Whatever message this page gives is out now! Go check it out!

RAG document ingestion-field reference

Last update:
May 18, 2026
This document describes the ingest documents into vector stores flow used for Retrieval Augmented Generation (RAG). Ingestion reads documents from a chosen source, parses and chunks text, generates embeddings using your configured embedding pipeline, and writes vectors into a selected vector store.
Cross-check labels, defaults, and parser options against your product build.
Global behavior notes
  • Vector store first: You must configure at least one vector store before ingestion can run. If none exist, the UI shows a message such as: no vector stores are configured; add one under your product’s vector store or AI services settings.
  • Embedding pipeline: Ingestion relies on the embedding model and vector store associated with this RAG configuration. Dimension and collection/index settings must stay aligned with that embedding model (see embedding-model-configuration.md and your Milvus, Pinecone, Qdrant, or Chroma vector store docs).
  • Server paths: Browse server (or equivalent) selects paths on the server running the admin or worker process, not necessarily the administrator’s own PC—verify path conventions (Windows vs Linux) for your deployment.

Configuration selection

Field
Description
Vector store
Required. The target vector store profile where ingested chunks (and embeddings) are stored. Choose a store that matches your RAG query path and embedding dimension. If the dropdown is empty, create and save a vector store configuration first.
---

Document source

Field
Description
Source type
Single file — ingest one document by path. Directory — ingest all supported files under a folder (respecting your product’s recursion and filter rules). URL — fetch content from a web address when your product supports it.
File path (or path / URL field)
Path or address for the selected source type. For single file or directory, use an absolute path the server can read. Use Browse server when available to reduce path typos. For URL, enter a full URL per your integration’s requirements.
Supported formats
Typical support includes PDF, Word, Excel, PowerPoint, HTML, CSV, JSON, XML, plain text, Markdown, and related formats—exact list depends on your release. Unsupported files may be skipped or fail per Continue on error.
---

Configuration options (advanced)

Field
Description
Parser type
Format-specific parser (for example PDF, HTML, plain text). Choose the parser that matches the dominant file type in this run, or the type your product uses when a single parser is selected for a batch. Some products auto-detect per file; confirm behavior in your docs.
Character encoding
Text encoding for parsers that read byte streams (for example UTF-8). Use the encoding that matches your files to avoid mojibake or parse failures.
Max file size (bytes)
Upper bound on file size for ingestion. 0 often means no limit or use product default—confirm in your build. Non-zero values reject or skip oversized files early.
---

Chunking configuration (advanced)

Field
Description
Splitter type
How text is split into chunks before embedding. Recursive (when labeled recommended) usually splits on paragraphs and headings first, then sentences, for more coherent chunks. Other types (if offered) may split on fixed characters or delimiters only.
Chunk size (characters)
Target maximum size of each chunk in characters (not tokens). Larger chunks preserve more context but can reduce retrieval precision; smaller chunks improve granularity but increase vector count and cost. Default 1000 is a common starting point.
Chunk overlap (characters)
Number of characters shared between adjacent chunks. Overlap helps avoid cutting sentences or facts in half at boundaries. Default 200 is typical with 1000-character chunks; adjust if answers miss context at edges.
Custom separators (optional)
Extra delimiter strings (if your product supports them) that force splits—for example specific headings or markers. Leave empty to use the splitter’s built-in rules.
---

Ingestion options (advanced)

Field
Description
Batch size
How many chunks or documents to process per internal batch (for example 100). Higher values can improve throughput but increase memory use and failure blast radius on errors.
Continue on error
When enabled, ingestion skips or logs failed files or chunks and continues with the rest. When disabled, the job may stop on the first error—better for strict validation; worse for large mixed folders.
---

Actions

Control
Description
Run ingestion
Starts the ingestion job with the current settings. Ensure vector store and paths are correct before running; large directories can take a long time.
Show / hide advanced settings
Toggles visibility of parser, chunking, and ingestion options. Basic runs may only need source + vector store.
---

Share this page

Was this page helpful?
We're glad. Tell us how this page helped.
We're sorry. Can you tell us what didn't work for you?
Thank you for your feedback. Your response will help improve this page.

On this page