Whatever message this page gives is out now! Go check it out!

Chunking and document splitting

Last update:
May 18, 2026
Learn how ColdFusion splits documents into chunks before embedding. Covers chunkSize, chunkOverlap, splitterType, all built-in splitters, custom UDF splitters, segment metadata, and standalone splitting with documentService().split().
Chunking is the process of breaking loaded documents into smaller pieces called segments before embedding. Each segment is embedded individually and stored in the vector store. At query time, retrieval returns the most relevant segments to ground the model's answer.

Why chunking matters

Embedding models and vector stores work on finite pieces of text. Chunking turns each loaded document into one or more segments. Each segment gets embedded and stored. At query time, retrieval returns the most relevant segments.
Tradeoffs (conceptual)
  • Smaller chunks: More precise matches, more vectors, more ingest cost.
  • Larger chunks: More local context per vector, fewer segments, risk of mixing unrelated topics.
  • Overlap (chunkOverlap): Repeats a tail of the previous chunk at the start of the next so sentences are not cut in half at boundaries. Zero overlap means no duplicated boundary text between adjacent segments.
Chunk size boundaries and segment count comparisons
  • Minimum boundary (chunkSize: 100): asserts segment text lengths stay within chunkSize + tolerance.
  • Maximum boundary (chunkSize: 10000): expects a short file to yield one segment.
  • Compare small vs large chunk counts: splits the same document with chunkSize: 200 vs 1000 and asserts more segments for the smaller chunk size.
  • Small chunks vs default: compares chunkSize: 100 with overlap 20 against chunkSize: 1000 on test-large.txt and expects more segments for the small setting.
  • Large chunks vs default: uses chunkSize: 5000 vs 1000 and expects fewer or equal segment count for the larger size.
<cfscript>
  docService = documentService();
  documents = docService.load({ path: expandPath("./docs/") });
  smallSegments = docService.split(documents, { chunkSize: 200 });
  largeSegments = docService.split(documents, { chunkSize: 1000 });
  writeOutput(arrayLen(smallSegments) & " vs " & arrayLen(largeSegments));
</cfscript>
Output
1696 vs 247

Core parameters: chunkSize, chunkOverlap, splitterType

The following parameters control how documents are split into segments. They are available across all three RAG APIs.
Parameter
Role
chunkSize
Target maximum size of a chunk (tests use character-oriented sizes, for example 500 or 1000).
chunkOverlap
Characters shared between consecutive chunks when the splitter supports overlap (for example 50, 100, or 0).
splitterType
Algorithm: how boundaries are chosen (recursive, sentence, character, line, paragraph, word, regex, and in some APIs custom).
recursive
Used in Simple RAG options alongside splitterType (example: recursive: false with character in tests).
regexPattern
Required for regex splitters: pattern used to split (example: newline pattern in tests).
separators
For recursive, optional ordered list of delimiter strings (example: double newline then single newline).
Where configuration lives
  • documentService().split(documents, options)
  • agent({ ingestion: { documentSplitter: { ... } } })
  • simpleRAG(source, model, { chunkSize, chunkOverlap, splitterType, recursive, ... })
In simpleRAG(), pass chunking options as flat keys on the third argument (the options struct) using chunkSize, chunkOverlap, splitterType, and recursive. You do not nest them under documentSplitter here; that nesting is used in Advanced RAG (agent()).

Built-in splitter types

ColdFusion includes the following built-in splitter algorithms. Pass the type name as a string to splitterType.
Splitter type
Splits on
Best for
character
Fixed character count
Simple text; fast but can cut words mid-sentence.
word
Word boundaries
Better than character for readability; no semantic awareness.
sentence
Sentence boundaries
Prose documents; preserves complete sentences.
line
Line breaks
Log files, CSV-like data, code.
paragraph
Paragraph breaks
Documents where each paragraph is a self-contained unit.
recursive
Paragraphs → sentences → words → characters (cascade)
General purpose; the recommended default for most documents.
hierarchical
Document structure (headings, sections)
Structured documents like manuals, wikis, legal docs.
semantic
Semantic similarity between sentences
Dense technical content where meaning shifts within paragraphs.
regex
Custom regular expression pattern
Custom-formatted documents.
Non-empty segment text across all splitter types
The following example loops splitterType over recursive, character, line, word, sentence, paragraph, and regex. For regex, it sets regexPattern: "\.\s+". Every segment must have non-empty trimmed text.
<cfscript>
  docService = documentService();
  documents = docService.load({ path: expandPath("./docs/") });
  types = ["recursive", "character", "line", "word", "sentence", "paragraph"];
  for (splitterType in types) {
    opts = { chunkSize: 500, chunkOverlap: 50, splitterType: splitterType };
    if (splitterType == "regex") {
      opts.regexPattern = "\.\s+";
    }
    segments = docService.split(documents, opts);
    writeOutput(splitterType & ": " & arrayLen(segments) & " segments<br>");
  }
</cfscript>
Output
recursive: 486 segments character: 544 segments line: 486 segments word: 478 segments sentence: 487 segments paragraph: 487 segments

Choosing the right splitter for your content

Selecting a splitter depends on your document structure and the retrieval precision required. Use the table below as a guide.
Content type
Recommended splitter
Most prose documents (articles, FAQs, manuals)
recursive — the recommended general-purpose default. Cascades from paragraphs down to characters.
Narrative text, customer support transcripts
sentence — preserves complete sentences; avoids mid-sentence cuts.
Log files, CSV-like data, code listings
line — splits on line breaks; each line is a discrete unit.
Structured documents (manuals, wikis, legal docs)
hierarchical — respects document structure such as headings and sections.
Dense technical content where meaning shifts within paragraphs
semantic — groups sentences by semantic similarity rather than syntactic boundaries.
Custom-formatted documents with known delimiters
regex — provide regexPattern to define split points precisely.
Simple text; performance-sensitive pipelines
character — fastest; does not preserve word or sentence boundaries.
Domain-specific structure (legal clauses, code functions)
Custom splitter UDF
High-overlap configuration
High overlap is useful when answers are likely to span a chunk boundary. Use it when the default overlap produces answers that miss context at the edges.
  • Standalone: uses chunkSize: 500, chunkOverlap: 200.
  • Advanced RAG: uses chunkSize: 200, chunkOverlap: 80 on a larger file to stress overlap behavior during ingest.
Zero-overlap configuration
With chunkOverlap: 0, adjacent segments share no boundary text. Use this when you want no redundant boundary text between segments.
<cfscript>
  docService = documentService();
  documents = docService.load({ path: expandPath("./docs/long-doc.txt") });
  segments = docService.split(documents, {
    chunkSize: 200,
    chunkOverlap: 0
  });
  writeOutput("segment count: " & arrayLen(segments));
</cfscript>
Output
segment count: 668

Custom splitter via UDF

If your documents have a domain-specific structure that built-in splitters do not handle well, for example, legal contracts split by clause, or code files split by function. You can provide a custom splitter UDF.
Use type: "custom" with implementation set to a UDF that accepts a document struct and returns an array of segment structs { text, metadata }. The sample splits by fixed character width inside the UDF.
<cfscript>
  chatModel = ChatModel({
    provider: "openai",
    modelName: "gpt-4o-mini",
    apiKey: application.apiKey,
    temperature: 0.7
  });
  docsDir = expandPath("./docs/");
  embeddingModel = {
    provider: "openai",
    modelName: "text-embedding-3-small",
    apiKey: application.apiKey
  };
  vectorStore = VectorStore({
    provider: "INMEMORY",
    embeddingModel: embeddingModel
  });

  customSplitter = function(required struct document) {
    var paragraphs = listToArray(document.text, chr(10) & chr(10));
    var segments = [];
    for (var para in paragraphs) {
      if (len(trim(para)) > 0) {
        arrayAppend(segments, {
          text: trim(para),
          metadata: document.metadata
        });
      }
    }
    return segments;
  };

  ragBot = agent({
    CHATMODEL: chatModel,
    ingestion: {
      source: docsDir,
      recursive: false,
      includePatterns: ["*.pdf"],
      documentSplitter: {
        splitterType: "recursive",
        chunkSize: 500,
        chunkOverlap: 50,
        separators: [ chr(10) & chr(10), chr(10) ]
      },
      vectorStoreIngestor: { vectorStore: vectorStore }
    },
    retrievalAugmentor: {
      queryRouter: {
        contentRetrievers: [{
          vectorStore: vectorStore,
          maxResults: 3,
          minScore: 0.3,
          description: "Knowledge base"
        }]
      }
    }
  });

  ragBot.ingest();
  answer = ragBot.chat("How to extend Adobe subscription using prepaid card?");
  writeOutput(answer.message);
</cfscript>
Output
To extend your Adobe subscription using a prepaid card, follow these steps: 1. **Locate the Redemption Code**: Find the redemption code on your prepaid card. It is usually located beneath the scratch-off foil on the back of the card. 2. **Extend Your Subscription**: While your subscription is active, enter the redemption code in your Adobe account to add the subscription period to your existing account. If, however, you encounter any issues such as payment failures or authentication problems, ensure that your card details are updated in your Adobe account and complete any necessary bank authentication processes.
Note: The UDF receives one document struct and must return an array of segment structs, each with a text key and a metadata key. Returning an empty array is valid and skips the document.

Segment metadata schema

After split(), each segment's metadata struct contains splitter bookkeeping fields. The test splitter-metadata-schema.cfm validates ten keys on the first segment.
Metadata key
Description
splitterType
The splitter algorithm used to produce this segment.
splitTimestamp
Time the split operation was recorded.
documentIndex
Zero-based index of the source document in the input array.
chunkIndex
Zero-based position of this segment within its source document.
globalSegmentIndex
Zero-based position of this segment across all documents in the split call.
totalChunks
arrayLen(segments) — total segments produced in this split call.
chunkSize
The chunkSize value used for this split.
chunkOverlap
The chunkOverlap value used for this split.
startOffset
Character offset in the source document where this segment begins.
endOffset
Character offset in the source document where this segment ends. Always greater than startOffset.
The test splitter-metadata-sequential.cfm walks all segments and checks that chunkIndex and globalSegmentIndex are 0, 1, 2, … and that startOffset increases monotonically.
The following example calls split() with chunkSize: 300 and dumps the metadata of the first segment:
<cfscript>
  docService = documentService();
  documents = docService.load({ path: expandPath("./docs/") });
  segments = docService.split(documents, { chunkSize: 300 });
  meta = segments[1].metadata;
  writeDump(meta);
</cfscript>

Standalone splitting with documentService().split()

Call documentService().split() to split documents independently of any RAG service. This is useful when you want to inspect, transform, or route segments before ingestion, or when you are building a custom pipeline.
Default split (no options)
After load(), call split(documents) with no second argument to use product defaults. Defaults are chunkSize 1000 and chunkOverlap 100.
<cfscript>
  docService = documentService();
  documents = docService.load({
    path: expandPath("./docs/"),
    pattern: "*.pdf"
  });
  segments = docService.split(documents);
  writeOutput(arrayLen(segments) & " segments");
</cfscript>
Output
52 segments
Split with chunkSize and chunkOverlap
Pass a struct as the second argument with chunkSize and chunkOverlap. Segments are an array of structs with text and metadata.
<cfscript>
  docService = documentService();
  documents = docService.load({
    path: expandPath("./docs/"),
    pattern: "*.pdf"
  });
  segments = docService.split(documents, {
    chunkSize: 500,
    chunkOverlap: 50
  });
  writeDump(segments[1]);
</cfscript>
Split with splitterType: recursive
The recursive splitter walks hierarchical separators. Pass splitterType: "recursive" alongside chunkSize and chunkOverlap.
<cfscript>
  docService = documentService();
  documents = docService.load({
    path: expandPath("./docs/"),
    pattern: "*.pdf"
  });
  segments = docService.split(documents, {
    chunkSize: 500,
    chunkOverlap: 50,
    splitterType: "recursive"
  });
  writeOutput(arrayLen(segments) & " segments");
</cfscript>
Output
96 segments
Load then split pipeline
Common standalone flow: load documents, split into segments, then (optionally) transform or inspect segments without ingesting into a vector store. The test expects at least as many segments as documents for multi-chunk files.
<cfscript>
  docService = documentService();
  documents = docService.load({
    path: expandPath("./docs/"),
    pattern: "*.pdf"
  });
  segments = docService.split(documents, {
    chunkSize: 1000,
    chunkOverlap: 100
  });
  writeOutput("documents=" & arrayLen(documents) & " segments=" & arrayLen(segments));
</cfscript>
Output
documents=21 segments=52
Advanced RAG: default documentSplitter
In agent(), ingestion.documentSplitter may be an empty struct {}. Default behavior is equivalent to recursive splitting with chunkSize 1000 and chunkOverlap 100. Ingestion must produce segmentsIngested > 0 for the sample file.
<cfscript>
  chatModel = ChatModel({
    provider: "openai",
    modelName: "gpt-4o-mini",
    apiKey: application.apiKey,
    temperature: 0.7
  });
  vectorStore = VectorStore({
    provider: "INMEMORY",
    embeddingModel: {
      provider: "openai",
      modelName: "text-embedding-3-small",
      apiKey: application.apiKey
    }
  });
  docsDir = expandPath("./docs/");
  ragService = agent({
    CHATMODEL: chatModel,
    ingestion: {
      source: docsDir & "long-doc.txt",
      documentSplitter: {},
      vectorStoreIngestor: { vectorStore: vectorStore }
    },
    retrievalAugmentor: {
      queryRouter: {
        contentRetrievers: [{
          vectorStore: vectorStore,
          maxResults: 5,
          minScore: 0.3,
          description: "Knowledge base"
        }]
      }
    }
  });
  result = ragService.ingest();
  writeOutput(result.segmentsIngested);
</cfscript>
Output
126
Advanced RAG: chunkSize and chunkOverlap
Explicit documentSplitter with only chunkSize and chunkOverlap (no splitterType in the minimal snippet):
<cfscript>
  chatModel = ChatModel({
    provider: "openai",
    modelName: "gpt-4o-mini",
    apiKey: application.apiKey,
    temperature: 0.7
  });
  vectorStore = VectorStore({
    provider: "INMEMORY",
    embeddingModel: {
      provider: "openai",
      modelName: "text-embedding-3-small",
      apiKey: application.apiKey
    }
  });
  docsDir = expandPath("./docs/");
  ragService = agent({
    CHATMODEL: chatModel,
    ingestion: {
      source: docsDir,
      documentSplitter: {
        chunkSize: 500,
        chunkOverlap: 50
      },
      vectorStoreIngestor: { vectorStore: vectorStore }
    },
    retrievalAugmentor: {
      queryRouter: {
        contentRetrievers: [{
          vectorStore: vectorStore,
          maxResults: 5,
          minScore: 0.3,
          description: "Knowledge base"
        }]
      }
    }
  });
  ragService.ingest();
  answer = ragService.chat("How to update TIN?");
  writeOutput(answer.message);
</cfscript>
Output
To update your tax identification number (TIN), such as VAT, GST, or NIT for your individual account, you need to follow these steps: 1. **For Tax-Exempt Customers in North America:** If you are a tax-exempt customer, you can learn how to place a tax-exempt order with Adobe by following our step-by-step instructions available on the Adobe website. 2. **For PayPal Users:** If you use PayPal, ensure that you update your VAT or tax details directly in your PayPal account settings to guarantee that accurate tax information appears on future invoices. In case you need to resolve any issues, remember to add new payment information or update your existing one accordingly.
Advanced RAG: splitterType sentence
splitterType: "sentence" with chunkSize and chunkOverlap splits on sentence boundaries where possible:
<cfscript>
  ragService = agent({
    CHATMODEL: chatModel,
    ingestion: {
      source: docsDir,
      documentSplitter: {
        splitterType: "sentence",
        chunkSize: 500,
        chunkOverlap: 50
      },
      vectorStoreIngestor: { vectorStore: vectorStore }
    },
    retrievalAugmentor: { ... }
  });
</cfscript>
Advanced RAG: regex splitter
For splitterType: "regex", provide regexPattern. The sample uses a pattern that splits on newlines:
<cfscript>
    chatModel = ChatModel({
        provider: "openai",
        modelName: "gpt-4o-mini",
        apiKey: application.apiKey,
        temperature: 0.7
    });

    vectorStore = VectorStore({
        provider: "INMEMORY",
        embeddingModel: {
            provider: "openai",
            modelName: "text-embedding-3-small",
            apiKey: application.apiKey
        }
    });

    // Same layout as QA: template runs from DocumentSplitter\, Documents\ is alongside it
    docsDir = expandPath("./docs/");

    
ragService = agent({
    CHATMODEL: chatModel,
        ingestion: {
            source: docsDir & "sample.txt",
            documentSplitter: {
                splitterType: "regex",
                regexPattern: "\\n"
            },
            vectorStoreIngestor: { vectorStore: vectorStore }
        },
        retrievalAugmentor: {
            queryRouter: {
                contentRetrievers: [{
                    vectorStore: vectorStore,
                    maxResults: 5,
                    minScore: 0.3,
                    description: "Knowledge base"
                }]
            }
        }
    });

    ragService.ingest();
    answer = ragService.chat("What is the plot of study in scarlet?");
    writeOutput(answer.message);
</cfscript>
Output
A Study in Scarlet, published in 1887 by Sir Arthur Conan Doyle, is the first story featuring Sherlock Holmes and Dr. John Watson. The plot begins with Watson, a military surgeon returning from the Afghan War, seeking accommodation in London. He meets Holmes, and they become roommates at 221B Baker Street. Holmes, who works as a consulting detective, involves Watson in a murder case centered around a man named Enoch Drebber, found dead in a derelict house with the word 'RACHE' written in blood. Holmes uses his deduction skills to solve the case despite challenges from Scotland Yard inspectors. The narrative shifts to the backstory of the murderer, Jefferson Hope, whose actions are motivated by tragic vengeance related to a love triangle involving Mormons in the American West. Eventually, Hope dies in prison, and Watson publishes the case to acknowledge Holmes's brilliance.
Advanced RAG: recursive with custom separators
splitterType: "recursive" can take separators, an array of strings tried in order. The test uses chr(10) & chr(10) then chr(10) for paragraph then line breaks. A longer variant uses [chr(10) & chr(10), chr(10), " ", ""]:
<cfscript>
    chatModel = ChatModel({ 
        provider: "openai", 
        modelName: "gpt-4o-mini", 
        apiKey: application.apiKey, 
        temperature: 0.7 
    }); 

    vectorStore = VectorStore({ 
        provider: "INMEMORY", 
        embeddingModel: { 
            provider: "openai", 
            modelName: "text-embedding-3-small", 
            apiKey: application.apiKey 
        } 

    }); 

 
    // Same layout as QA: template runs from DocumentSplitter\, Documents\ is alongside it 

    docsDir = expandPath("./docs/"); 
    ragService = agent({ 
        CHATMODEL: chatModel, 
        ingestion: { 
            source: docsDir & "sample.txt", 
            documentSplitter: { 
                splitterType: "recursive", 
                separators: [ chr(10) & chr(10), chr(10) ] 
            }, 

            vectorStoreIngestor: { vectorStore: vectorStore } 
        }, 

        retrievalAugmentor: { 
            queryRouter: { 
                contentRetrievers: [{ 
                    vectorStore: vectorStore, 
                    maxResults: 5, 
                    minScore: 0.3, 
                    description: "Knowledge base" 
                }] 

            } 
        } 
    }); 

    ragService.ingest(); 
    answer = ragService.chat("Who is Dr. Watson?"); 
    writeOutput(answer.message); 
</cfscript>
Output
Dr. John Watson is a fictional character created by Sir Arthur Conan Doyle, who first appears in the novel "A Study in Scarlet" published in 1887. Watson is a military surgeon who returns to London after being invalided from the Afghan War. He becomes the close companion and confidant of the famous detective Sherlock Holmes. They share lodgings at 221B Baker Street, where Watson becomes involved in Holmes's investigations. Throughout the stories, Watson serves as the narrator and often helps Holmes solve complex cases, highlighting his skills in observation and deduction.
Simple RAG: flat splitter options
simpleRAG() takes chunkSize, chunkOverlap, splitterType, and recursive on the third argument options struct:
<cfscript>
    chatModel = ChatModel({
        provider: "openai",
        modelName: "gpt-4o-mini",
        apiKey: application.apiKey,
        temperature: 0.7
    });

    vectorStore = VectorStore({
        provider: "INMEMORY",
        embeddingModel: {
            provider: "openai",
            modelName: "text-embedding-3-small",
            apiKey: application.apiKey
        }
    });

    docsDir = expandPath("./docs/");

    ragService = simpleRAG(
        docsDir,
        chatModel,
        {
            vectorStore: vectorStore,
            chunkSize: 500,
            chunkOverlap: 50,
            splitterType: "character",
            recursive: false
        }
    );

    ragService.ingest();
    answer = ragService.ask("How to extend Adobe subscription?");
    writeOutput(answer.message);
</cfscript>
Output
To extend your Adobe subscription, you can easily do so with a prepaid card while your subscription is still active. The subscription period will be added to your existing account when using a prepaid card. Here are the steps you need to follow: 1. On the prepaid card, find your redemption code beneath the scratch-off foil on the back of the card. 2. Select the appropriate link based on your product type.
Simple RAG: sentence splitter
Same flat options with splitterType: "sentence":
<cfscript>
    chatModel = ChatModel({
        provider: "openai",
        modelName: "gpt-4o-mini",
        apiKey: application.apiKey,
        temperature: 0.7
    });

    vectorStore = VectorStore({
        provider: "INMEMORY",
        embeddingModel: {
            provider: "openai",
            modelName: "text-embedding-3-small",
            apiKey: application.apiKey
        }
    });

    docsDir = expandPath("./docs/");

    ragService = simpleRAG(
        docsDir,
        chatModel,
        {
            vectorStore: vectorStore,
            chunkSize: 500,
            chunkOverlap: 200,
            splitterType: "sentence",
            recursive: false
        }
    );

    ragService.ingest();
    answer = ragService.ask("How to respond to Adobe support ticket?");
    writeOutput(answer.message);
</cfscript>
Output
To respond to an open Adobe support ticket, sign in to your Adobe account and update the case by adding messages or uploading files directly. You can include additional information or screenshots as needed. If requested, you may also submit proof of purchase for returns or proof of tax-exempt eligibility for tax-exempt orders.

Share this page

Was this page helpful?
We're glad. Tell us how this page helped.
We're sorry. Can you tell us what didn't work for you?
Thank you for your feedback. Your response will help improve this page.

On this page