Chunking and document splitting

Last update:

May 18, 2026

Learn how ColdFusion splits documents into chunks before embedding. Covers chunkSize, chunkOverlap, splitterType, all built-in splitters, custom UDF splitters, segment metadata, and standalone splitting with documentService().split().

Chunking is the process of breaking loaded documents into smaller pieces called segments before embedding. Each segment is embedded individually and stored in the vector store. At query time, retrieval returns the most relevant segments to ground the model's answer.

Why chunking matters

Embedding models and vector stores work on finite pieces of text. Chunking turns each loaded document into one or more segments. Each segment gets embedded and stored. At query time, retrieval returns the most relevant segments.

Tradeoffs (conceptual)

Smaller chunks: More precise matches, more vectors, more ingest cost.
Larger chunks: More local context per vector, fewer segments, risk of mixing unrelated topics.
Overlap (chunkOverlap): Repeats a tail of the previous chunk at the start of the next so sentences are not cut in half at boundaries. Zero overlap means no duplicated boundary text between adjacent segments.

Chunk size boundaries and segment count comparisons

Minimum boundary (chunkSize: 100): asserts segment text lengths stay within chunkSize + tolerance.
Maximum boundary (chunkSize: 10000): expects a short file to yield one segment.
Compare small vs large chunk counts: splits the same document with chunkSize: 200 vs 1000 and asserts more segments for the smaller chunk size.
Small chunks vs default: compares chunkSize: 100 with overlap 20 against chunkSize: 1000 on test-large.txt and expects more segments for the small setting.
Large chunks vs default: uses chunkSize: 5000 vs 1000 and expects fewer or equal segment count for the larger size.

<cfscript>

  docService = documentService();

  documents = docService.load({ path: expandPath("./docs/") });

  smallSegments = docService.split(documents, { chunkSize: 200 });

  largeSegments = docService.split(documents, { chunkSize: 1000 });

  writeOutput(arrayLen(smallSegments) & " vs " & arrayLen(largeSegments));

</cfscript>

Output

1696 vs 247

Core parameters: chunkSize, chunkOverlap, splitterType

The following parameters control how documents are split into segments. They are available across all three RAG APIs.

Parameter	Role
chunkSize	Target maximum size of a chunk (tests use character-oriented sizes, for example 500 or 1000).
chunkOverlap	Characters shared between consecutive chunks when the splitter supports overlap (for example 50, 100, or 0).
splitterType	Algorithm: how boundaries are chosen (recursive, sentence, character, line, paragraph, word, regex, and in some APIs custom).
recursive	Used in Simple RAG options alongside splitterType (example: recursive: false with character in tests).
regexPattern	Required for regex splitters: pattern used to split (example: newline pattern in tests).
separators	For recursive, optional ordered list of delimiter strings (example: double newline then single newline).

Where configuration lives

documentService().split(documents, options)
agent({ ingestion: { documentSplitter: { ... } } })
simpleRAG(source, model, { chunkSize, chunkOverlap, splitterType, recursive, ... })

In simpleRAG(), pass chunking options as flat keys on the third argument (the options struct) using chunkSize, chunkOverlap, splitterType, and recursive. You do not nest them under documentSplitter here; that nesting is used in Advanced RAG (agent()).

Built-in splitter types

ColdFusion includes the following built-in splitter algorithms. Pass the type name as a string to splitterType.

Splitter type	Splits on	Best for
character	Fixed character count	Simple text; fast but can cut words mid-sentence.
word	Word boundaries	Better than character for readability; no semantic awareness.
sentence	Sentence boundaries	Prose documents; preserves complete sentences.
line	Line breaks	Log files, CSV-like data, code.
paragraph	Paragraph breaks	Documents where each paragraph is a self-contained unit.
recursive	Paragraphs → sentences → words → characters (cascade)	General purpose; the recommended default for most documents.
hierarchical	Document structure (headings, sections)	Structured documents like manuals, wikis, legal docs.
semantic	Semantic similarity between sentences	Dense technical content where meaning shifts within paragraphs.
regex	Custom regular expression pattern	Custom-formatted documents.

Non-empty segment text across all splitter types

The following example loops splitterType over recursive, character, line, word, sentence, paragraph, and regex. For regex, it sets regexPattern: "\.\s+". Every segment must have non-empty trimmed text.

<cfscript>

  docService = documentService();

  documents = docService.load({ path: expandPath("./docs/") });

  types = ["recursive", "character", "line", "word", "sentence", "paragraph"];

  for (splitterType in types) {

    opts = { chunkSize: 500, chunkOverlap: 50, splitterType: splitterType };

    if (splitterType == "regex") {

      opts.regexPattern = "\.\s+";

    }

    segments = docService.split(documents, opts);

    writeOutput(splitterType & ": " & arrayLen(segments) & " segments<br>");

  }

</cfscript>

Output

recursive: 486 segments character: 544 segments line: 486 segments word: 478 segments sentence: 487 segments paragraph: 487 segments

Choosing the right splitter for your content

Selecting a splitter depends on your document structure and the retrieval precision required. Use the table below as a guide.

Content type	Recommended splitter
Most prose documents (articles, FAQs, manuals)	recursive — the recommended general-purpose default. Cascades from paragraphs down to characters.
Narrative text, customer support transcripts	sentence — preserves complete sentences; avoids mid-sentence cuts.
Log files, CSV-like data, code listings	line — splits on line breaks; each line is a discrete unit.
Structured documents (manuals, wikis, legal docs)	hierarchical — respects document structure such as headings and sections.
Dense technical content where meaning shifts within paragraphs	semantic — groups sentences by semantic similarity rather than syntactic boundaries.
Custom-formatted documents with known delimiters	regex — provide regexPattern to define split points precisely.
Simple text; performance-sensitive pipelines	character — fastest; does not preserve word or sentence boundaries.
Domain-specific structure (legal clauses, code functions)	Custom splitter UDF

High-overlap configuration

High overlap is useful when answers are likely to span a chunk boundary. Use it when the default overlap produces answers that miss context at the edges.

Standalone: uses chunkSize: 500, chunkOverlap: 200.
Advanced RAG: uses chunkSize: 200, chunkOverlap: 80 on a larger file to stress overlap behavior during ingest.

Zero-overlap configuration

With chunkOverlap: 0, adjacent segments share no boundary text. Use this when you want no redundant boundary text between segments.

<cfscript>

  docService = documentService();

  documents = docService.load({ path: expandPath("./docs/long-doc.txt") });

  segments = docService.split(documents, {

    chunkSize: 200,

    chunkOverlap: 0

  });

  writeOutput("segment count: " & arrayLen(segments));

</cfscript>

Output

segment count: 668

Custom splitter via UDF

If your documents have a domain-specific structure that built-in splitters do not handle well, for example, legal contracts split by clause, or code files split by function. You can provide a custom splitter UDF.

Use type: "custom" with implementation set to a UDF that accepts a document struct and returns an array of segment structs { text, metadata }. The sample splits by fixed character width inside the UDF.

<cfscript>

  chatModel = ChatModel({

    provider: "openai",

    modelName: "gpt-4o-mini",

    apiKey: application.apiKey,

    temperature: 0.7

  });

  docsDir = expandPath("./docs/");

  embeddingModel = {

    provider: "openai",

    modelName: "text-embedding-3-small",

    apiKey: application.apiKey

  };

  vectorStore = VectorStore({

    provider: "INMEMORY",

    embeddingModel: embeddingModel

  });

  customSplitter = function(required struct document) {

    var paragraphs = listToArray(document.text, chr(10) & chr(10));

    var segments = [];

    for (var para in paragraphs) {

      if (len(trim(para)) > 0) {

        arrayAppend(segments, {

          text: trim(para),

          metadata: document.metadata

        });

      }

    }

    return segments;

  };

  ragBot = agent({

    CHATMODEL: chatModel,

    ingestion: {

      source: docsDir,

      recursive: false,

      includePatterns: ["*.pdf"],

      documentSplitter: {

        splitterType: "recursive",

        chunkSize: 500,

        chunkOverlap: 50,

        separators: [ chr(10) & chr(10), chr(10) ]

      },

      vectorStoreIngestor: { vectorStore: vectorStore }

    },

    retrievalAugmentor: {

      queryRouter: {

        contentRetrievers: [{

          vectorStore: vectorStore,

          maxResults: 3,

          minScore: 0.3,

          description: "Knowledge base"

        }]

      }

    }

  });

  ragBot.ingest();

  answer = ragBot.chat("How to extend Adobe subscription using prepaid card?");

  writeOutput(answer.message);

</cfscript>

Output

To extend your Adobe subscription using a prepaid card, follow these steps: 1. **Locate the Redemption Code**: Find the redemption code on your prepaid card. It is usually located beneath the scratch-off foil on the back of the card. 2. **Extend Your Subscription**: While your subscription is active, enter the redemption code in your Adobe account to add the subscription period to your existing account. If, however, you encounter any issues such as payment failures or authentication problems, ensure that your card details are updated in your Adobe account and complete any necessary bank authentication processes.

Note: The UDF receives one document struct and must return an array of segment structs, each with a text key and a metadata key. Returning an empty array is valid and skips the document.

Segment metadata schema

After split(), each segment's metadata struct contains splitter bookkeeping fields. The test splitter-metadata-schema.cfm validates ten keys on the first segment.

Metadata key	Description
splitterType	The splitter algorithm used to produce this segment.
splitTimestamp	Time the split operation was recorded.
documentIndex	Zero-based index of the source document in the input array.
chunkIndex	Zero-based position of this segment within its source document.
globalSegmentIndex	Zero-based position of this segment across all documents in the split call.
totalChunks	arrayLen(segments) — total segments produced in this split call.
chunkSize	The chunkSize value used for this split.
chunkOverlap	The chunkOverlap value used for this split.
startOffset	Character offset in the source document where this segment begins.
endOffset	Character offset in the source document where this segment ends. Always greater than startOffset.

The test splitter-metadata-sequential.cfm walks all segments and checks that chunkIndex and globalSegmentIndex are 0, 1, 2, … and that startOffset increases monotonically.

The following example calls split() with chunkSize: 300 and dumps the metadata of the first segment:

<cfscript>

  docService = documentService();

  documents = docService.load({ path: expandPath("./docs/") });

  segments = docService.split(documents, { chunkSize: 300 });

  meta = segments[1].metadata;

  writeDump(meta);

</cfscript>

Standalone splitting with documentService().split()

Call documentService().split() to split documents independently of any RAG service. This is useful when you want to inspect, transform, or route segments before ingestion, or when you are building a custom pipeline.

Default split (no options)

After load(), call split(documents) with no second argument to use product defaults. Defaults are chunkSize 1000 and chunkOverlap 100.

<cfscript>

  docService = documentService();

  documents = docService.load({

    path: expandPath("./docs/"),

    pattern: "*.pdf"

  });

  segments = docService.split(documents);

  writeOutput(arrayLen(segments) & " segments");

</cfscript>

Output

52 segments

Split with chunkSize and chunkOverlap

Pass a struct as the second argument with chunkSize and chunkOverlap. Segments are an array of structs with text and metadata.

<cfscript>

  docService = documentService();

  documents = docService.load({

    path: expandPath("./docs/"),

    pattern: "*.pdf"

  });

  segments = docService.split(documents, {

    chunkSize: 500,

    chunkOverlap: 50

  });

  writeDump(segments[1]);

</cfscript>

Split with splitterType: recursive

The recursive splitter walks hierarchical separators. Pass splitterType: "recursive" alongside chunkSize and chunkOverlap.

<cfscript>

  docService = documentService();

  documents = docService.load({

    path: expandPath("./docs/"),

    pattern: "*.pdf"

  });

  segments = docService.split(documents, {

    chunkSize: 500,

    chunkOverlap: 50,

    splitterType: "recursive"

  });

  writeOutput(arrayLen(segments) & " segments");

</cfscript>

Output

96 segments

Load then split pipeline

Common standalone flow: load documents, split into segments, then (optionally) transform or inspect segments without ingesting into a vector store. The test expects at least as many segments as documents for multi-chunk files.

<cfscript>

  docService = documentService();

  documents = docService.load({

    path: expandPath("./docs/"),

    pattern: "*.pdf"

  });

  segments = docService.split(documents, {

    chunkSize: 1000,

    chunkOverlap: 100

  });

  writeOutput("documents=" & arrayLen(documents) & " segments=" & arrayLen(segments));

</cfscript>

Output

documents=21 segments=52

Advanced RAG: default documentSplitter

In agent(), ingestion.documentSplitter may be an empty struct {}. Default behavior is equivalent to recursive splitting with chunkSize 1000 and chunkOverlap 100. Ingestion must produce segmentsIngested > 0 for the sample file.

<cfscript>

  chatModel = ChatModel({

    provider: "openai",

    modelName: "gpt-4o-mini",

    apiKey: application.apiKey,

    temperature: 0.7

  });

  vectorStore = VectorStore({

    provider: "INMEMORY",

    embeddingModel: {

      provider: "openai",

      modelName: "text-embedding-3-small",

      apiKey: application.apiKey

    }

  });

  docsDir = expandPath("./docs/");

  ragService = agent({

    CHATMODEL: chatModel,

    ingestion: {

      source: docsDir & "long-doc.txt",

      documentSplitter: {},

      vectorStoreIngestor: { vectorStore: vectorStore }

    },

    retrievalAugmentor: {

      queryRouter: {

        contentRetrievers: [{

          vectorStore: vectorStore,

          maxResults: 5,

          minScore: 0.3,

          description: "Knowledge base"

        }]

      }

    }

  });

  result = ragService.ingest();

  writeOutput(result.segmentsIngested);

</cfscript>

Output

126

Advanced RAG: chunkSize and chunkOverlap

Explicit documentSplitter with only chunkSize and chunkOverlap (no splitterType in the minimal snippet):

<cfscript>

  chatModel = ChatModel({

    provider: "openai",

    modelName: "gpt-4o-mini",

    apiKey: application.apiKey,

    temperature: 0.7

  });

  vectorStore = VectorStore({

    provider: "INMEMORY",

    embeddingModel: {

      provider: "openai",

      modelName: "text-embedding-3-small",

      apiKey: application.apiKey

    }

  });

  docsDir = expandPath("./docs/");

  ragService = agent({

    CHATMODEL: chatModel,

    ingestion: {

      source: docsDir,

      documentSplitter: {

        chunkSize: 500,

        chunkOverlap: 50

      },

      vectorStoreIngestor: { vectorStore: vectorStore }

    },

    retrievalAugmentor: {

      queryRouter: {

        contentRetrievers: [{

          vectorStore: vectorStore,

          maxResults: 5,

          minScore: 0.3,

          description: "Knowledge base"

        }]

      }

    }

  });

  ragService.ingest();

  answer = ragService.chat("How to update TIN?");

  writeOutput(answer.message);

</cfscript>

Output

To update your tax identification number (TIN), such as VAT, GST, or NIT for your individual account, you need to follow these steps: 1. **For Tax-Exempt Customers in North America:** If you are a tax-exempt customer, you can learn how to place a tax-exempt order with Adobe by following our step-by-step instructions available on the Adobe website. 2. **For PayPal Users:** If you use PayPal, ensure that you update your VAT or tax details directly in your PayPal account settings to guarantee that accurate tax information appears on future invoices. In case you need to resolve any issues, remember to add new payment information or update your existing one accordingly.

Advanced RAG: splitterType sentence

splitterType: "sentence" with chunkSize and chunkOverlap splits on sentence boundaries where possible:

<cfscript>

  ragService = agent({

    CHATMODEL: chatModel,

    ingestion: {

      source: docsDir,

      documentSplitter: {

        splitterType: "sentence",

        chunkSize: 500,

        chunkOverlap: 50

      },

      vectorStoreIngestor: { vectorStore: vectorStore }

    },

    retrievalAugmentor: { ... }

  });

</cfscript>

Advanced RAG: regex splitter

For splitterType: "regex", provide regexPattern. The sample uses a pattern that splits on newlines:

<cfscript>

    chatModel = ChatModel({

        provider: "openai",

        modelName: "gpt-4o-mini",

        apiKey: application.apiKey,

        temperature: 0.7

    });

    vectorStore = VectorStore({

        provider: "INMEMORY",

        embeddingModel: {

            provider: "openai",

            modelName: "text-embedding-3-small",

            apiKey: application.apiKey

        }

    });

    // Same layout as QA: template runs from DocumentSplitter\, Documents\ is alongside it

    docsDir = expandPath("./docs/");

ragService = agent({

    CHATMODEL: chatModel,

        ingestion: {

            source: docsDir & "sample.txt",

            documentSplitter: {

                splitterType: "regex",

                regexPattern: "\\n"

            },

            vectorStoreIngestor: { vectorStore: vectorStore }

        },

        retrievalAugmentor: {

            queryRouter: {

                contentRetrievers: [{

                    vectorStore: vectorStore,

                    maxResults: 5,

                    minScore: 0.3,

                    description: "Knowledge base"

                }]

            }

        }

    });

    ragService.ingest();

    answer = ragService.chat("What is the plot of study in scarlet?");

    writeOutput(answer.message);

</cfscript>

Output

A Study in Scarlet, published in 1887 by Sir Arthur Conan Doyle, is the first story featuring Sherlock Holmes and Dr. John Watson. The plot begins with Watson, a military surgeon returning from the Afghan War, seeking accommodation in London. He meets Holmes, and they become roommates at 221B Baker Street. Holmes, who works as a consulting detective, involves Watson in a murder case centered around a man named Enoch Drebber, found dead in a derelict house with the word 'RACHE' written in blood. Holmes uses his deduction skills to solve the case despite challenges from Scotland Yard inspectors. The narrative shifts to the backstory of the murderer, Jefferson Hope, whose actions are motivated by tragic vengeance related to a love triangle involving Mormons in the American West. Eventually, Hope dies in prison, and Watson publishes the case to acknowledge Holmes's brilliance.

Advanced RAG: recursive with custom separators

splitterType: "recursive" can take separators, an array of strings tried in order. The test uses chr(10) & chr(10) then chr(10) for paragraph then line breaks. A longer variant uses [chr(10) & chr(10), chr(10), " ", ""]:

<cfscript>

    chatModel = ChatModel({ 

        provider: "openai", 

        modelName: "gpt-4o-mini", 

        apiKey: application.apiKey, 

        temperature: 0.7 

    }); 

    vectorStore = VectorStore({ 

        provider: "INMEMORY", 

        embeddingModel: { 

            provider: "openai", 

            modelName: "text-embedding-3-small", 

            apiKey: application.apiKey 

        } 

    }); 

    // Same layout as QA: template runs from DocumentSplitter\, Documents\ is alongside it 

    docsDir = expandPath("./docs/"); 

    ragService = agent({ 

        CHATMODEL: chatModel, 

        ingestion: { 

            source: docsDir & "sample.txt", 

            documentSplitter: { 

                splitterType: "recursive", 

                separators: [ chr(10) & chr(10), chr(10) ] 

            }, 

            vectorStoreIngestor: { vectorStore: vectorStore } 

        }, 

        retrievalAugmentor: { 

            queryRouter: { 

                contentRetrievers: [{ 

                    vectorStore: vectorStore, 

                    maxResults: 5, 

                    minScore: 0.3, 

                    description: "Knowledge base" 

                }] 

            } 

        } 

    }); 

    ragService.ingest(); 

    answer = ragService.chat("Who is Dr. Watson?"); 

    writeOutput(answer.message); 

</cfscript>

Output

Dr. John Watson is a fictional character created by Sir Arthur Conan Doyle, who first appears in the novel "A Study in Scarlet" published in 1887. Watson is a military surgeon who returns to London after being invalided from the Afghan War. He becomes the close companion and confidant of the famous detective Sherlock Holmes. They share lodgings at 221B Baker Street, where Watson becomes involved in Holmes's investigations. Throughout the stories, Watson serves as the narrator and often helps Holmes solve complex cases, highlighting his skills in observation and deduction.

Simple RAG: flat splitter options

simpleRAG() takes chunkSize, chunkOverlap, splitterType, and recursive on the third argument options struct:

<cfscript>

    chatModel = ChatModel({

        provider: "openai",

        modelName: "gpt-4o-mini",

        apiKey: application.apiKey,

        temperature: 0.7

    });

    vectorStore = VectorStore({

        provider: "INMEMORY",

        embeddingModel: {

            provider: "openai",

            modelName: "text-embedding-3-small",

            apiKey: application.apiKey

        }

    });

    docsDir = expandPath("./docs/");

    ragService = simpleRAG(

        docsDir,

        chatModel,

        {

            vectorStore: vectorStore,

            chunkSize: 500,

            chunkOverlap: 50,

            splitterType: "character",

            recursive: false

        }

    );

    ragService.ingest();

    answer = ragService.ask("How to extend Adobe subscription?");

    writeOutput(answer.message);

</cfscript>

Output

To extend your Adobe subscription, you can easily do so with a prepaid card while your subscription is still active. The subscription period will be added to your existing account when using a prepaid card. Here are the steps you need to follow: 1. On the prepaid card, find your redemption code beneath the scratch-off foil on the back of the card. 2. Select the appropriate link based on your product type.

Simple RAG: sentence splitter

Same flat options with splitterType: "sentence":

<cfscript>

    chatModel = ChatModel({

        provider: "openai",

        modelName: "gpt-4o-mini",

        apiKey: application.apiKey,

        temperature: 0.7

    });

    vectorStore = VectorStore({

        provider: "INMEMORY",

        embeddingModel: {

            provider: "openai",

            modelName: "text-embedding-3-small",

            apiKey: application.apiKey

        }

    });

    docsDir = expandPath("./docs/");

    ragService = simpleRAG(

        docsDir,

        chatModel,

        {

            vectorStore: vectorStore,

            chunkSize: 500,

            chunkOverlap: 200,

            splitterType: "sentence",

            recursive: false

        }

    );

    ragService.ingest();

    answer = ragService.ask("How to respond to Adobe support ticket?");

    writeOutput(answer.message);

</cfscript>

Output

To respond to an open Adobe support ticket, sign in to your Adobe account and update the case by adding messages or uploading files directly. You can include additional information or screenshots as needed. If requested, you may also submit proof of purchase for returns or proof of tax-exempt eligibility for tax-exempt orders.

Was this page helpful?

We're glad. Tell us how this page helped.

Found the answer to my problem Understood the instructions Liked the feature

Other suggestions

We're sorry. Can you tell us what didn't work for you?

Didn't find the answer to my problem Couldn't understand the instructions Didn't like the feature

Other suggestions

Thank you for your feedback. Your response will help improve this page.

Was this helpful?

We are sorry the content didn't meet your needs.

Share additional feedback to help us improve.

0/255 | Character limit exceeded.

Thank you so much for sharing your feedback!

Chunking and document splitting

Why chunking matters

Core parameters: chunkSize, chunkOverlap, splitterType

Built-in splitter types

Choosing the right splitter for your content

Custom splitter via UDF

Segment metadata schema

Standalone splitting with documentService().split()

On this page