Document processing service (Standalone)

Last update:

May 18, 2026

Learn how to use ColdFusion's standalone documentService() for full procedural control over RAG document pipelines. Covers load(), split(), transform(), transformSegments(), ingest(), all async variants with Futures, composing with agent() for shared vector stores, and closing the service.

Document processing is the pipeline that turns files or URLs into text segments ready for embedding and storage. In ColdFusion RAG you can run that pipeline in two ways: inside an AI service's ingestion configuration (agent(), simpleRAG()), or outside it using documentService(). The standalone service is useful when you need explicit control over each step, load once, split with different strategies, transform with your own UDFs, then ingest to Milvus, Qdrant, or another store, without coupling everything to a single ingestion block.

documentService() returns an object you call like a small SDK: load, split, transform, transformSegments, ingest, plus loadAsync, ingestAsync, transformAsync, transformSegmentsAsync, and close. Parameters are passed per method; the factory itself does not take a global config struct.

documentService() vs agent() ingestion

agent({ ingestion: { ... } }) bundles document loading, splitting, and vector ingest into one lifecycle: you configure source, documentSplitter, vectorStoreIngestor, and call ingest() on the agent. documentService() exposes the same underlying stages as discrete calls so you can compose custom workflows (for example load from disk, split, enrich metadata in CFML, then ingest), integrate with batch jobs, or unit-test splitting and transforms without standing up a full agent.

Use agent() ingestion when you want a declarative RAG setup; use documentService() when you need procedural control or reuse of intermediate arrays (documents, segments).

Requirement	agent() ingestion	documentService()
Declarative RAG setup, minimal code	Preferred	Works but verbose
Procedural control over each step	No	Discrete method calls
Reuse intermediate arrays (documents, segments)	No	Each method returns an array
Custom UDF transforms at document or segment level	No	`transform()` / `transformSegments()`
Unit-test splitting and transforms in isolation	No	No agent required
Batch jobs or offline export (no vector store required)	No	Stop after split
Query retrieval from same service	`chat()` / `ask()`	No — hand segments to `agent()`

Creating the Service

The service is created with documentService() and no arguments. Each call returns a new instance; tests confirm multiple instances operate independently.

// No factory-level config — options are passed to each method

docService = documentService();

documents = docService.load({

    path: expandPath("./Documents/"),

    pattern: "*.txt"

});

// Multiple independent instances

service1 = documentService();

service2 = documentService();

docs1 = service1.load({ path: expandPath("./Documents/"), pattern: "*.txt" });

docs2 = service2.load({ path: expandPath("./Documents/"), pattern: "*.txt" });

Full Pipeline: Load → split → transform → ingest

A full pipeline chains load → split → transformSegments → ingest. A lighter pattern stops after load → split when you only need chunks for offline export, scoring, or non-vector workflows.

Step 1: Load documents

load() reads from the filesystem and from URLs. It returns a ColdFusion array of structs; each document struct includes at least text and metadata. You can pass a struct with path and optional pattern, recursive, and metadata (merged into each document's metadata), or pass a string path directly.

Basic load:

<cfscript>

docService = documentService();

documents = docService.load({

    path: expandPath("./docs/")

});

if (isArray(documents) && arrayLen(documents) > 0) {

    doc = documents[1];

    // doc.text, doc.metadata

}

writeDump(doc);

</cfscript>

With pattern, recursion, and custom metadata:

<cfscript>

docService = documentService();

documents = docService.load({

    path: expandPath("./docs/"),

    pattern: "*.pdf",

    recursive: false,

    metadata: { category: "test" }

});

if (isArray(documents) && arrayLen(documents) > 0) {

    doc = documents[1];

    // doc.text, doc.metadata

}

writeDump(doc);

</cfscript>

Step 2: Split into segments

split() takes the array of documents from load() and returns an array of segment structs (each with text and metadata). You can rely on defaults (chunkSize: 1000, chunkOverlap: 100) or pass chunkSize, chunkOverlap, and splitterType. For splitterType: "regex", supply regexPattern.

Defaults (no options):

<cfscript>

docService = documentService();

documents = docService.load({ path: expandPath("./docs/"), pattern: "*.txt" });

segments = docService.split(documents);

writeDump(segments);

</cfscript>

Explicit chunking:

<cfscript>

docService = documentService();

documents = docService.load({ path: expandPath("./docs/"), pattern: "*.txt" });

segments = docService.split(documents, {

    chunkSize: 500,

    chunkOverlap: 50

});

writeDump(segments);

</cfscript>

Recursive splitter:

<cfscript>

docService = documentService();

documents = docService.load({ path: expandPath("./docs/"), pattern: "*.txt" });

segments = docService.split(documents, {

    chunkSize: 500,

    chunkOverlap: 50,

    splitterType: "recursive"

});

writeDump(segments);

</cfscript>

Regex splitter:

<cfscript>

docService = documentService();

documents = docService.load({ path: expandPath("./docs/"), pattern: "*.txt" });

segments = docService.split(documents, {

    chunkSize: 500,

    chunkOverlap: 50,

    splitterType: "regex",

    regexPattern: "\n\n"

});

writeDump(segments);

</cfscript>

Step 3: Transform documents and segments

Transformation lets you modify documents after load() or segments after split() — normalizing text, enriching metadata, or tagging pipeline stages, before ingest. transform(documents, udf) applies a function to each document. transformSegments(segments, udf) passes both the source document and segment to the UDF so you can correlate chunk-level data with file-level context.

Document-level UDF:

<cfscript>

docService = documentService();

documents = docService.load({ path: expandPath("./docs/"), pattern: "*.txt" });

function myTransformer(required struct document) {

    document.metadata.transformed = true;

    document.metadata.wordCount = listLen(document.text, " ");

    return document;

}

transformed = docService.transform(documents, myTransformer);

writeDump(transformed);

</cfscript>

Segment-level UDF:

<cfscript>

docService = documentService();

documents = docService.load({ path: expandPath("./docs/"), pattern: "*.txt" });

segments = docService.split(documents, { chunkSize: 500, chunkOverlap: 50 });

function segmentEnricher(struct document, required struct segment) {

    segment.metadata.enhanced = true;

    segment.metadata.charCount = len(segment.text);

    return segment;

}

transformed = docService.transformSegments(segments, segmentEnricher);

writeDump(transformed);

</cfscript>

Step 4: Ingest into a vector store

ingest() takes an array of segments and a vector store client, embeds segment text according to the store's embeddingModel, and writes vectors. You may call ingest(segments, store) with defaults, or pass a third struct with batchSize and continueOnError. The return value is a struct of statistics.

With batching and error policy (INMEMORY store):

<cfscript>

docService = documentService();

vectorStore = VectorStore({

    provider: "INMEMORY",

    embeddingModel: {

        provider: "openai",

        modelName: "text-embedding-3-small",

        apiKey: application.apiKey

    }

});

documents = docService.load({ path: expandPath("./docs/"), pattern: "*.txt" });

segments = docService.split(documents, { chunkSize: 500, chunkOverlap: 50 });

result = docService.ingest(segments, vectorStore, {

    batchSize: 50,

    continueOnError: true

});

writeDump(result);

</cfscript>

Qdrant example:

vectorStoreClient = vectorstore({

    provider: "qdrant",

    url: application.vectorDB.qdrant.grpcUrl,

    apiKey: application.vectorDB.qdrant.apiKey,

    collectionName: "dps_ingest_qdrant_test",

    metricType: "COSINE",

    dimension: 384,

    embeddingModel: {

        provider: "ollama",

        modelName: "all-minilm",

        baseUrl: application.ollamaBaseUrl

    }

});

result = docService.ingest(segments, vectorStoreClient, {

    batchSize: 50,

    continueOnError: true

});

Load → split Only (Lightweight Pipeline)

Stop after load → split when you only need chunks for offline export, scoring, or non-vector workflows:

<cfscript>

docService = documentService();

documents = docService.load({ path: expandPath("./docs/"), pattern: "*.txt" });

segments = docService.split(documents, {

    chunkSize: 1000,

    chunkOverlap: 100

});

</cfscript>

Full pipeline (load → split → transformSegments → ingest)

<cfscript>

docService = documentService();

documents = docService.load({ path: expandPath("./docs/"), pattern: "*.txt" });

segments = docService.split(documents, { chunkSize: 500, chunkOverlap: 50 });

function enrichSegment(struct document, required struct segment) {

    segment.metadata.pipeline = "full";

    segment.metadata.processedAt = now();

    return segment;

}

enrichedSegments = docService.transformSegments(segments, enrichSegment);

vectorStore = VectorStore({

    provider: "INMEMORY",

    embeddingModel: {

        provider: "openai",

        modelName: "text-embedding-3-small",

        apiKey: application.apiKey

    }

});

result = docService.ingest(enrichedSegments, vectorStore);

writeDump(result);

</cfscript>

Async methods and futures (loadAsync, transformAsync, ingestAsync)

Async variants return a Future; call .get() to block until the operation completes. This mirrors loadAsync, transformAsync, transformSegmentsAsync, and ingestAsync patterns used when work can overlap with other request processing (subject to server threading and safety limits).

Async method	Sync equivalent	Returns (after .get())
`loadAsync(options)`	`load(options)`	Array of document structs
`transformAsync(docs, udf)`	`transform(docs, udf)`	Array of transformed document structs
`transformSegmentsAsync(segs, udf)`	`transformSegments(segs, udf)`	Array of transformed segment structs
`ingestAsync(segs, store)`	`ingest(segs, store)`	Statistics struct

loadAsync

future = docService.loadAsync({

    path: expandPath("./Documents/"),

    pattern: "*.txt"

});

documents = future.get();

transformAsync

function asyncTransformer(required struct document) {

    document.metadata.asyncDone = true;

    return document;

}

future = docService.transformAsync(documents, asyncTransformer);

transformed = future.get();

transformSegmentsAsync

future = docService.transformSegmentsAsync(segments, function(struct document, struct segment) {

    segment.metadata.asyncProcessed = true;

    return segment;

});

transformed = future.get();

ingestAsync

future = docService.ingestAsync(segments, vectorStoreClient);

result = future.get();

Note: future.get() blocks the current request thread until the operation finishes. For true background jobs that survive the HTTP response, use scheduled tasks, message queues, or long-lived workers.

Compose with agent() for shared vector stores

documentService() is the standalone equivalent of the ingestion stages inside agent(). You can compose them: use documentService() to load, split, and transform documents with full procedural control, then ingest the resulting segments into a VectorStore that is also referenced by an agent() for retrieval. Both operate on the same underlying index.

Create one VectorStore instance. Use documentService() to populate it, then pass the same store to agent() for querying:

<cfscript>

// 1. Create a shared vector store

sharedStore = VectorStore({

    provider: "INMEMORY",

    embeddingModel: {

        provider: "openai",

        modelName: "text-embedding-3-small",

        apiKey: application.apiKey

    }

});

// 2. Use documentService() to load, split, enrich, and ingest

docService = documentService();

documents = docService.load({ path: expandPath("./docs/"), pattern: "*.txt" });

segments = docService.split(documents, { chunkSize: 500, chunkOverlap: 50 });

function enrichSegment(struct document, required struct segment) {

    segment.metadata.source = "documentService";

    return segment;

}

enriched = docService.transformSegments(segments, enrichSegment);

docService.ingest(enriched, sharedStore);

docService.close();

// 3. Use agent() for retrieval against the same store

chatModel = ChatModel({

    provider: "openai",

    modelName: "gpt-4o-mini",

    apiKey: application.apiKey,

    temperature: 0.7

});

queryService = agent({

    CHATMODEL: chatModel,

    retrievalAugmentor: {

        queryRouter: {

            contentRetrievers: [{

                vectorStore: sharedStore,

                maxResults: 5,

                minScore: 0.3,

                description: "Knowledge base"

            }]

        }

    }

});

answer = queryService.chat("How to upgrade Adobe plan?");

writeOutput(answer.message);

</cfscript>

This pattern is useful when you need to run multiple ingestion passes (for example different chunk sizes for different document types) into the same store, or when ingestion and querying are separate processes that share a persistent vector store such as Qdrant or Milvus.

Close the service

The service implements AutoCloseable-style cleanup: call close() when you are done to release any resources held by the instance.

<cfscript>

docService = documentService();

documents = docService.load({

    path: expandPath("./Documents/"),

    pattern: "*.txt"

});

docService.close();

</cfscript>

Call close() after the final ingest() call on a given instance — or in a finally block if the pipeline can throw. Multiple instances each require their own close() call; closing one instance does not affect others.

Topic	Detail
`close()`	Releases resources held by this `documentService()` instance. Safe to call after `load()`, `split()`, `transform()`, or `ingest()`.
Scope	Each instance must be closed independently. Closing `service1` does not affect `service2`.
Timing	Call after the final `ingest()` on that instance, or in a `finally` block to guarantee cleanup even if an earlier step throws.

Note: The factory documentService() itself has no persistent state between calls. Only the instance object returned by a given call holds resources. Each new call to documentService() creates a fresh, independent instance.

Was this page helpful?

We're glad. Tell us how this page helped.

Found the answer to my problem Understood the instructions Liked the feature

Other suggestions

We're sorry. Can you tell us what didn't work for you?

Didn't find the answer to my problem Couldn't understand the instructions Didn't like the feature