Whatever message this page gives is out now! Go check it out!

Document processing service (Standalone)

Last update:
May 18, 2026
Learn how to use ColdFusion's standalone documentService() for full procedural control over RAG document pipelines. Covers load(), split(), transform(), transformSegments(), ingest(), all async variants with Futures, composing with agent() for shared vector stores, and closing the service.
Document processing is the pipeline that turns files or URLs into text segments ready for embedding and storage. In ColdFusion RAG you can run that pipeline in two ways: inside an AI service's ingestion configuration (agent(), simpleRAG()), or outside it using documentService(). The standalone service is useful when you need explicit control over each step, load once, split with different strategies, transform with your own UDFs, then ingest to Milvus, Qdrant, or another store, without coupling everything to a single ingestion block.
documentService() returns an object you call like a small SDK: load, split, transform, transformSegments, ingest, plus loadAsync, ingestAsync, transformAsync, transformSegmentsAsync, and close. Parameters are passed per method; the factory itself does not take a global config struct.

documentService() vs agent() ingestion

agent({ ingestion: { ... } }) bundles document loading, splitting, and vector ingest into one lifecycle: you configure source, documentSplitter, vectorStoreIngestor, and call ingest() on the agent. documentService() exposes the same underlying stages as discrete calls so you can compose custom workflows (for example load from disk, split, enrich metadata in CFML, then ingest), integrate with batch jobs, or unit-test splitting and transforms without standing up a full agent.
Use agent() ingestion when you want a declarative RAG setup; use documentService() when you need procedural control or reuse of intermediate arrays (documents, segments).
Requirement
agent() ingestion
documentService()
Declarative RAG setup, minimal code
Preferred
Works but verbose
Procedural control over each step
No
Discrete method calls
Reuse intermediate arrays (documents, segments)
No
Each method returns an array
Custom UDF transforms at document or segment level
No
transform() / transformSegments()
Unit-test splitting and transforms in isolation
No
No agent required
Batch jobs or offline export (no vector store required)
No
Stop after split
Query retrieval from same service
chat() / ask()
No — hand segments to agent()

Creating the Service

The service is created with documentService() and no arguments. Each call returns a new instance; tests confirm multiple instances operate independently.
// No factory-level config — options are passed to each method
docService = documentService();
documents = docService.load({
    path: expandPath("./Documents/"),
    pattern: "*.txt"
});

// Multiple independent instances
service1 = documentService();
service2 = documentService();
docs1 = service1.load({ path: expandPath("./Documents/"), pattern: "*.txt" });
docs2 = service2.load({ path: expandPath("./Documents/"), pattern: "*.txt" });

Full Pipeline: Load → split → transform → ingest

A full pipeline chains loadsplittransformSegmentsingest. A lighter pattern stops after loadsplit when you only need chunks for offline export, scoring, or non-vector workflows.

Step 1: Load documents

load() reads from the filesystem and from URLs. It returns a ColdFusion array of structs; each document struct includes at least text and metadata. You can pass a struct with path and optional pattern, recursive, and metadata (merged into each document's metadata), or pass a string path directly.
Basic load:
<cfscript>
docService = documentService();
documents = docService.load({
    path: expandPath("./docs/")
});
if (isArray(documents) && arrayLen(documents) > 0) {
    doc = documents[1];
    // doc.text, doc.metadata
}
writeDump(doc);
</cfscript>
With pattern, recursion, and custom metadata:
<cfscript>
docService = documentService();
documents = docService.load({
    path: expandPath("./docs/"),
    pattern: "*.pdf",
    recursive: false,
    metadata: { category: "test" }
});
if (isArray(documents) && arrayLen(documents) > 0) {
    doc = documents[1];
    // doc.text, doc.metadata
}
writeDump(doc);
</cfscript>

Step 2: Split into segments

split() takes the array of documents from load() and returns an array of segment structs (each with text and metadata). You can rely on defaults (chunkSize: 1000, chunkOverlap: 100) or pass chunkSize, chunkOverlap, and splitterType. For splitterType: "regex", supply regexPattern.
Defaults (no options):
<cfscript>
docService = documentService();
documents = docService.load({ path: expandPath("./docs/"), pattern: "*.txt" });
segments = docService.split(documents);
writeDump(segments);
</cfscript>
Explicit chunking:
<cfscript>
docService = documentService();
documents = docService.load({ path: expandPath("./docs/"), pattern: "*.txt" });
segments = docService.split(documents, {
    chunkSize: 500,
    chunkOverlap: 50
});
writeDump(segments);
</cfscript>
Recursive splitter:
<cfscript>
docService = documentService();
documents = docService.load({ path: expandPath("./docs/"), pattern: "*.txt" });
segments = docService.split(documents, {
    chunkSize: 500,
    chunkOverlap: 50,
    splitterType: "recursive"
});
writeDump(segments);
</cfscript>
Regex splitter:
<cfscript>
docService = documentService();
documents = docService.load({ path: expandPath("./docs/"), pattern: "*.txt" });
segments = docService.split(documents, {
    chunkSize: 500,
    chunkOverlap: 50,
    splitterType: "regex",
    regexPattern: "\n\n"
});
writeDump(segments);
</cfscript>

Step 3: Transform documents and segments

Transformation lets you modify documents after load() or segments after split() — normalizing text, enriching metadata, or tagging pipeline stages, before ingest. transform(documents, udf) applies a function to each document. transformSegments(segments, udf) passes both the source document and segment to the UDF so you can correlate chunk-level data with file-level context.
Document-level UDF:
<cfscript>
docService = documentService();
documents = docService.load({ path: expandPath("./docs/"), pattern: "*.txt" });

function myTransformer(required struct document) {
    document.metadata.transformed = true;
    document.metadata.wordCount = listLen(document.text, " ");
    return document;
}

transformed = docService.transform(documents, myTransformer);
writeDump(transformed);
</cfscript>
Segment-level UDF:
<cfscript>
docService = documentService();
documents = docService.load({ path: expandPath("./docs/"), pattern: "*.txt" });
segments = docService.split(documents, { chunkSize: 500, chunkOverlap: 50 });

function segmentEnricher(struct document, required struct segment) {
    segment.metadata.enhanced = true;
    segment.metadata.charCount = len(segment.text);
    return segment;
}

transformed = docService.transformSegments(segments, segmentEnricher);
writeDump(transformed);
</cfscript>

Step 4: Ingest into a vector store

ingest() takes an array of segments and a vector store client, embeds segment text according to the store's embeddingModel, and writes vectors. You may call ingest(segments, store) with defaults, or pass a third struct with batchSize and continueOnError. The return value is a struct of statistics.
With batching and error policy (INMEMORY store):
<cfscript>
docService = documentService();
vectorStore = VectorStore({
    provider: "INMEMORY",
    embeddingModel: {
        provider: "openai",
        modelName: "text-embedding-3-small",
        apiKey: application.apiKey
    }
});
documents = docService.load({ path: expandPath("./docs/"), pattern: "*.txt" });
segments = docService.split(documents, { chunkSize: 500, chunkOverlap: 50 });
result = docService.ingest(segments, vectorStore, {
    batchSize: 50,
    continueOnError: true
});
writeDump(result);
</cfscript>
Qdrant example:
vectorStoreClient = vectorstore({
    provider: "qdrant",
    url: application.vectorDB.qdrant.grpcUrl,
    apiKey: application.vectorDB.qdrant.apiKey,
    collectionName: "dps_ingest_qdrant_test",
    metricType: "COSINE",
    dimension: 384,
    embeddingModel: {
        provider: "ollama",
        modelName: "all-minilm",
        baseUrl: application.ollamaBaseUrl
    }
});
result = docService.ingest(segments, vectorStoreClient, {
    batchSize: 50,
    continueOnError: true
});

Load → split Only (Lightweight Pipeline)

Stop after loadsplit when you only need chunks for offline export, scoring, or non-vector workflows:
<cfscript>
docService = documentService();
documents = docService.load({ path: expandPath("./docs/"), pattern: "*.txt" });
segments = docService.split(documents, {
    chunkSize: 1000,
    chunkOverlap: 100
});
</cfscript>

Full pipeline (load → split → transformSegments → ingest)

<cfscript>
docService = documentService();
documents = docService.load({ path: expandPath("./docs/"), pattern: "*.txt" });
segments = docService.split(documents, { chunkSize: 500, chunkOverlap: 50 });

function enrichSegment(struct document, required struct segment) {
    segment.metadata.pipeline = "full";
    segment.metadata.processedAt = now();
    return segment;
}

enrichedSegments = docService.transformSegments(segments, enrichSegment);

vectorStore = VectorStore({
    provider: "INMEMORY",
    embeddingModel: {
        provider: "openai",
        modelName: "text-embedding-3-small",
        apiKey: application.apiKey
    }
});

result = docService.ingest(enrichedSegments, vectorStore);
writeDump(result);
</cfscript>

Async methods and futures (loadAsync, transformAsync, ingestAsync)

Async variants return a Future; call .get() to block until the operation completes. This mirrors loadAsync, transformAsync, transformSegmentsAsync, and ingestAsync patterns used when work can overlap with other request processing (subject to server threading and safety limits).
Async method
Sync equivalent
Returns (after .get())
loadAsync(options)
load(options)
Array of document structs
transformAsync(docs, udf)
transform(docs, udf)
Array of transformed document structs
transformSegmentsAsync(segs, udf)
transformSegments(segs, udf)
Array of transformed segment structs
ingestAsync(segs, store)
ingest(segs, store)
Statistics struct

loadAsync

future = docService.loadAsync({
    path: expandPath("./Documents/"),
    pattern: "*.txt"
});
documents = future.get();

transformAsync

function asyncTransformer(required struct document) {
    document.metadata.asyncDone = true;
    return document;
}
future = docService.transformAsync(documents, asyncTransformer);
transformed = future.get();

transformSegmentsAsync

future = docService.transformSegmentsAsync(segments, function(struct document, struct segment) {
    segment.metadata.asyncProcessed = true;
    return segment;
});
transformed = future.get();

ingestAsync

future = docService.ingestAsync(segments, vectorStoreClient);
result = future.get();
Note: future.get() blocks the current request thread until the operation finishes. For true background jobs that survive the HTTP response, use scheduled tasks, message queues, or long-lived workers.

Compose with agent() for shared vector stores

documentService() is the standalone equivalent of the ingestion stages inside agent(). You can compose them: use documentService() to load, split, and transform documents with full procedural control, then ingest the resulting segments into a VectorStore that is also referenced by an agent() for retrieval. Both operate on the same underlying index.
Create one VectorStore instance. Use documentService() to populate it, then pass the same store to agent() for querying:
<cfscript>
// 1. Create a shared vector store
sharedStore = VectorStore({
    provider: "INMEMORY",
    embeddingModel: {
        provider: "openai",
        modelName: "text-embedding-3-small",
        apiKey: application.apiKey
    }
});

// 2. Use documentService() to load, split, enrich, and ingest
docService = documentService();
documents = docService.load({ path: expandPath("./docs/"), pattern: "*.txt" });
segments = docService.split(documents, { chunkSize: 500, chunkOverlap: 50 });

function enrichSegment(struct document, required struct segment) {
    segment.metadata.source = "documentService";
    return segment;
}

enriched = docService.transformSegments(segments, enrichSegment);
docService.ingest(enriched, sharedStore);
docService.close();

// 3. Use agent() for retrieval against the same store
chatModel = ChatModel({
    provider: "openai",
    modelName: "gpt-4o-mini",
    apiKey: application.apiKey,
    temperature: 0.7
});

queryService = agent({
    CHATMODEL: chatModel,
    retrievalAugmentor: {
        queryRouter: {
            contentRetrievers: [{
                vectorStore: sharedStore,
                maxResults: 5,
                minScore: 0.3,
                description: "Knowledge base"
            }]
        }
    }
});

answer = queryService.chat("How to upgrade Adobe plan?");
writeOutput(answer.message);
</cfscript>
This pattern is useful when you need to run multiple ingestion passes (for example different chunk sizes for different document types) into the same store, or when ingestion and querying are separate processes that share a persistent vector store such as Qdrant or Milvus.

Close the service

The service implements AutoCloseable-style cleanup: call close() when you are done to release any resources held by the instance.
<cfscript>
docService = documentService();
documents = docService.load({
    path: expandPath("./Documents/"),
    pattern: "*.txt"
});
docService.close();
</cfscript>
Call close() after the final ingest() call on a given instance — or in a finally block if the pipeline can throw. Multiple instances each require their own close() call; closing one instance does not affect others.
Topic
Detail
close()
Releases resources held by this documentService() instance. Safe to call after load(), split(), transform(), or ingest().
Scope
Each instance must be closed independently. Closing service1 does not affect service2.
Timing
Call after the final ingest() on that instance, or in a finally block to guarantee cleanup even if an earlier step throws.
Note: The factory documentService() itself has no persistent state between calls. Only the instance object returned by a given call holds resources. Each new call to documentService() creates a fresh, independent instance.

Share this page

Was this page helpful?
We're glad. Tell us how this page helped.
We're sorry. Can you tell us what didn't work for you?
Thank you for your feedback. Your response will help improve this page.

On this page