Document loading

Last update:

May 18, 2026

Understand documentService() loading workflows, including file/URL loading, parserType, and parserConfigs, recursive filtering, lazyLoad streaming for large corpora, and async methods

Loading reads bytes from a filesystem path or URL, picks a parser, and returns one or more document values. Parsing converts raw file content into text (and usually metadata) for splitting, embedding, and retrieval. documentService() returns an object you call like a small SDK: load, split, transform, transformSegments, ingest, plus loadAsync, ingestAsync, transformAsync, transformSegmentsAsync, and close.

A typical document struct contains the following fields:

Field	Description
text	String body used for chunking and vector search.
metadata	Struct: file name, URI, parser fields, custom tags.

Prerequisites before using documentService():

Paths: On supported releases, whitelist RAG source paths in pathfilter.json when the server enforces path filtering.
Network: URL loads need outbound HTTP(S). SSRF protections may block internal or disallowed URLs.

Basic load from file or directory

Call documentService() with no arguments. Pass a struct to load() with at least path. The result is an array of documents. Each document exposes text and metadata.

<cfscript>

    docService = documentService();

    documents = docService.load({

        path: expandPath("./docs/")

    });

    if (isArray(documents) && arrayLen(documents) > 0) {

        doc = documents[1];

        writeOutput("text length: " & len(doc.text));

        writeDump(doc.metadata);

    }

</cfscript>

You can also pass a plain string (path to a file or directory) directly to load(), not only a struct. The following example loads from both a directory and a single file:

<cfscript>

    docService = documentService();

    docsDir = expandPath("docs");

    dirDocuments = docService.load(docsDir);

    fileDocuments = docService.load(docsDir & "/upgrade-adobe-plan.pdf");

    writeOutput(arrayLen(dirDocuments) & " from dir, " & arrayLen(fileDocuments) & " from file");

</cfscript>

Output

26 from dir, 1 from file

Load Options: Pattern filtering, recursive, custom metadata

Narrow files with a glob (pattern), control directory walking with recursive, and attach a metadata struct that is merged onto each loaded document. This is useful for tagging an entire batch.

<cfscript>

  docService = documentService();

  documents = docService.load({

    path:      expandPath("./docs/"),

    pattern:   "*.pdf",

    recursive: false,

    metadata:  { category: "test" }

  });

  writeDump(documents);

</cfscript>

The following load options are supported:

Option	Description
path	Required. Absolute path to a file, a directory, or a plain string path. Use expandPath() to resolve relative paths.
pattern	Glob filter applied to filenames (e.g. *.pdf). Only matching files are loaded. Omit to load all supported formats.
recursive	When true, walks subdirectories. When false (default), processes only the top-level folder.
metadata	A struct merged into each loaded document's metadata. Useful for tagging a batch with a category, source label, or version.
parserType	Forces a specific parser regardless of file extension. Omit to use automatic format detection based on extension.
parserConfigs	A struct keyed by format (e.g. json, pdf, csv) with format-specific parser settings such as JSON path selectors, PDF page ranges, or CSV row-per-document.
maxFileSize	Upper bound on file size for ingestion in bytes. Non-zero values reject or skip oversized files early. 0 means no limit.
includePatterns	Array of URL or path patterns to allow. Applied when loading from URLs.
excludePatterns	Array of URL or path patterns to block. Applied when loading from URLs.
parallel	When true, loads files using multiple threads. Use with maxThreads to control concurrency.
maxThreads	Maximum number of threads for parallel loading. Very large values may be capped (e.g. at 64).

parserType override

parserType forces which parser runs. For example, passing "text" on a .json file returns raw file text instead of structured JSON. Omit parserType to use automatic format detection.

<cfscript>

    docService = documentService(); 

    rawAsText = docService.load({ 

        path: expandPath("./docs/district.json"), 

        parserType: "text" 

    }); 

    autoParsed = docService.load({ 

        path: expandPath("./docs/district.json") 

    }); 

    writeOutput("Forced text length: " & len(rawAsText[1].text)); 

    writeOutput("<br>Auto-parsed doc count: " & arrayLen(autoParsed)); 

</cfscript>

Output

Forced text length: 131605

Auto-parsed doc count: 1

parserConfigs: JSON

parserConfigs is a struct keyed by format. For JSON, common keys include:

jsonPath: selects nodes (e.g. $.articles[*])
contentKey: property whose value becomes text
metadataKeys: list of properties copied into metadata

<cfscript>

    docService = documentService(); 

    docs = docService.load({ 

    path: expandPath("./docs/rag.json"), 

    parserConfigs: { 

        json: { 

            jsonPath: "$.articles[*]", 

            contentKey: "body", 

            metadataKeys: ["title", "author", "publishedAt"] 

        } 

    } 

    }); 

    writeDump(docs); 

</cfscript>

parserConfigs: PDF page range

For PDF, pages is a page range string. For example, "1" limits extraction to the first page.

<cfscript>

    docService = documentService(); 

    pageDocs = docService.load({ 

        path: expandPath("./docs/upgrade-adobe-plan.pdf"), 

        parserConfigs: { pdf: { pages: "1" } } 

    }); 

    writeOutput(len(pageDocs[1].text)); 

</cfscript>

Output

914

parserConfigs: CSV row-per-document

For CSV, rowPerDocument: true produces one document per row. Exact counts depend on the file.

<cfscript>

    docService = documentService(); 

    future = docService.loadAsync({ 

    path: expandPath("./docs/age-when-completed-education.csv"), 

    parserConfigs: { 

        csv: { 

            rowPerDocument: true 

        } 

    } 

    }); 

    docs = future.get(); 

    writeOutput("Documents: " & arrayLen(docs)); 

</cfscript>

Output

Documents: 165

URL loading and requestOptions

Set path to an http or https URL. Use requestOptions for timeouts, retries, and userAgent. You can also add headers (e.g. Authorization) for authenticated URLs.

<cfscript>

    docService = documentService(); 

    docs = docService.load({ 

    path: "https://www.adobe.com/robots.txt", 

    requestOptions: { 

        connectionTimeout: 60000, 

        readTimeout: 120000, 

        maxRetries: 2, 

        userAgent: "ColdFusion-RAG-Test/1.0" 

    } 

    }); 

    writeOutput(arrayLen(docs) & " document(s)"); 

    writeOutput("<br>Preview: " & left(docs[1].text, 200))

</cfscript>

Output

1 from file 1 document(s) Preview: # The use of robots or other automated means to access the Adobe site # without the express permission of Adobe is strictly prohibited. # Notwithstanding the foregoing, Adobe may permit automated acce...

The supported requestOptions fields are:

Field	Description
connectionTimeout	Milliseconds to wait when establishing a connection before timing out.
readTimeout	Milliseconds to wait for data to be received before timing out.
maxRetries	Number of retry attempts on transient failures.
userAgent	User-agent string sent with the HTTP request.

Note: CF selects the parser for URL-loaded content based on the file extension inferred from the URL path, not the HTTP Content-Type header. If the URL does not end in a recognizable extension you may need to set parserType explicitly.

Lazy loading with lazyLoad() (Streaming large corpora)

load() reads every file into memory before returning. For large archives that can exhaust JVM heap, lazyLoad() returns an iterator: only one document is held at a time, and processing can start before the whole set is loaded.

lazyLoad() returns an iterable with hasNext() and next(). Use it to stream through many files without holding every document in memory at once.

<cfscript>

    docService = documentService(); 

    docIterable = docService.lazyLoad({ 

        path: expandPath("./docs/"), 

        pattern: "*.pdf" 

    }); 

    count = 0; 

    cfloop(condition="docIterable.hasNext()") { 

        doc = docIterable.next(); 

        count++; 

        writeOutput("Doc " & count & ": " & len(doc.text) & " chars<br>"); 

    } 

    writeOutput("Total: " & count); 

</cfscript>

Output

Doc 1: 1343 chars Doc 2: 4577 chars Doc 3: 903 chars Doc 4: 4130 chars Doc 5: 5729 chars Doc 6: 723 chars Doc 7: 1179 chars Doc 8: 1718 chars Doc 9: 1051 chars Doc 10: 1607 chars Doc 11: 1118 chars Doc 12: 3135 chars Doc 13: 1431 chars Doc 14: 1465 chars Doc 15: 1060 chars Doc 16: 1026 chars Doc 17: 1501 chars Doc 18: 1480 chars Doc 19: 947 chars Doc 20: 1301 chars Doc 21: 914 chars Total: 21

Method	Description
lazyLoad(options)	Returns an iterable over all matched files. Does not load all files into memory at once.
hasNext()	Returns true if more documents remain in the iterator.
next()	Returns the next document struct (with text and metadata) and advances the iterator.

Note: lazyLoad() accepts the same path, pattern, recursive, and metadata options as load(). Use it whenever your document corpus is too large to hold entirely in the JVM heap.

Async loading with loadAsync()

loadAsync() returns a Future. Call get() on that Future to obtain the document array when loading finishes. This is useful when loading can overlap with other request processing.

<cfscript>

    docService = documentService(); 

    future = docService.loadAsync({ 

    path: expandPath("./docs/age-when-completed-education.csv"), 

    parserConfigs: { 

        csv: { 

            rowPerDocument: true 

        } 

    } 

    }); 

    docs = future.get(); 

    writeOutput("Documents: " & arrayLen(docs)); 

</cfscript>

Output

Documents: 165

Every method on documentService() has an async variant that returns a Future:

Async method	Sync equivalent
loadAsync(options)	load(options)
transformAsync(docs, udf)	transform(docs, udf)
transformSegmentsAsync(segs, udf)	transformSegments(segs, udf)
ingestAsync(segs, store)	ingest(segs, store)

In every case, call .get() on the returned Future to block until the operation completes and retrieve the result. The Future pattern is ideal when you want non-blocking composition in code or a clear completion point before the next step in the same request.

Note: future.get() blocks the current request thread until the operation finishes. For true background jobs that survive the HTTP response, you typically need scheduled tasks, message queues, or long-lived workers.

Multiple service instances

The service is created with documentService() and no arguments. Each call returns a new instance. Multiple instances operate independently of one another.

<cfscript>

  service1 = documentService();

  service2 = documentService();

  docs1 = service1.load({ path: expandPath("./docs/"), pattern: "*.txt" });

  docs2 = service2.load({ path: expandPath("./docs/"), pattern: "*.txt" });

</cfscript>

Use multiple instances when you need to run separate load/split/ingest pipelines concurrently, or when you want to isolate configuration between pipeline branches (e.g. different chunk sizes for different document types).

Closing the service

The service implements AutoCloseable-style cleanup. Call close() when you are done to release any resources held by the instance.

<cfscript>

  docService = documentService();

  documents = docService.load({

    path:    expandPath("./docs/"),

    pattern: "*.txt"

  });

  docService.close();

</cfscript>

Note: No factory-level config is required when creating a documentService() instance. All options are passed to each individual method call, not to the factory itself.

Was this page helpful?

We're glad. Tell us how this page helped.

Found the answer to my problem Understood the instructions Liked the feature

Other suggestions

We're sorry. Can you tell us what didn't work for you?

Didn't find the answer to my problem Couldn't understand the instructions Didn't like the feature

Other suggestions

Thank you for your feedback. Your response will help improve this page.

Was this helpful?

We are sorry the content didn't meet your needs.

Share additional feedback to help us improve.

0/255 | Character limit exceeded.

Thank you so much for sharing your feedback!

Document loading

Basic load from file or directory

Load Options: Pattern filtering, recursive, custom metadata

URL loading and requestOptions

Lazy loading with lazyLoad() (Streaming large corpora)

Async loading with loadAsync()

Multiple service instances

On this page