Whatever message this page gives is out now! Go check it out!

Document loading

Last update:
May 18, 2026
Understand documentService() loading workflows, including file/URL loading, parserType, and parserConfigs, recursive filtering, lazyLoad streaming for large corpora, and async methods
Loading reads bytes from a filesystem path or URL, picks a parser, and returns one or more document values. Parsing converts raw file content into text (and usually metadata) for splitting, embedding, and retrieval. documentService() returns an object you call like a small SDK: load, split, transform, transformSegments, ingest, plus loadAsync, ingestAsync, transformAsync, transformSegmentsAsync, and close.
A typical document struct contains the following fields:
Field
Description
text
String body used for chunking and vector search.
metadata
Struct: file name, URI, parser fields, custom tags.
Prerequisites before using documentService():
  • Paths: On supported releases, whitelist RAG source paths in pathfilter.json when the server enforces path filtering.
  • Network: URL loads need outbound HTTP(S). SSRF protections may block internal or disallowed URLs.

Basic load from file or directory

Call documentService() with no arguments. Pass a struct to load() with at least path. The result is an array of documents. Each document exposes text and metadata.
<cfscript>
    docService = documentService();

    documents = docService.load({
        path: expandPath("./docs/")
    });

    if (isArray(documents) && arrayLen(documents) > 0) {
        doc = documents[1];
        writeOutput("text length: " & len(doc.text));
        writeDump(doc.metadata);
    }
</cfscript>
You can also pass a plain string (path to a file or directory) directly to load(), not only a struct. The following example loads from both a directory and a single file:
<cfscript>
    docService = documentService();
    docsDir = expandPath("docs");

    dirDocuments = docService.load(docsDir);
    fileDocuments = docService.load(docsDir & "/upgrade-adobe-plan.pdf");

    writeOutput(arrayLen(dirDocuments) & " from dir, " & arrayLen(fileDocuments) & " from file");
</cfscript>
Output
26 from dir, 1 from file

Load Options: Pattern filtering, recursive, custom metadata

Narrow files with a glob (pattern), control directory walking with recursive, and attach a metadata struct that is merged onto each loaded document. This is useful for tagging an entire batch.
<cfscript>
  docService = documentService();
  documents = docService.load({
    path:      expandPath("./docs/"),
    pattern:   "*.pdf",
    recursive: false,
    metadata:  { category: "test" }
  });
  writeDump(documents);
</cfscript>
The following load options are supported:
Option
Description
path
Required. Absolute path to a file, a directory, or a plain string path. Use expandPath() to resolve relative paths.
pattern
Glob filter applied to filenames (e.g. *.pdf). Only matching files are loaded. Omit to load all supported formats.
recursive
When true, walks subdirectories. When false (default), processes only the top-level folder.
metadata
A struct merged into each loaded document's metadata. Useful for tagging a batch with a category, source label, or version.
parserType
Forces a specific parser regardless of file extension. Omit to use automatic format detection based on extension.
parserConfigs
A struct keyed by format (e.g. json, pdf, csv) with format-specific parser settings such as JSON path selectors, PDF page ranges, or CSV row-per-document.
maxFileSize
Upper bound on file size for ingestion in bytes. Non-zero values reject or skip oversized files early. 0 means no limit.
includePatterns
Array of URL or path patterns to allow. Applied when loading from URLs.
excludePatterns
Array of URL or path patterns to block. Applied when loading from URLs.
parallel
When true, loads files using multiple threads. Use with maxThreads to control concurrency.
maxThreads
Maximum number of threads for parallel loading. Very large values may be capped (e.g. at 64).
parserType override
parserType forces which parser runs. For example, passing "text" on a .json file returns raw file text instead of structured JSON. Omit parserType to use automatic format detection.
<cfscript>
    docService = documentService(); 
 
    rawAsText = docService.load({ 
        path: expandPath("./docs/district.json"), 
        parserType: "text" 
    }); 
    
    autoParsed = docService.load({ 
        path: expandPath("./docs/district.json") 
    }); 
    
    writeOutput("Forced text length: " & len(rawAsText[1].text)); 
    writeOutput("<br>Auto-parsed doc count: " & arrayLen(autoParsed)); 

</cfscript>
Output
Forced text length: 131605
Auto-parsed doc count: 1
parserConfigs: JSON
parserConfigs is a struct keyed by format. For JSON, common keys include:
  • jsonPath: selects nodes (e.g. $.articles[*])
  • contentKey: property whose value becomes text
  • metadataKeys: list of properties copied into metadata
<cfscript>
    docService = documentService(); 
 
    docs = docService.load({ 
    path: expandPath("./docs/rag.json"), 
    parserConfigs: { 
        json: { 
            jsonPath: "$.articles[*]", 
            contentKey: "body", 
            metadataKeys: ["title", "author", "publishedAt"] 
        } 
    } 
    }); 
    
    writeDump(docs); 

</cfscript>
parserConfigs: PDF page range
For PDF, pages is a page range string. For example, "1" limits extraction to the first page.
<cfscript>
    docService = documentService(); 
 
    pageDocs = docService.load({ 
        path: expandPath("./docs/upgrade-adobe-plan.pdf"), 
        parserConfigs: { pdf: { pages: "1" } } 
    }); 
    
    writeOutput(len(pageDocs[1].text)); 

</cfscript>
Output
914
parserConfigs: CSV row-per-document
For CSV, rowPerDocument: true produces one document per row. Exact counts depend on the file.
<cfscript>
    docService = documentService(); 
 
    future = docService.loadAsync({ 
    path: expandPath("./docs/age-when-completed-education.csv"), 
    parserConfigs: { 
        csv: { 
            rowPerDocument: true 
        } 
    } 
    }); 
    
    docs = future.get(); 
    writeOutput("Documents: " & arrayLen(docs)); 

</cfscript>
Output
Documents: 165

URL loading and requestOptions

Set path to an http or https URL. Use requestOptions for timeouts, retries, and userAgent. You can also add headers (e.g. Authorization) for authenticated URLs.
<cfscript>
    docService = documentService(); 
 
    docs = docService.load({ 
    path: "https://www.adobe.com/robots.txt", 
    requestOptions: { 
        connectionTimeout: 60000, 
        readTimeout: 120000, 
        maxRetries: 2, 
        userAgent: "ColdFusion-RAG-Test/1.0" 
    } 
    }); 
    
    writeOutput(arrayLen(docs) & " document(s)"); 
    writeOutput("<br>Preview: " & left(docs[1].text, 200))
</cfscript>
Output
1 from file 1 document(s) Preview: # The use of robots or other automated means to access the Adobe site # without the express permission of Adobe is strictly prohibited. # Notwithstanding the foregoing, Adobe may permit automated acce...
The supported requestOptions fields are:
Field
Description
connectionTimeout
Milliseconds to wait when establishing a connection before timing out.
readTimeout
Milliseconds to wait for data to be received before timing out.
maxRetries
Number of retry attempts on transient failures.
userAgent
User-agent string sent with the HTTP request.
Note: CF selects the parser for URL-loaded content based on the file extension inferred from the URL path, not the HTTP Content-Type header. If the URL does not end in a recognizable extension you may need to set parserType explicitly.

Lazy loading with lazyLoad() (Streaming large corpora)

load() reads every file into memory before returning. For large archives that can exhaust JVM heap, lazyLoad() returns an iterator: only one document is held at a time, and processing can start before the whole set is loaded.
lazyLoad() returns an iterable with hasNext() and next(). Use it to stream through many files without holding every document in memory at once.
<cfscript>
    docService = documentService(); 
 
    docIterable = docService.lazyLoad({ 
        path: expandPath("./docs/"), 
        pattern: "*.pdf" 
    }); 
    
    count = 0; 
    cfloop(condition="docIterable.hasNext()") { 
        doc = docIterable.next(); 
        count++; 
        writeOutput("Doc " & count & ": " & len(doc.text) & " chars<br>"); 
    } 
    writeOutput("Total: " & count); 

</cfscript>
Output
Doc 1: 1343 chars Doc 2: 4577 chars Doc 3: 903 chars Doc 4: 4130 chars Doc 5: 5729 chars Doc 6: 723 chars Doc 7: 1179 chars Doc 8: 1718 chars Doc 9: 1051 chars Doc 10: 1607 chars Doc 11: 1118 chars Doc 12: 3135 chars Doc 13: 1431 chars Doc 14: 1465 chars Doc 15: 1060 chars Doc 16: 1026 chars Doc 17: 1501 chars Doc 18: 1480 chars Doc 19: 947 chars Doc 20: 1301 chars Doc 21: 914 chars Total: 21
Method
Description
lazyLoad(options)
Returns an iterable over all matched files. Does not load all files into memory at once.
hasNext()
Returns true if more documents remain in the iterator.
next()
Returns the next document struct (with text and metadata) and advances the iterator.
Note: lazyLoad() accepts the same path, pattern, recursive, and metadata options as load(). Use it whenever your document corpus is too large to hold entirely in the JVM heap.

Async loading with loadAsync()

loadAsync() returns a Future. Call get() on that Future to obtain the document array when loading finishes. This is useful when loading can overlap with other request processing.
<cfscript>
    docService = documentService(); 
 
    future = docService.loadAsync({ 
    path: expandPath("./docs/age-when-completed-education.csv"), 
    parserConfigs: { 
        csv: { 
            rowPerDocument: true 
        } 
    } 
    }); 
    
    docs = future.get(); 
    writeOutput("Documents: " & arrayLen(docs)); 

</cfscript>
Output
Documents: 165
Every method on documentService() has an async variant that returns a Future:
Async method
Sync equivalent
loadAsync(options)
load(options)
transformAsync(docs, udf)
transform(docs, udf)
transformSegmentsAsync(segs, udf)
transformSegments(segs, udf)
ingestAsync(segs, store)
ingest(segs, store)
In every case, call .get() on the returned Future to block until the operation completes and retrieve the result. The Future pattern is ideal when you want non-blocking composition in code or a clear completion point before the next step in the same request.
Note: future.get() blocks the current request thread until the operation finishes. For true background jobs that survive the HTTP response, you typically need scheduled tasks, message queues, or long-lived workers.

Multiple service instances

The service is created with documentService() and no arguments. Each call returns a new instance. Multiple instances operate independently of one another.
<cfscript>
  service1 = documentService();
  service2 = documentService();

  docs1 = service1.load({ path: expandPath("./docs/"), pattern: "*.txt" });
  docs2 = service2.load({ path: expandPath("./docs/"), pattern: "*.txt" });
</cfscript>
Use multiple instances when you need to run separate load/split/ingest pipelines concurrently, or when you want to isolate configuration between pipeline branches (e.g. different chunk sizes for different document types).
Closing the service
The service implements AutoCloseable-style cleanup. Call close() when you are done to release any resources held by the instance.
<cfscript>
  docService = documentService();
  documents = docService.load({
    path:    expandPath("./docs/"),
    pattern: "*.txt"
  });
  docService.close();
</cfscript>
Note: No factory-level config is required when creating a documentService() instance. All options are passed to each individual method call, not to the factory itself.

Share this page

Was this page helpful?
We're glad. Tell us how this page helped.
We're sorry. Can you tell us what didn't work for you?
Thank you for your feedback. Your response will help improve this page.

On this page