Chunk documents Transform

May 5, 2025 ยท View on GitHub

Please see the set of transform project conventions for details on general project conventions, transform configuration, testing and IDE set up.

Contributors

Description

This transform is chunking documents. It supports multiple chunker modules (see the chunking_type parameter).

When using documents converted to JSON, the transform leverages the Docling Core HierarchicalChunker to chunk according to the document layout segmentation, i.e. respecting the original document components as paragraphs, tables, enumerations, etc. It relies on documents converted with the Docling library in the docling2parquet transform using the option contents_type: "application/json", which provides the required JSON structure.

When using documents converted to Markdown, the transform leverages the Llama Index MarkdownNodeParser, which is relying on its internal Markdown splitting logic.

Input

input column namedata typedescription
the one specified in content_column_name configurationstringthe content used in this transform

Output format

The output parquet file will contain all the original columns, but the content will be replaced with the individual chunks.

Tracing the origin of the chunks

The transform allows to trace the origin of the chunk with the source_doc_id which is set to the value of the document_id column (if present) in the input table. The actual name of columns can be customized with the parameters described below.

Configuration

The transform can be tuned with the following parameters.

ParameterDefaultDescription
chunking_typedl_jsonChunking type to apply. Valid options are li_markdown for using the LlamaIndex Markdown chunking, dl_json for using the Docling JSON chunking, li_token_text for using the LlamaIndex Token Text Splitter, which chunks the text into fixed-sized windows of tokens.
content_column_namecontentsName of the column containing the text to be chunked.
doc_id_column_namedocument_idName of the column containing the doc_id to be propagated in the output.
chunk_size_tokens128Size of the chunk in tokens for the token text chunker.
chunk_overlap_tokens30Number of tokens overlapping between chunks for the token text chunker.
output_chunk_column_namecontentsColumn name to store the chunks in the output table.
output_source_doc_id_column_namesource_document_idColumn name to store the doc_id from the input table.
output_jsonpath_column_namedoc_jsonpathColumn name to store the document path of the chunk in the output table.
output_pageno_column_namepage_numberColumn name to store the page number of the chunk in the output table.
output_bbox_column_namebboxColumn name to store the bbox of the chunk in the output table.

Usage

Launched Command Line Options

When invoking the CLI, the parameters must be set as --doc_chunk_<name>, e.g., --doc_chunk_column_name=myoutput.

Code example

See a sample notebook

Transforming data using the transform image

To use the transform image to transform your data, please refer to the running images quickstart, substituting the name of this transform image and runtime as appropriate.

Testing

Following the testing strategy of data-processing-lib

Currently we have:

Further Resource

Chunk documents Ray Transform

Summary

This project wraps the doc_chunck transform python implementation with a Ray runtime.

Configuration and command line Options

chunk documents configuration and command line options are the same as for the base python transform.

Launched Command Line Options

In addition to those available to the transform as defined above, the set of launcher options are available.

Code example

See a sample Ray notebook

Transforming data using the transform image

To use the transform image to transform your data, please refer to the running images quickstart, substituting the name of this transform image and runtime as appropriate.