Extreme Tokenized Documents Annotator

January 29, 2025 ยท View on GitHub

Please see the set of transform project conventions for details on general project conventions, transform configuration, testing and IDE set up.

Contributors

Summary

This annotator retrieves the tokens generated for a set of documents. Then, it calculates, for each document, the size and the total number of characters. The number of tokens is divided by the size and by the number of characters, and the resulting values are stored in two columns ( tokens_per_doc_size and tokens_per_doc_num_chars).

The annotator transform annotates the input table with 5 columns:

  • doc_num_tokens - number of tokens for each document
  • doc_size_kbs - document size in kb
  • doc_num_chars - number of characters in the document
  • tokens_per_doc_size - ratio between number of tokens and document size
  • tokens_per_doc_num_chars - ratio between number of tokens and number of characters in document

Documents with extremely high or low number of tokens per character (or tokens per byte) are identified as extreme-tokenized documents and can be excluded in the filtering step.

Configuration and command line Options

The set of dictionary keys holding ExtremeTokenizedTransform configuration for values are as follows:

  • et_contents_column_name - specifies the name of the column holding the document text. The default is text.
  • et_arrow_path - location of the folder containing the arrow (tokenization) files.

Additionally, a set of data access-specific arguments are provided that enable the specification of the location of domain list files, so that these files could be stored in the local file system or in S3 storage, for example. The arguments are as follows (and generally match the TransformLauncher's data access arguments but with the `extreme_tokenized_' prefix).

  • et_local_config - specifies the input and output folders.
  • et_s3_config - specifies the input and output paths in s3.
  • et_s3_credentials - provides credentials to access the s3 storage.

See the Command Line options below for specifics on these.

Running

You can run the extreme_tokenized module to transform the test1.parquet file in test input data to an output directory. The output directory will contain both the new annotated test1.parquet file and the metadata.json file.

(venv) cma:extreme_tokenized$ make venv PYTHON=python3.11
(venv) cma:extreme_tokenized$ source venv/bin/activate
(venv) cma:extreme_tokenized$ python -m dpk_extreme_tokenized.runtime --et_arrow_path test-data/input/arrow --data_local_config "{ 'input_folder': 'test-data/input', 'output_folder': 'output' }"
09:41:22 INFO - Launching Extreme Tokenized Annotator transform
09:41:22 INFO - data factory et_ is using local configuration without input/output path
09:41:22 INFO - data factory et_ max_files -1, n_sample -1
09:41:22 INFO - data factory et_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
09:41:22 INFO - pipeline id pipeline_id
09:41:22 INFO - code location None
09:41:22 INFO - data factory data_ is using local data access: input_folder - test-data/input output_folder - output
09:41:22 INFO - data factory data_ max_files -1, n_sample -1
09:41:22 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
09:41:22 INFO - orchestrator et started at 2025-01-27 09:41:22
09:41:22 INFO - Number of files is 1, source profile {'max_file_size': 0.029085159301757812, 'min_file_size': 0.029085159301757812, 'total_file_size': 0.029085159301757812}
09:41:22 INFO - Transforming table with 10 rows from file /home/cma/de/data-prep-kit/transforms/language/extreme_tokenized/test-data/input/test1.parquet
09:41:22 INFO - Completed 1 files (100.0%) in 0.001 min
09:41:22 INFO - Done processing 1 files, waiting for flush() completion.
(venv) cma:extreme_tokenized$ deactivate

Building the Docker Images

(venv) cma:extreme_tokenized$ make image 
...
(venv) cma:extreme_tokenized$ podman images
REPOSITORY                                                  TAG       IMAGE ID       CREATED              SIZE
extreme_tokenized-ray                                       latest    b933331aab92   41 seconds ago       2.77GB
quay.io/dataprep1/data-prep-kit/extreme_tokenized-ray       latest    b933331aab92   41 seconds ago       2.77GB
extreme_tokenized-python                                    latest    a5df95eba200   About a minute ago   696MB
quay.io/dataprep1/data-prep-kit/extreme_tokenized-python    latest    a5df95eba200   About a minute ago   696MB

In addition, there are some useful make targets (see conventions above) or use make help to see a list of available targets.

Launched Command Line Options

When running the transform with the Ray launcher (i.e. TransformLauncher), the following command line arguments are available in addition to the options provided by the launcher.

options:
  -h, --help            show this help message and exit
  --et_contents_column_name ET_CONTENTS_COLUMN_NAME
                        Name of the column holding the document text
  --et_arrow_path ET_ARROW_PATH
                         Arrow folder location.
  --et_s3_cred ET_S3_CRED
                        AST string of options for s3 credentials. Only required for S3 data access.
                        access_key: access key help text
                        secret_key: secret key help text
                        url: optional s3 url
                        region: optional s3 region
                        Example: { 'access_key': 'access', 'secret_key': 'secret', 
                        'url': 'https://s3.us-east.cloud-object-storage.appdomain.cloud', 
                        'region': 'us-east-1' }
  --data_s3_cred DATA_S3_CRED
                        AST string of options for s3 credentials. Only required for S3 data access.
                        access_key: access key help text
                        secret_key: secret key help text
                        url: optional s3 url
                        region: optional s3 region
                        Example: { 'access_key': 'access', 'secret_key': 'secret', 
                        'url': 'https://s3.us-east.cloud-object-storage.appdomain.cloud', 
                        'region': 'us-east-1' }
  --data_s3_config DATA_S3_CONFIG
                        AST string containing input/output paths.
                        input_folder: Path to input folder of files to be processed
                        output_folder: Path to output folder of processed files
                        Example: { 'input_folder': 's3-path/your-input-bucket', 
                        'output_folder': 's3-path/your-output-bucket' }
  --data_local_config DATA_LOCAL_CONFIG
                        ast string containing input/output folders using local fs.
                        input_folder: Path to input folder of files to be processed
                        output_folder: Path to output folder of processed files
                        Example: { 'input_folder': './input', 'output_folder': '/tmp/output' }
  --data_max_files DATA_MAX_FILES
                        Max amount of files to process
  --data_checkpointing DATA_CHECKPOINTING
                        checkpointing flag
  --data_files_to_checkpoint DATA_FILES_TO_CHECKPOINT
                        list of file extensions to choose for checkpointing.
  --data_data_sets DATA_DATA_SETS
                        List of sub-directories of input directory to use for input. For example, ['dir1', 'dir2']
  --data_files_to_use DATA_FILES_TO_USE
                        list of file extensions to choose for input.
  --data_num_samples DATA_NUM_SAMPLES
                        number of random input files to process
  --runtime_num_processors RUNTIME_NUM_PROCESSORS
                        size of multiprocessing pool
  --runtime_pipeline_id RUNTIME_PIPELINE_ID
                        pipeline id
  --runtime_job_id RUNTIME_JOB_ID
                        job id
  --runtime_code_location RUNTIME_CODE_LOCATION
                        AST string containing code location
                        github: Github repository URL.
                        commit_hash: github commit hash
                        path: Path within the repository
                        Example: { 'github': 'https://github.com/somerepo', 'commit_hash': '1324', 
                        'path': 'transforms/universal/code' }