Document Quality Transform

January 22, 2025 ยท View on GitHub

Please see the set of transform project conventions for details on general project conventions, transform configuration, testing and IDE set up.

Contributors

Description

This transform will calculate and annotate several metrics which are useful to assess the quality of the document. The document quality transform operates on text documents only

Input

input column namedata typedescription
the one specified in doc_content_column configurationstringtext whose quality will be calculated by this transform

Output columns annotated by this transform

output column namedata typedescriptionsupported language
docq_total_wordsintthe total number of wordsALL
docq_mean_word_lenintthe mean of words' lengthsALL
docq_symbol_to_word_ratiofloatthe ratio of symbol-to-word ratio (Reference for symbols like emojis: https://textacy.readthedocs.io/en/0.11.0/api_reference/preprocessing.html, currently used symbol: #, ...)ALL
docq_sentence_countintthe number of sentencesALL
docq_curly_bracket_ratiofloatthe ratio between the number of occurrences of { or } over the text lengthALL
docq_lorem_ipsum_ratiofloatthe ratio between the number of occurrences of lorem ipsum over the text length. Lorem ipsum, or lipsum as it is sometimes known, is dummy text used in laying out print, graphic or web designs.ALL
docq_contain_bad_wordboolwhether text containst bad wordsALL
docq_bullet_point_ratiofloatthe ratio of lines starting with a bullet pointALL
docq_ellipsis_line_ratiofloatthe ratio of lines ending with an ellipsisALL
docq_alphabet_word_ratiofloatthe ratio of words having at least one alphabetic characterALL
docq_contain_common_en_wordsboolwhether the given text contains common English words like the, and, to, that, of, with, be, and haveALL
docq_avg_ja_sentence_lenintaverage sentence length for an input text, inspired by an OSS HojiChar.ja
docq_first_ja_alphabet_posintfirst position of occurrence of Japanese alphabets (i.e., Hiragana or Katakana)ja

You can see more detailed backgrounds of some columns in Deepmind's Gopher paper

Configuration

The set of dictionary keys holding DocQualityTransform configuration for values are as follows:

  • text_lang - specifies language used in the text content. By default, "en" is used.
  • doc_content_column - specifies column name that contains document text. By default, "contents" is used.
  • bad_word_filepath - specifies a path to bad word file: local folder (file or directory) that points to bad word file. You don't have to set this parameter if you don't need to set bad words.

Example

{
    text_lang_key: "en",
    doc_content_column_key: "contents",
    bad_word_filepath_key: os.path.join(basedir, "ldnoobw", "en"),
}

Usage

Launched Command Line Options

The following command line arguments are available

  --docq_text_lang DOCQ_TEXT_LANG   language used in the text content. By default, "en" is used.
  --docq_doc_content_column DOCQ_DOC_CONTENT_COLUMN   column name that contain document text. By default, "contents" is used.
  --docq_bad_word_filepath DOCQ_BAD_WORD_FILEPATH   path to bad word file: local folder (file or directory) that points to bad word file. You don't have to set this parameter if you don't need to set bad words.

These correspond to the configuration keys described above.

Running the samples

To run the samples, use the following make target

  • run-cli-sample - runs dpk_doc_quality/transform.py using command line args

This target will activate the virtual environment and sets up any configuration needed. Use the -n option of make to see the detail of what is done to run the sample.

For example,

make run-cli-sample
...

Then

ls output

To see results of the transform.

Code example

notebook

Testing

Following the testing strategy of data-processing-lib

Currently we have:

Further Resource

Consideration

Troubleshooting guide

For M1 Mac user, if you see following error during make command, error: command '/usr/bin/clang' failed with exit code 1, you should follow this step

Document Quality Ray Transform

Please see the set of transform project conventions for details on general project conventions, transform configuration, testing and IDE set up.

Configuration and command line Options

Document Quality configuration and command line options are the same as for the base python transform.

Running

Launched Command Line Options

When running the transform with the Ray launcher (i.e., TransformLauncher), In addition to those available to the transform as defined here, the set of launcher options are available.

Running the samples

To run the samples, use the following make target

  • run-ray-cli-sample - runs dpk_doc_quality/ray/transform.py using command line args

This target will activate the virtual environment and sets up any configuration needed. Use the -n option of make to see the detail of what is done to run the sample.

For example,

make run-ray-cli-sample
...

Then

ls output

To see results of the transform.

Code example (Ray)

notebook

Transforming data using the transform image

To use the transform image to transform your data, please refer to the running images quickstart, substituting the name of this transform image and runtime as appropriate.