UniTok V4

June 25, 2025 · View on GitHub

The documentation for v3, old version, can be found here in Chinese.

Overview

PyPI version

Welcome to the UniTok v4! This library provides a unified preprocessing solution for machine learning datasets, handling diverse data types like text, categorical features, and numerical values.

Please refer to UniTok Handbook for more detailed information.

Road from V3 to V4

Changes and Comparisons

After UniTok 4.4.0, Job is renamed to Feature.

FeatureUniTok v3UniTok v4Comments
UniTok classSolely for tokenizationManages the entire preprocessing lifecycle
UniDep classData loading and combiningRemovedV4 combines the functionalities of UniTok and UniDep into a single class.
Column classColumn name is for both the original and tokenized datasetsN/AV4 introduces a Feature class.
Feature classN/ADefines how a specific column should be tokenized
Tokenizer classAmbiguous return type definitionreturn_list parameter must be of type bool
Tokenizer classOnly supports BertTokenizer for text processingSupports all Tokenizers in the transformers libraryNew TransformersTokenizer class
analyse methodSupportedNot supported Currently
Meta classOnly for human-friendly displayingManager for Feature, Tokenizer, and Vocab
unitok commandVisualization in the terminalMore colorful and detailed output
Vocab class (unitok >= 4.1.0)Save and load vocabulary using text filesSave and load vocabulary using pickle filesAvoids issues with special characters in text files

How to Migrate the Processed Data

unidep-upgrade-v4 <path>

Installation

Requirements

  • Python 3.7 or later
  • Dependencies:
    • pandas
    • transformers
    • tqdm
    • rich

Install UniTok via pip

pip install unitok

Core Concepts

States

  • initialized: The initial state after creating a UniTok instance.
  • tokenized: Achieved after applying tokenization to the dataset.
  • organized: Reached after combining multiple datasets via operations like union.

Components

  • UniTok: Manages the dataset preprocessing lifecycle.
  • Feature: Defines how a specific column should be tokenized.
  • Tokenizer: Encodes data using various methods (e.g., BERT, splitting by delimiters).
  • Vocabulary: Stores and manages unique tokens across datasets.

Primary Key (key_feature)

The key_feature acts as the primary key for operations like getitem and union, ensuring consistency across datasets.

Usage Guide

Loading Data

Load datasets using pandas:

import pandas as pd

item = pd.read_csv(
    filepath_or_buffer='news-sample.tsv',
    sep='\t',
    names=['nid', 'category', 'subcategory', 'title', 'abstract', 'url', 'title_entities', 'abstract_entities'],
    usecols=['nid', 'category', 'subcategory', 'title', 'abstract'],
)
item['abstract'] = item['abstract'].fillna('')  # Handle missing values

user = pd.read_csv(
    filepath_or_buffer='user-sample.tsv',
    sep='\t',
    names=['uid', 'history'],
)

interaction = pd.read_csv(
    filepath_or_buffer='interaction-sample.tsv',
    sep='\t',
    names=['uid', 'nid', 'click'],
)

Defining and Adding Features

Define tokenization features for different columns:

from unitok import UniTok, Vocab
from unitok.tokenizer import BertTokenizer, TransformersTokenizer, EntityTokenizer, SplitTokenizer, DigitTokenizer

item_vocab = Vocab(name='nid')  # will be used across datasets
user_vocab = Vocab(name='uid')  # will be used across datasets

with UniTok() as item_ut:
    bert_tokenizer = BertTokenizer(vocab='bert')
    llama_tokenizer = TransformersTokenizer(vocab='llama', key='huggyllama/llama-7b')

    item_ut.add_feature(tokenizer=EntityTokenizer(vocab=item_vocab), column='nid', key=True)
    item_ut.add_feature(tokenizer=bert_tokenizer, column='title', name='title@bert', truncate=20)
    item_ut.add_feature(tokenizer=llama_tokenizer, column='title', name='title@llama', truncate=20)
    item_ut.add_feature(tokenizer=bert_tokenizer, column='abstract', name='abstract@bert', truncate=50)
    item_ut.add_feature(tokenizer=llama_tokenizer, column='abstract', name='abstract@llama', truncate=50)
    item_ut.add_feature(tokenizer=EntityTokenizer(vocab='category'), column='category')
    item_ut.add_feature(tokenizer=EntityTokenizer(vocab='subcategory'), column='subcategory')

with UniTok() as user_ut:
    user_ut.add_feature(tokenizer=EntityTokenizer(vocab=user_vocab), column='uid', key=True)
    user_ut.add_feature(tokenizer=SplitTokenizer(vocab=item_vocab, sep=','), column='history', truncate=30)

with UniTok() as inter_ut:
    inter_ut.add_index_feature(name='index')
    inter_ut.add_feature(tokenizer=EntityTokenizer(vocab=user_vocab), column='uid')
    inter_ut.add_feature(tokenizer=EntityTokenizer(vocab=item_vocab), column='nid')
    inter_ut.add_feature(tokenizer=DigitTokenizer(vocab='click', vocab_size=2), column='click')

Tokenizing Data

Tokenize and save the processed data:

item_ut.tokenize(item).save('sample-ut/item')
item_vocab.deny_edit()  # will raise an error if new items are detected in the user or interaction datasets
user_ut.tokenize(user).save('sample-ut/user')
inter_ut.tokenize(interaction).save('sample-ut/interaction')

Combining Datasets

Combine datasets using union:

# => {'category': 0, 'nid': 0, 'title@bert': [1996, 9639, 3035, 3870, ...], 'title@llama': [450, 1771, 4167, 10470, ...], 'abstract@bert': [4497, 1996, 14960, 2015, ...], 'abstract@llama': [1383, 459, 278, 451, ...], 'subcategory': 0}
print(item_ut[0])

# => {'uid': 0, 'history': [0, 1, 2]}
print(user_ut[0])

# => {'uid': 0, 'nid': 7, 'index': 0, 'click': 1}
print(inter_ut[0])

with inter_ut:
    inter_ut.union(user_ut)

    # => {'index': 0, 'click': 1, 'uid': 0, 'nid': 7, 'history': [0, 1, 2]}
    print(inter_ut[0])

Glance at the Terminal

unitok sample-ut/item
UniTok (4beta)
Sample Size: 10
ID Column: nid

                                                                                 Features                                                                                  
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ Tokenizer                            ┃     Tokenizer ID      ┃ Column Mapping                               ┃ Vocab                             ┃    Max Length     ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│ TransformersTokenizer                │      auto_2VN5Ko      │ abstract -> abstract@llama                   │ llama (size=32024)                │        50         │
├──────────────────────────────────────┼───────────────────────┼──────────────────────────────────────────────┼───────────────────────────────────┼───────────────────┤
│ EntityTokenizer                      │      auto_C0b9Du      │ subcategory -> subcategory                   │ subcategory (size=8)              │        N/A        │
├──────────────────────────────────────┼───────────────────────┼──────────────────────────────────────────────┼───────────────────────────────────┼───────────────────┤
│ TransformersTokenizer                │      auto_2VN5Ko      │ title -> title@llama                         │ llama (size=32024)                │        20         │
├──────────────────────────────────────┼───────────────────────┼──────────────────────────────────────────────┼───────────────────────────────────┼───────────────────┤
│ EntityTokenizer                      │      auto_4WQYxo      │ category -> category                         │ category (size=4)                 │        N/A        │
├──────────────────────────────────────┼───────────────────────┼──────────────────────────────────────────────┼───────────────────────────────────┼───────────────────┤
│ BertTokenizer                        │      auto_Y9tADT      │ abstract -> abstract@bert                    │ bert (size=30522)                 │        46         │
├──────────────────────────────────────┼───────────────────────┼──────────────────────────────────────────────┼───────────────────────────────────┼───────────────────┤
│ BertTokenizer                        │      auto_Y9tADT      │ title -> title@bert                          │ bert (size=30522)                 │        16         │
├──────────────────────────────────────┼───────────────────────┼──────────────────────────────────────────────┼───────────────────────────────────┼───────────────────┤
│ EntityTokenizer                      │      auto_qwQALc      │ nid -> nid                                   │ nid (size=10)                     │        N/A        │
└──────────────────────────────────────┴───────────────────────┴──────────────────────────────────────────────┴───────────────────────────────────┴───────────────────┘

Contributing

We welcome contributions to UniTok! We appreciate your feedback, bug reports, and pull requests.

Our TODO list includes:

  • More detailed documentation
  • More examples and tutorials
  • More SQL-like operations
  • Analysis and visualization tools

License

This project is licensed under the MIT License - see the LICENSE file for details.