๐Ÿš€ JTokkit - Java Tokenizer Kit

July 19, 2024 ยท View on GitHub

License: MIT GitHub Workflow Status Maven Central javadoc

Welcome to JTokkit, a Java tokenizer library designed for use with OpenAI models.

EncodingRegistry registry = Encodings.newDefaultEncodingRegistry();
Encoding enc = registry.getEncoding(EncodingType.CL100K_BASE);
assertEquals("hello world", enc.decode(enc.encode("hello world")));

// Or get the tokenizer corresponding to a specific OpenAI model
enc = registry.getEncodingForModel(ModelType.TEXT_EMBEDDING_ADA_002);

๐Ÿ’ก Quickstart

For a quick getting started, see our documentation.

๐Ÿ“– Introduction

JTokkit aims to be a fast and efficient tokenizer designed for use in natural language processing tasks using the OpenAI models. It provides an easy-to-use interface for tokenizing input text, for example for counting required tokens in preparation of requests to the GPT-3.5 model. This library resulted out of the need to have similar capacities in the JVM ecosystem as the library tiktoken provides for Python.

๐Ÿค– Features

โœ… Implements encoding and decoding via r50k_base, p50k_base, p50k_edit, cl100k_base and o200k_base

โœ… Easy-to-use API

โœ… Easy extensibility for custom encoding algorithms

โœ… Zero Dependencies

โœ… Supports Java 8 and above

โœ… Fast and efficient performance

๐Ÿ“Š Performance

JTokkit is between 2-3 times faster than a comparable tokenizer.

benchmark

For details on the benchmark, see the benchmark directory.

๐Ÿ› ๏ธ Installation

You can install JTokkit by adding the following dependency to your Maven project:

<dependency>
    <groupId>com.knuddels</groupId>
    <artifactId>jtokkit</artifactId>
    <version>1.1.0</version>
</dependency>

Or alternatively using Gradle:

dependencies {
    implementation 'com.knuddels:jtokkit:1.1.0'
}

๐Ÿ”ฐ Getting Started

To use JTokkit, simply create a new EncodingRegistry and use getEncoding to retrieve the encoding you want to use. You can then use the encode and decode methods to encode and decode text.

EncodingRegistry registry = Encodings.newDefaultEncodingRegistry();
Encoding enc = registry.getEncoding(EncodingType.CL100K_BASE);
IntArrayList encoded = enc.encode("This is a sample sentence.");
// encoded = [2028, 374, 264, 6205, 11914, 13]
        
String decoded = enc.decode(encoded);
// decoded = "This is a sample sentence."

// Or get the tokenizer based on the model type
Encoding secondEnc = registry.getEncodingForModel(ModelType.TEXT_EMBEDDING_ADA_002);
// enc == secondEnc

The EncodingRegistry and Encoding classes are thread-safe and can be freely shared among components.

โžฐ Extending JTokkit

You may want to extend JTokkit to support custom encodings. To do so, you have two options:

  1. Implement the Encoding interface and register it with the EncodingRegistry
EncodingRegistry registry = Encodings.newDefaultEncodingRegistry();
Encoding customEncoding = new CustomEncoding();
registry.registerEncoding(customEncoding);
  1. Add new parameters for use with the existing BPE algorithm
EncodingRegistry registry = Encodings.newDefaultEncodingRegistry();
GptBytePairEncodingParams params = new GptBytePairEncodingParams(
        "custom-name",
        Pattern.compile("some custom pattern"),
        encodingMap,
        specialTokenEncodingMap
);
registry.registerGptBytePairEncoding(params);

Afterwards you can use the custom encodings alongside the default ones and access them by using registry.getEncoding("custom-name"). See the JavaDoc for more details.

๐Ÿ“„ License

JTokkit is licensed under the MIT License. See the LICENSE file for more information.