Hashformers Evaluation Report

January 8, 2026 ยท View on GitHub

Benchmark Overview: This report evaluates the performance of various text segmentation approaches across English hashtags, foreign hashtags, and code identifier splitting tasks.

The benchmark scripts are available in the scripts directory ( scripts/benchmark_script_focused.py and scripts/benchmark_script.py ).


๐Ÿ“‹ Datasets Used

DatasetSplit Used
ruanchaves/bountest
ruanchaves/stan_smalltest
ruanchaves/stan_largetest
ruanchaves/dev_stanfordvalidation
ruanchaves/test_stanfordtest
ruanchaves/snaptrain
ruanchaves/nru_hsetest
ruanchaves/hashset_distanttest
ruanchaves/hashset_distant_sampledtest
ruanchaves/loyolatest
ruanchaves/lynxtest
ruanchaves/jhotdrawtest
ruanchaves/binkleytest
ruanchaves/bt11test

Note

ruanchaves/hashset_manual was excluded due to a loading error.


โฑ๏ธ Global Latency Performance

ModelMean (ms)Std (ms)Min (ms)Max (ms)Throughput (items/sec)
WordNinja0.190.140.022.055,357
SymSpell0.280.210.041.493,580
Ekphrasis0.690.970.119.171,449
Hashformers-DistilGPT2264.11320.6313.563,124.963.79
LLM-Qwen2 (0.5B)300.63166.81134.102,444.613.33
Hashformers-GPT2362.97424.7522.043,644.442.76

Tip

Heuristic-based approaches (WordNinja, SymSpell, Ekphrasis) are ~1,000x faster than transformer-based methods, making them ideal for high-throughput scenarios.


๐Ÿ“ˆ Accuracy Results by Task

English Hashtags

RankModelAccuracyPrecisionRecallF1
๐Ÿฅ‡Hashformers-DistilGPT276.67%85.34%78.88%81.99%
๐ŸฅˆHashformers-GPT275.83%85.27%76.10%80.42%
๐Ÿฅ‰Ekphrasis72.50%77.39%80.48%78.91%
4WordNinja71.67%71.48%84.86%77.60%
5SymSpell69.17%69.31%80.08%74.31%
6LLM-Qwen2 (0.5B)68.33%75.45%67.33%71.16%

Foreign (Non-English) Hashtags

RankModelAccuracyPrecisionRecallF1
๐Ÿฅ‡Ekphrasis83.33%92.19%84.29%88.06%
๐ŸฅˆHashformers-GPT278.33%86.03%83.57%84.78%
๐Ÿฅ‰Hashformers-DistilGPT276.67%83.94%82.14%83.03%
4LLM-Qwen2 (0.5B)60.00%75.45%59.29%66.40%
5SymSpell48.33%46.67%55.00%50.49%
6WordNinja35.00%48.73%55.00%51.68%

Code Identifier Splitting

RankModelAccuracyPrecisionRecallF1
๐Ÿฅ‡Ekphrasis66.00%84.84%80.76%82.75%
๐ŸฅˆWordNinja60.00%74.57%74.57%74.57%
๐Ÿฅ‰SymSpell60.00%69.93%68.73%69.32%
4LLM-Qwen2 (0.5B)49.00%70.47%46.74%56.20%
5Hashformers-GPT243.00%63.74%37.46%47.19%
6Hashformers-DistilGPT238.00%58.62%35.05%43.87%

๐Ÿ‡ท๐Ÿ‡บ Russian Language Benchmark (NRU HSE Dataset)

This supplementary benchmark evaluates Russian hashtag segmentation using a language-specific model.

RankModelAccuracyPrecisionRecallF1
๐Ÿฅ‡Hashformers-RuGPT3Small80.00%83.87%81.25%82.54%
๐ŸฅˆHashformers-GPT275.00%75.00%75.00%75.00%
๐Ÿฅ‰Hashformers-DistilGPT270.00%74.19%71.88%73.02%
4Ekphrasis50.00%56.00%43.75%49.12%
5SymSpell45.00%45.00%28.13%34.62%
5LLM-Qwen2 (0.5B)45.00%45.00%28.13%34.62%
7WordNinja0.00%10.00%6.25%7.69%

Important

Using a language-specific backbone (RuGPT3Small) improves Russian segmentation accuracy by +5โ€“10% over English-pretrained models.


๐Ÿ“Š Summary: Overall Performance Comparison

                          English    Foreign      Code     Russian
                         Hashtags   Hashtags Identifiers  Hashtags
                         โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€   โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Hashformers-DistilGPT2     ๐Ÿฅ‡         ๐Ÿฅ‰         6th        ๐Ÿฅ‰
Hashformers-GPT2           ๐Ÿฅˆ         ๐Ÿฅˆ         5th        ๐Ÿฅˆ
Hashformers-RuGPT3Small     โ€”          โ€”          โ€”         ๐Ÿฅ‡
Ekphrasis                  ๐Ÿฅ‰         ๐Ÿฅ‡         ๐Ÿฅ‡          4th
WordNinja                  4th        6th        ๐Ÿฅˆ          7th
SymSpell                   5th        5th        ๐Ÿฅ‰          5th
LLM-Qwen2 (0.5B)           6th        4th        4th         5th

๐ŸŽฏ Conclusions

Key Findings

  1. Hashformers excels at English hashtag segmentation, achieving the highest accuracy (76.67%) and F1-score (81.99%) with DistilGPT2. This represents a +8.3 percentage point improvement over the comparable-scale LLM-Qwen2 (0.5B).

  2. Heuristic methods dominate code identifier splitting. Ekphrasis leads with 66% accuracy, while Hashformers models underperform on programmatic naming conventions (camelCase, snake_case). This suggests the pretraining corpus of GPT-2 models lacks sufficient code-style text.

  3. Language-specific backbones matter. For Russian hashtags, Hashformers-RuGPT3Small (80% accuracy) substantially outperforms English-pretrained alternatives, demonstrating the importance of matching the LM to the target language.

  4. Latency vs. accuracy trade-off is significant. Heuristic splitters are ~1,000x faster but sacrifice 5โ€“15% accuracy on hashtag tasks. For batch processing millions of items, heuristics may be acceptable; for quality-critical applications, Hashformers is preferred.

  5. Hashformers outperforms similarly-sized LLMs. When compared to LLM-Qwen2 (0.5B parameters), Hashformers variants consistently deliver better accuracy across English and Foreign hashtag tasks, proving that the specialized architecture is more effective than general-purpose LLMs at comparable scale.


When to Use Hashformers?

The table below outlines when to use Hashformers versus other approaches like heuristic-based splitters (e.g., SymSpell, WordNinja) or large LLMs.

ApproachExamplesRecommended When...Notes
Heuristic-basedSymSpell, Ekphrasis, WordNinja, Spiral (Ronin)โ€ข Scalability is a primary requirement.

โ€ข The segmentation domain works well with a standard pre-built vocabulary.
Fast and efficient, but requires a pre-built vocabulary which can be limiting for niche domains or languages.
HashformersHashformersโ€ข Scalability is needed.

โ€ข You are working in a domain or language where a Language Model is readily available, but compiling a manual vocabulary is too burdensome.
Evidence shows Hashformers is superior to LLMs of similar scale (0.5B parameters).
Large LLMsOpenAI, Local LLM Deploymentโ€ข Cost, latency, and scalability are not concerns.

โ€ข You are segmenting a low volume of items.
To gain an accuracy advantage over Hashformers, you generally need to use significantly larger LLMs.

Appendix: Models Evaluated

CategoryModelDescription
HeuristicWordNinjaStatistical word segmentation based on Wikipedia unigram frequencies
HeuristicSymSpellFast spelling correction and word segmentation using Symmetric Delete
HeuristicEkphrasisText preprocessing tool optimized for social media text
HeuristicSpiral-RoninIdentifier splitting for source code analysis
HashformersHashformers-GPT2GPT-2 backbone with specialized hashtag segmentation head
HashformersHashformers-DistilGPT2Distilled GPT-2 for faster inference with minimal accuracy loss
HashformersHashformers-RuGPT3SmallRussian-language GPT-3 backbone for Cyrillic text: ai-forever/rugpt3small_based_on_gpt2
LLMLLM-Qwen2 (0.5B)General-purpose 0.5B parameter language model: Qwen/Qwen2-0.5B-Instruct

Report generated: January 2026