NodeJS Language Detection Benchmark :rocket:

May 17, 2023 ยท View on GitHub

  • This kind of benchmark is not perfect and % can vary over time, but it gives a good idea of overall performances
  • Language evaluated in this benchmark:
    • Asia: jpn, cmn, kor, hin
    • Europe: fra, spa, por, ita, nld, eng, deu, fin, rus
    • Middle east: , tur, heb, ara
  • This page and graphs are auto-generated from the code

Libraries

Here is the list of libraries in this benchmark

LibraryScriptLanguageProperly IdentifiedImproperly identifiedNot identifiedAvg Execution TimeDisk Size
TinyLD Heavyyarn bench:tinyld-heavy6499.249%0.7478%0.0032%0.096ms.2.0MB
TinyLDyarn bench:tinyld6498.5231%1.3712%0.1057%0.1191ms.580KB
TinyLD Lightyarn bench:tinyld-light2497.8778%1.9842%0.138%0.0947ms.68KB
**langdetectyarn bench:langdetect5395.675%4.325%0%0.3647ms.1.8MB
node-cldyarn bench:cld16092.3654%1.6213%6.0133%0.0711ms.> 10MB
francyarn bench:franc18774.2577%25.7423%0%0.2242ms.267KB
franc-minyarn bench:franc-min8270.3891%23.1888%6.422%0.084ms.119KB
franc-allyarn bench:franc-all40366.7081%33.2919%0%0.4763ms.509KB
languagedetectyarn bench:languagedetect5265.2835%11.2808%23.4357%0.1896ms.240KB

Global Accuracy

Benchmark

We see two group of libraries

  • tinyld, langdetect and cld over 90% accuracy
  • franc and languagedetect under 75% accuracy

Per Language

Language

We see big differences between languages:

  • Japanese or Korean are almost at 100% for every libs (lot of unique characters)
  • Spanish and Portuguese are really close and cause more false-positive and an higher error-rate

Accuracy By Text length

Most libraries are using statistical analysis, so longer is the input text, better will be the detection. So we can often see quotes like this in those library documentations.

Make sure to pass it big documents to get reliable results.

Let's see if this statement is true, and how those libraries behave for different input size (from small to long) Size

So the previous quote is right, over 512 characters all the libs become accurate enough.

But for a ~95% accuracy threshold:

  • tinyld (green) reaches it around 24 characters
  • langdetect (cyan) and cld (orange) reach it around 48 characters

Execution Time

Size

Here we can notice few things about performance:

  • langdetect (cyan) and franc (pink) seems to slow down at a similar rate
  • tinyld (green) slow down but at a really flat rate
  • cld (orange) is definitely the fastest and doesn't show any apparent slow down

But we've seen previously that some of those libraries need more than 256 characters to be accurate. It means they start to slow down at the same time they start to give decent results.


Conclusion

- By platform :computer:

  • For NodeJS: TinyLD, langdetect or node-cld (fast and accurate)
  • For Browser: TinyLD Light or franc-min (small, decent accuracy, franc is less accurate but support more languages)

- By usage :speech_balloon:

  • Short text (chatbot, keywords, database, ...): TinyLD or langdetect
  • Long text (documents, webpage): node-cld or TinyLD
  • franc-all is the worst in terms of accuracy, not a surprise because it tries to detect 400+ languages with only 3-grams. A technical demo to put big numbers but useless for real usage, even a language like english barely reaches ~45% detection rate.
  • languagedetect is light but just not accurate enough

Last word :raising_hand:

Thanks for reading this article, those metrics are really helpful for the development of tinyld. It's used in the development to see the impact of every modification and features.

If you want to contribute or see another library in this benchmark, open an issue