Source Datasets of REBU-Syn

January 31, 2024 · View on GitHub

  • We collected labeled data from 16 publicly available real datasets to construct REBU-Syn. The details of these datasets are listed in the following table.
Data file nameSizeLinkLicense
OpenVINO1.5Mhttps://storage.googleapis.com/openimages/web/index.htmlApache License 2.0
TextOCR0.8Mhttps://textvqa.org/textocr/dataset/CC BY 4.0
ICDAR2013843https://rrc.cvc.uab.es/?ch=2Unknown
ICDAR20154,467https://rrc.cvc.uab.es/?ch=4CC BY 4.0
IIIT5K2,000https://cvit.iiit.ac.in/research/projects/cvit-projects/the-iiit-5k-word-datasetMIT License
SVT257http://www.iapr-tc11.org/mediawiki/index.php/The_Street_View_Text_DatasetUnknown
Total-Text12,251https://github.com/cs-chan/Total-Text-DatasetBSD-3 license
CTW15003,170https://github.com/Yuliang-Liu/Curve-Text-DetectorUnknown
Uber127,850https://s3-us-west-2.amazonaws.com/uber-common-public/ubertext/index.htmlUnknown
RCTW1710,245https://rctw.vlrlab.net/datasetUnknown
COCOv2.072,950https://vision.cornell.edu/se3/coco-text-2/CC BY 4.0
LSVT8,164https://rrc.cvc.uab.es/?ch=16Unknown
MLT1955,112https://rrc.cvc.uab.es/?ch=15CC BY-NC 4.0
ReCTS26,040https://rrc.cvc.uab.es/?ch=12Unknown
ArT31,966https://rrc.cvc.uab.es/?ch=14Unknown
Union14M_L_lmdb_format3Mhttps://github.com/Mountchicken/Union14M/tree/main?tab=readme-ov-file#34-downloadMIT License
  • We collected labeled data from 4 publicly available synthetic datasets to construct REBU-Syn. The details of these datasets are listed in the following table.
Data file nameSizeLinkLicense
MJ6Mhttps://www.robots.ox.ac.uk/~vgg/data/text/Unknown
ST9Mhttps://www.robots.ox.ac.uk/~vgg/data/scenetext/Unknown
Curved SynthText1.7Mhttps://github.com/Jyouhou/ICDAR2019-ArT-Recognition-AlchemyApache License 2.0
SynthAdd1.2Mhttps://github.com/wangpengnorman/SAR-Strong-Baseline-for-Text-RecognitionUnknown

Datasets

Download the training dataset from the following links:

  1. LMDB archives for MJ, ST, IIIT5k, SVT, SVTP, IC13, IC15, CUTE80, ArT, RCTW17, ReCTS, LSVT, MLT19, COCO-Text, and Uber-Text.
  2. LMDB archives for TextOCR and OpenVINO.
  3. LMDB archives for Union14M_L_lmdb_format.
  4. CTW1500
  5. Total-Text
  6. SynthAdd
  7. Curved SynthText

Then, organize the data as follows:

├── REBU-Syn
├── train
│   └── synth_and_real
│       ├── Curved_SynthText
│       │   ├── syntext1
│       │   └── syntext2
│       ├── SynthAdd
│       │   ├── data.mdb
│       │   └── lock.mdb
│       ├── Union14M_L_lmdb_format
│       │   ├── difficult
│       │   ├── hard
│       │   ├── hell
│       │   ├── medium
│       │   └── simple
│       ├── benchmark
│       │   ├── ICDAR2013
│       │   ├── ICDAR2015
│       │   ├── IIIT5K
│       │   └── SVT
│       ├── extra
│       │   ├── CTW1500
│       │   └── total_text
│       └── real_data
│       │   ├── ArT
│       │   ├── COCOv2.0
│       │   ├── LSVT
│       │   ├── MLT19
│       │   ├── OpenVINO
│       │   ├── RCTW17
│       │   ├── ReCTS
│       │   ├── TextOCR
│       │   └── Uber
│       └── mj_st
│           ├── data.mdb
│           └── lock.mdb
└── val
│   ├── CUTE80
│   ├── IC13_1015
│   ├── IC15_1811
│   ├── IIIT5k
│   ├── SVT
│   └── SVTP
├── test
│   ├── CUTE80
│   ├── IC13_1015
│   ├── IC13_857
│   ├── IC15_1811
│   ├── IIIT5k
│   ├── SVT
│   └── SVTP

Data Generation

We generated MJST+(60M) using TextRecognitionDataGenerator and SynthText. For specific generation methods, please refer to GenData.md

Acknowledgement

We sincerely thank all the constructors of the 20 datasets used in REBU-Syn.

  • PARSeq: the dataset we built upon. Thanks for their wonderful work!
  • Union14M: organizes a challenging STR training data. Don't forget to check this great open-source work if you don't know it before!