Scene-Text Oriented Referring Expression Comprehension
December 27, 2022 · View on GitHub
This is the resource and code for the TMM paper, Scene-Text Oriented Referring Expression Comprehension, including the resource of the RefText dataset and official implementation of the Scene Text Awareness Network (STAN).
Introduction
We introduce a new task called scene-text oriented referring expression comprehension (ST-REC). To tackle this task, we propose a scene text awareness network (STAN) that can bridge the gap between texts from two modalities. Additionally, to conduct quantitative evaluations, we establish a new benchmark dataset called RefText, which contains 31K manually generated referring expressions for 11K objects from multiple image sources. Examples are shown as follows:
Data Preparation
- Download the annotation files from Google Drive and place them in
data/reftext - Download the images from Google Drive and place them in
ln_data/other/images/reftext - Download the Google OCR results from Google Drive and place them in
ln_data/ocr
The folder structure for the dataset is shown below.
STAN
├── data
│ └── reftext
│ ├── reftext_subtest_home.pth
│ ├── reftext_subtest_oov.pth
│ ├── reftext_subtest_other.pth
│ ├── reftext_subtest_semantic.pth
│ ├── reftext_subtest_shelf.pth
│ ├── reftext_subtest_sport.pth
│ ├── reftext_subtest_street.pth
│ ├── reftext_test.pth
│ ├── reftext_train.pth
│ └── reftext_val.pth
└── ln_data
├── ocr
│ └── google_ocr_results_reftext_rank_aggr.json
└── other
└── images
└── reftext (contains 4,594 images)
Installation
- Clone the repository.
git clone https://github.com/Buki2/STAN.git
- Create a virtual environment and install other dependencies.
conda create -n stan python=3.6
conda activate stan
conda install pytorch==1.2.0 torchvision==0.4.0 cudatoolkit=10.0 -c pytorch
pip install opencv-python-headless
pip install matplotlib
conda install -c conda-forge scipy
pip install argparse
conda install -c conda-forge pillow
pip install pytorch-pretrained-bert --ignore-installed
pip install strsimpy
pip install nltk
- Download the pretrained model of Yolov3 and place it in
./saved_models.
wget -P saved_models https://pjreddie.com/media/files/yolov3.weights
Training
Run the code under the main folder.
python train.py --gpu $GPU_ID --batch_size 14
We train the model on 1 GPU with a batch size of 14 for 100 epochs. Please check other experimental setups in our paper.
Evaluation
Run the code under the main folder. Use flag --test to access test mode, and flag --test_set to choose a test set (option: test, subtest_street, subtest_shelf, subtest_home, subtest_sport, subtest_other, subtest_semantic).
python train.py --gpu $GPU_ID --batch_size 1 --resume ./saved_models/STAN_reftext_batch14_model_best.pth.tar --test --test_set subtest_street
(During the evaluation of subtest_semantic, please set the THRES_PHI = 0.40 in the grounding_model.py)
Citation
If you find this resource helpful, please cite our paper and share our work.
@article{tmm/Bu2022/scenetext,
author={Yuqi Bu and Liuwu Li and Jiayuan Xie and Qiong Liu and Yi Cai and Qingbao Huang and Qing Li},
title={Scene-Text Oriented Referring Expression Comprehension},
journal={{IEEE} Transactions on Multimedia},
year={2022},
}
Acknowledgement
First, we thank the authors of ReSC and BBA for sharing their codes.
Second, we thank all the annotators who contribute to the RefText dataset. Special thanks to Xin Wu, Jingwei Zhang, Wenhao Fang, Junpeng Chen, Junyue Song, Yaoming Deng, Jintao Tan, Zetao Lian, Shubin Huang, Cantao Wu, Hongfei Liu, Peizhi Zhao, et al.