Scene Graph Generation from Natural Language Supervision
June 5, 2022 ยท View on GitHub
This repository includes the Pytorch code for our paper "Learning to Generate Scene Graph from Natural Language Supervision" accepted in ICCV 2021.
Contents
- Overview
- Qualitative Results
- Installation
- Data
- Metrics
- Pretrained Object Detector
- Pretrained Scene Graph Generation Models
- Model Training
- Model Evaluation
- Acknowledgement
- Reference
Overview
Learning from image-text data has demonstrated recent success for many recognition tasks, yet is currently limited to visual features or individual visual concepts such as objects. In this paper, we propose one of the first methods that learn from image-sentence pairs to extract a graphical representation of localized objects and their relationships within an image, known as scene graph. To bridge the gap between images and texts, we leverage an off-the-shelf object detector to identify and localize object instances, match labels of detected regions to concepts parsed from captions, and thus create "pseudo" labels for learning scene graph. Further, we design a Transformer-based model to predict these "pseudo" labels via a masked token prediction task. Learning from only image-sentence pairs, our model achieves 30% relative gain over a latest method trained with human-annotated unlocalized scene graphs. Our model also shows strong results for weakly and fully supervised scene graph generation. In addition, we explore an open-vocabulary setting for detecting scene graphs, and present the first result for open-set scene graph generation.
Qualitative Results
Our generated scene graphs learned from image descriptions
Our generated scene graphs in open-set and closed-set settings
Installation
Check INSTALL.md for installation instructions.
Data
Check DATASET.md for instructions of data downloading.
Metrics
Explanation of metrics in this toolkit are given in METRICS.md
Pretrained Object Detector
In this project, we primarily use the detector Faster RCNN pretrained on Open Images dataset. To use this repo, you don't need to run this detector. You can directly download the extracted detection features, as the instruction in DATASET.md.
If you're interested in this detector, the pretrained model can be found in TensorFlow 1 Detection Model Zoo: faster_rcnn_inception_resnet_v2_atrous_oidv4.
Update: The script used to extract region features (TensorFlow 1 Detection Model) is uploaded in preprocess/README.md.
Additionally, to compare with previous fully supervised models, we also use the detector pretrained by Scene-Graph-Benchmark. You can download this Faster R-CNN model and extract all the files to the directory checkpoints/pretrained_faster_rcnn.
Pretrained Scene Graph Generation Models
Our pretrained SGG models can be downloaded on Google Drive. The details of these models can be found in Model Training section below. After downloading, please put all the folders to the directory checkpoints/. More pretrained models will be released. Stay tuned!
Model Training
To train our scene graph generation models, run the script
bash train.sh MODEL_TYPE
where MODEL_TYPE specifies the training supervision, the training dataset and the scene graph generation model. See details below.
-
Language supervised models: trained by image-text pairs
Language_CC-COCO_Uniter: train our Transformer-based model on Conceptual Caption (CC) and COCO Caption (COCO) datasetsLanguage_*_Uniter: train our Transformer-based model on single dataset.*represents the dataset name and can beCC,COCO, andVGLanguage_OpensetCOCO_Uniter: train our Transformer-based model on COCO dataset in open-set settingLanguage_CC-COCO_MotifNet: train Motif-Net model with language supervision from CC and COCO datasets
-
Weakly supervised models: trained by unlocalized scene graph labels
Weakly_Uniter: train our Transformer-based model
-
Fully supervised models: trained by localized scene graph labels
Sup_Uniter: train our Transformer-based modelSup_OnlineDetector_Uniter: train our Transformer-based model by using the object detector from Scene-Graph-Benchmark.
You can set CUDA_VISIBLE_DEVICES in train.sh to specify which GPUs are used for model training (e.g., the default script uses 2 GPUs).
Model Evaluation
To evaluate the trained scene graph generation model, you can reuse the commands in train.sh by simply changing WSVL.SKIP_TRAIN to True and setting OUTPUT_DIR as the path to your trained model. One example can be found in test.sh and just run bash test.sh.
Acknowledgement
This repository was built based on Scene-Graph-Benchmark for scene graph generation and UNITER for image-text representation learning.
We specially would like to thank Pengchuan Zhang for providing the object detector pretrained on Objects365 dataset which was used in our ablation study.
Reference
If you are using our code, please consider citing our paper.
@inproceedings{zhong2021SGGfromNLS,
title={Learning to Generate Scene Graph from Natural Language Supervision},
author={Zhong, Yiwu and Shi, Jing and Yang, Jianwei and Xu, Chenliang and Li, Yin},
booktitle={ICCV},
year={2021}
}