Microsoft COCO Dataset (Captioning)

July 31, 2022 ยท View on GitHub

Samples from the COCO Caption dataset (Image credit: "https://arxiv.org/pdf/1504.00325.pdf").(Samples from the COCO Caption dataset. Image credit: "https://arxiv.org/pdf/1504.00325.pdf")

Microsoft COCO Dataset (Captioning)

Description

Microsoft COCO Captions dataset contains over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions are be provided for each image.

Task

(from https://paperswithcode.com/task/image-captioning)

Image captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence.

Metrics

Models are typically evaluated according to a BLEU or CIDER metric.

Leaderboard

(Ranked by BLEU-4)

RankModelBLEU-4CIDErMETEORSPICEResources
1OFA44.9154.932.526.6paper, code
2LEMON42.6145.531.425.5paper
3CoCa40.9143.633.924.7paper
4SimVLM40.6143.333.725.4paper
5VinVL41.0140.931.125.2paper, code
6OSCAR40.7140.030.624.5paper, code
7BLIP40.4136.731.424.3paper, code, demo
8M^239.1131.229.222.6paper, code
9BUTD36.5113.527.020.3paper, code
10ClipCap32.2108.427.120.1paper, code

Auto-Downloading

cd lavis/datasets/download_scripts && python download_coco.py

References

"Microsoft COCO Captions: Data Collection and Evaluation Server", Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, C. Lawrence Zitnick