Wikipedia-based Image Text (WIT) Dataset

September 21, 2022 ยท View on GitHub

WIT Dataset is now available for download and use.

WIT is also available as a Tensorflow Dataset (TFDS).

A part of WIT Dataset was used in the Kaggle Image-Text Retrieval Competition and you can find images and resnet embeddings available as part of that (not all images but more than 60% of it). These image pixels available here are also available via the TFDS linked above.

We are providing WIT as a set of 10 tsv files (compressed). The total dataset size is about ~25GB. This is the entire training dataset.

If you want to start quick, pick any one file which will be about ~2.5GB and will give you about ~10% of the data which has about ~3.5M+ image text example sets.

We are also including the validation and test sets (5 files each).

Here is a 1% data sample file for a quick start.

File Details

Each row of the WIT dataset consists of 17 columns and they are as follows in this order.

FIELD_NAMEFIELD_TYPE
languagestring
page_urlstring
image_urlstring
page_titlestring
section_titlestring
hierarchical_section_titlestring
caption_reference_descriptionstring
caption_attribution_descriptionstring
caption_alt_text_descriptionstring
mime_typestring
original_heightint
original_widthint
is_main_imagebool
attribution_passes_lang_idbool
page_changed_recentlybool
context_page_descriptionstring
context_section_descriptionstring

Training Set

Here are the links to download the 10 files which are part of WIT training set.

wit_v1.train.all-00000-of-00010.tsv.gz

wit_v1.train.all-00001-of-00010.tsv.gz

wit_v1.train.all-00002-of-00010.tsv.gz

wit_v1.train.all-00003-of-00010.tsv.gz

wit_v1.train.all-00004-of-00010.tsv.gz

wit_v1.train.all-00005-of-00010.tsv.gz

wit_v1.train.all-00006-of-00010.tsv.gz

wit_v1.train.all-00007-of-00010.tsv.gz

wit_v1.train.all-00008-of-00010.tsv.gz

wit_v1.train.all-00009-of-00010.tsv.gz

Validation Set

Here are the links to download the WIT validation set files.

wit_v1.val.all-00000-of-00005.tsv.gz

wit_v1.val.all-00001-of-00005.tsv.gz

wit_v1.val.all-00002-of-00005.tsv.gz

wit_v1.val.all-00003-of-00005.tsv.gz

wit_v1.val.all-00004-of-00005.tsv.gz

Test Set

Here are the links to download the WIT test set files.

wit_v1.test.all-00000-of-00005.tsv.gz

wit_v1.test.all-00001-of-00005.tsv.gz

wit_v1.test.all-00002-of-00005.tsv.gz

wit_v1.test.all-00003-of-00005.tsv.gz

wit_v1.test.all-00004-of-00005.tsv.gz

For any questions, please contact wit-dataset@google.com.

We welcome feedback and will try our best to incorporate it. Feel free to file a github issue too.

Last but not least, if WIT dataset is useful to you, please do write to us about it. Be it a blog post, a research project or a paper you are working, we are delighted to learn about it.