The Wikipedia Webpage 2M (WikiWeb2M) Dataset

October 13, 2023 ยท View on GitHub

We present the WikiWeb2M dataset consisting of over 2 million English Wikipedia articles. Our released dataset includes all of the text content on each page, links to the images present, and structure metadata such as which section each text and image element comes from.

This dataset is a contribution from our paper A Suite of Generative Tasks for Multi-Level Multimodal Webpage Understanding.

The dataset is stored as gzipped TFRecord files which can be downloaded via these links.

Train

wikiweb2m-train.tfrecord.gz-00000-of-00005

wikiweb2m-train.tfrecord.gz-00001-of-00005

wikiweb2m-train.tfrecord.gz-00002-of-00005

wikiweb2m-train.tfrecord.gz-00003-of-00005

wikiweb2m-train.tfrecord.gz-00004-of-00005

Validation

wikiweb2m-val.tfrecord.gz

Test

wikiweb2m-test.tfrecord.gz

WikiWeb2M Statistics

WikiWeb2M is the first multimodal open source dataset to include all page content in a unified format. Here we provide aggregate information about the WikiWeb2M dataset as well as the number of samples available with each of the fine-tuning tasks we design from it.

Number ofTrainValidationTest
Pages1,803,225100,475100,833
Sections10,519,294585,651588,552
Unique Images3,867,277284,975286,390
Total Images5,340,708299,057300,666

Our data processing and filtering choices for each fine-tuning task are described in the paper.

Downstream Task SamplesTrainValidationTest
Page Description Generation1,435,26380,10380,339
Section Summarization3,082,031172,984173,591
Contextual Image Captioning2,222,814124,703124,188

Data and Task Examples

Here we illustrate how a single webpage can be processed into the three tasks we study: page description generation, section summarization, and contextual image captioning. The paper includes multiple Wikipedia article examples.

Illustration of Succulents Wikipedia Article being used for page description generation, section summarization, and contextual image captioning

Usage

TFRecord Features

Here we provide the names of the fields included in the dataset, their tensorflow Sequence Example type, their data type, and a brief description.

FeatureSequence Example TypeDTypeDescription
splitContextstringDataset split this page contributes to (e.g., train, val, or test)
page_urlContextstringWikipeda page URL
page_titleContextstringWikipedia page title, title of the article
raw_page_descriptionContextstringWikipedia page description, which is typically the same or very similar to the content of the first (root) section of the article
clean_page_descriptionContextstringraw_page_description but with newline and tab characters removed; this provides the exact target text for our page description generation task
page_contains_imagesContextint64Whether the Wikipedia page has images after our cleaning and processing steps
page_content_sections_without_table_listContextint64Number of content sections with text or images that do not contain a list or table. This field can be used to reproduce data filtering for page description generation
is_page_description_sampleContextint64Whether a page is used as a sample for the page description fine-tuning task
section_titleSequencestringTitles of each section on the Wikipedia page, in order
section_indexSequenceint64Index of each section on the Wikipedia page, in order
section_depthSequenceint64Depth of each section on the Wikipedia page, in order
section_heading_levelSequenceint64Heading level of each section on the Wikipedia page, in order
section_subsection_indexSequenceint64Subsection indices, grouped by section in order
section_parent_indexSequenceint64The parent section index of each section, in order
section_textSequencestringThe body text of each section, in order
is_section_summarization_sampleSequenceint64Whether a section is used as a sample for the section summarization fine-tuning task
section_raw_1st_sentenceSequencestringThe processed out first sentence of each section, in order
section_clean_1st_sentenceSequencestringThe same as section_raw_1st_sentence but with newline and tab characters removed. This provides the exact target text for our section summarization task
section_rest_sentenceSequencestringThe processed out sentences following the first sentence of each section, in order
section_contains_table_or_listSequenceint64Whether section content contains a table or list; this field is needed to be able to reproduce sample filtering for section summarization
section_contains_imagesSequenceint64Whether each section has images after our cleaning and processing steps, in order
is_image_caption_sampleSequenceint64Whether an image is used as a sample for the image captioning fine-tuning task
section_image_urlSequencestringImage URLs, grouped by section in order
section_image_mime_typeSequencestringImage mime type, grouped by section in order
section_image_widthSequenceint64Image width, grouped by section in order
section_image_heightSequenceint64Image height, grouped by section in order
section_image_in_witSequenceint64Whether an image was originally contained in the WIT dataset, grouped by section in order
section_image_raw_attr_descSequencestringImage attribution description, grouped by section in order
section_image_clean_attr_descSequencestringThe English only processed portions of the attribution description
section_image_raw_ref_descSequencestringImage reference description, grouped by section in order
section_image_clean_ref_descSequencestringThe same as section_image_raw_ref_desc but with newline and tab characters removed; this provides the exact target text for our image captioning task
section_image_alt_textSequencestringImage alt-text, grouped by section in order
section_image_captionsSequencestringComma separated concatenated text from alt-text, attribution, and reference descriptions; this is how captions are formatted as input text when used

Loading the Data

Here we provide a small code snippet for how to load the TFRecord files. First, load any necessary packages.

import numpy as np
import glob
import tensorflow.compat.v1 as tf
from collections import defaultdict

Next, define a data parser class.

```python
class DataParser():
  def __init__(self,
               filepath: str = 'wikiweb2m-*',
               path: str):
    self.filepath = filepath
    self.path = path
    self.data = defaultdict(list)

  def parse_data(self):
    context_feature_description = {
        'split': tf.io.FixedLenFeature([], dtype=tf.string),
        'page_title': tf.io.FixedLenFeature([], dtype=tf.string),
        'page_url': tf.io.FixedLenFeature([], dtype=tf.string),
        'clean_page_description': tf.io.FixedLenFeature([], dtype=tf.string),
        'raw_page_description': tf.io.FixedLenFeature([], dtype=tf.string),
        'is_page_description_sample': tf.io.FixedLenFeature([], dtype=tf.int64),
        'page_contains_images': tf.io.FixedLenFeature([], dtype=tf.int64),
        'page_content_sections_without_table_list': tf.io.FixedLenFeature([] , dtype=tf.int64)
    }

    sequence_feature_description = {
        'is_section_summarization_sample': tf.io.VarLenFeature(dtype=tf.int64),
        'section_title': tf.io.VarLenFeature(dtype=tf.string),
        'section_index': tf.io.VarLenFeature(dtype=tf.int64),
        'section_depth': tf.io.VarLenFeature(dtype=tf.int64),
        'section_heading_level': tf.io.VarLenFeature(dtype=tf.int64),
        'section_subsection_index': tf.io.VarLenFeature(dtype=tf.int64),
        'section_parent_index': tf.io.VarLenFeature(dtype=tf.int64),
        'section_text': tf.io.VarLenFeature(dtype=tf.string),
        'section_clean_1st_sentence': tf.io.VarLenFeature(dtype=tf.string),
        'section_raw_1st_sentence': tf.io.VarLenFeature(dtype=tf.string),
        'section_rest_sentence': tf.io.VarLenFeature(dtype=tf.string),
        'is_image_caption_sample': tf.io.VarLenFeature(dtype=tf.int64),
        'section_image_url': tf.io.VarLenFeature(dtype=tf.string),
        'section_image_mime_type': tf.io.VarLenFeature(dtype=tf.string),
        'section_image_width': tf.io.VarLenFeature(dtype=tf.int64),
        'section_image_height': tf.io.VarLenFeature(dtype=tf.int64),
        'section_image_in_wit': tf.io.VarLenFeature(dtype=tf.int64),
        'section_contains_table_or_list': tf.io.VarLenFeature(dtype=tf.int64),
        'section_image_captions': tf.io.VarLenFeature(dtype=tf.string),
        'section_image_alt_text': tf.io.VarLenFeature(dtype=tf.string),
        'section_image_raw_attr_desc': tf.io.VarLenFeature(dtype=tf.string),
        'section_image_clean_attr_desc': tf.io.VarLenFeature(dtype=tf.string),
        'section_image_raw_ref_desc': tf.io.VarLenFeature(dtype=tf.string),
        'section_image_clean_ref_desc': tf.io.VarLenFeature(dtype=tf.string),
        'section_contains_images': tf.io.VarLenFeature(dtype=tf.int64)
    }

    def _parse_function(example_proto):
      return tf.io.parse_single_sequence_example(example_proto,
                                                 context_feature_description,
                                                 sequence_feature_description)

    suffix = '.tfrecord*'

    data_path = glob.Glob(self.path + self.filepath + suffix)
    raw_dataset = tf.data.TFRecordDataset(data_path, compression_type='GZIP')
    parsed_dataset = raw_dataset.map(_parse_function)

    for d in parsed_dataset:
      split = d[0]['split'].numpy().decode()
      self.data[split].append(d)
```

Then you can run the following to parse the dataset.

parser = DataParser()
parser.parse_data()
print((len(parser.data['train']), len(parser.data['val']), len(parser.data['test'])))

Models

Our full attention, transient global, and prefix global experiments were run using the LongT5 code base. In coming months the Prefix Global attention mechanism may be open sourced.

How to Cite

If you extend or use this work, please cite the paper where it was introduced:

@inproceedings{
burns2023wiki,
title={A Suite of Generative Tasks for Multi-Level Multimodal Webpage Understanding},
author={Andrea Burns and Krishna Srinivasan and Joshua Ainslie and Geoff Brown and Bryan A. Plummer and Kate Saenko and Jianmo Ni and Mandy Guo},
booktitle={The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year={2023},
url={https://openreview.net/forum?id=rwcLHjtUmn}
}