DATASET.md
July 13, 2024 ยท View on GitHub
File Pre-processing
This section hosts the detail information to conduct file pre-processing for both dataset.
We already provide all the pre-processed files in this link. Download all the json files and place then under /data folder. Alternatively, you can follow the following steps to process the files yourself.
Simple Action
Please copy and paste the files inside the data directory of CONE's released Ego4D-NLQ and MAD data to the current directory.
FILE SUMMARY
CONE have provided the pre-processed files (inside the data/ego4d_data or data/mad_data directory).
Each file is in JSON Line format, each row of the files can be loaded as a single dict in Python. Below is an example of the annotation:
{
"query": "what did I put in the black dustbin?",
"query_id": ca7e11a2-cd1e-40dd-9d2f-ea810ab6a99b_0,
"duration": 480,
"clip_id": "93231c7e-1cf4-4a20-b1f8-9cc9428915b2",
"video_id": "38737402-19bd-4689-9e74-3af391b15feb",
"timestamps": [425.0, 431.0],
}
query_id is a unique identifier of a query. This query corresponds to a video identified by its id clip_id.
duration is an integer indicating the duration of this video.
clip_id is a unique identifier of a video clip. For Ego4D benchmark, each video clip is actually trimmed from a full-length video identified by its id video_id.
Nevertheless, in our long video grounding problem setting, the video clip is reckoned as a long-form video.
timestamps is the ground-truth moment and a list with two scalars, the first scalar is ground-truth begin timestamp, the second scalar is groundtruth end timestamp.
Below shows the detailed procedures for file pre-processing, if you are interested.
Ego4D-NLQ
First get official json files (we also provide them in data/ego4d_ori_data directory under our released Ego4D-NLQ data). Then pre-process them through the following codes.
python reformat_data.py --input_train_split ego4d_ori_data/nlq_train.json --input_val_split ego4d_ori_data/nlq_val.json
--input_test_split ego4d_ori_data/nlq_test_unannotated.json --output_save_path ego4d_data --dset_name ego4d
python process_train_split.py --train_split ego4d_data/train.jsonl --dset_name ego4d
The first code aims to convert the original released file to our standard jsonl file for further processing, and the second code aims to remove some low-quality train samples.
MAD
First get official json files (CONE also provide them in data/mad_ori_data directory under their released MAD data). Then pre-process them through the following codes.
python reformat_data.py --input_train_split mad_ori_data/MAD_train.json --input_val_split mad_ori_data/MAD_val.json
--input_test_split mad_ori_data/MAD_test.json --output_save_path mad_data --dset_name mad
python process_train_split.py --train_split mad_data/train.jsonl --dset_name mad
NaQ
To process the NaQ augmentation data follow these steps:
-
PYTHONPATH=$PYTHONPATH:. python CONE/data/reformat_data.py --input_train_split data/ego4d_naq/v2/naq_datasets/naq_datasets/data/nlq_aug_naq_train.json --input_val_split ego4d_ori_data/nlq_val.json --input_test_split ego4d_ori_data/nlq_test_unannotated.json --output_save_path data/ego4d_nlq_data_for_cone/data/ego4d_naq_data --dset_name ego4d -
PYTHONPATH=$PYTHONPATH:. python data/process_train_split.py --train_split data/ego4d_nlq_data_for_cone/data/ego4d_naq_data/train.jsonl --dset_name ego4d -
PYTHONPATH=$PYTHONPATH:. python feature_extraction/misc/convert_pt_to_lmdb.py root_dir_list = \["data/ego4d_naq/video_features/features/egovlp/",\], feature_save_path = "data/ego4d_nlq_data_for_cone/offline_lmdb/egovlp_video_feature_naq" -
PYTHONPATH=$PYTHONPATH:. python feature_extraction/ego4d_clip_token_extractor.py --feature_output_path data/ego4d_nlq_data_for_cone/offline_lmdb/egovlp_clip_text_features_naq -
PYTHONPATH=$PYTHONPATH:. python feature_extraction/ego4d_clip_token_extractor.py PYTHONPATH=$PYTHONPATH:. python feature_extraction/ego4d_egovlp_cls_extractor.py PYTHONPATH=$PYTHONPATH:. python feature_extraction/ego4d_merge_textual_cls_token_feature.py PYTHONPATH=$PYTHONPATH:. python ego4d_roberta_token_extractor.py --do_extract