The COM Kitchens dataset
August 22, 2024 · View on GitHub
Table of Contents
Authors
Koki Maeda(3,1)*, Tosho Hirasawa(4,1)*, Atsushi Hashimoto(1), Jun Harashima(2), Leszek Rybicki(2), Yusuke Fukasawa(2), Yoshitaka Ushiku(1)
(1) OMRON SINIC X Corp. (2) Cookpad Inc. (3) Tokyo Institute of Technology (4) Tokyo Metropolitan University
*: Equally Contribution. This work is done for the internship at OMRON SINIC X.
Citation
Note
@InProceedings{comkitchens_eccv2024,
author = {Koki Maeda and Tosho Hirasawa and Atsushi Hashimoto and Jun Harashima and Leszek Rybicki and Yusuke Fukasawa and Yoshitaka Ushiku},
title = {COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark},
booktitle = {Proceedings of the European Conference on Computer Vision},
year = {2024},
}
Dataset Details
This COMKitchens dataset provides cooking videos annotated with a structured visual action graph. The dataset currently has two benchmarks:
- Dense Video Captioning on unedited fixed-viewpoint videos (DVC-FV)
- Online Recipe Retrieval (OnRR)
We provide all the dataset for the benchmarks and attach .dat files which represent the train/validation/test split.
File Structure
data
├─ ap # captions for each action-by-person entry
├─ frames # frames extracted from videos (split into train/valid/test)
├─ frozenbilm # features by FrozenBiLM (used by vid2seq)
└─ main # recipes annotated by human
└─ {recipe_id} # recipe id
└─ {kitchen_id} # kitchen id
├─ cropped_images # cropped images of bounding boxes for visual action graph
├─ frames # annotated frames for AP of visual action graph
├─ front_compressed.mp4 # recorded video
├─ annotations.xml # annotations in xml file format
├─ gold_recipe_translation_en.json # recipe annotations
├─ gold_recipe.json # rewritten recipe (in Japanese)
├─ graph.dot # visual action graph
├─ graph.dot.pdf # visualization of visual action graph
└─ obj.names
├── ingredients.txt # ingredients list in the COM Kitchens dataset
├── ingredients_translation_en.txt # translated ingredients list in the COM Kitchens dataset
├── train.txt # list of recipe id in the train split
└── val.txt # list of recipe id in the validation split
Important files
gold_recipe.json
gold_recipe.json provides the recipe information, to which the visual action graph is attached.
| key | value | description |
|---|---|---|
| "recipe_id" | str | recipe id |
| "kitchen_id" | int | kitchen id |
| "ingredients" | List[str] | ingredients list (in Japanese) |
| "ingredient_images" | List[str] | path of the images of each ingredient |
| "steps" | List[Dict] | annotations by step |
| "steps/memo" | str | recipe sentence |
| "steps/words" | List[str] | recipe split word by word |
| "steps/ap_ids" | List[Dict] | Correspondence between AP and words |
| "actions_by_person" | List[str] | annotation of the visual action graph, including the time span and bounding boxes |
{recipe_id}/{kitchen_id}/gold_recipe_translation_en.json
gold_recipe_translation_en.json provides only the translated recipe information.
| key | value | description |
|---|---|---|
| "ingredients" | List[str] | ingredients list (in English) |
| "steps" | List[Dict] | annotations by step |
| "steps/memo" | str | recipe sentence |
| "steps/words" | List[str] | recipe split word by word |
| "steps/ap_ids" | List[Dict] | Correspondence between AP and words |
Download Procedure for COM Kitchens
Note
Application Form English support will be available soon.
Quick Start
Dataset Preparation
- Dataset Preparation
- Download annotation files and videos.
- Preprocess
- Run
python -m com_kitchens.preprocess.videofor extracting all frames of the videos. - Run
python -m com_kitchens.preprocess.recipefor extracting all action-by-person entries of the videos.
- Run
Warning
While we extract all frames in preprocess for simplicity, you can save disk storage space by extracting only the frames you use with the annotation files.
Online Recipe Retrieval (OnRR)
- Training
- Run
sh scripts/onrr-train-xclip.shfor simple start of trainings.
- Run
- Evaluation
- Run
sh scripts/onrr-eval-xclip.sh {your/path/to/ckpt}for the evaluation.
- Run
Training UniVL models in OnRR
For UniVL, we are required to extract s3d features of the videos.
- Download
s3d_howto100m.pthtocache/s3d_howto100m.pthor other path you configure. - Run
sh scripts/extract_s3d_features.shto extract s3d features. - Download pretrained model
univl.pretrained.bintocache/univl.pretrained.binor other path you configure. - Then you can run
sh scripts/onrr-train-univl.shto train UniVL models.
Dense Video Captioning on unedited fixed-viewpoint videos (DVC-FV)
- Docker Images
- Run
make build-docker-imagesto build docker images.
- Run
- Preprocess
- Run
sh scripts/dvc-vid2seq-prepto extract
- Run
- Training & Evaluation
- Run
sh scripts/vid2seq-zs.shto evaluate a pre-trained vid2seq model - Run
sh scripts/vid2seq-ft.shto fine-tune and evaluate a vid2seq model - RUn
sh scripts/vid2seq-ft-rl-as.shto fine-tune and evaluate a vid2seq model incorporating action graph as both relation labels and attention supervision (RL+AS)
- Run
LICENSE
This project (other than the dataset) is licensed under the MIT License, see the LICENSE.txt file for details.