π CI-VID: A Coherent Interleaved Text-Video Dataset
September 4, 2025 Β· View on GitHub
CI-VID is a large-scale dataset designed to advance coherent multi-clip video generation. Unlike traditional text-to-video (T2V) datasets with isolated clip-caption pairs, CI-VID supports text-and-video-to-video (TV2V) generation by providing over 340,000 interleaved sequences of video clips and rich captions. It enables models to learn both intra-clip content and inter-clip transitions, fostering story-driven generation with strong temporal and visual coherence. We also introduce a comprehensive evaluation suite including human, VLM-based, and similarity-based assessments. In addition, we split CI-VID into a training set (334k samples) and a test set (8k samples), enabling users to develop and evaluate their own metrics using the test set.
π π Paper
π π¦ Download Train Samples (334k)
π π¦ Download Test Samples (8k)
π¦ Download Videos: CI-VID/download_all_chunks.sh (For access to the full dataset videos (6.5 TB), please refer to BAAI/CI-VID on Hugging Face (https://huggingface.co/datasets/BAAI/CI-VID))
ποΈ CI-VID Example Viewing
- π Provided Files
example_viewing/
This part of the repository contains samples extracted from CI-VID to better illustrate the datasetβs structure and characteristics.
|
π Captions |
|
π Captions |
|
π Captions |
|
π Captions |
|
π Captions |
|
π Captions |
|
π Captions |
|
π Captions |
|
π Captions |
|
π Captions |
|
π Captions |
|
π Captions |
π οΈ Dataset Construction Functions
- π Provided Files
construction_functions/
This part of the repository contains four key functions used to construct the CI-VID dataset:
- similarity_based_segment β segments raw videos based on visual similarity.
- entity_based_segment β segments videos based on main entity.
- individual_annotation - generates clip-level captions individually.
- joint_annotation β generates inter-clip relation captions that capture transitions and coherence.
π Quantitative Evaluation
This part of the repository contains the quantitative evaluation resources for the CI-VID dataset, including evaluation code, prompts, visualizations, and sample annotations. We provide three complementary evaluation approaches:
π Overview
We propose three evaluation protocols:
- Human Evaluation
- VLM-based Evaluation
- Similarity-based Evaluation
π₯ 1. Human Evaluation
- π Provided Files
evaluation/human_evaluation/prompts.jsonlβ Prompts used for evaluation.evaluation/human_evaluation/visual_contrast/β *Visualizations for human evaluation (1,000 prompts). π‘download
Human evaluation is based on 1,000 automatically generated prompts (Seeded with keywords from VBench), with each describing 6 scenes composing a coherent multi-scene narratives.
-
Models Compared: Baseline (trained on Emu3) vs Fine-tuned (further finetuned on CI-VID).
-
Examples:
|
π Text Prompt |
|
π Text Prompt |
|
π Text Prompt |
|
π Text Prompt |
|
π Text Prompt |
|
π Text Prompt |
- Procedure: 3 professional annotators compare model outputs across: - *Consistency* - *Narrativity* - *Factual correctness*
- Judgment Format: Side-by-side comparison, anonymized model identity, random top-bottom order.
π€ 2. VLM-based Evaluation
- π Provided Files
evaluation/vlm_evaluation/vlm_evaluation_data.jsonlβ Prompts used for evaluation.evaluation/vlm_evaluation/vlm_evaluation.pyβ Code for VLM-based evaluation
We use the same prompts as human evaluation perform a VLM-based evaluation.
-
Procedure: Qwen2-VL-72B-Instruct is prompted to assess model outputs along the following six dimensions:
-
- Stylistic Consistency
the visual styles of frames need to be consistent (e.g., color tone, lighting, rendering technique, texture details)
- Stylistic Consistency
-
- Entity Consistency
the key characters and objects need to be consistent across frames. (e.g., retain the same attributes and identity)
- Entity Consistency
-
- Background Consistency
the backgrounds and environments need to be consistent across frames?
- Background Consistency
-
- Perspective Transition Coherence
the transitions between camera angles and scenes need to be smooth and logically aligned.
- Perspective Transition Coherence
-
- Text Prompt Alignment
the frames need to be accurately reflect the content and intent of the original text prompts.prompt.
- Text Prompt Alignment
-
- Visual Plausibility
is the overall visual quality realistic? Are there any noticeable artifacts, glitches, or implausible elements?
- Visual Plausibility
- Score Guide:
-
5 - Very Excellent:
Perfect: not only flawless but demonstrate outstanding consistency and execution. -
4 β Excellent:
Flawless: no noticeable issues. -
3 β Good:
Nearly flawless, with only minor, negligible imperfections. -
2 β Fair:
Minor flaws observed in one clip. -
1 β Poor:
Major or multiple flaws. -
0 β Very Poor:
Multiple (> 1) major flaws.
π§ͺ Averaged over 6 evaluations per sample (1 full + 5 pairwise), with VLM calibration via reference examples.
π― 3. Similarity-based Evaluation
- π Provided Files
evaluation/similarity_evaluation/object_similarity_data.jsonlβ Captions and first clips for similarity-based evaluation.evaluation/similarity_evaluation/object_similarity_evaluation.pyβ Code for computing similarity evaluation.object_similarity_data.jsonlβ Data for model inference. π‘downloadmiddle_frames.zipβ *Ground-truth middle frames for similarity evaluation. π‘downloadrectangles.zipβ *Ground-truth entity boxes for similarity evaluation. π‘downloadCI-VID_results.zipβ *Visualizations of results from the CI-VID fine-tuned model. π‘downloadobservation_for_object_similarity_data.zipβ *Visual observation files for similarity-based evaluation data. π‘downloadsim_eval_img_package.zipβ Entity boxes data with imgs. π‘download
We construct a similarity-based evaluation dataset based on CI-VID data. To avoid data leakage, all test data and data from the same source videos are excluded from the CI-VID training set. This evaluation compares the similarity between the generated and ground-truth videos at both the global and object levels.
βοΈ Evaluation Setup
-
Object Detection:
YOLO is applied to each video clip. For every clip, 3 frames are uniformly sampled and processed. -
Manual Filtering:
Non-essential objects are removed manually. A maximum of two narrative-relevant object boxes are kept per frame. -
Evaluation Protocol:
- Each sample includes the first clip and the full caption as input.
- The evaluated model generates the remaining video clips.
- We compute similarity between:
- Generated and ground-truth middle frames β for whole-sequence similarity
- Object boxes in generated and reference frames β for object-level similarity
- For object similarity, we match each generated object to ground-truch object across 3 frames per clip, and use the best score as the clip score, then average all clip scores as sample score. The final results are the average of all samples.
- Ground-truth Examples:
|
π Captions |
|
π Captions |
|
π Captions |
|
π Captions |
|
π Captions |
|
π Captions |
|
π Captions |
|
π Captions |
|
π Captions |
|
π Captions |
|
π Captions |
|
π Captions |
|
π Captions |
Research-Only License
This dataset is provided for non-commercial, research purposes only.
- Commercial use is not allowed.
- Redistribution or repackaging is not permitted without prior consent.