📄 CI-VID: A Coherent Interleaved Text-Video Dataset

September 4, 2025 · View on GitHub

CI-VID is a large-scale dataset designed to advance coherent multi-clip video generation. Unlike traditional text-to-video (T2V) datasets with isolated clip-caption pairs, CI-VID supports text-and-video-to-video (TV2V) generation by providing over 340,000 interleaved sequences of video clips and rich captions. It enables models to learn both intra-clip content and inter-clip transitions, fostering story-driven generation with strong temporal and visual coherence. We also introduce a comprehensive evaluation suite including human, VLM-based, and similarity-based assessments. In addition, we split CI-VID into a training set (334k samples) and a test set (8k samples), enabling users to develop and evaluate their own metrics using the test set.

🔗 📃 Paper
🔗 📦 Download Train Samples (334k)
🔗 📦 Download Test Samples (8k)
📦 Download Videos: CI-VID/download_all_chunks.sh (For access to the full dataset videos (6.5 TB), please refer to BAAI/CI-VID on Hugging Face (https://huggingface.co/datasets/BAAI/CI-VID))

🗂️ CI-VID Example Viewing

📁 Provided Files
- example_viewing/

This part of the repository contains samples extracted from CI-VID to better illustrate the dataset’s structure and characteristics.

	🔗 Captions
	🔗 Captions
	🔗 Captions
	🔗 Captions
	🔗 Captions
	🔗 Captions
	🔗 Captions
	🔗 Captions
	🔗 Captions
	🔗 Captions
	🔗 Captions
	🔗 Captions

🛠️ Dataset Construction Functions

📁 Provided Files
- construction_functions/

This part of the repository contains four key functions used to construct the CI-VID dataset:

similarity_based_segment – segments raw videos based on visual similarity.
entity_based_segment – segments videos based on main entity.
individual_annotation - generates clip-level captions individually.
joint_annotation – generates inter-clip relation captions that capture transitions and coherence.

📊 Quantitative Evaluation

This part of the repository contains the quantitative evaluation resources for the CI-VID dataset, including evaluation code, prompts, visualizations, and sample annotations. We provide three complementary evaluation approaches:

🔍 Overview

We propose three evaluation protocols:

Human Evaluation
VLM-based Evaluation
Similarity-based Evaluation

👥 1. Human Evaluation

📁 Provided Files
- evaluation/human_evaluation/prompts.jsonl → Prompts used for evaluation.
- evaluation/human_evaluation/visual_contrast/ → *Visualizations for human evaluation (1,000 prompts). 💡download

Human evaluation is based on 1,000 automatically generated prompts (Seeded with keywords from VBench), with each describing 6 scenes composing a coherent multi-scene narratives.

Models Compared: Baseline (trained on Emu3) vs Fine-tuned (further finetuned on CI-VID).
Examples:

	🔗 Text Prompt
	🔗 Text Prompt
	🔗 Text Prompt
	🔗 Text Prompt
	🔗 Text Prompt
	🔗 Text Prompt

Procedure: 3 professional annotators compare model outputs across: - *Consistency* - *Narrativity* - *Factual correctness*

Judgment Format: Side-by-side comparison, anonymized model identity, random top-bottom order.

🤖 2. VLM-based Evaluation

📁 Provided Files
- evaluation/vlm_evaluation/vlm_evaluation_data.jsonl → Prompts used for evaluation.
- evaluation/vlm_evaluation/vlm_evaluation.py → Code for VLM-based evaluation

We use the same prompts as human evaluation perform a VLM-based evaluation.

Procedure: Qwen2-VL-72B-Instruct is prompted to assess model outputs along the following six dimensions:

Stylistic Consistency
the visual styles of frames need to be consistent (e.g., color tone, lighting, rendering technique, texture details)

Entity Consistency
the key characters and objects need to be consistent across frames. (e.g., retain the same attributes and identity)

Background Consistency
the backgrounds and environments need to be consistent across frames?

Perspective Transition Coherence
the transitions between camera angles and scenes need to be smooth and logically aligned.

Text Prompt Alignment
the frames need to be accurately reflect the content and intent of the original text prompts.prompt.

Visual Plausibility
is the overall visual quality realistic? Are there any noticeable artifacts, glitches, or implausible elements?

Score Guide:

5 - Very Excellent:
Perfect: not only flawless but demonstrate outstanding consistency and execution.

4 – Excellent:
Flawless: no noticeable issues.

3 – Good:
Nearly flawless, with only minor, negligible imperfections.

2 – Fair:
Minor flaws observed in one clip.

1 – Poor:
Major or multiple flaws.

0 – Very Poor:
Multiple (> 1) major flaws.

🧪 Averaged over 6 evaluations per sample (1 full + 5 pairwise), with VLM calibration via reference examples.

🎯 3. Similarity-based Evaluation

📁 Provided Files
- evaluation/similarity_evaluation/object_similarity_data.jsonl → Captions and first clips for similarity-based evaluation.
- evaluation/similarity_evaluation/object_similarity_evaluation.py → Code for computing similarity evaluation.
- object_similarity_data.jsonl → Data for model inference. 💡download
- middle_frames.zip → *Ground-truth middle frames for similarity evaluation. 💡download
- rectangles.zip → *Ground-truth entity boxes for similarity evaluation. 💡download
- CI-VID_results.zip → *Visualizations of results from the CI-VID fine-tuned model. 💡download
- observation_for_object_similarity_data.zip → *Visual observation files for similarity-based evaluation data. 💡download
- sim_eval_img_package.zip → Entity boxes data with imgs. 💡download

We construct a similarity-based evaluation dataset based on CI-VID data. To avoid data leakage, all test data and data from the same source videos are excluded from the CI-VID training set. This evaluation compares the similarity between the generated and ground-truth videos at both the global and object levels.

⚙️ Evaluation Setup

Object Detection:
YOLO is applied to each video clip. For every clip, 3 frames are uniformly sampled and processed.
Manual Filtering:
Non-essential objects are removed manually. A maximum of two narrative-relevant object boxes are kept per frame.
Evaluation Protocol:
- Each sample includes the first clip and the full caption as input.
- The evaluated model generates the remaining video clips.
- We compute similarity between:
  - Generated and ground-truth middle frames → for whole-sequence similarity
  - Object boxes in generated and reference frames → for object-level similarity
- For object similarity, we match each generated object to ground-truch object across 3 frames per clip, and use the best score as the clip score, then average all clip scores as sample score. The final results are the average of all samples.

Ground-truth Examples:

	🔗 Captions
	🔗 Captions
	🔗 Captions
	🔗 Captions
	🔗 Captions
	🔗 Captions
	🔗 Captions
	🔗 Captions
	🔗 Captions
	🔗 Captions
	🔗 Captions
	🔗 Captions
	🔗 Captions