🚀 Data Preparation

October 25, 2025 · View on GitHub

We demonstrate our data storage format and provide a specific example of how to prepare new data from scratch

📂 Data Hierarchy

The directory structure should look like this. You should prepare your own playground dir:

Eagle2
├── deepspeed_configs
├── document
├── eaglevl
├── playground
│   ├── sft_data
│   ├── sft_jsonl
│   └── sft_recipe
├── pretrained
├── pyproject.toml
├── README.md
├── shell
├── streamlit_demo
├── tools

📌 Level 1: `sft_recipe`

A recipe is a collection of various data sources. For example:

playground/sft_recipe/debug.json

{
    "LLaVAR-Instruct-16K": {
        "root": "",
        "annotation": "playground/sft_jsonl/LLaVAR-Instruct-16K.jsonl",
        "weight": 1, 
        "weight_description": "Larger weight means samples appear more frequently in later training stages.",
        "repeat_time": 1, 
        "repeat_time_description": "Number of times data is repeated during training.",
        "length": 15500,
        "length_description": "Reference only; actual length determined by the JSONL file.",
        "modality": ["image"]
    }
}

📌 Level 2: `sft_jsonl` and `sft_data`

Each data source corresponds to a jsonl file in sft_jsonl dir, and the image/video data are stored in sft_data.

Example entry in playground/sft_jsonl/LLaVAR-Instruct-16K.jsonl:

{
    "conversations": [
        {"from": "system", "value": "This system prompt is optional."},
        {"from": "human", "value": "<image>\nWhat is the main idea of the quote from Albert Camus?"},
        {"from": "gpt", "value": "The main idea of the quote by Albert Camus..."}
    ],
    "data": [
        {
            "type": "image",
            "image_list": ["path/to/image"]
        }
    ],
}

🔄 IMPORTANT: Converting Data to Training Format

We don't directly use the above generated JSONL file for training. Instead, we further process the data for:

📊 Calculating token lengths (essential for data packing)
🎬 Selecting relevant frames from video data
🚫 Filtering out incorrect samples
📦 Converting JSONL to parquet format for efficient reading and storage (parquet file only stores meta information)

Use this command for the conversion:

# single node
bash shell/prepare.sh playground/sft_recipe/debug.json

Upon completion, a new recipe file is generated, named with a .prepare.json suffix at the original location. You can use this recipe for subsequent training.

⚠️ Note: It's important to review prepare.sh to confirm settings like the number of video frames sampled and the tokenizer used.

✨ Optimized Workflow Tip: Currently, data preparation still requires GPU resources. However, we've implemented a caching mechanism—files with the same MD5 hash won't be regenerated. Further script based on Ray coming soon! 🚧

🎨 Supported Modalities and Formats

Use one of the following formats in JSONL files. The prepare.sh script standardizes these formats automatically.

single image:

<image> is a placeholder

{
    "conversations": [
        {"from": "human", "value": "<image>\nWhat is this?"},
        {"from": "gpt", "value": "It is an apple."}
    ],
    "image": "path/to/image"
}

multiple images:

<image-i> are placeholders

{
    "conversations": [
        {"from": "human", "value": "<image-1><image-2>\nWhat are these?"},
        {"from": "gpt", "value": "An apple and a banana."}
    ],
    "image": ["path/to/image-1", "path/to/image-2"]
}

single video:

<video> is a placeholder

{
    "conversations": [
        {"from": "human", "value": "<video>\nWhat is this?"},
        {"from": "gpt", "value": "It an apple"}
    ],
    "video": "path/to/video"
}

single video clip:

<video> is a placeholder. You can define the starting and ending time of every clip (second)

{
    "conversations": [
        {"from": "human", "value": "<video>\nWhat is this?"},
        {"from": "gpt", "value": "It an apple"}
    ],
    "video": {"path": "path/to/video", "start": 0, "end": 10}
}

lmdb image:

The image in LMDB corresponds to one filename and one key.

{
    "conversations": [
        {"from": "human", "value": "<image>\nWhat is this?"},
        {"from": "gpt", "value": "It an apple"}
    ],
    "image": {"lmdb_file": "path/to/file", "lmdb_key": "key"}
}

interleaved format

{
    "conversations": [
        {"from": "human", "value": "<image-1><video-1><video-2><image-2><image-3>\nWhat is this?"},
        {"from": "gpt", "value": "It an apple"}
    ],
    "data": [
        {
            "type": "image", 
            "image_list"=[
                "path/to/image1",
                {"lmdb_file": "path/to/lmdb", "lmdb_key": "image2"},
                {"lmdb_file": "path/to/lmdb", "lmdb_key": "image3"}
            ]
        },
        {
            "type": "video",
            "video_list"=[
                {"path": "path/to/video1", "start": 0, "end": 10},
                "path/to/video2"
            ]
        }
    ]
}

🔖 Note: Placeholders (,