πŸš€ Data Preparation

October 25, 2025 Β· View on GitHub

We demonstrate our data storage format and provide a specific example of how to prepare new data from scratch

πŸ“‚ Data Hierarchy

The directory structure should look like this. You should prepare your own playground dir:

Eagle2
β”œβ”€β”€ deepspeed_configs
β”œβ”€β”€ document
β”œβ”€β”€ eaglevl
β”œβ”€β”€ playground
β”‚   β”œβ”€β”€ sft_data
β”‚   β”œβ”€β”€ sft_jsonl
β”‚   └── sft_recipe
β”œβ”€β”€ pretrained
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ README.md
β”œβ”€β”€ shell
β”œβ”€β”€ streamlit_demo
β”œβ”€β”€ tools

πŸ“Œ Level 1: sft_recipe

A recipe is a collection of various data sources. For example:

playground/sft_recipe/debug.json

{
    "LLaVAR-Instruct-16K": {
        "root": "",
        "annotation": "playground/sft_jsonl/LLaVAR-Instruct-16K.jsonl",
        "weight": 1, 
        "weight_description": "Larger weight means samples appear more frequently in later training stages.",
        "repeat_time": 1, 
        "repeat_time_description": "Number of times data is repeated during training.",
        "length": 15500,
        "length_description": "Reference only; actual length determined by the JSONL file.",
        "modality": ["image"]
    }
}

πŸ“Œ Level 2: sft_jsonl and sft_data

Each data source corresponds to a jsonl file in sft_jsonl dir, and the image/video data are stored in sft_data.

Example entry in playground/sft_jsonl/LLaVAR-Instruct-16K.jsonl:

{
    "conversations": [
        {"from": "system", "value": "This system prompt is optional."},
        {"from": "human", "value": "<image>\nWhat is the main idea of the quote from Albert Camus?"},
        {"from": "gpt", "value": "The main idea of the quote by Albert Camus..."}
    ],
    "data": [
        {
            "type": "image",
            "image_list": ["path/to/image"]
        }
    ],
}

πŸ”„ IMPORTANT: Converting Data to Training Format

We don't directly use the above generated JSONL file for training. Instead, we further process the data for:

  • πŸ“Š Calculating token lengths (essential for data packing)
  • 🎬 Selecting relevant frames from video data
  • 🚫 Filtering out incorrect samples
  • πŸ“¦ Converting JSONL to parquet format for efficient reading and storage (parquet file only stores meta information)

Use this command for the conversion:

# single node
bash shell/prepare.sh playground/sft_recipe/debug.json

Upon completion, a new recipe file is generated, named with a .prepare.json suffix at the original location. You can use this recipe for subsequent training.

⚠️ Note: It's important to review prepare.sh to confirm settings like the number of video frames sampled and the tokenizer used.

✨ Optimized Workflow Tip: Currently, data preparation still requires GPU resources. However, we've implemented a caching mechanismβ€”files with the same MD5 hash won't be regenerated. Further script based on Ray coming soon! 🚧

🎨 Supported Modalities and Formats

Use one of the following formats in JSONL files. The prepare.sh script standardizes these formats automatically.

single image:

<image> is a placeholder

{
    "conversations": [
        {"from": "human", "value": "<image>\nWhat is this?"},
        {"from": "gpt", "value": "It is an apple."}
    ],
    "image": "path/to/image"
}

multiple images:

<image-i> are placeholders

{
    "conversations": [
        {"from": "human", "value": "<image-1><image-2>\nWhat are these?"},
        {"from": "gpt", "value": "An apple and a banana."}
    ],
    "image": ["path/to/image-1", "path/to/image-2"]
}

single video:

<video> is a placeholder

{
    "conversations": [
        {"from": "human", "value": "<video>\nWhat is this?"},
        {"from": "gpt", "value": "It an apple"}
    ],
    "video": "path/to/video"
}

single video clip:

<video> is a placeholder. You can define the starting and ending time of every clip (second)

{
    "conversations": [
        {"from": "human", "value": "<video>\nWhat is this?"},
        {"from": "gpt", "value": "It an apple"}
    ],
    "video": {"path": "path/to/video", "start": 0, "end": 10}
}

lmdb image:

The image in LMDB corresponds to one filename and one key.

{
    "conversations": [
        {"from": "human", "value": "<image>\nWhat is this?"},
        {"from": "gpt", "value": "It an apple"}
    ],
    "image": {"lmdb_file": "path/to/file", "lmdb_key": "key"}
}

interleaved format

{
    "conversations": [
        {"from": "human", "value": "<image-1><video-1><video-2><image-2><image-3>\nWhat is this?"},
        {"from": "gpt", "value": "It an apple"}
    ],
    "data": [
        {
            "type": "image", 
            "image_list"=[
                "path/to/image1",
                {"lmdb_file": "path/to/lmdb", "lmdb_key": "image2"},
                {"lmdb_file": "path/to/lmdb", "lmdb_key": "image3"}
            ]
        },
        {
            "type": "video",
            "video_list"=[
                {"path": "path/to/video1", "start": 0, "end": 10},
                "path/to/video2"
            ]
        }
    ]
}

πŸ”– Note: Placeholders (,