π Data Preparation
October 25, 2025 Β· View on GitHub
We demonstrate our data storage format and provide a specific example of how to prepare new data from scratch
π Data Hierarchy
The directory structure should look like this. You should prepare your own playground dir:
Eagle2
βββ deepspeed_configs
βββ document
βββ eaglevl
βββ playground
β βββ sft_data
β βββ sft_jsonl
β βββ sft_recipe
βββ pretrained
βββ pyproject.toml
βββ README.md
βββ shell
βββ streamlit_demo
βββ tools
π Level 1: sft_recipe
A recipe is a collection of various data sources. For example:
playground/sft_recipe/debug.json
{
"LLaVAR-Instruct-16K": {
"root": "",
"annotation": "playground/sft_jsonl/LLaVAR-Instruct-16K.jsonl",
"weight": 1,
"weight_description": "Larger weight means samples appear more frequently in later training stages.",
"repeat_time": 1,
"repeat_time_description": "Number of times data is repeated during training.",
"length": 15500,
"length_description": "Reference only; actual length determined by the JSONL file.",
"modality": ["image"]
}
}
π Level 2: sft_jsonl and sft_data
Each data source corresponds to a jsonl file in sft_jsonl dir, and the image/video data are stored in sft_data.
Example entry in playground/sft_jsonl/LLaVAR-Instruct-16K.jsonl:
{
"conversations": [
{"from": "system", "value": "This system prompt is optional."},
{"from": "human", "value": "<image>\nWhat is the main idea of the quote from Albert Camus?"},
{"from": "gpt", "value": "The main idea of the quote by Albert Camus..."}
],
"data": [
{
"type": "image",
"image_list": ["path/to/image"]
}
],
}
π IMPORTANT: Converting Data to Training Format
We don't directly use the above generated JSONL file for training. Instead, we further process the data for:
- π Calculating token lengths (essential for data packing)
- π¬ Selecting relevant frames from video data
- π« Filtering out incorrect samples
- π¦ Converting JSONL to parquet format for efficient reading and storage (parquet file only stores meta information)
Use this command for the conversion:
# single node
bash shell/prepare.sh playground/sft_recipe/debug.json
Upon completion, a new recipe file is generated, named with a .prepare.json suffix at the original location. You can use this recipe for subsequent training.
β οΈ Note: It's important to review prepare.sh to confirm settings like the number of video frames sampled and the tokenizer used.
β¨ Optimized Workflow Tip: Currently, data preparation still requires GPU resources. However, we've implemented a caching mechanismβfiles with the same MD5 hash won't be regenerated. Further script based on Ray coming soon! π§
π¨ Supported Modalities and Formats
Use one of the following formats in JSONL files. The prepare.sh script standardizes these formats automatically.
single image:
<image> is a placeholder
{
"conversations": [
{"from": "human", "value": "<image>\nWhat is this?"},
{"from": "gpt", "value": "It is an apple."}
],
"image": "path/to/image"
}
multiple images:
<image-i> are placeholders
{
"conversations": [
{"from": "human", "value": "<image-1><image-2>\nWhat are these?"},
{"from": "gpt", "value": "An apple and a banana."}
],
"image": ["path/to/image-1", "path/to/image-2"]
}
single video:
<video> is a placeholder
{
"conversations": [
{"from": "human", "value": "<video>\nWhat is this?"},
{"from": "gpt", "value": "It an apple"}
],
"video": "path/to/video"
}
single video clip:
<video> is a placeholder. You can define the starting and ending time of every clip (second)
{
"conversations": [
{"from": "human", "value": "<video>\nWhat is this?"},
{"from": "gpt", "value": "It an apple"}
],
"video": {"path": "path/to/video", "start": 0, "end": 10}
}
lmdb image:
The image in LMDB corresponds to one filename and one key.
{
"conversations": [
{"from": "human", "value": "<image>\nWhat is this?"},
{"from": "gpt", "value": "It an apple"}
],
"image": {"lmdb_file": "path/to/file", "lmdb_key": "key"}
}
interleaved format
{
"conversations": [
{"from": "human", "value": "<image-1><video-1><video-2><image-2><image-3>\nWhat is this?"},
{"from": "gpt", "value": "It an apple"}
],
"data": [
{
"type": "image",
"image_list"=[
"path/to/image1",
{"lmdb_file": "path/to/lmdb", "lmdb_key": "image2"},
{"lmdb_file": "path/to/lmdb", "lmdb_key": "image3"}
]
},
{
"type": "video",
"video_list"=[
{"path": "path/to/video1", "start": 0, "end": 10},
"path/to/video2"
]
}
]
}
π Note: Placeholders (
,