Data Filtering for Open-Qwen2VL Pre-Training Data

April 2, 2025 ยท View on GitHub

Install

We develop our data filtering code based on DataComp repo. Please firstly install the required packages:

cd data_filtering
pip install -r requirements.txt

Data Downloading

CC3M-CC12M-SBU

wget https://storage.googleapis.com/sfr-vision-language-research/BLIP/datasets/ccs_filtered.json
img2dataset --url_list ccs_filtered.json --input_format "json" \
        --url_col "url" --caption_col "caption" --output_format webdataset \
        --output_folder ccs_webdataset --processes_count 32 --thread_count 128 --image_size 512 \
        --resize_only_if_bigger=True --resize_mode="keep_ratio" --skip_reencode=True \
        --enable_wandb False

DataComp-Medium-128M

Please follow the offical scripts to download it into webdataset format.

Data Quality Score Generation

DFN-CLIP

Since we do not have the access to DFN model checkpoint, we directly use the released uids of selected high-quality subset from DFN.

MLM-Filter

MLM-Filter adopts an efficient MLLM to generate four distince and comprehensive metrics to assess the quality of each image-text caption data sample. Please follow the official repo to perform the large scale quality score generation using mlm-filter-qwen2.5-1.5b-gpt4o.

Data Filtering for MLM-Filter

INPUT_DATADIR="path/to/datacomp/shards"
python baselines.py --metadata_dir $INPUT_DATADIR --save_path medium_filter_results/medium_semantic_understanding_th_85.npy --name llava_semantic_understanding_score --threshold 85
mkdir "path/to/datacomp/medium_semantic_understanding_th_85"
python resharder.py -i $DOWNLOAD_DIR -o "path/to/datacomp/medium_semantic_understanding_th_85" -s medium_filter_results/medium_semantic_understanding_th_85.npy