MLM Filter
April 14, 2025 ยท View on GitHub
Official implementation of our paper "Finetuned Multimodal Language Models are High-Quality Image-Text Data Filters".
Release
- [12/30/2024] ๐ฅ We released a new generation MLM-Filter model based on Qwen2.5-1.5B, mlm-filter-qwen2.5-1.5b-gpt4o. The instruction data are re-generated with GPT-4o. With the much smaller LLM backbone, the inference has been significantly improved. The llava codebase for mlm-filter model inference has been completely removed and integrated into LLaVA-Unified.
- [10/24/2024] ๐ฅ We released two new MLM-Filter models based on llama3, mlm-filter-llama-3-8b and mlm-filter-llama-3.2-3b.
- [2/25/2024] ๐ฅ We released Finetuned Multimodal Language Models are High-Quality Image-Text Data Filters. We propose to adopt fine-tuned Multimodal Language Model as effective and efficient data filters to select high-quality image-text pairs from large-scale web-crawled iamge-text data. Checkout the paper.
Project Structure
- mlm_filter_scoring_single_image.py: Sample code for perform quality score generation on a single image-text pair
- mlm_filter_scoring_datacomp_batch_inference.py: Sample code for perform large-scale quality score generation on Webdataset format image-text data
- mlm_filter_scoring_datacomp_batch_inference_v2.py: Sample code for perform large-scale quality score generation on Webdataset format image-text data for Llama3 or Qwen2.5 based MLM-Filter models
- run_inference.sh: Sample code for perform large-scale quality score generation on Webdataset format image-text data on machines with 8 GPUs
Install
We highly suggest you to use python==3.10, i.e.,
conda create -n mlm_filter python=3.10
Then install the dependencies for quality score generation:
pip install git+https://github.com/Victorwz/LLaVA-Unified.git
Quality Score Generation
Inference on Single Image
python mlm_filter_scoring_single_image.py --image-path /path/to/image --caption "text caption"
Parameters to note:
--metric: quality scoring metric for generation, select amongimage_text_matching,object_detail_fulfillment,caption_text_quality,semantic_understanding,all--image-path: path to image file or image url--caption: text caption
Inference on Webdataset Large-Scale Data
bash run_inference.sh ${GPU_START_ID} ${Metric} ${Model_Path} ${Data_Path} ${Tars_Per_GPU} ${Num_GPU}
Parameters to note:
GPU_START_ID: for large-scale score generation using multi-machines, specify the index of machinesMetric: quality scoring metric for generation, select amongimage_text_matching,object_detail_fulfillment,caption_text_quality,semantic_understanding,allModel_Path: path to the mlm filter model checkpointData_Path: path to the webdataset image-text tarsTars_Per_GPU: the number of webdataset image-text tars for a single-gpu to inference onNum_GPU: the number of GPUs for one machine, e.g. 1, 8, 16
Fine-Tuning MLM as Data Filter
- Prepare data
Please download the 50k multimodal instructions and save it to ./data/mlm_filter_instruct_50k_gpt4v_cc12m_4k.json.
Please download the images from constituting datasets:
- COCO: train2017
- GQA: images
- OCR-VQA: ocr_vqa_images_llava_v15.zip.
- TextVQA: train_val_images
- VisualGenome: part1, part2
- CC12M:
unzip images.zip -C data/images, the images are available at Huggingface Data Repo.
After downloading all of them, organize the data as follows in ./data/images,
โโโ coco
โ โโโ train2017
โโโ gqa
โ โโโ images
โโโ ocr_vqa
โ โโโ images
โโโ textvqa
โ โโโ train_images
โโโ vg
โ โโโ VG_100K
โ โโโ VG_100K_2
โโโ cc12m
OCR-VQA are repacked by ourselves to ensure there is no failed-to-download images which are included in LLaVA-v1.5-665k instruction dataset.
- Start training!
Please refer to LLaVA-Unified for more fine-tuning guidance.
Training script with DeepSpeed ZeRO-3: LLaVA_Unified/scripts/mlm_filter/finetune.sh.
Our Best CLIP Model on DataComp-Medium
We also open-sourced our pre-trained CLIP-ViT-B/32 checkppint under the DataComp-Medium Benchmark Controlled Setting in weizhiwang/clip_datacomp_medium_itm_th_66_AND_odf_th_20_gpt4v. Our best model is trianed on the data filtered by both the ITM and ODF Quality Scores.
License
MIT License
Contacts
For any question or issue, please feel free to contact weizhiwang@ucsb.edu or submit github issues.
Citation
Please cite our paper if you find this repository interesting or helpful in your research:
@article{mlm-filter,
title={Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters},
author={Wang, Weizhi and Mrini, Khalil and Yang, Linjie and Kumar, Sateesh and Tian, Yu and Yan, Xifeng and Wang, Heng},
publisher={arXiv preprint arXiv:2403.02677},
year={2024},
}
Credits
MLM-Filter is developed based on