FaceID-6M: A Large-Scale, Open-Source FaceID Customization Dataset

April 11, 2025 · View on GitHub

This repository contains resources referenced in the paper FaceID-6M: A Large-Scale, Open-Source FaceID Customization Dataset.

If you find this repository helpful, please cite the following:

🥳 News

Stay tuned! More related work will be updated!

[3 Jan, 2025] The repository is created.
[3 Mar, 2025] We release the first version of the paper.

FaceID-6M, is the first large-scale, open-source faceID dataset containing 6 million high-quality text-image pairs. Filtered from LAION-5B, which includes billions of diverse and publicly available text-image pairs, FaceID-6M undergoes a rigorous image and text filtering process to ensure dataset quality. For image filtering, we apply a pre-trained face detection model to remove images that lack human faces, contain more than three faces, have low resolution, or feature faces occupying less than 4% of the total image area. For text filtering, we implement a keyword-based strategy to retain descriptions containing human-related terms, including references to people (e.g., man), nationality (e.g., Chinese), ethnicity (e.g., East Asian), professions (e.g., engineer), and names (e.g., Donald Trump). Through these cleaning processes, FaceID-6M provides a high-quality dataset optimized for training powerful FaceID customization models, facilitating advancements in the field by offering an open resource for research and development.

Comparison with Previous Works

FaceID Fidelity

Based on these results, we can infer that the model trained on our FaceID-6M dataset achieves a level of performance comparable to the official InstantID model in maintaining FaceID fidelity. For example, in case 2 and case 3, both the official InstantID model and the FaceID-6M-trained model effectively generate the intended images based on the input. This clearly highlights the effectiveness of our FaceID-6M dataset in training robust FaceID customization models.

Scaling Results

To evaluate the impact of dataset size on model performance and optimize the trade-off between performance and training cost, we conduct scaling experiments by sampling subsets of different sizes from FaceID-6M. The sampled dataset sizes include: (1) 1K, (2) 10K, (3) 100K, (4) 1M, (5) 2M, (6) 4M, and (7) the full dataset (6M). For the experimental setup, we utilize the InstantID FaceID customization framework and adhere to the configurations used in the previous quantitative evaluations. The trained models are tested on the subset of COCO2017 test set, with Face Sim, CLIP-T, and CLIP-I as the evaluation metrics.

The results demonstrate a clear correlation between training dataset size and the performance of FaceID customization models. For example, the Face Sim score increased from 0.38 with 2M training data, to 0.51 with 4M, and further improved to 0.63 when using 6M data. These results underscore the significant contribution of our FaceID-6M dataset in advancing FaceID customization research, highlighting its importance in driving improvements in the field.

Released FaceID-6M dataset

We release two versions of our constructed dataset:

FaceID-70K: This is a subset of our FaceID-6M by further removing images lower than 1024 pixels either in width or height, consisting approximately 70K text-image pairs.
FaceID-6M: This is our constructed full FaceID customization dataset.

Please note that due to the large file size, we have pre-split it into multiple smaller parts. Before use, please execute the merge and unzip commands to restore the full file. Take the InstantID-FaceID-70K dataset as the example:

cat laion_1024.tar.gz.* > laion_1024.tar.gz
tar zxvf laion_1024.tar.gz

Index

After restoring the full dataset, you will find large amounts .png and .npy file, and also a ./face directory and a *.jsonl file:

*.png: Tmage files
*.npy: The pre-computed landmarks of the face in the related image, which is necessary to train InstantID-based models. If you don't need that, just ignore them.
./face: The directory including face files.
*.jsonl: Descriptions or texts. Ignore the file paths listed in the .jsonl file and use the line number instead to locate the corresponding image, face, and .npy files. For example, the 0th line in the .jsonl file corresponds to 0.png, 0.npy, and ./face/0.png.

Released FaceID Customization Models

We release two versions of trained InstantID models:

InstantID-FaceID-70K: Model trained on our FaceID-70K dataset.
InstantID-FaceID-6M: Model trained on our FaceID-6M dataset.

from huggingface_hub import hf_hub_download
hf_hub_download(repo_id="InstantX/InstantID", filename="ControlNetModel/config.json", local_dir="./checkpoints")
hf_hub_download(repo_id="InstantX/InstantID", filename="ControlNetModel/diffusion_pytorch_model.safetensors", local_dir="./checkpoints")
hf_hub_download(repo_id="InstantX/InstantID", filename="ip-adapter.bin", local_dir="./checkpoints")

If you cannot access to Huggingface, you can use hf-mirror to download models.

export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download --resume-download InstantX/InstantID --local-dir checkpoints

For face encoder, you need to manutally download via this URL to models/antelopev2 as the default link is invalid. Once you have prepared all models, the folder tree should be like:

  .
  ├── models
  ├── checkpoints
  ├── ip_adapter
  ├── pipeline_stable_diffusion_xl_instantid.py
  ├── download.py
  ├── download.sh
  ├── get_face_info.py
  ├── infer_from_pkl.py
  ├── infer.py
  ├── train_instantId_sdxl.py
  ├── train_instantId_sdxl.sh
  └── README.md

Step 2. Download Required Dataset

Please download our released dataset from:

cat laion_1024.tar.gz.* > laion_1024.tar.gz
tar zxvf laion_1024.tar.gz

Step 3. Training

Fill the MODEL_NAME, ENCODER_NAME, ADAPTOR_NAME, CONTROLNET_NAME, and JSON_FILE into our provided training script ./train_instantId_sdxl.sh, where:
1. MODEL_NAME refers to the backboned diffusion model, e.g., stable-diffusion-xl-base-1.0
2. ENCODER_NAME refers to the downloaded encoder, e.g., image_encoder
3. ADAPTOR_NAME and CONTROLNET_NAME refers to the pre-trained official InstantID model, e.g., checkpoints/ip-adapter.bin and checkpoints/ControlNetModel
4. JSON_FILE refers to our released FaceID-70K or FaceID-6M dataset.
Run the training scirpt, such as: bash ./train_instantId_sdxl.sh

Inference

Fill the base_model_path, face_adapter, controlnet_path, prompt0, and face_image into our provided inference script ./infer_from_pkl.py, where:
1. base_model_path refers to the backboned diffusion model, e.g., stable-diffusion-xl-base-1.0
2. face_adapter and controlnet_path refer to your trained model e.g., checkpoints/ip-adapter.bin and checkpoints/ControlNetModel
3. prompt0 and face_image refer to your test sample.
Run the training script, such as: python ./infer_from_pkl.py

Contact

If you have any issues or questions about this repo, feel free to contact shuhewang@student.unimelb.edu.au

FaceID-6M: A Large-Scale, Open-Source FaceID Customization Dataset

🥳 News

Links

Introduction