Collaborative Face Experts Fusion in Video Generation: Boosting Identity Consistency Across Large Face Poses

January 12, 2026 · View on GitHub

arXiv Project Page Demo Data List

This repository is the official implementation of CoFE, a novel identity-preserving Image-to-Video (I2V) generation method specifically designed for challenging large face pose scenarios.

News

  • [2025.1.12] We have released the ~160K-sample LaFID data list derived from OpenHumanVid.
  • [2025.12.28] We have open-sourced the code for our data processing pipeline.

Abstract

Current video generation models struggle with identity preservation under large face poses, primarily facing two challenges: the difficulty in exploring an effective mechanism to integrate identity features into DiT architectures, and the lack of targeted coverage of large face poses in existing open-source video datasets. To address these, we present two key innovations. First, we propose Collaborative Face Experts Fusion (CoFE), which dynamically fuses complementary signals from three specialized experts within the DiT backbone: an identity expert that captures cross-pose invariant features, a semantic expert that encodes high-level visual context, and a detail expert that preserves pixel-level attributes such as skin texture and color gradients. Second, we introduce a data curation pipeline comprising three key components: Face Constraints to ensure diverse large-pose coverage, Identity Consistency to maintain stable identity across frames, and Speech Disambiguation to align textual captions with actual speaking behavior. This pipeline yields LaFID-180K, a large-scale dataset of pose-annotated video clips designed for identity-preserving video generation. Experimental results on several benchmarks demonstrate that our approach significantly outperforms state-of-the-art methods in face similarity, FID, and CLIP semantic alignment.

Method Overview

Method Overview

Our approach integrates three specialized experts through a dynamic mixture mechanism, enabling robust identity preservation under challenging viewing conditions.

Data Processing Pipeline

Data Processing Pipeline

The data processing pipeline in CoFE consists of three key components that collectively ensure high-quality, identity-consistent, and semantically grounded video-text pairs.

  • First, Face Constraints Filtering selects videos containing a single prominent face with substantial 3D pose variation, ensuring suitability for large-pose facial modeling.
  • Second, Identity Consistency Analysis verifies that the same person appears throughout each clip by measuring facial feature similarity across frames, thereby preserving identity integrity.
  • Third, Speech Disambiguation aligns spoken content with visual lip motion and enriches captions with explicit speaking-status descriptions using an LLM, resolving ambiguities in silent versus talking segments.

Together, these modules enable the construction of LaFID-180K—a dataset tailored for identity-preserving video generation under challenging pose conditions.

Usage

First, navigate to the data processing directory:

cd data_process

Step 1: Face Constraints Filtering

This step calculates basic face attributes (e.g., pose, quality) and then filters out videos that do not meet our criteria, such as those without a single prominent face or with insufficient pose variation.

python calc_face_info.py
python filter_face_constraints.py

Step 2: Identity Consistency Analysis

This step ensures the identity of the person remains stable within each video.

python filter_ID_consistency.py

We also provide a version for ID Clustering.

python ID_cluster.py

Step 3: Speech Disambiguation

This final step aligns video content with speech, enriching captions with speaking-status tags.

First, you need to process your videos with a speech-to-text model to obtain transcriptions. We recommend using OpenAI's Whisper for this task. Ensure the transcriptions are saved in a format that can be mapped back to your video files. Next, run speech_disambiguation.py to refine your prompt.

python speech_disambiguation.py

Citation

If you find our paper and code useful in your research, please consider giving a star and citation.

@article{cofe2025,
  title={From Large Angles to Consistent Faces: Identity-Preserving Video Generation via Mixture of Facial Experts},
  author={Your Name and Others},
  journal={arXiv preprint},
  year={2025}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

We extend our gratitude to the following projects and communities for their invaluable contributions to our research:

  • The open-source community for providing OpenHumanVid datasets that formed the foundation of our work.
  • The developers of the Wan-Model for their open-source model, which served as a crucial building block.
  • The authors and contributors of CogVideoX, LTX-Video, and ConsisID, whose work provided significant baselines for our comparative analysis.