Harmonizing Visual and Textual Embeddings for Zero-Shot Text-to-Image Customization (AAAI 2025)

January 23, 2026 · View on GitHub

This repository contains the official implementation of Harmonizing Visual and Textual Embeddings for Zero-Shot Text-to-Image Customization.

Run BLIP-Diffusion

Run the following command:

python run_blip_diffusion.py \
    --data_dir='./example/01.jpg' \
    --data_class='cat' \
    --prompt='jumping' \
    --output_dir='./outputs'

Run ELITE

Download gloabal and local mappers from ELITE homepage.
Run basline or ours (run_elite_ours.py)

python run_elite.py \
    --global_mapper_path="path-to-global_mapper.pt" \
    --local_mapper_path="path-to-local_mapper.pt" \
    --test_data_dir='./example/01.jpg' \ 
    --template='a * riding a bike' \
    --output_dir='./outputs' \

Computing Infrastructure

GPU Model : TITAN RTX
Memory : 13GB (BLIP-Diffusion) / 9GB (ELITE)
Operating system: Ubuntu 18.04.5 LTS