CANDI: Hybrid Discrete-Continuous Diffusion Models
October 27, 2025 ยท View on GitHub
Overview
CANDI (Continuous and Discrete Diffusion) is a novel hybrid diffusion model that combines continuous and discrete noise processes for high-quality text generation. This approach bridges the gap between continuous diffusion models used in computer vision and discrete token-based language modeling, offering the best of both worlds.

Method
CANDI uses a hybrid kernel to coordinate both discrete and continuous corruption explicitly throughout training:

This code base is built on the DUO codebase, which is available at this link: https://github.com/s-sahoo/duo?tab=readme-ov-file.
Installation
- Clone the repository:
git clone https://github.com/patrickpynadath1/candi.git
cd candi
- Install dependencies:
pip install -r requirements.txt
- (Optional) Install Flash Attention for faster training:
pip install flash-attn --no-build-isolation
- Download the OWT data
bash manual_download.sh
Experiments
This codebase includes the code for running experiments on Text8 and OWT. We will integrate the QM9 experiments later. In general, we re-use the same experimental methodology and codebase from https://github.com/kuleshov-group/discrete-diffusion-guidance.
Training
We include scripts for training models in scripts/slurm_scripts.
Frontier Analysis
Run temperature sweeps for frontier analysis using the following scripts:
# OpenWebText sweeps
bash scripts/gen_ppl_owt_candi_sweep.sh
# Text8 sweeps
bash scripts/gen_text8_candi_sweep.sh
Configuration
The project uses Hydra for configuration management. Key configuration files:
- Algorithm configs:
configs/algo/- Different diffusion algorithms (CANDI, MDLM, SEDD, etc.) - Data configs:
configs/data/- Dataset configurations - Model configs:
configs/model/- Model architecture settings - Base config:
configs/config.yaml- Main configuration file
Key Components
algo.py: Core algorithm implementations (CANDI, MDLM, DUO, etc.)main.py: Main training and evaluation scriptdataloader.py: Data loading and preprocessing utilitiesmodels/: Model architectures (DiT)metrics.py: Evaluation metrics and utilitiestrainer_base.py: Base trainer class with common functionality
Citation
If you use this code in your research, please cite:
@article{pynadath2025candi,
title={CANDI: Hybrid Discrete-Continuous Diffusion Models},
author={Patrick Pynadath, Jiaxin Shi, and Ruqi Zhang},
journal={arXiv preprint},
year={2025}
}
For more details, visit our project page or check out the paper.