SANA: Efficient High-Resolution Image & Video Generation
March 5, 2026 · View on GitHub
SANA is an efficiency-oriented codebase for high-resolution image and video generation, providing complete training and inference pipelines.
This document specifies how to post-train (SFT/RL) a SANA-Image or SANA-Video model on Cosmos-RL.
Tutorial
For a full document about the post-training of diffusion models, you can find it in the official document of Cosmos-RL.
Configuration
Experiment: configurations of SANA can be found in configs/sana. We provided several preset config files:
- SFT
- Image:
sana-image-sft,sana-image-sft-lora - Video:
sana-vidoe-sft,sana-video-sft-lora
- Image:
- RL
- Image:
sana-image-nft - Video:
sana-video-nft
- Image:
For a detailed explanation of the arguments, you can see the Configuration Page of Cosmos-RL
Reward service
Considering the computation overhead, it's necessary to use a separate async service for reward computing.
- You can launch a reward service by following this document.
- Configure the trainer to make it communicate with the reward service. Set environment variable
REMOTE_REWARD_TOKEN,REMOTE_REWARD_ENQUEUE_URL, andREMOTE_REWARD_FETCH_URL
Dataset
SFT
We support loading the dataset from a local directory. You should prepare your paired prompts and multimodal datas with the following format:
local_image_dataset_dir/
├── *.json
├── *.jpg
│...
local_video_dataset_dir/
├── *.json
├── *.mp4
│ ...
RL
We support some popular datasets for RL training.
- Image: pickscore, ocr, geneval
- Video: filtered VidProM from DanceGRPO
Note: The Cosmos-RL is very flexible for the user-customized dataset and input/output format. You can edit the
./cosmos_rl/tools/dataset/diffusion_nft.pylauncher to customize your own dataset for training. For more details about the customization of the dataset, please refer to Customization.
Training
SFT
cosmos-rl --config ./configs/stable-diffusion-3-5/stable-diffusion-3-5-image-sft-lora.toml cosmos_rl.tools.dataset.diffusers_dataset
RL
cosmos-rl --config ./configs/sana/sana-image-nft.toml cosmos_rl.tools.dataset.diffusion_nft
Citation
@misc{xie2024sana,
title={Sana: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer},
author={Enze Xie and Junsong Chen and Junyu Chen and Han Cai and Haotian Tang and Yujun Lin and Zhekai Zhang and Muyang Li and Ligeng Zhu and Yao Lu and Song Han},
year={2024},
eprint={2410.10629},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.10629},
}
@misc{xie2025sana,
title={SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer},
author={Xie, Enze and Chen, Junsong and Zhao, Yuyang and Yu, Jincheng and Zhu, Ligeng and Lin, Yujun and Zhang, Zhekai and Li, Muyang and Chen, Junyu and Cai, Han and others},
year={2025},
eprint={2501.18427},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2501.18427},
}
@misc{chen2025sanasprint,
title={SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation},
author={Junsong Chen and Shuchen Xue and Yuyang Zhao and Jincheng Yu and Sayak Paul and Junyu Chen and Han Cai and Song Han and Enze Xie},
year={2025},
eprint={2503.09641},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.09641},
}
@misc{chen2025sana,
title={SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer},
author={Chen, Junsong and Zhao, Yuyang and Yu, Jincheng and Chu, Ruihang and Chen, Junyu and Yang, Shuai and Wang, Xianbang and Pan, Yicheng and Zhou, Daquan and Ling, Huan and others},
year={2025},
eprint={2509.24695},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.24695},
}