[CVPR 2026] Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration

April 30, 2026 ยท View on GitHub

[CVPR 2026] Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration

Jiaqi Han* 1^1, Juntong Shi* 1^1, Puheng Li 1^1, Haotian Ye 1^1, Qiushan Guo 2^2, Stefano Ermon 1^1

1^1 Stanford University 2^2 ByteDance

๐ŸŽฏ Overview

Illustration

We propose Spectrum, a training-free spectral diffusion feature forecaster that enables global, long-range feature reuse with tightly controlled error. We view the latent features of the denoiser as functions over time and approximate them with Chebyshev polynomials. Specifically, we fit the coefficient for each basis via ridge regression, which is then leveraged to forecast features at multiple future diffusion steps. We theoretically reveal that our approach admits more favorable long-horizon behavior and yields an error bound that does not compound with the step size.

Extensive experiments on various state-of-the-art image and video diffusion models consistently verify the superiority of our approach. Notably, we achieve up to $4.79\times speedup on FLUX.1 and \4.67\times$ speedup on Wan2.1-14B, while maintaining much higher sample quality compared with the baselines. See more demos on our Project Page.

Please give us a star โญ if you find our work interesting!

Overview

โญ Also checkout our previous work CHORDS on multi-core diffusion sampling acceleration, accepted at ICCV 2025!

๐Ÿ“ฐ Updates

๐Ÿš€๐Ÿš€๐Ÿš€ Check out ComfyUI-Spectrum-SDXL for a community implementation of Spectrum ComfyUI workflow on SDXL-based models! Huge thanks to ruwwww!!

๐Ÿš€๐Ÿš€๐Ÿš€ Check out ComfyUI-Spectrum for a community implementation of Spectrum ComfyUI workflow on a wide suite of models, including e.g. FLUX-based models, Qwen Image, Z Image Turbo, HunyuanVideo and Wan2.2! Huge thanks to judian17!!

๐Ÿ›  Dependencies

Our code relies on the following core packages:

torch
transformers
diffusers
hydra-core
imageio
imageio-ffmpeg

For the specific versions of these packages that have been verified as well as some optional dependencies, please refer to requirements.txt. We recommend creating a new virual environment via the following procedure:

conda create -n spectrum python=3.10
conda activate spectrum
pip install -r requirements.txt

๐Ÿš€ Running Inference

Prior to running inference pipeline, please make sure that the models have been downloaded from ๐Ÿค— huggingface. We provide the download script for some example models for image and video generation in download.py.

We use hydra to organize different hyperparameters for the image/video diffusion model as well as the sampling algorithm. The default configurations can be found under configs folder. The entries to launch the sampling for image and video generation are src/text_to_image.py and src/text_to_video.py, respectively. For SDXL, please refer to src/text_to_image_sdxl.py.

โญ Text-to-Image (T2I)

The command below is an example to perform image generation on Flux using 1 GPU.

CUDA_VISIBLE_DEVICES=0 \
python src/text_to_image.py \
    model=flux \
    algo=spectrum \
    algo.w=0.5 \
    algo.lam=0.1 \
    algo.m=4 \
    window_size=2 \
    flex_window=0.75 \
    exp_name=temp \
    ngpu=1 \
    total_prompt_num=1000 \
    output_base_path=output_samples_image \
    prompt_file=prompts/DrawBench200.txt

For model we currently support:

algo.w is by default set to 1.0, which recovers our Chebyshev predictor. Post publication, we also find that a convex mixture of our spectral predictor with linear interpolation slightly enhances robustness across a wider range of acceleration ratios. We recommend setting algo.w between 0.5 and 1.0, with a relatively larger value of algo.w when enabling more aggressive speedups (see flex_window).

algo.lam refers to the regularization strength ฮป\lambda in the paper. By default set to 0.1.

algo.m refers to the number of Chebyshev bases. By default set to 4.

window_size refers to the initial window size N\mathcal{N} in the paper.

flex_window refers to the hyperparameter ฮฑ\alpha in the paper. Notably, N\mathcal{N} and ฮฑ\alpha defines the sequence of diffusion steps to perform actual forward pass of the denoiser. More details are in Appendix B.1 and Table 6 in the paper. A larger value of ฮฑ\alpha corresponds to fewer actual network forwards, leading to larger speedup.

ngpu corresponds to the number of GPUs to use in parallel. We split all prompts equally to several gpus to speedup the benchmark for all methods. Note that it should match CUDA_VISIBLE_DEVICES.

output_base_path is the directory to save the generated samples.

prompt_file stores the list of prompts, each per line, that will be sequentially employed to generate each image.

For full functionality of the script, please refer to the arguments and their default values (such as the number of inference steps, the resolution of the image, etc.) under the configs folder, which is parsed by hydra.

Remark: window_size=2 and flex_window=0.75 recovers the ฮฑ=0.75\alpha=0.75 setting in the paper with 14 full network passes (โ‰ˆ3.5ร—\approx 3.5\times speedup). For more aggressive acceleration, use window_size=2 and flex_window=3.0, which corresponds to the ฮฑ=3.0\alpha=3.0 setting in the paper with 10 network passes (โ‰ˆ5ร—\approx 5\times speedup).

We also provide a boilerplate script to launch the inference:

# For Flux and Stable Diffusion 3.5-Large
bash scripts/run_mp_image.sh
# For SDXL
bash scripts/run_mp_image_sdxl.sh

โญ Text-to-Video (T2V)

Similarly, the following script can be used for video generation with Spectrum:

CUDA_VISIBLE_DEVICES=0 \
python src/text_to_video.py \
    model=hunyuan \
    algo=spectrum \
    algo.w=0.5 \
    algo.lam=0.1 \
    algo.m=4 \
    window_size=2 \
    flex_window=0.75 \
    exp_name=temp \
    ngpu=1 \
    total_prompt_num=1000 \
    output_base_path=output_samples_video \
    prompt_file=prompts/video_demo.txt

where for model we currently support:

We also provide a boilerplate script to launch the inference:

# For HunyuanVideo and Wan2.1-14B
bash scripts/run_mp_video.sh

Remark: For high-resolution video generation, change model.width, model.height, and model.num_frames to your specific choice. For exmaple, we use 1080x720x129f setting with HunyuanVideo for the qualitative examples.

๐Ÿ“Œ Citation

Please consider citing our work if you find it useful:

@article{han2026adaptive,
  title={Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration},
  author={Han, Jiaqi and Shi, Juntong and Li, Puheng and Ye, Haotian and Guo, Qiushan and Ermon, Stefano},
  journal={arXiv preprint arXiv:2603.01623},
  year={2026}
}

๐Ÿ—’๏ธ Acknowledgments

Part of the code was inspired by TaylorSeer. We thank the authors for open-sourcing the codebase.

๐Ÿงฉ Contact and Community Contribution

If you have any question, welcome to contact me at:

Jiaqi Han: jiaqihan@stanford.edu

๐Ÿ”ฅ We warmly welcome community contributions for e.g. supporting more models! Please open/submit a PR if you are interested!