README.md
May 24, 2026 ยท View on GitHub
Light Forcing:
Accelerating Autoregressive Video Diffusion via Sparse Attention
Chengtao Lv, Yumeng Shi, Yushi Huang, Ruihao Gong๐ง, Shen Ren, Wenya Wang๐ง
NTU, HKUST, Sensetime (LightX2V Group)
(๐ง denotes corresponding author.)
https://github.com/user-attachments/assets/2daa9f17-329e-4019-8f14-68ac2c467592
(Results on Self Forcing 1.3B. Left: Dense Attention. Right: 1.3x acceleration using Light Forcing)๐ก Why Light Forcing
- ๐ฅ Pioneer work: The first to explore sparse attention acceleration for autoregressive video generation.
- ๐ Superior performance: Achieves a VBench total score of 84.5, delivering high-quality results with strong overall performance.
- ๐ Plug-and-play acceleration: This repository provides additional acceleration techniques, including FP8 quantization, efficient kernels, and an efficient VAE, enabling easy speedups with just a few lines of configuration.
- ๐ Strong generality: Light Forcing is compatible with diverse GPUs (e.g., RTX 5090, H100, A100) and supports both short-video (e.g., 5s) and long-video (e.g., >10s) generation.
- โก Extreme acceleration: Achieves around 3.0ร end-to-end speedup on a single RTX 5090 (27.4 FPS) and around 2.0ร end-to-end speedup on an H100 (33.9 FPS).
๐งพ Introduction
Advanced autoregressive (AR) video generation models have improved visual fidelity and interactivity, but the quadratic complexity of attention remains a primary bottleneck for efficient deployment. While existing sparse attention solutions have shown promise on bidirectional models, we identify that applying these solutions to AR models leads to considerable performance degradation for two reasons: isolated consideration of chunk generation and insufficient utilization of past informative context. Motivated by these observations, we propose Light Forcing, the first sparse attention solution tailored for AR video generation models. It incorporates a Chunk-Aware Growth mechanism to quantitatively estimate the contribution of each chunk, which determines their sparsity allocation. This progressive sparsity increase strategy enables the current chunk to inherit prior knowledge in earlier chunks during generation. Additionally, we introduce a Hierarchical Sparse Attention to capture informative historical and local context in a coarse-to-fine manner. Such two-level mask selection strategy (i.e., frame and block level) can adaptively handle diverse attention patterns. Extensive experiments demonstrate that our method outperforms existing sparse attention in quality (e.g., 84.5 on VBench) and efficiency (e.g., 1.2-1.3x end-to-end speedup). Combined with other efficient solutions, Light Forcing further achieves a 2.0-3.0x end-to-end speedup across diverse GPUs (e.g., 27.4 FPS on RTX 5090 and 33.9 FPS on H100).

โจ Quick Start
Environment
We highly recommend using the Docker environment, as it is the simplest and fastest way to set up the environment. The Docker image already includes optimized kernels for Flash Attention 4 sparse attention, FP8 deployment, and RMSNorm.
docker pull lvchengtao/light_forcing:v1
Note: The Docker image requires the host NVIDIA driver to support CUDA 13.0 or newer.
Download Checkpoints
hf download Wan-AI/Wan2.1-T2V-1.3B --local-dir wan_models/Wan2.1-T2V-1.3B
hf download mack-williams/Light-Forcing --local-dir ./Light-Forcing
If you need to use LightVAE, run:
hf download lightx2v/Autoencoders lightvaew2_1.pth --local-dir ./Autoencoders
Fast Inference
Before inference, you can adjust the configuration file to enable the desired acceleration techniques, such as sparse attention, LightVAE, FP8 quantization, and efficient kernels.
long_video_gen: true # Whether to enable long-video generation
sink_size: 1
sparse_config:
sparsity: 0.88
sparsity_base: 0.98
keep_frames: 6 # Number of past frames to keep
keep_sink: 1 # Number of sink frames to force keeping
keep_near: 2 # Number of nearest frames to force keeping
# other keys: BLKQ, BLKK
efficient_deployment:
lightvae_path: path to lightvae ckpt
quant_fp8: true # Not supported on A100
rmsnorm_kernel: true
rope_kernel: true
scale_shift_kernel: true
For short-video generation (e.g., 5s), run:
python inference.py \
--config_path configs/light_forcing_short.yaml \
--output_folder videos/light_forcing_short \
--checkpoint_path path to short_video_gen.pt \
--data_path prompts/MovieGenVideoBench_extended.txt \
--use_ema
For long-video generation (e.g., 15s), run:
python inference.py \
--config_path configs/light_forcing_long.yaml \
--output_folder videos/light_forcing_long \
--checkpoint_path path to long_video_gen.pt \
--data_path prompts/MovieGenVideoBench_extended.txt \
--use_ema \
--num_output_frames 63
Note
- On RTX 5090 and A100 GPUs, Light Forcing calls the Triton sparse attention kernel. On H100 and other Hopper GPUs, it calls the Flash Attention sparse kernel.
- If you use Hopper GPUs, we recommend setting
sparsityandsparsity_baseto0.8and0.9, respectively. Flash Attention 4 sparse attention currently supports a block size of 128, which is relatively coarse-grained, and further increasing sparsity does not bring additional speedup.
๐ Performance Benchmarks
RTX 5090
| Metric | Duration | Flash Attention 2 | +Light Forcing (88% sparsity) |
+FP8 linear | +Efficient kernel (RoPE, RMSNorm, etc.) |
+Light VAE |
|---|---|---|---|---|---|---|
| Video | 5 seconds | |||||
| Latency | 5 seconds | 9.09s | 6.83s | 5.90s | 5.37s | 2.96s |
| Speedup | 5 seconds | 1.00ร | 1.33ร | 1.54ร | 1.69ร | 3.07ร |
| Peak Memory | 5 seconds | 17.8G | 17.8G | 16.6G | 15.8G | 12.7G |
| Video | 15 seconds | |||||
| Latency | 15 seconds | 30.4s | 24.2s | 21.4s | 17.0s | 9.6s |
| Speedup | 15 seconds | 1.00ร | 1.26ร | 1.42ร | 1.79ร | 3.17ร |
| Peak Memory | 15 seconds | 17.6G | 17.6G | 16.5G | 16.3G | 13.1G |
A100
| Metric | Duration | Flash Attention 2 | +Light Forcing (88% sparsity) | +Efficient kernel (RoPE, RMSNorm, etc.) | +Light VAE |
|---|---|---|---|---|---|
| Latency | 5 seconds | 11.38s | 9.88s | 9.41s | 4.85s |
| Speedup | 5 seconds | 1.00ร | 1.15ร | 1.21ร | 2.35ร |
| Latency | 15 seconds | 38.28s | 34.56s | 26.63s | 18.08s |
| Speedup | 15 seconds | 1.00ร | 1.11ร | 1.44ร | 2.12ร |
H100
| Metric | Duration | Flash Attention 3 | +Light Forcing (80% sparsity) | +FP8 linear | +Efficient kernel (RoPE, RMSNorm, etc.) | +Light VAE |
|---|---|---|---|---|---|---|
| Latency | 5 seconds | 4.80s | 4.33s | 4.32s | 3.74s | 2.39s |
| Speedup | 5 seconds | 1.00ร | 1.11ร | 1.11ร | 1.28ร | 2.01ร |
| Latency | 15 seconds | 15.8s | 14.1s | 13.8s | 12.1s | 8.0s |
| Speedup | 15 seconds | 1.00ร | 1.12ร | 1.14ร | 1.31ร | 1.98ร |
- We record the generation time of a single video on one single GPU after operator warm-up, starting from the second sample.
- Efficient kernels such as RoPE, RMSNorm, and scale-shift are lossless acceleration methods.
- FP8 linear layers are near-lossless, while LightVAE may introduce slightly blurrier visual quality because it is designed for bidirectional video diffusion.
- Light Forcing has not yet been specifically optimized for A100 GPUs, and we plan to further optimize it in future updates.
๐ค Acknowledgments
We develop our code referring to the following projects:
- Video generation: Self Forcing and Infinite-Forcing.
- Sparse attention kernel: Flash Attention 4 and SLA (Triton).
- FP8 and RMSNorm kernel: SGLang team.
- Light VAE: LightX2V team.
๐ Recommendation
We strongly recommend using LightX2V, a leading inference framework for video generation. LightX2V supports a wide range of autoregressive video generation models, including Self-Forcing, WorldPlay, Matrix-Game, and LingBot-World.
It provides a comprehensive set of acceleration techniques, including Weight Quantization (FP8/NVFP4), KV Cache Quantization, Offloading, Sparse Attention, LightVAE, Sequence Parallelism, and Kernel Fusion.
โ๏ธ Citation
If you find our toolkit or research paper useful or relevant to your research, please kindly cite our work.
@article{lv2026light,
title={Light Forcing: Accelerating Autoregressive Video Diffusion via Sparse Attention},
author={Lv, Chengtao and Shi, Yumeng and Huang, Yushi and Gong, Ruihao and Ren, Shen and Wang, Wenya},
journal={arXiv preprint arXiv:2602.04789},
year={2026}
}