README.md

May 24, 2026 · View on GitHub

Light Forcing:
Accelerating Autoregressive Video Diffusion via Sparse Attention

Chengtao Lv, Yumeng Shi, Yushi Huang, Ruihao Gong📧, Shen Ren, Wenya Wang📧

NTU, HKUST, Sensetime (LightX2V Group)

(📧 denotes corresponding author.)

https://github.com/user-attachments/assets/2daa9f17-329e-4019-8f14-68ac2c467592

(Results on Self Forcing 1.3B. Left: Dense Attention. Right: 1.3x acceleration using Light Forcing)

💡 Why Light Forcing

🥇 Pioneer work: The first to explore sparse attention acceleration for autoregressive video generation.
🏆 Superior performance: Achieves a VBench total score of 84.5, delivering high-quality results with strong overall performance.
🔌 Plug-and-play acceleration: This repository provides additional acceleration techniques, including FP8 quantization, efficient kernels, and an efficient VAE, enabling easy speedups with just a few lines of configuration.
🌐 Strong generality: Light Forcing is compatible with diverse GPUs (e.g., RTX 5090, H100, A100) and supports both short-video (e.g., 5s) and long-video (e.g., >10s) generation.
⚡ Extreme acceleration: Achieves around 3.0× end-to-end speedup on a single RTX 5090 (27.4 FPS) and around 2.0× end-to-end speedup on an H100 (33.9 FPS).

Advanced autoregressive (AR) video generation models have improved visual fidelity and interactivity, but the quadratic complexity of attention remains a primary bottleneck for efficient deployment. While existing sparse attention solutions have shown promise on bidirectional models, we identify that applying these solutions to AR models leads to considerable performance degradation for two reasons: isolated consideration of chunk generation and insufficient utilization of past informative context. Motivated by these observations, we propose Light Forcing, the first sparse attention solution tailored for AR video generation models. It incorporates a Chunk-Aware Growth mechanism to quantitatively estimate the contribution of each chunk, which determines their sparsity allocation. This progressive sparsity increase strategy enables the current chunk to inherit prior knowledge in earlier chunks during generation. Additionally, we introduce a Hierarchical Sparse Attention to capture informative historical and local context in a coarse-to-fine manner. Such two-level mask selection strategy (i.e., frame and block level) can adaptively handle diverse attention patterns. Extensive experiments demonstrate that our method outperforms existing sparse attention in quality (e.g., 84.5 on VBench) and efficiency (e.g., 1.2-1.3x end-to-end speedup). Combined with other efficient solutions, Light Forcing further achieves a 2.0-3.0x end-to-end speedup across diverse GPUs (e.g., 27.4 FPS on RTX 5090 and 33.9 FPS on H100).

✨ Quick Start

Environment

We highly recommend using the Docker environment, as it is the simplest and fastest way to set up the environment. The Docker image already includes optimized kernels for Flash Attention 4 sparse attention, FP8 deployment, and RMSNorm.

docker pull lvchengtao/light_forcing:v1

Note: The Docker image requires the host NVIDIA driver to support CUDA 13.0 or newer.

Download Checkpoints

hf download Wan-AI/Wan2.1-T2V-1.3B --local-dir wan_models/Wan2.1-T2V-1.3B
hf download mack-williams/Light-Forcing --local-dir ./Light-Forcing

If you need to use LightVAE, run:

hf download lightx2v/Autoencoders lightvaew2_1.pth --local-dir ./Autoencoders

Fast Inference

Before inference, you can adjust the configuration file to enable the desired acceleration techniques, such as sparse attention, LightVAE, FP8 quantization, and efficient kernels.

long_video_gen: true  # Whether to enable long-video generation
sink_size: 1
sparse_config:
  sparsity: 0.88
  sparsity_base: 0.98
  keep_frames: 6  # Number of past frames to keep
  keep_sink: 1  # Number of sink frames to force keeping
  keep_near: 2  # Number of nearest frames to force keeping
  # other keys: BLKQ, BLKK
efficient_deployment:
  lightvae_path: path to lightvae ckpt
  quant_fp8: true  # Not supported on A100
  rmsnorm_kernel: true
  rope_kernel: true
  scale_shift_kernel: true

For short-video generation (e.g., 5s), run:

python inference.py \
  --config_path configs/light_forcing_short.yaml \
  --output_folder videos/light_forcing_short \
  --checkpoint_path path to short_video_gen.pt \
  --data_path prompts/MovieGenVideoBench_extended.txt \
  --use_ema

For long-video generation (e.g., 15s), run:

python inference.py \
  --config_path configs/light_forcing_long.yaml \
  --output_folder videos/light_forcing_long \
  --checkpoint_path path to long_video_gen.pt \
  --data_path prompts/MovieGenVideoBench_extended.txt \
  --use_ema \
  --num_output_frames 63

Note

On RTX 5090 and A100 GPUs, Light Forcing calls the Triton sparse attention kernel. On H100 and other Hopper GPUs, it calls the Flash Attention sparse kernel.

If you use Hopper GPUs, we recommend setting sparsity and sparsity_base to 0.8 and 0.9, respectively. Flash Attention 4 sparse attention currently supports a block size of 128, which is relatively coarse-grained, and further increasing sparsity does not bring additional speedup.

📊 Performance Benchmarks

RTX 5090

Metric	Duration	Flash Attention 2	+Light Forcing (88% sparsity)	+FP8 linear	+Efficient kernel (RoPE, RMSNorm, etc.)	+Light VAE
Video	5 seconds
Latency	5 seconds	9.09s	6.83s	5.90s	5.37s	2.96s
Speedup	5 seconds	1.00×	1.33×	1.54×	1.69×	3.07×
Peak Memory	5 seconds	17.8G	17.8G	16.6G	15.8G	12.7G
Video	15 seconds
Latency	15 seconds	30.4s	24.2s	21.4s	17.0s	9.6s
Speedup	15 seconds	1.00×	1.26×	1.42×	1.79×	3.17×
Peak Memory	15 seconds	17.6G	17.6G	16.5G	16.3G	13.1G

A100

Metric	Duration	Flash Attention 2	+Light Forcing (88% sparsity)	+Efficient kernel (RoPE, RMSNorm, etc.)	+Light VAE
Latency	5 seconds	11.38s	9.88s	9.41s	4.85s
Speedup	5 seconds	1.00×	1.15×	1.21×	2.35×
Latency	15 seconds	38.28s	34.56s	26.63s	18.08s
Speedup	15 seconds	1.00×	1.11×	1.44×	2.12×

H100

Metric	Duration	Flash Attention 3	+Light Forcing (80% sparsity)	+FP8 linear	+Efficient kernel (RoPE, RMSNorm, etc.)	+Light VAE
Latency	5 seconds	4.80s	4.33s	4.32s	3.74s	2.39s
Speedup	5 seconds	1.00×	1.11×	1.11×	1.28×	2.01×
Latency	15 seconds	15.8s	14.1s	13.8s	12.1s	8.0s
Speedup	15 seconds	1.00×	1.12×	1.14×	1.31×	1.98×

We record the generation time of a single video on one single GPU after operator warm-up, starting from the second sample.
Efficient kernels such as RoPE, RMSNorm, and scale-shift are lossless acceleration methods.
FP8 linear layers are near-lossless, while LightVAE may introduce slightly blurrier visual quality because it is designed for bidirectional video diffusion.
Light Forcing has not yet been specifically optimized for A100 GPUs, and we plan to further optimize it in future updates.

🤝 Acknowledgments

We develop our code referring to the following projects:

Video generation: Self Forcing and Infinite-Forcing.
Sparse attention kernel: Flash Attention 4 and SLA (Triton).
FP8 and RMSNorm kernel: SGLang team.
Light VAE: LightX2V team.

🚀 Recommendation

We strongly recommend using LightX2V, a leading inference framework for video generation. LightX2V supports a wide range of autoregressive video generation models, including Self-Forcing, WorldPlay, Matrix-Game, and LingBot-World.

It provides a comprehensive set of acceleration techniques, including Weight Quantization (FP8/NVFP4), KV Cache Quantization, Offloading, Sparse Attention, LightVAE, Sequence Parallelism, and Kernel Fusion.

✏️ Citation

If you find our toolkit or research paper useful or relevant to your research, please kindly cite our work.

@article{lv2026light,
  title={Light Forcing: Accelerating Autoregressive Video Diffusion via Sparse Attention},
  author={Lv, Chengtao and Shi, Yumeng and Huang, Yushi and Gong, Ruihao and Ren, Shen and Wang, Wenya},
  journal={arXiv preprint arXiv:2602.04789},
  year={2026}
}

Light Forcing:Accelerating Autoregressive Video Diffusion via Sparse Attention

Light Forcing:
Accelerating Autoregressive Video Diffusion via Sparse Attention