🎥 FAR: Frame Autoregressive Model for Both Short- and Long-Context Video Modeling 🚀

April 23, 2025 · View on GitHub

Long-Context Autoregressive Video Modeling with Next-Frame Prediction

dmlab_sample

📢 News

2025-04: Update multi-level KV cache for faster inference on long video. 🎉 Check our updated paper for details. We release colab demo for inference speed test.
2025-04: Release colab demo for quick inference! 🎉
2025-03: Paper and code of FAR are released! 🎉

🌟 What's the Potential of FAR?

🔥 Introducing FAR: a new baseline for autoregressive video generation

FAR (i.e., Frame AutoRegressive Model) learns to predict continuous frames based on an autoregressive context. Its objective aligns well with video modeling, similar to the next-token prediction in language modeling.

dmlab_sample

🔥 FAR achieves better convergence than video diffusion models with the same continuous latent space:

🔥 FAR leverages clean visual context without additional image-to-video fine-tuning:

Unconditional pretraining on UCF-101 achieves state-of-the-art results in both video generation (context frame = 0) and video prediction (context frame ≥ 1) within a single model.

🔥 FAR supports efficient training on long video sequences with manageable token lengths:

The key technique behind this is long short-term context modeling, where we use regular patchification for short-term context to ensure fine-grained temporal consistency and aggressive patchification for long-term context to reduce redundant tokens.

🔥 FAR exploits the multi-level KV-Cache to speed up autoregressive inference on long videos:

📚 For more details, check out our paper.

🏋️‍♂️ FAR Model Zoo

We provide trained FAR models in our paper for re-implementation.

Video Generation

We use seed-[0,2,4,6] in evaluation, following the evaluation prototype of Latte:

Model (Config)	#Params	Resolution	Condition	FVD	HF Weights	Pre-Computed Samples	Train Cost (H100 Days)	Memory (Per GPU)
FAR-L	457 M	128x128	✗	280 ± 11.7	Model-HF	Google Drive	12.2	22 G
FAR-L	457 M	128x128	✓	99 ± 5.9	Model-HF	Google Drive	12.2	22 G
FAR-L	457 M	256x256	✗	303 ± 13.5	Model-HF	Google Drive	12.7	22 G
FAR-L	457 M	256x256	✓	113 ± 3.6	Model-HF	Google Drive	12.7	22 G
FAR-XL	657 M	256x256	✗	279 ± 9.2	Model-HF	Google Drive	14.6	22 G
FAR-XL	657 M	256x256	✓	108 ± 4.2	Model-HF	Google Drive	14.6	22 G

Short-Video Prediction

We follows the evaluation prototype of MCVD and ExtDM:

Model (Config)	#Params	Dataset	PSNR	SSIM	LPIPS	FVD	HF Weights	Pre-Computed Samples	Train Cost (H100 Days)	Memory (Per GPU)
FAR-B	130 M	UCF101	25.64	0.818	0.037	194.1	Model-HF	Google Drive	3.6	9 G
FAR-B	130 M	BAIR (c=2, p=28)	19.40	0.819	0.049	144.3	Model-HF	Google Drive	2.6	12 G

Long-Video Prediction

We use seed-[0,2,4,6] in evaluation, following the evaluation prototype of TECO:

Model (Config)	#Params	Dataset	PSNR	SSIM	LPIPS	FVD	HF Weights	Pre-Computed Samples	Train Cost (H100 Days)	Memory (Per GPU)
FAR-B-Long	150 M	DMLab	22.3	0.687	0.104	64	Model-HF	Google Drive	17.5	13 G
FAR-M-Long	280 M	Minecraft	16.9	0.448	0.251	39	Model-HF	Google Drive	18.2	19 G

🔧 Dependencies and Installation

1. Setup Environment:

# Setup Conda Environment
conda create -n FAR python=3.10
conda activate FAR

# Install Pytorch
conda install pytorch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 pytorch-cuda=12.4 -c pytorch -c nvidia

# Install Other Dependences
pip install -r requirements.txt

2. Prepare Dataset:

We have uploaded the dataset used in this paper to Hugging Face datasets for faster download. Please follow the instructions below to prepare.

from huggingface_hub import snapshot_download, hf_hub_download

dataset_url = {
    "ucf101": "guyuchao/UCF101",
    "bair": "guyuchao/BAIR",
    "minecraft": "guyuchao/Minecraft",
    "minecraft_latent": "guyuchao/Minecraft_Latent",
    "dmlab": "guyuchao/DMLab",
    "dmlab_latent": "guyuchao/DMLab_Latent"
}

for key, url in dataset_url.items():
    snapshot_download(
        repo_id=url,
        repo_type="dataset",
        local_dir=f"datasets/{key}",
        token="input your hf token here"
    )

Then, enter its directory and execute:

find . -name "shard-*.tar" -exec tar -xvf {} \;

3. Prepare Pretrained Models of FAR:

We have uploaded the pretrained models of FAR to Hugging Face models. Please follow the instructions below to download if you want to evaluate FAR.

from huggingface_hub import snapshot_download, hf_hub_download

snapshot_download(
    repo_id="guyuchao/FAR_Models",
    repo_type="model",
    local_dir="experiments/pretrained_models/FAR_Models",
    token="input your hf token here"
)

🚀 Training

To train different models, you can run the following command:

accelerate launch \
    --num_processes 8 \
    --num_machines 1 \
    --main_process_port 19040 \
    train.py \
    -opt train_config.yml

Wandb: Set use_wandb to True in config to enable wandb monitor.
Periodally Evaluation: Set val_freq to control the peroidly evaluation in training.
Auto Resume: Directly rerun the script, the model will find the lastest checkpoint to resume, the wandb log will automatically resume.
Efficient Training on Pre-Extracted Latent: Set use_latent to True, and set the data_list to corresponding latent path list.

💻 Sampling & Evaluation

To evaluate the performance of a pretrained model, just copy the training config and set the pretrain_network: ~ to your trained folder. Then run the following scripts:

accelerate launch \
    --num_processes 8 \
    --num_machines 1 \
    --main_process_port 10410 \
    test.py \
    -opt test_config.yml

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

📖 Citation

If our work assists your research, feel free to give us a star ⭐ or cite us using:

@article{gu2025long,
    title={Long-Context Autoregressive Video Modeling with Next-Frame Prediction},
    author={Gu, Yuchao and Mao, weijia and Shou, Mike Zheng},
    journal={arXiv preprint arXiv:2503.19325},
    year={2025}
}