๐ŸŽฅ FAR: Frame Autoregressive Model for Both Short- and Long-Context Video Modeling ๐Ÿš€

April 23, 2025 ยท View on GitHub

Project Page arXivย  huggingface weightsย  SOTA google colab logo

Long-Context Autoregressive Video Modeling with Next-Frame Prediction

dmlab_sample

๐Ÿ“ข News

  • 2025-04: Update multi-level KV cache for faster inference on long video. ๐ŸŽ‰ Check our updated paper for details. We release colab demo for inference speed test. google colab logo
  • 2025-04: Release colab demo for quick inference! ๐ŸŽ‰ google colab logo
  • 2025-03: Paper and code of FAR are released! ๐ŸŽ‰

๐ŸŒŸ What's the Potential of FAR?

๐Ÿ”ฅ Introducing FAR: a new baseline for autoregressive video generation

FAR (i.e., Frame AutoRegressive Model) learns to predict continuous frames based on an autoregressive context. Its objective aligns well with video modeling, similar to the next-token prediction in language modeling.

dmlab_sample

๐Ÿ”ฅ FAR achieves better convergence than video diffusion models with the same continuous latent space:

๐Ÿ”ฅ FAR leverages clean visual context without additional image-to-video fine-tuning:

Unconditional pretraining on UCF-101 achieves state-of-the-art results in both video generation (context frame = 0) and video prediction (context frame โ‰ฅ 1) within a single model.

๐Ÿ”ฅ FAR supports efficient training on long video sequences with manageable token lengths:

The key technique behind this is long short-term context modeling, where we use regular patchification for short-term context to ensure fine-grained temporal consistency and aggressive patchification for long-term context to reduce redundant tokens.

๐Ÿ”ฅ FAR exploits the multi-level KV-Cache to speed up autoregressive inference on long videos:

๐Ÿ“š For more details, check out our paper.

๐Ÿ‹๏ธโ€โ™‚๏ธ FAR Model Zoo

We provide trained FAR models in our paper for re-implementation.

Video Generation

We use seed-[0,2,4,6] in evaluation, following the evaluation prototype of Latte:

Model (Config)#ParamsResolutionConditionFVDHF WeightsPre-Computed SamplesTrain Cost (H100 Days)Memory (Per GPU)
FAR-L457 M128x128โœ—280 ยฑ 11.7Model-HFGoogle Drive12.222 G
FAR-L457 M128x128โœ“99 ยฑ 5.9Model-HFGoogle Drive12.222 G
FAR-L457 M256x256โœ—303 ยฑ 13.5Model-HFGoogle Drive12.722 G
FAR-L457 M256x256โœ“113 ยฑ 3.6Model-HFGoogle Drive12.722 G
FAR-XL657 M256x256โœ—279 ยฑ 9.2Model-HFGoogle Drive14.622 G
FAR-XL657 M256x256โœ“108 ยฑ 4.2Model-HFGoogle Drive14.622 G

Short-Video Prediction

We follows the evaluation prototype of MCVD and ExtDM:

Model (Config)#ParamsDatasetPSNRSSIMLPIPSFVDHF WeightsPre-Computed SamplesTrain Cost (H100 Days)Memory (Per GPU)
FAR-B130 MUCF10125.640.8180.037194.1Model-HFGoogle Drive3.69 G
FAR-B130 MBAIR (c=2, p=28)19.400.8190.049144.3Model-HFGoogle Drive2.612 G

Long-Video Prediction

We use seed-[0,2,4,6] in evaluation, following the evaluation prototype of TECO:

Model (Config)#ParamsDatasetPSNRSSIMLPIPSFVDHF WeightsPre-Computed SamplesTrain Cost (H100 Days)Memory (Per GPU)
FAR-B-Long150 MDMLab22.30.6870.10464Model-HFGoogle Drive17.513 G
FAR-M-Long280 MMinecraft16.90.4480.25139Model-HFGoogle Drive18.219 G

๐Ÿ”ง Dependencies and Installation

1. Setup Environment:

# Setup Conda Environment
conda create -n FAR python=3.10
conda activate FAR

# Install Pytorch
conda install pytorch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 pytorch-cuda=12.4 -c pytorch -c nvidia

# Install Other Dependences
pip install -r requirements.txt

2. Prepare Dataset:

We have uploaded the dataset used in this paper to Hugging Face datasets for faster download. Please follow the instructions below to prepare.

from huggingface_hub import snapshot_download, hf_hub_download

dataset_url = {
    "ucf101": "guyuchao/UCF101",
    "bair": "guyuchao/BAIR",
    "minecraft": "guyuchao/Minecraft",
    "minecraft_latent": "guyuchao/Minecraft_Latent",
    "dmlab": "guyuchao/DMLab",
    "dmlab_latent": "guyuchao/DMLab_Latent"
}

for key, url in dataset_url.items():
    snapshot_download(
        repo_id=url,
        repo_type="dataset",
        local_dir=f"datasets/{key}",
        token="input your hf token here"
    )

Then, enter its directory and execute:

find . -name "shard-*.tar" -exec tar -xvf {} \;

3. Prepare Pretrained Models of FAR:

We have uploaded the pretrained models of FAR to Hugging Face models. Please follow the instructions below to download if you want to evaluate FAR.

from huggingface_hub import snapshot_download, hf_hub_download

snapshot_download(
    repo_id="guyuchao/FAR_Models",
    repo_type="model",
    local_dir="experiments/pretrained_models/FAR_Models",
    token="input your hf token here"
)

๐Ÿš€ Training

To train different models, you can run the following command:

accelerate launch \
    --num_processes 8 \
    --num_machines 1 \
    --main_process_port 19040 \
    train.py \
    -opt train_config.yml
  • Wandb: Set use_wandb to True in config to enable wandb monitor.
  • Periodally Evaluation: Set val_freq to control the peroidly evaluation in training.
  • Auto Resume: Directly rerun the script, the model will find the lastest checkpoint to resume, the wandb log will automatically resume.
  • Efficient Training on Pre-Extracted Latent: Set use_latent to True, and set the data_list to corresponding latent path list.

๐Ÿ’ป Sampling & Evaluation

To evaluate the performance of a pretrained model, just copy the training config and set the pretrain_network: ~ to your trained folder. Then run the following scripts:

accelerate launch \
    --num_processes 8 \
    --num_machines 1 \
    --main_process_port 10410 \
    test.py \
    -opt test_config.yml

๐Ÿ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ“– Citation

If our work assists your research, feel free to give us a star โญ or cite us using:

@article{gu2025long,
    title={Long-Context Autoregressive Video Modeling with Next-Frame Prediction},
    author={Gu, Yuchao and Mao, weijia and Shou, Mike Zheng},
    journal={arXiv preprint arXiv:2503.19325},
    year={2025}
}