Forward pass on a single image (RGB; ImageNet normalization recommended)

May 24, 2026 · View on GitHub

Sapiens2

Scale. Semantics. Fidelity.

Rawal Khirodkar · He Wen · Julieta Martinez · Yuan Dong · Su Zhaoen · Shunsuke Saito

ICLR 2026

A family of high-resolution transformers pretrained on 1 billion human images, achieving state-of-the-art performance across diverse human-centric tasks — pose estimation, body-part segmentation, surface normals, pointmaps, and human matting.

🤗 Demos: Pose · Seg · Normal · Pointmap · Matting

📣 News

May 15, 2026: Sapiens2-1B human matting model is released.
April 24, 2026: Initial Sapiens2 release — pose, body-part segmentation, surface normals, and pointmaps.

⚡ Quick Start

Run a pretrained backbone forward pass — only torch and safetensors needed:

import os
import torch
from safetensors.torch import load_file
from sapiens.backbones.standalone.sapiens2 import Sapiens2

# Build the model and load a pretrained checkpoint
model = Sapiens2(arch="sapiens2_1b", img_size=(1024, 768), patch_size=16).eval().cuda()  # img_size is (H, W)
ckpt = os.path.expanduser("~/sapiens2_host/pretrain/sapiens2_1b_pretrain.safetensors")
model.load_state_dict(load_file(ckpt))

# Forward pass on a single image (RGB; ImageNet normalization recommended)
x = torch.randn(1, 3, 1024, 768).cuda()
with torch.no_grad():
    features = model(x)[0]  # dense backbone features

🪶 Zero-Dependency Usage

The Quick Start snippet above imports from a single self-contained file — torch (plus safetensors for checkpoint loading) is all you need. Drop the file into your project and you're done:

curl -O https://raw.githubusercontent.com/facebookresearch/sapiens2/main/sapiens/backbones/standalone/sapiens2.py

For Sapiens v1, grab sapiens.py instead.

🧬 Model Card

Model	Params	FLOPs	Embed dim	Layers	Heads
Sapiens2-0.1B	0.114 B	0.342 T	768	12	12
Sapiens2-0.4B	0.398 B	1.260 T	1024	24	16
Sapiens2-0.8B	0.818 B	2.592 T	1280	32	16
Sapiens2-1B	1.462 B	4.715 T	1536	40	24
Sapiens2-1B (4K)	1.607 B	—	1536	40	24
Sapiens2-5B	5.071 B	15.722 T	2432	56	32

All models use patch size 16 and are trained at 1024×768 (H×W) resolution, except Sapiens2-1B (4K) which is trained at 4096×3072 with use_tokenizer=True.

📦 Getting Started

Clone the repository:

git clone https://github.com/facebookresearch/sapiens2.git
cd sapiens2
export SAPIENS_ROOT=$(pwd)

Install (requires Python ≥3.12 and PyTorch ≥2.7):

pip install -e .

Download checkpoints from MODEL_ZOO.md. Place downloaded files under $SAPIENS_CHECKPOINT_ROOT (default: ~/sapiens2_host):

sapiens2_host/
├── pretrain/
│   ├── sapiens2_{0.1b,0.4b,0.8b,1b,5b}_pretrain.safetensors
│   └── sapiens2_1b_4k_pretrain.safetensors
├── pose/
│   └── sapiens2_{0.4b,0.8b,1b,5b}_pose.safetensors
├── seg/
│   └── sapiens2_{0.4b,0.8b,1b,5b}_seg.safetensors
├── normal/
│   └── sapiens2_{0.4b,0.8b,1b,5b}_normal.safetensors
├── pointmap/
│   └── sapiens2_{0.4b,0.8b,1b,5b}_pointmap.safetensors
├── matting/
│   └── sapiens2_1b_matting.safetensors
└── detector/                  # [optional] only needed for pose inference
    └── detr-resnet-101-dc5/

🎯 Vision Tasks

Task	Description	Inference	Train
Pose Estimation	_{308 whole-body keypoints}	docs/POSE.md	docs/train/POSE.md
Body-Part Segmentation	_{29 body parts}	docs/SEG.md	docs/train/SEG.md
Surface Normal Estimation	_{per-pixel normals}	docs/NORMAL.md	docs/train/NORMAL.md
Pointmap Estimation	_{per-pixel 3D points}	docs/POINTMAP.md	docs/train/POINTMAP.md
Human Matting	_{alpha matte + foreground}	docs/MATTING.md	docs/train/MATTING.md

@article{khirodkarsapiens2,
  title={Sapiens2},
  author={Khirodkar, Rawal and Wen, He and Martinez, Julieta and Dong, Yuan and Su, Zhaoen and Saito, Shunsuke},
  journal={arXiv preprint arXiv:2604.21681},
  year={2026}
}

Forward pass on a single image (RGB; ImageNet normalization recommended)

Scale. Semantics. Fidelity.

ICLR 2026

📣 News

⚡ Quick Start

🪶 Zero-Dependency Usage

🧬 Model Card

📦 Getting Started

🎯 Vision Tasks

✨ Acknowledgements

🤝 Contributing

License

📚 Citation