README.md

May 22, 2026 ยท View on GitHub

HumanNet: Human-Centric Video Learning and Embodied AI Resources

DAGroup & SimpleSilicon Innovation Team

Peking University

๐Ÿ”ฅ News

  • [2026.05.22] ๐Ÿ”ฅ We reprocessed a compliant embodied dataset on our university cluster. In this version, we removed non-compliant third-person source data and, after the full data pipeline, obtained a high-quality first-person robot video dataset with 3M videos, 720P resolution, and 30 FPS. To the best of our knowledge, RoVid-X is the largest robot video generation dataset to date for training physics-aware models. See Dataset | Resources.
  • [2026.05.18] ๐Ÿ”ฅ We release StableVLA. Congratulations on its acceptance to ICML 2026! It is a vision-language-action model for robust robot policy learning. See Docs | Code | Project | Paper | Checkpoint.
  • [Next Month] ๐Ÿ”ฅ We are preparing the open-source release of the HumanNet corpus, the curation pipeline, and the post-training validation code. Stay tuned!
  • [2026.05.11]๐Ÿ”ฅ The HumanNet technical report and project page have been released: Paper | Project.

๐Ÿ“‘ Todo List

  • Release the HumanNet technical report on arXiv. โœ…
  • Release StableVLA model code and documentation. โœ…
  • Release RoVid-X on Hugging Face. โœ…
  • Release a HumanNet preview subset on Hugging Face for early access.
  • Release the full one-million-hour HumanNet corpus with metadata and annotations.
  • Release the trained checkpoints initialized from HumanNet.

๐Ÿ“ฃ Overview

teaser This repository is maintained as a growing research hub for human-centric video data, embodied learning models, and validation code. It currently centers on HumanNet, a one-million-hour human-centric video corpus, and will also host related models, training recipes, evaluation protocols, and release notes.

The initial core release is HumanNet, a scalable infrastructure for fine-grained activity understanding, motion-aware video learning, and embodied pretraining. HumanNet pairs first-person and third-person footage with caption labels, motion annotations, and hand and body signals, organized by a multi-axis taxonomy and produced by a curation pipeline that treats human-centric filtering, viewpoint characterization, quality control, and privacy review as first-class design choices.

๐ŸŽฅ Demo

https://github.com/user-attachments/assets/52eaa410-0ec4-4f89-81e8-d2ecf9bb351c

๐Ÿ“š Dataset Family

DatasetStatusDocumentationResources
HumanNetDocumentation availableDocssrc/dataset/humandata
RoVid-XReleased on Hugging FaceDataset cardHugging Face / src/dataset/rovid-x

๐Ÿค– Model Family

ModelStatusDocumentationCode
StableVLACode and docs availableDocssrc/model/StableVLA

๐Ÿ—‚๏ธ Repository Map

HumanNet/
โ”œโ”€โ”€ README.md                 # Repository entry point
โ”œโ”€โ”€ docs/                     # Component-level documentation and release notes
โ”‚   โ”œโ”€โ”€ humandata.md          # HumanNet dataset documentation
โ”‚   โ””โ”€โ”€ stablevla.md          # StableVLA documentation
โ”œโ”€โ”€ assets/                   # Figures used by the repository README
โ””โ”€โ”€ src/
    โ”œโ”€โ”€ dataset/
    โ”‚   โ”œโ”€โ”€ humandata/        # HumanNet dataset resources
    โ”‚   โ””โ”€โ”€ rovid-x/          # ROViD-X dataset resources
    โ””โ”€โ”€ model/
        โ””โ”€โ”€ StableVLA/        # StableVLA source code, training scripts, and model README

๐Ÿ”ง Usage

Coming soon.

# Download a HumanNet subset (placeholder)
# if you are in china mainland, run this first: export HF_ENDPOINT=https://hf-mirror.com
# pip install -U "huggingface_hub[cli]"
huggingface-cli download DAGroup-PKU/HumanNet

# Download RoVid-X
huggingface-cli download DAGroup-PKU/RoVid-X --repo-type dataset

๐Ÿ™ Acknowledgement

We gratefully acknowledge SimpleSilicon Innovation for providing funding and resource support, and Astribot for providing real-robot platforms and deployment experiment support.

๐Ÿ“ง Ethics Concerns

The videos referenced in this repository are sourced from public domains and intended solely to showcase the capabilities of this research. Human-centric video raises non-trivial privacy, consent, and dual-use concerns; any release will follow license review, redaction, restricted-content filtering, access controls where necessary, and clear documentation of what is included or excluded.

  • The service is a research preview. Please contact us if you find any potential violations.

โœ๏ธ Citation

If you find our work useful in your research, please consider giving a star :star: and citation :pencil:.

BibTeX

@article{deng2026humannet,
  title={HumanNet: Scaling Human-centric Video Learning to One Million Hours},
  author={Deng, Yufan and Zhou, Daquan},
  journal={arXiv preprint arXiv:2605.06747},
  year={2026}
}

@misc{fu2026stablevlarobustvisionlanguageactionmodels,
      title={StableVLA: Towards Robust Vision-Language-Action Models without Extra Data}, 
      author={Yiyang Fu and Chubin Zhang and Shukai Gong and Yufan Deng and Kaiwei Sun and Qiyang Min and Qibin Hou and Yansong Tang and Jianan Wang and Daquan Zhou},
      year={2026},
      eprint={2605.18287},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.18287}, 
}