TinyNeXt
October 22, 2025 · View on GitHub
Official pytorch implementation of "An Efficient Hybrid Vision Transformer for TinyML Applications, ICCV'2025"
Abstract: To enable the deployment of Vision Transformers on resource-constrained mobile and edge devices, the development of efficient ViT models has attracted significant attention. Researchers achieved remarkable improvements in accuracy and speed by optimizing attention mechanisms and integrating lightweight CNN modules. However, existing designs often overlook runtime overhead from memory-bound operations and the shift in feature characteristics from spatial-dominant to semantic-dominant as networks deepen. This work introduces TinyNeXt, a family of efficient hybrid ViTs for TinyML, featuring Lean Single-Head Self-Attention to minimize memory-bound operations, and a macro design tailored to feature characteristics at different stages. TinyNeXt strikes a better accuracy-speed trade-off across diverse tasks and hardware platforms, outperforming state-of-the-art models of comparable scale. For instance, our TinyNeXt-T achieves a remarkable 71.5% top-1 accuracy with only 1.0M parameters on ImageNet-1K. Furthermore, compared to recent efficient models like MobileViT-XXS and MobileViT-XS, TinyNeXt-S and TinyNeXt-M achieve 3.7%/0.5% higher accuracy, respectively, while running 2.1×/2.6× faster on Nvidia Jetson Nano.
Comparison with SOTA models on ImgeNet-1K
Overview of TinyNeXt
Repository Structure
- classification/: Image classification related code and checkpoint.
- detection/: Object detection related code.
- segmentation/: Semantic segmentation related code .
- speed_benchmark/: Throughput testing scripts and logs
Model Performance
Image Classification Performance (ImageNet-1K)
| Model | Top-1 Accuracy | Parameters | MACs | Latency |
|---|---|---|---|---|
| TinyNeXt-M | 75.3% | 2.3M | 475M | 19.4ms |
| TinyNeXt-S | 72.7% | 1.3M | 304M | 14.3ms |
| TinyNeXt-T | 71.5% | 1.0M | 259M | 12.7ms |
Latency is measured on Nvidia Jetson Nano.
Object Detection Performance Based on SSDLite (MS-COCO 2017)
| Backbone | AP | AP50 | AP75 | Parameters |
|---|---|---|---|---|
| TinyNeXt-S | 22.4 | 37.9 | 22.7 | 2.3M |
| TinyNeXt-M | 25.0 | 41.1 | 25.4 | 3.3M |
Semantic Segmentation Performance Based on DeepLabv3 (Pascal VOC 2012)
| Backbone | Parameters | Flops | mIOU |
|---|---|---|---|
| TinyNeXt-S | 2.3M | 3.5G | 75.5 |
| TinyNeXt-M | 3.3M | 5.1G | 76.9 |
Acknowledgements
We thank but not limited to following repositories for providing assistance for our research:
Citation
If you find this work helpful, please consider citing:
@inproceedings{tinynext_iccv2025,
author = {Zeng, Fanhong and Li, Huanan and Guan, Juntao and Fan, Rui and Wu, Tong and Wang, Xilong and Lai, Rui},
title = {An Efficient Hybrid Vision Transformer for TinyML Applications},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2025},
pages = {19914-19924}
}