书生图像 - 大规模视觉基础模型

March 4, 2025 · View on GitHub

[English Version]

书生图像 - 大规模视觉基础模型

这个代码仓库是 InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions 的官方实现。

[论文] [知乎专栏]

亮点

:thumbsup: 高达 30 亿参数的最强视觉通用主干模型
🏆 图像分类标杆数据集 ImageNet 90.1% Top1准确率，开源模型中准确度最高
🏆 物体检测标杆数据集 COCO 65.5 mAP，唯一超过 65 mAP 的模型

项目功能

简介

InternImage 是一个由上海人工智能实验室、清华大学等机构的研究人员提出的基于卷积神经网络（CNN）的视觉基础模型。与基于 Transformer 的网络不同，InternImage 以可变形卷积 DCNv3 作为核心算子，使模型不仅具有检测和分割等下游任务所需的动态有效感受野，而且能够进行自适应的空间聚合。

与 InternImage 相关的其他项目还包括：预训练算法 M3I-Pretraining，通用解码器 Uni-Perceiver 系列，以及自动驾驶感知通用编码器 BEVFormer 系列。

性能

在图像分类标杆数据集 ImageNet 上，InternImage 仅基于公开数据便达到了 90.1% 的 Top-1 准确率。这是除谷歌与微软两个未公开模型及额外数据集外，唯一准确率超过 90.0% 的模型，同时也是世界上开源模型中 ImageNet 准确度最高，规模最大的模型；
在物体检测标杆数据集 COCO 上，InternImage 取得了 65.5 的 mAP，是世界上唯一超过 65 mAP 的模型；
在另外 16 个重要的视觉基础数据集（覆盖分类、检测和分割任务）上取得世界最好性能。

分类任务

图像分类	场景分类		长尾分类
ImageNet	Places365	Places 205	iNaturalist 2018
90.1	61.2	71.7	92.6

检测任务

常规物体检测				长尾物体检测		自动驾驶物体检测		密集物体检测
COCO	VOC 2007	VOC 2012	OpenImage	LVIS minival	LVIS val	BDD100K	nuScenes	CrowdHuman
65.5	94.0	97.2	74.1	65.8	63.2	38.8	64.8	97.2

分割任务

语义分割			街景分割	RGBD分割
ADE20K	COCO Stuff-10K	Pascal Context	CityScapes	NYU Depth V2
62.9	59.6	70.3	87.0	68.1

已发布模型

开源视觉预训练模型

name	pretrain	resolution	#param	download
InternImage-L	ImageNet-22K	384x384	223M	pth \| hf
InternImage-XL	ImageNet-22K	384x384	335M	pth \| hf
InternImage-H	Joint 427M	384x384	1.08B	pth \| hf
InternImage-G	Joint 427M	384x384	3B	pth \| hf

ImageNet-1K 图像分类

name	pretrain	resolution	acc@1	#param	FLOPs	download
InternImage-T	ImageNet-1K	224x224	83.5	30M	5G	pth \| hf \| cfg
InternImage-S	ImageNet-1K	224x224	84.2	50M	8G	pth \| hf \| cfg
InternImage-B	ImageNet-1K	224x224	84.9	97M	16G	pth \| hf \| cfg
InternImage-L	ImageNet-22K	384x384	87.7	223M	108G	pth \| hf \| cfg
InternImage-XL	ImageNet-22K	384x384	88.0	335M	163G	pth \| hf \| cfg
InternImage-H	Joint 427M	640x640	89.6	1.08B	1478G	pth \| hf \| cfg
InternImage-G	Joint 427M	512x512	90.1	3B	2700G	pth \| hf \| cfg

COCO 目标检测和实例分割

backbone	method	schd	box mAP	mask mAP	#param	FLOPs	download
InternImage-T	Mask R-CNN	1x	47.2	42.5	49M	270G	ckpt \| cfg
InternImage-T	Mask R-CNN	3x	49.1	43.7	49M	270G	ckpt \| cfg
InternImage-S	Mask R-CNN	1x	47.8	43.3	69M	340G	ckpt \| cfg
InternImage-S	Mask R-CNN	3x	49.7	44.5	69M	340G	ckpt \| cfg
InternImage-B	Mask R-CNN	1x	48.8	44.0	115M	501G	ckpt \| cfg
InternImage-B	Mask R-CNN	3x	50.3	44.8	115M	501G	ckpt \| cfg
InternImage-L	Cascade	1x	54.9	47.7	277M	1399G	ckpt \| cfg
InternImage-L	Cascade	3x	56.1	48.5	277M	1399G	ckpt \| cfg
InternImage-XL	Cascade	1x	55.3	48.1	387M	1782G	ckpt \| cfg
InternImage-XL	Cascade	3x	56.2	48.8	387M	1782G	ckpt \| cfg

backbone	method	box mAP (val/test)	#param	download
CB-InternImage-H	DINO (TTA)	65.0 / 65.4	2.18B	ckpt \| cfg
CB-InternImage-G	DINO (TTA)	65.3 / 65.5	6B	TODO

ADE20K 语义分割

backbone	method	resolution	mIoU (ss/ms)	#param	FLOPs	download
InternImage-T	UperNet	512x512	47.9 / 48.1	59M	944G	ckpt \| cfg
InternImage-S	UperNet	512x512	50.1 / 50.9	80M	1017G	ckpt \| cfg
InternImage-B	UperNet	512x512	50.8 / 51.3	128M	1185G	ckpt \| cfg
InternImage-L	UperNet	640x640	53.9 / 54.1	256M	2526G	ckpt \| cfg
InternImage-XL	UperNet	640x640	55.0 / 55.3	368M	3142G	ckpt \| cfg
InternImage-H	UperNet	896x896	59.9 / 60.3	1.12B	3566G	ckpt \| cfg
InternImage-H	Mask2Former	896x896	62.5 / 62.9	1.31B	4635G	ckpt \| cfg

模型推理速度

Export classification model from pytorch to tensorrt

Export detection model from pytorch to tensorrt

Export segmentation model from pytorch to tensorrt

name	resolution	#param	FLOPs	batch 1 FPS (TensorRT)
InternImage-T	224x224	30M	5G	156
InternImage-S	224x224	50M	8G	129
InternImage-B	224x224	97M	16G	116
InternImage-L	384x384	223M	108G	56
InternImage-XL	384x384	335M	163G	47

在使用 mmdeploy 将 PyTorch 模型转为 TensorRT 之前，请确保您已正确编译 DCNv3 的自定义算子，其安装方式如下：

export MMDEPLOY_DIR=/the/root/path/of/MMDeploy

# prepare our custom ops, you can find it at InternImage/tensorrt/modulated_deform_conv_v3
cp -r modulated_deform_conv_v3 ${MMDEPLOY_DIR}/csrc/mmdeploy/backend_ops/tensorrt

# build custom ops
cd ${MMDEPLOY_DIR}
mkdir -p build && cd build
cmake -DCMAKE_CXX_COMPILER=g++-7 -DMMDEPLOY_TARGET_BACKENDS=trt -DTENSORRT_DIR=${TENSORRT_DIR} -DCUDNN_DIR=${CUDNN_DIR} ..
make -j$(nproc) && make install

# install the mmdeploy after building custom ops
cd ${MMDEPLOY_DIR}
pip install -e .

关于 mmdeploy 编译自定义算子的更多细节，请参考这份文档。

算法竞赛

2022 Waymo 3D Camera-Only Detection Challenge: 基于 InternImage，BEVFormer++ 取得赛道冠军
nuScenes 3D detection: BEVFormer v2 在 nuScenes 纯视觉检测任务中取得SOTA性能 (64.8 NDS)
CVPR 2023 Workshop End-to-End Autonomous Driving: InternImage 作为 baseline 支持了比赛 3D Occupancy Prediction Challenge 和 OpenLane Topology Challenge

引用

若这个工作对您的研究有帮助，请参考如下 BibTeX 对我们的工作进行引用。

@inproceedings{wang2023internimage,
  title={Internimage: Exploring large-scale vision foundation models with deformable convolutions},
  author={Wang, Wenhai and Dai, Jifeng and Chen, Zhe and Huang, Zhenhang and Li, Zhiqi and Zhu, Xizhou and Hu, Xiaowei and Lu, Tong and Lu, Lewei and Li, Hongsheng and others},
  booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
  pages={14408--14419},
  year={2023}
}

书生图像 - 大规模视觉基础模型

书生图像 - 大规模视觉基础模型

亮点

最新进展

项目功能

简介

性能

已发布模型

相关项目

多模态基础模型

自动驾驶

算法竞赛

引用