InternImage for Image Classification

March 4, 2025 · View on GitHub

This folder contains the implementation of the InternImage for image classification.

Installation
Data Preparation
Released Models
Evaluation
Training from Scratch on ImageNet-1K
Manage Jobs with Slurm
Training with DeepSpeed
Extracting Intermediate Features
Export

Installation

Clone this repository:

git clone https://github.com/OpenGVLab/InternImage.git
cd InternImage

Create a conda virtual environment and activate it:

conda create -n internimage python=3.9
conda activate internimage

Install CUDA>=10.2 with cudnn>=7 following the official installation instructions
Install PyTorch>=1.10.0 and torchvision>=0.9.0 with CUDA>=10.2:

For examples, to install torch==1.11 with CUDA==11.3:

pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113  -f https://download.pytorch.org/whl/torch_stable.html

Install timm==0.6.11 and mmcv-full==1.5.0:

pip install -U openmim
mim install mmcv-full==1.5.0
mim install mmsegmentation==0.27.0
pip install timm==0.6.11 mmdet==2.28.1

Install other requirements:

pip install opencv-python termcolor yacs pyyaml scipy
# Please use a version of numpy lower than 2.0
pip install numpy==1.26.4
pip install pydantic==1.10.13

Compile CUDA operators

Before compiling, please use the nvcc -V command to check whether your nvcc version matches the CUDA version of PyTorch.

cd ./ops_dcnv3
sh ./make.sh
# unit test (should see all checking is True)
python test.py

You can also install the operator using precompiled .whl files DCNv3-1.0-whl

Data Preparation

We provide the following ways to prepare data:

Standard ImageNet-1K

We use standard ImageNet dataset, you can download it from http://image-net.org/.

For standard folder dataset, move validation images to labeled sub-folders. The file structure should look like:

$ tree data
imagenet
├── train
│   ├── class1
│   │   ├── img1.jpeg
│   │   ├── img2.jpeg
│   │   └── ...
│   ├── class2
│   │   ├── img3.jpeg
│   │   └── ...
│   └── ...
└── val
    ├── class1
    │   ├── img4.jpeg
    │   ├── img5.jpeg
    │   └── ...
    ├── class2
    │   ├── img6.jpeg
    │   └── ...
    └── ...

Zipped ImageNet-1K

To boost the slow speed when reading images from massive small files, we also support zipped ImageNet, which includes four files:

train.zip, val.zip: which store the zipped folder for train and validate splits.
train.txt, val.txt: which store the relative path in the corresponding zip file and ground truth label. Make sure the data folder looks like this:

$ tree data
data
└── ImageNet-Zip
    ├── train_map.txt
    ├── train.zip
    ├── val_map.txt
    └── val.zip

$ head -n 5 meta_data/val.txt
ILSVRC2012_val_00000001.JPEG	65
ILSVRC2012_val_00000002.JPEG	970
ILSVRC2012_val_00000003.JPEG	230
ILSVRC2012_val_00000004.JPEG	809
ILSVRC2012_val_00000005.JPEG	516

$ head -n 5 meta_data/train.txt
n01440764/n01440764_10026.JPEG	0
n01440764/n01440764_10027.JPEG	0
n01440764/n01440764_10029.JPEG	0
n01440764/n01440764_10040.JPEG	0
n01440764/n01440764_10042.JPEG	0

ImageNet-22K

For ImageNet-22K dataset, make a folder named fall11_whole and move all images to labeled sub-folders in this folder. Then download the train-val split file (ILSVRC2011fall_whole_map_train.txt & ILSVRC2011fall_whole_map_val.txt) , and put them in the parent directory of fall11_whole. The file structure should look like:
```
  $ tree imagenet22k/
  imagenet22k/
  └── fall11_whole
      ├── n00004475
      ├── n00005787
      ├── n00006024
      ├── n00006484
      └── ...
```

iNaturalist 2018

For the iNaturalist 2018, please download the dataset from the official repository. The file structure should look like:

  $ tree inat2018/
  inat2018/
  ├── categories.json
  ├── test2018
  ├── test2018.json
  ├── train2018.json
  ├── train2018_locations.json
  ├── val2018
  ├── val2018.json
  └── val2018_locations.json

Released Models

Open-Source Visual Pretrained Models

name	pretrain	resolution	#param	download
InternImage-L	IN-22K	384x384	223M	pth \| hf
InternImage-XL	IN-22K	384x384	335M	pth \| hf
InternImage-H	Joint 427M -> IN-22K	384x384	1.08B	pth \| hf
InternImage-G	Joint 427M -> IN-22K	384x384	3B	pth \| hf

ImageNet-1K Image Classification

name	pretrain	resolution	acc@1	#param	FLOPs	download
InternImage-T	IN-1K	224x224	83.5	30M	5G	pth \| hf \| cfg
InternImage-S	IN-1K	224x224	84.2	50M	8G	pth \| hf \| cfg
InternImage-B	IN-1K	224x224	84.9	97M	16G	pth \| hf \| cfg
InternImage-L	IN-22K	384x384	87.7	223M	108G	pth \| hf \| cfg
InternImage-XL	IN-22K	384x384	88.0	335M	163G	pth \| hf \| cfg
InternImage-H	Joint 427M -> IN-22K	640x640	89.6	1.08B	1478G	pth \| hf \| cfg
InternImage-G	Joint 427M -> IN-22K	512x512	90.1	3B	2700G	pth \| hf \| cfg

iNaturalist 2018 Image Classification

name	pretrain	resolution	acc@1	#param	download
InternImage-H	Joint 427M -> IN-22K	384x384	92.6	1.1B	ckpt \| cfg \| log

Evaluation

To evaluate a pretrained InternImage on ImageNet val, run:

python -m torch.distributed.launch --nproc_per_node <num-of-gpus-to-use> --master_port 12345 main.py --eval \
--cfg <config-file> --resume <checkpoint> --data-path <imagenet-path>

For example, to evaluate the InternImage-B with a single GPU:

python -m torch.distributed.launch --nproc_per_node 1 --master_port 12345 main.py --eval \
--cfg configs/internimage_b_1k_224.yaml --resume internimage_b_1k_224.pth --data-path <imagenet-path>

Training from Scratch on ImageNet-1K

The paper results were obtained from models trained with configs in configs/without_lr_decay.

To train an InternImage on ImageNet from scratch, run:

python -m torch.distributed.launch --nproc_per_node <num-of-gpus-to-use> --master_port 12345  main.py \
--cfg <config-file> --data-path <imagenet-path> [--batch-size <batch-size-per-gpu> --output <output-directory> --tag <job-tag>]

Manage Jobs with Slurm

For example, to train or evaluate InternImage with slurm cluster, run:

InternImage-T (IN-1K)

# Train for 300 epochs with 8 GPUs
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_t_1k_224.yaml --batch-size 512
# Train for 300 epochs with 32 GPUs
GPUS=32 sh train_in1k.sh <partition> <job-name> configs/internimage_t_1k_224.yaml --batch-size 128
# Evaluate on ImageNet-1K with 8 GPUs
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_t_1k_224.yaml --resume pretrained/internimage_t_1k_224.pth --eval

InternImage-S (IN-1K)

# Train for 300 epochs with 8 GPUs
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_s_1k_224.yaml --batch-size 512
# Train for 300 epochs with 32 GPUs
GPUS=32 sh train_in1k.sh <partition> <job-name> configs/internimage_s_1k_224.yaml --batch-size 128
# Evaluate on ImageNet-1K with 8 GPUs
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_s_1k_224.yaml --resume pretrained/internimage_s_1k_224.pth --eval

InternImage-B (IN-1K)

# Train for 300 epochs with 8 GPUs
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_b_1k_224.yaml --batch-size 512
# Train for 300 epochs with 32 GPUs
GPUS=32 sh train_in1k.sh <partition> <job-name> configs/internimage_b_1k_224.yaml --batch-size 128
# Evaluate on ImageNet-1K with 8 GPUs
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_b_1k_224.yaml --resume pretrained/internimage_b_1k_224.pth --eval

InternImage-L (IN-22K to IN-1K)

# Train for 20 epochs with 32 GPUs
GPUS=32 sh train_in1k.sh <partition> <job-name> configs/internimage_l_22kto1k_384.yaml --batch-size 16
# Evaluate on ImageNet-1K with 8 GPUs
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_l_22kto1k_384.yaml --resume pretrained/internimage_l_22kto1k_384.pth --eval

InternImage-XL (IN-22K to IN-1K)

# Train for 20 epochs with 32 GPUs
GPUS=32 sh train_in1k.sh <partition> <job-name> configs/internimage_xl_22kto1k_384.yaml --batch-size 16
# Evaluate on ImageNet-1K with 8 GPUs
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_xl_22kto1k_384.yaml --resume pretrained/internimage_xl_22kto1k_384.pth --eval

InternImage-H (IN-22K to IN-1K)

# Train for 20 epochs with 32 GPUs
GPUS=32 sh train_in1k.sh <partition> <job-name> configs/internimage_h_22kto1k_640.yaml --batch-size 16
# Evaluate on ImageNet-1K with 8 GPUs
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_h_22kto1k_640.yaml --resume pretrained/internimage_h_22kto1k_640.pth --eval

InternImage-G (IN-22K to IN-1K)

# Train for 20 epochs with 64 GPUs
GPUS=64 sh train_in1k.sh <partition> <job-name> configs/internimage_g_22kto1k_512.yaml --batch-size 8
# Evaluate on ImageNet-1K with 8 GPUs
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/internimage_g_22kto1k_512.yaml --resume pretrained/internimage_g_22kto1k_512.pth --eval

Training with DeepSpeed

We support utilizing DeepSpeed to reduce memory costs for training large-scale models, e.g. InternImage-H with over 1 billion parameters. To use it, first install the requirements as

pip install deepspeed==0.8.3

Then you could launch the training in a slurm system with 8 GPUs as follows (tiny and huge as examples). The default zero stage is 1 and it could config via command line args --zero-stage.

GPUS=8 GPUS_PER_NODE=8 sh train_in1k_deepspeed.sh <partition> <job-name> configs/internimage_t_1k_224.yaml --batch-size 128 --accumulation-steps 4
GPUS=8 GPUS_PER_NODE=8 sh train_in1k_deepspeed.sh <partition> <job-name> configs/internimage_t_1k_224.yaml --batch-size 128 --accumulation-steps 4 --eval --resume ckpt.pth
GPUS=8 GPUS_PER_NODE=8 sh train_in1k_deepspeed.sh <partition> <job-name> configs/internimage_t_1k_224.yaml --batch-size 128 --accumulation-steps 4 --eval --resume deepspeed_ckpt_dir
GPUS=8 GPUS_PER_NODE=8 sh train_in1k_deepspeed.sh <partition> <job-name> configs/internimage_h_22kto1k_640.yaml --batch-size 16 --accumulation-steps 4 --pretrained pretrained/internimage_h_jointto22k_384.pth
GPUS=8 GPUS_PER_NODE=8 sh train_in1k_deepspeed.sh <partition> <job-name> configs/internimage_h_22kto1k_640.yaml --batch-size 16 --accumulation-steps 4 --pretrained pretrained/internimage_h_jointto22k_384.pth --zero-stage 3

🤗 HuggingFace Accelerate Integration of DeepSpeed

Optionally, you could use our HuggingFace Accelerate integration to use DeepSpeed.

pip install accelerate==0.18.0

accelerate launch --config_file configs/accelerate/dist_8gpus_zero3_wo_loss_scale.yaml main_accelerate.py --cfg configs/internimage_h_22kto1k_640.yaml --data-path data/imagenet --batch-size 16 --pretrained pretrained/internimage_h_jointto22k_384.pth --accumulation-steps 4
accelerate launch --config_file configs/accelerate/dist_8gpus_zero3_offload.yaml main_accelerate.py --cfg configs/internimage_t_1k_224.yaml --data-path data/imagenet --batch-size 128 --accumulation-steps 4 --output output_zero3_offload
accelerate launch --config_file configs/accelerate/dist_8gpus_zero1.yaml main_accelerate.py --cfg configs/internimage_t_1k_224.yaml --data-path data/imagenet --batch-size 128 --accumulation-steps 4

Memory Costs

Here is the reference GPU memory cost for InternImage-H with 8 GPUs.

total batch size = 512, 16 batch size for each GPU, gradient accumulation steps = 4.

Resolution	Zero Stage	Cpu Offloading	Memory
640	zero1	False	22572
640	zero3	False	20000
640	zero3	True	19144
384	zero1	False	16000
384	zero3	True	11928

Convert Checkpoints

To convert DeepSpeed checkpoints to pytorch fp32 checkpoint, you could use the following snippet.

from deepspeed.utils.zero_to_fp32 import convert_zero_checkpoint_to_fp32_state_dict
convert_zero_checkpoint_to_fp32_state_dict(checkpoint_dir, 'best.pth', tag='best')

Then, you could use best.pth as usual, e.g., model.load_state_dict(torch.load('best.pth'))

Due to the lack of computational resources, the deepspeed training scripts are currently only verified for the first few epochs. Please fire an issue if you have problems for reproducing the whole training.

Extracting Intermediate Features

To extract the features of an intermediate layer, you could use extract_feature.py.

For example, extract features of b.png from layers patch_embed and levels.0.downsample and save them to 'b.pth'.

python extract_feature.py --cfg configs/internimage_t_1k_224.yaml --img b.png --keys patch_embed levels.0.downsample --save --resume internimage_t_1k_224.pth

Export

Install mmdeploy at first:

pip install mmdeploy==0.14.0

To export InternImage-T from PyTorch to ONNX, run:

python export.py --model_name internimage_t_1k_224 --ckpt_dir /path/to/ckpt/dir --onnx

To export InternImage-T from PyTorch to TensorRT, run:

git clone https://github.com/open-mmlab/mmdeploy.git
cd mmdeploy && git checkout v0.13.0
export MMDEPLOY_DIR=$(pwd)
# prepare our custom ops, you can find it at InternImage/tensorrt/modulated_deform_conv_v3
cp -r ../../tensorrt/modulated_deform_conv_v3 csrc/mmdeploy/backend_ops/tensorrt/

# build custom ops
mkdir -p build && cd build
cmake -DCMAKE_CXX_COMPILER=g++ -DMMDEPLOY_TARGET_BACKENDS=trt -DTENSORRT_DIR=${TENSORRT_DIR} -DCUDNN_DIR=${CUDNN_DIR} ..
make -j$(nproc) && make install

# install the mmdeploy after building custom ops
pip install -e .
cd ../

python export.py --model_name internimage_t_1k_224 --ckpt_dir /path/to/ckpt/dir --trt