Quantization for image classification
August 19, 2021 ยท View on GitHub
Install
- Clone the repo (change the
FASTDIRas preferred):
export FASTDIR=/workspace
cd $FASTDIR/git/
git clone https://github.com/aim-uofa/model-quantization
git clone https://github.com/blueardour/pytorch-utils
cd model-quantization
ln -s ../pytorch-utils utils
# create separate log and weight folders (optional, if symbol link not created, the script will create these folders under the project path)
#mkdir -p /data/pretrained/pytorch/model-quantization/{exp,weights}
#ln -s /data/pretrained/pytorch/model-quantization/exp .
#ln -s /data/pretrained/pytorch/model-quantization/weights .
-
Install prerequisite packages
cd $FASTDIR/git/model-quantization # python 3 is required pip install -r requirement.txtQuantization for the classification task has no strict requirement on the pytorch version. However, other tasks such as detection and segmentation require a higher version pytorch.
detectron2currently requiresTorch 1.4+. Besides, the CUDA version on the machine is advised to keep the same with the one compiling the pytorch. -
Install Nvidia image pre-processing packages and mix precision training packages (optional, highly recommend)
Dataset
This repo supports the Imagenet dataset and CIFAR dataset. Create necessary folders and prepare the datasets. Example:
# dataset
mkdir -p /data/cifar
mkdir -p /data/imagenet
# download imagnet and move the train and evaluation data in /data/imagenet/{train,val}, respectively.
# cifar dataset can be downloaded on the fly
Pretrained models and Quantization Results
Some of the quantization results are listed in result_cls.md. We provide pretrained models in google drive
Quick Start
Both training and testing employ the train.sh script. Directly calling the main.py is also possible.
bash train.sh config.xxxx
config.xxxx is the configuration file, which contains network architecture, quantization related and training related parameters. For more about the supported options, refer below Training script options and config.md. Also refer to the examples in config subfolder.
Training is often time-consuming . Try our start_on_terminate.sh script which can be used to pend a second task. New round training will start automatically when the last training process is terminated.
# wait in a screen shell
screen -S next-round
bash start_on_terminate.sh [current training thread pid] [next round config.xxxx]
# Ctl+A D to detach screen to backend
Besides, tools.py provides many useful functions for debug / verbose / model convertion. Refer tools.md for detailed usage.
Known Issues
See know issues
Training script options
-
From 2020.07.28 Dynamic loading of the training options by policy file is supported.
-
Option parsing
Common options are parsed in
util/config.py. Quantization related options are separate in themain.py. -
Keyword (choosing quantization method)
The
--keywordoption is one of the most important variables to control the model architecture and quantization algorithm choice.We currently support quantization algorithms by adding the following options in the
keyword:a.
lqfor LQ-Netsb.
pactfor PACTc.
dorefafor DoReFa-Net. Besides, an additional keyword oflsqfor learned step size,non-uniformfor FATNN.d.
xnorfor XNOR-Net. Ifgammais combined with thexnorin the keyword, a separated learnable scale coefficient is added (It becomes the XNor-net++). -
Keyword (structure control):
The network structure can be chosen by
--archor--model. For ResNet, the official ResNet model is provided withpytorch-resnetxxand more flexible ResNet architecture can be realized by setting the--archor--modelwithresnetxx. For the latter case, a lot of options can be combined to customize the network structure:a.
originexists / not exists inkeywordis to choose whether the bi-real skip connection is preferred (Block-wise skip connection versus layer-wise skip connection).b.
bacsorcbas, etc, indicate the layer order in a ResNet block. For example,bacsis a kind of pre-activation structure, typically in a ResNet block, first normalization layer, then activation layer, then convolutional layer and last skip connection layer. For pre-activation structure,preBNis required for the first ResNet block. Refer resnet.md for more information.c. By default all layers except the first and last layers are quantized,
real_skipcan be added to keep the skip connection layers in ResNet to full precision, which is widely used in Xnor-net and Bi-Real net.d. For the normalization layer and activation layer, we also provide some
keywordfor different variants. For example,NRelUmeans it does not include ReLU activation in the network andPRelUindicates PReLU is employed. Refermodel/layer.pyfor details.e. Padding and quantization order. I think it is an error if padding the feature map with 0 after quantization, especially in BNNs. From my perspective, the strategy makes BNNs become TNNs. Thus, I advocate to pad the feature map with zero first and then go through the quantization step. To keep compatible with the publication as well as providing a revised method,
padding_after_quantcan be set to control the order between padding and quantization. Refer line 445 inmodel/quant.pyfor the implementation.f. Skip connection realization. Two choices are provided. One is the avgpooling with stride followed by a conv1x1 with stride=1. Another is just one conv1x1 with stride as demanded.
singleconvinkeywordis used for the choice.g.
fixupis used to enable the architecture in Fixup Initialization.h. The option
basewhich is a standalone option rather than a word in thekeywordlist is used to realize the branch configuration in Group-Net.Self-defined
keywordis supported and can be easily realized according to the user's own desire. As introduced above, the options can be combined to build up different variant architectures. Examples can be found in theconfigsubfolder. -
Activation and weight quantization options
The script provides independent configurations for activations and weights respectively. We here explain some advanced options.
-
xx_quant_groupindicates the group amount for the quantization parameter along the channel dimension. -
xx_adaptivein most cases, indicates the additional normalization operation which shows great potential to increase the performance. -
xx_grad_typedefines a custom gradient approximation method. In general, the quantization step is not differentiable, techniques such as the STE are used to approximate the gradient. Other types of approximation exist. Besides, in some works, it is advocated to add some scale coefficient to the gradient in order to stabilize the training.
-
-
Weight decay
Three major related options.
-
--wdsets the default L2 weight decay value. -
Weight decay is originally proposed to avoid overfit for the large number of parameters. For some small tensors, for example the parameters in BatchNorm layer (as well as custom defined quantization parameters, such as clip-value), weight decay is advocated to be zero.
--decay_smallis for whether to decay those small tensors or not. -
--custom_decay_listand--custom_decayare combined for specific custom decay value to certain parameters. For example, in PACT, the clip_boundary can have its own independent weight decay for regularization.
-
-
Learning rate
-
multi-step decay
-
ploy decay
-
sgdr (with restart)
-
--custom_lr_listand--custom_lrare provided similarly with before mentioned weight decay to specific custom learning rate for certain parameters.
-
-
Mixed precision training options
--fp16and--opt_level [O1]are provided for mix precision training.-
FP32
-
FP16 with custom level, recommend
O1level.
-