Lightweight-VIT

October 21, 2024 ยท View on GitHub

Abstract

The transformer architecture has achieved significant success in natural language processing, motivating its adaptation to computer vision tasks. Unlike convolutional neural networks, vision transformers inherently capture long-range dependencies and enable parallel processing, yet lack inductive biases and efficiency benefits, facing significant computational and memory challenges that limit its real-world applicability. This paper surveys various online strategies for generating lightweight vision transformers for image recognition, focusing on three key areas: efficient component design, dynamic network, and knowledge distillation. We evaluate the corresponding exploration for each topic using ImageNet benchmark, analyzing trade-offs among precision, parameters, throughput, and more to highlight their respective advantages, disadvantages, and flexibility. Finally, we propose future research directions and potential challenges in the lightweighting of vision transformers, with the aim of inspiring further exploration and providing practical guidance for the community.

Overview of PaPer

overview

content

Efficient ViT Components

  1. Embedding Structure Design

    • CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [GitHub] [PDF]
    • CvT: Introducing Convolutions to Vision Transformers [GitHub] [PDF]
    • MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer [GitHub] [PDF]
    • Mobile-Former: Bridging MobileNet and Transformer [PDF]
    • Shunted Self-Attention via Multi-Scale Token Aggregation [GitHub] [PDF]
    • Patch Slimming for Efficient Vision Transformers [PDF]
    • Shunted Self-Attention via Multi-Scale Token Aggregation [GitHub] [PDF]
    • RepViT: Revisiting Mobile CNN From ViT Perspective [GitHub] [PDF]
    • TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation [GitHub] [PDF]
  2. Efficient Position Encoding

    • Rotary Position Embedding for Vision Transformer [GitHub] [PDF]
    • Parameterization of Cross-Token Relations with Relative Positional Encoding for Vision MLP [GitHub] [PDF]
    • Lightweight Structure-Aware Attention for Visual Understanding [PDF]
    • Functional Interpolation for Relative Positions Improves Long Context Transformers [PDF]
    • RELATIVE POSITIONAL ENCODING FAMILY VIA UNITARY TRANSFORMATION [PDF]
    • Conditional Positional Encodings for Vision Transformers [GitHub] [PDF]
  3. Efficient Token Update

    • Global Filter Networks for Image Classification [GitHub] [PDF]
    • Focal Self-attention for Local-Global Interactions in Vision Transformers [PDF]
    • Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet [GitHub] [PDF]
    • CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows [GitHub] [PDF]
    • MaxViT: Multi-Axis Vision Transformer [GitHub] [PDF]
    • Skip-Attention: Improving Vision Transformers by Paying Less Attention [PDF]
    • SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications [GitHub] [PDF]
    • BiFormer: Vision Transformer with Bi-Level Routing Attention [GitHub] [PDF]
    • ReViT: Enhancing Vision Transformers Feature Diversity with Attention Residual Connections [PDF]
    • Vision Transformer with Sparse Scan Prior [GitHub] [PDF]
    • ConvMLP: Hierarchical Convolutional MLPs for Vision [GitHub] [PDF]
    • CycleMLP: A MLP-like Architecture for Dense Prediction [GitHub] [PDF]
    • ResMLP: Feedforward networks for image classification with data-efficient training [PDF]
    • Hire-MLP: Vision MLP via Hierarchical Rearrangement [GitHub] [PDF]
  4. Framework Design

    • Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions [GitHub] [PDF]
    • Vision Transformers with Hierarchical Attention [GitHub] [PDF]
    • Twins: Revisiting the Design of Spatial Attention in Vision Transformers [GitHub] [PDF]
    • Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [GitHub] [PDF]
    • MaxViT: Multi-Axis Vision Transformer [GitHub] [PDF]
    • FDViT: Improve the Hierarchical Architecture of Vision Transformer [PDF]
    • HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling [PDF]
    • Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles [GitHub] [PDF]
    • HIRI-ViT: Scaling Vision Transformer with High Resolution Inputs [PDF]

Dynamic Network

  1. Dynamic Resolution

    • Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer [GitHub] [PDF]
    • DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification [GitHub] [PDF]
    • EViT: An Eagle Vision Transformer with Bi-Fovea Self-Attention [GitHub] [PDF]
    • SPViT: Enabling Faster Vision Transformers via Soft Token Pruning [PDF]
    • HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers [PDF]
    • HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers [PDF]
    • CF-ViT: A General Coarse-to-Fine Method for Vision Transformer [PDF]
    • No Token Left Behind: Efficient Vision Transformer via Dynamic Token Idling [PDF]
    • ATS: Adaptive Token Sampling For Efficient Vision Transformers [PDF]
    • TPC-ViT: Token Propagation Controller for Efficient Vision Transformer [PDF]
    • Multi-Scale And Token Mergence: Make Your ViT More Efficient [PDF]
    • Super Vision Transformer [GitHub] [PDF]
    • Token Merging: Your ViT But Faster [GitHub] [PDF]
    • Token Fusion: Bridging the Gap between Token Pruning and Token Merging [PDF]
    • Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer [GitHub] [PDF]
  2. Depth Adaptation

    • A-ViT: Adaptive Tokens for Efficient Vision Transformer [GitHub] [PDF]
    • Distillation-Based Training for Multi-Exit Architectures [PDF]
    • Single-Layer Vision Transformers for More Accurate Early Exits with Less Overhead [PDF]
    • Multi-Exit Vision Transformer for Dynamic Inference [PDF]
    • LGViT: Dynamic Early Exiting for Accelerating Vision Transformer [PDF]
    • Dyn-Adapter: Towards Disentangled Representation for Efficient Visual Recognition [PDF]
    • CF-ViT: A General Coarse-to-Fine Method for Vision Transformer [GitHub] [PDF]
    • AdaViT: Adaptive Vision Transformers for Efficient Image Recognition [PDF]

Knowledge Distillation

  1. Feature Knowledge Distillation

  2. Response Knowledge Distillation