Representation Learning for Skeleton Action Recognition with Convolutional Transformers and BYOL

September 9, 2024 · View on GitHub

To extract robust and more generalizable skeleton action recognition features, large amounts of well-curated data are typically required, which is a challenging task hindered by annotation costs. In this work, we investigate unsupervised representation learning for skeleton action recognition. For this purpose, we designed a convolutional transformer framework, named ReL-SAR, exploiting the complementarity of convolutional and attention layers for jointly modeling spatial and temporal cues in skeleton sequences.

alt text

This repository complements our paper, providing a reference implementation of the method as described in the paper. Please contact the authors for enquiries regarding the code.

Dependencies

We tested our code on the following environment.

tensorflow 2.6.0
Python 3.8.5
numpy 1.23.5

First, clone the repository and install the required pip packages (virtual environment recommended!):

pip install -r requirements.txt

Data preparation for NW-UCLA:

Download raw data from http://users.eecs.northwestern.edu/~jwa368/data/multiview_action_videos.tgz

Place the videos in data/NW-UCLA in that form:

 ├── data
   ├── NW-UCLA
      ├── carry
         ├──  vid_1
         ├──  vid_2
         ........
      ├── doffing
         ├──  vid_1
         ├──  vid_2
         ........
      ├── donning
         ├──  vid_1
         ├──  vid_2
         ........
      ├── ....
      ├── ....

Download ViTPose Huge model: https://huggingface.co/JunkyByte/easy_ViTPose/blob/main/torch/coco_25/vitpose-h-coco_25.pth
Place the ViTPose Huge model in data/

Install the requirements for the data preparation:

cd data/       
pip install -r requirements.txt

Generate the keypoints for each video:
```
python data/keypoints_finder.py
```
Generate npy file for each class:
```
python data/creating_npy.py     
```

Supervised Baseline Training:

To run a supervised baseline experiment:

python Supervised-Baseline/main.py -b

Bootstrap Your Own Latent (BYOL) for Representation Learning:

alt text

To run a byol training process:

python BYOL_pretraining/byol_train_script.py

Evaluating the Effectiveness of Learned Feature Hierarchies:

To run experiments for fully fine-tuning and freezing Conv1D layers:

python BYOL_pretraining/main.py -b

That's a wrap!