ACM
July 23, 2023 ยท View on GitHub
This repository contains the code for the paper:
Online Continual Learning Without the Storage Constraint
Ameya Prabhu, Zhipeng Cai, Puneet Dokania, Philip Torr, Vladlen Koltun, Ozan Sener
[Arxiv]
[PDF]
[Bibtex]
Installation and Dependencies
Our code was run on a 16GB RTX 3080Ti Laptop GPU with 64GB RAM and PyTorch >=1.13, although better GPU/RAM space will allow for faster experimentation.
- Install all requirements required to run the code on a Python >=3.9 environment by:
# First, activate a new virtual environment
pip3 install -r requirements.txt
Fast Dataset Setup
- There is a fast, direct mechanism to download and use our datasets implemented in this repository.
- Input the directory where the dataset was downloaded into
data_dirfield insrc/opts.py. - All codes in this repository were run on this dataset.
Recreating the Datasets
YOUR_DATA_DIRwould contain two subfolders:cglmandcloc. Following are instructions to setup each dataset:
Continual Google Landmarks V2 (CGLM)
Download Images
- You can download Continual Google Landmarks V2 dataset by following instructions on their Github repository, run in the
DATA_DIRdirectory:
wget -c https://raw.githubusercontent.com/cvdfoundation/google-landmark/master/download-dataset.sh
mkdir train && cd train
bash ../download-dataset.sh train 499
Recreating Metadata
- Download metadata by running the following commands in the
scriptsdirectory:
wget -c https://s3.amazonaws.com/google-landmark/metadata/train_attribution.csv
python cglm_scrape.py
- Parse the XML files and organize it as a dictionary.
- Ordering used in the paper is available to download from here.
- Now, select only images that are a part of the order file and your dataset should be ready!
Continual YFCC100M (CLOC)
Extremely Fast Image Downloader
- Download the
cloc.txtfile from this link inside theYOUR_DATASET_DIR/clocdirectory. - The
cloc.txtfile contains 36.8M image links, removing missing/broken links from the original download file of CLOC. - Download the dataset parallely and scalably using img2dataset, finishes in <a day on a 8-node server (read instructions in
img2datasetrepo for further distributed download options):
pip install img2dataset
img2dataset --url_list cyfcc.txt --input_format "txt" --output_form webdataset output_folder images --process_count 16 --thread_count 256 --resize_mode no --skip_reencode True
- Match the urls and file indexes to the idx for training script given in the original CLOC repo via this script .
Running the Code
Replication
Additional Experiments
- To reproduce our KNN scaling graphs (Figure 1b), please run the following on a computer with high RAM:
cd scripts/
python knn_scaling.py
python plot_knn_results.py
- To reproduce the blind classifier, please run the following:
cd scripts/
python run_blind.py
If you discover any bugs in the code please contact me, I will cross-check them with my nightmares.
Updates
- New ordering files using the
upload_dateinstead of date from EXIF metadata (more unique timestamps and more faithful to the story), we get this new order file. Differerent from order file at CLDatasets repo. Do not crosscompare. - However, no substantial changes observed in trends! The label correlation does not go away (slightly increases infact with better ordering, by breaking ties of same timestamps which led to random ordering!)
Citation
We hope ACM is a strong method for comparison, and this idea/codebase is useful for your cool CL idea! To cite our work:
@article{prabhu2023online,
title={Online Continual Learning Without the Storage Constraint},
author={Prabhu, Ameya and Cai, Zhipeng and Dokania, Puneet and Torr, Philip and Koltun, Vladlen and Sener, Ozan},
journal={arXiv preprint arXiv:2305.09253},
year={2023}
}