Continual Learning Module
October 30, 2025 ยท View on GitHub
This module provides popular continual learning baseline implementations on top of the Soft-Actor-Critic (SAC) algorithm. The implementation is based on Tensorflow.
Installation
To install the continual learning module, run the following command:
$ pip install COOM[cl]
Running Experiments
You can run single task or continual learning experiments with CL/run_single.py and CL/run_cl.py scripts, respectively.
To see available script arguments, run with --help option, e.g. python CL/run_single.py --help
Single task
python CL/run_single.py --scenario pitfall
Continual learning
python CL/run_cl.py --sequence CO4 --cl_method packnet
Reproducing Experimental Results
We have also listed all the commands for running the experiments in our paper in cl.sh and single.sh. We used seeds [0, 1, ..., 9] for all experiments in the paper.
Average Performance, Forgetting, Forward Transfer and Action Distributions
We evaluate the continual learning methods on the COOM benchmark based on Average Performance, Forgetting, and Forward Transfer. We use the following CL methods:
python CL/run_cl.py --sequence [SEQUENCE] --seed [SEED] --cl_method packnet --packnet_retrain_steps 10000 --clipnorm 2e-05
python CL/run_cl.py --sequence [SEQUENCE] --seed [SEED] --cl_method mas --cl_reg_coef=10000
python CL/run_cl.py --sequence [SEQUENCE] --seed [SEED] --cl_method agem --regularize_critic --episodic_mem_per_task 10000 --episodic_batch_size 128
python CL/run_cl.py --sequence [SEQUENCE] --seed [SEED] --cl_method l2 --cl_reg_coef=100000
python CL/run_cl.py --sequence [SEQUENCE] --seed [SEED] --cl_method ewc --cl_reg_coef=250
python CL/run_cl.py --sequence [SEQUENCE] --seed [SEED] --cl_method vcl --cl_reg_coef=1 --vcl_first_task_kl False
python CL/run_cl.py --sequence [SEQUENCE] --seed [SEED] --cl_method clonex --exploration_kind 'best_return' --cl_reg_coef=100 --episodic_mem_per_task 10000 --episodic_batch_size 128
python CL/run_cl.py --sequence [SEQUENCE] --seed [SEED] --batch_size 512 --buffer_type reservoir --reset_buffer_on_task_change False --replay_size 8e5 # Perfect Memory
python CL/run_cl.py --sequence [SEQUENCE] --seed [SEED] # Fine-tuning
We ran the COC sequence with sparse reward and only with PackNet:
python CL/run_cl.py --sequence COC --seed [SEED] --sparse_rewards --cl_method packnet --packnet_retrain_steps 10000 --clipnorm 2e-05
Measuring Forward Transfer also requires running SAC on each task in isolation:
python CL/run_single.py --scenario [SCENARIO] --envs [ENVS] --seed [SEED] --no_test
Network plasticity
To reproduce our network plasticity experiments from the paper, run the following command:
python CL/run_cl.py --sequence CO8 --seed [SEED] --repeat_sequence 10 --no_test --steps_per_env 100000
Method Variations
To reproduce our method variations experiments from the paper, run the following command:
Image Augmentations
- Random Convolution
- Random Shift
- Random Noise
python CL/run_cl.py --sequence CO8 --cl_method [METHOD] --seed [SEED] --augment --augmentation conv
python CL/run_cl.py --sequence CO8 --cl_method [METHOD] --seed [SEED] --augment --augmentation shift
python CL/run_cl.py --sequence CO8 --cl_method [METHOD] --seed [SEED] --augment --augmentation noise
Prioritized Experience Replay (PER)
python CL/run_cl.py --sequence CO8 --cl_method [METHOD] --seed [SEED] --buffer_type prioritized
LSTM
python CL/run_cl.py --sequence CO8 --cl_method [METHOD] --seed [SEED] --use_lstm
Critic Regularization
python CL/run_cl.py --sequence CO8 --cl_method [METHOD] --seed [SEED] --regularize_critic
Command Line Arguments
Below is a table of the available command line arguments for the script:
| Category | Argument | Default | Description |
|---|---|---|---|
| Core | --scenarios | None | Scenarios to run. Choices: health_gathering, run_and_gun, dodge_projectiles, chainsaw, raise_the_roof, floor_is_lava, hide_and_seek, arms_dealer, pitfall |
--envs | ['default'] | Name of the environments in the scenario(s) to run | |
--test_envs | [] | Name of the environments to periodically evaluate the agent on | |
--no_test | False | If True, no test environments will be used | |
--seed | 0 | Seed for randomness | |
--gpu | None | Which GPU to use | |
--sparse_rewards | False | Whether to use the sparse reward setting | |
| Continual Learning | --sequence | None | Name of the continual learning sequence. Choices: CD4, CD8, CD16, CO4, CO8, CO16, COC, MIXED |
--cl_method | None | Continual learning method. Choices: clonex, owl, l2, ewc, mas, vcl, packnet, agem | |
--start_from | 0 | Which task to start/continue the training from | |
--num_repeats | 1 | How many times to repeat the sequence | |
--random_order | False | Whether to randomize the order of the tasks | |
| DOOM | --render | False | Render the environment |
--render_sleep | 0.0 | Sleep time between frames when rendering | |
--variable_queue_length | 5 | Number of game variables to remember | |
--frame_skip | 4 | Number of frames to skip | |
--resolution | None | Screen resolution of the game. Choices: 800X600, 640X480, 320X240, 160X120 | |
| Save/Load | --save_freq_epochs | 25 | Save the model parameters after n epochs |
--model_path | None | Path to load the model from | |
| Recording | --record | False | Whether to record gameplay videos |
--record_every | 100 | Record gameplay video every n episodes | |
--video_folder | 'videos' | Path to save the gameplay videos | |
| Logging | --with_wandb | False | Enables Weights and Biases |
--logger_output | ["tsv", "tensorboard"] | Types of logger used. Choices: neptune, tensorboard, tsv | |
--group_id | "default_group" | Group ID, for grouping logs from different experiments into common directory | |
--log_every | 1000 | Number of steps between subsequent evaluations and logging | |
| Model | --use_lstm | False | Whether to use an LSTM after the CNN encoder head |
--hidden_sizes | [256, 256] | Hidden sizes list for the MLP models | |
--activation | "lrelu" | Activation kind for the models | |
--use_layer_norm | True | Whether to use layer normalization | |
--multihead_archs | True | Whether to use multi-head architecture | |
--hide_task_id | False | Whether the model knows the task during test time | |
| Learning Rate | --lr | 1e-3 | Learning rate for the optimizer |
--lr_decay | 'linear' | Method to decay the learning rate over time. Choices: None, 'linear', 'exponential' | |
--lr_decay_rate | 0.1 | Rate to decay the learning | |
--lr_decay_steps | 1e5 | Number of steps to decay the learning rate | |
| Replay Buffer | --replay_size | 5e4 | Size of the replay buffer |
--buffer_type | "fifo" | Strategy of inserting examples into the buffer. Choices: fifo, other values as per BufferType enum | |
--episodic_memory_from_buffer | True | [Description] | |
| Training | --steps_per_env | 2e5 | Number of steps the algorithm will run per environment |
--update_after | 5000 | Number of env interactions to collect before starting to do update the gradient | |
--update_every | 500 | Number of env interactions to do between every update | |
--n_updates | 50 | Number of consecutive policy gradient descent updates to perform | |
--batch_size | 128 | Minibatch size for the optimization | |
--gamma | 0.99 | Discount factor | |
--alpha | "auto" | Entropy regularization coefficient | |
--target_output_std | 0.089 | Target standard deviation of the action distribution for dynamic alpha tuning | |
--regularize_critic | False | Whether to regularize both actor and critic, or only actor | |
--clipnorm | None | Value for gradient clipping | |
| Testing | --test | True | Whether to test the model |
--test_only | False | Whether to only test the model | |
--test_episodes | 3 | Number of episodes to test the model | |
| Exploration | --start_steps | 10000 | Number of steps for uniform-random action selection |
--agent_policy_exploration | False | Whether to use uniform exploration only in the first task | |
--exploration_kind | None | Kind of exploration to use at the beginning of a new task | |
| Task Change | --reset_buffer_on_task_change | True | Whether to reset the replay buffer on task change |
--reset_optimizer_on_task_change | True | Whether to reset the optimizer on task change | |
--reset_critic_on_task_change | False | Whether to reset the critic on task change | |
| CL Method Specific | --packnet_retrain_steps | 0 | Number of retrain steps after network pruning per task |
--cl_reg_coef | 0.0 | Regularization strength for certain CL methods | |
--vcl_first_task_kl | False | Use KL regularization for the first task in VCL | |
--episodic_mem_per_task | 0 | Number of examples to keep in memory per task for AGEM | |
--episodic_batch_size | 0 | Minibatch size for additional loss computation in AGEM | |
| Observation | --frame_stack | 4 | Number of frames to stack |
--frame_height | 84 | Height of the frame | |
--frame_width | 84 | Width of the frame | |
--augment | False | Whether to use image augmentation | |
--augmentation | None | Type of image augmentation. Choices: 'conv', 'shift', 'noise' | |
| Reward | --reward_frame_survived | 0.01 | Reward for surviving a frame |
--reward_switch_pressed | 15.0 | Reward for pressing a switch | |
--reward_kill_dtc | 1.0 | Reward for eliminating an enemy | |
--reward_kill_rag | 5.0 | Reward for eliminating an enemy | |
--reward_kill_chain | 5.0 | Reward for eliminating an enemy | |
--reward_health_hg | 15.0 | Reward for picking up a health kit | |
--reward_health_has | 5.0 | Reward for picking a health kit | |
--reward_weapon_ad | 15.0 | Reward for picking a weapon | |
--reward_delivery | 30.0 | Reward for delivering an item | |
--reward_platform_reached | 1.0 | Reward for reaching a platform | |
--reward_on_platform | 0.1 | Reward for staying on a platform | |
--reward_scaler_pitfall | 0.1 | Reward scaler for traversal | |
--reward_scaler_traversal | 1e-3 | Reward scaler for traversal | |
| Penalty | --penalty_passivity | -0.1 | Penalty for not moving |
--penalty_death | -1.0 | Negative reward for dying | |
--penalty_projectile | -0.01 | Negative reward for projectile hit | |
--penalty_health_hg | -0.01 | Negative reward for losing health | |
--penalty_health_dtc | -1.0 | Negative reward for losing health | |
--penalty_health_has | -5.0 | Negative reward for losing health | |
--penalty_lava | -0.1 | Penalty for stepping on lava | |
--penalty_ammo_used | -0.1 | Negative reward for using ammo |