Command-line usage guide for minimax.train

November 22, 2023 · View on GitHub

Parsing command-line arguments is handled by Parsnip.

You can quickly generate batches of training commands from a JSON configuration file using minimax.config.make_cmd.

General arguments

ArgumentDescription
seedRandom seed, should be unique per experimental run
agent_rl_algoBase RL algorithm used for training (e.g. PPO)
n_total_updatesTotal number of updates for the training run
train_runnerWhich training runner to use, e.g. dr, plr, or paired
n_devicesNumber of devices over which to shard the environment batch dimension
n_studentsNumber of students in the autocurriculum
n_parallelNumber of parallel environments
n_evalNumber of parallel trials per environment (environment batch dimension is then n_parallel*n_eval)
n_rollout_stepsNumber of steps per rollout (used for each update cycle)
lrLearning rate
lr_finalFinal learning rate, based on linear schedule. Defaults to None, corresponding to no schedule.
lr_anneal_stepsNumber of steps over which to linearly anneal from lr to lr_final
student_value_coefValue loss coefficient
student_entropy_coefEntropy bonus coefficient
student_unroll_updateUnroll multi-gradient updates this many times (can lead to speed ups)
max_grad_normClip gradients beyond this magnitude
adam_epsValue of ϵ`\epsilon` numerical stability constant for Adam
discountDiscount factor γ`\gamma` for the student's RL optimization
n_unroll_rolloutUnroll rollout scans this many times (can lead to speed ups)

Logging arguments

ArgumentDescription
verboseRandom seed, should be unique per experimental run
track_env_metricsTrack per rollout batch environment metrics if True
log_dirPath to directory storing all experiment folders
xpidUnique name for experiment folder, stored in --log_dir
log_intervalLog training statistics every this many rollout cycles
wandb_base_urlBase API URL if logging with wandb
wandb_api_keyAPI key for wandb
wandb_entitywandb entity associated with the experiment run
wandb_projectwandb project for the experiment run
wandb_groupwandb group for the experiment run

Checkpointing arguments

ArgumentDescription
checkpoint_intervalRandom seed, should be unique per experimental run
from_last_checkpointBegin training from latest checkpoint.pkl, if any, in the experiment folder
archive_intervalSave an additional checkpoint for models trained per this many rollout cycles

Evaluation arguments

ArgumentDescription
test_env_namesRandom seed, should be unique per experimental run
test_n_episodesAverage test results over this many episodes per test environment
test_agent_idxsTest agents at these indices (csv of indices or * for all indices)

PPO arguments

These arguments activate when --agent_rl_algo=ppo.

ArgumentDescription
student_ppo_n_epochsRandom seed, should be unique per experimental run
student_ppo_n_epochsNumber of PPO epochs per update cycle
student_ppo_n_minibatchesNumber of minibatches per PPO epoch
student_ppo_clip_epsClip coefficient for PPO
student_ppo_clip_value_lossPerform value clipping if True
gae_lambdaLambda discount factor for Generalized Advantage Estimation

PAIRED arguments

The arguments in this section activate when --train_runner=paired.

ArgumentDescription
teacher_lrLearning rate
teacher_lr_finalAnneal learning rate to this value (defaults to teacher_lr)
teacher_lr_anneal_stepsNumber of steps over which to linearly anneal from lr to lr_final
teacher_discountDiscount factor, γ`\gamma`
teacher_value_loss_coefValue loss coefficient
teacher_entropy_coefEntropy bonus coefficient
teacher_n_unroll_updateUnroll multi-gradient updates this many times (can lead to speed ups)
ued_scoreName of UED objective, e.g. relative_regret

These PPO-specific arguments for teacher optimization further activate when --agent_rl_algo=ppo.

ArgumentDescription
teacher_ppo_n_epochsNumber of PPO epochs per update cycle
teacher_ppo_n_minibatchesNumber of minibatches per PPO epoch
teacher_ppo_clip_epsClip coefficient for PPO
teacher_ppo_clip_value_lossPerform value clipping if True
teacher_gae_lambdaLambda discount factor for Generalized Advantage Estimation

PLR arguments

The arguments in this section activate when --train_runner=paired.

ArgumentDescription
ued_scoreName of UED objective (aka PLR scoring function)
plr_replay_probReplay probability
plr_buffer_sizeSize of level replay buffer
plr_staleness_coefStaleness coefficient
plr_tempScore distribution temperature
plr_use_score_ranksUse rank-based prioritization (rather than proportional)
plr_min_fill_ratioOnly replay once level replay buffer is filled above this ratio
plr_use_robust_plrUse robust PLR (i.e. only update policy on replay levels)
plr_force_uniqueForce level replay buffer members to be unique
plr_use_parallel_evalUse Parallel PLR or Parallel ACCEL (if plr_mutation_fn is set)
plr_mutation_fnIf set, PLR becomes ACCEL. Use 'default' for default mutation operator per environment.
plr_n_mutationsNumber of applications of plr_mutation_fn per mutation cycle.
plr_mutation_criterionHow replay levels are selected for mutation (e.g. batch, easy, hard).
plr_mutation_subsample_sizeNumber of replay levels selected for mutation according to the criterion (ignored if using batch criterion)

Environment-specific arguments

Maze

See the AMaze docs for details on how to specify training, evaluation, and teacher-specific environment parameters via command line