RL Games: High performance RL library

April 17, 2026 · View on GitHub

Note: The next release will be 2.0.0 (unreleased). It migrates fully from gym to gymnasium. EnvPool support has been restored with the new envpool 1.1.1 (Python 3.11–3.14, NumPy 2.x, MuJoCo 3.x compatible).

Some results on the different environments

Ant_running Humanoid_running

Allegro_Hand_400 Shadow_Hand_OpenAI

Allegro_Hand_real_world

AllegroKuka

Go1 Flat Velocity Go1 Rough Velocity G1 Humanoid Flat Velocity

Implemented in Pytorch:

  • PPO with the support of asymmetric actor-critic variant
  • Support of end-to-end GPU accelerated training pipeline with Isaac Gym and Brax
  • Masked actions support
  • Multi-agent training, decentralized and centralized critic variants
  • Self-play

Implemented in Tensorflow 1.x (was removed in this version):

  • Rainbow DQN
  • A2C
  • PPO

Quickstart: Colab in the Cloud

Explore RL Games quick and easily in colab notebooks:

Installation

For maximum training performance, PyTorch >= 2.2 with CUDA is recommended.

pip install rl-games

Or clone the repo and install the latest version from source:

pip install -e .

With optional extras (e.g. Atari, Mujoco, EnvPool):

pip install -e ".[atari,mujoco,envpool]"

Available extras: atari, mujoco, envpool, brax, pufferlib.

For high-throughput vectorized MuJoCo / Atari / DM Control training, install the envpool extra and see docs/ENVPOOL.md.

uv is a fast Python package manager. To create a virtual environment and install rl_games:

uv venv --python 3.11
source .venv/bin/activate
uv pip install -e ".[mujoco,envpool]"

Citing

If you use rl-games in your research please use the following citation:

@misc{rl-games2021,
title = {rl-games: A High-performance Framework for Reinforcement Learning},
author = {Makoviichuk, Denys and Makoviychuk, Viktor},
month = {May},
year = {2021},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/Denys88/rl_games}},
}

Development setup

uv venv --python 3.11
source .venv/bin/activate
uv pip install -e ".[atari,mujoco]"

Training

NVIDIA Isaac Gym

Download and follow the installation instructions of Isaac Gym: https://developer.nvidia.com/isaac-gym
And IsaacGymEnvs: https://github.com/NVIDIA-Omniverse/IsaacGymEnvs

Ant

python train.py task=Ant headless=True
python train.py task=Ant test=True checkpoint=nn/Ant.pth num_envs=100

Humanoid

python train.py task=Humanoid headless=True
python train.py task=Humanoid test=True checkpoint=nn/Humanoid.pth num_envs=100

Shadow Hand block orientation task

python train.py task=ShadowHand headless=True python train.py task=ShadowHand test=True checkpoint=nn/ShadowHand.pth num_envs=100

Other

Atari Pong

python runner.py --train --file rl_games/configs/atari/ppo_pong.yaml
python runner.py --play --file rl_games/configs/atari/ppo_pong.yaml --checkpoint nn/PongNoFrameskip.pth

Brax Ant

pip install -U "jax[cuda12]"
pip install brax
python runner.py --train --file rl_games/configs/brax/ppo_ant.yaml
python runner.py --play --file rl_games/configs/brax/ppo_ant.yaml --checkpoint runs/Ant_brax/nn/Ant_brax.pth

Experiment tracking

rl_games support experiment tracking with Weights and Biases.

python runner.py --train --file rl_games/configs/atari/ppo_breakout_torch.yaml --track
WANDB_API_KEY=xxxx python runner.py --train --file rl_games/configs/atari/ppo_breakout_torch.yaml --track
python runner.py --train --file rl_games/configs/atari/ppo_breakout_torch.yaml --wandb-project-name rl-games-special-test --track
python runner.py --train --file rl_games/configs/atari/ppo_breakout_torch.yaml --wandb-project-name rl-games-special-test -wandb-entity openrlbenchmark --track

Multi GPU

We use torchrun to orchestrate any multi-gpu runs.

torchrun --standalone --nnodes=1 --nproc_per_node=2 runner.py --train --file rl_games/configs/ppo_cartpole.yaml

Config Parameters

FieldExample ValueDefaultDescription
seed8NoneSeed for pytorch, numpy etc.
algoAlgorithm block.
namea2c_continuousNoneAlgorithm name. Possible values are: sac, a2c_discrete, a2c_continuous
modelModel block.
namecontinuous_a2c_logstdNonePossible values: continuous_a2c ( expects sigma to be (0, +inf), continuous_a2c_logstd ( expects sigma to be (-inf, +inf), a2c_discrete, a2c_multi_discrete
networkNetwork description.
nameactor_criticPossible values: actor_critic or soft_actor_critic.
separateFalseWhether use or not separate network with same same architecture for critic. In almost all cases if you normalize value it is better to have it False
spaceNetwork space
continuouscontinuous or discrete
mu_activationNoneActivation for mu. In almost all cases None works the best, but we may try tanh.
sigma_activationNoneActivation for sigma. Will be threated as log(sigma) or sigma depending on model.
mu_initInitializer for mu.
namedefault
sigma_initInitializer for sigma. if you are using logstd model good value is 0.
nameconst_initializer
val0
fixed_sigmaTrueIf true then sigma vector doesn't depend on input.
cnnConvolution block.
typeconv2dType: right now two types supported: conv2d or conv1d
activationeluactivation between conv layers.
initializerInitialier. I took some names from the tensorflow.
nameglorot_normal_initializerInitializer name
gain1.4142Additional parameter.
convsConvolution layers. Same parameters as we have in torch.
filters32Number of filters.
kernel_size8Kernel size.
strides4Strides
padding0Padding
filters64Next convolution layer info.
kernel_size4
strides2
padding0
filters64
kernel_size3
strides1
padding0
mlpMLP Block. Convolution is supported too. See other config examples.
unitsArray of sizes of the MLP layers, for example: [512, 256, 128]
d2rlFalseUse d2rl architecture from https://arxiv.org/abs/2010.09163.
activationeluActivations between dense layers.
initializerInitializer.
namedefaultInitializer name.
rnnRNN block.
namelstmRNN Layer name. lstm and gru are supported.
units256Number of units.
layers1Number of layers
before_mlpFalseFalseApply rnn before mlp block or not.
configRL Config block.
reward_shaperReward Shaper. Can apply simple transformations.
min_val-1You can apply min_val, max_val, scale and shift.
scale_value0.11
normalize_advantageTrueTrueNormalize Advantage.
gamma0.995Reward Discount
tau0.95Lambda for GAE. Called tau by mistake long time ago because lambda is keyword in python :(
learning_rate3e-4Learning rate.
namewalkerName which will be used in tensorboard.
save_best_after10How many epochs to wait before start saving checkpoint with best score.
score_to_win300If score is >=value then this value training will stop.
grad_norm1.5Grad norm. Applied if truncate_grads is True. Good value is in (1.0, 10.0)
entropy_coef0Entropy coefficient. Good value for continuous space is 0. For discrete is 0.02
truncate_gradsTrueApply truncate grads or not. It stabilizes training.
env_nameBipedalWalker-v3Envinronment name.
e_clip0.2clip parameter for ppo loss.
clip_valueFalseApply clip to the value loss. If you are using normalize_value you don't need it.
num_actors16Number of running actors/environments.
horizon_length4096Horizon length per each actor. Total number of steps will be num_actors*horizon_length * num_agents (if env is not MA num_agents==1).
minibatch_size8192Minibatch size. Total number number of steps must be divisible by minibatch size.
minibatch_size_per_env8Minibatch size per env. If specified will overwrite total number number the default minibatch size with minibatch_size_per_env * nume_envs value.
mini_epochs4Number of miniepochs. Good value is in [1,10]
critic_coef2Critic coef. by default critic_loss = critic_coef * 1/2 * MSE.
lr_scheduleadaptiveNoneScheduler type. Could be None, linear or adaptive. Adaptive is the best for continuous control tasks. Learning rate is changed changed every miniepoch
kl_threshold0.008KL threshould for adaptive schedule. if KL < kl_threshold/2 lr = lr * 1.5 and opposite.
normalize_inputTrueApply running mean std for input.
bounds_loss_coef0.0Coefficient to the auxiary loss for continuous space.
max_epochs10000Maximum number of epochs to run.
max_frames5000000Maximum number of frames (env steps) to run.
normalize_valueTrueUse value running mean std normalization.
use_diagnosticsTrueAdds more information into the tensorboard.
value_bootstrapTrueBootstraping value when episode is finished. Very useful for different locomotion envs.
bound_loss_typeregularisationNoneAdds aux loss for continuous case. 'regularisation' is the sum of sqaured actions. 'bound' is the sum of actions higher than 1.1.
bounds_loss_coef0.00050Regularisation coefficient
use_smooth_clampFalseUse smooth clamp instead of regular for cliping
zero_rnn_on_doneFalseTrueIf False RNN internal state is not reset (set to 0) when an environment is rest. Could improve training in some cases, for example when domain randomization is on
playerPlayer configuration block.
renderTrueFalseRender environment
deterministicTrueTrueUse deterministic policy ( argmax or mu) or stochastic.
use_vecenvTrueFalseUse vecenv to create environment for player
games_num200Number of games to run in the player mode.
env_configEnv configuration block. It goes directly to the environment. This example was take for my atari wrapper.
skip4Number of frames to skip
nameBreakoutNoFrameskip-v4The exact name of an (atari) gym env. An example, depends on the training env this parameters can be different.
evaluationTrueFalseEnables the evaluation feature for inferencing while training.
update_checkpoint_freq100100Frequency in number of steps to look for new checkpoints.
dir_to_monitorDirectory to search for checkpoints in during evaluation.

Custom network example:

simple test network
This network takes dictionary observation. To register it you can add code in your init.py

from rl_games.envs.test_network import TestNetBuilder 
from rl_games.algos_torch import model_builder
model_builder.register_network('testnet', TestNetBuilder)

simple test environment example environment

Additional environment supported properties and functions

FieldDefault ValueDescription
use_central_valueFalseIf true than returned obs is expected to be dict with 'obs' and 'state'
value_size1Shape of the returned rewards. Network wil support multihead value automatically.
concat_infosFalseShould default vecenv convert list of dicts to the dicts of lists. Very usefull if you want to use value_boostrapping. in this case you need to always return 'time_outs' : True or False, from the env.
get_number_of_agents(self)1Returns number of agents in the environment
has_action_mask(self)FalseReturns True if environment has invalid actions mask.
get_action_mask(self)NoneReturns action masks if has_action_mask is true. Good example is SMAC Env

Release Notes

1.6.5

  • Added torch.compile support with configurable modes. Provides 10-40% performance improvement. Requires torch 2.2 or newer.
    • Default mode is reduce-overhead for balanced compilation time and runtime performance
    • Configurable via torch_compile parameter in yaml configs (true/false/"default"/"reduce-overhead"/"max-autotune")
    • Separate compilation modes for actor and central value networks
    • See torch.compile documentation for detailed configuration and mode selection guidance
  • Fixed critical bugs in asymmetric actor-critic (central_value) training:
    • Fixed incorrect device reference in update_lr() method
    • Fixed infinite loop when iterating over dataset
    • Added proper __iter__ method to PPODataset class
  • Fixed variance calculation in RunningMeanStd to use population variance
  • Fixed get_mean_std_with_masks function.
  • Fixed missing central value optimizer state in checkpoint save/load
  • Added myosuite support.
  • Added auxilary loss support.
  • Update for tacsl release: CNN tower processing, critic weights loading and freezing.
  • Fixed SAC input normalization.
  • Fixed SAC agent summary writer to use configured directory instead of hardcoded 'runs/'
  • Fixed default player config num_games value.
  • Fixed applying minibatch size per env.
  • Added concat_output support for RNN.
  • SAC improvements:
    • Fixed missing gamma_tensor initialization bug
    • Removed hardcoded torch.compile decorators (now respects YAML config)
    • Optimized tensor operations and removed unnecessary clones
  • Environment wrapper fixes:
    • Fixed tuple/list observation handling for compatibility with various gym environments
    • Added proper numpy to torch tensor conversion in cast_obs
    • Fixed missing gym import in envpool wrapper
  • Ray integration improvements:
    • Moved Ray import to lazy loading (only when RayVecEnv is used)
    • Added configurable Ray initialization with ray_config parameter
    • Added proper cleanup with close() method for Ray actors
    • Default 1GB object store memory allocation

1.6.1

  • Fixed Central Value RNN bug which occurs if you train ma multi agent environment.
  • Added Deepmind Control PPO benchmark.
  • Added a few more experimental ways to train value prediction (OneHot, TwoHot encoding and crossentropy loss instead of L2).
  • New methods didn't. It is impossible to turn it on from the yaml files. Once we find an env which trains better it will be added to the config.
  • Added shaped reward graph to the tensorboard.
  • Fixed bug with SAC not saving weights with save_frequency.
  • Added multi-node training support for GPU-accelerated training environments like Isaac Gym. No changes in training scripts are required. Thanks to @ankurhanda and @ArthurAllshire for assistance in implementation.
  • Added evaluation feature for inferencing during training. Checkpoints from training process can be automatically picked up and updated in the inferencing process when enabled.Enhanced
  • Added get/set API for runtime update of rl training parameters. Thanks to @ArthurAllshire for the initial version of fast PBT code.
  • Fixed SAC not loading weights properly.
  • Removed Ray dependency for use cases it's not required.
  • Added warning for using deprecated 'seq_len' instead of 'seq_length' in configs with RNN networks.

1.6.0

  • Added ONNX export colab example for discrete and continious action spaces. For continuous case LSTM policy example is provided as well.
  • Improved RNNs training in continuous space, added option zero_rnn_on_done.
  • Added NVIDIA CuLE support: https://github.com/NVlabs/cule
  • Added player config everride. Vecenv is used for inference.
  • Fixed multi-gpu training with central value.
  • Fixed max_frames termination condition, and it's interaction with the linear learning rate: https://github.com/Denys88/rl_games/issues/212
  • Fixed "deterministic" misspelling issue.
  • Fixed Mujoco and Brax SAC configs.
  • Fixed multiagent envs statistics reporting. Fixed Starcraft2 SMAC environments.

1.5.2

  • Added observation normalization to the SAC.
  • Returned back adaptive KL legacy mode.

1.5.1

  • Fixed build package issue.

1.5.0

  • Added wandb support.
  • Added poetry support.
  • Fixed various bugs.
  • Fixed cnn input was not divided by 255 in case of the dictionary obs.
  • Added more envpool mujoco and atari training examples. Some of the results: 15 min Mujoco humanoid training, 2 min atari pong.
  • Added Brax and Mujoco colab training examples.
  • Added 'seed' command line parameter. Will override seed in config in case it's > 0.
  • Deprecated horovod in favor of torch.distributed (#171).

1.4.0

  • Added discord channel https://discord.gg/hnYRq7DsQh :)
  • Added envpool support with a few atari examples. Works 3-4x time faster than ray.
  • Added mujoco results. Much better than openai spinning up ppo results.
  • Added tcnn(https://github.com/NVlabs/tiny-cuda-nn) support. Reduces 5-10% of training time in the IsaacGym envs.
  • Various fixes and improvements.

1.3.2

  • Added 'sigma' command line parameter. Will override sigma for continuous space in case if fixed_sigma is True.

1.3.1

  • Fixed SAC not working

1.3.0

  • Simplified rnn implementation. Works a little bit slower but much more stable.
  • Now central value can be non-rnn if policy is rnn.
  • Removed load_checkpoint from the yaml file. now --checkpoint works for both train and play.

1.2.0

  • Added Swish (SILU) and GELU activations, it can improve Isaac Gym results for some of the envs.
  • Removed tensorflow and made initial cleanup of the old/unused code.
  • Simplified runner.
  • Now networks are created in the algos with load_network method.

1.1.4

  • Fixed crash in a play (test) mode in player, when simulation and rl_devices are not the same.
  • Fixed variuos multi gpu errors.

1.1.3

  • Fixed crash when running single Isaac Gym environment in a play (test) mode.
  • Added config parameter clip_actions for switching off internal action clipping and rescaling

1.1.0

  • Added to pypi: pip install rl-games
  • Added reporting env (sim) step fps, without policy inference. Improved naming.
  • Renames in yaml config for better readability: steps_num to horizon_length amd lr_threshold to kl_threshold

Troubleshouting

  • Some of the supported envs are not installed with setup.py, you need to manually install them
  • Starting from rl-games 1.1.0 old yaml configs won't be compatible with the new version:
    • steps_num should be changed to horizon_length amd lr_threshold to kl_threshold

Known issues

  • Running a single environment with Isaac Gym can cause crash, if it happens switch to at least 2 environments simulated in parallel