Environment API and Semantics

August 1, 2019 · View on GitHub

This text describes the Python-based Environment API defined by dm_env.

Overview

The main interaction with an environment is via the step() method.

Each call to an environment's step() method takes an action parameter and returns a TimeStep namedtuple with fields

step_type, reward, discount, observation

A sequence consists of a series of TimeSteps returned by consecutive calls to step(). In many settings we refer to each sequence as an episode. Each sequence starts with a step_type of FIRST, ends with a step_type of LAST, and has a step_type of MID for all intermediate TimeSteps.

As well as step(), each environment implements a reset() method. This takes no arguments, forces the start of a new sequence and returns the first TimeStep. See the run loop samples below for more details.

Calling step() on a new environment instance, or immediately after a TimeStep with a step_type of LAST is equivalent to calling reset(). In other words, the action argument will be ignored and a new sequence will begin, starting with a step_type of FIRST.

NOTE: The discount does not determine when a sequence ends. The discount may be 0 in the middle of a sequence and ≥0 at the end of a sequence.

Example sequences

We show two examples of sequences below, along with the first TimeStep of the next sequence.

Each row corresponds to the tuple returned by an environment's step() method. We use r, ɣ and obs to denote the reward, discount and observation respectively, x to denote a None or optional value at a timestep, and ✓ to denote a value that exists at a timestep.

Example: A sequence where the end of the prediction—the discounted sum of future rewards that we wish to predict—coincides with the end of the sequence. i.e., this sequence ends with a discount of 0. Such a sequence could represent a single episode of a finite-horizon RL task.

(r, ɣ, obs)  | (x, x, ✓) →  (✓, ✓, ✓)  →  (✓, 0, ✓) ⇢ (x, x, ✓)
step_type    |   FIRST         MID           LAST       FIRST

Example: Here the prediction does not terminate at the end of the sequence, which ends with a nonzero discount. This type of termination is sometimes used in infinite-horizon RL settings.

(r, ɣ, obs)  | (x, x, ✓) →  (✓, ✓, ✓)  →  (✓, > 0, ✓) ⇢ (x, x, ✓)
step_type    |   FIRST         MID           LAST         FIRST

In general, a discount of 0 does not need to coincide with the end of a sequence. An environment may return ɣ = 0 in the middle of a sequence, and may do this multiple times within a sequence. We do not (typically) call these sub-sequences episodes.

The step_type can potentially be used by an agent. For instance, some agents may reset their short-term memory when step_type is LAST, but not when the step_type is MID, even if the discount is 0. This is up to the creator of the agent, but it does mean that the aforementioned two ways to model a termination of the prediction do not necessarily correspond to the same agent behaviour.

Run loop samples

Here we show some sample run loops for using an environment with an agent class that implements a step(timestep) method.

NOTE: Environments do not make any assumptions about the structure of algorithmic code or agent classes. These examples are illustrative only.

Continuing

We may call step() repeatedly.

timestep = env.reset()
while True:
  action = agent.step(timestep)
  timestep = env.step(action)

NOTE: An environment will ignore action after a LAST step, and return the FIRST step of a new sequence. An agent or algorithm may use the step_type, for example to decide when to reset short-term memory.

Set number of sequences

We can choose to run a specific number of sequences. Here we use the syntactic sugar method .last() to check whether we are at the end of a sequence.

for _ in range(num_sequences):

  timestep = env.reset()
  while True:
    action = agent.step(timestep)
    timestep = env.step(action)
    if timestep.last():
      _ = agent.step(timestep)
      break

A TimeStep also has .first() and .mid() methods.

Manual truncation

We can truncate a sequence manually at some step_limit.

step_limit = 100
for _ in range(num_sequences):

  timestep = env.reset()

  step_counter = 1
  while True:
    action = agent.step(timestep)
    timestep = env.step(action)
    if step_counter == step_limit:
      timestep = timestep._replace(step_type=environment.StepType.LAST)

    if timestep.last():
      _ = agent.step(timestep)
      break

    step_counter += 1

In this example we've accessed the step_type element directly.

The format of observations and actions

Environments should return observations and accept actions in the form of NumPy arrays.

An environment may return observations made up of multiple arrays, for example a list where the first item is an array containing an RGB image and the second item is an array containing velocities. The arrays may also be values in a dict, or any other structure made up of basic Python containers. Note: A single array is a perfectly valid format.

Similarly, actions may be specified as multiple arrays, for example control signals for distinct parts of a simulated robot.

Each environment also implements an observation_spec() and an action_spec() method. Each method should return a structure of Array specs, where the structure should correspond exactly to the format of the actions/observations.

Each Array spec should define the dtype, shape and, where possible, the bounds and name of the corresponding action or observation array.

Note: Actions should almost always specify bounds, e.g. they should use the BoundedArray spec subclass.