src.ppo package#

Submodules#

src.ppo.agent module#

class src.ppo.agent.FCAgent(envs: SyncVectorEnv, environment_config: EnvironmentConfig, fc_model_config=None, device: device = device(type='cpu'), hidden_dim: int = 64)#

Bases: PPOAgent

actor: Sequential#
critic: Sequential#
learn(memory: Memory, args: OnlineTrainConfig, optimizer: Optimizer, scheduler: PPOScheduler, track: bool) None#

Performs the learning phase of the PPO algorithm, updating the agent’s parameters using the collected experience.

Parameters:
  • memory (Memory) – The replay buffer containing the collected experiences.

  • args (OnlineTrainConfig) – The configuration for the training.

  • optimizer (optim.Optimizer) – The optimizer to update the agent’s parameters.

  • scheduler (PPOScheduler) – The scheduler attached to the optimizer.

  • track (bool) – Whether to track the training progress.

rollout(memory: Memory, num_steps: int, envs: SyncVectorEnv, trajectory_writer=None, sampling_method='basic', **kwargs) None#

Performs the rollout phase of the PPO algorithm, collecting experience by interacting with the environment.

Parameters:
  • memory (Memory) – The replay buffer to store the experiences.

  • num_steps (int) – The number of steps to collect.

  • envs (gym.vector.SyncVectorEnv) – The vectorized environment to interact with.

  • trajectory_writer (TrajectoryWriter, optional) – The writer to log the collected trajectories. Defaults to None.

class src.ppo.agent.LSTMPPOAgent(envs: SyncVectorEnv, environment_config: EnvironmentConfig, lstm_config: LSTMModelConfig, device: device)#

Bases: PPOAgent

actor: Module#
critic: Module#
learn(memory: Memory, args: OnlineTrainConfig, optimizer: Optimizer, scheduler: PPOScheduler, track: bool) None#
preprocess_obs(obs)#
rollout(memory: Memory, num_steps: int, envs: SyncVectorEnv, trajectory_writer=None, sampling_method='basic', **kwargs) None#
class src.ppo.agent.PPOAgent(envs: SyncVectorEnv, device)#

Bases: Module

actor: Module#
critic: Module#
layer_init(layer: Linear, std: float = 1.4142135623730951, bias_const: float = 0.0) Linear#

Initializes the weights of a linear layer with orthogonal initialization and the biases with a constant value.

Parameters:
  • layer (nn.Linear) – The linear layer to be initialized.

  • std (float, optional) – The standard deviation of the distribution used to initialize the weights. Defaults to np.sqrt(2).

  • bias_const (float, optional) – The constant value to initialize the biases with. Defaults to 0.0.

Returns:

The initialized linear layer.

Return type:

nn.Linear

abstract learn(memory, args, optimizer, scheduler) None#
make_optimizer(num_updates: int, initial_lr: float, end_lr: float) Tuple[Optimizer, PPOScheduler]#

Returns an Adam optimizer with a learning rate schedule for updating the agent’s parameters.

Parameters:
  • num_updates (int) – The total number of updates to be performed.

  • initial_lr (float) – The initial learning rate.

  • end_lr (float) – The final learning rate.

Returns:

A tuple containing the optimizer and its attached scheduler.

Return type:

Tuple[optim.Optimizer, PPOScheduler]

abstract rollout(memory, args, envs, trajectory_writer, **kwargs) None#
class src.ppo.agent.PPOScheduler(optimizer: Optimizer, initial_lr: float, end_lr: float, num_updates: int)#

Bases: object

step()#

Implement linear learning rate decay so that after num_updates calls to step, the learning rate is end_lr.

class src.ppo.agent.TransformerPPOAgent(envs: SyncVectorEnv, environment_config: EnvironmentConfig, transformer_model_config: TransformerModelConfig, device: device = device(type='cpu'))#

Bases: PPOAgent

actor: Module#
critic: Module#
learn(memory: Memory, args: OnlineTrainConfig, optimizer: Optimizer, scheduler: PPOScheduler, track: bool) None#

Performs the learning phase of the PPO algorithm, updating the agent’s parameters using the collected experience.

Parameters:
  • memory (Memory) – The replay buffer containing the collected experiences.

  • args (OnlineTrainConfig) – The configuration for the training.

  • optimizer (optim.Optimizer) – The optimizer to update the agent’s parameters.

  • scheduler (PPOScheduler) – The scheduler attached to the optimizer.

  • track (bool) – Whether to track the training progress.

rollout(memory: Memory, num_steps: int, envs: SyncVectorEnv, trajectory_writer=None, sampling_method='basic', **kwargs) None#

Performs the rollout phase of the PPO algorithm, collecting experience by interacting with the environment.

Parameters:
  • memory (Memory) – The replay buffer to store the experiences.

  • num_steps (int) – The number of steps to collect.

  • envs (gym.vector.SyncVectorEnv) – The vectorized environment to interact with.

  • trajectory_writer (TrajectoryWriter, optional) – The writer to log the collected trajectories. Defaults to None.

training: bool#
src.ppo.agent.get_agent(model_config: dataclass, envs: SyncVectorEnv, environment_config: EnvironmentConfig, online_config) PPOAgent#

Returns an agent based on the given configuration.

Args: - transformer_model_config: The configuration for the transformer model. - envs: The environment to train on. - environment_config: The configuration for the environment. - online_config: The configuration for online training.

Returns: - An agent.

src.ppo.agent.load_all_agents_from_checkpoints(checkpoint_folder_path, num_envs=10)#

Example:

src.ppo.agent.load_saved_checkpoint(path, num_envs=10) PPOAgent#
src.ppo.agent.sample_from_agents(agents, rollout_length=2000, trajectory_path=None, num_envs=1, sampling_method='basic')#

src.ppo.compute_adv_vectorized module#

src.ppo.compute_adv_vectorized.compute_advantages_vectorized(next_value: Tensor, next_done: Tensor, rewards: Tensor, values: Tensor, dones: Tensor, device: device, gamma: float, gae_lambda: float) Tensor#

The compute_advantages_vectorized function computes the Generalized Advantage Estimation (GAE) advantages for a batch of environments in a vectorized manner.

Parameters:
  • next_value (torch.Tensor) – The predicted value of the next state for each environment in the batch, of shape (num_envs,).

  • next_done (torch.Tensor) – Whether the next state is done or not for each environment in the batch, of shape (num_envs,).

  • rewards (torch.Tensor) – The rewards received for each timestep and environment, of shape (timesteps, num_envs).

  • values (torch.Tensor) – The predicted state value for each timestep and environment, of shape (timesteps, num_envs).

  • dones (torch.Tensor) – Whether the state is done or not for each timestep and environment, of shape (timesteps, num_envs).

  • device (torch.device) – The device on which to perform computations.

  • gamma (float) – The discount factor to use.

  • gae_lambda (float) – The GAE lambda value to use.

Returns:

The computed GAE advantages for each timestep and environment, of shape (timesteps, num_envs).

Return type:

advantages (torch.Tensor)

src.ppo.compute_adv_vectorized.shift_rows(arr)#

Returns a 2D array where the i-th row is the input array from index 0 to i. If the input array has more than 1 dimension, it treats the later dimensions as batch dimensions.

Args: arr (np.ndarray): 1D array to be transformed into a 2D array.

Returns: np.ndarray: A 2D array where the i-th row is the input array from index 0 to i.

Example

Given a 1D array like:

[1, 2, 3]

this function will return:

[[1, 2, 3], [0, 1, 2], [0, 0, 1]]

If the array has >1D, it treats the later dimensions as batch dims

src.ppo.memory module#

class src.ppo.memory.Memory(envs: SyncVectorEnv, args: OnlineTrainConfig, device: device = device(type='cpu'))#

Bases: object

A memory buffer for storing experiences during the rollout phase.

add(*data: Tensor)#

Adds an experience to storage. Called during the rollout phase.

*data: A tuple containing the tensors of (obs, done, action, logprob, value, reward) for an agent.

add_vars_to_log(**kwargs)#

Add variables to storage, for eventual logging (if args.track=True).

compute_advantages(next_value: Tensor, next_done: Tensor, rewards: Tensor, values: Tensor, dones: Tensor, device: device, gamma: float, gae_lambda: float) Tensor#

Compute advantages using Generalized Advantage Estimation.

Generalized Advantage Estimation (GAE) is a technique used in Proximal Policy Optimization (PPO) to estimate the advantage function, which is the difference between the expected value of the cumulative reward and the estimated value of the current state.

Args: - next_value (Tensor): the estimated value of the next state. - next_done (Tensor): whether the next state is terminal. - rewards (Tensor): the rewards received from taking actions. - values (Tensor): the estimated values of the states. - dones (Tensor): whether the states are terminal. - device (torch.device): the device to store the tensors on. - gamma (float): the discount factor. - gae_lambda (float): the GAE lambda parameter.

Returns: - advantages (Tensor): the advantages of the states.

get_minibatch_indexes(batch_size: int, minibatch_size: int, recurrence: Optional[int] = None) List[ndarray]#

Return a list of length (batch_size // minibatch_size) where each element is an array of indexes into the batch.

If recurrence is not none, the elements returned in each minibatch are a list of starting indices seperated by the recurrence value.

Each index should appear exactly once.

get_minibatches(recurrence: Optional[int] = None, indexes: Optional[List[array]] = None) List[Minibatch]#
Return a list of length (batch_size // minibatch_size)

where each element is an array of indexes into the batch.

Args: - recurrence (int): the number of steps to take between each minibatch. - indexes (List[np.array]): the indexes to use for the minibatches.

Returns: - List[MiniBatch]: a list of minibatches.

get_printable_output() str#

Sets a new progress bar description, if any episodes have terminated. If not, then the bar’s description won’t change.

get_trajectory_minibatches(timesteps: int, prob_go_from_end: float = 0.1) List[TrajectoryMinibatch]#

Return a list of trajectory minibatches, where each minibatch contains experiences from a single trajectory.

Args: - timesteps (int): the number of timesteps to include in each minibatch.

Returns: - List[TrajectoryMinibatch]: a list of minibatches.

log() None#

Logs variables to wandb.

reset() None#

Function to be called at the end of each rollout period, to make space for new experiences to be generated.

sample_experiences()#

Prints out a randomly selected experience as a sanity check.

Each experience consists of a tuple containing: - obs: observations of the environment - done: whether the episode has terminated - action: the action taken by the agent - logprob: the log probability of taking that action - value: the estimated value of the current state - reward: the reward received from taking that action

The output will be a sample from the stored experiences, in the format:

Sample X/Y: obs : […] done : […] action : […] logprob: […] value : […] reward : […]

class src.ppo.memory.Minibatch(obs: Tensor, actions: Tensor, logprobs: Tensor, advantages: Tensor, values: Tensor, returns: Tensor, recurrence_memory: Optional[Tensor] = None, mask: Optional[Tensor] = None)#

Bases: object

A dataclass containing the tensors of a minibatch of experiences.

actions: Tensor#
advantages: Tensor#
logprobs: Tensor#
mask: Optional[Tensor] = None#
obs: Tensor#
recurrence_memory: Optional[Tensor] = None#
returns: Tensor#
values: Tensor#
class src.ppo.memory.TrajectoryMinibatch(obs: Tensor, actions: Tensor, logprobs: Tensor, advantages: Tensor, values: Tensor, returns: Tensor, timesteps: Tensor, rewards: Tensor)#

Bases: object

A dataclass containing the tensors of a minibatch of experiences, including trajectory information leading up to each step.

actions: Tensor#
advantages: Tensor#
logprobs: Tensor#
obs: Tensor#
returns: Tensor#
rewards: Tensor#
timesteps: Tensor#
values: Tensor#
src.ppo.memory.process_memory_vars_to_log(memory_vars_to_log)#

src.ppo.my_probe_envs module#

class src.ppo.my_probe_envs.Probe1#

Bases: Env

One action, observation of [0.0], one timestep long, +1 reward. We expect the agent to rapidly learn that the value of the constant [0.0] observation is +1.0. Note we’re using a continuous observation space for consistency with CartPole.

action_space: Discrete#
observation_space: Box#
reset(seed: Optional[int] = None, return_info=True, options=None) Union[ndarray, Tuple[ndarray, dict]]#

Resets the environment to an initial internal state, returning an initial observation and info.

This method generates a new starting state often with some randomness to ensure that the agent explores the state space and learns a generalised policy about the environment. This randomness can be controlled with the seed parameter otherwise if the environment already has a random number generator and reset() is called with seed=None, the RNG is not reset.

Therefore, reset() should (in the typical use case) be called with a seed right after initialization and then never again.

For Custom environments, the first line of reset() should be super().reset(seed=seed) which implements the seeding correctly.

Changed in version v0.25: The return_info parameter was removed and now info is expected to be returned.

Parameters:
  • seed (optional int) – The seed that is used to initialize the environment’s PRNG (np_random). If the environment does not already have a PRNG and seed=None (the default option) is passed, a seed will be chosen from some source of entropy (e.g. timestamp or /dev/urandom). However, if the environment already has a PRNG and seed=None is passed, the PRNG will not be reset. If you pass an integer, the PRNG will be reset even if it already exists. Usually, you want to pass an integer right after the environment has been initialized and then never again. Please refer to the minimal example above to see this paradigm in action.

  • options (optional dict) – Additional information to specify how the environment is reset (optional, depending on the specific environment)

Returns:

Observation of the initial state. This will be an element of observation_space

(typically a numpy array) and is analogous to the observation returned by step().

info (dictionary): This dictionary contains auxiliary information complementing observation. It should be analogous to

the info returned by step().

Return type:

observation (ObsType)

step(action: int) Tuple[ndarray, float, bool, dict]#

Run one timestep of the environment’s dynamics using the agent actions.

When the end of an episode is reached (terminated or truncated), it is necessary to call reset() to reset this environment’s state for the next episode.

Changed in version 0.26: The Step API was changed removing done in favor of terminated and truncated to make it clearer to users when the environment had terminated or truncated which is critical for reinforcement learning bootstrapping algorithms.

Parameters:

action (ActType) – an action provided by the agent to update the environment state.

Returns:

An element of the environment’s observation_space as the next observation due to the agent actions.

An example is a numpy array containing the positions and velocities of the pole in CartPole.

reward (SupportsFloat): The reward as a result of taking the action. terminated (bool): Whether the agent reaches the terminal state (as defined under the MDP of the task)

which can be positive or negative. An example is reaching the goal state or moving into the lava from the Sutton and Barton, Gridworld. If true, the user needs to call reset().

truncated (bool): Whether the truncation condition outside the scope of the MDP is satisfied.

Typically, this is a timelimit, but could also be used to indicate an agent physically going out of bounds. Can be used to end the episode prematurely before a terminal state is reached. If true, the user needs to call reset().

info (dict): Contains auxiliary diagnostic information (helpful for debugging, learning, and logging).

This might, for instance, contain: metrics that describe the agent’s performance state, variables that are hidden from observations, or individual reward terms that are combined to produce the total reward. In OpenAI Gym <v26, it contains “TimeLimit.truncated” to distinguish truncation and termination, however this is deprecated in favour of returning terminated and truncated variables.

done (bool): (Deprecated) A boolean value for if the episode has ended, in which case further step() calls will

return undefined results. This was removed in OpenAI Gym v26 in favor of terminated and truncated attributes. A done signal may be emitted for different reasons: Maybe the task underlying the environment was solved successfully, a certain timelimit was exceeded, or the physics simulation has entered an invalid state.

Return type:

observation (ObsType)

class src.ppo.my_probe_envs.Probe2#

Bases: Env

One action, observation of [-1.0] or [+1.0], one timestep long, reward equals observation. We expect the agent to rapidly learn the value of each observation is equal to the observation.

action_space: Discrete#
observation_space: Box#
reset(seed: Optional[int] = None, return_info=True, options=None) Union[ndarray, Tuple[ndarray, dict]]#

Resets the environment to an initial internal state, returning an initial observation and info.

This method generates a new starting state often with some randomness to ensure that the agent explores the state space and learns a generalised policy about the environment. This randomness can be controlled with the seed parameter otherwise if the environment already has a random number generator and reset() is called with seed=None, the RNG is not reset.

Therefore, reset() should (in the typical use case) be called with a seed right after initialization and then never again.

For Custom environments, the first line of reset() should be super().reset(seed=seed) which implements the seeding correctly.

Changed in version v0.25: The return_info parameter was removed and now info is expected to be returned.

Parameters:
  • seed (optional int) – The seed that is used to initialize the environment’s PRNG (np_random). If the environment does not already have a PRNG and seed=None (the default option) is passed, a seed will be chosen from some source of entropy (e.g. timestamp or /dev/urandom). However, if the environment already has a PRNG and seed=None is passed, the PRNG will not be reset. If you pass an integer, the PRNG will be reset even if it already exists. Usually, you want to pass an integer right after the environment has been initialized and then never again. Please refer to the minimal example above to see this paradigm in action.

  • options (optional dict) – Additional information to specify how the environment is reset (optional, depending on the specific environment)

Returns:

Observation of the initial state. This will be an element of observation_space

(typically a numpy array) and is analogous to the observation returned by step().

info (dictionary): This dictionary contains auxiliary information complementing observation. It should be analogous to

the info returned by step().

Return type:

observation (ObsType)

step(action: int) Tuple[ndarray, float, bool, dict]#

Run one timestep of the environment’s dynamics using the agent actions.

When the end of an episode is reached (terminated or truncated), it is necessary to call reset() to reset this environment’s state for the next episode.

Changed in version 0.26: The Step API was changed removing done in favor of terminated and truncated to make it clearer to users when the environment had terminated or truncated which is critical for reinforcement learning bootstrapping algorithms.

Parameters:

action (ActType) – an action provided by the agent to update the environment state.

Returns:

An element of the environment’s observation_space as the next observation due to the agent actions.

An example is a numpy array containing the positions and velocities of the pole in CartPole.

reward (SupportsFloat): The reward as a result of taking the action. terminated (bool): Whether the agent reaches the terminal state (as defined under the MDP of the task)

which can be positive or negative. An example is reaching the goal state or moving into the lava from the Sutton and Barton, Gridworld. If true, the user needs to call reset().

truncated (bool): Whether the truncation condition outside the scope of the MDP is satisfied.

Typically, this is a timelimit, but could also be used to indicate an agent physically going out of bounds. Can be used to end the episode prematurely before a terminal state is reached. If true, the user needs to call reset().

info (dict): Contains auxiliary diagnostic information (helpful for debugging, learning, and logging).

This might, for instance, contain: metrics that describe the agent’s performance state, variables that are hidden from observations, or individual reward terms that are combined to produce the total reward. In OpenAI Gym <v26, it contains “TimeLimit.truncated” to distinguish truncation and termination, however this is deprecated in favour of returning terminated and truncated variables.

done (bool): (Deprecated) A boolean value for if the episode has ended, in which case further step() calls will

return undefined results. This was removed in OpenAI Gym v26 in favor of terminated and truncated attributes. A done signal may be emitted for different reasons: Maybe the task underlying the environment was solved successfully, a certain timelimit was exceeded, or the physics simulation has entered an invalid state.

Return type:

observation (ObsType)

class src.ppo.my_probe_envs.Probe3#

Bases: Env

One action, [0.0] then [1.0] observation, two timesteps, +1 reward at the end. We expect the agent to rapidly learn the discounted value of the initial observation.

action_space: Discrete#
observation_space: Box#
reset(seed: Optional[int] = None, return_info=True, options=None) Union[ndarray, Tuple[ndarray, dict]]#

Resets the environment to an initial internal state, returning an initial observation and info.

This method generates a new starting state often with some randomness to ensure that the agent explores the state space and learns a generalised policy about the environment. This randomness can be controlled with the seed parameter otherwise if the environment already has a random number generator and reset() is called with seed=None, the RNG is not reset.

Therefore, reset() should (in the typical use case) be called with a seed right after initialization and then never again.

For Custom environments, the first line of reset() should be super().reset(seed=seed) which implements the seeding correctly.

Changed in version v0.25: The return_info parameter was removed and now info is expected to be returned.

Parameters:
  • seed (optional int) – The seed that is used to initialize the environment’s PRNG (np_random). If the environment does not already have a PRNG and seed=None (the default option) is passed, a seed will be chosen from some source of entropy (e.g. timestamp or /dev/urandom). However, if the environment already has a PRNG and seed=None is passed, the PRNG will not be reset. If you pass an integer, the PRNG will be reset even if it already exists. Usually, you want to pass an integer right after the environment has been initialized and then never again. Please refer to the minimal example above to see this paradigm in action.

  • options (optional dict) – Additional information to specify how the environment is reset (optional, depending on the specific environment)

Returns:

Observation of the initial state. This will be an element of observation_space

(typically a numpy array) and is analogous to the observation returned by step().

info (dictionary): This dictionary contains auxiliary information complementing observation. It should be analogous to

the info returned by step().

Return type:

observation (ObsType)

step(action: int) Tuple[ndarray, float, bool, dict]#

Run one timestep of the environment’s dynamics using the agent actions.

When the end of an episode is reached (terminated or truncated), it is necessary to call reset() to reset this environment’s state for the next episode.

Changed in version 0.26: The Step API was changed removing done in favor of terminated and truncated to make it clearer to users when the environment had terminated or truncated which is critical for reinforcement learning bootstrapping algorithms.

Parameters:

action (ActType) – an action provided by the agent to update the environment state.

Returns:

An element of the environment’s observation_space as the next observation due to the agent actions.

An example is a numpy array containing the positions and velocities of the pole in CartPole.

reward (SupportsFloat): The reward as a result of taking the action. terminated (bool): Whether the agent reaches the terminal state (as defined under the MDP of the task)

which can be positive or negative. An example is reaching the goal state or moving into the lava from the Sutton and Barton, Gridworld. If true, the user needs to call reset().

truncated (bool): Whether the truncation condition outside the scope of the MDP is satisfied.

Typically, this is a timelimit, but could also be used to indicate an agent physically going out of bounds. Can be used to end the episode prematurely before a terminal state is reached. If true, the user needs to call reset().

info (dict): Contains auxiliary diagnostic information (helpful for debugging, learning, and logging).

This might, for instance, contain: metrics that describe the agent’s performance state, variables that are hidden from observations, or individual reward terms that are combined to produce the total reward. In OpenAI Gym <v26, it contains “TimeLimit.truncated” to distinguish truncation and termination, however this is deprecated in favour of returning terminated and truncated variables.

done (bool): (Deprecated) A boolean value for if the episode has ended, in which case further step() calls will

return undefined results. This was removed in OpenAI Gym v26 in favor of terminated and truncated attributes. A done signal may be emitted for different reasons: Maybe the task underlying the environment was solved successfully, a certain timelimit was exceeded, or the physics simulation has entered an invalid state.

Return type:

observation (ObsType)

class src.ppo.my_probe_envs.Probe4#

Bases: Env

Two actions, [0.0] observation, one timestep, reward is -1.0 or +1.0 dependent on the action. We expect the agent to learn to choose the +1.0 action.

action_space: Discrete#
observation_space: Box#
reset(seed: Optional[int] = None, return_info=True, options=None) Union[ndarray, Tuple[ndarray, dict]]#

Resets the environment to an initial internal state, returning an initial observation and info.

This method generates a new starting state often with some randomness to ensure that the agent explores the state space and learns a generalised policy about the environment. This randomness can be controlled with the seed parameter otherwise if the environment already has a random number generator and reset() is called with seed=None, the RNG is not reset.

Therefore, reset() should (in the typical use case) be called with a seed right after initialization and then never again.

For Custom environments, the first line of reset() should be super().reset(seed=seed) which implements the seeding correctly.

Changed in version v0.25: The return_info parameter was removed and now info is expected to be returned.

Parameters:
  • seed (optional int) – The seed that is used to initialize the environment’s PRNG (np_random). If the environment does not already have a PRNG and seed=None (the default option) is passed, a seed will be chosen from some source of entropy (e.g. timestamp or /dev/urandom). However, if the environment already has a PRNG and seed=None is passed, the PRNG will not be reset. If you pass an integer, the PRNG will be reset even if it already exists. Usually, you want to pass an integer right after the environment has been initialized and then never again. Please refer to the minimal example above to see this paradigm in action.

  • options (optional dict) – Additional information to specify how the environment is reset (optional, depending on the specific environment)

Returns:

Observation of the initial state. This will be an element of observation_space

(typically a numpy array) and is analogous to the observation returned by step().

info (dictionary): This dictionary contains auxiliary information complementing observation. It should be analogous to

the info returned by step().

Return type:

observation (ObsType)

step(action: int) Tuple[ndarray, float, bool, dict]#

Run one timestep of the environment’s dynamics using the agent actions.

When the end of an episode is reached (terminated or truncated), it is necessary to call reset() to reset this environment’s state for the next episode.

Changed in version 0.26: The Step API was changed removing done in favor of terminated and truncated to make it clearer to users when the environment had terminated or truncated which is critical for reinforcement learning bootstrapping algorithms.

Parameters:

action (ActType) – an action provided by the agent to update the environment state.

Returns:

An element of the environment’s observation_space as the next observation due to the agent actions.

An example is a numpy array containing the positions and velocities of the pole in CartPole.

reward (SupportsFloat): The reward as a result of taking the action. terminated (bool): Whether the agent reaches the terminal state (as defined under the MDP of the task)

which can be positive or negative. An example is reaching the goal state or moving into the lava from the Sutton and Barton, Gridworld. If true, the user needs to call reset().

truncated (bool): Whether the truncation condition outside the scope of the MDP is satisfied.

Typically, this is a timelimit, but could also be used to indicate an agent physically going out of bounds. Can be used to end the episode prematurely before a terminal state is reached. If true, the user needs to call reset().

info (dict): Contains auxiliary diagnostic information (helpful for debugging, learning, and logging).

This might, for instance, contain: metrics that describe the agent’s performance state, variables that are hidden from observations, or individual reward terms that are combined to produce the total reward. In OpenAI Gym <v26, it contains “TimeLimit.truncated” to distinguish truncation and termination, however this is deprecated in favour of returning terminated and truncated variables.

done (bool): (Deprecated) A boolean value for if the episode has ended, in which case further step() calls will

return undefined results. This was removed in OpenAI Gym v26 in favor of terminated and truncated attributes. A done signal may be emitted for different reasons: Maybe the task underlying the environment was solved successfully, a certain timelimit was exceeded, or the physics simulation has entered an invalid state.

Return type:

observation (ObsType)

class src.ppo.my_probe_envs.Probe5#

Bases: Env

Two actions, random 0/1 observation, one timestep, reward is 1 if action equals observation otherwise -1. We expect the agent to learn to match its action to the observation.

action_space: Discrete#
observation_space: Box#
reset(seed: Optional[int] = None, return_info=True, options=None) Union[ndarray, Tuple[ndarray, dict]]#

Resets the environment to an initial internal state, returning an initial observation and info.

This method generates a new starting state often with some randomness to ensure that the agent explores the state space and learns a generalised policy about the environment. This randomness can be controlled with the seed parameter otherwise if the environment already has a random number generator and reset() is called with seed=None, the RNG is not reset.

Therefore, reset() should (in the typical use case) be called with a seed right after initialization and then never again.

For Custom environments, the first line of reset() should be super().reset(seed=seed) which implements the seeding correctly.

Changed in version v0.25: The return_info parameter was removed and now info is expected to be returned.

Parameters:
  • seed (optional int) – The seed that is used to initialize the environment’s PRNG (np_random). If the environment does not already have a PRNG and seed=None (the default option) is passed, a seed will be chosen from some source of entropy (e.g. timestamp or /dev/urandom). However, if the environment already has a PRNG and seed=None is passed, the PRNG will not be reset. If you pass an integer, the PRNG will be reset even if it already exists. Usually, you want to pass an integer right after the environment has been initialized and then never again. Please refer to the minimal example above to see this paradigm in action.

  • options (optional dict) – Additional information to specify how the environment is reset (optional, depending on the specific environment)

Returns:

Observation of the initial state. This will be an element of observation_space

(typically a numpy array) and is analogous to the observation returned by step().

info (dictionary): This dictionary contains auxiliary information complementing observation. It should be analogous to

the info returned by step().

Return type:

observation (ObsType)

step(action: int) Tuple[ndarray, float, bool, dict]#

Run one timestep of the environment’s dynamics using the agent actions.

When the end of an episode is reached (terminated or truncated), it is necessary to call reset() to reset this environment’s state for the next episode.

Changed in version 0.26: The Step API was changed removing done in favor of terminated and truncated to make it clearer to users when the environment had terminated or truncated which is critical for reinforcement learning bootstrapping algorithms.

Parameters:

action (ActType) – an action provided by the agent to update the environment state.

Returns:

An element of the environment’s observation_space as the next observation due to the agent actions.

An example is a numpy array containing the positions and velocities of the pole in CartPole.

reward (SupportsFloat): The reward as a result of taking the action. terminated (bool): Whether the agent reaches the terminal state (as defined under the MDP of the task)

which can be positive or negative. An example is reaching the goal state or moving into the lava from the Sutton and Barton, Gridworld. If true, the user needs to call reset().

truncated (bool): Whether the truncation condition outside the scope of the MDP is satisfied.

Typically, this is a timelimit, but could also be used to indicate an agent physically going out of bounds. Can be used to end the episode prematurely before a terminal state is reached. If true, the user needs to call reset().

info (dict): Contains auxiliary diagnostic information (helpful for debugging, learning, and logging).

This might, for instance, contain: metrics that describe the agent’s performance state, variables that are hidden from observations, or individual reward terms that are combined to produce the total reward. In OpenAI Gym <v26, it contains “TimeLimit.truncated” to distinguish truncation and termination, however this is deprecated in favour of returning terminated and truncated variables.

done (bool): (Deprecated) A boolean value for if the episode has ended, in which case further step() calls will

return undefined results. This was removed in OpenAI Gym v26 in favor of terminated and truncated attributes. A done signal may be emitted for different reasons: Maybe the task underlying the environment was solved successfully, a certain timelimit was exceeded, or the physics simulation has entered an invalid state.

Return type:

observation (ObsType)

class src.ppo.my_probe_envs.Probe6#

Bases: Env

Two actions, single float observation that increments by 1 every time step, reward is 1 if action is 1 otherwise 0. We expect the agent to learn to choose action 1 when the observation is odd and action 0 when it is even.

action_space: Discrete#
observation_space: Box#
reset(seed: Optional[int] = None, return_info=True, options=None) Union[ndarray, Tuple[ndarray, dict]]#

Resets the environment to an initial internal state, returning an initial observation and info.

This method generates a new starting state often with some randomness to ensure that the agent explores the state space and learns a generalised policy about the environment. This randomness can be controlled with the seed parameter otherwise if the environment already has a random number generator and reset() is called with seed=None, the RNG is not reset.

Therefore, reset() should (in the typical use case) be called with a seed right after initialization and then never again.

For Custom environments, the first line of reset() should be super().reset(seed=seed) which implements the seeding correctly.

Changed in version v0.25: The return_info parameter was removed and now info is expected to be returned.

Parameters:
  • seed (optional int) – The seed that is used to initialize the environment’s PRNG (np_random). If the environment does not already have a PRNG and seed=None (the default option) is passed, a seed will be chosen from some source of entropy (e.g. timestamp or /dev/urandom). However, if the environment already has a PRNG and seed=None is passed, the PRNG will not be reset. If you pass an integer, the PRNG will be reset even if it already exists. Usually, you want to pass an integer right after the environment has been initialized and then never again. Please refer to the minimal example above to see this paradigm in action.

  • options (optional dict) – Additional information to specify how the environment is reset (optional, depending on the specific environment)

Returns:

Observation of the initial state. This will be an element of observation_space

(typically a numpy array) and is analogous to the observation returned by step().

info (dictionary): This dictionary contains auxiliary information complementing observation. It should be analogous to

the info returned by step().

Return type:

observation (ObsType)

step(action: int) Tuple[ndarray, float, bool, dict]#

Run one timestep of the environment’s dynamics using the agent actions.

When the end of an episode is reached (terminated or truncated), it is necessary to call reset() to reset this environment’s state for the next episode.

Changed in version 0.26: The Step API was changed removing done in favor of terminated and truncated to make it clearer to users when the environment had terminated or truncated which is critical for reinforcement learning bootstrapping algorithms.

Parameters:

action (ActType) – an action provided by the agent to update the environment state.

Returns:

An element of the environment’s observation_space as the next observation due to the agent actions.

An example is a numpy array containing the positions and velocities of the pole in CartPole.

reward (SupportsFloat): The reward as a result of taking the action. terminated (bool): Whether the agent reaches the terminal state (as defined under the MDP of the task)

which can be positive or negative. An example is reaching the goal state or moving into the lava from the Sutton and Barton, Gridworld. If true, the user needs to call reset().

truncated (bool): Whether the truncation condition outside the scope of the MDP is satisfied.

Typically, this is a timelimit, but could also be used to indicate an agent physically going out of bounds. Can be used to end the episode prematurely before a terminal state is reached. If true, the user needs to call reset().

info (dict): Contains auxiliary diagnostic information (helpful for debugging, learning, and logging).

This might, for instance, contain: metrics that describe the agent’s performance state, variables that are hidden from observations, or individual reward terms that are combined to produce the total reward. In OpenAI Gym <v26, it contains “TimeLimit.truncated” to distinguish truncation and termination, however this is deprecated in favour of returning terminated and truncated variables.

done (bool): (Deprecated) A boolean value for if the episode has ended, in which case further step() calls will

return undefined results. This was removed in OpenAI Gym v26 in favor of terminated and truncated attributes. A done signal may be emitted for different reasons: Maybe the task underlying the environment was solved successfully, a certain timelimit was exceeded, or the physics simulation has entered an invalid state.

Return type:

observation (ObsType)

class src.ppo.my_probe_envs.Probe7#

Bases: Env

4 timesteps. Observation at time 0 is samples from 0 or 1 uniformly. Reward is 0 at all timesteps except the 5th, when it is 1 if the action is equal to the observation given at the first timestep, otherwise 0.

action_space: Discrete#
observation_space: Box#
reset(seed: Optional[int] = None, return_info=True, options=None) Union[ndarray, Tuple[ndarray, dict]]#

Resets the environment to an initial internal state, returning an initial observation and info.

This method generates a new starting state often with some randomness to ensure that the agent explores the state space and learns a generalised policy about the environment. This randomness can be controlled with the seed parameter otherwise if the environment already has a random number generator and reset() is called with seed=None, the RNG is not reset.

Therefore, reset() should (in the typical use case) be called with a seed right after initialization and then never again.

For Custom environments, the first line of reset() should be super().reset(seed=seed) which implements the seeding correctly.

Changed in version v0.25: The return_info parameter was removed and now info is expected to be returned.

Parameters:
  • seed (optional int) – The seed that is used to initialize the environment’s PRNG (np_random). If the environment does not already have a PRNG and seed=None (the default option) is passed, a seed will be chosen from some source of entropy (e.g. timestamp or /dev/urandom). However, if the environment already has a PRNG and seed=None is passed, the PRNG will not be reset. If you pass an integer, the PRNG will be reset even if it already exists. Usually, you want to pass an integer right after the environment has been initialized and then never again. Please refer to the minimal example above to see this paradigm in action.

  • options (optional dict) – Additional information to specify how the environment is reset (optional, depending on the specific environment)

Returns:

Observation of the initial state. This will be an element of observation_space

(typically a numpy array) and is analogous to the observation returned by step().

info (dictionary): This dictionary contains auxiliary information complementing observation. It should be analogous to

the info returned by step().

Return type:

observation (ObsType)

step(action: int) Tuple[ndarray, float, bool, dict]#

Run one timestep of the environment’s dynamics using the agent actions.

When the end of an episode is reached (terminated or truncated), it is necessary to call reset() to reset this environment’s state for the next episode.

Changed in version 0.26: The Step API was changed removing done in favor of terminated and truncated to make it clearer to users when the environment had terminated or truncated which is critical for reinforcement learning bootstrapping algorithms.

Parameters:

action (ActType) – an action provided by the agent to update the environment state.

Returns:

An element of the environment’s observation_space as the next observation due to the agent actions.

An example is a numpy array containing the positions and velocities of the pole in CartPole.

reward (SupportsFloat): The reward as a result of taking the action. terminated (bool): Whether the agent reaches the terminal state (as defined under the MDP of the task)

which can be positive or negative. An example is reaching the goal state or moving into the lava from the Sutton and Barton, Gridworld. If true, the user needs to call reset().

truncated (bool): Whether the truncation condition outside the scope of the MDP is satisfied.

Typically, this is a timelimit, but could also be used to indicate an agent physically going out of bounds. Can be used to end the episode prematurely before a terminal state is reached. If true, the user needs to call reset().

info (dict): Contains auxiliary diagnostic information (helpful for debugging, learning, and logging).

This might, for instance, contain: metrics that describe the agent’s performance state, variables that are hidden from observations, or individual reward terms that are combined to produce the total reward. In OpenAI Gym <v26, it contains “TimeLimit.truncated” to distinguish truncation and termination, however this is deprecated in favour of returning terminated and truncated variables.

done (bool): (Deprecated) A boolean value for if the episode has ended, in which case further step() calls will

return undefined results. This was removed in OpenAI Gym v26 in favor of terminated and truncated attributes. A done signal may be emitted for different reasons: Maybe the task underlying the environment was solved successfully, a certain timelimit was exceeded, or the physics simulation has entered an invalid state.

Return type:

observation (ObsType)

src.ppo.runner module#

src.ppo.runner.combine_args(run_config, environment_config, online_config, transformer_model_config=None)#
src.ppo.runner.ppo_runner(run_config: RunConfig, environment_config: EnvironmentConfig, online_config: OnlineTrainConfig, model_config: Optional[Union[TransformerModelConfig, LSTMModelConfig]])#

Executes Proximal Policy Optimization (PPO) training on a specified environment with provided hyperparameters.

Args: - run_config (RunConfig): An object containing general run configuration details. - environment_config (EnvironmentConfig): An object containing environment-specific configuration details. - online_config (OnlineTrainConfig): An object containing online training configuration details. - model_config (Optional[Union[TransformerModelConfig, LSTMModelConfig]]): An optional object containing either Transformer or LSTM model configuration details.

Returns: None.

src.ppo.train module#

src.ppo.train.check_and_upload_new_video(video_path, videos, step=None)#

Checks if new videos have been generated in the video path directory since the last check, and if so, uploads them to the current WandB run.

Args: - video_path: The path to the directory where the videos are being saved. - videos: A list of the names of the videos that have already been uploaded to WandB. - step: The current step in the training loop, used to associate the video with the correct timestep.

Returns: - A list of the names of all the videos currently present in the video path directory.

src.ppo.train.prepare_video_dir(video_path)#
src.ppo.train.train_ppo(run_config: RunConfig, online_config: OnlineTrainConfig, environment_config: EnvironmentConfig, model_config: Optional[Union[TransformerModelConfig, LSTMModelConfig]], envs: SyncVectorEnv, trajectory_writer=None) PPOAgent#

Trains a PPO agent on a given environment.

Args: - run_config (RunConfig): An object containing general run configuration details. - online_config (OnlineTrainConfig): An object containing online training configuration details. - environment_config (EnvironmentConfig): An object containing environment-specific configuration details. - model_config (Optional[Union[TransformerModelConfig, LSTMModelConfig]]): An optional object containing either Transformer or LSTM model configuration details. - envs (SyncVectorEnv): The environment in which to perform training. - trajectory_writer (optional): An optional object for writing trajectories to a file.

Returns: - agent (PPOAgent): The trained PPO agent.

src.ppo.utils module#

src.ppo.utils.get_obs_preprocessor(obs_space)#
src.ppo.utils.get_obs_shape(single_observation_space) tuple#

Returns the shape of a single observation.

Parameters:

single_observation_space (gym.spaces.Box, gym.spaces.Discrete, gym.spaces.Dict) – The observation space of a single agent.

Returns:

The shape of a single observation.

Return type:

tuple

src.ppo.utils.parse_args()#
src.ppo.utils.preprocess_images(images, device=None)#
src.ppo.utils.set_global_seeds(seed)#

Sets random seeds in several different ways (to guarantee reproducibility)

src.ppo.utils.store_model_checkpoint(agent, online_config, run_config, checkpoint_num, checkpoint_artifact) int#

Module contents#