Kamal

Social learning in goal navigation tasks

2020-07-02T00:00:00-07:00

For the past few months I’ve been studying social learning in multi-agent reinforcement learning as part of the Spring 2020 OpenAI Scholars program. This is last in a series of posts I’ve written during the program, and in this post I’ll discuss some experiments I’ve been conducting to study social learning by independent RL agents. The first post in this series has lots of context about why I’m interested in multi-agent reinforcement learning. Before continuing I wanted to express my tremendous gratitude to OpenAI for organizing the Scholars program, and to my mentor Natasha Jaques for her incredible support and encouragement.

I spent much of the time dedicated to this project working on open-source tools to facilitate MARL research. I developed Marlgrid, a multi-agent gridworld environment (github, post), and a PPO-LSTM implementation (github, post) that I used to train agents. PPO is a powerful algorithm, and I recommend checking out the post I wrote about hidden state refreshing for details about how I got it working well with LSTMs for tasks that require memory (at desktop scale!).

Solitary humans are pretty useless, but they can gain pretty incredible capabilities through social interactions with other humans. By “capabilities” I don’t just mean personal capacity for thought, but rather all the ways an individual can interact with the world. For example, I can win a race with any non-human animal by tapping into a huge body of tools and cultural knowledge accumulated and accessed through social interaction with other humans. This is sort of an unfair comparison because I get to take advantage of human technology while other animals don’t, but that’s just the point: over time, interactions between individual humans have given rise to capabilities so great as to make direct comparisons with other animals almost nonsensical.

Independent multi-agent reinforcement learning

Understanding the ways AI systems might exhibit or benefit from social learning seems very important given how central it is to human intelligence. It seems likely that humans are biologically predisposed towards interacting with other humans, since infants show signs of cooperative social behavior like mimicry (Tomasello 2009) to a much greater degree than similar species (Henrich 2015). For this reason we might expect that AI systems might need to be endowed with inductive biases that encourage social learning in order to gain the capacity for skills-enhancing social learning.

Lots of work in multi-agent reinforcement learning is focused on building systems that perform well in multiplayer games (poker, go, Starcraft, Hanabi) or in finding ways to train agents that can coordinate effectively with other agents (cooperative games, autonomous vehicles). Developing mechanisms to encourage social learning would fall in this category. The objective in these applications is to develop capable systems, and many researchers reasonably use algorithms where agents have some sort of built-in bias to encourage the desired behavior – centralized critics, shared/cooperative/shaped rewards, etc. I’d characterize these research programs as offensive, in the sports sense.

But I’m particularly interested in the defensive program: identifying the circumstances in which social behavior might emerge on its own, without explicit encouragement. For starters, understanding the interactions that might arise between ostensibly independent agents seems like an obvious prerequisite for safely deploying adaptive systems. As reinforcement learning algorithms become more capable and widely deployed, it will be important to understand the circumstances in which collective behavior might arise – for instance, when automated stockbrokers might acquire knowledge or skills from one another.

Consider two experiments which both show the emergence of some phenomenon in a group of RL agents. In the first experiment, agents are given a reward or inductive bias that facilitated this phenomenon. In the second, independent agents exhibit the phenomenon without such encouragement. The first experiment lets us draw the conclusion that the particular methods/arrangement was sufficient for the phenomenon to emerge. If the phenomenon is “playing a strategy game without serious strategic weaknesses” or “successfully negotiating a traffic jam”, then this is really valuable. But the second experiment suggests a much stronger claim: the phenomenon is a property of the environment or scenario, and with smart enough agents we should expect to see it in any similar scenario.

Learning from experts

The purpose of this project is to identify circumstances in which novice agents can acquire skills through social learning.

With the perspective described above, this project is building towards determining when we might expect capable adaptive agents to learn from one another due to the structure of their environment, with the hope of drawing conclusions that are agnostic to the underlying RL algorithms. Thus, I’ll focus on independent multi-agent reinforcement learning, a context in which agents aren’t explicitly given a reason to cooperate.

This diagram illustrates the process of social learning: a novice learning in the presence of experts (solid line) is able to attain mastery of some skill. But if the novice is alone in the learning environment (dotted line), it is unable to obtain that skill and remains a novice. The human process of social learning fits this template. Humans are born with little innate skill. Individuals born outside human society are unable to develop the same capabilities as those that can profit from social learning and cultural knowledge. This pattern would hold for a wide variety of skill metrics such as vocabulary size, top speed, or twitter follower count.

Prior work

In Observational learning by reinforcement learning, Borsa et al. (2019) demonstrated that there are circumstances in which RL agents can learn more quickly and achieve higher rewards in the presence of experts. They examine two-agent scenarios in which a novice is trained by reinforcement learning (with A3C) in the presence of an expert agent that is hard-coded to perform the same task perfectly.

(video source)

Among the scenarios they consider are exploratory navigation gridworld tasks where agents must locate and navigate to visually distinct target locations. As shown in the image and video above, both novices (green) and experts (blue) begin each episode in the top left portion of the map. The goal is placed randomly at one of the sixteen positions indicated by “G?”. Agents are rewarded for navigating to the purple goal tiles.

Borsa et al. found that in partially observed environments the presence of such hard-coded experts can ease learning for novices, but that the presence of experts didn’t improve the final skill of the trained novices.

Learning environments

I have focused on exploratory navigation tasks in gridworld environments. With these tasks, expert agents are able to effectively locate and navigate to a certain objective. Such tasks are particularly promising for social learning because the skillful behavior required to achieve high rewards is visible to onlooking agents through the motion of experts.

Implementations of the environments I describe below are available in Marlgrid.

The “cluttered” environments (as shown in the video below) are exploratory navigation tasks that pose similar challenges to standard random maze tasks. The clutter in the environment can occlude agents’ views, and agents are unable to move through the clutter. Agents respawn randomly in the environment after reaching the goal, and continue doing so until the episode reaches a maximum duration.

Expert agents achieve high returns by quickly locating and navigating to the goal tile, and by storing appropriate information in their hidden states to hasten this process after each respawn (within a single episode). Novice agents accomplish this less quickly, and stand to achieve higher returns by using cues from the observed trajectories of expert agents to hasten their own search. But novices need to observe lots of expert behavior in order to learn from it, and novices in cluttered environments are somewhat unlikely to be close enough to experts to observe and learn from them.

I created the “goal cycle” environments to address this issue. In these environments, agents receive a reward of +1 for navigating between (typically three) goal tiles in a certain order, and receive a configurable penalty making mistakes. The size of the penalty relative to the reward determines the difficulty of exploration, and when the penalty is large the goal cycle becomes a hard-exploration task.

In the goal cycle environment, expert agents spend the first portion of each episode identifying the order in which to traverse the goals, and spend the rest of the episode collecting rewards by cycling between the goals in that order. Since novice agents very quickly learn to navigate to a first goal tile, encounters between novices and experts are much more likely than in the cluttered environment.

It is very difficult for novice agents to learn to identify the correct goal cycle when the penalty is large compared to the reward, since the penalty disincentivizes the sort of exploration that is necessary to discover the optimal strategy. Thus the size of the penalty gives a way to change the difficulty of learning by trial-and-error to directly solve the task relative to the difficulty of learning to accrue rewards through social behavior like following other agents.

In order to observe the sort of social learning described above, we need some way of training experts in environments with large penalties. We can do this by initially training the agents in a forgiving low-penalty environment then slowly increasing the penalty with a curriculum as shown here:

To expedite gathering experience in these environments, I used the SubprocVecEnv wrapper published in the with the OpenAI RL Baselines (P. Dhariwal et al. 2017) to collect experience in 8-64 parallel environments. Parallelized in this way, I am able to collect about a billion transitions of experience per day on a desktop computer, with agents calculating policies using the network architecture described below.

Algorithms

I’ve been training agents with Proximal Policy Optimization (Schulman et al. 2017). The neural networks expressing the policy and value functions share input layers as well as an LSTM. The network architecture is shown below.

In the Marlgrid goal cycle environments, observations consist of images representing partial views and optionally include encoded scalar directional and positional information as well as the agent reward signal. The image is fed through series of convolutional layer then concatenated to the scalar inputs. This vector is fed through one or two fully connected trunk layers before the LSTM. The output of the LSTM is split and processed by separate MLP heads that output value estimates and policies.

As I described in another post, I found that periodically refreshing the hidden states stored in the agent replay buffers¹ between mini-batch updates within a single gradient step to be crucial for good performance in tasks like goal cycle that require heavy use of memory over extended trajectories.

I implemented PPO-LSTM with hidden state refreshing using PyTorch. So that the parameter updates and hidden state updates are efficient, I implemented a custom LSTM layer that jit-compiles iteration over the items of the LSTM’s input sequence to expose an interface for tensors with dimensions

\[([ n_{seq}\times n_{mb}\times n_{in}], [n_{seq}\times n_{mb}\times n_{h}]) \rightarrow [n_{seq}\times n_{mb}\times n_{h}],\]

where $n_{seq}$ is the sequence length, $n_{mb}$ is the mini-batch size, and $n_{h}$ is the size of the LSTM hidden state². The PyTorch jit makes computing the full LSTM output for a full batch of episodes quite efficient. With 27x27 pixel input observations, a batch size of 32 and an LSTM hidden size of 256, it is possible to recalculate an entire batch worth of hidden states in a single forward pass. This takes well under 1GB of VRAM and is very fast, particularly since only a forward pass is needed and autograd can be disabled.

Experiments

Cluttered

In cluttered environments, novice agents did not learn more effectively in the presence of experts. The chart below shows the average episode returns for solitary novices and novices learning with two experts. Each curve is the average of five trials, with the highlighted regions showing two standard errors.

Goal cycle, hidden goals

The videos above show the behavior of a novice agent learning in a goal cycle environment in the presence of experts. The two experts were trained together with a penalty curriculum. The novice agent’s partial views are shown below the two experts’ in the columns on the right. The experts are able to see the goals as usual, but the goals are masked out of the novice’s partial view. In this scenario, the novice agent develops a very robust following behavior. Since the novice agent is only able to follow the experts, it is unable to attain the level of expertise as the other experts. This mirrors a similar finding in Borsa et al. (2019), though in this case the experts as well as the novices are trained with RL.

Discussion

The purpose of this project is to find circumstances in which social learning occurs between independent agents. In scenarios conductive to social learning, novices can learn to accomplish a task more effectively in the presence of experts than they would alone. The motivating example illustrated in the introduction shows an extreme case in which novice agents in the presence of experts are able to attain high skill levels, but solitary novices are unable to (or extremely unlikely to).

When the exploration task is easy to solve directly (without incorporating information from other agents), the presence of experts confers no benefit to novices. This is the case in the cluttered environment, as shown above. Conversely when the behavior of other agents provides helpful cues for solving tasks, novices can learn policies that make use of that information, as I observed in the goal cycle environment with hidden goals.

Next steps

With three-goal goal cycle environments, it is difficult to construct a scenario in which solitary agents consistently fail to learn while novices robustly succeed at learning in the presence of experts. This is because novice agents (whether or not they are solitary) learn to avoid the penalty cost of exploration by avoiding goal tiles altogether (after receiving an initial reward), as shown below.

With more than three goals the penalty can be set such that there many non-expert strategies that still involve traversing goal tiles. I hypothesize that agents with such sub-optimal strategies would be more likely to learn to associate cues from experts with higher rewards, and thus might be more likely to learn social behavior. Discovering the correct cycle in environments with four or more goals is significantly harder, so training optimal experts poses a challenge. Still, I think this is a very promising direction.

In this post, I have used returns as a stand-in for skill. But agents that achieve high rewards by following experts might be following much simpler policies (i.e. “follow a blue agent”) than those that solve the task directly (i.e. “find one goal, then another; if the second gave a penalty, …”). One way to tighten up the analogy with human social skill acquisition would be to evaluate the transfer performance of trained novices in solitary environments.

References

Joseph Henrich (2015). The Secret of Our Success: How Culture Is Driving Human Evolution, Domesticating Our Species, and Making Us Smarter.

Michael Tomasello (2009). Why We Cooperate.

Diana Borsa et al. (2019) Observational Learning by Reinforcement Learning. AAAMAS 2019, Montreal, Canada, 2019.

John Schulman et al. Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347, 2017.

Prafulla Dhariwal et al. baslines. GitHub repository https://github.com/openai/baselines. 2017.

To facilitate mini-batch gradient steps in the PPO updates, agents store the LSTM hidden states $h_t$ alongside the partially observed Markov decision process (POMDP) state-action-reward tuples $(s_t, a_t, r_t)$ recorded as they interact with the environment. The policy and value functions are updated synchronously with a single Adam optimizer with mini-batches consisting of 256 to 1024 randomly sampled trajectories of 8 consecutive time steps (for a total of 1024-8192 transitions per mini-batch). ↩
The LSTM hidden state is actually comprised of “hidden” and “cell” values and would conventionally be expressed as a tensor of shape $[2\times n_h]$. I glossed over this distinction in the discussion above for the sake of brevity. ↩

Stale hidden states in PPO-LSTM

2020-06-25T00:00:00-07:00

I’ve been using Proximal Policy Optimization (PPO, Schulman et al. 2017) to train agents to accomplish gridworld tasks. The neural net architectures I’ve been using include LSTM layers – this gives the agents the capacity to remember details from earlier in an episode when choosing actions later in the episode. This capacity is particularly important in partially observed environments that are ubiquitous in multi-agent reinforcement learning (MARL).

I’ve found PPO and LSTMs to be a potent combination, but getting it to work well has required lots of effort and attention to detail. In this post I’ll discuss hidden state refreshing, a feature of my implementation that I have found to be important for achieving good performance in partially observed environments with sparse rewards.

As a teaser, here’s a video of a solitary agent locating and navigating between four goals in a Marlgrid goal cycle environment. PPO-LSTM with hidden state refreshing has enabled me to train agents to accomplish this sort of challenging partially observed exploration tasks with just my desktop computer.

In on-policy deep reinforcement learning algorithms, agents alternate between collecting a batch of experience (storing it in a replay buffer) and updating their parameters based on that experience. Parameter updates that cause big changes in agent behavior policies are often harmful to performance. Algorithms like Trust Region Policy Optimization (TRPO, Schulman et al. 2015) and PPO offer ways to update agent parameters while keeping the induced policies within a “trust region” of the pre-update policies. In PPO (and particularly the PPO-Clip variant I’ve been working with), the policy network is trained with a surrogate objective that is maximized when the policy increases the likelihood of producing high-advantage actions without drifting too far from the pre-update policy.

Implementation details are important

RL Algorithms like PPO have lots of moving parts, and actually implementing them involves lots of small algorithmic design choices. Not all of these are considered core parts of the algorithm, but put together they can have a pretty big impact on performance. Engstrom et al. (2019) draw the striking conclusion that “much of PPO’s observed improvement in performance comes from seemingly small modifications to the core algorithm…” that had not been emphasized in published comparisons with other training methods.

Detail: early stopping

My implementation of PPO is based on the Spinning Up PPO code, from which it inherits lots of tweaks and good design choices. One of these is early stopping.

In PPO, Each batch update consists of several mini-batch gradient updates. This helps with sample efficiency, and the clipped surrogate objective helps prevent the updated policy from straying too far from the trust region. But as an extra guarantee, the mini-batch iteration terminates if the expected KL divergence between the current policy and the policy used to collect the experience exceeds some threshold (more on this later).

Detail: Hidden states

Adding a recurrent neural network like an LSTM to the policy and/or value networks gives an agent the capacity to use memory at the cost of significant implementation complexity. Much of this arises from the handling of hidden states.

PPO updates involves (1) computing policies and values for trajectories in the replay buffer, (2) using these to calculate losses for the policy and value networks, and (3) updating the networks’ parameters with stochastic gradient descent (typically Adam) to minimize these losses. SGD-style parameter updates typically work best when the loss is computed from uncorrelated samples rather than e.g. whole trajectories. In the specific context of on-policy RL, Andrychowicz et al. (2020) found that parameter updates that used multiple mini-batch gradient steps (with random transitions randomly assigned to mini-batches) were more effective than large-batch parameter updates that used the entire batch.

With architectures that include LSTMs, policies and values are functions of a hidden state as well as the observed state of the environment. Thus the loss for an arbitrary replay buffer transition depends on the hidden state associated with that transition. We cache hidden states alongside observed states/actions/rewards in the replay buffer to make sure we can compute losses efficiently.

Stale values in the PPO replay buffer

In off-policy RL, experience in the replay buffer can be re-used for a very large number of parameter updates. In the R2D2 paper, Kapturowski et al. (2019) showed that there are significant discrepancies between Q-values calculated with stale vs. fresh hidden states – “fresh” meaning recalculated with current (or recent) model parameters. The experience used for each parameter update in on-policy algorithms is collected with the most recent version of the policy and discarded after a single update, so the data used for each update (including saved hidden states) is less stale than with off-policy algorithms.

Even so, using fresh data for parameter updates can be important for on-policy reinforcement learning. In What matters in on-policy reinforcement learning?, Andrychowicz et al. (2020) argue that the advantage values used to estimate state values in on-policy reinforcement learning algorithms like PPO can become stale over the course of a single update. In typical implementations the advantages are computed using the value network only once per batch, but with each mini-batch iteration the stored advantages become less consistent with the current value network parameters. Andrychowicz et al. (2020) suggest mitigating this issue by recalculating advantages before each mini-batch iteration rather than before each batch update, and they show that this improves performance on their benchmarks.

The argument for refreshing advantages extends to hidden states for architectures with RNNs since as the hidden states saved in the buffer become more stale, using them to estimate quantities like advantage values will become less accurate.

Stale hidden states also potentially undermine the mechanisms used in PPO-clip to maintain trust regions during updates. During each minibatch policy update, current policies (calculated with the most recent network parameters) are compared to stored policies (that used with the pre-update parameters) for loss clipping. If the “current” policies are computed using stale hidden states, they might falsely appear more similar to the stored policies. This would get in the way of loss clipping from preventing large policy changes.

There is a similar issue with loss clipping: hidden states are needed to calculate the expected KL divergence between the pre-update and current policies when deciding whether to prematurely terminate an update step. If the hidden states are stale, then these estimates will be inaccurate, and early stopping will be less effective at keeping the policy within the trust region.

Fortunately since the replay buffers used for on-policy reinforcement learning are typically quite small (especially compared to those used in off-policy RL), it’s not too costly to simply recalculate the stored hidden states every few mini-batches. And if we are already following the recommendation of Andrychowicz et al. (2019) to periodically refresh advantages, then recalculating hidden states has a low marginal cost. So, I had the idea to apply the stale-state refreshing technique of R2D2 to PPO, and conducted some experiments to see how much it helped.

Experiments

I ran some experiments to test the impact of hidden state staleness on PPO performance in a couple different environments. In addition to policy performance, I computed the KL divergence of the pre- and post-update policies. In both cases I refreshed the hidden states before calculating the post-update policies for estimating the policy divergence.

The lines shown are means of 3 trials. The highlighted regions show +/-2 standard errors of the means.

Cluttered env

Without hidden state refreshing, updates cause much larger policy divergences. But this doesn’t impact performance all that much; the reward curves are pretty similar. This task isn’t too memory intensive; an agent with only feed-forward networks would probably do fine with a strategy like “move toward the goal if it’s visible, otherwise randomly move/rotate”.

Goal cycle env

Hidden state refreshing makes a huge difference for goal cycle performance! When the hidden states are allowed to get stale, the combination of the PPO-Clip objective function and early stopping fails to keep the policies from changing dramatically during the updates – note that the y axis range is about an order of magnitude larger in the goal cycle divergence plot than the cluttered divergence plot. Because this task is more memory intensive, refreshing the hidden state is critical to achieving good performance.

References

John Schulman et al. Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347, 2017.

John Schulman et al. Trust Region Policy Optimization. arXiv preprint arXiv:1502.05477, 2015.

Joshua Achiam. Spinning Up in Deep Reinforcement Learning. 2018.

Marcin Andrychowicz et al. What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study. arXiv preprint arXiv:2006.05990, 2020.

Logan Engstrom et al. Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO. arXiv preprint arXiv:2006.12729, 2020.

John Schulman et al. High-Dimensional Continuous Control Using Generalized Advantage Estimation. arXiv preprint arXiv:1506.02438, 2015.

Steven Kapturowski et al. Recurrent experience replay in distributed reinforcement learning. ICLR 2019.

Goal cycle environments

2020-06-24T00:00:00-07:00

I’ve been thinking about ways to construct challenging gridworld scenarios in which the behavior of expert agents might provide cues that ease learning for novice agents. I’ve focused on tasks in which navigation is central, since the an agent’s movement can always in principle be visible to other agents. Tasks like this often resemble random maze navigation: an agent spawns in an environment with a random layout and receives a reward after navigating to a certain (perhaps randomly placed) goal tile. The episode ends either after the agent reaches the goal, or after a fixed time limit (with the agent respawning each time it reaches the goal).

In that sort of environment, the main skill exhibited by expert agents is efficiently searching a map (not running in to walls, not retracing steps, etc). Non-expert agents don’t stand to learn all that much by observing the experts.

I developed a class of environments called goal cycle that I’ve been using to study hard-exploration tasks in both single and multi-agent scenarios.

Goal cycle environments

The grid in the example below has size 13x13 and contains 3 goals and 11 pieces of clutter. The goals and clutter are placed randomly. Agents can step on and see through the goal tiles and other agents, but they cannot step on or see through wall or clutter tiles.

Each goal tile has an id between 0 and 2 (if there are 3 goals, otherwise between 0 and N-1). Agents are rewarded the first time they step on a goal in each episode. They are also rewarded any time they navigate from goal $x$ to goal $(x+1)\%N$. Agents are penalized for stepping on a goal tile out of order and the size of this penalty is configurable.

While the goal identities are visually indistinguishable to the agents, the Marlgrid goal cycle environments allow agents to directly see the (scalar) rewards dispensed by the environments as part of their observations. I’ve also added support for a visual prestige mechanism that causes agents to change color in response to rewards or penalties (in their own egocentric views as well as those of other agents).

The agent shown above was trained with PPO-LSTM. The whole map is shown on the left, but the agent only sees the partial view shown on the right. This agent was trained from images and didn’t receive any positional/directional/reward information directly from the environment. With prestige coloring, the agent starts off red and becomes somewhat bluer with each positive reward. Penalties (negative rewards) reset the agent’s color to red. In this episode, the agent traverses the goals in the order [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0] and happens not to make any mistakes.

Hard exploration

When the penalty is large the reward signal becomes deceptive and goal cycle becomes a hard exploration problem (complete with sparse and delayed rewards). Some of the well-known environments that researchers have used to measure progress in this challenging problem domain include the Atari game Montezuma’s Revenge and the door/key and multi-room gridworld environments. The difficulty of exploration varies with the size of the penalty, and it’s

Agents incur penalties for exploring possible sequences, so novice agents tend to learn to avoid the goal tiles:

When the penalty is large, an agent can obtain a maximal return in an episode by cycling between the three goals in a certain order (when the penalty is large). The agent must first identify the correct order. Even optimal policies might incur penalties before discovering the corect order, since the bonus tiles can’t be disambiguated without stepping on them. In a three-goal environment, this would happen half the time. Identifying the proper order constitutes a hard exploration problem that must be overcome in each episode.

Environments with higher a penalty $p$ for messing up the goal orders are more difficult to explore.

Hyperparameter hell or: How I learned to stop worrying and love PPO

2020-06-01T00:00:00-07:00

Multi-agent reinforcement learning (MARL) is pretty tricky. Beyond all the challenges of single-agent RL, interactions between learning agents introduce stochasticity and nonstationarity that can make tasks harder for each agent to master. At a high level, many interesting questions in MARL can be framed in ways that focus on the properties of the interactions between abstract agents and are ostensibly agnostic to the underlying RL algorithms, e.g. “In what circumstances do agents learn to communicate?”. For such questions, the structure of the environment/reward function/etc is usually more important.

While state of the art deep RL algorithms may not be critical for multi-agent experiments, achieving the degree of stability and robustness necessary to empirically study such questions can be challenging.

The most time consuming part of creating a scaffolding for running MARL experiments has been implementing and tuning RL algorithms that can perform well in the stochastic and partially observable gridworld environments I care about.

I spent a few weeks(!) struggling to find hyperparameters and algorithmic improvements that would allow my DRQN implementation to perform well in a variety of Marlgrid environments. I got slight improvements by messing with the model architecture and optimizer, simplifying the environment, shaping the action space and implementing PER. Still I’d begun to worry that the middling performance was due not to fixable shortcomings of the algorithm, but to crippling bugs with Marlgrid or serious issues with my intuition about the difficulty of the tasks I’d been trying to solve. This was frustrating because the RL performance I was fighting for is only of incidental importance to the questions I’d like to study.

Then I implemented PPO (Schulman et al. 2017). It immediately blew DRQN out of the water – with essentially no time spent on fussy tuning or optimization. I immediately saw huge increases in stability, robustness, and efficiency.

Tasks that DRQN could solve in hours, PPO could solve in minutes. With DRQN I struggled to find hyperparameters that would work for both small and large cluttered gridworlds; the first set of hyperparameters I used for PPO worked for all all the gridworld tasks I’d been looking at – as well as Atari Breakout!

This was extra satisfying because I didn’t use the tricks that are typically employed to ease learning for RL agents playing Atari games (frame skipping/stacking, action repeating, greyscale conversion, image downsizing). As a bonus: my PPO implementation reused the bits of my DRQN code of which I was most suspicious; good performance with PPO confirmed that these weren’t broken.

PPO

I implemented PPO (code here) using the PyTorch Spinning Up code as a reference. I highly recommend studying the Spinning Up docs for more details about PPO and other modern deep RL algorithms.

PPO architecture. Observations $s_t$ can include both images and scalar/vector quantities. Images are fed through a few convolutional layers. The non-image portions of the input state and the output of the convolutions are concatenated then passed through a shallow MLP before getting fed into the LSTM. Separate MLPs compute the policy $\pi(s_t)$ and value $V(s_t)$ from the output of the LSTM.

There are some substantive differences between my implementation and the spinning up reference. I include an LSTM to make sure the network can learn effective strategies in partially observable environments. As in my DRQN implementation, agents save hidden states while collecting experience. These saved hidden states are used during parameter updates.

In Spinning Up, the policy and value networks have separate weights and are separately updated with batch gradient descent (with Adam) – policy first, then value. In my implementation, the policy and value networks share most of their parameters; the heads that compute the policy $\pi(s_t)$ and value $V(s_t)$ of a state $s_t$ branch off after the LSTM. The parameters in the unified network are updated together (with minibatch Adam) to minimize the combined policy loss and value loss $(L_v + L_\pi)$.

Why didn’t I use PPO to begin with?

On-policy algorithms like PPO only update the policy parameters using data collected w/ the most recent version of the policy, while off-policy algorithms like DRQN can learn from older data. This makes off-policy RL potentially more effective in circumstances where collecting environment interactions is costly. Since I have access to <1% of the compute used to collect experience for those spectacular PPO results, I figured DRQN might be a better choice. In retrospect there are a few issues with this reasoning. First: while I (still) find the argument for off-policy sample efficiency to be somewhat compelling, I probably shouldn’t have used it as evidence that DRQN specifically is more sample efficient than PPO specifically. PPO is one of the best on-policy RL algorithms; it would be fairer to compare its sample efficiency with a state of the art off-policy RL algorithm like TD3 or SAC.

More importantly: to the degree that I care about efficiency, I care about overall (wall/cpu time) efficiency rather than sample efficiency in particular. Sample efficiency is more important in domains where interacting with the environment is slow or costly (robotics, or leaning from humans). But collecting experience in Marlgrid is fast (on my desktop, I can collect about a billion frames of experience in a day), so sample efficiency is not much of a concern.

From a bird’s eye view, I could have taken the impressive results with Dota 2/Rubik’s cube manipulation as evidence that PPO is a good choice. But these systems made use of far more computation resources than I have access to: OpenAI used 64 GPUs and 920 CPUs while collecting experience to train agents to manipulate Rubik’s cubes, and 512 GPUs and 51200 CPUs to train its team of Dota agents. I was hesitant to conclude that PPO would also be effective at my much smaller scale (1 CPU, 1-2 GPUs). But as it turns out, PPO has excellent downward scalability.

Tuning DRQN vs PPO

I found that PPO easily outperformed DRQN in the tasks I’ve been studying. Another key advantage of PPO is that it was much less finicky. Performance with DRQN was inconsistent and unstable, and I spent lots of time adjusting parameters and adding features to mitigate these issues.

If a PPO agent is struggling to make headway on a task, there are a few go-to parameters that seem to dramatically increase its chance of success: learning rate and batch size. There’s clear tradeoff between training speed (faster when learning rates are higher and batch sizes are lower) and the complexity of tasks that an agent can solve.

With DRQN it was tough to find hyperparameters that would allow stable training. And even for tasks that DRQN could “solve”, performance would sometimes drop after long periods of good behavior. I added a few tweaks that help with performance and stability:

Target Q network
- Very standard trick for avoiding value function overestimation
Entropy-regularized Q function and Boltzmann exploration
- These are standard features for contemporary DQN (i.e. default in tf-agents DQN)
Hidden state refreshing (R2D2-style)
Prioritized experience replay The best Q learning based RL algorithms (like the Rainbow algorithm from Hessel et al. at Deepmind) involve even more tweaks such as N-step learning, dueling DQN, and distributional RL.

The parameter controlling entropy regularization was particularly important; there seemed to be a task-specific range of acceptable values. If it was too high or too low, learning would be less stable.

I also tried (to no avail)

changing the optimizer and learning rate
- RMSProp seems to work much better for DRQN, but Adam is better for PPO
changing the network architecture
- LSTM hidden size, number/configuration of conv layers
dramatically increasing the size of the replay buffer

References

John Schulman et al. Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347, 2017.

Joshua Achiam. Spinning Up in Deep Reinforcement Learning. 2018.

OpenAI et al. Solving Rubik’s Cube with a Robot Hand. arXiv preprint arXiv:1910.07113, 2019.

OpenAI Five. Blog post, 2018.

Hessel et al. Rainbow: Combining Improvements in Deep Reinforcement Learning. arXiv preprint arXiv:1710.02298.

Prioritized Experience Replay in DRQN

2020-05-08T00:00:00-07:00

Q learning is a classic and well-studied reinforcement learning (RL) algorithm. Adding neural network Q-functions led to the milestone Deep Q-Network (DQN) algorithm that surpassed human performance on a suite of Atari games (Mnih et al. 2013). DQN is attractive due to its simplicity, but the DQN-based algorithms that are most successful tend to rely on many tweaks and improvements to achieve stability and good performance.

For instance, Deepmind’s 2017 Rainbow algorithm (Hessel et al. 2017) showed that combining double Q learning, prioritized experience replay (PER, Schaul et al. 2015), dueling Q-networks, multi-step learning, and distributional Q learning could outperform standard DQN, A3C, and typically exceed human performance on the benchmark suite of Atari games.

Since the multi-agent experiments I’m interested in don’t really require state-of-the-art performance, and since many of the excellent and feature rich off-the-shelf implementations of RL algorithms do not effectively support multi-agent training, I’ve been implementing the algorithms from scratch. In the interest of managing complexity, I’ve been adding bells and whistles only as necessary.

In a previous post I discussed Deep Recurrent Q-Networks (DRQN), in which agents use recurrent neural networks (RNNs) as a sort of memory that gives them the capacity to learn strategies that explicitly account for past information. This is a key advantage in partially observed environments. As an example, DRQN agents in the exploratory navigation tasks I’ve been studying can learn to use the hidden state to avoid regions they’ve already visited. Learning in partially observed environments is key for MARL, so the scales tipped in favor of implementing DRQN.

I’ve also been using entropy-regularized Q learning with Boltzmann exploration following Haarnoja et al. (2017). I found this strategy for encouraging exploration to be much less finicky than epsilon-greedy exploration.

My DRQN implementation was usually working reasonably well for the environments I’m most interested in (partially observed gridworlds filled with clutter and randomly placed rewards), but training wasn’t quite as stable as I’d hoped. Notably, agent performance would sometimes degrade significantly after long periods of success. Reasoning that the unlearning I observed could be due to catastrophic forgetting once the model began training on only successful episodes, I decided to implement prioritized experience replay.

The agent learns to navigate to the goal in about 5k episodes, then unlearns after about 10k more. After unlearning, the behavior is way less effective than a random policy. Instabilities like this motivated me to implement prioritized experience replay: the drop in performance happens after the initial unsuccessful episodes have been purged from the agent's replay buffer, so maybe the agent isn't learning from some key aspects of its experience?

Prioritized Experience Replay (PER) is a key component of many recent off-policy RL algorithms like R2D2 (Kapturowski et al. 2019), and the ablations in the Rainbow paper suggest that PER is among the most important DQN extensions for achieving good performance. Thus I decided to add PER to my DRQN implementation. Sadly PER wasn’t a silver bullet for the particular instability shown above.

Prioritized Experience Replay

Replay buffers and TD errors

In Q learning, agents collect experience following a policy specified by the state-value/Q-function. As the agent interacts with the environment, it periodically updates the parameters of its Q-function to minimize the temporal difference (TD) error of its state-value predictions. As a reminder, the TD error for a transition $(s_{t}, a_{t}, r_{t}, s_{t+1})$ is

\[\delta_{\theta} (s_t, a_t, r_t, s_{t+1}) = Q_{\theta}(s_{t}, a_{t}) - (r_{t} + \gamma \cdot \underset{a}{\text{max }} Q_{\tilde{\theta}}(s_{t+1}, a)),\]

and it describes the difference between the agent’s estimate $Q(s_t, a_t)$ of the expected total future return after taking an action $a_t$, and the reward $r_t$ actually observed for that state, plus a discounted bootstrap estimate of the value of the next state $s_{t+1}$.

In tabular Q learning, agents update their Q-functions to reduce the TD error immediately after seeing each transition. Deep Q-Networks (DQN) are far more expressive than tabular Q-functions, but training them in typical deep learning style with stochastic gradient descent (SGD) only works well when they can use batches of uncorrelated samples. Thus DQN agents use replay buffers to store large amounts of past experience. Parameter updates still occur between environment steps but rather than learning from experience as it is collected (as in tabular Q learning), agents sample a batch of data (uniformly) from the replay buffer, then use SGD to update their parameters, minimizing the average TD error on the batch.

Uniformly sample a batch of transitions $D_{\text{batch}}$ from the replay buffer $D$,
Update the Q-function parameters $\theta$ to minimize the average TD error $\delta_{\theta}$ for $D_{\text{batch}}$ with one step of gradient descent:

\[\theta \leftarrow \theta - \alpha \cdot\underset{(s_t, a_t, r_t, s_{t+1}) \sim D_{\text{batch}}}{\mathop{\mathbb{E}}} \Big[ \nabla_{\theta} \left( \delta_{\theta} (s_t, a_t, r_t, s_{t+1}) \right) \Big] .\]

Large replay buffers can be helpful because uniform samples are unlikely to be uncorrelated: sampled transitions are likely to come from different episodes, and plausibly include a wider variety of environment states. This helps stabilize training. But larger replay buffers can also lead to slower learning: older experience will have been collected with older versions of an agent’s policy, and thus may not be super helpful for improving the current policy. And of course storing lots of experience requires lots of memory.

Prioritized Experience Replay

The goal of PER is for agents to learn from the portions of past experience that give rise to the largest performance improvement (Schaul et al. 2015). In standard deep Q learning, agents estimate the Q-function’s TD error by sampling transitions uniformly from their replay buffers. With PER, they favorably sample transitions from the replay buffer that have high temporal difference (TD) error. I’ve been using proportional PER, where the probability $p_{t}$ that a sample $(s_t, a_t, r_t, s_{t+1})$ is included in a batch is proportional to the TD error: $p_{t} \propto \delta_{\theta}(s_t, a_t, r_t, s_{t+1})$.

As an example of where this may be helpful, consider an environment with very sparse rewards. Since the agent sees rewards very rarely, the rewards from almost all transitions in the replay buffer will be zero, and the Q-function could pretty quickly converge to something like $Q^{\text{naive}}(s, a) = 0$. Then for a batch to be useful, it needs to contain a transition with a nonzero reward. These are exactly the transitions for which $Q^{\text{naive}}(s, a)$ has high TD error and would be prioritized by PER.

The most straightforward way to implement PER would be to calculate the TD errors for each transition in the replay buffer prior to each gradient update then to use these errors (or some function of the errors) as weights when sampling from the replay buffer to construct a batch. This would be quite costly.

PER avoids the cost of recalculating TD errors for the whole replay buffer before each parameter update by caching TD errors between parameter updates. This is helpful because individual transitions in the replay buffer are typically sampled for gradient updates many times before getting purged from the buffer. Each time a certain transition is used for a gradient update, the TD error for that transition is stored in the buffer to be used as a sampling priority in the future.

Importance sampling

Since PER changes the sampling weights of the transitions in the replay buffer, the distribution of the transitions comprising $D_{\text{batch}}^{\text{PER}}$ differs from the distribution of transitions in the replay buffer overall. This means the parameter gradients computed from TD errors in $D_{\text{batch}}^{\text{PER}}$ are biased estimates of the “true” gradients, so repeated SGD updates might not yield parameters that minimize the average TD error for the whole replay buffer.

This bias can be corrected with a technique called importance sampling, using a weighted average of the sampled TD errors to compute the loss for each batch. The weights are set to cancel out variations in the probabilities that each transition would have been included in the batch in the first place: $w_{i}^{\text{is}} \propto 1/p_{i}$ for the $i$-th sample in the batch¹.

With the importance sampling correction, the procedure for updating the Q-function parameters becomes

Sample a batch of transitions $D^{\text{PER}}_{\text{batch}}$ from the replay buffer $D$, weighting each transition by the previous TD error.
Update the Q-function parameters $\theta$ to minimize the average weighted TD error $\delta_{\theta}$ for $D^{\text{PER}}_{\text{batch}}$ with one step of gradient descent:
\[\theta \leftarrow \theta - \alpha \cdot \left( \underset{(s_t, a_t, r_t, s_{t+1}, w^{\text{is}}_{t}) \sim D_{\text{batch}}^{\text{PER}}}{\mathop{\mathbb{E}}} \Big[ \nabla_{\theta} \left( w^{\text{is}}_{t} \cdot \delta_{\theta} (s_t, a_t, r_t, s_{t+1}) \right) \Big] \right)\]
Update the TD errors stored in the replay buffer.

Handling recurrence

To train the recurrent Q-function with backpropagation through time, the batches of sampled experience consist of trajectories (of length $k$ « episode length) rather than individual transitions. In order to use PER for trajectory sampling, we need some way to aggregate the TD errors (which are defined for each individual transition) to obtain sequence sampling priorities. I did this by just taking the average TD error for each window, so the sampling probability for the trajectory beginning at $t$ is $p_t \propto 1/k \sum_{i=t}^{t+k-1} \delta_{\theta}(s_i, a_i, r_i, s_{i+1})$.

And to avoid computing this before constructing each batch, I cached these values in the replay buffers alongside the TD errors for each transition, and updated them any time the TD errors changed.

But did PER solve the unlearning problem?

I implemented PER over the course of a few days, after replicating the forgetting described above in a few different environment configurations and hyperparameter combinations. After adding PER, I found that the issue was largely gone! But the impact of PER appeared much smaller when I ran some experiments to measure its impact in a more controlled and rigorous manner.

As it happens, I got better at tuning the other hyperparameters while I was working on the PER code. The improvement to the forgetting issue that I noticed after implementing PER was due in a large part to training with better values of the Q-function’s entropy regularization parameter. PER was helpful (particularly for speeding up training), but for the small partially observed gridworlds with which I’ve been experimenting, I didn’t find (environment, hyperparameter) configurations in which it was decisively important.

The PER and non-PER agents (w/ the identical hyperparameters) perform similarly in this comparison. The trajectories sampled for Q-function updates with PER have a pretty different distribution than those sampled by the non-PER agent, as reflected by differences in the returns of those trajectories.

References

Volodymyr Mnih et al. Playing Atari with Deep Reinforcement Learning. arXiv preprint arXiv:1312.5602, 2013.

Matteo Hessel et al. Rainbow: Combining Improvements in Deep Reinforcement Learning. arXiv preprint arXiv:1710.02298, 2017.

Tom Schaul et al. Prioritized Experience Replay. arXiv preprint arXiv:1511.05952, 2015.

Tuomas Haarnoja et al. Reinforcement Learning with Deep Energy-Based Policies. arXiv preprint arXiv:1702.08165, 2017.

Steven Kapturowski et al. Recurrent experience replay in distributed reinforcement learning. ICLR 2019.

In Schaul et al. 2015, the strength of the importance sampling correction is controlled by a hyperparameter $\beta \in [0,1]$: $w_{i}^{\text{is}} \propto 1/p_{i}^{\beta}$. $\beta=0$ is no correction, and $\beta=1$ balances out the actual sampling bias. ↩

DQN and DRQN in partially observable gridworlds

2020-03-30T00:00:00-07:00

RL agents whose policies use only feedforward neural networks have a limited capacity to accomplish tasks in partially observable environments. For such tasks, an agent may need to account for past observations or previous actions to implement a successful strategy.

As I mentioned in a previous post, DQN agents struggle to accomplish simple navigation tasks in partially observed gridworld environments when they have no memory of past observations. Multi-agent environments are inherently partially observed; while agents can observe each other, they can’t directly observe the actions (or history of actions) taken by other agents. Knowing this action history makes it easier to predict the other agent’s next action and therefore the next state, leading to a big advantage for agents that have some form of memory.

One way to address the issue of partial visibility is to use policies that incorporate recurrent neural networks (RNNs). In this post I’ll focus on deep recurrent Q-networks (DRQN, Hausknecht et al. 2015) in single-agent environments. DRQN is very similar to DQN, though the procedure for training RNN-based Q-networks adds some complexity.

In particular, I’ll discuss

differences between DQN and DRQN,
ways to manage the hidden state for recurrent Q-networks, and
empirical advantages of DRQN over DQN

DQN

The key component of DQN is a neural network that estimates state-action values $Q(s_{t}, a_{t})$ – the values of states $s_{t+1}$ induced by taking any possible action $a_{t}$ in a given state $s_{t}$. In DQN (and DRQN) we assume the action space is discrete, i.e. $a^{0} = \text{move forward, } a^{1} =\text{rotate left, }$etc, and for the examples in this post the “states” observed by the agents are images.

The typical DQN network architecture consists of a few convolutional layers followed by a few fully-connected layers. The input $s_t$ is a 3-channel RGB image of the agent's egocentric partial view of the environment. The $i$-th neuron in the network's output layer is the state-action value of the $i$-th possible action conditional on the input; i.e. $Q(s_{t}, a^{i})$.

When presented with a new environment state $s$, the agent estimates the state-action values $Q(s, a’)$ of all possible actions $a’$ then selects the action with this highest value, e.g. $a = \pi(s) = \text{arg max}_{a’} Q(s, a’)$. During each episode the agent records the sequence of states/actions/rewards. Between action steps the agent uses value iteration to update the weights of its Q-network, minimizing

\[\text{Loss}(\theta) \dot{=} \left\lVert Q^{\theta}(s_t,a_t) - \left( r_t + \gamma\ \underset{a}{\text{max}}\ Q^{\theta}(s_{t+1}, a) \right) \right\rVert .\]

The agent uses samples of past transitions (stored in a replay buffer) to estimate the loss, and uses some variant of stochastic gradient descent (SGD) to minimize the loss wrt. the network parameters $\theta$.

DQN update

loss $\leftarrow$ 0
Sample a batch of $N$ transitions from the agent's replay buffer
For each sampled transition $(s, a, r, s')$:

$v \leftarrow r + \gamma \cdot Q^{\theta}(s, a)$
$\hat{v} \leftarrow \text{max}_{a'} Q^{\tilde{\theta}} (s', a')$
loss $\leftarrow$ loss $+ \left\lVert v - \hat{v} \right\rVert$

Update parameters $\theta$ to minimize loss.

This procedure computes $\hat{v}$ with a target Q-network $Q^{\tilde{\theta}}$. The target Q is a snapshot of the regular Q-network, whose weights $\tilde{\theta}$ are periodically copied from the main Q-network. This helps prevent overestimation of state action values, which is a common issue in DQN.

DRQN

In DRQN some of the post-convolution layers are replaced by an RNN, typically a long short-term memory (LSTM) cell. RNNs are called recurrent because because the output of the cell at one timestep is fed back into itself to compute the output at the next timestep. The “hidden state” $h_{t}$ is a form of memory that gives recurrent cell the capacity to store information between timesteps, and allows them to learn patterns that unfold over time. LSTMs use special gates that control the flow of information into and out of the hidden state, and to forget inconsequential hidden state information. For an excellent overview of RNNs and LSTMs, check out this post on Chris Olah’s blog.

The process by which DRQN agents select actions is the same as in DQN, but the agent uses information from the hidden state in addition to the observed state $s_{t}$ – so the agent needs to keep track of the hidden state over the course of each episode. Typically the hidden state $h_{0}$ is set to zero at the beginning of each episode.

DRQN network architecture. This diagram shows the network unrolled over time to emphasize that computing $Q$ values for state $s_{t}$ requires the hidden state $h_{t}$ generated by the RNN layer for the previous state $s_{t-1}$.

The output of the network at time $t$ depends on the hidden state value $h_{t}$, so it’s instructive to write the Q-function expressed by the network (with weights $\theta$) as $Q^{\theta}(s_{t}, a_{t}, h_{t})$. The network’s RNN layer transforms the hidden state $h_{t} \rightarrow h_{t+1}$; for convenience, we can define a function $Z$ that maps observed states, actions, and hidden states to Q-values and new hidden states: $Z^{\theta}(s_{t}, a_{t}, h_{t}) \dot{=} (Q^{\theta}(s_{t}, a_{t}, h_{t}), h_{t+1})$

The DRQN gradient update similar to the DQN gradient update:

DRQN update

loss $\leftarrow$ 0
Sample $N$ replay sequences of length $T+1$ from the agent's replay buffer
For each sampled sequence $(s_{0 \ldots T}, a_{0 \ldots T}, r_{0 \ldots T})$:

Initialize hidden state $h_0$
For $\tau$ in $0 \ldots T-1$:

$\left( x, h_{\tau+1} \right) \leftarrow Z^{\theta} (s_{\tau}, a_{\tau}, h_{\tau})$
$v \leftarrow r_{\tau} + \gamma \cdot x$
$\hat{v} \leftarrow \text{max}_{a'} Q^{\tilde{\theta}} (s_{\tau+1} , a' , h_{\tau + 1})$
loss $\leftarrow$ loss $+ \left\lVert v - \hat{v} \right\rVert$

Update parameters $\theta$ to minimize loss.

Keeping track of hidden states

The DRQN update procedure needs some way to $\text{initialize hidden state } h_{0}$ for trajectory sampled while updating the network. Updating the RNN parameters changes the way it interprets hidden states, so the hidden states used originally by the agent to compute its actions aren’t necessarily helpful for later updates.

The original DRQN paper suggests zero-initializing the hidden state at the beginning of each sampled trajectory, but points out that this limits the RNN’s capacity to learn patterns longer than the sampled sequence length $T$.

An alternative approach is to use $Z^{\theta}$ to calculate $h_{t}$ from scratch by evaluating the network for the whole sequence of observations $s_{0} \ldots s_{t-1}$ while keeping track of the hidden states. This can be quite costly since the sampled sequences can occur anywhere in an episode and the episodes can have many more than $T$ steps. For example, if we want to train on a 10-step sequence where $t=990…1000$, we’d need to run the RNN over the first 989 timesteps to get $h_{990}$.

As part of the R2D2 algorithm, Kapturowski et al. (2019) suggest storing hidden states in the replay buffer and periodically refreshing them. When it’s time to update the weights of the policy network, the initial hidden state for each sampled sequence is read directly from the replay buffer alongside the states/actions/rewards. This nominally allows the agent learn to keep useful information in its hidden state through an entire episode, but introduces the potential issue of hidden state staleness: the network parameters might get updated many times before the hidden state is refreshed. Still, Kapturowski et al. show that agents trained with stored/periodically refreshed hidden states outperform those that use zero-initialization in most of the partially observable tasks that they consider. This is the strategy I use in my DRQN implementation for the comparisons that follow.

When the stored hidden states are not stale, refreshing them should only cause small changes to their values. To monitor hidden state staleness, I’ve been logging the cosine similarity between the old and new hidden state values during each refresh (averaged over every episode in the replay buffer).

Average cosine similarity of previous hidden states and new hidden states calculated during replay buffer hidden state refreshes for the DRQN agent shown below.

In this example, the agent seems not to make much use of the hidden state until about step 200k. Updates to the LSTM between steps 200K and 400k seem result in relatively volatile changes to the stored hidden states.

DQN v. DRQN in empty gridworlds

Notes/caveats

In all the environments shown below, agents receive a reward of +1 if they attain the goal state within a time limit, and 0 otherwise. Since the reward signal itself doesn’t distinguish between skilled agents who get to the goal quickly and unskilled agents who meander before stumbling into the goal, the agent skill comparisons that follow use episode length (averaged over 25 episodes, then smoothed) rather than episode reward.
Vanilla DQN-style $\epsilon$-greedy exploration was very finicky for the DRQN network; using a Boltzmann exploration policy and entropy regularization helped significantly. All of the results shown here (for both DQN and DRQN) incorporate these tricks.
It was harder to get DRQN working than DQN. The extra effort I put into tuning DRQN might unfairly advantage it in this comparison.
Thanks to the LSTM, the DRQN networks have more parameters than the DQN networks. Adding more or larger layers to the DQN’s post-convolution MLP didn’t seem to help very much, but I haven’t explored that with much rigor.
Batch sizes and update frequencies for the two variants were the same, but each element of the DRQN batches were sequences (of length 20 steps) rather than individual transitions so the DRQN updates included contributions from many more individual timesteps.
All the curves shown here are for single training run (not averaged over multiple seeds).
The videos I’ve selected are representative of agent performance somewhat late in the training process.

Empty environment

In the “empty” environment variants, agents spawn at a random location and need to navigate to the green goal tile at the bottom right of the grid. In these experiments the grid has size $8\times8$, and the episode time limit is 100 steps.

DRQN takes much longer than DQN to converge. The final performance (measured by episode length) of the DRQN agent is a bit better and more consistent, since it is able to systematically explore the environment and avoid revisiting positions seen earlier in a single episode.

DQN

DRQN

Cluttered environment

Agents in the “cluttered” environments have the same goal as agents in the empty environment, but the environments are filled with static obstacles that are randomly placed each episode. Here the grids are $11\times11$ and have 15 pieces of clutter, and the episode time limit is 400 steps.

The DQN agents are unable to make much headway in this environment; they are only able to attain the goal if they spawn right next to it. The DRQN agents learn to explore the environment and more consistently navigate to the goal square.

DQN

DRQN

References

Matthew Hausknecht et al. Deep Recurrent Q-Learning for Partially Observable MDPs. arXiv preprint arXiv:1507.06527, 2015.

Steven Kapturowski et al. Recurrent experience replay in distributed reinforcement learning. ICLR 2019.

Special thanks to Natasha Jaques for help with this post, and for help improving my DQN/DRQN implementations!

Multi-agent gridworlds

2020-03-02T00:00:00-08:00

Gridworlds are popular environments for RL experiments. Agents in gridworlds can move between adjacent tiles in a rectangular grid, and are typically trained to pursue rewards solving simple puzzles in the grid. MiniGrid is a popular and flexible gridworld implementation that has been used in more than 20 publications.

Gridworld scenario based on the MiniGrid "DoorKey" environment. The position and orientation of the agent are shown by the red pointer, and the grey-highlighted cells comprise the agent’s field of view. Each time step, the agent can choose one of several possible actions: move forward, turn left, turn right, pick up, drop, and toggle. In this scenario, the agent must pick up the key, toggle the door, and navigate to the green square to receive a reward of +1.

I’ve created a multi-agent variant of MiniGrid called MarlGrid. In the modified library, multiple agents exist in a shared same environment: they can observe each other and aren’t allowed to collide. At each time step, the agents view distinct portions of the environment, act independently, and receive separate rewards. Scenarios can be a mix of competitive and collaborative, depending on the structure of the reward signals. MarlGrid could be valuable to other multi-agent researchers, so feedback and contributions are welcome!

Basic multi-agent gridworld scenario featuring three interacting agents. The partial perspectives of each of the three agents are shown on the right.

DQN

While working on the multi-agent environment, I’ve been using deep Q learning to train independent agents to accomplish the simple task described above. The agents’ behavior is guided by separate deep Q networks (DQN) that predict the best action to take based on their observations. This technique, sometimes called Independent Q Networks or IQN, can effectively train agents in very simple scenarios:

But IQN has shortcomings that manifest in larger or more complex environments.

Below, I’ll review DQN and explain some of these issues.

Deep Q-learning review

In Q learning, the behavior of the agent is determined by a Q function $Q(s, a)$ that estimates the value of a state $s$ conditional on taking a particular action $a$. The value of a state (at time $t$) is taken to be the expected $\gamma$-discounted sum the of the rewards $r_{t^{‘} \geq t}$ the agent would subsequently receive. The procedure by which an agent chooses an action based on the state is known as the agent’s policy $\pi$. When the environment is in state $s_{t}$, the agent takes the action $a_{t}$ that maximizes $Q$:

\[a_{t} = \pi(s_{t}) = \underset{a}{\text{arg}\,\text{max}}\ Q(s_{t}, a) \label{eq:qpolicy}\]

Classical Q learning assumes the set of possible states $S: s_{i} \in S$ and possible actions $A: a_{j} \in A$ are both finite. Then the Q function can be represented by a table where the value in cell $(i,j)$ is $Q(s_{i}, a_{j})$, i.e. the expected value of taking action $a_{j}$ in state $s_{i}$. A variant of this technique called deep Q learning also applies to continuous state spaces, using a neural network (DQN) approximation rather than a tabular representation of the state-action value function $Q$. DQN was developed by Mnih et al (2013) and achieved impressive results on a suite of Atari games.

This value estimation problem is central to reinforcement learning: rather than predicting the immediate rewards the agent might receive by taking a greedy action now, the goal is to estimate all of the future rewards it can attain starting from this state. This is why reinforcement learning is sometimes referred to as sequential decision making.

In Q learning, experience collected by an agent (behaving per $\text{eq.}~\ref{eq:qpolicy}$) is used to update $Q$ with a method called value iteration, which seeks to minimize the loss:

\[L = \left\lVert Q(s_t,a_t) - \left( r_t + \gamma\ \underset{a}{\text{max}}\ Q(s_{t+1}, a) \right) \right\rVert\]

Iteratively minimizing this loss leads to continual improvements in the Q function when the environment in which the agent is situated can be described by a stochastic Markov decision process (MDP) (Jaakkola et al 1994). An MDP is characterized by $(S, A, T, R)$, where $S$ is the set of possible environment states, $A$ is the set of possible actions, $T:S \times A \rightarrow S$ is the transition function, and $R:S\times A \rightarrow \mathbb{R}$ is the reward function. For an environment to be an MDP, $T$ and $R$ must be stationary: the reward $r_{t}$ and new state $s_{t+1}$ resulting from the agent taking an action $a_{t}$ in a state $s_{t}$ can’t depend on any states or actions from before $t$ (Markov assumption).

Challenges in multi-agent training

In IQN, value iteration is used to simultaneously and independently train multiple DQN agents. The environment overall can still be described as an MDP, but the states and rewards observed by the individual agents no longer obey the Markov assumption: individual agents don’t control all the actions that determine the transition of the global environment state, and may only have partial views of that state (as in the MarlGrid examples).

Non-stationarity

In a $k$-agent gridworld, the transition function depends on the actions of all the agents: $ s_{t+1} = T(s_{t}, a_{t}) = T(s_{t}, a_{t}^{1}, a_{t}^{2}, …, a_{t}^{k})$. So from the perspective of a agent $i$, the transition depends not only on the state and its action $a_{t}^{i} = \pi^{i}(s_{t})$, but also on the actions $\{ a_{t}^{j} = \pi^{j}(s_{t}), j \neq i \}$ of the other agents.

If the policies of the other agents were static, then their actions could be seen as aspects of the environment state invisible to the $i$-th agent, and this would boil down to an issue of partial observability. But since the policies change as the agents learn, $T$ is non-stationary from the perspective of any single agent.

Partial observability

Individual agents in the MarlGrid environments have limited fields of view. Even in relatively simple environments, this limits the sophistication of agent behavior.

Left: the blue agent (bottom right) sees the goal square, and must rotate left to begin moving towards it.
Right: Having rotated, the blue agent can no longer observe the goal. Since the agent's policy can only generate actions based on the current observation, the agent is unlikely to follow a sequence of actions that will lead it to the goal.

As another example, the purple agent in the second video above ends up spending lots of time wandering aimlessly. Since the agent lacks memory, it is unable to develop a strategy that would help it systematically explore the environment.

This shortcoming of basic DQN implementations is typically mitigated by giving the agent some capacity to account for observations prior to $s_t$ when determining an action $a_t$. The Atari DQN collaboration (Mnih et al 2013) accomplished this by giving agents direct access to some of 16 previous frames. Hausknecht et al (2015) addressed this using deep recurrent Q networks (DRQNs) that use RNNs to explicitly maintain state across multiple steps in the environment. I plan to take the latter approach. Next week, I will incorporate LSTM cells into my DQN in order to give agent’s memory, and improve their ability to handle partial observability and model other agent’s behavior.

References

Volodymyr Mnih et al. Playing Atari with Deep Reinforcement Learning. arXiv preprint arXiv:1312.5602, 2013.

Tommi Jaakkola et al. On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6 (6): 1185-1201, 1994.

Matthew Hausknecht et al. Deep Recurrent Q-Learning for Partially Observable MDPs. arXiv preprint arXiv:1507.06527, 2015.

Why I’m excited about MARL

2020-02-14T00:00:00-08:00

I’m excited to be participating in the 2020 cohort of the OpenAI Scholars program. With the mentorship of Natasha Jaques, I’ll be spending the next few months studying multi-agent reinforcement learning (MARL) and periodically writing blog posts to document my progress. In this first post, I’ll discuss the reasons I’m excited about MARL and my plan for the Scholars program.

What is MARL?

Reinforcement learning (RL) is the subfield of machine learning concerned with optimizing the behavior of an agent interacting with an environment to maximize external rewards. RL is partially inspired by theories of animal learning from psychology, taking the term “reinforcement” from Pavlov’s 1927 work on classical conditioning (Sutton 2018).

Training agents to perform well at video games (and computerized board games) is a prototypical use-case for reinforcement learning, since the action space (controls, possible moves) and observation space (pixels on screen, state of the board) are well defined, and the score (points in the game, or victory/loss) provides a clear reward signal. Progress in RL has been marked by success in harder and harder games: backgammon (Tesauro 1995); various Atari games (Mnih et al. 2013); chess, shogi, and go (Silver et al. 2017); Dota 2 (Berner 2019); and Starcraft 2 (Vinyals et al. 2019).

Multi-agent reinforcement learning (MARL) is the extension of RL to scenarios with multiple interacting agents. MARL is naturally important for applications like self driving cars, where agents can only succeed by accounting for the behavior of other agents (Reddy 2018). Research in MARL includes efforts to foster collaborative problem solving, to improve and understand inter-agent communication, and to characterize emergent social phenomena.

In general, RL agents learn by incrementally improving a strategy for attaining a high reward. For this learning to be effective, the reward signal must distinguish between beneficial and detrimental changes to the agent’s behavior. In competitive multi-agent environments like games, agents struggle to learn effective strategies against much stronger opponents (improvements to their behavior still lead to losses). More broadly, training an agent to complete a difficult task generally requires constructing either a sequence (curriculum) of tasks that allows the agent to incrementally build expertise, or a very informative reward signal.

Many successful applications of RL to competitive two-player games have generated an effective automatic curriculum with a technique called self-play. With self-play, agents are trained to compete against copies of themselves – so they always have an opponent of a comparable skill level. Agents are able to learn effectively because the reward signal reflects the differences between good and bad variations to their strategies, and this remains true as their behavior grows in sophistication. Agents trained by self-play can out-compete human experts in complex strategic games including Go (Silver 2017) and Dota 2 (Berner 2019).

However, recent research has revealed limitations of self-play, and showed that training large and diverse ensembles of agents can help individual agents achieve higher performance. For example, Wang et al. (2019) show that training with a population of agents – along with procedures for generating curricula of increasingly difficult tasks and transferring successful individual agents between tasks – can encourage agents to develop sophisticated and robust strategies to overcome challenges that may be otherwise intractable. Vinyals et al. (2019) address the shortcomings of self-play by training Strarcraft agents (1v1) against a diverse population of opponents with varying skill levels and strategy types. This diversity forces agents to learn robust strategies that are not overfit to weaknesses of particular opponents.

With more than two agents in a shared environment, the automatic curriculum of challenges induced by increasingly complex cooperative and competitive interactions (Leibo et al. 2019) can give rise to a rich variety of emergent phenomena including communication (Jaques et al. 2019) and tool use (Baker et al. 2020). However, MARL presents challenges that are not present in typical RL problems. Because other agents are constantly learning and adapting, the learning environment in MARL changes continually. RL algorithms with convergence guarantees in the single-agent setting do not necessarily converge or even stabilize when there are multiple agents interacting and learning simultaneously (Balduzzi et al. 2018, Mazumdar et al. 2019). I’ll discuss some of the ways standard RL agent architectures are modified for multi-agent scenarios in a future post.

Why I am excited about MARL

The fact that I have some direct introspective access to the stuff in my head makes it easy to attribute my thoughts and actions to internal cognitive processes. This probably causes me to overestimate the importance of this internal stuff (compared to external influences) in shaping my thoughts and behavior. If I were presented with a novel task that I needed to work through “from scratch”, I would lean heavily on concepts refined over thousands of years of human cultural/social development.

Humans are really smart, even compared to closely related animals like chimps. In the first few million years after their last common ancestor, proto-humans developed a slew of distinct characteristics including larger brains and a notable aptitude for social learning (Henrich 2015).

But anatomically modern humans invented language just 50k years ago – recently enough that biological evolution hasn’t had the time to change our basic cognitive capabilities. This suggests that the remarkable progress humanity has made since then has been due to social/cultural development and accumulation of knowledge rather than improvements to the human brain.

The environment inhabited by modern humans – in the RL sense – is mainly made up of entities that arose through these processes of socio-cultural evolution (companies, countries). I know how to function in society, but I would die pretty quickly if I were transported to an environment in which I was directly subjected to environments that shaped human biological evolution.

I’m excited about MARL as a way to empirically study the emergence of the sort of social phenomena that gave rise to the complexity of modern human culture, and as a way to build agents that can make use of cultural knowledge in a human-like way.

MARL might be a useful tool for understanding how AI will impact society

Reinforcement learning-based automation is poised to be deployed for many practical uses including stock trading, corporate decision making, robotics and self-driving, etc. MARL is likely to be important for enabling self-driving cars and household robots to coordinate with humans, and might be a useful tool for training individual agents that are capable enough to succeed at complex real-world tasks. But regardless of how central MARL is in developing such systems, any future that is filled with RL agents could benefit from the ability to understand how those agents may interact.

By analogy to the emergence of entities like companies and countries in large groups of humans, we might expect the interactions of these agents to give rise to emergent behavior. Just as collections of interacting humans pose risks that individual humans do not, the group behavior of these RL agents may be undesirable.

Some potential concerns are pretty straightforward (“are the stock trading bots colluding?”), but the stark difference in the objectives and capabilities between individuals and countries suggests that characterizing emergent social behavior might be really important. As AI agents fill more roles in society, MARL might provide us with invaluable tools for understanding how inadvertent collectives of agents emerge/behave, and the impact they could have on society at large.

Scholars plan

The OpenAI Scholars program lasts for 4 months. The program is divided in half, into a learning portion and a project portion. During the learning portion, I’ll work roughly in parallel on digesting the MARL literature and developing some simple multi-agent experiments.

First, I’ll study and write about the broad state of MARL research (~3 weeks) while I work on the general infrastructure for running experiments, Then I’ll narrow the scope of my reading to focus on communication in MARL while adding inter-agent communication to the experiment scaffolding (~3 weeks). Building on all of this, I plan to use information theoretic tools from the MARL literature to empirically investigate inter-agent communication in this toy system.

References

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction; 2nd Edition, 2017.

Gerald Tesauro. Temporal Difference Learning and TD-Gammon. Communications of the ACM. 38 (3), 1995.

Volodymyr Mnih et al. Playing Atari with Deep Reinforcement Learning. arXiv preprint arXiv:1312.5602, 2013.

David Silver et al. Mastering the Game of Go without Human Knowledge. Nature 550, 354-359, 2017.

David Silver et al. Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm. arXiv preprint arXiv:1712.01815, 2017.

Oriol Vinyals et al. Grandmaster level in Starcraft II using multi-agent reinforcement learning. Nature 575, 350-354, 2019.

Ilge Akkaya et al. Solving Rubik’s Cube with a Robot Hand. arXiv preprint arXiv:1910.07113, 2019.

Reddy et al. Shared Autonomy via Deep Reinforcement Learning. arXiv preprint arXiv:1802.01744, 2018.

Christopher Berner et al. Dota 2 with Large Scale Deep Reinforcement Learning. arXiv preprint arXiv:1912.06680, 2019.

Rui Wang et al. Paired Open-Ended Trailblazer (POET): Endlessly Generating Increasingly Complex and Diverse Learning Environments and Their Solutions. arXiv preprint arXiv:1901.01753, 2019.

Natasha Jaques et al. Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning. arXiv preprint arXiv:1810.08647, 2019.

Bowen Baker et al. Emergent Tool Use From Multi-Agent Autocurricula. arXiv preprint arXiv:1909.07528, 2019.

David Balduzzi et al. The Mechanics of n-Player Differentiable Games. arXiv preprint arXiv:1802.05642, 2018.

Eric Mazumdar et al. Policy-Gradient Algorithms Have No Guarantees of Convergence in Linear Quadratic Games. arXiv preprint arXiv:1907.03712, 2019.

Joel Z. Leibo et al. Autocurricula and the Emergence of Innovation from Social Interaction: A Manifesto for Multi-Agent Intelligence Research. arXiv preprint arXiv:1903.00742, 2019.

Joseph Henrich. The Secret of Our Success: How Culture Is Driving Human Evolution, Domesticating Our Species, and Making Us Smarter, 2015.

MineRL: Recurrent replay

2019-11-09T00:00:00-08:00

I spent some time recently exploring reinforcement learning in the excellent MineRL minecraft environments. I haven’t played much Minecraft, and I haven’t actually accomplished the personally accomplished the holy grail objective of mining a diamond. The prospect of building a bot that can learn to accomplish a task that I haven’t completed – one that is as human-accessible as this – is incredibly exciting!

There are a bunch of factors that make the MineRL environments interesting and challenging: - need to learn from pixels - need to coordinate actions on short and long time scales - mixed discrete/continuous action and observation spaces - very sparse rewards

MineRL also provides many hours of expert data – recorded trajectories of human players accomplishing a variety of in-game tasks. Rewards in the MineRL environments are very sparse; in most variants, agents reap the first reward after successfully chopping a tree. This is very unlikely to happen if the agent’s randomly mashing buttons, which makes the expert data particularly valuable.

I’ve mostly been focusing on RL algorithms that fit the full MineRL environments – in parcicular, those that work pretty naturally with both discrete and continuous action spaces, can learn from expert demonstrations, and can cope with sparse rewards. Actor critic algorithms fit the bill pretty well, and soft actor critic in particular is promising thanks to excellent demonstrated sample efficiency (even while learning directly from pixels).

These algorithms are tricky to implement properly, and their performance can be quite sensitive to hyperparameter values. Furthermore, the extreme reward sparsity makes it very difficult to distinguish between a bug-laden algorithm and one that is correct but poorly tuned: either way, the reward will be zero for a long time.

So I instead began by implementing algorithms in the much simpler and more forgiving Roboschool environments. I implemented the agents with a modular architecture: the networks for encoding observations and emitting actions are inferred from the structure of the environment, but the core of the learning algorithms are not environment-specific. This let me validate agent architectures in the more forgiving roboschool environment before moving them to MineRL.

Nevertheless, this often left me watching “validated” agents hop randomly across the map, wondering whether they’d be capable of achieving the basic tree-chop task. A few hours into the N-th fruitless training run, I decided to put myself in the agent’s shoes and actually play the game for a bit. Hoping to get build some empathy for the difficulty of the task, I approached a tree and tried to punch out the wood. Surprisingly, this took at least a second of continuous “attack” actions. If I let up on the mouse for an instant, the tree would remain intact.

Agents in off-policy RL algorithms like soft actor-critic choose actions by sampling from distributions. Up to this point, I had been assuming that the agent could succeed in simple tasks by choosing actions independently at each step. If I was presented a bunch of early-game minecraft frames out of order and told to choose actions that would lead to tree chops, I’d choose an appropriate action > 90% of the time. But this isn’t nearly good enough: if the agent chose the ‘attack’ action 90% of the time, on average it would take about 7 seconds before the agent would successfully chop some wood (assuming this requires 1 second of constant attacking at 30fps).

There are a few potential approaches to stabilizing actions:

Shape the observation space by including past observations at every step, so that policies have direct access to some of the environment dynamics
Shape the action space by de-bouncing jitter in discrete actions
Add explicit temporal regularization to the policy loss
- “Observe and Look Further…”, one of the DeepMind Montezuma’s Revenge papers, suggests a “temporal consistency” (TC) loss that penalizes producing different actions at consecutive steps.
Use auto-regressive policies to give agents information about past actions
Use fully recurrent networks

As soon as I added a TC loss to the policy, the agent started to (occasionally) successfully chop trees! On top of that, using LSTM-based policies instead of feedforward networks begat further improvements, but in the interest of managing complexity and stability I’ve mostly been experimenting with LSTMs in significantly simpler architectures (behavior cloning, advantage-weighted regression, etc).

To facilitate experiments with recurrent policies, I implemented a fancy trajectory replay buffer that

can to store and upate hidden states,
can easily sample minibatches of arbitrary-length (obs, act, rew, done, hidden) sequences
stores all the data on-disk, in surprisingly efficient memory-mapped numpy arrays.

My implementation of the “RecurrentReplayBuffer”, along with some other utilities that were helpful in managing the complex hierarchical MineRL action/observation spaces, are available on github.

I’ll update this post as I continue cleaning and refactoring an unseemly mess of private code.

Kamal

Social learning in goal navigation tasks

Social learning in goal navigation tasks

Independent multi-agent reinforcement learning

Learning from experts

Prior work

Social skill acquisition

Learning environments

Algorithms

Experiments

Cluttered

Goal cycle, hidden goals

Discussion

Next steps

References

Stale hidden states in PPO-LSTM

Implementation details are important

Detail: early stopping

Detail: Hidden states

Stale values in the PPO replay buffer

Experiments

Cluttered env

Goal cycle env

References

Goal cycle environments

Goal cycle environments

Hard exploration

Hyperparameter hell or: How I learned to stop worrying and love PPO

PPO

Why didn’t I use PPO to begin with?

Tuning DRQN vs PPO

References

Prioritized Experience Replay in DRQN

Prioritized Experience Replay

Replay buffers and TD errors

Prioritized Experience Replay

Importance sampling

Handling recurrence

But did PER solve the unlearning problem?

References

DQN and DRQN in partially observable gridworlds

DQN

DRQN

Keeping track of hidden states

DQN v. DRQN in empty gridworlds

Notes/caveats

Empty environment

DQN

DRQN

Cluttered environment

DQN

DRQN

References

Multi-agent gridworlds

DQN

Deep Q-learning review

Challenges in multi-agent training

Non-stationarity

Partial observability

References

Why I’m excited about MARL

What is MARL?

Why I am excited about MARL

Human intelligence is very social

MARL might be a useful tool for understanding how AI will impact society

Scholars plan

References

MineRL: Recurrent replay