Efficient Exploration via State Marginal Matching
Abstract
To solve tasks with sparse rewards, reinforcement learning algorithms must be equipped with suitable exploration techniques. However, it is unclear what underlying objective is being optimized by existing exploration algorithms, or how they can be altered to incorporate prior knowledge about the task. Most importantly, it is difficult to use exploration experience from one task to acquire exploration strategies for another task. We address these shortcomings by learning a single exploration policy that can quickly solve a suite of downstream tasks in a multitask setting, amortizing the cost of learning to explore. We recast exploration as a problem of State Marginal Matching (SMM): we learn a mixture of policies for which the state marginal distribution matches a given target state distribution, which can incorporate prior knowledge about the task. Without any prior knowledge, the SMM objective reduces to maximizing the marginal state entropy. We optimize the objective by reducing it to a twoplayer, zerosum game, where we iteratively fit a state density model and then update the policy to visit states with low density under this model. While many previous algorithms for exploration employ a similar procedure, they omit a crucial historical averaging step, without which the iterative procedure does not converge to a Nash equilibria. To parallelize exploration, we extend our algorithm to use mixtures of policies, wherein we discover connections between SMM and previouslyproposed skill learning methods based on mutual information. On complex navigation and manipulation tasks, we demonstrate that our algorithm explores faster and adapts more quickly to new tasks.^{1}^{1}1Videos and code: https://sites.google.com/view/statemarginalmatching
1 Introduction
In order to solve tasks with sparse or delayed rewards, reinforcement learning (RL) algorithms must be equipped with suitable exploration techniques. Exploration methods based on random actions have limited ability to cover a wide range of states. More sophisticated techniques, such as intrinsic motivation, can be much more effective. However, it is often unclear what underlying objective is optimized by these methods, or how prior knowledge can be readily incorporated into the exploration strategy. Most importantly, it is difficult to use exploration experience from one task to acquire exploration strategies for another task.
We address these shortcomings by considering a multitask setting, where many different reward functions can be provided for the same set of states and dynamics. Rather than reinventing the wheel and learning to explore anew for each task, we aim to learn a single, taskagnostic exploration policy that can be adapted to many possible downstream reward functions, amortizing the cost of learning to explore. This exploration policy can be viewed as a prior on the policy for solving downstream tasks. Learning will consist of two phases: during training, we acquire this taskagnostic exploration policy; during testing, we use this exploration policy to quickly explore and maximize the task reward.
Learning a single exploration policy is considerably more difficult than doing exploration throughout the course of learning a single task. The latter is done by intrinsic motivation (Pathak et al., 2017; Tang et al., 2017; Oudeyer et al., 2007) and countbased exploration methods (Bellemare et al., 2016), which can effectively explore to find states with high reward, at which point the agent can decrease exploration and increase exploitation of those highreward states. While these methods perform efficient exploration for learning a single task, the policy at any particular iteration is not a good exploration policy. For example, the final policy at convergence would only visit the highreward states discovered for the current task. A straightforward solution is to simply take the historical average over policies from each iteration of training. At test time, we can sample one of the historical policies from a previous training iteration, and use the corresponding policy to sample actions in that episode. Our algorithm will implicitly do this.
What objective should be optimized during training to obtain a good exploration policy? We recast exploration as a problem of State Marginal Matching: given a desired state distribution, we learn a mixture of policies for which the state marginal distribution matches this desired distribution. Without any prior information, this objective reduces to maximizing the marginal state entropy , which encourages the policy to visit as many states as possible. The distribution matching objective also provides a convenient mechanism for humans to incorporate prior knowledge about the task, whether in the form of constraints that the agent should obey; preferences for some states over other states; reward shaping; or the relative importance of each state dimension for a particular task.
We propose an algorithm to optimize the State Marginal Matching (SMM) objective. First, we reduce the problem of SMM to a twoplayer, zerosum game between a policy player and a density player. We find a Nash Equilibrium for this game using fictitious play (Brown, 1951), a classic procedure from game theory. Our resulting algorithm iteratively fits a state density model and then updates the policy to visit states with low density under this model. While many previous algorithms for exploration employ a similar procedure, they omit a crucial historical averaging step, without which the iterative procedure is not guaranteed to converge.
In short, our paper studies the State Marginal Matching objective as a principled objective for acquiring a taskagnostic exploration policy. We propose an algorithm to optimize this objective. Our analysis of this algorithm sheds light on prior methods, and we empirically show that SMM solves hard exploration tasks faster than stateoftheart baselines in navigation and manipulation domains.
2 Related Work
Most prior work on exploration has looked at exploration bonuses and intrinsic motivation. Typically, these algorithms (Pathak et al., 2017; Oudeyer et al., 2007; Schmidhuber, 1991; Houthooft et al., 2016; Burda et al., 2018) formulate some auxiliary task, and use prediction error on that task as an exploration bonus. Another class of methods (Tang et al., 2017; Bellemare et al., 2016; Schmidhuber, 2010) directly encourage the agent to visit novel states. While all methods effectively explore during the course of solving a single task, the policy obtained at convergence is often not a good exploration policy. For example, consider an exploration bonus derived from prediction error of an inverse model (Pathak et al., 2017). At convergence, the inverse model will have high error at states with stochastic dynamics, so the resulting policy will always move towards these stochastic states and fail to explore the rest of the state space.
Many exploration algorithms can be classified by whether they do exploration in the space of actions, policy parameters, goals, or states. Common exploration strategies including greedy and Ornstein–Uhlenbeck noise (Lillicrap et al., 2015), as well as standard MaxEnt algorithms (Ziebart, 2010; Haarnoja et al., 2018), do exploration in action space. Recent work (Fortunato et al., 2017; Plappert et al., 2017) shows that adding noise to the parameters of the policy can result in good exploration. Most closely related to our work are methods (Pong et al., 2019; Hazan et al., 2018) that perform exploration in the space of states or goals. In fact, Hazan et al. (2018), consider the same State Marginal Matching objective that we examine. However, the algorithm proposed there requires an oracle planner and an oracle density model, assumptions that our method will not require. Finally, some prior work considers exploration in the space of goals (Colas et al., 2018; Held et al., 2017; Nair et al., 2018; Pong et al., 2019). In Appendix D.3, we also discuss how goalconditioned RL (Kaelbling, 1993; Schaul et al., 2015) can be viewed as a special case of State Marginal Matching when the goalsampling distribution is learned jointly with the policy.
The problems of exploration and metareinforcement learning are tightly coupled. Exploration algorithms visit a wide range of states with the aim of finding new states with high reward. Metareinforcement learning algorithms (Duan et al., 2016; Finn et al., 2017; Rakelly et al., 2019; Mishra et al., 2017) must perform effective exploration if they hope to solve a downstream task. Some prior work has explicitly looked at the problem of learning to explore (Gupta et al., 2018; Xu et al., 2018). However, these methods rely on metalearning algorithms which are often complicated and brittle. ^{1}^{1}todo: 1Mention taskagnostic exploration such as MAESN, which aims to learn a general exploration policy for any MDP (which is probably the uniform density over states). Whereas our goal is exploring new tasks from a class of MDPs we’ve encountered before.: We discuss MAESN above. I don’t think that MAESN would work on unseen MDPs
Closely related to our approach is standard maximum action entropy algorithms (Haarnoja et al., 2018; Kappen et al., 2012; Rawlik et al., 2013; Ziebart et al., 2008; Theodorou and Todorov, 2012). While these algorithms are referred to as MaxEnt RL, they are maximizing entropy over actions, not states. This class of algorithms can be viewed as performing inference on a graphical model where the likelihood of a trajectory is given by its exponentiated reward (Toussaint and Storkey, 2006; Levine, 2018; Abdolmaleki et al., 2018). While distributions over trajectories define distributions over states, the relationship is complicated. Given a target distribution over states, it is quite challenging to design a reward function such that the optimal maximum action entropy policy matches the target state distribution. Our Algorithm 1 avoids learning the reward function and instead directly learns a policy that matches the target distribution.
Finally, the idea of distribution matching has been employed successfully in imitation learning settings (Ziebart et al., 2008; Ho and Ermon, 2016; Finn et al., 2016; Fu et al., 2017). While inverse RL algorithms assume access to expert trajectories, we instead assume access to the density of the target state marginal distribution. Similar to inverse RL algorithms (Ho and Ermon, 2016; Fu et al., 2018), our method can likewise be interpreted as learning a reward function, though our reward function is obtained via a density model instead of a discriminator. ^{2}^{2}todo: 2Mention that it is often easier to specify a target state marginal distribution rather than provide expert demonstrations?
3 State Marginal Matching
In this section, we propose the State Marginal Matching problem as a principled objective for learning to explore, and offer an algorithm for optimizing it. We consider a parametric policy that chooses actions in a Markov Decision Process (MDP) with fixed episode lengths , dynamics distribution , and initial state distribution . The MDP together with the policy form an implicit generative model over states. We define the state marginal distribution as the probability that the policy visits state :
We emphasize that is not a distribution over trajectories, and is not the stationary distribution of the policy after infinitely many steps, but rather the distribution over states visited in a finitelength episode.^{2}^{2}2 approaches the policy’s stationary distribution in the limit as the episodic horizon . We also note that any trajectory distribution matching problem can be reduced to a state marginal matching problem by augmenting the current state to include all previous states.
We assume that we are given a target distribution that encodes our uncertainty about the tasks we may be given at testtime. For example, a roboticist might assign small values of to states that are dangerous, regardless of the desired task. Alternatively, we might also learn from data about human preferences (Christiano et al., 2017). For goalreaching tasks, we can analytically derive the optimal target distribution (Appendix C). Given , our goal is to find a parametric policy that is “closest” to this target distribution, where we measure discrepancy using the KullbackLeibler divergence:
(1)  
(2)  
(3) 
Note that we use the reverseKL (Bishop, 2006), which is modecovering (i.e., exploratory). We show in Appendix C that the policies obtained via State Marginal Matching provide an optimal exploration strategy for a particular distribution over reward functions. To gain intuition for the State Marginal Matching objective, we decomposed it in two ways. In Equation 3, we see that State Marginal Matching is equivalent to maximizing the reward function while simultaneously maximizing the entropy of states. Note that, unlike traditional MaxEnt RL algorithms (Ziebart et al., 2008; Haarnoja et al., 2018), we regularize the entropy of the state distribution, not the conditional distribution of actions given states, which results in exploration in the space of states rather than in actions. Moreover, Equation 2 suggests that State Marginal Matching maximizes a pseudoreward , which assigns positive utility to states that the agent visits too infrequently and negative utility to states visited too frequently (see Figure 1). We emphasize that maximizing this pseudoreward is not a RL problem because the pseudoreward depends on the policy.
3.1 Optimizing the State Marginal Matching Objective
Optimizing Equation 2 to obtain a single exploration policy is more challenging than standard RL because the reward function itself depends on the policy. To break this cyclic dependency, we introduce a parametric state density model to approximate the policy’s state marginal distribution, . We assume that the class of density models is sufficiently expressive to represent every policy:
Assumption 1.
For every policy , there exists such that .
Now, we can optimize the policy w.r.t. the proxy distribution. Let policies and density models satisfying Assumption 1 be given. For any target distribution , the following optimization problems are equivalent:
(4) 
To see this, note that
(5) 
By Assumption 1, for some , so we obtain the desired result:
Solving the new maxmin optimization problem is equivalent to finding the Nash equilibrium of a twoplayer, zerosum game: a policy player chooses the policy while the density player chooses the density model . To avoid confusion, we use actions to refer to controls output by the policy in the traditional RL problem and strategies to refer to the decisions of the policy player and decisions of the density player. The Nash existence theorem (Nash, 1951) proves that such a stationary point always exists for such a twoplayer, zerosum game.
One common approach to saddle point games is to alternate between updating player A w.r.t. player B, and updating player B w.r.t. player A. However, simple games such as RockPaperScissors illustrate that such a greedy approach is not guaranteed to converge to a stationary point. A slight variant, fictitious play (Brown, 1951) does converge to a Nash equilibrium in finite time Robinson (1951); Daskalakis and Pan (2014). At each iteration, each player chooses their best strategy in response to the historical average of the opponent’s strategies. In our setting, fictitious play alternates between (1) fitting the density model to the historical average of policies, and (2) updating the policy with RL to minimize the logdensity of the state, using a historical average of the density models:
(6)  
(7) 
We summarize the resulting algorithm in Algorithm 1. In practice, we can efficiently implement Equation 6 and avoid storing the policy parameters from every iteration by instead storing sampled states from each iteration.^{3}^{3}3One way is to maintain an infinitesized replay buffer, and fit the density to the replay buffer at each iteration. Alternatively, we can replace older samples in a fixedsize replay buffer less frequently such that sampling from is uniform over iterations. We cannot perform the same trick for Equation 7, and instead resort to approximating the historical average of density models with the most recent iterate
3.2 Do Exploration Bonuses Using Predictive Error Perform State Marginal Matching?
We note some striking similarities between Equation 4 and exploration methods based on prediction error. For example, when our density model is a VAE, Equation 4 becomes
where is our autoencoder and is the KL penalty on the VAE encoder for the data distribution . In contrast, the objective for RND (Burda et al., 2018) is
where is an encoder obtained by a randomly initialized neural network. Exploration bonuses based on the predictive error of forward models (Schmidhuber, 1991; Chentanez et al., 2005; Stadie et al., 2015) have a similar form, but instead consider full transitions:
Exploration bonuses derived from inverse models (Pathak et al., 2017) look similar:
Each of these methods can be interpreted as almost learning a particular density model of and using the logprobability under that density model as a reward. However, because they omit the historical averaging step, they do not actually perform distribution matching. This provides an interesting interpretation of state marginal matching as a more principled way to apply intrinsic motivation: instead of simply taking the latest policy, which is not by itself optimizing any particular objective, we take the historical average, which can be shown to match the target distribution asymptotically.
3.3 Better Marginal Matching with Mixture of Policies
Given the challenging problem of exploration in large state spaces, it is natural to wonder whether we can accelerate exploration by automatically decomposing the potentiallymultimodal target distribution into a mixture of “easiertolearn” distributions and learn a corresponding set of policies to do distribution matching for each component. Note that the mixture model we introduce here is orthogonal to the historical averaging step discussed before. Using to denote the state distribution of the policy conditioned on the latent variable , the state marginal distribution of the mixture of policies is
(8) 
where is a latent prior. As before, we will minimize the KL divergence between this mixture distribution and the target distribution. Using Bayes’ rule to rewrite in terms of conditional probabilities, we obtain the following optimization problem:
(9) 
Intuitively, this says that the agent should go to states (a) with high density under the target state distribution, (b) where this agent has not been before, and (c) where this agent is clearly distinguishable from the other agents. The last term (d) says to explore in the space of mixture components . This decomposition bears a resemblance to the mutualinformation objectives in recent work (Achiam et al., 2018; Eysenbach et al., 2018; CoReyes et al., 2018). Thus, one interpretation of our work is as explaining that mutual information objectives almost perform distribution matching. The caveat is that prior work omits the state entropy term which provides high reward for visiting novel states, possibly explaining why these previous works have failed to scale to complex tasks.
We summarize the resulting procedure in Algorithm 2. The only difference from before is that we learn a discriminator , in addition to updating the density models and the policies . Jensen’s inequality tells us that maximizing the logdensity of the learned discriminator will maximize a lower bound on the true density (see Agakov (2004)):
In our experiments, we leave the latent prior as fixed and uniform. Note that the updates for each can be conducted in parallel. A distributed implementation would emulate broadcastcollect algorithms (Lynch, 1996), with each worker updating the policy independently, and periodically aggregating results to update the discriminator .
4 Experiments
In this section, we empirically study whether our method learns to explore effectively, and compare against prior exploration methods. Our experiments will demonstrate how State Marginal Matching provides good exploration, a key component of which is the historical averaging step. More experimental details can be found in Appendix E.1, and code will be released upon publication.
Baselines: We compare to a stateoftheart offpolicy MaxEnt RL algorithm, Soft ActorCritic (SAC) (Haarnoja et al., 2018) and three exploration methods: Countbased Exploration (C), which discretizes states and uses as an exploration bonus; Pseudocounts (Bellemare et al., 2016) (PC), which obtains an exploration bonus from the recoding probability; and Intrinsic Curiosity Module (ICM) (Pathak et al., 2017), which uses prediction error as an exploration bonus.
Manipulation Task: The manipulation environment (Plappert et al., 2018) (shown on the right) consists of a robot with a single gripper arm, and a block object resting on top of a table surface, with a ^{3}^{3}todo: 3check this10dimensional observation space and a 4dimensional action space. The robot’s task is to move the object to a goal location which is not observed by the robot, thus requiring the robot to explore by moving the block to different locations on the table. At the beginning of each episode, we spawn the object at the center of the table, and the robot gripper above the initial block position. We terminate the episode after 50 environment steps, or if the block falls off the table.
Navigation Task: The agent is spawned at the center of long hallways that extend radially outward, like the spokes of a wheel, as shown on the right. The agent’s task is to navigate to the end of a goal corridor. We can vary the length of the hallway and the number of halls to finely control the task difficulty and measure how well various algorithms scale as the exploration problem becomes more challenging. We consider two types of robots: in 2D Navigation, the agent is a point mass whose position is directly controlled by velocity actions ^{4}^{4}todo: 4check this; in 3D Navigation, the agent is the quadrupedal robot from Schulman et al. (2015), which has a ^{5}^{5}todo: 5check this113dimensional observation space and a 7dimensional action space.
Implementation Details. The extrinsic environment reward implicitly defines the target distribution: . We use a VAE to model the density for both SMM and Pseudocounts (PC). For SMM, we use discrete latent skills . All results are averaged over 4 random seeds.
4.1 Experimental Results
Question 1: Is exploration more effective with maximizing state entropy or action entropy?


MaxEnt RL algorithms such as SAC maximize entropy over actions, which is often motivated as leading to good exploration. In contrast, the State Marginal Matching objective leads to maximizing entropy over states. In this experiment, we compared our method to SAC on the navigation task. To see how each method scaled, we also increased the number of hallways (# Arms) to increase the exploration challenge. To evaluate each method, we counted the number of hallways that the agent fully explored (i.e., reached the end) during training. Figure 2 shows that our method, which maximizes entropy over states, consistently explores 60% of hallways, whereas MaxEnt RL, which maximizes entropy over actions, rarely visits more than 20% of hallways. Further, using mixtures of policies (§ 3.3) explores even better.^{4}^{4}4In all experiments, we run each method for the same number of environment transitions; a mixture of 3 policies does not get to take 3 times more transitions. Figure 2 also shows the state visitations for the three hallway environment, illustrating that SAC only explores one hallway whereas SMM explores all three.
Question 2: Does historical averaging improve exploration?
While historical averaging is necessary to guarantee convergence (§ 3.1), most prior exploration methods do not employ historical averaging, raising the question of whether it is necessary in practice. To answer this question, we compare SMM to three exploration methods. For each method, we compare (1) the policy obtained at convergence with (2) the historical average of policy iterates over training. We measure how well each explores by computing the marginal state entropy, which we compute by discretizing the state space.^{5}^{5}5Discretization is used only for evaluation, no policy has access to it (except for Count). Figure 2(a) shows that historical averaging improves exploration of SMM, and can even improve exploration of the baselines w.r.t. the gripper position.
Question 3: Does State Marginal Matching allow us to quickly find unknown goals?



In this experiment, we evaluate whether the exploration policy acquired by our method efficiently explores to solve a wide range of downstream tasks. On the manipulation environment, we defined the target distribution to be uniform over the entire state space (joint + block configuration), with the constraint that we put low probability mass on states where the block has fallen off the table. The target distribution also incorporated the prior that actions should be small and the arm should be close to the object. As shown in Figure 2(c), our method has learned to explore better than the baselines, finding over 80% of the goals. Figure 9 illustrates which goals each method succeeded in finding. Our method succeeds in finding a wide range of goals.
Question 4: Can injecting prior knowledge via the target distribution bias exploration?
One of the benefits of the State Marginal Matching objective is that is allows users to easily incorporate prior knowledge about the task. In this experiment, we check whether prior knowledge injected via the target distribution is reflected in the policy obtained from State Marginal Matching. Using the same manipulation environment as above, we modified the target distribution to assign larger probability to states where the block was on the left half of the table than on the right half. In Figure 4, we plot the state marginals of the block Ycoordinate (where is left half of the table). We see that our method acquires a policy whose state distribution closely matches the target distribution.
5 Discussion
In this paper, we introduced a formal objective for exploration. While it is often unclear what existing exploration algorithms will converge to, our State Marginal Matching objective has a clear solution: at convergence, the policy should visit states in proportion to their density under a target distribution. Not only does this objective encourage exploration, it also provides human users with a flexible mechanism to bias exploration towards states they prefer and away from dangerous states. Upon convergence, the resulting policy can thereafter be used as a prior in a multitask setting, amortizing exploration and enabling faster adaptation to new, potentially sparse, reward functions. The algorithm we proposed looks quite similar to previous exploration methods based on prediction error, suggesting that those methods are also performing some sort of distribution matching. However, by deriving our method from first principles, we note that these prior methods omit a crucial historical averaging step, without which the algorithm is not guaranteed to converge. Experiments on navigation and manipulation tasks demonstrated how our method learns to explore, enabling an agent to efficiently explore in new tasks provided at test time.
In future work, we aim to study connections between inverse RL, MaxEnt RL and state marginal matching, all of which perform some sort of distribution matching. Empirically, we aim to scale to more complex tasks by parallelizing the training of all mixture components simultaneously. Broadly, we expect the state distribution matching problem formulation to enable the development of more effective and principled RL methods that reason about distributions rather than individual states.
6 Acknowledgements
We would like to thank Maruan AlShedivat for helpful discussions and comments. LL is supported by NSF grant DGE1745016 and AFRL contract FA870215D0002. BE is supported by Google. EP is supported by ONR grant N000141812861 and Apple. RS is supported by NSF grant IIS1763562, ONR grant N000141812861, AFRL CogDeCON, and Apple. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of NSF, AFRL, ONR, Google or Apple. We also thank Nvidia for their GPU support.
References
 Abdolmaleki et al. (2018) Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller. Maximum a posteriori policy optimisation. arXiv preprint arXiv:1806.06920, 2018.
 Achiam et al. (2018) Joshua Achiam, Harrison Edwards, Dario Amodei, and Pieter Abbeel. Variational option discovery algorithms. arXiv preprint arXiv:1807.10299, 2018.
 Agakov (2004) David Barber Felix Agakov. The im algorithm: a variational approach to information maximization. Advances in Neural Information Processing Systems, 16:201, 2004.
 Agrawal and Jia (2017) Shipra Agrawal and Randy Jia. Optimistic posterior sampling for reinforcement learning: worstcase regret bounds. In Advances in Neural Information Processing Systems, pages 1184–1194, 2017.
 Bellemare et al. (2016) Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying countbased exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pages 1471–1479, 2016.
 Bishop (2006) Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.
 Brown (1951) G. Brown. Iterative solution of games by fictitious play. Activity Analysis of Production and Allocation, 1951.
 Burda et al. (2018) Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. arXiv preprint arXiv:1810.12894, 2018.
 Chentanez et al. (2005) Nuttapong Chentanez, Andrew G Barto, and Satinder P Singh. Intrinsically motivated reinforcement learning. In Advances in neural information processing systems, pages 1281–1288, 2005.
 Christiano et al. (2017) Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pages 4299–4307, 2017.
 CoReyes et al. (2018) John D CoReyes, YuXuan Liu, Abhishek Gupta, Benjamin Eysenbach, Pieter Abbeel, and Sergey Levine. Selfconsistent trajectory autoencoder: Hierarchical reinforcement learning with trajectory embeddings. arXiv preprint arXiv:1806.02813, 2018.
 Colas et al. (2018) Cédric Colas, Olivier Sigaud, and PierreYves Oudeyer. Curious: Intrinsically motivated multitask, multigoal reinforcement learning. arXiv preprint arXiv:1810.06284, 2018.
 Daskalakis and Pan (2014) Constantinos Daskalakis and Qinxuan Pan. A counterexample to karlin’s strong conjecture for fictitious play. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pages 11–20. IEEE, 2014.
 Duan et al. (2016) Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl2̂: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.
 Eysenbach et al. (2018) Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018.
 Finn et al. (2016) Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, pages 49–58, 2016.
 Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. Modelagnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1126–1135. JMLR. org, 2017.
 Fortunato et al. (2017) Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex Graves, Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, et al. Noisy networks for exploration. arXiv preprint arXiv:1706.10295, 2017.
 Fu et al. (2017) Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adversarial inverse reinforcement learning. arXiv preprint arXiv:1710.11248, 2017.
 Fu et al. (2018) Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adverserial inverse reinforcement learning. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rkHywlA.
 Gupta et al. (2018) Abhishek Gupta, Russell Mendonca, YuXuan Liu, Pieter Abbeel, and Sergey Levine. Metareinforcement learning of structured exploration strategies. In Advances in Neural Information Processing Systems, pages 5302–5311, 2018.
 Haarnoja et al. (2018) Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
 Hazan et al. (2018) Elad Hazan, Sham M Kakade, Karan Singh, and Abby Van Soest. Provably efficient maximum entropy exploration. arXiv preprint arXiv:1812.02690, 2018.
 Held et al. (2017) David Held, Xinyang Geng, Carlos Florensa, and Pieter Abbeel. Automatic goal generation for reinforcement learning agents. arXiv preprint arXiv:1705.06366, 2017.
 Ho and Ermon (2016) Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pages 4565–4573, 2016.
 Houthooft et al. (2016) Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems, pages 1109–1117, 2016.
 Kaelbling (1993) Leslie Pack Kaelbling. Learning to achieve goals. In IJCAI, pages 1094–1099. Citeseer, 1993.
 Kappen et al. (2012) Hilbert J Kappen, Vicenç Gómez, and Manfred Opper. Optimal control as a graphical model inference problem. Machine learning, 87(2):159–182, 2012.
 Levine (2018) Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909, 2018.
 Lillicrap et al. (2015) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 Lynch (1996) Nancy A Lynch. Distributed algorithms. Elsevier, 1996.
 Mishra et al. (2017) Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive metalearner. arXiv preprint arXiv:1707.03141, 2017.
 Nair et al. (2018) Ashvin V Nair, Vitchyr Pong, Murtaza Dalal, Shikhar Bahl, Steven Lin, and Sergey Levine. Visual reinforcement learning with imagined goals. In Advances in Neural Information Processing Systems, pages 9191–9200, 2018.
 Nash (1951) John Nash. Noncooperative games. Annals of mathematics, pages 286–295, 1951.
 Oudeyer et al. (2007) PierreYves Oudeyer, Frdric Kaplan, and Verena V Hafner. Intrinsic motivation systems for autonomous mental development. IEEE transactions on evolutionary computation, 11(2):265–286, 2007.
 Pathak et al. (2017) Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiositydriven exploration by selfsupervised prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 16–17, 2017.
 Plappert et al. (2017) Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon Sidor, Richard Y Chen, Xi Chen, Tamim Asfour, Pieter Abbeel, and Marcin Andrychowicz. Parameter space noise for exploration. arXiv preprint arXiv:1706.01905, 2017.
 Plappert et al. (2018) Matthias Plappert, Marcin Andrychowicz, Alex Ray, Bob McGrew, Bowen Baker, Glenn Powell, Jonas Schneider, Josh Tobin, Maciek Chociej, Peter Welinder, et al. Multigoal reinforcement learning: Challenging robotics environments and request for research. arXiv preprint arXiv:1802.09464, 2018.
 Pong et al. (2019) Vitchyr H Pong, Murtaza Dalal, Steven Lin, Ashvin Nair, Shikhar Bahl, and Sergey Levine. Skewfit: Statecovering selfsupervised reinforcement learning. arXiv preprint arXiv:1903.03698, 2019.
 Rakelly et al. (2019) Kate Rakelly, Aurick Zhou, Deirdre Quillen, Chelsea Finn, and Sergey Levine. Efficient offpolicy metareinforcement learning via probabilistic context variables. arXiv preprint arXiv:1903.08254, 2019.
 Rawlik et al. (2013) Konrad Rawlik, Marc Toussaint, and Sethu Vijayakumar. On stochastic optimal control and reinforcement learning by approximate inference. In TwentyThird International Joint Conference on Artificial Intelligence, 2013.
 Robinson (1951) Julia Robinson. An iterative method of solving a game. Annals of mathematics, pages 296–301, 1951.
 Schaul et al. (2015) Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approximators. In International conference on machine learning, pages 1312–1320, 2015.
 Schmidhuber (1991) Jürgen Schmidhuber. A possibility for implementing curiosity and boredom in modelbuilding neural controllers. In Proc. of the international conference on simulation of adaptive behavior: From animals to animats, pages 222–227, 1991.
 Schmidhuber (2010) Jürgen Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE Transactions on Autonomous Mental Development, 2(3):230–247, 2010.
 Schulman et al. (2015) John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
 Stadie et al. (2015) Bradly C Stadie, Sergey Levine, and Pieter Abbeel. Incentivizing exploration in reinforcement learning with deep predictive models. arXiv preprint arXiv:1507.00814, 2015.
 Tang et al. (2017) Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, OpenAI Xi Chen, Yan Duan, John Schulman, Filip DeTurck, and Pieter Abbeel. # exploration: A study of countbased exploration for deep reinforcement learning. In Advances in neural information processing systems, pages 2753–2762, 2017.
 Theodorou and Todorov (2012) Evangelos A Theodorou and Emanuel Todorov. Relative entropy and free energy dualities: Connections to path integral and kl control. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), pages 1466–1473. IEEE, 2012.
 Todorov et al. (2012) Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for modelbased control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012.
 Toussaint and Storkey (2006) Marc Toussaint and Amos Storkey. Probabilistic inference for solving discrete and continuous state markov decision processes. In Proceedings of the 23rd international conference on Machine learning, pages 945–952. ACM, 2006.
 Xie et al. (2013) Dan Xie, Sinisa Todorovic, and SongChun Zhu. Inferring "dark matter" and "dark energy" from videos. In The IEEE International Conference on Computer Vision (ICCV), December 2013.
 Xu et al. (2018) Tianbing Xu, Qiang Liu, Liang Zhao, and Jian Peng. Learning to explore with metapolicy gradient. arXiv preprint arXiv:1803.05044, 2018.
 Ziebart (2010) Brian D Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. 2010.
 Ziebart et al. (2008) Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008.
 Ziebart et al. (2009) Brian D Ziebart, Nathan Ratliff, Garratt Gallagher, Christoph Mertz, Kevin Peterson, J Andrew Bagnell, Martial Hebert, Anind K Dey, and Siddhartha Srinivasa. Planningbased prediction for pedestrians. In 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 3931–3936. IEEE, 2009.
References
Appendix A Graphical Models for State Marginal Matching
In Figure 4(a) we show that the State Marginal Matching objective can be viewed as a projection of the target distribution onto the set of realizable policies. Figures 4(b) and 4(c) illustrate the generative models for generating states in the single policy and mixture policy cases.



Appendix B A Simple Experiment


We consider a simple task to illustrate why action entropy is insufficient for distribution matching. We consider an MDP with two states and two actions, shown in Figure 6, and no reward function. A standard maximum actionentropy policy (e.g., SAC) will choose actions uniformly at random. However, because the selfloop in state A has a smaller probability that the selfloop in state B, the agent will spend 60% of its time in state B and only 40% of its time in state A. Thus, maximum action entropy policies will not yield uniform state distributions. We apply our method to learn a policy that maximizes state entropy. As shown in Figure 6, our method achieves the highest possible state entropy.
While we consider the case without a reward function, we can further show that there does not exist reward function for which the optimal policy achieves a uniform state distribution. It is enough to consider the relative reward on state A and B. If , then the agent will take actions to remain at state A as often as possible; the optimal policies achieves remains at state A 91% of the time. If , the optimal policy can remain at state B 100% of the time. if , all policies are optimal and we have no guarantee that an arbitrary policy will have a uniform state distribution.
Appendix C Choosing for GoalReaching Tasks
In general, the choice of the target distribution will depend on the distribution of testtime tasks. In this section, we consider the special case where the testtime tasks correspond to goalreaching derive the optimal target distribution .
We consider the setting where goals are sampled from some known distribution. Our goal is to minimize the number of episodes required to reach that goal state. We define reaching the goal state as visiting a state that lies within an ball of the goal, where both and the distance metric are known.
We start with a simple lemma that shows that the probability that we reach the goal at any state in a trajectory is at least the probability that we reach the goal at a randomly chosen state in that same trajectory. Defining the binary random variable as the event that the state at time reaches the goal state, we can formally state the claim as follows:
Lemma C.1.
(10) 
Proof.
We start by noting the following implication:
(11) 
Thus, the probability of the event on the RHS must be at least as large as the probability of the event on the LHS:
(12) 
∎
Next, we look at the expected number of episodes to reach the goal state. Since each episode is independent, the expected hitting time is simply
(13) 
Note that we have upperbounded the hitting time using Lemma 10. Since the goal is a random variable, we take an expectation over :
(14) 
We can rewrite the RHS using to denote the target state marginal distribution:
(15) 
We will minimize , an upper bound on the expected hitting time.
Lemma C.2.
The state marginal distribution minimizes , where is a smoothed version of the target density.
Before diving into the proof, we provide a bit on intuition. In the case where , the optimal target distribution is . For nonzero , the policy in Lemma C.2 is equivalent to convolving with a box filter before taking the square root. In both cases, we see that the optimal policy does distribution matching to some function of the goal distribution. Note that may not sum to one and therefore is not a proper probability distribution.
Proof.
We start by forming the Lagrangian:
(16) 
The first derivative is
(17) 
Note that the second derivative is positive, indicating that this Lagrangian is convex, so all stationary points must be global minima:
(18) 
Setting the first derivative equal to zero and rearranging terms, we obtain
(19) 
Swapping , we obtain the desired result. ∎
Appendix D State Marginal Matching with Mixtures of Policies
d.1 Alternative Derivation via Information Theory
The language of information theory gives an alternate view on the mixture model objective in Equation 9. First, we recall that mutual information can be decomposed in two ways:
(20) 
Thus, we have the following identity:
(21) 
Plugging this identity into Equation 9, we see that our mixture policy approach is identical to the original SMM objective (Equation 2), albeit using a mixture of policies:
(22) 
d.2 Testtime Adaptation via Latent Posterior Update
After acquiring our taskagnostic policy during training, at testtime we want the policy to adapt to solve the testtime task. The goal of fastadaptation belongs to the realm of metaRL, for which prior work has proposed many algorithms (Duan et al., 2016; Finn et al., 2017; Rakelly et al., 2019). In our setting, we propose a lightweight metalearning procedure that exploits the fact that we use a mixture of policies. Rather than adapting all parameters of our policy, we only adapt the frequency with which we sample each mixture component, which we can do simply via posterior sampling.
For simplicity, we consider testtime tasks that give sparse rewards. For each mixture component , we model the probability that the agent obtains the sparse reward: . At the start of each episode, we sample the mixture component with probability proportional to the posterior probability if obtains the reward:
(23) 
Intuitively, this procedure biases us to sampling skills that previously yielded high reward. Because posterior sampling over mixture components is a bandit problem, this approach is optimal (Agrawal and Jia, 2017) in the regime where we only adapt the mixture components. We use this procedure to quickly adapt to testtime tasks in Figures 2(b) and 2(c).
d.3 Connections to GoalConditioned RL
GoalConditioned RL (Kaelbling, 1993; Nair et al., 2018; Held et al., 2017) can be viewed as a special case of State Marginal Matching when the goalsampling distribution is learned jointly with the policy. In particular, consider the State Marginal Matching with a mixture policy (Algorithm 2), where the mixture component maps bijectively to goal states . In this case, we learn goalconditioned policies of the form . We start by swapping for in the SMM objective with Mixtures of Policies (Equation 9):
(24) 
The second term is an estimate of which goal the agent is trying to reach, similar to objectives in intent inference (Ziebart et al., 2009; Xie et al., 2013). The third term is the distribution over states visited by the policy when attempting to reach goal . For an optimal goalconditioned policy in an infinitehorizon MDP, both of these terms are Dirac functions:
(25) 
In this setting, the State Marginal Matching objective simply says to sample goals with probability equal to the density of that goal under the target distribution.
(26) 
Whether goalconditioned RL is the preferable way to do distribution matching depends on (1) the difficulty of sampling goals and (2) the supervision that will be provided at test time. It is natural to use goalconditioned RL in settings where it is easy to sample goals, such as when the space of goals is small and finite or otherwise lowdimensional. If a large collection of goals is available apriori, we could use importance sampling to generate goals to train the goalconditioned policy (Pong et al., 2019). However, in many realworld settings, goals are highdimension observations (e.g., images, lidar) which are challenging to sample. While goalconditioned RL is likely the right approach when, at testtime, we will be given a testtime task, a latentconditioned policy make explore better in settings where the goalstate is not provided at testtime.
Appendix E Additional Experiments & Experimental Details
e.1 Environment Details
Both the Manipulation and 3D Navigation tasks were implemented using the Mujoco simulator Todorov et al. (2012). We summarize environment parameters in Table 1.
Manipulation. We use the simulated Fetch Robotics arm^{6}^{6}6https://fetchrobotics.com/ implemented by Plappert et al. (2018). The state vector includes the action taken by the robot, and xyzcoordinates of the block and the robot gripper respectively. In ManipulationUniform, the target state marginal distribution is given by
where are fixed weights, and the rewards
correspond to (1) a uniform distribution of the block position over the table surface (the agent receives +0 reward while the block is on the table), (2) an indicator reward for moving the robot gripper close to the block, and (3) action penalty, respectively. In ManipulationHalf, is replaced by a reward function that gives a slightly higher reward +0.1 for states where the block is on the rightside of the table. During training, all policies are trained on a weighted sum of the three reward terms: . At testtime for ManipulationUniform, we sample a goal block location uniformly across the table, and record the number of episodes until the agent finds the goal.
Navigation: Episodes have a maximum time horizon of 100 steps and 500 steps for 2D and 3D navigation, respectively. The environment reward is
where is the xyposition of the agent. Except in Figure 2 where we vary the number of halls, we use 3 halls for all 2D and 3D Navigation experiments. In Figure 2 and 9(a), we use a uniform target distribution over the end of all halls, so the environment reward at training time is if the robot is close enough to the end of any of the halls. In Figure 9(b) (3D Navigation), a goal is sampled in one of the three halls, and the agent must explore to find the goal.
Domain  Env  Env Reward (  Parameters  Figure  
Manipulation  10  4  50  Uniform  Uniform block pos. over table surface  3, 10  
Half  More block pos. density on lefthalf of table  4  
Navigation  2  2  100  2D  Uniform over all N halls 

2  
Uniform over all N halls 

9(a)  
113  7  500  3D  One (unobserved) goal hallway 

9(b) 
e.2 Visualizing the Manipulation Environment
We visualize the log state marginal over block XYcoordinates in Figures 7 and 8. In Figure 9, we plot goals sampled at testtime, colored by the number of episodes each method required to push the block to that goal location.. Blue dots indicate that the agent found the goal quickly. We observe that SMM has the most blue dots, indicating that it succeeds in exploring a wide range of states at testtime.
e.3 Additional Experimental Results
To understand the relative contribution of each component in the mixturecase SMM objective (Equation 9), we compare our method to baselines that lack conditional state entropy , latent conditional action entropy , or both (i.e, SAC). We evaluate on 2D Navigation (Figure 9(a)) and 3D Navigation (Figure 9(b)). Results show that our method relies heavily on both key differences from SAC.
We show training curves for ManipulationHalf in Figure 11.

fig:3dnavigationsolutiontime 


e.4 Experiment Details
Hyperparameter settings are summarized in Table 2. All algorithms were trained for 1K epochs (1M env steps) for Manipulation and 3D Navigation, and 100 epochs (100K env steps) for 2D Navigation.
Loss Hyperparameters. SAC reward scale controls the action entropy reward w.r.t. the extrinsic reward. Count coeff controls the intrinsic countbased exploration reward w.r.t. the extrinsic reward and SAC action entropy reward. Similarly, Pseudocount coeff controls the intrinsic pseudocount exploration reward. SMM coeff for and control the different loss components (state entropy and latent conditional entropy) of the SMM objective in Equation 9.
Historical Averaging. In Manip. experiments, we tried the following sampling strategies for historical averaging: (1) Uniform: Sample policies uniformly across training iterations. (2) Exponential: Sample policies, with recent policies sampled exponentially more than earlier ones. (3) Last: Sample the latest policies uniformly at random. We found that Uniform worked less well, possibly due to the policies at early iterations not being trained enough. We found negligible difference in the state entropy metric between Exponential vs. Last, and between sampling 5 vs. 10 historical policies. Note that since we only sample 10 checkpoints, it is unnecessary to keep checkpoints from every iteration.
Network Hyperparameters. For all algorithms, we use a Gaussian policy with two hidden layers of size (300, 300) with Tanh activation and a final fullyconnected layer. The Value function and Qfunction each are a feedforward MLP with two hidden layers of size (300, 300) with ReLU activation and a final fullyconnected layer. The same network configuration is used for the SMM discriminator but with different input and output sizes. The SMM density model is modeled by a VAE with encoder and decoder networks each consisting of two hidden layers of size (150, 150) with ReLU activation. The same VAE network configuration is used for Pseudocount.
Environment  Algorithm  Hyperparameters Used  Hyperparameters Considered  
All  All 

N/A (Default SAC hyperparameters)  
Manip.Uniform  SMM 



SAC  SAC reward scale: 0.1  SAC reward scale: 0.1, 1, 10, 100  
Count 



Pseudocount 



ICM  Learning rate: 1e3  Learning rate: 1e4, 1e3, 1e2  
Manip.Half  All  SAC reward scale: 0.1  (Best reward scale for Manip.Uniform)  
SMM 



Count 



ICM  Learning rate: 1e3  Learning rate: 1e4, 1e3, 1e2  
2D Navigation  All  SAC reward scale: 25  SAC reward scale: 1e2, 0.1, 1, 10, 25, 100  
SMM 



3D Navigation  All  SAC reward scale: 25  SAC reward scale: 1e2, 0.1, 1, 10, 25, 100  
SMM 

# Clusters  # GPU’s  GPU  CUDA  NVIDIA Driver 

3  4  GeForce RTX 2080 Ti  10.1  418.43 
1  4  Titan X  10.1  418.43 