Avery - OBS1HEMC-0526A - 42

.pdf

School

Webster University *

*We aren’t endorsed by this school

Course

Subject

Computer Science

Date

Dec 6, 2023

Type

pdf

Pages

Uploaded by ProfessorCheetah6347

Evaluate The Quality Of The Output [Batch 3, Version 2.1] Instructions Input: What are the baselines? Introduction and Background Many reinforcement learning algorithms are designed for relatively small discrete or continuous action spaces and so have trouble scaling. Text-adventure games—or interaction ±ctions—are simulations in which both an agents' state and action spaces are in textual natural language. An example of a one turn agent interaction in the popular text-game Zork1 can be seen in Fig. FIGREF1. Text-adventure games provide us with multiple challenges in the form of partial observability, commonsense reasoning, and a combinatorially-sized state-action space. Text-adventure games are structured as long puzzles or quests, interspersed with bottlenecks. The quests can usually be completed through multiple branching paths. However, games can also feature one or more bottlenecks. Bottlenecks are areas that an agent must pass through in order to progress to the next section of the game regardless of what path the agent has taken to complete that section of the quest BIBREF0. In this work, we focus on more effectively exploring this space and surpassing these bottlenecks—building on prior work that focuses on tackling the other problems. Formally, we use the de±nition of text-adventure games as seen in BIBREF1 and BIBREF2. These games are partially observable Markov decision processes (POMDPs), represented as a 7-tuple of $\langle S,T,A,\Omega , O,R, \gamma \rangle $ representing the set of environment states, mostly deterministic conditional transition probabilities between states, the vocabulary or words used to compose text commands, observations returned by the game, observation conditional probabilities, reward function, and the discount factor respectively. For our purposes, understanding the exact state and action spaces we use in this work is critical and so we de±ne each of these in relative depth. Action-Space. To solve Zork1, the cannonical text-adventure games, requires the generation of actions consisting of up to ±ve-words from a relatively modest vocabulary of 697 words recognized by the game's parser. This results in $\mathcal {O}(697^5)={1.64e14}$ possible actions at every step. To facilitate text-adventure game playing, BIBREF2 introduce Jericho, a framework for interacting with text-games. They propose a template-based action space in which the agent ±rst selects a template, consisting of an action verb and preposition, and then ±lling that in with relevant entities $($e.g. $[get]$ $ [from] $ $)$. Zork1 has 237 templates, each with up to two blanks, yielding a template-action space of size $\mathcal {O}(237 \times 697^2)={1.15e8}$. This space is still far larger than most used by previous approaches applying reinforcement learning to text-based games. State-Representation. Prior work has shown that knowledge graphs are effective in terms of dealing with the challenges of partial observability $($BIBREF3 BIBREF3; BIBREF4$)$. A knowledge graph is a set of 3-tuples of the form $\langle subject, relation, object \rangle $. These triples are extracted from the observations using Stanford's Open Information (/dashboard) Work mode 85% accuracy 40 tasks completed 0 per task 177:35

Extraction (OpenIE) BIBREF5. Human-made text-adventure games often contain relatively complex semi-structured information that OpenIE is not designed to parse and so they add additional rules to ensure that the correct information is parsed. The graph itself is more or less a map of the world, with information about objects' affordances and attributes linked to the rooms that they are place in a map. The graph also makes a distinction with respect to items that are in the agent's possession or in their immediate surrounding environment. An example of what the knowledge graph looks like and speci±c implementation details can be found in Appendix SECREF14. BIBREF6 introduce the KG-A2C, which uses a knowledge graph based state-representation to aid in the section of actions in a combinatorially-sized action-space—speci±cally they use the knowledge graph to constrain the kinds of entities that can be ±lled in the blanks in the template action-space. They test their approach on Zork1, showing the combination of the knowledge graph and template action selection resulted in improvements over existing methods. They note that their approach reaches a score of 40 which corresponds to a bottleneck in Zork1 where the player is eaten by a "grue" (resulting in negative reward) if the player has not ±rst lit a lamp. The lamp must be lit many steps after ±rst being encountered, in a different section of the game; this action is necessary to continue exploring but doesn't immediately produce any positive reward. That is, there is a long term dependency between actions that is not immediately rewarded, as seen in Figure FIGREF1. Others using arti±cially constrained action spaces also report an inability to pass through this bottleneck BIBREF7, BIBREF8. They pose a signi±cant challenge for these methods because the agent does not see the correct action sequence to pass the bottleneck enough times. This is in part due to the fact that for that sequence to be reinforced, the agent needs to reach the next possible reward beyond the bottleneck. More ef±cient exploration strategies are required to pass bottlenecks. Our contributions are two-fold. We ±rst introduce a method that detects bottlenecks in text-games using the overall reward gained and the knowledge graph state. This method freezes the policy used to reach the bottleneck and restarts the training from there on out, additionally conducting a backtracking search to ensure that a sub-optimal policy has not been frozen. The second contribution explore how to leverage knowledge graphs to improve existing exploration algorithms for dealing with combinatorial action-spaces such as Go-Explore BIBREF9. We additionally present a comparative ablation study analyzing the performance of these methods on the popular text-game Zork1. Exploration Methods In this section, we describe methods to explore combinatorially sized action spaces such as text-games—focusing especially on methods that can deal with their inherent bottleneck structure. We ±rst describe our method that explicitly attempts to detect bottlenecks and then describe how an exploration algorithm such as Go Explore BIBREF9 can leverage knowledge graphs. KG-A2C-chained An example of a bottleneck can be seen in Figure FIGREF1. We extend the KG-A2C algorithm as follows. First, we detect bottlenecks as states where the agent is unable to progress any further. We set a patience parameter and if the agent has not seen a higher score in patience steps, the agent assumes it has been limited by a bottleneck. Second, when a bottleneck is found, we freeze the policy that gets the agent to the state with the highest score. The agent then begins training a new policy from that particular state. Simply freezing the policy that led to the bottleneck, however, can potentially result in a policy one that is globally sub-optimal. We therefore employ a backtracking strategy that restarts exploration from each of the $n$ previous steps—searching for a more optimal policy that reaches that bottleneck. At each step, we keep track of a buffer of $n$ states and admissible actions that led up to that locally optimal state. We force the agent to explore from this state to attempt to drive it out of the local optima. If it is further unable to ±nd itself out of this local optima, we refresh the training process again, but starting at the state immediately before the agent reaches the local optima. If this continues to fail, we continue to iterate through this

buffer of seen states states up to that local optima until we either ±nd a more optimal state or we run out of states to refresh from, in which we terminate the training algorithm. KG-A2C-Explore Go-Explore BIBREF9 is an algorithm that is designed to keep track of sub- optimal and under-explored states in order to allow the agent to explore upon more optimal states that may be a result of sparse rewards. The Go-Explore algorithm consists of two phases, the ±rst to continuously explore until a set of promising states and corresponding trajectories are found on the basis of total score, and the second to robustify this found policy against potential stochasticity in the game. Promising states are de±ned as those states when explored from will likely result in higher reward trajectories. Since the text games we are dealing with are mostly deterministic, with the exception of Zork in later stages, we only focus on using Phase 1 of the Go-Explore algorithm to ±nd an optimal policy. BIBREF10 look at applying Go-Explore to text-games on a set of simpler games generated using the game generation framework TextWorld BIBREF1. Instead of training a policy network in parallel to generate actions used for exploration, they use a small set of "admissible actions"—actions guaranteed to change the world state at any given step during Phase 1—to explore and ±nd high reward trajectories. This space of actions is relatively small (of the order of $10^2$ per step) and so ±nding high reward trajectories in larger action-spaces such as in Zork would be infeasible Go-Explore maintains an archive of cells—de±ned as a set of states that map to a single representation—to keep track of promising states. BIBREF9 simply encodes each cell by keeping track of the agent's position and BIBREF10 use the textual observations encoded by recurrent neural network as a cell representation. We improve on this implementation by training the KG-A2C network in parallel, using the snapshot of the knowledge graph in conjunction with the game state to further encode the current state and use this as a cell representation. At each step, Go-Explore chooses a cell to explore at random (weighted by score to prefer more advanced cells). The KG-A2C will run for a number of steps, starting with the knowledge graph state and the last seen state of the game from the cell. This will generate a trajectory for the agent while further training the KG-A2C at each iteration, creating a new representation for the knowledge graph as well as a new game state for the cell. After expanding a cell, Go-Explore will continue to sample cells by weight to continue expanding its known states. At the same time, KG-A2C will bene±t from the heuristics of selecting preferred cells and be trained on promising states more often. Evaluation We compare our two exploration strategies to the following baselines and ablations: KG-A2C This is the exact same method presented in BIBREF6 with no modi±cations. A2C Represents the same approach as KG-A2C but with all the knowledge graph components removed. The state representation is text only encoded using recurrent networks. A2C-chained Is a variation on KG-A2C-chained where we use our policy chaining approach with the A2C method to train the agent instead of KG-A2C. A2C-Explore Uses A2C in addition to the exploration strategy seen in KG-A2C-Explore. The cell representations here are de±ned in terms of the recurrent network based encoding of the textual observation. Figure FIGREF10 shows that agents utilizing knowledge-graphs in addition to either enhanced exploration method far outperform the baseline A2C and KG-A2C. KG-A2C-chained and KG- A2C-Explore both pass the bottleneck of a score of 40, whereas A2C-Explore gets to the bottleneck but cannot surpass it. There are a couple of key insights that can be drawn from these results The ±rst is that the knowledge graph appears to be critical; it is theorized to help with partial observability. However the knowledge graph representation isn't suf±cient in that the knowledge graph representation without enhanced exploration methods cannot surpass the bottleneck. A2C- chained—which explores without a knowledge graph—fails to even outperform the baseline A2C. We hypothesize that this is due to the knowledge graph aiding implicitly in the sample ef±ciency of bottleneck detection and subsequent exploration. That is, exploring after backtracking from a potentially detected bottleneck is much more ef±cient in the knowledge

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version