Avery - OBS1HEMC-0526A - 42
.pdf
keyboard_arrow_up
School
Webster University *
*We aren’t endorsed by this school
Course
57
Subject
Computer Science
Date
Dec 6, 2023
Type
Pages
8
Uploaded by ProfessorCheetah6347
Evaluate The Quality Of The Output [Batch 3,
Version 2.1]
Instructions Input: What are the baselines?
Introduction and Background
Many reinforcement learning algorithms are designed for relatively small discrete or
continuous action spaces and so have trouble scaling. Text-adventure games—or interaction
±ctions—are simulations in which both an agents' state and action spaces are in textual natural
language. An example of a one turn agent interaction in the popular text-game Zork1 can be
seen in Fig. FIGREF1. Text-adventure games provide us with multiple challenges in the form of
partial observability, commonsense reasoning, and a combinatorially-sized state-action space.
Text-adventure games are structured as long puzzles or quests, interspersed with bottlenecks.
The quests can usually be completed through multiple branching paths. However, games can
also feature one or more bottlenecks. Bottlenecks are areas that an agent must pass through in
order to progress to the next section of the game regardless of what path the agent has taken
to complete that section of the quest BIBREF0. In this work, we focus on more effectively
exploring this space and surpassing these bottlenecks—building on prior work that focuses on
tackling the other problems.
Formally, we use the de±nition of text-adventure games as seen in BIBREF1 and BIBREF2.
These games are partially observable Markov decision processes (POMDPs), represented as a
7-tuple of $\langle S,T,A,\Omega , O,R, \gamma \rangle $ representing the set of environment
states, mostly deterministic conditional transition probabilities between states, the vocabulary
or words used to compose text commands, observations returned by the game, observation
conditional probabilities, reward function, and the discount factor respectively. For our
purposes, understanding the exact state and action spaces we use in this work is critical and so
we de±ne each of these in relative depth.
Action-Space. To solve Zork1, the cannonical text-adventure games, requires the generation of
actions consisting of up to ±ve-words from a relatively modest vocabulary of 697 words
recognized by the game's parser. This results in $\mathcal {O}(697^5)={1.64e14}$ possible
actions at every step. To facilitate text-adventure game playing, BIBREF2 introduce Jericho, a
framework for interacting with text-games. They propose a template-based action space in
which the agent ±rst selects a template, consisting of an action verb and preposition, and then
±lling that in with relevant entities $($e.g. $[get]$ $ [from] $ $)$. Zork1 has 237 templates, each
with up to two blanks, yielding a template-action space of size $\mathcal {O}(237 \times
697^2)={1.15e8}$. This space is still far larger than most used by previous approaches applying
reinforcement learning to text-based games.
State-Representation. Prior work has shown that knowledge graphs are effective in terms of
dealing with the challenges of partial observability $($BIBREF3 BIBREF3; BIBREF4$)$. A
knowledge graph is a set of 3-tuples of the form $\langle subject, relation, object \rangle $.
These triples are extracted from the observations using Stanford's Open Information
(/dashboard)
Work mode
85% accuracy
40 tasks completed
0 per task
177:35
Extraction (OpenIE) BIBREF5. Human-made text-adventure games often contain relatively
complex semi-structured information that OpenIE is not designed to parse and so they add
additional rules to ensure that the correct information is parsed. The graph itself is more or less
a map of the world, with information about objects' affordances and attributes linked to the
rooms that they are place in a map. The graph also makes a distinction with respect to items
that are in the agent's possession or in their immediate surrounding environment. An example
of what the knowledge graph looks like and speci±c implementation details can be found in
Appendix SECREF14.
BIBREF6 introduce the KG-A2C, which uses a knowledge graph based state-representation to
aid in the section of actions in a combinatorially-sized action-space—speci±cally they use the
knowledge graph to constrain the kinds of entities that can be ±lled in the blanks in the
template action-space. They test their approach on Zork1, showing the combination of the
knowledge graph and template action selection resulted in improvements over existing
methods. They note that their approach reaches a score of 40 which corresponds to a
bottleneck in Zork1 where the player is eaten by a "grue" (resulting in negative reward) if the
player has not ±rst lit a lamp. The lamp must be lit many steps after ±rst being encountered, in a
different section of the game; this action is necessary to continue exploring but doesn't
immediately produce any positive reward. That is, there is a long term dependency between
actions that is not immediately rewarded, as seen in Figure FIGREF1. Others using arti±cially
constrained action spaces also report an inability to pass through this bottleneck BIBREF7,
BIBREF8. They pose a signi±cant challenge for these methods because the agent does not see
the correct action sequence to pass the bottleneck enough times. This is in part due to the fact
that for that sequence to be reinforced, the agent needs to reach the next possible reward
beyond the bottleneck.
More ef±cient exploration strategies are required to pass bottlenecks. Our contributions are
two-fold. We ±rst introduce a method that detects bottlenecks in text-games using the overall
reward gained and the knowledge graph state. This method freezes the policy used to reach the
bottleneck and restarts the training from there on out, additionally conducting a backtracking
search to ensure that a sub-optimal policy has not been frozen. The second contribution
explore how to leverage knowledge graphs to improve existing exploration algorithms for
dealing with combinatorial action-spaces such as Go-Explore BIBREF9. We additionally
present a comparative ablation study analyzing the performance of these methods on the
popular text-game Zork1.
Exploration Methods
In this section, we describe methods to explore combinatorially sized action spaces such as
text-games—focusing especially on methods that can deal with their inherent bottleneck
structure. We ±rst describe our method that explicitly attempts to detect bottlenecks and then
describe how an exploration algorithm such as Go Explore BIBREF9 can leverage knowledge
graphs.
KG-A2C-chained An example of a bottleneck can be seen in Figure FIGREF1. We extend the
KG-A2C algorithm as follows. First, we detect bottlenecks as states where the agent is unable
to progress any further. We set a patience parameter and if the agent has not seen a higher
score in patience steps, the agent assumes it has been limited by a bottleneck. Second, when a
bottleneck is found, we freeze the policy that gets the agent to the state with the highest score.
The agent then begins training a new policy from that particular state.
Simply freezing the policy that led to the bottleneck, however, can potentially result in a policy
one that is globally sub-optimal. We therefore employ a backtracking strategy that restarts
exploration from each of the $n$ previous steps—searching for a more optimal policy that
reaches that bottleneck. At each step, we keep track of a buffer of $n$ states and admissible
actions that led up to that locally optimal state. We force the agent to explore from this state to
attempt to drive it out of the local optima. If it is further unable to ±nd itself out of this local
optima, we refresh the training process again, but starting at the state immediately before the
agent reaches the local optima. If this continues to fail, we continue to iterate through this
buffer of seen states states up to that local optima until we either ±nd a more optimal state or
we run out of states to refresh from, in which we terminate the training algorithm.
KG-A2C-Explore Go-Explore BIBREF9 is an algorithm that is designed to keep track of sub-
optimal and under-explored states in order to allow the agent to explore upon more optimal
states that may be a result of sparse rewards. The Go-Explore algorithm consists of two phases,
the ±rst to continuously explore until a set of promising states and corresponding trajectories
are found on the basis of total score, and the second to robustify this found policy against
potential stochasticity in the game. Promising states are de±ned as those states when explored
from will likely result in higher reward trajectories. Since the text games we are dealing with are
mostly deterministic, with the exception of Zork in later stages, we only focus on using Phase 1
of the Go-Explore algorithm to ±nd an optimal policy. BIBREF10 look at applying Go-Explore to
text-games on a set of simpler games generated using the game generation framework
TextWorld BIBREF1. Instead of training a policy network in parallel to generate actions used
for exploration, they use a small set of "admissible actions"—actions guaranteed to change the
world state at any given step during Phase 1—to explore and ±nd high reward trajectories. This
space of actions is relatively small (of the order of $10^2$ per step) and so ±nding high reward
trajectories in larger action-spaces such as in Zork would be infeasible
Go-Explore maintains an archive of cells—de±ned as a set of states that map to a single
representation—to keep track of promising states. BIBREF9 simply encodes each cell by
keeping track of the agent's position and BIBREF10 use the textual observations encoded by
recurrent neural network as a cell representation. We improve on this implementation by
training the KG-A2C network in parallel, using the snapshot of the knowledge graph in
conjunction with the game state to further encode the current state and use this as a cell
representation. At each step, Go-Explore chooses a cell to explore at random (weighted by
score to prefer more advanced cells). The KG-A2C will run for a number of steps, starting with
the knowledge graph state and the last seen state of the game from the cell. This will generate a
trajectory for the agent while further training the KG-A2C at each iteration, creating a new
representation for the knowledge graph as well as a new game state for the cell. After
expanding a cell, Go-Explore will continue to sample cells by weight to continue expanding its
known states. At the same time, KG-A2C will bene±t from the heuristics of selecting preferred
cells and be trained on promising states more often.
Evaluation
We compare our two exploration strategies to the following baselines and ablations:
KG-A2C This is the exact same method presented in BIBREF6 with no modi±cations.
A2C Represents the same approach as KG-A2C but with all the knowledge graph components
removed. The state representation is text only encoded using recurrent networks.
A2C-chained Is a variation on KG-A2C-chained where we use our policy chaining approach with
the A2C method to train the agent instead of KG-A2C.
A2C-Explore Uses A2C in addition to the exploration strategy seen in KG-A2C-Explore. The cell
representations here are de±ned in terms of the recurrent network based encoding of the
textual observation.
Figure FIGREF10 shows that agents utilizing knowledge-graphs in addition to either enhanced
exploration method far outperform the baseline A2C and KG-A2C. KG-A2C-chained and KG-
A2C-Explore both pass the bottleneck of a score of 40, whereas A2C-Explore gets to the
bottleneck but cannot surpass it.
There are a couple of key insights that can be drawn from these results The ±rst is that the
knowledge graph appears to be critical; it is theorized to help with partial observability.
However the knowledge graph representation isn't suf±cient in that the knowledge graph
representation without enhanced exploration methods cannot surpass the bottleneck. A2C-
chained—which explores without a knowledge graph—fails to even outperform the baseline
A2C. We hypothesize that this is due to the knowledge graph aiding implicitly in the sample
ef±ciency of bottleneck detection and subsequent exploration. That is, exploring after
backtracking from a potentially detected bottleneck is much more ef±cient in the knowledge
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Questions
Computational Theory: The halting problem is described as the impossibility to know if a program (based on any programming language) will halt or not after it starts running.
How would you simulate or semi-simulate the halting problem? If a simulation is possible, provide a graph or a flow chart. Explain your reasoning.
If a simulation is not possible, explain your reasoning
arrow_forward
How can simulations replicate unclassifiable situations?
arrow_forward
write Overview of Example Implementation
arrow_forward
Select the correct definition for the Computational Thinking term Pattern Recognition.
The assembly of the parts above into the complete solution. Combines parts into a program which is the realization of an algorithm using a syntax that the computer can understand.
Step-by-step instructions of how to solve a problem
The process of taking a complex problem and breaking it into more manageable sub-problems. Often leaves a framework of sub-problems that later have to be assembled (system integration) to produce a desired solution.
The process of identifying important characteristics of the problem and ignore characteristics that are not important. We use these characteristics to create a representation of what we are trying to solve.
Refers to finding similarities, or shared characteristics of problems, which allows a complex problem to become easier to solve, and allows use of same solution method for each occurrence of the pattern.
arrow_forward
no Ai please
arrow_forward
Model diagrams show software development.
arrow_forward
Definition of Parameterized Testing.
arrow_forward
An observer design pattern is one famous design pattern. Explain the purpose, behaviour, pros, cons and example of this design pattern.
In the purpose part, make sure to include why we need to incorporate it and what problem it solves.
In the example part, Include an example illustration of the observer design pattern using java
arrow_forward
Question11: what does “Decomposing an Action” mean in activity diagram?
(Software Design)
arrow_forward
MODELLING AND SIMULATION:
What is/are the effect/s of simulation in the model?
arrow_forward
For the final project, you address some questions that interest you with the statistical methodology we learn in statistics 201( Elementery statics 13th e) . you choose the question; you decide how to collect data; you do the analyses. The questions can address almost any topic, including topics in psycholgy, sociology, antural science, medicine, public policy, sports, law, etc,
The final project requires you to synthesize all the material from the course. Hence, it's one of the best ways to solidify your understanding of statistical methods.
arrow_forward
How do you ensure you're not overfitting with a model? In machine learning
arrow_forward
In the computer programming field, the terms "cohesion" and "coupling" are used interchangeably.
arrow_forward
Is it possible to explain the difference between an incremental and an iterative approach?
arrow_forward
Discuss possible ways on how we can increase the training and validation time for unsupervised machine learning. (Minimum word count: 100 words)
Optional: You can place images, tables, or references here. Alternatively, you can place the additional contents at the end of this assessment task.
arrow_forward
A conformant array—how does it work?
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you
C++ for Engineers and Scientists
Computer Science
ISBN:9781133187844
Author:Bronson, Gary J.
Publisher:Course Technology Ptr
Related Questions
- Computational Theory: The halting problem is described as the impossibility to know if a program (based on any programming language) will halt or not after it starts running. How would you simulate or semi-simulate the halting problem? If a simulation is possible, provide a graph or a flow chart. Explain your reasoning. If a simulation is not possible, explain your reasoningarrow_forwardHow can simulations replicate unclassifiable situations?arrow_forwardwrite Overview of Example Implementationarrow_forward
- Select the correct definition for the Computational Thinking term Pattern Recognition. The assembly of the parts above into the complete solution. Combines parts into a program which is the realization of an algorithm using a syntax that the computer can understand. Step-by-step instructions of how to solve a problem The process of taking a complex problem and breaking it into more manageable sub-problems. Often leaves a framework of sub-problems that later have to be assembled (system integration) to produce a desired solution. The process of identifying important characteristics of the problem and ignore characteristics that are not important. We use these characteristics to create a representation of what we are trying to solve. Refers to finding similarities, or shared characteristics of problems, which allows a complex problem to become easier to solve, and allows use of same solution method for each occurrence of the pattern.arrow_forwardno Ai pleasearrow_forwardModel diagrams show software development.arrow_forward
- Definition of Parameterized Testing.arrow_forwardAn observer design pattern is one famous design pattern. Explain the purpose, behaviour, pros, cons and example of this design pattern. In the purpose part, make sure to include why we need to incorporate it and what problem it solves. In the example part, Include an example illustration of the observer design pattern using javaarrow_forwardQuestion11: what does “Decomposing an Action” mean in activity diagram? (Software Design)arrow_forward
- MODELLING AND SIMULATION: What is/are the effect/s of simulation in the model?arrow_forwardFor the final project, you address some questions that interest you with the statistical methodology we learn in statistics 201( Elementery statics 13th e) . you choose the question; you decide how to collect data; you do the analyses. The questions can address almost any topic, including topics in psycholgy, sociology, antural science, medicine, public policy, sports, law, etc, The final project requires you to synthesize all the material from the course. Hence, it's one of the best ways to solidify your understanding of statistical methods.arrow_forwardHow do you ensure you're not overfitting with a model? In machine learningarrow_forward
arrow_back_ios
SEE MORE QUESTIONS
arrow_forward_ios
Recommended textbooks for you
- C++ for Engineers and ScientistsComputer ScienceISBN:9781133187844Author:Bronson, Gary J.Publisher:Course Technology Ptr
C++ for Engineers and Scientists
Computer Science
ISBN:9781133187844
Author:Bronson, Gary J.
Publisher:Course Technology Ptr