COMSW4701_HW3

.pdf

School

Columbia University *

*We aren’t endorsed by this school

Course

4701

Subject

Computer Science

Date

Dec 6, 2023

Type

pdf

Pages

5

Uploaded by SuperGoldfish3864

COMS W4701: Artificial Intelligence, Spring 2023 Homework 3 Instructions: Compile all solutions, plots, and analysis for this assignment in a single PDF file (typed, or handwritten legibly if absolutely necessary). Problems 2 and 3 will ask you to use and minimally modify provided scripts; you will not be turning in your code for these problems. For problem 4, implement your solutions directly in the provided Python file(s) and turn these in along with the PDf file to Gradescope. Please be mindful of the deadline, as late submissions are not accepted, as well as our course policies on academic honesty. Problem 1: Twenty-One (16 points) Consider a one-player version of the game twenty-one as a Markov decision process. The objective is to draw cards one at a time from an infinite deck of playing cards and acquire a card sum as large as possible without going over 21. For now we will have ten integer states in { 12 , ..., 21 } representing the card sum (sums smaller than 12 are trivially played). At each turn we can take one of two actions from state s . Stopping yields a reward equal to s and immediately ends the game. Hitting yields zero reward, and we will either transition to a state s with probability 1 13 where s < s 21, or immediately end the game (“bust”) with probability s 8 13 . 1. Write down the Bellman optimality equations for the value V ( s ) of each state. While you can do so by writing 10 separate equations, each equation form is similar enough that you can write one compact expression for V ( s ) that applies to all of them (use summation notation). 2. We perform value iteration starting with V 0 ( s ) = 0 for all states. What are the values V 1 ( s ) in the next iteration? Have we found the optimal values V ? 3. Now consider taking a policy iteration approach. Suppose that we have an initial policy π 1 , which is to hit in every state. What are the associated values V π 1 ( s )? You should be able to reason this out without having to solve a linear system or write an iterative program. 4. Perform the next step of policy iteration to find π 2 . Have we found the optimal policy π ? Problem 2: Twenty-One Redux (18 points) We will now modify the twenty-one MDP to incorporate a special rule regarding the ace card. If a player’s hand contains an ace, it can add either 1 or 11 to the card sum, provided that the total sum does not exceed 21. If the latter case is possible, we say that the hand contains a usable ace . The number of states is now doubled to twenty, and each can be represented by a tuple s = ( sum, ace ), where sum is the card sum as before and ace can be either True or False . Suppose we take an action from state s = ( sum, ace ). As before, stopping returns reward equal to sum and immediately ends the game. The possible transitions for hitting are as follows, all with zero reward: 1
Transition to state s = ( sum , ace ), where sum < sum 21, with probability 1 13 . If ace is True , transition to state s = ( sum , False ), where 12 sum < sum , with proba- bility 1 13 , or state s = ( sum, False ) with probability 4 13 . Else if ace is False , immediately end the game with probability sum 8 13 . As an example, let s = (16 , True ). Our hand has a usable ace (ace value is 11) and card sum 16. We then choose to hit . We may draw a value between 1 and 5, which brings us to states (17 , True ) , ..., (21 , True ) with probability 1 13 each. We may draw a value between 6 and 9, which brings the card sum over 21, so we “remove” usability of the ace and decrease the excessive card sum by 10. This brings us to states (12 , False ) , ..., (15 , False ) with probability 1 13 each. Finally, we may draw a value of 10, which brings our state to (16 , False ) with probability 4 13 . The provided twentyone.ipynb script initializes an array storing the 20 state values. The first 10 are the ace = False states, and the last 10 are the ace = True states. Given a discount factor gamma , the value iteration() method can be used to find the optimal values for this problem. The code outside the main loop assigns the transition probabilities for hitting in a 20 × 20 array, allowing us to compute the Bellman update using a simple matrix-vector product. 1. Run three instances of value iteration with the following values of discount factor: γ = 1, γ = 0 . 9, and γ = 0 . 5. For each instance, use matplotlib to produce a separate scatter plot with the x-axis ranging from 12 to 21. Use two different markers to plot the values of states with and without a usable ace, and show a legend as well. Copy these plots to your PDF. 2. What are the optimal values and actions of the states with no usable ace? Explain why they either differ or remain the same (depending on your observations) from those of the previous problem when aces could only have a value of 1. 3. For γ = 1, what are the optimal actions in each state with a usable ace? Explain the differences in strategy compared to the corresponding states without a usable ace. 4. Explain how and why the optimal strategies in usable ace states change as we decrease γ . Problem 3: Bernoulli Bandits (18 points) You will be simulating a K -armed Bernoulli bandit in bandits.ipynb ; each arm gives either a reward of 1 with a set probability p or 0 with probability 1 p . means is a list of the p probabilities for each arm. strat and param describe the strategy used to play this bandit ( ε -greedy or UCB) along with the parameter value ( ε or α ), and the bandit is simulated M times for T timesteps each. The returned quantities are the average regret and average frequency that the best (highest mean) arm is chosen at each timestep. execute() simulates the bandit given the same arguments above, except that params is expected to be a list, so that we can compare the simulation results for different values of ε or α . The returned values from bernoulli bandit() are plotted vs time on separate plots (note that regret is plotted cumulatively). The x-axis is logarithmic, so that the trends are better visualized over time. Thus, logarithmic growth will appear to be linear on these plots, while linear growth will appear to be exponential. 2
For each part of this problem, you will only need to set the arguments as instructed and then call execute() . You will then copy these plots into your submission and provide analysis. Make sure to have a good understanding for what each plot is showing you before responding to the questions. 1. Let’s first try an “easier” problem: a 2-armed bandit with means 0 . 3 and 0 . 7, simulated for T = 10000 timesteps and averaged over M = 100 simulations. Use ε -greedy with ε values 0 . 1 , 0 . 2 , 0 . 3, and then use UCB with α values 0 . 1 , 0 . 2 , 0 . 3 (you can ignore any divide by zero warnings). Show the two sets of plots and briefly answer the following questions. For ε -greedy, which parameter value does better on the two metrics over the long term? Why might your answer change if we look at a shorter time period (e.g., t < 100)? How does the value of α affect whether UCB does better or worse than ε -greedy? How does α affect the order of growth of regret? 2. Now let’s consider a “harder” 2-armed bandit with means 0 . 5 and 0 . 55. Perform the same simulations as in the previous part with all other values unchanged. Show the two sets of plots and briefly answer the following questions. Compare ε -greedy’s performance on this problem vs the previous one on the two metrics. For this problem, do we need greater or fewer (or about the same) number of trials to maximize the frequency of playing the best arm? Compare the performance of ε -greedy vs UCB on this problem on the two metrics. Which generally seems to be less sensitive to varying parameter values and why? 3. For the last experiment, simulate a 4-armed bandit with means 0 . 2 , 0 . 4 , 0 . 6 , 0 . 8, keeping all other values the same. Do so using both ε -greedy and UCB, and show both sets of plots. Briefly answer the following questions. Among the three bandit problems, explain why ε -greedy performs the worst on this one in terms of regret, particularly for larger values of ε . How does UCB compare to ε -greedy here when α is properly chosen (i.e., look at the best performance of UCB)? Explain any differences that you observe. Problem 4: Frozen Lake (48 points) Frozen Lake is a classical test example for reinforcement learning algorithms included in the Gym API. It is a grid world problem in which most states (“Frozen”) yield zero reward, the “Goal” is a terminal state that yields +1 reward, and all other terminal states (“Holes”) yield zero reward. The agent may move in one of the four cardinal directions from any Frozen state, but transitions may be stochastic and unknown by the agent, so it will employ reinforcement learning. First set up the poetry virtual environment as in past homeworks. It is essential that you do so to avoid importing and setting up the Gym libraries yourself. You will be implementing certain functions in the QLearningAgent class in agent.py . To run your code and simulate the environ- ment, you can modify parameters in and run main.py . The agent will then train for 10000 episodes and evaluate its learned policy on 100 episodes. If the render parameter is set to True , a small graphic will pop up showing the agent’s trajectory during each evaluation episode. In addition, the evaluation success rate, learned state values (remember that V ( s ) = max a Q ( s, a )), true state values (computed using value iteration), and value error norm will be printed on the console. 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help