18461_661_HW7

.pdf

School

Carnegie Mellon University *

*We aren’t endorsed by this school

Course

18661

Subject

Computer Science

Date

Dec 6, 2023

Type

pdf

Pages

Uploaded by MajorHare3795

Homework 7 ECE 461/661: Introduction to Machine Learning Prof. Yuejie Chi and Prof. Beidi Chen Due: Sunday Dec. 3rd, 2023 at 8:59 PM PT / 11:59PM ET Please remember to show your work for all problems and to write down the names of any students that you collaborate with. The full collaboration and grading policies are available on the course website: https://18661.github.io/ Your solutions should be uploaded to Gradescope ( https://www.gradescope.com/ ) in PDF format by the deadline. We will not accept hardcopies. If you choose to hand-write your solutions, please make sure the uploaded copies are legible. Gradescope will ask you to identify which page(s) contain your solutions to which problems, so make sure you leave enough time to finish this before the deadline. We will give you a 30-minute grace period to upload your solutions in case of technical problems. 1 Gaussian Multi-Armed Bandit [33 points] Consider the following multi-armed bandit problem. We have three very unique slot machines, k 1 , k 2 , k 3 , which provide rewards drawn from univariate Gaussian distributions. Each distribution has an unknown (to us) mean, µ 1 , µ 2 , µ 3 , and unknown variance. Every time we pull a lever, a reward is observed. We have observed the rewards listed below for each slot machine. Now it is up to us to understand the exploration- exploitation tradeoff and determine which slot machine provides the highest reward on average in the fewest pulls. Assume the highest expected reward over all arms is 1 (i.e., max { µ 1 , µ 2 , µ 3 } = 1 ). 1

k 1 rewards -3.4246 4.5886 -0.4250 -0.8251 1.4727 0.1228 3.5182 k 2 rewards -4.8422 -9.2154 -3.8178 3.7586 -3.9574 k 3 rewards 0.2795 1.4759 (a) (4pts) Suppose on the next pull we choose the slot machine according to a greedy approach. Which slot machine should we pick for the next pull if we choose to exploit? Show your work and explain why. (b) (5pts) Now we would like to understand the difference between our observed rewards over the 14 pulls and the rewards collected from an optimal strategy. Provide a numerical answer. (c) (9pts) Suppose we chose to explore either of the other two slot machines instead of exploiting at the next pull. We receive a reward of 5. How does this change our choice for the next pull (still assuming we are greedy) and our regret? Explain why the regret increases, decreases, or stays the same. (d) (6pts) UCB1 Approach. Suppose instead of the greedy approach we take a different approach before our next draw. Instead, we used the UCB1 algorithm. Calculate the upper confidence bounds for each slot machine (use the natural logarithm). What is our optimal choice using this algorithm? (e) (9pts) Suppose we choose the optimal choice we found in the previous question. We receive a reward of -5. How does this change our optimal choice for the next pull? What happens to our regret after this observation and why? 2 Gridworld [34 points] Consider the following grid environment. Starting from any unshaded square, you can move up, down, left, or right. Actions are deterministic and always succeed (e.g. going left from state 16 goes to state 15) unless they will cause the agent to run into a wall. The thicker edges indicate walls, and attempting to move in the direction of a wall results in staying in the same square (e.g. going in any direction other than left from state 16 stays in 16). Taking any action from the target square with cheese (no. 11) earns a reward of r g (so r (11 , a ) = r g ∀ a ) and ends the episode. Taking any action from the square of cat (no. 6) earns a reward of r r (so r (6 , a ) = r r ∀ a ) and ends the episode. Otherwise, from every other square, taking any action is associated with a reward r s ∈ {− 1 , 0 , +1 } (even if the action results in the agent staying in the same square). Assume the discount factor γ = 1 , r g = +10 , and r r = − 1000 unless otherwise specified. 1 2 3 4 5 7 8 9 10 12 13 14 15 16 (a) (12pts) Let r s = 0 . Evaluate the following policy and show the corresponding value for every state (square). 2

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version