CSE - 571 Homework - 4 (1)

.pdf

School

Arizona State University *

*We aren’t endorsed by this school

Course

571

Subject

Computer Science

Date

Apr 3, 2024

Type

pdf

Pages

Uploaded by PresidentMonkeyMaster1655

CSE - 571 Artificial Intelligence Homework 4 Sanidhya Chauhan 1229529043

A 1.1) For a standard Markov Decision Process (MDP) with the discount factor γ, the value V of a state under a certain policy π can be calculated using the Bellman equation for policy evaluation: ? 𝝅 (?) = ?(?) + 𝛾∑?'𝑃(?'|?, π(?))? 𝞹 (?') ● is the reward for being in state . ?(?) ? ● = 0.99 𝛾 ● is the probability of transitioning from state to ’ under the action defined 𝑃(?'|?, π(?)) ? ? by policy at state . π ? ● is the value of the next state . ? 𝞹 (?') ?' a) r=100 i) With a reward of 100 for the non-terminal states, the agent is highly incentivized to avoid reaching the terminal state with reward 10 as long as possible because staying in any non-terminal state gives a better immediate reward. This means that the optimal policy in this case would be to move in such a way that it never reaches the terminal state, thus accumulating a high immediate reward indefinitely. b) r=−3 i) With a reward of -3 for non-terminal states, the agent would be motivated to reach the terminal state with reward +10 as soon as possible to escape the negative reward. However, the optimal policy still needs to balance the negative reward with the probability of accidentally entering the terminal state with a negative reward. c) r=0 i) With a reward of 0 for non-terminal states, the agent is indifferent to moving around in non-terminal states but still prefers to reach the terminal state with reward +10. So the policy would lean towards making safe moves that progress towards the positive terminal state without risking entering the negative terminal state. d) r=+3 i) With a positive reward of 3 for non-terminal states, the agent has an incentive to delay reaching the terminal state with reward +10 because it is accumulating a positive reward by simply moving around. However, it will eventually still aim to reach the terminal state with reward +10, but the policy may take a more conservative path that minimizes the risk of entering the terminal state with a negative reward. Intuitively, the value of r changes the desirability of staying in non-terminal states versus reaching the terminal states. The discount factor slightly discounts future rewards compared to 𝛾 immediate rewards but since it is close to 1, it doesn't have a significant effect on the qualitative aspect of the policy. The stochastic nature of the transitions due to the 20% chance of moving at

right angles to the intended direction further complicates the optimal policy as it introduces a risk factor that the policy must mitigate.

A 1.2) Initialization: Initial policy: (cool) = Slow ; (warm) = Fast. π π Discount factor: = 0.5. 𝛾 Iteration 1: Policy Evaluation ● Calculate the expected utility for each state under the current policy. Using the initial policy: For state "cool" with action "Slow": U(cool)=1+γ×U(cool) For state "warm" with action "Fast": ● U(warm)=0.5×(2+γ×U(cool))+0.5×(−10+γ×U(overheated)) State "overheated" has no outgoing actions (it is a terminal state), so its utility does not change (0 by default). The utilities converge to: ● U(cool)=1.9999961853027344 ● U(warm)=−10.0 (initially which will then be corrected in the following iterations) ● U(overheated)=0 Iteration 1: Policy Improvement ● For each state, check if changing the action would increase the expected utility. For the state "cool", the current action is "Slow". We check if "Fast" is better: ● Using “Fast”: U’(cool) = = 0. 5×(2 + γ×?(?𝑎??)) + 0. 5×(− 10 + γ×?(𝑜???ℎ?𝑎???)) ● Since U’(cool) with "Fast" is worse than with "Slow", we don't change the policy for "cool". For state "warm", the current action is "Fast". We check if "Slow" is better: ● Using “Slow”: ?′(?𝑎??) = 0. 5×(1 + γ×?(?𝑜𝑜?)) + 0. 5×(2 + γ×?(?𝑎??)) ● Since U’(warm) "Slow" is better than with "Fast", we change the policy for "warm" to "Slow". The policy changes to: ● (cool) = Slow π ● (warm) = Slow π Iteration 2: Policy Evaluation ● We recalculate the utilities with the updated policy until they converge. The utilities converge to: ● U(cool)=1.9999961853027344 (unchanged because the policy for "cool" did not change) ● U(warm)=2.6664695888757706 (now corrected with the updated policy)

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version