CSE - 571 Homework - 4 (1)
.pdf
keyboard_arrow_up
School
Arizona State University *
*We aren’t endorsed by this school
Course
571
Subject
Computer Science
Date
Apr 3, 2024
Type
Pages
9
Uploaded by PresidentMonkeyMaster1655
CSE - 571 Artificial Intelligence
Homework 4
Sanidhya Chauhan
1229529043
A 1.1) For a standard Markov Decision Process (MDP) with the discount factor γ, the value V of
a state under a certain policy π can be calculated using the Bellman equation for policy
evaluation:
?
𝝅
(?) = ?(?) + 𝛾∑?'𝑃(?'|?, π(?))?
𝞹
(?')
●
is the reward for being in state
.
?(?)
?
●
= 0.99
𝛾
●
is the probability of transitioning from state
to
’ under the action defined
𝑃(?'|?, π(?))
?
?
by policy
at state
.
π
?
●
is the value of the next state
.
?
𝞹
(?')
?'
a)
r=100
i)
With a reward of 100 for the non-terminal states, the agent is highly incentivized
to avoid reaching the terminal state with reward 10 as long as possible because
staying in any non-terminal state gives a better immediate reward. This means
that the optimal policy in this case would be to move in such a way that it never
reaches the terminal state, thus accumulating a high immediate reward
indefinitely.
b)
r=−3
i)
With a reward of -3 for non-terminal states, the agent would be motivated to
reach the terminal state with reward +10 as soon as possible to escape the
negative reward. However, the optimal policy still needs to balance the negative
reward with the probability of accidentally entering the terminal state with a
negative reward.
c)
r=0
i)
With a reward of 0 for non-terminal states, the agent is indifferent to moving
around in non-terminal states but still prefers to reach the terminal state with
reward +10. So the policy would lean towards making safe moves that progress
towards the positive terminal state without risking entering the negative terminal
state.
d)
r=+3
i)
With a positive reward of 3 for non-terminal states, the agent has an incentive to
delay reaching the terminal state with reward +10 because it is accumulating a
positive reward by simply moving around. However, it will eventually still aim to
reach the terminal state with reward +10, but the policy may take a more
conservative path that minimizes the risk of entering the terminal state with a
negative reward.
Intuitively, the value of r changes the desirability of staying in non-terminal states versus
reaching the terminal states. The discount factor
slightly discounts future rewards compared to
𝛾
immediate rewards but since it is close to 1, it doesn't have a significant effect on the qualitative
aspect of the policy. The stochastic nature of the transitions due to the 20% chance of moving at
right angles to the intended direction further complicates the optimal policy as it introduces a risk
factor that the policy must mitigate.
A 1.2)
Initialization:
Initial policy:
(cool) = Slow ;
(warm) = Fast.
π
π
Discount factor:
= 0.5.
𝛾
Iteration 1: Policy Evaluation
●
Calculate the expected utility for each state under the current policy.
Using the initial policy:
For state "cool" with action "Slow":
U(cool)=1+γ×U(cool)
For state "warm" with action "Fast":
●
U(warm)=0.5×(2+γ×U(cool))+0.5×(−10+γ×U(overheated))
State "overheated" has no outgoing actions (it is a terminal state), so its utility does not change
(0 by default).
The utilities converge to:
●
U(cool)=1.9999961853027344
●
U(warm)=−10.0 (initially which will then be corrected in the following iterations)
●
U(overheated)=0
Iteration 1: Policy Improvement
●
For each state, check if changing the action would increase the expected utility.
For the state "cool", the current action is "Slow". We check if "Fast" is better:
●
Using “Fast”: U’(cool) =
= 0. 5×(2 + γ×?(?𝑎??)) + 0. 5×(− 10 + γ×?(𝑜???ℎ?𝑎???))
●
Since U’(cool) with "Fast" is worse than with "Slow", we don't change the policy for
"cool".
For state "warm", the current action is "Fast". We check if "Slow" is better:
●
Using “Slow”:
?′(?𝑎??) = 0. 5×(1 + γ×?(?𝑜𝑜?)) + 0. 5×(2 + γ×?(?𝑎??))
●
Since U’(warm) "Slow" is better than with "Fast", we change the policy for "warm" to
"Slow".
The policy changes to:
●
(cool) = Slow
π
●
(warm) = Slow
π
Iteration 2: Policy Evaluation
●
We recalculate the utilities with the updated policy until they converge.
The utilities converge to:
●
U(cool)=1.9999961853027344 (unchanged because the policy for "cool" did not change)
●
U(warm)=2.6664695888757706 (now corrected with the updated policy)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Questions
Specifically, it would be helpful to see an illustration of transition probabilities for a zero-recurrence renewal Markov chain.
arrow_forward
Singels and system
arrow_forward
с
с1
c2
c3
c4
xi1
0.4
0.2
0.7
0.0
xi2
0.1
0.1
0.1
0.3
xi3
0.5
0.7
0.2
0.7
-Use the cosine amplitude method to develop the fuzzy relation R that
relates the pollution degree to the?
arrow_forward
Logistic regression aims to train the parameters from the training set D =
{(x(i),y(i)), i
1,2,...,m, y ¤ {0,1}} so that the hypothesis function h(x)
=
g(0¹ x)
1
(here g(z) is the logistic or sigmod function g(z)
can predict the probability of a
1+ e-z
new instance x being labeled as 1. Please derive the following stochastic gradient ascent
update rule for a logistic regression problem.
0j = 0j + a(y(¹) — hz(x)))x;
ave.
=
arrow_forward
Consider a Diffie-Hellman scheme with a common prime q = 11 and a primitive root α = 2.
Show that 2 is a primitive root of 11.
If user A has public key YA = 9, what is A’s private key XA?
If user B has public key YB = 3, what is the secret key K shared with A?
arrow_forward
Theory of Computation
Prove that XY is undecidable
XY = { ⟨M⟩ ∣ M is a TM where x∈L(M) and y∈L(M) }
Prove this by reducing it from ATM which is proven to be undecidable
arrow_forward
Give an example of a random variable X : {b, c, d, e} → N (Natural Number) with expectation 2, where each of {b, c, d, e} has equal probability
arrow_forward
Consider the case of a simple Markov Decision Process (MDP) with a discount factor gamma = 1. The MDP has three states (x, y, and z), with rewards -1, -2, 0, respectively. State z is considered a terminal state. In states and y there are two possible actions: a₁ and a2.
The transition model is as follows:
In state x, action a1 moves the agent to state y with probability 0.9 and makes the agent stay put with probability 0.1.
In state y, action a1 moves the agent to state with probability 0.9 and makes the agent stay put with probability 0.1.
In either state or state y, action a2 moves the agent to state z with probability 0.1 and makes the agent stay put with probability 0.9.
Please answer the following questions:
Draw a picture of the MDP
What can be determined qualitatively about the optimal policy in states x and y?
Apply the policy iteration algorithm discuss in class, showing each step in full, to determine the optimal policy and the…
arrow_forward
Q-1. Consider the Farmer-Wolf-Goat-Cabbage Problem described below:
Farmer-Wolf-Goat-Cabbage Problem
There is a farmer with a wolf, a goat and a cabbage. The farmer has to cross a river with all three things. A small boat is available to cross the river, but farmer can carry only one thing with him at a time on the boat. In the absence of farmer, the goat will eat the cabbage and wolf will eat the goat. How can the farmer cross the river with all 3 things?
State Space Formulation of the Problem
State of the problem can be represented by a 4-tuple where elements of the tuple represent positions of farmer, wolf, goat and cabbage respectively. The position of boat is always same as the position of farmer because only farmer can drive the boat.
Initial state: (L, L, L, L)
Operators:
1. Move farmer and wolf to the opposite side of river if goat and cabbage are not left alone.
2. Move farmer and goat to the opposite side of river.
3. Move farmer and cabbage to the opposite…
arrow_forward
Determine the causality for each of the following linear systems.
a. y(n) = 0.5x(n) + 20x(n − 2) – 0.1y(n-1)
b. y(n) = x(n + 2) - 0.4y(n-1)
c. y(n) = x(n - 1) + 0.5y(n + 2)
arrow_forward
Bayes Nets Probability and Inference
arrow_forward
The following MDP world consists of 5 states and 3 actions:
(1, 1)
(1, 2)
Action: Exit - -10
Actions: down, right
(2, 1)
(2, 2)
Action: Exit = -10
Actions: down, right
(3, 1)
Action: Exit = 10
When taking action down, it is successful with probability 0.7. with probability 0.2 you go right, and with probability 0.1 you stay in place.
When taking action right, it is successful with probability 0.7, with probability 0.2 you go right, and with probability O.1 you stay in place.
When taking action Exit, it is successful with probability 1.0.
The only reward is when taking action Exit, and there is no discounting.
Calculate the value of states using Value Iteration algorithm for required time step:
Provide the value for State (1,2) at time step 4 (calculate to 3 decimal places):
Va(1,2) =
arrow_forward
#3
arrow_forward
The following is an illustration of the one-step transition probabilities for a renewal Markov chain with no recurrence.
arrow_forward
(a) Assume a probabilistic model is represented by the Bayesian network (I) in Figure 3.
Is it then always possible to represent the model instead with the Bayesian network
(I) of Figure 32 Given an argument for your answer.
(b) Assume a probabilistic model is represented by the Markov network (III) in Figure 3.
Is it then always possible to represent the model instead with the Bayesian network
(1) of Figure 3? Given an argument for your answer.
(1I)
(II)
(1)
arrow_forward
True or False. Explain
a. It is quite common to experience an unbounded divergence when we use the deadly triad in Deep Q-Learning: bootstrapping, value function approximation, and off-policy learning.
b. As the size of an MDP (i.e., the number of states or actions) increases, an RL agent is required to store more weight parameters for the linear function approximation of the value function.
arrow_forward
Correct answer will be upvoted else downvoted. Computer science.
in case there are two planes and a molecule is shot with rot age 3 (towards the right), the cycle is as per the following: (here, D(x) alludes to a solitary molecule with rot age x)
the primary plane delivers a D(2) to the left and lets D(3) progress forward to the right;
the subsequent plane delivers a D(2) to the left and lets D(3) progress forward to the right;
the primary plane lets D(2) forge ahead to the left and creates a D(1) to the right;
the subsequent plane lets D(1) progress forward to one side (D(1) can't create any duplicates).
Altogether, the last multiset S of particles is {D(3),D(2),D(2),D(1)}. (See notes for visual clarification of this experiment.)
Gaurang can't adapt up to the intricacy of the present circumstance when the number of planes is excessively huge. Help Gaurang find the size of the multiset S, given n and k.
Since the size of the multiset can be extremely huge, you…
arrow_forward
Correct and detailed answer will be Upvoted. Thank you
arrow_forward
No. 3
By natural deduction, show the validity of
Jx (P(x) A Q(x)), Vy (P(x) → R(x)) (R(x) A Q(x)).
arrow_forward
Consider a Diffie-Hellman scheme with a common prime q = 17 and a primitive root α = 3.
a) If user A has a private key XA=4, what is A’s public key, YA?
b) A sends YA to B. If B has a private key XB=6, what is the shared secret key, K that B can calculate and share with A?
c) If B computes YB and sends it to A, what is the shared secret Key, K computed by A?
arrow_forward
3. In your local nuclear power station, there is an alarm that senses when a temperature gauge exceeds a given threshold. The gauge measures the temperature of the core. Consider the Boolean variables A (alarm sounds), FA (alarm is faulty), and FG (gauge is faulty) and the multivalued nodes G (gauge reading) and T (actual core temperature).
a. Draw a Bayesian network for this domain, given that the gauge is more likely to fail when the core temperature gets too high.
b. Is your network a polytree? Why or why not?
arrow_forward
Let M be the PDA defined by Q = {q, q0 , q1 , q2}, Σ = {a,b}, Γ = {a}, F := {q , q1}.δ(q0 , a , Z0) = {(q, Z0 ) } δ(q , a, Z0) = {(q , aZ0)} δ(q , a, a) = {(q , aa)} δ(q , b , a) = {(q1 ,e)} δ(q1 , b , a) = {(q1 ,e)} δ(q1 , b, Z0 ) = {(q2 , e)}
Describe the CFG and the language accepted by M
arrow_forward
Algorithm A search using the heuristic h(n) = a for some fixed constant a > 0
is guaranteed to find an optimal solution
Select one:
True
False
The Perceptron Learning Rule is a sound and complete method for a Perceptron to lear
correctly classify any 2-class classification problem
Select one:
True
False
systems are Data-driven, where as
systems are goal-driven.
Select one:
a. Backward Chaining & Forward Chaining
b. Forward Chaining & Backward Chaining
c. Spanning Chaining & Backward Chaining
d. None of the above
it is a search algorithm that requires less memory
Select one:
a. Breadth-First Search
b. Linear Search
c. Optimal Search
d. Depth First Search
Rationality is?
Select one:
a. being reasonable
b. being sensible
c. having good sense of judgment
d. All of the above
Breadth-first search is similar to minimax search
Select one:
True
False
arrow_forward
The room temperature x in Fahrenheit (F) is converted to
y in Celsius (C) through the function y = f(x) = 5(x-32)/9.
Let a fuzzy set B1
(in Fahrenheit) be defined by
B1 = 0.15/76 + 0.42/78 + 0.78/80 + 1.0/82 + 1.0/84
What is the induced fuzzy set of B1
in terms of the
extension principle? B2 = ?
arrow_forward
A Probabilistic Model for Approaching the Selection
Pressure Curve
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you
Operations Research : Applications and Algorithms
Computer Science
ISBN:9780534380588
Author:Wayne L. Winston
Publisher:Brooks Cole
Related Questions
- Specifically, it would be helpful to see an illustration of transition probabilities for a zero-recurrence renewal Markov chain.arrow_forwardSingels and systemarrow_forwardс с1 c2 c3 c4 xi1 0.4 0.2 0.7 0.0 xi2 0.1 0.1 0.1 0.3 xi3 0.5 0.7 0.2 0.7 -Use the cosine amplitude method to develop the fuzzy relation R that relates the pollution degree to the?arrow_forward
- Logistic regression aims to train the parameters from the training set D = {(x(i),y(i)), i 1,2,...,m, y ¤ {0,1}} so that the hypothesis function h(x) = g(0¹ x) 1 (here g(z) is the logistic or sigmod function g(z) can predict the probability of a 1+ e-z new instance x being labeled as 1. Please derive the following stochastic gradient ascent update rule for a logistic regression problem. 0j = 0j + a(y(¹) — hz(x)))x; ave. =arrow_forwardConsider a Diffie-Hellman scheme with a common prime q = 11 and a primitive root α = 2. Show that 2 is a primitive root of 11. If user A has public key YA = 9, what is A’s private key XA? If user B has public key YB = 3, what is the secret key K shared with A?arrow_forwardTheory of Computation Prove that XY is undecidable XY = { ⟨M⟩ ∣ M is a TM where x∈L(M) and y∈L(M) } Prove this by reducing it from ATM which is proven to be undecidablearrow_forward
- Give an example of a random variable X : {b, c, d, e} → N (Natural Number) with expectation 2, where each of {b, c, d, e} has equal probabilityarrow_forwardConsider the case of a simple Markov Decision Process (MDP) with a discount factor gamma = 1. The MDP has three states (x, y, and z), with rewards -1, -2, 0, respectively. State z is considered a terminal state. In states and y there are two possible actions: a₁ and a2. The transition model is as follows: In state x, action a1 moves the agent to state y with probability 0.9 and makes the agent stay put with probability 0.1. In state y, action a1 moves the agent to state with probability 0.9 and makes the agent stay put with probability 0.1. In either state or state y, action a2 moves the agent to state z with probability 0.1 and makes the agent stay put with probability 0.9. Please answer the following questions: Draw a picture of the MDP What can be determined qualitatively about the optimal policy in states x and y? Apply the policy iteration algorithm discuss in class, showing each step in full, to determine the optimal policy and the…arrow_forwardQ-1. Consider the Farmer-Wolf-Goat-Cabbage Problem described below: Farmer-Wolf-Goat-Cabbage Problem There is a farmer with a wolf, a goat and a cabbage. The farmer has to cross a river with all three things. A small boat is available to cross the river, but farmer can carry only one thing with him at a time on the boat. In the absence of farmer, the goat will eat the cabbage and wolf will eat the goat. How can the farmer cross the river with all 3 things? State Space Formulation of the Problem State of the problem can be represented by a 4-tuple where elements of the tuple represent positions of farmer, wolf, goat and cabbage respectively. The position of boat is always same as the position of farmer because only farmer can drive the boat. Initial state: (L, L, L, L) Operators: 1. Move farmer and wolf to the opposite side of river if goat and cabbage are not left alone. 2. Move farmer and goat to the opposite side of river. 3. Move farmer and cabbage to the opposite…arrow_forward
- Determine the causality for each of the following linear systems. a. y(n) = 0.5x(n) + 20x(n − 2) – 0.1y(n-1) b. y(n) = x(n + 2) - 0.4y(n-1) c. y(n) = x(n - 1) + 0.5y(n + 2)arrow_forwardBayes Nets Probability and Inferencearrow_forwardThe following MDP world consists of 5 states and 3 actions: (1, 1) (1, 2) Action: Exit - -10 Actions: down, right (2, 1) (2, 2) Action: Exit = -10 Actions: down, right (3, 1) Action: Exit = 10 When taking action down, it is successful with probability 0.7. with probability 0.2 you go right, and with probability 0.1 you stay in place. When taking action right, it is successful with probability 0.7, with probability 0.2 you go right, and with probability O.1 you stay in place. When taking action Exit, it is successful with probability 1.0. The only reward is when taking action Exit, and there is no discounting. Calculate the value of states using Value Iteration algorithm for required time step: Provide the value for State (1,2) at time step 4 (calculate to 3 decimal places): Va(1,2) =arrow_forward
arrow_back_ios
SEE MORE QUESTIONS
arrow_forward_ios
Recommended textbooks for you
- Operations Research : Applications and AlgorithmsComputer ScienceISBN:9780534380588Author:Wayne L. WinstonPublisher:Brooks Cole
Operations Research : Applications and Algorithms
Computer Science
ISBN:9780534380588
Author:Wayne L. Winston
Publisher:Brooks Cole