2019_final

.pdf

School

University of Illinois, Urbana Champaign *

*We aren’t endorsed by this school

Course

553

Subject

Computer Science

Date

Jan 9, 2024

Type

pdf

Pages

2

Uploaded by CorporalHerring3809

Report
Machine Learning - Final Exam - 2019 The duration of the midterm is 1h and 10 minutes. Good luck! December, 17 2019 Question 1 - Short questions (20 points, equally distributed) Please briefly answer the following questions. Be precise and concise in your answer. In the context of reiforcement learning: 1) What is a policy function? 2) What is a value function? 3) What is a Bellman equation? 4) Cite one advantage of TD (temporal difference) methods compared to MC (Monte Carlo). 5) In model-free reinforcement learning, the policy improvement step is typically an -greedy improve- ment, as opposed the the greedy improvement used in dynamic programming. Why? 6) In dynamic programming and reinforcement learning in general, it is typical to discount future rewards by a parameter γ < 1 . Please give at least one justification for discounting future rewards. 7) Explain the difference between SARSA and Q-learning. 8) Give an example of a potential application of reinforcement learning in finance. Question 2 (20 points) Please briefly explain the picture below. Question 3 (20 points) Consider a markov reward process (MRP) with two states: the good state and the bad state. In the good state, the reward is equal to 4 , and in the bad state the reward is 2 . The probability of transition from the good state to the bad state is 0 . 5 . The probability of transitioning from the bad state to the good state is also 0 . 5 . The time discount factor is γ = 0 . 5 . Find the value function of this MRP.
Question 4 (20 points) As you have seen the last graded assignent, we can find the theoretical price of a stock by solving the Bellman equation: V ( s t ) = E t [ R t + γV ( s t +1 )] As in your homework, suppose you have a data set of realized rewards R t and a sequence of visited states s t stored in a buffer. You parametrize the value function using a neural network. Write a pseudo-algorithm to solve the Bellman equation. Question 5 (20 points) Q-learning and neural networks have been known for decades, but for a long time it was believed that the use of neural networks to represent value functions was unstable. That changed in 2014, when Deepmind showed that a single deep neural network learned to play many Atari video games, archieving “super human” performance on many of them. Two key ingredients that made training more stable are “replay buffers” and “fixed targets”. Explain what are replay buffers and fixed targets, and how they help overcome the unstability of learning value functions with neural networks.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help