L1

https://www.youtube.com/watch?v=2GwBez0D20A&list=PLwRJQ4m4UJjNymuBM9RdmB3Z9N5-0IlY0&index=1&ab_channel=PieterAbbeel

https://www.dropbox.com/scl/fi/rvbpc40ozhstnwhk7h5b7/l1-mdps-exact-methods.pdf?rlkey=5bibe5t8cqmpm9dhth969iiq6&e=1&dl=0

Notation for Markov Decision Processes (MDPs)

Set of states: $S$

Start state: $s_0$ Set of actions: $A$ Transition function ($p(s' | s, a)$): Given a state action pair, return the next state. This can be stochastic. Reward function ($r(s, a, s')$): Given a state action pair and the resulting next state, return the reward gained. Discount factor ($\gamma$): factor to make future rewards valued less than immediate rewards. Horizon ($H$): number of actions to take

Objective

Learn a policy $\pi$ that maps states to actions to maximize the sum of discounted rewards:

$$ \max_{\pi} \mathbb{E}\pi[\sum{t=0}^{H} \gamma^t R(S_t, A_t, S_{t+1})] $$

Policy can be a deterministic mapping of states to actions $\pi(s)$. Or we can define a stochastic policy that gives a probability distribution over all actions $\pi(a|s)$.

Value Iteration

Optimal Value Function:

$V^*(s) = \underset{\pi}{\max} \mathbb{E}\left[∑_{t=0}^{H} γ^t R(s_t, a_t, s_{t+1}) | π, s_0 = s\right]$

Given the optimal infinite horizon policy, what is the expected discounted sum of rewards from a state.

Value iteration is an algorithm to determine the optimal values which can be used to get the optimal policy. The values of all states are initialized to 0. We then loop over all states several times and update the values based on the following rule:

$$ V_k^(s) = \max_{a} \sum_{s'} P(s'|s, a) \left( R(s, a, s') + \gamma V_{k-1}^(s') \right) $$

The values are initialized to 0, and are updated by looking ahead to the values of the next states. This can be proven to converge to the optimal infinite horizon policy.

Q-Value Iteration

$Q^*(s,a)$: expected utility from taking an action and a state and acting optimally after

$Q^{}(s, a) = \sum_{s'} P(s'|s, a)(R(s, a, s') + \gamma \max_{a'} Q^{}(s', a'))$