CS188 Discussion Week 14: Reinforcement Learning By Nuttapong Chentanez Want to compute Markov Decision Process (MDP) -Consists of set of states s  S -Transition model model, T(s, a, s’) = P(s’ | s, a) - Probability that executing action a in state s leads to s’ -Reward function R(s, a, s’), - Can also write Renter(s’) for reward for entering into s’ - Or Rin(s) for being in s -A start state (or a distribution) -May be terminal states Bellman’s Equation Idea: Optimal rewards = maximize over first action and then follow optimal policy How? Simple Monte-Carlo (Sampling) *(s) = Value Iteration Theorem: Will converge to a unique optimal values. Policy may converge long before values do. Policy Iteration Policy Evaluation: Calculate utilities for a fixed policy Policy Improvement: Update policy based on resulting converged utility Repeat until policy does not change In practice, no need to compute “exact” utility of a policy just rough estimate is enough. Combine dynamic programming with Monte-Carlo -One step of sampling and use current state value -Aka. Temporal Difference Learning (TD) Reinforcement Learning Still have MDP  Still have an MDP:  A set of states s  S  A set of actions (per state) A  A model T(s,a,s’)  A reward function R(s,a,s’)  Still looking for a policy (s)  However, the agent don’t know T or R  Must actually try actions at states out to learn Reinforcement Learning in Animals Studied experimentally in psychology for > 60 years Rewards: food, pain, hunger etc. Example: Bees learn near-optimum foraging plan in artificial flowers field Dolphin training Model-Based Learning -Can try to learn T and R first then  Simplest case:  Count outcomes for each s,a  Normalize to give estimate of T(s,a,s’)  Discover R(s,a,s’) the first time we experience (s,a,s’) Passive Learning -Given a policy (s), try to learn V(s), don’t know T, R TD Features -On-line, Incremental, Bootstrapping(use info that we learn so far) -Model free, Converge as long as  decreases over time eg (1/k), <1 Problems with TD -TD is for learning value of state under a given policy -If we want to learn optimal policy, won’t work -Idea, learn state-action pair (Q-values) instead Q-Functions -Utility of starting at state s, taking action a, then follow  thereafter -Q-value of optimum policy, utility of starting at state s, taking action a, then follow optimum policy afterward Another solution: Exploration Function Idea: instead of exploring a fixed amount, can explore area where the value is not yet established Eg. f(u,n) = u + k/n, k is a constant Q-Learning Algorithm, applying TD idead to learn Q * Practical Q-Learning -In realistic situation, too many states to visit, may even be infinite -Need to learn from small training data -Be able to generalize to new, similar states -Fundamental problem in machine learning Function Approximation -Inefficient/Infeasible to learn each state-action pair q-value -Suppose we approximate Q(st,at) with a function f with parameters  -Can do gradient-descent update Learn Q*(s,a) values -Receive a sample (s,a,s’,r) -Consider your old estimate: -Consider your new sample estimate: -Modify the old estimate towards the new sample: -Equivalently, average samples over time: Converge as long as  decreases over time eg (1/k), <1 Exploration vs. Exploitation -“If always take best current action, will never explore other action and never know there is a better action” -Explore initially then exploit later -Many scheme for balancing exploration vs. exploitation  Simplest: random actions (-greedy)  Every time step, flip a coin  With probability , act randomly  With probability 1-, act according to current policy (best q value for instance) Will explore space but keep doing random action even after the learning is “done” A solution: Lower  over time Project time - vt is the reward received - Idea: Gradient indicate the direction of changes in  that most increase Q Want Q to looks more like vt, so modify each parameter depending on whether increasing or decreasing a parameter will make Q more like vt - In this simple form, does not always work, but the idea is on the right track - Feature selection is difficult in general
© Copyright 2025 Paperzz