A Crash Course in Reinforcement Learning Oliver Schulte Simon Fraser University Outline • • • • • What is Reinforcement Learning? Key Definitions Key Learning Tasks Reinforcement Learning Techniques Reinforcement Learning with Neural Nets Learning To Act • So far: learning to predict • Now: learn to act – In engineering: control theory – Economics, operations research: decision and game theory • Examples: – – – – fly helicopter drive car play Go play soccer RL at a glance Agent State Reward Action Environment s0 a0 r0 s1 a1 r1 s2 a2 r2 ... Goal: Learn to choose actions that maximize r + g r + g 2 r + ... , where 0 < g <1 0 1 2 http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/mlbook/ch13.pdf Acting in Action • Autonomous Helicopter – An example of imitation learning: start by observing human actions • Learning to play video games – “Deep Q works best when it lives in the moment” • Learn to flip pancakes Markov Decision Processes • Recall Markov process (MP) – state = vector x ≅ s of input variable values – can contain hidden variables = partially observable (POMDP) – transition probability P(s’|s) • Markov reward process (MRP) = MP + rewards r • Markov decision process (MDP) = MRP + actions a • Markov game = MDP with actions, rewards for > 1 agent Model Parameters: transition probabilities • Markov process: P(s(t+1)|s(t)) • MDP: P(s(t+1)|s(t),a(t)) E(r(t+1)|s(t),a(t)) expected reward • recall basketball example • also hockey example • grid example David Poole’s demo Returns and discounting • A trajectory is a (possibly infinite) sequence s(0),a(0),r(0),s(1),a(1),r(1),...,s(n),a(n),r(n),... • The return is the total sum of rewards. • But: if the trajectory is infinite, we have an infinite sum! • Solution: Weight by discount factor γ between 0 and 1. • Return = r(0)+γr(1)+γ2r(n)+... RL Concepts These 3 functions can be computed by neural networks http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/mlbook/ch13.pdf Policies and Values • A deterministic policy π is a function that maps states to actions. – i.e. tells us how to act. • Can also be probabilistic. • Can be implemented using neural nets. • Given a policy and an MDP, we have the expected return from using the policy at a state. • Notation: Vπ(s) Optimal Policies • A policy π* is optimal if for any other policy and for all states s Vπ*(s) ≥ Vπ(s) • The value of the optimal policy is written as V*(s). The action value function • Given a policy, the expected reward at a state given an action is denoted as Qπ(s,a). • Similarly Q*(s,a) for the value of an action given the optimal policy. • grid example Two Learning Problems • Prediction: For a fixed policy, learn Vπ(s). • Control: For a given MDP, learn V*(s) (optimal policy). • Variants for Q-function. Model-Based Learning Data Transition Probabilities dynamic programming Value Function Bellmann equation: Vπ(s) = Ps’,a π(a) x ( E(r)|s,a + P(s’|s,a) x Vπ (s’) ) • Developed for transition probabilities that are “nice” discrete, Gaussian, Poisson,... • grid example Model-free Learning • By-pass estimating transition probabilities • Why? Continuous state variables, no “nice” functional form. • (How about using LSTM/RNN dynamic model? deep dynamic programming?) Model-free Learning • Directly learn optimal policy π* (policy iteration) • Directly learn optimal value function V*. • Directly learn optimal action-value function Q*. • All of these functions can be implemented in a neural network. NN learning = reinforcement learning Model-free Learning: What are the data? • Data is simply a sequence of events s(0),a(0),r(0),s(1),a(1),r(1),... doesn’t tell us expected values or optimal actions. • Monte Carlo learning: to learn V, observe return at end of episode. • e.g. chessbase gives percentage of wins by white for any position Temporal Difference Learning • Consistency idea: using current model, and given data, s(0),a(0),r(0),s(1),a(1),r(1), estimate 1. the value V(s(t)) at current state 2. the next-step value V1(s(t)) = r(t)+γV(s(t+1)) 3. Minimize the “error” [V1(s(t))-V(s(t))]2 Model-Free Learning Example http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html
© Copyright 2026 Paperzz