Math 6330: Statistical Consulting Class 11 Tony Cox [email protected] University of Colorado at Denver Course web site: http://cox-associates.com/6330/ Course schedule • • • • April 14: Draft of project/term paper due April 18, 25, May 2: In-class presentations May 2: Last class May 4: Final project/paper due by 8:00 PM 2 MAB Thompson sampling (cont.) 3 Thompson sampling and adaptive Bayesian control: Bernoulli trials • Basic idea: Choose each of the k actions according to the probability that it is best • Estimate the probability via Bayes’ rule – It is the mean of the posterior distribution – Use beta conjugate prior updating for “Bernoulli bandit” (0-1 reward, fail/succeed) – Sample from posterior for each arm, 1… k; choose the one with highest sample value. Update & repeat. S = success F = failure http://jmlr.org/proceedings/papers/v23/agrawal12/agrawal12.pdf Agrawal and Goyal, 2012 4 Thompson sampling: General stochastic (random) rewards • Second idea: Generalize to arbitrary reward distribution (normalized to the interval [0, 1]) by considering a trial a “success” with probability equal to its reward http://jmlr.org/proceedings/papers/v23/agrawal12/agrawal12.pdf Agrawal and Goyal, 2012 5 Thompson sampling with complex online actions • Main idea: Embed simulationoptimization in Thompson sampling loop – = state space, S – Sample the states • Applications: Job scheduling (assigning jobs to machines); web advertising with reward depending sets of ads shown – Y = observation h = reward, X = random variable depending on • Updating posteriors can be done efficiently using a sampling-based approach (particle filtering) Gopalan et al., 2014 http://jmlr.org/proceedings/papers/v32/gopalan14.pdf 6 Comparing methods • In simulation experiments, Thompson sampling works well with batch updating, even with slowly or occasionally changing rewards and other realistic complexities. • Beats UCB1 in many but not all comparisons • More practical than UCB1 for batch updating because it keeps experimenting (trying actions with some randomness) between updates. http://engineering.richrelevance.com/recommendations-thompson-sampling/ 7 MAB variations • Contextual bandits – See signal before acting – Constrained contextual bandits: Actions constrained • Adversarial bandits – Adaptive adversaries – Bubeck and Slivens, 2012, https://www.microsoft.com/en-us/research/wp-content/uploads/2017/01/COLT12_BS.pdf • Restless bandits: Probabilities change • Gittins index maximizes expected discounted reward, not easy to compute • Correlated bandits 8 Wrap-up on MAB problems • Adaptive Bayesian learning works well in simple environments, including many of practical interest • The resulting rules are *much* simpler to implement than previous methods (e.g., Gittins index policies) • Sampling-based approaches (Thomposn, particle filtering, etc.) promote computationally practical “online learning” 9 Wrap-up on adaptive learning • No need for a causal model – Learn act-consequence probabilities and optimal decision rules directly • Assumes a stationary (or slowly changing) decision environment, known choice set, immediate feedback (reward) following action • Works very well when these assumptions are met: low-regret learning is possible 10 Optimal stopping 11 Optimal stopping decision problems • Suppose that a decision-maker (d.m.) faces a random sequence of opportunities • How long to wait for best one? • When to stop and commit to a final choice? • Examples: Selling a house, hiring a new employee, accepting a job offer, replacing a component, shuttering an aging facility, taking a parking spot, etc. • Other optimal stopping problems: Least-cost policies for replacing aging components 12 Hazard functions: Conditional rate of failure given survival so far • Let T = length of life for a component (or person, or time until first occurrence of an event, etc.) – T is a random variable with cdf F(t) = Pr(T < t) and survival function S(t) = 1 – F(t) = Pr(T > t) – The pdf for T is then f(t) = F’(t) = dF(t)/dt • The hazard function for T is defined as: • h(t) = limdt0Pr(t < T < t + dt | T > t)/dt • h(t) = f(t)/S(t) = f(t)/[1 – F(t)] – Interpretation: “instantaneous failure rate” • h(t)dt Pr(occurs in next dt | survival until t) – In discrete time, dt = 1, no limit is taken 13 Using hazard functions to guide decisions • The shape of the hazard function can often guide decisions, e.g… – If h(t) is increasing, then optimal time to stop is when h(t) reaches a certain threshold – If h(t) is decreasing, then best decision is either don’t start or else continue until failure occurs – Normal distribution hazard function calculator is at http://reliabilityanalyticstoolkit.ap pspot.com/normal_distribution – SPRT and other calculators: http://reliabilityanalyticstoolkit.ap pspot.com/ www.wolfram.com/mathematica/new-in-9/enhanced-probability-and-statistics/define-a-distributio given-its-hazard-function.html https://www.ncss.com/software/ncss/survival-analysis-in-ncss/ 14 Example: optimal age replacement • The lifetime T of a component is a random variable with known distribution • Suppose it costs $10 to replace the plant before it fails and $50 to replace it if it fails. • When should the component be voluntarily replaced (if not failed yet)? • Answer can be calculated by minimizing expected average cost per cycle (or equating marginal benefit to marginal cost for continuing), but calculations are detailed and soon get tedious • Alternative: Google “optimal replacement age calculator” 15 Optimal age replacement calculator http://www.reliawiki.org/index.php/Optimum_Replacement_Time_Example 16 Optimal selling of an asset • If offers arrive sequentially from a known distribution and costs of waiting are known, then an optimal decision boundary (blue) can be constructed to maximize EMV Sell when red line first hits blue decision boundary W(t) = price series S(t) = maximum price so far http://file.scirp.org/Html/9-1040163_25151.htm 17 Optimal stopping: Variations • Offers arrive sequentially from an unknown distribution – Bayesian updating provides solutions • Time pressure: Must sell by deadline, or fixed number of offers • With or without being able to go back to previous offers Sell when blue line first hits green decision boundary http://file.scirp.org/Html/9-1040163_25151.htm 18 Wrap-up on optimal stopping and statistical decision theory • Many valuable decision problems can be solved using the philosophy of simulation-optimization: – Try different decisions, evaluate their probable consequences – Choose the one with best (EMV or EU-maximizing) probability distribution of consequences • Finding a best decision or decision rule can become very technical – Use appropriate software or on-line calculators • For business applications, understanding how to formulate decision problems and solve them with software can create high value in practice 19 Heuristics and biases 20
© Copyright 2026 Paperzz