CMPUT 654 – ONLINE LEARNING ASSIGNMENT 1 FALL 2006 DEPARTMENT OF COMPUTING SCIENCE UNIVERSITY OF ALBERTA Due: Midnight, Sunday, October 20 Submission: Send a pdf or postscript document in e-mail Worth: 25% of final grade (for details, see the course webpage!) Instructor: Csaba Szepesvári, Ath331, x2-8581, [email protected] Homepage: www.cs.ualberta.ca/˜szepesva/CMPUT654 Note: In order to achieve 100% on this assignment, you need to achieve at least 200 points. (This is way less than the total number of points that can be collected so you can pick the exercises you want to solve. You are also allowed to hand in more exercises to safeguard your result.) Hard assignment are marked by * and come with ridiculously high points (for these problems the points do not have any relationship to the hardness of the problems). 1. Distributing Time Amongst Arms Consider a decision problem with K options. Assume that we know that the problem is such that for any of the options, if √ during n trials this option is used T times then the resulting total loss is at most T . Hence, if option j is used Tj -times PK p then the total loss you suffer is at most j=1 Tj . Distribute the sampling times between the actions to minimize the above bound on the total loss, assuming a fixed number of trials! What is the minimum total loss that can be guaranteed? PK (The sampling times must sum to n, the number of trials: j=1 Tj = n and, of course, they must be non-negative). Score: 20 points Hint: Proceed as follows: first, conjecture the optimal distribution. Then try to prove that your conjecture holds. For the proof you only need to know an inequality that we used more than once in the class:) 1 2 FALL 2006 DEPARTMENT OF COMPUTING SCIENCE UNIVERSITY OF ALBERTA 2. Low Regret in a Full Information Stationary Stochastic Environment Consider a stochastic, stationary environment with K arms: All payoffs of all the arms are independent of each other and the payoff sequences for any arm form an i.i.d. sequence. For simplicity assume that the payoffs lie between 0 and 1. The crucial difference to a bandit problem is that whatever arm you choose you also learn the payoff of the other arms (hence, you gain information in every trial about all the arms). Your payoff however is still just the payoff of the arm that you have chosen. Design an algorithm for this problem that achieves finite total regret. Prove a bound on the instantaneous regret, i.e., a bound on the expected loss at an arbitrary time n. Using this bound, prove a bound on the total regret. Hint: Start by proving a bound on the probability of choosing a suboptimal arm at some trial. Use Hoeffding’s inequality. Score: 20 points 3. Deciding Who is Better Consider a two-player, zero-sum game with stochastic payoffs (e.g. Poker). Assume that the payoffs lie in the interval [−K, K] (consider the payoff from the point of view of Player 1). Assume that the players play a series of games but they do not learn while they are playing (they do not change their strategies in any way). Given a series of payoffs, X1 , . . . , Xn , for Player 1, design a simple, but powerful test that has failure probability less than 5% and which determines if one of the players is better than the other (Player 1 is better than Player 2 when his expected payoff is above 0). Note that the possible outcomes for a test like this are: “Player 1 is better than Player 2”, “Player 2 is better than Player 1”, “it is undecided which of the players is better”. Determine the probability that the test returns “undecided” assuming that the difference between the players’ performances is at most ∆ (this is called the “power” of the test). Discuss your results. Look into how K, the range of the payoffs influences the test. Hint: For the analytic part use Hoeffding’s inequality. Score: 20 points Write a program to study the actual behaviour of the test and experiment with various payoff distributions. Look into how the failure probabilities depend on the variance, K and the expected value of X1 . Compare your empirical results with the bounds that you derived. Study the tightness of the theoretical bounds (identify conditions when the bounds are tight and cases when they are not). Suggest improvements. Score: 20 points CMPUT 654 – ONLINE LEARNING 3 Design a stopping rule that stops only when it can be decided with high probability if one of the players is better than the other. Implement this rule and study its behaviour empirically. Score: 20 points Prove that the rule that you designed meets its design requirements. Score: 1000 points (*) 4. Testing for Independence The purpose of this exercise is to develop further confidence in reasoning about probabilities and in particular using Hoeffding’s inequality. In data mining people look for non-random co-occurrences (“market basket analysis”). One problem here can be stated as follows: Imagine that you observe a number of i.i.d. random variables, Z1 , . . . , ZN , Zi = (Xi , Yi ), Xi ∈ A, Yi ∈ B, where A and B are finite sets. Design a test that detects if the distributions underlying Xi and Yi are independent, i.e., if it holds that for any (a, b) ∈ A × B, (4.1) P (X = a, Y = b) = P (X = a) P (Y = b) . Here (X, Y ) are identically distributed with (Xi , Yi ). More specifically, design a test based on Hoeffding’s inequality that has failure probability less than 5% and which detects when the random variables are not independent. Hint: Use Hoeffding’s inequality and a union bounding argument. This latter works as follows: Imagine that the failure events, F1 , . . . ,P Fn , all have a probability of at n most δ/n: P (Fi ) ≤ δ/n. Then, by P (∪ni=1 Fi ) ≤ i=1 P (Fi ), you have that the probability that any of these events will happen is at most δ. Hence, the probability that no failure happens is at least 1 − δ. Score: 30 points 5. Regret and Inferior Sampling Times Consider a stochastic, stationary, i.i.d. bandit setting. Let the payoff of arm i at time t be Xit (Xit ∈ R). Consider any bandit algorithm, A. Let Ti (t) denote the Pt sampling time of arm i up to time t: Ti (t) = s=1 I{Is =i} , where Is ∈ {1, . . . , K} denotes the arm chosen at time s by A. Let Zt be the payoff at time t: Zt = XIt ,TIt (t) . Prove that " n # K X X E Zt = µj E [Tj (n)] , t=1 where µj = E [Xj1 ]. j=1 4 FALL 2006 DEPARTMENT OF COMPUTING SCIENCE UNIVERSITY OF ALBERTA Hint: Use Wald’s equation which states the following: Let X1 , X2 , . . . be an i.i.d. sequence of random variables satisfying E [|Xi |] ≤ +∞. Further let T be a stopping time with respect to this sequence whose expectation is bound: E [T ] < +∞. Then " T # X E Xs = E [T ] E [X1 ] . s=1 Exploit that allocation rules can only use past information. Let X1 , X2 , . . . and Y1 , Y2 , . . . be sequences of random variables, where Yi and Xj are independent for any integers i and j. We say that T is a stopping time with respect to a sequence of random variables of X1 , X2 , . . . if for all n ≥ 1, the event {T = n} is completely determined by (at most) the total information known up to time n, {X1 , X2 , . . . , Xn } and Y1 , Y2 , . . .. Score: 20 points the payoff at time t, emphasizing the role of the allocation rule. Now, denote by ZtAP n A Prove that supA E = nµ∗ , where µ∗ = maxj µj . t=1 Zt Score: 20 points 6. Instantaneous Failure Probability for UCB1 Consider UCB1. Prove a tight (polynomial) bound on the probability that the arm with the highest average value at time n is a suboptimal one. Score: 1000 points (*) 7. The Bias Term in UCB1 The bias term in UCB1 has the form r p log t . 2s The theorem proven in the class states that the regret is logarithmic when p > 2 and actually predicts that the regret blows up when p = 2. Study the behaviour of UCB1 for p ≤ 2! More specifically, design a test environment, implement UCB1 and study its behaviour by looking at the expected (total) regret and the variance of the regret. Use graphs to illustrate your findings! Summarize your findings in the form of some conjectures. cts = Score: 20 points Prove that if p is “too small” then the regret is not logarithmic. Score: 1000 points (*) CMPUT 654 – ONLINE LEARNING 5 Now assume that cts is monotone increasing in t and monotone decreasing in s and that limt→∞ cts = ∞ and lims→∞ cts = 0. Other than this cts is not restricted. Is this sufficient to make UCB1 to be a no-regret algorithm? We call an algorithm “no-regret” if its average expected regret per step in the limit is zero or smaller, or lim supn→∞ Rn /n ≤ 0. Here Rn is the expected regret after n steps. Score: 40 points 8. Doubling Trick Let A = A(η) be a learning algorithm that has a free parameter η (e.g. a learning rate). Assume that it is known that the total regret of A(η) in n trials is bounded by cnα with some c > 0 and provided that η is selected to match n: η = f ∗ (n). Here f ∗ is some function mapping integers to reals. The problem is that you have to know n in order to select η and since η has to match n in order to achieve a good regret if there is no fixed time horizon then it is not clear how to run algorithm A (how should η be selected) so that its regret is still small. The “doubling trick” is a simple solution. It works as follows: Time is segmented into periods. The first period has a length of 1, the second has a length of 2, the third 4, etc. In period k you run the algorithm with parameter ηk = f ∗ (2k ). Give a closed-form bound on the regret (no sums) for this “meta-algorithm”. Try to derive a reasonably tight bound. In deriving the bound consider full periods only, i.e., when n = 1 + 2 + . . . + 2k with some k. Score: 10 points Derive a bound that applies to any time-point n, not just to the period-ends. Score: 10 points What would make doubling special? Bound the regret if the period lengths are of the form beβk c with β > 0. In this exercise a heuristic analysis is fine (you do not need to care about the interval length being integer numbers and it is fine to have a bound that is restricted to period-ends). What is the trend as a function of β? Can you suggest a good value of β that makes your bound small? In view of your results reason about possible modifications of the meta-algorithm that may work better than the doubling trick in practice. Reason about the performance of the new algorithm. Score: 20 points 6 FALL 2006 DEPARTMENT OF COMPUTING SCIENCE UNIVERSITY OF ALBERTA Assume that the regret bound is logarithmic, i.e., Rn ≤ c log(n) with some c > 0. What is the best way to set up the lengths of the periods if the goal is to have a slow growth-rate? Derive a bound on the regret (again, a heuristic analysis suffices). Score: 20 points 9. Turning Action-Elimination into an Online Algorithm The Median-Elimination (ME) algorithm of Even-Dar et al. (2003) is designed to pick the best action in a stochastic, stationary bandit problem with a low error probability. If you have K arms, with i.i.d. rewards that are confined say to the interval [0, 1], the ME algorithm is guaranteed to return an optimal arm using K N,δ,K = bc 2 log(1/δ)c. samples. Here c > 0 is a constant that can be found in the above article. Your task is to turn this algorithm into one that achieves a low expected regret. For this consider running the algorithm in phases: Each phase is divided up into an exploration and an exploitation segment. Try to design the length of these so that the expected regret is small. Again, a heuristic analysis suffices. Score: 20 points 10. Approximating the Max-Norm Approximations of the max-norm play an important role in the construction and analysis of several online learning algorithms (cf. “potential function method”). Prove that for any a1 , . . . , an real numbers, !1/p n X p lim |ai | = max |ai |. p→∞ and that i=1,...,n i=1 n 1 X ηai ln e = max ai . η→∞ η i=1,...,n i=1 lim Hint: To prove the second result, use the first one. Score: 10 points 11. Horse Race Consider the horse race problem (see the slides for more information) but with the modification that there is a cap on the amount that you can bet in a single round and also on the odds. The guarantee derived in the class was that the multiplicative loss is bounded. Do you think that this is a strong guarantee given that there both the bets and the wins are capped? What would be a stronger guarantee for this modified problem? Implement a simple version of the problem: the index CMPUT 654 – ONLINE LEARNING 7 of the winning √ horse could be e.g. a periodic or almost periodic function of time sin(x)+sin( 2x) is a good start and odds could take values like e.g. {1, 2, 3, 4}. The experts could use the same function used to generate the index of the winning horse but with different parameters (e.g. frequency, phase). Look at the actual payoffs, the logarithm of the payoffs and both the additive and multiplicative differences to the cumulative payoff of the best expert. Score: 40 points References Even-Dar, E., Mannor, S., and Mansour, Y. (2003). Action elimination and stopping conditions for reinforcement learning. In Fawcett, T. and Mishra, N., editors, Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), pages 162–169.
© Copyright 2026 Paperzz