assignment 1 - University of Alberta

CMPUT 654 – ONLINE LEARNING
ASSIGNMENT 1
FALL 2006
DEPARTMENT OF COMPUTING SCIENCE
UNIVERSITY OF ALBERTA
Due: Midnight, Sunday, October 20
Submission: Send a pdf or postscript document in e-mail
Worth: 25% of final grade (for details, see the course webpage!)
Instructor: Csaba Szepesvári, Ath331, x2-8581, [email protected]
Homepage: www.cs.ualberta.ca/˜szepesva/CMPUT654
Note: In order to achieve 100% on this assignment, you need to achieve at least
200 points. (This is way less than the total number of points that can be collected
so you can pick the exercises you want to solve. You are also allowed to hand in
more exercises to safeguard your result.) Hard assignment are marked by * and
come with ridiculously high points (for these problems the points do not have any
relationship to the hardness of the problems).
1. Distributing Time Amongst Arms
Consider a decision problem with K options. Assume that we know that the problem is such that for any of the options, if √
during n trials this option is used T times
then the resulting total loss is at most T . Hence, if option j is used Tj -times
PK p
then the total loss you suffer is at most j=1 Tj . Distribute the sampling times
between the actions to minimize the above bound on the total loss, assuming a
fixed number of trials! What is the minimum total loss that can be guaranteed?
PK
(The sampling times must sum to n, the number of trials:
j=1 Tj = n and, of
course, they must be non-negative).
Score: 20 points
Hint: Proceed as follows: first, conjecture the optimal distribution. Then try to
prove that your conjecture holds. For the proof you only need to know an inequality
that we used more than once in the class:)
1
2
FALL 2006 DEPARTMENT OF COMPUTING SCIENCE UNIVERSITY OF ALBERTA
2. Low Regret in a Full Information Stationary Stochastic
Environment
Consider a stochastic, stationary environment with K arms: All payoffs of all the
arms are independent of each other and the payoff sequences for any arm form an
i.i.d. sequence. For simplicity assume that the payoffs lie between 0 and 1. The
crucial difference to a bandit problem is that whatever arm you choose you also
learn the payoff of the other arms (hence, you gain information in every trial about
all the arms). Your payoff however is still just the payoff of the arm that you have
chosen.
Design an algorithm for this problem that achieves finite total regret. Prove a bound
on the instantaneous regret, i.e., a bound on the expected loss at an arbitrary time
n. Using this bound, prove a bound on the total regret.
Hint: Start by proving a bound on the probability of choosing a suboptimal arm
at some trial. Use Hoeffding’s inequality.
Score: 20 points
3. Deciding Who is Better
Consider a two-player, zero-sum game with stochastic payoffs (e.g. Poker). Assume
that the payoffs lie in the interval [−K, K] (consider the payoff from the point of
view of Player 1). Assume that the players play a series of games but they do not
learn while they are playing (they do not change their strategies in any way). Given
a series of payoffs, X1 , . . . , Xn , for Player 1, design a simple, but powerful test that
has failure probability less than 5% and which determines if one of the players is
better than the other (Player 1 is better than Player 2 when his expected payoff
is above 0). Note that the possible outcomes for a test like this are: “Player 1 is
better than Player 2”, “Player 2 is better than Player 1”, “it is undecided which of
the players is better”. Determine the probability that the test returns “undecided”
assuming that the difference between the players’ performances is at most ∆ (this
is called the “power” of the test). Discuss your results. Look into how K, the range
of the payoffs influences the test.
Hint: For the analytic part use Hoeffding’s inequality.
Score: 20 points
Write a program to study the actual behaviour of the test and experiment with
various payoff distributions. Look into how the failure probabilities depend on the
variance, K and the expected value of X1 . Compare your empirical results with
the bounds that you derived. Study the tightness of the theoretical bounds (identify conditions when the bounds are tight and cases when they are not). Suggest
improvements.
Score: 20 points
CMPUT 654 – ONLINE LEARNING
3
Design a stopping rule that stops only when it can be decided with high probability
if one of the players is better than the other. Implement this rule and study its
behaviour empirically.
Score: 20 points
Prove that the rule that you designed meets its design requirements.
Score: 1000 points (*)
4. Testing for Independence
The purpose of this exercise is to develop further confidence in reasoning about
probabilities and in particular using Hoeffding’s inequality.
In data mining people look for non-random co-occurrences (“market basket analysis”). One problem here can be stated as follows: Imagine that you observe a
number of i.i.d. random variables, Z1 , . . . , ZN , Zi = (Xi , Yi ), Xi ∈ A, Yi ∈ B,
where A and B are finite sets. Design a test that detects if the distributions underlying Xi and Yi are independent, i.e., if it holds that for any (a, b) ∈ A × B,
(4.1)
P (X = a, Y = b) = P (X = a) P (Y = b) .
Here (X, Y ) are identically distributed with (Xi , Yi ). More specifically, design a
test based on Hoeffding’s inequality that has failure probability less than 5% and
which detects when the random variables are not independent.
Hint: Use Hoeffding’s inequality and a union bounding argument. This latter works
as follows: Imagine that the failure events, F1 , . . . ,P
Fn , all have a probability of at
n
most δ/n: P (Fi ) ≤ δ/n. Then, by P (∪ni=1 Fi ) ≤ i=1 P (Fi ), you have that the
probability that any of these events will happen is at most δ. Hence, the probability
that no failure happens is at least 1 − δ.
Score: 30 points
5. Regret and Inferior Sampling Times
Consider a stochastic, stationary, i.i.d. bandit setting. Let the payoff of arm i at
time t be Xit (Xit ∈ R). Consider any bandit algorithm, A. Let Ti (t) denote the
Pt
sampling time of arm i up to time t: Ti (t) = s=1 I{Is =i} , where Is ∈ {1, . . . , K}
denotes the arm chosen at time s by A. Let Zt be the payoff at time t: Zt =
XIt ,TIt (t) . Prove that
" n
#
K
X
X
E
Zt =
µj E [Tj (n)] ,
t=1
where µj = E [Xj1 ].
j=1
4
FALL 2006 DEPARTMENT OF COMPUTING SCIENCE UNIVERSITY OF ALBERTA
Hint: Use Wald’s equation which states the following: Let X1 , X2 , . . . be an i.i.d.
sequence of random variables satisfying E [|Xi |] ≤ +∞. Further let T be a stopping
time with respect to this sequence whose expectation is bound: E [T ] < +∞. Then
" T
#
X
E
Xs = E [T ] E [X1 ] .
s=1
Exploit that allocation rules can only use past information.
Let X1 , X2 , . . . and Y1 , Y2 , . . . be sequences of random variables, where Yi and Xj
are independent for any integers i and j. We say that T is a stopping time with
respect to a sequence of random variables of X1 , X2 , . . . if for all n ≥ 1, the event
{T = n} is completely determined by (at most) the total information known up to
time n, {X1 , X2 , . . . , Xn } and Y1 , Y2 , . . ..
Score: 20 points
the payoff at time t, emphasizing the role of the allocation rule.
Now, denote by ZtAP
n
A
Prove that supA E
= nµ∗ , where µ∗ = maxj µj .
t=1 Zt
Score: 20 points
6. Instantaneous Failure Probability for UCB1
Consider UCB1. Prove a tight (polynomial) bound on the probability that the arm
with the highest average value at time n is a suboptimal one.
Score: 1000 points (*)
7. The Bias Term in UCB1
The bias term in UCB1 has the form
r
p log t
.
2s
The theorem proven in the class states that the regret is logarithmic when p > 2
and actually predicts that the regret blows up when p = 2. Study the behaviour
of UCB1 for p ≤ 2! More specifically, design a test environment, implement UCB1
and study its behaviour by looking at the expected (total) regret and the variance
of the regret. Use graphs to illustrate your findings! Summarize your findings in
the form of some conjectures.
cts =
Score: 20 points
Prove that if p is “too small” then the regret is not logarithmic.
Score: 1000 points (*)
CMPUT 654 – ONLINE LEARNING
5
Now assume that cts is monotone increasing in t and monotone decreasing in s and
that limt→∞ cts = ∞ and lims→∞ cts = 0. Other than this cts is not restricted. Is
this sufficient to make UCB1 to be a no-regret algorithm? We call an algorithm
“no-regret” if its average expected regret per step in the limit is zero or smaller, or
lim supn→∞ Rn /n ≤ 0. Here Rn is the expected regret after n steps.
Score: 40 points
8. Doubling Trick
Let A = A(η) be a learning algorithm that has a free parameter η (e.g. a learning
rate). Assume that it is known that the total regret of A(η) in n trials is bounded
by cnα with some c > 0 and provided that η is selected to match n: η = f ∗ (n).
Here f ∗ is some function mapping integers to reals.
The problem is that you have to know n in order to select η and since η has to
match n in order to achieve a good regret if there is no fixed time horizon then it
is not clear how to run algorithm A (how should η be selected) so that its regret is
still small.
The “doubling trick” is a simple solution. It works as follows: Time is segmented
into periods. The first period has a length of 1, the second has a length of 2, the
third 4, etc. In period k you run the algorithm with parameter ηk = f ∗ (2k ).
Give a closed-form bound on the regret (no sums) for this “meta-algorithm”. Try
to derive a reasonably tight bound. In deriving the bound consider full periods
only, i.e., when n = 1 + 2 + . . . + 2k with some k.
Score: 10 points
Derive a bound that applies to any time-point n, not just to the period-ends.
Score: 10 points
What would make doubling special? Bound the regret if the period lengths are of
the form beβk c with β > 0. In this exercise a heuristic analysis is fine (you do not
need to care about the interval length being integer numbers and it is fine to have
a bound that is restricted to period-ends). What is the trend as a function of β?
Can you suggest a good value of β that makes your bound small? In view of your
results reason about possible modifications of the meta-algorithm that may work
better than the doubling trick in practice. Reason about the performance of the
new algorithm.
Score: 20 points
6
FALL 2006 DEPARTMENT OF COMPUTING SCIENCE UNIVERSITY OF ALBERTA
Assume that the regret bound is logarithmic, i.e., Rn ≤ c log(n) with some c > 0.
What is the best way to set up the lengths of the periods if the goal is to have a slow
growth-rate? Derive a bound on the regret (again, a heuristic analysis suffices).
Score: 20 points
9. Turning Action-Elimination into an Online Algorithm
The Median-Elimination (ME) algorithm of Even-Dar et al. (2003) is designed to
pick the best action in a stochastic, stationary bandit problem with a low error
probability. If you have K arms, with i.i.d. rewards that are confined say to the
interval [0, 1], the ME algorithm is guaranteed to return an optimal arm using
K
N,δ,K = bc 2 log(1/δ)c.
samples. Here c > 0 is a constant that can be found in the above article. Your
task is to turn this algorithm into one that achieves a low expected regret. For
this consider running the algorithm in phases: Each phase is divided up into an
exploration and an exploitation segment. Try to design the length of these so that
the expected regret is small. Again, a heuristic analysis suffices.
Score: 20 points
10. Approximating the Max-Norm
Approximations of the max-norm play an important role in the construction and
analysis of several online learning algorithms (cf. “potential function method”).
Prove that for any a1 , . . . , an real numbers,
!1/p
n
X
p
lim
|ai |
= max |ai |.
p→∞
and that
i=1,...,n
i=1
n
1 X ηai
ln
e
= max ai .
η→∞ η
i=1,...,n
i=1
lim
Hint: To prove the second result, use the first one.
Score: 10 points
11. Horse Race
Consider the horse race problem (see the slides for more information) but with the
modification that there is a cap on the amount that you can bet in a single round and
also on the odds. The guarantee derived in the class was that the multiplicative
loss is bounded. Do you think that this is a strong guarantee given that there
both the bets and the wins are capped? What would be a stronger guarantee for
this modified problem? Implement a simple version of the problem: the index
CMPUT 654 – ONLINE LEARNING
7
of the winning
√ horse could be e.g. a periodic or almost periodic function of time
sin(x)+sin( 2x) is a good start and odds could take values like e.g. {1, 2, 3, 4}. The
experts could use the same function used to generate the index of the winning horse
but with different parameters (e.g. frequency, phase). Look at the actual payoffs,
the logarithm of the payoffs and both the additive and multiplicative differences to
the cumulative payoff of the best expert.
Score: 40 points
References
Even-Dar, E., Mannor, S., and Mansour, Y. (2003). Action elimination and stopping conditions for reinforcement learning. In Fawcett, T. and Mishra, N., editors, Proceedings of the Twentieth International Conference on Machine Learning
(ICML-2003), pages 162–169.