Learning to optimally exploit multi-channel diversity.pdf

This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE INFOCOM 2010 proceedings
This paper was presented as part of the main Technical Program at IEEE INFOCOM 2010.
Learning to optimally exploit multi-channel
diversity in wireless systems
P. Chaporkar (IITB, Mumbai), A. Proutiere (Microsoft Research, UK), H. Asnani (IITB, Mumbai)
Abstract—Consider a wireless system where a transmitter
may send data to a set of receivers, or on various channels,
experiencing random time-varying fading. The transmitter can
send data to a single receiver or on a single channel at a time
and may adapt its transmission power to the radio conditions
of the chosen receiver/channel. Its objective is to implement a
strategy defining at each time how to select the receiver/channel
and transmission power, so as to maximize its throughput, i.e.,
its average sending rate, under an average power constraint. The
optimization problem is easy when the fading conditions of all the
receivers/channels are known. In many situations however, the
instantaneous fading conditions are not known a priori, instead
they have to be acquired, i.e., receivers/channels have to be
probed, which consumes resources (time, spectrum, energy) in
proportion of the number of probed receivers/channels. Hence,
the transmitter may choose not to acquire the radio conditions of
all the receivers/channels so as to spare resources for actual transmissions. In this paper, we aim at characterizing a joint probing,
receiver/channel selection and power control strategy maximizing
throughput. We provide an adaptive algorithm converging to
the throughput optimal strategy. This algorithm may be used
in a wide class of wireless systems with limited information,
such as broadcast systems without a priori knowledge of the
instantaneous Channel-State Information (CSI). But it can be
also used to solve dynamic spectrum access problems such as
those arising in cognitive radio systems, where secondary users
can access large parts of the spectrum, but have to discover which
portions of the spectrum offer more favorable radio conditions
or less interference from primary users.
I. I NTRODUCTION
Opportunistic resource allocation has been shown to significantly improve the performance of wireless systems by
exploting (rather than countering) location dependent and
time varying channel conditions on various links. But, to
employ opportunistic schemes, the transmitter has to know
the channel side information (CSI) on each of the links.
The CSI is not automatically known, rather it has to be
acquired. Acquiring CSI on each link consumes resources
(time, power and bandwidth) proportional to the number of
links. Often the gain due to opportunism compensates for
the resources invested in CSI acquisition, and hence many
systems keep resources aside for CSI acquisition. For example,
in CDMA/HDR [2] broadcast systems, a dedicated uplink
channel for each receiver is maintained to communicate the
CSI. In IEEE 802.16 based WiMax systems, CSI can be
obtained by polling each link at the beginning of a frame
[8]. However there is an increasing number of systems where
it is not feasible to maintain dedicated resources for CSI
acquisition; rather the CSI acquisition should be done on
Prof. Chaporkar’s work is supported by India-UK Advanced Technology
Centre (IU-ATC) of Excellence in Next Generation Networks Systems and
Services.
demand. We refer to systems with on demand CSI acquisition
as limited information based MAC.
An increasingly important example of limited information
based MAC is the opportunistic spectrum access methods used
in systems such as cognitive radio systems. In these systems,
a secondary user may access a large number of frequency
bands provided that these bands are not currently occupied by
licensed or primary users. In this scenario, a user willing to
maximize its transmission rate, has to opportunistically use the
spectrum parts left idle by primary users and offering favorable
fading conditions. Of course, the user cannot maintain dedicated resources to acquire the CSI on each frequency band,
and to check whether a primary user is using it. Rather before
transmitting, the user should acquire this information on a few
well selected bands.
For an optimal design of limited information based MAC,
one has to strike the best exploration versus exploitation tradeoff. Here, exploration refers to finding out (probing) link CSIs.
Exploration consumes resources proportional to the number
of links probed, and thus leaves few resources for the actual
data transmission. On the other hand, exploitation refers to
opportunistically transmitting on the probed link with the best
CSI, and hence more links one probes, greater is the chance
to find a link with good channel conditions.
In [6], we have developed a probing strategy achieving
the optimal exploration vs. exploitaion trade-off under the
assumption that the transmitter always transmits at a fixed
power. Here, our aim is to achieve the optimal trade-off when
the transmitter can vary the transmit power, but has to satisfy
an average power constraint. In wireless networks, power is
an important resource that should be used optimally. Hence,
it is important to design joint probing and power control
schemes that maximize the system throughput. We investigate
the throughput gain achieved using power adaptation in limited
information based MAC.
Now, we elaborate on the analytical challenges in obtaining
the optimal joint probing and power control schemes in limited
information based MAC. When the fading conditions on the
various links are known at the transmitter, then the optimal
power control scheme can be obtained as a solution of a
convex optimization program. For example, in the case where
a single channel can be used at a time, the optimal scheme is to
always transmit on the channel with the most favorable fading
state, and to share power in time through the celebrated waterfilling procedure, where the water-filling level is obtained so
that the average power constraint is satisfied with equality [9].
To compute the water-filling level, one needs to know the
distribution of the CSI of chosen link (the link with the best
CSI). As we will demonstrate, a similar analysis is possible
978-1-4244-5837-0/10/$26.00 ©2010 IEEE
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE INFOCOM 2010 proceedings
This paper was presented as part of the main Technical Program at IEEE INFOCOM 2010.
even in the case of limited information based MAC, i.e., the
optimal probing and power control scheme can be obtained as
a solution to an optimization problem. A major difficulty in
solving this problem is that the constrained set of possible
schemes is large and is extemely intricate to characterize.
Indeed, to compute the average power consumption of a given
scheme, we need to quantify the distribution of the CSI of
the link selected at the end of the probing phase, which turns
out to be almost impossible. Hence, we need to solve the
optimization problem without really knowing the constrained
set of possible schemes. To circumvent this difficulty, we
propose an on-line learning strategy that provably converges
to the optimal joint probing and power control scheme. More
precisely, the contributions in this paper are as follows:
• We formalize the problem of designing optimal opportunistic probing and power allocation schemes as a Constrained
Markov Decision Process (Section II).
• We provide structural properties of the problem in systems
where transmitting on a single link at a time is permitted.
These properties allow us to characterize the throughputoptimal strategies (Section III).
• As the complexity of the numerical computation of the
optimal strategies from the aforementioned characterization
grows exponentially with the number of links, we propose an
on-line learning algorithm with linear complexity that provably
converges to the optimal strategy (Section IV).
• The results are then extended to the case where the transmitter is allowed to transmit on several links at a time (Section V).
• Finally, we illustrate and discuss, using simulations, the
efficiency of the proposed optimal exploration-exploitation
strategies. In particular, we evaluate the price in terms of
throughput that has to paid due to the lack of information,
i.e., due to the fact that the channel states have to be acquired
(Section VI).
Note that related work is presented in Section VII, and we
conclude in VIII.
II. S YSTEM M ODEL AND P ROBLEM F ORMULATION
We present the first basic model considered in this paper to analyze the problem of designing optimal exploration/exploitation strategies in limited information based
MAC. We generalize this model in Section V.
A. Model
Consider a user that can possibly transmit on N channels,
but on one channel at a time. Time is slotted. The slot duration
is assumed to correspond to the coherence time of channels.
We assume that the radio conditions on the various channels
satisfy the block fading model: the radio conditions on channel
i are constant during each slot, and hence represented by a
channel state Ci (t) in slot t. The random variable Ci (t) takes
its values in a finite set C = {c1 , c2 , · · · , cM }. Moreover,
Ci (t), t ≥ 0 are i.i.d. random variables with distribution
Fi (·). We assume that the distribution Fi (·) is known to
the transmitter for every i. Here the underlying assumption
is that the user remains in the system a long time, so that
it can learn Fi (·). We also assume that the channel states
are independent across channels, i.e., the random variables
Ci (t), t ≥ 0, i = 1, . . . , N are independent (spatial diversity).
At the beginning of each slot, the user may acquire the
state of one or several channels sequentially. Probing a channel
takes a fixed proportion β of the slot, so that after probing
k channels, the fraction of the slot available for actual data
transmissions is (1 − kβ). When the user decides to transmit
on the probed channel i observed in state c ∈ C with power
p, its transmission rate is approximated by Shannon formula:
R(c, p) = log(1 + c×p
N0 ), where N0 denotes the thermal noise
power. The choice of the rate function R(·, ·) does not impact
the results derived in this paper, provided that its is increasing
and concave in the second argument, i.e., in power. If the user
transmits after probing k channels, and decides to transmit at
power p on a channel in state c, the amount of information
transmitted during this slot is: (1 − kβ)R(c, p). Note that in
[17], [5], similar models (but with fixed transmit power) have
been considered and exemplified in practical systems.
In order to utilize the channel resources and its power
reserve optimally, the user has to decide in a smart way the
order in which it is going to probe channels, when to stop
probing and start transmitting actual data, and finally at which
power it should transmit. In short, it has to implement an
optimal probing and power allocation strategy. Formally, we
define such a strategy as follows. Consider an arbitrary slot
(the slot considered does not play any role here as the system
is i.i.d. over slots). In this slot, let s = [s1 · · · sN ] denote
an N -dimensional vector indicating which channels have been
already probed and also the states of these channels. If the ith
channel has been probed, then there exists c ∈ C such that
si = c; and for unprobed channels, we let si = −1. The
set of all possible states is S = (C ∪ {−1})N . Depending
on the past decisions in the slot, and its observation of the
channel states, the user has to decide whether to probe further,
or to transmit on a channel, and at which power. This decision
can be random, e.g. with some probability the user decides to
probe further, and with some other probability it decides to
stop and transmit. In Figure 1, we give an example of decisions
in a simplistic 3-channel system. In the following, we denote
by P(A) the set of probability measures on A.
Exploration
Decisions (2,p)
Exploitation
(3,p) (3,p)
P2
P3
Tr 3
State s (−1,−1,−1)
(−1,c,−1)
(−1,c,c’)
Fig. 1. Decisions made in one slot - Exploration phase of duration 2β:
Channels 2 and 3 are probed; Exploitation phase of duration (1 − 2β):
transmission on channel 3 at power p (i.e., (1 − 2β)R(c , p) bits are sent).
Definition 1: A joint probing and power control strategy π
is a mapping from the set of states S to the set P({1, . . . , N }×
R+ ), i.e., in every state s, π chooses a pair (i, p) randomly
according to the distribution π(s).
− If si = −1, then the user probes channel i, observes its
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE INFOCOM 2010 proceedings
This paper was presented as part of the main Technical Program at IEEE INFOCOM 2010.
state c, and the system state changes to s , where sj = sj for
j = i and si = c.
− If si ∈ C, it means that the channel i has been probed
already. The user stops probing and starts transmitting on
channel i with power p.
The above definition does not exclude deterministic strategies that choose a single couple (i, p) (i.e., w.p. 1) in each state
s - In this case, π(s) = δ(i,p) . It is worth observing as well
that the decision taken by a strategy π is defined in all possible
states s ∈ S, although because of the specific choices made
by π, some states may not be actually reached (for example, π
can decide that channel 1 is never probed first, in which case,
the state (c, −1, . . . , −1) for c ∈ C is never reached under π).
Strategy π in those states can be arbitrarily defined.
We denote by Π the set of all probing and power allocation
strategies. For a given strategy π ∈ Π, we define by ρπ the
corresponding occupation measure, i.e., for any subset A ⊂
S of states, and Borel set I ⊂ R+ of possible transmission
powers, the probability that under π, the user stops probing in
a state s ∈ A and start transmitting at a power p ∈ I is:
1s∈A 1p∈I dρπ (s, p).
ρπ (A × I) =
S×R+
We also introduce the measure σπ , corresponding to the
distribution of the state in which the strategy π stops: for any
A ⊂ S,
σπ (A) =
S×R+
1s∈A dρπ (s, p).
We refer to σπ (·) as the terminal state distribution. The
occupation measure ρπ results from the random decisions
made by π, and also from the random channel states.
B. Problem formulation
We are now ready to state the problem of designing a
probing and power allocation strategy maximizing user’s longterm throughput subject to an average power constraint. Since
the objective is to maximize throughput, we restrict our
attention to strategies that when deciding to stop and transmit,
transmits on the channel with the best observed state. In state
s, we denote by s̄ = max{si , i = 1, . . . , N } the state of the
best (probed) channel. We also denote by k(s) the number of
channels that have been probed in state s.
Both throughput T (π) and average power P (π) under strategy π are expressed through the occupation measure ρπ :
T (π) =
dρπ (s, p)(1 − k(s)β)R(s̄, p),
(1)
S×R+
P (π) =
dρπ (s, p)(1 − k(s)β)p.
(2)
S×R+
Denote by P0 the average power budget. Our problem is
then formalized as follows
(O1) Find π ∈ Π maximizing T (π) subject to P (π) ≤ P0 .
This problem cannot be solved using classical methods, e.g.,
convex optimization techniques, simply because the objective and the constraint are both functions of the occupation
measure, which proves quite complicated to characterize for
a given strategy. In fact the problem belongs to the class
of constrained stochastic control problems [1] which are
notoriously difficult. In the next section, we provide some
structural properties of (O1), that will help the analysis.
III. S TRUCTURAL P ROPERTIES OF O PTIMAL S TRATEGIES
To solve (O1), we need to study the structure of the possible
optimal probing and power allocation strategies. First we show
that it is useless to randomize the power allocation. Then
we prove that optimal power allocations are always obtained
through water-filling. We show that this implies that solving
(O1) is equivalent to identifying the saddle point of a function
depending on the probing strategy and on a parameter defining
the level of the water-filling procedure providing the power
allocation. Finally, we provide structural properties of the
probing strategy maximizing this function.
A. Derandomizing power
We first define the set Π1 ⊂ Π as the set of strategies
π such that the power allocation is deterministic in the sense
that when in state s, π decides to stop probing and to transmit,
it then picks a unique transmission power, denoted by pπ (s).
Mathematically this implies that for any state s and any subset
I of R+ , ρπ (s, I) = σπ (s) × 1pπ (s)∈I . In the following, for
any π ∈ Π, we denote by ρπ (p|s) the probability that π selects
power p given that it stops probing in state s.
Lemma 1: Let π ∈ Π. Consider π ∈ Π1 such that it makes
the same probing decisions as π, but averages the transmission
power decisions made by π: for any state s, if π chooses a
pair (i, p) for some possible power p, then π chooses (i, p ),
with p = R+ dρπ (p|s)p. Then: T (π ) ≥ T (π) .
Proof. Note that since R(·, ·) is concave in power, for any
state s we have, by Jensen’s
inequality and the definition of
π , that: R(s, pπ (s)) ≥ R+ dρπ (p|s)R(s, p). Then:
dρπ (s, p)(1 − βk(s))R(s̄, p)
T (π) =
S×R+
=
σπ (s)(1 − βk(s))
dρπ (p|s)R(s̄, p)
p∈R+
s∈S
≤
σπ (s)(1 − βk(s))R(s̄, pπ (s)) = T (π ) .
2
s∈S
B. Optimality of water-filling
Now we investigate the possible form of optimal power
allocations. We fix the terminal state distribution σ ∈ P(S),
and given that distribution, we seek the best power allocation.
A (deterministic) power allocation is represented by a function
p : S → R+ . The throughput achieved by power allocation
p(·) is:
T (σ,p) =
σ(s)(1 − βk(s))R(s̄, p(s)).
s∈S
The average power consumption under p(·) is:
σ(s)(1 − βk(s))p(s).
P (σ,p) =
s∈S
We seek to solve, for a given σ ∈ P(S):
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE INFOCOM 2010 proceedings
This paper was presented as part of the main Technical Program at IEEE INFOCOM 2010.
(Pσ ) Find p (·) maximizing T (σ,p) subject to P (σ,p) ≤ P0 .
or equivalently,
Clearly (Pσ ) is a convex optimization problem, and should
R(·, ·) be strictly concave in power, it admits a unique solution.
Consider the associated Lagrangian:
Lσ (p(·), μ) =
σ(s)(1−βk(s))[R(s̄, p(s))−μp(s))]+μP0 ,
s∈S
where μ ≥ 0 denotes the Lagrange multiplier. Denote by
G(σ, μ) = maxp(·) Lσ (p(·), μ). The solution of (Pσ ) is obtained with a power allocation obtained through a water-filling
procedure of parameter μ, as stated in the following lemma:
Lemma 2: We have:
σ(s)(1 − βk(s)) [R(s̄, pμ (s̄)) − μpμ (s̄))]
G(σ, μ) =
s∈S
+ μP0 , where
+
N0
1
−
.
pμ (s̄) =
μ
s̄
Proof. The result follows by solving
∂G
∂p(s)
= 0 for all s.
2
C. Saddle point interpretation
From the previous result, the power allocation in a throughput optimal strategy is necessarily obtained through a waterfilling procedure. Hence to identify such an optimal strategy
π , we may restrict our attention to strategies defined by
a probing strategy and a parameter μ defining the level of
the water-filling procedure. To formalize this observation, we
define the notion of probing strategy:
Definition 2: A probing strategy ν is a mapping from S to
the set P({1, . . . , N }), i.e., in every state s, π chooses an
index i randomly according to the distribution ν(s).
− If si = −1, then the user probes channel i, observes its
state ci , and the system state changes to s , where s j = sj
for j = i and s i = ci .
− If si ∈ C, it means that the channel i has been probed
already. The user stops probing yielding a terminal state s.
We denote by V the set of probing strategies. The couple
composed by a probing strategy ν ∈ V, and a power allocation
obtained through water-filling of level μ (i.e., pμ (·)) defines a
strategy π ∈ Π1 , and we use the notation π = (ν, μ). Define
Π2 as the set of such strategies:
Π2 = {π ∈ Π1 : ∃ν ∈ V, μ > 0, π = (ν, μ)}.
For a strategy π = (ν, μ) ∈ Π2 , the terminal state distribution
σπ depends on π through the probing strategy ν only; hence
we may write σν = σπ . Summarizing what we have shown
so far: Solving (O1) is equivalent to solving (O2) where:
(O2) Find π ∈ Π2 maximizing T
(π)
subject to P
(π)
≤ P0 .
The following crucial result will help us to characterize the
solution of (O2). It states that the solution may be interpreted
as the saddle point of the function (ν, μ) → G(σν , μ) defined
in §III-B.
Theorem 1: Let π = (ν , μ ) ∈ Π2 . The strategy π is
optimal if and only if the pair (ν , μ ) satisfies the following
saddle point condition: for any ν ∈ V, μ > 0,
G(σν , μ ) ≤ G(σν , μ ) ≤ G(σν , μ),
G(σν , μ ) = min max G(σν , μ) = max min G(σν , μ). (3)
μ>0 ν∈V
ν∈V μ>0
The proof of Theorem 1 is not straightforward since G is
not the Lagrangian of problem (O2), and hence (3) does not a
priori express the strong duality of some optimization problem.
Next, we present the formal proof.
Proof. First, we show that (O1) is a convex optimization
problem. To show this, we need to show that (1) Π is a
convex set, and (2) T (π) is concave in π. Note that a joint
probing and power control policy π is characterized by its
occupation measure ρπ . Thus, the convex combination of
the two policies is defined as the convex combination of
their occupation measures, elementwise. That is, for every
α ∈ [0, 1], π = απ1 + (1 − α)π2 implies that ρπ (s, p) =
αρπ1 (s, p)+(1−α)ρπ2 (s, p). Clearly, π is a valid joint probing
and power control policy as it can be obtained by choosing
π1 w.p. α and π2 w.p. (1 − α). Thus, Π is a convex space.
Now, we show that T (π) is concave in π. We need to show
that T (π) ≥ αT (π1 ) + (1 − α)T (π2 ) . First, note that
σπ (s) = ασπ1 (s) + (1 − α)σπ2 (s),
ρπ (p|s) = θρπ1 (p|s) + (1 − θ)ρπ2 (p|s),
ασ
(s)
π1
. With the above observations
where θ = ασπ (s)+(1−α)σ
π2 (s)
1
and some algebra, it can be verified that T (π) = αT (π1 ) +
(1 − α)T (π2 ) . Thus, T (π) is a concave function of π.
Now, we show that (O1) has strong duality property using
Slater’s constraint qualification condition. Note that any strategy
π that allocates zero power in every terminal state, i.e.,
s∈S ρπ (s, 0) = 1, is a strictly feasible solution of (O1).
Thus, Slater’s condition holds. This implies that
max min T (π) + λ(P0 − P (π) )
π∈Π λ≥0
(4)
= min max T (π) + λ(P0 − P (π) ) .
λ≥0 π∈Π
In (4), λ is the Lagrange multiplier.
Now, by Lemma 1, we know that the optimal probing and
power control strategy lies in Π1 . Thus, (4) holds even when
Π is replaced by Π1 .
Let Πσ denote the set of policies π that generate the
same terminal distribution σ. Moreover, let Σ = {σ : σ =
σπ for some π ∈ Π1 }. With this notation, the right hand side
of (4) can be written as follows:
min max max T (π) + λ(P0 − P (π) ) .
(5)
λ≥0 σ∈Σ π∈Πσ
Consider the last optimization in (5), and note that
max T (π) + λ(P0 − P (π) ) = max Lσ (p(·), λ).
π∈Πσ
p(·)
This is because, in Πσ , policies differ in their power allocation
only. Thus, optimizing over Πσ is equivalent to choosing
optimal power control. Thus,
min max max T (π)+λ(P0 −P (π) ) = min max G(σ, λ).
λ≥0 σ∈Σ π∈Πσ
λ≥0 σ∈Σ
(6)
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE INFOCOM 2010 proceedings
This paper was presented as part of the main Technical Program at IEEE INFOCOM 2010.
Now, note that the left hand side of (4) is equal to
max max min T (π) + λ(P0 − P (π) ) .
σ∈Σ π∈Πσ λ≥0
(7)
Using the similar arguments as before, we note that
max min T (π) + λ(P0 − P (π) ) = max min Lσ (p(·), λ).
π∈Πσ λ≥0
p(·) λ≥0
Using the strong duality of (Pσ ), we conclude that
max min T (π) + λ(P0 − P (π) ) = min max Lσ (p(·), λ).
π∈Πσ λ≥0
Thus,
λ≥0 p(·)
max max min T (π) + λ(P0 − P (π) )
σ∈Σ π∈Πσ λ≥0
= max min G(σ, λ).
σ∈Σ λ≥0
(8)
From (6) and (8), we conclude that
min max G(σ, λ) = max min G(σ, λ).
λ≥0 σ∈Σ
σ∈Σ λ≥0
The result follows.
2
Theorem 1 provides a simple way to verify the optimality
of a given strategy π = (ν, μ) in Π2 . For example, we simply
have to check that: 1. ν = arg maxν ∈V G(σν , μ); 2. μ =
arg minμ >0 G(σν , μ ).
Now observe that for any σ ∈ P(S), G(σ, μ) is minimized
in μ if and only if the resulting average power consumption
is exactly equal to P0 (just deriving G w.r.t. μ). Summarizing,
we have the following characterization of optimal strategies:
Corollary 1: Let π ∈ Π. The strategy π solves (O1) if
and only if π ∈ Π2 , i.e., ∃(ν , μ ) ∈ V × R+ : π = (ν , μ ),
and (1) ν = arg maxν∈V G(σν , μ ), (2) P (σν ,μ ) = P0 ,
where for any (ν, μ), P (σν ,μ) denotes the average power
consumption under strategy (ν, μ):
P (σν ,μ) =
σν (s)(1 − k(s)β)pμ (s̄).
s∈S
D. Structure of the optimal probing strategy
If one wish to use the characterization of the solution of
(O1) provided in the above corollary, one needs to be able
to verify Condition (1). In other words, we need to solve the
following problem for a fixed μ:
(Pμ ) Find ν ∈ V maximizing G(σν , μ).
(Pμ ) can be seen as a generalized version of stopping time
problems, and as it turns out, similar problems have been
recently studied and solved, see [5], [6]. We adapt the results of
these existing analysis to our setting. For brevity, we introduce
the following notation: for any c ∈ C,
G(c) = R(c, pμ (c)) + μ[P0 − pμ (c)].
Assume that at a given slot, the system is in state s.
- If under strategy ν, we stop probing and transmit (on the
best channel), the reward is Gtr (s) with
Gtr (s) = (1 − k(s)β)G(s̄);
- If under strategy ν, we probe further a channel i in state
c ∈ C, the state becomes s = s(i) where for all j = i,
sj = sj and si = Ci . Ci is the random variable representing
the state of channel i.
Now denote by G (s) the average reward under an optimal
strategy ν starting from state s. Bellman’s equation allows
us to recursively characterize G : for any s ∈ S,
G (s) = max{Gtr (s), max Ei [G (s(i))},
i:si =−1
where Ei [·] is the expectation taken w.r.t. the distribution
Fi (·) of the state of the i-th channel. To characterize the
solution ν of (Pμ ), we need to compute G (s0 ) where
s0 = (−1, . . . , −1) is the initial state. To do so, let’s introduce
the average reward Gpr,tr (s) obtained when, starting in state
s, one first probes channel i and after that, one stops and
transmits (on the best channel):
Gpr,tr (s) = (1 − (k(s) + 1)β) max Ei [G(max{s̄, Ci })].
i:si =−1
For the results of [5], [6] to be applicable, we need the
following property of function G(·) that can be easily checked:
Lemma 3: G(·) is a non-decreasing function.
We are now ready to provide two structural properties of
the optimal probing strategy ν , that will actually characterize
this strategy in some particular but relevant cases.
1) Optimal stopping rule: The following result states that
in any given state s ∈ S, to optimally decide whether to stop
and transmit or to probe further, we only need to follow the
choice made by the one-step-look-ahead strategy [5], [6].
Theorem 2: Let ν be the optimal probing strategy solving
μ
(P ). In any state s ∈ S, ν decides to probe another channel
if and only if: Gpr,tr (s) > Gtr (s).
Theorem 2 is sufficient to characterize the optimal strategy
when the states of the various channel are i.i.d. Indeed, in this
case, the order in which channels are probed has no impact on
the average reward, and hence we can probe channels in any
order. However, when the channel states are not identically
distributed, characterizing ν becomes extremely complicated
and is an open problem in general. This might be explained
by the fact that the one-step-look-ahead strategy is not always
optimal as shown in [6].
2) Optimal channel probing order: As discussed above, the
main challenge in characterizing ν is to determine the optimal
order in which channels should be probed. And in general, this
issue proves impossible to solve. However, there are special
cases where it is still possible to find ν . Specifically, when the
channel states are stochastically ordered (as defined below),
the optimal order is obtained when the stochastically largest
unprobed channel is probed.
Channels are stochastically ordered if there exists a permutation ω of {1, . . . , N } such that for all i, j , if ω(i) ≤ ω(j),
then Cω(j) ≤st Cω(i) , where X ≤st Y if and only if for all
increasing function f such that E[f (Y )] < ∞, E[f (X)] ≤
E[f (Y )]. Without loss of generality, when the channels are
stochastically ordered, we assume that the permutation ω is
ω(i) = i for all i. An example of ordered channels is when
one can write Ci = E[Ci ]Yi where the random variables Yi ’s
are i.i.d. copies of a fixed random variable Y , i.e., when the
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE INFOCOM 2010 proceedings
This paper was presented as part of the main Technical Program at IEEE INFOCOM 2010.
channels have similar distributions but different means. This is
a quite usual fading model in wireless networks, for example
in the case of Rayleigh fading. In these settings, we can obtain
an optimal probing strategy [6]:
Theorem 3: Assume that the channels are stochastically
ordered. Let ν be the optimal probing strategy solving (Pμ ).
In any state s ∈ S, under ν , the decision on whether
to stop and transmit or to probe further is defined by the
rule of Theorem 2. Moreover, if the decision is to probe
further, the next channel to probe is the stochastically largest
un-probed channel. In other words, we necessarily have:
s = (c1 , . . . , ck(s) , −1, . . . , −1), and the channel to probe next
is channel k(s) + 1.
E. Summary
In this section, we have proved that the optimal probing and
power control strategy π solving (O1) has the following properties: (i) the optimal power control strategy is deterministic;
(ii) it is obtained via a water-filling procedure of parameter
μ ; (iii) π = (ν , μ ) where ν denotes the optimal probing
strategy, and (ν , μ ) satisfies ν = arg maxν∈V G(σν , μ )
and P (σν ,μ ) = P0 ; finally, we have identified how to
determine ν = arg maxν∈V G(σν , μ ).
We have theoretically characterized the optimal probing and
power control strategy. However, we still need to numerically
compute the optimal water-filling parameter μ , which is
difficult since the average power consumption depends on
both the probing strategy and the water-filling parameter. Such
computation might be prohibitive on a simple mobile device.
Indeed, computing the average power consumption even for a
fixed strategy needs to consider all possible realizations of the
state of all channels, which requires O(#C N ) operations. In
the next section, we propose a simple algorithm that the user
can run while exploring and exploiting the spectrum resources
and that provably converges to the optimal probing and power
control strategy. In each slot, the user has to perform O(N )
operations to make its probing and power control decisions.
The price for reducing the complexity is the time it takes for
the algorithm to converge.
IV. O PTIMAL O N - LINE S TRATEGY
We now provide an on-line algorithm that provably converges to the optimal joint probing and power control strategy. The algorithm may be interpreted as a multiple timescale stochastic approximation algorithm. We first describe the
algorithm, and then prove its convergence.
A. Stochastic learning algorithm
The algorithm seeks to solve minμ maxν G(σν , μ). At each
slot, the parameter μ, defining the power allocation obtained
through water-filling, is updated. The probing strategy ν is also
updated at each slot so as to maximize G(σν , μ). The latter
update is performed using the analysis presented in §III-D.
The update of μ is done so that μ converges to μ solution of
∂G
∂μ = 0, which is equivalent to the fact that the average power
consumption under pμ (·) is exactly P0 .
Formally the algorithm maintains two random variables: the
power allocation parameter μn ∈ R+ in slot n, and Pn ∈ R+
representing the average empirical power consumed until slot
n. The Algorithm operates as follows.
Algorithm 1
1) In the n-th slot, run the probing strategy
νn ∈ arg maxν G(σν , μn ), and power allocation pμn (·);
2) At the end of slot n:
(i) Observe γn+1 the transmission power during slot n,
and update Pn as:
Pn+1 = Pn + an (γn+1 − Pn );
(9)
(ii) Update μn as:
μn+1 = μn + bn (Pn+1 − P0 ).
(10)
Thestep-size
sequences (a
n ) 2and(bn2) are chosen such
that:
n an ,
n bn = ∞,
n an ,
n bn < ∞. Note that
in principle, we should update the parameter μn as a function
of the actual average power consumption using the optimal
strategy given the power allocation parametrized by μn . This
average power cannot be observed in one slot of course, so we
need to impose that the update on μn is much slower than that
of Pn , in other words we require that bn /an → 0 as n → ∞.
Note also that Algorithm 1 is easy to implement, because Step
1. only requires to implement νn that have been completely
characterized in §III-D; this requires O(N ) operations, since
in the worst case we probe all channels.
B. Convergence analysis
We prove that Algorithm 1 converges to the optimal probing
and power allocation strategy, i.e., the long-term throughput
is optimized while satisfying the power constraint.
Theorem 4: Under Algorithm 1, we have almost surely:
μn → μ , νn → ν when n → ∞.
Proof. The updates in Algorithm 1 can be written as:
1
),
Pn+1 = Pn + an (E[γn+1 |Fn ] − Pn + Mn+1
2
μn+1 = μn + bn (E[Pn+1 |Fn ] − P0 + Mn+1
),
where the σ-algebra Fn = σ(Pm , μm , m ≤ n) represents the
past up to slot n, and Mn1 and Mn2 are martingale difference
sequences defined by:
1
Mn+1
= γn − E[γn+1 |Fn ],
2
Mn+1
= Pn+1 − E[Pn+1 |Fn ].
Note that the average power E[γn+1 |Fn ] observed in slot n
depends on the past only through the parameter μn , hence
we can define a function g(·, ·) such that g(μn , Pn ) =
E[γn+1 |Fn ] − Pn . Similarly E[Pn+1 |Fn ] depends on the past
through Pn and μn only, and there exists a function h(·) such
that h(μn , Pn ) = E[Pn+1 |Fn ] − P0 . Hence the updates in
Algorithm 1 become:
1
Pn+1 = Pn + an (g(μn , Pn ) + Mn+1
),
2
μn+1 = μn + bn (h(μn , Pn ) + Mn+1
).
These are the equations of a stochastic approximation algorithm with two time-scales as considered in [4] Chapter 6.
It can be shown then that h and g are Lipschitz. Now the
conditions to apply the results of [4] are met, and we deduce
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE INFOCOM 2010 proceedings
This paper was presented as part of the main Technical Program at IEEE INFOCOM 2010.
that Algorithm 1 converges. Now since the unique equilibrium
point of Algorithm 1 is that where the power consumption is
exactly P0 and where an optimal probing strategy is used,
Theorem 4 is proved.
2
V. M ULTI - CHANNEL T RANSMISSIONS
So far, we have considered that a user may access N
channels, but transmits on one of these channels at a time.
Here, we extend the analysis to the case where the user
can simultaneously transmit on several channels at a time,
provided that these channels have been probed. We assume
that the various channels are orthogonal, so that concurrent
transmissions on different channels do not interfere with each
other. Now the decision problem that the user faces is similar
to that investigated previously, except that here when the user
decides to stop and transmit, it has to decide the transmission
power on each of the probed channels. Note that the user may
decide not to transmit at all on a given channel by allocating a
zero power on this channel. As before, the user’s objective is
to maximize its throughput. The analysis of this problem uses
similar methods as those developed in Sections III and IV.
We first define the space of probing and power allocation
strategies in the case of possible multi-channel transmissions,
and then state the throughput maximization problem.
Definition 3: A joint probing and power control strategy π̃ is a mapping from the set of states S to the set
P({0, 1, . . . , N } × RN
+ ), i.e., in every state s, π̃ chooses a
pair (i, p) randomly according to the distribution π̃(s).
− If i > 0, then the user probes channel i, observes its state
ci , and the system state changes to s , where sj = sj for j = i
and si = ci .
− If i = 0, the user stops probing and starts transmitting on
channel j with power pj .
In state s, let (i, p) be the decision made under π̃. Then,
we impose the following restrictions on π̃’s decisions: (a) If
i > 0, π̃ probes channel i so it means that i had not been
probed earlier, i.e. si = −1. (b) If i = 0, then under π̃, the
user stops and transmits. To ensure that it transmits on probed
channels only, we impose pj > 0 only if sj ∈ C.
The set of joint probing and power allocation strategies
satisfying (a) and (b) is denoted by Π̃. Like before, for any
π̃ ∈ Π̃, we define the associated occupation measure ρπ̃ (·)
and the terminal state distribution σπ̃ (·). Using these, we can
compute the throughout and the average power under π̃:
T (π̃) =
P (π̃) =
S×RN
+
dρπ̃ (s, p)(1 − k(s)β)
N
dρπ̃ (s, p)(1 − k(s)β)
σ(s)(1 − k(s)β)
P (π̃) = P (σ,p) =
N
R(sj , pj (s)),
j=1
s∈S
σ(s)(1 − k(s)β)
N
pj (s).
j=1
s∈S
We solve:
N
The above problem is convex with associated Lagrangian:
Lσ (p(·), μ) =
σ(s)(1−βk(s))
R(sj , pj ),
(11)
pj .
(12)
N
[R(sj , pj ) − μpj ]+μP0
j=1
s∈S
We can then easily show that the power allocation maximizing
the Lagrangian is again obtained through water-filling with
parameter μ. Note that here the water-filling is made in time
and channels, i.e., for any state s ∈ S, the optimal power
allocation is pμ (s) with: for any j ∈ {1, . . . , N },
if sj = −1,
pμ,j (s) = 0;
+
N0
1
−
= pμ (sj ).
if sj ∈ C, pμ,j (s) =
μ
sj
Hence we can restrict our attention to strategies within the
set Π̃2 of strategies whose power allocations are obtained
through water-filling in time and channels. Any strategy π̃ ∈
Π̃2 can be represented as a couple (ν, μ) ∈ V × R+ , where
ν is a probing strategy satisfying σπ̃ = σν and μ is the timechannel water-filling parameter of the power allocation.
Now for any (σ, μ) ∈ P(S) × R+ , define G̃(σ, μ) as:
G̃(σ, μ) = max Lσ (p(·), μ)
p(·)
=
σ(s)(1 − k(s)β)
N
R(sj , pμ (sj ))
j=1
s∈S
j=1
T (π̃) = T (σ,p) =
(P̃σ ) Find p (·) maximizing T (σ,p) subject to P (σ,p) ≤ P0 .
A. Problem formulation
B. Optimal power allocation and saddle point interpretation
We first provide structural properties of the optimal power
First using the concavallocations that simplify problem (O1).
ity of the rate function R(·, ·) in power, we can reproduce the
proof of Lemma 1, and prove that we may restrict our attention
to deterministic power allocations. We define Π̃1 the set of
strategies having deterministic power allocations, and denote
by pπ̃ (s) the power allocation vector chosen under strategy
π̃ ∈ Π̃1 in state s ∈ S. Next, we identify for a given terminal
state distribution σ ∈ P(S), the optimal power allocation. The
power allocation of a strategy π̃ in Π̃1 whose terminal state
distribution is σ is just represented as a function p : S → R+ ,
and the couple (σ, p(·)) uniquely defines the throughput and
the average power consumption:
N
σ(s)(1 − k(s)β)
pμ (sj ) .
+ μ P0 −
j=1
s∈S
We seek to solve the following optimization problem:
Then as in Theorem 1, it can be shown that an optimal probing
and power allocation strategy is (ν , μ ) ∈ Π̃2 and solves the
following strong maxmin condition:
Find π̃ ∈ Π̃ maximizing T (π̃) subject to P (π̃) ≤ P0 .
(O1)
max min G̃(σν , μ) = min max G̃(σν , μ).
S×RN
+
j=1
ν∈V μ≥0
μ≥0 ν∈V
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE INFOCOM 2010 proceedings
This paper was presented as part of the main Technical Program at IEEE INFOCOM 2010.
iff π̃ = (ν , μ ) ∈ Π̃2 with:
Finally, π̃ ∈ Π̃ solves (O1)
1) ν = arg
maxν∈V G̃(σν , μ ),
2) P (σν ,μ ) = P0 ,
where for any (ν, μ), P (σν ,μ) denotes the average power
consumption under strategy (ν, μ):
P (σν ,μ) =
σν (s)(1 − k(s)β)
N
pμ (sj ).
j=1
s∈S
C. Structure of optimal probing strategies
We fix the power control to be a water-filling power
allocation with parameter μ. For this power control, we obtain
the optimal probing strategy ν . For any state s ∈ S, define
As be the set
As as the set of probed channel in state s, and
of un-probed channels. Also define: G̃(s) = j∈As G(sj ),
where G(c) = R(c, pμ (c)) − μpμ (c).
Now, consider the system to be in state s. If a probing
strategy terminates in s, then the total reward received is
G̃tr (s) = (1 − k(s)β)G̃(s).
If the strategy decides to probe further, say channel i, then
the system state changes from s to s(i), where s(i) satisfies
As(i) = As ∪ {i}. Let G̃ (s) denote the maximum expected
reward starting from state s. Then, we can characterize G̃ (·)
recursively, starting from state with s0 = (−1, . . . , −1), using
Bellman’s equation:
G̃ (s) = max G̃tr (s), max Ei [G̃ (si )] .
i∈As
In order to characterize ν , let us define the following term
that provides the maximum expected reward which can be
obtained by probing exactly one additional channel.
G̃pr,tr (s) = (1 − (k(s) + 1)β) G̃(s) + max Ei [G(Ci )] .
i∈As
1) Optimal stopping rule: Now, we characterize the states
in which an optimal probing strategy terminates:
Theorem 5: The optimal probing strategy ν terminates in
state s, if and only if G̃tr (s) ≥ G̃pr,tr (s).
2) Optimal channel probing order: Now, we fully characterize ν , by obtaining an optimal channel probing order.
Theorem 6: Assume that the channels are stochastically
ordered. Fix any state s ∈ S such that
G̃tr (s) < G̃pr,tr (s).
(13)
Then, ν probes the stochastically largest channel in As .
The proof of Theorems 5 and 6 are similar to those of
Theorems 2 and 3.
D. An optimal on-line strategy
Again computing π̃ can be quite difficult (exponential
complexity). As in the case where transmissions on a single
channel were allowed, we can propose an on-line learning
algorithm that provably converges the optimal strategy. The
algorithm is exactly the same as Algorithm 1 except that we
use the strategy νn ∈ arg maxν G̃(σν , μn ) in slot n.
VI. S IMULATION R ESULTS
We now illustrate the throughput gains achieved by an
optimal probing and power control strategy using simulation.
The N channels are equivalent, and experience Rayleigh
fading, i.i.d. across slots. The results with heterogeneous
channels follow similar trends and are omitted due to space
constraints. We assume that β = 0.04. The optimal strategy π is compared to: (1) a genie-aided strategy, that assumes that
at the beginning of each slot the channel states are known; (2)
a fixed-power strategy πfp where an average power P0 is used
in each slot (in the case of multi-channel transmissions, the
power is evenly spread among probed channels), and where an
optimal probing strategy, given this fixed power allocation, is
used as determined in [17], [6]; (3) Strategies π1 and πN where
one or all channels are probed, and where the optimal power
allocation, given this probing strategy, is used. We compute π using learning Algorithm 1 (or its equivalent in multichannel
transmission scenarios), with parameters a(n) = (1/n)0.8
and b(n) = 10/n. We observe a convergence time1 for this
algorithm that lies between 2000 and 3000 slots with up to 25
channels.
Figures 2(a) and (b) present results when transmitting on
only one channel is allowed. Figure 2(a) shows the throughput
of various strategies as a function of N for a fixed SNR =
10 dB. Comparing the throughput achieved with the genieaided strategy and the others allows us to quantify the price
of information; e.g. for N channels, the loss in throughput
due to lack of channel information is around 30%, but this
loss grows as N increases - it should scale as log log(N ) for
large N , because when probing is required, the throughput
remains bounded as N grows large. Figure 2(b) shows the
throughput gain of π over other strategies as a function of
average SNR. Note that the throughput gain of π over πfp
is negligible except for low SNR (the gain is 9% at -10 dB).
The reason behind this is that in the high SNR regime log(1 +
SN R) ≈ log(SN R). With this approximation, the optimal
solution of (Pσ ) is p(s) = P0 for every s ∈ S and σ(·),
i.e., the constant power control is almost optimal. However,
note that the gain is small even for moderate SNR values,
e.g., the gain is 1% at 0 dB. Thus, when transmitting on only
one channel is allowed, optimizing over the probing strategy
is important, and optimizing over power control is not crucial.
In Figure 2(c), we give the throughput gain of π̃ over
other strategies in the case of multi-channel transmissions. It
is interesting to see that the throughput gain is significantly
higher than that observed in Figure 2(b). Note also that the
gain over πfp is quite important (at least 90% for various values
of SNR). Thus, to achieve good performance, it is imperative
to optimize both probing and power control strategies, which
contrasts with the case of single-channel transmissions.
VII. R ELATED W ORK
The problem analyzed in this paper falls into the broad
class of stochastic control problems [3], where an optimal
exploration vs. exploitation trade-off has to be identified.
1 By definition, the convergence time is the first time after which the
achieved throughput remains within 5 % of the maximum throughput.
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE INFOCOM 2010 proceedings
This paper was presented as part of the main Technical Program at IEEE INFOCOM 2010.
100
Throughput Gain (%)
Throughput
3
2.5
2
1.5
Genie-aided
Optimal
Fixed-power
Probe-all
Probe-one
1
0.5
0
0
5
80
Fixed-power
Probe-all
Probe-one
800
60
40
20
700
600
500
400
300
200
100
0
10
15
20
25
Number of Channels (N)
(a) Avg. SNR = 10 dB
Fig. 2.
900
Fixed-power
Probe-all
Probe-one
Throughput Gain (%)
4
3.5
0
-10
-5
0
5
10
15
20
-10
-5
Avg. SNR (dB)
(b) N = 15
0
5
10
15
20
Avg. SNR (dB)
(c) Avg. N = 15
Throughput and throughput gains in the case of single-channel transmissions (a) and (b) and multi-channel transmissions (c).
However, as already noticed in [11], it does not correspond
to any of the classical control problems, such as multi-armed
bandits, or stopping time, or optimal sampling problems.
Indeed in the various version of the multi-armed bandits
problems, sampling an arm (here a channel) is not allowed
before exploiting it. Note that the authors of [15] propose a
model for opportunistic spectrum access where, in each slot,
the user chooses a channel and tries to transmit on it without
acquiring its state. This model actually corresponds to the restless multi-armed bandits problem [18]. Our problem cannot be
seen as a stopping time problem [7], because here in addition
to the decision to probe further or to stop and transmit, the user
has to select which channel to probe next, or at which power
to transmit. It would become a stopping time problem if the
channels were statistically equivalent and if transmissions were
made at a fixed power, e.g. as in [17]. Finally, our problem
is not an optimal sampling problem, where the optimal order
at which random variables should be sampled [14], since this
kind of models does not allow for exploitation.
The design of optimal probing and channel selection strategy has been only recently studied [13], [10], [12], [11],
[5], [6], but most often under the assumptions that (i) the
channel states are identically distributed and (ii) power control
is not taken into account. In [6], the authors manage to relax
assumption (i), but to our knowledge, this present work is the
first considering jointly probing and power control strategies.
VIII. C ONCLUSION
We have considered a case where a user can access many
channels for data transmission, but to use them effectively
it needs to acquire CSI. Acquiring CSI consumes resources,
thereby reducing the resources remaining for actual data
transmission. In such systems, we have designed a probing
and power control strategy that maximizes the throughput.
The optimal strategy is computationally simple, but can be
computed only through iterative learning algorithm. We have
shown that the iterative procedure converges to the optimal
policy. Key insights obtained from the numerical experiments
are: (a) when a user can transmit only on a single channel,
the gain through power adaptation is limited, i.e., the constant
power allocation with optimal probing strategy provides a near
optimal performance. (b) When a user can transmit on multiple
channels simultaneously, the throughput gain through intelli-
gent power allocation is significant (more than 90%). Hence,
it is of paramount importance to use joint probing and power
control to optimally exploit the available resources. Note that
cognitive radio is one of the most important examples of the
systems in which user can simultaneously transmit on multiple
channels after acquiring CSI.
R EFERENCES
[1] E. Altman. Constrained Markov Decision Processes. Chapman and
Hall/CRC, 1999.
[2] P. Bender, P. Black, M. Grob, R. Padovani, N. Sindhushayana, A. Viterbi.
CDMA/HDR: a bandwidth-efficient high-speed wireless data service for
nomadic users. IEEE Commun. Mag., vol. 28, pp 70-77, 2000.
[3] D. Bertsekas. Dynamic Programming and Optimal Control, 3rd edition.
Athena Scientific, 2007.
[4] V. Borkar. Stochastic Approximation, a Dynamical Systems Viewpoint.
Hindustan Book Agency (Cambridge University Press), 2008.
[5] N. Chang, M. Liu. Optimal channel probing and transmission scheduling
for opportunistic spectrum access. In proc. of ACM MobiCom, 2007.
[6] P. Chaporkar, A. Proutiere. Optimal Joint Probing and Transmission
Strategy for Maximizing Throughput in Wireless Systems. IEEE J. on
Selected Areas in Commu., vol. 26, no. 18, pp. 1546-1556, Oct. 2008.
[7] Y.S. Chow, H. Robbins, D. Siegmund. Great expectations: the theory of
optimal stopping. Houghton Mufflin Company, 1971.
[8] K. Etemad. Overview of Mobile WiMax technology and evolution. IEEE
Comm. Magazine, pp 31-40, Oct. 2008.
[9] A. Goldsmith, P. Varaiya. Capacity of Fading Channels with Channel
side information. IEEE Trans. Inform. Theory, vol. 43, pp 1986-1992,
Nov. 1997.
[10] S. Guha, K. Munagala, S. Sarkar. Jointly optimal transmission and
probing strategies for multichannel wireless systems. In proc. of CISS,
2006.
[11] S. Guha, K. Munagala, S. Sarkar. Approximation Schemes for Information Acquisition and Exploitation in Multichannel Wireless Networks,
Proc. of Allerton Conf. on Commu., Control and Computing, 2006.
[12] S. Guha, K. Munagala, S. Sarkar. Optimizing Transmission Rate in
Wireless Channels using Adaptive Probes. Poster paper in ACM Sigmetrics/Performance Conference, 2006.
[13] Z. Ji, Y. Yang, J. Zhou, M. Takai, R. Bagrodia. Exploiting medium access
diversity in rate adaptive wireless LANs. In proc. of ACM Mobicom,
2004.
[14] M. Kodialam. The throughput of sequential testing. Lectures notes in
Compu. Sci., 2081 pp 280-292, 2001.
[15] L. Lai, H. El Gamal, H. Jiang and H. V. Poor. Cognitive Medium Access:
Exploration, Exploitation and Competition. Submitted to IEEE ToN, Oct.
2007.
[16] H. Robbins. Some aspects of the sequential design of experiments. Bull.
Amer. Math. Soc., 55 pp 527-535, 1952.
[17] A. Sabharwal, A. Khoshnevis, E. Knightly. Opportunistic spectral usage:
Bounds and multi-band CSMA/CA protocol. ACM/IEEE Trans. on
Networking, vol 15-3, 2007.
[18] P. Whittle. Restless bandits: Activity allocation in a changing world.
In: A celebration of Applied Probability, J. Gani (Ed), J. Appl. Probab.
Spec., 25 pp 287-298, 1988.