This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE INFOCOM 2010 proceedings This paper was presented as part of the main Technical Program at IEEE INFOCOM 2010. Learning to optimally exploit multi-channel diversity in wireless systems P. Chaporkar (IITB, Mumbai), A. Proutiere (Microsoft Research, UK), H. Asnani (IITB, Mumbai) Abstract—Consider a wireless system where a transmitter may send data to a set of receivers, or on various channels, experiencing random time-varying fading. The transmitter can send data to a single receiver or on a single channel at a time and may adapt its transmission power to the radio conditions of the chosen receiver/channel. Its objective is to implement a strategy defining at each time how to select the receiver/channel and transmission power, so as to maximize its throughput, i.e., its average sending rate, under an average power constraint. The optimization problem is easy when the fading conditions of all the receivers/channels are known. In many situations however, the instantaneous fading conditions are not known a priori, instead they have to be acquired, i.e., receivers/channels have to be probed, which consumes resources (time, spectrum, energy) in proportion of the number of probed receivers/channels. Hence, the transmitter may choose not to acquire the radio conditions of all the receivers/channels so as to spare resources for actual transmissions. In this paper, we aim at characterizing a joint probing, receiver/channel selection and power control strategy maximizing throughput. We provide an adaptive algorithm converging to the throughput optimal strategy. This algorithm may be used in a wide class of wireless systems with limited information, such as broadcast systems without a priori knowledge of the instantaneous Channel-State Information (CSI). But it can be also used to solve dynamic spectrum access problems such as those arising in cognitive radio systems, where secondary users can access large parts of the spectrum, but have to discover which portions of the spectrum offer more favorable radio conditions or less interference from primary users. I. I NTRODUCTION Opportunistic resource allocation has been shown to significantly improve the performance of wireless systems by exploting (rather than countering) location dependent and time varying channel conditions on various links. But, to employ opportunistic schemes, the transmitter has to know the channel side information (CSI) on each of the links. The CSI is not automatically known, rather it has to be acquired. Acquiring CSI on each link consumes resources (time, power and bandwidth) proportional to the number of links. Often the gain due to opportunism compensates for the resources invested in CSI acquisition, and hence many systems keep resources aside for CSI acquisition. For example, in CDMA/HDR [2] broadcast systems, a dedicated uplink channel for each receiver is maintained to communicate the CSI. In IEEE 802.16 based WiMax systems, CSI can be obtained by polling each link at the beginning of a frame [8]. However there is an increasing number of systems where it is not feasible to maintain dedicated resources for CSI acquisition; rather the CSI acquisition should be done on Prof. Chaporkar’s work is supported by India-UK Advanced Technology Centre (IU-ATC) of Excellence in Next Generation Networks Systems and Services. demand. We refer to systems with on demand CSI acquisition as limited information based MAC. An increasingly important example of limited information based MAC is the opportunistic spectrum access methods used in systems such as cognitive radio systems. In these systems, a secondary user may access a large number of frequency bands provided that these bands are not currently occupied by licensed or primary users. In this scenario, a user willing to maximize its transmission rate, has to opportunistically use the spectrum parts left idle by primary users and offering favorable fading conditions. Of course, the user cannot maintain dedicated resources to acquire the CSI on each frequency band, and to check whether a primary user is using it. Rather before transmitting, the user should acquire this information on a few well selected bands. For an optimal design of limited information based MAC, one has to strike the best exploration versus exploitation tradeoff. Here, exploration refers to finding out (probing) link CSIs. Exploration consumes resources proportional to the number of links probed, and thus leaves few resources for the actual data transmission. On the other hand, exploitation refers to opportunistically transmitting on the probed link with the best CSI, and hence more links one probes, greater is the chance to find a link with good channel conditions. In [6], we have developed a probing strategy achieving the optimal exploration vs. exploitaion trade-off under the assumption that the transmitter always transmits at a fixed power. Here, our aim is to achieve the optimal trade-off when the transmitter can vary the transmit power, but has to satisfy an average power constraint. In wireless networks, power is an important resource that should be used optimally. Hence, it is important to design joint probing and power control schemes that maximize the system throughput. We investigate the throughput gain achieved using power adaptation in limited information based MAC. Now, we elaborate on the analytical challenges in obtaining the optimal joint probing and power control schemes in limited information based MAC. When the fading conditions on the various links are known at the transmitter, then the optimal power control scheme can be obtained as a solution of a convex optimization program. For example, in the case where a single channel can be used at a time, the optimal scheme is to always transmit on the channel with the most favorable fading state, and to share power in time through the celebrated waterfilling procedure, where the water-filling level is obtained so that the average power constraint is satisfied with equality [9]. To compute the water-filling level, one needs to know the distribution of the CSI of chosen link (the link with the best CSI). As we will demonstrate, a similar analysis is possible 978-1-4244-5837-0/10/$26.00 ©2010 IEEE This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE INFOCOM 2010 proceedings This paper was presented as part of the main Technical Program at IEEE INFOCOM 2010. even in the case of limited information based MAC, i.e., the optimal probing and power control scheme can be obtained as a solution to an optimization problem. A major difficulty in solving this problem is that the constrained set of possible schemes is large and is extemely intricate to characterize. Indeed, to compute the average power consumption of a given scheme, we need to quantify the distribution of the CSI of the link selected at the end of the probing phase, which turns out to be almost impossible. Hence, we need to solve the optimization problem without really knowing the constrained set of possible schemes. To circumvent this difficulty, we propose an on-line learning strategy that provably converges to the optimal joint probing and power control scheme. More precisely, the contributions in this paper are as follows: • We formalize the problem of designing optimal opportunistic probing and power allocation schemes as a Constrained Markov Decision Process (Section II). • We provide structural properties of the problem in systems where transmitting on a single link at a time is permitted. These properties allow us to characterize the throughputoptimal strategies (Section III). • As the complexity of the numerical computation of the optimal strategies from the aforementioned characterization grows exponentially with the number of links, we propose an on-line learning algorithm with linear complexity that provably converges to the optimal strategy (Section IV). • The results are then extended to the case where the transmitter is allowed to transmit on several links at a time (Section V). • Finally, we illustrate and discuss, using simulations, the efficiency of the proposed optimal exploration-exploitation strategies. In particular, we evaluate the price in terms of throughput that has to paid due to the lack of information, i.e., due to the fact that the channel states have to be acquired (Section VI). Note that related work is presented in Section VII, and we conclude in VIII. II. S YSTEM M ODEL AND P ROBLEM F ORMULATION We present the first basic model considered in this paper to analyze the problem of designing optimal exploration/exploitation strategies in limited information based MAC. We generalize this model in Section V. A. Model Consider a user that can possibly transmit on N channels, but on one channel at a time. Time is slotted. The slot duration is assumed to correspond to the coherence time of channels. We assume that the radio conditions on the various channels satisfy the block fading model: the radio conditions on channel i are constant during each slot, and hence represented by a channel state Ci (t) in slot t. The random variable Ci (t) takes its values in a finite set C = {c1 , c2 , · · · , cM }. Moreover, Ci (t), t ≥ 0 are i.i.d. random variables with distribution Fi (·). We assume that the distribution Fi (·) is known to the transmitter for every i. Here the underlying assumption is that the user remains in the system a long time, so that it can learn Fi (·). We also assume that the channel states are independent across channels, i.e., the random variables Ci (t), t ≥ 0, i = 1, . . . , N are independent (spatial diversity). At the beginning of each slot, the user may acquire the state of one or several channels sequentially. Probing a channel takes a fixed proportion β of the slot, so that after probing k channels, the fraction of the slot available for actual data transmissions is (1 − kβ). When the user decides to transmit on the probed channel i observed in state c ∈ C with power p, its transmission rate is approximated by Shannon formula: R(c, p) = log(1 + c×p N0 ), where N0 denotes the thermal noise power. The choice of the rate function R(·, ·) does not impact the results derived in this paper, provided that its is increasing and concave in the second argument, i.e., in power. If the user transmits after probing k channels, and decides to transmit at power p on a channel in state c, the amount of information transmitted during this slot is: (1 − kβ)R(c, p). Note that in [17], [5], similar models (but with fixed transmit power) have been considered and exemplified in practical systems. In order to utilize the channel resources and its power reserve optimally, the user has to decide in a smart way the order in which it is going to probe channels, when to stop probing and start transmitting actual data, and finally at which power it should transmit. In short, it has to implement an optimal probing and power allocation strategy. Formally, we define such a strategy as follows. Consider an arbitrary slot (the slot considered does not play any role here as the system is i.i.d. over slots). In this slot, let s = [s1 · · · sN ] denote an N -dimensional vector indicating which channels have been already probed and also the states of these channels. If the ith channel has been probed, then there exists c ∈ C such that si = c; and for unprobed channels, we let si = −1. The set of all possible states is S = (C ∪ {−1})N . Depending on the past decisions in the slot, and its observation of the channel states, the user has to decide whether to probe further, or to transmit on a channel, and at which power. This decision can be random, e.g. with some probability the user decides to probe further, and with some other probability it decides to stop and transmit. In Figure 1, we give an example of decisions in a simplistic 3-channel system. In the following, we denote by P(A) the set of probability measures on A. Exploration Decisions (2,p) Exploitation (3,p) (3,p) P2 P3 Tr 3 State s (−1,−1,−1) (−1,c,−1) (−1,c,c’) Fig. 1. Decisions made in one slot - Exploration phase of duration 2β: Channels 2 and 3 are probed; Exploitation phase of duration (1 − 2β): transmission on channel 3 at power p (i.e., (1 − 2β)R(c , p) bits are sent). Definition 1: A joint probing and power control strategy π is a mapping from the set of states S to the set P({1, . . . , N }× R+ ), i.e., in every state s, π chooses a pair (i, p) randomly according to the distribution π(s). − If si = −1, then the user probes channel i, observes its This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE INFOCOM 2010 proceedings This paper was presented as part of the main Technical Program at IEEE INFOCOM 2010. state c, and the system state changes to s , where sj = sj for j = i and si = c. − If si ∈ C, it means that the channel i has been probed already. The user stops probing and starts transmitting on channel i with power p. The above definition does not exclude deterministic strategies that choose a single couple (i, p) (i.e., w.p. 1) in each state s - In this case, π(s) = δ(i,p) . It is worth observing as well that the decision taken by a strategy π is defined in all possible states s ∈ S, although because of the specific choices made by π, some states may not be actually reached (for example, π can decide that channel 1 is never probed first, in which case, the state (c, −1, . . . , −1) for c ∈ C is never reached under π). Strategy π in those states can be arbitrarily defined. We denote by Π the set of all probing and power allocation strategies. For a given strategy π ∈ Π, we define by ρπ the corresponding occupation measure, i.e., for any subset A ⊂ S of states, and Borel set I ⊂ R+ of possible transmission powers, the probability that under π, the user stops probing in a state s ∈ A and start transmitting at a power p ∈ I is: 1s∈A 1p∈I dρπ (s, p). ρπ (A × I) = S×R+ We also introduce the measure σπ , corresponding to the distribution of the state in which the strategy π stops: for any A ⊂ S, σπ (A) = S×R+ 1s∈A dρπ (s, p). We refer to σπ (·) as the terminal state distribution. The occupation measure ρπ results from the random decisions made by π, and also from the random channel states. B. Problem formulation We are now ready to state the problem of designing a probing and power allocation strategy maximizing user’s longterm throughput subject to an average power constraint. Since the objective is to maximize throughput, we restrict our attention to strategies that when deciding to stop and transmit, transmits on the channel with the best observed state. In state s, we denote by s̄ = max{si , i = 1, . . . , N } the state of the best (probed) channel. We also denote by k(s) the number of channels that have been probed in state s. Both throughput T (π) and average power P (π) under strategy π are expressed through the occupation measure ρπ : T (π) = dρπ (s, p)(1 − k(s)β)R(s̄, p), (1) S×R+ P (π) = dρπ (s, p)(1 − k(s)β)p. (2) S×R+ Denote by P0 the average power budget. Our problem is then formalized as follows (O1) Find π ∈ Π maximizing T (π) subject to P (π) ≤ P0 . This problem cannot be solved using classical methods, e.g., convex optimization techniques, simply because the objective and the constraint are both functions of the occupation measure, which proves quite complicated to characterize for a given strategy. In fact the problem belongs to the class of constrained stochastic control problems [1] which are notoriously difficult. In the next section, we provide some structural properties of (O1), that will help the analysis. III. S TRUCTURAL P ROPERTIES OF O PTIMAL S TRATEGIES To solve (O1), we need to study the structure of the possible optimal probing and power allocation strategies. First we show that it is useless to randomize the power allocation. Then we prove that optimal power allocations are always obtained through water-filling. We show that this implies that solving (O1) is equivalent to identifying the saddle point of a function depending on the probing strategy and on a parameter defining the level of the water-filling procedure providing the power allocation. Finally, we provide structural properties of the probing strategy maximizing this function. A. Derandomizing power We first define the set Π1 ⊂ Π as the set of strategies π such that the power allocation is deterministic in the sense that when in state s, π decides to stop probing and to transmit, it then picks a unique transmission power, denoted by pπ (s). Mathematically this implies that for any state s and any subset I of R+ , ρπ (s, I) = σπ (s) × 1pπ (s)∈I . In the following, for any π ∈ Π, we denote by ρπ (p|s) the probability that π selects power p given that it stops probing in state s. Lemma 1: Let π ∈ Π. Consider π ∈ Π1 such that it makes the same probing decisions as π, but averages the transmission power decisions made by π: for any state s, if π chooses a pair (i, p) for some possible power p, then π chooses (i, p ), with p = R+ dρπ (p|s)p. Then: T (π ) ≥ T (π) . Proof. Note that since R(·, ·) is concave in power, for any state s we have, by Jensen’s inequality and the definition of π , that: R(s, pπ (s)) ≥ R+ dρπ (p|s)R(s, p). Then: dρπ (s, p)(1 − βk(s))R(s̄, p) T (π) = S×R+ = σπ (s)(1 − βk(s)) dρπ (p|s)R(s̄, p) p∈R+ s∈S ≤ σπ (s)(1 − βk(s))R(s̄, pπ (s)) = T (π ) . 2 s∈S B. Optimality of water-filling Now we investigate the possible form of optimal power allocations. We fix the terminal state distribution σ ∈ P(S), and given that distribution, we seek the best power allocation. A (deterministic) power allocation is represented by a function p : S → R+ . The throughput achieved by power allocation p(·) is: T (σ,p) = σ(s)(1 − βk(s))R(s̄, p(s)). s∈S The average power consumption under p(·) is: σ(s)(1 − βk(s))p(s). P (σ,p) = s∈S We seek to solve, for a given σ ∈ P(S): This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE INFOCOM 2010 proceedings This paper was presented as part of the main Technical Program at IEEE INFOCOM 2010. (Pσ ) Find p (·) maximizing T (σ,p) subject to P (σ,p) ≤ P0 . or equivalently, Clearly (Pσ ) is a convex optimization problem, and should R(·, ·) be strictly concave in power, it admits a unique solution. Consider the associated Lagrangian: Lσ (p(·), μ) = σ(s)(1−βk(s))[R(s̄, p(s))−μp(s))]+μP0 , s∈S where μ ≥ 0 denotes the Lagrange multiplier. Denote by G(σ, μ) = maxp(·) Lσ (p(·), μ). The solution of (Pσ ) is obtained with a power allocation obtained through a water-filling procedure of parameter μ, as stated in the following lemma: Lemma 2: We have: σ(s)(1 − βk(s)) [R(s̄, pμ (s̄)) − μpμ (s̄))] G(σ, μ) = s∈S + μP0 , where + N0 1 − . pμ (s̄) = μ s̄ Proof. The result follows by solving ∂G ∂p(s) = 0 for all s. 2 C. Saddle point interpretation From the previous result, the power allocation in a throughput optimal strategy is necessarily obtained through a waterfilling procedure. Hence to identify such an optimal strategy π , we may restrict our attention to strategies defined by a probing strategy and a parameter μ defining the level of the water-filling procedure. To formalize this observation, we define the notion of probing strategy: Definition 2: A probing strategy ν is a mapping from S to the set P({1, . . . , N }), i.e., in every state s, π chooses an index i randomly according to the distribution ν(s). − If si = −1, then the user probes channel i, observes its state ci , and the system state changes to s , where s j = sj for j = i and s i = ci . − If si ∈ C, it means that the channel i has been probed already. The user stops probing yielding a terminal state s. We denote by V the set of probing strategies. The couple composed by a probing strategy ν ∈ V, and a power allocation obtained through water-filling of level μ (i.e., pμ (·)) defines a strategy π ∈ Π1 , and we use the notation π = (ν, μ). Define Π2 as the set of such strategies: Π2 = {π ∈ Π1 : ∃ν ∈ V, μ > 0, π = (ν, μ)}. For a strategy π = (ν, μ) ∈ Π2 , the terminal state distribution σπ depends on π through the probing strategy ν only; hence we may write σν = σπ . Summarizing what we have shown so far: Solving (O1) is equivalent to solving (O2) where: (O2) Find π ∈ Π2 maximizing T (π) subject to P (π) ≤ P0 . The following crucial result will help us to characterize the solution of (O2). It states that the solution may be interpreted as the saddle point of the function (ν, μ) → G(σν , μ) defined in §III-B. Theorem 1: Let π = (ν , μ ) ∈ Π2 . The strategy π is optimal if and only if the pair (ν , μ ) satisfies the following saddle point condition: for any ν ∈ V, μ > 0, G(σν , μ ) ≤ G(σν , μ ) ≤ G(σν , μ), G(σν , μ ) = min max G(σν , μ) = max min G(σν , μ). (3) μ>0 ν∈V ν∈V μ>0 The proof of Theorem 1 is not straightforward since G is not the Lagrangian of problem (O2), and hence (3) does not a priori express the strong duality of some optimization problem. Next, we present the formal proof. Proof. First, we show that (O1) is a convex optimization problem. To show this, we need to show that (1) Π is a convex set, and (2) T (π) is concave in π. Note that a joint probing and power control policy π is characterized by its occupation measure ρπ . Thus, the convex combination of the two policies is defined as the convex combination of their occupation measures, elementwise. That is, for every α ∈ [0, 1], π = απ1 + (1 − α)π2 implies that ρπ (s, p) = αρπ1 (s, p)+(1−α)ρπ2 (s, p). Clearly, π is a valid joint probing and power control policy as it can be obtained by choosing π1 w.p. α and π2 w.p. (1 − α). Thus, Π is a convex space. Now, we show that T (π) is concave in π. We need to show that T (π) ≥ αT (π1 ) + (1 − α)T (π2 ) . First, note that σπ (s) = ασπ1 (s) + (1 − α)σπ2 (s), ρπ (p|s) = θρπ1 (p|s) + (1 − θ)ρπ2 (p|s), ασ (s) π1 . With the above observations where θ = ασπ (s)+(1−α)σ π2 (s) 1 and some algebra, it can be verified that T (π) = αT (π1 ) + (1 − α)T (π2 ) . Thus, T (π) is a concave function of π. Now, we show that (O1) has strong duality property using Slater’s constraint qualification condition. Note that any strategy π that allocates zero power in every terminal state, i.e., s∈S ρπ (s, 0) = 1, is a strictly feasible solution of (O1). Thus, Slater’s condition holds. This implies that max min T (π) + λ(P0 − P (π) ) π∈Π λ≥0 (4) = min max T (π) + λ(P0 − P (π) ) . λ≥0 π∈Π In (4), λ is the Lagrange multiplier. Now, by Lemma 1, we know that the optimal probing and power control strategy lies in Π1 . Thus, (4) holds even when Π is replaced by Π1 . Let Πσ denote the set of policies π that generate the same terminal distribution σ. Moreover, let Σ = {σ : σ = σπ for some π ∈ Π1 }. With this notation, the right hand side of (4) can be written as follows: min max max T (π) + λ(P0 − P (π) ) . (5) λ≥0 σ∈Σ π∈Πσ Consider the last optimization in (5), and note that max T (π) + λ(P0 − P (π) ) = max Lσ (p(·), λ). π∈Πσ p(·) This is because, in Πσ , policies differ in their power allocation only. Thus, optimizing over Πσ is equivalent to choosing optimal power control. Thus, min max max T (π)+λ(P0 −P (π) ) = min max G(σ, λ). λ≥0 σ∈Σ π∈Πσ λ≥0 σ∈Σ (6) This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE INFOCOM 2010 proceedings This paper was presented as part of the main Technical Program at IEEE INFOCOM 2010. Now, note that the left hand side of (4) is equal to max max min T (π) + λ(P0 − P (π) ) . σ∈Σ π∈Πσ λ≥0 (7) Using the similar arguments as before, we note that max min T (π) + λ(P0 − P (π) ) = max min Lσ (p(·), λ). π∈Πσ λ≥0 p(·) λ≥0 Using the strong duality of (Pσ ), we conclude that max min T (π) + λ(P0 − P (π) ) = min max Lσ (p(·), λ). π∈Πσ λ≥0 Thus, λ≥0 p(·) max max min T (π) + λ(P0 − P (π) ) σ∈Σ π∈Πσ λ≥0 = max min G(σ, λ). σ∈Σ λ≥0 (8) From (6) and (8), we conclude that min max G(σ, λ) = max min G(σ, λ). λ≥0 σ∈Σ σ∈Σ λ≥0 The result follows. 2 Theorem 1 provides a simple way to verify the optimality of a given strategy π = (ν, μ) in Π2 . For example, we simply have to check that: 1. ν = arg maxν ∈V G(σν , μ); 2. μ = arg minμ >0 G(σν , μ ). Now observe that for any σ ∈ P(S), G(σ, μ) is minimized in μ if and only if the resulting average power consumption is exactly equal to P0 (just deriving G w.r.t. μ). Summarizing, we have the following characterization of optimal strategies: Corollary 1: Let π ∈ Π. The strategy π solves (O1) if and only if π ∈ Π2 , i.e., ∃(ν , μ ) ∈ V × R+ : π = (ν , μ ), and (1) ν = arg maxν∈V G(σν , μ ), (2) P (σν ,μ ) = P0 , where for any (ν, μ), P (σν ,μ) denotes the average power consumption under strategy (ν, μ): P (σν ,μ) = σν (s)(1 − k(s)β)pμ (s̄). s∈S D. Structure of the optimal probing strategy If one wish to use the characterization of the solution of (O1) provided in the above corollary, one needs to be able to verify Condition (1). In other words, we need to solve the following problem for a fixed μ: (Pμ ) Find ν ∈ V maximizing G(σν , μ). (Pμ ) can be seen as a generalized version of stopping time problems, and as it turns out, similar problems have been recently studied and solved, see [5], [6]. We adapt the results of these existing analysis to our setting. For brevity, we introduce the following notation: for any c ∈ C, G(c) = R(c, pμ (c)) + μ[P0 − pμ (c)]. Assume that at a given slot, the system is in state s. - If under strategy ν, we stop probing and transmit (on the best channel), the reward is Gtr (s) with Gtr (s) = (1 − k(s)β)G(s̄); - If under strategy ν, we probe further a channel i in state c ∈ C, the state becomes s = s(i) where for all j = i, sj = sj and si = Ci . Ci is the random variable representing the state of channel i. Now denote by G (s) the average reward under an optimal strategy ν starting from state s. Bellman’s equation allows us to recursively characterize G : for any s ∈ S, G (s) = max{Gtr (s), max Ei [G (s(i))}, i:si =−1 where Ei [·] is the expectation taken w.r.t. the distribution Fi (·) of the state of the i-th channel. To characterize the solution ν of (Pμ ), we need to compute G (s0 ) where s0 = (−1, . . . , −1) is the initial state. To do so, let’s introduce the average reward Gpr,tr (s) obtained when, starting in state s, one first probes channel i and after that, one stops and transmits (on the best channel): Gpr,tr (s) = (1 − (k(s) + 1)β) max Ei [G(max{s̄, Ci })]. i:si =−1 For the results of [5], [6] to be applicable, we need the following property of function G(·) that can be easily checked: Lemma 3: G(·) is a non-decreasing function. We are now ready to provide two structural properties of the optimal probing strategy ν , that will actually characterize this strategy in some particular but relevant cases. 1) Optimal stopping rule: The following result states that in any given state s ∈ S, to optimally decide whether to stop and transmit or to probe further, we only need to follow the choice made by the one-step-look-ahead strategy [5], [6]. Theorem 2: Let ν be the optimal probing strategy solving μ (P ). In any state s ∈ S, ν decides to probe another channel if and only if: Gpr,tr (s) > Gtr (s). Theorem 2 is sufficient to characterize the optimal strategy when the states of the various channel are i.i.d. Indeed, in this case, the order in which channels are probed has no impact on the average reward, and hence we can probe channels in any order. However, when the channel states are not identically distributed, characterizing ν becomes extremely complicated and is an open problem in general. This might be explained by the fact that the one-step-look-ahead strategy is not always optimal as shown in [6]. 2) Optimal channel probing order: As discussed above, the main challenge in characterizing ν is to determine the optimal order in which channels should be probed. And in general, this issue proves impossible to solve. However, there are special cases where it is still possible to find ν . Specifically, when the channel states are stochastically ordered (as defined below), the optimal order is obtained when the stochastically largest unprobed channel is probed. Channels are stochastically ordered if there exists a permutation ω of {1, . . . , N } such that for all i, j , if ω(i) ≤ ω(j), then Cω(j) ≤st Cω(i) , where X ≤st Y if and only if for all increasing function f such that E[f (Y )] < ∞, E[f (X)] ≤ E[f (Y )]. Without loss of generality, when the channels are stochastically ordered, we assume that the permutation ω is ω(i) = i for all i. An example of ordered channels is when one can write Ci = E[Ci ]Yi where the random variables Yi ’s are i.i.d. copies of a fixed random variable Y , i.e., when the This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE INFOCOM 2010 proceedings This paper was presented as part of the main Technical Program at IEEE INFOCOM 2010. channels have similar distributions but different means. This is a quite usual fading model in wireless networks, for example in the case of Rayleigh fading. In these settings, we can obtain an optimal probing strategy [6]: Theorem 3: Assume that the channels are stochastically ordered. Let ν be the optimal probing strategy solving (Pμ ). In any state s ∈ S, under ν , the decision on whether to stop and transmit or to probe further is defined by the rule of Theorem 2. Moreover, if the decision is to probe further, the next channel to probe is the stochastically largest un-probed channel. In other words, we necessarily have: s = (c1 , . . . , ck(s) , −1, . . . , −1), and the channel to probe next is channel k(s) + 1. E. Summary In this section, we have proved that the optimal probing and power control strategy π solving (O1) has the following properties: (i) the optimal power control strategy is deterministic; (ii) it is obtained via a water-filling procedure of parameter μ ; (iii) π = (ν , μ ) where ν denotes the optimal probing strategy, and (ν , μ ) satisfies ν = arg maxν∈V G(σν , μ ) and P (σν ,μ ) = P0 ; finally, we have identified how to determine ν = arg maxν∈V G(σν , μ ). We have theoretically characterized the optimal probing and power control strategy. However, we still need to numerically compute the optimal water-filling parameter μ , which is difficult since the average power consumption depends on both the probing strategy and the water-filling parameter. Such computation might be prohibitive on a simple mobile device. Indeed, computing the average power consumption even for a fixed strategy needs to consider all possible realizations of the state of all channels, which requires O(#C N ) operations. In the next section, we propose a simple algorithm that the user can run while exploring and exploiting the spectrum resources and that provably converges to the optimal probing and power control strategy. In each slot, the user has to perform O(N ) operations to make its probing and power control decisions. The price for reducing the complexity is the time it takes for the algorithm to converge. IV. O PTIMAL O N - LINE S TRATEGY We now provide an on-line algorithm that provably converges to the optimal joint probing and power control strategy. The algorithm may be interpreted as a multiple timescale stochastic approximation algorithm. We first describe the algorithm, and then prove its convergence. A. Stochastic learning algorithm The algorithm seeks to solve minμ maxν G(σν , μ). At each slot, the parameter μ, defining the power allocation obtained through water-filling, is updated. The probing strategy ν is also updated at each slot so as to maximize G(σν , μ). The latter update is performed using the analysis presented in §III-D. The update of μ is done so that μ converges to μ solution of ∂G ∂μ = 0, which is equivalent to the fact that the average power consumption under pμ (·) is exactly P0 . Formally the algorithm maintains two random variables: the power allocation parameter μn ∈ R+ in slot n, and Pn ∈ R+ representing the average empirical power consumed until slot n. The Algorithm operates as follows. Algorithm 1 1) In the n-th slot, run the probing strategy νn ∈ arg maxν G(σν , μn ), and power allocation pμn (·); 2) At the end of slot n: (i) Observe γn+1 the transmission power during slot n, and update Pn as: Pn+1 = Pn + an (γn+1 − Pn ); (9) (ii) Update μn as: μn+1 = μn + bn (Pn+1 − P0 ). (10) Thestep-size sequences (a n ) 2and(bn2) are chosen such that: n an , n bn = ∞, n an , n bn < ∞. Note that in principle, we should update the parameter μn as a function of the actual average power consumption using the optimal strategy given the power allocation parametrized by μn . This average power cannot be observed in one slot of course, so we need to impose that the update on μn is much slower than that of Pn , in other words we require that bn /an → 0 as n → ∞. Note also that Algorithm 1 is easy to implement, because Step 1. only requires to implement νn that have been completely characterized in §III-D; this requires O(N ) operations, since in the worst case we probe all channels. B. Convergence analysis We prove that Algorithm 1 converges to the optimal probing and power allocation strategy, i.e., the long-term throughput is optimized while satisfying the power constraint. Theorem 4: Under Algorithm 1, we have almost surely: μn → μ , νn → ν when n → ∞. Proof. The updates in Algorithm 1 can be written as: 1 ), Pn+1 = Pn + an (E[γn+1 |Fn ] − Pn + Mn+1 2 μn+1 = μn + bn (E[Pn+1 |Fn ] − P0 + Mn+1 ), where the σ-algebra Fn = σ(Pm , μm , m ≤ n) represents the past up to slot n, and Mn1 and Mn2 are martingale difference sequences defined by: 1 Mn+1 = γn − E[γn+1 |Fn ], 2 Mn+1 = Pn+1 − E[Pn+1 |Fn ]. Note that the average power E[γn+1 |Fn ] observed in slot n depends on the past only through the parameter μn , hence we can define a function g(·, ·) such that g(μn , Pn ) = E[γn+1 |Fn ] − Pn . Similarly E[Pn+1 |Fn ] depends on the past through Pn and μn only, and there exists a function h(·) such that h(μn , Pn ) = E[Pn+1 |Fn ] − P0 . Hence the updates in Algorithm 1 become: 1 Pn+1 = Pn + an (g(μn , Pn ) + Mn+1 ), 2 μn+1 = μn + bn (h(μn , Pn ) + Mn+1 ). These are the equations of a stochastic approximation algorithm with two time-scales as considered in [4] Chapter 6. It can be shown then that h and g are Lipschitz. Now the conditions to apply the results of [4] are met, and we deduce This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE INFOCOM 2010 proceedings This paper was presented as part of the main Technical Program at IEEE INFOCOM 2010. that Algorithm 1 converges. Now since the unique equilibrium point of Algorithm 1 is that where the power consumption is exactly P0 and where an optimal probing strategy is used, Theorem 4 is proved. 2 V. M ULTI - CHANNEL T RANSMISSIONS So far, we have considered that a user may access N channels, but transmits on one of these channels at a time. Here, we extend the analysis to the case where the user can simultaneously transmit on several channels at a time, provided that these channels have been probed. We assume that the various channels are orthogonal, so that concurrent transmissions on different channels do not interfere with each other. Now the decision problem that the user faces is similar to that investigated previously, except that here when the user decides to stop and transmit, it has to decide the transmission power on each of the probed channels. Note that the user may decide not to transmit at all on a given channel by allocating a zero power on this channel. As before, the user’s objective is to maximize its throughput. The analysis of this problem uses similar methods as those developed in Sections III and IV. We first define the space of probing and power allocation strategies in the case of possible multi-channel transmissions, and then state the throughput maximization problem. Definition 3: A joint probing and power control strategy π̃ is a mapping from the set of states S to the set P({0, 1, . . . , N } × RN + ), i.e., in every state s, π̃ chooses a pair (i, p) randomly according to the distribution π̃(s). − If i > 0, then the user probes channel i, observes its state ci , and the system state changes to s , where sj = sj for j = i and si = ci . − If i = 0, the user stops probing and starts transmitting on channel j with power pj . In state s, let (i, p) be the decision made under π̃. Then, we impose the following restrictions on π̃’s decisions: (a) If i > 0, π̃ probes channel i so it means that i had not been probed earlier, i.e. si = −1. (b) If i = 0, then under π̃, the user stops and transmits. To ensure that it transmits on probed channels only, we impose pj > 0 only if sj ∈ C. The set of joint probing and power allocation strategies satisfying (a) and (b) is denoted by Π̃. Like before, for any π̃ ∈ Π̃, we define the associated occupation measure ρπ̃ (·) and the terminal state distribution σπ̃ (·). Using these, we can compute the throughout and the average power under π̃: T (π̃) = P (π̃) = S×RN + dρπ̃ (s, p)(1 − k(s)β) N dρπ̃ (s, p)(1 − k(s)β) σ(s)(1 − k(s)β) P (π̃) = P (σ,p) = N R(sj , pj (s)), j=1 s∈S σ(s)(1 − k(s)β) N pj (s). j=1 s∈S We solve: N The above problem is convex with associated Lagrangian: Lσ (p(·), μ) = σ(s)(1−βk(s)) R(sj , pj ), (11) pj . (12) N [R(sj , pj ) − μpj ]+μP0 j=1 s∈S We can then easily show that the power allocation maximizing the Lagrangian is again obtained through water-filling with parameter μ. Note that here the water-filling is made in time and channels, i.e., for any state s ∈ S, the optimal power allocation is pμ (s) with: for any j ∈ {1, . . . , N }, if sj = −1, pμ,j (s) = 0; + N0 1 − = pμ (sj ). if sj ∈ C, pμ,j (s) = μ sj Hence we can restrict our attention to strategies within the set Π̃2 of strategies whose power allocations are obtained through water-filling in time and channels. Any strategy π̃ ∈ Π̃2 can be represented as a couple (ν, μ) ∈ V × R+ , where ν is a probing strategy satisfying σπ̃ = σν and μ is the timechannel water-filling parameter of the power allocation. Now for any (σ, μ) ∈ P(S) × R+ , define G̃(σ, μ) as: G̃(σ, μ) = max Lσ (p(·), μ) p(·) = σ(s)(1 − k(s)β) N R(sj , pμ (sj )) j=1 s∈S j=1 T (π̃) = T (σ,p) = (P̃σ ) Find p (·) maximizing T (σ,p) subject to P (σ,p) ≤ P0 . A. Problem formulation B. Optimal power allocation and saddle point interpretation We first provide structural properties of the optimal power First using the concavallocations that simplify problem (O1). ity of the rate function R(·, ·) in power, we can reproduce the proof of Lemma 1, and prove that we may restrict our attention to deterministic power allocations. We define Π̃1 the set of strategies having deterministic power allocations, and denote by pπ̃ (s) the power allocation vector chosen under strategy π̃ ∈ Π̃1 in state s ∈ S. Next, we identify for a given terminal state distribution σ ∈ P(S), the optimal power allocation. The power allocation of a strategy π̃ in Π̃1 whose terminal state distribution is σ is just represented as a function p : S → R+ , and the couple (σ, p(·)) uniquely defines the throughput and the average power consumption: N σ(s)(1 − k(s)β) pμ (sj ) . + μ P0 − j=1 s∈S We seek to solve the following optimization problem: Then as in Theorem 1, it can be shown that an optimal probing and power allocation strategy is (ν , μ ) ∈ Π̃2 and solves the following strong maxmin condition: Find π̃ ∈ Π̃ maximizing T (π̃) subject to P (π̃) ≤ P0 . (O1) max min G̃(σν , μ) = min max G̃(σν , μ). S×RN + j=1 ν∈V μ≥0 μ≥0 ν∈V This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE INFOCOM 2010 proceedings This paper was presented as part of the main Technical Program at IEEE INFOCOM 2010. iff π̃ = (ν , μ ) ∈ Π̃2 with: Finally, π̃ ∈ Π̃ solves (O1) 1) ν = arg maxν∈V G̃(σν , μ ), 2) P (σν ,μ ) = P0 , where for any (ν, μ), P (σν ,μ) denotes the average power consumption under strategy (ν, μ): P (σν ,μ) = σν (s)(1 − k(s)β) N pμ (sj ). j=1 s∈S C. Structure of optimal probing strategies We fix the power control to be a water-filling power allocation with parameter μ. For this power control, we obtain the optimal probing strategy ν . For any state s ∈ S, define As be the set As as the set of probed channel in state s, and of un-probed channels. Also define: G̃(s) = j∈As G(sj ), where G(c) = R(c, pμ (c)) − μpμ (c). Now, consider the system to be in state s. If a probing strategy terminates in s, then the total reward received is G̃tr (s) = (1 − k(s)β)G̃(s). If the strategy decides to probe further, say channel i, then the system state changes from s to s(i), where s(i) satisfies As(i) = As ∪ {i}. Let G̃ (s) denote the maximum expected reward starting from state s. Then, we can characterize G̃ (·) recursively, starting from state with s0 = (−1, . . . , −1), using Bellman’s equation: G̃ (s) = max G̃tr (s), max Ei [G̃ (si )] . i∈As In order to characterize ν , let us define the following term that provides the maximum expected reward which can be obtained by probing exactly one additional channel. G̃pr,tr (s) = (1 − (k(s) + 1)β) G̃(s) + max Ei [G(Ci )] . i∈As 1) Optimal stopping rule: Now, we characterize the states in which an optimal probing strategy terminates: Theorem 5: The optimal probing strategy ν terminates in state s, if and only if G̃tr (s) ≥ G̃pr,tr (s). 2) Optimal channel probing order: Now, we fully characterize ν , by obtaining an optimal channel probing order. Theorem 6: Assume that the channels are stochastically ordered. Fix any state s ∈ S such that G̃tr (s) < G̃pr,tr (s). (13) Then, ν probes the stochastically largest channel in As . The proof of Theorems 5 and 6 are similar to those of Theorems 2 and 3. D. An optimal on-line strategy Again computing π̃ can be quite difficult (exponential complexity). As in the case where transmissions on a single channel were allowed, we can propose an on-line learning algorithm that provably converges the optimal strategy. The algorithm is exactly the same as Algorithm 1 except that we use the strategy νn ∈ arg maxν G̃(σν , μn ) in slot n. VI. S IMULATION R ESULTS We now illustrate the throughput gains achieved by an optimal probing and power control strategy using simulation. The N channels are equivalent, and experience Rayleigh fading, i.i.d. across slots. The results with heterogeneous channels follow similar trends and are omitted due to space constraints. We assume that β = 0.04. The optimal strategy π is compared to: (1) a genie-aided strategy, that assumes that at the beginning of each slot the channel states are known; (2) a fixed-power strategy πfp where an average power P0 is used in each slot (in the case of multi-channel transmissions, the power is evenly spread among probed channels), and where an optimal probing strategy, given this fixed power allocation, is used as determined in [17], [6]; (3) Strategies π1 and πN where one or all channels are probed, and where the optimal power allocation, given this probing strategy, is used. We compute π using learning Algorithm 1 (or its equivalent in multichannel transmission scenarios), with parameters a(n) = (1/n)0.8 and b(n) = 10/n. We observe a convergence time1 for this algorithm that lies between 2000 and 3000 slots with up to 25 channels. Figures 2(a) and (b) present results when transmitting on only one channel is allowed. Figure 2(a) shows the throughput of various strategies as a function of N for a fixed SNR = 10 dB. Comparing the throughput achieved with the genieaided strategy and the others allows us to quantify the price of information; e.g. for N channels, the loss in throughput due to lack of channel information is around 30%, but this loss grows as N increases - it should scale as log log(N ) for large N , because when probing is required, the throughput remains bounded as N grows large. Figure 2(b) shows the throughput gain of π over other strategies as a function of average SNR. Note that the throughput gain of π over πfp is negligible except for low SNR (the gain is 9% at -10 dB). The reason behind this is that in the high SNR regime log(1 + SN R) ≈ log(SN R). With this approximation, the optimal solution of (Pσ ) is p(s) = P0 for every s ∈ S and σ(·), i.e., the constant power control is almost optimal. However, note that the gain is small even for moderate SNR values, e.g., the gain is 1% at 0 dB. Thus, when transmitting on only one channel is allowed, optimizing over the probing strategy is important, and optimizing over power control is not crucial. In Figure 2(c), we give the throughput gain of π̃ over other strategies in the case of multi-channel transmissions. It is interesting to see that the throughput gain is significantly higher than that observed in Figure 2(b). Note also that the gain over πfp is quite important (at least 90% for various values of SNR). Thus, to achieve good performance, it is imperative to optimize both probing and power control strategies, which contrasts with the case of single-channel transmissions. VII. R ELATED W ORK The problem analyzed in this paper falls into the broad class of stochastic control problems [3], where an optimal exploration vs. exploitation trade-off has to be identified. 1 By definition, the convergence time is the first time after which the achieved throughput remains within 5 % of the maximum throughput. This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE INFOCOM 2010 proceedings This paper was presented as part of the main Technical Program at IEEE INFOCOM 2010. 100 Throughput Gain (%) Throughput 3 2.5 2 1.5 Genie-aided Optimal Fixed-power Probe-all Probe-one 1 0.5 0 0 5 80 Fixed-power Probe-all Probe-one 800 60 40 20 700 600 500 400 300 200 100 0 10 15 20 25 Number of Channels (N) (a) Avg. SNR = 10 dB Fig. 2. 900 Fixed-power Probe-all Probe-one Throughput Gain (%) 4 3.5 0 -10 -5 0 5 10 15 20 -10 -5 Avg. SNR (dB) (b) N = 15 0 5 10 15 20 Avg. SNR (dB) (c) Avg. N = 15 Throughput and throughput gains in the case of single-channel transmissions (a) and (b) and multi-channel transmissions (c). However, as already noticed in [11], it does not correspond to any of the classical control problems, such as multi-armed bandits, or stopping time, or optimal sampling problems. Indeed in the various version of the multi-armed bandits problems, sampling an arm (here a channel) is not allowed before exploiting it. Note that the authors of [15] propose a model for opportunistic spectrum access where, in each slot, the user chooses a channel and tries to transmit on it without acquiring its state. This model actually corresponds to the restless multi-armed bandits problem [18]. Our problem cannot be seen as a stopping time problem [7], because here in addition to the decision to probe further or to stop and transmit, the user has to select which channel to probe next, or at which power to transmit. It would become a stopping time problem if the channels were statistically equivalent and if transmissions were made at a fixed power, e.g. as in [17]. Finally, our problem is not an optimal sampling problem, where the optimal order at which random variables should be sampled [14], since this kind of models does not allow for exploitation. The design of optimal probing and channel selection strategy has been only recently studied [13], [10], [12], [11], [5], [6], but most often under the assumptions that (i) the channel states are identically distributed and (ii) power control is not taken into account. In [6], the authors manage to relax assumption (i), but to our knowledge, this present work is the first considering jointly probing and power control strategies. VIII. C ONCLUSION We have considered a case where a user can access many channels for data transmission, but to use them effectively it needs to acquire CSI. Acquiring CSI consumes resources, thereby reducing the resources remaining for actual data transmission. In such systems, we have designed a probing and power control strategy that maximizes the throughput. The optimal strategy is computationally simple, but can be computed only through iterative learning algorithm. We have shown that the iterative procedure converges to the optimal policy. Key insights obtained from the numerical experiments are: (a) when a user can transmit only on a single channel, the gain through power adaptation is limited, i.e., the constant power allocation with optimal probing strategy provides a near optimal performance. (b) When a user can transmit on multiple channels simultaneously, the throughput gain through intelli- gent power allocation is significant (more than 90%). Hence, it is of paramount importance to use joint probing and power control to optimally exploit the available resources. Note that cognitive radio is one of the most important examples of the systems in which user can simultaneously transmit on multiple channels after acquiring CSI. R EFERENCES [1] E. Altman. Constrained Markov Decision Processes. Chapman and Hall/CRC, 1999. [2] P. Bender, P. Black, M. Grob, R. Padovani, N. Sindhushayana, A. Viterbi. CDMA/HDR: a bandwidth-efficient high-speed wireless data service for nomadic users. IEEE Commun. Mag., vol. 28, pp 70-77, 2000. [3] D. Bertsekas. Dynamic Programming and Optimal Control, 3rd edition. Athena Scientific, 2007. [4] V. Borkar. Stochastic Approximation, a Dynamical Systems Viewpoint. Hindustan Book Agency (Cambridge University Press), 2008. [5] N. Chang, M. Liu. Optimal channel probing and transmission scheduling for opportunistic spectrum access. In proc. of ACM MobiCom, 2007. [6] P. Chaporkar, A. Proutiere. Optimal Joint Probing and Transmission Strategy for Maximizing Throughput in Wireless Systems. IEEE J. on Selected Areas in Commu., vol. 26, no. 18, pp. 1546-1556, Oct. 2008. [7] Y.S. Chow, H. Robbins, D. Siegmund. Great expectations: the theory of optimal stopping. Houghton Mufflin Company, 1971. [8] K. Etemad. Overview of Mobile WiMax technology and evolution. IEEE Comm. Magazine, pp 31-40, Oct. 2008. [9] A. Goldsmith, P. Varaiya. Capacity of Fading Channels with Channel side information. IEEE Trans. Inform. Theory, vol. 43, pp 1986-1992, Nov. 1997. [10] S. Guha, K. Munagala, S. Sarkar. Jointly optimal transmission and probing strategies for multichannel wireless systems. In proc. of CISS, 2006. [11] S. Guha, K. Munagala, S. Sarkar. Approximation Schemes for Information Acquisition and Exploitation in Multichannel Wireless Networks, Proc. of Allerton Conf. on Commu., Control and Computing, 2006. [12] S. Guha, K. Munagala, S. Sarkar. Optimizing Transmission Rate in Wireless Channels using Adaptive Probes. Poster paper in ACM Sigmetrics/Performance Conference, 2006. [13] Z. Ji, Y. Yang, J. Zhou, M. Takai, R. Bagrodia. Exploiting medium access diversity in rate adaptive wireless LANs. In proc. of ACM Mobicom, 2004. [14] M. Kodialam. The throughput of sequential testing. Lectures notes in Compu. Sci., 2081 pp 280-292, 2001. [15] L. Lai, H. El Gamal, H. Jiang and H. V. Poor. Cognitive Medium Access: Exploration, Exploitation and Competition. Submitted to IEEE ToN, Oct. 2007. [16] H. Robbins. Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc., 55 pp 527-535, 1952. [17] A. Sabharwal, A. Khoshnevis, E. Knightly. Opportunistic spectral usage: Bounds and multi-band CSMA/CA protocol. ACM/IEEE Trans. on Networking, vol 15-3, 2007. [18] P. Whittle. Restless bandits: Activity allocation in a changing world. In: A celebration of Applied Probability, J. Gani (Ed), J. Appl. Probab. Spec., 25 pp 287-298, 1988.
© Copyright 2025 Paperzz