On the Minimax Complexity of Pricing in a Changing Environment Omar Besbes∗ Assaf Zeevi† Columbia University Columbia University First submitted: June 22, 2008 / Revised: April 24, 2009, September 2, 2009 Abstract We consider a pricing problem in an environment where the customers’ willingness-to-pay (WtP) distribution may change at some point over the selling horizon. Customers arrive sequentially and make purchase decisions based on a quoted price and their private reservation price. The seller knows the WtP distribution pre- and post-change, but does not know the time at which this change occurs. The performance of a pricing policy is measured in terms of regret: the loss in revenues relative to an oracle that knows the time of change prior to the start of the selling season. We derive lower bounds on the worst case regret and develop pricing strategies that achieve the order of these bounds, thus establishing the complexity of the pricing problem. Our results shed light on the role of price experimentation, and its necessity for optimal detection of changes in market response / WtP. Our formulation allows for essentially arbitrary consumer WtP distributions, and purchase request patterns. Keywords: pricing, non-stationary demand, estimation, detection, change-point, price experimentation 1 Introduction 1.1 Overview of the problem and main objectives Overview of the problem. A product is to be sold over a finite selling season with prospective buyers arriving sequentially over time, each endowed with a reservation price modeled as an independent draw from a common willingness-to-pay (WtP) distribution. The product is purchased if the buyers’ reservation price exceeds the posted price, otherwise the buyer leaves without making a purchase. The objective of the decision maker (a monopolist seller) is to adjust prices over time so as to maximize expected cumulative revenues. ∗ † Graduate School of Business, e-mail: [email protected] Graduate School of Business, e-mail: [email protected] 1 The bulk of the academic literature that focuses on such pricing problems, mostly found within the field of revenue management, assumes that the WtP distribution does not change over time; in other words, consumer preferences are assumed to be time-homogeneous. Those papers that do relax this assumption typically endow the decision maker with exact foreknowledge of the timedependent characteristics of demand. (See the literature review in Section 1.4 for further details and references.) While such assumptions ensure mathematical tractability, they tend to ignore an important element of realism that exists in many real-world settings, where various market effects and intrinsic consumer time-preferences introduce shifts in the market response during the course of a selling season, typically in a manner that is not fully observed or predictable. The purpose of this paper is to study a family of stylized pricing problems that allows the WtP distribution (and hence market response) to change over the course of a selling season, at a time that is unknown to the decision maker. The primary objective is to quantify the complexity of such problems and establish fundamental limits on what can and cannot be achieved by any pricing policy. In particular, our goal is to shed light on two fundamental issues: i.) the pivotal role of dynamic pricing and price experimentation in monitoring market response; and ii.) the trade off between accurate detection of market change (exploration) and revenue optimization (exploitation). In this paper we consider a simple instance of an uncertain change in the environment: a WtP distribution remains constant up until an unknown time, at which point it shifts to a new distribution from which buyers’ reservation prices are drawn up until the end of the selling horizon. The decision maker needs to devise pricing policies that maximize revenues over the relevant time horizon, by properly adapting to said change whose effects can only be observed indirectly via customer purchase decisions. In order to “zoom in” on the impact of the unknown change in market response, we assume that the two WtP distributions, pre- and post-change, are revealed to the decision maker at the start of the selling horizon (but are otherwise quite arbitrary in structure and in particular need not belong to any parametric family). Endowing the decision maker with such information eliminates complexities that stem from learning the underlying distributions. Moreover, this formulation represents the minimal departure from antecedent literature, in which both the WtP distributions and their time dependence are assumed to be known a priori. Ultimately, by furthering understanding of the issues at play here, this paper strives to provide a foundation for tackling the more complicated problem of jointly learning the underlying distributions and detecting changes in their structure. We come back to discuss this point in Section 6. For the purpose of this paper we will ignore inventory considerations. From an analytical perspective, much in the spirit of the remarks made in the previous paragraph, this allows us to crisply identify the implications of a change in the WtP distributions, and characterize the complexity of this problem without masking it by other effects. From a more practical standpoint, we note that 2 such settings arise frequently in many “real world” pricing and revenue optimization problems, with notable examples in financial services involving secured or unsecured loans, mortgages, credit offerings, etc. Below is a simple illustrative example of such a setting, including some empirical observations that support our modeling paradigm and main focus. An illustrative example. An on-line auto-lender operating in a direct-to-consumer sales channel quotes rates to interested customers given information they supply. Based on the characteristics of the customers, a FICO score (measure of credit worthiness) is computed, and together with the requested loan type serves as input variables to an optimization problem. The output of this problem is an offered annual percentage rate (APR), which serves as the decision variable. For all practical purposes, the lender is not faced with constraints on the amount of funds it can allocate towards these loans. The data set contains all instances of incoming customers who were offered a loan during a period ranging from December 2003 to December 2004, as well as their ultimate decision (accept or reject); see Besbes et al. (2008) for a detailed description of the data and sales process. Customers are segmented into tiers and for each such tier the acceptance probability is modeled by exp{𝜃1 + 𝜃2 𝑥}(1+exp{𝜃1 +𝜃2 𝑥})−1 , where 𝑥 is the rate offered and 𝜃1 , 𝜃2 are two parameters to be estimated. This is the familiar logit family of response functions. We focus on a given segment of customer characteristics and estimate the logit model as a function of the APR offered by the firm. Considering two consecutive periods of 6 months, the parameters obtained were as follows: a) period 1 running from December 2003 to May 2004: 𝜃1 = 0.89, 𝜃2 = −0.30. b) period 2 running from June 2004 to December 2004: 𝜃1 = 2.71, 𝜃2 = −0.58. The two logit curves corresponding to the above parameters are depicted in Figure 1. We observe that the two curves differ significantly, but no clear connection to macroeconomic variables (e.g., cost-of-funds or inter-bank borrowing rates), or changes in market landscape (new product offerings etc.) could be established or were available. In this setting, one of the major challenges for the lender is to detect and adjust its APR to “match” changes in the demand environment in real-time, basing such decisions only on data she or he was able to collect up until that point. 1.2 Summary of the main results The performance of a pricing policy will be measured relative to the best achievable performance corresponding to a (clairvoyant) oracle that knows the value of the change-point, the time at which the WtP changes, at the start of the selling season. The difference in revenues between the latter and the former defines the regret; the smaller the regret the better the performance of the policy. Unlike the oracle, the seller is restricted to non-anticipating policies, i.e., policies in which prices 3 0.65 0.6 period 2 response curves 0.55 0.5 0.45 0.4 0.35 0.3 period 1 0.25 0.2 4 4.5 5 5.5 6 6.5 7 rate (APR) Figure 1: Estimated logit models as function of the offered annual percentage rate (APR): period 1 corresponds to December 2003 to May 2004; and period 2 corresponds to June 2004 to December 2004. chosen at any given point in time are only allowed to depend on past prices and observed purchase decisions. Our objective is to characterize the minimax regret, i.e., the minimal worst case regret (where “worst” is relative to all possible locations of the time of change). This criterion ensures that a policy is robust and exhibits “good” performance irrespective of the timing of change in market response. Below we list the main analytical results of the paper, and subsequently (Section 1.3) interpret them and discuss some of the qualitative insights that emerge. For this purpose, let us denote by 𝑁 the total number of consumers requesting a price quote over the selling horizon, and it follows that the best achievable performance generates revenues of order 𝑁 . i.) We prove that the worst case regret of any admissible pricing policy is at least of order 𝑁 1/2 ; see the lower bound in Theorem 1. That is, any policy must incur losses of order 𝑁 1/2 relative to the revenues generated by the oracle which are of order 𝑁 . ii.) We propose a pricing strategy that actively monitors market response via price experimentation and observation of purchase decisions; see Algorithm 1. The policy is shown to achieve the lower bound described above (up to a logarithmic term), and hence is essentially minimax optimal. iii.) We elucidate the tension between the contradicting objectives of accurate detection of a change and revenue maximization; see Propositions 1 and 2. It is effectively seen that any “good” 4 policy must price-experiment in a significant manner in order to balance this trade off. iv.) We highlight an intuitive structural property of the pre- and post-change WtP distributions under which the minimax complexity is of significantly smaller order: log 𝑁 . This follows from upper and lower bounds established in Theorems 3 and 4, respectively. In this setting, optimal policies can be found within the more restricted class of “passive” pricing algorithms that do not price-experiment (see Algorithm 2). At higher level, a further contribution of the paper is in bringing together two stands of literature, in particular, porting tools from sequential analysis to an operations research setting, where changepoint detection is executed jointly with system control (pricing); see further discussion in Section 1.4. 1.3 Qualitative insights and significance of the main results 1. The value of information. Our results establish that it is possible to design non-anticipating pricing policies whose expected revenue performance is very close to that of clairvoyant ones. Specifically, an oracle with advance knowledge of the change-point only gains additional revenue of order square root of the total revenues accumulated by our proposed policies (and under further structural restrictions, the value of this information translates into a mere logarithmic-order difference in revenues). If one wishes to view this on normalized scale, these results establish that the average revenue extracted per customer converges to the oracle performance as the number of customers requesting a quote grows, and hence the value of prior information on the change in WtP diminishes. 2. Price experimentation and the value of dynamic pricing. The pricing policies proposed in this paper are constructed to balance two contradicting objectives: continuous price experimentation increases the ability to detect sudden changes in the market, yet simultaneously causes a deterioration in the instantaneous revenue rate. Resolving this tension yields the precise frequency and extent of experimentation that guarantee good performance. In particular, our results establish that it suffices to price-experiment on roughly square-root the total number of customers arriving over the time horizon. Unlike traditional settings, here dynamic pricing is not driven by dynamic programming considerations but rather a need to balance revenue losses resulting from experimentation, and potential gains from accurate detection. 3. Robustness. It is worthwhile noting that our proposed policies rely only on observed purchase decisions and do not build on any specific assumptions on the time of change in WtP distribution. For example, no prior distribution over the change-point, or dependence on other exogenous variables is assumed. A choice of prior in the current non-stationary setting would be hard to justify, and while it might be recognized that dependence on exogenous variables exists 5 in various settings, one often still faces significant uncertainty with regard to the length of time between a change in exogenous variables and a change in customers’ WtP. A feature of the proposed approach is that it relies on minimal assumptions. The remainder of the paper. The next section concludes the introduction with a review of related work. Section 2 formulates the problem. Section 3 provides a fundamental limit on the performance of any pricing policy and analyzes an active pricing scheme with a performance that achieves this limit. Section 4 characterizes the best achievable performance under a restricted class of market conditions. Section 5 presents a set of numerical illustrations and qualitative insights and Section 6 discusses extensions to the present work. Appendices A and B contain the proofs of the main results while Appendix C details the proofs of auxiliary results. 1.4 Literature review Our work contributes, and is related to various streams of research. Dynamic pricing. The presence of changes in the demand environment naturally drives dynamic pricing. This is illustrated in Gallego and van Ryzin (1997), where time varying demand models and corresponding pricing policies are analyzed. It is important to note that there the temporal evolution of the demand model is known in advance. In the absence of model uncertainty, dynamic pricing is typically driven by inventory and perishability considerations (see, e.g., Talluri and van Ryzin (2005, Section 5.2) and Phillips (2005) for recent overviews). Lobo (2007) and Besbes and Zeevi (2007) help shed light on how dynamic pricing is used to uncover an unknown (yet static) WtP distribution. A related intermediate setting is found in Levin et al. (2008) who study online learning in the presence of time (and inventory) dependent demand, where the unknown parameters are static. The present paper illustrates the fundamental role of dynamic pricing in monitoring and detecting potential changes in WtP distributions; to the best of our knowledge this is the first study that establishes such a sharp characterization of the complexity of a non-stationary pricing problem. Experimentation in non-stationary and uncertain environments. A related study is that of Keller and Rady (1999) where a continuous time quantity setting infinite horizon problem is analyzed in a Bayesian framework. A state of the world that characterizes parameters of the price function evolves according to a continuous time Markov chain and the firm maximizes profits. The authors identify two regimes that depend on the problem parameters and that are characterized by extreme or moderate experimentation levels. The main difference relative to our study lies in the dynamic programming logic adopted there, which is only possible due to the Markovian assumption on the state of the world process. In contrast, our study does not rely on any assumption regarding the change point and as a result requires a different methodology based on information theoretical arguments. 6 Change-point detection. General ideas date back to the early work of Shewhart (1931) in the context of product quality control, with more refined procedures developed by Page (1954), Shiryayev (1963) and Roberts (1966). The typical formulation is one of minimizing the expected time between the change itself and the detection of the change (aka detection delay), with a constraint on the so-called false alarm rate; see Shiryayev (1978), Siegmund (1985) and the recent overview of Lai (2001). A particular instance of this is the so-called “standard” Poisson disorder problem, where the objective is to detect a change in the mean of a Poisson process, where the time of the change has an exponential distribution; see, e.g., Bayraktar et al. (2005) and the references therein. While our work is related to traditional change-point problems, there are two important distinguishing features stemming from our revenue management setting. First, performance in our problem is naturally defined in terms of revenues, and “good” detection ability does not imply improved performance. In fact, here one trades off instantaneous performance with detection ability. Second, most work in the literature takes the observations to be exogenous, and these are used to detect the time of change. In the current setting, the observations associated with the decisions of buyers to accept or decline a purchase are endogenous, as they depend on the quoted price which is a decision variable. This endows our problem with a “closed-loop” nature that is absent in other studies. Finally, it is also worth noting that our analysis is, by and large, “distribution free,” insofar as very little is assumed on the arrival stream of potential buyers and the time of change. Minimax and adversarial formulations. Problems associated with performance optimization in uncertain and non-stationary environments have often been considered in the economics and computer science literature, dating back to the early work of Hannan (1957); see Cesa-Bianchi and Lugosi (2006) for a recent monograph on the subject. The prevalent formulation in this stream of literature has been to allow nature to act in an adversarial manner at each point in time, while in our setting, the adversary’s actions are restricted to the start of the time horizon. While this limits the power of nature, it is also less conservative and guides the design of potentially more practical policies. For example, in this setting the optimal oracle policy is allowed to be dynamic, as opposed to the more common benchmark of static oracle policies considered in the aforementioned stream of literature. This also implies that the performance of the oracle in our adversarial formulation, which serves to benchmark our proposed policies can be significantly better than the traditional static benchmark. 2 Problem Formulation The model. We consider a revenue management problem in which a firm (the “decision-maker”) sells a single product over a planning or sales horizon during which 𝑁 buyers (“customers”) arrive sequentially and request a price quote for the product. Each customer is assumed to have a 7 willingness-to-pay (WtP) or reservation price for the product; we will denote by 𝑉𝑖 the WtP of the 𝑖𝑡ℎ customer. She or he purchases the product if and only if 𝑉𝑖 exceeds or equals the price quoted by the decision-maker, 𝑝𝑖 . We assume that 𝑉𝑖 is a random variable with right continuous cumulative distribution function 𝐹 (⋅; 𝑖). We assume that the set of feasible prices is [𝑝, 𝑝], where 0 < 𝑝 < 𝑝 < ∞. For some 𝜏 ∈ {1, ..., 𝑁 + 1} and all 𝑖 ≥ 1, put 𝐹 (𝑝; 𝑖) = ⎧ ⎨𝐹𝑎 (𝑝) ⎩𝐹 (𝑝) 𝑏 if 𝑖 < 𝜏 , if 𝑖 ≥ 𝜏 . In other words, the response function or probability of purchase as a function of price, is assumed to be identically equal to 𝐹¯𝑎 (⋅) up until the change-point 𝜏 , and is equal to 𝐹¯𝑏 (⋅) subsequent to that1 . Here, and throughout the paper, for any cumulative distribution 𝐹 , 𝐹¯ will denote the complement of 𝐹 , i.e., 𝐹¯ (⋅) = 1 − 𝐹 (⋅). To exclude trivial solutions, we assume that the response functions differ at some feasible price. Fixing some 𝛿0 ∈ (0, 1), we let ℱ denote the class of pairs of distribution functions (𝐹𝑎 (⋅), 𝐹𝑏 (⋅)) that satisfy max ∣𝐹¯𝑏 (𝑝) − 𝐹¯𝑎 (𝑝)∣ ≥ 𝛿0 . 𝑝∈[𝑝,𝑝] (1) Admissible pricing policies. Let (𝑝𝑖 : 1 ≤ 𝑖 ≤ 𝑁 ) denote the price process which is assumed to take values in [𝑝, 𝑝]. Let 𝑌𝑖 = 1{𝑉𝑖 ≥ 𝑝𝑖 } denote the sales outcome associated with the 𝑖𝑡ℎ customer: 𝑌𝑖 = 1 indicates that customer 𝑖 purchased the product, while 𝑌𝑖 = 0 indicates that she or he opted not to purchase it. Let {ℋ𝑖 , 𝑖 ≥ 0} denote the filtration or history associated with the ( ) process of prices and purchases, with ℋ0 = ∅, and ℋ𝑖 = 𝜎 (𝑝𝑗 , 𝑌𝑗 ), 1 ≤ 𝑗 ≤ 𝑖 . A pricing policy is said to be non-anticipating if it is adapted to the filtration {ℋ𝑖 , 𝑖 ≥ 0}, which means that the price quoted to the (𝑖 + 1)𝑠𝑡 customer, 𝑝𝑖+1 , is ℋ𝑖 -measurable (i.e., determined by ℋ𝑖 ). We will restrict attention to the set of non-anticipating policies denoted by 𝒫, and for any policy 𝜋 ∈ 𝒫, we denote the price offered to the (𝑖 + 1)𝑠𝑡 customer, 𝑝𝑖+1 , by 𝜓𝑖+1 (ℋ𝑖 ); 𝜓 will be referred to as the price mapping associated with policy 𝜋. For any 𝜏 ∈ {1, ..., 𝑁 +1}, and any policy 𝜋 ∈ 𝒫, we will use ℙ𝜋𝜏 and 𝔼𝜋𝜏 to denote the probabilities of events and expectations of random variables, respectively, when a change in response function occurs at 𝜏 and the pricing policy 𝜋 is used. Information structure and the decision maker’s objective. We assume that the response functions before and after the change, 𝐹¯𝑎 (⋅) and 𝐹¯𝑏 (⋅), respectively, are known to the decision-maker; however, she or he does not know the point of the change, 𝜏 . In addition, the decision-maker need not know the number of customers 𝑁 that arrive over the horizon of interest. The only information 1 The possibility of having continuous changes is discussed in Section 6. 8 available is that 𝑁 ≥ 2 and that for some known 𝑁0 ≥ 2 and 0 < 𝜈 < 𝜈 < ∞, 𝜈𝑁0 ≤ 𝑁 ≤ 𝜈𝑁0 . (2) Roughly speaking, this condition states that the decision-maker knows the order of magnitude of the number of potential customers arriving during the planning or sales horizon of interest, given by the lower and upper bounds above. Note that no probabilistic assumptions are made with regard to the arrival process. For 𝑝 ∈ [𝑝, 𝑝] and ℓ = 𝑎, 𝑏, let 𝑟ℓ (𝑝) = 𝑝𝐹¯ℓ (𝑝) denote the revenue functions. Put 𝑝∗𝑎 ∈ arg max{𝑟𝑎 (𝑝) : 𝑝 ∈ [𝑝, 𝑝]} to be a maximizer of the revenue function before the change, and similarly let 𝑝∗𝑏 ∈ arg max{𝑟𝑏 (𝑝) : 𝑝 ∈ [𝑝, 𝑝]} denote a maximizer of the revenue function after the change. If 𝜏 were known to the decision-maker at the start of the selling season, the optimal policy is to quote 𝑝∗𝑎 to customers 1 through 𝜏 − 1 and 𝑝∗𝑏 to customers 𝜏 through 𝑁 . The corresponding performance, denoted 𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ), will be referred to as the “oracle” performance, as one would need to have access to the value of 𝜏 to achieve it: ∑ ∑ 𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) = 𝑟𝑎 (𝑝∗𝑎 ) + 𝑟𝑏 (𝑝∗𝑏 ). 1≤𝑖<𝜏 (3) 𝜏 ≤𝑖≤𝑁 In contrast, the expected cumulative revenues over the 𝑁 customers for any admissible pricing policy 𝜋 (and its associated price mapping 𝜓) are given by: [ ∑ ] 𝜋 𝜋 𝜓𝑖 (ℋ𝑖−1 )1{𝑉𝑖 ≥ 𝜓𝑖 (ℋ𝑖−1 )} . 𝐽 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) := 𝔼𝜏 1≤𝑖≤𝑁 Clearly for all admissible policies 𝜋 ∈ 𝒫, 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) ≤ 𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) and the difference between the two quantifies the degradation in revenues due to lack of prior knowledge on the time of change 𝜏 . Ideally, one would like to design policies that make this gap, also known as the the regret, as small as possible. To ensure that the policies exhibit this behavior uniformly over a range of change point scenarios, and to preclude simple minded policies (such as “guessing the value of 𝜏 at time 0), we adopt the following formulation. Assuming “nature” first reveals the response functions to the decision-maker but keeps the time of change hidden, the decision-maker then constructs and announces an admissible policy, followed by nature selecting the change-point 𝜏 to maximize the difference between 𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) and 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ). In particular, we define the minimax regret as follows2 : { ∗ } ℛ∗ (𝑁, ℱ) := sup inf sup 𝐽 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) . (𝐹𝑎 ,𝐹𝑏 )∈ℱ 𝜋∈𝒫 1≤𝜏 ≤𝑁 +1 2 (4) 1) Note that here, to avoid cluttering the notation, we do not write explicitly that 𝑁 can be selected in an adversarial manner by nature but the results developed hold for the worst case value of 𝑁 satisfying condition (2). 2) While we focus in this paper on the absolute regret, it is worthwhile noting that the arguments developed could be used to analyze the minimax relative regret (i.e., the regret normalized by the oracle performance). 9 Informally, the decision-maker attempts to select a policy that tracks closely the oracle optimal policy, so that 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) is “close” to 𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ). When deciding where to price at a given point in time, the decision-maker needs to consider both the impact on instantaneous performance as well as the impact on the information gathered about the change 𝜏 . This trade-off will be formalized in the coming sections and will guide the design of pricing policies. Computing (4) is essentially an intractable goal due to the broad class of response functions under consideration and the absence of probabilistic structure on the time of change 𝜏 . In what follows, we will be interested in gaining insights on the magnitude of ℛ∗ (𝑁, ℱ), especially its dependence on the total number of customers 𝑁 , as well as designing policies with performance that “comes close” to ℛ∗ (𝑁, ℱ). To that end, we introduce the following definition. Definition 1 (minimax optimality) A policy 𝜋 ˆ ∈ 𝒫 is said to be optimal with respect to class ℱ if for each (𝐹𝑎 , 𝐹𝑏 ) ∈ ℱ and all 𝑁 ≥ 2, { ∗ } sup 𝐽 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋ˆ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) ≤ 𝐶 ℛ∗ (𝑁, ℱ), (5) respect to class ℱ if for any 𝜖 > 0, for each (𝐹𝑎 , 𝐹𝑏 ) ∈ ℱ and all 𝑁 ≥ 2, { ∗ } ( )(1+𝜖) , sup 𝐽 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋˜ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) ≤ 𝐶 ℛ∗ (𝑁, ℱ) (6) 1≤𝜏 ≤𝑁 +1 for some constant 𝐶 ≥ 1 independent of 𝐹𝑎 and 𝐹𝑏 . A policy 𝜋 ˜ ∈ 𝒫 is said to be near-optimal with 1≤𝜏 ≤𝑁 +1 for some constant 𝐶 > 0 independent of 𝐹𝑎 and 𝐹𝑏 . It is worth noting that the the left-hand side in (5) and (6) is lower bounded by ℛ∗ (𝑁, ℱ) for some configuration of response functions (𝐹𝑎 , 𝐹𝑏 ) ∈ ℱ. Note also that in general, one has that 𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) is of order 𝑁 , while ℛ∗ (𝑁, ℱ) is typically of a lower order of magnitude. Hence, an optimal policy is one that achieves the minimax regret up to a multiplicative constant, i.e., that achieves the optimal order of magnitude of the revenue loss 𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ). In contrast, a near-optimal policy can have a worst case performance that exceeds “slightly” the order of magnitude of the minimax regret, where this excess is encoded in (6). Discussion of the model and problem formulation. Few modeling assumptions are made in the present work. First, our formulation allows nature to choose the number of customer arrivals in an adversarial manner, only subject to the constraint (2), and no probabilistic assumptions are made with regard to this arrival process. Second, the WtP distributions are only assumed to satisfy a separation of at least 𝛿0 at some price in [𝑝, 𝑝]. Finally, no probabilistic assumptions are made with respect to the time of change. A possible alternative would have been to introduce a prior distribution on the time of change, 𝐹𝜏 , leading to a Bayesian approach. Assuming for simplicity that 𝑁 is known a priori to the decision-maker, one would then focus on the following objective [ ] ℛ∗𝐵 (𝑁, ℱ, 𝐹𝜏 ) = sup inf 𝔼 𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) , (7) (𝐹𝑎 ,𝐹𝑏 )∈ℱ 𝜋∈𝒫 10 where the expectation is with respect to the prior on 𝜏 . From (4) and (7), it follows that for any prior 𝐹𝜏 , ℛ∗𝐵 (𝑁, ℱ, 𝐹𝜏 ) ≤ ℛ∗ (𝑁, ℱ). In §6, we argue that the Bayesian and minimax settings are essentially equivalent by showing that ℛ∗ (𝑁, ℱ) ≈ ℛ∗𝐵 (𝑁, ℱ, 𝐹𝜏 ) for certain priors. 3 Optimal Price Experimentation and Detection 3.1 A lower bound on best achievable performance Theorem 1 For some constant 𝐶 > 0, ℛ∗ (𝑁, ℱ) ≥ 𝐶𝑁 1/2 (8) for all 𝑁 ≥ 2. This result asserts that any admissible policy must incur a revenue loss of at least order 𝑁 1/2 relative to the oracle pricing policy that knows the change-point 𝜏 . The value of a constant 𝐶 is computed explicitly in the proof. The proof of Theorem 1, which is outlined in Section 3.2, provides important qualitative insights. In particular, it quantifies the need for price experimentation as a consequence of the following tradeoff: the absence of price experimentation over a large number of customers could result in failure of or late detection of a change in the market response; on the other hand, price experimentation leads to potential revenue losses as one “steps away” from the pre-change optimal price. The proof also reveals some insights that suggest the “correct” frequency of price experiments, and this will be harnessed in Section 3.3 to develop a near-optimal policy. 3.2 Proof outline and intuition behind Theorem 1 Broad overview of the proof and key ideas. In broad strokes, the proof focuses on a subclass ℱ1 ⊆ ℱ and establishes a lower bound on ℛ∗ (𝑁, ℱ1 ), which yields a lower bound on ℛ∗ (𝑁, ℱ)3 . It is organized around two cases, dictated by the information accumulated about the change. If a policy accumulates only limited information on the prospective change, one can establish a lower bound on performance (see Proposition 1) which follows from this lack of information. Similarly, any policy which accumulates “ample” information about the change increases the likelihood of accurate detection of change, but at the expense of significant revenue losses due to experimentation (see Proposition 2). These two cases yield the lower bound announced in Theorem 1. A critical aspect of the proof is the quantification of information, and the link that is established between the level 3 While it would be sufficient to exhibit a pair of response functions in ℱ for which the performance is bounded below, we establish here a lower bound on the performance for any pair in a broad class ℱ1 to highlight the market conditions that lead to a “worst-case” performance. 11 of information accumulated and price experimentation. In particular, the argument gives rise to a notion of “correct” order of information accumulation, which translates to a prescription for the frequency of price experiments (see also Theorem 2 in Section 3.3). Preliminaries. Fix some 𝜇 ∈ (0, 1/2), 𝛿𝑝 > 0, 𝛼, 𝛾 > 0 and 𝐾 > 0, and let ℱ1 be constituted of pairs of response functions that satisfy the following Assumption (where 𝛿0 is fixed as for the definition of ℱ). Assumption 1 i.) max𝑝∈[𝑝,𝑝] ∣𝐹¯𝑎 (𝑝) − 𝐹¯𝑏 (𝑝)∣ ≥ 𝛿0 . ii.) 𝐹¯𝑎 (𝑝), 𝐹¯𝑏 (𝑝) ∈ (𝜇, 1 − 𝜇) for all 𝑝 ∈ [𝑝, 𝑝], and 𝐹¯𝑎 (⋅) and 𝐹¯𝑏 (⋅) are Lipschitz with constant 𝐾. iii.) 𝐹¯𝑎 (𝑝∗𝑎 ) = 𝐹¯𝑏 (𝑝∗𝑎 ) iv.) ∣𝑝∗𝑎 − 𝑝∗𝑏 ∣ > 𝛿𝑝 . v.) 𝑟𝑖 (𝑝∗𝑖 ) − 𝑟𝑖 (𝑝) ≥ ℎ(𝑝 − 𝑝∗𝑖 ) for all 𝑝 ∈ [𝑝, 𝑝], for 𝑖 = 𝑎, 𝑏, where ℎ(𝑥) = min{𝛼∣𝑥∣2 , 𝛾} for all 𝑥 ∈ ℝ. From the above, it is clear that ℱ1 ⊂ ℱ. Important features of the class ℱ1 are the following: the response functions cross at the pre-change optimal decision 𝑝∗𝑎 [𝑖𝑖𝑖.)]; there is a revenue deterioration whenever stepping away from the current optimal decision [𝑣.)]; and the optimal decisions 𝑝∗𝑎 and 𝑝∗𝑏 are distinct [𝑖𝑣.)]. For the latter, note that if 𝑝∗𝑎 and 𝑝∗𝑏 would coincide, then one could achieve a regret of zero by just applying 𝑝∗𝑎 throughout the horizon. The Lipschitz assumption [𝑖𝑖.)] is a technical condition that ensures that one cannot price close to 𝑝∗𝑎 and yet gather ample information about the change. Assumption 1 is clearly satisfied by a wide range of response functions. (For example, it is easy to check that this assumption is satisfied for 𝐹¯𝑎 (𝑝) = (1 − 0.2𝑝)+ and 𝐹¯𝑏 (𝑝) = (0.75 − 0.1𝑝)+ when e.g., [𝑝, 𝑝] = [0.1, 4], 𝛿0 = 0.2, 𝜇 = 0.1, 𝛿𝑝 = 0.5, 𝛼 = 0.1 and 𝛾 = 1.) In order to gain some intuition on the main effects at play in characterizing ℛ∗ (𝑁, ℱ1 ), we will focus on batches of customers of a given size 𝛥 ∈ {1, ..., 𝑁 }. In particular let us define: ˜ 𝑁 = ⌊𝑁/𝛥⌋ + 1 𝑙𝑖 = 1 + (𝑖 − 1)𝛥, ˜ 𝑖 = 1, ..., 𝑁 𝑙𝑁˜ +1 = 𝑁. ˜ − 1) ≥ 1, and we let 𝑙𝑖 be the index of The number of customer batches of size 𝛥 is given by (𝑁 ˜ − 1. Here and in the rest of the manuscript, for any the first customer in batch 𝑖, where 𝑖 = 1, ..., 𝑁 real number 𝑥, ⌈𝑥⌉ will denote the smallest integer larger than or equal to 𝑥 and ⌊𝑥⌋ the largest integer smaller than or equal to 𝑥. 12 ˜, In what follows, we let for ℓ = 𝑎, 𝑏 and 𝑝 ∈ [𝑝, 𝑝], ℙℓ (1∣𝑝) = 1−ℙℓ (0∣𝑝) = 𝐹¯ℓ (𝑝) and for 𝑗 = 1, ..., 𝑁 we let ℙ𝜋𝑙𝑗 denote the probability distribution of observed customers’ purchase decisions when policy 𝜋 ∈ 𝒫 is used and the change in response occurs at index 𝜏 = 𝑙𝑗 . Let 𝒦(ℙ𝜋𝑙𝑗+1 , ℙ𝜋𝑙𝑗 ) denote the Kulback-Leibler (KL) divergence (cf. Borovkov (1998)) between the two measures ℙ𝜋𝑙𝑗+1 and ℙ𝜋𝑙𝑗 , which is given by (see Lemma 1 in Appendix C): 𝑙𝑗+1 −1 𝒦(ℙ𝜋𝑙𝑗+1 , ℙ𝜋𝑙𝑗 ) = ∑ 𝔼𝜋𝑙𝑗+1 𝑖=𝑙𝑗 [ ] ℙ𝑎 (𝑌𝑖 ∣ 𝜓𝑖 (ℋ𝑖−1 )) . log ℙ𝑏 (𝑌𝑖 ∣ 𝜓𝑖 (ℋ𝑖−1 )) (9) The KL-divergence is a measure of “distance” between two probability distributions. Case 1: Strategies with “limited” information gathering. We first treat the case in which polices under-experiment. Proposition 1 Suppose that (𝐹𝑎 , 𝐹𝑏 ) belongs to ℱ1 and that 𝛥 ∈ {1, ..., 𝑁 }. Then for any 𝛽 > 0, } { and any 𝜋 ∈ 𝒫 such that min1≤𝑗≤𝑁˜ −1 𝒦(ℙ𝜋𝑙𝑗+1 , ℙ𝜋𝑙𝑗 ) ≤ 𝛽, sup {𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )} ≥ 𝐶 1 𝛥, (10) 1≤𝜏 ≤𝑁 +1 where 𝐶 1 depends only on the parameters of the class ℱ1 and 𝛽. Remark 1 (Intuition and proof sketch). The result of Proposition 1 establishes a connection between the “distance” that separates the probability distributions ℙ𝜋𝑙𝑗+1 and ℙ𝜋𝑙𝑗 , and the performance, revenue-wise, that can be achieved: if the former is “small,” then the worst-case performance is at least of order the batch size 𝛥. The intuition behind this is as follows. When the two probability distributions are “close,” then no policy is able to distinguish sufficiently well between a change occurring at time 𝑙𝑗 or 𝑙𝑗+1 . In particular, the probability of making an error in trying to distinguish between the two is strictly positive, and independent of 𝛥 and the total number of customers. This implies that for customers in some batch of size 𝛥, it is impossible to determine whether their WtP distribution is 𝐹𝑎 or 𝐹𝑏 . This, in turn leads to the lower bound advertised above. Remark 2 (Interpretation in terms of price experimentation). Consider two distributions (𝐹𝑎 , 𝐹𝑏 ) belonging to ℱ1 . By Assumption 1 𝑖𝑖𝑖.), they cross at 𝑝∗𝑎 . Going back to the expression for the KL-divergence we observe that whenever one prices at 𝑝∗𝑎 , the associated term in (9) contributes zero to the sum. In other words, each term in the sum, which is non-negative, contributes to the sum only if there is a positive probability of pricing “away” from 𝑝∗𝑎 before a change occurs. As a result, the decision-maker faces the following dilemma: pricing at 𝑝∗𝑎 maximizes instantaneous revenue rate if the change has not yet occurred, but pricing too often at 𝑝∗𝑎 or “close” to 𝑝∗𝑎 prohibits accumulating sufficient information about the change. The worst-case regret associated with the policy is ultimately limited by 𝐶 1 𝛥, as highlighted in Proposition 1. Case 2: Strategies with “ample” information gathering. The next result provides a lower 13 bound on the performance of policies for which the KL-divergence exceeds a given 𝛽 for all batches. This corresponds to policies that price at least once away from 𝑝∗𝑎 in every batch of size 𝛥 prior to the change. Proposition 2 Suppose that (𝐹𝑎 , 𝐹𝑏 ) belongs to ℱ1 and that 𝛥 ∈ {1, ..., 𝑁 }. Then for any 𝛽 > ˜ − 1. and 𝜋 ∈ 𝒫 such that 𝒦(ℙ𝜋 , ℙ𝜋 ) > 𝛽 for all 𝑗 = 1, ...𝑁 𝑙𝑗 𝑙𝑗+1 sup ˜ − 1), {𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )} ≥ 𝐶 2 (𝑁 (11) 1≤𝜏 ≤𝑁 +1 where 𝐶 2 depends only on the parameters of the class ℱ1 and 𝛽. Remark 3 (Intuition and proof sketch). When the “distance” between the probability distributions ℙ𝜋𝑙𝑗+1 and ℙ𝜋𝑙𝑗 is bounded below, it follows that the policy in question price-experiments away from 𝑝∗𝑎 sufficiently often throughout the batches that precede the change. Assumption 1 𝑣.) ensures that prior to the change, deviating from 𝑝∗𝑎 results in a revenue deterioration. Now, if the change occurs toward the end of the horizon, at 𝜏 = 𝑙𝑁˜ −1 , then this policy incurs losses almost throughout the horizon due to price experimentation, and this leads to the lower bound on the worst-regret of ˜ − 1). 𝐶 2 (𝑁 Combining the two cases. Proposition 1 establishes a limit on the performance of policies that use moderate price experimentation away of 𝑝∗𝑎 , and Proposition 2 establishes a limit on policies that experiment too often. Combining the two results, we obtain the result of Theorem 1, which provides a universal bound on the performance of any admissible policy. Indeed, for any such policy 𝜋 ∈ 𝒫, only one of the two cases considered in Propositions 1 and 2 can occur. As a result, we have that the ˜ − 1)}. worst-case regret over the class ℱ1 is lower bounded as follows ℛ∗ (𝑁, ℱ1 ) ≥ min{𝐶 1 𝛥, 𝐶 2 (𝑁 ˜ ≈ 𝑁/𝛥 and selecting 𝛥 = 𝑁 1/2 leads to ℛ∗ (𝑁, ℱ1 ) ≥ 𝐶 ′ 𝑁 1/2 for some 𝐶 ′ > 0. Noting that 𝑁 Recalling the initial remark that ℛ∗ (𝑁, ℱ) ≥ ℛ∗ (𝑁, ℱ1 ), the result of Theorem 1 follows. 3.3 The proposed policy and its performance Let (𝐹𝑎 , 𝐹𝑏 ) be an arbitrary pair of response functions in ℱ and let 𝑝0 be a price in [𝑝, 𝑝] such that ∣𝐹¯𝑎 (𝑝0 ) − 𝐹¯𝑏 (𝑝0 )∣ ≥ 𝛿0 (such a price exists by condition (1)). We introduce below a pricing policy defined through three positive constants that serve as tuning parameters (𝑐𝑒 , 𝑐𝑟 , 𝜀). The general structure runs as follows. Start by quoting 𝑝∗𝑎 but regularly quote 𝑝0 for a “small” number of customers, that constitute the price experimentation batch size. The rationale here is to monitor for a change in market response. After having observed the responses of a batch of customers of size determined by 𝑐𝑟 and 𝑐𝑒 , one uses a decision rule based on the observations of demand at 𝑝0 to declare the presence or the absence of a change. The role of 𝜀 here is to allow some slack in the decision rule, due to noise in the observations (stemming from customer specific WtP). When a change is declared, switch to 𝑝∗𝑏 until the end of the horizon. 14 Algorithm 1 : 𝝅(𝒄𝒆 , 𝒄𝒓 , 𝜺) Step 1. Joint Pricing and Market Monitoring: Initialize:⌈ 𝑑𝑒𝑡𝑒𝑐𝑡⌉= 0, 1/2 Set 𝑛𝑟 = 𝑐𝑟 𝑁0 . Set 𝑛𝑒 = ⌈𝑐𝑒 log 𝑁0 ⌉. 𝑖 = 1, 𝑗=1 [revenue extraction batch size] [price experimentation batch size] While 𝑑𝑒𝑡𝑒𝑐𝑡 = 0 and 𝑗 ≤ 𝑁 , (a) Pricing: 𝑝𝑗 = 𝑝∗𝑎 for 𝑗 = 𝑖, ..., 𝑖 + 𝑛𝑟 − 1 𝑝𝑗 = 𝑝0 for 𝑗 = 𝑖 + 𝑛𝑟 , ..., 𝑖 + 𝑛𝑟 + 𝑛𝑒 − 1 [revenue extraction] [price experimentation] (b) Detection test: [ ] ˆ = 1 total sales from cust. 𝑖 + 𝑛𝑟 to cust. 𝑖 + 𝑛𝑟 + 𝑛𝑒 − 1 − 𝐹¯𝑎 (𝑝0 ) 𝐷 𝑛𝑒 [difference between empirical mean and true pre-change response at 𝑝0 ] Compute ˆ ∈ [−𝜀, 𝜀] If 𝐷 𝑑𝑒𝑡𝑒𝑐𝑡 = 0 Else 𝑑𝑒𝑡𝑒𝑐𝑡 = 1 End 𝑖 = 𝑖 + 𝑛𝑟 + 𝑛𝑒 [difference “small” ⇒ no change detected] [difference “large” ⇒ change detected] End Step 2. Quote Adjustment: Set 𝑝𝑗 = 𝑝∗𝑏 for 𝑖 ≤ 𝑗 ≤ 𝑁 [quote post-change optimal price] Intuition. In Algorithm 1, one focuses on batches of customers of size 𝑛𝑟 + 𝑛𝑒 of order 𝑁 1/2 , which is guided by the lower bound in Theorem 1. In any such batch, 𝑛𝑒 determines the number of times one quotes away from 𝑝∗𝑎 prior to the detection of a change, while 𝑛𝑟 characterizes the number of times one quotes at 𝑝∗𝑎 , the optimal pre-change price. Here 𝑛𝑒 quantifies the degree of price experimentation that takes place until a change is detected. Note that increasing 𝑛𝑒 yields an increase in detection abilities but also a potential decrease in performance when a change occurs toward the end of the horizon. After each group of 𝑛𝑟 + 𝑛𝑒 customers arrives, one compares the 15 sample average demand, over the trailing window of size 𝑛𝑒 , to the theoretical mean 𝐹¯𝑎 (𝑝0 ) assuming no change has occurred. In the absence of change, this difference should be “small” while if the change has already occurred, this difference will be bounded away from zero (given that condition (1) implies that ∣𝐹¯𝑎 (𝑝0 ) − 𝐹¯𝑏 (𝑝0 )∣ ≥ 𝛿0 ). The tuning parameter 𝜀 is used to distinguish between the two hypotheses. The choice of 𝜀 and 𝑐𝑒 allows to control both types of error: the probability of declaring a change in the absence of one; and the probability of not detecting the change after more than 2(𝑛𝑟 + 𝑛𝑒 ) customers have arrived following the change. Let 𝑐𝑒 , 𝑐𝑟 and 𝜀 be specified as follows 0 < 𝜀 < 𝛿0 , 𝑐𝑒 = max{𝜀−2 , (𝛿0 − 𝜀)−2 }, and 𝑐𝑟 > 0. (12) The next result characterizes the performance of the proposed pricing scheme. Theorem 2 Let 𝐹𝑎 (⋅) and 𝐹𝑏 (⋅) be such that condition (1) holds. Let 𝜋(𝑐𝑒 , 𝑐𝑟 , 𝜀) be defined by Algorithm 1 with 𝑐𝑒 , 𝑐𝑟 and 𝜀 as specified in (12). Then for some finite constant 𝐶¯ > 0, the worst-case regret associated with 𝜋(𝑐𝑒 , 𝑐𝑟 , 𝜀) is bounded as follows sup 1≤𝜏 ≤𝑁 +1 ¯ 1/2 log 𝑁, {𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )} ≤ 𝐶𝑁 (13) for all 𝑁 ≥ 2, and hence ¯ 1/2 log 𝑁. ℛ∗ (𝑁, ℱ) ≤ 𝐶𝑁 (14) Discussion. The constant 𝐶¯ appearing above depends only on parameters of the class ℱ and the arrival process bounds 𝜈 and 𝜈. We have thus established that the lower bound in Theorem 1 can be achieved up to a logarithmic term. Given that the scale of losses is of order 𝑁 1/2 , the policy prescribed by Algorithm 1 is near-optimal. The proposed policy performs finely tuned price experiments, a crucial feature as highlighted in Section 3.2. One could refer to the pricing scheme as “active,” given that it is actively seeking to learn about a change in the response function through price experimentation. In contrast, a policy that would “wait” for a change while pricing at a fixed point would be referred to as “passive.” Section 4 analyzes in more detail settings where passive schemes can perform well. 4 Reduced Complexity and Sufficiency of Passive Monitoring This section studies an important special case of those discussed in the previous section, establishing a simple structural condition under which price experimentation is no longer necessary, and the complexity of the pricing problem is significantly reduced. Fixing 𝛿0 ∈ (0, 1), Let 𝒢 denote the class of pairs of response functions (𝐹𝑎 , 𝐹𝑏 ) that satisfy ∣𝐹¯𝑏 (𝑝∗𝑎 ) − 𝐹¯𝑎 (𝑝∗𝑎 )∣ ≥ 𝛿0 . 16 (15) This condition implies that the two response functions are separated at the pre-change optimal price, 𝑝∗𝑎 , by at least 𝛿0 , which is more stringent than condition (1) that served as a premise to the analysis in Section 3. From this, we will see shortly that under the above condition, one can accumulate information about the change while quoting the pre-change optimal price 𝑝∗𝑎 . This in turn will lead to fundamentally simpler pricing strategies that achieve significantly superior performance than that characterizing the general case. Given that 𝒢 ⊆ ℱ, and hence nature is more restricted here, it follows that ℛ∗ (𝑁, 𝒢) ≤ ℛ∗ (𝑁, ℱ). The main question we focus on is by how much does ℛ∗ (𝑁, 𝒢) differ from ℛ∗ (𝑁, ℱ), and what are optimal or near-optimal policies in this setting. 4.1 The proposed passive pricing policy We introduce below a pricing policy defined through two positive constants (𝑐, 𝜀) that serve as tuning parameters. The main idea is as follows: First, 𝑝∗𝑎 is quoted and the average demand at 𝑝∗𝑎 is monitored by focusing on a trailing window, whose size 𝑛 is determined by 𝑐 and the proxy for the number of arrivals, 𝑁0 . Then, using 𝜀 as threshold, we declare whether a change has occurred or not. Following a positive detection, one adjusts the quotes to the price 𝑝∗𝑏 , the optimal post-change price, and holds that until the end of the season. In contrast with Algorithm 1, price experimentation is not used and the assessments of the occurrence of a change take place much more often. This is summarized in the following pseudo-code. Algorithm 2 : 𝝅(𝒄, 𝜺) Step 1. Pricing and Market Monitoring: Initialize: 𝑑𝑒𝑡𝑒𝑐𝑡 = 0, Set 𝑛 = ⌈𝑐 log 𝑁0 ⌉ 𝑖 = 1, 𝑗=1 [trailing window size] While 𝑑𝑒𝑡𝑒𝑐𝑡 = 0 and 𝑗 ≤ 𝑁 , (a) Pricing: 𝑝𝑗 = 𝑝∗𝑎 for 𝑗 = 𝑖, ..., 𝑖 + 𝑛 − 1 (b) Detection test Compute [ ] ˆ = 1 total sales from cust. 𝑖 to cust. 𝑖 + 𝑛 − 1 − 𝐹¯𝑎 (𝑝∗ ) 𝐷 𝑎 𝑛 [Difference between empirical mean and true pre-change response at 𝑝∗𝑎 ] ˆ ∈ [−𝜀, 𝜀] If 𝐷 𝑑𝑒𝑡𝑒𝑐𝑡 = 0 [difference “small” ⇒ no change detected] 17 Else [difference “large” ⇒ 𝑑𝑒𝑡𝑒𝑐𝑡 = 1 End 𝑖=𝑖+𝑛+1 change detected] End Step 2. Quote Adjustment: Set 𝑝𝑗 = 𝑝∗𝑏 for 𝑖 ≤ 𝑗 ≤ 𝑁 [quote post-change optimal price] A close inspection of Algorithm 2 reveals that it is a special case of Algorithm 1. Indeed, if one takes 𝑐𝑟 = 0 and 𝑝0 = 𝑝∗𝑎 in the latter, then one obtains the pricing scheme of Algorithm 2, where price experimentation and revenue extraction are conducted jointly. After each batch of customer arrivals, the sample average of demand over the trailing window of size 𝑛 is compared to 𝐹¯𝑎 (𝑝∗𝑎 ). In the absence of change, this difference should be “small.” If the change has occurred, this difference will be bounded away from zero, given that 𝛿0 > 0. The tuning parameter 𝜀 is used to distinguish between the two hypotheses of the change having occurred or not. 4.2 Performance of the proposed policy The next result characterizes the performance of Algorithm 2. Theorem 3 Let 𝐹𝑎 (⋅) and 𝐹𝑏 (⋅) be such that condition (15) holds. Let 𝜀 be such that 0 < 𝜀 < 𝛿0 and 𝑐 = max{𝜀−2 , (𝛿0 − 𝜀)−2 }. Let 𝜋(𝑐, 𝜀) be defined by means of Algorithm 2. Then for some finite constant 𝐶¯ ′ > 0, the worst-case regret achieved by 𝜋(𝑐, 𝜀) is bounded as follows sup 1≤𝜏 ≤𝑁 +1 {𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )} ≤ 𝐶¯ ′ log 𝑁, (16) for all 𝑁 ≥ 2. Consequently, the minimax regret satisfies ℛ∗ (𝑁, 𝒢) ≤ 𝐶¯ ′ log 𝑁. (17) The constant 𝐶¯ ′ depends only on the class 𝒢 and the parameters 𝜈 and 𝜈 characterizing the arrival process (see (2)). Note that ℛ∗ (𝑁, 𝒢) is significantly smaller than ℛ∗ (𝑁, ℱ), which was shown to be essentially of order 𝑁 1/2 . This difference stems from the fact that under condition (15), there is no need for price experimentation, and the decision-maker can infer occurrence of a change while pricing at 𝑝∗𝑎 . The question still remains whether one can improve upon Algorithm 2; this is the topic of Section 4.3. Proof sketch. For the proposed scheme, localizing efficiently the change-point implies that the proposed pricing policy will be “close” to the oracle optimal price path (that has access to the 18 value of 𝜏 before the start of the season). We highlight below the connection between revenue performance and detection ability of the proposed policy. Let 𝑘ˆ denote the customer number for which the price switches from 𝑝∗𝑎 to 𝑝∗𝑏 . Then, the expected revenues associated with the proposed policy are given by 𝜋 𝐽 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) = 𝔼𝜋𝜏 ˆ −1} [min{∑ 𝑘,𝜏 𝑖=1 )− ( )+ ( + + 𝑟𝑎 (𝑝∗𝑏 ) 𝑘ˆ − 𝜏 𝑟𝑎 (𝑝∗𝑎 ) + 𝑟𝑏 (𝑝∗𝑎 ) 𝑘ˆ − 𝜏 𝑁 ∑ ˆ } 𝑖=max{𝑘,𝜏 ] 𝑟𝑏 (𝑝∗𝑏 ) . In the equations above and in all the manuscript, for any real number 𝑥, 𝑥+ refers to max{𝑥, 0} and 𝑥− to max{−𝑥, 0}. The regret is then given by 𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) )− ] )+ ] ( ) [( ( ) [( + 𝑟𝑎 (𝑝∗𝑎 ) − 𝑟𝑎 (𝑝∗𝑏 ) 𝔼𝜋𝜏 𝑘ˆ − 𝜏 = 𝑟𝑏 (𝑝∗𝑏 ) − 𝑟𝑏 (𝑝∗𝑎 ) 𝔼𝜋𝜏 𝑘ˆ − 𝜏 )+ ( )− ] { } [( ≤ max 𝑟𝑎 (𝑝∗𝑎 ) − 𝑟𝑎 (𝑝∗𝑏 ), 𝑟𝑏 (𝑝∗𝑏 ) − 𝑟𝑏 (𝑝∗𝑎 ) 𝔼𝜋𝜏 𝑘ˆ − 𝜏 + 𝑘ˆ − 𝜏 ] } [ { 𝑘 − 𝜏 = max 𝑟𝑎 (𝑝∗𝑎 ) − 𝑟𝑎 (𝑝∗𝑏 ), 𝑟𝑏 (𝑝∗𝑏 ) − 𝑟𝑏 (𝑝∗𝑎 ) 𝔼𝜋𝜏 ˆ (18) From the last expression, we observe that the performance of the proposed algorithm will be driven [ ] 𝑘 − 𝜏 can be made. The latter, in turn, is driven by by detection accuracy, i.e., by how small 𝔼𝜋𝜏 ˆ the probability of detecting a change when none occurred (false alarm), and that of not detecting the change when more than 𝑐 log 𝑁 customers requested the product since the change. These two probabilities can be controlled by analyzing deviations of normalized sums of i.i.d. random variables relative to their true mean, in conjunction with judicious choices of 𝑐 and 𝜖. 4.3 Optimality of the passive pricing policy The next result provides a lower bound on the performance of any admissible pricing policy 𝜋 ∈ 𝒫. Theorem 4 For some 𝐶 ′ > 0, ℛ∗ (𝑁, 𝒢) ≥ 𝐶 ′ log 𝑁 (19) for all 𝑁 ≥ 2. The above result, combined with Theorem 3 establishes the minimax regret ℛ∗ (𝑁, 𝒢) ≈ log 𝑁 . It shows that one cannot improve upon the logarithmic growth of the regret in terms of the total number of customers, 𝑁 . In addition, this growth rate is achieved by the policy presented in Algorithm 2. In other words, roughly log 𝑁 customers are offered a suboptimal price due to the change in the WtP. This illustrates that the decision-maker can counter nature quite effectively if she or he is able to gather information about the change while pricing at 𝑝∗𝑎 . Intuition and proof sketch. It is important to note that the set of pricing policies 𝒫 is quiet “large” as few restrictions are imposed. The proof of Theorem 4 establishes that it is possible to 19 reduce the worst-case regret minimization problem to a detection problem, where the objective is to ˆ minimize the expected “distance” between the customer number at which one declares a change, 𝑘, and the true customer at which the change occurs, i.e., minimize 𝔼𝜋𝜏 ∣𝑘ˆ − 𝜏 ∣. This reduction implies that any level of performance that can be achieved in the regret minimization problem can also be achieved in the detection problem. The last part of the proof establishes a fundamental limit on performance in the detection problem, which yields a lower bound for regret minimization. 5 Illustrative Numerical Examples In what follows, we will fix the price domain to [𝑝, 𝑝] = [0.5, 5], the response functions after the change, 𝐹¯𝑏 (⋅) and vary the response function prior to the change 𝐹¯𝑎 (⋅). The response function after the change is taken to be 𝐹¯𝑏 (𝑝) = (3.2𝑝)−1/2 . The response function prior to the change is linear 𝐹¯𝑎 (𝑝) = max{1 − 𝛽𝑎 𝑝, 0}, where the coefficient 𝛽𝑎 can take three values. The cases we focus on are: - Case 𝐼.) 𝐹¯𝑎 (𝑝) = max{1 − 0.8𝑝, 0}, - Case 𝐼𝐼.) 𝐹¯𝑎 (𝑝) = max{1 − 0.4𝑝, 0}, - Case 𝐼𝐼𝐼.) 𝐹¯𝑎 (𝑝) = max{1 − 0.2𝑝, 0}. These choices allow to cover different possibilities with respect to the difference between the preand post-change responses functions at the pre-change optimal price 𝑝∗𝑎 . In particular, both cases I and III are cases where the pair of response functions belongs to 𝒢 (for some appropriate 𝛿0 ), i.e., the response are well separated at 𝑝∗𝑎 ; and case II is a case where the pair of response functions belongs to ℱ ∖ 𝒢, i.e., where 𝐹¯𝑎 (𝑝∗𝑎 ) = 𝐹¯𝑏 (𝑝∗𝑎 ). We depict in Figure 2 the pre- and post-change response functions as well as the corresponding revenue functions for the cases 𝐼𝐼. and 𝐼𝐼𝐼.. We let 𝜋1 denote the policy defined by means of Algorithm 1 with 𝑐𝑒 = 2, 𝑐𝑟 = 2, 𝜀 = 0.3 and 𝑝0 = arg max𝑝∈[𝑝,𝑝] {∣𝐹¯𝑎 (𝑝) − 𝐹¯𝑏 (𝑝)∣} (𝑝0 = 1.25 for cases 𝐼. and 𝐼𝐼𝐼. and 𝑝0 = 2.5 for case 𝐼𝐼.); 𝜋2 denotes the policy defined by means of Algorithm 2 with 𝑐 = 3 and 𝜀 = 0.3. The total number of customers requesting a quote 𝑁 is assumed to be equal to 103 . The experiments are based on the following parameters: the arrival process parameters are set at 𝑁0 = 1, 000 and 𝜈 = 0.25. Structure of the pricing policies. Figure 3 contrasts sample paths of the prices associated with the policies 𝜋1 and 𝜋2 for a case where 𝜏 = 500 and the pre-change response function is 𝐹¯𝑎 (𝑝) = max{1 − 0.8𝑝, 0} (case 𝐼.). The figure highlights how 𝜋1 regularly experiments at the price 𝑝0 for a small batch of customers to monitor the occurrence of a change. As discussed in the previous sections, this policy trades-off the performance losses associated with such experimentation (away from 𝑝∗𝑎 ) with the improved detection abilities that result from it. In contrast, 𝜋2 just prices at 𝑝∗𝑎 until a change is detected. In the sample path depicted, 𝜋2 detects the change faster than 𝜋1 as the latter only assesses the occurrence of a change after experimentation occurs. Note that this is 20 (a) 0.9 (b) 1.4 𝑟𝑎 (𝑝) (III) 𝐹¯𝑎 (𝑝) (III) 1.2 revenue functions response functions 0.8 0.7 0.6 𝐹¯𝑏 (𝑝) 0.5 0.4 0.3 0.2 𝐹¯𝑎 (𝑝) (II) 0.1 0 1 2 price 3 4 1 𝑟𝑏 (𝑝) 0.8 0.6 𝑟𝑎 (𝑝) (II) 0.4 0.2 𝑝∗𝑎 (II) 𝑝∗𝑎 (III) 𝑝∗𝑏 0 5 1 2 price 3 4 5 Figure 2: Test cases. Figure (a) depicts the response curves and Figure (b) the revenue curves for two test cases (𝐼𝐼 and 𝐼𝐼𝐼). The case to which the pre-change quantities correspond to is indicated in parenthesis. to be expected in cases where (𝐹𝑎 , 𝐹𝑏 ) ∈ 𝒢, i.e., when the response functions are “well separated” at 𝑝∗𝑎 . Performance and benchmarking. For purposes of benchmarking, we consider two other policies: the policy 𝜋𝑎 that ignores possible changes in the environment and prices at 𝑝∗𝑎 throughout the horizon; and the best fixed price static oracle policy 𝜋𝑠 that given knowledge of the value of 𝜏 selects the best single price and holds it fixed throughout the selling season. The latter policy is clearly not an admissible one, but serves as a reasonable benchmark. In Table 1, we present the percentage of expected revenue loss relative to the oracle performance [𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )]/𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) for: the four policies under consideration; three values of the change point 𝜏 ; and the three different cases for pre-change response functions. The results depicted are based on running 103 independent simulation replications from which the performance indicators were derived by averaging. The standard error for the percentage loss was always seen to be below 5% of the policy’s estimated performance. The results highlight the small magnitude of the regret achieved by the policy 𝜋1 . In particular, its performance is in general far superior to that of 𝜋𝑠 , the best fixed-price oracle policy in all 21 price quotes 6 5 4 𝑝∗𝑏 = 5 detection lag 𝑝∗𝑎 = 2.5 𝜏 price path under 𝜋2 3 2 1 0 100 200 300 400 500 600 700 800 900 1000 customer number price quotes 6 5 𝑝∗𝑏 = 5 4 detection lag 𝑝∗𝑎 = 2.5 3 2 price path under 𝜋1 𝜏 𝑝0 = 1.25 1 0 100 200 300 400 500 600 700 800 900 1000 customer number Figure 3: Price paths. The change point occurs at 𝜏 = 500. The figure depicts price paths for policies 𝜋1 and 𝜋2 . The optimal prices pre (post) change are 𝑝∗𝑎 and 𝑝∗𝑏 and 𝑝0 is used for experimentation. Pre-change response fn change-point 𝜏 𝜋𝑎 𝜋𝑠 𝜋1 𝜋2 𝑁/4 𝐼. 𝑁/2 3𝑁/4 𝑁/4 𝐼𝐼. 𝑁/2 3𝑁/4 𝑁/4 𝐼𝐼𝐼. 𝑁/2 3𝑁/4 59.7 7.7 3.2 7.6 51.7 20.0 7.9 9.9 36.9 36.3 18.1 14.8 42.9 14.3 5.36 37.0 33.3 31.5 11.6 27.4 20.0 19.6 20.6 18.7 22.0 16.2 5.4 6.6 14.6 13.2 9.3 6.9 7.3 7.0 14.5 8.9 Table 1: Percentage loss relative to the oracle performance: [𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝜋 ∗ 𝐽 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )]/𝐽 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ). The table displays all cases of response functions (𝐼 - 𝐼𝐼𝐼). 𝜋1 denotes the policy defined by means of Algorithm 1; 𝜋2 denotes the policy defined by means of Algorithm 2; and 𝜋𝑎 is the policy that prices at 𝑝∗𝑎 throughout the horizon; 𝜋𝑠 is the best fixed-price oracle policy. instances tested. As expected, the policy 𝜋1 performs better when the change occurs earlier as less price experimentation will be involved compared to the oracle policy. Yet, we observe that the policy 𝜋1 achieves a relative regret below 20.6% in all instances tested, as opposed to the policy 𝜋2 whose relative regret can be as high as 37%. The key feature at play here is that the policy 𝜋1 uses 22 price experimentation to ensure more reliable detection of the change, while 𝜋2 relies on observations at 𝑝∗𝑎 to infer information about the change. When 𝐹¯𝑎 (𝑝∗𝑎 ) and 𝐹¯𝑏 (𝑝∗𝑎 ) are close, 𝜋2 might fail to detect a change. In cases 𝐼. and 𝐼𝐼𝐼., the two response curves are well separated at 𝑝∗𝑎 , explaining why 𝜋2 performs well in such cases. However, in the cases where 𝐹¯𝑎 (𝑝) = max{1 − 0.4𝑝, 0} (case 𝐼𝐼.), the two response curves are equal at 𝑝∗𝑎 , yielding a significant deterioration in performance for 𝜋2 . Fine tuning policies. There are various parameters associated with 𝜋1 that can be fine tuned to potentially improve performance. We investigate below the impact of the experimentation price 𝑝0 , that essentially determines the amount of information gathered regarding the occurrence of change in the market. Using the previous setup, Table 2 reports the relative performance of 𝜋1 relative to the oracle policy as 𝑝0 varies for the three cases analyzed when a change occurs for customer 𝜏 = 500. 𝑝0 Case 𝐼. Case 𝐼𝐼. Case 𝐼𝐼𝐼. 1 6.8 24.4 11.6 1.25 8.2 24.5 9.6 1.5 8.1 23.3 12.2 1.75 8.6 22.5 11.4 2 8.5 11.3 11.5 2.25 9.2 10.2 12.3 2.5 10.0 11.21 13.3 2.75 9.6 12.37 15.9 3 9.7 12.2 14.8 3.25 10.2 12.6 16.73 Table 2: Percentage loss relative to the oracle performance: [𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )]/𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ). The table displays the performance of 𝜋1 , the policy defined by means of Algorithm 1 as the experimentation price (𝑝0 ) varies. A change occurs at 𝜏 = 𝑁/2. We observe that a poor choice of 𝑝0 , i.e., selecting 𝑝0 in a region where the two response curves are close, can have a significant negative impact on performance. However, the performance does not vary significantly over a wide range of choices of 𝑝0 where the response curves are sufficiently separated. 6 Concluding Remarks and Extensions On the cost of information acquisition. The analysis in the paper reveals a significant distinction between the classes of response functions ℱ and 𝒢, the key being whether they are “well separated” at 𝑝∗𝑎 . These cases admit fairly intuitive interpretations. In the well separated case, i.e., when ∣𝐹¯𝑎 (𝑝∗𝑎 ) − 𝐹¯𝑏 (𝑝∗ )∣ > 0, it is possible to acquire information about the change while pricing at 𝑎 the pre-change optimal price 𝑝∗𝑎 . Roughly speaking, this makes information acquisition “costless.” In contrast, when 𝐹¯𝑎 (𝑝∗𝑎 ) = 𝐹¯𝑏 (𝑝∗𝑎 ), information about the change can only be acquired by quoting prices that differ from 𝑝∗𝑎 , i.e., via price experimentation. This implies a potential decrease in revenues prior to the change (as illustrated in Figure 3). In other words, here information acquisition is costly and its cost is endogenously determined by the revenue rate deterioration associated with price experimentation. Thus, in a changing environment, the information acquisition costs play a 23 central role in determining achievable performance and guiding the design of “good” policies. On the observability of lost sales. The assumption that the decision-maker observes customers who decline to purchase, while valid in various B2B settings where a sale can only be made after a request for quote is received, is restrictive. In order to address settings with non-observable lost sales, one would need to introduce some temporal structure for the arrival process in order to be able to relate the number of sales to the actual number of potential customers. In that case, we expect similar results to the ones derived in this paper. In the absence of this link, it seems impossible to detect changes in the demand environment as it is impossible to infer the cause for the absence of purchases. We also refer the reader to Talluri (2009) for a discussion of the related issue of purchase probability estimation in the context of assortment selection. The presence of multiple change points. While the analysis in the current paper was conducted for the case of a single change point, the approach taken is applicable when multiple change points are present, as long as these are suitably “separated.” More precisely, as long as a representative number of customers arrives between such changes, and the decision-maker has access to the order of magnitude of the number of potential customers between change points, the method developed could be applied. Abrupt versus gradual changes in the demand environment. In the current paper, we have modeled changes in the demand environment as being abrupt. What happens when these changes are gradual? The first thing to note is that the detection delay associated with the proposed policies might be longer in that setting. However, this need not imply a significant loss in performance. Indeed, the lack of detection of a change in the response functions indicates that the new demand environment is still within an “indifference zone” relative to the current one, and hence using the latter model has only a minor impact on performance. The proposed policies, which are built with this indifference zone idea in mind, will only detect a change once the new demand environment differs “significantly” from the current one, hence the core ideas developed in this paper are still applicable in settings where the demand changes gradually. An alternative Bayesian formulation. We commented at the end of Section 2 on a possible Bayesian approach to the problem (see (7)). In particular, we mentioned that under any prior for the time of change 𝐹𝜏 , ℛ∗𝐵 (𝑁, ℱ, 𝐹𝜏 ) ≤ ℛ∗ (𝑁, ℱ); i.e., one is able to achieve a lower worstcase regret when the time of change is drawn from a distribution and performance is averaged with respect to that distribution. We argue here that there exists a prior for which ℛ∗ (𝑁, ℱ) ≤ 𝐶 log 𝑁 ℛ∗𝐵 (𝑁, ℱ, 𝐹𝜏 ) for some 𝐶 > 0. Indeed, assume that 𝑁 is initially known to the decisionmaker (which corresponds to 𝜈 = 𝜈) and recall the discussion in Section 3.2 and the definition of the 𝑙𝑖 ’s which separate the arrival stream into “batches.”. Suppose that 𝐹𝜏 places equal mass on each of the the customer indices 𝑙𝑖 , between batches of size 𝛥 ≈ 𝑁 1/2 . Then, noting that any 24 policy either gathers limited or ample information on at least half these batches, an adaptation of the argument developed in Section 3.2 can be used to establish that the worst-case regret is lower bounded by order 𝑁 1/2 . In other words, if the prior can be selected arbitrarily, then the Bayesian regret is of the same order as the minimax regret. Hence, sup𝐹𝜏 ℛ∗𝐵 (𝑁, ℱ, 𝐹𝜏 ) ≈ ℛ∗ (𝑁, ℱ), i.e., they are of the same order of magnitude up to logarithmic terms. Complexity of detection versus complexity of learning. This paper has focused on the complexity of pricing in an environment where the time of change is unknown. A key question that naturally arises is how does this compare with the complexity of learning the post-change response function. Consider the regret ℛ∗𝑙 (𝑁, ℱ) := inf sup 𝜋∈𝒫 𝐹 :(𝐹𝑎 ,𝐹 )∈ℱ 𝑏 𝑏 { } 𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) , (20) where the time of change 𝜏 and the initial response function 𝐹𝑎 (⋅) are both known, but the postchange response function 𝐹𝑏 (⋅) is not known. Such an objective isolates the complexity associated with learning. Based on the stochastic approximations literature and in particular the results in Polyak and Tsybakov (1990), one would expect ℛ∗𝑙 (𝑁, ℱ) ≈ 𝑁 1/2 if one imposes some relatively minimal smoothness conditions on the post-change response function. In other words, we have the remarkable fact: the complexity of detection and learning are comparable, implying that both tasks contribute equally from a performance perspective. An important research direction is the design of “good” and practical algorithms for cases where both the time of change and the post-change response function are unknown. References Bayraktar, E., Dayanik, S. and Karatzas, I. (2005), ‘The standard poisson disorder problem revisited’, Stochastic Processes and Applications 115, 1437–1450. Besbes, O., Phillips, R. and Zeevi, A. (2008), ‘Testing the validity of a demand model: an operations perspective’, Manufacturing & Service Operations Management, forthcoming . Besbes, O. and Zeevi, A. (2007), ‘Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms’, forthcoming in Operations Research . Borovkov, A. (1998), Mathematical Statistics, Gordon and Breach Science Publishers. Cesa-Bianchi, N. and Lugosi, G. (2006), Prediction, learning, and games, Cambridge University Press. Gallego, G. and van Ryzin, G. (1997), ‘A multiproduct dynamic pricing problem and its applications to network yield management’, Operations Research 45, 24–41. 25 Hannan, J. (1957), ‘Approximation to Bayes risk in repeated play’, Contributions to the Theory of Games, Princeton University Press III, 97–139. Keller, G. and Rady, S. (1999), ‘Optimal experimentation in a changing environment’, The review of economic studies 66, 475–507. Korostelev, A. P. (1987), ‘On minimax estimation of a discontinuous signal’, Theory of Probability and its Applications 32, 727–730. Lai, T. L. (2001), ‘Sequential analysis: Some classical problems and new challenges’, Statistica Sinica 11, 303–408. Levin, Y., Levina, T., McGill, J. and Nediak, M. (2008), ‘Dynamic pricing with online learning and strategic consumers’, forthcoming in Operations Research . Lobo, M. S. (2007), ‘The value of dynamic pricing’, working paper, Duke University . Page, E. S. (1954), ‘Continuous inspection schemes’, Biometrika 41, 100–115. Phillips, R. (2005), Pricing and Revenue Optimization, Stanford University Press. Polyak, B. T. and Tsybakov, A. (1990), ‘Optimal order of accuracy of search algorithms in stochastic optimization’, Problems of Information Transmission 26, 126–133. Roberts, S. W. (1966), ‘A comparison of some control chart procedures’, Technometrics 8, 411–430. Shewhart, W. A. (1931), The Economic control of the Quality of Manufactured Product, Van Nostrand, New York. Shiryayev, A. N. (1963), ‘On optimum methods in quickest detection problems’, Theory of Probability and its Applications 8, 22–46. Shiryayev, A. N. (1978), Optimal Stopping Rules, Springer-Verlag. Siegmund, D. (1985), Sequential analysis, Springer-Verlag. Talluri, K. (2009), ‘A finite-population revenue management model and a risk-ratio procedure for the joint estimation of population size and parameters’, working paper, Universitat Pompeu Fabra . Talluri, K. T. and van Ryzin, G. J. (2005), Theory and Practice of Revenue Management, SpringerVerlag. Tsybakov, A. (2004), Introduction à l’estimation non-paramétrique, Springer. 26 Online Companion: On the Minimax Complexity of Pricing in a Changing Environment Omar Besbes∗ A Assaf Zeevi† Proofs for Section 3 Preliminaries. We introduce below some notation and basic results that will be used in the proofs that follow. For any policy 𝜋 ∈ 𝒫 and its associated price mapping 𝜓, we define the random variable 𝒥 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) as follows 𝒥 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) = 𝜏 −1 ∑ 𝑟𝑎 (𝜓𝑖 (ℋ𝑖−1 )) + 𝑁 ∑ 𝑟𝑏 (𝜓𝑖 (ℋ𝑖−1 )). (A-1) 𝑖=𝜏 𝑖=1 In the proofs, we will use extensively the following two equalities 𝔼𝜋𝜏 [𝒥 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )] = 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) [∑ 𝜏 −1 [ ] ∗ 𝜋 𝜋 𝑟𝑎 (𝑝∗𝑎 ) − 𝑟𝑎 (𝜓𝑖 (ℋ𝑖−1 )) 𝐽 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) = 𝔼𝜏 + 𝑖=1 𝑁 ∑ 𝑖=𝜏 [ 𝑟𝑏 (𝑝∗𝑏 ) ] ] − 𝑟𝑏 (𝜓𝑖 (ℋ𝑖−1 )) . (A-2) (A-3) These follow from the conditioning argument below 𝜋 𝐽 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) = 𝔼𝜋𝜏 [∑ 𝜏 −1 𝜓𝑖 (ℋ𝑖−1 )1{𝑉𝑖 ≤ 𝜓𝑖 (ℋ𝑖−1 )} + 𝑖=𝜏 𝑖=1 = 𝜏 −1 ∑ + 𝑖=𝜏 𝜏∑ −1 𝑖=1 ] 𝜓𝑖 (ℋ𝑖−1 )1{𝑉𝑖 ≤ 𝜓𝑖 (ℋ𝑖−1 )} [ [ ]] 𝔼𝜋𝜏 𝔼𝜋𝜏 𝜓𝑖 (ℋ𝑖−1 )1{𝑉𝑖 ≤ 𝜓𝑖 (ℋ𝑖−1 )} ℋ𝑖−1 𝑖=1 𝑁 ∑ = 𝑁 ∑ [ [ ]] 𝔼𝜋𝜏 𝔼𝜋𝜏 𝜓𝑖 (ℋ𝑖−1 )1{𝑉𝑖 ≤ 𝜓𝑖 (ℋ𝑖−1 )} ℋ𝑖−1 𝑁 [ ] [ ] ∑ 𝔼𝜋𝜏 𝑟𝑏 (𝜓𝑖 (ℋ𝑖−1 )) 𝔼𝜋𝜏 𝑟𝑎 (𝜓𝑖 (ℋ𝑖−1 )) + 𝑖=𝜏 = 𝔼𝜋𝜏 [𝒥 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )]. Proof of Theorem 1. The proof of the result relies on Propositions 1 and 2 (whose proofs are presented following this one) and the notation introduced in Section 3.2. In particular, suppose ∗ † Graduate School of Business, Columbia University. ([email protected]) Graduate School of Business, Columbia University. ([email protected]) 1 that one takes ⌈ ⌉ 𝛥 = 𝑁 1/2 . ˜ − 1 ≥ (1/6)𝑁 1/2 . Combining the results of Propositions 1 and 2, we obtain Then 𝛥 ≥ 𝑁 1/2 and 𝑁 that for any policy 𝜋 ∈ 𝒫, { } ˜ − 1) sup {𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )} ≥ min 𝐶 1 𝛥 , 𝐶 2 (𝑁 1≤𝜏 ≤𝑁 +1 } { ≥ min 𝐶 1 𝑁 1/2 , (1/6)𝐶 2 𝑁 1/2 = min{𝐶 1 , (1/6)𝐶 2 }𝑁 1/2 . We get that for all (𝐹𝑎 , 𝐹𝑏 ) ∈ ℱ1 , sup 1≤𝜏 ≤𝑁 +1 {𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )} ≥ 𝐶𝑁 1/2 , where 𝐶 = min{𝐶 1 , (1/6)𝐶 2 }. This concludes the proof. Proof of Proposition 1. Consider any policy 𝜋 ∈ 𝒫 such that min1≤𝑗≤𝑁˜ −1 𝒦(ℙ𝜋𝑙𝑗+1 , ℙ𝜋𝑙𝑗 ) < 𝛽. Let 𝑖0 denote an index 𝑗 such 𝒦(ℙ𝜋𝑙𝑗+1 , ℙ𝜋𝑙𝑗 ) < 𝛽 and consider the following two hypotheses: 𝜏∈ / {𝑙𝑖0 , ..., 𝑙𝑖0 +1 − 1}, 𝐻0 : 𝐻1 : 𝜏 = 𝑙 𝑖0 . Under ℙ𝜋𝑙𝑖 a change occurs at 𝑙𝑖0 , and under ℙ𝜋𝑙𝑖 +1 no change occurs in {𝑙𝑖0 , ...𝑙𝑖0 +1 −1}. Let 𝜙 denote 0 0 a decision rule, i.e., a mapping from the set of price and demand realizations in {1, ..., 𝑙𝑖0 +1 − 1} into {0, 1}: 𝜙 = 0 will denote “no change” and 𝜙 = 1 will denote the presence of a change at 𝑙𝑖0 . By Tsybakov (2004, Theorem 2.2), we have that the worst case probability error of any decision rule is lower bounded by (1/4) exp{−𝛽}, i.e., inf max{ℙ𝜋𝑙𝑖 {𝜙 = 0}, ℙ𝜋𝑙𝑖 𝜙 0 0 +1 {𝜙 = 1}} ≥ (1/4) exp{−𝛽} (A-4) We show next that this implies that the losses in performance throughout the horizon must be of order 𝛥. { } Let 𝛿 = ∣𝑝∗𝑏 − 𝑝∗𝑎 ∣/2, 𝛿𝑟 = inf 𝑦∈[𝑝∗𝑎 −𝛿,𝑝∗𝑎 +𝛿] 𝑟𝑏 (𝑝∗𝑏 ) − 𝑟𝑏 (𝑦) and note that 𝛿𝑟 > 0 by conditions 𝑖𝑣.) and 𝑣.) in Assumption 1. Let 𝐶1 = 𝛿𝑟 /(1 + 𝛿𝑟 /ℎ(𝛿)), and define 𝐶2 as 1 𝐶2 = 𝐶1 exp{−𝛽}. 8 (A-5) Suppose for a moment that we have sup 𝑘=𝑖0 ,𝑖0 +1 𝔼𝜋𝑙𝑘 [𝐽 ∗ − 𝒥 𝜋 ] ≤ 𝐶2 𝛥, and consider the following decision rule 𝜙: ⎧ ] [ ⎨0 if ∑𝑙𝑖0 +1 −1 𝑟𝑎 (𝑝∗ ) − 𝑟𝑎 (𝜓𝑗 (ℋ𝑗−1 )) ≤ 𝐶1 𝛥, 𝑎 𝑗=𝑙𝑖0 𝜙= ] ∑𝑙𝑖0 +1 −1 [ ⎩1 if 𝑟𝑎 (𝑝∗𝑎 ) − 𝑟𝑎 (𝜓𝑗 (ℋ𝑗−1 )) > 𝐶1 𝛥. 𝑗=𝑙𝑖0 2 (A-6) We next analyze the error probabilities associated with this rule. ℙ𝜋𝑙𝑖 0 {𝛷 = 1} +1 = ℙ𝜋𝑙𝑖 0 +1 {𝑙𝑖0∑ +1 −1 𝑗=𝑙𝑖0 (𝑎) ≤ [ ] 𝑟𝑎 (𝑝∗𝑎 ) − 𝑟𝑎 (𝜓𝑗 (ℋ𝑗−1 )) > 𝐶1 𝛥 } +1 −1 [𝑙𝑖0∑ [ ]] 1 𝑟𝑎 (𝑝∗𝑎 ) − 𝑟𝑎 (𝜓𝑗 (ℋ𝑗−1 )) 𝔼𝜋𝑙𝑖 +1 𝐶1 𝛥 0 𝑗=𝑙𝑖0 ≤ (𝑏) ≤ (𝑐) = 1 𝔼𝜋 [𝐽 ∗ − 𝒥 𝜋 ] 𝐶1 𝛥 𝑙𝑖0 +1 𝐶2 𝐶1 1 exp{−𝛽} 8 where (𝑎) follows from Markov’s inequality; (𝑏) follows from the assumption that (A-6) holds; and (𝑐) follows from the definitions of 𝐶1 and 𝐶2 (see (A-5)). We now turn to ℙ𝜋𝑙𝑖 {𝛷 = 0}. We first establish the following inequality 0 ℙ𝜋𝑙𝑖 {𝛷 0 = 0} ≤ ℙ𝜋𝑙𝑖 0 {𝑙𝑖0∑ +1 −1 [ 𝑟𝑏 (𝑝∗𝑏 ) 𝑗=𝑙𝑖0 } − 𝑟𝑏 (𝜓𝑗 (ℋ𝑗−1 )) ≥ 𝛿𝑟 [1 − 𝐶1 /ℎ(𝛿)]𝛥 . ] (A-7) ] ∑𝑙𝑖0 +1 −1 [ 𝑟𝑎 (𝑝∗𝑎 ) − 𝑟𝑎 (𝜓𝑗 (ℋ𝑗−1 )) ≤ 𝐶1 𝛥. Then by Assumption Indeed, suppose that 𝜙 = 0, i.e., 𝑗=𝑙 𝑖0 ) ∑𝑙𝑖0 +1 −1 ∑𝑙𝑖0 +1 −1 ( ∗ 1{∣𝑝∗𝑎 − 𝜓𝑗 (ℋ𝑗−1 )∣ > ℎ 𝑝𝑎 − 𝜓𝑗 (ℋ𝑗−1 ) ≤ 𝐶1 𝛥. This in turn implies that 𝑗=𝑙 1 𝑣.), 𝑗=𝑙 𝑖0 𝑖0 𝛿} ≤ 𝐶1 𝛥/ℎ(𝛿) and hence 𝑙𝑖0 +1 −1 ∑ [ 𝑟𝑏 (𝑝∗𝑏 ) − 𝑟𝑏 (𝜓𝑗 (ℋ𝑗−1 )) 𝑗=𝑙𝑖0 ] 𝑙𝑖0 +1 −1 ≥ ∑ [ 𝑗=𝑙𝑖0 ≥ ] 𝑟𝑏 (𝑝∗𝑏 ) − 𝑟𝑏 (𝜓𝑗 (ℋ𝑗−1 )) 1{∣𝑝∗𝑎 − 𝜓𝑗 (ℋ𝑗−1 )∣ ≤ 𝛿} inf 𝑦∈[𝑝∗𝑎 −𝛿,𝑝∗𝑎 +𝛿] [ 𝑟𝑏 (𝑝∗𝑏 ) − 𝑟𝑏 (𝑦) 𝑙𝑖 +1 −1 ] 0∑ 1{∣𝑝∗𝑎 − 𝜓𝑗 (ℋ𝑗−1 )∣ ≤ 𝛿} 𝑗=𝑙𝑖0 ≥ 𝛿𝑟 [1 − 𝐶1 /ℎ(𝛿)]𝛥. Coming back to (A-7), we now have ℙ𝜋𝑙𝑖 {𝛷 0 = 0} ≤ ℙ𝜋𝑙𝑖 0 {𝑙𝑖0∑ +1 −1 𝑗=𝑙𝑖0 (𝑎) ≤ [ 𝑟𝑏 (𝑝∗𝑏 ) ] − 𝑟𝑏 (𝜓𝑗 (ℋ𝑗−1 )) ≥ 𝛿𝑟 [1 − 𝐶1 /ℎ(𝛿)]𝛥 +1 −1 [𝑙𝑖0∑ [ ]] 1 𝜋 𝑟𝑏 (𝑝∗𝑏 ) − 𝑟𝑏 (𝜓𝑗 (ℋ𝑗−1 )) 𝔼 𝑙𝑖 𝛿𝑟 [1 − 𝐶1 /ℎ(𝛿)]𝛥 0 𝑗=𝑙𝑖0 ≤ (𝑏) ≤ (𝑐) = 1 𝔼𝜋 [𝐽 ∗ − 𝒥 𝜋 ] 𝛿𝑟 [1 − 𝐶1 /ℎ(𝛿)]𝛥 𝑙𝑖0 𝐶2 𝛿𝑟 [1 − 𝐶1 /ℎ(𝛿)] 1 exp{−𝛽}, 8 3 } where (𝑎) follows from Markov’s inequality; (𝑏) follows from the assumption that (A-6) holds; and (𝑐) follows from the definitions of 𝐶1 and 𝐶2 (see (A-5)). We deduce that the rule 𝜙 defined earlier satisfies max{ℙ𝜋𝑙𝑖 {𝜙 = 0}, ℙ𝜋𝑙𝑖 0 0 +1 {𝜙 = 1}} ≤ (1/8) exp{−𝛽} < (1/4) exp{−𝛽}, which is in contradiction with (A-4). We deduce that (A-6) cannot hold and hence, in the current case where min1≤𝑖≤𝑁˜ −1 𝒦(ℙ𝜋𝑙𝑗+1 , ℙ𝜋𝑙𝑗 ) = 𝒦(ℙ𝜋𝑙𝑖 +1 , ℙ𝜋𝑙𝑖 ) < 𝛽, we necessarily have 0 0 sup 1≤𝜏 ≤𝑁 +1 𝔼𝜋𝜏 [𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝒥 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )] > 𝐶2 𝛥. (A-8) Recall that 𝐶2 = [𝛿𝑟 /(1 + 𝛿𝑟 /ℎ(𝛿))](1/8) exp{−𝛽} and note that 𝛿 > 𝛿𝑝 /2 and 𝛿𝑟 ≥ ℎ(𝛿) ≥ min{𝛼𝛿𝑝2 /4, 𝛾} by Assumption 1 𝑖𝑣.) and 𝑣.). We deduce that 𝐶2 ≥ 𝐶 1 := (1/16)ℎ(𝛿𝑝 /2) exp{−𝛽} and that sup 1≤𝜏 ≤𝑁 +1 𝔼𝜋𝜏 [𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝒥 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )] > 𝐶 1 𝛥, (A-9) where 𝐶 1 depends on the parameters of the class ℱ1 and 𝛽. This concludes the proof. Proof of Proposition 2. Consider any policy 𝜋 ∈ 𝒫 such that min1≤𝑗≤𝑁˜ −1 𝒦(ℙ𝜋𝑙𝑗+1 , ℙ𝜋𝑙𝑗 ) ≥ 𝛽. We analyze the performance of the policy when the change occurs at 𝜏 = 𝑙𝑁˜ . We have 𝔼𝜋𝑙˜ [𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) 𝑁 𝜋 − 𝒥 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )] = 𝔼𝜋𝑙˜ 𝑁 [𝑙∑ ˜ −1 𝑁 𝑗=1 ≥ 𝔼𝜋𝑙˜ 𝑁 [𝑙∑ ˜ −1 𝑁 𝑗=𝑙1 = ˜ −1 𝑁 ∑ 𝑖=1 (𝑎) = ˜ −1 𝑁 ∑ 𝔼𝜋𝑙˜ 𝑁 ≥ ˜ −1 𝑁 ∑ 𝑖=1 [ 𝑟𝑎 (𝑝∗𝑎 ) ] − 𝑟𝑎 (𝜓𝑗 (ℋ𝑗−1 )) + 𝑟𝑎 (𝑝∗𝑎 ) − 𝑟𝑎 (𝜓𝑗 (ℋ𝑗−1 )) [𝑙𝑖+1 −1 ∑ [ 𝑟𝑎 (𝑝∗𝑎 ) ] 𝔼𝜋𝑙𝑖+1 [𝑙𝑖+1 −1 ∑ [ − 𝑟𝑎 (𝜓𝑗 (ℋ𝑗−1 )) 𝑟𝑎 (𝑝∗𝑎 ) [𝑙𝑖+1 −1 ∑ 𝑗=𝑙𝑖 ] 𝑟𝑏 (𝑝∗𝑏 ) − 𝑟𝑏 (𝜓𝑗 (ℋ𝑗−1 )) ] ( ∗ ) ℎ 𝑝𝑎 − 𝜓𝑗 (ℋ𝑗−1 ) , ] (A-10) Lemma 1 For some 𝐶𝒦 > 0 that depends only on the parameters defining the class ℱ1 , we have ˜ − 1, that for all 𝑖 = 1, ..., 𝑁 [𝑙𝑖+1 −1 ∑ 𝑗=𝑙𝑖 ( ) ℎ 𝑝∗𝑎 − 𝜓𝑗 (ℋ𝑗−1 ) 4 ] ≥ 𝐶𝒦 𝒦(ℙ𝜋𝑙𝑖+1 , ℙ𝜋𝑙𝑖 ). ] ] ∑𝑙𝑖+1 −1 where here, (𝑎) follows from the fact that the random variable 𝑗=𝑙 [𝑟𝑎 (𝑝∗𝑎 ) − 𝑟𝑎 (𝜓𝑗 (ℋ𝑗−1 ))] is 𝑖 ℋ𝑙𝑖+1 −1 -measurable; and (𝑏) follows from Assumption 1 𝑣.). The following lemma, whose proof can be found in Appendix C, allows to further lower bound the terms appearing in the sum in (A-10). 𝔼𝜋𝑙𝑖+1 ] ] − 𝑟𝑎 (𝜓𝑗 (ℋ𝑗−1 )) 𝑗=𝑙𝑖 𝔼𝜋𝑙𝑖+1 𝑁 ∑ [ 𝑗=𝑙𝑁˜ ] 𝑗=𝑙𝑖 𝑖=1 (𝑏) [ Combining the latter result with (A-10) yields 𝔼𝜋𝑙˜ [𝐽 ∗ − 𝒥 𝜋 ] ≥ 𝐶𝒦 𝑁 ˜ −1 𝑁 ∑ ˜ − 1)𝐶𝒦 𝛽. 𝒦(ℙ𝜋𝑙𝑗+1 , ℙ𝜋𝑙𝑗 ) ≥ (𝑁 𝑖=1 We deduce that when min1≤𝑖≤𝑁˜ −1 𝒦(ℙ𝜋𝑙𝑗+1 , ℙ𝜋𝑙𝑗 ) ≥ 𝛽, we necessarily have sup 1≤𝜏 ≤𝑁 +1 ˜ − 1)𝐶𝒦 𝛽. 𝔼𝜋𝜏 [𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝒥 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )] > (𝑁 (A-11) Letting 𝐶 2 = 𝐶𝒦 𝛽, we have that sup 1≤𝜏 ≤𝑁 +1 ˜ − 1), 𝔼𝜋𝜏 [𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝒥 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )] > 𝐶 2 (𝑁 (A-12) where 𝐶 2 depends on the parameters of the class ℱ1 and 𝛽. This concludes the proof. Proof of Theorem 2. For what follows, when a change is declared, we let 𝑘ˆ denote the customer number at which it is declared (i.e., the customer number for which the price changes to 𝑝∗𝑏 ); when no change is declared, we let 𝑘ˆ = 𝑁 + 1. We introduce the following notation: 𝑖0 = 1, 𝑖𝑗+1 = 𝑖𝑗 + 𝑛𝑟 + 𝑛𝑒 , 𝑗 = 0, ..., 𝑛𝑏 − 1, where 𝑛𝑏 = sup{𝑗 : 𝑖𝑗 ≤ 𝑁 }. Note that 𝑛𝑏 ≤ 𝑁/(𝑛𝑟 + 𝑛𝑒 ). Let 𝑗 ∗ correspond to the index such that 𝑖𝑗 ∗ < 𝜏 ≤ 𝑖𝑗 ∗ +1 . (A-13) Step 1. The first step consists of relating the regret and the detection abilities of the policy. Recalling the expression for the regret provided in (A-3) in the preliminaries, we have for the policy under consideration 𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) 𝜏 −1 𝑁 [∑ ] ∑ 𝜋 ∗ = 𝔼𝜏 [𝑟𝑎 (𝑝𝑎 ) − 𝑟𝑎 (𝜓𝑖 (ℋ𝑖−1 ))] + [𝑟𝑏 (𝑝∗𝑏 ) − 𝑟𝑏 (𝜓𝑖 (ℋ𝑖−1 ))] 𝑖=1 = 𝔼𝜋𝜏 𝑖=𝜏 ˆ [min{𝜏 −1,𝑘−1} ∑ [𝑟𝑎 (𝑝∗𝑎 ) − 𝑟𝑎 (𝑝0 )]1{𝜓𝑖 (ℋ𝑖−1 ) = 𝑝0 } + 𝑖=1 ˆ min{𝑁,𝑘−1} + ∑ 𝑖=𝜏 [𝑟𝑏 (𝑝∗𝑏 ) − 𝑟𝑏 (𝜓𝑖 (ℋ𝑖−1 ))] 𝜏 −1 ∑ [𝑟𝑎 (𝑝∗𝑎 ) − 𝑟𝑎 (𝑝∗𝑏 )]1{𝜓𝑖 (ℋ𝑖−1 ) = 𝑝∗𝑏 } ˆ 𝑖=𝑘 ] [ ] ˆ + [𝑟𝑎 (𝑝∗ ) − 𝑟𝑎 (𝑝∗ )] ≤ 𝑛𝑒 (𝑛𝑏 + 1)[𝑟𝑎 (𝑝∗𝑎 ) − 𝑟𝑎 (𝑝0 )] + 𝔼𝜋𝜏 (𝜏 − 𝑘) 𝑎 𝑏 [ ] + ∗ ∗ ∗ 𝜋 ˆ +𝔼𝜏 (𝑘 − 𝜏 ) max{𝑟𝑏 (𝑝𝑏 ) − 𝑟𝑏 (𝑝𝑎 ), 𝑟𝑏 (𝑝𝑏 ) − 𝑟𝑏 (𝑝0 )}. Next, we focus on bounding the terms on the right-hand-side above. )+ ] [( [( )+ ] . and 𝔼𝜋𝜏 𝑘ˆ − 𝜏 Step 2. In this step we analyze separately 𝔼𝜋𝜏 𝜏 − 𝑘ˆ 5 (A-14) For ℓ = 1, ..., 𝑗 ∗ , let 𝑞𝑓,ℓ denote the probability of a false alarm at 𝑖ℓ + 1, i.e., of incorrectly detecting a change after observing customer 𝑖ℓ . We have 𝔼𝜋𝜏 [( 𝜏 − 𝑘ˆ )+ ] 𝜏 −1 ∑ = ℙ𝜋𝜏 𝑗=1 𝜏 −1 ∑ = 𝑗=1 ∗ {( 𝜏 − 𝑘ˆ ℙ𝜋𝜏 ≤ 𝑗∗ ∑−1 ∑ 𝑖ℓ+1 ℓ {∪ } } {𝑘ˆ = 𝑖𝑚 + 1} 𝑚=1 ℓ=1 𝑗=𝑖ℓ (𝑎) ≥𝑗 } { ℙ𝜋𝜏 𝑘ˆ ≤ 𝜏 − 𝑗 𝑗 𝑖ℓ+1 ∑−1 ∑ ≤ )+ ℓ𝑞𝑓,𝑙 ℓ=1 𝑗=𝑖ℓ ≤ (𝑛𝑟 + 𝑛𝑒 ) ∗ 𝑗 ∑ ℓ𝑞𝑓,𝑙 (A-15) ℓ=1 where (𝑎) follows from a union bound. For ℓ = 1, ..., 𝑗 ∗ , one can bound 𝑞𝑓,ℓ as follows (𝑎) 𝑞𝑓,ℓ = ℙ𝜋𝜏 {1 𝑛𝑒 𝑖ℓ−1 +𝑛𝑟 +𝑛𝑒 ∑ } (𝑏) (𝑌𝑚 − 𝔼𝜋𝜏 [𝑌𝑚 ]) > 𝜀 ≤ 2 exp{−2𝑛𝑒 𝜀2 }. (A-16) 𝑚=𝑖ℓ−1 +𝑛𝑟 +1 where in (𝑎), 𝑌𝑚 are i.i.d Bernouilli random variables with ℙ𝜋𝜏 {𝑌𝑚 = 1} = 𝐹¯𝑎 (𝑝0 ); and (𝑏) follows from Hoeffding’s inequality. Combining (A-15) and (A-16), one obtains [( )+ ] ≤ 2𝑁 2 exp{−2𝑛𝑒 𝜀2 } 𝔼𝜋𝜏 𝜏 − 𝑘ˆ (A-17) By the definition of 𝑛𝑒 , we have that 𝑛𝑒 ≥ 𝑐𝑒 log 𝑁0 ≥ 𝑐𝑒 log 𝑁 + 𝑐𝑒 log 𝜈 −1 . We deduce that [( )+ ] 𝔼𝜋𝜏 𝜏 − 𝑘ˆ ≤ 2𝑁 2 exp{−2𝜀2 (𝑐𝑒 log 𝑁 − 𝑐𝑒 log 𝜈)} 2 = 2𝑁 2−2𝑐𝑒 𝜀 (𝜈)2𝑐𝑒 𝜀 ≤ 2(𝜈)2𝑐𝑒 𝜀 2 2 (A-18) where the last inequality holds since 2(1 − 𝜀2 𝑐𝑒 ) ≤ 0 and 𝑁 ≥ 1. [( )+ ] We now turn to analyzing 𝔼𝜋𝜏 𝑘ˆ − 𝜏 . For ℓ = 𝑗 ∗ + 2, ..., 𝑛𝑟 , let 𝑞𝑑,ℓ denote the following quantity 𝑞𝑑,ℓ = ℙ𝜋𝜏 {1 𝑛𝑒 𝑖ℓ−1 +𝑛𝑟 +𝑛𝑒 ∑ } (𝑌𝑚 − 𝐹¯𝑎 (𝑝0 )) ≤ 𝜀 , 𝑚=𝑖ℓ−1 +𝑛𝑟 +1 where 𝑌𝑚 are Bernoulli random variable with ℙ𝜋𝜏 {𝑌𝑚 = 1} = 𝐹¯ (𝑝0 ; 𝑚). 𝑞𝑑,ℓ represents the proba- 6 bility of not detecting a change at exactly 𝑖ℓ , given that it has not yet been detected. We have 𝔼𝜋𝜏 [( )+ ] 𝑘ˆ − 𝜏 = 𝑁∑ −𝜏 +1 ℙ𝜋𝜏 𝑗=0 = 𝑁∑ −𝜏 +1 𝑗=0 ≤ (𝑎) ≤ {( 𝑘ˆ − 𝜏 )+ ≥𝑗 } { ℙ𝜋𝜏 𝑘ˆ ≥ 𝜏 + 𝑗 3(𝑛𝑟 + 𝑛𝑒 ) + 𝑖ℓ+1 −1 𝑛𝑏 ∑ 3(𝑛𝑟 + 𝑛𝑒 ) + } ∑ ℙ𝜋𝜏 ℓ=𝑗 ∗ +2 𝑗=𝑖ℓ 𝑛𝑏 ∑ ℓ { ∩ } {𝑘ˆ ∕= 𝑖𝑚 } 𝑚=𝑗 ∗ +2 (𝑛𝑟 + 𝑛𝑒 )(𝑞𝑑,𝑗 ∗ +2 )ℓ−𝑗 ∗ −1 ℓ=𝑗 ∗ +2 = ( (𝑛𝑟 + 𝑛𝑒 ) 3 + ∗ 𝑛𝑏 −𝑗 ∑−1 (𝑞𝑑,𝑗 ∗ +2 )𝑚 𝑚=1 ( = (𝑛𝑟 + 𝑛𝑒 ) 3 + 𝑞 ≤ (𝑛𝑟 + 𝑛𝑒 ) 𝑑,𝑗 ∗ +2 3 , 1 − 𝑞𝑑,𝑗 ∗ +2 ) ∗ 1 − (𝑞𝑑,𝑗 ∗ +2 )(𝑛𝑏 −𝑗 ) ) 1 − 𝑞𝑑,𝑗 ∗ +2 (A-19) where in (𝑎), we used the fact that 𝑞𝑑,𝑗 = 𝑞𝑑,𝑗 ∗ +2 for 𝑗 = 𝑗 ∗ + 2, ..., 𝑛𝑏 . On another hand, we have for ℓ = 𝑗 ∗ + 2, ..., 𝑛𝑏 , 𝑞𝑑,ℓ = = (𝑎) ≤ (𝑏) ≤ (𝑐) ℙ𝜋𝜏 ℙ𝜋𝜏 {1 𝑛𝑒 ∑ } ¯ (𝑌𝑚 − 𝐹𝑎 (𝑝0 )) ≤ 𝜀 𝑚=𝑖ℓ−1 +𝑛𝑟 +1 { 1 𝑛𝑒 { 1 ℙ𝜋𝜏 𝑛𝑒 𝑖ℓ−1 +𝑛𝑟 +𝑛𝑒 ∑ } (𝑌𝑚 − 𝔼𝜋𝜏 [𝑌𝑚 ]) + (𝐹¯𝑏 (𝑝0 ) − 𝐹¯𝑎 (𝑝0 )) ≤ 𝜀 𝑚=𝑖ℓ−1 +𝑛𝑟 +1 𝑖ℓ−1 +𝑛𝑟 +𝑛𝑒 ∑ } (𝑌𝑚 − 𝔼𝜋𝜏 [𝑌𝑚 ]) ≥ ∣𝐹¯𝑏 (𝑝0 ) − 𝐹¯𝑎 (𝑝0 )∣ − 𝜀 𝑚=𝑖ℓ−1 +𝑛𝑟 +1 ( )2 2 exp{−2𝑛𝑒 𝛿0 − 𝜀 } −2𝑐𝑒 (𝛿0 −𝜀)2 ≤ 2𝑁0 (𝑑) 1 , 2 ≤ 𝑖ℓ−1 +𝑛𝑟 +𝑛𝑒 where (𝑎) follows from a triangle inequality; (𝑏) follows from Hoeffding’s inequality which holds as long as 𝛿0 > 𝜀; (𝑐) follows from the fact that 𝑛𝑒 ≥ 𝑐𝑒 log 𝑁0 ; and (𝑑) is true since 𝑁0 ≥ 2 and 1/2 2𝑐𝑒 (𝛿0 − 𝜀)2 ≥ 2. Note that 𝑛𝑟 + 𝑛𝑒 ≤ 𝑐𝑟 𝑁0 + 𝑐𝑒 log 𝑁0 + 2 ≤ 𝑐𝑟 (𝑁/𝜈)1/2 + 𝑐𝑒 log(𝑁/𝜈) + 2. Coming back to (A-19), we get 𝔼𝜋𝜏 [( 𝑘ˆ − 𝜏 )+ ] ≤ (𝑛𝑟 + 𝑛𝑒 ) ) ( 3 ≤ 6 𝑐𝑟 (𝑁/𝜈)1/2 + 𝑐𝑒 log(𝑁/𝜈) + 2 1 − 𝑞𝑑,𝑗 ∗ +2 (A-20) Step 3. Let 𝛿𝑟 = max{𝑟𝑎 (𝑝∗𝑎 ) − 𝑟𝑎 (𝑝0 ), 𝑟𝑎 (𝑝∗𝑎 ) − 𝑟𝑎 (𝑝∗𝑏 ), 𝑟𝑏 (𝑝∗𝑏 ) − 𝑟𝑏 (𝑝∗𝑎 ), 𝑟𝑏 (𝑝∗𝑏 ) − 𝑟𝑏 (𝑝0 )}. Returning 7 to the bound on the regret, (A-14) in combination with (A-18) and (A-20) yields 𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) [ ] [( [( )+ ] )+ ] 𝜋 𝜋 ˆ ˆ ≤ 𝛿𝑟 𝑛𝑒 (𝑛𝑏 + 1) + 𝔼𝜏 𝜏 − 𝑘 + 𝔼𝜏 𝑘 − 𝜏 [ ] 2 ≤ 2𝛿𝑟 (𝑁/(𝑛𝑟 + 𝑛𝑒 ) + 1)(𝑐𝑒 log 𝑁0 + 1) + 𝜈 2𝑐𝑒 𝜀 + 3𝑐𝑟 (𝑁/𝜈)1/2 + 3𝑐𝑒 log(𝑁/𝜈) + 6 . The right-hand side above does not depend on 𝜏 , hence sup [𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )] ≤ 𝐶1 𝑁 1/2 log 𝑁, 1≤𝜏 ≤𝑁 +1 where 𝐶1 is a function of 𝛿𝑟 , 𝑐𝑒 , 𝑐𝑟 , 𝜀, 𝜈 and 𝜈. This concludes the proof. B Proofs for Section 4 Proof of Theorem 3. This result follows from a slight modification of the proof of Theorem 2. In particular, one notes that Algorithm 2 can be seen as a special case of Algorithm 1 where 𝑛𝑟 = 0 and 𝑝0 = 𝑝∗𝑎 . Next, we analyze the steps that need to be modified in the proof of Theorem 2. The only modifications to the proof occur [ ] in the upper bound on the regret in Step 1 (see (A-14)) and + 𝜋 ˆ in the upper bound on 𝔼𝜏 (𝑘 − 𝜏 ) ; see (A-20) in Step 2. In particular, the upper bound on the regret can here be written as 𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) [ ] ˆ + [𝑟𝑎 (𝑝∗ ) − 𝑟𝑎 (𝑝∗ )] ≤ 𝔼𝜋𝜏 (𝜏 − 𝑘) 𝑎 𝑏 [ ] +𝔼𝜋𝜏 (𝑘ˆ − 𝜏 )+ [𝑟𝑏 (𝑝∗𝑏 ) − 𝑟𝑏 (𝑝∗𝑎 )]. [ ] Then, in Step 2, the upper bound for 𝔼𝜋𝜏 (𝑘ˆ − 𝜏 )+ , see (A-20), becomes [( )+ ] ≤ 𝑛𝑒 𝔼𝜋𝜏 𝑘ˆ − 𝜏 3 ≤ 6(𝑐𝑒 log(𝑁/𝜈) + 1) 1 − 𝑞𝑑,𝑗 ∗ +2 (B-1) Hence, with 𝛿𝑟 = max{𝑟𝑎 (𝑝∗𝑎 ) − 𝑟𝑎 (𝑝∗𝑏 ), 𝑟𝑏 (𝑝∗𝑏 ) − 𝑟𝑏 (𝑝∗𝑎 )}, one obtains [ ] [( )+ ] [( )+ ] ∗ 𝜋 𝜋 𝜋 ˆ ˆ 𝐽 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) ≤ 𝛿𝑟 𝔼𝜏 𝜏 − 𝑘 + 𝔼𝜏 𝜏 − 𝑘 ] [ 2 ≤ 2𝛿𝑟 𝜈 2𝑐𝑒 𝜀 + 3𝑐𝑒 log(𝑁/𝜈) + 3 , which yields sup 1≤𝜏 ≤𝑁 +1 [𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )] ≤ 𝐶1 log 𝑁, with 𝐶1 that depends only on 𝛿𝑟 , 𝑐𝑒 , 𝜀, 𝜈 and 𝜈. This completes the proof. Proof of Theorem 4. The proof is organized in three main steps. In the first step we establish that if one is able to achieve a given performance level in terms of revenues, then a related performance level can be achieved for the problem of change detection (where one attempts to 8 minimize the distance between the detection and the actual time of change). In the second step, we establish a fundamental limit on the performance of any detection rule. The last steps concludes by translating those fundamental limits to the revenue maximization problem. Let ℱ2 denote the set of response function (𝐹𝑎 , 𝐹𝑏 ) that satisfy conditions 𝑖.), 𝑖𝑖.), 𝑖𝑣.) and 𝑣.) of Assumption 1. We analyze ℛ∗ (𝑁, 𝒢 ∩ ℱ2 ) and then use the fact that ℛ∗ (𝑁, 𝒢) ≥ ℛ∗ (𝑁, 𝒢 ∩ ℱ2 ). Consider any pair of response functions (𝐹𝑎 , 𝐹𝑏 ) ∈ 𝒢 ∩ ℱ2 . Step 1. Consider any policy 𝜋 ∈ 𝒫 and its associated price mapping 𝜓. Recall the definition of the random variable 𝒥 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) given in (A-1) in the preliminaries. Let 𝛾 be any positive constant (which may depend on 𝑁 ) and define ℬ𝛾 = {𝜔 : 𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝒥 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) < 𝛾}. (B-2) Note that 𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝒥 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) ≥ 𝜏 −1 ∑ ℎ(𝑝∗𝑎 − 𝜓𝑖 (ℋ𝑖−1 ))) + 𝑁 ∑ ℎ(𝑝∗𝑏 − 𝜓𝑖 (ℋ𝑖−1 ))), 𝑖=𝜏 𝑖=1 where ℎ(⋅) was defined in Assumption 1 𝑣.). Hence, for all 𝜔 ∈ ℬ𝛾 , we have 𝜏 −1 ∑ ℎ(𝑝∗𝑎 − 𝜓𝑖 (ℋ𝑖−1 ))) + 𝑖=1 𝑁 ∑ ℎ(𝑝∗𝑏 − 𝜓𝑖 (ℋ𝑖−1 ))) < 𝛾. (B-3) 𝑖=𝜏 We next define a stopping rule for the index of the customer at which a change occurs, 𝜏 , based on the pricing policy 𝜋. Let ⌈ ⌉ 𝛾 𝑗0 := , (B-4) ℎ(𝛿𝑝 /3) where 𝛿𝑝 and ℎ(⋅) were defined in Assumption 1. Define and for 𝑖 ≥ 1, 𝑘ˆ𝑖+1 = { } 𝑘ˆ1 = inf 1 < 𝑗 ≤ 𝑁 : {∣𝑝∗𝑏 − 𝜓𝑗 (ℋ𝑗−1 )∣2 ≤ 𝛿𝑝 /3} ∪ {𝑗 = 𝑁 } , { { } inf 𝑘ˆ𝑖 < 𝑗 ≤ 𝑁 : {∣𝑝∗𝑏 − 𝜓𝑗 (ℋ𝑗−1 )∣2 ≤ 𝛿𝑝 /3} ∪ {𝑗 = 𝑁 } 𝑁 if 𝑘ˆ𝑖 = 𝑁. if 𝑘ˆ𝑖 < 𝑁, The stopping rule is now defined as the 𝑗0𝑡ℎ term of this sequence, namely 𝑘ˆ∗ = 𝑘ˆ𝑗0 . (B-5) In other words, a change is declared after we have priced 𝑗0 number of times “close” to 𝑝∗𝑏 (or if we have reached 𝑁 ). The next lemma, whose proof can be found in Appendix C, provides a guarantee on the performance of the stopping rule 𝑘ˆ∗ on the set ℬ𝛾 . Lemma 2 For all 𝜔 ∈ ℬ𝛾 , we have 0 ≤ 𝑘ˆ∗ − 𝜏 ≤ 2𝑗0 . 9 (B-6) In other words, we have established that ℬ𝛾 ⊆ {𝜔 : 0 ≤ 𝑘ˆ∗ − 𝜏 ≤ 2𝑗0 } ⊆ {𝜔 : ∣𝑘ˆ∗ − 𝜏 ∣ ≤ 2𝑗0 }, which implies that sup 1≤𝜏 ≤𝑁 +1 ℙ{ℬ𝛾𝑐 } ≥ sup 1≤𝜏 ≤𝑁 +1 ℙ{∣𝑘ˆ∗ − 𝜏 ∣ > 2𝑗0 } Step 2. The following lemma, whose proof can be found in Appendix C, establishes a fundamental limit on the performance of any stopping rule. Lemma 3 For some 𝐶 > 0 and 𝛼 > 0, any stopping rule 𝑘ˆ must satisfy satisfy sup 1≤𝜏 ≤𝑁 +1 ℙ𝜋𝜏 {∣𝑘ˆ − 𝜏 ∣ > 𝐶 log 𝑁 } ≥ 𝛼. (B-7) Step 3. Set 𝛾 = (𝐶1 log 𝑁 − ℎ(𝛿𝑝 /3))+ , where 𝐶1 = ℎ(𝛿𝑝 /3)𝐶/2. Then 2 ⌈𝛾/ℎ(𝛿𝑝 /3)⌉ ≤ 𝐶 log 𝑁 and sup 1≤𝜏 ≤𝑁 +1 ℙ{ℬ𝛾𝑐 } ≥ sup 1≤𝜏 ≤𝑁 +1 ≥ ℙ{∣𝑘ˆ − 𝜏 ∣ > 2𝑗0 } ℙ{∣𝑘ˆ − 𝜏 ∣ > 𝐶 log 𝑁 } sup 1≤𝜏 ≤𝑁 +1 (𝑎) ≥ 𝛼, where (𝑎) follows from Lemma 3. The latter, in conjunction with Markov’s inequality, implies that sup [𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )] = 1≤𝜏 ≤𝑁 +1 sup 1≤𝜏 ≤𝑁 +1 ≥ 𝐶1 log 𝑁 𝔼[𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝒥 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )] sup 1≤𝜏 ≤𝑁 +1 ≥ 𝛼𝐶1 log 𝑁 This completes the proof. 10 ℙ{ℬ𝛾𝑐 } C Proofs of Auxiliary Results Proof of Lemma 1. Note first that ] [ ℙ𝜋𝑙𝑗+1 {𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑁 } 𝜋 𝜋 𝜋 𝒦(ℙ𝑙𝑗+1 , ℙ𝑙𝑗 ) = 𝔼𝑙𝑗+1 log 𝜋 ℙ𝑙𝑗 {𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑁 } [ ] ℙ𝜋𝑙𝑗+1 {𝑌𝑁 ∣ 𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑁 − 1} ⋅ ⋅ ⋅ ℙ𝜋𝑙𝑗+1 {𝑌2 ∣ 𝑌1 }ℙ𝜋𝑙𝑗+1 {𝑌1 } 𝜋 = 𝔼𝑙𝑗+1 log ℙ𝜋𝑙𝑗 {𝑌𝑁 ∣ 𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑁 − 1} ⋅ ⋅ ⋅ ℙ𝜋𝑙𝑗 {𝑌2 ∣ 𝑌1 }ℙ𝜋𝑙𝑗 {𝑌1 } [ ] 𝑁 ℙ𝜋 {𝑌 ∣ ℋ ∏ } 𝑖 𝑖−1 𝑙 𝑗+1 = 𝔼𝜋𝑙𝑗+1 log ℙ𝜋𝑙𝑗 {𝑌𝑖 ∣ ℋ𝑖−1 } 𝑖=1 [𝑁 ] ∑ ℙ𝜋𝑙𝑗+1 {𝑌𝑖 ∣ ℋ𝑖−1 } 𝜋 = 𝔼𝑙𝑗+1 log 𝜋 ℙ𝑙𝑗 {𝑌𝑖 ∣ ℋ𝑖−1 } 𝑖=1 [𝑁 ] 𝜋 {𝑌 ∣ ℋ 𝜋 {𝑌 ∣ ℋ ∑ 1{1 ≤ 𝑖 ≤ 𝑙 − 1}ℙ } + 1{𝑙 ≤ 𝑖 ≤ 𝑁 }ℙ } 𝑗+1 𝑖 𝑖−1 𝑗+1 𝑖 𝑖−1 𝑎 𝑏 = 𝔼𝜋𝑙𝑗+1 log 1{1 ≤ 𝑖 ≤ 𝑙𝑗 − 1}ℙ𝜋𝑎 {𝑌𝑖 ∣ ℋ𝑖−1 } + 1{𝑙𝑗 ≤ 𝑖 ≤ 𝑁 }ℙ𝜋𝑏 {𝑌𝑖 ∣ ℋ𝑖−1 } 𝑖=1 ⎤ ⎡ 𝑙𝑗+1 −1 ∑ ℙ (𝑌 ∣ 𝜓 (ℋ )) 𝑎 𝑖 𝑖 𝑖−1 ⎦ . log = 𝔼𝜋𝑙𝑗+1 ⎣ ℙ𝑏 (𝑌𝑖 ∣ 𝜓𝑖 (ℋ𝑖−1 )) 𝑖=𝑙𝑗 Now, the term in the expectation above can be simplified to yield 𝒦(ℙ𝜋𝑙𝑗+1 , ℙ𝜋𝑙𝑗 ) = [𝑙𝑗+1 ∑−1 ℙ𝑎 (𝑌𝑖 ∣ 𝜓𝑖 (ℋ𝑖−1 )) log ℙ𝑏 (𝑌𝑖 ∣ 𝜓𝑖 (ℋ𝑖−1 )) ] ∑ 𝔼𝜋𝑙𝑗+1 [ ℙ𝑎 (𝑌𝑖 ∣ 𝜓𝑖 (ℋ𝑖−1 )) log ℙ𝑏 (𝑌𝑖 ∣ 𝜓𝑖 (ℋ𝑖−1 )) ] ∑ 𝔼𝜋𝑙𝑗+1 [ 𝔼𝜋𝑙𝑗+1 𝔼𝜋𝑙𝑗+1 𝑖=𝑙𝑗 𝑙𝑗+1 −1 = 𝑖=𝑙𝑗 𝑙𝑗+1 −1 = 𝑖=𝑙𝑗 [ ℙ𝑎 (𝑌𝑖 ∣ 𝜓𝑖 (ℋ𝑖−1 )) log ℙ𝑏 (𝑌𝑖 ∣ 𝜓𝑖 (ℋ𝑖−1 )) ]] ℋ𝑖−1 . Let 𝑢 = 𝐹¯𝑎 (𝑥) − 𝐹¯𝑎 (𝑝∗𝑎 ) and 𝑣 = 𝐹¯𝑏 (𝑥) − 𝐹¯𝑏 (𝑝∗𝑎 ) and note that 𝑢, 𝑣 ∈ (−𝐹¯𝑎 (𝑝∗𝑎 ), 1 − 𝐹¯𝑎 (𝑝∗𝑎 )). Then [ ] ℙ𝑎 (𝑌𝑖 ∣ 𝜓𝑖 (ℋ𝑖−1 )) 𝜋 𝔼𝑙𝑗+1 log 𝜓𝑖 (ℋ𝑖−1 ) = 𝑥 ℙ𝑏 (𝑌𝑖 ∣ 𝜓𝑖 (ℋ𝑖−1 )) ℙ𝑎 (𝑌𝑖 = 1 ∣ 𝑥) ℙ𝑎 (𝑌𝑖 = 0 ∣ 𝑥) = ℙ𝑎 (𝑌𝑖 = 1 ∣ 𝑥) log + ℙ𝑎 (𝑌𝑖 = 0 ∣ 𝑥) log ℙ𝑏 (𝑌𝑖 = 1 ∣ 𝑥) ℙ𝑏 (𝑌𝑖 = 0 ∣ 𝑥) ¯ 𝐹𝑎 (𝑥) 𝐹𝑎 (𝑥) = 𝐹¯𝑎 (𝑥) log ¯ + 𝐹𝑎 (𝑥) log 𝐹𝑏 (𝑥) 𝐹𝑏 (𝑥) ∗ ¯ 𝐹𝑎 (𝑝∗𝑎 ) − 𝑢 𝐹𝑎 (𝑝 ) + 𝑢 + [𝐹𝑎 (𝑝∗𝑎 ) − 𝑢] log = [𝐹¯𝑎 (𝑝∗𝑎 ) + 𝑢] log ¯ ∗𝑎 𝐹𝑏 (𝑝∗𝑏 ) − 𝑣 𝐹𝑏 (𝑝𝑎 ) + 𝑣 [ ] 𝑢 𝑣 = [𝐹¯𝑎 (𝑝∗𝑎 ) + 𝑢] log(1 + ¯ ∗ ) − log(1 + ¯ ∗ ) 𝐹𝑎 (𝑝𝑎 ) 𝐹𝑎 (𝑝𝑎 ) ] [ 𝑣 𝑢 ) − log(1 − ) . +[𝐹𝑎 (𝑝∗𝑎 ) − 𝑢] log(1 − 𝐹𝑎 (𝑝∗𝑎 ) 𝐹𝑎 (𝑝∗𝑎 ) 11 Note that for all 𝑦 ∈ (−(1 − 2𝜇)/(1 − 𝜇), (1 − 2𝜇)/𝜇), we have log(1 + 𝑦) ≤ 𝑦 and − log(1 + 𝑦) ≤ −𝑦 + [(1 − 𝜇)/𝜇]2 𝑦 2 . This implies that [ ] ℙ𝑎 (𝑌𝑖 ∣ 𝜓𝑖 (ℋ𝑖−1 )) 𝜋 𝜓𝑖 (ℋ𝑖−1 ) = 𝑥 𝔼𝑙𝑗+1 log ℙ𝑏 (𝑌𝑖 ∣ 𝜓𝑖 (ℋ𝑖−1 )) [ 𝑢 ] 𝑣2 𝑣 1−𝜇 ≤ [𝐹¯𝑎 (𝑝∗𝑎 ) + 𝑢] ¯ ∗ − ¯ ∗ + 𝜇 (𝐹¯𝑎 (𝑝∗𝑎 ))2 𝐹𝑎 (𝑝𝑎 ) 𝐹𝑎 (𝑝𝑎 ) [ −𝑢 ] 𝑣2 𝑣 1−𝜇 +[𝐹𝑎 (𝑝∗𝑎 ) − 𝑢] + + 𝐹𝑎 (𝑝∗𝑎 ) 𝐹𝑎 (𝑝∗𝑎 ) 𝜇 (𝐹𝑎 (𝑝∗𝑎 ))2 ] [ [ 1 ∗ 1 1 ] 1 − 𝜇 𝐹¯𝑎 (𝑝𝑎 ) + 𝑢 𝐹𝑎 (𝑝∗𝑎 ) − 𝑢 ] 2 [ 1 2 + + 𝑢 + 𝑣 − 𝑢𝑣 + = 𝜇 (𝐹𝑎 (𝑝∗𝑎 ))2 𝐹¯𝑎 (𝑝∗𝑎 ) 𝐹𝑎 (𝑝∗𝑎 ) 𝐹¯𝑎 (𝑝∗𝑎 ) 𝐹𝑎 (𝑝∗𝑎 ) (𝐹¯𝑎 (𝑝∗𝑎 ))2 [ 2 { 2 } 2 1−𝜇 1−𝜇 1−𝜇 1−𝜇 ] ∗ 2 ≤ + + min 𝐾 ∣𝑥 − 𝑝 ∣ , 1 + 𝑎 𝜇 (𝐹¯𝑎 (𝑝∗𝑎 ))2 𝜇 (𝐹𝑎 (𝑝∗𝑎 ))2 𝐹¯𝑎 (𝑝∗𝑎 ) 𝐹𝑎 (𝑝∗𝑎 ) 1 ℎ(𝑝∗𝑎 − 𝑥), ≤ 𝐶𝒦 where 𝐶𝒦 = [[ 4 (1 − 𝜇)2 ] max{𝐾 2 , 1} +2 𝜇 𝜇3 min{𝛼, 𝛾} ]−1 , and 𝜇 was defined in Assumption 1 𝑖𝑖.). We deduce that 𝔼𝜋𝑙𝑗+1 [𝑙𝑗+1 ∑−1 𝑖=𝑙𝑗 ( ) ℎ 𝑝∗𝑎 − 𝜓𝑗 (ℋ𝑗−1 ) ] ≥ 𝐶𝒦 𝒦(ℙ𝜋𝑙𝑗+1 , ℙ𝜋𝑙𝑗 ), and the proof is complete. Proof of Lemma 2. Consider 𝜔 ∈ ℬ𝛾 and suppose (B-6) is not true. Then two cases need to be considered. i.) Suppose first that 𝑘ˆ∗ − 𝜏 > 2𝑗0 . Then we have that ˆ 𝑘 ∑ ∗ ˆ 𝑘 ∑ ∗ ℎ(𝑝∗𝑏 − 𝜓𝑖 (ℋ𝑖−1 ))) ≥ 𝑖=𝜏 ℎ(𝑝∗𝑏 − 𝜓𝑖 (ℋ𝑖−1 )))1{∣𝑝∗𝑏 − 𝜓𝑖 (ℋ𝑖−1 )∣ > 𝛿𝑝 /3} 𝑖=𝜏 (𝑎) ≥ ℎ(𝛿𝑝 /3) ˆ 𝑘 ∑ ∗ 1{∣𝑝∗𝑏 − 𝜓𝑖 (ℋ𝑖−1 )∣ > 𝛿𝑝 /3} 𝑖=𝜏 (𝑏) ≥ (𝑐) > (𝑗0 + 1)ℎ(𝛿𝑝 /3) 𝛾, where (𝑎) follows from the fact that ℎ(⋅) is increasing; (𝑏) follows from and the definition of 𝑘ˆ∗ and the fact that 𝑘ˆ∗ − 𝜏 > 2𝑗0 ; and (𝑐) follows from the definition of 𝑗0 in (B-4). However, the last inequality is in contradiction with (B-3). Hence necessarily, 𝑘ˆ∗ − 𝜏 ≤ 2𝑗0 . 12 ii.) Suppose now that 𝑘ˆ∗ − 𝜏 < 0. Then we have that ˆ 𝑘 ∑ ˆ 𝑘 ∑ ∗ ∗ ℎ(𝑝∗𝑎 − 𝜓𝑖 (ℋ𝑖−1 ))) ≥ 𝑖=1 ℎ(𝑝∗𝑎 − 𝑝∗𝑏 + 𝑝∗𝑏 − 𝜓𝑖 (ℋ𝑖−1 )))1{∣𝑝∗𝑏 − 𝜓𝑖 (ℋ𝑖−1 )∣ ≤ 𝛿𝑝 /3}} 𝑖=1 ˆ∗ 𝑘 ∑ (𝑎) ≥ ℎ(𝛿𝑝 − 𝛿𝑝 /3) 𝑖=1 ≥ 𝑗0 ℎ(2𝛿𝑝 /3) (𝑏) > 𝑗0 ℎ(𝛿𝑝 /3) (𝑐) ≥ 𝛾, where (𝑎) and (𝑏) follow from the fact that ℎ(⋅) is strictly increasing; and (𝑐) follows from the definition of 𝑗0 in (B-4). The last inequality is in contradiction with (B-3). Hence, 𝑘ˆ∗ − 𝜏 ≥ 0. We conclude from the two cases that (B-6) is necessarily true and the result is established. Proof of Lemma 3. Let 𝑘ˆ be any stopping rule. We develop an argument parallel to that of Korostelev (1987). Define { } ℙ𝑎 (𝑦 ∣ 𝑝) 𝜙 := sup log : 𝑦 = 0, 1, 𝑝 ∈ [𝑝, 𝑝] , ℙ𝑏 (𝑦 ∣ 𝑝) (C-1) and note that Assumption 1 𝑖𝑖.) ensures that 𝜙 is well defined. If 𝜙 = 0, then ℙ𝑎 = ℙ𝑏 and 𝑝∗𝑎 = 𝑝∗𝑏 and one can achieve zero regret. We assume from now on that that 𝜙 > 0. Fix 𝛽 ∈ (0, 1) and let 𝐵(𝜙, 𝛽, 𝑥) := 1 − log 𝛽 − 𝜙 − log 2 + log 𝑥 − log(1 + log 𝑥) .⋅ log 𝑥 Note that 𝐵(𝜙, 𝛽, 𝑥) is increasing in 𝑥 for 𝑥 > exp(exp(1)) and converges to 1 as 𝑥 → ∞. Let 𝑛0 = min{𝑗 : 𝑗 ∈ ℕ, 𝑗 ≥ exp(exp(1)), 𝐵(𝜙, 𝛽, 𝑗) > 0}. Choose 𝐶 > 0 as follows { } 𝐵(𝜙, 𝛽, 𝑛0 ) 𝐶 = min ,1 . (C-2) 𝜙 Note also that for all 𝑛 ≥ 𝑛0 , 𝐶 ≤ (1/𝜙)𝐵(𝜙, 𝛽, 𝑛). Let 𝑔(𝑥) = 𝑥(𝐶 log 𝑥 + 1)−1 and note that 𝑔(⋅) is increasing for 𝑥 ≥ 2 (since 𝐶 ≤ 1) and tends to infinity as 𝑥 → ∞. Let 𝑛1 = min{𝑗 : 𝑗 ∈ ℕ, 𝑗 ≥ 𝑛0 , , 𝑔(𝑗) ≥ 3/2}. (C-3) We distinguish between two cases. Case 1. 𝑁 ≥ 𝑛1 . ˜ = ⌈𝑁/𝛥⌉. Note that 𝑁/𝛥 ≥ 𝑁 (𝐶 log 𝑁 + 1)−1 ≥ 3/2 by the choice Let 𝛥 = ⌈𝐶 log 𝑁 ⌉ and 𝑁 ˜ ≥ 2. Define of 𝑛1 in (C-3), and hence 𝑁 𝑙𝑗 = 1 + (𝑗 − 1)𝛥, 𝑙𝑁˜ +1 = 𝑁. 13 ˜ 𝑗 = 1, ..., 𝑁 Letting ℙ𝜋𝑙𝑗 denote the probability associated with the observations when the change occurs at 𝜏 = 𝑙𝑗 , define 𝑍𝑗 := log ℙ𝜋𝑙𝑗+1 {𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑁˜ +1 } ℙ𝜋𝑙𝑗 {𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑁˜ +1 } ⋅ Using a conditioning argument as in Lemma 1 (for the simplification of 𝒦(ℙ𝜋𝑙𝑗+1 , ℙ𝜋𝑙𝑗 )) yields that 𝑙𝑗+1 −1 𝑍𝑗 = ∑ log 𝑖=𝑙𝑗 ℙ𝑎 (𝑌𝑖 ∣ 𝜓𝑖 (ℋ𝑖−1 )) ⋅ ℙ𝑏 (𝑌𝑖 ∣ 𝜓𝑖 (ℋ𝑖−1 )) (C-4) Suppose for a moment that in the current case where 𝑁 ≥ 𝑛1 , we have min ˜ −1 1≤𝑗≤𝑁 ℙ𝜋𝑙𝑗 {∣𝑘ˆ − 𝜏 ∣ > 𝛥/3} < 1 − 𝛽. (C-5) Let 𝒜𝑗 denote the event {𝜔 : ∣𝑘ˆ − 𝑙𝑗 ∣ ≤ 𝛥/3} and note that 𝒜𝑗 , 𝑗 = 1, ..., 𝑁 are disjoint events and ˜ −1 that {𝜔 : ∣𝑘ˆ − 𝑙𝑁˜ ∣ > 𝛥} ⊃ ∪𝑁 𝑗=1 𝒜𝑗 . This implies that ℙ𝑙𝑁˜ {∣𝑘ˆ − 𝜏 ∣ > 𝛥} −1 {𝑁˜∪ ≥ ℙ𝑙𝑁˜ = ˜ −1 𝑁 ∑ 𝑗=1 𝒜𝑗 } ℙ𝑙𝑁˜ {∣𝑘ˆ − 𝑙𝑗 ∣ ≤ 𝛥/3} 𝑗=1 = ˜ −1 𝑁 ∑ 𝑗=1 (𝑎) = ˜ −1 𝑁 ∑ 𝑗=1 [ ] 𝔼𝑙𝑁˜ 1{∣𝑘ˆ − 𝑙𝑗 ∣ ≤ 𝛥/3} [ ] 𝔼𝑙𝑗 exp(𝑍𝑗 )1{∣𝑘ˆ − 𝑙𝑗 ∣ ≤ 𝛥/3} . (C-6) For (𝑎) above we have used the fact that for any ℋ𝑙𝑗+1 −1 -measurable random variable 𝑉 , 𝔼𝑙𝑁˜ [𝑉 ] = = (𝑏) = (𝑐) = (𝑑) = 𝔼𝜋𝑙𝑗+1 [ ℙ𝜋 {𝑌 : 1 ≤ 𝑖 ≤ 𝑙 ˜ +1 } 𝑖 𝑙˜ 𝑁 𝑁 ] 𝑉 ℙ𝜋𝑙𝑗+1 {𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑁˜ +1 } ]] [ [ ℙ𝜋 {𝑌 : 1 ≤ 𝑖 ≤ 𝑙 ˜ +1 } 𝑖 𝑙𝑁˜ 𝑁 𝜋 𝜋 𝑉 ℋ𝑙 −1 𝔼𝑙𝑗+1 𝔼𝑙𝑗+1 𝜋 ℙ𝑙𝑗+1 {𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑁˜ +1 } 𝑗+1 ]] [ [ ℙ𝜋 {𝑌 : 1 ≤ 𝑖 ≤ 𝑙 ˜ +1 } 𝑖 𝑙𝑁˜ 𝑁 𝜋 𝜋 ℋ𝑙 −1 𝔼𝑙𝑗+1 𝑉 𝔼𝑙𝑗+1 𝜋 ℙ𝑙𝑗+1 {𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑁˜ +1 } 𝑗+1 𝔼𝜋𝑙𝑗+1 [𝑉 ] 𝔼𝑙𝑗 [exp(𝑍𝑗 )𝑉 ], where here, (𝑏) follows from the fact that 𝑉 is ℋ𝑙𝑗+1 −1 -measurable; (𝑑) follows from a standard 14 change of measure; and (𝑐) follows from the chain of equalities below 𝔼𝜋𝑙𝑗+1 = = = = = [ ℙ𝜋 {𝑌 : 1 ≤ 𝑖 ≤ 𝑙 ] ˜ +1 } 𝑖 𝑙𝑁˜ 𝑁 ℋ𝑙 −1 ℙ𝜋 {𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙 ˜ } 𝑗+1 𝑙𝑗+1 𝑁 +1 [ ℙ𝜋 {𝑌 : 1 ≤ 𝑖 ≤ 𝑙 ] 𝜋 ˜ +1 ∣ 𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑗+1 }ℙ𝑙 ˜ {𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑗+1 } 𝑖 𝑙𝑁˜ 𝑁 𝜋 𝑁 ℋ𝑙 −1 𝔼𝑙𝑗+1 𝜋 ℙ𝑙𝑗+1 {𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑁˜ +1 ∣ 𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑗+1 }ℙ𝜋𝑙𝑗+1 {𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑗+1 } 𝑗+1 ] [ ℙ𝜋 {𝑌 : 1 ≤ 𝑖 ≤ 𝑙 𝜋 ˜ +1 ∣ 𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑗+1 − 1}ℙ𝑙 ˜ {𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑗+1 − 1} 𝑖 𝑙𝑗+1 𝑁 𝜋 𝑁 ℋ𝑙 −1 𝔼𝑙𝑗+1 𝜋 ℙ𝑙 ˜ {𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑁˜ +1 ∣ 𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑗+1 − 1}ℙ𝜋𝑙𝑗+1 {𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑗+1 − 1} 𝑗+1 𝑁 [ ℙ𝜋 {𝑌 : 1 ≤ 𝑖 ≤ 𝑙 ] 𝜋 ˜ +1 ∣ 𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑗+1 − 1}ℙ𝑎 {𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑗+1 − 1} 𝑖 𝑙𝑁˜ 𝑁 𝜋 ℋ𝑙 −1 𝔼𝑙𝑗+1 𝜋 ℙ𝑙𝑗+1 {𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑁˜ +1 ∣ 𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑗+1 − 1}ℙ𝜋𝑎 {𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑗+1 − 1} 𝑗+1 ] [ ℙ𝜋 {𝑌 : 1 ≤ 𝑖 ≤ 𝑙 ˜ +1 ∣ 𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑗+1 − 1} 𝑖 𝑙𝑁˜ 𝑁 𝜋 ℋ𝑙 −1 𝔼𝑙𝑗+1 𝜋 ℙ𝑙𝑗+1 {𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑁˜ +1 ∣ 𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑗+1 − 1} 𝑗+1 [ ℙ𝜋 {𝑌 : 𝑙 ] ˜ +1 ∣ 𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑗+1 − 1} 𝑖 𝑗+1 ≤ 𝑖 ≤ 𝑙𝑁 𝑙𝑁˜ 𝜋 ℋ𝑙 −1 𝔼𝑙𝑗+1 𝜋 ℙ𝑙 ˜ {𝑌𝑖 : 𝑙𝑗+1 ≤ 𝑖 ≤ 𝑙𝑁˜ +1 ∣ 𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑗+1 − 1} 𝑗+1 𝑁 = 1. Note that by (C-4), 𝑍𝑗 is the sum of 𝛥 random variables and every term is lower bounded by −𝜙 (cf. the definition of 𝜙 in (C-1)). Hence 𝑍𝑗 ≥ −𝛥𝜙 and one can lower bound every term in the sum in (C-6) as follows [ ] [ ] 𝔼𝑙𝑗 exp(𝑍𝑗 )1{∣𝑘ˆ − 𝑙𝑗 ∣ ≤ 𝛥/3} ≥ 𝔼𝑙𝑗 exp(−𝜙(𝐶 log 𝑁 + 1))1{∣𝑘ˆ − 𝑙𝑗 ∣ ≤ 𝛥/3} ] exp(−𝜙) [ ˆ = 1{∣ 𝑘 − 𝑙 ∣ ≤ 𝛥/3} . 𝔼 𝑗 𝑙 𝑗 𝑁 𝐶𝜙 Going back to (C-6), we have ℙ𝑙𝑁˜ {∣𝑘ˆ − 𝜏 ∣ > 𝛥} ≥ ˜ −1 𝑁 [ ] exp(−𝜙) ∑ ˆ 𝔼𝑙𝑗 1{∣𝑘 − 𝑙𝑗 ∣ ≤ 𝛥/3} 𝑁 𝐶𝜙 𝑗=1 ≥ exp(−𝜙) ˜ (𝑁 − 1) min ℙ𝜋𝑙𝑗 {∣𝑘ˆ − 𝜏 ∣ ≤ 𝛥/3}. ˜ −1 𝑁 𝐶𝜙 1≤𝑗≤𝑁 (C-7) ˜ −1≥𝑁 ˜ /2 and that 𝐶 ≤ 1, we have On another hand, noting that 𝑁 ( ) ˜ − 1) log 𝑁 −𝐶𝜙 (𝑁 ≥ = = (𝑎) ≥ (𝑏) ≥ ) 𝑁 2 log 𝑁 + 2 (1 − 𝐶𝜙) log 𝑁 − log(1 + log 𝑁 ) − log 2 [ ] log 𝑁 −𝐶𝜙 + 𝐵(𝜙, 𝛽, 𝑁 ) + 𝜙 − log 𝛽 [ ] log 𝑁 𝜙 − log 𝛽 ( log 𝑁 −𝐶𝜙 [𝜙 − log 𝛽] exp(1). Here (𝑎) follows from the definition of 𝐶 in (C-2) and the fact that 𝑁 ≥ 𝑛1 ≥ 𝑛0 ; and (𝑏) follows ˜ − 1) > 1/𝛽 2 . Hence from the fact that 𝑁 ≥ 𝑛0 ≥ exp(exp(1)). This implies that exp(−𝜙)𝑁 −𝐶𝜙 (𝑁 15 we have 1 ℙ𝑙𝑁˜ {∣𝑘ˆ − 𝜏 ∣ > 𝛥} ≥ 2 𝛽 (𝑎) 1 ℙ𝜋𝑙𝑗 {∣𝑘ˆ − 𝜏 ∣ ≤ 𝛥/3} ≥ > 1, ˜ −1 𝛽 1≤𝑗≤𝑁 min where (𝑎) follows from assumption (C-5). This is a contradiction since ℙ𝑙𝑁˜ {∣𝑘ˆ − 𝜏 ∣ > 𝛥} ≤ 1. Hence we conclude that (C-5) cannot hold and necessarily, max ˜ −1 1≤𝑗≤𝑁 ℙ𝜋𝑙𝑗 {∣𝑘ˆ − 𝜏 ∣ > 𝛥/3} ≥ 1 − 𝛽, implying that for all cases such that 𝑁 ≥ 𝑛1 , sup 1≤𝜏 ≤𝑁 +1 ℙ𝜋𝜏 {∣𝑘ˆ − 𝜏 ∣ > (𝐶/3) log 𝑁 } ≥ 1 − 𝛽. (C-8) Case 2. 𝑁 < 𝑛1 . Let 𝜏1 = 𝑁 − 1 and 𝜏2 = 𝑁 . Suppose ℙ𝜏1 {∣𝑘ˆ − 𝜏1 ∣ = 0} ≥ 1/2, then [ ] ℙ𝜏2 {∣𝑘ˆ − 𝜏1 ∣ = 0} = 𝔼𝜏2 1{∣𝑘ˆ − 𝜏1 ∣ = 0} [ { ] ℙ𝑎 {𝑌𝑁 −1 } } ˆ = 𝔼𝜏1 exp log 1{∣𝑘 − 𝜏1 ∣ = 0} ℙ {𝑌𝑁 −1 } [ 𝑏 ] ≥ exp{−𝜙}𝔼𝜏1 1{∣𝑘ˆ − 𝜏1 ∣ = 0} ≥ 1 exp{−𝜙}, 2 and hence ℙ𝜏2 {∣𝑘ˆ − 𝜏2 ∣ ≥ 1} ≥ 1/2(exp{−𝜙}). We deduce that sup 𝜏 ∈{𝜏1 ,𝜏2 } ℙ𝜋𝜏 {∣𝑘ˆ − 𝜏 ∣ ≥ 1} ≥ (1/2) exp{−𝜙}. Noting that in the current case, 1 ≥ 𝑁/𝑛1 ≥ (1/𝑛1 ) log 𝑁 , we deduce that sup 1≤𝜏 ≤𝑁 +1 ℙ𝜋𝜏 {∣𝑘ˆ − 𝜏 ∣ ≥ (1/𝑛1 ) log 𝑁 } ≥ (1/2) exp{−𝜙}. Combining the two cases, one obtains sup 1≤𝜏 ≤𝑁 +1 ℙ𝜋𝜏 {∣𝑘ˆ − 𝜏 ∣ ≥ 𝐶1 log 𝑁 } ≥ 𝛼. with 𝐶1 = min{(𝐶/3), 1/𝑛1 } and 𝛼 = min{1 − 𝛽, (1/2) exp{−𝜙}}, which concludes the proof. 16
© Copyright 2026 Paperzz