On the Minimax Complexity of Pricing in a Changing Environment

On the Minimax Complexity of Pricing in a Changing Environment
Omar Besbes∗
Assaf Zeevi†
Columbia University
Columbia University
First submitted: June 22, 2008 / Revised: April 24, 2009, September 2, 2009
Abstract
We consider a pricing problem in an environment where the customers’ willingness-to-pay
(WtP) distribution may change at some point over the selling horizon. Customers arrive sequentially and make purchase decisions based on a quoted price and their private reservation
price. The seller knows the WtP distribution pre- and post-change, but does not know the time
at which this change occurs. The performance of a pricing policy is measured in terms of regret:
the loss in revenues relative to an oracle that knows the time of change prior to the start of the
selling season. We derive lower bounds on the worst case regret and develop pricing strategies
that achieve the order of these bounds, thus establishing the complexity of the pricing problem. Our results shed light on the role of price experimentation, and its necessity for optimal
detection of changes in market response / WtP. Our formulation allows for essentially arbitrary
consumer WtP distributions, and purchase request patterns.
Keywords: pricing, non-stationary demand, estimation, detection, change-point, price experimentation
1
Introduction
1.1
Overview of the problem and main objectives
Overview of the problem. A product is to be sold over a finite selling season with prospective
buyers arriving sequentially over time, each endowed with a reservation price modeled as an independent draw from a common willingness-to-pay (WtP) distribution. The product is purchased if
the buyers’ reservation price exceeds the posted price, otherwise the buyer leaves without making
a purchase. The objective of the decision maker (a monopolist seller) is to adjust prices over time
so as to maximize expected cumulative revenues.
∗
†
Graduate School of Business, e-mail: [email protected]
Graduate School of Business, e-mail: [email protected]
1
The bulk of the academic literature that focuses on such pricing problems, mostly found within
the field of revenue management, assumes that the WtP distribution does not change over time;
in other words, consumer preferences are assumed to be time-homogeneous. Those papers that do
relax this assumption typically endow the decision maker with exact foreknowledge of the timedependent characteristics of demand. (See the literature review in Section 1.4 for further details
and references.) While such assumptions ensure mathematical tractability, they tend to ignore an
important element of realism that exists in many real-world settings, where various market effects
and intrinsic consumer time-preferences introduce shifts in the market response during the course
of a selling season, typically in a manner that is not fully observed or predictable.
The purpose of this paper is to study a family of stylized pricing problems that allows the WtP
distribution (and hence market response) to change over the course of a selling season, at a time
that is unknown to the decision maker. The primary objective is to quantify the complexity of such
problems and establish fundamental limits on what can and cannot be achieved by any pricing
policy. In particular, our goal is to shed light on two fundamental issues: i.) the pivotal role of
dynamic pricing and price experimentation in monitoring market response; and ii.) the trade off
between accurate detection of market change (exploration) and revenue optimization (exploitation).
In this paper we consider a simple instance of an uncertain change in the environment: a WtP
distribution remains constant up until an unknown time, at which point it shifts to a new distribution from which buyers’ reservation prices are drawn up until the end of the selling horizon.
The decision maker needs to devise pricing policies that maximize revenues over the relevant time
horizon, by properly adapting to said change whose effects can only be observed indirectly via
customer purchase decisions.
In order to “zoom in” on the impact of the unknown change in market response, we assume that
the two WtP distributions, pre- and post-change, are revealed to the decision maker at the start of
the selling horizon (but are otherwise quite arbitrary in structure and in particular need not belong
to any parametric family). Endowing the decision maker with such information eliminates complexities that stem from learning the underlying distributions. Moreover, this formulation represents
the minimal departure from antecedent literature, in which both the WtP distributions and their
time dependence are assumed to be known a priori. Ultimately, by furthering understanding of the
issues at play here, this paper strives to provide a foundation for tackling the more complicated
problem of jointly learning the underlying distributions and detecting changes in their structure.
We come back to discuss this point in Section 6.
For the purpose of this paper we will ignore inventory considerations. From an analytical perspective, much in the spirit of the remarks made in the previous paragraph, this allows us to crisply
identify the implications of a change in the WtP distributions, and characterize the complexity of
this problem without masking it by other effects. From a more practical standpoint, we note that
2
such settings arise frequently in many “real world” pricing and revenue optimization problems,
with notable examples in financial services involving secured or unsecured loans, mortgages, credit
offerings, etc. Below is a simple illustrative example of such a setting, including some empirical
observations that support our modeling paradigm and main focus.
An illustrative example. An on-line auto-lender operating in a direct-to-consumer sales channel quotes rates to interested customers given information they supply. Based on the characteristics
of the customers, a FICO score (measure of credit worthiness) is computed, and together with the
requested loan type serves as input variables to an optimization problem. The output of this problem is an offered annual percentage rate (APR), which serves as the decision variable. For all
practical purposes, the lender is not faced with constraints on the amount of funds it can allocate
towards these loans.
The data set contains all instances of incoming customers who were offered a loan during a
period ranging from December 2003 to December 2004, as well as their ultimate decision (accept or
reject); see Besbes et al. (2008) for a detailed description of the data and sales process. Customers
are segmented into tiers and for each such tier the acceptance probability is modeled by exp{𝜃1 +
𝜃2 𝑥}(1+exp{𝜃1 +𝜃2 𝑥})−1 , where 𝑥 is the rate offered and 𝜃1 , 𝜃2 are two parameters to be estimated.
This is the familiar logit family of response functions.
We focus on a given segment of customer characteristics and estimate the logit model as a
function of the APR offered by the firm. Considering two consecutive periods of 6 months, the
parameters obtained were as follows:
a) period 1 running from December 2003 to May 2004: 𝜃1 = 0.89, 𝜃2 = −0.30.
b) period 2 running from June 2004 to December 2004: 𝜃1 = 2.71, 𝜃2 = −0.58.
The two logit curves corresponding to the above parameters are depicted in Figure 1. We observe
that the two curves differ significantly, but no clear connection to macroeconomic variables (e.g.,
cost-of-funds or inter-bank borrowing rates), or changes in market landscape (new product offerings
etc.) could be established or were available. In this setting, one of the major challenges for the
lender is to detect and adjust its APR to “match” changes in the demand environment in real-time,
basing such decisions only on data she or he was able to collect up until that point.
1.2
Summary of the main results
The performance of a pricing policy will be measured relative to the best achievable performance
corresponding to a (clairvoyant) oracle that knows the value of the change-point, the time at which
the WtP changes, at the start of the selling season. The difference in revenues between the latter
and the former defines the regret; the smaller the regret the better the performance of the policy.
Unlike the oracle, the seller is restricted to non-anticipating policies, i.e., policies in which prices
3
0.65
0.6
period 2
response curves
0.55
0.5
0.45
0.4
0.35
0.3
period 1
0.25
0.2
4
4.5
5
5.5
6
6.5
7
rate (APR)
Figure 1: Estimated logit models as function of the offered annual percentage rate
(APR): period 1 corresponds to December 2003 to May 2004; and period 2 corresponds to June
2004 to December 2004.
chosen at any given point in time are only allowed to depend on past prices and observed purchase
decisions. Our objective is to characterize the minimax regret, i.e., the minimal worst case regret
(where “worst” is relative to all possible locations of the time of change). This criterion ensures
that a policy is robust and exhibits “good” performance irrespective of the timing of change in
market response.
Below we list the main analytical results of the paper, and subsequently (Section 1.3) interpret
them and discuss some of the qualitative insights that emerge. For this purpose, let us denote by
𝑁 the total number of consumers requesting a price quote over the selling horizon, and it follows
that the best achievable performance generates revenues of order 𝑁 .
i.) We prove that the worst case regret of any admissible pricing policy is at least of order 𝑁 1/2 ;
see the lower bound in Theorem 1. That is, any policy must incur losses of order 𝑁 1/2 relative
to the revenues generated by the oracle which are of order 𝑁 .
ii.) We propose a pricing strategy that actively monitors market response via price experimentation and observation of purchase decisions; see Algorithm 1. The policy is shown to achieve
the lower bound described above (up to a logarithmic term), and hence is essentially minimax
optimal.
iii.) We elucidate the tension between the contradicting objectives of accurate detection of a change
and revenue maximization; see Propositions 1 and 2. It is effectively seen that any “good”
4
policy must price-experiment in a significant manner in order to balance this trade off.
iv.) We highlight an intuitive structural property of the pre- and post-change WtP distributions
under which the minimax complexity is of significantly smaller order: log 𝑁 . This follows
from upper and lower bounds established in Theorems 3 and 4, respectively. In this setting,
optimal policies can be found within the more restricted class of “passive” pricing algorithms
that do not price-experiment (see Algorithm 2).
At higher level, a further contribution of the paper is in bringing together two stands of literature,
in particular, porting tools from sequential analysis to an operations research setting, where changepoint detection is executed jointly with system control (pricing); see further discussion in Section
1.4.
1.3
Qualitative insights and significance of the main results
1. The value of information. Our results establish that it is possible to design non-anticipating
pricing policies whose expected revenue performance is very close to that of clairvoyant ones.
Specifically, an oracle with advance knowledge of the change-point only gains additional revenue
of order square root of the total revenues accumulated by our proposed policies (and under further
structural restrictions, the value of this information translates into a mere logarithmic-order difference in revenues). If one wishes to view this on normalized scale, these results establish that
the average revenue extracted per customer converges to the oracle performance as the number
of customers requesting a quote grows, and hence the value of prior information on the change in
WtP diminishes.
2. Price experimentation and the value of dynamic pricing. The pricing policies proposed in this paper are constructed to balance two contradicting objectives: continuous price experimentation increases the ability to detect sudden changes in the market, yet simultaneously causes
a deterioration in the instantaneous revenue rate. Resolving this tension yields the precise frequency and extent of experimentation that guarantee good performance. In particular, our results
establish that it suffices to price-experiment on roughly square-root the total number of customers
arriving over the time horizon. Unlike traditional settings, here dynamic pricing is not driven by
dynamic programming considerations but rather a need to balance revenue losses resulting from
experimentation, and potential gains from accurate detection.
3.
Robustness. It is worthwhile noting that our proposed policies rely only on observed
purchase decisions and do not build on any specific assumptions on the time of change in WtP
distribution. For example, no prior distribution over the change-point, or dependence on other
exogenous variables is assumed. A choice of prior in the current non-stationary setting would be
hard to justify, and while it might be recognized that dependence on exogenous variables exists
5
in various settings, one often still faces significant uncertainty with regard to the length of time
between a change in exogenous variables and a change in customers’ WtP. A feature of the proposed
approach is that it relies on minimal assumptions.
The remainder of the paper. The next section concludes the introduction with a review of
related work. Section 2 formulates the problem. Section 3 provides a fundamental limit on the
performance of any pricing policy and analyzes an active pricing scheme with a performance that
achieves this limit. Section 4 characterizes the best achievable performance under a restricted class
of market conditions. Section 5 presents a set of numerical illustrations and qualitative insights
and Section 6 discusses extensions to the present work. Appendices A and B contain the proofs of
the main results while Appendix C details the proofs of auxiliary results.
1.4
Literature review
Our work contributes, and is related to various streams of research.
Dynamic pricing. The presence of changes in the demand environment naturally drives dynamic pricing. This is illustrated in Gallego and van Ryzin (1997), where time varying demand
models and corresponding pricing policies are analyzed. It is important to note that there the
temporal evolution of the demand model is known in advance. In the absence of model uncertainty,
dynamic pricing is typically driven by inventory and perishability considerations (see, e.g., Talluri
and van Ryzin (2005, Section 5.2) and Phillips (2005) for recent overviews). Lobo (2007) and Besbes and Zeevi (2007) help shed light on how dynamic pricing is used to uncover an unknown (yet
static) WtP distribution. A related intermediate setting is found in Levin et al. (2008) who study
online learning in the presence of time (and inventory) dependent demand, where the unknown
parameters are static. The present paper illustrates the fundamental role of dynamic pricing in
monitoring and detecting potential changes in WtP distributions; to the best of our knowledge this
is the first study that establishes such a sharp characterization of the complexity of a non-stationary
pricing problem.
Experimentation in non-stationary and uncertain environments. A related study is
that of Keller and Rady (1999) where a continuous time quantity setting infinite horizon problem
is analyzed in a Bayesian framework. A state of the world that characterizes parameters of the price
function evolves according to a continuous time Markov chain and the firm maximizes profits. The
authors identify two regimes that depend on the problem parameters and that are characterized by
extreme or moderate experimentation levels. The main difference relative to our study lies in the
dynamic programming logic adopted there, which is only possible due to the Markovian assumption
on the state of the world process. In contrast, our study does not rely on any assumption regarding
the change point and as a result requires a different methodology based on information theoretical
arguments.
6
Change-point detection. General ideas date back to the early work of Shewhart (1931) in
the context of product quality control, with more refined procedures developed by Page (1954),
Shiryayev (1963) and Roberts (1966). The typical formulation is one of minimizing the expected
time between the change itself and the detection of the change (aka detection delay), with a constraint on the so-called false alarm rate; see Shiryayev (1978), Siegmund (1985) and the recent
overview of Lai (2001). A particular instance of this is the so-called “standard” Poisson disorder
problem, where the objective is to detect a change in the mean of a Poisson process, where the time
of the change has an exponential distribution; see, e.g., Bayraktar et al. (2005) and the references
therein.
While our work is related to traditional change-point problems, there are two important distinguishing features stemming from our revenue management setting. First, performance in our
problem is naturally defined in terms of revenues, and “good” detection ability does not imply improved performance. In fact, here one trades off instantaneous performance with detection ability.
Second, most work in the literature takes the observations to be exogenous, and these are used to
detect the time of change. In the current setting, the observations associated with the decisions of
buyers to accept or decline a purchase are endogenous, as they depend on the quoted price which
is a decision variable. This endows our problem with a “closed-loop” nature that is absent in other
studies. Finally, it is also worth noting that our analysis is, by and large, “distribution free,” insofar
as very little is assumed on the arrival stream of potential buyers and the time of change.
Minimax and adversarial formulations. Problems associated with performance optimization
in uncertain and non-stationary environments have often been considered in the economics and
computer science literature, dating back to the early work of Hannan (1957); see Cesa-Bianchi and
Lugosi (2006) for a recent monograph on the subject. The prevalent formulation in this stream of
literature has been to allow nature to act in an adversarial manner at each point in time, while in
our setting, the adversary’s actions are restricted to the start of the time horizon. While this limits
the power of nature, it is also less conservative and guides the design of potentially more practical
policies. For example, in this setting the optimal oracle policy is allowed to be dynamic, as opposed
to the more common benchmark of static oracle policies considered in the aforementioned stream
of literature. This also implies that the performance of the oracle in our adversarial formulation,
which serves to benchmark our proposed policies can be significantly better than the traditional
static benchmark.
2
Problem Formulation
The model. We consider a revenue management problem in which a firm (the “decision-maker”)
sells a single product over a planning or sales horizon during which 𝑁 buyers (“customers”) arrive
sequentially and request a price quote for the product. Each customer is assumed to have a
7
willingness-to-pay (WtP) or reservation price for the product; we will denote by 𝑉𝑖 the WtP of
the 𝑖𝑡ℎ customer. She or he purchases the product if and only if 𝑉𝑖 exceeds or equals the price
quoted by the decision-maker, 𝑝𝑖 . We assume that 𝑉𝑖 is a random variable with right continuous
cumulative distribution function 𝐹 (⋅; 𝑖).
We assume that the set of feasible prices is [𝑝, 𝑝], where 0 < 𝑝 < 𝑝 < ∞. For some 𝜏 ∈
{1, ..., 𝑁 + 1} and all 𝑖 ≥ 1, put
𝐹 (𝑝; 𝑖) =
⎧
⎨𝐹𝑎 (𝑝)
⎩𝐹 (𝑝)
𝑏
if 𝑖 < 𝜏 ,
if 𝑖 ≥ 𝜏 .
In other words, the response function or probability of purchase as a function of price, is assumed to
be identically equal to 𝐹¯𝑎 (⋅) up until the change-point 𝜏 , and is equal to 𝐹¯𝑏 (⋅) subsequent to that1 .
Here, and throughout the paper, for any cumulative distribution 𝐹 , 𝐹¯ will denote the complement
of 𝐹 , i.e., 𝐹¯ (⋅) = 1 − 𝐹 (⋅). To exclude trivial solutions, we assume that the response functions differ
at some feasible price. Fixing some 𝛿0 ∈ (0, 1), we let ℱ denote the class of pairs of distribution
functions (𝐹𝑎 (⋅), 𝐹𝑏 (⋅)) that satisfy
max ∣𝐹¯𝑏 (𝑝) − 𝐹¯𝑎 (𝑝)∣ ≥ 𝛿0 .
𝑝∈[𝑝,𝑝]
(1)
Admissible pricing policies. Let (𝑝𝑖 : 1 ≤ 𝑖 ≤ 𝑁 ) denote the price process which is assumed
to take values in [𝑝, 𝑝]. Let 𝑌𝑖 = 1{𝑉𝑖 ≥ 𝑝𝑖 } denote the sales outcome associated with the 𝑖𝑡ℎ
customer: 𝑌𝑖 = 1 indicates that customer 𝑖 purchased the product, while 𝑌𝑖 = 0 indicates that she
or he opted not to purchase it. Let {ℋ𝑖 , 𝑖 ≥ 0} denote the filtration or history associated with the
(
)
process of prices and purchases, with ℋ0 = ∅, and ℋ𝑖 = 𝜎 (𝑝𝑗 , 𝑌𝑗 ), 1 ≤ 𝑗 ≤ 𝑖 . A pricing policy is
said to be non-anticipating if it is adapted to the filtration {ℋ𝑖 , 𝑖 ≥ 0}, which means that the price
quoted to the (𝑖 + 1)𝑠𝑡 customer, 𝑝𝑖+1 , is ℋ𝑖 -measurable (i.e., determined by ℋ𝑖 ). We will restrict
attention to the set of non-anticipating policies denoted by 𝒫, and for any policy 𝜋 ∈ 𝒫, we denote
the price offered to the (𝑖 + 1)𝑠𝑡 customer, 𝑝𝑖+1 , by 𝜓𝑖+1 (ℋ𝑖 ); 𝜓 will be referred to as the price
mapping associated with policy 𝜋.
For any 𝜏 ∈ {1, ..., 𝑁 +1}, and any policy 𝜋 ∈ 𝒫, we will use ℙ𝜋𝜏 and 𝔼𝜋𝜏 to denote the probabilities
of events and expectations of random variables, respectively, when a change in response function
occurs at 𝜏 and the pricing policy 𝜋 is used.
Information structure and the decision maker’s objective. We assume that the response
functions before and after the change, 𝐹¯𝑎 (⋅) and 𝐹¯𝑏 (⋅), respectively, are known to the decision-maker;
however, she or he does not know the point of the change, 𝜏 . In addition, the decision-maker need
not know the number of customers 𝑁 that arrive over the horizon of interest. The only information
1
The possibility of having continuous changes is discussed in Section 6.
8
available is that 𝑁 ≥ 2 and that for some known 𝑁0 ≥ 2 and 0 < 𝜈 < 𝜈 < ∞,
𝜈𝑁0 ≤ 𝑁 ≤ 𝜈𝑁0 .
(2)
Roughly speaking, this condition states that the decision-maker knows the order of magnitude of
the number of potential customers arriving during the planning or sales horizon of interest, given by
the lower and upper bounds above. Note that no probabilistic assumptions are made with regard
to the arrival process.
For 𝑝 ∈ [𝑝, 𝑝] and ℓ = 𝑎, 𝑏, let 𝑟ℓ (𝑝) = 𝑝𝐹¯ℓ (𝑝) denote the revenue functions.
Put 𝑝∗𝑎 ∈
arg max{𝑟𝑎 (𝑝) : 𝑝 ∈ [𝑝, 𝑝]} to be a maximizer of the revenue function before the change, and
similarly let 𝑝∗𝑏 ∈ arg max{𝑟𝑏 (𝑝) : 𝑝 ∈ [𝑝, 𝑝]} denote a maximizer of the revenue function after the
change.
If 𝜏 were known to the decision-maker at the start of the selling season, the optimal policy is
to quote 𝑝∗𝑎 to customers 1 through 𝜏 − 1 and 𝑝∗𝑏 to customers 𝜏 through 𝑁 . The corresponding
performance, denoted 𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ), will be referred to as the “oracle” performance, as one would
need to have access to the value of 𝜏 to achieve it:
∑
∑
𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) =
𝑟𝑎 (𝑝∗𝑎 ) +
𝑟𝑏 (𝑝∗𝑏 ).
1≤𝑖<𝜏
(3)
𝜏 ≤𝑖≤𝑁
In contrast, the expected cumulative revenues over the 𝑁 customers for any admissible pricing
policy 𝜋 (and its associated price mapping 𝜓) are given by:
[ ∑
]
𝜋
𝜋
𝜓𝑖 (ℋ𝑖−1 )1{𝑉𝑖 ≥ 𝜓𝑖 (ℋ𝑖−1 )} .
𝐽 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) := 𝔼𝜏
1≤𝑖≤𝑁
Clearly for all admissible policies 𝜋 ∈ 𝒫, 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) ≤ 𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) and the difference between
the two quantifies the degradation in revenues due to lack of prior knowledge on the time of change
𝜏 . Ideally, one would like to design policies that make this gap, also known as the the regret, as
small as possible. To ensure that the policies exhibit this behavior uniformly over a range of change
point scenarios, and to preclude simple minded policies (such as “guessing the value of 𝜏 at time
0), we adopt the following formulation.
Assuming “nature” first reveals the response functions to the decision-maker but keeps the
time of change hidden, the decision-maker then constructs and announces an admissible policy,
followed by nature selecting the change-point 𝜏 to maximize the difference between 𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 )
and 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ). In particular, we define the minimax regret as follows2 :
{ ∗
}
ℛ∗ (𝑁, ℱ) := sup
inf
sup
𝐽 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) .
(𝐹𝑎 ,𝐹𝑏 )∈ℱ 𝜋∈𝒫 1≤𝜏 ≤𝑁 +1
2
(4)
1) Note that here, to avoid cluttering the notation, we do not write explicitly that 𝑁 can be selected in an
adversarial manner by nature but the results developed hold for the worst case value of 𝑁 satisfying condition (2).
2) While we focus in this paper on the absolute regret, it is worthwhile noting that the arguments developed could
be used to analyze the minimax relative regret (i.e., the regret normalized by the oracle performance).
9
Informally, the decision-maker attempts to select a policy that tracks closely the oracle optimal
policy, so that 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) is “close” to 𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ). When deciding where to price at a given
point in time, the decision-maker needs to consider both the impact on instantaneous performance
as well as the impact on the information gathered about the change 𝜏 . This trade-off will be
formalized in the coming sections and will guide the design of pricing policies.
Computing (4) is essentially an intractable goal due to the broad class of response functions under
consideration and the absence of probabilistic structure on the time of change 𝜏 . In what follows,
we will be interested in gaining insights on the magnitude of ℛ∗ (𝑁, ℱ), especially its dependence
on the total number of customers 𝑁 , as well as designing policies with performance that “comes
close” to ℛ∗ (𝑁, ℱ). To that end, we introduce the following definition.
Definition 1 (minimax optimality) A policy 𝜋
ˆ ∈ 𝒫 is said to be optimal with respect to class
ℱ if for each (𝐹𝑎 , 𝐹𝑏 ) ∈ ℱ and all 𝑁 ≥ 2,
{ ∗
}
sup
𝐽 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋ˆ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) ≤ 𝐶 ℛ∗ (𝑁, ℱ),
(5)
respect to class ℱ if for any 𝜖 > 0, for each (𝐹𝑎 , 𝐹𝑏 ) ∈ ℱ and all 𝑁 ≥ 2,
{ ∗
}
(
)(1+𝜖)
,
sup
𝐽 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋˜ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) ≤ 𝐶 ℛ∗ (𝑁, ℱ)
(6)
1≤𝜏 ≤𝑁 +1
for some constant 𝐶 ≥ 1 independent of 𝐹𝑎 and 𝐹𝑏 . A policy 𝜋
˜ ∈ 𝒫 is said to be near-optimal with
1≤𝜏 ≤𝑁 +1
for some constant 𝐶 > 0 independent of 𝐹𝑎 and 𝐹𝑏 .
It is worth noting that the the left-hand side in (5) and (6) is lower bounded by ℛ∗ (𝑁, ℱ) for
some configuration of response functions (𝐹𝑎 , 𝐹𝑏 ) ∈ ℱ. Note also that in general, one has that
𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) is of order 𝑁 , while ℛ∗ (𝑁, ℱ) is typically of a lower order of magnitude. Hence,
an optimal policy is one that achieves the minimax regret up to a multiplicative constant, i.e.,
that achieves the optimal order of magnitude of the revenue loss 𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ). In
contrast, a near-optimal policy can have a worst case performance that exceeds “slightly” the order
of magnitude of the minimax regret, where this excess is encoded in (6).
Discussion of the model and problem formulation. Few modeling assumptions are made
in the present work. First, our formulation allows nature to choose the number of customer arrivals
in an adversarial manner, only subject to the constraint (2), and no probabilistic assumptions are
made with regard to this arrival process. Second, the WtP distributions are only assumed to satisfy
a separation of at least 𝛿0 at some price in [𝑝, 𝑝]. Finally, no probabilistic assumptions are made
with respect to the time of change. A possible alternative would have been to introduce a prior
distribution on the time of change, 𝐹𝜏 , leading to a Bayesian approach. Assuming for simplicity
that 𝑁 is known a priori to the decision-maker, one would then focus on the following objective
[
]
ℛ∗𝐵 (𝑁, ℱ, 𝐹𝜏 ) =
sup
inf 𝔼 𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) ,
(7)
(𝐹𝑎 ,𝐹𝑏 )∈ℱ 𝜋∈𝒫
10
where the expectation is with respect to the prior on 𝜏 . From (4) and (7), it follows that for any
prior 𝐹𝜏 , ℛ∗𝐵 (𝑁, ℱ, 𝐹𝜏 ) ≤ ℛ∗ (𝑁, ℱ). In §6, we argue that the Bayesian and minimax settings are
essentially equivalent by showing that ℛ∗ (𝑁, ℱ) ≈ ℛ∗𝐵 (𝑁, ℱ, 𝐹𝜏 ) for certain priors.
3
Optimal Price Experimentation and Detection
3.1
A lower bound on best achievable performance
Theorem 1 For some constant 𝐶 > 0,
ℛ∗ (𝑁, ℱ) ≥ 𝐶𝑁 1/2
(8)
for all 𝑁 ≥ 2.
This result asserts that any admissible policy must incur a revenue loss of at least order 𝑁 1/2
relative to the oracle pricing policy that knows the change-point 𝜏 . The value of a constant 𝐶 is
computed explicitly in the proof.
The proof of Theorem 1, which is outlined in Section 3.2, provides important qualitative insights.
In particular, it quantifies the need for price experimentation as a consequence of the following tradeoff: the absence of price experimentation over a large number of customers could result in failure
of or late detection of a change in the market response; on the other hand, price experimentation
leads to potential revenue losses as one “steps away” from the pre-change optimal price. The proof
also reveals some insights that suggest the “correct” frequency of price experiments, and this will
be harnessed in Section 3.3 to develop a near-optimal policy.
3.2
Proof outline and intuition behind Theorem 1
Broad overview of the proof and key ideas. In broad strokes, the proof focuses on a subclass
ℱ1 ⊆ ℱ and establishes a lower bound on ℛ∗ (𝑁, ℱ1 ), which yields a lower bound on ℛ∗ (𝑁, ℱ)3 .
It is organized around two cases, dictated by the information accumulated about the change. If a
policy accumulates only limited information on the prospective change, one can establish a lower
bound on performance (see Proposition 1) which follows from this lack of information. Similarly, any
policy which accumulates “ample” information about the change increases the likelihood of accurate
detection of change, but at the expense of significant revenue losses due to experimentation (see
Proposition 2). These two cases yield the lower bound announced in Theorem 1. A critical aspect
of the proof is the quantification of information, and the link that is established between the level
3
While it would be sufficient to exhibit a pair of response functions in ℱ for which the performance is bounded
below, we establish here a lower bound on the performance for any pair in a broad class ℱ1 to highlight the market
conditions that lead to a “worst-case” performance.
11
of information accumulated and price experimentation. In particular, the argument gives rise to
a notion of “correct” order of information accumulation, which translates to a prescription for the
frequency of price experiments (see also Theorem 2 in Section 3.3).
Preliminaries. Fix some 𝜇 ∈ (0, 1/2), 𝛿𝑝 > 0, 𝛼, 𝛾 > 0 and 𝐾 > 0, and let ℱ1 be constituted
of pairs of response functions that satisfy the following Assumption (where 𝛿0 is fixed as for the
definition of ℱ).
Assumption 1
i.) max𝑝∈[𝑝,𝑝] ∣𝐹¯𝑎 (𝑝) − 𝐹¯𝑏 (𝑝)∣ ≥ 𝛿0 .
ii.) 𝐹¯𝑎 (𝑝), 𝐹¯𝑏 (𝑝) ∈ (𝜇, 1 − 𝜇) for all 𝑝 ∈ [𝑝, 𝑝], and 𝐹¯𝑎 (⋅) and 𝐹¯𝑏 (⋅) are Lipschitz with constant 𝐾.
iii.) 𝐹¯𝑎 (𝑝∗𝑎 ) = 𝐹¯𝑏 (𝑝∗𝑎 )
iv.) ∣𝑝∗𝑎 − 𝑝∗𝑏 ∣ > 𝛿𝑝 .
v.) 𝑟𝑖 (𝑝∗𝑖 ) − 𝑟𝑖 (𝑝) ≥ ℎ(𝑝 − 𝑝∗𝑖 ) for all 𝑝 ∈ [𝑝, 𝑝], for 𝑖 = 𝑎, 𝑏, where ℎ(𝑥) = min{𝛼∣𝑥∣2 , 𝛾} for all
𝑥 ∈ ℝ.
From the above, it is clear that ℱ1 ⊂ ℱ. Important features of the class ℱ1 are the following: the
response functions cross at the pre-change optimal decision 𝑝∗𝑎 [𝑖𝑖𝑖.)]; there is a revenue deterioration
whenever stepping away from the current optimal decision [𝑣.)]; and the optimal decisions 𝑝∗𝑎
and 𝑝∗𝑏 are distinct [𝑖𝑣.)]. For the latter, note that if 𝑝∗𝑎 and 𝑝∗𝑏 would coincide, then one could
achieve a regret of zero by just applying 𝑝∗𝑎 throughout the horizon. The Lipschitz assumption
[𝑖𝑖.)] is a technical condition that ensures that one cannot price close to 𝑝∗𝑎 and yet gather ample
information about the change. Assumption 1 is clearly satisfied by a wide range of response
functions. (For example, it is easy to check that this assumption is satisfied for 𝐹¯𝑎 (𝑝) = (1 − 0.2𝑝)+
and 𝐹¯𝑏 (𝑝) = (0.75 − 0.1𝑝)+ when e.g., [𝑝, 𝑝] = [0.1, 4], 𝛿0 = 0.2, 𝜇 = 0.1, 𝛿𝑝 = 0.5, 𝛼 = 0.1 and
𝛾 = 1.)
In order to gain some intuition on the main effects at play in characterizing ℛ∗ (𝑁, ℱ1 ), we will
focus on batches of customers of a given size 𝛥 ∈ {1, ..., 𝑁 }. In particular let us define:
˜
𝑁
= ⌊𝑁/𝛥⌋ + 1
𝑙𝑖 = 1 + (𝑖 − 1)𝛥,
˜
𝑖 = 1, ..., 𝑁
𝑙𝑁˜ +1 = 𝑁.
˜ − 1) ≥ 1, and we let 𝑙𝑖 be the index of
The number of customer batches of size 𝛥 is given by (𝑁
˜ − 1. Here and in the rest of the manuscript, for any
the first customer in batch 𝑖, where 𝑖 = 1, ..., 𝑁
real number 𝑥, ⌈𝑥⌉ will denote the smallest integer larger than or equal to 𝑥 and ⌊𝑥⌋ the largest
integer smaller than or equal to 𝑥.
12
˜,
In what follows, we let for ℓ = 𝑎, 𝑏 and 𝑝 ∈ [𝑝, 𝑝], ℙℓ (1∣𝑝) = 1−ℙℓ (0∣𝑝) = 𝐹¯ℓ (𝑝) and for 𝑗 = 1, ..., 𝑁
we let ℙ𝜋𝑙𝑗 denote the probability distribution of observed customers’ purchase decisions when policy
𝜋 ∈ 𝒫 is used and the change in response occurs at index 𝜏 = 𝑙𝑗 . Let 𝒦(ℙ𝜋𝑙𝑗+1 , ℙ𝜋𝑙𝑗 ) denote the
Kulback-Leibler (KL) divergence (cf. Borovkov (1998)) between the two measures ℙ𝜋𝑙𝑗+1 and ℙ𝜋𝑙𝑗 ,
which is given by (see Lemma 1 in Appendix C):
𝑙𝑗+1 −1
𝒦(ℙ𝜋𝑙𝑗+1 , ℙ𝜋𝑙𝑗 )
=
∑
𝔼𝜋𝑙𝑗+1
𝑖=𝑙𝑗
[
]
ℙ𝑎 (𝑌𝑖 ∣ 𝜓𝑖 (ℋ𝑖−1 ))
.
log
ℙ𝑏 (𝑌𝑖 ∣ 𝜓𝑖 (ℋ𝑖−1 ))
(9)
The KL-divergence is a measure of “distance” between two probability distributions.
Case 1: Strategies with “limited” information gathering. We first treat the case in which
polices under-experiment.
Proposition 1 Suppose that (𝐹𝑎 , 𝐹𝑏 ) belongs to ℱ1 and that 𝛥 ∈ {1, ..., 𝑁 }. Then for any 𝛽 > 0,
}
{
and any 𝜋 ∈ 𝒫 such that min1≤𝑗≤𝑁˜ −1 𝒦(ℙ𝜋𝑙𝑗+1 , ℙ𝜋𝑙𝑗 ) ≤ 𝛽,
sup
{𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )} ≥ 𝐶 1 𝛥,
(10)
1≤𝜏 ≤𝑁 +1
where 𝐶 1 depends only on the parameters of the class ℱ1 and 𝛽.
Remark 1 (Intuition and proof sketch). The result of Proposition 1 establishes a connection between
the “distance” that separates the probability distributions ℙ𝜋𝑙𝑗+1 and ℙ𝜋𝑙𝑗 , and the performance,
revenue-wise, that can be achieved: if the former is “small,” then the worst-case performance is at
least of order the batch size 𝛥. The intuition behind this is as follows. When the two probability
distributions are “close,” then no policy is able to distinguish sufficiently well between a change
occurring at time 𝑙𝑗 or 𝑙𝑗+1 . In particular, the probability of making an error in trying to distinguish
between the two is strictly positive, and independent of 𝛥 and the total number of customers. This
implies that for customers in some batch of size 𝛥, it is impossible to determine whether their WtP
distribution is 𝐹𝑎 or 𝐹𝑏 . This, in turn leads to the lower bound advertised above.
Remark 2 (Interpretation in terms of price experimentation). Consider two distributions (𝐹𝑎 , 𝐹𝑏 )
belonging to ℱ1 . By Assumption 1 𝑖𝑖𝑖.), they cross at 𝑝∗𝑎 . Going back to the expression for the
KL-divergence we observe that whenever one prices at 𝑝∗𝑎 , the associated term in (9) contributes
zero to the sum. In other words, each term in the sum, which is non-negative, contributes to the
sum only if there is a positive probability of pricing “away” from 𝑝∗𝑎 before a change occurs. As
a result, the decision-maker faces the following dilemma: pricing at 𝑝∗𝑎 maximizes instantaneous
revenue rate if the change has not yet occurred, but pricing too often at 𝑝∗𝑎 or “close” to 𝑝∗𝑎 prohibits
accumulating sufficient information about the change. The worst-case regret associated with the
policy is ultimately limited by 𝐶 1 𝛥, as highlighted in Proposition 1.
Case 2: Strategies with “ample” information gathering. The next result provides a lower
13
bound on the performance of policies for which the KL-divergence exceeds a given 𝛽 for all batches.
This corresponds to policies that price at least once away from 𝑝∗𝑎 in every batch of size 𝛥 prior to
the change.
Proposition 2 Suppose that (𝐹𝑎 , 𝐹𝑏 ) belongs to ℱ1 and that 𝛥 ∈ {1, ..., 𝑁 }. Then for any 𝛽 >
˜ − 1.
and 𝜋 ∈ 𝒫 such that 𝒦(ℙ𝜋 , ℙ𝜋 ) > 𝛽 for all 𝑗 = 1, ...𝑁
𝑙𝑗
𝑙𝑗+1
sup
˜ − 1),
{𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )} ≥ 𝐶 2 (𝑁
(11)
1≤𝜏 ≤𝑁 +1
where 𝐶 2 depends only on the parameters of the class ℱ1 and 𝛽.
Remark 3 (Intuition and proof sketch). When the “distance” between the probability distributions ℙ𝜋𝑙𝑗+1 and ℙ𝜋𝑙𝑗 is bounded below, it follows that the policy in question price-experiments away
from 𝑝∗𝑎 sufficiently often throughout the batches that precede the change. Assumption 1 𝑣.) ensures
that prior to the change, deviating from 𝑝∗𝑎 results in a revenue deterioration. Now, if the change
occurs toward the end of the horizon, at 𝜏 = 𝑙𝑁˜ −1 , then this policy incurs losses almost throughout
the horizon due to price experimentation, and this leads to the lower bound on the worst-regret of
˜ − 1).
𝐶 2 (𝑁
Combining the two cases. Proposition 1 establishes a limit on the performance of policies that
use moderate price experimentation away of 𝑝∗𝑎 , and Proposition 2 establishes a limit on policies that
experiment too often. Combining the two results, we obtain the result of Theorem 1, which provides
a universal bound on the performance of any admissible policy. Indeed, for any such policy 𝜋 ∈ 𝒫,
only one of the two cases considered in Propositions 1 and 2 can occur. As a result, we have that the
˜ − 1)}.
worst-case regret over the class ℱ1 is lower bounded as follows ℛ∗ (𝑁, ℱ1 ) ≥ min{𝐶 1 𝛥, 𝐶 2 (𝑁
˜ ≈ 𝑁/𝛥 and selecting 𝛥 = 𝑁 1/2 leads to ℛ∗ (𝑁, ℱ1 ) ≥ 𝐶 ′ 𝑁 1/2 for some 𝐶 ′ > 0.
Noting that 𝑁
Recalling the initial remark that ℛ∗ (𝑁, ℱ) ≥ ℛ∗ (𝑁, ℱ1 ), the result of Theorem 1 follows.
3.3
The proposed policy and its performance
Let (𝐹𝑎 , 𝐹𝑏 ) be an arbitrary pair of response functions in ℱ and let 𝑝0 be a price in [𝑝, 𝑝] such that
∣𝐹¯𝑎 (𝑝0 ) − 𝐹¯𝑏 (𝑝0 )∣ ≥ 𝛿0 (such a price exists by condition (1)).
We introduce below a pricing policy defined through three positive constants that serve as tuning
parameters (𝑐𝑒 , 𝑐𝑟 , 𝜀). The general structure runs as follows. Start by quoting 𝑝∗𝑎 but regularly quote
𝑝0 for a “small” number of customers, that constitute the price experimentation batch size. The
rationale here is to monitor for a change in market response. After having observed the responses
of a batch of customers of size determined by 𝑐𝑟 and 𝑐𝑒 , one uses a decision rule based on the
observations of demand at 𝑝0 to declare the presence or the absence of a change. The role of 𝜀
here is to allow some slack in the decision rule, due to noise in the observations (stemming from
customer specific WtP). When a change is declared, switch to 𝑝∗𝑏 until the end of the horizon.
14
Algorithm 1 :
𝝅(𝒄𝒆 , 𝒄𝒓 , 𝜺)
Step 1. Joint Pricing and Market Monitoring:
Initialize:⌈ 𝑑𝑒𝑡𝑒𝑐𝑡⌉= 0,
1/2
Set 𝑛𝑟 = 𝑐𝑟 𝑁0 .
Set 𝑛𝑒 = ⌈𝑐𝑒 log 𝑁0 ⌉.
𝑖 = 1,
𝑗=1
[revenue extraction batch size]
[price experimentation batch size]
While 𝑑𝑒𝑡𝑒𝑐𝑡 = 0 and 𝑗 ≤ 𝑁 ,
(a) Pricing:
𝑝𝑗
= 𝑝∗𝑎
for 𝑗 = 𝑖, ..., 𝑖 + 𝑛𝑟 − 1
𝑝𝑗
= 𝑝0
for 𝑗 = 𝑖 + 𝑛𝑟 , ..., 𝑖 + 𝑛𝑟 + 𝑛𝑒 − 1
[revenue extraction]
[price experimentation]
(b) Detection test:
[
]
ˆ = 1 total sales from cust. 𝑖 + 𝑛𝑟 to cust. 𝑖 + 𝑛𝑟 + 𝑛𝑒 − 1 − 𝐹¯𝑎 (𝑝0 )
𝐷
𝑛𝑒
[difference between empirical mean and true pre-change response at 𝑝0 ]
Compute
ˆ ∈ [−𝜀, 𝜀]
If 𝐷
𝑑𝑒𝑡𝑒𝑐𝑡 = 0
Else
𝑑𝑒𝑡𝑒𝑐𝑡 = 1
End
𝑖 = 𝑖 + 𝑛𝑟 + 𝑛𝑒
[difference “small” ⇒ no change detected]
[difference “large” ⇒
change detected]
End
Step 2. Quote Adjustment:
Set 𝑝𝑗 = 𝑝∗𝑏
for 𝑖 ≤ 𝑗 ≤ 𝑁
[quote post-change optimal price]
Intuition. In Algorithm 1, one focuses on batches of customers of size 𝑛𝑟 + 𝑛𝑒 of order 𝑁 1/2 ,
which is guided by the lower bound in Theorem 1. In any such batch, 𝑛𝑒 determines the number
of times one quotes away from 𝑝∗𝑎 prior to the detection of a change, while 𝑛𝑟 characterizes the
number of times one quotes at 𝑝∗𝑎 , the optimal pre-change price. Here 𝑛𝑒 quantifies the degree of
price experimentation that takes place until a change is detected. Note that increasing 𝑛𝑒 yields an
increase in detection abilities but also a potential decrease in performance when a change occurs
toward the end of the horizon. After each group of 𝑛𝑟 + 𝑛𝑒 customers arrives, one compares the
15
sample average demand, over the trailing window of size 𝑛𝑒 , to the theoretical mean 𝐹¯𝑎 (𝑝0 ) assuming
no change has occurred. In the absence of change, this difference should be “small” while if the
change has already occurred, this difference will be bounded away from zero (given that condition
(1) implies that ∣𝐹¯𝑎 (𝑝0 ) − 𝐹¯𝑏 (𝑝0 )∣ ≥ 𝛿0 ). The tuning parameter 𝜀 is used to distinguish between
the two hypotheses. The choice of 𝜀 and 𝑐𝑒 allows to control both types of error: the probability
of declaring a change in the absence of one; and the probability of not detecting the change after
more than 2(𝑛𝑟 + 𝑛𝑒 ) customers have arrived following the change.
Let 𝑐𝑒 , 𝑐𝑟 and 𝜀 be specified as follows
0 < 𝜀 < 𝛿0 ,
𝑐𝑒 = max{𝜀−2 , (𝛿0 − 𝜀)−2 },
and 𝑐𝑟 > 0.
(12)
The next result characterizes the performance of the proposed pricing scheme.
Theorem 2 Let 𝐹𝑎 (⋅) and 𝐹𝑏 (⋅) be such that condition (1) holds. Let 𝜋(𝑐𝑒 , 𝑐𝑟 , 𝜀) be defined by
Algorithm 1 with 𝑐𝑒 , 𝑐𝑟 and 𝜀 as specified in (12). Then for some finite constant 𝐶¯ > 0, the
worst-case regret associated with 𝜋(𝑐𝑒 , 𝑐𝑟 , 𝜀) is bounded as follows
sup
1≤𝜏 ≤𝑁 +1
¯ 1/2 log 𝑁,
{𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )} ≤ 𝐶𝑁
(13)
for all 𝑁 ≥ 2, and hence
¯ 1/2 log 𝑁.
ℛ∗ (𝑁, ℱ) ≤ 𝐶𝑁
(14)
Discussion. The constant 𝐶¯ appearing above depends only on parameters of the class ℱ and
the arrival process bounds 𝜈 and 𝜈. We have thus established that the lower bound in Theorem
1 can be achieved up to a logarithmic term. Given that the scale of losses is of order 𝑁 1/2 , the
policy prescribed by Algorithm 1 is near-optimal. The proposed policy performs finely tuned price
experiments, a crucial feature as highlighted in Section 3.2. One could refer to the pricing scheme as
“active,” given that it is actively seeking to learn about a change in the response function through
price experimentation. In contrast, a policy that would “wait” for a change while pricing at a fixed
point would be referred to as “passive.” Section 4 analyzes in more detail settings where passive
schemes can perform well.
4
Reduced Complexity and Sufficiency of Passive Monitoring
This section studies an important special case of those discussed in the previous section, establishing
a simple structural condition under which price experimentation is no longer necessary, and the
complexity of the pricing problem is significantly reduced. Fixing 𝛿0 ∈ (0, 1), Let 𝒢 denote the class
of pairs of response functions (𝐹𝑎 , 𝐹𝑏 ) that satisfy
∣𝐹¯𝑏 (𝑝∗𝑎 ) − 𝐹¯𝑎 (𝑝∗𝑎 )∣ ≥ 𝛿0 .
16
(15)
This condition implies that the two response functions are separated at the pre-change optimal
price, 𝑝∗𝑎 , by at least 𝛿0 , which is more stringent than condition (1) that served as a premise to
the analysis in Section 3. From this, we will see shortly that under the above condition, one can
accumulate information about the change while quoting the pre-change optimal price 𝑝∗𝑎 . This
in turn will lead to fundamentally simpler pricing strategies that achieve significantly superior
performance than that characterizing the general case. Given that 𝒢 ⊆ ℱ, and hence nature is
more restricted here, it follows that ℛ∗ (𝑁, 𝒢) ≤ ℛ∗ (𝑁, ℱ). The main question we focus on is by
how much does ℛ∗ (𝑁, 𝒢) differ from ℛ∗ (𝑁, ℱ), and what are optimal or near-optimal policies in
this setting.
4.1
The proposed passive pricing policy
We introduce below a pricing policy defined through two positive constants (𝑐, 𝜀) that serve as
tuning parameters. The main idea is as follows: First, 𝑝∗𝑎 is quoted and the average demand at
𝑝∗𝑎 is monitored by focusing on a trailing window, whose size 𝑛 is determined by 𝑐 and the proxy
for the number of arrivals, 𝑁0 . Then, using 𝜀 as threshold, we declare whether a change has
occurred or not. Following a positive detection, one adjusts the quotes to the price 𝑝∗𝑏 , the optimal
post-change price, and holds that until the end of the season. In contrast with Algorithm 1, price
experimentation is not used and the assessments of the occurrence of a change take place much
more often. This is summarized in the following pseudo-code.
Algorithm 2 :
𝝅(𝒄, 𝜺)
Step 1. Pricing and Market Monitoring:
Initialize: 𝑑𝑒𝑡𝑒𝑐𝑡 = 0,
Set 𝑛 = ⌈𝑐 log 𝑁0 ⌉
𝑖 = 1,
𝑗=1
[trailing window size]
While 𝑑𝑒𝑡𝑒𝑐𝑡 = 0 and 𝑗 ≤ 𝑁 ,
(a) Pricing:
𝑝𝑗
= 𝑝∗𝑎
for 𝑗 = 𝑖, ..., 𝑖 + 𝑛 − 1
(b) Detection test
Compute
[
]
ˆ = 1 total sales from cust. 𝑖 to cust. 𝑖 + 𝑛 − 1 − 𝐹¯𝑎 (𝑝∗ )
𝐷
𝑎
𝑛
[Difference between empirical mean and true pre-change response at 𝑝∗𝑎 ]
ˆ ∈ [−𝜀, 𝜀]
If 𝐷
𝑑𝑒𝑡𝑒𝑐𝑡 = 0
[difference “small” ⇒ no change detected]
17
Else
[difference “large” ⇒
𝑑𝑒𝑡𝑒𝑐𝑡 = 1
End
𝑖=𝑖+𝑛+1
change detected]
End
Step 2. Quote Adjustment:
Set 𝑝𝑗 = 𝑝∗𝑏
for 𝑖 ≤ 𝑗 ≤ 𝑁
[quote post-change optimal price]
A close inspection of Algorithm 2 reveals that it is a special case of Algorithm 1. Indeed, if one
takes 𝑐𝑟 = 0 and 𝑝0 = 𝑝∗𝑎 in the latter, then one obtains the pricing scheme of Algorithm 2, where
price experimentation and revenue extraction are conducted jointly. After each batch of customer
arrivals, the sample average of demand over the trailing window of size 𝑛 is compared to 𝐹¯𝑎 (𝑝∗𝑎 ). In
the absence of change, this difference should be “small.” If the change has occurred, this difference
will be bounded away from zero, given that 𝛿0 > 0. The tuning parameter 𝜀 is used to distinguish
between the two hypotheses of the change having occurred or not.
4.2
Performance of the proposed policy
The next result characterizes the performance of Algorithm 2.
Theorem 3 Let 𝐹𝑎 (⋅) and 𝐹𝑏 (⋅) be such that condition (15) holds. Let 𝜀 be such that 0 < 𝜀 < 𝛿0
and 𝑐 = max{𝜀−2 , (𝛿0 − 𝜀)−2 }. Let 𝜋(𝑐, 𝜀) be defined by means of Algorithm 2. Then for some finite
constant 𝐶¯ ′ > 0, the worst-case regret achieved by 𝜋(𝑐, 𝜀) is bounded as follows
sup
1≤𝜏 ≤𝑁 +1
{𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )} ≤ 𝐶¯ ′ log 𝑁,
(16)
for all 𝑁 ≥ 2. Consequently, the minimax regret satisfies
ℛ∗ (𝑁, 𝒢) ≤ 𝐶¯ ′ log 𝑁.
(17)
The constant 𝐶¯ ′ depends only on the class 𝒢 and the parameters 𝜈 and 𝜈 characterizing the arrival
process (see (2)). Note that ℛ∗ (𝑁, 𝒢) is significantly smaller than ℛ∗ (𝑁, ℱ), which was shown to
be essentially of order 𝑁 1/2 . This difference stems from the fact that under condition (15), there is
no need for price experimentation, and the decision-maker can infer occurrence of a change while
pricing at 𝑝∗𝑎 . The question still remains whether one can improve upon Algorithm 2; this is the
topic of Section 4.3.
Proof sketch. For the proposed scheme, localizing efficiently the change-point implies that the
proposed pricing policy will be “close” to the oracle optimal price path (that has access to the
18
value of 𝜏 before the start of the season). We highlight below the connection between revenue
performance and detection ability of the proposed policy. Let 𝑘ˆ denote the customer number for
which the price switches from 𝑝∗𝑎 to 𝑝∗𝑏 . Then, the expected revenues associated with the proposed
policy are given by
𝜋
𝐽 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) =
𝔼𝜋𝜏
ˆ −1}
[min{∑
𝑘,𝜏
𝑖=1
)−
(
)+
(
+
+ 𝑟𝑎 (𝑝∗𝑏 ) 𝑘ˆ − 𝜏
𝑟𝑎 (𝑝∗𝑎 ) + 𝑟𝑏 (𝑝∗𝑎 ) 𝑘ˆ − 𝜏
𝑁
∑
ˆ }
𝑖=max{𝑘,𝜏
]
𝑟𝑏 (𝑝∗𝑏 ) .
In the equations above and in all the manuscript, for any real number 𝑥, 𝑥+ refers to max{𝑥, 0}
and 𝑥− to max{−𝑥, 0}. The regret is then given by
𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )
)− ]
)+ ] (
) [(
(
) [(
+ 𝑟𝑎 (𝑝∗𝑎 ) − 𝑟𝑎 (𝑝∗𝑏 ) 𝔼𝜋𝜏 𝑘ˆ − 𝜏
= 𝑟𝑏 (𝑝∗𝑏 ) − 𝑟𝑏 (𝑝∗𝑎 ) 𝔼𝜋𝜏 𝑘ˆ − 𝜏
)+ (
)− ]
{
} [(
≤ max 𝑟𝑎 (𝑝∗𝑎 ) − 𝑟𝑎 (𝑝∗𝑏 ), 𝑟𝑏 (𝑝∗𝑏 ) − 𝑟𝑏 (𝑝∗𝑎 ) 𝔼𝜋𝜏 𝑘ˆ − 𝜏 + 𝑘ˆ − 𝜏
]
} [
{
𝑘 − 𝜏
= max 𝑟𝑎 (𝑝∗𝑎 ) − 𝑟𝑎 (𝑝∗𝑏 ), 𝑟𝑏 (𝑝∗𝑏 ) − 𝑟𝑏 (𝑝∗𝑎 ) 𝔼𝜋𝜏 ˆ
(18)
From the last expression, we observe that the performance of the proposed algorithm will be driven
[
]
𝑘 − 𝜏 can be made. The latter, in turn, is driven by
by detection accuracy, i.e., by how small 𝔼𝜋𝜏 ˆ
the probability of detecting a change when none occurred (false alarm), and that of not detecting
the change when more than 𝑐 log 𝑁 customers requested the product since the change. These two
probabilities can be controlled by analyzing deviations of normalized sums of i.i.d. random variables
relative to their true mean, in conjunction with judicious choices of 𝑐 and 𝜖.
4.3
Optimality of the passive pricing policy
The next result provides a lower bound on the performance of any admissible pricing policy 𝜋 ∈ 𝒫.
Theorem 4 For some 𝐶 ′ > 0,
ℛ∗ (𝑁, 𝒢) ≥ 𝐶 ′ log 𝑁
(19)
for all 𝑁 ≥ 2.
The above result, combined with Theorem 3 establishes the minimax regret ℛ∗ (𝑁, 𝒢) ≈ log 𝑁 . It
shows that one cannot improve upon the logarithmic growth of the regret in terms of the total
number of customers, 𝑁 . In addition, this growth rate is achieved by the policy presented in
Algorithm 2. In other words, roughly log 𝑁 customers are offered a suboptimal price due to the
change in the WtP. This illustrates that the decision-maker can counter nature quite effectively if
she or he is able to gather information about the change while pricing at 𝑝∗𝑎 .
Intuition and proof sketch. It is important to note that the set of pricing policies 𝒫 is quiet
“large” as few restrictions are imposed. The proof of Theorem 4 establishes that it is possible to
19
reduce the worst-case regret minimization problem to a detection problem, where the objective is to
ˆ
minimize the expected “distance” between the customer number at which one declares a change, 𝑘,
and the true customer at which the change occurs, i.e., minimize 𝔼𝜋𝜏 ∣𝑘ˆ − 𝜏 ∣. This reduction implies
that any level of performance that can be achieved in the regret minimization problem can also be
achieved in the detection problem. The last part of the proof establishes a fundamental limit on
performance in the detection problem, which yields a lower bound for regret minimization.
5
Illustrative Numerical Examples
In what follows, we will fix the price domain to [𝑝, 𝑝] = [0.5, 5], the response functions after the
change, 𝐹¯𝑏 (⋅) and vary the response function prior to the change 𝐹¯𝑎 (⋅). The response function after
the change is taken to be 𝐹¯𝑏 (𝑝) = (3.2𝑝)−1/2 . The response function prior to the change is linear
𝐹¯𝑎 (𝑝) = max{1 − 𝛽𝑎 𝑝, 0}, where the coefficient 𝛽𝑎 can take three values. The cases we focus on are:
- Case 𝐼.)
𝐹¯𝑎 (𝑝) = max{1 − 0.8𝑝, 0},
- Case 𝐼𝐼.) 𝐹¯𝑎 (𝑝) = max{1 − 0.4𝑝, 0},
- Case 𝐼𝐼𝐼.) 𝐹¯𝑎 (𝑝) = max{1 − 0.2𝑝, 0}.
These choices allow to cover different possibilities with respect to the difference between the preand post-change responses functions at the pre-change optimal price 𝑝∗𝑎 . In particular, both cases I
and III are cases where the pair of response functions belongs to 𝒢 (for some appropriate 𝛿0 ), i.e.,
the response are well separated at 𝑝∗𝑎 ; and case II is a case where the pair of response functions
belongs to ℱ ∖ 𝒢, i.e., where 𝐹¯𝑎 (𝑝∗𝑎 ) = 𝐹¯𝑏 (𝑝∗𝑎 ). We depict in Figure 2 the pre- and post-change
response functions as well as the corresponding revenue functions for the cases 𝐼𝐼. and 𝐼𝐼𝐼..
We let 𝜋1 denote the policy defined by means of Algorithm 1 with 𝑐𝑒 = 2, 𝑐𝑟 = 2, 𝜀 = 0.3 and
𝑝0 = arg max𝑝∈[𝑝,𝑝] {∣𝐹¯𝑎 (𝑝) − 𝐹¯𝑏 (𝑝)∣} (𝑝0 = 1.25 for cases 𝐼. and 𝐼𝐼𝐼. and 𝑝0 = 2.5 for case 𝐼𝐼.); 𝜋2
denotes the policy defined by means of Algorithm 2 with 𝑐 = 3 and 𝜀 = 0.3. The total number of
customers requesting a quote 𝑁 is assumed to be equal to 103 . The experiments are based on the
following parameters: the arrival process parameters are set at 𝑁0 = 1, 000 and 𝜈 = 0.25.
Structure of the pricing policies. Figure 3 contrasts sample paths of the prices associated
with the policies 𝜋1 and 𝜋2 for a case where 𝜏 = 500 and the pre-change response function is
𝐹¯𝑎 (𝑝) = max{1 − 0.8𝑝, 0} (case 𝐼.). The figure highlights how 𝜋1 regularly experiments at the price
𝑝0 for a small batch of customers to monitor the occurrence of a change. As discussed in the previous
sections, this policy trades-off the performance losses associated with such experimentation (away
from 𝑝∗𝑎 ) with the improved detection abilities that result from it. In contrast, 𝜋2 just prices at 𝑝∗𝑎
until a change is detected. In the sample path depicted, 𝜋2 detects the change faster than 𝜋1 as
the latter only assesses the occurrence of a change after experimentation occurs. Note that this is
20
(a)
0.9
(b)
1.4
𝑟𝑎 (𝑝) (III)
𝐹¯𝑎 (𝑝) (III)
1.2
revenue functions
response functions
0.8
0.7
0.6
𝐹¯𝑏 (𝑝)
0.5
0.4
0.3
0.2
𝐹¯𝑎 (𝑝) (II)
0.1
0
1
2
price
3
4
1
𝑟𝑏 (𝑝)
0.8
0.6
𝑟𝑎 (𝑝) (II)
0.4
0.2
𝑝∗𝑎 (II) 𝑝∗𝑎 (III) 𝑝∗𝑏
0
5
1
2
price
3
4
5
Figure 2: Test cases. Figure (a) depicts the response curves and Figure (b) the revenue curves for
two test cases (𝐼𝐼 and 𝐼𝐼𝐼). The case to which the pre-change quantities correspond to is indicated
in parenthesis.
to be expected in cases where (𝐹𝑎 , 𝐹𝑏 ) ∈ 𝒢, i.e., when the response functions are “well separated”
at 𝑝∗𝑎 .
Performance and benchmarking. For purposes of benchmarking, we consider two other
policies: the policy 𝜋𝑎 that ignores possible changes in the environment and prices at 𝑝∗𝑎 throughout
the horizon; and the best fixed price static oracle policy 𝜋𝑠 that given knowledge of the value of 𝜏
selects the best single price and holds it fixed throughout the selling season. The latter policy is
clearly not an admissible one, but serves as a reasonable benchmark.
In Table 1, we present the percentage of expected revenue loss relative to the oracle performance
[𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )]/𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) for: the four policies under consideration; three values
of the change point 𝜏 ; and the three different cases for pre-change response functions.
The results depicted are based on running 103 independent simulation replications from which
the performance indicators were derived by averaging. The standard error for the percentage loss
was always seen to be below 5% of the policy’s estimated performance.
The results highlight the small magnitude of the regret achieved by the policy 𝜋1 . In particular,
its performance is in general far superior to that of 𝜋𝑠 , the best fixed-price oracle policy in all
21
price quotes
6
5
4
𝑝∗𝑏 = 5
detection lag
𝑝∗𝑎 = 2.5
𝜏
price path under 𝜋2
3
2
1
0
100
200
300
400
500
600
700
800
900
1000
customer number
price quotes
6
5
𝑝∗𝑏 = 5
4
detection lag
𝑝∗𝑎 = 2.5
3
2
price path under 𝜋1
𝜏
𝑝0 = 1.25
1
0
100
200
300
400
500
600
700
800
900
1000
customer number
Figure 3: Price paths. The change point occurs at 𝜏 = 500. The figure depicts price paths
for policies 𝜋1 and 𝜋2 . The optimal prices pre (post) change are 𝑝∗𝑎 and 𝑝∗𝑏 and 𝑝0 is used for
experimentation.
Pre-change response fn
change-point 𝜏
𝜋𝑎
𝜋𝑠
𝜋1
𝜋2
𝑁/4
𝐼.
𝑁/2
3𝑁/4
𝑁/4
𝐼𝐼.
𝑁/2
3𝑁/4
𝑁/4
𝐼𝐼𝐼.
𝑁/2 3𝑁/4
59.7
7.7
3.2
7.6
51.7
20.0
7.9
9.9
36.9
36.3
18.1
14.8
42.9
14.3
5.36
37.0
33.3
31.5
11.6
27.4
20.0
19.6
20.6
18.7
22.0
16.2
5.4
6.6
14.6
13.2
9.3
6.9
7.3
7.0
14.5
8.9
Table 1: Percentage loss relative to the oracle performance:
[𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) −
𝜋
∗
𝐽 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )]/𝐽 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ). The table displays all cases of response functions (𝐼 - 𝐼𝐼𝐼). 𝜋1 denotes
the policy defined by means of Algorithm 1; 𝜋2 denotes the policy defined by means of Algorithm
2; and 𝜋𝑎 is the policy that prices at 𝑝∗𝑎 throughout the horizon; 𝜋𝑠 is the best fixed-price oracle
policy.
instances tested. As expected, the policy 𝜋1 performs better when the change occurs earlier as
less price experimentation will be involved compared to the oracle policy. Yet, we observe that the
policy 𝜋1 achieves a relative regret below 20.6% in all instances tested, as opposed to the policy 𝜋2
whose relative regret can be as high as 37%. The key feature at play here is that the policy 𝜋1 uses
22
price experimentation to ensure more reliable detection of the change, while 𝜋2 relies on observations
at 𝑝∗𝑎 to infer information about the change. When 𝐹¯𝑎 (𝑝∗𝑎 ) and 𝐹¯𝑏 (𝑝∗𝑎 ) are close, 𝜋2 might fail to
detect a change. In cases 𝐼. and 𝐼𝐼𝐼., the two response curves are well separated at 𝑝∗𝑎 , explaining
why 𝜋2 performs well in such cases. However, in the cases where 𝐹¯𝑎 (𝑝) = max{1 − 0.4𝑝, 0} (case
𝐼𝐼.), the two response curves are equal at 𝑝∗𝑎 , yielding a significant deterioration in performance for
𝜋2 .
Fine tuning policies. There are various parameters associated with 𝜋1 that can be fine tuned
to potentially improve performance. We investigate below the impact of the experimentation price
𝑝0 , that essentially determines the amount of information gathered regarding the occurrence of
change in the market. Using the previous setup, Table 2 reports the relative performance of 𝜋1
relative to the oracle policy as 𝑝0 varies for the three cases analyzed when a change occurs for
customer 𝜏 = 500.
𝑝0
Case 𝐼.
Case 𝐼𝐼.
Case 𝐼𝐼𝐼.
1
6.8
24.4
11.6
1.25
8.2
24.5
9.6
1.5
8.1
23.3
12.2
1.75
8.6
22.5
11.4
2
8.5
11.3
11.5
2.25
9.2
10.2
12.3
2.5
10.0
11.21
13.3
2.75
9.6
12.37
15.9
3
9.7
12.2
14.8
3.25
10.2
12.6
16.73
Table 2: Percentage loss relative to the oracle performance:
[𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) −
𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )]/𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ). The table displays the performance of 𝜋1 , the policy defined by means
of Algorithm 1 as the experimentation price (𝑝0 ) varies. A change occurs at 𝜏 = 𝑁/2.
We observe that a poor choice of 𝑝0 , i.e., selecting 𝑝0 in a region where the two response curves
are close, can have a significant negative impact on performance. However, the performance does
not vary significantly over a wide range of choices of 𝑝0 where the response curves are sufficiently
separated.
6
Concluding Remarks and Extensions
On the cost of information acquisition. The analysis in the paper reveals a significant distinction between the classes of response functions ℱ and 𝒢, the key being whether they are “well
separated” at 𝑝∗𝑎 . These cases admit fairly intuitive interpretations. In the well separated case, i.e.,
when ∣𝐹¯𝑎 (𝑝∗𝑎 ) − 𝐹¯𝑏 (𝑝∗ )∣ > 0, it is possible to acquire information about the change while pricing at
𝑎
the pre-change optimal price 𝑝∗𝑎 . Roughly speaking, this makes information acquisition “costless.”
In contrast, when 𝐹¯𝑎 (𝑝∗𝑎 ) = 𝐹¯𝑏 (𝑝∗𝑎 ), information about the change can only be acquired by quoting
prices that differ from 𝑝∗𝑎 , i.e., via price experimentation. This implies a potential decrease in revenues prior to the change (as illustrated in Figure 3). In other words, here information acquisition
is costly and its cost is endogenously determined by the revenue rate deterioration associated with
price experimentation. Thus, in a changing environment, the information acquisition costs play a
23
central role in determining achievable performance and guiding the design of “good” policies.
On the observability of lost sales. The assumption that the decision-maker observes customers who decline to purchase, while valid in various B2B settings where a sale can only be made
after a request for quote is received, is restrictive. In order to address settings with non-observable
lost sales, one would need to introduce some temporal structure for the arrival process in order to
be able to relate the number of sales to the actual number of potential customers. In that case,
we expect similar results to the ones derived in this paper. In the absence of this link, it seems
impossible to detect changes in the demand environment as it is impossible to infer the cause for
the absence of purchases. We also refer the reader to Talluri (2009) for a discussion of the related
issue of purchase probability estimation in the context of assortment selection.
The presence of multiple change points. While the analysis in the current paper was
conducted for the case of a single change point, the approach taken is applicable when multiple
change points are present, as long as these are suitably “separated.” More precisely, as long as
a representative number of customers arrives between such changes, and the decision-maker has
access to the order of magnitude of the number of potential customers between change points, the
method developed could be applied.
Abrupt versus gradual changes in the demand environment. In the current paper,
we have modeled changes in the demand environment as being abrupt. What happens when
these changes are gradual? The first thing to note is that the detection delay associated with the
proposed policies might be longer in that setting. However, this need not imply a significant loss
in performance. Indeed, the lack of detection of a change in the response functions indicates that
the new demand environment is still within an “indifference zone” relative to the current one, and
hence using the latter model has only a minor impact on performance. The proposed policies,
which are built with this indifference zone idea in mind, will only detect a change once the new
demand environment differs “significantly” from the current one, hence the core ideas developed in
this paper are still applicable in settings where the demand changes gradually.
An alternative Bayesian formulation. We commented at the end of Section 2 on a possible
Bayesian approach to the problem (see (7)). In particular, we mentioned that under any prior
for the time of change 𝐹𝜏 , ℛ∗𝐵 (𝑁, ℱ, 𝐹𝜏 ) ≤ ℛ∗ (𝑁, ℱ); i.e., one is able to achieve a lower worstcase regret when the time of change is drawn from a distribution and performance is averaged
with respect to that distribution. We argue here that there exists a prior for which ℛ∗ (𝑁, ℱ) ≤
𝐶 log 𝑁 ℛ∗𝐵 (𝑁, ℱ, 𝐹𝜏 ) for some 𝐶 > 0. Indeed, assume that 𝑁 is initially known to the decisionmaker (which corresponds to 𝜈 = 𝜈) and recall the discussion in Section 3.2 and the definition
of the 𝑙𝑖 ’s which separate the arrival stream into “batches.”. Suppose that 𝐹𝜏 places equal mass
on each of the the customer indices 𝑙𝑖 , between batches of size 𝛥 ≈ 𝑁 1/2 . Then, noting that any
24
policy either gathers limited or ample information on at least half these batches, an adaptation of
the argument developed in Section 3.2 can be used to establish that the worst-case regret is lower
bounded by order 𝑁 1/2 . In other words, if the prior can be selected arbitrarily, then the Bayesian
regret is of the same order as the minimax regret. Hence, sup𝐹𝜏 ℛ∗𝐵 (𝑁, ℱ, 𝐹𝜏 ) ≈ ℛ∗ (𝑁, ℱ), i.e.,
they are of the same order of magnitude up to logarithmic terms.
Complexity of detection versus complexity of learning. This paper has focused on the
complexity of pricing in an environment where the time of change is unknown. A key question that
naturally arises is how does this compare with the complexity of learning the post-change response
function. Consider the regret
ℛ∗𝑙 (𝑁, ℱ) := inf
sup
𝜋∈𝒫 𝐹 :(𝐹𝑎 ,𝐹 )∈ℱ
𝑏
𝑏
{
}
𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) ,
(20)
where the time of change 𝜏 and the initial response function 𝐹𝑎 (⋅) are both known, but the postchange response function 𝐹𝑏 (⋅) is not known. Such an objective isolates the complexity associated
with learning. Based on the stochastic approximations literature and in particular the results in
Polyak and Tsybakov (1990), one would expect
ℛ∗𝑙 (𝑁, ℱ) ≈ 𝑁 1/2
if one imposes some relatively minimal smoothness conditions on the post-change response function.
In other words, we have the remarkable fact: the complexity of detection and learning are comparable, implying that both tasks contribute equally from a performance perspective. An important
research direction is the design of “good” and practical algorithms for cases where both the time
of change and the post-change response function are unknown.
References
Bayraktar, E., Dayanik, S. and Karatzas, I. (2005), ‘The standard poisson disorder problem revisited’, Stochastic Processes and Applications 115, 1437–1450.
Besbes, O., Phillips, R. and Zeevi, A. (2008), ‘Testing the validity of a demand model: an operations
perspective’, Manufacturing & Service Operations Management, forthcoming .
Besbes, O. and Zeevi, A. (2007), ‘Dynamic pricing without knowing the demand function: Risk
bounds and near-optimal algorithms’, forthcoming in Operations Research .
Borovkov, A. (1998), Mathematical Statistics, Gordon and Breach Science Publishers.
Cesa-Bianchi, N. and Lugosi, G. (2006), Prediction, learning, and games, Cambridge University
Press.
Gallego, G. and van Ryzin, G. (1997), ‘A multiproduct dynamic pricing problem and its applications
to network yield management’, Operations Research 45, 24–41.
25
Hannan, J. (1957), ‘Approximation to Bayes risk in repeated play’, Contributions to the Theory of
Games, Princeton University Press III, 97–139.
Keller, G. and Rady, S. (1999), ‘Optimal experimentation in a changing environment’, The review
of economic studies 66, 475–507.
Korostelev, A. P. (1987), ‘On minimax estimation of a discontinuous signal’, Theory of Probability
and its Applications 32, 727–730.
Lai, T. L. (2001), ‘Sequential analysis: Some classical problems and new challenges’, Statistica
Sinica 11, 303–408.
Levin, Y., Levina, T., McGill, J. and Nediak, M. (2008), ‘Dynamic pricing with online learning and
strategic consumers’, forthcoming in Operations Research .
Lobo, M. S. (2007), ‘The value of dynamic pricing’, working paper, Duke University .
Page, E. S. (1954), ‘Continuous inspection schemes’, Biometrika 41, 100–115.
Phillips, R. (2005), Pricing and Revenue Optimization, Stanford University Press.
Polyak, B. T. and Tsybakov, A. (1990), ‘Optimal order of accuracy of search algorithms in stochastic
optimization’, Problems of Information Transmission 26, 126–133.
Roberts, S. W. (1966), ‘A comparison of some control chart procedures’, Technometrics 8, 411–430.
Shewhart, W. A. (1931), The Economic control of the Quality of Manufactured Product, Van
Nostrand, New York.
Shiryayev, A. N. (1963), ‘On optimum methods in quickest detection problems’, Theory of Probability and its Applications 8, 22–46.
Shiryayev, A. N. (1978), Optimal Stopping Rules, Springer-Verlag.
Siegmund, D. (1985), Sequential analysis, Springer-Verlag.
Talluri, K. (2009), ‘A finite-population revenue management model and a risk-ratio procedure for
the joint estimation of population size and parameters’, working paper, Universitat Pompeu
Fabra .
Talluri, K. T. and van Ryzin, G. J. (2005), Theory and Practice of Revenue Management, SpringerVerlag.
Tsybakov, A. (2004), Introduction à l’estimation non-paramétrique, Springer.
26
Online Companion:
On the Minimax Complexity of Pricing in a Changing
Environment
Omar Besbes∗
A
Assaf Zeevi†
Proofs for Section 3
Preliminaries. We introduce below some notation and basic results that will be used in the
proofs that follow. For any policy 𝜋 ∈ 𝒫 and its associated price mapping 𝜓, we define the random
variable 𝒥 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) as follows
𝒥 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) =
𝜏 −1
∑
𝑟𝑎 (𝜓𝑖 (ℋ𝑖−1 )) +
𝑁
∑
𝑟𝑏 (𝜓𝑖 (ℋ𝑖−1 )).
(A-1)
𝑖=𝜏
𝑖=1
In the proofs, we will use extensively the following two equalities
𝔼𝜋𝜏 [𝒥 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )] = 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )
[∑
𝜏 −1
[
]
∗
𝜋
𝜋
𝑟𝑎 (𝑝∗𝑎 ) − 𝑟𝑎 (𝜓𝑖 (ℋ𝑖−1 ))
𝐽 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) = 𝔼𝜏
+
𝑖=1
𝑁
∑
𝑖=𝜏
[
𝑟𝑏 (𝑝∗𝑏 )
]
]
− 𝑟𝑏 (𝜓𝑖 (ℋ𝑖−1 )) .
(A-2)
(A-3)
These follow from the conditioning argument below
𝜋
𝐽 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) =
𝔼𝜋𝜏
[∑
𝜏 −1
𝜓𝑖 (ℋ𝑖−1 )1{𝑉𝑖 ≤ 𝜓𝑖 (ℋ𝑖−1 )} +
𝑖=𝜏
𝑖=1
=
𝜏 −1
∑
+
𝑖=𝜏
𝜏∑
−1
𝑖=1
]
𝜓𝑖 (ℋ𝑖−1 )1{𝑉𝑖 ≤ 𝜓𝑖 (ℋ𝑖−1 )}
[ [
]]
𝔼𝜋𝜏 𝔼𝜋𝜏 𝜓𝑖 (ℋ𝑖−1 )1{𝑉𝑖 ≤ 𝜓𝑖 (ℋ𝑖−1 )} ℋ𝑖−1
𝑖=1
𝑁
∑
=
𝑁
∑
[ [
]]
𝔼𝜋𝜏 𝔼𝜋𝜏 𝜓𝑖 (ℋ𝑖−1 )1{𝑉𝑖 ≤ 𝜓𝑖 (ℋ𝑖−1 )} ℋ𝑖−1
𝑁
[
]
[
] ∑
𝔼𝜋𝜏 𝑟𝑏 (𝜓𝑖 (ℋ𝑖−1 ))
𝔼𝜋𝜏 𝑟𝑎 (𝜓𝑖 (ℋ𝑖−1 )) +
𝑖=𝜏
= 𝔼𝜋𝜏 [𝒥 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )].
Proof of Theorem 1. The proof of the result relies on Propositions 1 and 2 (whose proofs are
presented following this one) and the notation introduced in Section 3.2. In particular, suppose
∗
†
Graduate School of Business, Columbia University. ([email protected])
Graduate School of Business, Columbia University. ([email protected])
1
that one takes
⌈
⌉
𝛥 = 𝑁 1/2 .
˜ − 1 ≥ (1/6)𝑁 1/2 . Combining the results of Propositions 1 and 2, we obtain
Then 𝛥 ≥ 𝑁 1/2 and 𝑁
that for any policy 𝜋 ∈ 𝒫,
{
}
˜ − 1)
sup {𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )} ≥ min 𝐶 1 𝛥 , 𝐶 2 (𝑁
1≤𝜏 ≤𝑁 +1
}
{
≥ min 𝐶 1 𝑁 1/2 , (1/6)𝐶 2 𝑁 1/2
= min{𝐶 1 , (1/6)𝐶 2 }𝑁 1/2 .
We get that for all (𝐹𝑎 , 𝐹𝑏 ) ∈ ℱ1 ,
sup
1≤𝜏 ≤𝑁 +1
{𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )} ≥ 𝐶𝑁 1/2 ,
where 𝐶 = min{𝐶 1 , (1/6)𝐶 2 }. This concludes the proof.
Proof of Proposition 1. Consider any policy 𝜋 ∈ 𝒫 such that min1≤𝑗≤𝑁˜ −1 𝒦(ℙ𝜋𝑙𝑗+1 , ℙ𝜋𝑙𝑗 ) < 𝛽.
Let 𝑖0 denote an index 𝑗 such 𝒦(ℙ𝜋𝑙𝑗+1 , ℙ𝜋𝑙𝑗 ) < 𝛽 and consider the following two hypotheses:
𝜏∈
/ {𝑙𝑖0 , ..., 𝑙𝑖0 +1 − 1},
𝐻0 :
𝐻1 :
𝜏 = 𝑙 𝑖0 .
Under ℙ𝜋𝑙𝑖 a change occurs at 𝑙𝑖0 , and under ℙ𝜋𝑙𝑖 +1 no change occurs in {𝑙𝑖0 , ...𝑙𝑖0 +1 −1}. Let 𝜙 denote
0
0
a decision rule, i.e., a mapping from the set of price and demand realizations in {1, ..., 𝑙𝑖0 +1 − 1}
into {0, 1}: 𝜙 = 0 will denote “no change” and 𝜙 = 1 will denote the presence of a change at 𝑙𝑖0 .
By Tsybakov (2004, Theorem 2.2), we have that the worst case probability error of any decision
rule is lower bounded by (1/4) exp{−𝛽}, i.e.,
inf max{ℙ𝜋𝑙𝑖 {𝜙 = 0}, ℙ𝜋𝑙𝑖
𝜙
0
0 +1
{𝜙 = 1}} ≥ (1/4) exp{−𝛽}
(A-4)
We show next that this implies that the losses in performance throughout the horizon must be of
order 𝛥.
{
}
Let 𝛿 = ∣𝑝∗𝑏 − 𝑝∗𝑎 ∣/2, 𝛿𝑟 = inf 𝑦∈[𝑝∗𝑎 −𝛿,𝑝∗𝑎 +𝛿] 𝑟𝑏 (𝑝∗𝑏 ) − 𝑟𝑏 (𝑦) and note that 𝛿𝑟 > 0 by conditions 𝑖𝑣.)
and 𝑣.) in Assumption 1. Let 𝐶1 = 𝛿𝑟 /(1 + 𝛿𝑟 /ℎ(𝛿)), and define 𝐶2 as
1
𝐶2 = 𝐶1 exp{−𝛽}.
8
(A-5)
Suppose for a moment that we have
sup
𝑘=𝑖0 ,𝑖0 +1
𝔼𝜋𝑙𝑘 [𝐽 ∗ − 𝒥 𝜋 ] ≤ 𝐶2 𝛥,
and consider the following decision rule 𝜙:
⎧
]
[
⎨0 if ∑𝑙𝑖0 +1 −1 𝑟𝑎 (𝑝∗ ) − 𝑟𝑎 (𝜓𝑗 (ℋ𝑗−1 )) ≤ 𝐶1 𝛥,
𝑎
𝑗=𝑙𝑖0
𝜙=
]
∑𝑙𝑖0 +1 −1 [
⎩1 if
𝑟𝑎 (𝑝∗𝑎 ) − 𝑟𝑎 (𝜓𝑗 (ℋ𝑗−1 )) > 𝐶1 𝛥.
𝑗=𝑙𝑖0
2
(A-6)
We next analyze the error probabilities associated with this rule.
ℙ𝜋𝑙𝑖
0
{𝛷 = 1}
+1
=
ℙ𝜋𝑙𝑖
0 +1
{𝑙𝑖0∑
+1 −1
𝑗=𝑙𝑖0
(𝑎)
≤
[
]
𝑟𝑎 (𝑝∗𝑎 ) − 𝑟𝑎 (𝜓𝑗 (ℋ𝑗−1 )) > 𝐶1 𝛥
}
+1 −1
[𝑙𝑖0∑
[
]]
1
𝑟𝑎 (𝑝∗𝑎 ) − 𝑟𝑎 (𝜓𝑗 (ℋ𝑗−1 ))
𝔼𝜋𝑙𝑖 +1
𝐶1 𝛥 0
𝑗=𝑙𝑖0
≤
(𝑏)
≤
(𝑐)
=
1
𝔼𝜋 [𝐽 ∗ − 𝒥 𝜋 ]
𝐶1 𝛥 𝑙𝑖0 +1
𝐶2
𝐶1
1
exp{−𝛽}
8
where (𝑎) follows from Markov’s inequality; (𝑏) follows from the assumption that (A-6) holds; and
(𝑐) follows from the definitions of 𝐶1 and 𝐶2 (see (A-5)).
We now turn to ℙ𝜋𝑙𝑖 {𝛷 = 0}. We first establish the following inequality
0
ℙ𝜋𝑙𝑖 {𝛷
0
= 0} ≤
ℙ𝜋𝑙𝑖
0
{𝑙𝑖0∑
+1 −1
[
𝑟𝑏 (𝑝∗𝑏 )
𝑗=𝑙𝑖0
}
− 𝑟𝑏 (𝜓𝑗 (ℋ𝑗−1 )) ≥ 𝛿𝑟 [1 − 𝐶1 /ℎ(𝛿)]𝛥 .
]
(A-7)
]
∑𝑙𝑖0 +1 −1 [
𝑟𝑎 (𝑝∗𝑎 ) − 𝑟𝑎 (𝜓𝑗 (ℋ𝑗−1 )) ≤ 𝐶1 𝛥. Then by Assumption
Indeed, suppose that 𝜙 = 0, i.e., 𝑗=𝑙
𝑖0
)
∑𝑙𝑖0 +1 −1
∑𝑙𝑖0 +1 −1 ( ∗
1{∣𝑝∗𝑎 − 𝜓𝑗 (ℋ𝑗−1 )∣ >
ℎ 𝑝𝑎 − 𝜓𝑗 (ℋ𝑗−1 ) ≤ 𝐶1 𝛥. This in turn implies that 𝑗=𝑙
1 𝑣.), 𝑗=𝑙
𝑖0
𝑖0
𝛿} ≤ 𝐶1 𝛥/ℎ(𝛿) and hence
𝑙𝑖0 +1 −1
∑ [
𝑟𝑏 (𝑝∗𝑏 )
− 𝑟𝑏 (𝜓𝑗 (ℋ𝑗−1 ))
𝑗=𝑙𝑖0
]
𝑙𝑖0 +1 −1
≥
∑ [
𝑗=𝑙𝑖0
≥
]
𝑟𝑏 (𝑝∗𝑏 ) − 𝑟𝑏 (𝜓𝑗 (ℋ𝑗−1 )) 1{∣𝑝∗𝑎 − 𝜓𝑗 (ℋ𝑗−1 )∣ ≤ 𝛿}
inf
𝑦∈[𝑝∗𝑎 −𝛿,𝑝∗𝑎 +𝛿]
[
𝑟𝑏 (𝑝∗𝑏 )
− 𝑟𝑏 (𝑦)
𝑙𝑖 +1 −1
] 0∑
1{∣𝑝∗𝑎 − 𝜓𝑗 (ℋ𝑗−1 )∣ ≤ 𝛿}
𝑗=𝑙𝑖0
≥ 𝛿𝑟 [1 − 𝐶1 /ℎ(𝛿)]𝛥.
Coming back to (A-7), we now have
ℙ𝜋𝑙𝑖 {𝛷
0
= 0}
≤
ℙ𝜋𝑙𝑖
0
{𝑙𝑖0∑
+1 −1
𝑗=𝑙𝑖0
(𝑎)
≤
[
𝑟𝑏 (𝑝∗𝑏 )
]
− 𝑟𝑏 (𝜓𝑗 (ℋ𝑗−1 )) ≥ 𝛿𝑟 [1 − 𝐶1 /ℎ(𝛿)]𝛥
+1 −1
[𝑙𝑖0∑
[
]]
1
𝜋
𝑟𝑏 (𝑝∗𝑏 ) − 𝑟𝑏 (𝜓𝑗 (ℋ𝑗−1 ))
𝔼 𝑙𝑖
𝛿𝑟 [1 − 𝐶1 /ℎ(𝛿)]𝛥 0
𝑗=𝑙𝑖0
≤
(𝑏)
≤
(𝑐)
=
1
𝔼𝜋 [𝐽 ∗ − 𝒥 𝜋 ]
𝛿𝑟 [1 − 𝐶1 /ℎ(𝛿)]𝛥 𝑙𝑖0
𝐶2
𝛿𝑟 [1 − 𝐶1 /ℎ(𝛿)]
1
exp{−𝛽},
8
3
}
where (𝑎) follows from Markov’s inequality; (𝑏) follows from the assumption that (A-6) holds; and
(𝑐) follows from the definitions of 𝐶1 and 𝐶2 (see (A-5)).
We deduce that the rule 𝜙 defined earlier satisfies
max{ℙ𝜋𝑙𝑖 {𝜙 = 0}, ℙ𝜋𝑙𝑖
0
0 +1
{𝜙 = 1}} ≤ (1/8) exp{−𝛽} < (1/4) exp{−𝛽},
which is in contradiction with (A-4). We deduce that (A-6) cannot hold and hence, in the current
case where min1≤𝑖≤𝑁˜ −1 𝒦(ℙ𝜋𝑙𝑗+1 , ℙ𝜋𝑙𝑗 ) = 𝒦(ℙ𝜋𝑙𝑖 +1 , ℙ𝜋𝑙𝑖 ) < 𝛽, we necessarily have
0
0
sup
1≤𝜏 ≤𝑁 +1
𝔼𝜋𝜏 [𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝒥 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )] > 𝐶2 𝛥.
(A-8)
Recall that 𝐶2 = [𝛿𝑟 /(1 + 𝛿𝑟 /ℎ(𝛿))](1/8) exp{−𝛽} and note that 𝛿 > 𝛿𝑝 /2 and 𝛿𝑟 ≥ ℎ(𝛿) ≥
min{𝛼𝛿𝑝2 /4, 𝛾} by Assumption 1 𝑖𝑣.) and 𝑣.). We deduce that 𝐶2 ≥ 𝐶 1 := (1/16)ℎ(𝛿𝑝 /2) exp{−𝛽}
and that
sup
1≤𝜏 ≤𝑁 +1
𝔼𝜋𝜏 [𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝒥 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )] > 𝐶 1 𝛥,
(A-9)
where 𝐶 1 depends on the parameters of the class ℱ1 and 𝛽. This concludes the proof.
Proof of Proposition 2. Consider any policy 𝜋 ∈ 𝒫 such that min1≤𝑗≤𝑁˜ −1 𝒦(ℙ𝜋𝑙𝑗+1 , ℙ𝜋𝑙𝑗 ) ≥ 𝛽.
We analyze the performance of the policy when the change occurs at 𝜏 = 𝑙𝑁˜ . We have
𝔼𝜋𝑙˜ [𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 )
𝑁
𝜋
− 𝒥 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )]
=
𝔼𝜋𝑙˜
𝑁
[𝑙∑
˜ −1
𝑁
𝑗=1
≥
𝔼𝜋𝑙˜
𝑁
[𝑙∑
˜ −1
𝑁
𝑗=𝑙1
=
˜ −1
𝑁
∑
𝑖=1
(𝑎)
=
˜ −1
𝑁
∑
𝔼𝜋𝑙˜
𝑁
≥
˜ −1
𝑁
∑
𝑖=1
[
𝑟𝑎 (𝑝∗𝑎 )
]
− 𝑟𝑎 (𝜓𝑗 (ℋ𝑗−1 )) +
𝑟𝑎 (𝑝∗𝑎 ) − 𝑟𝑎 (𝜓𝑗 (ℋ𝑗−1 ))
[𝑙𝑖+1
−1
∑
[
𝑟𝑎 (𝑝∗𝑎 )
]
𝔼𝜋𝑙𝑖+1
[𝑙𝑖+1
−1
∑
[
− 𝑟𝑎 (𝜓𝑗 (ℋ𝑗−1 ))
𝑟𝑎 (𝑝∗𝑎 )
[𝑙𝑖+1
−1
∑
𝑗=𝑙𝑖
]
𝑟𝑏 (𝑝∗𝑏 )
− 𝑟𝑏 (𝜓𝑗 (ℋ𝑗−1 ))
]
( ∗
)
ℎ 𝑝𝑎 − 𝜓𝑗 (ℋ𝑗−1 ) ,
]
(A-10)
Lemma 1 For some 𝐶𝒦 > 0 that depends only on the parameters defining the class ℱ1 , we have
˜ − 1,
that for all 𝑖 = 1, ..., 𝑁
[𝑙𝑖+1
−1
∑
𝑗=𝑙𝑖
(
)
ℎ 𝑝∗𝑎 − 𝜓𝑗 (ℋ𝑗−1 )
4
]
≥ 𝐶𝒦 𝒦(ℙ𝜋𝑙𝑖+1 , ℙ𝜋𝑙𝑖 ).
]
]
∑𝑙𝑖+1 −1
where here, (𝑎) follows from the fact that the random variable 𝑗=𝑙
[𝑟𝑎 (𝑝∗𝑎 ) − 𝑟𝑎 (𝜓𝑗 (ℋ𝑗−1 ))] is
𝑖
ℋ𝑙𝑖+1 −1 -measurable; and (𝑏) follows from Assumption 1 𝑣.). The following lemma, whose proof can
be found in Appendix C, allows to further lower bound the terms appearing in the sum in (A-10).
𝔼𝜋𝑙𝑖+1
]
]
− 𝑟𝑎 (𝜓𝑗 (ℋ𝑗−1 ))
𝑗=𝑙𝑖
𝔼𝜋𝑙𝑖+1
𝑁
∑
[
𝑗=𝑙𝑁˜
]
𝑗=𝑙𝑖
𝑖=1
(𝑏)
[
Combining the latter result with (A-10) yields
𝔼𝜋𝑙˜ [𝐽 ∗ − 𝒥 𝜋 ] ≥ 𝐶𝒦
𝑁
˜ −1
𝑁
∑
˜ − 1)𝐶𝒦 𝛽.
𝒦(ℙ𝜋𝑙𝑗+1 , ℙ𝜋𝑙𝑗 ) ≥ (𝑁
𝑖=1
We deduce that when min1≤𝑖≤𝑁˜ −1 𝒦(ℙ𝜋𝑙𝑗+1 , ℙ𝜋𝑙𝑗 ) ≥ 𝛽, we necessarily have
sup
1≤𝜏 ≤𝑁 +1
˜ − 1)𝐶𝒦 𝛽.
𝔼𝜋𝜏 [𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝒥 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )] > (𝑁
(A-11)
Letting 𝐶 2 = 𝐶𝒦 𝛽, we have that
sup
1≤𝜏 ≤𝑁 +1
˜ − 1),
𝔼𝜋𝜏 [𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝒥 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )] > 𝐶 2 (𝑁
(A-12)
where 𝐶 2 depends on the parameters of the class ℱ1 and 𝛽. This concludes the proof.
Proof of Theorem 2. For what follows, when a change is declared, we let 𝑘ˆ denote the customer
number at which it is declared (i.e., the customer number for which the price changes to 𝑝∗𝑏 ); when
no change is declared, we let 𝑘ˆ = 𝑁 + 1.
We introduce the following notation:
𝑖0 = 1,
𝑖𝑗+1 = 𝑖𝑗 + 𝑛𝑟 + 𝑛𝑒 ,
𝑗 = 0, ..., 𝑛𝑏 − 1,
where 𝑛𝑏 = sup{𝑗 : 𝑖𝑗 ≤ 𝑁 }. Note that 𝑛𝑏 ≤ 𝑁/(𝑛𝑟 + 𝑛𝑒 ). Let 𝑗 ∗ correspond to the index such that
𝑖𝑗 ∗ < 𝜏 ≤ 𝑖𝑗 ∗ +1 .
(A-13)
Step 1. The first step consists of relating the regret and the detection abilities of the policy.
Recalling the expression for the regret provided in (A-3) in the preliminaries, we have for the policy
under consideration
𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )
𝜏 −1
𝑁
[∑
]
∑
𝜋
∗
= 𝔼𝜏
[𝑟𝑎 (𝑝𝑎 ) − 𝑟𝑎 (𝜓𝑖 (ℋ𝑖−1 ))] +
[𝑟𝑏 (𝑝∗𝑏 ) − 𝑟𝑏 (𝜓𝑖 (ℋ𝑖−1 ))]
𝑖=1
= 𝔼𝜋𝜏
𝑖=𝜏
ˆ
[min{𝜏 −1,𝑘−1}
∑
[𝑟𝑎 (𝑝∗𝑎 ) − 𝑟𝑎 (𝑝0 )]1{𝜓𝑖 (ℋ𝑖−1 ) = 𝑝0 } +
𝑖=1
ˆ
min{𝑁,𝑘−1}
+
∑
𝑖=𝜏
[𝑟𝑏 (𝑝∗𝑏 ) − 𝑟𝑏 (𝜓𝑖 (ℋ𝑖−1 ))]
𝜏 −1
∑
[𝑟𝑎 (𝑝∗𝑎 ) − 𝑟𝑎 (𝑝∗𝑏 )]1{𝜓𝑖 (ℋ𝑖−1 ) = 𝑝∗𝑏 }
ˆ
𝑖=𝑘
]
[
]
ˆ + [𝑟𝑎 (𝑝∗ ) − 𝑟𝑎 (𝑝∗ )]
≤ 𝑛𝑒 (𝑛𝑏 + 1)[𝑟𝑎 (𝑝∗𝑎 ) − 𝑟𝑎 (𝑝0 )] + 𝔼𝜋𝜏 (𝜏 − 𝑘)
𝑎
𝑏
[
]
+
∗
∗
∗
𝜋 ˆ
+𝔼𝜏 (𝑘 − 𝜏 ) max{𝑟𝑏 (𝑝𝑏 ) − 𝑟𝑏 (𝑝𝑎 ), 𝑟𝑏 (𝑝𝑏 ) − 𝑟𝑏 (𝑝0 )}.
Next, we focus on bounding the terms on the right-hand-side above.
)+ ]
[(
[(
)+ ]
.
and 𝔼𝜋𝜏 𝑘ˆ − 𝜏
Step 2. In this step we analyze separately 𝔼𝜋𝜏 𝜏 − 𝑘ˆ
5
(A-14)
For ℓ = 1, ..., 𝑗 ∗ , let 𝑞𝑓,ℓ denote the probability of a false alarm at 𝑖ℓ + 1, i.e., of incorrectly
detecting a change after observing customer 𝑖ℓ . We have
𝔼𝜋𝜏
[(
𝜏 − 𝑘ˆ
)+ ]
𝜏 −1
∑
=
ℙ𝜋𝜏
𝑗=1
𝜏 −1
∑
=
𝑗=1
∗
{(
𝜏 − 𝑘ˆ
ℙ𝜋𝜏
≤
𝑗∗
∑−1
∑ 𝑖ℓ+1
ℓ
{∪
}
}
{𝑘ˆ = 𝑖𝑚 + 1}
𝑚=1
ℓ=1 𝑗=𝑖ℓ
(𝑎)
≥𝑗
}
{
ℙ𝜋𝜏 𝑘ˆ ≤ 𝜏 − 𝑗
𝑗 𝑖ℓ+1
∑−1
∑
≤
)+
ℓ𝑞𝑓,𝑙
ℓ=1 𝑗=𝑖ℓ
≤
(𝑛𝑟 + 𝑛𝑒 )
∗
𝑗
∑
ℓ𝑞𝑓,𝑙
(A-15)
ℓ=1
where (𝑎) follows from a union bound. For ℓ = 1, ..., 𝑗 ∗ , one can bound 𝑞𝑓,ℓ as follows
(𝑎)
𝑞𝑓,ℓ = ℙ𝜋𝜏
{1
𝑛𝑒
𝑖ℓ−1 +𝑛𝑟 +𝑛𝑒
∑
} (𝑏)
(𝑌𝑚 − 𝔼𝜋𝜏 [𝑌𝑚 ]) > 𝜀 ≤ 2 exp{−2𝑛𝑒 𝜀2 }.
(A-16)
𝑚=𝑖ℓ−1 +𝑛𝑟 +1
where in (𝑎), 𝑌𝑚 are i.i.d Bernouilli random variables with ℙ𝜋𝜏 {𝑌𝑚 = 1} = 𝐹¯𝑎 (𝑝0 ); and (𝑏) follows
from Hoeffding’s inequality. Combining (A-15) and (A-16), one obtains
[(
)+ ]
≤ 2𝑁 2 exp{−2𝑛𝑒 𝜀2 }
𝔼𝜋𝜏 𝜏 − 𝑘ˆ
(A-17)
By the definition of 𝑛𝑒 , we have that 𝑛𝑒 ≥ 𝑐𝑒 log 𝑁0 ≥ 𝑐𝑒 log 𝑁 + 𝑐𝑒 log 𝜈 −1 . We deduce that
[(
)+ ]
𝔼𝜋𝜏 𝜏 − 𝑘ˆ
≤ 2𝑁 2 exp{−2𝜀2 (𝑐𝑒 log 𝑁 − 𝑐𝑒 log 𝜈)}
2
= 2𝑁 2−2𝑐𝑒 𝜀 (𝜈)2𝑐𝑒 𝜀
≤ 2(𝜈)2𝑐𝑒 𝜀
2
2
(A-18)
where the last inequality holds since 2(1 − 𝜀2 𝑐𝑒 ) ≤ 0 and 𝑁 ≥ 1.
[(
)+ ]
We now turn to analyzing 𝔼𝜋𝜏 𝑘ˆ − 𝜏
. For ℓ = 𝑗 ∗ + 2, ..., 𝑛𝑟 , let 𝑞𝑑,ℓ denote the following
quantity
𝑞𝑑,ℓ = ℙ𝜋𝜏
{1
𝑛𝑒
𝑖ℓ−1 +𝑛𝑟 +𝑛𝑒
∑
}
(𝑌𝑚 − 𝐹¯𝑎 (𝑝0 )) ≤ 𝜀 ,
𝑚=𝑖ℓ−1 +𝑛𝑟 +1
where 𝑌𝑚 are Bernoulli random variable with ℙ𝜋𝜏 {𝑌𝑚 = 1} = 𝐹¯ (𝑝0 ; 𝑚). 𝑞𝑑,ℓ represents the proba-
6
bility of not detecting a change at exactly 𝑖ℓ , given that it has not yet been detected. We have
𝔼𝜋𝜏
[(
)+ ]
𝑘ˆ − 𝜏
=
𝑁∑
−𝜏 +1
ℙ𝜋𝜏
𝑗=0
=
𝑁∑
−𝜏 +1
𝑗=0
≤
(𝑎)
≤
{(
𝑘ˆ − 𝜏
)+
≥𝑗
}
{
ℙ𝜋𝜏 𝑘ˆ ≥ 𝜏 + 𝑗
3(𝑛𝑟 + 𝑛𝑒 ) +
𝑖ℓ+1 −1
𝑛𝑏
∑
3(𝑛𝑟 + 𝑛𝑒 ) +
}
∑
ℙ𝜋𝜏
ℓ=𝑗 ∗ +2 𝑗=𝑖ℓ
𝑛𝑏
∑
ℓ
{ ∩
}
{𝑘ˆ ∕= 𝑖𝑚 }
𝑚=𝑗 ∗ +2
(𝑛𝑟 + 𝑛𝑒 )(𝑞𝑑,𝑗 ∗ +2 )ℓ−𝑗
∗ −1
ℓ=𝑗 ∗ +2
=
(
(𝑛𝑟 + 𝑛𝑒 ) 3 +
∗
𝑛𝑏 −𝑗
∑−1
(𝑞𝑑,𝑗 ∗ +2 )𝑚
𝑚=1
(
=
(𝑛𝑟 + 𝑛𝑒 ) 3 + 𝑞
≤
(𝑛𝑟 + 𝑛𝑒 )
𝑑,𝑗 ∗ +2
3
,
1 − 𝑞𝑑,𝑗 ∗ +2
)
∗
1 − (𝑞𝑑,𝑗 ∗ +2 )(𝑛𝑏 −𝑗 ) )
1 − 𝑞𝑑,𝑗 ∗ +2
(A-19)
where in (𝑎), we used the fact that 𝑞𝑑,𝑗 = 𝑞𝑑,𝑗 ∗ +2 for 𝑗 = 𝑗 ∗ + 2, ..., 𝑛𝑏 . On another hand, we have
for ℓ = 𝑗 ∗ + 2, ..., 𝑛𝑏 ,
𝑞𝑑,ℓ
=
=
(𝑎)
≤
(𝑏)
≤
(𝑐)
ℙ𝜋𝜏
ℙ𝜋𝜏
{1
𝑛𝑒
∑
}
¯
(𝑌𝑚 − 𝐹𝑎 (𝑝0 )) ≤ 𝜀
𝑚=𝑖ℓ−1 +𝑛𝑟 +1
{ 1 𝑛𝑒
{ 1 ℙ𝜋𝜏 𝑛𝑒
𝑖ℓ−1 +𝑛𝑟 +𝑛𝑒
∑
}
(𝑌𝑚 − 𝔼𝜋𝜏 [𝑌𝑚 ]) + (𝐹¯𝑏 (𝑝0 ) − 𝐹¯𝑎 (𝑝0 )) ≤ 𝜀
𝑚=𝑖ℓ−1 +𝑛𝑟 +1
𝑖ℓ−1 +𝑛𝑟 +𝑛𝑒
∑
}
(𝑌𝑚 − 𝔼𝜋𝜏 [𝑌𝑚 ]) ≥ ∣𝐹¯𝑏 (𝑝0 ) − 𝐹¯𝑎 (𝑝0 )∣ − 𝜀
𝑚=𝑖ℓ−1 +𝑛𝑟 +1
(
)2
2 exp{−2𝑛𝑒 𝛿0 − 𝜀 }
−2𝑐𝑒 (𝛿0 −𝜀)2
≤
2𝑁0
(𝑑)
1
,
2
≤
𝑖ℓ−1 +𝑛𝑟 +𝑛𝑒
where (𝑎) follows from a triangle inequality; (𝑏) follows from Hoeffding’s inequality which holds
as long as 𝛿0 > 𝜀; (𝑐) follows from the fact that 𝑛𝑒 ≥ 𝑐𝑒 log 𝑁0 ; and (𝑑) is true since 𝑁0 ≥ 2 and
1/2
2𝑐𝑒 (𝛿0 − 𝜀)2 ≥ 2. Note that 𝑛𝑟 + 𝑛𝑒 ≤ 𝑐𝑟 𝑁0 + 𝑐𝑒 log 𝑁0 + 2 ≤ 𝑐𝑟 (𝑁/𝜈)1/2 + 𝑐𝑒 log(𝑁/𝜈) + 2. Coming
back to (A-19), we get
𝔼𝜋𝜏
[(
𝑘ˆ − 𝜏
)+ ]
≤ (𝑛𝑟 + 𝑛𝑒 )
)
(
3
≤ 6 𝑐𝑟 (𝑁/𝜈)1/2 + 𝑐𝑒 log(𝑁/𝜈) + 2
1 − 𝑞𝑑,𝑗 ∗ +2
(A-20)
Step 3. Let 𝛿𝑟 = max{𝑟𝑎 (𝑝∗𝑎 ) − 𝑟𝑎 (𝑝0 ), 𝑟𝑎 (𝑝∗𝑎 ) − 𝑟𝑎 (𝑝∗𝑏 ), 𝑟𝑏 (𝑝∗𝑏 ) − 𝑟𝑏 (𝑝∗𝑎 ), 𝑟𝑏 (𝑝∗𝑏 ) − 𝑟𝑏 (𝑝0 )}. Returning
7
to the bound on the regret, (A-14) in combination with (A-18) and (A-20) yields
𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )
[
]
[(
[(
)+ ]
)+ ]
𝜋
𝜋 ˆ
ˆ
≤ 𝛿𝑟 𝑛𝑒 (𝑛𝑏 + 1) + 𝔼𝜏 𝜏 − 𝑘
+ 𝔼𝜏 𝑘 − 𝜏
[
]
2
≤ 2𝛿𝑟 (𝑁/(𝑛𝑟 + 𝑛𝑒 ) + 1)(𝑐𝑒 log 𝑁0 + 1) + 𝜈 2𝑐𝑒 𝜀 + 3𝑐𝑟 (𝑁/𝜈)1/2 + 3𝑐𝑒 log(𝑁/𝜈) + 6 .
The right-hand side above does not depend on 𝜏 , hence
sup
[𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )] ≤ 𝐶1 𝑁 1/2 log 𝑁,
1≤𝜏 ≤𝑁 +1
where 𝐶1 is a function of 𝛿𝑟 , 𝑐𝑒 , 𝑐𝑟 , 𝜀, 𝜈 and 𝜈. This concludes the proof.
B
Proofs for Section 4
Proof of Theorem 3. This result follows from a slight modification of the proof of Theorem 2.
In particular, one notes that Algorithm 2 can be seen as a special case of Algorithm 1 where 𝑛𝑟 = 0
and 𝑝0 = 𝑝∗𝑎 . Next, we analyze the steps that need to be modified in the proof of Theorem 2. The
only modifications to the proof
occur
[
] in the upper bound on the regret in Step 1 (see (A-14)) and
+
𝜋
ˆ
in the upper bound on 𝔼𝜏 (𝑘 − 𝜏 ) ; see (A-20) in Step 2. In particular, the upper bound on the
regret can here be written as
𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )
[
]
ˆ + [𝑟𝑎 (𝑝∗ ) − 𝑟𝑎 (𝑝∗ )]
≤ 𝔼𝜋𝜏 (𝜏 − 𝑘)
𝑎
𝑏
[
]
+𝔼𝜋𝜏 (𝑘ˆ − 𝜏 )+ [𝑟𝑏 (𝑝∗𝑏 ) − 𝑟𝑏 (𝑝∗𝑎 )].
[
]
Then, in Step 2, the upper bound for 𝔼𝜋𝜏 (𝑘ˆ − 𝜏 )+ , see (A-20), becomes
[(
)+ ]
≤ 𝑛𝑒
𝔼𝜋𝜏 𝑘ˆ − 𝜏
3
≤ 6(𝑐𝑒 log(𝑁/𝜈) + 1)
1 − 𝑞𝑑,𝑗 ∗ +2
(B-1)
Hence, with 𝛿𝑟 = max{𝑟𝑎 (𝑝∗𝑎 ) − 𝑟𝑎 (𝑝∗𝑏 ), 𝑟𝑏 (𝑝∗𝑏 ) − 𝑟𝑏 (𝑝∗𝑎 )}, one obtains
[
]
[(
)+ ]
[(
)+ ]
∗
𝜋
𝜋
𝜋
ˆ
ˆ
𝐽 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) ≤ 𝛿𝑟 𝔼𝜏 𝜏 − 𝑘
+ 𝔼𝜏 𝜏 − 𝑘
]
[
2
≤ 2𝛿𝑟 𝜈 2𝑐𝑒 𝜀 + 3𝑐𝑒 log(𝑁/𝜈) + 3 ,
which yields
sup
1≤𝜏 ≤𝑁 +1
[𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )] ≤ 𝐶1 log 𝑁,
with 𝐶1 that depends only on 𝛿𝑟 , 𝑐𝑒 , 𝜀, 𝜈 and 𝜈. This completes the proof.
Proof of Theorem 4. The proof is organized in three main steps. In the first step we establish that if one is able to achieve a given performance level in terms of revenues, then a related
performance level can be achieved for the problem of change detection (where one attempts to
8
minimize the distance between the detection and the actual time of change). In the second step, we
establish a fundamental limit on the performance of any detection rule. The last steps concludes
by translating those fundamental limits to the revenue maximization problem.
Let ℱ2 denote the set of response function (𝐹𝑎 , 𝐹𝑏 ) that satisfy conditions 𝑖.), 𝑖𝑖.), 𝑖𝑣.) and 𝑣.) of
Assumption 1. We analyze ℛ∗ (𝑁, 𝒢 ∩ ℱ2 ) and then use the fact that ℛ∗ (𝑁, 𝒢) ≥ ℛ∗ (𝑁, 𝒢 ∩ ℱ2 ).
Consider any pair of response functions (𝐹𝑎 , 𝐹𝑏 ) ∈ 𝒢 ∩ ℱ2 .
Step 1. Consider any policy 𝜋 ∈ 𝒫 and its associated price mapping 𝜓. Recall the definition
of the random variable 𝒥 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) given in (A-1) in the preliminaries. Let 𝛾 be any positive
constant (which may depend on 𝑁 ) and define
ℬ𝛾 = {𝜔 : 𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝒥 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) < 𝛾}.
(B-2)
Note that
𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝒥 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) ≥
𝜏 −1
∑
ℎ(𝑝∗𝑎 − 𝜓𝑖 (ℋ𝑖−1 ))) +
𝑁
∑
ℎ(𝑝∗𝑏 − 𝜓𝑖 (ℋ𝑖−1 ))),
𝑖=𝜏
𝑖=1
where ℎ(⋅) was defined in Assumption 1 𝑣.). Hence, for all 𝜔 ∈ ℬ𝛾 , we have
𝜏 −1
∑
ℎ(𝑝∗𝑎
− 𝜓𝑖 (ℋ𝑖−1 ))) +
𝑖=1
𝑁
∑
ℎ(𝑝∗𝑏 − 𝜓𝑖 (ℋ𝑖−1 ))) < 𝛾.
(B-3)
𝑖=𝜏
We next define a stopping rule for the index of the customer at which a change occurs, 𝜏 , based on
the pricing policy 𝜋. Let
⌈
⌉
𝛾
𝑗0 :=
,
(B-4)
ℎ(𝛿𝑝 /3)
where 𝛿𝑝 and ℎ(⋅) were defined in Assumption 1. Define
and for 𝑖 ≥ 1,
𝑘ˆ𝑖+1 =
{
}
𝑘ˆ1 = inf 1 < 𝑗 ≤ 𝑁 : {∣𝑝∗𝑏 − 𝜓𝑗 (ℋ𝑗−1 )∣2 ≤ 𝛿𝑝 /3} ∪ {𝑗 = 𝑁 } ,
{
{
}
inf 𝑘ˆ𝑖 < 𝑗 ≤ 𝑁 : {∣𝑝∗𝑏 − 𝜓𝑗 (ℋ𝑗−1 )∣2 ≤ 𝛿𝑝 /3} ∪ {𝑗 = 𝑁 }
𝑁
if 𝑘ˆ𝑖 = 𝑁.
if 𝑘ˆ𝑖 < 𝑁,
The stopping rule is now defined as the 𝑗0𝑡ℎ term of this sequence, namely
𝑘ˆ∗ = 𝑘ˆ𝑗0 .
(B-5)
In other words, a change is declared after we have priced 𝑗0 number of times “close” to 𝑝∗𝑏 (or if we
have reached 𝑁 ). The next lemma, whose proof can be found in Appendix C, provides a guarantee
on the performance of the stopping rule 𝑘ˆ∗ on the set ℬ𝛾 .
Lemma 2 For all 𝜔 ∈ ℬ𝛾 , we have
0 ≤ 𝑘ˆ∗ − 𝜏 ≤ 2𝑗0 .
9
(B-6)
In other words, we have established that
ℬ𝛾 ⊆ {𝜔 : 0 ≤ 𝑘ˆ∗ − 𝜏 ≤ 2𝑗0 } ⊆ {𝜔 : ∣𝑘ˆ∗ − 𝜏 ∣ ≤ 2𝑗0 },
which implies that
sup
1≤𝜏 ≤𝑁 +1
ℙ{ℬ𝛾𝑐 } ≥
sup
1≤𝜏 ≤𝑁 +1
ℙ{∣𝑘ˆ∗ − 𝜏 ∣ > 2𝑗0 }
Step 2. The following lemma, whose proof can be found in Appendix C, establishes a fundamental limit on the performance of any stopping rule.
Lemma 3 For some 𝐶 > 0 and 𝛼 > 0, any stopping rule 𝑘ˆ must satisfy satisfy
sup
1≤𝜏 ≤𝑁 +1
ℙ𝜋𝜏 {∣𝑘ˆ − 𝜏 ∣ > 𝐶 log 𝑁 } ≥ 𝛼.
(B-7)
Step 3. Set
𝛾 = (𝐶1 log 𝑁 − ℎ(𝛿𝑝 /3))+ ,
where 𝐶1 = ℎ(𝛿𝑝 /3)𝐶/2. Then 2 ⌈𝛾/ℎ(𝛿𝑝 /3)⌉ ≤ 𝐶 log 𝑁 and
sup
1≤𝜏 ≤𝑁 +1
ℙ{ℬ𝛾𝑐 }
≥
sup
1≤𝜏 ≤𝑁 +1
≥
ℙ{∣𝑘ˆ − 𝜏 ∣ > 2𝑗0 }
ℙ{∣𝑘ˆ − 𝜏 ∣ > 𝐶 log 𝑁 }
sup
1≤𝜏 ≤𝑁 +1
(𝑎)
≥
𝛼,
where (𝑎) follows from Lemma 3. The latter, in conjunction with Markov’s inequality, implies that
sup
[𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝐽 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )] =
1≤𝜏 ≤𝑁 +1
sup
1≤𝜏 ≤𝑁 +1
≥ 𝐶1 log 𝑁
𝔼[𝐽 ∗ (𝐹𝑎 , 𝐹𝑏 , 𝜏 ) − 𝒥 𝜋 (𝐹𝑎 , 𝐹𝑏 , 𝜏 )]
sup
1≤𝜏 ≤𝑁 +1
≥ 𝛼𝐶1 log 𝑁
This completes the proof.
10
ℙ{ℬ𝛾𝑐 }
C
Proofs of Auxiliary Results
Proof of Lemma 1. Note first that
]
[
ℙ𝜋𝑙𝑗+1 {𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑁 }
𝜋
𝜋
𝜋
𝒦(ℙ𝑙𝑗+1 , ℙ𝑙𝑗 ) = 𝔼𝑙𝑗+1 log 𝜋
ℙ𝑙𝑗 {𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑁 }
[
]
ℙ𝜋𝑙𝑗+1 {𝑌𝑁 ∣ 𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑁 − 1} ⋅ ⋅ ⋅ ℙ𝜋𝑙𝑗+1 {𝑌2 ∣ 𝑌1 }ℙ𝜋𝑙𝑗+1 {𝑌1 }
𝜋
= 𝔼𝑙𝑗+1 log
ℙ𝜋𝑙𝑗 {𝑌𝑁 ∣ 𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑁 − 1} ⋅ ⋅ ⋅ ℙ𝜋𝑙𝑗 {𝑌2 ∣ 𝑌1 }ℙ𝜋𝑙𝑗 {𝑌1 }
[
]
𝑁 ℙ𝜋 {𝑌 ∣ ℋ
∏
}
𝑖
𝑖−1
𝑙
𝑗+1
= 𝔼𝜋𝑙𝑗+1 log
ℙ𝜋𝑙𝑗 {𝑌𝑖 ∣ ℋ𝑖−1 }
𝑖=1
[𝑁
]
∑
ℙ𝜋𝑙𝑗+1 {𝑌𝑖 ∣ ℋ𝑖−1 }
𝜋
= 𝔼𝑙𝑗+1
log 𝜋
ℙ𝑙𝑗 {𝑌𝑖 ∣ ℋ𝑖−1 }
𝑖=1
[𝑁
]
𝜋 {𝑌 ∣ ℋ
𝜋 {𝑌 ∣ ℋ
∑
1{1
≤
𝑖
≤
𝑙
−
1}ℙ
}
+
1{𝑙
≤
𝑖
≤
𝑁
}ℙ
}
𝑗+1
𝑖
𝑖−1
𝑗+1
𝑖
𝑖−1
𝑎
𝑏
= 𝔼𝜋𝑙𝑗+1
log
1{1 ≤ 𝑖 ≤ 𝑙𝑗 − 1}ℙ𝜋𝑎 {𝑌𝑖 ∣ ℋ𝑖−1 } + 1{𝑙𝑗 ≤ 𝑖 ≤ 𝑁 }ℙ𝜋𝑏 {𝑌𝑖 ∣ ℋ𝑖−1 }
𝑖=1
⎤
⎡
𝑙𝑗+1 −1
∑
ℙ
(𝑌
∣
𝜓
(ℋ
))
𝑎 𝑖
𝑖
𝑖−1 ⎦
.
log
= 𝔼𝜋𝑙𝑗+1 ⎣
ℙ𝑏 (𝑌𝑖 ∣ 𝜓𝑖 (ℋ𝑖−1 ))
𝑖=𝑙𝑗
Now, the term in the expectation above can be simplified to yield
𝒦(ℙ𝜋𝑙𝑗+1 , ℙ𝜋𝑙𝑗 )
=
[𝑙𝑗+1
∑−1
ℙ𝑎 (𝑌𝑖 ∣ 𝜓𝑖 (ℋ𝑖−1 ))
log
ℙ𝑏 (𝑌𝑖 ∣ 𝜓𝑖 (ℋ𝑖−1 ))
]
∑
𝔼𝜋𝑙𝑗+1
[
ℙ𝑎 (𝑌𝑖 ∣ 𝜓𝑖 (ℋ𝑖−1 ))
log
ℙ𝑏 (𝑌𝑖 ∣ 𝜓𝑖 (ℋ𝑖−1 ))
]
∑
𝔼𝜋𝑙𝑗+1
[
𝔼𝜋𝑙𝑗+1
𝔼𝜋𝑙𝑗+1
𝑖=𝑙𝑗
𝑙𝑗+1 −1
=
𝑖=𝑙𝑗
𝑙𝑗+1 −1
=
𝑖=𝑙𝑗
[
ℙ𝑎 (𝑌𝑖 ∣ 𝜓𝑖 (ℋ𝑖−1 ))
log
ℙ𝑏 (𝑌𝑖 ∣ 𝜓𝑖 (ℋ𝑖−1 ))
]]
ℋ𝑖−1 .
Let 𝑢 = 𝐹¯𝑎 (𝑥) − 𝐹¯𝑎 (𝑝∗𝑎 ) and 𝑣 = 𝐹¯𝑏 (𝑥) − 𝐹¯𝑏 (𝑝∗𝑎 ) and note that 𝑢, 𝑣 ∈ (−𝐹¯𝑎 (𝑝∗𝑎 ), 1 − 𝐹¯𝑎 (𝑝∗𝑎 )). Then
[
]
ℙ𝑎 (𝑌𝑖 ∣ 𝜓𝑖 (ℋ𝑖−1 )) 𝜋
𝔼𝑙𝑗+1 log
𝜓𝑖 (ℋ𝑖−1 ) = 𝑥
ℙ𝑏 (𝑌𝑖 ∣ 𝜓𝑖 (ℋ𝑖−1 )) ℙ𝑎 (𝑌𝑖 = 1 ∣ 𝑥)
ℙ𝑎 (𝑌𝑖 = 0 ∣ 𝑥)
= ℙ𝑎 (𝑌𝑖 = 1 ∣ 𝑥) log
+ ℙ𝑎 (𝑌𝑖 = 0 ∣ 𝑥) log
ℙ𝑏 (𝑌𝑖 = 1 ∣ 𝑥)
ℙ𝑏 (𝑌𝑖 = 0 ∣ 𝑥)
¯
𝐹𝑎 (𝑥)
𝐹𝑎 (𝑥)
= 𝐹¯𝑎 (𝑥) log ¯
+ 𝐹𝑎 (𝑥) log
𝐹𝑏 (𝑥)
𝐹𝑏 (𝑥)
∗
¯
𝐹𝑎 (𝑝∗𝑎 ) − 𝑢
𝐹𝑎 (𝑝 ) + 𝑢
+ [𝐹𝑎 (𝑝∗𝑎 ) − 𝑢] log
= [𝐹¯𝑎 (𝑝∗𝑎 ) + 𝑢] log ¯ ∗𝑎
𝐹𝑏 (𝑝∗𝑏 ) − 𝑣
𝐹𝑏 (𝑝𝑎 ) + 𝑣
[
]
𝑢
𝑣
= [𝐹¯𝑎 (𝑝∗𝑎 ) + 𝑢] log(1 + ¯ ∗ ) − log(1 + ¯ ∗ )
𝐹𝑎 (𝑝𝑎 )
𝐹𝑎 (𝑝𝑎 )
]
[
𝑣
𝑢
)
−
log(1
−
)
.
+[𝐹𝑎 (𝑝∗𝑎 ) − 𝑢] log(1 −
𝐹𝑎 (𝑝∗𝑎 )
𝐹𝑎 (𝑝∗𝑎 )
11
Note that for all 𝑦 ∈ (−(1 − 2𝜇)/(1 − 𝜇), (1 − 2𝜇)/𝜇), we have log(1 + 𝑦) ≤ 𝑦 and − log(1 + 𝑦) ≤
−𝑦 + [(1 − 𝜇)/𝜇]2 𝑦 2 . This implies that
[
]
ℙ𝑎 (𝑌𝑖 ∣ 𝜓𝑖 (ℋ𝑖−1 )) 𝜋
𝜓𝑖 (ℋ𝑖−1 ) = 𝑥
𝔼𝑙𝑗+1 log
ℙ𝑏 (𝑌𝑖 ∣ 𝜓𝑖 (ℋ𝑖−1 )) [ 𝑢
]
𝑣2
𝑣
1−𝜇
≤ [𝐹¯𝑎 (𝑝∗𝑎 ) + 𝑢] ¯ ∗ − ¯ ∗ +
𝜇 (𝐹¯𝑎 (𝑝∗𝑎 ))2
𝐹𝑎 (𝑝𝑎 ) 𝐹𝑎 (𝑝𝑎 )
[ −𝑢
]
𝑣2
𝑣
1−𝜇
+[𝐹𝑎 (𝑝∗𝑎 ) − 𝑢]
+
+
𝐹𝑎 (𝑝∗𝑎 ) 𝐹𝑎 (𝑝∗𝑎 )
𝜇 (𝐹𝑎 (𝑝∗𝑎 ))2
]
[
[ 1
∗
1
1 ]
1 − 𝜇 𝐹¯𝑎 (𝑝𝑎 ) + 𝑢 𝐹𝑎 (𝑝∗𝑎 ) − 𝑢 ] 2 [ 1
2
+
+
𝑢
+
𝑣
−
𝑢𝑣
+
=
𝜇
(𝐹𝑎 (𝑝∗𝑎 ))2
𝐹¯𝑎 (𝑝∗𝑎 ) 𝐹𝑎 (𝑝∗𝑎 )
𝐹¯𝑎 (𝑝∗𝑎 ) 𝐹𝑎 (𝑝∗𝑎 )
(𝐹¯𝑎 (𝑝∗𝑎 ))2
[ 2
{ 2
}
2
1−𝜇 1−𝜇
1−𝜇 1−𝜇 ]
∗ 2
≤
+
+
min
𝐾
∣𝑥
−
𝑝
∣
,
1
+
𝑎
𝜇 (𝐹¯𝑎 (𝑝∗𝑎 ))2
𝜇 (𝐹𝑎 (𝑝∗𝑎 ))2
𝐹¯𝑎 (𝑝∗𝑎 ) 𝐹𝑎 (𝑝∗𝑎 )
1
ℎ(𝑝∗𝑎 − 𝑥),
≤
𝐶𝒦
where
𝐶𝒦 =
[[
4
(1 − 𝜇)2 ] max{𝐾 2 , 1}
+2
𝜇
𝜇3
min{𝛼, 𝛾}
]−1
,
and 𝜇 was defined in Assumption 1 𝑖𝑖.). We deduce that
𝔼𝜋𝑙𝑗+1
[𝑙𝑗+1
∑−1
𝑖=𝑙𝑗
(
)
ℎ 𝑝∗𝑎 − 𝜓𝑗 (ℋ𝑗−1 )
]
≥ 𝐶𝒦 𝒦(ℙ𝜋𝑙𝑗+1 , ℙ𝜋𝑙𝑗 ),
and the proof is complete.
Proof of Lemma 2. Consider 𝜔 ∈ ℬ𝛾 and suppose (B-6) is not true. Then two cases need to
be considered.
i.) Suppose first that 𝑘ˆ∗ − 𝜏 > 2𝑗0 . Then we have that
ˆ
𝑘
∑
∗
ˆ
𝑘
∑
∗
ℎ(𝑝∗𝑏
− 𝜓𝑖 (ℋ𝑖−1 )))
≥
𝑖=𝜏
ℎ(𝑝∗𝑏 − 𝜓𝑖 (ℋ𝑖−1 )))1{∣𝑝∗𝑏 − 𝜓𝑖 (ℋ𝑖−1 )∣ > 𝛿𝑝 /3}
𝑖=𝜏
(𝑎)
≥
ℎ(𝛿𝑝 /3)
ˆ
𝑘
∑
∗
1{∣𝑝∗𝑏 − 𝜓𝑖 (ℋ𝑖−1 )∣ > 𝛿𝑝 /3}
𝑖=𝜏
(𝑏)
≥
(𝑐)
>
(𝑗0 + 1)ℎ(𝛿𝑝 /3)
𝛾,
where (𝑎) follows from the fact that ℎ(⋅) is increasing; (𝑏) follows from and the definition of 𝑘ˆ∗ and
the fact that 𝑘ˆ∗ − 𝜏 > 2𝑗0 ; and (𝑐) follows from the definition of 𝑗0 in (B-4). However, the last
inequality is in contradiction with (B-3). Hence necessarily, 𝑘ˆ∗ − 𝜏 ≤ 2𝑗0 .
12
ii.) Suppose now that 𝑘ˆ∗ − 𝜏 < 0. Then we have that
ˆ
𝑘
∑
ˆ
𝑘
∑
∗
∗
ℎ(𝑝∗𝑎
− 𝜓𝑖 (ℋ𝑖−1 )))
≥
𝑖=1
ℎ(𝑝∗𝑎 − 𝑝∗𝑏 + 𝑝∗𝑏 − 𝜓𝑖 (ℋ𝑖−1 )))1{∣𝑝∗𝑏 − 𝜓𝑖 (ℋ𝑖−1 )∣ ≤ 𝛿𝑝 /3}}
𝑖=1
ˆ∗
𝑘
∑
(𝑎)
≥
ℎ(𝛿𝑝 − 𝛿𝑝 /3)
𝑖=1
≥
𝑗0 ℎ(2𝛿𝑝 /3)
(𝑏)
>
𝑗0 ℎ(𝛿𝑝 /3)
(𝑐)
≥
𝛾,
where (𝑎) and (𝑏) follow from the fact that ℎ(⋅) is strictly increasing; and (𝑐) follows from the
definition of 𝑗0 in (B-4). The last inequality is in contradiction with (B-3). Hence, 𝑘ˆ∗ − 𝜏 ≥ 0.
We conclude from the two cases that (B-6) is necessarily true and the result is established.
Proof of Lemma 3. Let 𝑘ˆ be any stopping rule. We develop an argument parallel to that of
Korostelev (1987). Define
{
}
ℙ𝑎 (𝑦 ∣ 𝑝) 𝜙 := sup log
: 𝑦 = 0, 1, 𝑝 ∈ [𝑝, 𝑝] ,
ℙ𝑏 (𝑦 ∣ 𝑝)
(C-1)
and note that Assumption 1 𝑖𝑖.) ensures that 𝜙 is well defined. If 𝜙 = 0, then ℙ𝑎 = ℙ𝑏 and 𝑝∗𝑎 = 𝑝∗𝑏
and one can achieve zero regret. We assume from now on that that 𝜙 > 0.
Fix 𝛽 ∈ (0, 1) and let
𝐵(𝜙, 𝛽, 𝑥) := 1 −
log 𝛽 − 𝜙 − log 2 + log 𝑥 − log(1 + log 𝑥)
.⋅
log 𝑥
Note that 𝐵(𝜙, 𝛽, 𝑥) is increasing in 𝑥 for 𝑥 > exp(exp(1)) and converges to 1 as 𝑥 → ∞. Let
𝑛0 = min{𝑗 : 𝑗 ∈ ℕ, 𝑗 ≥ exp(exp(1)), 𝐵(𝜙, 𝛽, 𝑗) > 0}. Choose 𝐶 > 0 as follows
{
}
𝐵(𝜙, 𝛽, 𝑛0 )
𝐶 = min
,1 .
(C-2)
𝜙
Note also that for all 𝑛 ≥ 𝑛0 , 𝐶 ≤ (1/𝜙)𝐵(𝜙, 𝛽, 𝑛).
Let 𝑔(𝑥) = 𝑥(𝐶 log 𝑥 + 1)−1 and note that 𝑔(⋅) is increasing for 𝑥 ≥ 2 (since 𝐶 ≤ 1) and tends to
infinity as 𝑥 → ∞. Let
𝑛1 = min{𝑗 : 𝑗 ∈ ℕ, 𝑗 ≥ 𝑛0 , , 𝑔(𝑗) ≥ 3/2}.
(C-3)
We distinguish between two cases.
Case 1. 𝑁 ≥ 𝑛1 .
˜ = ⌈𝑁/𝛥⌉. Note that 𝑁/𝛥 ≥ 𝑁 (𝐶 log 𝑁 + 1)−1 ≥ 3/2 by the choice
Let 𝛥 = ⌈𝐶 log 𝑁 ⌉ and 𝑁
˜ ≥ 2. Define
of 𝑛1 in (C-3), and hence 𝑁
𝑙𝑗
= 1 + (𝑗 − 1)𝛥,
𝑙𝑁˜ +1 = 𝑁.
13
˜
𝑗 = 1, ..., 𝑁
Letting ℙ𝜋𝑙𝑗 denote the probability associated with the observations when the change occurs at
𝜏 = 𝑙𝑗 , define
𝑍𝑗
:= log
ℙ𝜋𝑙𝑗+1 {𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑁˜ +1 }
ℙ𝜋𝑙𝑗 {𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑁˜ +1 }
⋅
Using a conditioning argument as in Lemma 1 (for the simplification of 𝒦(ℙ𝜋𝑙𝑗+1 , ℙ𝜋𝑙𝑗 )) yields that
𝑙𝑗+1 −1
𝑍𝑗 =
∑
log
𝑖=𝑙𝑗
ℙ𝑎 (𝑌𝑖 ∣ 𝜓𝑖 (ℋ𝑖−1 ))
⋅
ℙ𝑏 (𝑌𝑖 ∣ 𝜓𝑖 (ℋ𝑖−1 ))
(C-4)
Suppose for a moment that in the current case where 𝑁 ≥ 𝑛1 , we have
min
˜ −1
1≤𝑗≤𝑁
ℙ𝜋𝑙𝑗 {∣𝑘ˆ − 𝜏 ∣ > 𝛥/3} < 1 − 𝛽.
(C-5)
Let 𝒜𝑗 denote the event {𝜔 : ∣𝑘ˆ − 𝑙𝑗 ∣ ≤ 𝛥/3} and note that 𝒜𝑗 , 𝑗 = 1, ..., 𝑁 are disjoint events and
˜ −1
that {𝜔 : ∣𝑘ˆ − 𝑙𝑁˜ ∣ > 𝛥} ⊃ ∪𝑁
𝑗=1 𝒜𝑗 . This implies that
ℙ𝑙𝑁˜ {∣𝑘ˆ − 𝜏 ∣ > 𝛥}
−1
{𝑁˜∪
≥
ℙ𝑙𝑁˜
=
˜ −1
𝑁
∑
𝑗=1
𝒜𝑗
}
ℙ𝑙𝑁˜ {∣𝑘ˆ − 𝑙𝑗 ∣ ≤ 𝛥/3}
𝑗=1
=
˜ −1
𝑁
∑
𝑗=1
(𝑎)
=
˜ −1
𝑁
∑
𝑗=1
[
]
𝔼𝑙𝑁˜ 1{∣𝑘ˆ − 𝑙𝑗 ∣ ≤ 𝛥/3}
[
]
𝔼𝑙𝑗 exp(𝑍𝑗 )1{∣𝑘ˆ − 𝑙𝑗 ∣ ≤ 𝛥/3} .
(C-6)
For (𝑎) above we have used the fact that for any ℋ𝑙𝑗+1 −1 -measurable random variable 𝑉 ,
𝔼𝑙𝑁˜ [𝑉 ]
=
=
(𝑏)
=
(𝑐)
=
(𝑑)
=
𝔼𝜋𝑙𝑗+1
[ ℙ𝜋 {𝑌 : 1 ≤ 𝑖 ≤ 𝑙
˜ +1 }
𝑖
𝑙˜
𝑁
𝑁
]
𝑉
ℙ𝜋𝑙𝑗+1 {𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑁˜ +1 }
]]
[
[ ℙ𝜋 {𝑌 : 1 ≤ 𝑖 ≤ 𝑙
˜ +1 }
𝑖
𝑙𝑁˜
𝑁
𝜋
𝜋
𝑉 ℋ𝑙 −1
𝔼𝑙𝑗+1 𝔼𝑙𝑗+1 𝜋
ℙ𝑙𝑗+1 {𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑁˜ +1 } 𝑗+1
]]
[
[ ℙ𝜋 {𝑌 : 1 ≤ 𝑖 ≤ 𝑙
˜ +1 } 𝑖
𝑙𝑁˜
𝑁
𝜋
𝜋
ℋ𝑙 −1
𝔼𝑙𝑗+1 𝑉 𝔼𝑙𝑗+1 𝜋
ℙ𝑙𝑗+1 {𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑁˜ +1 } 𝑗+1
𝔼𝜋𝑙𝑗+1 [𝑉 ]
𝔼𝑙𝑗 [exp(𝑍𝑗 )𝑉 ],
where here, (𝑏) follows from the fact that 𝑉 is ℋ𝑙𝑗+1 −1 -measurable; (𝑑) follows from a standard
14
change of measure; and (𝑐) follows from the chain of equalities below
𝔼𝜋𝑙𝑗+1
=
=
=
=
=
[ ℙ𝜋 {𝑌 : 1 ≤ 𝑖 ≤ 𝑙
]
˜ +1 } 𝑖
𝑙𝑁˜
𝑁
ℋ𝑙 −1
ℙ𝜋 {𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙 ˜ } 𝑗+1
𝑙𝑗+1
𝑁 +1
[ ℙ𝜋 {𝑌 : 1 ≤ 𝑖 ≤ 𝑙
]
𝜋
˜ +1 ∣ 𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑗+1 }ℙ𝑙 ˜ {𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑗+1 } 𝑖
𝑙𝑁˜
𝑁
𝜋
𝑁
ℋ𝑙 −1
𝔼𝑙𝑗+1 𝜋
ℙ𝑙𝑗+1 {𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑁˜ +1 ∣ 𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑗+1 }ℙ𝜋𝑙𝑗+1 {𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑗+1 } 𝑗+1
]
[ ℙ𝜋 {𝑌 : 1 ≤ 𝑖 ≤ 𝑙
𝜋
˜ +1 ∣ 𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑗+1 − 1}ℙ𝑙 ˜ {𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑗+1 − 1} 𝑖
𝑙𝑗+1
𝑁
𝜋
𝑁
ℋ𝑙 −1
𝔼𝑙𝑗+1 𝜋
ℙ𝑙 ˜ {𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑁˜ +1 ∣ 𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑗+1 − 1}ℙ𝜋𝑙𝑗+1 {𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑗+1 − 1} 𝑗+1
𝑁
[ ℙ𝜋 {𝑌 : 1 ≤ 𝑖 ≤ 𝑙
]
𝜋
˜ +1 ∣ 𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑗+1 − 1}ℙ𝑎 {𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑗+1 − 1} 𝑖
𝑙𝑁˜
𝑁
𝜋
ℋ𝑙 −1
𝔼𝑙𝑗+1 𝜋
ℙ𝑙𝑗+1 {𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑁˜ +1 ∣ 𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑗+1 − 1}ℙ𝜋𝑎 {𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑗+1 − 1} 𝑗+1
]
[ ℙ𝜋 {𝑌 : 1 ≤ 𝑖 ≤ 𝑙
˜ +1 ∣ 𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑗+1 − 1} 𝑖
𝑙𝑁˜
𝑁
𝜋
ℋ𝑙 −1
𝔼𝑙𝑗+1 𝜋
ℙ𝑙𝑗+1 {𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑁˜ +1 ∣ 𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑗+1 − 1} 𝑗+1
[ ℙ𝜋 {𝑌 : 𝑙
]
˜ +1 ∣ 𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑗+1 − 1} 𝑖
𝑗+1 ≤ 𝑖 ≤ 𝑙𝑁
𝑙𝑁˜
𝜋
ℋ𝑙 −1
𝔼𝑙𝑗+1 𝜋
ℙ𝑙 ˜ {𝑌𝑖 : 𝑙𝑗+1 ≤ 𝑖 ≤ 𝑙𝑁˜ +1 ∣ 𝑌𝑖 : 1 ≤ 𝑖 ≤ 𝑙𝑗+1 − 1} 𝑗+1
𝑁
= 1.
Note that by (C-4), 𝑍𝑗 is the sum of 𝛥 random variables and every term is lower bounded by
−𝜙 (cf. the definition of 𝜙 in (C-1)). Hence 𝑍𝑗 ≥ −𝛥𝜙 and one can lower bound every term in the
sum in (C-6) as follows
[
]
[
]
𝔼𝑙𝑗 exp(𝑍𝑗 )1{∣𝑘ˆ − 𝑙𝑗 ∣ ≤ 𝛥/3} ≥ 𝔼𝑙𝑗 exp(−𝜙(𝐶 log 𝑁 + 1))1{∣𝑘ˆ − 𝑙𝑗 ∣ ≤ 𝛥/3}
]
exp(−𝜙) [ ˆ
=
1{∣
𝑘
−
𝑙
∣
≤
𝛥/3}
.
𝔼
𝑗
𝑙
𝑗
𝑁 𝐶𝜙
Going back to (C-6), we have
ℙ𝑙𝑁˜ {∣𝑘ˆ − 𝜏 ∣ > 𝛥} ≥
˜ −1
𝑁
[
]
exp(−𝜙) ∑
ˆ
𝔼𝑙𝑗 1{∣𝑘 − 𝑙𝑗 ∣ ≤ 𝛥/3}
𝑁 𝐶𝜙
𝑗=1
≥
exp(−𝜙) ˜
(𝑁 − 1) min ℙ𝜋𝑙𝑗 {∣𝑘ˆ − 𝜏 ∣ ≤ 𝛥/3}.
˜ −1
𝑁 𝐶𝜙
1≤𝑗≤𝑁
(C-7)
˜ −1≥𝑁
˜ /2 and that 𝐶 ≤ 1, we have
On another hand, noting that 𝑁
(
)
˜ − 1)
log 𝑁 −𝐶𝜙 (𝑁
≥
=
=
(𝑎)
≥
(𝑏)
≥
)
𝑁
2 log 𝑁 + 2
(1 − 𝐶𝜙) log 𝑁 − log(1 + log 𝑁 ) − log 2
[
]
log 𝑁 −𝐶𝜙 + 𝐵(𝜙, 𝛽, 𝑁 ) + 𝜙 − log 𝛽
[
]
log 𝑁 𝜙 − log 𝛽
(
log 𝑁 −𝐶𝜙
[𝜙 − log 𝛽] exp(1).
Here (𝑎) follows from the definition of 𝐶 in (C-2) and the fact that 𝑁 ≥ 𝑛1 ≥ 𝑛0 ; and (𝑏) follows
˜ − 1) > 1/𝛽 2 . Hence
from the fact that 𝑁 ≥ 𝑛0 ≥ exp(exp(1)). This implies that exp(−𝜙)𝑁 −𝐶𝜙 (𝑁
15
we have
1
ℙ𝑙𝑁˜ {∣𝑘ˆ − 𝜏 ∣ > 𝛥} ≥ 2
𝛽
(𝑎) 1
ℙ𝜋𝑙𝑗 {∣𝑘ˆ − 𝜏 ∣ ≤ 𝛥/3} ≥
> 1,
˜ −1
𝛽
1≤𝑗≤𝑁
min
where (𝑎) follows from assumption (C-5). This is a contradiction since ℙ𝑙𝑁˜ {∣𝑘ˆ − 𝜏 ∣ > 𝛥} ≤ 1.
Hence we conclude that (C-5) cannot hold and necessarily,
max
˜ −1
1≤𝑗≤𝑁
ℙ𝜋𝑙𝑗 {∣𝑘ˆ − 𝜏 ∣ > 𝛥/3} ≥ 1 − 𝛽,
implying that for all cases such that 𝑁 ≥ 𝑛1 ,
sup
1≤𝜏 ≤𝑁 +1
ℙ𝜋𝜏 {∣𝑘ˆ − 𝜏 ∣ > (𝐶/3) log 𝑁 } ≥ 1 − 𝛽.
(C-8)
Case 2. 𝑁 < 𝑛1 .
Let 𝜏1 = 𝑁 − 1 and 𝜏2 = 𝑁 . Suppose ℙ𝜏1 {∣𝑘ˆ − 𝜏1 ∣ = 0} ≥ 1/2, then
[
]
ℙ𝜏2 {∣𝑘ˆ − 𝜏1 ∣ = 0} = 𝔼𝜏2 1{∣𝑘ˆ − 𝜏1 ∣ = 0}
[
{
]
ℙ𝑎 {𝑌𝑁 −1 } } ˆ
= 𝔼𝜏1 exp log
1{∣𝑘 − 𝜏1 ∣ = 0}
ℙ {𝑌𝑁 −1 }
[ 𝑏
]
≥ exp{−𝜙}𝔼𝜏1 1{∣𝑘ˆ − 𝜏1 ∣ = 0}
≥
1
exp{−𝜙},
2
and hence ℙ𝜏2 {∣𝑘ˆ − 𝜏2 ∣ ≥ 1} ≥ 1/2(exp{−𝜙}). We deduce that
sup
𝜏 ∈{𝜏1 ,𝜏2 }
ℙ𝜋𝜏 {∣𝑘ˆ − 𝜏 ∣ ≥ 1} ≥ (1/2) exp{−𝜙}.
Noting that in the current case, 1 ≥ 𝑁/𝑛1 ≥ (1/𝑛1 ) log 𝑁 , we deduce that
sup
1≤𝜏 ≤𝑁 +1
ℙ𝜋𝜏 {∣𝑘ˆ − 𝜏 ∣ ≥ (1/𝑛1 ) log 𝑁 } ≥ (1/2) exp{−𝜙}.
Combining the two cases, one obtains
sup
1≤𝜏 ≤𝑁 +1
ℙ𝜋𝜏 {∣𝑘ˆ − 𝜏 ∣ ≥ 𝐶1 log 𝑁 } ≥ 𝛼.
with 𝐶1 = min{(𝐶/3), 1/𝑛1 } and 𝛼 = min{1 − 𝛽, (1/2) exp{−𝜙}}, which concludes the proof.
16

Download Report

On the Minimax Complexity of Pricing in a Changing Environment

Paperzz.com

Your Paperzz