Dynamic Pricing Under Model Uncertainty:
Learning and Earning
Assaf Zeevi
Alex Slivkins
(Columbia University)
(MSR, NYC)
EC, Portland, June 2015
Game plan
Part 1 (AZ) : motivation, set up, parametric, no inventory constraints,
incomplete learning, remedies, model misspecification [ 75 min, 2:00 - 3:15 ]
Game plan
Part 1 (AZ) : motivation, set up, parametric, no inventory constraints,
incomplete learning, remedies, model misspecification [ 75 min, 2:00 - 3:15 ]
[ Break, 3:15-3:45 ]
Game plan
Part 1 (AZ) : motivation, set up, parametric, no inventory constraints,
incomplete learning, remedies, model misspecification [ 75 min, 2:00 - 3:15 ]
[ Break, 3:15-3:45 ]
Part 2 (AS): bandit connection, capacitated / constrained, discretization,
contextual formulation, open problems
[ 75 min, 3:45 - 5:00 ]
Part 1: Outline
◮
Motivation: why are we interested in this problem?
◮
Formulation: in its simplest form...
◮
Some history and preliminary observations
◮
Simple lookahead policies [ Bayesian and non-B ]
◮
Incomplete learning (and possible remedies...)
◮
Exploration-Exploitaiton-Approximation
◮
Model misspecification
Dynamic pricing: general background
adjusting prices over time in order to maximize revenues
◮
one of the core problems in modern revenue management
Dynamic pricing: general background
adjusting prices over time in order to maximize revenues
◮
one of the core problems in modern revenue management
◮
adjusting prices serves multiple purposes
stimulate demand/sales
control variation in demand to support production planning
reflect shadow cost of resources/inventory
match competitors’ “moves”
monitor/track changing market conditions
learn about consumer preferences
Price variation in sales data: Empirical illustration
0
−0.5
θ̂2
−1
−1.5
−2
−2.5
optimal rate
−3
0
50
100
0
50
100
150
200
250
300
350
150
200
250
300
350
time (days)
5.5
5
4.5
4
3.5
time (days)
Price variation in sales data: Empirical illustration
0
−0.5
θ̂2
−1
−1.5
−2
−2.5
optimal rate
−3
0
50
100
0
50
100
150
200
250
300
350
150
200
250
300
350
time (days)
5.5
5
4.5
4
3.5
time (days)
Price variation and the effects of competition
mean daily offered rate (%)
5
4.9
firm 2 adjusts its rate
a few days following
the adjustment by firm 1
firm 1
4.8
4.7
4.6
4.5
4.4
4.3
4.2
4.1
firm 2
4
3.9
3.8
3.7
3.6
3.5
500
520
540
560
580
600
620
640
660
680
time (days)
cost of funds is stable throughout time horizon!
700
Dynamic pricing: general background
The macro-level view:
Dynamic pricing: general background
The macro-level view:
◮
basic macro-economic theory postulates price rigidity
prices often do not track well changing economic conditions
basis for theory of business cycle dynamics and monetary policy
Dynamic pricing: general background
The macro-level view:
◮
◮
basic macro-economic theory postulates price rigidity
prices often do not track well changing economic conditions
basis for theory of business cycle dynamics and monetary policy
recent macro-economic studies have shown
DP considerations affect nearly %50 of observed “price flexibility”
[ across major part of US economy! ]
these effects are increasing rapidly since 1980’s
[ post de-regulation act & birth of modern revenue management.. ]
characteristics are highly transitory
[ manifesting only on relatively short time cycles ]
Dynamic pricing: general background
Micro/tactical view:
Dynamic pricing: general background
Micro/tactical view:
◮
Airlines among pioneers in developing DP approaches
[ Phillips (05’), Talluri and van Ryzin (05’) ]
price adjustment via capacity controls
open/close fare classes allows simple feedback mechanism...
driven by dynamic programming logic
Dynamic pricing: general background
Micro/tactical view:
◮
Airlines among pioneers in developing DP approaches
[ Phillips (05’), Talluri and van Ryzin (05’) ]
◮
price adjustment via capacity controls
open/close fare classes allows simple feedback mechanism...
driven by dynamic programming logic
Price experimentation increasingly prevalent
90% of large U.S. retailers conduct some type price experiments
[ Gaur and Fisher (05’) ]
testing for merchandise selection
[ Rajaram and Fisher (00’) ]
Dynamic pricing: general background
Enabling/facilitating factors
Dynamic pricing: general background
Enabling/facilitating factors
◮
legislative
◮
de-regulation and resulting competition
technological: computing and advent of Internet
Direct-to-Consumer model
[ Dell, Amazon, Apple, eBay,... ]
simple and nearly frictionless price changes
easy to collect and analyze response data
Dynamic pricing: general background
Enabling/facilitating factors
◮
legislative
◮
de-regulation and resulting competition
technological: computing and advent of Internet
Direct-to-Consumer model
[ Dell, Amazon, Apple, eBay,... ]
simple and nearly frictionless price changes
easy to collect and analyze response data
Some key challenges:
1.
Efficient use of past data
2.
“Smart” ways to collect new data
3.
Tractable and provably good pricing policies
[ inference ]
[ active learning ]
[ approximation ]
Focus of this tutorial
premise: monopolist selling to non-strategic consumers,
finite time horizon, w or w/o inventory constraints
–
not very realistic but need to start somewhere...
salient features of ‘learning and earning’ problem:
–
uncertain consumer behavior
–
noisy observations
–
competition
[ not known and may change ]
[ limited sales data ]
[ models/actions only partially observable ]
Focus of this tutorial
premise: monopolist selling to non-strategic consumers,
finite time horizon, w or w/o inventory constraints
–
not very realistic but need to start somewhere...
salient features of ‘learning and earning’ problem:
–
uncertain consumer behavior
–
noisy observations
–
competition
[ not known and may change ]
[ limited sales data ]
[ models/actions only partially observable ]
focal points:
–
how to balance optimization and estimation
– price experimentation
[ how long, how many prices etc ]
–
illustrate potential issues / pitfalls / solutions
–
what constitutes “good” dynamic pricing strategies
Problem formulation (in its simplest form)
Buyers: arrive sequentially indexed t = 1, 2, . . .
T measures the length of horizon
Seller: sells single product
decides on price pt , t = 1, 2, . . .
Problem formulation (in its simplest form)
Buyers: arrive sequentially indexed t = 1, 2, . . .
T measures the length of horizon
Seller: sells single product
decides on price pt , t = 1, 2, . . .
Demand model: λ(p; θ)
θ is unknown parameter
Realized demand: at time t is
quantity based Dt = λ(pt , θ) + εt
customer based Dt = 1{sale at time t}
Problem formulation (in its simplest form)
Buyers: arrive sequentially indexed t = 1, 2, . . .
T measures the length of horizon
Seller: sells single product
decides on price pt , t = 1, 2, . . .
Demand model: λ(p; θ)
θ is unknown parameter
Realized demand: at time t is
quantity based Dt = λ(pt , θ) + εt
customer based Dt = 1{sale at time t}
Objective:
design pricing policy π to maximize cumulative expected revenues
R(π, T ) = E
T
X
t=1
pt Dt
Origins of this line of research
Econ literature:
early 70’s...
Origins of this line of research
Econ literature:
early 70’s...
Rothschild (74):
◮
Bayesian formulation, finite number of prices
(infinite horizon/discounted) two-armed bandits
conclusions: ex ante optimal actions may be ex-post sub optimal
incomplete learning phenomenon
Origins of this line of research
Econ literature:
early 70’s...
Rothschild (74):
◮
Bayesian formulation, finite number of prices
(infinite horizon/discounted) two-armed bandits
conclusions: ex ante optimal actions may be ex-post sub optimal
incomplete learning phenomenon
Aghion, Bolton, Harris, and Jullien (91):
◮
optimal experimentation
Origins of this line of research
Econ literature:
early 70’s...
Rothschild (74):
◮
Bayesian formulation, finite number of prices
(infinite horizon/discounted) two-armed bandits
conclusions: ex ante optimal actions may be ex-post sub optimal
incomplete learning phenomenon
Aghion, Bolton, Harris, and Jullien (91):
◮
optimal experimentation
McLennan (84), Keller and Rady (99):
◮
further demonstration of incomplete learning
How do firms set prices in practice
◮
A typical procedure:
1. initial data collection (e.g., market research)
2. postulate a (typically simple) parametric model of demand
3. estimate model parameters [ possibly price experiment ]
4. optimize empirical estimate of profit fn
5. track market response to presented price
6. update estimates using new sales data
How do firms set prices in practice
◮
A typical procedure:
1. initial data collection (e.g., market research)
2. postulate a (typically simple) parametric model of demand
3. estimate model parameters [ possibly price experiment ]
4. optimize empirical estimate of profit fn
5. track market response to presented price
6. update estimates using new sales data
◮
prevalent use of Bayesian-based pricing engines
◮
common approaches rely on (semi) myopic policies
use optimized prices to draw inference...
... with hope that learning takes care of itself
Where did this all begin? (cont’d)
OR/MS literature:
only emerged in recent years...
Where did this all begin? (cont’d)
OR/MS literature:
only emerged in recent years...
Lobo and Boyd (2003):
◮
linear demand, normal prior, no inventory constraints
practical price experimentation, myopic policies
Where did this all begin? (cont’d)
OR/MS literature:
only emerged in recent years...
Lobo and Boyd (2003):
◮
linear demand, normal prior, no inventory constraints
practical price experimentation, myopic policies
Aviv and Pazgal (2005):
◮
exponential demand, gamma prior, w/ inventory constraint
first paper to adapt Gallego-van Ryzin (1994) model...
revisited in Farias and Van Roy (2009)...
Where did this all begin? (cont’d)
OR/MS literature:
only emerged in recent years...
Lobo and Boyd (2003):
◮
linear demand, normal prior, no inventory constraints
practical price experimentation, myopic policies
Aviv and Pazgal (2005):
◮
exponential demand, gamma prior, w/ inventory constraint
first paper to adapt Gallego-van Ryzin (1994) model...
revisited in Farias and Van Roy (2009)...
Besbes and Z (2009):
◮
non-Bayesian, parametric and nonparametric, w/ inventory constraints
exploration-exploitation, performance guarantees and optimality
Quite a bit of recent activity...
◮
for recent work try googling:
– Omar Besbes, Denis Saure
– Victor Araman, Rene Caldentey
– den Boer and Zwart
– Cooper et al
– Paat Rusmeviechientong
– Mike Harrison, N. Bora Keskin, ...
Quite a bit of recent activity...
◮
for recent work try googling:
– Omar Besbes, Denis Saure
– Victor Araman, Rene Caldentey
– den Boer and Zwart
– Cooper et al
– Paat Rusmeviechientong
– Mike Harrison, N. Bora Keskin, ...
◮
for work in econ check:
– Dirk Bergemann, Godfrey Keller, Sven Rady, Ilya Segal, Chris Harris, ...
Quite a bit of recent activity...
◮
for recent work try googling:
– Omar Besbes, Denis Saure
– Victor Araman, Rene Caldentey
– den Boer and Zwart
– Cooper et al
– Paat Rusmeviechientong
– Mike Harrison, N. Bora Keskin, ...
◮
for work in econ check:
– Dirk Bergemann, Godfrey Keller, Sven Rady, Ilya Segal, Chris Harris, ...
◮
for work in CS community see second part of tutorial...
Roadmap
Part I: A close look at myopic (limited lookahead) policies
Part I(a): Bayesian setting
Part I(b): Frequentist setting
Part II: Impact of model misspecification
Part III: A prelude to capacitated problems...
Roadmap
Part I: A close look at myopic (limited lookahead) policies
Part I(a): Bayesian setting
Part I(b): Frequentist setting
Part II: Impact of model misspecification
Part III: A prelude to capacitated problems...
Second part of tutorial:
connection etc...
capacitated problems, discretization, bandit
Part I: The Good/Bad/Ugly in myopic policies
Recall typical strategies:
◮
test a few price points
[ initialization ]
◮
either update prior (Bayesian) or estimator (Frequentist)
◮
compute optimal price given current information
◮
plug in prevailing price and observe sales
◮
go back to step 2 and repeat
[ estimation ]
[ optimization ]
[ information gathering ]
A first hint of trouble...
Key assumption: seller knows demand model structure up to unknown
parameter/s
the model is well specified
Why are we assuming this?
Illustration of what can go wrong...
0.8
demand functions
demand functions
0.8
true model
0.6
0.4
fitted model
p̌
0.2
0
1
1.5
2
2.5
3
3.5
4
4.5
true model
0.6
0.4
fitted model
0.2
0
5
p̌
1
1.5
2
3.5
3.5
3
3
2.5
p̌
2
1.5
1
3
3.5
4
4.5
5
price
price
price
price
2.5
p̌
2.5
2
1.5
0
5
10
iterations
15
20
1
0
5
10
iterations
15
20
Illustration of what can go wrong...
0.8
demand functions
demand functions
0.8
true model
0.6
0.4
fitted model
p̌
0.2
0
1
1.5
2
2.5
3
3.5
4
4.5
true model
0.6
0.4
fitted model
0.2
0
5
p̌
1
1.5
2
3.5
3.5
3
3
2.5
p̌
2
1.5
1
3
3.5
4
4.5
5
price
price
price
price
2.5
p̌
2.5
2
1.5
0
5
10
iterations
15
20
1
0
5
10
15
20
iterations
model misspecification can lead to unstable price dynamics...
Part I(a): An illustrative Bayesian setting
◮
unknown parameter θ takes one of two values {θ0 , θ1 }
– two models/hypotheses indexed by 0 and 1
λ(p)
λ(p)
1
0.95
λ0 (p)
λ0 (p)
0.73
0.65
0.63
λ1 (p)
0.5
0.5
λ1 (p)
0.35
0.27
0.05
0.5
pb = 1
1.5
p
0
pb = 0.95
2
3
4
p
An illustrative Bayesian setting (cont’d)
◮
prior: q [ belief that model i is correct ]
– at t = 0 puts mass q0 on model 0 [ 1 − q0 on model 1 ]
◮
revenue/profit fn:
r(p, θi ) = p · λ(p; θi ) for model i
An illustrative Bayesian setting (cont’d)
◮
prior: q [ belief that model i is correct ]
– at t = 0 puts mass q0 on model 0 [ 1 − q0 on model 1 ]
r(p, θi ) = p · λ(p; θi ) for model i
◮
revenue/profit fn:
◮
price optimization at time t:
maximize r(p) = Eq r(p; θ)
– q ≡ qt−1 is prior at time t − 1
– gives price decision pt at time t
◮
Bayesian updating:
compute qt using Bayes rule
An illustrative Bayesian setting (cont’d)
◮
prior: q [ belief that model i is correct ]
– at t = 0 puts mass q0 on model 0 [ 1 − q0 on model 1 ]
r(p, θi ) = p · λ(p; θi ) for model i
◮
revenue/profit fn:
◮
price optimization at time t:
maximize r(p) = Eq r(p; θ)
– q ≡ qt−1 is prior at time t − 1
– gives price decision pt at time t
compute qt using Bayes rule
◮
Bayesian updating:
◮
continue until end of horizon T
myopic Bayesian policy (MBP)
Performance of MBP
Revenue loss relative to oracle: oracle knows a priori the correct model...
T r(p∗ ) − R(π, T )
∆(T ) =
r(p∗ )
– numerator is diff between best achievable revenues and those of policy
– π is myopic Bayesian policy
– denominator makes for convenient normalization
– ∆(T ) measures # of customers who were quoted sub-optimal price...
– the smaller ∆(T ) the better the candidate policy...
Performance of MBP
Revenue loss relative to oracle: oracle knows a priori the correct model...
T r(p∗ ) − R(π, T )
∆(T ) =
r(p∗ )
– numerator is diff between best achievable revenues and those of policy
– π is myopic Bayesian policy
– denominator makes for convenient normalization
– ∆(T ) measures # of customers who were quoted sub-optimal price...
– the smaller ∆(T ) the better the candidate policy...
Typical objective:
sublinear growth, ideally with matching bounds...
Uninformative prices and indeterminate equilibria
ϕ(q)
p∗1 = 1.33
pb = 1.00
p∗0 = 0.78
0
qb = 0.67
1
q
Implications: Performance of MBP
∆(T )
MBP
27.4
14.5
0
250
500
T
∆(T ) = # customers “lost” due to model uncertainty...
Is this the best that can be done?
Is this the best that can be done?
Simple competitor policy
∆(T )
4.1
0
250
500
T
Theory behind numerical illustration
Thm.
[Keskin, Harrison and Z (09’)]
1. Beliefs under MBP converge (a.s.) to limit under either ambient model;
2. For “many” demand models MBP will get “stuck” in indeterminate
equilibrium with probability 1;
3. Profit losses increase linearly in horizon
∆(T ) ≥ cT
◮
MBP misprices on almost every customer...
◮
incomplete learning despite many price tweaks
What can be achieved and how?
What can be achieved and how?
Def.
A policy is discriminative if the two underlying models are separated at
each pricing decision of the policy...
What can be achieved and how?
Def.
A policy is discriminative if the two underlying models are separated at
each pricing decision of the policy...
Thm.
[Keskin, Harrison and Z (09’)]
1. The class of discriminative policies achieves bounded profit loss
∆(T ) ≤ C
for all T
2. MBP can be modified straightforwardly to be discriminative
Some questions
Is the phenomenon of incomplete learning specific to
Q1.
Bayesian setting? [ versus classical estimation ]
Q2.
simple discrete settings? [ does continuous updating help? ]
Q3.
“interlaced” estimation and optimization? [ will batching help? ]
Part I(b): Towards a more general theory
◮
unknown parameter vector θ ∈ Θ
– uncountable number of hypotheses / demand models...
– no prior [ non-Bayesian setting ]
◮
revenue/profit fn:
r(p; θ) = p · λ(p; θ)
Part I(b): Towards a more general theory
◮
unknown parameter vector θ ∈ Θ
– uncountable number of hypotheses / demand models...
– no prior [ non-Bayesian setting ]
r(p; θ) = p · λ(p; θ)
◮
revenue/profit fn:
◮
estimation: at stage t form estimator θ̂t based on past observation
◮
price optimization: at stage t maximize r(p; θ̂t )
– obtain price pt
◮
use price pt , observe outcome, and repeat
Part I(b): Towards a more general theory
◮
unknown parameter vector θ ∈ Θ
– uncountable number of hypotheses / demand models...
– no prior [ non-Bayesian setting ]
r(p; θ) = p · λ(p; θ)
◮
revenue/profit fn:
◮
estimation: at stage t form estimator θ̂t based on past observation
◮
price optimization: at stage t maximize r(p; θ̂t )
– obtain price pt
◮
use price pt , observe outcome, and repeat
◮
continue until end of horizon T
myopic LS/ML/GMM/etc
An illustrative example
◮
Linear demand model: [ quantity-based RM ]
λ(p; θ) = θ1 p + θ0
◮
Observed demand: given pricing decision pt
Dt = λ(pt ; θ) + εt
An illustrative example
◮
Linear demand model: [ quantity-based RM ]
λ(p; θ) = θ1 p + θ0
◮
Observed demand: given pricing decision pt
Dt = λ(pt ; θ) + εt
◮
Admissible policies π :
pricing decision at stage t based on {p1 , D1 , . . . , pt−1 , Dt−1 }
pricing policies do not know θ
An illustrative example
◮
Linear demand model: [ quantity-based RM ]
λ(p; θ) = θ1 p + θ0
◮
Observed demand: given pricing decision pt
Dt = λ(pt ; θ) + εt
◮
Admissible policies π :
pricing decision at stage t based on {p1 , D1 , . . . , pt−1 , Dt−1 }
pricing policies do not know θ
◮
Myopic ordinary least squares:
stage t − 1 have LS estimator θ̂t−1
set pt to maximize {pλ(p; θ̂t−1 )}
observe Dt and repeat...
Performance of myopic policy
780
760
740
720
700
∆(T)
680
660
640
620
600
0
2
4
6
T
8
10
4
x 10
What’s wrong?
Number of observations (within a sample of size R = 10,000)
3500
3000
2500
2000
1500
1000
500
0
0.98
1
1.02
1.04
1.06
p
T
1.08
1.1
1.12
1.14
Towards a general theory
◮
myopic policies fail because they are “too greedy”
decision intended to optimize revenues are used for data collection...
... and this data then used for next decisions.
Towards a general theory
◮
◮
myopic policies fail because they are “too greedy”
decision intended to optimize revenues are used for data collection...
... and this data then used for next decisions.
ignores necessity of suitable information gathering
Towards a general theory (cont’d)
Towards a general theory (cont’d)
Thm.
[Keskin and Z (14’)] For “regular” parametric models of demand, any
admissible policy π must satisfy:
∆(T ) ≥
T
X
t=1
◮
C1
C2 + It (π; θ)
It (π; θ)= Fisher information at time t for observations generated under π
Towards a general theory (cont’d)
Thm.
[Keskin and Z (14’)] For “regular” parametric models of demand, any
admissible policy π must satisfy:
∆(T ) ≥
T
X
t=1
C1
C2 + It (π; θ)
◮
It (π; θ)= Fisher information at time t for observations generated under π
◮
best scenario:
◮
even then, no policy can achieve bounded revenue loss!
worst scenario:
information grows linearly
information growth is bounded
revenue losses grow linearly...
Fixing myopic policies
Key idea:
optimize myopically but insert constraint on information growth in
the form of lower information envelope
◮
make sure at each stage t that information exceeds envelope constraint
◮
previous result suggests envelope needs to grow linearly in t...
Fixing myopic policies
Key idea:
optimize myopically but insert constraint on information growth in
the form of lower information envelope
◮
make sure at each stage t that information exceeds envelope constraint
◮
previous result suggests envelope needs to grow linearly in t...
true, but not exactly the right idea...
Fixing myopic policies
Key idea:
optimize myopically but insert constraint on information growth in
the form of lower information envelope
◮
make sure at each stage t that information exceeds envelope constraint
◮
previous result suggests envelope needs to grow linearly in t...
true, but not exactly the right idea...
what can go wrong?
too many pricing decision aimed at information
gathering...
◮
could lead to excessive revenue loss
Fixing myopic policies
◮
Estimate: using past observations form parameter estimate θ̂t
◮
Optimize: set pt to optimize λ(p; θ̂t )
◮
Envelope test:
◮
calculate Jbt [ estimator of Fisher information ]
if Jbt ≥ κ log t then optimize
if Jbt < κ log t then price test
repeat until end of horizon
Fixing myopic policies
◮
Estimate: using past observations form parameter estimate θ̂t
◮
Optimize: set pt to optimize λ(p; θ̂t )
◮
Envelope test:
◮
calculate Jbt [ estimator of Fisher information ]
if Jbt ≥ κ log t then optimize
if Jbt < κ log t then price test
repeat until end of horizon
– policy acts in semi-myopic manner
– seeks to optimize subject to information constraints
– violating constraint triggers price test
Performance of semi-myopic policy
680
670
660
∆(T )
650
640
630
620
610
600
0
2
4
6
log T
8
10
12
Performance of semi-myopic policy
1200
1000
800
EJ t
600
400
200
0
0
2
4
t
6
8
10
4
x 10
Performance of semi-myopic policy
Mean number of price tests in first t periods
Performance of semi-myopic policy
Number of observations (within a sample of size R = 10,000)
700
600
500
400
300
200
100
0
1.08
1.085
1.09
1.095
1.1
pT
1.105
1.11
1.115
1.12
Theory behind numerical illustration
Thm.
[Keskin and Z (14’)] For “regular” parametric models of demand, the
semi-myopic policy achieves
∆(T ) ≤ Clog T
◮
[ if parameter dimension =1 ]
matches lower bound (up to constants ... )
Theory behind numerical illustration
Thm.
[Keskin and Z (14’)] For “regular” parametric models of demand, the
semi-myopic policy achieves
∆(T ) ≤ Clog T
[ if parameter dimension =1 ]
◮
matches lower bound (up to constants ... )
◮
under this policy information grows linearly...
– while envelope constraint only imposes log-growth!!!
Theory behind numerical illustration (cont’d)
Thm.
[Keskin and Z (14’)] For “regular” parametric models of demand, the
√
semi-myopic policy with information envelope that is proportional to T
∆(T ) ≤ C
√
T log T
[ if parameter dimension > 1 ]
Theory behind numerical illustration (cont’d)
Thm.
[Keskin and Z (14’)] For “regular” parametric models of demand, the
√
semi-myopic policy with information envelope that is proportional to T
∆(T ) ≤ C
√
T log T
[ if parameter dimension > 1 ]
◮
“matching” lower bound: for any policy ∆(T ) ≥ c
◮
under this policy information also grows linearly...
√
T
Takeaway message
◮
myopic policies are simple and computationally attractive approximations...
Takeaway message
◮
myopic policies are simple and computationally attractive approximations...
◮
... but myopic decisions can lead to incomplete learning
in absence of suitable consideration for information gathering
Takeaway message
◮
myopic policies are simple and computationally attractive approximations...
◮
... but myopic decisions can lead to incomplete learning
◮
in absence of suitable consideration for information gathering
it is possible to “fix” myopic policies
constraints on information accumulation
Takeaway message
◮
myopic policies are simple and computationally attractive approximations...
◮
... but myopic decisions can lead to incomplete learning
◮
it is possible to “fix” myopic policies
◮
in absence of suitable consideration for information gathering
constraints on information accumulation
familiar logic from UCB / ε-greedy...
Takeaway message
◮
myopic policies are simple and computationally attractive approximations...
◮
... but myopic decisions can lead to incomplete learning
◮
in absence of suitable consideration for information gathering
it is possible to “fix” myopic policies
constraints on information accumulation
◮
familiar logic from UCB / ε-greedy...
◮
leads to simple rules-of-thumb of when/how to price test
Part II: Effects of misspecification
Motivation:
most mathematical models are abstractions of reality
not meant to, or capable of, capturing complexities of underlying
systems and/or phenomena
often only provide a crude approximation of underlying
“true” relationships among observed data...
Part II: Effects of misspecification
Motivation:
most mathematical models are abstractions of reality
not meant to, or capable of, capturing complexities of underlying
systems and/or phenomena
often only provide a crude approximation of underlying
“true” relationships among observed data...
=⇒ models are inherently misspecified
Part II: Effects of misspecification
Motivation:
most mathematical models are abstractions of reality
not meant to, or capable of, capturing complexities of underlying
systems and/or phenomena
often only provide a crude approximation of underlying
“true” relationships among observed data...
=⇒ models are inherently misspecified
Main Q. What is the impact of model misspecification?
Background on misspecification
data X1 , . . . , Xn iid according to some distribution P [ density p ]
model using parametric family Qθ
model is:
well-specified
misspecified
if
if
[ density q(·; θ), θ ∈ Θ ]
p ∈ {q(·; θ) : θ ∈ Θ}
p∈
/ {q(·; θ) : θ ∈ Θ}
inference e.g., using Maximum Likelihood
θbn ∈ arg max{ q(X1 , . . . , Xn ; θ) }
Background on misspecification
data X1 , . . . , Xn iid according to some distribution P [ density p ]
model using parametric family Qθ
model is:
well-specified
misspecified
p ∈ {q(·; θ) : θ ∈ Θ}
if
if
[ density q(·; θ), θ ∈ Θ ]
p∈
/ {q(·; θ) : θ ∈ Θ}
inference e.g., using Maximum Likelihood
θbn ∈ arg max{ q(X1 , . . . , Xn ; θ) }
Q. What are the properties of θbn ?
Background on misspecification
estimator
θbn ∈ arg max{q(X1 , . . . , Xn ; θ}
large sample properties – log-likelihood fn
log q(X1 , . . . , Xn ; θ) =
n
X
log q(Xi ; θ)
i=1
≈
n · E[log q(X1 ; θ)]
expectation wrt P...
Background on misspecification
estimator
θbn ∈ arg max{q(X1 , . . . , Xn ; θ}
large sample properties – log-likelihood fn
log q(X1 , . . . , Xn ; θ) =
n
X
log q(Xi ; θ)
i=1
≈
n · E[log q(X1 ; θ)]
expectation wrt P...
maximizing likelihood = minimizing KL-divergence
−E[log q(X1 ; θ)]
p(X1 )
− E log p(X1 )
q(X1 ; θ)
=: KL( PkQθ ) − E log p(X1 )
=
E log
arg max{E[log q(X1 ; θ)]} = arg min{ KL( PkQθ ) }
Background on misspecification
large sample behavior of misspecified ML-estimator:
θ ∗ = arg min{KL( PkQθ )}
expect that
θbn → θ ∗
as n → ∞
– ML-estimator converges to value that minimizes KL-divergence
– this corresponds to member of Qθ “closest” to P...
– “close” is measured in KL (pseudo)-distance
Discussion
◮
model misspecification has long history in statistics/decision theory,
econometrics: good comprehensive reference is White (1996)
◮
first studied by Cox (1961), Berk (1966), Huber (1967) and others
main objectives of theory
establish asymptotic behavior under model misspecification
[ mirroring classical theory... ]
develop statistical tests to check whether model is misspecified
Discussion
◮
model misspecification has long history in statistics/decision theory,
econometrics: good comprehensive reference is White (1996)
◮
first studied by Cox (1961), Berk (1966), Huber (1967) and others
main objectives of theory
establish asymptotic behavior under model misspecification
[ mirroring classical theory... ]
◮
develop statistical tests to check whether model is misspecified
misspecification = approximation error...
= minimal “distance” between Qθ and P...
Discussion
◮
model misspecification has long history in statistics/decision theory,
econometrics: good comprehensive reference is White (1996)
◮
first studied by Cox (1961), Berk (1966), Huber (1967) and others
main objectives of theory
establish asymptotic behavior under model misspecification
[ mirroring classical theory... ]
◮
misspecification = approximation error...
◮
develop statistical tests to check whether model is misspecified
= minimal “distance” between Qθ and P...
reducing misspecification error: “grow” the complexity of the class Qθ ...
leads to a variety of model selection procedures (AIC, BIC-like etc)
adding deg. of freedom per sample size... [ e.g. Adjusted R2 ]
method of sieves (Grenander, Geman and Huang (1982))
Main question
what are the implications of model misspecification on the
quality of sequential decisions derived from these models?
Problem formulation
discrete periods
Dt
=
λ(pt ) + εt ;
λ(p) = E[D | p]
t = 1, . . . , T
true demand fn
unknown
εt represents statistical noise ; think Gaussian...
no historical data available: need to fit a model on the fly
Problem formulation
discrete periods
Dt
=
λ(pt ) + εt ;
λ(p) = E[D | p]
t = 1, . . . , T
true demand fn
εt represents statistical noise ; think Gaussian...
no historical data available: need to fit a model on the fly
objective: maximize cumulative revenues
E
T
hX
t=1
pt Dt
i
unknown
= E
T
hX
t=1
i
pt λ(pt )
Problem formulation
discrete periods
Dt
=
λ(pt ) + εt ;
λ(p) = E[D | p]
t = 1, . . . , T
true demand fn
εt represents statistical noise ; think Gaussian...
no historical data available: need to fit a model on the fly
objective: maximize cumulative revenues
E
T
hX
t=1
pt Dt
i
unknown
= E
T
hX
t=1
i
pt λ(pt )
sequence of prices {pt } adapted to history of demand observations...
Recall: How do firms set prices in practice
1. postulate a simple parametric model of demand
[e.g, linear, exponential, iso-elastic, logit]
2. estimate model parameters [ possibly price experiment]
3. optimize empirical estimate of profit fn
4. track market response to presented price
5. update estimates using new sales data and repeat
Recall: How do firms set prices in practice
1. postulate a simple parametric model of demand
[e.g, linear, exponential, iso-elastic, logit]
2. estimate model parameters [ possibly price experiment]
3. optimize empirical estimate of profit fn
4. track market response to presented price
5. update estimates using new sales data and repeat
◮
common approaches rely on simple (semi) myopic policies
use optimized prices to draw inference...
... with hope that learning takes care of itself
ignore any issue related to misspecification...
Two concrete questions
Q1:
What is the impact of model misspecification on price dynamics?
Q2:
What can we hope to “learn” and “earn” given model misspecification?
Problem formulation (cont’d)
Problem formulation (cont’d)
◮
focus on simple class of myopic pricing policies
– price based only on most recent observations
( extreme case of exp. smoothing... )
Problem formulation (cont’d)
◮
focus on simple class of myopic pricing policies
– price based only on most recent observations
( extreme case of exp. smoothing... )
◮
already know that using all observations + myopic policy
=> incomplete learning
What can go wrong – I (no noise)
true demand model is logit
fit a linear demand model
e3−p
λ(p) =
1 + e3−p
1 − βp
What can go wrong – I (no noise)
e3−p
λ(p) =
1 + e3−p
true demand model is logit
1 − βp
fit a linear demand model
3.5
p̌
price
3
2.5
2
p∗
0
5
10
15
iterations
20
25
30
What can go wrong – II (no noise)
true demand model is logit
fit a linear demand model
e4.1−p
λ(p) =
1 + e4.1−p
1 − βp
What can go wrong – II (no noise)
e4.1−p
λ(p) =
1 + e4.1−p
true demand model is logit
1 − βp
fit a linear demand model
6
5.5
p̌
5
price
4.5
4
3.5
3
p∗
2.5
2
0
5
10
15
iterations
20
25
30
Misspecified pricing - no noise
λ(p)
true demand model is arbitrary
α − βp
fit a linear demand model
◮
need at least two prices...
at each decision epoch, price at pi and pi + δi
recalibration
βi
=
αi
=
λ(pi + δi ) − λ(pi )
δi
β i pi + λ(pi )
[update param.]
new postulated model: αi − β i p
new decision: arg max { p · (αi − β i p )}
[update param.]
Numerical illustration
true demand model is logit
λ(p) =
fit a linear demand model
α − βp
e4.1−p
1 + e4.1−p
Numerical illustration
true demand model is logit
λ(p) =
fit a linear demand model
α − βp
e4.1−p
1 + e4.1−p
10
9
8
price
7
6
5
4
3
2
1
0
0
5
10
15
iterations
20
25
30
Numerical illustration
true demand model is logit
λ(p) =
fit a linear demand model
α − βp
e4.1−p
1 + e4.1−p
10
9
8
price
7
6
5
4
3
2
1
0
0
5
10
15
20
25
30
iterations
misspecified model... nevertheless leads to optimal decisions
Linear-based misspecified pricing with noisy observations
For customer blocks of size ni , price at pi and pi + δi
Recalibration
b i)
λ(p
b i + δi )
λ(p
βbi
bi
α
=
=
=
1
ni
1
ni
tiX
+ni
Dt (pi )
[ demand estimate at pi ]
j=ti +1
ti X
+2ni
Dt (pi + δi )
[ demand estimate at pi + δi ]
j=ti +ni +1
b i + δi ) − λ(p
b i)
λ(p
δi
b i)
= βbi pi + λ(p
b i − βbi p
new postulated model: α
updated decision:
b i /2βbi
pi+1 = α
[ update slope estimate ]
[ update intercept estimate ]
Misspecified-based myopic pricing - Consistency
consider setting where postulated (and incorrect) model is linear
Thm (Consistency).
Suppose that δi ↓ 0, ni ↑ ∞ & some mild
regularity conditions.
Then, for any initial price p1 , the sequence of prices {pi : i ≥ 1}
converges to p∗ = arg max { pλ(p) } in probability.
Misspecified-based myopic pricing - Consistency
consider setting where postulated (and incorrect) model is linear
Thm (Consistency).
Suppose that δi ↓ 0, ni ↑ ∞ & some mild
regularity conditions.
Then, for any initial price p1 , the sequence of prices {pi : i ≥ 1}
converges to p∗ = arg max { pλ(p) } in probability.
price decisions converge to optimal point irrespective of underlying demand model!
Picture proof...
(b)
(a)
1.4
2.5
λ(p)
p(α30 − β30 p)
α3 − β3 p
1.2
α30 − β30 p
1
revenue functions
demand functions
2
0.8
0.6
0.4
1.5
p(α3 − β3 p)
1
pλ(p)
p(α2 − β2 p)
α2 − β2 p
0.5
0.2
0
0
2
4
price
6
8
10
0
0
2
4
price
6
8
10
Refined performance analysis and optimality results
Thm. (Convergence rate - Pricing sequence)
Fix a > 1, and take blocks ni = ai n0 , and δi = ni
−1/4
E|pT − p∗ |2 ≤ C
(log T )
T 1/2
. Then,
Refined performance analysis and optimality results
Thm. (Convergence rate - Pricing sequence)
Fix a > 1, and take blocks ni = ai n0 , and δi = ni
−1/4
E|pT − p∗ |2 ≤ C
. Then,
(log T )
T 1/2
If demand function is only known to be twice differentiable then for
any pricing policy there exists a demand function λ(·) such that
E|pT − p∗ |2 ≥
c
T 1/2
Optimality results (cont’d)
Thm. ( Revenue optimality )
Fix a > 1, and take blocks ni = ai n0 , and δi = ni
−1/4
2
∆(T ) ≤ C(log T )
√
T
. Then,
Optimality results (cont’d)
Thm. ( Revenue optimality )
Fix a > 1, and take blocks ni = ai n0 , and δi = ni
−1/4
2
∆(T ) ≤ C(log T )
√
. Then,
T
For any pricing policy there exists a demand function λ(·) such
that
√
∆(T ) ≥ c T
Optimality results (cont’d)
Thm. ( Revenue optimality )
Fix a > 1, and take blocks ni = ai n0 , and δi = ni
−1/4
2
∆(T ) ≤ C(log T )
√
. Then,
T
For any pricing policy there exists a demand function λ(·) such
that
√
∆(T ) ≥ c T
performance is almost best possible among all non-anticipating pricing schemes
Main takeaways
surprising sufficiency of linear demand models
–
Dawes (1979) The robust beauty of improper linear models in decision
making, American Psychologist
Main takeaways
surprising sufficiency of linear demand models
–
Dawes (1979) The robust beauty of improper linear models in decision
making, American Psychologist
simple family of parametric myopic pricing policies
– prices (decisions) converge to optimal point despite misspecification
– overall generated revenues (performance) essentially best possible
provides justification for prevalent use of simple parametric models...
Main takeaways
surprising sufficiency of linear demand models
–
Dawes (1979) The robust beauty of improper linear models in decision
making, American Psychologist
simple family of parametric myopic pricing policies
– prices (decisions) converge to optimal point despite misspecification
– overall generated revenues (performance) essentially best possible
provides justification for prevalent use of simple parametric models...
“All models are wrong, but some are useful...”
(George E.P. Box)
Some questions
◮
what is the impact of demand functions that “drift”?
teaser: harder can make life easier...
Some questions
◮
what is the impact of demand functions that “drift”?
teaser: harder can make life easier...
◮
what is the impact of inventory constraints?
teaser: not much harder incomplete info problem...
Some questions
◮
what is the impact of demand functions that “drift”?
teaser: harder can make life easier...
◮
what is the impact of inventory constraints?
teaser: not much harder incomplete info problem...
◮
what is the impact of dealing with multiple products?
teaser: optimal experimentation has geometrical structure...
Some questions
◮
what is the impact of demand functions that “drift”?
teaser: harder can make life easier...
◮
what is the impact of inventory constraints?
teaser: not much harder incomplete info problem...
◮
what is the impact of dealing with multiple products?
teaser: optimal experimentation has geometrical structure...
◮
what is the impact of competition (game theoretic formulation)?
teaser: can lead to significant variation in price dynamics...
THANK YOU!
Dynamic Pricing with Model Uncertainty
Alex Slivkins (Microsoft Research NYC)
Assaf Zeevi (Columbia University)
Tutorial at ACM EC 2015 (Portland, June 2015)
Paradigmatic problem
Seller with limited supply: 𝑘 identical items to sell
In each round 𝑡 = 1 . . 𝑛, a new customer arrives
seller offers 1 item @ price 𝑝𝑡 ∈ [0,1]
customer accepts or rejects
Until no more items or no more customers
Goal: adjust price over time, to maximize reward
𝑆 𝑝 = Pr[sale @ price 𝑝] demand curve
fixed but unknown to seller no parametric assumptions
interpretation: sale if and only if 𝑝𝑡 ≥ 𝑣𝑡 ,
where 𝑣𝑡 is customer’s value, drawn IID.
2
What is going on
First intuition: we want to sell at (unknown) “best price”
offered price too low ⇒ likely sale, wasted item
offered price too high ⇒ likely no sale, wasted customer
… but we learn something about the demand distribution
Our learning ability is handicapped:
can’t afford to sell too many items while trying “low” prices
(“explore-exploit tradeoff ”, “learn-and-earn”)
without parametric assumptions, no long-range inference
𝑆(𝑝)
𝑝1
𝑝2
𝑝
Outline
Unlimited supply
Limited supply
Beyond best fixed price
Abstract resource constraints
Extension to contexts
Further directions
4
Unlimited supply (𝑘 = 𝑛)
Multi-armed bandit problem
actions (prices) with fixed but unknown expected rewards
partial feedback: for chosen action, but not for all actions
basic model for explore-exploit
Special feature: sale @price 𝑝 => sale @any lower price
Reward function 𝑅 𝑝 = 𝑝 ⋅ 𝑆(𝑝) (expected per-round reward)
Best fixed price 𝑝∗ : maximizes 𝑅(𝑝)
algorithm’s performance measured by
Regret 𝑛 = 𝑛 𝑝∗ − 𝐸[algorithm′ s total reward]
5
Unlimited supply
Reduction to bandits
Uniform discretization 𝑈, then run a bandit algorithm on 𝑈
0
ϵ
𝑞∗
𝑝
∗
Regret = Regret 𝑈 + 𝑂𝑃𝑇 − 𝑂𝑃𝑇𝑈
bandit
discretization error
Round down 𝑝∗ to the nearest price 𝑈, call it 𝑞 ∗
Selling @𝑞 ∗ loses at most 𝜖 per each sale.
So, discretization error ≤ 𝜖𝑛
Pick 𝜖 in advance to optimize regret
6
prices
1
Unlimited supply
Reduction to bandits
simple algorithm:
Uniform discretization 𝑈
prices
1
0
ϵ
explore-then-exploit
pick action u.a.r. from 𝑈 for 𝑛0 rounds,
then pick “best action” and stick with it
pick 𝜖, 𝑛0 in advance to optimize regret
Regret 𝑂(𝑛3/4 )
better approach: adaptive exploration
adapt to observations to zoom in on better actions
Optimal regret 𝑂(𝑛2/3 )
pick 𝜖 in advance to optimize regret
7
Unlimited supply
Optimism under uncertainty
Well-known heuristic for adaptive exploration:
“round up” uncertainty for each action
If action 𝑖 is chosen 𝑛𝑖 times with emp. average 𝜇𝑖 ,
exp. reward 𝜇𝑖 ∈ 𝜇𝑖 ± 𝑂(1/ 𝑛𝑖 )
Upper Confidence Bound (UCB):
𝜇𝑖 ≤ 𝜇𝑖𝑈𝐶𝐵 w.h.p.
𝜇𝑖𝑈𝐶𝐵 = 𝜇𝑖 + 𝑂(1/ 𝑛𝑖 )
UCB algorithm: in each round,
pick action 𝑖 with maximal “index”, where “index” = 𝜇𝑖𝑈𝐶𝐵 .
each round implicitly combines explore & exploit
Theorem: regret
8
𝑂( 𝑛 |𝑈|) w.r.t. 𝑂𝑃𝑇𝑈
regret 𝑂(𝑛2/3 ) w.r.t. OPT
(discretization)
(adjust 𝜖 )
Unlimited supply
Lower bound on regret
Theorem: Regret ≥ Ω(𝑛2/3 )
Uniform discretization 𝑈
A family of problem instances:
values 𝑣𝑡 are multiples of 𝜖
0
ϵ
For prices 𝑝 ∈ 𝑈, and needle-in-a-haystack 𝑝0 ∈ 𝑈
𝑟 + 𝜖, 𝑝 = 𝑝0
𝑅 𝑝 ≡𝑝⋅𝑆 𝑝 =
𝑟, otherwise
9
For 𝜖 = 𝑛−1/3 , cannot find the needle!
Any algorithm has regret Ω(𝑛2/3 ) for some 𝑝0 .
values
1
Outline
Unlimited supply
Limited supply
Beyond best fixed price
Abstract resource constraints
Extension to contexts
Further directions
10
Limited supply (𝑘 < 𝑛)
[even with explore-then-exploit]
exploitation is limited by #items left after exploration
maximizing expected per-round reward is not the right goal.
need to think about expected total reward
Best fixed price
𝑝∗ = max REW(𝑝)
𝑝
REW(𝑝) expected total reward for fixed price 𝑝
Regret 𝑘, 𝑛 = REW(𝑝∗ ) − REW(algorithm)
now want regret sublinear in 𝑘 = #items
11
Limited supply
Explore-then-exploit fails for 𝑘 ≪ 𝑛
Uniform discretization,
ϵ
prices
𝑛0 rounds of exploration (u.a.r.) 0
If 𝑛0 ≥ 𝑘 then for problem instance with value 𝑣𝑡 ≡ 1,
1
𝑘
all items are sold at expected average price 2 ⟹ regret 2
If 𝑛0 ≤ 𝑘. Assume 𝑣𝑡 ∈ 0, 𝑣 and Pr 𝑣𝑡 = 𝑣 = 𝑘/𝑛
1
𝑘
Algorithm knows 𝑆(𝑝) only up to ± , which is ≫
1
2
Cannot tell 𝑣 = from 𝑣 = 1
𝑘
𝑘
𝑛
⟹ regret 2
Linear regret if 𝑘 < 𝑂 𝑛 .
Sub-linear (but sub-optimal) regret if 𝑘~𝑛.
12
1
Limited supply
UCB algorithm for total rewards
Uniform discretization 𝑈
0
ϵ
Want: in each round, pick price
argmax UCB(REW 𝑝 )
𝑝∈𝑈
Approximate
prices
𝑘 items; 𝑛 rounds.
𝑆 𝑝 = Pr(sale @𝑝)
approx. #sales @p
REW 𝑝 ≈ 𝑝 ⋅ min(𝑘, 𝑛 ⋅ 𝑆(𝑝))
Algorithm: in each round, pick price 𝑝 ∈ 𝑈 with maximal
Index 𝑝 = 𝑝 ⋅ min(𝑘, 𝑛 ⋅ UCB(𝑆 𝑝 ))
ave. sales rate + conf. term
13
1
Limited supply
Overview of the analysis
Intuition: algorithm zooms in on
𝑝∗∗ = argmax𝑝∈𝑈 𝑝 ∙ min 𝑘, 𝑛 ⋅ 𝑆(𝑝)
close to 𝑝∗ on expected total reward
How to bound the revenue loss from prices 𝑝 ≠ 𝑝∗∗ ?
Elegant tricks from bandit analysis do not apply
“charging argument”: charge 𝑝 ≠ 𝑝∗∗ for each time it is chosen
so that the total revenue loss ≤ the sum of these charges,
and the sum can be usefully bounded from above
focus on carefully defined “high-probability event”
Limited supply
Better regret for regular demands
Recall: reward function 𝑅 𝑝 = 𝑝 ⋅ 𝑆(𝑝).
Better regret if 𝑅(⋅) is concave: 𝑅 ′′ ⋅ ≤ 0 (regular demands)
How does it help:
analysis uses an upper bound on
𝐻𝛿,𝑈 = {𝑝 ∈ 𝑈 ∶ 𝑅 𝑝∗ − 𝑅 𝑝 ≤ 𝛿 }
by concavity, 𝑅 ⋅ is essentially quadratic near 𝑝∗ ,
⟹ a better upper bound on 𝐻𝛿,𝑈 .
Same algorithm (UCB for total rewards),
but a different discretization step 𝜖
15
Limited supply
Upper and lower bounds
UCB algorithm for total rewards:
regret 𝑂 𝑘 log 𝑛 2/3 without assumptions
regret 𝑂(𝑐𝑆 𝑘 log 𝑛 1/2 ) with regular demands,
where 𝑐𝑆 is an instance-dependent constant
Both are optimal up to 𝑂 log 𝑛 factors
Lower bounds rule out regret 𝑜(𝑘 2/3 ) and 𝑂(𝑐𝑆 𝑘1/2 )
for arb. large 𝑘, 𝑛
reduction to unlimited supply
𝑘 2/3 LB holds for a needle-in-a-haystack example
16
𝑘 LB holds for “generic” demand curves
Outline
Unlimited supply
Limited supply
Beyond best fixed price
Abstract resource constraints
Extension to contexts
Further directions
17
Beyond best fixed price
All-knowing benchmarks (know demand curve)
best fixed price 𝑝∗
optimal pricing policy
optimal offline mechanism (Myerson 1981).
All benchmarks are within 𝑂
weakest
strongest
𝑘 log 𝑘 for regular demands
In general, optimal pricing policy can be much better than 𝑝∗
Beyond best fixed price
Two prices better than one!
Example: Distribution over two prices twice as good as 𝑝∗
1 w/ prob 𝜖𝑘/𝑛
𝜖
otherwise
WLOG focus on prices 𝑝 ∈ {𝜖, 1}. For both, REW(𝑝) ≤ 𝜖𝑘 .
Problem instance: value 𝑣𝑡 =
𝑝=𝜖
Distribution 𝐷:
𝑝=1
w/ prob 1 − 𝜖 𝑘/𝑛
otherwise
Then REW(𝐷) ≥ 𝜖𝑘 2 − 𝑜 1 .
19
Beyond best fixed price
All-knowing benchmarks
LP-relaxation
OPT
Best distribution
over prices
Best fixed price
20
OPT: best pricing policy:
in each round, pick any price
given remaining resources
Regret w.r.t. LP-relaxation
Optimal regret 𝑂(𝑘 3/2 )
standard benchmark for bandits
Beyond best fixed price
LP-relaxation
Distribution 𝐷 over prices 𝑝
𝑅 𝐷 = 𝐸𝑝∼𝐷 𝑝 ⋅ 𝑆(𝑝)
𝑆 𝐷 = 𝐸𝑝∼𝐷 𝑆(𝑝)
exp. per-step
... reward
… #sales
𝑆 𝑝 sale probab.
Linear relaxation: exp. total reward for always using 𝐷,
assuming deterministic sales and fractional time
LP 𝐷 = max 𝜏 𝑅(𝐷) 𝜏 = #steps before algorithm stops
such that 𝜏 ≤ 𝑛 and 𝜏 𝑆 𝐷 ≤ 𝑘
𝑘
LP 𝐷 = 𝑅 𝐷 ⋅ min 𝑛,
𝑆(𝐷)
Theorem: max LP 𝐷 ≥ OPT
𝐷
21
“LP-value” of D
Beyond best fixed price
Algorithms for price distributions
Zoom in on price distribution
𝐷 over 𝑈 with max LP-value
Uniform discretization 𝑈
0
ϵ
prices
1
Three algorithms with optimal regret w.r.t. OPT: some fine-print
UCB index: round, pick 𝐷 to maximize UCB[LP 𝐷 ]
Multiplicative weights update (MWU):
MWU on the dual variable: fictitious cost-per-item
round, pick 𝑝 ∈ 𝑈 with best UCB
𝐸 reward
𝐸[cost]
Balanced exploration: among “plausibly optimal” 𝐷’s,
pick one that explores each price in 𝑈 “as much as possible”
22
Beyond best fixed price
“Balanced exploration”
conf. interval on LP 𝐷
for price distribution 𝐷
LP(𝐷′)
0
LP(𝐷)
conf. intervals on 𝑆(𝑝) for
each price 𝑝 in support(𝐷)
𝑘
𝐷′ is not “plausibly optimal”
In each round, choose “plausibly opt.” price distribution 𝐷
•
•
23
pick price 𝑝 unif. at random from U
pick “plausibly opt.” 𝐷 which maximizes 𝐷𝑝
how?
Beyond best fixed price / balanced exploration
Analysis (sketch)
E[sales @t]
E[reward @t]
24
optimal 𝑫
k/n
OPT/n
Price distribution 𝐷
𝑛 rounds, 𝑘 items
Beyond best fixed price / balanced exploration
Analysis (sketch)
E[sales @t]
E[reward @t]
optimal 𝑫
k/n
OPT/n
Price distribution 𝐷
𝑛 rounds, 𝑘 items
balanced exploration
k/n + Error
OPT/n − Error
For any plausibly optimal 𝐷, 𝐷∗ , in each round 𝑡:
Balanced exploration
enough samples from each price in support(𝐷)
sharp confidence interval on LP(𝐷)
(same for 𝐷 ∗ )
small upper bound on LP 𝐷 − LP 𝐷∗
25
Beyond best fixed price / balanced exploration
Price distribution 𝐷
𝑛 rounds, 𝑘 items
Analysis (sketch)
E[sales @t]
E[reward @t]
optimal 𝑫
k/n
OPT/n
balanced exploration
k/n + Error
OPT/n − Error
For any plausibly optimal 𝐷, 𝐷∗ , in each round 𝑡:
Balanced exploration
enough samples from each price in support(𝐷)
sharp confidence interval on LP(𝐷)
(same for 𝐷 ∗ )
small upper bound on
1
LP
n
E[reward @t]
26
1
n
𝐷 − LP 𝐷∗
OPT/n
If 𝐷 is algorithm’s selection in round 𝑡, and 𝐷∗ is the best distribution
Outline
Unlimited supply
Limited supply
Beyond best fixed price
Abstract resource constraints
Extension to contexts
Further directions
27
Abstract resource constraints
Dynamic pricing on
fixed & finite price set 𝑈
Uniform discretization 𝑈
0
ϵ
a more abstract bandit problem:
Knapsack
algorithm consumes limited resources
constraints
“Bandits with knapsacks”
28
prices
1
Abstract resource constraints
Bandits with Knapsacks
resources: budget 𝐵𝑖 for each resource 𝑖
In each round 𝑡, algorithm chooses action 𝑎𝑡 ∈ 𝐴
outcome: (reward; consumption of each resource)
stop if any resource exceeds budget
action
time is just another resource
reward
𝑑
stop
resource 2
resource 1
resource
price
money
items
Abstract resource constraints
Bandits with Knapsacks
resources: budget 𝐵𝑖 for each resource 𝑖
In each round 𝑡, algorithm chooses action 𝑎𝑡 ∈ 𝐴
outcome: (reward; consumption of each resource)
stop if any resource exceeds budget
time is just another resource
𝑑
Outcome for action 𝑎 is IID from distribution Dist 𝑎 over outcomes
• Dist 𝑎 fixed over time; not known to the algorithm
Goal: maximize total reward given constraints
30
Abstract resource constraints
Algorithms for Bandits with Knapsacks
Machinery from dynamic pricing with price distributions carries over
Zoom in on distribution over actions with max LP-value
Three algorithms with optimal regret w.r.t. OPT:
UCB index: round, pick 𝐷 to maximize UCB[LP 𝐷 ]
Multiplicative weights update (MWU):
MWU on the dual variable (fictitious cost-per-item)
round, pick 𝑝 ∈ 𝑈 with best UCB
𝐸 reward
𝐸[cost]
Balanced exploration: among “plausibly optimal” 𝐷’s,
pick one that explores each action “as much as possible”
31
Abstract resource constraints
Dynamic pricing: multiple products
selling multiple products, limited supply of each
action = price vector (price for each product)
each product consumes some primitive resources
action = price vector (price for each product)
bundling & volume pricing
given: collection of allowed bundles
action = price vector (price for each allowed bundle)
Discretization 𝑈 = finite subset of actions.
Given a specific discretization 𝑈, regret w.r.t. OPT(𝑈).
32
Abstract resource constraints
Dynamic procurement
Crowdsourcing
“Dynamic pricing for buying” (vs. selling)
market (MTurk)
Employer with many tasks, limited budget
In each round 𝑡, a new worker arrives
employer offers price 𝑝𝑡 ∈ [0,1]
worker accepts or rejects
Until out of workers or out of money
Pr[accept @ price 𝑝] fixed but unknown
Goal: adjust price over time, to maximize #tasks
Extensions: e.g. multiple types of tasks with per-type budgets
33
Abstract resource constraints
Uniform discretization 𝑈
Discretization
0
ϵ
prices
Theorem: Sufficient condition to bound discretization error:
For each action 𝑎, some discretized action 𝑎′ approximates
E[reward @𝑎]
𝑝 ⋅ Pr[sale @𝑝]
=
=𝑝
E[resource−𝑖 consumption @𝑎]
Pr[sale @𝑝]
dynamic pricing with a single product
dynamic procurement with a single budget
… but with a different mesh: {
1
,𝑗
1+𝑗𝜖
OK
OK
∈ ℕ}
multiple products/budgets:
not clear which discretization to choose
34
𝑎=𝑝
price
X
1
Abstract resource constraints
Best distribution beats best fixed action
uniform distribution
≫ best fixed action
resource 2
action 2
action 1
35
resource 1
Outline
Unlimited supply
Limited supply
Beyond best fixed price
Abstract resource constraints
Extension to contexts
Further directions
36
Contextual dynamic pricing
Contextual dynamic pricing
Seller with 𝑘 identical items
In each round 𝑡,
“context”
new customer arrives, with known profile 𝑥𝑡
seller offers 1 item @ price 𝑝𝑡 ∈ [0,1]
customer accepts with (unknown) probability 𝑆(𝑝𝑡 |𝑥𝑡 )
Until no more items or no more customers
Goal: adjust price over time, to maximize revenue
Contextual bandits: in each round, observable “context”.
All probabilities depend on both action and context.
37
Contextual dynamic pricing
Policy sets
policy: contexts → prices
Policy set Π = all policies learnable from offline data
via given method, e.g. linear regression, decision trees, etc.
OPTΠ = REW [ best all-knowing algorithm restricted to Π ]
Regret = OPTΠ – REW [ algorithm ]
Because of resource constraints, one may have
OPTΠ ≫ REW best fixed policy in Π
Essentially, OPTΠ = REW best distribution over policies in Π
Trivial: treat policies as “meta-actions” ⇒ regret ~ |Π|
New challenge: regret ~ log |Π| explore many policies at once!
38
Contextual dynamic pricing
Plausibly optimal distributions
LP-value for distribution 𝐷 over policies 𝜋 Treat policies
LP 𝐷 = 𝑅 𝐷 ⋅ min 𝑛, 𝑘/ 𝑆(𝐷)
𝑅 𝐷 = 𝐸𝜋∼𝐷 𝐸𝑥 𝑅(𝜋(𝑥)|𝑥)
𝑆 𝐷 = 𝐸𝜋∼𝐷 𝐸𝑥 𝑆(𝜋(𝑥)|𝑥)
Theorem: max LP 𝐷 ≥ OPT
as “actions”
exp. per-step
... reward
… sales
𝐷
conf. interval on LP 𝐷
0
39
LP(𝐷′)
LP(𝐷)
conf. intervals on 𝑅 𝜋 , 𝑆(𝜋),
for each policy 𝜋 in support(𝐷)
𝑘
𝐷′ is not “plausibly optimal”
Contextual dynamic pricing
Algorithm for finite #prices
Marry “balanced exploration” & prior work on contextual bandits.
Goal: zoom in on optimal distribution over policies in Π
policy: contexts → actions
In each round 𝑡,
pick a plausibly optimal distribution 𝐷 (given observations so far)
with low probability pick price 𝑝𝑡 u.a.r.,
else draw policy 𝜋 ∼ 𝐷, pick price 𝑝𝑡 = 𝜋 context 𝑥𝑡
pick 𝐷 so as to explore every policy 𝜋 ′ ∈ Π at once:
𝑝𝑡 = 𝜋 ′ (𝑥𝑡 )
with “near-optimal” probability
40
Contextual dynamic pricing
Results
with 𝑚 feasible prices, regret 𝑂( 𝑚𝑛 ⋅ log(𝑚𝑛 |Π|)
optimal for any given 𝑚, 𝑛, Π pair
𝑛 rounds
policy set Π
extends to multiple products
? bound discretization error & pick optimal discretization
* above algorithm is non-constructive.
Very recently: comput. efficient algorithm with the same regret
41
Outline
Unlimited supply
Limited supply
Beyond best fixed price
Abstract resource constraints
Extension to contexts
Further directions
42
Further directions
Demand distribution changes over time
Uniform discretization 𝑈
Lots known for unlimited supply:
reduce to adversarial bandits
prices
1
0
ϵ
optimal regret w.r.t. best fixed price
unrestricted change over time
bounded “total variation” over time
regret w.r.t. best per-round price if demands change slowly
by at most 𝜖 in each round, or as an unbiased random walk
by ≤ 𝑓 𝑡, 𝑡 ′ between times 𝑡 and 𝑡′, for some known 𝑓
Limited supply ???
43
Further directions
Multiple products & bundle pricing
Status: strong results for a fixed discretization of price vectors
? discretization error / optimal uniform discretization
? is fixed discretization the right approach?
an alternative: evolve discretization over time to
zoom in on better prices; partial results for unlimited supply.
? make no assumptions, but do better for “nice” instances;
esp. if the worst-case problem is prohibitively high-dim
possibly many useful ways to define “nice”
44
Only papers directly relevant to this talk.
Lots of other great work not mentioned.
Most related papers
Literature review on dynamic pricing (in general): den Boer (2015)
Unlimited supply
Kleinberg & Leighton (2003): fixed discretization + bandit algorithm;
optimal results for IID and adversarial cases; sophisticated lower bound.
Limited supply: single product
Besbes & Zeevi (2009): first paper, explore-then-exploit for k~n
Babaioff, Blumrosen, Dughmi, Singer (20011): single-item case
Babaioff, Dughmi, Slivkins & Kleinberg (2012): UCB index algorithm,
optimal solution for general case and for regular demands
Wang, Deng & Ye (2014): regular, Lipschitz demands, k~n: optimal solution
45
Most related papers
Limited supply, multiple constraints
Besbes & Zeevi (2012): “primitive resources”, k~n, explore-then-exploit
Badanidiyuru, Kleinberg & Slivkins (2013): bandits with knapsacks (BwK):
balanced exploration, MWU algorithm; discretization for dynamic pricing
Badanidiyuru, Langford, Slivkins (2014): Contextual BwK with policy sets:
optimal regret via computationally inefficient algo (balanced exploration)
Agrawal & Devanur (2014): UCB Index algorithm for BwK;
extension of BwK to convex objectives and linear contexts
Agrawal, Devanur & Li (2015): Contextual BwK with policy sets and convex
objectives: optimal regret via computationally efficient algorithm
46
Most related papers
Bandits
Auer, Cesa-Bianchi, Fischer (2002): UCB algorithm for IID rewards
Auer, Cesa-Bianchi, Freund, Schapire (2002): adversarial rewards; MWU
Hazan & Kale (2009): bounded “total variation” over time
Slivkins & Upfal (2008), Slivkins (2011): expected rewards change “slowly”
Dynamic procurement
Badanidiyuru, Kleinberg, Singer (2012): approximation ratio
Singla & Krause (2013): UCB index algorithm
Ho, Slivkins & Vaughan (2014): adaptive discretization for unlimited budgets;
extension to repeated principal-agent game.
47
© Copyright 2026 Paperzz