Economics and Probabilistic Machine Learning

Economics and Probabilistic Machine Learning
David M. Blei
Departments of Computer Science and Statistics
Columbia University
Modern probabilistic modeling
An efficient framework for discovering meaningful patterns in massive data.
Ñ
Ñ
Duct Tape”.
Many available packages.
Typically fast and scalable.
Ñ
Ñ
and assumptions.
Challenging to implement.
May not be fast or scalable.
How to use traditional machine learning and statistics to solve modern problems
Probabilistic machine learning: tailored models for the problem at hand.
Probabilistic machine learning: tailored models for the problem at hand.
I Compose and connect reusable parts
I Driven by disciplinary knowledge and its questions
I Large-scale data, both in terms of data points and data dimension
I Focus on discovering and using structure in unstructured data
I Exploratory, observational, causal analyses
Ñ
Ñ
Many available packages.
Typically fast and scalable.
Ñ
Ñ
Challenging to implement.
May not be fast or scalable.
Many software packages available; typically fast and scalable
More challenging to implement; may not be fast or scalable
KNOWLEDGE &
QUESTION
DATA
R A. 7Aty
This content downloaded from 128.59.38.144 on Thu, 12 Nov 2015 01:49:31 UTC
All use subject to JSTOR Terms and Conditions
K=7
LWK
YRI
ACB
ASW
CDX
CHB
CHS
JPT
KHV
CEU
FIN
GBR
IBS
TSI
MXL
PUR
CLM
PEL
GIH
pops
1
2
3
4
5
6
7
K=8
LWK
Make assumptions
YRI
ACB
ASW
CDX
CHB
CHS
JPT
KHV
CEU
FIN
GBR
IBS
TSI
MXL
PUR
CLM
PEL
GIH
Discover patterns
pops
1
2
3
4
5
6
7
Predict & Explore
8
K=9
LWK
YRI
ACB
ASW
CDX
CHB
CHS
JPT
KHV
CEU
FIN
GBR
IBS
TSI
MXL
PUR
CLM
PEL
GIH
pops
1
2
3
4
5
6
7
8
9
Figure S2: Population structure inferred from the TGP data set using the TeraStructure algorithm
at three values for the number of populations K. The visualization of the ✓’s in the Figure shows
patterns consistent with the major geographical regions. Some of the clusters identify a specific
region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Europeans and Central/South Americans). The presence of clusters that are shared between different
regions demonstrates the more continuous nature of the structure. The new cluster from K = 7 to
K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is
unpopulated.
28
The probabilistic pipeline
I Design models that reflect our domain expertise and knowledge
I Given data, compute the approximate posterior of hidden variables
I Use the computation to predict the future or explore the patterns in your data.
alian Japanese
BalochiBantuKenya
BantuSouthAfrica
Basque
Bedouin
BiakaPygmy Brahui
Burusho
Cambodian
Colombian
Dai Daur
Druze
French
Han Han−NChina
HazaraHezhen
Italian Japanese
Kalash Karitiana
Lahu Makrani Mandenka
prob
Adygei
Kalash Karitiana
Lahu Makrani Mandenka
MayaMbutiPygmy
Melanesian
Miao
Mongola Mozabite
NaxiOrcadian
Oroqen
Palestinian
Papuan Pathan
Pima
Russian San Sardinian
She
Sindhi Surui Tu TujiaTuscan
UygurXibo
Yakut
Yi
Yoruba
pops
1
2
3
4
5
6
7
Population analysis of 2 billion genetic measurements
MayaMbutiPygmy
Melanesia
M
Communities discovered in a 3.7M node network of U.S. Patents
Annual Review of Statistics and Its Application 2014.1:203-232. Downloaded from www.annualreviews.org
by Princeton University Library on 01/09/14. For personal use only.
1
2
3
4
5
Game
Season
Team
Coach
Play
Points
Games
Giants
Second
Players
Life
Know
School
Street
Man
Family
Says
House
Children
Night
Film
Movie
Show
Life
Television
Films
Director
Man
Story
Says
Book
Life
Books
Novel
Story
Man
Author
House
War
Children
Wine
Street
Hotel
House
Room
Night
Place
Restaurant
Park
Garden
6
7
8
9
10
Bush
Campaign
Clinton
Republican
House
Party
Democratic
Political
Democrats
Senator
Building
Street
Square
Housing
House
Buildings
Development
Space
Percent
Real
Won
Team
Second
Race
Round
Cup
Open
Game
Play
Win
Yankees
Game
Mets
Season
Run
League
Baseball
Team
Games
Hit
Government
War
Military
Officials
Iraq
Forces
Iraqi
Army
Troops
Soldiers
11
12
13
14
15
Children
School
Women
Family
Parents
Child
Life
Says
Help
Mother
Stock
Percent
Companies
Fund
Market
Bank
Investors
Funds
Financial
Business
Church
War
Women
Life
Black
Political
Catholic
Government
Jewish
Pope
Art
Museum
Show
Gallery
Works
Artists
Street
Artist
Paintings
Exhibition
Police
Yesterday
Man
Officer
Officers
Case
Found
Charged
Street
Shot
Figure 5
Topics found in a corpus of 1.8 million articles from the New York Times. Modified from Hoffman et al. (2013).
Topics found
in 1.8M
theonNew
York Times
a particular
movie), ourarticles
prediction of thefrom
rating depends
a linear combination
of the user’s
embedding and the movie’s embedding. We can also use these inferred representations to find
groups of users that have similar tastes and groups of movies that are enjoyed by the same kinds
of users.
Figure 4c illustrates the graphical model. This model is closely related to a linear factor model,
Neuroscience analysis of 220 million fMRI measurements
EMNLP 2016 Submission ***. Confidential review copy. DO NOT DISTRIBUTE.
1
1
1
1
1
1
1
1
1
1
1
1
1
Figure 1: Measure of “eventness,” or time interval impact on cable content (Eq. 2). Grey background indicates the number of cables
sent over time. This comes from the model fit we discuss in Section 3. Capsule successful detects real-world events from National
Archive diplomatic cables.
Events uncovered from 2M diplomatic cables
and the primary sources around them. We develop
Capsule, a probabilistic model for detecting and characterizing important events in large collections of
historical communication.
Figure 1 illustrates Capsule’s analysis of the two
1
1
1
1
“event” to be a moment in history when embassies
deviate from what each usually discusses, and when
each embassy deviates in the same way.
Capsule embeds this intuition into a Bayesian
model. It uses hidden variables to encode what “typi-
1
1
1
1
1
1
guacamole spicy avoclassic
guacamole salsa classic avomex
salsa pico de gallo rojos
salsa mild deli counter
tostitos tortilla chips gold
salsa mild rojo
salsa medium rojo 16 oz
tostitos scoops tortilla chips wht corn
tostitos restaurant style tortilla chips
salsa mild deli counter 16 oz
tostitos hint of lime
salsa medium deli counter 16 oz
la tapatia hmstyle tort chips
la tapatia trtlla chips ca styl
tostitos restaurant style tortilla chips
tostitos tortilla chips multigrain
tostitos scoops tortilla chips super sz
santitas tortilla chips yellow corn
pico de gallo garden highway
padrinos tort strps rstrnt styl
taco bell taco sauce mild
pace picante sauce medium
mission brn bag trtla chps wht crn trngl
sfwy taco shells jumbo
rosarita refried beans
mission brown bag tortilla chips round
mission brown bag trtla strps white corn
lawrys taco spices and seasoning
frsh exp shreds taco bell refried beans
pace picante sauce mild
mission tortilla strips family size
mission tortilla rounds family size
taco bell refried beans ff
mission brown bag tortilla triangles
sfwy taco shells white corn
taco bell taco shells
concord guacamole mix mild
kraft cheese taco finely shredd
santitas tortilla chips white corn
ortega taco seasoning mix
taco bell taco seasoning mix
salsa medium deli counter
pace thick & chnky salsa medium
old el paso stand n stuff taco shells
pace thck & chnky salsa mild
mcrmck seasoning mix taco
pace picante sauce medium kraft chse chdr/monterey jk finely shrd
mission jumbo taco shells
kraft 4 cheese mexican blend fine shred oep taco
shells
rosarita refried beans fat free
mission tortilla burrito
mcrmck seasoning mix taco mild
kraft chse chdr/monterey jk finely shrd
mission tortilla taco soft ff
ortega taco shells
lucerne
cheese
taco
blend
finely
shred
kraft 4 cheese mexican finely shredded
sfy sel salsa chipotle
oep taco shells
lucerne cheese 4 blend mexican
ortega taco shells white corn
sfy sel salsa peach/pineapple
lawrys taco seasoning mix
lucerne finely shredded cheese mexican
sfy sel salsa southwest hot
sargento 4 chs mexican blend
sfy sel salsa verde medium
sargento 4 cheese mexican blend shredded
mission white corn taco shells
mission flour tort burrito 8 ct
lawrys taco seasoning mix chckn
sfwy refried beans vegetarian
mission fajita size
sfwy olives ripe sliced
mission
tortilla fluffy gordita
mcrmck seasoning mix taco ls
mission tortilla soft taco
sfwy refried beans nf
mission tortilla soft taco
sfwy olives ripe sliced
rosarita refried beans f/f
ortega refried beans fat free
mission fajita size tortilla
rosarita refried beans vegetrn
la tapatia tortilla bby burrito
guerrero tort riquisima reseal
rosarita refried blk beans l/f
sfy sel salsa southwest mild
sfy sel salsa southwest med
mission tort sft taco 30ct
lucerne 4 cheese mex shredded
mission corn tortillas 36 ct
guerrero wht corn tortilla 50ctguerrero tortilla burrito 10ct
guerrero tortlla riquisima 8 in
lucerne cheese colby jack shredded
lucerne cheese cheddar medium shredded
lucerne cheddar mild shredded
lucerne cheese monterey jack shredded
mission tortilla corn super sz
mission corn tortillas 7in 12ct
la tapatia corn tortilla grmt
rosarita refried beans vegetarn rosarita
refried beans z−slsa f/f
foster farms ground chicken
Item characteristics learned from 5.6M purchases
don antonio trtla wht corn
herdez salsa verde mild
KNOWLEDGE &
QUESTION
DATA
R A. 7Aty
This content downloaded from 128.59.38.144 on Thu, 12 Nov 2015 01:49:31 UTC
All use subject to JSTOR Terms and Conditions
K=7
LWK
YRI
ACB
ASW
CDX
CHB
CHS
JPT
KHV
LWK
YRI
ACB
ASW
CDX
CHB
CHS
JPT
KHV
CEU
FIN
GBR
IBS
TSI
MXL
PUR
CLM
PEL
GIH
FIN
GBR
IBS
TSI
MXL
PUR
CLM
PEL
GIH
pops
1
2
3
4
5
6
7
K=8
Make assumptions
CEU
Discover patterns
pops
1
2
3
4
5
6
7
Predict & Explore
8
K=9
LWK
YRI
ACB
ASW
CDX
CHB
CHS
JPT
KHV
CEU
FIN
GBR
IBS
TSI
MXL
PUR
CLM
PEL
GIH
pops
1
2
3
4
5
6
7
8
9
Figure S2: Population structure inferred from the TGP data set using the TeraStructure algorithm
at three values for the number of populations K. The visualization of the ✓’s in the Figure shows
patterns consistent with the major geographical regions. Some of the clusters identify a specific
region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Europeans and Central/South Americans). The presence of clusters that are shared between different
regions demonstrates the more continuous nature of the structure. The new cluster from K = 7 to
K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is
unpopulated.
28
Our perspective:
I Customized data analysis is important to many fields.
I This pipeline separates assumptions, computation, application.
I It facilitates solving data science problems.
KNOWLEDGE &
QUESTION
DATA
R A. 7Aty
This content downloaded from 128.59.38.144 on Thu, 12 Nov 2015 01:49:31 UTC
All use subject to JSTOR Terms and Conditions
K=7
LWK
YRI
ACB
ASW
CDX
CHB
CHS
JPT
KHV
LWK
YRI
ACB
ASW
CDX
CHB
CHS
JPT
KHV
CEU
FIN
GBR
IBS
TSI
MXL
PUR
CLM
PEL
GIH
FIN
GBR
IBS
TSI
MXL
PUR
CLM
PEL
GIH
pops
1
2
3
4
5
6
7
K=8
Make assumptions
CEU
Discover patterns
pops
1
2
3
4
5
6
7
Predict & Explore
8
K=9
LWK
YRI
ACB
ASW
CDX
CHB
CHS
JPT
KHV
CEU
FIN
GBR
IBS
TSI
MXL
PUR
CLM
PEL
GIH
pops
1
2
3
4
5
6
7
8
9
Figure S2: Population structure inferred from the TGP data set using the TeraStructure algorithm
at three values for the number of populations K. The visualization of the ✓’s in the Figure shows
patterns consistent with the major geographical regions. Some of the clusters identify a specific
region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Europeans and Central/South Americans). The presence of clusters that are shared between different
regions demonstrates the more continuous nature of the structure. The new cluster from K = 7 to
K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is
unpopulated.
28
What we need:
I Flexible and expressive components for building models
I Scalable and generic inference algorithms
I New applications to stretch probabilistic modeling into new areas
KNOWLEDGE &
QUESTION
DATA
R A. 7Aty
This content downloaded from 128.59.38.144 on Thu, 12 Nov 2015 01:49:31 UTC
All use subject to JSTOR Terms and Conditions
K=7
LWK
YRI
ACB
ASW
CDX
CHB
CHS
JPT
KHV
CEU
FIN
GBR
IBS
TSI
MXL
PUR
CLM
PEL
GIH
pops
1
2
3
4
5
6
7
K=8
LWK
Make assumptions
YRI
ACB
ASW
CDX
CHB
CHS
JPT
KHV
CEU
FIN
GBR
IBS
TSI
MXL
PUR
CLM
PEL
GIH
Discover patterns
pops
1
2
3
4
5
6
7
Predict & Explore
8
K=9
LWK
YRI
ACB
ASW
CDX
CHB
CHS
JPT
KHV
CEU
FIN
GBR
IBS
TSI
MXL
PUR
CLM
PEL
GIH
pops
1
2
3
4
5
6
7
8
9
Figure S2: Population structure inferred from the TGP data set using the TeraStructure algorithm
at three values for the number of populations K. The visualization of the ✓’s in the Figure shows
patterns consistent with the major geographical regions. Some of the clusters identify a specific
region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Europeans and Central/South Americans). The presence of clusters that are shared between different
regions demonstrates the more continuous nature of the structure. The new cluster from K = 7 to
K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is
unpopulated.
28
Criticize model
Revise
I Here I discuss two threads of research with Susan Athey’s group.
I Build probabilistic models to analyze large-scale consumer behavior;
many consumers choosing among many items
I (Caveat: I’m not an economist.)
I also joint with Francisco Ruiz
I Vision: a utility model for baskets of items:
U.basket/ D [subs/comps] C [shopper] C [prices] C [other] C I Goals
– design, fit, check, and revise this model
– answer counterfactual questions about purchase behavior
Economic embeddings
Identifying substitutes and co-purchases in large-scale consumer data.
I Word embeddings are a powerful approach for analyzing language.
I Discovers a distributed representation of words
– Distances appear to capture semantic similarity.
I Many variants, but each reflects the same main ideas:
– Words are placed in a low-dimensional latent space
– A word’s probability depends on its distance to other words in its context
Bengio et al., A neural probabilistic language model. Journal of Machine Learning Research, 2003.
Mikolov et al., Efficient estimation of word representations in vector space. Neural Information Processing Systems, 2013.
Image: Paul Ginsparg
angel hair buitoni
sauce alfredo light buitoni
sauce pesto basil r/f buitoni
tortellini ital sausage buitoni
tortellini chkn/prsct buitoni
tortellini 3 cheese buitoni
tortellini mixed cheese buitoni
tortellini chse fmly sz buitoni
tortellini herb chck fam sz buitoni
ravioli cheese buitoni
ravioli 4 cheese buitoni
fettucine fresh e
sauce alfredo buitoni
sauce alfredo fmly sz buitoni
linguine fre
sauce marinara fresh buitoni
genova meat sauce
I Exponential family embeddings generalize this idea to other types of data.
I Use generalized lineargenova
models
and exponential families
ricotta ravioli
genova meat ravioli
I Examples:
neuroscience; recommender systems; networks; shopping baskets
I Bigger picture: A statistical perspective on ideas from neural networks.
Table 2: Analysis of neural data: mean squared error and standard errors of neural
activity (on the test set) for different models. Both models significantly
Zebrafish
outperform ; is more
accuratebrain
than activity
- .
Figure 1: Top view of the zebrafish brain, with blue circles at the location of the
individual neurons. We zoom on 3 neurons and their 50 nearest neighbors (small
blue dots), visualizingInteractions
the “synaptic weights”
learned
by a model (K D 100).
between
countries
The edge color encodes the inner product of the neural embedding vector and the
context vectors ⇢n> ˛m for each neighbor m. Positive values are green, negative values
are red, and the transparency is proportional to the magnitude. With these weights
we can form hypotheses about how nearby neurons are connected.
the lagged activity conditional on the simultaneous lags of surrounding neurons. We
studied context sizes c 2 f10; 50g and latent dimension K 2 f10; 100g.
Models. We compare to probabilistic factor analysis ( ), fitting Kdimensional factors for each neuron and K-dimensional factor loadings for each time
frame. In , each entry of the data matrix is a Gaussian with mean equal to the
Vacation-town deli
Jam
PB #1
PB #2
Soda
Bread
Pizza
I Consider a vacation-town deli; it has six items.
I Customers either buy [pizza, soda] or [peanut butter, jam, bread]
Customers only buy one type of peanut butter at a time.
I Items bought together (or not) are co-purchased (or not).
The peanut butters are substitutes.
(For now we ignore many issues, e.g., formal definitions, price, causality.)
I We would like to capture this purchase behavior.
Vacation-town deli
Jam
PB #1
PB #2
Soda
Bread
Pizza
I We endow each item with two (unknown) locations in a real space
Rk :
an embedding and context vector ˛ .
I The conditional probability of each item depends on its embedding and the
context vectors of the other items in the basket,
xb;i j xb;
I
i
n P
o
Poisson exp i> j ¤i ˛j xb;j :
˛i are latent product attributes
i indicate how product i interacts with other products’ attributes
2.5
2
⇢
˛
.interaction/
Embedding
Context
.attributes/
Pizza
Soda
1.5
1
0.5
Peanut butter #2
Peanut butter #1
Soda
Jelly
Bread
0
Pizza
−0.5
Bread
Jelly
−1
−1.5
Peanut butter #1
Peanut butter #2
−2
−2.5
−3
−2
−1
0
1
2
3
Pizza and soda are never bought with bread, jam, and PB; and vice versa
2.5
2
⇢
˛
.interaction/
Embedding
Context
.attributes/
Pizza
Soda
1.5
1
0.5
Peanut butter #2
Peanut butter #1
Soda
Jelly
Bread
0
Pizza
−0.5
Bread
Jelly
−1
−1.5
Peanut butter #1
Peanut butter #2
−2
−2.5
−3
−2
−1
0
1
2
Bread, jam, and PB are bought together
3
2.5
2
⇢
˛
.interaction/
Embedding
Context
.attributes/
Pizza
Soda
1.5
1
0.5
Peanut butter #2
Peanut butter #1
Soda
Jelly
Bread
0
Pizza
−0.5
Bread
Jelly
−1
−1.5
Peanut butter #1
Peanut butter #2
−2
−2.5
−3
−2
−1
0
1
PB #1 is never bought with PB #2
2
3
2.5
2
⇢
˛
.interaction/
Embedding
Context
.attributes/
Pizza
Soda
1.5
1
0.5
Peanut butter #2
Peanut butter #1
Soda
Jelly
Bread
0
Pizza
−0.5
Bread
Jelly
−1
−1.5
Peanut butter #1
Peanut butter #2
−2
−2.5
−3
−2
−1
0
1
2
PB #1 is bought with similar items as PB #2
3
Exponential Family Embedding
I The goal of an EF - EMB is to discover a useful representation of data
I Observations
x D x1Wn , where xi is a D -vector
I Examples:
DOMAIN
INDEX
VALUE
Language
Neuroscience
Network
Shopping
position in text i
neuron and time .n; t /
pair of nodes .s; d /
item and basket .d; b/
word indicator
activity level
edge indicator
number purchased
Exponential Family Embedding
xi
OBSERVED DATA
EMBEDDINGS
CONTEXT VECTORS
⇢
˛
I Three ingredients:
context, conditional exponential family, embedding structure
I Two latent variables per data index, an embedding and a context vector
I Model each data point conditioned on its context and latent variables.
I The latent variables interact in the conditional.
How depends on which indices are in context and which one is modeled
Context
I Each data point
i has a context ci , a set of indices of other data points.
I We model the conditional of the data point given its context,
I Examples
p.xi j xci /.
DOMAIN
DATA POINT
CONTEXT
Language
Neuroscience
Network
Shopping
word
neuron activity
edge
purchased item
surrounding words
activity of surrounding neurons
other edges on the two nodes
other item counts on the same trip
Context
I Each data point
i has a context ci , a set of indices of other data points.
I We model the conditional of the data point given its context,
I Examples
p.xi j xci /.
DOMAIN
DATA POINT
CONTEXT
Language
Neuroscience
Network
Shopping
word
neuron activity
edge
purchased item
surrounding words
activity of surrounding neurons
other edges on the two nodes
other item counts on the same trip
Context
I Each data point
i has a context ci , a set of indices of other data points.
I We model the conditional of the data point given its context,
I Examples
p.xi j xci /.
DOMAIN
DATA POINT
CONTEXT
Language
Neuroscience
Network
Shopping
word
neuron activity
edge
purchased item
surrounding words
activity of surrounding neurons
other edges on the two nodes
other item counts on the same trip
Conditional exponential family
xi
OBSERVED DATA
EMBEDDINGS
CONTEXT VECTORS
⇢
˛
I The EF - EMB has latent variables for each data point’s index:
an embedding Œi  and a context vector ˛Œi 
I These are used in the conditional of each data point,
xi j xci exp-fam..xci I Œi ; ˛Œci /; t.xi //:
(Poisson for counts, Gaussian for reals, Bernoulli for binary, etc.)
Conditional exponential family
xi
OBSERVED DATA
EMBEDDINGS
CONTEXT VECTORS
⇢
˛
I The natural parameter combines the embedding and context vectors,
0
i .xci / D f @Œi >
X
j 2ci
1
˛Œj xj A ;
I E.g., an item’s embedding (interaction) helps determine its count;
its context vector (attributes) helps determine other item’s counts
Embedding Structure
x
⇢
˛
I The embedding structure determines how parameters are shared.
I E.g.,
Œi  D Œj  for i D .Oreos; t/ and j D .Oreos; u/.
I Sharing enables learning about an object, such as a neuron, node, or item.
Embedding Structure
x
⇢
˛
I The embedding structure determines how parameters are shared.
I E.g.,
Œi  D Œj  for i D .Oreos; t/ and j D .Oreos; u/.
I Sharing enables learning about an object, such as a neuron, node, or item.
Embedding Structure
x
⇢
˛
I The embedding structure determines how parameters are shared.
I E.g.,
Œi  D Œj  for i D .Oreos; t/ and j D .Oreos; u/.
I Sharing enables learning about an object, such as a neuron, node, or item.
Pseudolikelihood
0
100
200
300
400
500
I We model each data point, conditional on the others.
I Combine these ingredients in a “pseudo-likelihood” (i.e., a utility)
L.; ˛/ D
n X
i D1
>
i t .xi /
a.i / C log f ./ C log g.˛/:
I Fit with stochastic optimization; exponential families simplify the gradients.
Pseudolikelihood
0
100
200
300
400
500
I The objective resembles a collection of GLM likelihoods.
I The gradient is
rŒj  L D
I
X
i D1
t.xi /
EŒt .xi / rŒj  i C rŒj  log f .Œj /:
I (Stochastic gradients give justification to NN ideas like “negative sampling.”)
Market basket analysis
I Data | Purchase counts of items in shopping trips at a large grocery store
– Category-level | 478 categories; 635,000 trips; 6.8M purchases
– Item-level | 5,675 items; 620,000 trips; 5.6M purchases
I Context | Other items purchased at the same trip
I Structure | Embeddings for each item are shared across trips
I Family | Poisson (and we downweight the zeros)
Market basket analysis
I Recall the conditional probability
xi j x
i
n P
o
Poisson exp i> j ¤i ˛j xj :
I
˛i reflects attributes of item i
I
i reflects the interaction of item i with attributes of other items.
tSNE Component 2
A 2D representation of category attributes ˛i
tSNE Component 1
shortening
powdered sugar
condensed milk
extracts
flour
granulated sugar
baking ingredients
brown sugar
pie crust
evaporated milk
pie filling
cream
salt
frozen pastry dough
specialty/miscellaneous deli items
infant formula
disposable diapers
disposable pants baby accessories
baby/youth wipes
infant toiletries
childrens/infants analgesics
A 2D representation of item attributes ˛i
pasta chkn herb parigianna buitoni
ravioli 4 cheese lt buitoni
angel hair buitoni
sauce alfredo light buitoni
sauce pesto basil r/f buitoni
tortellini ital sausage buitoni
tortellini chkn/prsct buitoni
tortellini 3 cheese buitoni
tortellini mixed cheese buitoni
tortellini chse fmly sz buitoni
tortellini herb chck fam sz buitoni
ravioli cheese buitoni
ravioli 4 cheese buitoni
fettucine fresh egg buitoni
sauce alfredo buitoni
sauce alfredo fmly sz buitoni
sauce marinara fresh buitoni
genova meat sauce
genova ricotta ravioli
genova meat ravioli
linguine fresh egg buitoni
artichoke parm dip stonemill
spread garlic herb lt alouette
carrs cracker table water garlic & herb
spread garlic/herb alouette
carrs crackers table water black
carrs cracker table water cracked pepper
carrs crackers table water w/sesame
cheddar smokey sharp deli counter dv
gouda domestic dv
pep farm cracker butter thins
pep farm dist crackers quartet
brie president
havarti w/dill primo taglio dv
brie primo taglio
brie cambozola champignon
carrs crackers whole wheat
toast tiny pride of fran
brie supreme dv
brie camembert de france dv
gouda smoked deli counter dv
brie mushroom champignon dv
crackers water classic monet
spread spinach alouette
alouette cheese spread sundrd tom w/bsl
brie ile de france cut/wrap
brie marquis de lafayette
wellington wtr crackers traditional
wellington wtr crackers cracked pepper
wellington wtr crackers tsted ssme
mozz/proscuitto rl primo taglio
crackers monet classic shp
dip artichoke
dip spinach signature 16 oz
cps toast garlic & cheese mussos
guacamole spicy avoclassic
guacamole salsa classic avomex
salsa pico de gallo rojos
salsa mild deli counter
tostitos tortilla chips gold
salsa mild rojo
salsa medium rojo 16 oz
tostitos scoops tortilla chips wht corn
tostitos restaurant style tortilla chips
salsa mild deli counter 16 oz
tostitos hint of lime
salsa medium deli counter 16 oz
la tapatia hmstyle tort chips
la tapatia trtlla chips ca styl
tostitos restaurant style tortilla chips
tostitos tortilla chips multigrain
tostitos scoops tortilla chips super sz
santitas tortilla chips yellow corn
pico de gallo garden highway
padrinos tort strps rstrnt styl
taco bell taco sauce mild
pace picante sauce medium
mission brn bag trtla chps wht crn trngl
sfwy taco shells jumbo
rosarita refried beans
mission brown bag tortilla chips round
mission brown bag trtla strps white corn
lawrys taco spices and seasoning
frsh exp shreds taco bell refried beans
pace picante sauce mild
mission tortilla strips family size
mission tortilla rounds family size
taco bell refried beans ff
mission brown bag tortilla triangles
sfwy taco shells white corn
taco bell taco shells
concord guacamole mix mild
kraft cheese taco finely shredd
santitas tortilla chips white corn
ortega taco seasoning mix
taco bell taco seasoning mix
salsa medium deli counter
pace thick & chnky salsa medium
old el paso stand n stuff taco shells
pace thck & chnky salsa mild
mcrmck seasoning mix taco
pace picante sauce medium kraft chse chdr/monterey jk finely shrd
mission jumbo taco shells
kraft 4 cheese mexican blend fine shred oep taco
shells
rosarita refried beans fat free
mission tortilla burrito
mcrmck seasoning mix taco mild
kraft chse chdr/monterey jk finely shrd
mission tortilla taco soft ff
shred ortega taco shells
kraft 4 cheese mexican finely shreddedlucerne cheese taco blend finely
sfy sel salsa chipotle
oep taco shells
lucerne cheese 4 blend mexican
ortega taco shells white corn
sfy sel salsa peach/pineapple
lawrys taco seasoning mix
lucerne finely shredded cheese mexican
sfy sel salsa southwest hot
sargento 4 chs mexican blend
sfy sel salsa verde medium
sargento 4 cheese mexican blend shredded
mission white corn taco shells
mission flour tort burrito 8 ct
lawrys taco seasoning mix chckn
sfwy refried beans vegetarian
mission fajita size
sfwy olives ripe sliced
mission
tortilla
fluffy
gordita
mission
tortilla
soft
taco
mcrmck seasoning mix taco ls
sfwy refried beans nf
mission tortilla soft taco
sfwy olives ripe sliced
rosarita refried beans f/f
ortega refried beans fat free
mission fajita size tortilla
rosarita refried beans vegetrn
la tapatia tortilla bby burrito
guerrero tort riquisima reseal
rosarita refried blk beans l/f
sfy sel salsa southwest mild
sfy sel salsa southwest med
mission tort sft taco 30ct
lucerne 4 cheese mex shredded
mission corn tortillas 36 ct
guerrero wht corn tortilla 50ctguerrero tortilla burrito 10ct
guerrero tortlla riquisima 8 in
lucerne cheese colby jack shredded
lucerne cheese cheddar medium shredded
lucerne cheddar mild shredded
lucerne cheese monterey jack shredded
foster farms ground chicken
mission tortilla corn super sz
mission corn tortillas 7in 12ct
la tapatia corn tortilla grmt
rosarita refried beans vegetarn rosarita
refried beans z−slsa f/f
don antonio trtla wht corn
herdez salsa verde mild
Category level
MODEL
Poisson embedding
Poisson embedding (downweighting zeros)
Additive Poisson embedding
Hierarchical Poisson factorization
Poisson PCA
K D 20
K D 50
7:497
7:110
7:868
7:740
8:314
7:284
6:994
8:191
7:626
9:51
K D 100
7:199
6:950
8:414
7:626
11:01
Item level
MODEL
Poisson embedding
Hierarchical Poisson factorization
K=50
K=100
7:72
7:86
7:64
7:87
Gopalan et al., Scalable recommendation with hierarchical Poisson factorization. Uncertainty in Artificial Intelligence, 2015.
Collins et al., A generalization of principal component analysis to the exponential family. Neural Information Processing Systems, 2002.
2.5
2
⇢
˛
.interaction/
Embedding
Context
.attributes/
Pizza
Soda
1.5
1
0.5
Peanut butter #2
Peanut butter #1
Soda
Jelly
Bread
0
Pizza
−0.5
Bread
Jelly
−1
−1.5
Peanut butter #1
Peanut butter #2
−2
−2.5
−3
−2
−1
0
1
2
3
We want to use this fit to understand purchase patterns.
I Exchangeables have a similar effect on the purchase of other items.
I Same-category items tend to be exchangeable and rarely purchased together
(e.g., two types of peanut butter).
I Complements are purchased (or not purchased) together
(e.g., hot dogs and buns).
2.5
2
⇢
˛
.interaction/
Embedding
Context
.attributes/
Pizza
Soda
1.5
1
0.5
Peanut butter #2
Peanut butter #1
Soda
Jelly
Bread
0
Pizza
−0.5
Bread
Jelly
−1
−1.5
Peanut butter #1
Peanut butter #2
−2
−2.5
−3
−2
−1
0
1
2
3
I PB #1 and PB #2 induce similar distributions of other items
I But they are rarely purchased together.
I Define the sigmoid function between two items,
ki ,
1
I
1 C expf k> ˛i g
N ki , 1
ki
2.5
2
⇢
˛
.interaction/
Embedding
Context
.attributes/
Pizza
Soda
1.5
1
0.5
Peanut butter #2
Peanut butter #1
Soda
Jelly
Bread
0
Pizza
−0.5
Bread
Jelly
−1
−1.5
Peanut butter #1
Peanut butter #2
−2
−2.5
−3
−2
−1
0
1
2
3
I The “substitute predictor” is
0
@
X
k=fi;j g
ki log
ki
kj
1
N ki A
C N ki log
N kj
j i log
j i
1 j i
:
2.5
2
⇢
˛
.interaction/
Embedding
Context
.attributes/
Pizza
Soda
1.5
1
0.5
Peanut butter #2
Peanut butter #1
Soda
Jelly
Bread
0
Pizza
−0.5
Bread
Jelly
−1
−1.5
Peanut butter #1
Peanut butter #2
−2
−2.5
−3
−2
−1
0
1
2
3
I The “complement predictor” is the negative of the last term
j i
j i log
1 j i
:
I Notes:
– We use the symmetrized version of both quantities
– These quantities generalize to other exponential families
ITEM
1
ITEM
organic vegetables
vegetables (<10 oz)
baby food
stuffing
gravy
pie filling
deli cheese
dry pasta/noodles
mayonnaise
cake mixes
2
SCORE ( RANK )
organic fruits
beets (>=10 oz)
disposable diapers
cranberries
stuffing
evaporated milk
deli crackers
tomato paste/sauce/puree
mustard
frosting
Example co-purchases at the category level
6.18 (01)
5.64 (02)
3.43 (32)
3.30 (36)
3.23 (37)
3.09 (42)
2.87 (55)
2.73 (63)
2.61 (69)
2.49 (78)
ITEM
1
ITEM
bouquets
frozen pizza 1
bottled water 1
carbonated soft drinks 1
orange juice 1
bathroom tissue 1
bananas 1
salads-convenience 1
potatoes 1
bouquets
2
roses
frozen pizza 2
bottled water 2
carbonated soft drinks 2
orange juice 2
bathroom tissue 2
bananas 2
salads-convenience 2
potatoes 2
blooming
SCORE ( RANK )
0.20 (01)
0.18 (02)
-0.07 (03)
-0.12 (04)
-0.37 (05)
-0.58 (06)
-0.61 (07)
-0.63 (08)
-0.66 (09)
-1.18 (10)
Top ten potential substitutes at the category level
ITEM
1
ITEM
ygrt peach ff
s&w beans garbanzo
whiskas cat fd beef
parsnips loose
celery hearts organic
85p ln gr beef patties 15p fat
kiwi imported
colby jack shredded
star magazine
seasoning mix fajita
2
ygrt mxd berry ff
s&w beans red kidney
whiskas cat food tuna/chicken
rutabagas
apples fuji organic
sesame buns
mangos small
taco bell taco seasoning mix
in touch magazine
mission tortilla corn super sz
Example co-purchases at the UPC level
SCORE ( RANK )
19.83 (0001)
14.42 (0002)
8.45 (0149)
8.32 (0157)
4.36 (0995)
4.35 (1005)
3.22 (1959)
2.89 (2472)
2.87 (2497)
2.87 (2500)
ITEM
1
ITEM
coffee drip grande
sandwich signature reg
market bouquet
sushi shoreline combo
semifreddis bread baguette
orbit gum peppermint
snickers candy bar
cheer lndry det color guard
coors light beer btl
greek salad signature
2
SCORE ( RANK )
coffee drip venti
sandwich signature lrg
alstromeria/rose bouquet
sushi full moon combo
crusty sweet baguette
orbit gum spearmint
3 musketeers candy bar
all lndry det liquid fresh rain
coors light beer can
neptune salad signature
Example potential substitutes at the UPC level
-0 .33 (001)
-1.17 (020)
-2.89 (186)
-3.76 (282)
-7.65 (566)
-7.96 (595)
-7.97 (598)
-7.99 (602)
-8.12 (621)
-8.15 (630)
Summary and Questions
xi
OBSERVED DATA
EMBEDDINGS
CONTEXT VECTORS
⇢
˛
I Word embeddings have become a staple in natural language processing
We distilled its essential elements, generalized to consumer data
I Compared to classical factorization, good performance in many data
– movie ratings, neural activity, scientific reading, shopping baskets
Rudolph et al. Exponential Family Embeddings. Neural Information Processing Systems, 2016.
Liang et al. Modeling User Exposure in Recommendation. Recommendation Systems, 2016.
Ranganath et al. Deep Exponential Families. Artificial Intelligence and Statistics, 2015.
Summary and Questions
xi
OBSERVED DATA
EMBEDDINGS
CONTEXT VECTORS
⇢
˛
I How can we capture higher-order structure in the embeddings?
I Why downweight the zeros?
I How can we include price and other complexities?
Rudolph et al. Exponential Family Embeddings. Neural Information Processing Systems, 2016.
Liang et al. Modeling User Exposure in Recommendation. Recommendation Systems, 2016.
Ranganath et al. Deep Exponential Families. Artificial Intelligence and Statistics, 2015.
Poisson factorization
A computationally efficient method for discovering correlated preferences
I Economics
– Look at items within one category (e.g. yoghurt)
– Try to estimate the effects of interventions (e.g., coupons, price, layout)
I Machine learning
– Look at all items
– Estimate user preferences and make predictions (recommendations)
– Ignore causal effects of interventions
Items
Users
a
St
ke
tri
s
ar
rW
Em
eS
pir
k
ac
sB
tu
Re
rn
of
e
th
y
i
ed
J
n
he
W
rry
Ha
all
tS
e
M
Ba
re
ot
fo
in
th
k
ar
eP
a
St
k
re
rT
III:
I Implicit data is about users interacting with items
– clicks
– “likes”
– purchases
I Less information than explicit data (e.g. ratings), but more prevalent
ra
W
th
of
an
Kh
ITEM ATTRIBUTES
USER PREFERENCES
yui
ˇi
✓uk ⇠ Gam. ; /
ˇik ⇠ Gam. ; /
✓u
yui ⇠ Poisson ✓u> ˇi
Poisson factorization
I Assumptions
– Users (consumers) have latent preferences u .
– Items have latent attributes ˇi .
– How many items a shopper purchased comes from a Poisson.
I The posterior
p.; ˇ j y/ reveals purchase patterns.
Gopalan et al., Scalable recommendation with hierarchical Poisson factorization. Uncertainty in Artificial Intelligence, 2015.
ITEM ATTRIBUTES
ˇi
USER PREFERENCES
yui
✓uk ⇠ Gam. ; /
ˇik ⇠ Gam. ; /
✓u
yui ⇠ Poisson ✓u> ˇi
Advantages
I captures heterogeneity of users
I implies a distribution of total consumption
I efficient approximation, only requires non-zero data
Gopalan et al., Scalable recommendation with hierarchical Poisson factorization. Uncertainty in Artificial Intelligence, 2015.
News articles from the New York Times
Scientific articles from Mendeley
“Business Self-Help”
“Astronomy”
Stay Focused And Your Career Will Manage Itself
To Tear Down Walls You Have to Move Out of Your Office
Self-Reliance Learned Early
Maybe Management Isn't Your Style
My Copyright Career
Theory of Star Formation
Error estimation in astronomy: A guide
Astronomy & Astrophysics
Measurements of Omega from 42 High-Redshift Supernovae
Stellar population synthesis at the resolution of 2003
“Personal Finance”
“Biodiesel”
In Hard Economy for All Ages Older Isn't Better It's Brutal
Younger Generations Lag Parents in Wealth-Building
Fast-Growing Brokerage Firm Often Tangles With Regulators
The Five Stages of Retirement Planning Angst
Signs That It's Time for a New Broker
Biodiesel from microalgae.
Biodiesel from microalgae beats bioethanol
Commercial applications of microalgae
Second Generation Biofuels
Hydrolysis of lignocellulosic materials for ethanol production
“All Things Airplane”
“Political Science”
Flying Solo
Crew-Only 787 Flight Is Approved By FAA
All Aboard Rescued After Plane Skids Into Water at Bali Airport
Investigators Begin to Test Other Parts On the 787
American and US Airways May Announce a Merger This Week
Social Capital: Origins and Applications in Modern Sociology
Increasing Returns, Path Dependence, and Politics
Institutions, Institutional Change and Economic Performance
Diplomacy and Domestic Politics
Comparative Politics and the Comparative Method
Figure 6: The top 10 items by the expected weight βi from three of the 100 components discovered by our
algorithm for the New York Times and Mendeley data sets.
non-uniformly sampled items. JMLR W&CP, Jan
2012.
[8] A. Gelman, J. Carlin, H. Stern, and D. Rubin.
Bayesian Data Analysis. Chapman & Hall, London,
1995.
the datatel challenge. Procedia Computer Science,
1(2):1–3, 2010.
[18] N. Johnson, A. Kemp, and S. Kotz. Univariate
Discrete Distributions. John Wiley & Sons, 2005.
Mean normalized precision
Mendeley
New York Times
!
10%
!
!
8%
25%
Netflix (implicit)
!
!
!
!
!
2%
20%
7%
8%
!
!
!
Mendeley
!
4%
New York Times
!
!
8%
!
Echo Nest
8%
!
7%
Netflix (implicit)
18%
!
0.5%
!
5%
6%
!
4%
0.0%
!
!
!
15%
20%
!
!
15%
!
12%
!
!
MF
Netflix (explicit)
!
1.5%
6%
10%
!
!
7%
LDA
NMF
!
!
2.0%
1.0%
BPF
15%
5%
!
0%
HPF
25%
!
!
15%
!
6%
!
!
20%
6%
1%
Netflix (explicit)
30%
!
!
2.5%
Mean recall
Echo Nest
!
!
5%
HPF
BPF
LDA
MF
10%
9%
NMF
!
!
4%
!
!
5%
!
Figure 4: Predictive performance on data sets. The top and bottom plots show normalized mean precision
and mean recall at 20 recommendations, respectively. While competing method performance varies across
data sets, HPF and BPF consistently outperform competing methods.
MCMC algorithm for inference. The authors [30] report
that a single Gibbs iteration on the Netflix data set with
ommendations for each method and data sets. We see that
HPF and BPF outperform other methods on all data sets
“FRUIT”
“CAT CARE”
“BABY ESSENTIALS”
“HEALTHY”
stone fruit
pears
tropical fruit
apples
grapes
cat food wet
cat food dry
cat litter & deodorant
canned fish
paper towels
baby food
starbucks coffee
disposable diapers
infant formula
baby/youth wipes
health and milk substitutes
organic vegetables
organic fruits
cold cereal
vegetarian / organic frozen
Consumer #1 : “Cats and Babies”
1.2
#10 -4
0.03
0.025
Latent preferences
Latent preferences
1
0.8
0.6
0.4
0.02
0.015
0.01
0.005
0.2
0
Consumer #2 : “Healthy and Cats”
0
0
5
10
Components
15
20
0
5
10
Components
15
20
Poisson factorization and economics
I Consider a utility model of a single purchase with Gumbel error,
U.yui / D log.u> ˇi / C :
I Suppose a shopper
u buys N items. Then
yu j N Multi.N; u /
ui / expfu> ˇi g:
I Thus, the unconditional distribution of counts is Poisson factorization,
yuj Poisson.u> ˇj /:
Poisson factorization and economics
I With this connection, we can devise new utility models, e.g.,
U.yui / D log.u> ˇi C ˛u expf c pricei / C :
I ...and other factors
–
–
–
–
–
time of day
in stock
date
observed item characteristics & category
demographic information about the shopper
I Inference is still efficient.
With assumptions, we can answer counterfactual questions.
ITEM ATTRIBUTES
ˇi
USER PREFERENCES
yui
✓uk ⇠ Gam. ; /
ˇik ⇠ Gam. ; /
✓u
yui ⇠ Poisson ✓u> ˇi
I Poisson factorization efficiently analyzes large-scale purchase behavior.
I Next steps
– include notions of co-purchases and substitutes
– include time of day at a level that is unconfounded
– include price and stock out; answer counterfactual questions
I Research in recommendation systems can help economic analyses.
KNOWLEDGE &
QUESTION
DATA
R A. 7Aty
This content downloaded from 128.59.38.144 on Thu, 12 Nov 2015 01:49:31 UTC
All use subject to JSTOR Terms and Conditions
K=7
LWK
YRI
ACB
ASW
CDX
CHB
CHS
JPT
KHV
LWK
YRI
ACB
ASW
CDX
CHB
CHS
JPT
KHV
CEU
FIN
GBR
IBS
TSI
MXL
PUR
CLM
PEL
GIH
FIN
GBR
IBS
TSI
MXL
PUR
CLM
PEL
GIH
pops
1
2
3
4
5
6
7
K=8
Make assumptions
CEU
Discover patterns
pops
1
2
3
4
5
6
7
Predict & Explore
8
K=9
LWK
YRI
ACB
ASW
CDX
CHB
CHS
JPT
KHV
CEU
FIN
GBR
IBS
TSI
MXL
PUR
CLM
PEL
GIH
pops
1
2
3
4
5
6
7
8
9
Figure S2: Population structure inferred from the TGP data set using the TeraStructure algorithm
at three values for the number of populations K. The visualization of the ✓’s in the Figure shows
patterns consistent with the major geographical regions. Some of the clusters identify a specific
region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Europeans and Central/South Americans). The presence of clusters that are shared between different
regions demonstrates the more continuous nature of the structure. The new cluster from K = 7 to
K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is
unpopulated.
28
Probabilistic machine learning: design expressive models to analyze data.
I Tailor your method to your question and knowledge
I Use generic and scalable inference to analyze large data sets
I Form predictions, hypotheses, inferences, and revise the model
KNOWLEDGE &
QUESTION
DATA
R A. 7Aty
This content downloaded from 128.59.38.144 on Thu, 12 Nov 2015 01:49:31 UTC
All use subject to JSTOR Terms and Conditions
K=7
LWK
YRI
ACB
ASW
CDX
CHB
CHS
JPT
KHV
LWK
YRI
ACB
ASW
CDX
CHB
CHS
JPT
KHV
CEU
FIN
GBR
IBS
TSI
MXL
PUR
CLM
PEL
GIH
FIN
GBR
IBS
TSI
MXL
PUR
CLM
PEL
GIH
pops
1
2
3
4
5
6
7
K=8
Make assumptions
CEU
Discover patterns
pops
1
2
3
4
5
6
7
Predict & Explore
8
K=9
LWK
YRI
ACB
ASW
CDX
CHB
CHS
JPT
KHV
CEU
FIN
GBR
IBS
TSI
MXL
PUR
CLM
PEL
GIH
pops
1
2
3
4
5
6
7
8
9
Figure S2: Population structure inferred from the TGP data set using the TeraStructure algorithm
at three values for the number of populations K. The visualization of the ✓’s in the Figure shows
patterns consistent with the major geographical regions. Some of the clusters identify a specific
region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Europeans and Central/South Americans). The presence of clusters that are shared between different
regions demonstrates the more continuous nature of the structure. The new cluster from K = 7 to
K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is
unpopulated.
28
Opportunities for economics and machine learning
I Push economics to high-dimensional data and scalable computation
I Push ML to explainable models, applied causal inference, new problems
I Develop new modeling methods together