“Big Data” Goes to the Movies: Revealing latent similarity among

“Big Data” Goes to the Movies: Revealing latent similarity
among movies and its effect on box office performance
Sandra Barbosu∗
December 21, 2015
Abstract
This paper explores, in the movie industry setting, how the rise of “big data”
presents firms with the opportunity to acquire knowledge that can influence product
performance and shape firm strategy. It asks the question: what is the value of
insights gained from “big data” for firms? Specifically, the paper focuses on Amazon Instant Video’s “Customers who rented this [focal movie] also rented this...”
lists to (1) evaluate similarity between movies based on users’ rental patterns, (2)
show that movies that are implicitly similar to others have better box office performance than movies that are far from others, and (3) provide reasons for these
performance differences. This paper also explores the formation of implicit clusters
of movies, made of movies with many co-rentals in common, and shows that they
differ from usual classification schemes (genres). Observable characteristics such as
genre, actor/director reputation, studio type and MPAA rating have modest power to
explain why these clusters form. This points to the possibility that implicit similarity dimensions may exist within clusters. An initial exploration shows that one such
dimension is theme, which transcends coarse classifications. Exploring other possible
latent common features could provide further insights to studios to help them make
movies better targeted to different audience segments. Methodologically, this paper
employs a combination of rigorous econometric methods and machine learning tools.
∗
Rotman School of Management, University of Toronto. [email protected]. I am grateful
to my committee for their valuable insights and continuous support: Mike Ryall (chair), Anne Bowers, Joshua Gans,
Avi Goldfarb, and Mitchell Hoffman. I also thank Andrew Ching, Brian Silverman, Olav Sorenson, Scott Stern,
David Tan, participants at the 2014 CCC Doctoral Consortium, 2014 Mallen Motion Picture Economics Conference,
2015 BPS Dissertation Consortium, and my colleagues in the Rotman doctoral program for their helpful feedback.
All errors remain my own.
1
1
Introduction and Related Work
Digitization is transforming a wide range of industries, and the rise of “big data”
presents firms with the opportunity to gain insights that can contribute to product
performance and strategy. The proliferation of numerous websites such as Amazon,
eBay, Pinterest and Twitter, which are able to aggregate data from millions of users,
has led to the availability of detailed information about consumer preferences that
was previously very difficult, if not impossible, to collect. Such fine-grained data can
provide firms with useful knowledge for their product development strategies. In this
paper, I address the question: what is the value of “big data” insights for firms?
The setting of this paper is the United States motion picture industry, an industry
with high economic importance, which contributed $504 billion to the United States’
GDP in 2011 (Bureau of Economic Analysis 2013). The movie industry has received
attention from multiple streams of literature, including research in economics exploring the determinants of box office success (e.g. Gopinath et al. 2013, Liu 2006), and
work in sociology focusing on the relationship between the number of genres a movie
spans and its performance (Hsu 2006, Hsu et al. 2009).
The data for this paper are collected from Amazon Instant Video, Amazon’s
movie streaming service, which in December 2013 had over 25,000 movies and an
estimated 20 million users (McGlaun 2013). A key reason I chose to focus on Amazon
Instant Video is the section “Customers who rented this [focal movie] also rented...”
associated with each movie. This section contains an ordered list of the top twenty
movies most frequently co-rented with the focal movie, which I refer to as a corental list. It is produced by Amazon’s collaborative item-to-item filtering algorithm
2
(Linden et al. 2003) and reflects the aggregated rental behavior of millions of users.
Co-rental lists provide a novel opportunity to observe relationships between movies
based directly on consumers’ rental patterns, which was not previously possible. This
offers a new way to determine similarity between movies from revealed consumer
preferences, without having to rely on third-party classifications like genres, which
are often vague and problematic. Rooted in data analytics, this approach has the
potential to reveal latent dimensions of similarity between movies that may not be
captured by traditional classifications, and this could have implications for studios’
movie development strategies. Using co-rental lists, I evaluate a movie’s implicit similarity with other movies based on the number of common co-rentals it shares with
every other movie. With the assumption that online rental behavior reflects consumers’ movie viewing behavior more generally, I explore the relationship between a
movie’s implicit similarity and its box office performance in Section 4.
This analysis is motivated by the theoretical question of how a movie’s implicit
similarity, as revealed from user rental patterns, matters for performance. On one
hand, movies that are similar to others tend to form implicit clusters of formulaic
movies that are liked by niche consumer groups. They may contain latent similarity
features that transcend simple genres, and their unobservable formulas appeal to
specific audience segments. As formulaic movies are well-aligned with niche segments’
preferences, this could lead to high performance. On the other hand, movies that
are far from others contain novel combinations of attributes that are not specific to
any niche audience, but that have the potential to command larger audiences by
appealing to multiple audiences, which could result in high performance.
3
The results of my analysis in Section 4 show that on average, movies that are
implicitly similar to others are higher-performing at the box office than more distant
movies. The implication of these findings is that consumers prefer formulaic movies
that appeal to niche groups over movies that attempt to combine disparate features.
After establishing these results, a concern still remains: does implicit similarity actually affect performance, or is this relationship due to an alternative mechanism?
Likely candidates for alternative explanations are: (1) omitted variable - an unobservable feature such as a movie’s quality may both affect its performance and lead
it to be rented together with other movies of similar quality, or (2) reverse causality
- movies are rented together because they have similar performance, even if they are
not otherwise similar (e.g. high-performing movies get rented together because of
shared box office success).
To address these concerns, I employ several empirical techniques, including instrumental variable and control function methods, and a recently developed procedure to
evaluate robustness of results to omitted variable bias (Oster 2014). These methods
provide evidence against the proposed alternative explanations, lending support to
the argument that implicit similarity affects movie performance.
Having established the relationship between implicit similarity and box office
performance, in Section 5 I delve deeper to explore which movies are implicitly similar to each other. By revealing clusters of movies that consumers frequently watch
together, this analysis would provide information to movie studios about different
market segments that may transcend simple genre classifications. This knowledge
4
could be useful for studios in order to help them make movies that are better aligned
with the preferences of different market segments. In this analysis, I apply machine
learning algorithms to determine which movies cluster together by having many
shared co-rentals with each other. I show that observable characteristics like genre,
actor/director reputation, studio type and MPAA rating have low explanatory power
in determining cluster membership. These findings raise the possibility that there
exist latent dimensions of similarity that are common to clusters, which are not captured by typical observable characteristics. Exploring these dimensions can further
help studios gain a better understanding of different audience segments’ preferences.
More generally, this analysis illustrates the potential that the proliferation of big
data, or the increasingly large unstructured data, can have for firms across industries, as it can reveal fine-grained information about consumer preferences which was
previously very difficult, if not impossible, to obtain. This information can help firms
gain insights into consumer preferences that may help them make products with a
better fit with what their customers want.
The results and insights in this paper contribute to several research streams: research on the economics and strategy of digitization, the branch of sociology that
studies the relationship between how many categories a firm spans and its performance, and marketing work on the determinants of movie success. I discuss the
contributions to each literature below. In addition, the paper employs different analytical tools in each part of the analysis. Figure 1 provides a roadmap of those tools.
5
INSERT FIGURE 1 HERE
Economics and strategy of digitization Research in this area explores the causes
and economic consequences of the digitization that has been occurring across industries (see Goldfarb et al. 2015 for a detailed overview). One stream of this literature
explores the impact of information technology on firm performance, strategy, and
overall industry structure in various settings (see Bloom et al. 2009, Athey and
Gans 2010, Brynjolfsson and Hitt 2000, McElheran 2011). Saunders and Tambe
(2012) show that adoption of data analytics practices is associated with higher firm
value. Einav et al. (2011) provide evidence in favor of business strategies that are
data-driven, showing that experiments conducted by sellers on eBay can shed light
on consumer behavior and market outcomes, which can hold insights for firms.
This paper contributes to this literature by exploring what types of insights data
analytics can provide for firms, which were previously very difficult to obtain. Particularly, it focuses on how user rental patterns on Amazon Instant Video can shed light
on previously unobservable consumer preferences and latent dimensions of similarity
between movies, as well as the implications of this information for movie performance. This paper also has methodological contributions by employing machine
learning techniques recently adopted in economics (Athey and Imbens 2015).
Categories research in sociology Starting with the seminal work by Zuckerman
(1999) on the categorical imperative, there has been a growing stream of research
6
in sociology that explores the relationship between the number of categories a product spans and its performance. Work in this stream argues that category-spanning
products tend to perform worse than products that belong to one category, because
products that span categories are difficult to interpret and evaluate, which results in
consumers ignoring them.
Although the negative correlation between category-spanning and performance
is well-established in this literature, the causality of this relationship is difficult to
disentangle, and attempts to tackle this issue are lacking. Moreover, with a few
important exceptions (Zuckerman 2004, Goldberg and Vashevko 2013), most work
relies on institutional, third-party classifications (e.g. SIC codes or movie genres) to
measure category-spanning, which may not capture how individuals actually classify.
In the movie industry setting, Hsu et al. (2009) show that movies that span genres
perform worse at the box office. However, particularly in the movie industry, genre
assignments are difficult to make as they are often broad, vague and subjective
(Basuroy et al. 2003), and may not accurately show which movies are similar to
each other. These issues make it problematic to use genres to measure the number
of categories a movie is in.
This paper aims to make two contributions to this literature. First, it makes a
departure from the general reliance on film genres to infer similarity between movies.
It attempts to get closer to how individuals classify by using the movie rental patterns of Amazon Instant Video users to infer implicit similarity between movies.
Movies that are frequently co-rented together by users form implicit categories, or
clusters, which differ from established genres. Second, the paper seeks to establish
7
and then explore the causality of the relationship between how similar a movie is to
others (i.e. the degree to which it belongs to one implicit cluster with other movies
or spans clusters) and movie performance, through a two-part approach made up of
instrumental variable and control function methods in Section 4.
Marketing There is a large stream of research in the marketing literature exploring
the determinants of success in the movie industry (see Eliashberg et al. 2006 for a
survey of the literature). Some studies focus on quantifying the impact of different
factors such as famous actors/directors, blogs and advertising on movie revenues
(Ravid 1999, Gopinath et al. 2013), while others explore the role of film critics as
both predictors and influencers of movie performance (Basuroy et al. 2003).
This paper builds on this literature by exploring how a previously unobservable
factor, a movie’s implicit similarity to other movies, as revealed through online consumer rental patterns, affects a movie’s box office performance. This provides an
avenue for future work to explore other previously difficult to observe movie characteristics that digitization can reveal, which may have performance implications.
The rest of the paper is organized as follows. Section 2 describes the data. Section
3 motivates and constructs a movie’s implicit similarity measure. Section 4 analyzes
the relationship between implicit similarity and box office performance. Section
5 explores the formation of implicit clusters of movies, and the latent similarity
dimensions they share. Section 6 concludes and suggests avenues for future research.
8
2
Data
There are two types of data that I have collected for the analysis. The first type is
used to evaluate implicit similarity between movies based on users’ co-rental behavior.
For this purpose, I collected data from Amazon Instant Video’s “Customers who
rented this [focal movie] also rented...” section associated with each movie. I focused
only on motion pictures that were released in theaters, excluding TV shows, directto-video films, music videos, and other short clips and duplicate titles, as my focus
is on factors that affect performance of feature length movies and studios’ strategies.
This section lists the top twenty movies that are most frequently co-rented with
the focal movie by Amazon users, in order of co-rental frequency (with hyperlinks
to the movies’ respective pages). Figure 2 shows a sample Amazon Instant Video
movie page for the movie Angels & Demons; the top five of the 20 co-rentals appear
on the main page. As I explain in detail in Section 3, these co-rental lists are a
key feature of my analysis. I use them to compute a measure of implicit similarity
between movies based on the number of co-rentals each two movies have in common, as revealed through rental patterns.1 Based on this measure, I investigate the
1
A possible concern would be that the explicit visibility of the first five co-rental relationships may
drive them to continue being rented together, which would reinforce their positions (OestreicherSinger and Sundararajan, 2012.) This would imply that co-rental lists are a combination of: (1)
initial revealed consumer preferences that generate the lists and (2) a bias introduced by the reinforcement of the first five visible co-rentals. However, I argue that these lists, which are the
basis for the construction of my implicit similarity measure, are still reflective of consumer preferences. This is because they are based on initial preferences and more importantly, because results in Section 4 show that a movies online implicit similarity affects its box office performance.
Since box office revenues are largely accrued by a movie before it can be rented on Amazon, this
finding implies that despite the bias, consumers online rental patterns are reflective of consumer
preferences as the similarity measure significantly affects past performance. As an additional robustness test, I run my analysis with the implicit similarity measure constructed on a movies first
10 co-rentals (five of which are not immediately visible on the movies page) and results still hold.
9
relationship between a movie’s implicit similarity to other movies and its box office
performance in Section 4. In Section 5, I explore which movies are similar to each
other, forming clusters of movies, and examine what latent dimensions of similarity
these movies share.
INSERT FIGURE 2 HERE
A potential concern about the data could be the lack of information on how
Amazon actually creates its co-rental lists. Can the lists be taken at face value as a
ranking of movies most frequently co-rented with a focal movie? I argue that they
can and attempt to provide evidence that alleviates such concerns.
As engineers and co-developers of several Amazon recommendation algorithms,
including the co-rental lists, Linden et al. (2003) explain in detail the algorithm
used to produce these lists. Known as item-to-item collaborative filtering, the code
produces a ranking of the most frequently co-rented movies by Amazon users in the
first stage. Then, the algorithm adjusts the ranking to also take into account any
revealed preferences of users based on prior rentals history. The description of the
algorithm by Linden et al. (2003) reduces the uncertainty of how the co-rental lists
are produced. In addition, I collect the data using a browser with no previous search
history on Amazon, in order to get as close as possible to the algorithm’s initial
ranking of co-rentals, unaffected by any revealed preferences.
A concern may still remain that Amazon could run experiments by inserting random movies in these lists, instead of those produced by the algorithm. However,
10
even if this occasionally does take place, Amazon’s incentive is to increase rentals,
so it would aim to show movies for which it had some evidence that certain users
would like them. Otherwise, it would not have an incentive to show them, and they
would not remain in the lists for very long. Thus, even if Amazon may sometimes
run experiments, it is in Amazon’s best interest to list movies that consumers actually want to rent, so movies that are similar to the other movies shown in some
way. Taken together, knowledge about Amazon’s algorithm and its incentives should
reduce concerns about whether co-rental lists truly reflect movies that consumers like
to rent together.
In addition to the initial Amazon Instant Video data on co-rentals, I use a number
of other sources to obtain additional variables related to each movie that have been
shown to affect movie performance and are relevant for the ensuing analysis. First, I
determine whether a movie is a sequel in a franchise, as this is positively correlated
with box office revenue (Terry et al., 2004). I also gather data on critic ratings prior
to movie release from Metacritic, since critics have an influence through their dual
role as predictors and influencers of performance (Basuroy et al. 2003). From the
Academy Awards website archive, I note whether a movie won or was nominated for
an Oscar in the following six categories: best picture, best director, best actor/actress
and best supporting actor/actress, which are found to be associated with higher
revenues (Nelson et al. 2001). I obtain each starring actor’s popularity from IMDB
Pro’s StarMeter, which has a yearly ranking (from 2004 to 2013) of the top actors
in the industry, based on the number of IMDB user searches for that individual, and
11
I use IMDB Pro’s Top 100 Directors rankings to evaluate director popularity. Also
from IMDB, I collect the following movie characteristics: starring actors, director,
studio, IMDB rating, MPAA rating, genres and US theatrical release date.
Additionally, I collect information about factors related to the movie’s production
and exhibition. From BoxOfficeMojo/The Numbers, I obtain data on a movie’s
box office revenue, production budget, and number of theaters the movie played in
during each week of its exhibition. Based on the movie’s release date, I also identify
whether it was released in the summer/winter or on a holiday weekend, to account
for seasonality in the movie industry (Einav 2007). For a subset of movies, those
released between 2009 and 2013, I further obtain advertising budgets, broken up by
media type and month, from Kantar Media. This variable is essential to my empirical
approach that attempts to establish causality in the relationship between a movie’s
implicit similarity to other movies and box office performance.
3
Measure of Implicit Similarity
3.1
Motivation
As mentioned in Section 1, Amazon Instant Video co-rental lists provide an opportunity to measure implicit similarity between movies based directly on consumers’
rental patterns. The advantage of this method is that it does not rely on third-party,
coarse genres to infer similarity between movies, since these classifications may not
accurately reflect how individual consumers classify movies.
The implicit similarity measure I construct is used to explore the relationship
12
between a movie’s implicit similarity (how similar it is to other movies based on
common co-rentals) and its box office performance in Section 4.
3.2
Construction
The measure of a movie’s implicit similarity to other movies, ImpSim, is based on
the number of common co-rentals a movie has with others. For any two movies,
the more common co-rentals they share, the more similar they are to each other.
A movie’s final implicit similarity measure is an average of its similarity with every
other movie. Details of the calculation are provided below.
Background for the construction of the measure
The implicit similarity measure is an adaptation of Zuckerman (2004)’s measure of
category coherence. In his paper, the aim is to compare the stock market volatility
of firms that have coherent identities (i.e. they are part of coherent categories of
firms) to that of firms with incoherent identities. To determine a firm’s coherence,
Zuckerman develops a measure coined category coherence, based on the analysts that
follow the firm: if a firm’s analysts follow a homogeneous group of firms, the focal
firm is considered coherent; if its analysts cover a heterogeneous group of disparate
firms, the firm is called incoherent. Zuckerman’s findings show that incoherent firms
have a more volatile value on the stock market than coherent firms. He argues the
result is explained by the idea that it is harder for people to interpret information
about a firm with an incoherent identity, because it lacks a clear reference group.
Zuckerman (2004)’s paper is part of a stream of research in sociology that studies
13
the relationship between the number of categories a firm spans and its performance.
However, a key novelty in his analysis compared to other papers is that he does
not rely on established SIC codes as a measure of how clearly a firm belongs to a
particular category, since these classifications can sometimes be difficult to make and
vague. Instead, he develops a way to measure the extent to which a firm belongs to
a coherent category based on the homogeneity of the set of firms its analysts follow,
without imposing problematic third-party categories.
An insight in the present paper is that there is a similarity between the analysts in
Zuckerman’s stock market setting and the Amazon co-rental lists in my context. For
Zuckerman, the more homogeneous is the group of firms that a focal firm’s analysts
follow, the more the focal firm belongs to a coherent category. The analogy in my
paper is that the more common co-rentals two movies share, the more those movies
belong to a coherent implicit cluster of movies.
The implicit similarity measure for movie m is operationalized as follows:
Step 1
For each pair of movies m, k compute:
pmk =
cmk
n+1
where pmk is the proximity of movie m to movie k ; cmk is the number of co-rentals
that movie m and movie k have in common, and n is the number of co-rentals over
which the measure is calculated; 1 is added to n in the denominator because when
counting the number of common co-rentals, the focal movie itself is included to check
14
whether it shows up on the co-rental list of any other movie.
The measure used throughout the analysis is calculated over the first five co-rentals
(n=5) as those are the most closely related to the focal movie m, and appear on its
main Amazon page. As a robustness check, the measure is also computed over the
complete list of 20 co-rentals (n = 20); the results of the analysis are qualitatively
similar using either computation of the measure.
Step 2:
Let N be the total number of movies, and ki any movie other than m. The implicit
similarity measure for movie m, ImpSimm , is computed as:
N
−1
X
ImpSimm =
pmki
i=1
nc
,
where nc is the number of movies for which pmk > 0 for all k. The measure ranges
from 0 to 1; 1 means that movie m has the same co-rentals as each of its co-rentals
(complete overlap), while 0 means there is no overlap. I interpret a measure of 1
to suggest that movie m has a high level of implicit similarity to other movies, and
a measure of 0 to suggest m has a low level of implicit similarity to other movies.2
Appendix B shows how the measure is constructed through a simple example.
2
Throughout the analysis, the constructed measure is standardized for ease of interpretation due to
its skewness, so mean = 0, standard deviation = 1 and the range is no longer [0,1].
15
4
Implicit similarity/performance relationship
This section focuses on determining the relationship between a movie’s implicit similarity and its box office performance. As discussed in detail in Section 1, the analysis
in this section is motivated by the theoretical question of which movies would perform better: movies that are implicitly similar to others, or movies that are far
from others? Implicitly similar movies may perform better because they target niche
segments of consumers, ensuring they have an audience, while distant movies could
perform better because they combine elements that appeal to multiple segments,
with the potential to draw larger audiences.
This analysis extends research exploring the relationship between the number of
genres a movie spans and its performance (Hsu 2006, Hsu et al. 2009), by focusing
on the extent to which a movie belongs to a coherent implicit category with other
movies, rather than third-party genre categories. Section 4.1 aims to establish the
relationship between a movie’s implicit similarity level and its box office revenues.
Section 4.2 attempts to disentangle the direction of causality in this relationship,
providing evidence that implicit similarity affects box office performance. This part
of the analysis is performed on a subset of movies released between 2009 and 2013,
for which I have data on all the relevant variables. Most importantly, I was also
able to obtain data on advertising spending for this subsample, a variable that is
instrumental in the empirical approach I employ to establish causality.
Table 1 presents summary statistics of movie characteristics for both the full
sample of 25,227 movies and for the subsample of movies from 2009 to 2013 that
16
is used in the analysis, for comparison. Movies in the full sample span the years
1905 to 2013, with an average release year of 1996, while movies in the subsample
are released during the last five years of the full sample, 2009 to 2013, on average
released in 2011. We can see that movies in the subsample are similar to those in
the full sample on key variables like total box office, production budget and range of
the implicit similarity measure.
INSERT TABLE 1 HERE
4.1
Establishing the existence of a relationship
This subsection explores the question: are there performance differences between
movies that are similar to others, as determined by the implicit similarity measure,
and movies that are far from all others? I run an initial regression as specified below
to determine the relationship between implicit similarity level and performance, including control variables that have been shown to be relevant to movie performance:
ln(T otalBoxOf f ice)m = α0 +α1 ImpSimm +α2 #Genresm +α3 Rankm +α4 (#Genresm
XRankm ) + α5 ln(pre.advertising)m + α6 M etacriticm + α7 Sequelm + α8 #Starsm +
α9 T opDirm +α10 M P AAm +α11 Holidaym +α12 HHI.5wksm +α13 Y earm +m ,
(1)
where ln(T otalBoxOf f ice)m is movie m’s total box office revenue;3 ImpSimm is the
independent variable of interest, which expresses movie m’s implicit similarity level;
3
Box office revenues, production and advertising budgets are normalized using the Consumer Price
Index to 2013 million dollars.
17
#Genresm is the number of established genres movie m is assigned; ln(pre.advertising)m
is the advertising spending for movie m prior to its release; M etacriticm is movie m’s
average critic rating prior to its release; Sequelm is an indicator variable to express
whether or not movie m is a sequel; Starsm is a count of movie m’s starring actors
that were listed in the StarMeter’s Top 100 in the year prior to the movie’s release;
T opDirm is an indicator variable that shows whether movie m’s director is listed
in the Top 100 Directors on IMDB; M P AAm is a vector of indicator variables that
shows movie m’s MPAA rating (i.e. NR, G, PG, PG13, or R); Holidaym indicates
whether movie m was released on one of the following holiday weekends: Independence Day, President’s Day, Labor Day, Memorial Day, Thanksgiving, Christmas,
New Year’s Eve); Y earm is a vector of release year indicator variables (2009 - 2013).
The variables HHI.5wksm , Rankm and the interaction #Genresm XRankm require more detailed explanations. The variable HHI.5wksm is designed to capture
the average level of competition movie m faces during its first five weekends playing
in theaters. This is because by the end of this period, movies have generally made
more than 80% of their total box office revenues, as illustrated in Figure 3. The
measure is constructed in the following steps: (1) I obtain data from BoxOfficeMojo
on the weekly box office revenues of all movies playing in theaters on each weekend
in the sample, even if some movies are not in the original dataset because they are
not available on Amazon Instant Video. (2) Given the box office revenues of all
movies playing on a given weekend, I compute each movie’s market share. (3) Based
on the computed market shares, I calculate a Herfindahl-Hirschman Index (HHI) to
18
measure the level of competition in each week. (4) For the final HHI.5wksm variable, the last important step is a modification of the traditional HHI. In computing
HHI.5wksm , I exclude movie m’s market share by subtracting it from the computation of the movie’s HHI. The reason for this step is that this variable is meant
to reflect the level of competition that movie m faces during its first five weeks in
theaters. Including movie m’s market share in the calculation would not accurately
capture the level of competition m is facing from the other movies playing.
INSERT FIGURE 3 HERE
The #Genres variable aims to control for performance differences between movies
that have different numbers of genres (ranging from 1 to 5). The inclusion of this
variable is motivated by prior research that finds that being listed in multiple genres
is associated with lower box office performance (Hsu et al. 2009). Even for movies
that have the same number of genres, there may be performance differences that
depend on not only the number, but which particular genres the movie is in. The
Rankm variable seeks to control for such performance differences due to different
combinations of genres. The measure is constructed in the following steps: (1) I
identify all the possible combinations of genres of movies on Amazon Instant Video.
For each combination, I compute the average box office of movies with those genres.
(2) Within each number of genres (ranging from 1 to 5), I rank the genre combinations
based on their average box office, with the highest grossing combination having a
rank of 1. (3) I normalize the rankings for ease of comparison across different numbers
19
of genres. An illustrative example of the construction of this variable is in Appendix
A. The interaction variable #Genresm XRankm is meant to capture the possibility
that number of genres can have different relationships with movie performance that
vary based on the movie’s genre combinations.
INSERT TABLE 2 HERE
The results of regression (1), shown in Table 2 Column (1), indicate that there is a
positive, significant correlation between a movie’s implicit similarity and its box office
performance. The analysis in this section was motivated by the theoretical question of
whether implicitly similar or distant movies would perform better. Findings indicate
that on average, movies that are similar to others perform better, which implies that
consumers gain a higher utility from formulaic movies that closely align with the
preferences of niche audience segments than from movies with disparate elements.
Given these results, the question arises: what is the direction of causality of this
relationship? Addressing this question is the goal of Section 4.2.
4.2
Disentangling the causality of the relationship
There are three possible explanations for the correlation found in Section 4.1. One
option, which is the hypothesized direction, is that a movie’s implicit similarity, as
revealed through consumers’ rental patterns on Amazon, is a feature that affects
movie box office performance through the mechanism described above in Section 4.1.
20
However, there are also two possible alternative explanations. First, there could
be an unobservable factor, such as movie quality, that affects both implicit similarity
and performance. This option is explored in Section 4.2.1. Second, there may be an
issue of reverse causality, meaning that box office performance could affect Amazon
rental patterns and thus implicit similarity. This possibility is addressed in Section
4.2.2. I attempt to disentangle the direction of causality in the relationship by
exploring each of the possible mechanisms, and providing evidence to refute the
alternative explanations.4
4.2.1
Is there an unobservable factor that affects both a movie’s implicit
similarity and its box office performance?
This subsection addresses the first alternative explanation, the possibility that an
unobservable factor could affect both a movie’s implicit similarity and box office performance. A likely candidate is a movie’s quality, as a valid concern would be that
high quality movies would both perform better at the box office and they would be
more likely to be rented together on Amazon due to their quality. This would imply
that similar movies are not high-performing because they contain common features
that fit well with the tastes of niche audience segments. Instead, the reason is simply
4
This multistep approach is necessary because in this context it is not possible to test the hypothesized direction of causality directly, in a straightforward way. Doing so would require either an
exogenous shock to a movie’s implicit similarity, or an instrumental variable that would be correlated with its implicit similarity but not with box office performance. Both of these have proven
impossible to find. However, I am still able to provide supportive evidence for the proposed
mechanism by employing a multistep approach based on instrumental variable, control function
and robustness to omitted variable bias tests, in order to refute each of the possible alternative
explanations. To the best of my knowledge, this is the first paper that tackles the question of
causality in the relationship between the extent to which a product belongs to a clear category
and its performance, thus contributing to the sociology research on categories.
21
that high quality movies are rented together due to their quality, which also affects
their performance.
Control function
To address this concern, I employ a control function approach, in which the goal is
to extract previously unobservable quality and include it as an explicit independent
variable in regression (1), which is reproduced below. If implicit similarity remains
significant and the coefficients of the independent variables generally remain stable
with the explicit inclusion of the quality variable, this would provide evidence that
the relationship between implicit similarity and box office performance is not due to
an unobservable like movie quality.
Given the regression:
ln(T otalBoxOf f ice)m = α0 +α1 ImpSimm +α2 #Genresm +α3 Rankm +α4 #(Genresm
XRankm ) + α5 ln(pre.advertising)m + α6 M etacriticm + α7 Sequelm + α8 #Starsm +
α9 T opDirm + α10 M P AAm + α11 Holidaym + α12 HHI.5wks + α13 Y earm + m ,
(1)
I attempt to extract movie quality from a movie’s pre-release advertising spending
by employing a control function approach, as explained below. A movie’s pre-release
advertising budget is finalized once a movie has completed production, so that the
studio can first assess the quality of the finished movie. Its advertising can have an
effect on the movie’s box office revenue through two mechanisms: first, advertising
can directly have a positive effect on revenue by making the movie more well-known
to audiences (Elberse and Anand, 2007), and second, it can affect revenue in an
22
indirect way, because the advertising that a studio spends on a particular movie
reflects its belief about the movie’s quality.
Since movie quality is unobservable, the econometric issue is that m is not independent of ln(pre.advertising)m . I address this issue through a control function
(see e.g. Heckman 1978, Petrin and Train 2009 for details on the implementation).
The basic idea of the control function approach is to derive a proxy variable that
conditions on the part of pre-release advertising that depends on the error (movie
quality). By doing so, the remaining variation in pre-release advertising spending
becomes independent of the error, capturing only the direct effect of advertising on
movie box office revenue, and previously unobservable quality can be included as an
explicit regressor.
The control function is a two-step procedure. The first step involves regressing
pre-release advertising spending on an instrumental variable and other covariates,
in order to obtain the residuals of the regression (what is referred to as the proxy
variable). The idea for the first step is that having found an appropriate instrument
for pre-release advertising, the residuals of the regression would reflect a studio’s
advertising expenditures for a movie, that go beyond what can be explained by a
movie with the same observable characteristics.
For the choice of instrumental variable for a movie’s pre-release advertising, I
follow Chintagunta et al. (2010), and instrument pre-release advertising with a
movie’s production budget. The reason this is an appropriate instrument is that a
movie’s pre-release advertising is typically set as a fixed proportion of its production
budget (Vogel 2009). The implication is that a movie with a pre-release advertising
23
spending that is either above or below a fixed proportion would shed light on the
studio’s belief about the movie’s quality. After running the first step regression and
collecting the residuals, the second step of the control function approach involves
using both the pre-release advertising spending and the residuals (interpreted as
perceived movie quality by the studio) in the original regression (1), as follows:
ln(T otalBoxOf f ice)m = α0 + α1 ImpSimm + α2 QualityResidualm + α3 #Genresm +
α4 Rankm + α5 #(Genresm XRankm ) + α6 ln(pre.advertising)m + α7 M etacriticm +
α8 Sequelm +α9 #Starsm +α10 T opDirm +α11 M P AAm +α12 Holidaym +α13 HHI.5wks
+α14 Y earm +m
(2)
The results of the control function method are shown in Table 2 Column (2). We
can see that the magnitude of the implicit similarity variable decreases slightly from
0.137 to 0.115, and that the inclusion of the perceived quality variable does not eliminate its significance. Including the quality variable does not change the explanatory
power of the model, which may in part be due to the comprehensive baseline control
variables (Caliendo et al. 2014).
Test to evaluate robustness of results to omitted variable bias
As an additional robustness test, I check that the relationship between Implicit Similarity and ln(TotalBoxOffice) is not due to omitted-variable bias by applying the
procedure developed in Oster (2014). The idea is to ask how important the unobservables would need to be relative to the observables in order to eliminate the
estimated effect of the observable variable of interest (see Oster 2014 for details).
To determine this, an assumption has to be made about the share of the outcome
24
variance that could be explained by observable and unobservable variables together,
referred to as Rmax . This depends on the context. The second step involves evaluating the movements in the magnitude of the coefficient of interest, along with
movements in R2 , between two regressions, one that includes only the variable of
interest and one that includes the other observables. These values are used to calculate the degree of selection on unobservables relative to observables which would
be necessary to explain away the result of the variable of interest, δ. To implement
the method, I first run a baseline specification only including the variable of interest,
ImpSim, as follows: ln(T otalBoxOf f ice)m = α0 + α1 ImpSimm + m . Running this
initial regression, I obtain α1 = 0.47 and R2 = 0.43. Next, I run a second regression
that includes controls for my other observable variables. This regression is the same
as regression (1), reproduced below:
ln(T otalBoxOf f ice)m = α0 +α1 ImpSimm +α2 #Genresm +α3 Rankm +α4 #(Genresm
XRankm ) + α5 ln(pre.advertising)m + α6 M etacriticm + α7 Sequelm + α8 #Starsm +
α9 T opDirm + α10 M P AAm + α11 Holidaym + α12 HHI.5wks + α13 Y earm + m
(1),
I obtain α1 = 0.162 and R2 = 0.891. Using the recently developed Stata psacalc
command with an approximated Rmax of 0.95, I obtain a δ (degree of proportionality
between unobservables and observables) which would be needed to produce a 0 effect
of ImpSim (α1 = 0) of δ = 4.07. The assumption of Rmax in this context is motivated
by the fact that the R2 of regression (1) is already 0.90, so Rmax , the variance that
can be explained by observable and unobservables, can take values between 0.90 and
1. In my case, any choice of Rmax gives δ >2. On average, larger values of δ (δ >
25
1) indicate more robust coefficient estimates, and with δ > 2 for any Rmax , it seems
implausible that the effect of ImpSim is driven by omitted variable bias. The value
of δ = 2 would suggest that the unobservables would need to be twice as important
as the observables to eliminate the effect I find.
4.2.2
Does box office performance affect implicit similarity?
The previous section provides evidence against the possibility that an unobservable
factor such as movie quality could affect both a movie’s box office performance and
its implicit similarity, by employing a control function approach and a test developed
by Oster (2014) to evaluate robustness of results to omitted-variable bias.
This section turns to addressing the possibility of reverse causality, meaning that
box office performance could actually affect implicit similarity, an option that is
very plausible in this setting. Given that most movies become available for rental
on Amazon Instant Video only after they are out of theaters, it is reasonable to
expect that a movie’s box office revenues could affect what movies people tend to
rent together online. This would imply that high-grossing movies are rented together
online because they are high-performing, not necessarily because they are inherently
similar to each other. To explore this possibility, I develop a regression model that
has as dependent variable a movie’s implicit similarity, and as independent variable
of interest a movie’s total box office revenue, as follows:
ImpSimm = α0 +α1 ln(T otalBoxOf f ice)m +α2 #Genresm +α3 Rankm +α4 (#Genresm
XRankm ) + α5 ln(advertising)m + α6 M etacriticm + α7 Sequelm + α8 #Starsm + α9
T opDirm + α10 M P AAm + α11 Holidaym + α12 W eeksm + α13 Y earm + m
(3)
26
Next, I employ an instrumental variable approach, substituting an instrument for the
total box office variable in regression (3), in order to explore whether it is causally
affecting implicit similarity.
In order to be valid, an instrument Z must satisfy two conditions: (1) it must
be correlated with the endogenous explanatory variable conditional on the other regressors, corr(Zi , Xi ) 6= 0 and (2) it cannot be correlated with the error term in the
regression, corr(Zi , ui ) = 0. An instrument for a movie’s total box office revenue that
meets these two conditions is the share of potential moviegoers affected by inclement
weather. The reasoning for this instrument is that when the weather is bad, it is
more likely that people switch from outdoor to indoor activities, which could lead
to an increase in theater attendance on those dates. At the same time, there is no
reason to expect bad weather during a movie’s run in theaters to directly affected a
movie’s implicit similarity on Amazon Instant Video, other than through its indirect
effect on a movie’s box office revenue.
To construct this instrumental variable, I obtain population data for the thirty
largest American cities from the US Census Bureau, and I compute an estimated
annual US market size of moviegoers by summing up the potential market sizes for
each of the cities. Then, I collect data on all significant weather events taking place
in the thirty cities during the periods of time that the movies in my dataset are
playing in theaters. I focus specifically on one type of weather event: thunderstorm
warnings. I then compute the share of the estimated US market of moviegoers that
27
experience a thunderstorm warning in each weekend that the movies are playing in
theaters. As a final step, for each movie, I sum up the weekend shares for the first
five weekends that the movie is in theaters. As for the HHI variable that I construct,
I choose to aggregate this variable over the movie’s first five weeks in theaters, as
this is the period of time when movies make more than 80% of their total revenue.
Results of the OLS and 2SLS methods are reported in Column (1) and Column
(2), respectively, of Table 3. The first-stage F-statistic is 13.27, indicating that the
instrument for box office revenue - the share of US moviegoers affected by thunderstorm warnings during a movie’s first five weeks in theaters - satisfies the relevance
condition. We can see that there are large differences between the results of the OLS
and 2SLS methods. For the OLS estimation, the relationship between a movie’s box
office revenue and its implicit similarity is positive and highly significant; for the instrumental variable method, there is no evidence that box office revenues significantly
affect a movie’s implicit similarity. This provides evidence against the alternative
explanation that the relationship between implicit similarity and performance is due
to reverse causality.
INSERT TABLE 3 HERE
Taken together, the control function in Section 4.2.1 and the instrumental variable
in Section 4.2.2 help alleviate the endogeneity concerns that surround the relationship
between implicit similarity and performance, in a setting where it is not possible to
28
disentangle causality through a direct approach. To the best of my knowledge, this is
the first paper in the literature studying the relationship between category coherence
and performance to attempt to establish the causality of such a relationship.
5
The formation of implicit movie clusters
Section 4 investigated the relationship between a movie’s implicit similarity and its
box office performance, and attempted to determine the causality of this relationship.
Results showed that movies that are similar to others, by having many co-rentals in
common with other movies, tend to have better box office revenues than movies that
are far from other movies (i.e. those with little co-rental overlap with other movies).
Having established this relationship, this section shifts the attention to exploring what movies tend to form implicit clusters together, by having many common
co-rentals with each other. The aim of this analysis is to reveal these implicit clusters of movies and find what features they have in common; doing so may expose
latent dimensions of similarity that would likely not be captured through observable
characteristics. This knowledge could help studios produce movies that better target
different audience segments, by observing similarity dimensions based on consumer
co-rental patterns that were previously difficult to observe.
Currently, movie studio executives decide which movies to green-light (i.e. move
from development to production phase) through an uncertain process in which several
assigned readers have to agree on the potential of a script (Eliashberg et al. 2007)
29
and how well it aligns with the studio’s goals regarding genres, talent and financial
considerations. While studios rely on such methods to make green-light decisions,
companies like Amazon Instant Video and Netflix are using their wealth of available
user data to gain insights about their customers’ preferences and are increasingly
engaging in data-driven original content production to compete with movie studios.
For example, when deciding on its production of “House of Cards,” Netflix found
that users who liked the original BBC series were also the ones who liked movies
directed by David Fincher and those starring Kevin Spacey. The show is critically
acclaimed and has won numerous awards (IMDB). Similarly, Amazon has also used
data analytics to gain insights from their users on different pilots’ potential for success. Both companies argue that they let the data guide their creative strategies,
and that studios rely too much on uncertain studio tastemakers to produce content
(Wall Street Journal, 2013).
Greater knowledge about consumer preferences and dimensions of implicit similarity, which studios could gain by analyzing insights from websites like Amazon
Instant Video, could help them refine their green-lighting process and make movies
that better target different audience segments. One important aspect to understand
is which movies tend to cluster together based on consumer co-rental patterns.
To shed light on this, I employ machine learning techniques for pattern recognition and learning from data, which have recently been introduced for use in the
economics literature (Alpaydin 2014, Athey and Imbens 2015). Using the walktrap
30
algorithm (Pons and Latapy, 2005),5 I produce implicit clusters of movies based on
movies that have many common co-rentals on Amazon Instant Video. For this part
of the analysis, I use the entire set of 25,227 movies in the dataset, which leads to
526 different implicit clusters. Approximately 65% of these clusters contain fewer
than 10 movies, 35% have between 10 and 1,000 movies, and less than 1% of clusters
have more than 1,000 movies. The number of movies per cluster ranges between
1 and 4,797 movies, with the average number being 43 movies in a cluster. There
are 313 movie franchises in my sample, composed of an original movie with one or
multiple sequels. The average franchise has 4.25 movies, with a range from 1 to 13
movies. We would expect that franchises’ movies would tend to cluster together, as
the formulas of the sequels are based on the original movies. Findings support this
view: more than 72% of movie franchises have all their movies in the same implicit
cluster, while the other 28% have movies in at most two related clusters.
Given the formation of these implicit clusters of movies, a question arises about
what characteristics movies within a given cluster have in common. In order to
explore this, I run logit regressions to determine whether observable characteristics
such as movie genre, movie/director/actor awards, rating, season/year of release, and
studio type are able to explain a movie’s cluster membership, or whether potentially
latent similarity dimensions between movies may exist. After excluding clusters that
have fewer than 10 movies, I run logit regressions of the form:
5
Walktrap algorithm is implemented through the igraph package in R.
31
Clusterm = α0 +α1 M P AAm +α2 Seasonm +α3 StudioT ypem +α4 Awardsm +α5 Genresm
+α6 Y earm +m ,
(4)
where Clusterm is the cluster membership of movie m; M P AAm is a vector of indicator variables that shows movie m’s MPAA rating (i.e. NR, G, PG, PG13, or
R); Seasonm indicates whether movie m was released during summer or winter;
StudioT ypem indicates whether movie m’s studio is a major, conglomerate indie, or
true indie; Awardsm is an indicator variable that shows whether movie m received
any of the following Oscar accolades: best picture nomination/win, best director
nomination/win, best actor nomination/win, best actress nomination/win, best supporting actor nomination/win, best supporting actress nomination/win; Genresm are
genre fixed effects for the following categories: action/adventure, comedy, documentary, drama, fantasy, foreign films, gay/lesbian, horror, kids/family, military/war,
musical, mystery/thriller, romance, science fiction, sports or western; Y earm are
release year fixed effects.
INSERT FIGURE 4 HERE
Figure 4 shows the distribution of the Pseudo-R2 of the logit regressions performed for each of the clusters. We can see that while observable characteristics,
like a movie’s genre, movie/director/actor awards, MPAA rating, studio type, and
release year, have modest power in explaining a movie’s cluster membership, with
an average Pseudo-R2 of 0.31. This suggests that a large part of why movies form a
cluster with other movies remains unexplained by observable attributes. With these
32
findings, the question becomes: why do these clusters of movies form, and what
do they have in common? Are there latent dimensions of similarity that movies in
each cluster have in common, which are not detected through observable factors and
coarse classifications? Some possible latent characteristics that may not be captured
through established categories include a movie’s theme, visual and sound effects,
pace and storytelling style, or character idiosyncrasies. In the rest of this section, I
explore one of these dimensions, the potential for the existence of common themes
that transcend simple genres.
To do so, I collect all the movies’ synopses from IMDB, and search for potential
commonalities in their themes. Table 4 shows five implicit clusters, the most prevalent genre labels in each cluster by percent of movies assigned to that genre, five
examples of movies in each cluster, and the common themes that are revealed when
exploring the text of the movies’ synopses. For example, a comparison of clusters 7
and 29 shows that even though the genre label drama is assigned to a majority of
movies in both clusters, each cluster has a common theme that its movies share, and
the themes between the two cluster are very different. Movies in Cluster 7 explore
issues of cultural identity, and the impact of political and class differences on people’s
daily lives. In contrast, movies in Cluster 29 tell stories about the lives of artists and
royals, focusing particularly on their struggles.6 Other examples of shared themes
that emerge within clusters are described in Table 4.
6
Text analysis is performed using the text mining and word cloud packages in R.
33
INSERT TABLE 4 HERE
Having shed light on the formation of implicit clusters of movies and latent themes
that movies within clusters share, as a final analysis I explore whether there are differences across clusters in the average implicit similarity measure of their movies.
This is motivated by the question of whether different types of movies tend to be
more or less similar to other movies than other types.
Initial regressions show that the relationship between a cluster and the average
implicit similarity of its movies is made up of two parts: (1) First, there is a mechanical relationship: the more movies in a cluster, the lower the average implicit
measure of the cluster. This is expected because in the calculation of the implicit
similarity measure, two movies must each have the same co-rental in order to count
in the computation. However, in order for a movie to be part of a cluster, it can be
enough to have a one-way link between two movies, for both movies to be included in
the same cluster. Due to this, the more movies there are in a cluster’s periphery, the
lower the average implicit measure of the cluster. (2) Second, there is a part of the
relationship between a cluster and the average implicit similarity that goes beyond
the simply mechanical one. This is because movies in some clusters are more or less
tightly connected to other movies due to the kind of movies they are (e.g. the type
of theme that is common within a cluster could explain why those movies are more
or less similar to others).
34
In order to compare average implicit measures between clusters and eliminate the
mechanical relationship of the number of movies influencing the average, I take the
movies with the highest implicit measure in each cluster and calculate the average
of their implicit measures, which I can compare across clusters. Figure 5 shows the
average implicit similarity of the top 10 movies with the highest measure in each
cluster of at least 50 movies7 We can see from the dispersion in the graph that there
are significant differences in the average measures across clusters.
What does this heterogeneity across clusters imply? For example, we can compare
two clusters of about the same size, Cluster 25, with 190 movies, and Cluster 29,
with 186 movies. Cluster 25 is composed of movies that explore themes of martial
arts and success of underdog stories. The average implicit similarity measure of its
top 10 movies is 0.57. Cluster 29, whose movies are about the lives and struggles
of royals, poets and artists, has a higher average measure, of 1.29. Since these two
clusters are of similar sizes, I can also compute and compare the average similarity
measure of their movies calculated over the entire set of movies. The relationship
between the average similarity measure of the two clusters is qualitatively similar
when done over the top 10 and over the entire set of movies. Computed over the
entire set, Cluster 25 still has a lower average (-0.21) than Cluster 29 (-0.07). This
implies that movies about the lives of artists and royals are more tightly connected
to other movies than movies about martial arts and underdog success stories.
7
I have also computed the average over the top 50 movies in clusters of at least 100 movies, and
obtained similar relationships between the clusters.
35
This exercise shows that there is heterogeneity across clusters in the level of
implicit similarity of their movies, suggesting that movie type (e.g. the theme a
movie explores) can explain whether a movie is more or less similar to other movies.
INSERT FIGURE 5 HERE
This initial exploration of the characteristics of implicit clusters suggests that
genre classifications may be too broad to reflect the more subtle and latent similarity
dimensions that movies within clusters share. Even though at first glance movies are
assigned the same genre labels, a deeper look may reveal large differences between
movies of the same genre. Exploring the movies’ synopses provides an example
of the kind of latent information and similarity features that data analytics has
the potential to uncover. Moreover, differences in the level of implicit similarity
across clusters indicate that some types of movies are more likely to be niche and
formulaic, whereas movies from other clusters tend to incorporate more disparate
elements. Movie theme, as revealed through the movies making up implicit clusters,
is one possible factor that can suggest whether a movie is more or less formulaic. A
more in depth text analysis of the movies’ scripts, along with a study of other latent
attributes, such as narrative style, rhythm, and visual and sound effects, provides
an interesting avenue for future research to uncover dimensions of latent similarity
between movies and to explore features that lead movies to be more or less similar
to other movies.
36
6
Discussion and Conclusion
This paper explores how the proliferation of voluminous amounts of data, “big data,”
holds the potential for firms to gain insights that may affect product performance and
firm strategy. Specifically, in the US movie industry setting, with data from Amazon Instant Video, the analysis focuses on the information that is revealed through
Amazon’s “Customers who rented this movie also rented this...” lists associated with
each movie. I present a novel way to determine implicit similarity between movies
from users’ online rental behavior, based on an adaptation of a measure originally
developed by Zuckerman (2004) in the stock market setting. Rather than relying on
third-party classifications like genres, which may not capture how individuals categorize, the advantage of employing this implicit similarity approach is that it emerges
directly from user rental patterns.
Based on this concept of implicit similarity, I show that movies that are similar
to others (having many common co-rentals with other movies) have better box office performance than movies that are far from others. This analysis was motivated
by the theoretical question of which movies would do better, those that are similar
to others, forming implicit clusters of formulaic movies, or those that are far from
others, combining disparate elements that could potentially draw larger audiences?
Findings show that on average, formulaic movies tend to perform better, suggesting
that audiences prefer niche movies rather than movies combining different elements.
I attempt to disentangle the direction of causality in the relationship between
implicit similarity and performance by employing a combination of control function
37
and instrumental variable techniques, as well as a method to evaluate robustness
to omitted variable bias. Although in this setting it is not possible to evaluate the
proposed direction of causality directly, using these empirical approaches I am able
to provide evidence against the two most likely alternative explanations for this relationship: (1) common cause, the idea that an unobservable factor could affect both a
movie’s implicit similarity and its box office performance, and (2) reverse causality,
the possibility that box office performance is in fact affecting what movies people
rent together. Eliminating these alternative mechanisms, my empirical approach
lends support to the hypothesized direction: that implicit similarity is a feature that
affects a movie’s box office performance.
Moreover, I show that implicit clusters of movies, composed of movies that have
many co-rentals in common, differ from typical classification schemes like genres.
Observable characteristics such as genre, movie/director/actor awards, studio type
or MPAA rating have modest power to explain why clusters form, leaving a large
part unexplained. These findings raise the question: given the implicit clusters I
uncover, what latent features do their movies share? In this paper, I conduct an
initial exploration of one possibility, the potential that movies within a cluster share
a common theme, a latent dimension that transcends coarse genres. I also show that
there exist differences in the level of implicit similarity across clusters, indicating
that some types of movies are more likely to be niche and formulaic, whereas others
tend to incorporate more disparate elements. Investigating other previously unobservable dimensions, like storytelling style, pace, visual/sound effects, and character
38
idiosyncrasies may provide additional insights into determinants of similarity between movies, and factors that lead them to be more or less similar to other movies.
Thus, a more in depth exploration of these factors could inform studios about other
subtle dimensions of similarity relevant for performance, and provides an interesting
avenue for future research.
This paper makes contributions to several streams of literature. First, it contributes to literature on the economics and strategy of digitization, by exploring
novel insights that data analytics can reveal for firms about consumer preferences,
which were previously very difficult, if not impossible, to obtain, and that have
performance and strategy implications. The paper also makes methodological contributions in this area by using machine learning techniques that have recently been
adopted in economics. Second, it contributes to the categories research in sociology,
by (1) inferring implicit categories from consumer co-rental patterns rather than relying on third-party genre classifications to determine categories, and (2) providing
evidence for the causality of the relationship between a movie’s implicit similarity
level (the extent to which it belongs to a coherent category with other movies) and
its performance. To the best of my knowledge, this is the first paper in this literature that tackles the issue of causality. Third, the paper contributes to literature in
marketing that explores the determinants of box office success, by examining how a
formerly unobservable movie feature, implicit similarity, affects its performance. This
could lead to future work in this area that investigates what other previously latent
characteristics may affect movie performance and how they shape firm strategy.
39
References
[1] Alpaydin, E. (2014). Introduction to Machine Learning, Third Edition MIT Press.
[2] Athey, S. and J. Gans. (2010) “The Impact of Targeting Technology on Advertising Markets and Media Competition,” AER Papers and Proceedings, 100(2):
608-613.
[3] Athey, S. and G. Imbens. (2015) .“Lectures on Machine Learning.” NBER,
Boston, MA. Lecture.
[4] Basuroy, S., Chatterjee, S., & Ravid, S. A. (2003). “How critical are critical
reviews? The box office effects of film critics, star power and budgets,” Journal
of Marketing, 67(4): 103-117.
[5] Bloom, N., L. Garicano, R. Sadun and J. Van Reenen. (2009). “The Distinct
Effects of Information Technology and Communication Technology on Firm Organization,” NBER Working Paper No. 14975.
[6] Brynjolfsson, E. and L. Hitt. (2000). “Beyond Computation: Information Technology, Organizational Transformation and Business Performance,” Journal of
Economic Perspectives, 14(4): 23-48.
[7] Caliendo, M., Mahlstedt, R. & Mitnik, O. A. (2014). “Unobservable, but unimportant? The influence of personality traits (and other usually unobserved variables) for the estimation of treatment effects,” IZA Discussion Paper No. 8337,
IZA, Bonn, Germany.
[8] Chintagunta, P., Gopinath, S., & S. Venkataraman. (2010). “The effects of online
user reviews on movie box office performance: Accounting for sequential rollout
and aggregation across local markets,” Marketing Science, 29(5): 944-957.
[9] Einan, L. (2007). “Seasonality in the US motion picture industry,” RAND Journal
of Economics, 38(1): 127-145.
[10] Einav, L., T. Kuchler, J. Levin and N. Sundaresan. (2011). “Learning from
Seller Experiments in Online Markets,” NBER Working Paper No. 17385.
[11] Eliashberg, J., A. Elberse and M. Leenders. (2006). “The Motion Picture Industry: Critical Issues in Practice, Current Research, and New Research Directions,”
Marketing Science, 25(6): 638 - 661.
40
[12] Goldfarb, A., S. Greenstein and C. Tucker. (2015). Economic Analysis of the
Digital Economy, Chicago, IL: University of Chicago Press.
[13] Gopinath, S., P. K. Chintagunta, & S. Venkataraman. (2013). “Blogs, advertising and local-market movie box office performance,” Management Science,
59(12): 2635-2654.
[14] Hsu, G. (2006). “Jacks of All Trades and Masters of None: Audiences’ Reactions to Spanning Genres in Feature Film Production,” Administrative Science
Quarterly, 51(3): 420-450.
[15] Hsu, G., T. Hannan and O. Kocak. (2009). “Multiple Category Membership in
Markets: An Integrative Theory and Two Empirical Tests,” American Sociological Review, 74(1): 150-169.
[16] Leonard,
A. “How Netflix is turning viewers into puppets.”
SALON. 1 February 2013. 12 August 2015. Retrieved from:
http://www.salon.com/2013/02/01/how netflix is turning viewers into puppets/
[17] Linden, G., B. Smith and J. York. (2003). “Amazon.com Recommendations:
Item-to-Item Collaborative Filtering,” Industry Report IEEE Computer Society,
76-80.
[18] Liu, Y. (2006). “Word of Mouth for Movies: Its Dynamics and Impact on Box
Office Revenue,” Journal of Marketing, 70(3): 74-89.
[19] McElheran, K. (2015). “Do Market Leaders Lead in Business Process Innovation? The Case(s) of E-Business Adoption,” Management Science, 61(6): 11971216.
[20] Nelson, R., M. Donihue, D. Waldman and C. Wheaton. (2001). “What’s an
Oscar worth?” Economic Inquiry, 39(1): 1-16.
[21] Oestreicher-Singer, G. and A. Sundararajan. (2012). “The Visible Hand? Demand Effects of Recommendation Networks in Electronic Markets,” Management
Science, 58(11): 1963-1981.
[22] Oster, E. (2014). “Unobservable Selection and Coefficient Stability: Theory and
Validation,” Working paper, University of Chicago.
[23] Pons, P. and M. Latapy. (2006). “Computing Communities in Large Networks
Using Random Walks,” Journal of graph Algorithms and Applications, 10(2):
284-293.
41
[24] Ravid, A. (1999). “Information, Blockbusters, and Stars: A Study of the Film
Industry,” The Journal of Business, pp. 72(4): 463-492.
[25] Saunders, A. and P. Tambe. “The Value of Data: Evidence from a Textual
Analysis of 10-Ks,” Working Paper, January 2012.
[26] Sharma, A. “Amazon Mines Its Data Trove to Bet on TV’s Next Hit.”
The Wall Street Journal. 1 November 2013. 15 August 2015. Retrieved from:
http://www.wsj.com/articles/SB10001424052702304200804579163861637839706/
[27] Terry, N., M. Butler & D. De’Armond. (2006). “The Economic Impact of Movie
Critics on Box Office Performance,” Academy of Marketing Studies Journal, 8(1):
61-73.
[28] Vogel, H. Entertainment Industry Economics: A Guide for Financial Analysis,
5th ed. Cambridge, UK: Cambridge University Press, 2009.
[29] Zuckerman, E. (1999). “The Categorical Imperative: Securities Analysts and
the Illegitimacy Discount,” American Journal of Sociology, 104(5): 1398-1438.
[30] Zuckerman, E. (2004). “Structural Incoherence and Stock Market Activity,”
American Sociological Review, 69(3): 405-432.
42
Appendix A: Rank Variable Construction Example
The table below shows an illustrative example of the Rank variable construction.
Table A1: Example of Rank variable construction
Number of Genres
1
1
1
2
2
2
3
Genres
A
B
C
A, B
A, C
B, C
A, B, C
43
Box Office ($M)
10
7
4
12
8
5
11
Rank
1
2
3
1
2
3
1
Appendix B: Implicit Similarity Construction Example
Suppose movies A and B have 5 mutual co-rentals, and movies A and C have 4 mutual co-rentals. A doesn’t have any other co-rentals in common with other movies.
Calculate movie A’s implicit similarity measure.
Step 1
For each of movie A’s pairs: A,B and A,C, calculate pAB and pAC .
pAB =
pAC =
4
6
5
6
=
2
3
Step 2
ImpSimA =
5
+ 23
6
44
2
=
9
12
=
3
4
Figure 1: Methodological tools used in the paper
Notes: The diagram shows a roadmap of the various methodological tools employed in the analysis of each section.
Figure 2: Amazon Instant Video sample movie page
Notes: The image shows a sample Amazon main page for the movie “Angels & Demons.” It includes the starring
actors, runtime, release year, MPAA rating, Amazon stars, IMDB score, synopsis, and the movie’s top 5 co-rentals.
45
0
.1
Percent of Total Box Office Revenues
.2 .3 .4 .5 .6 .7 .8 .9
1
Figure 3: Percent of Total Box Office Revenues, by week
0
5
10
15
20
25
30
Weekend
35
40
45
50
Notes: The figure shows the average cumulative percent of total box office gross each week for movies in the 20092013 subsample of movies. On average, movies in this sample have accumulated more than 80% of their total box
office grosses by their fifth weekend in theaters.
0
5
Percent of Clusters
10
15
20
25
Figure 4: Explanatory power of observable characteristics for cluster membership
0
.1
.2
.3
.4
.5
.6
Pseudo-R2
.7
.8
.9
1
Notes: I run logit regressions of cluster membership for each cluster containing at least 10 movies, to determine
whether observable characteristics can explain whether a movie belongs to a particular cluster. This figure is a
histogram of the Pseudo-R2 s of those regressions, indicating the percent of clusters with each Pseudo-R2 .
46
150
Figure 5: Average implicit similarity by cluster
135
124 123
115
112
100
105
Cluster ID
96
87
77
64
50
56
15
21 22
16
0
2
1
84
61
59
34
29
7
0
67
60
57
88
47
42
46
38
26
25
83
80
45
41
3133
1719
14
64
1
27
20
5
13
10
2
Average Implicit Similarity
11
9
3
3
Notes: The graph shows the average implicit measure of the top 10 movies with the highest implicit similarity in
each cluster, for clusters with at least 50 movies. The horizontal axis shows the standardized average similarity
measure of each cluster and the vertical axis shows the Cluster ID. Each Cluster ID is also labeled on the graph.
47
Table 1: Summary Statistics of Key Variables
Variable Name
Obs.
Full
Sub
Mean
Full
Sub
Std. Dev.
Full Sub
Min
Full Sub
Max
Full
Sub
3.08
1
Main Dep/Indep
ln(TotalBoxOffice)
Implicit Similarity
6,962
22,548
843
859
14.74
0
3.45
1
4.27
-0.65
4.09
-0.97
20.45
11.31
Production and Distribution
ln(Advertising)
ln(ProductionBudget)
Release Year
Number of Genres
Sequel
800
2,698
18,113
25,227
25,227
800
702
859
843
859
14.54 14.54 2.81 2.81
16.59 16.57 1.52 1.59
1996 2011 20.44 1.34
1.67 2.06 0.84 0.94
0.02 0.06 0.15 0.25
6.70
8.70
1905
1
0
6.70
9.55
2009
1
0
17.97 17.97
19.87 19.93
2013 2013
5
5
1
1
Genre Composition
Action-Adventure
Comedy
Documentary
Drama
Fantasy
Foreign Films
Horror
Kids-Family
Military/War
Musicals
Mystery-Thrillers
Romance
Science Fiction
Westerns
25,227
25,227
25,227
25,227
25,227
25,227
25,227
25,227
25,227
25,227
25,227
25,227
25,227
25,227
859
859
859
859
859
859
859
859
859
859
859
859
859
859
0.20
0.28
0.08
0.45
0.02
0.04
0.16
0.01
0.01
0.02
0.15
0.12
0.10
0.02
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
14.45
0
0.28
0.37
0.06
0.55
0.04
0.05
0.11
0
0.01
0.01
0.22
0.18
0.15
0.01
0.40
0.45
0.27
0.50
0.13
0.21
0.36
0.09
0.09
0.14
0.36
0.32
0.29
0.15
0.45
0.48
0.25
0.49
0.19
0.22
0.32
0
0.09
0.12
0.41
0.39
0.46
0.08
20.08
9.23
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Notes: The table shows summary statistics of key variables for both the full Amazon sample of 25,227 movies
(Full) and the subsample used in the analysis in Section 3 (Sub) for comparison.
48
1
1
1
1
1
1
1
0
1
1
1
1
1
1
Table 2: Relationship between implicit similarity and box office performance
Variable
DV = ln(TotalBoxOffice)
(1)
(2)
Implicit Similarity
0.137***
(0.044)
0.115**
(0.045)
QualityResidual
-0.045
(0.059)
#Genres
0.031
-0.042
(0.094)
(0.101)
Rank
-0.099*
-0.128**
(0.052)
(0.060)
#Genres X Rank
0.016
0.023
(0.022)
(0.024)
ln(Pre-Advertising) 1.054*** 1.060***
(0.026)
(0.053)
Metacritic
0.025*** 0.025***
(0.002)
(0.003)
Sequel
0.590*** 0.525***
(0.115)
(0.118)
#Stars
0.145**
0.090
(0.063)
(0.071)
TopDirector
0.086
-0.039
(0.098)
(0.100)
Holiday
-0.392
-0.0754
(0.329)
(0.262)
HHI.5wks
0.628
0.274
(0.608)
(0.755)
Rated NR
0.804*** 0.701**
(0.225)
(0.290)
Rated G
-0.145
-0.598**
(0.435)
(0.254)
Rated PG13
-0.139
-0.118
(0.124)
(0.131)
Rated R
-0.496*** -0.453***
(0.126)
(0.143)
Observations
776
664
R-squared
0.906
0.904
Regressions include year fixed effects. Robust standard errors are in parentheses. *** p<0.01, ** p<0.05, * p<0.1
49
Table 3: Instrumental variable method results
DV = Implicit Similarity
(1)
(2)
ln(TotalBoxOffice)
#Genres
Rank
#Genres X Rank
ln(Total-Advertising)
Metacritic
Sequel
#Stars
TopDirector
Holiday
Weeks
Rated NR
Rated G
Rated PG13
Rated R
0.184***
(0.062)
-0.045
(0.115)
-0.061
(0.053)
0.018
(0.023)
-0.128*
(0.068)
-0.001**
(0.003)
0.596*
(0.321)
-0.074
(0.062)
-0.166
(0.113)
-0.152
(0.132)
-0.012
(0.009)
-0.130
(0.213)
0.600*
(0.321)
-0.054
(0.128)
0.047
(0.139)
Number of instruments
First-stage F-statistics
Observations
767
R-squared
0.152
-0.014
(0.285)
-0.017
(0.123)
-0.072
(0.057)
0.018
(0.022)
0.072
(0.287)
-0.006**
(0.003)
0.704*
(0.398)
-0.052
(0.066)
-0.159
(0.107)
-0.139
(0.127)
0.005
(0.025)
0.103
(0.390)
0.641**
(0.267)
-0.086
(0.150)
-0.016
(0.172)
1
13.27
767
0.122
Column (1) is an OLS regression, Column (2) is 2SLS. Both specifications include year fixed effects. Robust standard errors in parentheses. *** p<0.01, ** p<0.05, * p<0.1.
The instrument for ln(TotalBoxOffice) is the share of US moviegoers affected by thunderstorm warnings during a
movie’s first five weekends in theaters.
50
51
Ip Man Series
Movies about martial arts,
Kingdom of War
success of underdog stories,
Mulan: Rise of a Warrior
and themes about powerful
Mirageman
female heroines.
Crouching Tiger, Hidden Dragon
Elizabeth
They Came to Play
The Red Violin
The Madness of King George
The Eyes of Van Gogh
Back to the Future
Mrs. Doubtfire
Honey I Shrunk the Kids
The Nutty Professor
Ghostbusters
Action (85.78%)
Cluster 25
Foreign Films (35.78%)
(190 movies) Drama (28.95%)
Comedy (13.16%)
Science Fiction (7.89%)
Drama (81.72%)
Cluster 29
Romance (32.25%)
(186 movies) Comedy (21.50%)
Documentary (6.99%)
Mystery (5.91%)
Comedy (58.61%)
Cluster 60
Action (26.15%)
(302 movies) Drama (22.19%)
Science Fiction (21.19%)
Kids/Family (12.58%)
Movies oriented toward
families; based on clever,
innovative, and sometimes
science fiction ideas.
Movies about the lives, and
in particular the struggles,
of royals, artists and poets.
Movies with humor that is
based on the outcomes of
unlikely friendships and
relationships.
Odd Couple
Grumpy Old Men
The Breakfast Club
Brain Donors
Under the Boardwalk
Comedy (70.54%)
Cluster 17
Drama (45.75%)
(129 movies) Romance (30.23%)
Science Fiction (7.75%)
Mystery (4.65%)
Common Themes
Movies exploring issues of
cultural identity, political
and class differences, and
their implact on daily life.
Movie Examples
The Class
City of Life
The Intouchables
The Women on the 6th Floor
4 months 3 weeks 2 days
Genres
Drama (63.28%)
Cluster 7
Comedy (43.75%)
(128 movies) Foreign Films (37.50%)
Romance (17.96%)
Mystery (10.15%)
Cluster ID
Table 4: Examples of Implicit Clusters