“Big Data” Goes to the Movies: Revealing latent similarity among movies and its effect on box office performance Sandra Barbosu∗ December 21, 2015 Abstract This paper explores, in the movie industry setting, how the rise of “big data” presents firms with the opportunity to acquire knowledge that can influence product performance and shape firm strategy. It asks the question: what is the value of insights gained from “big data” for firms? Specifically, the paper focuses on Amazon Instant Video’s “Customers who rented this [focal movie] also rented this...” lists to (1) evaluate similarity between movies based on users’ rental patterns, (2) show that movies that are implicitly similar to others have better box office performance than movies that are far from others, and (3) provide reasons for these performance differences. This paper also explores the formation of implicit clusters of movies, made of movies with many co-rentals in common, and shows that they differ from usual classification schemes (genres). Observable characteristics such as genre, actor/director reputation, studio type and MPAA rating have modest power to explain why these clusters form. This points to the possibility that implicit similarity dimensions may exist within clusters. An initial exploration shows that one such dimension is theme, which transcends coarse classifications. Exploring other possible latent common features could provide further insights to studios to help them make movies better targeted to different audience segments. Methodologically, this paper employs a combination of rigorous econometric methods and machine learning tools. ∗ Rotman School of Management, University of Toronto. [email protected]. I am grateful to my committee for their valuable insights and continuous support: Mike Ryall (chair), Anne Bowers, Joshua Gans, Avi Goldfarb, and Mitchell Hoffman. I also thank Andrew Ching, Brian Silverman, Olav Sorenson, Scott Stern, David Tan, participants at the 2014 CCC Doctoral Consortium, 2014 Mallen Motion Picture Economics Conference, 2015 BPS Dissertation Consortium, and my colleagues in the Rotman doctoral program for their helpful feedback. All errors remain my own. 1 1 Introduction and Related Work Digitization is transforming a wide range of industries, and the rise of “big data” presents firms with the opportunity to gain insights that can contribute to product performance and strategy. The proliferation of numerous websites such as Amazon, eBay, Pinterest and Twitter, which are able to aggregate data from millions of users, has led to the availability of detailed information about consumer preferences that was previously very difficult, if not impossible, to collect. Such fine-grained data can provide firms with useful knowledge for their product development strategies. In this paper, I address the question: what is the value of “big data” insights for firms? The setting of this paper is the United States motion picture industry, an industry with high economic importance, which contributed $504 billion to the United States’ GDP in 2011 (Bureau of Economic Analysis 2013). The movie industry has received attention from multiple streams of literature, including research in economics exploring the determinants of box office success (e.g. Gopinath et al. 2013, Liu 2006), and work in sociology focusing on the relationship between the number of genres a movie spans and its performance (Hsu 2006, Hsu et al. 2009). The data for this paper are collected from Amazon Instant Video, Amazon’s movie streaming service, which in December 2013 had over 25,000 movies and an estimated 20 million users (McGlaun 2013). A key reason I chose to focus on Amazon Instant Video is the section “Customers who rented this [focal movie] also rented...” associated with each movie. This section contains an ordered list of the top twenty movies most frequently co-rented with the focal movie, which I refer to as a corental list. It is produced by Amazon’s collaborative item-to-item filtering algorithm 2 (Linden et al. 2003) and reflects the aggregated rental behavior of millions of users. Co-rental lists provide a novel opportunity to observe relationships between movies based directly on consumers’ rental patterns, which was not previously possible. This offers a new way to determine similarity between movies from revealed consumer preferences, without having to rely on third-party classifications like genres, which are often vague and problematic. Rooted in data analytics, this approach has the potential to reveal latent dimensions of similarity between movies that may not be captured by traditional classifications, and this could have implications for studios’ movie development strategies. Using co-rental lists, I evaluate a movie’s implicit similarity with other movies based on the number of common co-rentals it shares with every other movie. With the assumption that online rental behavior reflects consumers’ movie viewing behavior more generally, I explore the relationship between a movie’s implicit similarity and its box office performance in Section 4. This analysis is motivated by the theoretical question of how a movie’s implicit similarity, as revealed from user rental patterns, matters for performance. On one hand, movies that are similar to others tend to form implicit clusters of formulaic movies that are liked by niche consumer groups. They may contain latent similarity features that transcend simple genres, and their unobservable formulas appeal to specific audience segments. As formulaic movies are well-aligned with niche segments’ preferences, this could lead to high performance. On the other hand, movies that are far from others contain novel combinations of attributes that are not specific to any niche audience, but that have the potential to command larger audiences by appealing to multiple audiences, which could result in high performance. 3 The results of my analysis in Section 4 show that on average, movies that are implicitly similar to others are higher-performing at the box office than more distant movies. The implication of these findings is that consumers prefer formulaic movies that appeal to niche groups over movies that attempt to combine disparate features. After establishing these results, a concern still remains: does implicit similarity actually affect performance, or is this relationship due to an alternative mechanism? Likely candidates for alternative explanations are: (1) omitted variable - an unobservable feature such as a movie’s quality may both affect its performance and lead it to be rented together with other movies of similar quality, or (2) reverse causality - movies are rented together because they have similar performance, even if they are not otherwise similar (e.g. high-performing movies get rented together because of shared box office success). To address these concerns, I employ several empirical techniques, including instrumental variable and control function methods, and a recently developed procedure to evaluate robustness of results to omitted variable bias (Oster 2014). These methods provide evidence against the proposed alternative explanations, lending support to the argument that implicit similarity affects movie performance. Having established the relationship between implicit similarity and box office performance, in Section 5 I delve deeper to explore which movies are implicitly similar to each other. By revealing clusters of movies that consumers frequently watch together, this analysis would provide information to movie studios about different market segments that may transcend simple genre classifications. This knowledge 4 could be useful for studios in order to help them make movies that are better aligned with the preferences of different market segments. In this analysis, I apply machine learning algorithms to determine which movies cluster together by having many shared co-rentals with each other. I show that observable characteristics like genre, actor/director reputation, studio type and MPAA rating have low explanatory power in determining cluster membership. These findings raise the possibility that there exist latent dimensions of similarity that are common to clusters, which are not captured by typical observable characteristics. Exploring these dimensions can further help studios gain a better understanding of different audience segments’ preferences. More generally, this analysis illustrates the potential that the proliferation of big data, or the increasingly large unstructured data, can have for firms across industries, as it can reveal fine-grained information about consumer preferences which was previously very difficult, if not impossible, to obtain. This information can help firms gain insights into consumer preferences that may help them make products with a better fit with what their customers want. The results and insights in this paper contribute to several research streams: research on the economics and strategy of digitization, the branch of sociology that studies the relationship between how many categories a firm spans and its performance, and marketing work on the determinants of movie success. I discuss the contributions to each literature below. In addition, the paper employs different analytical tools in each part of the analysis. Figure 1 provides a roadmap of those tools. 5 INSERT FIGURE 1 HERE Economics and strategy of digitization Research in this area explores the causes and economic consequences of the digitization that has been occurring across industries (see Goldfarb et al. 2015 for a detailed overview). One stream of this literature explores the impact of information technology on firm performance, strategy, and overall industry structure in various settings (see Bloom et al. 2009, Athey and Gans 2010, Brynjolfsson and Hitt 2000, McElheran 2011). Saunders and Tambe (2012) show that adoption of data analytics practices is associated with higher firm value. Einav et al. (2011) provide evidence in favor of business strategies that are data-driven, showing that experiments conducted by sellers on eBay can shed light on consumer behavior and market outcomes, which can hold insights for firms. This paper contributes to this literature by exploring what types of insights data analytics can provide for firms, which were previously very difficult to obtain. Particularly, it focuses on how user rental patterns on Amazon Instant Video can shed light on previously unobservable consumer preferences and latent dimensions of similarity between movies, as well as the implications of this information for movie performance. This paper also has methodological contributions by employing machine learning techniques recently adopted in economics (Athey and Imbens 2015). Categories research in sociology Starting with the seminal work by Zuckerman (1999) on the categorical imperative, there has been a growing stream of research 6 in sociology that explores the relationship between the number of categories a product spans and its performance. Work in this stream argues that category-spanning products tend to perform worse than products that belong to one category, because products that span categories are difficult to interpret and evaluate, which results in consumers ignoring them. Although the negative correlation between category-spanning and performance is well-established in this literature, the causality of this relationship is difficult to disentangle, and attempts to tackle this issue are lacking. Moreover, with a few important exceptions (Zuckerman 2004, Goldberg and Vashevko 2013), most work relies on institutional, third-party classifications (e.g. SIC codes or movie genres) to measure category-spanning, which may not capture how individuals actually classify. In the movie industry setting, Hsu et al. (2009) show that movies that span genres perform worse at the box office. However, particularly in the movie industry, genre assignments are difficult to make as they are often broad, vague and subjective (Basuroy et al. 2003), and may not accurately show which movies are similar to each other. These issues make it problematic to use genres to measure the number of categories a movie is in. This paper aims to make two contributions to this literature. First, it makes a departure from the general reliance on film genres to infer similarity between movies. It attempts to get closer to how individuals classify by using the movie rental patterns of Amazon Instant Video users to infer implicit similarity between movies. Movies that are frequently co-rented together by users form implicit categories, or clusters, which differ from established genres. Second, the paper seeks to establish 7 and then explore the causality of the relationship between how similar a movie is to others (i.e. the degree to which it belongs to one implicit cluster with other movies or spans clusters) and movie performance, through a two-part approach made up of instrumental variable and control function methods in Section 4. Marketing There is a large stream of research in the marketing literature exploring the determinants of success in the movie industry (see Eliashberg et al. 2006 for a survey of the literature). Some studies focus on quantifying the impact of different factors such as famous actors/directors, blogs and advertising on movie revenues (Ravid 1999, Gopinath et al. 2013), while others explore the role of film critics as both predictors and influencers of movie performance (Basuroy et al. 2003). This paper builds on this literature by exploring how a previously unobservable factor, a movie’s implicit similarity to other movies, as revealed through online consumer rental patterns, affects a movie’s box office performance. This provides an avenue for future work to explore other previously difficult to observe movie characteristics that digitization can reveal, which may have performance implications. The rest of the paper is organized as follows. Section 2 describes the data. Section 3 motivates and constructs a movie’s implicit similarity measure. Section 4 analyzes the relationship between implicit similarity and box office performance. Section 5 explores the formation of implicit clusters of movies, and the latent similarity dimensions they share. Section 6 concludes and suggests avenues for future research. 8 2 Data There are two types of data that I have collected for the analysis. The first type is used to evaluate implicit similarity between movies based on users’ co-rental behavior. For this purpose, I collected data from Amazon Instant Video’s “Customers who rented this [focal movie] also rented...” section associated with each movie. I focused only on motion pictures that were released in theaters, excluding TV shows, directto-video films, music videos, and other short clips and duplicate titles, as my focus is on factors that affect performance of feature length movies and studios’ strategies. This section lists the top twenty movies that are most frequently co-rented with the focal movie by Amazon users, in order of co-rental frequency (with hyperlinks to the movies’ respective pages). Figure 2 shows a sample Amazon Instant Video movie page for the movie Angels & Demons; the top five of the 20 co-rentals appear on the main page. As I explain in detail in Section 3, these co-rental lists are a key feature of my analysis. I use them to compute a measure of implicit similarity between movies based on the number of co-rentals each two movies have in common, as revealed through rental patterns.1 Based on this measure, I investigate the 1 A possible concern would be that the explicit visibility of the first five co-rental relationships may drive them to continue being rented together, which would reinforce their positions (OestreicherSinger and Sundararajan, 2012.) This would imply that co-rental lists are a combination of: (1) initial revealed consumer preferences that generate the lists and (2) a bias introduced by the reinforcement of the first five visible co-rentals. However, I argue that these lists, which are the basis for the construction of my implicit similarity measure, are still reflective of consumer preferences. This is because they are based on initial preferences and more importantly, because results in Section 4 show that a movies online implicit similarity affects its box office performance. Since box office revenues are largely accrued by a movie before it can be rented on Amazon, this finding implies that despite the bias, consumers online rental patterns are reflective of consumer preferences as the similarity measure significantly affects past performance. As an additional robustness test, I run my analysis with the implicit similarity measure constructed on a movies first 10 co-rentals (five of which are not immediately visible on the movies page) and results still hold. 9 relationship between a movie’s implicit similarity to other movies and its box office performance in Section 4. In Section 5, I explore which movies are similar to each other, forming clusters of movies, and examine what latent dimensions of similarity these movies share. INSERT FIGURE 2 HERE A potential concern about the data could be the lack of information on how Amazon actually creates its co-rental lists. Can the lists be taken at face value as a ranking of movies most frequently co-rented with a focal movie? I argue that they can and attempt to provide evidence that alleviates such concerns. As engineers and co-developers of several Amazon recommendation algorithms, including the co-rental lists, Linden et al. (2003) explain in detail the algorithm used to produce these lists. Known as item-to-item collaborative filtering, the code produces a ranking of the most frequently co-rented movies by Amazon users in the first stage. Then, the algorithm adjusts the ranking to also take into account any revealed preferences of users based on prior rentals history. The description of the algorithm by Linden et al. (2003) reduces the uncertainty of how the co-rental lists are produced. In addition, I collect the data using a browser with no previous search history on Amazon, in order to get as close as possible to the algorithm’s initial ranking of co-rentals, unaffected by any revealed preferences. A concern may still remain that Amazon could run experiments by inserting random movies in these lists, instead of those produced by the algorithm. However, 10 even if this occasionally does take place, Amazon’s incentive is to increase rentals, so it would aim to show movies for which it had some evidence that certain users would like them. Otherwise, it would not have an incentive to show them, and they would not remain in the lists for very long. Thus, even if Amazon may sometimes run experiments, it is in Amazon’s best interest to list movies that consumers actually want to rent, so movies that are similar to the other movies shown in some way. Taken together, knowledge about Amazon’s algorithm and its incentives should reduce concerns about whether co-rental lists truly reflect movies that consumers like to rent together. In addition to the initial Amazon Instant Video data on co-rentals, I use a number of other sources to obtain additional variables related to each movie that have been shown to affect movie performance and are relevant for the ensuing analysis. First, I determine whether a movie is a sequel in a franchise, as this is positively correlated with box office revenue (Terry et al., 2004). I also gather data on critic ratings prior to movie release from Metacritic, since critics have an influence through their dual role as predictors and influencers of performance (Basuroy et al. 2003). From the Academy Awards website archive, I note whether a movie won or was nominated for an Oscar in the following six categories: best picture, best director, best actor/actress and best supporting actor/actress, which are found to be associated with higher revenues (Nelson et al. 2001). I obtain each starring actor’s popularity from IMDB Pro’s StarMeter, which has a yearly ranking (from 2004 to 2013) of the top actors in the industry, based on the number of IMDB user searches for that individual, and 11 I use IMDB Pro’s Top 100 Directors rankings to evaluate director popularity. Also from IMDB, I collect the following movie characteristics: starring actors, director, studio, IMDB rating, MPAA rating, genres and US theatrical release date. Additionally, I collect information about factors related to the movie’s production and exhibition. From BoxOfficeMojo/The Numbers, I obtain data on a movie’s box office revenue, production budget, and number of theaters the movie played in during each week of its exhibition. Based on the movie’s release date, I also identify whether it was released in the summer/winter or on a holiday weekend, to account for seasonality in the movie industry (Einav 2007). For a subset of movies, those released between 2009 and 2013, I further obtain advertising budgets, broken up by media type and month, from Kantar Media. This variable is essential to my empirical approach that attempts to establish causality in the relationship between a movie’s implicit similarity to other movies and box office performance. 3 Measure of Implicit Similarity 3.1 Motivation As mentioned in Section 1, Amazon Instant Video co-rental lists provide an opportunity to measure implicit similarity between movies based directly on consumers’ rental patterns. The advantage of this method is that it does not rely on third-party, coarse genres to infer similarity between movies, since these classifications may not accurately reflect how individual consumers classify movies. The implicit similarity measure I construct is used to explore the relationship 12 between a movie’s implicit similarity (how similar it is to other movies based on common co-rentals) and its box office performance in Section 4. 3.2 Construction The measure of a movie’s implicit similarity to other movies, ImpSim, is based on the number of common co-rentals a movie has with others. For any two movies, the more common co-rentals they share, the more similar they are to each other. A movie’s final implicit similarity measure is an average of its similarity with every other movie. Details of the calculation are provided below. Background for the construction of the measure The implicit similarity measure is an adaptation of Zuckerman (2004)’s measure of category coherence. In his paper, the aim is to compare the stock market volatility of firms that have coherent identities (i.e. they are part of coherent categories of firms) to that of firms with incoherent identities. To determine a firm’s coherence, Zuckerman develops a measure coined category coherence, based on the analysts that follow the firm: if a firm’s analysts follow a homogeneous group of firms, the focal firm is considered coherent; if its analysts cover a heterogeneous group of disparate firms, the firm is called incoherent. Zuckerman’s findings show that incoherent firms have a more volatile value on the stock market than coherent firms. He argues the result is explained by the idea that it is harder for people to interpret information about a firm with an incoherent identity, because it lacks a clear reference group. Zuckerman (2004)’s paper is part of a stream of research in sociology that studies 13 the relationship between the number of categories a firm spans and its performance. However, a key novelty in his analysis compared to other papers is that he does not rely on established SIC codes as a measure of how clearly a firm belongs to a particular category, since these classifications can sometimes be difficult to make and vague. Instead, he develops a way to measure the extent to which a firm belongs to a coherent category based on the homogeneity of the set of firms its analysts follow, without imposing problematic third-party categories. An insight in the present paper is that there is a similarity between the analysts in Zuckerman’s stock market setting and the Amazon co-rental lists in my context. For Zuckerman, the more homogeneous is the group of firms that a focal firm’s analysts follow, the more the focal firm belongs to a coherent category. The analogy in my paper is that the more common co-rentals two movies share, the more those movies belong to a coherent implicit cluster of movies. The implicit similarity measure for movie m is operationalized as follows: Step 1 For each pair of movies m, k compute: pmk = cmk n+1 where pmk is the proximity of movie m to movie k ; cmk is the number of co-rentals that movie m and movie k have in common, and n is the number of co-rentals over which the measure is calculated; 1 is added to n in the denominator because when counting the number of common co-rentals, the focal movie itself is included to check 14 whether it shows up on the co-rental list of any other movie. The measure used throughout the analysis is calculated over the first five co-rentals (n=5) as those are the most closely related to the focal movie m, and appear on its main Amazon page. As a robustness check, the measure is also computed over the complete list of 20 co-rentals (n = 20); the results of the analysis are qualitatively similar using either computation of the measure. Step 2: Let N be the total number of movies, and ki any movie other than m. The implicit similarity measure for movie m, ImpSimm , is computed as: N −1 X ImpSimm = pmki i=1 nc , where nc is the number of movies for which pmk > 0 for all k. The measure ranges from 0 to 1; 1 means that movie m has the same co-rentals as each of its co-rentals (complete overlap), while 0 means there is no overlap. I interpret a measure of 1 to suggest that movie m has a high level of implicit similarity to other movies, and a measure of 0 to suggest m has a low level of implicit similarity to other movies.2 Appendix B shows how the measure is constructed through a simple example. 2 Throughout the analysis, the constructed measure is standardized for ease of interpretation due to its skewness, so mean = 0, standard deviation = 1 and the range is no longer [0,1]. 15 4 Implicit similarity/performance relationship This section focuses on determining the relationship between a movie’s implicit similarity and its box office performance. As discussed in detail in Section 1, the analysis in this section is motivated by the theoretical question of which movies would perform better: movies that are implicitly similar to others, or movies that are far from others? Implicitly similar movies may perform better because they target niche segments of consumers, ensuring they have an audience, while distant movies could perform better because they combine elements that appeal to multiple segments, with the potential to draw larger audiences. This analysis extends research exploring the relationship between the number of genres a movie spans and its performance (Hsu 2006, Hsu et al. 2009), by focusing on the extent to which a movie belongs to a coherent implicit category with other movies, rather than third-party genre categories. Section 4.1 aims to establish the relationship between a movie’s implicit similarity level and its box office revenues. Section 4.2 attempts to disentangle the direction of causality in this relationship, providing evidence that implicit similarity affects box office performance. This part of the analysis is performed on a subset of movies released between 2009 and 2013, for which I have data on all the relevant variables. Most importantly, I was also able to obtain data on advertising spending for this subsample, a variable that is instrumental in the empirical approach I employ to establish causality. Table 1 presents summary statistics of movie characteristics for both the full sample of 25,227 movies and for the subsample of movies from 2009 to 2013 that 16 is used in the analysis, for comparison. Movies in the full sample span the years 1905 to 2013, with an average release year of 1996, while movies in the subsample are released during the last five years of the full sample, 2009 to 2013, on average released in 2011. We can see that movies in the subsample are similar to those in the full sample on key variables like total box office, production budget and range of the implicit similarity measure. INSERT TABLE 1 HERE 4.1 Establishing the existence of a relationship This subsection explores the question: are there performance differences between movies that are similar to others, as determined by the implicit similarity measure, and movies that are far from all others? I run an initial regression as specified below to determine the relationship between implicit similarity level and performance, including control variables that have been shown to be relevant to movie performance: ln(T otalBoxOf f ice)m = α0 +α1 ImpSimm +α2 #Genresm +α3 Rankm +α4 (#Genresm XRankm ) + α5 ln(pre.advertising)m + α6 M etacriticm + α7 Sequelm + α8 #Starsm + α9 T opDirm +α10 M P AAm +α11 Holidaym +α12 HHI.5wksm +α13 Y earm +m , (1) where ln(T otalBoxOf f ice)m is movie m’s total box office revenue;3 ImpSimm is the independent variable of interest, which expresses movie m’s implicit similarity level; 3 Box office revenues, production and advertising budgets are normalized using the Consumer Price Index to 2013 million dollars. 17 #Genresm is the number of established genres movie m is assigned; ln(pre.advertising)m is the advertising spending for movie m prior to its release; M etacriticm is movie m’s average critic rating prior to its release; Sequelm is an indicator variable to express whether or not movie m is a sequel; Starsm is a count of movie m’s starring actors that were listed in the StarMeter’s Top 100 in the year prior to the movie’s release; T opDirm is an indicator variable that shows whether movie m’s director is listed in the Top 100 Directors on IMDB; M P AAm is a vector of indicator variables that shows movie m’s MPAA rating (i.e. NR, G, PG, PG13, or R); Holidaym indicates whether movie m was released on one of the following holiday weekends: Independence Day, President’s Day, Labor Day, Memorial Day, Thanksgiving, Christmas, New Year’s Eve); Y earm is a vector of release year indicator variables (2009 - 2013). The variables HHI.5wksm , Rankm and the interaction #Genresm XRankm require more detailed explanations. The variable HHI.5wksm is designed to capture the average level of competition movie m faces during its first five weekends playing in theaters. This is because by the end of this period, movies have generally made more than 80% of their total box office revenues, as illustrated in Figure 3. The measure is constructed in the following steps: (1) I obtain data from BoxOfficeMojo on the weekly box office revenues of all movies playing in theaters on each weekend in the sample, even if some movies are not in the original dataset because they are not available on Amazon Instant Video. (2) Given the box office revenues of all movies playing on a given weekend, I compute each movie’s market share. (3) Based on the computed market shares, I calculate a Herfindahl-Hirschman Index (HHI) to 18 measure the level of competition in each week. (4) For the final HHI.5wksm variable, the last important step is a modification of the traditional HHI. In computing HHI.5wksm , I exclude movie m’s market share by subtracting it from the computation of the movie’s HHI. The reason for this step is that this variable is meant to reflect the level of competition that movie m faces during its first five weeks in theaters. Including movie m’s market share in the calculation would not accurately capture the level of competition m is facing from the other movies playing. INSERT FIGURE 3 HERE The #Genres variable aims to control for performance differences between movies that have different numbers of genres (ranging from 1 to 5). The inclusion of this variable is motivated by prior research that finds that being listed in multiple genres is associated with lower box office performance (Hsu et al. 2009). Even for movies that have the same number of genres, there may be performance differences that depend on not only the number, but which particular genres the movie is in. The Rankm variable seeks to control for such performance differences due to different combinations of genres. The measure is constructed in the following steps: (1) I identify all the possible combinations of genres of movies on Amazon Instant Video. For each combination, I compute the average box office of movies with those genres. (2) Within each number of genres (ranging from 1 to 5), I rank the genre combinations based on their average box office, with the highest grossing combination having a rank of 1. (3) I normalize the rankings for ease of comparison across different numbers 19 of genres. An illustrative example of the construction of this variable is in Appendix A. The interaction variable #Genresm XRankm is meant to capture the possibility that number of genres can have different relationships with movie performance that vary based on the movie’s genre combinations. INSERT TABLE 2 HERE The results of regression (1), shown in Table 2 Column (1), indicate that there is a positive, significant correlation between a movie’s implicit similarity and its box office performance. The analysis in this section was motivated by the theoretical question of whether implicitly similar or distant movies would perform better. Findings indicate that on average, movies that are similar to others perform better, which implies that consumers gain a higher utility from formulaic movies that closely align with the preferences of niche audience segments than from movies with disparate elements. Given these results, the question arises: what is the direction of causality of this relationship? Addressing this question is the goal of Section 4.2. 4.2 Disentangling the causality of the relationship There are three possible explanations for the correlation found in Section 4.1. One option, which is the hypothesized direction, is that a movie’s implicit similarity, as revealed through consumers’ rental patterns on Amazon, is a feature that affects movie box office performance through the mechanism described above in Section 4.1. 20 However, there are also two possible alternative explanations. First, there could be an unobservable factor, such as movie quality, that affects both implicit similarity and performance. This option is explored in Section 4.2.1. Second, there may be an issue of reverse causality, meaning that box office performance could affect Amazon rental patterns and thus implicit similarity. This possibility is addressed in Section 4.2.2. I attempt to disentangle the direction of causality in the relationship by exploring each of the possible mechanisms, and providing evidence to refute the alternative explanations.4 4.2.1 Is there an unobservable factor that affects both a movie’s implicit similarity and its box office performance? This subsection addresses the first alternative explanation, the possibility that an unobservable factor could affect both a movie’s implicit similarity and box office performance. A likely candidate is a movie’s quality, as a valid concern would be that high quality movies would both perform better at the box office and they would be more likely to be rented together on Amazon due to their quality. This would imply that similar movies are not high-performing because they contain common features that fit well with the tastes of niche audience segments. Instead, the reason is simply 4 This multistep approach is necessary because in this context it is not possible to test the hypothesized direction of causality directly, in a straightforward way. Doing so would require either an exogenous shock to a movie’s implicit similarity, or an instrumental variable that would be correlated with its implicit similarity but not with box office performance. Both of these have proven impossible to find. However, I am still able to provide supportive evidence for the proposed mechanism by employing a multistep approach based on instrumental variable, control function and robustness to omitted variable bias tests, in order to refute each of the possible alternative explanations. To the best of my knowledge, this is the first paper that tackles the question of causality in the relationship between the extent to which a product belongs to a clear category and its performance, thus contributing to the sociology research on categories. 21 that high quality movies are rented together due to their quality, which also affects their performance. Control function To address this concern, I employ a control function approach, in which the goal is to extract previously unobservable quality and include it as an explicit independent variable in regression (1), which is reproduced below. If implicit similarity remains significant and the coefficients of the independent variables generally remain stable with the explicit inclusion of the quality variable, this would provide evidence that the relationship between implicit similarity and box office performance is not due to an unobservable like movie quality. Given the regression: ln(T otalBoxOf f ice)m = α0 +α1 ImpSimm +α2 #Genresm +α3 Rankm +α4 #(Genresm XRankm ) + α5 ln(pre.advertising)m + α6 M etacriticm + α7 Sequelm + α8 #Starsm + α9 T opDirm + α10 M P AAm + α11 Holidaym + α12 HHI.5wks + α13 Y earm + m , (1) I attempt to extract movie quality from a movie’s pre-release advertising spending by employing a control function approach, as explained below. A movie’s pre-release advertising budget is finalized once a movie has completed production, so that the studio can first assess the quality of the finished movie. Its advertising can have an effect on the movie’s box office revenue through two mechanisms: first, advertising can directly have a positive effect on revenue by making the movie more well-known to audiences (Elberse and Anand, 2007), and second, it can affect revenue in an 22 indirect way, because the advertising that a studio spends on a particular movie reflects its belief about the movie’s quality. Since movie quality is unobservable, the econometric issue is that m is not independent of ln(pre.advertising)m . I address this issue through a control function (see e.g. Heckman 1978, Petrin and Train 2009 for details on the implementation). The basic idea of the control function approach is to derive a proxy variable that conditions on the part of pre-release advertising that depends on the error (movie quality). By doing so, the remaining variation in pre-release advertising spending becomes independent of the error, capturing only the direct effect of advertising on movie box office revenue, and previously unobservable quality can be included as an explicit regressor. The control function is a two-step procedure. The first step involves regressing pre-release advertising spending on an instrumental variable and other covariates, in order to obtain the residuals of the regression (what is referred to as the proxy variable). The idea for the first step is that having found an appropriate instrument for pre-release advertising, the residuals of the regression would reflect a studio’s advertising expenditures for a movie, that go beyond what can be explained by a movie with the same observable characteristics. For the choice of instrumental variable for a movie’s pre-release advertising, I follow Chintagunta et al. (2010), and instrument pre-release advertising with a movie’s production budget. The reason this is an appropriate instrument is that a movie’s pre-release advertising is typically set as a fixed proportion of its production budget (Vogel 2009). The implication is that a movie with a pre-release advertising 23 spending that is either above or below a fixed proportion would shed light on the studio’s belief about the movie’s quality. After running the first step regression and collecting the residuals, the second step of the control function approach involves using both the pre-release advertising spending and the residuals (interpreted as perceived movie quality by the studio) in the original regression (1), as follows: ln(T otalBoxOf f ice)m = α0 + α1 ImpSimm + α2 QualityResidualm + α3 #Genresm + α4 Rankm + α5 #(Genresm XRankm ) + α6 ln(pre.advertising)m + α7 M etacriticm + α8 Sequelm +α9 #Starsm +α10 T opDirm +α11 M P AAm +α12 Holidaym +α13 HHI.5wks +α14 Y earm +m (2) The results of the control function method are shown in Table 2 Column (2). We can see that the magnitude of the implicit similarity variable decreases slightly from 0.137 to 0.115, and that the inclusion of the perceived quality variable does not eliminate its significance. Including the quality variable does not change the explanatory power of the model, which may in part be due to the comprehensive baseline control variables (Caliendo et al. 2014). Test to evaluate robustness of results to omitted variable bias As an additional robustness test, I check that the relationship between Implicit Similarity and ln(TotalBoxOffice) is not due to omitted-variable bias by applying the procedure developed in Oster (2014). The idea is to ask how important the unobservables would need to be relative to the observables in order to eliminate the estimated effect of the observable variable of interest (see Oster 2014 for details). To determine this, an assumption has to be made about the share of the outcome 24 variance that could be explained by observable and unobservable variables together, referred to as Rmax . This depends on the context. The second step involves evaluating the movements in the magnitude of the coefficient of interest, along with movements in R2 , between two regressions, one that includes only the variable of interest and one that includes the other observables. These values are used to calculate the degree of selection on unobservables relative to observables which would be necessary to explain away the result of the variable of interest, δ. To implement the method, I first run a baseline specification only including the variable of interest, ImpSim, as follows: ln(T otalBoxOf f ice)m = α0 + α1 ImpSimm + m . Running this initial regression, I obtain α1 = 0.47 and R2 = 0.43. Next, I run a second regression that includes controls for my other observable variables. This regression is the same as regression (1), reproduced below: ln(T otalBoxOf f ice)m = α0 +α1 ImpSimm +α2 #Genresm +α3 Rankm +α4 #(Genresm XRankm ) + α5 ln(pre.advertising)m + α6 M etacriticm + α7 Sequelm + α8 #Starsm + α9 T opDirm + α10 M P AAm + α11 Holidaym + α12 HHI.5wks + α13 Y earm + m (1), I obtain α1 = 0.162 and R2 = 0.891. Using the recently developed Stata psacalc command with an approximated Rmax of 0.95, I obtain a δ (degree of proportionality between unobservables and observables) which would be needed to produce a 0 effect of ImpSim (α1 = 0) of δ = 4.07. The assumption of Rmax in this context is motivated by the fact that the R2 of regression (1) is already 0.90, so Rmax , the variance that can be explained by observable and unobservables, can take values between 0.90 and 1. In my case, any choice of Rmax gives δ >2. On average, larger values of δ (δ > 25 1) indicate more robust coefficient estimates, and with δ > 2 for any Rmax , it seems implausible that the effect of ImpSim is driven by omitted variable bias. The value of δ = 2 would suggest that the unobservables would need to be twice as important as the observables to eliminate the effect I find. 4.2.2 Does box office performance affect implicit similarity? The previous section provides evidence against the possibility that an unobservable factor such as movie quality could affect both a movie’s box office performance and its implicit similarity, by employing a control function approach and a test developed by Oster (2014) to evaluate robustness of results to omitted-variable bias. This section turns to addressing the possibility of reverse causality, meaning that box office performance could actually affect implicit similarity, an option that is very plausible in this setting. Given that most movies become available for rental on Amazon Instant Video only after they are out of theaters, it is reasonable to expect that a movie’s box office revenues could affect what movies people tend to rent together online. This would imply that high-grossing movies are rented together online because they are high-performing, not necessarily because they are inherently similar to each other. To explore this possibility, I develop a regression model that has as dependent variable a movie’s implicit similarity, and as independent variable of interest a movie’s total box office revenue, as follows: ImpSimm = α0 +α1 ln(T otalBoxOf f ice)m +α2 #Genresm +α3 Rankm +α4 (#Genresm XRankm ) + α5 ln(advertising)m + α6 M etacriticm + α7 Sequelm + α8 #Starsm + α9 T opDirm + α10 M P AAm + α11 Holidaym + α12 W eeksm + α13 Y earm + m (3) 26 Next, I employ an instrumental variable approach, substituting an instrument for the total box office variable in regression (3), in order to explore whether it is causally affecting implicit similarity. In order to be valid, an instrument Z must satisfy two conditions: (1) it must be correlated with the endogenous explanatory variable conditional on the other regressors, corr(Zi , Xi ) 6= 0 and (2) it cannot be correlated with the error term in the regression, corr(Zi , ui ) = 0. An instrument for a movie’s total box office revenue that meets these two conditions is the share of potential moviegoers affected by inclement weather. The reasoning for this instrument is that when the weather is bad, it is more likely that people switch from outdoor to indoor activities, which could lead to an increase in theater attendance on those dates. At the same time, there is no reason to expect bad weather during a movie’s run in theaters to directly affected a movie’s implicit similarity on Amazon Instant Video, other than through its indirect effect on a movie’s box office revenue. To construct this instrumental variable, I obtain population data for the thirty largest American cities from the US Census Bureau, and I compute an estimated annual US market size of moviegoers by summing up the potential market sizes for each of the cities. Then, I collect data on all significant weather events taking place in the thirty cities during the periods of time that the movies in my dataset are playing in theaters. I focus specifically on one type of weather event: thunderstorm warnings. I then compute the share of the estimated US market of moviegoers that 27 experience a thunderstorm warning in each weekend that the movies are playing in theaters. As a final step, for each movie, I sum up the weekend shares for the first five weekends that the movie is in theaters. As for the HHI variable that I construct, I choose to aggregate this variable over the movie’s first five weeks in theaters, as this is the period of time when movies make more than 80% of their total revenue. Results of the OLS and 2SLS methods are reported in Column (1) and Column (2), respectively, of Table 3. The first-stage F-statistic is 13.27, indicating that the instrument for box office revenue - the share of US moviegoers affected by thunderstorm warnings during a movie’s first five weeks in theaters - satisfies the relevance condition. We can see that there are large differences between the results of the OLS and 2SLS methods. For the OLS estimation, the relationship between a movie’s box office revenue and its implicit similarity is positive and highly significant; for the instrumental variable method, there is no evidence that box office revenues significantly affect a movie’s implicit similarity. This provides evidence against the alternative explanation that the relationship between implicit similarity and performance is due to reverse causality. INSERT TABLE 3 HERE Taken together, the control function in Section 4.2.1 and the instrumental variable in Section 4.2.2 help alleviate the endogeneity concerns that surround the relationship between implicit similarity and performance, in a setting where it is not possible to 28 disentangle causality through a direct approach. To the best of my knowledge, this is the first paper in the literature studying the relationship between category coherence and performance to attempt to establish the causality of such a relationship. 5 The formation of implicit movie clusters Section 4 investigated the relationship between a movie’s implicit similarity and its box office performance, and attempted to determine the causality of this relationship. Results showed that movies that are similar to others, by having many co-rentals in common with other movies, tend to have better box office revenues than movies that are far from other movies (i.e. those with little co-rental overlap with other movies). Having established this relationship, this section shifts the attention to exploring what movies tend to form implicit clusters together, by having many common co-rentals with each other. The aim of this analysis is to reveal these implicit clusters of movies and find what features they have in common; doing so may expose latent dimensions of similarity that would likely not be captured through observable characteristics. This knowledge could help studios produce movies that better target different audience segments, by observing similarity dimensions based on consumer co-rental patterns that were previously difficult to observe. Currently, movie studio executives decide which movies to green-light (i.e. move from development to production phase) through an uncertain process in which several assigned readers have to agree on the potential of a script (Eliashberg et al. 2007) 29 and how well it aligns with the studio’s goals regarding genres, talent and financial considerations. While studios rely on such methods to make green-light decisions, companies like Amazon Instant Video and Netflix are using their wealth of available user data to gain insights about their customers’ preferences and are increasingly engaging in data-driven original content production to compete with movie studios. For example, when deciding on its production of “House of Cards,” Netflix found that users who liked the original BBC series were also the ones who liked movies directed by David Fincher and those starring Kevin Spacey. The show is critically acclaimed and has won numerous awards (IMDB). Similarly, Amazon has also used data analytics to gain insights from their users on different pilots’ potential for success. Both companies argue that they let the data guide their creative strategies, and that studios rely too much on uncertain studio tastemakers to produce content (Wall Street Journal, 2013). Greater knowledge about consumer preferences and dimensions of implicit similarity, which studios could gain by analyzing insights from websites like Amazon Instant Video, could help them refine their green-lighting process and make movies that better target different audience segments. One important aspect to understand is which movies tend to cluster together based on consumer co-rental patterns. To shed light on this, I employ machine learning techniques for pattern recognition and learning from data, which have recently been introduced for use in the economics literature (Alpaydin 2014, Athey and Imbens 2015). Using the walktrap 30 algorithm (Pons and Latapy, 2005),5 I produce implicit clusters of movies based on movies that have many common co-rentals on Amazon Instant Video. For this part of the analysis, I use the entire set of 25,227 movies in the dataset, which leads to 526 different implicit clusters. Approximately 65% of these clusters contain fewer than 10 movies, 35% have between 10 and 1,000 movies, and less than 1% of clusters have more than 1,000 movies. The number of movies per cluster ranges between 1 and 4,797 movies, with the average number being 43 movies in a cluster. There are 313 movie franchises in my sample, composed of an original movie with one or multiple sequels. The average franchise has 4.25 movies, with a range from 1 to 13 movies. We would expect that franchises’ movies would tend to cluster together, as the formulas of the sequels are based on the original movies. Findings support this view: more than 72% of movie franchises have all their movies in the same implicit cluster, while the other 28% have movies in at most two related clusters. Given the formation of these implicit clusters of movies, a question arises about what characteristics movies within a given cluster have in common. In order to explore this, I run logit regressions to determine whether observable characteristics such as movie genre, movie/director/actor awards, rating, season/year of release, and studio type are able to explain a movie’s cluster membership, or whether potentially latent similarity dimensions between movies may exist. After excluding clusters that have fewer than 10 movies, I run logit regressions of the form: 5 Walktrap algorithm is implemented through the igraph package in R. 31 Clusterm = α0 +α1 M P AAm +α2 Seasonm +α3 StudioT ypem +α4 Awardsm +α5 Genresm +α6 Y earm +m , (4) where Clusterm is the cluster membership of movie m; M P AAm is a vector of indicator variables that shows movie m’s MPAA rating (i.e. NR, G, PG, PG13, or R); Seasonm indicates whether movie m was released during summer or winter; StudioT ypem indicates whether movie m’s studio is a major, conglomerate indie, or true indie; Awardsm is an indicator variable that shows whether movie m received any of the following Oscar accolades: best picture nomination/win, best director nomination/win, best actor nomination/win, best actress nomination/win, best supporting actor nomination/win, best supporting actress nomination/win; Genresm are genre fixed effects for the following categories: action/adventure, comedy, documentary, drama, fantasy, foreign films, gay/lesbian, horror, kids/family, military/war, musical, mystery/thriller, romance, science fiction, sports or western; Y earm are release year fixed effects. INSERT FIGURE 4 HERE Figure 4 shows the distribution of the Pseudo-R2 of the logit regressions performed for each of the clusters. We can see that while observable characteristics, like a movie’s genre, movie/director/actor awards, MPAA rating, studio type, and release year, have modest power in explaining a movie’s cluster membership, with an average Pseudo-R2 of 0.31. This suggests that a large part of why movies form a cluster with other movies remains unexplained by observable attributes. With these 32 findings, the question becomes: why do these clusters of movies form, and what do they have in common? Are there latent dimensions of similarity that movies in each cluster have in common, which are not detected through observable factors and coarse classifications? Some possible latent characteristics that may not be captured through established categories include a movie’s theme, visual and sound effects, pace and storytelling style, or character idiosyncrasies. In the rest of this section, I explore one of these dimensions, the potential for the existence of common themes that transcend simple genres. To do so, I collect all the movies’ synopses from IMDB, and search for potential commonalities in their themes. Table 4 shows five implicit clusters, the most prevalent genre labels in each cluster by percent of movies assigned to that genre, five examples of movies in each cluster, and the common themes that are revealed when exploring the text of the movies’ synopses. For example, a comparison of clusters 7 and 29 shows that even though the genre label drama is assigned to a majority of movies in both clusters, each cluster has a common theme that its movies share, and the themes between the two cluster are very different. Movies in Cluster 7 explore issues of cultural identity, and the impact of political and class differences on people’s daily lives. In contrast, movies in Cluster 29 tell stories about the lives of artists and royals, focusing particularly on their struggles.6 Other examples of shared themes that emerge within clusters are described in Table 4. 6 Text analysis is performed using the text mining and word cloud packages in R. 33 INSERT TABLE 4 HERE Having shed light on the formation of implicit clusters of movies and latent themes that movies within clusters share, as a final analysis I explore whether there are differences across clusters in the average implicit similarity measure of their movies. This is motivated by the question of whether different types of movies tend to be more or less similar to other movies than other types. Initial regressions show that the relationship between a cluster and the average implicit similarity of its movies is made up of two parts: (1) First, there is a mechanical relationship: the more movies in a cluster, the lower the average implicit measure of the cluster. This is expected because in the calculation of the implicit similarity measure, two movies must each have the same co-rental in order to count in the computation. However, in order for a movie to be part of a cluster, it can be enough to have a one-way link between two movies, for both movies to be included in the same cluster. Due to this, the more movies there are in a cluster’s periphery, the lower the average implicit measure of the cluster. (2) Second, there is a part of the relationship between a cluster and the average implicit similarity that goes beyond the simply mechanical one. This is because movies in some clusters are more or less tightly connected to other movies due to the kind of movies they are (e.g. the type of theme that is common within a cluster could explain why those movies are more or less similar to others). 34 In order to compare average implicit measures between clusters and eliminate the mechanical relationship of the number of movies influencing the average, I take the movies with the highest implicit measure in each cluster and calculate the average of their implicit measures, which I can compare across clusters. Figure 5 shows the average implicit similarity of the top 10 movies with the highest measure in each cluster of at least 50 movies7 We can see from the dispersion in the graph that there are significant differences in the average measures across clusters. What does this heterogeneity across clusters imply? For example, we can compare two clusters of about the same size, Cluster 25, with 190 movies, and Cluster 29, with 186 movies. Cluster 25 is composed of movies that explore themes of martial arts and success of underdog stories. The average implicit similarity measure of its top 10 movies is 0.57. Cluster 29, whose movies are about the lives and struggles of royals, poets and artists, has a higher average measure, of 1.29. Since these two clusters are of similar sizes, I can also compute and compare the average similarity measure of their movies calculated over the entire set of movies. The relationship between the average similarity measure of the two clusters is qualitatively similar when done over the top 10 and over the entire set of movies. Computed over the entire set, Cluster 25 still has a lower average (-0.21) than Cluster 29 (-0.07). This implies that movies about the lives of artists and royals are more tightly connected to other movies than movies about martial arts and underdog success stories. 7 I have also computed the average over the top 50 movies in clusters of at least 100 movies, and obtained similar relationships between the clusters. 35 This exercise shows that there is heterogeneity across clusters in the level of implicit similarity of their movies, suggesting that movie type (e.g. the theme a movie explores) can explain whether a movie is more or less similar to other movies. INSERT FIGURE 5 HERE This initial exploration of the characteristics of implicit clusters suggests that genre classifications may be too broad to reflect the more subtle and latent similarity dimensions that movies within clusters share. Even though at first glance movies are assigned the same genre labels, a deeper look may reveal large differences between movies of the same genre. Exploring the movies’ synopses provides an example of the kind of latent information and similarity features that data analytics has the potential to uncover. Moreover, differences in the level of implicit similarity across clusters indicate that some types of movies are more likely to be niche and formulaic, whereas movies from other clusters tend to incorporate more disparate elements. Movie theme, as revealed through the movies making up implicit clusters, is one possible factor that can suggest whether a movie is more or less formulaic. A more in depth text analysis of the movies’ scripts, along with a study of other latent attributes, such as narrative style, rhythm, and visual and sound effects, provides an interesting avenue for future research to uncover dimensions of latent similarity between movies and to explore features that lead movies to be more or less similar to other movies. 36 6 Discussion and Conclusion This paper explores how the proliferation of voluminous amounts of data, “big data,” holds the potential for firms to gain insights that may affect product performance and firm strategy. Specifically, in the US movie industry setting, with data from Amazon Instant Video, the analysis focuses on the information that is revealed through Amazon’s “Customers who rented this movie also rented this...” lists associated with each movie. I present a novel way to determine implicit similarity between movies from users’ online rental behavior, based on an adaptation of a measure originally developed by Zuckerman (2004) in the stock market setting. Rather than relying on third-party classifications like genres, which may not capture how individuals categorize, the advantage of employing this implicit similarity approach is that it emerges directly from user rental patterns. Based on this concept of implicit similarity, I show that movies that are similar to others (having many common co-rentals with other movies) have better box office performance than movies that are far from others. This analysis was motivated by the theoretical question of which movies would do better, those that are similar to others, forming implicit clusters of formulaic movies, or those that are far from others, combining disparate elements that could potentially draw larger audiences? Findings show that on average, formulaic movies tend to perform better, suggesting that audiences prefer niche movies rather than movies combining different elements. I attempt to disentangle the direction of causality in the relationship between implicit similarity and performance by employing a combination of control function 37 and instrumental variable techniques, as well as a method to evaluate robustness to omitted variable bias. Although in this setting it is not possible to evaluate the proposed direction of causality directly, using these empirical approaches I am able to provide evidence against the two most likely alternative explanations for this relationship: (1) common cause, the idea that an unobservable factor could affect both a movie’s implicit similarity and its box office performance, and (2) reverse causality, the possibility that box office performance is in fact affecting what movies people rent together. Eliminating these alternative mechanisms, my empirical approach lends support to the hypothesized direction: that implicit similarity is a feature that affects a movie’s box office performance. Moreover, I show that implicit clusters of movies, composed of movies that have many co-rentals in common, differ from typical classification schemes like genres. Observable characteristics such as genre, movie/director/actor awards, studio type or MPAA rating have modest power to explain why clusters form, leaving a large part unexplained. These findings raise the question: given the implicit clusters I uncover, what latent features do their movies share? In this paper, I conduct an initial exploration of one possibility, the potential that movies within a cluster share a common theme, a latent dimension that transcends coarse genres. I also show that there exist differences in the level of implicit similarity across clusters, indicating that some types of movies are more likely to be niche and formulaic, whereas others tend to incorporate more disparate elements. Investigating other previously unobservable dimensions, like storytelling style, pace, visual/sound effects, and character 38 idiosyncrasies may provide additional insights into determinants of similarity between movies, and factors that lead them to be more or less similar to other movies. Thus, a more in depth exploration of these factors could inform studios about other subtle dimensions of similarity relevant for performance, and provides an interesting avenue for future research. This paper makes contributions to several streams of literature. First, it contributes to literature on the economics and strategy of digitization, by exploring novel insights that data analytics can reveal for firms about consumer preferences, which were previously very difficult, if not impossible, to obtain, and that have performance and strategy implications. The paper also makes methodological contributions in this area by using machine learning techniques that have recently been adopted in economics. Second, it contributes to the categories research in sociology, by (1) inferring implicit categories from consumer co-rental patterns rather than relying on third-party genre classifications to determine categories, and (2) providing evidence for the causality of the relationship between a movie’s implicit similarity level (the extent to which it belongs to a coherent category with other movies) and its performance. To the best of my knowledge, this is the first paper in this literature that tackles the issue of causality. Third, the paper contributes to literature in marketing that explores the determinants of box office success, by examining how a formerly unobservable movie feature, implicit similarity, affects its performance. This could lead to future work in this area that investigates what other previously latent characteristics may affect movie performance and how they shape firm strategy. 39 References [1] Alpaydin, E. (2014). Introduction to Machine Learning, Third Edition MIT Press. [2] Athey, S. and J. Gans. (2010) “The Impact of Targeting Technology on Advertising Markets and Media Competition,” AER Papers and Proceedings, 100(2): 608-613. [3] Athey, S. and G. Imbens. (2015) .“Lectures on Machine Learning.” NBER, Boston, MA. Lecture. [4] Basuroy, S., Chatterjee, S., & Ravid, S. A. (2003). “How critical are critical reviews? The box office effects of film critics, star power and budgets,” Journal of Marketing, 67(4): 103-117. [5] Bloom, N., L. Garicano, R. Sadun and J. Van Reenen. (2009). “The Distinct Effects of Information Technology and Communication Technology on Firm Organization,” NBER Working Paper No. 14975. [6] Brynjolfsson, E. and L. Hitt. (2000). “Beyond Computation: Information Technology, Organizational Transformation and Business Performance,” Journal of Economic Perspectives, 14(4): 23-48. [7] Caliendo, M., Mahlstedt, R. & Mitnik, O. A. (2014). “Unobservable, but unimportant? The influence of personality traits (and other usually unobserved variables) for the estimation of treatment effects,” IZA Discussion Paper No. 8337, IZA, Bonn, Germany. [8] Chintagunta, P., Gopinath, S., & S. Venkataraman. (2010). “The effects of online user reviews on movie box office performance: Accounting for sequential rollout and aggregation across local markets,” Marketing Science, 29(5): 944-957. [9] Einan, L. (2007). “Seasonality in the US motion picture industry,” RAND Journal of Economics, 38(1): 127-145. [10] Einav, L., T. Kuchler, J. Levin and N. Sundaresan. (2011). “Learning from Seller Experiments in Online Markets,” NBER Working Paper No. 17385. [11] Eliashberg, J., A. Elberse and M. Leenders. (2006). “The Motion Picture Industry: Critical Issues in Practice, Current Research, and New Research Directions,” Marketing Science, 25(6): 638 - 661. 40 [12] Goldfarb, A., S. Greenstein and C. Tucker. (2015). Economic Analysis of the Digital Economy, Chicago, IL: University of Chicago Press. [13] Gopinath, S., P. K. Chintagunta, & S. Venkataraman. (2013). “Blogs, advertising and local-market movie box office performance,” Management Science, 59(12): 2635-2654. [14] Hsu, G. (2006). “Jacks of All Trades and Masters of None: Audiences’ Reactions to Spanning Genres in Feature Film Production,” Administrative Science Quarterly, 51(3): 420-450. [15] Hsu, G., T. Hannan and O. Kocak. (2009). “Multiple Category Membership in Markets: An Integrative Theory and Two Empirical Tests,” American Sociological Review, 74(1): 150-169. [16] Leonard, A. “How Netflix is turning viewers into puppets.” SALON. 1 February 2013. 12 August 2015. Retrieved from: http://www.salon.com/2013/02/01/how netflix is turning viewers into puppets/ [17] Linden, G., B. Smith and J. York. (2003). “Amazon.com Recommendations: Item-to-Item Collaborative Filtering,” Industry Report IEEE Computer Society, 76-80. [18] Liu, Y. (2006). “Word of Mouth for Movies: Its Dynamics and Impact on Box Office Revenue,” Journal of Marketing, 70(3): 74-89. [19] McElheran, K. (2015). “Do Market Leaders Lead in Business Process Innovation? The Case(s) of E-Business Adoption,” Management Science, 61(6): 11971216. [20] Nelson, R., M. Donihue, D. Waldman and C. Wheaton. (2001). “What’s an Oscar worth?” Economic Inquiry, 39(1): 1-16. [21] Oestreicher-Singer, G. and A. Sundararajan. (2012). “The Visible Hand? Demand Effects of Recommendation Networks in Electronic Markets,” Management Science, 58(11): 1963-1981. [22] Oster, E. (2014). “Unobservable Selection and Coefficient Stability: Theory and Validation,” Working paper, University of Chicago. [23] Pons, P. and M. Latapy. (2006). “Computing Communities in Large Networks Using Random Walks,” Journal of graph Algorithms and Applications, 10(2): 284-293. 41 [24] Ravid, A. (1999). “Information, Blockbusters, and Stars: A Study of the Film Industry,” The Journal of Business, pp. 72(4): 463-492. [25] Saunders, A. and P. Tambe. “The Value of Data: Evidence from a Textual Analysis of 10-Ks,” Working Paper, January 2012. [26] Sharma, A. “Amazon Mines Its Data Trove to Bet on TV’s Next Hit.” The Wall Street Journal. 1 November 2013. 15 August 2015. Retrieved from: http://www.wsj.com/articles/SB10001424052702304200804579163861637839706/ [27] Terry, N., M. Butler & D. De’Armond. (2006). “The Economic Impact of Movie Critics on Box Office Performance,” Academy of Marketing Studies Journal, 8(1): 61-73. [28] Vogel, H. Entertainment Industry Economics: A Guide for Financial Analysis, 5th ed. Cambridge, UK: Cambridge University Press, 2009. [29] Zuckerman, E. (1999). “The Categorical Imperative: Securities Analysts and the Illegitimacy Discount,” American Journal of Sociology, 104(5): 1398-1438. [30] Zuckerman, E. (2004). “Structural Incoherence and Stock Market Activity,” American Sociological Review, 69(3): 405-432. 42 Appendix A: Rank Variable Construction Example The table below shows an illustrative example of the Rank variable construction. Table A1: Example of Rank variable construction Number of Genres 1 1 1 2 2 2 3 Genres A B C A, B A, C B, C A, B, C 43 Box Office ($M) 10 7 4 12 8 5 11 Rank 1 2 3 1 2 3 1 Appendix B: Implicit Similarity Construction Example Suppose movies A and B have 5 mutual co-rentals, and movies A and C have 4 mutual co-rentals. A doesn’t have any other co-rentals in common with other movies. Calculate movie A’s implicit similarity measure. Step 1 For each of movie A’s pairs: A,B and A,C, calculate pAB and pAC . pAB = pAC = 4 6 5 6 = 2 3 Step 2 ImpSimA = 5 + 23 6 44 2 = 9 12 = 3 4 Figure 1: Methodological tools used in the paper Notes: The diagram shows a roadmap of the various methodological tools employed in the analysis of each section. Figure 2: Amazon Instant Video sample movie page Notes: The image shows a sample Amazon main page for the movie “Angels & Demons.” It includes the starring actors, runtime, release year, MPAA rating, Amazon stars, IMDB score, synopsis, and the movie’s top 5 co-rentals. 45 0 .1 Percent of Total Box Office Revenues .2 .3 .4 .5 .6 .7 .8 .9 1 Figure 3: Percent of Total Box Office Revenues, by week 0 5 10 15 20 25 30 Weekend 35 40 45 50 Notes: The figure shows the average cumulative percent of total box office gross each week for movies in the 20092013 subsample of movies. On average, movies in this sample have accumulated more than 80% of their total box office grosses by their fifth weekend in theaters. 0 5 Percent of Clusters 10 15 20 25 Figure 4: Explanatory power of observable characteristics for cluster membership 0 .1 .2 .3 .4 .5 .6 Pseudo-R2 .7 .8 .9 1 Notes: I run logit regressions of cluster membership for each cluster containing at least 10 movies, to determine whether observable characteristics can explain whether a movie belongs to a particular cluster. This figure is a histogram of the Pseudo-R2 s of those regressions, indicating the percent of clusters with each Pseudo-R2 . 46 150 Figure 5: Average implicit similarity by cluster 135 124 123 115 112 100 105 Cluster ID 96 87 77 64 50 56 15 21 22 16 0 2 1 84 61 59 34 29 7 0 67 60 57 88 47 42 46 38 26 25 83 80 45 41 3133 1719 14 64 1 27 20 5 13 10 2 Average Implicit Similarity 11 9 3 3 Notes: The graph shows the average implicit measure of the top 10 movies with the highest implicit similarity in each cluster, for clusters with at least 50 movies. The horizontal axis shows the standardized average similarity measure of each cluster and the vertical axis shows the Cluster ID. Each Cluster ID is also labeled on the graph. 47 Table 1: Summary Statistics of Key Variables Variable Name Obs. Full Sub Mean Full Sub Std. Dev. Full Sub Min Full Sub Max Full Sub 3.08 1 Main Dep/Indep ln(TotalBoxOffice) Implicit Similarity 6,962 22,548 843 859 14.74 0 3.45 1 4.27 -0.65 4.09 -0.97 20.45 11.31 Production and Distribution ln(Advertising) ln(ProductionBudget) Release Year Number of Genres Sequel 800 2,698 18,113 25,227 25,227 800 702 859 843 859 14.54 14.54 2.81 2.81 16.59 16.57 1.52 1.59 1996 2011 20.44 1.34 1.67 2.06 0.84 0.94 0.02 0.06 0.15 0.25 6.70 8.70 1905 1 0 6.70 9.55 2009 1 0 17.97 17.97 19.87 19.93 2013 2013 5 5 1 1 Genre Composition Action-Adventure Comedy Documentary Drama Fantasy Foreign Films Horror Kids-Family Military/War Musicals Mystery-Thrillers Romance Science Fiction Westerns 25,227 25,227 25,227 25,227 25,227 25,227 25,227 25,227 25,227 25,227 25,227 25,227 25,227 25,227 859 859 859 859 859 859 859 859 859 859 859 859 859 859 0.20 0.28 0.08 0.45 0.02 0.04 0.16 0.01 0.01 0.02 0.15 0.12 0.10 0.02 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 14.45 0 0.28 0.37 0.06 0.55 0.04 0.05 0.11 0 0.01 0.01 0.22 0.18 0.15 0.01 0.40 0.45 0.27 0.50 0.13 0.21 0.36 0.09 0.09 0.14 0.36 0.32 0.29 0.15 0.45 0.48 0.25 0.49 0.19 0.22 0.32 0 0.09 0.12 0.41 0.39 0.46 0.08 20.08 9.23 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Notes: The table shows summary statistics of key variables for both the full Amazon sample of 25,227 movies (Full) and the subsample used in the analysis in Section 3 (Sub) for comparison. 48 1 1 1 1 1 1 1 0 1 1 1 1 1 1 Table 2: Relationship between implicit similarity and box office performance Variable DV = ln(TotalBoxOffice) (1) (2) Implicit Similarity 0.137*** (0.044) 0.115** (0.045) QualityResidual -0.045 (0.059) #Genres 0.031 -0.042 (0.094) (0.101) Rank -0.099* -0.128** (0.052) (0.060) #Genres X Rank 0.016 0.023 (0.022) (0.024) ln(Pre-Advertising) 1.054*** 1.060*** (0.026) (0.053) Metacritic 0.025*** 0.025*** (0.002) (0.003) Sequel 0.590*** 0.525*** (0.115) (0.118) #Stars 0.145** 0.090 (0.063) (0.071) TopDirector 0.086 -0.039 (0.098) (0.100) Holiday -0.392 -0.0754 (0.329) (0.262) HHI.5wks 0.628 0.274 (0.608) (0.755) Rated NR 0.804*** 0.701** (0.225) (0.290) Rated G -0.145 -0.598** (0.435) (0.254) Rated PG13 -0.139 -0.118 (0.124) (0.131) Rated R -0.496*** -0.453*** (0.126) (0.143) Observations 776 664 R-squared 0.906 0.904 Regressions include year fixed effects. Robust standard errors are in parentheses. *** p<0.01, ** p<0.05, * p<0.1 49 Table 3: Instrumental variable method results DV = Implicit Similarity (1) (2) ln(TotalBoxOffice) #Genres Rank #Genres X Rank ln(Total-Advertising) Metacritic Sequel #Stars TopDirector Holiday Weeks Rated NR Rated G Rated PG13 Rated R 0.184*** (0.062) -0.045 (0.115) -0.061 (0.053) 0.018 (0.023) -0.128* (0.068) -0.001** (0.003) 0.596* (0.321) -0.074 (0.062) -0.166 (0.113) -0.152 (0.132) -0.012 (0.009) -0.130 (0.213) 0.600* (0.321) -0.054 (0.128) 0.047 (0.139) Number of instruments First-stage F-statistics Observations 767 R-squared 0.152 -0.014 (0.285) -0.017 (0.123) -0.072 (0.057) 0.018 (0.022) 0.072 (0.287) -0.006** (0.003) 0.704* (0.398) -0.052 (0.066) -0.159 (0.107) -0.139 (0.127) 0.005 (0.025) 0.103 (0.390) 0.641** (0.267) -0.086 (0.150) -0.016 (0.172) 1 13.27 767 0.122 Column (1) is an OLS regression, Column (2) is 2SLS. Both specifications include year fixed effects. Robust standard errors in parentheses. *** p<0.01, ** p<0.05, * p<0.1. The instrument for ln(TotalBoxOffice) is the share of US moviegoers affected by thunderstorm warnings during a movie’s first five weekends in theaters. 50 51 Ip Man Series Movies about martial arts, Kingdom of War success of underdog stories, Mulan: Rise of a Warrior and themes about powerful Mirageman female heroines. Crouching Tiger, Hidden Dragon Elizabeth They Came to Play The Red Violin The Madness of King George The Eyes of Van Gogh Back to the Future Mrs. Doubtfire Honey I Shrunk the Kids The Nutty Professor Ghostbusters Action (85.78%) Cluster 25 Foreign Films (35.78%) (190 movies) Drama (28.95%) Comedy (13.16%) Science Fiction (7.89%) Drama (81.72%) Cluster 29 Romance (32.25%) (186 movies) Comedy (21.50%) Documentary (6.99%) Mystery (5.91%) Comedy (58.61%) Cluster 60 Action (26.15%) (302 movies) Drama (22.19%) Science Fiction (21.19%) Kids/Family (12.58%) Movies oriented toward families; based on clever, innovative, and sometimes science fiction ideas. Movies about the lives, and in particular the struggles, of royals, artists and poets. Movies with humor that is based on the outcomes of unlikely friendships and relationships. Odd Couple Grumpy Old Men The Breakfast Club Brain Donors Under the Boardwalk Comedy (70.54%) Cluster 17 Drama (45.75%) (129 movies) Romance (30.23%) Science Fiction (7.75%) Mystery (4.65%) Common Themes Movies exploring issues of cultural identity, political and class differences, and their implact on daily life. Movie Examples The Class City of Life The Intouchables The Women on the 6th Floor 4 months 3 weeks 2 days Genres Drama (63.28%) Cluster 7 Comedy (43.75%) (128 movies) Foreign Films (37.50%) Romance (17.96%) Mystery (10.15%) Cluster ID Table 4: Examples of Implicit Clusters
© Copyright 2026 Paperzz