The Effect of Calorie Posting Regulation on Consumer Opinion: A Flexible Latent Dirichlet Allocation Model with Informative Priors Dinesh Puranam, Vishal Narayan and Vrinda Kadiyali* May, 2014 Abstract In 2008, New York City mandated that all multi-unit restaurants post calorie information in their menu. For managers of multi-unit and stand-alone restaurants, and for policy makers, a pertinent goal might be to monitor the impact of this regulation on consumer conversations. We propose a scalable Bayesian topic model to measure and understand changes in consumer opinion about health (and other topics). We calibrate the model on 761,962 online reviews of restaurants posted over 8 years. Our methodological contribution is to generalize topic extraction approaches in marketing and computer science. Specifically, our model does the following: a) each word can probabilistically belong to multiple topics (e.g. “fries” could belong to the topics “taste” as well as “health”); b) managers can specify prior topics of interest such as “health” for a calorie posting regulation; and c) review lengths can affect distributions of topic proportions so that longer reviews might include more topics. Through careful controls, we isolate the potentially causal effect of regulation on consumer opinion. Following the regulation, there was a small but significant increase in the discussion of health topics. Health discussion remains restricted to a small segment of consumers. * Dinesh Puranam is a Marketing Ph.D. student at Johnson, Cornell University. This paper is part of his doctoral thesis. Vishal Narayan is an assistant professor of Marketing at NUS, Singapore. Vrinda Kadiyali is the Nicholas H. Noyes Professor of Marketing and Economics at Johnson, Cornell University. They can be reached at [email protected], [email protected] and [email protected] respectively. 1 Section 1. Introduction In the face of rising obesity, Mayor Bloomberg of New York City pushed a regulation in 2008 that required chain restaurants (those with 15 or more units nationwide) to display calories for every item on all menu boards and menus in a font that was at least as prominent as price. Two years later, the Affordable Health Care Act of 2010 mandated that restaurants with multiple locations prominently display calories for every item on all menus. The desired impact of both these laws was that posting calorific information would make health more salient in the minds of consumers eating out, and make it easier for them to choose healthier foods. However, implementing this national regulation “has gotten extremely thorny” in the words of the Commissioner of the Food and Drug Administration. “There are very, very strong opinions and powerful voices both on the consumer and public health side and on the industry side, and we have worked very hard to figure out what really makes sense,” she said (Jalonick 2013). Past research has shown us that regulations pertaining to health claims on food labels affect consumer search and consumer behavior in various ways (Roe, Levy and Derby 1999, Bollinger, Leslie and Sorenson 2011, and Downs et al. 2013). Unlike these papers, we focus our research on consumers’ post-consumption opinions of the product. Our data are 761,962 reviews of 9,805 restaurants in New York City, posted on a leading restaurant review site in an 8-year period from the website’s inception in October 2004 to December 2012. We are unaware of studies which estimate the impact of regulation changes on consumer opinion or word of mouth. We propose an automated and scalable probabilistic model that summarizes this large volume of free, unsolicited, rich user-generated text into a few interpretable topics. These topics can offer managerial and policy insights into how consumer opinion or the “voice of the consumer” (Griffin and Hauser 1993, Lee and Bradlow 2011) changed due to the implementation of a calorie posting regulation in New York City. Traditional approaches to measure the effects of regulations, such as surveys and focus groups, might be expensive, time consuming, and potentially subject to recall biases and demand effects (Netzer et al. 2012). Unlike such approaches which rely on primary data collected over a short period of time, our data are available over several years. Therefore, our approach is 2 especially useful for studying the impact of temporally distant events (such as past regulations) by comparing periods before and after such events.1 Based on our data and methods, we pose and answer the following empirical questions that are managerially and policy relevant. a) What were the major topics or attributes about chain restaurants that consumers discussed in online reviews before and after the mandatory calorie posting regulation was enforced? b) Was health a topic of discussion before and after the regulation? What proportion of discussion was on health? Which topics were discussed to a larger extent relative to health? c) How widespread was the discussion of health in online reviews before and after the regulation? Very widespread discussion would be represented in the data by health being discussed to the same extent in most reviews. Less widespread discussion of health would be implied if it is discussed to a large extent but only in a small number of reviews. The following managerial and regulatory insights can be obtained from our analysis. First, should there be an increase post-regulation in health as a topic of discussion by a large set of consumers, this can be seen as a measure of the regulation’s success in making health more salient in the minds and voices of consumers. Second, textual content posted in online consumer reviews affects subsequent demand (Archak, Ghose and Ipeirotis 2011, Ghose, Ipeirotis and Li 2012). That is, we might expect greater discussion of health across a very large number of online reviews to be accompanied with greater consumption of healthier foods. Third, changes in patterns of consumer opinion can provide continuous, timely and free inputs into more traditional forms of marketing research. Increased discussion of health in online reviews can serve as a basis for commissioning more costly investigations into changes in consumer buying behavior e.g. the patterns of substitutions from less healthy options. Fourth, how widespread health mentions are (i.e. variance in how deeply reviews discuss health as a topic, conditional on mean level of health topics across reviews) can provide insights in to consumer segments. Such information can also serve as the basis of studies aimed at identifying individuals who might influence restaurant choices of the population, and at ascertaining the demographic correlates of those individuals who are most vocal about health. Note that there is also widespread media discussion of health 1 It is plausible that despite large sample sizes, user-generated textual content might also suffer from biases; indeed, no market research technique is perfect. As such we do not propose to replace traditional approaches, but instead to augment them with data that are available for free, in larger quantities, and over longer period of times. 3 and in particular this regulation; we confine ourselves to the analysis of consumers’ postconsumption data for these reasons outlined above. We now briefly discuss our model and research design. We use a rigorous Bayesian framework to summarize a large collection of reviews into a few representative latent “topics” (e.g. price, service, menu item, cuisine). We characterize topics by a probability distribution over all words in reviews. For each word in a review, a topic is chosen, and conditional on the choice of topic, a word is chosen. This process continues till the review adequately represents the topics of interest of the writer. So each review is a random mixture of several topics (e.g. a restaurant review could be simplistically represented as 20% price, 20% service, and 60% Mexican). This process represents a probabilistic interpretation of the data generation process for the observed reviews. We use state-of-the-art tools to select topics that are coherent for ease of managerial and policy interpretability. Estimation challenges arise because a) we do not observe the topics, but infer them from the data, b) the same word could belong to different topics necessitating a flexible modeling approach, and c) the large scale of the data (761,962 reviews) necessitates scalable estimation techniques. Since the scope of the regulation was limited to chain restaurants, we analyze data from chain and standalone restaurants separately, such that standalone restaurants serve as a useful contrast and as a natural control group. 2 To isolate the causal effects of regulations on consumer opinions in chain (and standalone) restaurants, we control for short term differences in characteristics between chain and standalone restaurants (via interactions of chain- and timespecific dummies) and for geographical differences in topic proportions (via zip code dummies). To further test for robustness of causal inference, we conduct a variant of a regression discontinuity analysis (Thistlethwaite and Campbell 1960, Hartmann, Nair and Narayanan 2011): we constrain the time period of analysis to a few months before and after the implementation of the regulation to minimize the effect of potential time-varying confounds. In sum, we combine state-of-the-art methods from computer science with experimental design methods to draw causal inferences from textual data. Situating our work in existing research literature, our paper is somewhat related to the literature on the economic impact of numeric characteristics of online reviews (Godes and Silva 2 We refer to restaurants with less than 15 units nationwide as “standalone” (as opposed to chain) for ease of understanding. As mentioned, such restaurants were outside the scope of the regulation. 4 2013; Godes and Mayzlin 2004). However, there has been relatively less research on extracting useful information from large masses of text of reviews. Decker and Trusov (2010) use text mining to estimate the relative effect of product attributes and brand names on product evaluation. Ghose et al. (2012) combine text mining with crowdsourcing methods to estimate hotel demand. Methodologically, our work is closer to three papers. Archak et al. (2011) use a part-of-speech tagger to identify frequently mentioned nouns and noun phrases in reviews. They then cluster nouns that appear in similar “contexts” (windows of four words around the noun). The resulting set of clusters corresponds to product “attributes” or “topics”. Lee and Bradlow (2011) automatically extract details from each review in terms of phrases. Each phrase is then rendered into a word vector which records the frequency with which a word appears in the corresponding phrase. Phrases are clustered together according to their similarity, measured as the distance between the word vectors. Clustering is achieved using a K-means algorithm. Netzer et al. (2012) use a similar approach, with the difference that they define similarity between products based on their co-mention in the data. Our model generalizes previous approaches in marketing for extracting topics in three important ways. First, these approaches deterministically allocate each word (or phrase of words) into one cluster, so that each topic (or cluster) is represented by a set of unique phrases. However, it is plausible that words could denote multiple topics or product attributes. For example, the word “fries” could be associated with the attribute “taste” (e.g. “fries are tasty”) and also with the attribute “health” (e.g. “fries are unhealthy”). In our approach, each word of the vocabulary is probabilistically assigned to each topic. So a word can represent several topics, each with a specific probability. Each topic is represented by a probability distribution over all words in the vocabulary. This is appealing both statistically and conceptually. Statistically, we demonstrate that a less deterministic approach of clustering words leads to better out-of-sample model fit. Conceptually, it seems unlikely that consumers think of a word or a phrase as being associated with a single topic. Second, our substantive research objective dictates a different modeling approach. Managers or policy makers might have informed priors about how consumer opinions might change due to specific events. For example, the enforcement of a new sales tax law might alter the level of discussion of “price” as a topic, when consumers review the focal product online. On the other hand, a new regulation pertaining to minimum wages in the retailing industry might 5 alter consumer opinion about service (if wage increases lead to service improvements), or price (if wage increases translate to price increases). In the papers discussed above, words or phrases are allocated to topics, and then topics are interpreted by the researcher. In contrast our approach allows the analyst to pre-specify constructs or topics of interest, and to then track changes in consumer opinion as it pertains to those topics. This is achieved by specifying an informative prior distribution of topics over the words in the vocabulary. This enables us to parsimoniously integrate managerial intuition and interest with information contained in thousands of reviews. Combining managerial intuition with statistical modeling has a long tradition in marketing and psychology, and has even been shown to improve model fit compared with purely statistical modeling (Blattberg and Hoch 1990, Yaniv and Hogarth 1993, Wierenga 2006). Although improving model fit is not the most important objective of this research, we find that “seeding” specific topics does not lead to lower fit than an “unseeded” version of the model, where the prior distribution of topics over words is diffuse. Third, a standard assumption in the marketing and computer science literatures (Blei, Ng and Jordan 2003, Lee and Bradlow 2011) is that the distribution of topics within a document is independent of the writer’s decision of how many words to write. However, it seems intuitive that the length of a document affects the distribution of topics. Writers who want to communicate about several topics are likely to use more words. Shorter documents might be focused on a few topics leading to sparser topic distributions. We extend the computer science literature on topic models by allowing the within-document topic distributions to vary with the length of the document; we find empirical support for this. Other than improved model performance, allowing topic distributions to vary by the length of the review has substantive implications. Authors of reviews which are focused solely on health are more likely to lead the consumer opinion on health, and be more important than the general reviewing population for targeting. To the extent that shorter reviews are more likely to discuss one or very few topics, the length of a review might be an important summary statistic of user-generated content to consider in identifying such reviews and reviewers. From a substantive standpoint, this is the first attempt in the marketing literature to investigate the level of discussion of a combination of pre-defined topics of greater managerial interest, and topics which are entirely data driven. Further, we complement recent research on temporal dynamics in ratings of online reviews (Godes and Silva 2013, Li and Hitt 2008), by 6 inferring how levels of discussion of various topics vary over time due to an exogenous event. We draw inferences not just from the mean level of discussion of each topic across reviews, but also from the variance of this distribution. We are not aware of prior work that does this. Our work also complements the academic research on the effect of the calorie posting regulation on consumer behavior (Bollinger et al. 2010, Downs et al. 2013). Such research provides insights from survey and transactional data from a single chain of restaurants (e.g. Starbucks for Bollinger et al. 2010). Our data are from 9,805 restaurants including 78 unique chains, and we focus on post-consumption opinions. As a preview in to our main findings, we find that topics associated with American staple fast-foods such as burgers, fries, sandwiches and steaks get discussed to a far greater extent than health. The mean level of the discussion of the health topic increased due to the regulation. Health is discussed only in a small proportion of reviews (less than 7%). This proportion increased for chain restaurants, suggesting that the regulation served to maintain the salience of health among a small segment of health-conscious consumers. Given the overall trends of increasing obesity, even small post-regulation increases in health mentions in restaurant reviews might be worth celebrating and has potential for significant long-term implications (see section 4 for details). Next we discuss the model specification. Section 3 presents the data, and discusses specific estimation challenges. Section 4 presents the results from the model, and their implications. Section 5 compares out-of-sample model performance with models which do not incorporate the unique features discussed above. It also demonstrates scalability in terms of estimation time. Section 6 concludes. Section 2. The Model 2.1 Model Specification Our model belongs a class of probabilistic topic models termed Latent Dirichlet Allocation (LDA) models, which have been developed by computer science (specifically machine learning) researchers to analyze words in large sets of original texts in order to discover the themes or topics in them. These models do not require any prior annotation or labeling of documents (documents are online reviews in our application). Such modeling enables us to summarize textual documents at a scale that would be impossible by human annotation (Blei 7 2012). We first describe the key conceptual ideas behind the LDA model and then discuss the statistical specification. We start with the basic intuition that documents include multiple topics. Consider the consumer review of a restaurant in Figure 1. We can see that the writer discusses the menu items of Mexican cuisine that she orders (burritos, enchiladas), discusses service in some detail (waiting, timeliness), and also issues pertaining to the healthiness of her order (calories, healthiness and the effect on her weight). The LDA model tries to statistically capture this intuition. We describe it first in the form of a generative process. This is an imaginary random process by which the model assumes the textual data in each document was generated. In terms of basic notation, a corpus is a concatenation of all documents in the dataset e.g. a set of product reviews. A vocabulary is the set of all unique words across all documents (e.g. all unique words in all reviews posted on a website). A topic is a probability distribution over all words in the vocabulary. The word “calories” would be associated with the “health” topic with a high probability, but also with all other topics with a non-zero probability. The word “dessert” might be associated with high probability with both the “health” topic and the “taste” topic.3 Each of the three topics in Figure 1, list some of the words corresponding to the topic with relatively high probability. We assume that topics are specified before any data have been generated. The first step in the creation of a document is for the writer to randomly choose a distribution over topics (25%, 50% and 25% of topics 1, 2 and 3 in Figure 1). We allow this distribution to vary across documents i.e. each document can exhibit topics in a different proportion. In the next step, the writer chooses a word. For each word in the document, the writer first randomly chooses a topic based on the distribution above. She then randomly chooses a word from the corresponding distribution of the chosen topic over all words in the vocabulary. For example, a draw from the distribution of the three topics might result in the choice of topic 1. Next, a draw from the distribution of topic 1 over all words in the vocabulary might lead to the choice of “burrito”. This process of choosing words in repeated until the writer finishes writing the document. The next document is 3 The topic “Health” could include words drawn from both positive (“healthy”) and negative valence (“not at all healthy”) contexts. Current approaches in computer science for jointly modeling valence and topic are associated with errors in topic measurement, which could potentially lead to incorrect inferences of the effects of interest. So we do not model valence. We do however discuss the robustness of our results to accounting for overall valence of the review in the analysis in section 4. 8 written similarly, but with a potentially different distribution over topics (say 0%, 89% and 11% of topics 1, 2 and 3). ====Insert Figure 1 Here==== As mentioned, the researcher does not observe the topics, the distribution of words for each topic, the distribution of topics for each document, or the choice of topic that led to a specific choice of a word. The central computational problem is to use the observed documents to infer these distributions i.e. to uncover the hidden topic structure that generated the observed set of documents. The process described above defines a joint probability distribution over both the observed and hidden random variables. We use this joint distribution to compute the conditional (or posterior) distribution of the hidden variables given the observed documents and words. We now describe the model more formally. Each document d (d=1,…,D) is composed of nd words. The number of total word instances in a corpus is N (e.g. for a corpus of 100,000 documents with 100 words each, N is 1 million). The corpus is defined by an N-dimensional vector w = w11 , w12 , wdi ,...wDn D , where wdi is the ith word of document d. The vocabulary is defined by V unique words; each unique word is denoted by v. The distribution of topic k (k=1,..,K) over the vocabulary is denoted by a V-dimensional vector k . Element kv denotes the probability of word v belonging to topic k. In Figure 1, kv for the word “burrito” belonging to topic 1 is 0.13. Document d is a mixture of the K topics. d is a K-dimensional vector that represents the proportions of each topic in document d. So for a 3-topic model, the review in Figure 1 might be summarized as d [0.25 0.50 0.25] . Extant approaches in marketing such as cluster analysis deterministically allocate words to topics, so that kv can take the values of 0 or 1 only. The first step in the process for generating wdi is to draw d from a Dirichlet distribution, i.e. d ~ Dirichlet (d ) . The second step is to draw z di , the topic assignment for the ith word in document d. This is drawn from a categorical distribution with parameter d , i.e. zdi | d ~ Categorical (d ) . This is a particularly convenient choice of distributions as the Dirichlet distribution is conjugate to the categorical distribution, i.e. the posterior distribution of d is also Dirichlet. For example in Figure 1, a draw from a categorical distribution 9 d [0.25 0.50 0.25] might lead to the choice of the second topic i.e. z di = 2. z di is an element of the N-dimensional vector z, which represents the latent variable indicating topic assignment to each word in the corpus w. Given the choice of topic z di , the word wdi is drawn from the categorical distribution with parameter k z di , i.e. wdi | zdi , k ~ Categorical (k zdi ) . To exploit conjugacy, the distribution of k is also specified Dirichlet i.e. k ~ Dirichlet ( ) . Continuing with the earlier example, given the choice of topic k=2, a draw from the categorical distribution over all V words with parameter 2 might lead to the choice of the word “obese”. This process is repeated for each word in document d. In summary, LDA allocates words from documents into latent topics. The distributions of topics across a document, and of words over topics are both Dirichlet, hence the term “Latent Dirichlet Allocation”. This set-up leads to the formation of topics based on co-occurrence of words across documents. Words which co-occur more frequently are more likely to be assigned to the same topic. The model described so far assumes that the distribution of topics within a document is independent of its length. A parsimonious way to relax this assumption is to focus on d which is the vector that represents the proportions of each topic in document d. The standard assumption in the literature is to assume that d is drawn from a Dirichlet distribution with parameter d , which is invariant across reviews i.e. d . We relax this assumption by allowing the Dirichlet parameter to vary with the number of words in the focal document i.e. d nd . In this manner, we parsimoniously allow the within-document topic distribution to vary with the length of the document. To demonstrate how this specification affects the within-topic distribution, we estimated the model assuming that is known to the researcher, and studied topic proportions for different values of . We present topic proportions for a 20-topic model for three values of in Figure 2. Larger values of are associated with more even distribution of topic proportions across topics. So our specification enables us to allow for larger documents to have more evenly distributed topics. Later in the paper, we demonstrate that this specification improves model fit. ====Insert Figure 2 Here==== 2.2 Model Estimation 10 We estimate the posterior distributions of the hyper-parameters and , the document level parameter d , the vector of word level assignments of topics z , and the topic level parameter k . Assuming documents are conditionally independent and identically distributed, we show in Online Appendix 1 that the likelihood of the data conditional on the hyper-parameters is calculated as follows: D ( L | , ) d 1 nd K V p(θd | nd ) p( | ) [k ,v dk ]I ( wid v ) d d (1) i 1 k 1 v 1 where I is the indicator function. We face two estimation challenges: this function does not have a closed-formed analytical solution (Dickey 1983), and the dimensionality of our parameter space is very high (a common feature of problems associated with “big data”). The dimensionality problem is owing to the large number of unique words in the corpus (V), the potentially large number of topics which summarize them, and the large number of documents (note that d is document specific). Following the computer science literature, we propose the following estimation strategy. First, rather than estimate k or d as parameters (Griffiths and Steyvers 2004), we estimate the posterior distribution of the assignment of words to topics, P( z | w) . We then obtain unbiased and consistent estimates of k and d by examining this posterior distribution. Second, much of the computer science literature assumes that the hyper-parameters parameters and are known, presumably for computational ease. However, it seems implausible that we would know the distribution of words over topics a priori. We therefore employ an expectation maximization algorithm to estimate these parameters (Online Appendix 2 provides details).4 We now focus on the problem of evaluating P( z | w) , which is given as follows. P ( z | w) P ( z , w) P( z, w) (2) z 4 LDA is a “bag-of-words” model (Eliashberg, Hui and Zhang 2007, Netzer et al. 2012), i.e. the order of words in a document does not affect the joint distribution of the observed and hidden random variables. Modeling word order is computationally intensive and therefore rare, even in computer science. Such models have usually been limited to incorporating bi-grams (word pairs) or tri-grams (a triplet of words). Given the computational burden posed by estimating the hyper-parameters, we chose to retain the standard “bag-of-words” assumption. 11 Owing to the conjugacy of the Dirichlet distribution, the numerator can be factorized and simplified as follows: P( z, w) P( w | z ) P( z ) , where V ( nkv ) K (V ) P( w | z ) v 1 V ( ) k 1 (V nk ) (3) K ( nd nkd ) ( K nd ) D and P ( z ) k 1 K ( ( nd )) d 1 (K nd nd ) (4) K D (.) is the standard gamma function. This specification involves several counts. nkv is the number of times the word v in the vocabulary is assigned to topic k in the corpus; nk is the number of words in the corpus which are assigned to topic k; and nkd is the number of words in document d assigned to topic k. The detailed derivation is available from the authors. Yet, this posterior distribution cannot be computed directly because the sum in the denominator P( z, w) in z equation 2 does not factorize and involves KN terms, which is again computationally challenging, owing to the size of our “big” dataset. So we adopt a Markov Chain Monte Carlo approach which relies on Gibbs sampling of the latent topic assignment variable z . Our algorithm belongs to a class of algorithms which is known to perform well in terms of scalability to large datasets and computation speed (Griffiths and Steyvers 2004).5 The full conditional distribution of z is free of k and d , enabling us to estimate these parameters by post-processing. Further details of the estimation algorithm appear in Online Appendix 2. The estimation algorithm was coded in Java. The MCMC chain ran for 2,000 iterations, with the first 500 iterations for “burn-in”. The last 1,500 iterations (using a sampling lag of 75) yielded 20 samples that were used to compute the moments of the posterior parameter distributions. 5 Variational inference (VI) methods are also commonly employed in computer science and statistics for large-scale problems with intractable integrals. Whereas Monte Carlo methods provide numerical approximations of the exact posterior by sampling, VI methods provide a locally optimal but precise analytical solution to an approximation of the posterior. We estimated the model using a VI method and obtained almost identical results with comparable computational speed. We chose Monte Carlo methods since they are more common in the marketing literature. 12 Section 3: Analysis and Evaluation We start by describing the textual data. The mean length of all 761,962 reviews is 126.7 words (SD=109.6). Each sentence is split into its component words using the Natural Language Toolkit’s Tokenizer (Bird 2009). After eliminating stop words (“a”, “the” etc.) and words that occurred less than 5 times in the entire corpus (Griffiths and Steyvers 2004, Lu et al. 2011) the number of unique words in the corpus is 44,276. Although the calorie posting regulation was implemented over a few months, we assume July 1, 2008 as the “implementation date” for comparing pre- and post- regulation consumer opinion. Robustness of our results to this assumption is discussed in the next section. Next we report the top 100 words across all reviews for chain restaurants and for standalone restaurants, and posted before and after the implementation date, in terms of frequency of occurrence in Table 1. We find some interesting commonalities across both chain and standalone restaurants before and after the implementation of the regulation. The words “good” and “service” appear in the top 10 words in all four conditions (2 restaurant types × 2 time periods). Price is referenced in various ways (e.g. “$”, “cheap”, “worth”) prominently. That eating out is a social activity is indicated by the prevalence of “friend” and “friends”, though these words are ranked lower in the chain restaurants. Reviewers discuss “fresh” and “delicious” in all four conditions. “Salad” and “Salads” both appear in the top hundred in every condition, but appear to be ranked lower in the standalone restaurants post implementation of the regulation. In contrast to these similarities, “location” and “fast” appear in the rankings for chain restaurants only whereas “décor” and “atmosphere” appear only in the context of standalone restaurants. Words associated with health such as calories, health, fit and light are absent from this list. Given the general view that health is not a very important consideration when eating out, this is not surprising. ====Insert Table 1 Here==== Although such analysis is useful to obtain a preliminary sense of the data, it cannot be used to draw any meaningful or robust substantive inferences in changes of consumer opinion due to the regulation. Since our objective is to infer topics of discussion from the data, analyses pertaining to counting specific words in the corpus, and how these frequencies vary over time is not helpful either. Except for the “health” topic, it is a priori unclear which words or topics to look for in the corpus. Second, even if a reliable list of topics were available, any choice of 13 words for measuring the level of discussion of specific topics would be subjective; results pertaining to levels of topic discussions and their changes are sensitive to such choices. LDA offers a data-based, replicable, objective and principled methodology of inferring topics from text corpuses. Yet a major challenge in all topic models is the interpretability of estimated topics. Models with large numbers of topics typically fit the data better and are able to support finergrained distinctions in the text. However, some topics are more interpretable than others in the judgment of domain experts; and the number of less interpretable topics often increases with the number of topics (Mimno et al. 2011). Measures of model performance such as out-of-sample fit, although commonly employed in marketing, correlate poorly with human judgments of topic interpretability (Chang et al. 2009). This has led to increased interest among computer scientists to develop automated metrics which are better able to predict topic interpretability. A useful insight from this research is that if a topic is highly interpretable (to humans), pairs of words which are associated with this topic with a high probability, should frequently co-occur in several documents of the corpus. For example, a topic in which the words “healthy” and “vegetables” are highly probable is likely to be more interpretable or “coherent” if both of these words occur in several restaurant reviews. Mimno et al. (2011) provide evidence for this result, and use it to develop a “topic coherence” metric Ck for each topic. Topics scoring higher on this metric are more interpretable by human judges. It is defined as follows. M m 1 Ck log m 2 l 1 D( mk , lk ) 1 D( lk ) (5) where V k ( 1k ,..., Mk ) is the list of M most probable words in topic k, D( ) is the number of documents in which the word ݒappears and D( , ' ) is the number of documents which contain at least one occurrence each of both and ' . We now discuss how we choose the number of topics (K) and label each of them. Several statistical approaches exist for this purpose. Similar to cluster analysis, we maximize the dissimilarity between topics (Deveaud et al. 2012, Cao et al. 2009) by computing a distance between every pair of topics where each topic is a probability distribution over the vocabulary. We employ the Jensen-Shannon statistic (Lin 1991, Steyvers and Griffiths 2007) which is similar to the Kullback-Leibler divergence statistic (Kullback and Leibler 1951), except that it is symmetric (i.e. the order of distributions does not matter) and always takes finite values; these 14 are both desirable properties. On estimating our model for various values of K, we found that this statistic is maximum at K=200. So all results pertain to 200-topic models. Not all topics are of substantive interest; so we follow the computer science literature and restrict substantive inferences to a few coherent topics only (Mimno et al. 2011, AlSumait et al. 2009). Specifically, we present 20 topics in Table 2: the seeded topic discussed above, and 19 topics with greatest values of the topic coherence metric.6 Coherence scores of all other topics appear in Online Appendix 3. Each topic is commonly represented by listing the most probable words in the topics (Chang et al. 2009, Blei et al. 2003). We extend this principle to label topic k in terms of the two distinct words which have greatest posterior probability of belong to that topic (as per k ). Although other words associated with the topic are likely meaningful, we choose this method for its objectivity, conciseness and because it does not require human intervention. ====Insert Table 2 Here==== A unique feature of our model (vis-à-vis extant models for analyzing textual data in marketing) is that it permits the researcher to specify words to belong to a topic. This “seeded” topic then becomes a topic of central interest. The posterior parameter distributions can then be used to infer changes to the distribution of this topic across documents and over time. In our analysis, we seed a topic which we simply label “health”, by allowing the prior distribution k of this topic over the vocabulary to contain the following words or “seeds” with high probability: calorie, calories, fat, diet, health, healthy, light, fit, cardio, lean and protein. This list is based on a review of Section 81.50 of the New York City Health Code that articulates the regulation. The words “calorie”, “calories” and “health” are the most frequently occurring health related words in the policy document. Words such as “light”, “fit”, “lean” and “protein” appear related to health, and occurred with high frequency in our corpus. Seeding a topic with a few words serves to allow the topic to “attract” other words which co-occur with the seeded words frequently in the corpus. We start with a uniform random initialization, where a topic is assigned to every word in every document. Two count variables are important: nkv (the number of times the word v in the vocabulary is assigned to topic k in the 6 In further another analysis to test robustness of topics, we measured how far away the most probable words of the topics are from uniform distributions. The closer a topic’s top words follow a uniform distribution, the less likely that the topic is informative. Empirically we expect the Zipf’s law to apply; most of the probability mass in each topic is allocated to a few words. Employing this measure did not change the results in the paper. 15 corpus) and nk (the number of words in the corpus which are assigned to topic k). In the “unseeded” model these counts are updated over a number of MCMC iterations to yield a posterior estimate of k . In the seeded version of the model, we randomly choose a topic and increment the counts of nkv and nk artificially (or add “pseudo-counts”) for the seed words, thus increasing the prior probability that the seed words will be located in the same topic. After some experimentation, we choose 5 as the value of this “pseudo-count”. This choice is sufficiently flexible to allow for the possibility that the posterior estimate of k will contain a) seeds with low probability, and b) other words with high probability. Given the large volume of data, this choice of pseudo-counts does not affect the results. To verify this, we included a low frequency health related word “cardio” as a seed. “Cardio” receives a low posterior probability assignment in the health topic and is not in the top 20 words used to describe the topic. Although we use the Jensen-Shannon statistic to decide number of topics, this in itself does not guarantee that all inferred topics are managerially relevant. Our topic of interest is quite focused i.e. contains very few substantively relevant words. Seeding enables us to study the issue of interest in a focused, practical, yet statistically robust manner. Therefore, seeding is a very useful tool to investigate potentially small but emerging trends in the data, and allows managers and regulators to measure topics of special interest to them. Section 4. Results and Implications 4.1 Results We first discuss the major topics of discussion in our data. We then discuss the relative importance of the health topic. Next, we discuss how the discussion of the health topic is distributed across reviews; with an eye to inferring segments of the reviewing population which are more vocal about health. Finally, we draw inferences about how the relative importance of health and other topics changed over time due to the calorie posting regulation. In Table 2, we present the top 10 words in decreasing order of posterior probability of being in each topic (for the top 20 topics), as inferred from the analysis of all reviews of chain restaurants. These topics perform very well on well-established coherence metrics.7 First, we 7 To improve interpretability, it might be tempting to combine topics which appear similar. This can better be achieved by estimating models with fewer topics than by manually combining topics post estimation, since manual 16 find that a substantial number of topics are focused on specific menu items. As examples, words for topic 4 and 5 are predominantly about Mexican food, topics 2 and 13 are mostly about steak related words, and the label of topic 14 captures its contents: burgers and fries. Second, several topics are focused on specific restaurant brands: Potbelly (topic 11), Olive Garden (topic 10), Benihana (topic 12), and Hooters (topic 18). It is evident that a majority of all topics summarizing online consumer opinion are about restaurant brands, and specific menu items. Third, different aspects of service are captured across topics. Topic 7 capture service of the delivery of food and beverages (e.g. the words seated, server, bar, wait, reservation). Topic 8 alludes to non-food related restaurant services (e.g. arcade, games, play, tickets). Fourth, only one topic out of the “top” 20 connotes price; suggesting low importance in consumer opinion. Price is a “search” attribute (rather than an “experience” attribute) so perhaps writing about it is perceived to be less informative by reviewers. Lastly, other than the first topic that we seeded with health related words, there are topics that might be important in understanding changes in consumer opinion due to the calorie posting regulation. Increasing discussion about the topic “salad_salads” might be indicative of increased discussion about healthier products. The words in this topic appear to be predominantly about salad ingredients; this can be interpreted a greater awareness of what is being consumed. Similarly, greater discussion of the topic “steak_wolfgang” might be an early signal that consumers think and write as much about high calorie food, as they did prior to the regulation. So we infer changes in consumer opinion due to the regulation not just in terms of changes in the discussion of the “health” topic, but also based on discussion of obviously less and more healthy products. We now discuss the relative importance of the top 20 most coherent topics. In Table 3, we present the means (across reviews of chain restaurants) of the posterior mean of the topic proportions d . Since there are 200 topics, in the absence of any information, we might expect the proportion of each topic to be about 0.5%. Topics associated with American staple fast-foods such as burgers, fries, sandwiches and steaks get discussed to a greater extent than the average topic. In what can be perceived to be a healthy signal, the topic salad-salads is discussed more than the average topic. The seeded health topic is discussed at about an average level; so is price. Potentially unhealthy foods such as cakes and chocolates are discussed at below average levels. combinations might be subjective. However, such models would offer poorer fit. We follow the standard approach of drawing substantive inferences from the best fitting model. 17 We also present means of topic proportions prior and post the implementation of the regulation. We note increasing trends in the discussion of popular products such as sandwiches and steaks. In what can be seen as a measure of success of the calorie-posting regulation, the “health” topic and the topic associated with salads get discussed more post-regulation. On the other hand, the proportion of discussion of high calorie foods (e.g. cakes) shows a decline in reviews of chain restaurants.8 This is the first paper to infer temporal changes in topics of discussion about products in online forums. However, like much of the computer science literature on topic models, the drivers of temporal changes in topic proportions are still unclear; we return to this issue in detail later. ====Insert Tables 3 and 4 Here==== In Table 4, we present the distribution of topic proportions based on reviews posted for standalone restaurants. We use the topics estimated on data from chain restaurants, and estimate their proportions for data on standalone restaurants (i.e. we hold k fixed across the treatment and control groups, and estimate d for each review of standalone restaurants).9 We find major differences in the extent to which the top 20 topics are discussed in these restaurants. Most topics form greater proportion of reviews of chain restaurants than of standalone restaurants, perhaps because these topics were inferred based on reviews of chain restaurants. We find that health (the seeded topic) forms a greater proportion of reviews of chain restaurants. This could be driven by differences in preferences of reviewers who review both kinds of restaurants, and might perhaps be an early signal of the need for calorie posting regulation for standalone restaurants also. The topic connoting price (tip_%) is discussed more in standalone restaurants; this is not very surprising as chain restaurants are often quick serve restaurants (with some exceptions) where meals are paid for upfront and with a minimal service component. It is also plausible that public knowledge of prices of standalone restaurants is lower than that of chain 8 We focus on inferring the proportions of topics across reviews, and how they change due to the regulation. The absolute levels of discussion of most topics increase over time, since more reviews are posted over time. This holds for both chain and standalone restaurants. Further details are available from the authors. 9 Another option is to estimate topics separately for standalone and chain restaurants. We could then compare a topic k for chain restaurants with that topic for standalone restaurants, which is at minimum distance as measured by the Jensen-Shannon statistic. However, our method enables more precise comparison of topics across the two groups. In further analysis we estimated the model on the entire corpus so that k is inferred from reviews of both chain and standalone restaurants. All substantive results pertaining to the health topic remain unchanged. 18 restaurants (e.g. the prices of McDonald’s burgers might be better known than those of the neighborhood deli). So discussion of price of standalone restaurants might be more informative. Next we discuss how widespread the discussion of health and other topics is. We find that the variance in topic proportion (across reviews) of all top-20 topics is quite high (3.7 to 15.0 times the mean topic proportion), suggesting that any one topic might not get discussed to any significant extent by large segments of the reviewing population. To further explore this notion, we compute the proportion of reviews for which the topic proportion is greater than baseline topic proportion of 0.5%. We find that just 6.1% of reviews of chain restaurants contain the “health” topic to an extent greater than 0.5% (Table 3). In as many as 63% of all reviews, the proportion of this topic is less than 0.05%. Health as a topic gets discussed in very few reviews. Similarly, the discussion of healthy foods as measured by the topic salad_salads is restricted to 10.5% of reviews (to an extent of 0.5% or more). On the other hand, the discussion of staple fast foods is more widespread (burgers_fries is discussed by 12.7% of all reviews (to an extent of 0.5% or more) as is the discussion of popular brands (e.g. chipotle_burrito(2)). We provide evidence that discussion of specific topics in online product reviews is skewed; such that a small proportion of all reviews account for most of the discussion.10 We now discuss temporal trends in how widespread the discussion of topics is across reviews. Irrespective of whether the calorie posting regulation leads to an overall increased discussion of health vis-à-vis other topics, it might affect the proportion of the population which discusses health. More widespread discussion of health might be another measure of success of the regulation. We find a very small increase in the proportion of reviews of chain restaurants which discuss “health” (to an extent of at least 0.5% of the review) in the post-regulation period: 6.09% to 6.15%, but a decline in the proportion of reviews of standalone restaurants (14.2% to 13.5%). Since the regulation was not mandated for standalone restaurants, this suggests that it might have served to stem the decline in how widely health gets discussed in reviews of chain restaurants. The proportion of reviews discussing healthy foods (salad_salads) declines marginally for both kinds of restaurants, suggesting no effect of the regulation. Moreover, at least some high calorie foods witnessed more widespread discussion (filet_steak and 10 To aid understanding, we present an example of a review with a high estimate of the proportion of the health topic: “For anyone who is living a healthy lifestyle you need to come and sample Muscle Maker you will be back. The staff is helpful and friendly. The Arizona Rocky Balboa and Cajun Chicken with whole wheat penne are my favorites. If your (sic) on a low carb or low sodium or any kind of diet (except a high fat diet) they have something for you.” 19 medium_steak) in chain restaurant reviews; this increase was much more than the corresponding increase for standalone restaurants. Therefore, a small proportion of the population reviewing chain restaurants discussed explicit health related words and healthy foods; on the other hand the discussion of less healthy foods such as steaks seems to have become more widespread. To estimate the causal effect of the regulation on the mean levels of topic proportions in reviews of chain restaurants, we specify the following model for the proportion of topic k in document d: kd 0 k 1k Chain d XquarterId d 2 k Post d 3k Chain d XPost d 4 k ZipCode d kd (6) Chaind equals 1 if review d is for a chain restaurant (0 otherwise). Postd is a dummy variable for the implementation time of the regulation (1 if review d was dated July 2008 or later; 0 otherwise). quarterIdd is vector of dummy variables. The element corresponding to the unique year and quarter combination (e.g. quarter 1, 2008) in which review d was written takes the value 1; all other elements are 0. ZipCoded is a dummy variable for the zth zipcode in our data; referring to the location of the restaurant reviewed in review d. We also include the interaction of the variables Chain and Post. The key effect of interest is that of interaction of the variables Chaind and Postd. It measures how topic proportions changed after the regulation for chain restaurants in comparison with standalone restaurants. Positive (negative) estimates of the interaction effect indicate that the proportion of topic k in reviews of chain restaurants increased (decreased) due to the implementation of the regulation. In terms of controls, the coefficient of Postd captures changes to topic proportions (before and after implementation) that affect all restaurants similarly e.g. population level changes in health salience. The interactions between Chaind and quarterIdd control for short term changes in factors which might affect chain and standalone restaurants differently (e.g. more tourist traffic to chain restaurants in a particularly warm winter, greater advertising by chain restaurants in a summer of high tourist traffic). Dummies for the last quarter in the data, and for one zipcode, are excluded to avoid collinearity. Error terms are assumed IID and normally distributed. All parameters are topic-specific. Key parameters are presented in Table 5. ====Insert Table 5 Here==== Several insights emerge. First, after controlling for chain characteristics (in comparison to standalone restaurants), temporal trends, restaurant locations, and any time-specific shocks that might affect the two formats differently, we find that the proportion of the health topic in chain 20 restaurants increases due to the regulation. So did the proportion of discussion of topics of seemingly healthier foods such as salads. This clearly signals success of the regulation in terms of increasing the salience of health at least for those consumers of restaurants who post online reviews. However, the proportions of topics connoting high calorie foods such as “steak_wolfgang”, “filet_steak” and “burger_fries” also increase after the implementation of the regulation. The greatest magnitude of change post regulation is for the topic “burger_fries” (the coefficient of Chain d XPost d is 45.05 (SD=1.06); that of the health topic is just 3.50 (SD=0.55). So unfortunately, the relative proportion of discussion of high calorie foods was high before the regulation, and became even higher after it was implemented. From the coefficients of Postd, it is evident that topics related to some brands garner greater proportion of online reviews post June 2008 (Olive Garden and Cosi), whereas other brands are discussed relatively less (Potbelly and Hooters). Such trends can serve as informative signals for brand managers of the focal and competing brands: greater online discussion of a brand might be a precursor to increasing demand for it.11 As mentioned earlier, to account for the possibility that factors other than the regulation might affect topic proportions of chain restaurants, but not those of standalone restaurants, we conduct a regression discontinuity analysis. Such analysis elicits causal effects of interventions more cleanly by assigning a threshold above or below which an intervention is assigned. Such a threshold in our context is simply the time of implementation of the calorie posting regulation (July 2008). The treatment (mandatory calorie posting) is assigned to chain restaurants only after this cutoff. By comparing observations lying closely on either side of the threshold, it is possible to estimate the local treatment effect in contexts in which randomization was unfeasible. So we estimate the regressions discussed above not for all reviews in our data period but for reviews posted in a period of say X months before and after the date of implementation. The smaller the time period of analysis around the date of implementation, the less likely is the occurrence of any events which potentially affect topic proportions of chain restaurants only. We estimate the regressions for X=6 months (i.e. on all reviews posted in 2008 only), 1 year and 2 years. Although the regression coefficients vary in magnitude, the coefficients signs remain the same. All results discussed in the paper hold for all regressions, irrespective of the time period of 11 We conducted further analysis by including the valence of the review (measured on a 5-point scale in our data) as a covariate. Results discussed in the paper remain unchanged. Details are available from the authors. 21 analysis, and are available from the authors. We do not analyze smaller time periods of data, since the implementation of the regulation was spread over a few months. Another potential concern could pertain to our choice of temporal break (July 2008). It is plausible that restaurants made changes prior to the regulation (e.g. healthier menus or lower calorie ingredients) in anticipation of calorie posting. Such changes could have affected health topic proportions even before the regulation was implemented. So we repeated this analysis for various temporal breaks (both before and after July 2008), and find that our results hold. 4.2 Implications for Managers and Policy Makers We first discuss implications for policy makers interested in promoting healthy eating out habits. We find that health is not a prominent topic of discussion among hundreds of thousands of reviewers of restaurants in New York City. With over 57% of all adults in the city being overweight or obese12, this is worrisome. Most reviewers of restaurants discuss health to a very low extent or not at all. Interestingly, much of the discussion of health is skewed towards a small segment of reviewers, who can be readily identified online. They could serve as useful starting points for initiatives to identify influencers or evangelists who might be successful in changing online public opinion about health. We find that the calorie posting was successful in two ways. First, it led to an increase in the discussion of health related words among online reviews of chain restaurants. Managers of restaurants with healthier offerings might be encouraged by this trend and managers of restaurants with less healthy offerings might consider conducting more market research to determine whether and how to alter their strategy. Second, the proportion of reviews of chain restaurants which discuss health increased slightly after the regulation; whereas this proportion declined for standalone restaurants. In other words, the discussion of health became less widespread for standalone restaurants, but not for chain restaurants. Although these are encouraging signs of success of the regulation, they provide a basis for conducting more costly studies into consumption of healthier products as a logical next step. While these results are econometrically significant, are they economically significant? Our estimate of an 11.6% increase in health topic proportion (from 0.43% to 0.48%) is consistent with research based on transaction data. Bollinger et al. (2011) estimate a 6% decrease in calories per transaction at Starbucks after the regulation, but no change in overall revenues. 12 Source: http://www.health.ny.gov/statistics/prevention/obesity/county/newyorkcity.htm 22 Irrespective of the data source and research methodology, such small effect sizes might suggest that the regulation was not a success. However, small changes in consumer behavior due to the regulation can bring about major changes in obesity levels. Kuo et al. (2009) estimate that even if 10% of restaurant patrons in Los Angeles county were to reduce calorie consumption by 100 calories per meal, as much as 40.6% of average annual weight gain in the entire county population would be averted. Reduction in obesity levels has monumental social and economic significance in the US. Over 250,000 deaths in the US every year are attributable to obesity (Allison et al. 1999). Obesity related costs in the US in 2008 were estimated to be a staggering $147 billion (Finkelstein et al. 2009); greater than the GDP of all but about 60 countries, and are still rising. Therefore, a seemingly small change in the level of health discussion due to the regulation is to be celebrated by regulators Another key finding is that topics pertaining to health, price and service garner a smaller proportion of online reviews than those pertaining to brands and menu items. To the extent that these topics are correlated with product attributes which consumers use for choice decisions, this serves as a free and externally valid input into product management decisions. For making tradeoffs between investing in service or menu redesign, it is useful for managers to know that menu items get discussed far more than service. Among menu items, the fact that steaks, burgers and sandwiches are discussed more than salads and appetizers is an indication of the relative popularity of various food items for eating out in New York City. Although not central to our key research question of determining the effect of the regulation on consumer opinion, our analysis reveals useful insights for brand managers of restaurants. Topics in Table 2 reveal words which are commonly used along with certain brand names in consumer reviews. We note that Qdoba is the only brand among the top 10 words for the topic “chipotle_burrito(1)” suggesting that Chipotle and Qdoba are perceived to be similar by consumers. This could potentially serve as a useful input for future store choice decisions where one brand might not want to locate very close to the other. Food items frequently mentioned with a brand indicate which items a brand is associated with. Based on this, Chipotle is more strongly associated with burritos and chicken, and not as much with tacos or beef or avocado. This could serve as input into a formal menu planning exercise – more items related to burritos and chicken might make these brand-product associations stronger. Lastly, as mentioned earlier, the overall 23 trend of increasing (decreasing) discussion of certain brands after the implementation of the regulation could be a leading signal of increasing (decreasing) demand. Section 5. Model Comparison and Scalability 5.1 Model Comparison We assess improvement in model performance due to the incorporation of three features which are unique to the marketing literature: a) allowing words to belong to multiple topics probabilistically (instead of deterministically), b) allowing the researcher to seed certain topics with specific words which are considered substantively important, and c) allowing the distribution of topics within a document to be affected by the length of the document. Model A deterministically allocates each word in the vocabulary to only one topic. We achieve this by assuming that each word deterministically belongs to the topic for which the posterior probability is the greatest, i.e. we assume kv is 1 (and 0 otherwise) if the posterior probability of word belonging to topic k is greater than the posterior probability of this word belonging to any other topic. Model B is an unseeded model i.e. we do not impose any prior distribution k of any topic to contain any word with high probability. Model C is identical to the proposed model with the exception that we assume that d is drawn from a Dirichlet distribution with parameter which is invariant across reviews, and does not depend on review length. For model comparison, we compute the perplexity score - the likelihood of observing a collection of words given a model of how the words were generated. It is monotonically decreasing in the likelihood of the data, such that models with lower perplexity fit the data better. It is commonly used for model comparison in the Natural Language Processing literature (Blei et al. 2003, Arora, Ge and Moitra 2012), and is defined as follows: D perplexity ( Dtest ) exp log p( wtest d | Dtrain ) d 1 D N d 1 nd K ) (7) d V train k Where, p( wtest | Dtrain ) [k ,v d d i 1 k 1 v 1 test I ( wid v ) ] (8) 24 wdi is the ith word of document d. For out-of-sample performance of the three models with the proposed model, we employ a 10-fold cross validation technique (Hastie et al. 2005). In each “fold”, a 10% hold out sample (or the “train” data in equations 7 and 8) is drawn. The model is estimated from the remaining 90% of the data (or the “test” data). We assume the topic distribution over the vocabulary ( ) to be the same across the test data and the hold out data, and estimate the document level topic proportions d , for each document of the hold out data. Three insights emerge from these model comparisons (see Table 6). First, out of the three novel features we propose, the greatest improvement in fit comes from probabilistic allocation of words to topics. Model A performs much worse than the proposed model for every hold out sample. Second, our seeded model is comparable in fit to the more flexible model B (the unseeded model). In four out of ten samples, the unseeded model has a slightly greater perplexity score (indicating worse fit) than the proposed model. On average, the mean perplexity score of model B is only slightly lower than that of the proposed model. Therefore, our seeded model enables incorporating of managerial institution and offers much richer managerial insights, at very little cost in terms of model fit. As mentioned, the seeded topic might get “flooded” with substantively less relevant words if fewer topics are chosen. Alternatively, all words of substantive interest might get dispersed over several topics if more topics are chosen. Finally, model C that ignores the length of the review performs worse than the proposed model for each hold out sample. This provides empirical validity to the notion that longer reviews have a more even distribution of topics. To the extent that the topics discussed in reviews are indicative of reviewer’s attribute preferences, review lengths can serve as an easily measurable and observable segmentation variable. It is plausible that someone who posts a short review primarily about a single topic, say salads, might be a better target for a banner ad of a salad bar, than someone who posts a long review discussing salads, burgers and steaks. To the best of our knowledge, current targeting technologies for social media do not consider the length of content posted online. Our findings suggest this might be a fruitful arena. ====Insert Table 6 Here==== 5.2 Scalability Of the 761,962 reviews in our data, 9,253 are for chain restaurants and the remaining for standalone restaurants. Although our approach involved estimating topic proportions (vectors of 200 elements) for each of 761,962 reviews, it is possible that other applications of this model 25 might require larger datasets (e.g. comparisons of topic proportions from user generated content across cities or even countries). As such, it is important to assess the scalability of our model to larger datasets. We estimate the entire model (i.e. every parameter) on datasets of varying number of reviews, and report time taken to convergence. Although inspection of time series of loglikelihoods across iterations of the MCMC chain reveals that convergence is achieved in 600 iterations irrespective of sample size, we report computation times for 2,000 iterations. We used a standard machine with an Intel 3930K processor capable of maximum speed of 4.5 Ghz (on a single core) and a RAM of 48 Gb. Convergence times for datasets with 50,000, 100,000 and 660,000 documents are 3.4 hours, 8.7 hours and 61.9 hours respectively. Convergence times seem to increase approximately linearly with sample size. For most applications of the model that are not real time, this estimation time seems acceptable. Therefore, we believe our model holds promise for real time marketing applications. A brand manager might wish to track the proportion of a topic seeded with key brands on a daily basis, based on the analysis of all product reviews posted for the product category. Even though such applications might be based on smaller datasets (the number of reviews posted on any given day are not very high), we refer the reader to an active research stream in computer science concerned with reducing computation time using both deterministic and stochastic estimation techniques (Mimno et al. 2012, Zhai et al. 2012). Section 6. Conclusion The growth of the internet has led to the availability of very large quantities of data that are often less structured than data collected offline. Such data are often in the form of opinions of consumers (e.g. blogs, product reviews), from an increasingly representative subset of the population, are in the public domain, and are available for long periods of time (e.g. 8 years in this research). This provides an unprecedented opportunity for marketers not just to understand what consumers are saying about their products at a point in time, but also to continuously track changes in consumer opinion over time. However, a major challenge for researchers is that much of these data are textual. It is perhaps for this reason that much the research based on usergenerated online content has focused on numerical descriptors of these data or simpler measures like word count. Techniques to analyze large volumes of text are at a nascent stage even in 26 computer science. Yet, there is considerable interest from practitioners in using these data to gain usable knowledgeable. A recent report by the McKinsey Global Institute (Manyika et al. 2011) suggests that analyzing such data will become a “key basis of competition, underpinning new waves of productivity growth, innovation, and consumer surplus.” Early research using online textual data in marketing has been focused on inferring market structure and product attributes in specific product categories; to ascertain the extent to which these correlate with consumer level data collected from more traditional experimental and survey based techniques; and to incorporate measures of such data in demand models. We extend this work by using textual data to address an issue that has perhaps been infeasible otherwise: how can researchers track changes in consumer opinion over time, and assess the impact of exogenous events on such changes? Specifically we assess the impact of a regulation to post calories in chain restaurants, on consumer opinion pertaining to chain restaurants. Across marketing and computer science, we were unable to find other research that use textual data to infer the effect of any factor on consumer opinion. We find significant changes in proportions of various topics of discussion due to the implementation of the regulation. Methodologically, we extend the Latent Dirichlet Allocation set of models in computer science. These represent the state-of-the-art in that literature, and we introduce these models to marketing. We look forward to several strategy- and policy-relevant applications as well as more sophisticated models in this area of topic detection and measurement. 27 References Allison, D. B., K. R. Fontaine, J. E. Manson, J. Stevens, T.D. VanItallie. 1999. Annual Deaths Attributable to Obesity in the United States. The Journal of the American Medical Association. 282(16), 1530--1538. AlSumait, L., D. Barbará, J. Gentle, C. Domeniconi. 2009. Topic Significance Ranking of LDA Generative Models. Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part I. 67--82. Archak, N., A. Ghose, P. G. Ipeirotis. 2011. Deriving the Pricing Power of Product Features by Mining Consumer Reviews. Management Science. 57(8), 1485--1509. Arora, S., R. Ge, A. Moitra. 2012. Learning Topic Models - Going beyond SVD. Proceedings of the 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science. 1--10. Bird, S., E. Loper, E. Klein. 2009. Natural Language Processing with Python. O'Reilly Media Inc. Blattberg, R. C., S. J. Hoch. 1990. Database Models and Managerial Intuition: 50% Model+ 50% Manager. Management Science. 36(8), 887--899. Blei, D. M., 2012. Probabilistic topic models. Communications of the ACM. 55(4) 77--84. Blei, D. M., A. Ng, M. Jordan. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research. Volume 3, 993--1022. Bollinger, B., P. Leslie, A. Sorensen. 2011. Calorie Posting in Chain Restaurants. American Economic Journal: Economic Policy. 91-128. Cao, J., T. Xia, J. Li, Y. Zhang, S. Tang. 2009. A Density-Based Method for Adaptive LDA Model Selection. Neurocomputing. 72(7-9), 1775--1781. Chang, J., J. Boyd-Graber, S. Gerrish, C. Wang, D. M. Blei. 2009. Reading Tea Leaves: How Humans Interpret Topic Models. Neural Information Processing Systems, 1-9. Decker, R., M. Trusov. 2010. Estimating Aggregate Consumer Preferences from Online Product Reviews. International Journal of Research in Marketing, 27(4), 293--307. Deveaud, R., E. SanJuan, P. Bellot. 2012. LIA at TREC 2012 Web Track: Unsupervised Search Concepts Identification from General Sources of Information, Proceedings of the 21th Text REtrieval Conference (TREC 2012), Gaithersburg, USA, November 7-9. Dickey, J. M. 1983. Multiple Hypergeometric Functions: Probabilistic Interpretations and Statistical Uses. Journal of the American Statistical Association. 78(383), 628--637. 28 Downs, J. S., J. Wisdom, B. Wansink, G. Loewenstein. 2013. Supplementing Menu Labeling With Calorie Recommendations to Test for Facilitation Effects. American Journal of Public Health. 103(9), 1604--1609. Eliashberg, J., S. K. Hui, J. Zhang. 2007. From Story Line to Box Office: A New Approach for Green-lighting Movie Scripts. Management Science. 53(6), 881--893. Finkelstein, E. A., J. G. Trogdon, J. W. Cohen, W. Dietz. 2009. Annual Medical Spending Attributable to Obesity: Payer and Service Specific Estimates. Health Affairs. 28(5), 822--831. Ghose, A., P. G. Ipeirotis, B. Li. 2012. Designing Ranking Systems for Hotels on Travel Search Engines by Mining User-generated and Crowdsourced Content. Marketing Science. 31(3), 493-520. Godes, D., D. Mayzlin. 2004. Using Online Conversations to Study Word-of-Mouth Communication. Marketing Science. 23(4), 545--560. Godes, D., J. C. Silva. 2013. Sequential and Temporal Dynamics of Online Opinion. Marketing Science. 31(3). 448--473. Griffin, A., J. R. Hauser. 1993. The Voice of the Customer. Marketing Science. 12(1), 1--27. Griffiths, T., M. Steyvers. 2004. Finding Scientific Topics. Proceedings of the National Academy of Sciences of the United States of America. 101(Suppl 1), 5228--5235. Hartmann, W., H. S. Nair, S. Narayanan. 2011. Identifying Causal Marketing Mix Effects Using a Regression Discontinuity Design. Marketing Science. 30(6), 1079--1097. Hastie, T., R. Tibshirani, J. Friedman, J. Franklin. 2005. The Elements of Statistical Learning: Data Mining, Inference and Prediction. The Mathematical Intelligencer. 27(2), 83--85. Jalonick, M. C. 2013. FDA Head Says Menu Labeling 'Thorny' Issue. Associated Press, March 12. Kullback, S., R. Leibler. 1951. On Information and Sufficiency. Annals of Mathematical Statistics. 22 (1), 79-86. Kuo, T., C. J. Jarosz, P. Simon, J. E. Fielding. 2009. Menu Labeling as a Potential Strategy for Combating the Obesity Epidemic: a Health Impact Assessment. American Journal of Public Health. 99(9), 1680--1686. Lee, T. Y., E. T. Bradlow. 2011. Automated Marketing Research Using Online Customer Reviews. Journal of Marketing Research. 48(5), 881--894. Lin, J., 1991. Divergence Measures Based on the Shannon Entropy. IEEE Transactions on Information Theory. 37(1), 145--151. 29 Li, X.. L. M. Hitt. 2008. Self-Selection and Information Role of Online Product Reviews. Information Systems Research, 456-474. Lu, B., M. Ott, C. Cardie, B. Tsou. 2011. Multi-aspect Sentiment Analysis with Topic Models. Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference. 81--88. Manyika, J., M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, A. H. Byers. 2011. Big Data: the Next Frontier for Innovation, Competition, and Productivity. McKinsey Global Institute, May. Mimno, D., M. D. Hoffman, D. M. Blei. 2012. Sparse Stochastic Inference for Latent Dirichlet Allocation. Proceedings of the 29th International Conference on Machine Learning. Mimno, D., H. M. Wallach, E. Talley, M. Leenders, A. McCallum. 2011. Optimizing Semantic Coherence in Topic Models. Proceedings of the Conference on Empirical Methods in Natural Language Processing, 262--272. Minka, T. P. 2000. Estimating a Dirichlet Distribution. http://research.microsoft.com/enus/um/people/minka/papers/dirichlet/minka-dirichlet.pdf. Netzer, O., R. Feldman, J. Goldenberg, M. Fresko. 2012. Mine Your Own Business: MarketStructure Surveillance Through Text Mining. Marketing Science. 31(3), 521--543. Roe, B., A. S. Levy, B. M. Derby. 1999. The Impact of Health Claims on Consumer Search and Product Evaluation Outcomes: Results from FDA Experimental Data. Journal of Public Policy and Marketing. 89--105. Steyvers, M. T. Griffiths, T. 2007. Probabilistic Topic Models. Handbook of Latent Semantic Analysis. 427(7), 424--440. Thistlethwaite, D. L., D. T. Campbell. 1960. Regression-discontinuity Analysis: An Alternative to the Ex Post Facto Experiment. Journal of Educational Psychology. 51(6), 309. Wierenga, B. 2006. Motion pictures: Consumers, Channels, and Intuition. Marketing Science. 25(6), 674--677. Yaniv, I., R. M. Hogarth. 1993. Judgmental Versus Statistical Prediction: Information Asymmetry and Combination Rules. Psychological Science. 4(1), 58--62. Zhai, K., J. Boyd-Graber, N. Asadi, M. L. Alkhouja. 2012. Mr. LDA: A Flexible Large Scale Topic Modeling Package Using Variational Inference in MapReduce. Proceedings of the 21st International Conference on World Wide Web, 879--888. 30 Table 1: Most Frequently Occurring Words by Restaurant Type and Time Period Reviews and Data Period Reviews from chain restaurants (before July 1, 2008) Most frequently occurring 100 words good,love,$,chicken,steak,fries,service,salad,burger,lunch,pretty,chipotl e,better,location,burrito,fast,free,best,fresh,cheese,chain,line,cosi,meat,s andwich,nice,wait,didn,bad,burgers,bread,sauce,taste,give,meal,deliciou s,hot,friend,minutes,square,night,find,side,tasty,guys,city,bit,sandwiches ,worth,dinner,white,mexican,wasn,street,staff,isn,feel,nyc,friends,york,b urritos,places,lobster,drinks,drink,kind,price,quick,chips,taco,bar,beef,lo t,rice,doesn,experience,hour,huge,red,half,extra,bacon,things,makes,bell ,games,fun,manhattan,joint,garden,mcdonald,friendly,wanted,cheap,har d,olive,full,expensive,salads,favorite Reviews from good,service,$,nice,pretty,bar,best,chicken,night,delicious,love,sauce,di standalone restaurants dn,better,dinner,bit,cheese,friend,lunch,meal,wait,wine,salad,tasty,fresh, (before July 1, 2008) friends,pizza,drinks,sushi,dishes,side,bad,wasn,dish,hot,rice,experience, meat,soup,staff,worth,friendly,atmosphere,dessert,fried,taste,amazing,gi ve,special,find,feel,drink,burger,decor,fries,sweet,prices,favorite,perfect, places,brunch,pork,bread,excellent,nyc,cheap,kind,fish,chocolate,sandwi ch,lot,city,tables,decent,dining,price,fun,room,served,spicy,steak,though t,large,huge,beef,things,beer,recommend,waiter,full,top,street,happy,yor k,minutes,french,flavor,home,cream,shrimp Reviews from chain good,service,burger,steak,$,fries,salad,lunch,chicken,better,pretty,locati restaurants (after July on,sandwich,love,nice,chipotle,didn,cheese,meal,fresh,fast,best,bad,line, 1, 2008) wait,delicious,meat,bread,burrito,bit,staff,burgers,give,experience,minut es,sauce,free,side,chain,taste,wasn,friendly,hot,price,nyc,guys,bar,night, dinner,sandwiches,tasty,friend,pizza,waiter,square,decent,quality,find,co oked,feel,half,city,asked,quick,lobster,drink,lot,subway,drinks,worth,wa nted,prices,amazing,places,special,toppings,put,bacon,huge,extra,week,r ice,full,hour,happy,medium,thought,shrimp,kind,review,potatoes,shake, white,clean,top,friends,waiting,things,favorite,cheap Reviews from good,service,$,nice,delicious,pretty,chicken,best,bar,sauce,night,didn,lo standalone restaurants ve,better,dinner,meal,cheese,bit,wait,friend,lunch,fresh,dish,salad,pizza, (after July 1, 2008) amazing,wine,drinks,experience,pork,wasn,side,friends,taste,tasty,rice,d ishes,bad,burger,fried,friendly,staff,meat,special,worth,sweet,give,soup, dessert,bread,hot,fries,spicy,price,brunch,perfect,drink,sandwich,nyc,fla vor,sushi,minutes,atmosphere,favorite,waiter,find,feel,places,served,pric es,lot,recommend,fish,happy,thought,excellent,kind,top,decent,beef,tast ed,full,steak,hour,beer,wanted,cooked,tables,chocolate,quality,decor,ask ed,cream,city,things,large,huge,dining,awesome,super 31 Table 2: Seeded Topics, Other Coherent Topics and Associated Words ID Name 1 “Health” Coherence Score for Unseeded Topics Seeded 2 steak_wolfgang -265 3 filet_steak -301 4 chipotle_burrito(1) -308 5 chipotle_burrito(2) -311 6 juice_orange -326 7 seated_server -337 8 games_game -344 9 cake_chocolate -356 10 olive_garden -361 11 sandwich_potbelly -364 12 chef_benihana -365 13 medium_steak -366 14 burger_fries -375 15 salad_salads -376 16 tip_% -381 17 cosi_sandwich -382 18 wings_hooters -383 19 onions_lettuce -384 20 dinner_appetizers -385 Top 10 words in decreasing order of posterior probability of belonging to the topic calories,calorie,healthy,fat,low,diet,protein, count,light,chicken steak,wolfgang,luger,spinach,peter,bacon, steakhouse,creamed,porterhouse,good filet,steak,potatoes,lobster,good,mignon, mashed,crab,spinach,sides chipotle,burrito,rice,beans,salsa,chicken, corn,cheese,sour,cream chipotle,burrito,burritos,qdoba,mexican, bowl,fast,chicken,location,chips juice,orange,cheap,business,buy, plastic,clear,realize,cups,puts seated,server,waiter,minutes,hostess, party,reservation,arrived,bar,wait games,game,fun,play,dave,tickets, kids,card,arcade,playing cake,chocolate,dessert,week,good, steak,salad,creme,filet,meal olive,garden,italian,pasta,salad, breadsticks,bread,square,unlimited,og sandwich,potbelly,sandwiches,bread, peppers,lunch,line,subway,hot,wreck chef,benihana,hibachi,shrimp,rice, show,chicken,fun,birthday,fried medium,steak,rare,cooked,steaks, meat,good,waiter,filet,plate burger,fries,guys,burgers,toppings, good,free,cheeseburger,regular,Cajun salad,salads,dressing,bowl,ingredients, toppings,lunch,chopped,$,fresh tip,%,bill,18,waiter,service,gratuity, waitress,didn,added cosi,sandwich,bread,salad,sandwiches, tomato,chicken,lunch,good,soup wings,hooters,girls,beer,waitress, buffalo,hot,waitresses,chicken,pretty onions,lettuce,grilled,cheese,sauce,tomatoes, mushrooms,ketchup,tomato,mustard dinner,appetizers,night,good, main,entrees,party,shrimp,drinks,delicious 32 Table 3: Distributions of Topic Proportions Before and After the Regulation (Chain Restaurants) Topics Mean topic proportion (across reviews) Proportion of reviews for which topic proportion exceeds 0.5% All Pre Post All Pre Post “Health” steak_wolfgang 0.48% 0.43% 0.48% 6.14% 6.09% 6.15% 1.69% 1.80% 1.67% 9.20% 8.42% 9.30% filet_steak 1.51% 1.08% 1.57% 11.00% 7.61% 11.47% chipotle_burrito(1) 1.56% 1.57% 1.56% 9.38% 9.94% 9.30% chipotle_burrito(2) 2.62% 3.52% 2.50% 14.59% 16.65% 14.31% juice_orange 0.28% 0.17% 0.29% 3.22% 4.21% 3.09% seated_server 0.91% 0.54% 0.96% 10.58% 7.79% 10.96% games_game 1.00% 1.62% 0.91% 5.81% 7.88% 5.53% cake_chocolate 0.42% 0.51% 0.41% 7.74% 9.40% 7.51% olive_garden 0.86% 1.25% 0.81% 5.77% 6.71% 5.64% sandwich_potbelly 1.76% 0.42% 1.95% 9.59% 6.45% 10.02% chef_benihana 0.52% 0.47% 0.53% 4.25% 4.39% 4.23% medium_steak 0.65% 0.61% 0.65% 9.07% 7.88% 9.23% burger_fries 2.55% 2.40% 2.57% 12.72% 11.37% 12.91% salad_salads 1.35% 1.26% 1.36% 10.52% 10.65% 10.50% tip_% 0.55% 0.44% 0.56% 6.66% 4.92% 6.90% cosi_sandwich 1.35% 1.74% 1.30% 8.41% 10.47% 8.12% wings_hooters 0.54% 0.82% 0.51% 5.31% 6.80% 5.10% onions_lettuce 0.48% 0.46% 0.48% 7.77% 7.43% 7.82% dinner_appetizers 0.42% 0.34% 0.44% 7.84% 7.70% 7.85% 33 Table 4: Distributions of Topic Proportions Before and After the Regulation (Standalone Restaurants) Topics “Health” steak_wolfgang Mean topic proportion (across reviews) Proportion of reviews for which topic proportion exceeds 0.5% All Pre Post All Pre Post 0.28% 0.30% 0.27% 13.55% 14.21% 13.47% 0.42% 0.41% 0.43% 17.04% 16.69% 17.09% filet_steak 0.46% 0.43% 0.48% 21.89% 21.65% 21.92% chipotle_burrito(1) 0.52% 0.52% 0.52% 20.80% 20.41% 20.85% chipotle_burrito(2) 0.50% 0.55% 0.45% 20.41% 21.41% 20.28% juice_orange 0.55% 0.53% 0.57% 21.34% 21.51% 21.32% seated_server 0.38% 0.37% 0.38% 18.29% 19.31% 18.15% games_game 0.87% 0.91% 0.84% 24.26% 25.15% 24.15% cake_chocolate 0.41% 0.39% 0.43% 19.65% 20.06% 19.60% olive_garden 0.33% 0.33% 0.33% 17.64% 18.64% 17.51% sandwich_potbelly 0.40% 0.39% 0.40% 19.65% 20.74% 19.51% chef_benihana 0.41% 0.42% 0.40% 18.81% 19.56% 18.72% medium_steak 0.52% 0.46% 0.58% 20.95% 20.13% 21.05% burger_fries 0.37% 0.38% 0.36% 17.88% 18.79% 17.76% salad_salads 0.29% 0.27% 0.30% 14.13% 14.23% 14.11% tip_% 0.58% 0.70% 0.48% 18.06% 20.01% 17.81% cosi_sandwich 0.40% 0.43% 0.38% 18.06% 19.15% 17.92% wings_hooters 0.51% 0.49% 0.52% 19.54% 19.83% 19.51% onions_lettuce 0.91% 0.92% 0.90% 24.05% 23.85% 24.07% dinner_appetizers 0.28% 0.30% 0.27% 13.85% 14.80% 13.73% 34 Table 5: The Effect of Calorie Posting Regulation on Topic Proportions Coefficient of Coefficient of Posting Chain Posting M (0.05) 3.50 “Health” SD 0.04 0.55 M 0.17 6.92 steak_wolfgang SD 0.08 1.14 M 0.03 16.61 filet_steak SD 0.07 0.95 M 0.09 17.76 chipotle_burrito(1) SD 0.08 1.09 M (0.23) 27.84 chipotle_burrito(2) SD 0.08 1.20 M (0.14) (4.63) juice_orange SD 0.08 1.10 M (0.13) 3.39 seated_server SD 0.05 0.75 M (0.04) 1.78 games_game SD 0.12 1.80 M 0.07 1.10 cake_chocolate SD 0.06 0.83 M (0.11) 11.02 olive_garden SD 0.05 0.70 M (0.18) (1.50) sandwich_potbelly SD 0.05 0.73 M (0.97) (0.13) chef_benihana SD 0.77 0.05 M 1.57 0.39 medium_steak SD 1.17 0.08 M (0.07) 45.05 burger_fries SD 0.07 1.06 M 0.03 12.24 Salad_salads SD 0.06 0.82 M (0.35) 3.08 tip_% SD 0.07 1.03 M (0.20) 3.56 cosi_sandwich SD 0.06 0.90 M (0.02) (3.14) wings_hooters SD 0.08 1.14 M (2.35) 0.32 onions_lettuce SD 1.60 0.11 M (0.09) 2.76 dinner_appetizers SD 0.04 0.54 Note: M and SD stand for the mean and standard error of the coefficient estimate. p-value<0.05 for parameter estimates in bold. Topic 35 Table 6: Model Comparison (Perplexity scores for 10 hold out samples) Hold out Sample 1 Hold out Sample 2 Hold out Sample 3 Hold out Sample 4 Hold out Sample 5 Hold out Sample 6 Hold out Sample 7 Hold out Sample 8 Hold out Sample 9 Hold out Sample 10 Mean across all samples Proposed Model 553.9 543.9 555.5 566.0 591.2 567.4 547.1 564.3 563.9 555.6 560.9 Model A (deterministic allocation of words) 1598.5 1574.9 1601.7 1606.2 1628.7 1607.8 1569.1 1590.5 1597.2 1578.6 1595.3 Model B (unseeded) 548.0 545.1 557.9 567.8 590.7 569.3 544.7 558.8 558.9 553.2 559.4 Model C (ignores review length) 560.2 554.3 568.1 575.5 600.0 574.8 552.9 571.6 572.0 559.0 568.8 Figure 1: Document, Topics and Topic Proportions 36 Figure 2: Distribution of topic proportions for a 20-topic model for various values of Topic Proportion () α=0.001, β=0.1 1.0 0.8 0.6 0.4 0.2 0.0 topics (20) α=0.25, β=0.1 Topic Proportion () 1.0 0.8 0.6 0.4 0.2 0.0 topics (20) Topic Proportion () α=100, β=0.1 1.0 0.8 0.6 0.4 0.2 0.0 topics (20) 37 Online Appendix 1: Derivation of Equation (1) Our objective is to estimate: p( w, θ, z, | , ) p( w | θ, , z, , ) p(θ, z, | , ) p( w | θ, z, , , ) p( z | θ, , ) p(θ | , ) p( | ) p( w | z, ) p( z | θ) p(θ | ) p( | ) (A1.1) The joint probability for observing a single document with a given topic assignment is given by: p( w d , z d | , ) p( w d | θd , , z d , nd , ) p(θd , z d , | nd , )d d p( w d | θd , z d , , nd , ) p( z d | θd , nd , ) p(θd | nd , ) p( | )d d p( w d | z d , ) p( z d | θd ) p(θd | nd ) p( | )d d p( w nd id | z id , ) p( z id | θd ) p(θd | nd ) p( | )d d i 1 (A1.2) To get the probability of observing a given document given the hyper-parameters, we need to integrate over , and sum over the K possible values of zid . nd K p( w | nd , ) d p( w id i 1 z 1 id | z id , ) p( z id | θd ) p(θ d | nd ) p( | )d d (A1.3) A useful result yielding a more compact notation is as follows, V p( wid v | z id k , ) p( z id i | θd ) [ p( wid v | z id k , ) p( z id k | θd )]I ( wid v ) v 1 where I is the indicator function. Therefore, p( w d | nd , ) nd K p ( w i 1 z id 1 nd K id | z id , ) p( z id | θd ) p(θd | nd ) p( | )d d V [ p( w i 1 k 1 v 1 id v | z id k, ) p( z id k | θ d )]I ( wid v ) p (θd | nd ) p( | )d d nd K V p (θ d | nd ) p( | ) [ p( wid v | z id k , ) p( z id k | θ d )]I ( wid v ) d d i 1 k 1 v 1 38 (A1.4) The model assumptions allow us to further simplify this. Note that if wid word v in the vocabulary and z id topic k then p( wid v | z id k, ) = k ,v . Also p ( z id k | ) d k , since zid is drawn from a categorical distribution with parameter d . p( w d | , ) nd K V p(θd | nd ) p( | ) [ p( wid v | z id k , ) p( z id k | θd )]I ( wid v ) d d i 1 k 1 v 1 nd K V p(θd | nd ) p( | ) [k ,v dk ]I ( wid v ) d d i 1 k 1 v 1 (A1.5) Equation (1) is simply the product of (A1.5) over D documents. 39 Online Appendix 2: Optimizing Hyper-Parameters incorporating document length. Recall that D is total number of documents in the corpus; d is the index of specific document. nd is the number of words in a document. k is the topic index and there are K topics in the model. v is a unique word in the vocabulary and the size of the vocabulary is V. Finally n dk is the number of words in document d assigned to topic k. To optimize the joint probability with respect to , we focus on the terms that contain as follows. K ( ( nd ndk )) D ( nd ) k 1 P ( w , z | , ) K ( ( n )) d d 1 ( K nd nd ) D (A2.1) Taking logs for convenience: ( K nd ) Log ( P( w, z | , ) ) D log K ( ( nd )) K log{ ( ( nd ndk ))} log{( K nd nd )} d 1 k 1 D (A2.2) At the optimal value of = * we have, D K Log ( P( w, z | * , ) * ) {log (( *nd ndk )) log ( ( *nd ))} d 1 k 1 D {(log ( ( K *nd )) log ( ( K *nd nd ))} (A2.3) d 1 Directly maximizing this log likelihood (by setting the derivative w.r.t * of the above equation to 0) yields an intractable expression. We make use of two bounds (listed below) on gamma functions instead (Minka 2000), where x is the actual positive real number, x̂ another positive real number (which , as shown below, we can view as an estimate of x ) and n is an integer. log ( x n ) log ( x ) log ( xˆ n ) log ( xˆ ) xˆ (log ( x ) log ( xˆ ))( ( xˆ n ) ( xˆ )) (A2.4a) And log ( x ) log ( x n ) log ( xˆ ) log ( xˆ n ) ( ( xˆ n) ( x ))( xˆ x ) (A2.4b) is the digamma function. We can consequently rewrite equation A2.3 by substitute the bounds defined in equations A2.4a and A2.4b. For the first portion of equation A2.3, we use equation A2.4a, where x * nd and n nkd . For the latter half of equation A2.3 we use equation A2.4b, where x K * nd and n nd . This results in: 40 Log ( P( w, z | , ) D K {log (( n ndk )) log ( ( nd )) d d 1 k 1 nd ( ( nd ndk ) ( nd ))(log ( n d ) log ( nd )} (A2.5) D {log (( K nd )) log ( ( K nd nd )) d 1 ( ( K nd nd ) ( K nd ))( K nd K n d )} We can group all terms that do not involve into a constant term. Log ( P( w, z | * , ) * D K { n d d 1 k 1 (( nd ndk ) ( nd ))(log ( * n d )} (A2.6) D {( (K nd nd ) ( K nd ))( K * n d )} C d 1 Maximizing this lower bound w.r.t. would mean that the bound would more closely approximate the maximum of the log likelihood (it can be seen that the double derivative is 0). Taking the derivative w.r.t. and setting it to 0 yields: dLog ( P( w, z | * , ) * 0 d * D K { nd ( ( nd ndk ) ( nd )) d 1 k 1 D {(( K n nd ) ( K nd ))( Knd )} d d 1 1 n } nd d * (A2.7) This yields a simpler update: D K n {(( n d 1 D d k 1 d ndk ) ( nd ))} K n d {(( K nd nd ) ( K nd ))} d 1 (A2.8) We can iteratively update the value of * until convergence, yielding the maximum likelihood estimate of * . In practice this converges relatively quickly. We can similarly derive the update for the hyper-parameter . 41 Online Appendix 3: Other Topics Names and Coherence Scores (C.S.) ID Name C.S. ID Name C.S. ID Name C.S. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 white_castle capital_grille drink_drinks steak_chris lunch_lines side_bison staff_friendly bbq_chicken fish_tacos full_wonderful coffee_tea free_cosi chicken_fried pizza_delivery steak_house good_pretty manager_told burger_shake customers_counter friday_tgi seating_tables friend_didn subway_sandwich chips_mexican palm_steak good_service sandwiches_pret burger_fries friends_group recommend_highly breakfast_egg fries_french service_slow minutes_wait square_ruby bacon_tomato -385 -391 -391 -391 -394 -396 -398 -399 -399 -399 -402 -402 -403 -404 -405 -414 -415 -416 -418 -419 -419 -423 -423 -424 -426 -427 -427 -429 -429 -430 -433 -434 -436 -438 -438 -440 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 chick_fil wine_glass bad_reviews soup_bon dont_lot delicious_fresh potato_sweet eye_rib didn_asked fast_wendy fresh_ingredients high_quality tasted_didn flavor_taste good_better extra_$ huge_portions outback_steak didn_worst !!_!!! hit_miss birthday_coupon standard_good cream_ice york_city service_bit cashier_?" happy_hour nice_options ihop_pancakes lunch_office good_give line_lunch ---_ny free_better nice_bit -443 -443 -444 -444 -446 -447 -449 -450 -451 -453 -454 -454 -455 -457 -457 -458 -458 -458 -460 -463 -463 -464 -464 -466 -466 -467 -469 -471 -472 -476 -476 -478 -479 -480 -480 -480 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 gift_card service_good didn_boyfriend money_worth wasn_bad $_99 bit_better service_good quick_bite hard_rock beef_patties good_nice soda_coke $_pay onion_rings door_top good_expectations better_price location_number surprised_find things_.) ago_good hot_dog kind_doesn pretty_cheap meal_end pretty_cool guy_lady lobster_red counter_walk isn_places love_yum good_pretty good_nice room_dining avoid_awful -481 -481 -482 -484 -484 -485 -485 -485 -486 -487 -488 -488 -488 -489 -489 -490 -490 -491 -491 -491 -491 -492 -492 -493 -493 -495 -496 -497 -497 -498 -498 -498 -499 -501 -501 -502 42 Online Appendix 3 (contd.): Other Topics Names and Coherence Scores (C.S.) ID 109 111 Name bathroom_clea n location_locati ons top_pieces 112 bar_drinks 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 bread_panera guys_thought half_rest side_dish street_location wait_district water_asked location_week +_$ service_husba walked_today price_pay taco_bell lunch_hours better_girl simply_deal chicken_salad boston_market review_write give_show 110 C.S. -502 ID 133 Name nice_friendly C.S. -522 ID 157 Name heart_attack C.S. -538 -503 134 wanted_thought -522 158 chain_nyc -542 -503 135 !!!_!!!! -523 159 136 delicious_mouth -524 160 -506 -506 -506 -506 -506 -506 -506 -508 -510 -511 -511 -512 -512 -513 -514 -514 -516 -518 -518 -519 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 nyc_places special_dinner meat_flavor brother_quickly prices_decent bag_paper service_slow put_point bad_wasn good_blah free_bit location_full cold_warm salt_salty sauce_sweet check_awesome meal_start prices_prepared night_late cheese_mac -524 -524 -525 -526 -526 -528 -528 -529 -530 -530 -531 -531 -532 -532 -534 -536 -536 -536 -537 -538 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 hate_yeah walking_distan ce feel_makes good_things looked_didn love_.) good_pretty experience_ser single_properl experience_din wall_street decided_give isn_reason store_find good_..... good_chain mcdonald_mc best_love manhattan_cit best_nyc meat_hell home_awesom -543 -504 -544 -545 -546 -552 -552 -557 -558 -558 -561 -561 -565 -566 -567 -569 -570 -571 -573 -580 -583 -583 -625 43
© Copyright 2026 Paperzz