Knowl Inf Syst DOI 10.1007/s10115-012-0548-z REGULAR PAPER Gender, writing and ranking in review forums: a case study of the IMDb Jahna Otterbacher Received: 18 February 2011 / Revised: 5 April 2012 / Accepted: 22 August 2012 © Springer-Verlag London Limited 2012 Abstract Online review forums provide consumers with essential information about goods and services by facilitating word-of-mouth communication. Despite that preferences are correlated to demographic characteristics, reviewer gender is not often provided on user profiles. We consider the case of the internet movie database (IMDb), where users exchange views on movies. Like many forums, IMDb employs collaborative filtering such that by default, reviews are ranked by perceived utility. IMDb also provides a unique gender filter that displays an equal number of reviews authored by men and women. Using logistic classification, we compare reviews with respect to writing style, content and metadata features. We find salient differences in stylistic features and content between reviews written by men and women, as predicted by sociolinguistic theory. However, utility is the best predictor of gender, with women’s reviews perceived as being much less useful than those written by men. While we cannot observe who votes at IMDb, we do find that highly rated female-authored reviews exhibit “male” characteristics. Our results have implications for which contributions are likely to be seen, and to what extent participants get a balanced view as to “what others think” about an item. Keywords Gender · Information filtering · Social voting · Text classification 1 Introduction When investigating goods and services of interest, consumers often seek out third-party information [10], such as consumer-contributed reviews posted at online forums. Such “wordof-mouth” on the web enables people to learn about others’ opinions of and experiences with an item, before investing the time and money in consuming it [42]. In other words, one-way consumers use such forums to get a general impression of “what others think” about an item of interest. J. Otterbacher (B) Department of Humanities, Illinois Institute of Technology, Chicago, IL 60616, USA e-mail: [email protected]; [email protected] 123 J. Otterbacher We examine a popular site, the internet movie database (IMDb.com), at which consumers exchange views on a type of hedonic good, movies. In contrast to more utilitarian goods (e.g., pencils or insect repellant), the consumption of hedonic goods, also referred to as experience goods, resembles a personal experience [18]. Therefore, factors such as a consumer’s age and gender correlate to his or her preferences for these items [19]. For instance, while perhaps stereotypical, certain genres of movies (e.g., “tearjerkers”) tend to appeal more to women than men [35]; thus it is expected that such preferences will be reflected in consumers’ impressions and assessments of movies they review. Despite this correlation between demographics and preferences for hedonic goods, it is often difficult for consumers to find out who the reviewers are. For instance, reviewer gender is a potentially useful attribute for filtering reviews of experience goods. However, in many forums, one cannot determine the gender of reviewers with a high degree of accuracy. In online environments, many participants intentionally maintain a gender-neutral identity, as to avoid attracting undesirable attention and to be taken more seriously [6]. More generally, many have privacy concerns [1,45] and choose not to disclose personal information. This is a serious challenge for information seekers. People often use demographic characteristics to judge whether or not someone is similar to themselves [47], which then helps them in interpreting information and relating it to their own contexts. In fact, because of the potential of demographic information to enhance the process of information seeking, researchers have considered how to automatically infer user characteristics, including gender (e.g., [20,36]). In addition to providing cues to information seekers, effective means to infer user attributes and behaviors can enable site designers to appropriately personalize content or make recommendations to users (e.g., [2,38]). 1.1 IMDb’s gender filter Internet movie database provides a unique opportunity to explore author gender prediction in the context of a consumer-generated review forum. At IMDb, there is a filter allowing users to identify reviews written by male and female participants. The intent is to allow users to gain a gender-balanced perspective on a movie of interest; when the filter is selected, an equal number of reviews authored by men and women is shown in an interleaved display, as shown in Fig. 1. Alternatively, the default display is IMDb’s collaborative filtering (i.e., social voting) mechanism, the “Best” filter, under which the reviews are sorted according to what other participants voted as being “useful.” The filter might be described as a sociotechnical means through which users can share their preferences with one another, helping others to learn of interesting content (e.g., [34]). Searching for relevant content in larger online communities is challenging for many reasons, and general purpose search tools are not always effective in such contexts [49]. At IMDb, popular films often have several hundred, and sometimes even thousands, of user-contributed reviews. Therefore, filtering the reviews by particular attributes, such as perceived usefulness (i.e., “Best”), can greatly reduce the threat of information overload to users who want to learn what others think of a given film. Other approaches for helping users glean pertinent information from large sets of reviews include the automatic identification and summarization of the opinions expressed in reviews (e.g., [51]) as well as identifying a small set of the most comprehensive reviews (e.g., [48]). In contrast, the IMDb filter mechanism does not change the set of reviews presented to the user. Rather, the filter simply reorders them, prioritizing those with desirable attributes. 123 Gender, writing and ranking in review forums Fig. 1 Reviewer gender filter at IMDb Figure 1 shows a review of the classic film Casablanca. The gender filter has been selected, and it can be seen that a male participant wrote this particular review. As shown, there are 142 reviews (out of a total of 753 contributed reviews for Casablanca), for which author gender is available. Finally, above each review, the perceived utility is displayed in the form “x of y people found the following review useful.1 ” Feedback is solicited following each review, where the participants are asked if they found the review useful or not. This feedback is used to determine the presentation order of the review under the default filter, “Best.” IMDb is unique in that it provides filters beyond perceived utility. Previous research has highlighted the biases inherent in the simple collaborative filtering mechanisms that have become commonplace. Much of this research examines the “helpfulness” mechanism at Amazon.com. This mechanism is very similar to IMDb’s “Best” filter, although Amazon asks participants if they find reviews “helpful.” Ghose and Ipeirotis [13] noted that it takes time to accumulate sufficient votes for review rankings to become meaningful, a problem similar to that of “cold starts” in recommendation systems [41]. Liu et al. [28] warned of an “early bird” bias; reviews posted earlier have had more time to collect votes, thus they have an advantage over more recent contributions. They also noted a “winners’ circle” bias, in which the top-ranked items are unlikely to be displaced by new contributions. Since highly 1 Following [52], we define utility as the number of users who found a review useful divided by the total number of votes received (i.e., x/y). 123 J. Otterbacher rated reviews are displayed first, and since users are unlikely to read (and subsequently to vote on) many reviews [21], what is perceived as helpful remains so over time. Finally, in their analyses, Danescu-Niculescu-Mizil et al. [9] found that there is a penalty for reviews that deviate from the average opinion of the product. More specifically, when a reviewer’s rating of a product is below the average over all users, the associated review is less likely to be seen as helpful by others. The provision of additional review filters beyond popularity or helpfulness appears to be an attempt to expose readers to a more diverse set of contributions. 1.2 Gender and communication In addition to having distinct preferences for hedonic goods, research suggests that men and women have different communication styles, both in face-to-face and computer-mediated environments. These differences likely affect the extent to which women participate in online forums such as IMDb, as well as how women’s contributions are received by others and thus, how visible they are in the forum. Early work on gender differences in language focused on communication in face-toface settings, often in a group. Lakoff [26] proposed ten salient characteristics of “women’s language.” For instance, she found that women and men often choose to use different lexical items in similar situations (e.g., women describe colors more vividly than men). Another finding was that women make greater use of hedge phrases (e.g., “kind of,” “sort of”), which effectively soften an argument, presumably making it more acceptable to an interlocutor. Similarly, Tannen [46] theorized that women use “rapport talk” while men adopt a style described as “report talk,” contrasting women’s emphasis on building and supporting social relationships with men’s focus on conveying information. In addition to manner, the topics discussed by all female groups versus all male groups have been noted to differ, with men discussing objects (e.g., cars) and women talking about people and relationships [8]. Herring has found that many gender differences in conversational speech also apply to computer-mediated settings [15–17]. For instance, in online forums, men dominate interactions. Women participants typically post fewer messages, and are less likely to receive responses from others. In a study of academic discussion lists, Herring also described a “list effect,” in which contributors write using the stylistic features of the dominate gender (i.e., in male-majority lists, both women and men participants posted in a “male” voice) [16]. In addition, while female-authored messages tended to foster relationships with others (i.e., were more inclusive, expressed emotion, used hedging and apologies), male-authored messages tended to be more competitive and assertive in nature. These findings are echoed by Gefen and Ridings [12], who explained that communication differences have to do with how men and women use shared information spaces. They claimed that while male participants value the chance to showcase their knowledge to others, women often seek or offer social support for solving problems. Finally, Awad and Ragowsky [5] studied the mediating effect of gender in judging the quality of online word-of-mouth in electronic commerce contexts. Like Gefen and Ridings, they found that men appreciate the opportunity to soundoff about an item online, while women value the chance to interact with and receive feedback from others. 1.3 Goals of the current work As we have seen, men and women have different preferences for hedonic goods, use online forums in various ways, and adopt unique communication styles. Our current goal is to examine the differences between movie reviews written by male and female reviewers at 123 Gender, writing and ranking in review forums IMDb, with respect to three categories of attributes: (1) writing style, (2) thematic content and (3) review metadata. As will be detailed in Sects. 2–4, we use a logistic regression framework to examine how accurately we can infer author gender. We will show that male and female participants’ reviews do differ significantly with respect to writing style and content, as predicted by sociolinguistic theory. However, these features are not nearly as important as perceived utility in inferring author gender. In Sect. 6, we provide possible explanations for this difference in perceived utility. While we cannot observe who votes on the utility of reviews, we do find that highly rated reviews by females are characterized by “male” features which, given Herring’s findings [16], might suggest that IMDb is a male-majority forum. Finally, Sect. 7 relates our findings to the broader issues of the use of simple collaborative filtering in online forums and presents areas for the future work. 2 Inferring author gender We now define the task of predicting author gender in a set of IMDb movie reviews. We then relate our task to previous research on the prediction of author gender. Definition 2.1 Given the set S of user-contributed reviews about movie j, predicts the author gender of each review, s ⊂ S. We assume that all information seen by users at the main page of the movie’s review forum is available for use. We model features of author writing style and review content, as most approaches to text categorization rely on features that characterize these two aspects of a text. Commonly used features include token-level measures (e.g., average word or sentence length), vocabulary richness (i.e., size of vocabulary/length) and counts of occurrences of commonly used words [44]. In addition, researchers have applied dimension-reduction techniques (e.g., factor analysis) on word frequencies (e.g., [4]). In addition to style and content, we exploit information available at IMDb. This information includes movie metadata (e.g., the number of reviews contributed, which gives us a measure of the movie’s popularity), as well as review metadata (e.g., date posted, reviewer’s rating of the movie). In addition, we exploit metadata of the collaborative filtering mechanism (i.e., utility scores). These attributes are fully detailed in Sect. 4. Most research on inferring author gender concerns “clean” text, as opposed to user-contributed content that has become so characteristic of the web. For instance, [23] classified a set of fiction and non-fiction documents from the British National Corpus (BNC). In addition to lexical features (frequencies of commonly used words), they exploited a set of “quasi-syntactic” features, since the BNC is tagged for parts of speech. They reported 80 % accuracy in predicting author gender. Argamon et al. [3] also predicted gender for formal texts from the BNC, however, they exploited different features. In particular, they found differences in the use of pronouns. While male and female authors made references to nominals (i.e., wrote about “things”) with similar frequencies, women’s use of pronouns was much greater, and they used more first person pronouns. More recently, they turned their attention to inferring author gender in blog text [4]. They compared themes discussed by male versus female bloggers. To derive key themes, they used the “meaning extraction method” [7], performing factor analysis on frequencies of the most common words in the corpus. They discovered 20 themes, and found differences between the texts of male and female bloggers. For instance, “religion” and “politics” were 123 J. Otterbacher common themes discussed by men, whereas “conversation” and “at home” were popular with women. They reported classification accuracies “around 80 %.” We are not aware of any work that considers the prediction of author gender in postings in an online forum or community. First, reviews posted to such sites are not clean texts. Most review forums are not closely moderated and typically allow anyone to post with few barriers to entry (i.e., after creating a user name and password). Reviews often contain grammatical errors and may even be off topic. Second, movie reviews are significantly shorter than most texts analyzed in previous work. For instance, in [23], the average text length was over 34,000 words. In [4], the minimum blog length was 500 words. To contrast, our mean review length is only 256 words. Therefore, a movie review provides fewer clues about author writing style and themes that she or he might discuss. On the other hand, since movie reviews are part of an ongoing discourse about a film as compared with being stand-alone texts, we can exploit metadata that provides clues about the social aspects of the forum. Therefore, in addition to stylistic and content features, we include movie and review metadata. 3 Data To create a movie review corpus,2 we relied on the list of the 250 top films of all time (as of January 2010), as voted by IMDb participants, reasoning that they draw large audiences and would have many reviews written by both men and women. At each movie forum, we used the “male/female” filter to collect reviews with gender information. For the 250 movies studied, 31,300 reviews had gender identity, out of 157,065 total reviews (19.9 %). Within each movie, the proportion of reviews with gender identity ranged from 54 % to only 2 % (median of 18.5 %). It is not possible to calculate the true proportion of reviewers who have disclosed their gender, because IMDb always displays an equal number of male- and femaleauthored reviews. Therefore, we should keep in mind that we are only able to study the differences between male- and female-authored reviews, for the subset of reviewers who have provided their gender identities. Before performing analyses that involved word counts, we performed stemming. We also disregarded words that had occurred only once in the entire data set. As in [39], we reasoned that such terms were very likely to be junk, such as misspellings or leftover HTML artifacts. 3.1 Characteristics Table 1 presents an overview of corpus attributes. As can be seen, even though there are an equal number of male- and female-authored reviews in the data, male reviewers have written over a million words more than have female reviewers. In addition, over all male-contributed reviews, we find a larger vocabulary as compared with female-authored reviews. Table 2 presents summary statistics for review length in words, as well as the total feedback votes received and the perceived utility of the review. As these distributions are often skewed, the table shows both means and medians. It can be seen that there are significant differences between male- and female-authored reviews; a Wilcoxon–Mann–Whitney rank-sum test [30] for evaluating the difference between the medians of males versus females has a p value of approximately zero for all three attributes. 2 While there are other movie review corpora available (e.g., for studying sentiment analysis), we were not able to find existing data with author gender. 123 Gender, writing and ranking in review forums Table 1 Corpus attributes Male Female All Reviews 15,650 15,650 31,300 Unique reviewers 10,394 10,618 21,012 NA 125.2/69 Reviews per movie (mean/median) NA Total words 4,625,377 3,396,819 8,022,196 Vocabulary size 27,972 Table 2 Review attributes (mean/median) 23,403 35,783 Male Female Length (words) 295.3/222 216.8/162 Total votes 30.7/3 7.1/1 Utility 0.60/0.67 0.26/0 4 Research framework We use logistic regression [32] to model the log odds of a review being written by a female versus a male. More formally, the model to be estimated is: Pr (yi = 1) ln = b0 + b1 x 1 + b2 x 2 + · · · + b n x n (1) Pr (yi = 0) where x1 to xn are the predictors and yi is the response (taking the value of 0 if the author is a man and 1 if a woman). Thus, male is the default and we are predicting the event of a reviewer being female. Stated otherwise, given a vector of n variables describing a review, we use the predicted likelihood to assign the vector to one of two groups (i.e., if the likelihood of being female is greater than 0.5, the author is labeled as female; otherwise the label is left as male). It should be noted that a variety of statistical and machine- learning approaches has been applied to text classification problems. Citing [44,50] explain that for relatively large data sets (i.e., at least 300 observations for each class), there is little difference in performance of most classifiers on text classification tasks. Since our focus is to compare the contribution of the three types of information, rather than to contribute to the machine learning research, we employ a multivariate statistics technique. Logistic regression allows us to model a dichotomous response variable, while avoiding assumptions as to the normality of our predictor variables. 4.1 Writing style Table 3 presents the aspects of author writing style we have incorporated into our analysis and provides a short justification for each. As shown, previous researchers have discussed these attributes when comparing the communication styles of women and men. The attributes we model are: the extent to which a movie reviewer adopts an inner- versus outer-focused discourse (i.e., the extent to which he or she acknowledges a perceived audience, or tends to use a reporting style), the use of hedge phrases, the complexity of one’s writing style, the richness of vocabulary used in a review and patterns of frequent word use (i.e., lexical markers). 123 J. Otterbacher Table 3 Writing style features Measure Explanation Inner- versus outer-focused discourse # Pronouns (PN)/total words Researchers have found that women use more pronouns than men. We compute the rate of PN use, as well as the number of first, second and third person PNs # First person PN # Second person PN # Third person PN Hedging # Hedges used/total words Complexity Characters/words We use a list of 55 common hedges from [25]. Hedging is characteristic of women’s language These basic measures of writing complexity have been used in automatic essay scoring [11] Characters/words in title Words/sentences Vocabulary richness # Unique words/total words Lexical markers Frequency of top 50 words Quantifies the diversity of lexical item used. Stamatatos et al. [44] warn that it may not be stable in short texts We create a list of the 50 highest-frequency words and count their occurrences in each review 4.2 Content We consider five metrics that quantify aspects of review content. The first four metrics presented in Table 4 tell us something about the content of a review with respect to the other reviews written about the same movie. The first is a measure of review centrality. To compute centrality, we create a centroid, a vector consisting of all words occurring in the entire set of reviews about movie, j. Each element corresponds to the t f id f weight for the given word [40]. The centroid score for review s, which is the sum of the t f id f scores of all words in review s, quantifies the extent to which it contains words that are important across the set of reviews on the specific movie. Perplexity, entropy and out-of-vocabulary rate (OOV) are the measures of textual uniqueness, which are based on a language modeling framework [31]. We view the set of reviews as a bag of words, from which a probability distribution is derived. The creation of review s is viewed as a sequence of randomly selected words from the distribution that results from all other reviews about movie j (i.e., excluding review s). In other words, the random variable X can take on values (i.e., words) in a discrete set of symbols, which is the vocabulary used across all of the other reviews about the movie in question. Given this probability distribution for X , the entropy for a given movie review is the average uncertainty of X : H ( p) = − p(x) log2 p(x) (2) x Perplexity is related to entropy as follows: 123 Gender, writing and ranking in review forums Table 4 Review content features Centrality Uniqueness Themes perplexity = 2 H ( p) A document-level centroid score [37] quantifies the extent to which review s is important in the discourse about movie j Perplexity [31] measures the extent to which review s is surprising. A perplexity of k means that guessing the words in the review is as difficult as guessing between k items of equal probability Entropy [31] quantifies the average uncertainty of review s, expressed as the number of bits required to encode it Out-of-vocabulary rate (OOV) is the proportion of words in review s that do not appear in any other reviews of movie j latent semantic analysis (LSA) We use the set of 31,300 reviews to create a semantic space on which to perform LSA. We discover the top 20 themes used in our IMDb movie reviews. For each review, we use the scores on these 20 themes to represent the content discussed (3) and provides a measure of uncertainty that is perhaps easier to interpret. A perplexity of k indicates a level of uncertainty or surprise in the variable, equivalent to that faced when choosing between k events that are equally likely to occur. Finally, OOV is simply the proportion of words used in review s, which were not used in any other of the reviews written about movie j. Our last content feature is based on latent semantic analysis (LSA) [27]. We first construct a semantic space using our movie review data. Specifically, we begin by computing word counts across all reviews. We then consider the 500 most frequent words. Thus, we have a matrix with dimensions 31,300 by 500, which we subject to a factor analysis. We derive the first 20 factors, corresponding to the eigenvectors with the 20 highest eigenvalues. Finally, we compute the scores on these 20 factors for each review. Our analysis will examine if, based on this representation, there are differences in content between the male- and female-authored reviews. We compared candidate models to a baseline with only one factor and computed Bentler and Bonnett’s normed fit index (NFI) [29]. To achieve a good fit (i.e., NFI of 0.90 or greater), we need approximately 100 factors. However, interpretation becomes difficult, as many factors have small eigenvalues. Therefore, we also consider the Kaiser criterion, which states that any factor with an eigenvalue less than one should be dropped [22]. We experiment with the model having 20 factors, whose smallest eigenvalue is slightly greater than 1. Table 5 describes the factors with suggested interpretations, and shows the words with the largest loadings on each factor. As can be seen, the majority of the factors (e.g., F1, F2, F12, F14) have straightforward interpretations. However, others are less obvious. In particular, it 123 J. Otterbacher Table 5 Factors, interpretations and lexical items with largest loadings F1 Articles/prepositions the, of, to, a, in, on, that, with F2 First person I, my, me, am, think, saw F3 Third person he, his, him, himself, man, wife F4 Second person you, your, if, go, see, want F5 Positive impression good, really, like, pretty, funny F6 Success perform, best, role, actor, Oscar F7 Fantasy ring, battle, fantasy, book, fan, version F8 Feminine her, she, woman, girl, beautiful, relationship F9 Violence scene, plot, end, kill, murder, dead F10 Others liked movie, watch, great, people, like, good F11 Location unit, state, theater F12 All time hit ever, best, greatest, classic, favorite F13 Groups they, them, people, other, we, our, together F14 Family kid, girl, children, family, comedy, father F15 Aux/modals/past was, were, had, did, didn’t, would F16 Aux/modals have, been, be, if, any, may, should, do F17 Comic Batman, dark, comic, hero, action, fan F18 Relations time, relationship, character, together, first F19 Time hour, minute, two, three, no F20 Effects effect, special, visual, action, star Table 6 Review and movie metadata Movie popularity # Total reviews about movie j Reviewer rating 1–0 stars Rating assigned by author of review s Reviewer rating deviation Reviewers rating of movie j average rating of j Age (days) Age of review s Length (words) # Words in review s Length (sentences) # Sentences in review s Title length (words) # Words in title of review s Attention received from community Total votes for review s Perceived utility Utility of review s is difficult to distinguish between the meanings of F5 and F10, both which involve words used to express positive sentiments of movies. It can also be seen that several factors (e.g., F1, F2) overlap with stylistic features. 4.3 Metadata Table 6 summarizes the metadata used in our model. As shown, this information concerns the movie itself (e.g., popularity), review sentiment (i.e., number of stars reviewer gave the movie), the age and length of the reviews and how reviews are received by forum participants. In addition to the reviewer’s rating, we also include the deviation of the rating from average (i.e., over all reviewers). Previous work found that there is a negative relationship between 123 Gender, writing and ranking in review forums the deviation from the average opinion (i.e., item rating) expressed in a review forum, and the perceived utility of a review [9]. 5 Analysis To compare the characteristics of male- and female-authored reviews, we build a separate logistic regression model with the features of each type as predictors: (1) stylistic, (2) content and (3) metadata features. To control for the possibility that the reviews differ in content or writing style depending on the specific movie reviewed, we initially add dummy variables for the 250 movies. However, these are not significant in any of the models. To fit the models, we use a forward stepwise procedure [32] that begins with an empty model (i.e., assigning the label of male to all reviews) and adds a predictor if it significantly reduces the likelihood ratio to the current model (i.e., when the likelihood ratio test for the effect of the predictor has a p value <0.05). Once we have a model, we compare the estimated odds ratios of the predictors, to identify features in each category that are associated with author gender. For each feature category, we briefly report the individual features with the strongest correlations to reviewer gender. Finally, we follow the same procedure to build a model using features of all three types. We evaluate and compare the classification accuracies of all models in Sect. 5.4. 5.1 Model with stylistic features Recall that the features of style we examined include the rates of pronoun use, vocabulary richness, the use of hedge phrases, word and sentence complexity and lexical features. For the lexical features, we first identify the most frequent 50 words in the corpus. This results in a set of five content words (movie, film, story, characters and time) and 45 function words (pronouns, articles, prepositions, adverbs, verbs, etc.). For each review, we counted the occurrences of these words, resulting in a set of 50 lexical features. Clearly, some of the stylistic features are correlated (e.g., occurrences of the word “I” and the number of first person pronouns in a given review). Therefore, it is expected that several features will end up not having a significant effect in the model. The model with stylistic predictors of gender is highly significant. However, it has a pseudo R 2 (McFadden’s R 2 ) of only 0.09.3 This suggests that stylistic features alone most likely do not provide enough information to distinguish between male and female reviewers. While we will report the classification accuracy using only stylistic features in Sect. 5.4, here we discuss the most significant differences in terms of writing style between male and female reviewers. Figure 2 summarizes the stylistic trends we observe. The strongest indicator of female writing is a richer vocabulary, with an odds ratio of 6.626. In other words, when holding other factors constant, for a one-unit increase in vocabulary richness, the odds of a reviewer being female increase by a factor of 6.6. To contrast, the best predictor of “male” writing is complexity, particularly the character-to-word ratio, with an odds ratio of 0.7919. As shown, female-authored reviews are also characterized by a higher rate of pronoun use, and in particular, the singular first person. 3 Unlike in OLS regression, the pseudo R 2 cannot be interpreted as the proportion of variance in the independent variable that is explained by the model; it is a simple measure of the strength of association between the predictors and the independent variable. Therefore, it is a useful guide in choosing an appropriate model, but has no literal interpretation. 123 J. Otterbacher Fig. 2 Stylistic predictors of female (top) and male (bottom) reviewers Fig. 3 Content predictors of female (top) and male (bottom) reviewers In addition, male-authored reviews are characterized by the use of prepositions. Also, it is interesting that of the five content words, all but the word having to do with people (“character”) are used more often by men than women. In summary, our analysis of stylistic features concurs with previous research. In particular, there is evidence that female movie reviewers have a more involved style, frequently writing in first person voice. To contrast, men more often use a reporting style, discussing “the movie” or “a film,” and making use of the third person pronouns “his” and “it.” 5.2 Model with content features Figure 3 shows the content features that have the strongest correlation to author gender. As can be seen, the strongest indicator of female writing is the use of first person pronouns. In fact, the factors representing the use of first, second and third person pronouns are all positively related to female-authored reviews (odds ratios of 1.4616, 1.0481 and 1.2017, respectively). We also observe that female reviewers are more likely to discuss themes related to groups, which is the second strongest correlate. As in the model with stylistic features, we again observe that the use of prepositions and articles (F1) is strongly related to male-authored reviews, having an odds ratio of only 0.5721. The themes discussed more often by males as compared with females include a movie being an all time hit, violence, and special effects, with odds ratios of 0.8158, 0.8332 and 0.8682, respectively. Men are also more likely than women to discuss the success of the movie (e.g., if it won an oscar). Finally, entropy is the only feature based on language modeling that has a significant effect in the model, with an odds ratio of 0.8168. Compared with other reviews written about the same movie, male-authored reviews tend to be more unique or surprising than those of females. The resulting model that includes the remaining style, content and metadata features is detailed in Table 7. The likelihood ratio test for the model as a whole is highly significant. However, like the model based on stylistic features, its pseudo R 2 is only 0.09. Therefore, it is 123 Gender, writing and ranking in review forums uncertain whether content features alone can adequately distinguish between male- and female-authored reviews. 5.3 Model with metadata features The model with metadata features is highly statistically significant and has a pseudo R 2 of 0.1826. This suggests that metadata features in general have a stronger correlation to reviewer gender than do the stylistic or content features. While six of the metadata features have a statistically significant effect in the model, perceived utility is the only feature with an odds ratio that deviates substantially from one (0.0863). The other features (review length in words, the number of reviews written about the movie, the total votes received by a review, review age and the length of the review’s title) have odds ratios between 0.9900 and 1, and therefore are not likely to be useful in discriminating between reviews written by male versus female participants. 5.4 Model with style, content and metadata features As mentioned, many of the stylistic and content features are correlated with one another. In particular, the counts of first, second and third person pronouns used as stylistic features, are highly correlated to F2, F4 and F3, respectively. In addition, F1 accounts for the use of articles and prepositions, eliminating the need for the majority of the lexical features. Therefore, to avoid problems with collinearity in the model using all three types of predictors, we exclude the 50 lexical features. The resulting model that includes the remaining style, content and metadata features is detailed in Table 7. The likelihood ratio test for the model as a whole is highly significant. In addition, the model based on stylistic, content and metadata features has a pseudo R 2 of 0.2379. As mentioned, the pseudo R 2 does not have a straightforward interpretation as does R 2 in the context of ordinary least squares regression. Therefore, we next turn toward evaluating the classification accuracy of each of our models. 5.5 Comparison between models In our analyses, we identified independent variables that are correlated to author gender. To compare models in terms of their ability to infer gender, and to compare our results to previous work, we now evaluate their classification accuracy on this task. To do this, we use a tenfold cross-validation procedure [33]. Table 8 reports the average classification accuracies for each model, as well as the respective standard deviation. In addition, λ p [32] is also shown. This statistic ranges from 0 to 1, and expresses the proportion reduction in error, relative to the baseline of predicting the mode (i.e., labeling all reviewers as male, which results in 50 % error). As a comparison, we also show the accuracy when review utility is the only predictor. As can be seen, while all models do significantly better than the baseline, the model incorporating all three types of predictor variables has the best performance. Its accuracy is significantly better than that of the content only and style only models. However, there is not a statistically significant difference between its accuracy and that of the models using only metadata and only review utility to infer gender. Others have previously noted that features characterizing the writing style and content of a text can be sensitive to the text’s length (e.g., [44]). Therefore, we also examined the average classification accuracy of each of the models, as a function of review length. Figure 4 indeed illustrates that the accuracy of the models incorporating only stylistic and content features is 123 J. Otterbacher Table 7 Model with style, content and metadata features Odds ratio Metadata Style Utility 0.0863 Length (sents) 0.9854 Total votes 0.9946 # Reviews 0.9983 Title length (chars) 0.9982 Age 0.9998 Vocab richness 1.6870 PN/1000 1.0060 Words/sents 0.9888 Uniqueness Entropy 0.8320 Content 1st person (F2) 1.4596 Groups (F13) 1.2100 Comic (F1) 1.2100 Table 8 Classification accuracy for all models Aux/modals/past (F15) 1.2012 Feminine (F8) 1.1777 Location (F11) 1.1710 Fantasy (F7) 1.1315 Family (F14) 1.1115 3rd person (F3) 1.1077 Violence (F9) 0.8111 All time hit (F12) 0.8186 Positive impression (F5) 0.8771 Articles/prepositions (F1) 0.8800 Effects (F20) 0.9347 Success (F6) 0.9447 Accuracy SD λp Style 64.55 1.0717 0.291 Content 65.03 1.2119 0.301 Metadata 72.95 3.088 0.459 Utility 72.46 3.495 0.449 ALL 73.71 3.662 0.474 positively correlated to length. To contrast, the models using all features and only metadata are relatively more stable. 6 Discussion There are salient differences both in terms of writing style and content between IMDb movie reviews written by male and female authors. However, the perceived utility of the review by IMDb participants is the strongest predictor of gender. We examine this point further in order to provide a possible explanation. We also discuss what we have learned about 123 Gender, writing and ranking in review forums Fig. 4 Accuracy as a function of review length inferring author gender from short texts contributed in online forums. We then conclude with a discussion of areas for the future work. 6.1 Gender and review utility Why is utility such a strong predictor of gender, even when controlling for properties such as review length, valence and age, which are known to bias collaborative utility scores [9,28]? As mentioned in the introduction, consumers with similar demographics are more likely to share tastes in hedonic goods such as movies. Therefore, a likely explanation is that a large proportion of IMDb participants are men, who give more favorable feedback to male-authored reviews, since they are similar to their own perspectives and communicate those views in a way that is familiar to them. We do not have the necessary data to examine whether men give better feedback to maleauthored reviews as compared with those written by women, since voting is anonymous at IMDb. In addition, we cannot be sure if the majority of participants are male or female, since this information is not available in user profiles. However, we can study the characteristics of the highly rated female-authored reviews. In other words, could it be the case that reviews authored by women that are perceived to be relatively useful are more similar to male-authored reviews? To examine this question, we divide the female-authored reviews into two groups: more useful (utility >0.5; n = 2, 898) and less useful (utility ≤0.5; n = 12, 752). We use the same forward stepwise procedure (with a significance threshold of 0.05) to fit a logistic regression model to distinguish the more useful reviews from the less useful ones, using stylistic (excluding the lexical features), content and metadata predictors. Several variables, which have been noted to be positively correlated to review utility [9], have significant positive effects in the model: total feedback votes, review age, reviewer’s rating of the movie and length. Figure 5 displays predictors that have a significant effect in the model, with their respective odds ratios. Of interest is that one of the themes that were associated with male-authored reviews in Fig. 3, “all time hit,” is characteristic of the more useful female-authored reviews. 123 J. Otterbacher Fig. 5 Features of more (top) and less (bottom) useful women’s reviews Likewise, we can observe that vocabulary richness, which was associated with femaleauthored reviews in the model described previously in Table 7, is correlated to the less useful reviews written by women. Finally, the use of first and second person pronouns is also more characteristic of less useful reviews. In summary, the analysis shows that reviews written by women, which are viewed by IMDb participants as being more useful, do have similarities to male-authored reviews, as compared with less useful female-authored reviews. 6.2 Participation To further explain the above findings, we can consider the rate of participation of the 21,012 reviewers in our data set. From reviewers’ profile pages, we obtained the total number of reviews that they had contributed at IMDb. Indeed, male reviewers are more active than females; the mean number of reviews contributed by males is 42.7 (median of 5), whereas for females the mean is only 8.7 (median of 2). In addition, it is illustrative that the most prolific male reviewer in our data set has written four times more reviews than the most prolific female (a total of 8,167 versus 2,061 reviews, respectively). Therefore, we find additional evidence that IMDb is a male-majority forum. Herring’s “list effect” [16] does not apply here, since we do not find that both males and females write like males, the majority gender. Herring studied academic discussion lists, and it is likely that within an academic community, participants know one another better and have more interaction than in a large forum such as IMDb. At IMDb, we see that there are significant differences, both in writing style and content, between male- and female-authored reviews. In addition, we do observe that a relatively small number of female-authored reviews exhibit “male” characteristics, and that these reviews are perceived as being more useful, as compared with reviews written in a “female” voice. It may be that such female participants recognize that the less personal, more objective, “male” style of writing is better received by others [43], and so try to adopt that style. 6.3 Classifying author gender for short texts We are not aware of previous work that considered how to infer author gender for textual contributions posted in online forums. As previously mentioned, the movie reviews in our data set are substantially shorter than the texts other researchers have experimented on, and we anticipated that stylistic and content metrics might not be as robust as they are for longer texts. At the same time, texts posted in an online forum are qualitatively different than individual texts such as scientific articles and books that others have studied. For instance, reviews at IMDb occur within an ongoing discourse about a given movie, and readers are invited to be active participants, both by writing their own reviews, as well as in voting for the reviews they find most useful. 123 Gender, writing and ranking in review forums We found that in order to compensate for the brevity of reviews, content and stylistic features can be augmented with metadata, which reflect social (e.g., utility or how others perceive the review) and dynamic (e.g., review age) aspects of the forum. Previous work on classifying literary and scientific texts [23,3] and blogs [4] with respect to author gender reported accuracy rates around 80 %. In comparison, when we use only stylistic or content features, we can infer movie reviewer gender with 65 % accuracy. If we also exploit metadata, accuracy increases to just under 74 %. In the future work, we plan to reexamine the performance of the three classifiers, given more information about each author. Currently, we had only one review per author, on which to base a decision as to his or her gender, along with movie and review metadata. Another task of interest is to infer a user’s gender, given all of his or her postings at the site. In other words, for many users, we will have reviews of multiple movies, along with associated metadata (e.g., the user’s average utility over a larger set of contributions). Given that features of writing style and content are known to be sensitive to text length [44] and the brief nature of online reviews, we plan to further explore reviewer gender classification exploiting different types of information. 7 Conclusion Review forums are large, multiuser information spaces in which participants, who likely have not had previous interactions with one another, exchange information via textual postings, describing their experiences with and opinions of various items. In popular forums, it is not uncommon for users to find hundreds or even thousands of reviews posted about an item of interest. In response to the need to organize this information, collaborative filtering mechanisms, in which participants vote as to which reviews they find most “helpful” or “useful” are commonly used. This is an easy means to organize reviews, since it does not require administrators to modify postings, nor does it require much effort from participants, who provide simple binary feedback as to whether or not reviews were useful to them. However, researchers who have studied these voting mechanisms have reported biases, such as the early bird and the winners’ circle [28], as well as the penalizing of reviews expressing views that deviate from the majority, particularly when the review is more critical than the majority [9]. Our case study of the IMDb suggests that a gender bias exists at this particular site, with female-authored reviews being seen as significantly less useful than those contributed by males. In other words, if it were not for IMDb’s gender filter, under the default collaborative filter, most of the highly ranked reviews would be authored by male participants, effectively hiding the views of female participants. Currently, we used text classification in order to see which aspects of writing style, content and forum metadata could effectively discriminate between female- and male-authored reviews. We confirmed that, as found in previous research, female and male participants do write differently and discuss contrasting themes in their movie reviews. Based on the observation that female-authored reviews are much less likely to be judged by others as being useful, and that males are generally more prolific than female participants, we came to the conclusion that IMDb is a male-majority forum. There is much discussion about gender and participation online. For instance, in a recent New York Times Room for Debate session4 , debaters discussed recent findings that less than 4 http://www.nytimes.com/roomfordebate/2011/02/02/ where-are-the-women-in-wikipedia 123 J. Otterbacher 15 % of contributors at Wikipedia are female [14]. It is important to consider why this gender gap exists and how it might be reduced. In the future work, tools based on text classification might aid in detecting and ameliorating such problems. For instance, tools could be developed in order to infer demographic characteristics of contributors. This might help site designers to determine if certain groups are underrepresented, such that they could be encouraged to participate. In addition, in forums that use simple collaborative filtering (e.g., “helpful” or “useful” mechanisms), demographic information could be used to suggest yet unrated contributions of possible interest to a user, based on her own demographic traits. This could help reduce biases such as the “winners’ circle” by giving unrated content a chance to collect votes. In conclusion, mechanisms can be developed that might help forums balance out the participation and visibility of all participants, and that make participants feel as though their contributions are valued. This will be a challenging work both from the technical and sociotechnical points of view, as care must be taken not only to ensure that classification algorithms are sufficiently accurate, but also that demographic information is not used in a way that might cause privacy concerns among participants. Acknowledgments We thank the anonymous reviewers who provided helpful feedback on this work, as well as the reviewers of an earlier version of this work, which appeared at ACM CIKM 2010. We also acknowledge the insightful advice of Alexia Panayiotou, as well as Mengyuan (Serena) Li’s assistance with data collection. References 1. Acquisti A, Gross R (2006) Imagined communities: awareness, information sharing, and privacy on Facebook. In: Privacy enhancing technologies. Lecture notes in computer science, Springer, Berlin 2. Ahmed A, Low Y, Aly M, Josifovski V, Smola A (2011) Scalable distributed inference of dynamic user interests for behavioral targeting. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, San Diego, pp 114–122 3. Argamon S, Koppel M, Fine J, Shimoni AR (2003) Gender, genre, and writing style in formal written texts, Text 23 4. Argamon S, Koppel M, Pennebaker JW, Schler J (2007) Mining the blogosphere: age, gender and the varieties of self-expression. First Monday 12(9). http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/ fm/article/view/2003/1878 5. Awad NF, Ragowsky A (2008) Establishing trust in electronic commerce through online word of mouth: an examination across genders. J Manag Inf Syst 24(4):101–121 6. Bruckman A (1996) Gender swapping on the internet. In: Ludlow P (ed) High noon on the electronic frontier: conceptual issues in cyberspace. MIT Press, Cambridge 7. Chung C, Pennebaker JW (2008) Revealing dimensions of thinking in open-ended self-descriptions: an automated meaning extraction method for natural language. J Res Pers 42:96–132 8. Coates J (1993) Women, men, and language. Longman, London 9. Danescu-Niculescu-Mizil C, Kossinets G, Kleinberg J, Lee L (2009) How opinions are received by online communities: a case study on Amazon.com helpfulness votes. In: Proceedings of the international world wide web conference, Madrid, Spain, pp 141–150 10. Dellarocas C (2003) The digitization of word of mouth: promise and challenges of online feedback mechanisms. Manag Sci 49(10):1407–1424 11. Foltz PW, Laham D, Landauer TK (1999) Automated essay scoring: applications to educational technology. In: Proceedings of world conference on educational multimedia, hypermedia and telecommunications, Chesapeake, pp 939–944 12. Gefen D, Ridings CM (2005) If you spoke as she does, sir, instead of the way you do: a sociolinguistics perspective of gender differences in virtual communities. ACM SIGMIS Database 3(2):78–92 13. Ghose A, Ipeirotis PG (2010) Estimating the helpfulness and economic impact of product reviews: mining text and reviewer characteristics. IEEE Trans Knowl Data Eng 1498–1512 14. Glott R, Ghosh R, Schmidt P (2010) Analysis of Wikipedia survey UNU-MERIT http://www. wikipediasurvey.org/docs/Wikipedia_Overview_15March2010-FINAL.pdf 123 Gender, writing and ranking in review forums 15. Herring SC (1996) Posting in a different voice: gender and ethics in computer-mediated communication. In: Ess C (ed) Philosophical perspectives on computer-mediated communication. SUNY Press, New York, pp 115–145 16. Herring SC (1996b) Two variants of an electronic message schema. In: Herring SC (ed) Computermediated communication: linguistic, social and cross-cultural perspectives. John Benjamins, New York, pp 81–108 17. Herring SC (2003) Gender and power in online communication. In: Holmes J, Meyerhoff M (eds) The handbook of language and gender. Blackwell, Oxford, pp 202–228 18. Hirschman EC, Holbrook MB (1982) Hedonic consumption: emerging concepts, methods and propositions. J Mark 46(Summer):92–101 19. Holbrook MB, Schindler RM (1994) Age, sex and attitude toward the past as predictors of consumers’ aesthetic tastes for cultural products. J Mark Res 31:412–422 20. Hu J, Zeng H-J, Li H, Niu C, Chen Z (2007) Demographic prediction based on user’s browsing behavior. In: Proceedings of the ACM WWW Banff, Albert, May 2007, pp 151–160 21. Joachims T, Granka L, Pan B, Humbrooke H, Radlinski G, Gay G (2007) Evaluating the accuracy of implicit feedback from clicks and query reformation in web search. ACM Trans Inf Syst 25(2). http:// doi.acm.org/10.1145/1229179.1229181 22. Kaiser HF (1960) The application of electronic computers to factor analysis. Educ Psychol Meas 20:141– 151 23. Koppel M, Argamon S, Shimoni AR (2003) Automatically categorizing written texts by author gender. Lit Linguist Comput 17(4):401–412 24. Kostakos V (2009) Is the crowd’s wisdom biased? A quantitative analysis of three online communities. In: Proceedings of IEEE social communication, international symposium on social intelligence and networking, Vancouver, Canada, pp 251–255 25. Lakoff G (1973) Hedges: a study in meaning criteria and the logic of fuzzy concepts. J Philos Log 2(4):458–508 26. Lakoff R (1973) Language and woman’s place. Lang Soc 2:45–79 27. Landauer TK, Foltz PW, Laham D (1998) An introduction to latent semantic analysis. Discourse Process 25:259–284 28. Liu J, Cao Y, Lin C-Y, Huang Y, Zhou M (2007) Low-quality product review detection in opinion summarization. In: Proceedings of the conference on empirical methods in natural language processing, pp 334–342 29. Loehlin JC (1992) Latent variable models. Lawrence Erlbaum Associates, London 30. Mann HB, Whitney DR (1947) On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat 18:50–60 31. Manning CD, Schutze H (2000) Foundations of statistical natural language processing. MIT Press, Cambridge 32. Menard S (2002) Applied logistic regression analysis. Quantitative applications in the social sciences. Sage University Press, Beverley Hills 33. Mitchell T (1997) Machine learning. McGraw Hill, New York 34. Muhlestein D, Lim S (2011) Online learning with social computing based interest sharing. Knowl Inf Syst 26:31–58 35. Oliver MB, Weaver JB III (2000) An examination of factors related to sex differences in enjoyment of sad films. J Broadcast Electron Media 44(2):282–300 36. Popescu A, Grefenstette G (2010) Mining user home location and gender from Flickr tags. In: Proceedings of the 4th international conference on weblogs and social media, Washington, DC, May 2010 37. Radev D, Jing H, Stys M, Tam D (2004) Centroid-based summarization of multiple documents. Inf Process Manag 40:919–938 38. Roth M, Ben-David A, Deutscher D, Flysher G, Horn I, Leichtberg A, Leiser N, Matias Y, Merom R (2010) Suggesting friends using the implicit social graph. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, Washington, DC, pp 233–242 39. Sahlgren M, Karlgren J (2009) Terminology mining in social media. In: Proceedings of the ACM conference on information and knowledge management, Hong Kong, Nov 2009, pp 405–414 40. Salton G, McGill MJ (1986) Introduction to modern information retrieval. McGraw-Hill, Inc., New York 41. Schein AI, Popescul A, Ungar LH, Pennock DM (2002) Methods and metrics for cold-start recommendations. In: Proceedings of SIGIR, pp 253–260 42. Schindler RM, Bickart B (2005) Published word of mouth: referable, consumer-generated information on the Internet. In: Hauvgedt C, Machleit K, Yalch R (eds) Online consumer psychology: understanding and influencing behavior in the virtual world. Lawrence Erlbaum Associates, London, pp 35–61 123 J. Otterbacher 43. Spender D (1989) The writing or the sex or why you don’t have to read women’s writing to know it’s no good. Elsevier, Oxford 44. Stamatatos E, Kokkinakis G, Fakotakis N (2000) Automatic text categorization in terms of genre and author. Comput Linguist 26(4):471–495 45. Stutzman F (2006) An evaluation of identity-sharing behavior in social network communities. Intern Digit Media Arts J 3(1):10–18 46. Tannen D (1990) You just don’t understand. HarperCollins Publishers, Inc., New York 47. Terveen L, McDonald DW (2005) Social matching: a framework and research agenda. ACM Trans Compu Hum Interact 12(3):401–434 48. Tsaparas P, Ntoulas A, Terzi E (2011) Selecting a comprehensive set of reviews. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, San Diego, pp 168–176 49. Wang D, Tse QCK, Zhou Y (2011) A decentralized search engine for dynamic Web communities. Knowl Inf Syst 26:105–125 50. Yang Y (1999) An evaluation of statistical approaches to text categorization. Inf Retr J 1(1):69–90 51. Zhai Z, Liu B, Xu H, Jia P (2011) Clustering product features for opinion mining.In: Proceedings of the fourth ACM international conference on Web search and data mining, Hong Kong, pp 347–354 52. Zhang Z, Varadarajan B (2006) Utility scoring of product reviews. In: Proceedings of the ACM conference on information and knowledge management, Arlington, pp 51–57 Author Biography Jahna Otterbacher received her PhD in 2006 from the School of Information at the University of Michigan (Ann Arbor), USA. She is currently an Assistant Professor in the Lewis Department of Humanities at the Illinois Institute of Technology (IIT) in Chicago, Illinois, USA. Broadly, the goals of her research are to understand and use patterns in language to facilitate online interaction between people, as well as to enhance their access to information. 123
© Copyright 2026 Paperzz