Gender, writing and ranking in review forums: a case study of the IMDb

Knowl Inf Syst
DOI 10.1007/s10115-012-0548-z
REGULAR PAPER
Gender, writing and ranking in review forums:
a case study of the IMDb
Jahna Otterbacher
Received: 18 February 2011 / Revised: 5 April 2012 / Accepted: 22 August 2012
© Springer-Verlag London Limited 2012
Abstract Online review forums provide consumers with essential information about goods
and services by facilitating word-of-mouth communication. Despite that preferences are correlated to demographic characteristics, reviewer gender is not often provided on user profiles.
We consider the case of the internet movie database (IMDb), where users exchange views
on movies. Like many forums, IMDb employs collaborative filtering such that by default,
reviews are ranked by perceived utility. IMDb also provides a unique gender filter that displays an equal number of reviews authored by men and women. Using logistic classification,
we compare reviews with respect to writing style, content and metadata features. We find
salient differences in stylistic features and content between reviews written by men and
women, as predicted by sociolinguistic theory. However, utility is the best predictor of gender, with women’s reviews perceived as being much less useful than those written by men.
While we cannot observe who votes at IMDb, we do find that highly rated female-authored
reviews exhibit “male” characteristics. Our results have implications for which contributions
are likely to be seen, and to what extent participants get a balanced view as to “what others
think” about an item.
Keywords
Gender · Information filtering · Social voting · Text classification
1 Introduction
When investigating goods and services of interest, consumers often seek out third-party
information [10], such as consumer-contributed reviews posted at online forums. Such “wordof-mouth” on the web enables people to learn about others’ opinions of and experiences with
an item, before investing the time and money in consuming it [42]. In other words, one-way
consumers use such forums to get a general impression of “what others think” about an item
of interest.
J. Otterbacher (B)
Department of Humanities, Illinois Institute of Technology, Chicago, IL 60616, USA
e-mail: [email protected]; [email protected]
123
J. Otterbacher
We examine a popular site, the internet movie database (IMDb.com), at which consumers
exchange views on a type of hedonic good, movies. In contrast to more utilitarian goods (e.g.,
pencils or insect repellant), the consumption of hedonic goods, also referred to as experience
goods, resembles a personal experience [18]. Therefore, factors such as a consumer’s age and
gender correlate to his or her preferences for these items [19]. For instance, while perhaps
stereotypical, certain genres of movies (e.g., “tearjerkers”) tend to appeal more to women
than men [35]; thus it is expected that such preferences will be reflected in consumers’
impressions and assessments of movies they review.
Despite this correlation between demographics and preferences for hedonic goods, it is
often difficult for consumers to find out who the reviewers are. For instance, reviewer gender
is a potentially useful attribute for filtering reviews of experience goods. However, in many
forums, one cannot determine the gender of reviewers with a high degree of accuracy. In
online environments, many participants intentionally maintain a gender-neutral identity, as
to avoid attracting undesirable attention and to be taken more seriously [6]. More generally,
many have privacy concerns [1,45] and choose not to disclose personal information. This is
a serious challenge for information seekers.
People often use demographic characteristics to judge whether or not someone is similar to themselves [47], which then helps them in interpreting information and relating
it to their own contexts. In fact, because of the potential of demographic information to
enhance the process of information seeking, researchers have considered how to automatically infer user characteristics, including gender (e.g., [20,36]). In addition to providing
cues to information seekers, effective means to infer user attributes and behaviors can enable
site designers to appropriately personalize content or make recommendations to users (e.g.,
[2,38]).
1.1 IMDb’s gender filter
Internet movie database provides a unique opportunity to explore author gender prediction
in the context of a consumer-generated review forum. At IMDb, there is a filter allowing
users to identify reviews written by male and female participants. The intent is to allow users
to gain a gender-balanced perspective on a movie of interest; when the filter is selected, an
equal number of reviews authored by men and women is shown in an interleaved display, as
shown in Fig. 1. Alternatively, the default display is IMDb’s collaborative filtering (i.e., social
voting) mechanism, the “Best” filter, under which the reviews are sorted according to what
other participants voted as being “useful.” The filter might be described as a sociotechnical
means through which users can share their preferences with one another, helping others to
learn of interesting content (e.g., [34]).
Searching for relevant content in larger online communities is challenging for many reasons, and general purpose search tools are not always effective in such contexts [49]. At IMDb,
popular films often have several hundred, and sometimes even thousands, of user-contributed
reviews. Therefore, filtering the reviews by particular attributes, such as perceived usefulness
(i.e., “Best”), can greatly reduce the threat of information overload to users who want to learn
what others think of a given film. Other approaches for helping users glean pertinent information from large sets of reviews include the automatic identification and summarization of
the opinions expressed in reviews (e.g., [51]) as well as identifying a small set of the most
comprehensive reviews (e.g., [48]). In contrast, the IMDb filter mechanism does not change
the set of reviews presented to the user. Rather, the filter simply reorders them, prioritizing
those with desirable attributes.
123
Gender, writing and ranking in review forums
Fig. 1 Reviewer gender filter at IMDb
Figure 1 shows a review of the classic film Casablanca. The gender filter has been selected,
and it can be seen that a male participant wrote this particular review. As shown, there are 142
reviews (out of a total of 753 contributed reviews for Casablanca), for which author gender
is available. Finally, above each review, the perceived utility is displayed in the form “x of
y people found the following review useful.1 ” Feedback is solicited following each review,
where the participants are asked if they found the review useful or not. This feedback is used
to determine the presentation order of the review under the default filter, “Best.”
IMDb is unique in that it provides filters beyond perceived utility. Previous research has
highlighted the biases inherent in the simple collaborative filtering mechanisms that have
become commonplace. Much of this research examines the “helpfulness” mechanism at
Amazon.com. This mechanism is very similar to IMDb’s “Best” filter, although Amazon
asks participants if they find reviews “helpful.” Ghose and Ipeirotis [13] noted that it takes
time to accumulate sufficient votes for review rankings to become meaningful, a problem
similar to that of “cold starts” in recommendation systems [41]. Liu et al. [28] warned of an
“early bird” bias; reviews posted earlier have had more time to collect votes, thus they have
an advantage over more recent contributions. They also noted a “winners’ circle” bias, in
which the top-ranked items are unlikely to be displaced by new contributions. Since highly
1 Following [52], we define utility as the number of users who found a review useful divided by the total
number of votes received (i.e., x/y).
123
J. Otterbacher
rated reviews are displayed first, and since users are unlikely to read (and subsequently to
vote on) many reviews [21], what is perceived as helpful remains so over time. Finally, in
their analyses, Danescu-Niculescu-Mizil et al. [9] found that there is a penalty for reviews
that deviate from the average opinion of the product. More specifically, when a reviewer’s
rating of a product is below the average over all users, the associated review is less likely to
be seen as helpful by others. The provision of additional review filters beyond popularity or
helpfulness appears to be an attempt to expose readers to a more diverse set of contributions.
1.2 Gender and communication
In addition to having distinct preferences for hedonic goods, research suggests that men and
women have different communication styles, both in face-to-face and computer-mediated
environments. These differences likely affect the extent to which women participate in online
forums such as IMDb, as well as how women’s contributions are received by others and thus,
how visible they are in the forum.
Early work on gender differences in language focused on communication in face-toface settings, often in a group. Lakoff [26] proposed ten salient characteristics of “women’s
language.” For instance, she found that women and men often choose to use different lexical
items in similar situations (e.g., women describe colors more vividly than men). Another
finding was that women make greater use of hedge phrases (e.g., “kind of,” “sort of”), which
effectively soften an argument, presumably making it more acceptable to an interlocutor.
Similarly, Tannen [46] theorized that women use “rapport talk” while men adopt a style
described as “report talk,” contrasting women’s emphasis on building and supporting social
relationships with men’s focus on conveying information. In addition to manner, the topics
discussed by all female groups versus all male groups have been noted to differ, with men
discussing objects (e.g., cars) and women talking about people and relationships [8].
Herring has found that many gender differences in conversational speech also apply to
computer-mediated settings [15–17]. For instance, in online forums, men dominate interactions. Women participants typically post fewer messages, and are less likely to receive
responses from others. In a study of academic discussion lists, Herring also described a “list
effect,” in which contributors write using the stylistic features of the dominate gender (i.e., in
male-majority lists, both women and men participants posted in a “male” voice) [16]. In addition, while female-authored messages tended to foster relationships with others (i.e., were
more inclusive, expressed emotion, used hedging and apologies), male-authored messages
tended to be more competitive and assertive in nature.
These findings are echoed by Gefen and Ridings [12], who explained that communication
differences have to do with how men and women use shared information spaces. They claimed
that while male participants value the chance to showcase their knowledge to others, women
often seek or offer social support for solving problems. Finally, Awad and Ragowsky [5]
studied the mediating effect of gender in judging the quality of online word-of-mouth in
electronic commerce contexts. Like Gefen and Ridings, they found that men appreciate the
opportunity to soundoff about an item online, while women value the chance to interact with
and receive feedback from others.
1.3 Goals of the current work
As we have seen, men and women have different preferences for hedonic goods, use online
forums in various ways, and adopt unique communication styles. Our current goal is to
examine the differences between movie reviews written by male and female reviewers at
123
Gender, writing and ranking in review forums
IMDb, with respect to three categories of attributes: (1) writing style, (2) thematic content
and (3) review metadata. As will be detailed in Sects. 2–4, we use a logistic regression
framework to examine how accurately we can infer author gender.
We will show that male and female participants’ reviews do differ significantly with
respect to writing style and content, as predicted by sociolinguistic theory. However, these
features are not nearly as important as perceived utility in inferring author gender. In Sect. 6,
we provide possible explanations for this difference in perceived utility. While we cannot
observe who votes on the utility of reviews, we do find that highly rated reviews by females
are characterized by “male” features which, given Herring’s findings [16], might suggest that
IMDb is a male-majority forum. Finally, Sect. 7 relates our findings to the broader issues of
the use of simple collaborative filtering in online forums and presents areas for the future
work.
2 Inferring author gender
We now define the task of predicting author gender in a set of IMDb movie reviews. We then
relate our task to previous research on the prediction of author gender.
Definition 2.1 Given the set S of user-contributed reviews about movie j, predicts the author
gender of each review, s ⊂ S. We assume that all information seen by users at the main page
of the movie’s review forum is available for use.
We model features of author writing style and review content, as most approaches to
text categorization rely on features that characterize these two aspects of a text. Commonly
used features include token-level measures (e.g., average word or sentence length), vocabulary richness (i.e., size of vocabulary/length) and counts of occurrences of commonly used
words [44]. In addition, researchers have applied dimension-reduction techniques (e.g., factor
analysis) on word frequencies (e.g., [4]).
In addition to style and content, we exploit information available at IMDb. This information
includes movie metadata (e.g., the number of reviews contributed, which gives us a measure
of the movie’s popularity), as well as review metadata (e.g., date posted, reviewer’s rating of
the movie). In addition, we exploit metadata of the collaborative filtering mechanism (i.e.,
utility scores). These attributes are fully detailed in Sect. 4.
Most research on inferring author gender concerns “clean” text, as opposed to
user-contributed content that has become so characteristic of the web. For instance, [23]
classified a set of fiction and non-fiction documents from the British National Corpus (BNC).
In addition to lexical features (frequencies of commonly used words), they exploited a set of
“quasi-syntactic” features, since the BNC is tagged for parts of speech. They reported 80 %
accuracy in predicting author gender.
Argamon et al. [3] also predicted gender for formal texts from the BNC, however, they
exploited different features. In particular, they found differences in the use of pronouns. While
male and female authors made references to nominals (i.e., wrote about “things”) with similar
frequencies, women’s use of pronouns was much greater, and they used more first person
pronouns. More recently, they turned their attention to inferring author gender in blog text
[4]. They compared themes discussed by male versus female bloggers. To derive key themes,
they used the “meaning extraction method” [7], performing factor analysis on frequencies of
the most common words in the corpus. They discovered 20 themes, and found differences
between the texts of male and female bloggers. For instance, “religion” and “politics” were
123
J. Otterbacher
common themes discussed by men, whereas “conversation” and “at home” were popular with
women. They reported classification accuracies “around 80 %.”
We are not aware of any work that considers the prediction of author gender in postings
in an online forum or community. First, reviews posted to such sites are not clean texts. Most
review forums are not closely moderated and typically allow anyone to post with few barriers
to entry (i.e., after creating a user name and password). Reviews often contain grammatical
errors and may even be off topic.
Second, movie reviews are significantly shorter than most texts analyzed in previous work.
For instance, in [23], the average text length was over 34,000 words. In [4], the minimum blog
length was 500 words. To contrast, our mean review length is only 256 words. Therefore,
a movie review provides fewer clues about author writing style and themes that she or he
might discuss. On the other hand, since movie reviews are part of an ongoing discourse about
a film as compared with being stand-alone texts, we can exploit metadata that provides clues
about the social aspects of the forum. Therefore, in addition to stylistic and content features,
we include movie and review metadata.
3 Data
To create a movie review corpus,2 we relied on the list of the 250 top films of all time (as
of January 2010), as voted by IMDb participants, reasoning that they draw large audiences
and would have many reviews written by both men and women. At each movie forum, we
used the “male/female” filter to collect reviews with gender information. For the 250 movies
studied, 31,300 reviews had gender identity, out of 157,065 total reviews (19.9 %). Within
each movie, the proportion of reviews with gender identity ranged from 54 % to only 2 %
(median of 18.5 %). It is not possible to calculate the true proportion of reviewers who have
disclosed their gender, because IMDb always displays an equal number of male- and femaleauthored reviews. Therefore, we should keep in mind that we are only able to study the
differences between male- and female-authored reviews, for the subset of reviewers who
have provided their gender identities.
Before performing analyses that involved word counts, we performed stemming. We also
disregarded words that had occurred only once in the entire data set. As in [39], we reasoned
that such terms were very likely to be junk, such as misspellings or leftover HTML artifacts.
3.1 Characteristics
Table 1 presents an overview of corpus attributes. As can be seen, even though there are an
equal number of male- and female-authored reviews in the data, male reviewers have written
over a million words more than have female reviewers. In addition, over all male-contributed
reviews, we find a larger vocabulary as compared with female-authored reviews.
Table 2 presents summary statistics for review length in words, as well as the total feedback
votes received and the perceived utility of the review. As these distributions are often skewed,
the table shows both means and medians. It can be seen that there are significant differences
between male- and female-authored reviews; a Wilcoxon–Mann–Whitney rank-sum test [30]
for evaluating the difference between the medians of males versus females has a p value of
approximately zero for all three attributes.
2 While there are other movie review corpora available (e.g., for studying sentiment analysis), we were not
able to find existing data with author gender.
123
Gender, writing and ranking in review forums
Table 1 Corpus attributes
Male
Female
All
Reviews
15,650
15,650
31,300
Unique reviewers
10,394
10,618
21,012
NA
125.2/69
Reviews per movie (mean/median) NA
Total words
4,625,377 3,396,819 8,022,196
Vocabulary size
27,972
Table 2 Review attributes
(mean/median)
23,403
35,783
Male
Female
Length (words)
295.3/222
216.8/162
Total votes
30.7/3
7.1/1
Utility
0.60/0.67
0.26/0
4 Research framework
We use logistic regression [32] to model the log odds of a review being written by a female
versus a male. More formally, the model to be estimated is:
Pr (yi = 1)
ln
= b0 + b1 x 1 + b2 x 2 + · · · + b n x n
(1)
Pr (yi = 0)
where x1 to xn are the predictors and yi is the response (taking the value of 0 if the author
is a man and 1 if a woman). Thus, male is the default and we are predicting the event of a
reviewer being female. Stated otherwise, given a vector of n variables describing a review,
we use the predicted likelihood to assign the vector to one of two groups (i.e., if the likelihood
of being female is greater than 0.5, the author is labeled as female; otherwise the label is left
as male).
It should be noted that a variety of statistical and machine- learning approaches has been
applied to text classification problems. Citing [44,50] explain that for relatively large data
sets (i.e., at least 300 observations for each class), there is little difference in performance of
most classifiers on text classification tasks. Since our focus is to compare the contribution of
the three types of information, rather than to contribute to the machine learning research, we
employ a multivariate statistics technique. Logistic regression allows us to model a dichotomous response variable, while avoiding assumptions as to the normality of our predictor
variables.
4.1 Writing style
Table 3 presents the aspects of author writing style we have incorporated into our analysis
and provides a short justification for each. As shown, previous researchers have discussed
these attributes when comparing the communication styles of women and men. The attributes
we model are: the extent to which a movie reviewer adopts an inner- versus outer-focused
discourse (i.e., the extent to which he or she acknowledges a perceived audience, or tends
to use a reporting style), the use of hedge phrases, the complexity of one’s writing style,
the richness of vocabulary used in a review and patterns of frequent word use (i.e., lexical
markers).
123
J. Otterbacher
Table 3 Writing style features
Measure
Explanation
Inner- versus outer-focused discourse # Pronouns (PN)/total words Researchers have found that women
use more pronouns than men. We
compute the rate of PN use, as well
as the number of first, second and
third person PNs
# First person PN
# Second person PN
# Third person PN
Hedging
# Hedges used/total words
Complexity
Characters/words
We use a list of 55 common hedges
from [25]. Hedging is characteristic
of women’s language
These basic measures of writing
complexity have been used in
automatic essay scoring [11]
Characters/words in title
Words/sentences
Vocabulary richness
# Unique words/total words
Lexical markers
Frequency of top 50 words
Quantifies the diversity of lexical
item used. Stamatatos et al. [44]
warn that it may not be stable in
short texts
We create a list of the 50
highest-frequency words and count
their occurrences in each review
4.2 Content
We consider five metrics that quantify aspects of review content. The first four metrics
presented in Table 4 tell us something about the content of a review with respect to the other
reviews written about the same movie. The first is a measure of review centrality. To compute
centrality, we create a centroid, a vector consisting of all words occurring in the entire set of
reviews about movie, j. Each element corresponds to the t f id f weight for the given word
[40]. The centroid score for review s, which is the sum of the t f id f scores of all words in
review s, quantifies the extent to which it contains words that are important across the set of
reviews on the specific movie.
Perplexity, entropy and out-of-vocabulary rate (OOV) are the measures of textual uniqueness, which are based on a language modeling framework [31]. We view the set of reviews as
a bag of words, from which a probability distribution is derived. The creation of review s is
viewed as a sequence of randomly selected words from the distribution that results from all
other reviews about movie j (i.e., excluding review s). In other words, the random variable
X can take on values (i.e., words) in a discrete set of symbols, which is the vocabulary used
across all of the other reviews about the movie in question.
Given this probability distribution for X , the entropy for a given movie review is the
average uncertainty of X :
H ( p) = −
p(x) log2 p(x)
(2)
x
Perplexity is related to entropy as follows:
123
Gender, writing and ranking in review forums
Table 4 Review content features
Centrality
Uniqueness
Themes
perplexity = 2 H ( p)
A document-level centroid score [37]
quantifies the extent to which review
s is important in the discourse about
movie j
Perplexity [31] measures the extent
to which review s is surprising. A perplexity of k means that guessing the
words in the review is as difficult as
guessing between k items of equal
probability
Entropy [31] quantifies the average
uncertainty of review s, expressed as
the number of bits required to encode
it
Out-of-vocabulary rate (OOV) is the
proportion of words in review s that
do not appear in any other reviews of
movie j
latent semantic analysis (LSA) We
use the set of 31,300 reviews to create
a semantic space on which to perform
LSA. We discover the top 20 themes
used in our IMDb movie reviews. For
each review, we use the scores on these
20 themes to represent the content discussed
(3)
and provides a measure of uncertainty that is perhaps easier to interpret. A perplexity of
k indicates a level of uncertainty or surprise in the variable, equivalent to that faced when
choosing between k events that are equally likely to occur. Finally, OOV is simply the
proportion of words used in review s, which were not used in any other of the reviews written
about movie j.
Our last content feature is based on latent semantic analysis (LSA) [27]. We first construct
a semantic space using our movie review data. Specifically, we begin by computing word
counts across all reviews. We then consider the 500 most frequent words. Thus, we have a
matrix with dimensions 31,300 by 500, which we subject to a factor analysis. We derive the
first 20 factors, corresponding to the eigenvectors with the 20 highest eigenvalues. Finally, we
compute the scores on these 20 factors for each review. Our analysis will examine if, based on
this representation, there are differences in content between the male- and female-authored
reviews.
We compared candidate models to a baseline with only one factor and computed Bentler
and Bonnett’s normed fit index (NFI) [29]. To achieve a good fit (i.e., NFI of 0.90 or greater),
we need approximately 100 factors. However, interpretation becomes difficult, as many factors have small eigenvalues. Therefore, we also consider the Kaiser criterion, which states
that any factor with an eigenvalue less than one should be dropped [22]. We experiment with
the model having 20 factors, whose smallest eigenvalue is slightly greater than 1.
Table 5 describes the factors with suggested interpretations, and shows the words with the
largest loadings on each factor. As can be seen, the majority of the factors (e.g., F1, F2, F12,
F14) have straightforward interpretations. However, others are less obvious. In particular, it
123
J. Otterbacher
Table 5 Factors, interpretations
and lexical items with largest
loadings
F1 Articles/prepositions
the, of, to, a, in, on, that, with
F2 First person
I, my, me, am, think, saw
F3 Third person
he, his, him, himself, man, wife
F4 Second person
you, your, if, go, see, want
F5 Positive impression
good, really, like, pretty, funny
F6 Success
perform, best, role, actor, Oscar
F7 Fantasy
ring, battle, fantasy, book, fan, version
F8 Feminine
her, she, woman, girl, beautiful, relationship
F9 Violence
scene, plot, end, kill, murder, dead
F10 Others liked
movie, watch, great, people, like, good
F11 Location
unit, state, theater
F12 All time hit
ever, best, greatest, classic, favorite
F13 Groups
they, them, people, other, we, our, together
F14 Family
kid, girl, children, family, comedy, father
F15 Aux/modals/past
was, were, had, did, didn’t, would
F16 Aux/modals
have, been, be, if, any, may, should, do
F17 Comic
Batman, dark, comic, hero, action, fan
F18 Relations
time, relationship, character, together, first
F19 Time
hour, minute, two, three, no
F20 Effects
effect, special, visual, action, star
Table 6 Review and movie metadata
Movie popularity
# Total reviews about movie j
Reviewer rating 1–0 stars
Rating assigned by author of review s
Reviewer rating deviation
Reviewers rating of movie j average rating of j
Age (days)
Age of review s
Length (words)
# Words in review s
Length (sentences)
# Sentences in review s
Title length (words)
# Words in title of review s
Attention received from community
Total votes for review s
Perceived utility
Utility of review s
is difficult to distinguish between the meanings of F5 and F10, both which involve words
used to express positive sentiments of movies. It can also be seen that several factors (e.g.,
F1, F2) overlap with stylistic features.
4.3 Metadata
Table 6 summarizes the metadata used in our model. As shown, this information concerns
the movie itself (e.g., popularity), review sentiment (i.e., number of stars reviewer gave the
movie), the age and length of the reviews and how reviews are received by forum participants.
In addition to the reviewer’s rating, we also include the deviation of the rating from average
(i.e., over all reviewers). Previous work found that there is a negative relationship between
123
Gender, writing and ranking in review forums
the deviation from the average opinion (i.e., item rating) expressed in a review forum, and
the perceived utility of a review [9].
5 Analysis
To compare the characteristics of male- and female-authored reviews, we build a separate
logistic regression model with the features of each type as predictors: (1) stylistic, (2) content
and (3) metadata features. To control for the possibility that the reviews differ in content or
writing style depending on the specific movie reviewed, we initially add dummy variables
for the 250 movies. However, these are not significant in any of the models.
To fit the models, we use a forward stepwise procedure [32] that begins with an empty
model (i.e., assigning the label of male to all reviews) and adds a predictor if it significantly
reduces the likelihood ratio to the current model (i.e., when the likelihood ratio test for the
effect of the predictor has a p value <0.05). Once we have a model, we compare the estimated
odds ratios of the predictors, to identify features in each category that are associated with
author gender. For each feature category, we briefly report the individual features with the
strongest correlations to reviewer gender. Finally, we follow the same procedure to build a
model using features of all three types. We evaluate and compare the classification accuracies
of all models in Sect. 5.4.
5.1 Model with stylistic features
Recall that the features of style we examined include the rates of pronoun use, vocabulary
richness, the use of hedge phrases, word and sentence complexity and lexical features. For
the lexical features, we first identify the most frequent 50 words in the corpus. This results
in a set of five content words (movie, film, story, characters and time) and 45 function
words (pronouns, articles, prepositions, adverbs, verbs, etc.). For each review, we counted
the occurrences of these words, resulting in a set of 50 lexical features. Clearly, some of
the stylistic features are correlated (e.g., occurrences of the word “I” and the number of first
person pronouns in a given review). Therefore, it is expected that several features will end
up not having a significant effect in the model.
The model with stylistic predictors of gender is highly significant. However, it has a pseudo
R 2 (McFadden’s R 2 ) of only 0.09.3 This suggests that stylistic features alone most likely do
not provide enough information to distinguish between male and female reviewers. While
we will report the classification accuracy using only stylistic features in Sect. 5.4, here we
discuss the most significant differences in terms of writing style between male and female
reviewers.
Figure 2 summarizes the stylistic trends we observe. The strongest indicator of female
writing is a richer vocabulary, with an odds ratio of 6.626. In other words, when holding
other factors constant, for a one-unit increase in vocabulary richness, the odds of a reviewer
being female increase by a factor of 6.6. To contrast, the best predictor of “male” writing
is complexity, particularly the character-to-word ratio, with an odds ratio of 0.7919. As
shown, female-authored reviews are also characterized by a higher rate of pronoun use, and
in particular, the singular first person.
3 Unlike in OLS regression, the pseudo R 2 cannot be interpreted as the proportion of variance in the independent variable that is explained by the model; it is a simple measure of the strength of association between
the predictors and the independent variable. Therefore, it is a useful guide in choosing an appropriate model,
but has no literal interpretation.
123
J. Otterbacher
Fig. 2 Stylistic predictors of female (top) and male (bottom) reviewers
Fig. 3 Content predictors of female (top) and male (bottom) reviewers
In addition, male-authored reviews are characterized by the use of prepositions. Also,
it is interesting that of the five content words, all but the word having to do with people
(“character”) are used more often by men than women. In summary, our analysis of stylistic
features concurs with previous research. In particular, there is evidence that female movie
reviewers have a more involved style, frequently writing in first person voice. To contrast,
men more often use a reporting style, discussing “the movie” or “a film,” and making use of
the third person pronouns “his” and “it.”
5.2 Model with content features
Figure 3 shows the content features that have the strongest correlation to author gender. As can
be seen, the strongest indicator of female writing is the use of first person pronouns. In fact,
the factors representing the use of first, second and third person pronouns are all positively
related to female-authored reviews (odds ratios of 1.4616, 1.0481 and 1.2017, respectively).
We also observe that female reviewers are more likely to discuss themes related to groups,
which is the second strongest correlate.
As in the model with stylistic features, we again observe that the use of prepositions and
articles (F1) is strongly related to male-authored reviews, having an odds ratio of only 0.5721.
The themes discussed more often by males as compared with females include a movie being
an all time hit, violence, and special effects, with odds ratios of 0.8158, 0.8332 and 0.8682,
respectively. Men are also more likely than women to discuss the success of the movie (e.g.,
if it won an oscar). Finally, entropy is the only feature based on language modeling that has
a significant effect in the model, with an odds ratio of 0.8168. Compared with other reviews
written about the same movie, male-authored reviews tend to be more unique or surprising
than those of females. The resulting model that includes the remaining style, content and
metadata features is detailed in Table 7.
The likelihood ratio test for the model as a whole is highly significant. However,
like the model based on stylistic features, its pseudo R 2 is only 0.09. Therefore, it is
123
Gender, writing and ranking in review forums
uncertain whether content features alone can adequately distinguish between male- and
female-authored reviews.
5.3 Model with metadata features
The model with metadata features is highly statistically significant and has a pseudo R 2 of
0.1826. This suggests that metadata features in general have a stronger correlation to reviewer
gender than do the stylistic or content features.
While six of the metadata features have a statistically significant effect in the model,
perceived utility is the only feature with an odds ratio that deviates substantially from one
(0.0863). The other features (review length in words, the number of reviews written about
the movie, the total votes received by a review, review age and the length of the review’s
title) have odds ratios between 0.9900 and 1, and therefore are not likely to be useful in
discriminating between reviews written by male versus female participants.
5.4 Model with style, content and metadata features
As mentioned, many of the stylistic and content features are correlated with one another. In
particular, the counts of first, second and third person pronouns used as stylistic features, are
highly correlated to F2, F4 and F3, respectively. In addition, F1 accounts for the use of articles
and prepositions, eliminating the need for the majority of the lexical features. Therefore, to
avoid problems with collinearity in the model using all three types of predictors, we exclude
the 50 lexical features. The resulting model that includes the remaining style, content and
metadata features is detailed in Table 7.
The likelihood ratio test for the model as a whole is highly significant. In addition, the
model based on stylistic, content and metadata features has a pseudo R 2 of 0.2379. As
mentioned, the pseudo R 2 does not have a straightforward interpretation as does R 2 in the
context of ordinary least squares regression. Therefore, we next turn toward evaluating the
classification accuracy of each of our models.
5.5 Comparison between models
In our analyses, we identified independent variables that are correlated to author gender. To
compare models in terms of their ability to infer gender, and to compare our results to previous
work, we now evaluate their classification accuracy on this task. To do this, we use a tenfold
cross-validation procedure [33]. Table 8 reports the average classification accuracies for each
model, as well as the respective standard deviation. In addition, λ p [32] is also shown. This
statistic ranges from 0 to 1, and expresses the proportion reduction in error, relative to the
baseline of predicting the mode (i.e., labeling all reviewers as male, which results in 50 %
error). As a comparison, we also show the accuracy when review utility is the only predictor.
As can be seen, while all models do significantly better than the baseline, the model
incorporating all three types of predictor variables has the best performance. Its accuracy is
significantly better than that of the content only and style only models. However, there is not
a statistically significant difference between its accuracy and that of the models using only
metadata and only review utility to infer gender.
Others have previously noted that features characterizing the writing style and content of a
text can be sensitive to the text’s length (e.g., [44]). Therefore, we also examined the average
classification accuracy of each of the models, as a function of review length. Figure 4 indeed
illustrates that the accuracy of the models incorporating only stylistic and content features is
123
J. Otterbacher
Table 7 Model with style,
content and metadata features
Odds ratio
Metadata
Style
Utility
0.0863
Length (sents)
0.9854
Total votes
0.9946
# Reviews
0.9983
Title length (chars)
0.9982
Age
0.9998
Vocab richness
1.6870
PN/1000
1.0060
Words/sents
0.9888
Uniqueness
Entropy
0.8320
Content
1st person (F2)
1.4596
Groups (F13)
1.2100
Comic (F1)
1.2100
Table 8 Classification accuracy
for all models
Aux/modals/past (F15)
1.2012
Feminine (F8)
1.1777
Location (F11)
1.1710
Fantasy (F7)
1.1315
Family (F14)
1.1115
3rd person (F3)
1.1077
Violence (F9)
0.8111
All time hit (F12)
0.8186
Positive impression (F5)
0.8771
Articles/prepositions (F1)
0.8800
Effects (F20)
0.9347
Success (F6)
0.9447
Accuracy
SD
λp
Style
64.55
1.0717
0.291
Content
65.03
1.2119
0.301
Metadata
72.95
3.088
0.459
Utility
72.46
3.495
0.449
ALL
73.71
3.662
0.474
positively correlated to length. To contrast, the models using all features and only metadata
are relatively more stable.
6 Discussion
There are salient differences both in terms of writing style and content between IMDb movie
reviews written by male and female authors. However, the perceived utility of the review
by IMDb participants is the strongest predictor of gender. We examine this point further
in order to provide a possible explanation. We also discuss what we have learned about
123
Gender, writing and ranking in review forums
Fig. 4 Accuracy as a function of review length
inferring author gender from short texts contributed in online forums. We then conclude with
a discussion of areas for the future work.
6.1 Gender and review utility
Why is utility such a strong predictor of gender, even when controlling for properties such as
review length, valence and age, which are known to bias collaborative utility scores [9,28]?
As mentioned in the introduction, consumers with similar demographics are more likely to
share tastes in hedonic goods such as movies. Therefore, a likely explanation is that a large
proportion of IMDb participants are men, who give more favorable feedback to male-authored
reviews, since they are similar to their own perspectives and communicate those views in a
way that is familiar to them.
We do not have the necessary data to examine whether men give better feedback to maleauthored reviews as compared with those written by women, since voting is anonymous at
IMDb. In addition, we cannot be sure if the majority of participants are male or female, since
this information is not available in user profiles. However, we can study the characteristics
of the highly rated female-authored reviews. In other words, could it be the case that reviews
authored by women that are perceived to be relatively useful are more similar to male-authored
reviews?
To examine this question, we divide the female-authored reviews into two groups: more
useful (utility >0.5; n = 2, 898) and less useful (utility ≤0.5; n = 12, 752). We use the
same forward stepwise procedure (with a significance threshold of 0.05) to fit a logistic
regression model to distinguish the more useful reviews from the less useful ones, using
stylistic (excluding the lexical features), content and metadata predictors. Several variables,
which have been noted to be positively correlated to review utility [9], have significant
positive effects in the model: total feedback votes, review age, reviewer’s rating of the movie
and length.
Figure 5 displays predictors that have a significant effect in the model, with their respective
odds ratios. Of interest is that one of the themes that were associated with male-authored
reviews in Fig. 3, “all time hit,” is characteristic of the more useful female-authored reviews.
123
J. Otterbacher
Fig. 5 Features of more (top) and less (bottom) useful women’s reviews
Likewise, we can observe that vocabulary richness, which was associated with femaleauthored reviews in the model described previously in Table 7, is correlated to the less
useful reviews written by women. Finally, the use of first and second person pronouns is
also more characteristic of less useful reviews. In summary, the analysis shows that reviews
written by women, which are viewed by IMDb participants as being more useful, do have
similarities to male-authored reviews, as compared with less useful female-authored reviews.
6.2 Participation
To further explain the above findings, we can consider the rate of participation of the 21,012
reviewers in our data set. From reviewers’ profile pages, we obtained the total number of
reviews that they had contributed at IMDb. Indeed, male reviewers are more active than
females; the mean number of reviews contributed by males is 42.7 (median of 5), whereas
for females the mean is only 8.7 (median of 2). In addition, it is illustrative that the most
prolific male reviewer in our data set has written four times more reviews than the most prolific
female (a total of 8,167 versus 2,061 reviews, respectively). Therefore, we find additional
evidence that IMDb is a male-majority forum.
Herring’s “list effect” [16] does not apply here, since we do not find that both males and
females write like males, the majority gender. Herring studied academic discussion lists, and
it is likely that within an academic community, participants know one another better and
have more interaction than in a large forum such as IMDb. At IMDb, we see that there are
significant differences, both in writing style and content, between male- and female-authored
reviews. In addition, we do observe that a relatively small number of female-authored reviews
exhibit “male” characteristics, and that these reviews are perceived as being more useful, as
compared with reviews written in a “female” voice. It may be that such female participants
recognize that the less personal, more objective, “male” style of writing is better received by
others [43], and so try to adopt that style.
6.3 Classifying author gender for short texts
We are not aware of previous work that considered how to infer author gender for textual
contributions posted in online forums. As previously mentioned, the movie reviews in our
data set are substantially shorter than the texts other researchers have experimented on, and we
anticipated that stylistic and content metrics might not be as robust as they are for longer texts.
At the same time, texts posted in an online forum are qualitatively different than individual
texts such as scientific articles and books that others have studied. For instance, reviews at
IMDb occur within an ongoing discourse about a given movie, and readers are invited to be
active participants, both by writing their own reviews, as well as in voting for the reviews
they find most useful.
123
Gender, writing and ranking in review forums
We found that in order to compensate for the brevity of reviews, content and stylistic
features can be augmented with metadata, which reflect social (e.g., utility or how others
perceive the review) and dynamic (e.g., review age) aspects of the forum. Previous work on
classifying literary and scientific texts [23,3] and blogs [4] with respect to author gender
reported accuracy rates around 80 %. In comparison, when we use only stylistic or content
features, we can infer movie reviewer gender with 65 % accuracy. If we also exploit metadata,
accuracy increases to just under 74 %.
In the future work, we plan to reexamine the performance of the three classifiers, given
more information about each author. Currently, we had only one review per author, on which
to base a decision as to his or her gender, along with movie and review metadata. Another task
of interest is to infer a user’s gender, given all of his or her postings at the site. In other words,
for many users, we will have reviews of multiple movies, along with associated metadata
(e.g., the user’s average utility over a larger set of contributions). Given that features of
writing style and content are known to be sensitive to text length [44] and the brief nature of
online reviews, we plan to further explore reviewer gender classification exploiting different
types of information.
7 Conclusion
Review forums are large, multiuser information spaces in which participants, who likely have
not had previous interactions with one another, exchange information via textual postings,
describing their experiences with and opinions of various items. In popular forums, it is
not uncommon for users to find hundreds or even thousands of reviews posted about an
item of interest. In response to the need to organize this information, collaborative filtering
mechanisms, in which participants vote as to which reviews they find most “helpful” or
“useful” are commonly used. This is an easy means to organize reviews, since it does not
require administrators to modify postings, nor does it require much effort from participants,
who provide simple binary feedback as to whether or not reviews were useful to them.
However, researchers who have studied these voting mechanisms have reported biases,
such as the early bird and the winners’ circle [28], as well as the penalizing of reviews
expressing views that deviate from the majority, particularly when the review is more critical
than the majority [9]. Our case study of the IMDb suggests that a gender bias exists at this
particular site, with female-authored reviews being seen as significantly less useful than those
contributed by males. In other words, if it were not for IMDb’s gender filter, under the default
collaborative filter, most of the highly ranked reviews would be authored by male participants,
effectively hiding the views of female participants.
Currently, we used text classification in order to see which aspects of writing style, content and forum metadata could effectively discriminate between female- and male-authored
reviews. We confirmed that, as found in previous research, female and male participants do
write differently and discuss contrasting themes in their movie reviews. Based on the observation that female-authored reviews are much less likely to be judged by others as being
useful, and that males are generally more prolific than female participants, we came to the
conclusion that IMDb is a male-majority forum.
There is much discussion about gender and participation online. For instance, in a recent
New York Times Room for Debate session4 , debaters discussed recent findings that less than
4 http://www.nytimes.com/roomfordebate/2011/02/02/ where-are-the-women-in-wikipedia
123
J. Otterbacher
15 % of contributors at Wikipedia are female [14]. It is important to consider why this gender
gap exists and how it might be reduced.
In the future work, tools based on text classification might aid in detecting and ameliorating such problems. For instance, tools could be developed in order to infer demographic
characteristics of contributors. This might help site designers to determine if certain groups
are underrepresented, such that they could be encouraged to participate. In addition, in forums
that use simple collaborative filtering (e.g., “helpful” or “useful” mechanisms), demographic
information could be used to suggest yet unrated contributions of possible interest to a user,
based on her own demographic traits. This could help reduce biases such as the “winners’
circle” by giving unrated content a chance to collect votes. In conclusion, mechanisms can
be developed that might help forums balance out the participation and visibility of all participants, and that make participants feel as though their contributions are valued. This will be
a challenging work both from the technical and sociotechnical points of view, as care must
be taken not only to ensure that classification algorithms are sufficiently accurate, but also
that demographic information is not used in a way that might cause privacy concerns among
participants.
Acknowledgments We thank the anonymous reviewers who provided helpful feedback on this work, as well
as the reviewers of an earlier version of this work, which appeared at ACM CIKM 2010. We also acknowledge
the insightful advice of Alexia Panayiotou, as well as Mengyuan (Serena) Li’s assistance with data collection.
References
1. Acquisti A, Gross R (2006) Imagined communities: awareness, information sharing, and privacy on
Facebook. In: Privacy enhancing technologies. Lecture notes in computer science, Springer, Berlin
2. Ahmed A, Low Y, Aly M, Josifovski V, Smola A (2011) Scalable distributed inference of dynamic user
interests for behavioral targeting. In: Proceedings of the 17th ACM SIGKDD international conference on
knowledge discovery and data mining, San Diego, pp 114–122
3. Argamon S, Koppel M, Fine J, Shimoni AR (2003) Gender, genre, and writing style in formal written
texts, Text 23
4. Argamon S, Koppel M, Pennebaker JW, Schler J (2007) Mining the blogosphere: age, gender and the
varieties of self-expression. First Monday 12(9). http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/
fm/article/view/2003/1878
5. Awad NF, Ragowsky A (2008) Establishing trust in electronic commerce through online word of mouth:
an examination across genders. J Manag Inf Syst 24(4):101–121
6. Bruckman A (1996) Gender swapping on the internet. In: Ludlow P (ed) High noon on the electronic
frontier: conceptual issues in cyberspace. MIT Press, Cambridge
7. Chung C, Pennebaker JW (2008) Revealing dimensions of thinking in open-ended self-descriptions: an
automated meaning extraction method for natural language. J Res Pers 42:96–132
8. Coates J (1993) Women, men, and language. Longman, London
9. Danescu-Niculescu-Mizil C, Kossinets G, Kleinberg J, Lee L (2009) How opinions are received by online
communities: a case study on Amazon.com helpfulness votes. In: Proceedings of the international world
wide web conference, Madrid, Spain, pp 141–150
10. Dellarocas C (2003) The digitization of word of mouth: promise and challenges of online feedback
mechanisms. Manag Sci 49(10):1407–1424
11. Foltz PW, Laham D, Landauer TK (1999) Automated essay scoring: applications to educational technology. In: Proceedings of world conference on educational multimedia, hypermedia and telecommunications, Chesapeake, pp 939–944
12. Gefen D, Ridings CM (2005) If you spoke as she does, sir, instead of the way you do: a sociolinguistics
perspective of gender differences in virtual communities. ACM SIGMIS Database 3(2):78–92
13. Ghose A, Ipeirotis PG (2010) Estimating the helpfulness and economic impact of product reviews: mining
text and reviewer characteristics. IEEE Trans Knowl Data Eng 1498–1512
14. Glott R, Ghosh R, Schmidt P (2010) Analysis of Wikipedia survey UNU-MERIT http://www.
wikipediasurvey.org/docs/Wikipedia_Overview_15March2010-FINAL.pdf
123
Gender, writing and ranking in review forums
15. Herring SC (1996) Posting in a different voice: gender and ethics in computer-mediated communication.
In: Ess C (ed) Philosophical perspectives on computer-mediated communication. SUNY Press, New York,
pp 115–145
16. Herring SC (1996b) Two variants of an electronic message schema. In: Herring SC (ed) Computermediated communication: linguistic, social and cross-cultural perspectives. John Benjamins, New York,
pp 81–108
17. Herring SC (2003) Gender and power in online communication. In: Holmes J, Meyerhoff M (eds) The
handbook of language and gender. Blackwell, Oxford, pp 202–228
18. Hirschman EC, Holbrook MB (1982) Hedonic consumption: emerging concepts, methods and propositions. J Mark 46(Summer):92–101
19. Holbrook MB, Schindler RM (1994) Age, sex and attitude toward the past as predictors of consumers’
aesthetic tastes for cultural products. J Mark Res 31:412–422
20. Hu J, Zeng H-J, Li H, Niu C, Chen Z (2007) Demographic prediction based on user’s browsing behavior.
In: Proceedings of the ACM WWW Banff, Albert, May 2007, pp 151–160
21. Joachims T, Granka L, Pan B, Humbrooke H, Radlinski G, Gay G (2007) Evaluating the accuracy of
implicit feedback from clicks and query reformation in web search. ACM Trans Inf Syst 25(2). http://
doi.acm.org/10.1145/1229179.1229181
22. Kaiser HF (1960) The application of electronic computers to factor analysis. Educ Psychol Meas 20:141–
151
23. Koppel M, Argamon S, Shimoni AR (2003) Automatically categorizing written texts by author gender.
Lit Linguist Comput 17(4):401–412
24. Kostakos V (2009) Is the crowd’s wisdom biased? A quantitative analysis of three online communities. In: Proceedings of IEEE social communication, international symposium on social intelligence and
networking, Vancouver, Canada, pp 251–255
25. Lakoff G (1973) Hedges: a study in meaning criteria and the logic of fuzzy concepts. J Philos Log
2(4):458–508
26. Lakoff R (1973) Language and woman’s place. Lang Soc 2:45–79
27. Landauer TK, Foltz PW, Laham D (1998) An introduction to latent semantic analysis. Discourse Process
25:259–284
28. Liu J, Cao Y, Lin C-Y, Huang Y, Zhou M (2007) Low-quality product review detection in opinion
summarization. In: Proceedings of the conference on empirical methods in natural language processing,
pp 334–342
29. Loehlin JC (1992) Latent variable models. Lawrence Erlbaum Associates, London
30. Mann HB, Whitney DR (1947) On a test of whether one of two random variables is stochastically larger
than the other. Ann Math Stat 18:50–60
31. Manning CD, Schutze H (2000) Foundations of statistical natural language processing. MIT Press, Cambridge
32. Menard S (2002) Applied logistic regression analysis. Quantitative applications in the social sciences.
Sage University Press, Beverley Hills
33. Mitchell T (1997) Machine learning. McGraw Hill, New York
34. Muhlestein D, Lim S (2011) Online learning with social computing based interest sharing. Knowl Inf
Syst 26:31–58
35. Oliver MB, Weaver JB III (2000) An examination of factors related to sex differences in enjoyment of
sad films. J Broadcast Electron Media 44(2):282–300
36. Popescu A, Grefenstette G (2010) Mining user home location and gender from Flickr tags. In: Proceedings
of the 4th international conference on weblogs and social media, Washington, DC, May 2010
37. Radev D, Jing H, Stys M, Tam D (2004) Centroid-based summarization of multiple documents. Inf Process
Manag 40:919–938
38. Roth M, Ben-David A, Deutscher D, Flysher G, Horn I, Leichtberg A, Leiser N, Matias Y, Merom R
(2010) Suggesting friends using the implicit social graph. In: Proceedings of the 16th ACM SIGKDD
international conference on knowledge discovery and data mining, Washington, DC, pp 233–242
39. Sahlgren M, Karlgren J (2009) Terminology mining in social media. In: Proceedings of the ACM conference on information and knowledge management, Hong Kong, Nov 2009, pp 405–414
40. Salton G, McGill MJ (1986) Introduction to modern information retrieval. McGraw-Hill, Inc., New York
41. Schein AI, Popescul A, Ungar LH, Pennock DM (2002) Methods and metrics for cold-start recommendations. In: Proceedings of SIGIR, pp 253–260
42. Schindler RM, Bickart B (2005) Published word of mouth: referable, consumer-generated information
on the Internet. In: Hauvgedt C, Machleit K, Yalch R (eds) Online consumer psychology: understanding
and influencing behavior in the virtual world. Lawrence Erlbaum Associates, London, pp 35–61
123
J. Otterbacher
43. Spender D (1989) The writing or the sex or why you don’t have to read women’s writing to know it’s no
good. Elsevier, Oxford
44. Stamatatos E, Kokkinakis G, Fakotakis N (2000) Automatic text categorization in terms of genre and
author. Comput Linguist 26(4):471–495
45. Stutzman F (2006) An evaluation of identity-sharing behavior in social network communities. Intern Digit
Media Arts J 3(1):10–18
46. Tannen D (1990) You just don’t understand. HarperCollins Publishers, Inc., New York
47. Terveen L, McDonald DW (2005) Social matching: a framework and research agenda. ACM Trans Compu
Hum Interact 12(3):401–434
48. Tsaparas P, Ntoulas A, Terzi E (2011) Selecting a comprehensive set of reviews. In: Proceedings of the
17th ACM SIGKDD international conference on knowledge discovery and data mining, San Diego, pp
168–176
49. Wang D, Tse QCK, Zhou Y (2011) A decentralized search engine for dynamic Web communities. Knowl
Inf Syst 26:105–125
50. Yang Y (1999) An evaluation of statistical approaches to text categorization. Inf Retr J 1(1):69–90
51. Zhai Z, Liu B, Xu H, Jia P (2011) Clustering product features for opinion mining.In: Proceedings of the
fourth ACM international conference on Web search and data mining, Hong Kong, pp 347–354
52. Zhang Z, Varadarajan B (2006) Utility scoring of product reviews. In: Proceedings of the ACM conference
on information and knowledge management, Arlington, pp 51–57
Author Biography
Jahna Otterbacher received her PhD in 2006 from the School of
Information at the University of Michigan (Ann Arbor), USA. She is
currently an Assistant Professor in the Lewis Department of Humanities at the Illinois Institute of Technology (IIT) in Chicago, Illinois,
USA. Broadly, the goals of her research are to understand and use patterns in language to facilitate online interaction between people, as well
as to enhance their access to information.
123