Paper - Sean Massung

Multidimensional Characterization of Expert Users in the
Yelp Review Network ∗
Cheng Han Lee
Sean Massung
Department of Computer Science
University of Illinois at Urbana-Champaign
Department of Computer Science
University of Illinois at Urbana-Champaign
[email protected]
[email protected]
ABSTRACT
In this paper, we propose a multidimensional model that integrates text analysis, temporal information, network structure, and user metadata to effectively predict experts from
a large collection of user profiles. We make use of the Yelp
Academic Dataset which provides us with a rich social network of bidirectional friendships, full review text including
formatting, timestamped activities, and user metadata (such
as votes and other information) in order to analyze and
train our classification models. Through our experiments,
we hope to develop a feature set that can be used to accurately predict whether a user is a Yelp expert (also known
as an ‘elite’ user) or a normal user. We show that each of
the four feature types is able to capture a signal that a user
is an expert user. In the end, we combine all feature sets
together in an attempt to raise the classification accuracy
even higher.
Keywords
network mining, text mining, expert finding, social network
analysis, time series analysis
1.
INTRODUCTION
Expert finding seeks to locate users in a particular domain
that have more qualifications or knowledge (expertise) than
the average user. Usually, the number of experts is very
low compared to the overall population, making this a challenging problem. Expert finding is especially important in
medical, legal, and even governmental situations. In our
work, we focus on the Yelp academic dataset [8] since it has
many rich features that are unavailable in other domains.
In particular, we have full review content, timestamps, the
friend graph, and user metadata—this allows us to use techniques from text mining, time series analysis, social network
analysis, and classical machine learning.
From a user’s perspective, it’s important to find an expert
∗Submitted as the semester project for CS 598hs Fall 2014.
reviewer to give a fair or useful review of a business that
may be a future destination. From a business’s perspective,
expert reviewers should be great summarizers and able to
explain exactly how to improve their store or restaurant. In
both cases, it’s much more efficient to find the opinion of an
expert reviewer than sift through hundreds of thousands of
potentially useless or spam reviews.
Yelp is a crowd-sourced business review site as well as a social network, consisting of several objects: users, reviews,
and businesses. Users write text reviews accompanied by a
star rating for businesses they visit. Users also have bidirectional friendships as well as one-directional fans. We
consider the social network to consist of the bidirectional
friendships since each user consents to the friendship of the
other user. Additionally, popular users are much less likely
to know their individual fans making this connection much
weaker. Each review object is annotated with a time stamp,
so we are able to investigate trends temporally.
The purpose of this work is to investigate and analyze the
Yelp dataset and find potentially interesting patterns that
we can exploit in our future expert-finding system. The key
question we hope to answer is:
Given a network of Yelp users, who is an elite user?
To answer the above question, we have to first address the
following:
1. How does the text in expert reviews differ from text in
normal reviews?
2. How does the average number of votes per review for
a user change over time?
3. Are elite users the first to review a new business?
4. Does the social network structure suggest whether a
user is an elite user?
5. Does user metadata available from Yelp have any indication about a user’s status?
The structure of this paper is as follows: in section 2, we
discuss related work. In sections 3, 4, 5, and 6, we discuss the
four different dimensions of the Yelp dataset. For the first
Paper
Sun et al. 2009 [13]
Bozzon et al. 2013 [3]
Zhang et al. 2007 [17]
Choudhury et al. 2009 [5]
Balog et al. 2009 [1]
Ehrlich et al. 2007 [6]
three feature types, we use text analysis, temporal analysis,
and social network analysis respectively. The user metadata
is already in a quantized format, so we simply overview the
fields available. Section 7 details running experiments on the
proposed features on balanced (number of experts is equal
to the number of normal users) and unbalanced (number of
experts is much less) data. Finally, we end with conclusions
and future work in section 8.
2.
RELATED WORK
RankClus [13] integrates clustering and ranking on heterogeneous information networks. Within each cluster, a ranking
of nodes is created, so the top-ranked nodes could be considered experts for a given cluster. For example, consider the
DBLP bibliographic network. Clusters are formed based on
authors who share coauthors, and within each cluster there
is a ranking of authoritative authors (experts in their field).
Clustering and ranking are defined to mutually enhance each
other since conditional rank is used as a clustering feature
and cluster membership is used as an object feature. In
order to determine the final configuration, an expectationmaximization algorithm is used to iteratively update cluster
and ranking assignments.
This work is relevant to our Yelp dataset if we consider clusters to be the business categories, and experts to be domain
experts. However, the Yelp categories are not well-defined
since some category labels overlap, so some extra processing
may be necessary to deal with this issue.
Expert Finding in Social Networks [3] considers Facebook,
LinkedIn, and Twitter as domains where experts reside. Instead of labeling nodes from the entire graph as experts, a
subset of candidate nodes is considered and they are ranked
according to an expertise score. These expertise scores are
obtained through link relation types defined on each social
network, such as creates, contains, annotates, owns, etc.
To rank experts, they used a vector-space retrieval model
common in information retrieval [10] and evaluated with
popular IR metrics such as MAP, MRR, and NDCG [10].
Their vector space consisted of resources, related entities,
and expertise measures. They concluded that profile information is a less useful determiner for expertise than their
extracted relations, and that resources created by others including the target are also quite useful.
A paper on “Expertise Networks” [17] begins with a large
study on analyzing a question and answer forum; typical
centrality measures such as PageRank [11] and HITS [9] are
used to initially find expert users. Then, other features describing these expert users are defined or extracted in order
to create an “ExpertiseRank” algorithm, which (as far as we
can tell) is essentially PageRank. This algorithm was then
evaluated by human raters and it was found ExpertiseRank
had slightly smaller errors than the other measures (including HITS, but was not evaluated against PageRank).
While the result of ExpertiseRank is unsurprising, we would
be unable to directly use it or PageRank since the Yelp social
network is undirected; running PageRank on an undirected
network approximates degree centrality.
Text?
Time?
X
X
X
X
Graph?
X
X
X
X
Figure 1: Comparison of features used in previous
work.
A paper on Expert Language Models [1] builds two different
language models by invoking Bayes’ Theorem. The conditional probability of a candidate given a specific query is
estimated by representing it using a multinomial probability
distribution over the vocabulary terms. A candidate model
θca is inferred for each candidate ca, such that the probability of a term used in the query given the candidate model
is p(t|θca ). For one of the models, they assumed that the
document and the candidate are conditionally independent
and for the other model, they use the probability p(t|d, ca),
which is based on the strength of the co-occurrence between
a term and a candidate in a particular document.
In terms of the modeling techniques used above, we can
adopt a similar method whereby the candidate in the Expert
Language Models [1] is a yelp user and we will determine the
extent to which a review characterizes an elite or normal
user.
For the paper on Interesting YouTube commenters [5], the
goal is to determine a real scalar value corresponding to
each conversation to measure its interestingness. The model
comprised of detecting conversational themes using a mixture model approach, determining ‘interestingness’ of participants and conversations based on a random walk model,
and lastly, establishing the consequential impact of ‘interestingness’ via different metrics. The paper could be useful
to us for characterizing reviews and Yelp users in terms of
‘interestingness’. An intuitive conjecture is that ‘elite’ users
should be ones with high ‘interestingness’ level and likewise,
they should post reviews that are interesting.
In summary, Fig 1 shows a comparison of related work surveyed and which aspects of the dataset they examined.
3.
TEXT ANALYSIS
The text analysis examines the reviews written by each user
in order to extract features from the unstructured text content. Common text processing techniques such as indexing,
categorization, and language modeling are explored in the
next sections.
3.1
Datasets
First, we preprocessed the text by lowercasing, removing
stop words, and performing stemming with the Porter2 English stemmer. Text was tokenized as unigram bag-of-words
by the MeTA toolkit1 .
1
http://meta-toolkit.github.io/meta/
Dataset
All
Elite
Normal
Docs
659,248
329,624
329,624
Lenavg
81.8
98.8
64.9
|V |
164,311
125,137
95,428
Raw
480
290
190
Index
81
46
37
F1 scores were similar. Recall the F1 score is a harmonic
mean of precision P and recall R:
F1 =
Figure 2: Comparison of the three text datasets of
Yelp reviews in terms of corpus size, average document length (in words), vocabulary size, raw data
size, and indexed data size (both in MB).
elite
normal
Class
Elite
Normal
Total
Confusion Matrix
classified as elite classfied as normal
0.665
0.335
0.252
0.748
F1 Score
0.694
0.718
0.706
Precision
0.665
0.748
0.706
Recall
0.725
0.691
0.708
Figure 3: Confusion matrix and classification accuracy on normal vs elite reviews.
Since this is just a baseline classifier, we expect it is possible to achieve higher accuracy using more advanced features
such as n-grams of words or grammatical features like partof-speech tags or parse tree productions. However, this initial experiment is to determine whether elite and non-elite
reviews can be separated based on text alone, with no regard to the author or context. Since the accuracy on this
default model is 70% is seems that text will make a useful
subset of overall features to predict expertise.
Furthermore, remember that this classification experiment
is not whether a user is elite or not, but rather whether a
review has been written by an elite user; it would be very
straightforward to extend this problem to classify users instead, where each user is a combination of all reviews that
he or she writes. In fact, this is what we do in section 7,
where we are concerned with identifying elite users.
3.3
We created three datasets that are used in this section:
• All: contains all the elite reviews (reviews written by
an elite user) and an equal number of normal reviews
• Elite: contains only reviews written by elite users
• Normal: contains only reviews written by normal
users
Elite and Normal together make up All; this is to ensure the
analyses run on these corpora have a balanced class distribution. Overall, there were 1,125,458 reviews, consisting of
329,624 elite reviews and 795,834 normal reviews. Thus, the
number of normal reviews was randomized and truncated to
329,624.
The number of elite users is far fewer than the ratio of written reviews may suggest; this is because elite users write
many more reviews on average than normal users. A summary of the three datasets is found in Fig 2.
3.2
Classification
We tested how easy it is to distinguish between an elite review and a non-elite (normal) review by a simple supervised
classification task. We used the dataset All described in the
previous section along with each review’s true label to train
an SVM classifier. Evaluation was performed with five-fold
cross validation and had a baseline of 50% accuracy. Results
of this categorization experiment are displayed in Fig 3.
The confusion matrix tells us that it was slightly easier to
classify normal reviews, though the overall accuracy was acceptable at just over 70%. Precision and recall highs had
opposite maximums for normal and elite, though overall the
2P R
P +R
Language Model
We now turn to the next text analysis method: unigram
language models. A language model is simply a distribution
of words given some context. In our example, we will define
three language models—each based on a corpus described in
section 3.1.
The background language model (or “collection” language
model) simply represents the All corpus. We define a
smoothed collection language model pC (w) as
pC (w) =
count(w, C) + 1
|C| + |V |
This creates a distribution pC (·) over each word w ∈ V .
Here, C is the corpus, and V is the vocabulary (each unique
word in C), so |C| is the total number of words in the corpus
and |V | is the number of unique terms.
The collection language model essentially shows the probability of a word occurring in the entire corpus. Thus, we
can sort the outcomes in the distribution by their assigned
probabilities to get the most frequent words. Unsurprisingly,
these words are the common stop words with no real content
information. However, we will use this background model to
filter out words specific to elite or normal users.
We now define another unigram language model to represent the probability of seeing a word w in a corpus θ ∈
{elite, normal}. We create a normalized language model
score per word using the smoothed background model defined previously:
score(w, θ) =
count(w,θ)
|θ|
pC (w)
=
count(w, θ)
pC (w) · |θ|
Background
the
and
a
i
to
was
of
is
for
it
in
that
my
with
but
this
you
we
they
on
Normal
gorsek
forks)
yu-go
sabroso
(***
eloff
-/+
jeph
deirdra
ruffin’
josefa
ubox
waite
again!!
optionz
ecig
nulook
gtr
shiba
kenta
Elite
uuu
aloha!!!
**recommendations**
meter:
**summary**
carin
no1dp
(lyrics
friends!!!!!
**ordered**
8/20/2011
rickie
kuge
;]]]
#365
g
*price
visits):
r
ik
3.4
We use the following six style or typographical features:
• Average review length. We calculate review length
as the number of whitespace-delimited tokens in a review. Average review length is simply the average of
this count across all of a user’s reviews.
• Average review sentiment. We used sentiment valence scores [12] to calculate the sentiment of an entire
review. The sentiment valence score is < 0 if the overall sentiment is negative and > 0 if the overall sentiment is positive.
• Paragraph rate. Based on the language model analysis, we included a feature to detect whether paragraph
segmentation was used in a review. We simply count
the rate of multiple newline characters per review per
user.
Figure 4: Top 20 tokens from each of the three language models.
• List rate. Again, based on the language model analysis, we add this feature to detect whether a bulleted
list is included in the review. We defined a list as the
beginning of a line followed by ‘*’ or ‘-’ before alpha
characters.
The goal of the language model score is to find unigram
tokens that are very indicative of their respective categories;
using a language model this way can be seen as a form of
feature selection. Fig 4 shows a comparison of the top twenty
words from each of the three methods.
These default language models did not reveal very clear differences in word usage between the two categories, despite
the elite users using a larger vocabulary as shown in Fig 2.
The singular finding was that the elite language model shows
that its users are more likely to segment their reviews into
different sections, discussing different aspects of the business. For example, recommendations, summary, ordered, or
price.
Also, it may appear that there are a good deal of nonsense
words in the top words from each language model. However,
upon closer inspection, these words are actually valid given
some domain knowledge of the Yelp dataset. For example,
the top word “gorsek” in the normal language model is the
last name of a normal user that always signs his posts. Similarly, “sabroso” is a Spanish word meaning delicious that a
particular user likes to say in his posts. Similar arguments
can be made for other words in the normal language model.
In the elite model, “uuu” was originally “\uuu/”, an emoticon that an elite user is fond of. “No1DP” is a Yelp username
that is often referred to by a few other elite users in their
review text.
Work on supervised and unsupervised review aspect segmentation has been done before [15,16], and it may be applicable
in our case since there are clear boundaries in aspect mentions. Another approach would be to add a boolean feature
has_aspects that detects whether a review is segmented in
the style popular among elite users.
Typographical Features
Based partly on the experiments performed in section 3.3, we
now define typographical features of the review text. We call
a feature a ‘typographical’ feature if it is a trait that can’t
be detected by a unigram words tokenizer and is indicative
of the style of review writing.
• All caps. The rate of words in all capital letters. We
suspect very high rates of capital letters will indicate
spam or useless reviews.
• Bad punctuation. Again, this feature is to detect
less serious reviews in an attempt to find spam. A
basic example of bad punctuation is not starting a new
sentence with a capital letter.
Although the number of features here is low, we hope that
the added meaning behind each one is more informative than
a single unigram words feature.
4.
TEMPORAL ANALYSIS
In this section, we look at how features change temporally
by making use of the time stamp in reviews as well as tips.
This allows to us to analyze the activity of a user over time
as well as how the average number of votes the user has
received changes with each review posted.
4.1
Average Votes-per-review Over Time
For the average number of votes-per-review varies with each
review posted by an user. To gather this data, we grouped
the reviews in the Yelp dataset by users and ordered the
reviews by the date each was posted.
The goal was to try to predict whether an user is an “Elite”
or “Normal” user using the votes-per-review vs review number plot. The motivation for this was that after processing
the data, we found out that the number of votes on average
was significantly greater for elite users compared to normal
Elite vs Normal users Statistics
useful votes funny votes cool votes
elite users
616
361
415
normal users
20
7
7
Figure 5: Average number of votes per category for
elite and normal users.
elite
normal
Confusion Matrix
classified as elite classified as normal
0.64
0.36
0.26
0.74
Figure 7: Summary of results for logistic regression.
Logistic Regression Summary
elite users normal users
training
2005
2005
testing
18040
18040
Figure 6: Summary of training and testing data for
logistic regression.
users as show in Fig 5. Thus, we decided to find out whether
any trend exists on how the average number of votes grow
with each review posted by users from both categories. We
hypothesized that elite users should have an increasing average number of votes over time.
On the y-axis, we have υi which is the votes-per-review after
a user posts his ith review. This is defined as the sum of
the number of “useful” votes, “cool” votes and “funny” votes
divided by the number of reviews by the user up to that
point in time. On the x-axis, we will have the review count.
Using the Yelp dataset, we plotted a scatter plot for each
user. Visual inspection of graphs did not show any obvious
trends in how the average number of likes per review varied
with each review being posted by the user.
We then proceeded to perform a logistic regression using the
following variables:
Pincrease =
count(increases)
count(reviews)
Figure 8: Plot of the probability of being an elite
user for reviews at rank r.
users compared to normal users. This means that each review that a elite user posts tends to be a “quality” review
that receives enough votes to increase the running average
of votes-per-review for this user. The second hypothesis is
that the mean of the running average votes-per-review for
elite users is higher than that of normal users. This is supported by data shown in Fig 5 where the average votes for
elite users are higher than normal users.
4.2
Pcount(reviews)
µ=
υi
i=0
count(reviews)
where count(increases) is the number of times the average
votes-per-review increased (i.e. υi+1 > υi ) after a user posts
a review and count(reviews) is the number of reviews the
user has made.
Both the training and testing sets consists of only users with
at least one review. For each user, we calculated the variables Pincrease and µ. The training and testing data are
shown in Fig 6. 10% of users with at least one review became part of the training data and the remaining 90% were
used to test.
There was an accuracy of 0.69 on the testing set. The results
are shown in Fig 7.
Given the overall accuracy of our model is relatively high
at 0.69, we can hypothesize that Pincrease is higher for elite
User Review Rank
For the second part of our temporal analysis, we look at
the rank of each review a user has posted. Using 0-index, if
a review has rank r for business b, the review was the rth
review written for business b.
Our hypothesis was that an elite user should be one of the
first few users who write a review for a restaurant because
elite users are more likely to find new restaurants to review.
Also, based on the dataset, elite users write approximately
230 more reviews on average than normal users, thus it is
more likely that elite users will be one of the first users to
review a business. Over time, since there are more normal
users, the ratio of elite to normal users will decrease as more
normal users write reviews.
To verify this, we calculated the percentage of elite reviews
for each rank across the top 10,000 businesses, whereby the
top business is defined as the business with the most reviews.
The number of ranks we look at will be the minimum number of reviews of a single business among the top 10,000
businesses. The plot is shown in Fig 8.
users, the plot shows us that it is more likely for an elite
user to be among the first few tippers of a business. Furthermore, for this specific dataset, elite users only make up
approximately 25% of the total number of tips, yet for the
top 10,000 businesses, they make up more than 25% of the
tips for almost all the ranks shown in Fig 9.
We then calculated a score for each user based on the rank
of each tip of the user and we included this as a feature in
the SVM. The score is defined as follows:
score =
X
(tip count(business of (review)) − rank(tip))
tip
(The equation for this score follows the same reasoning as
the user review rank section)
Figure 9: Plot of the probability of being an elite
user for tips at rank r.
Given that the dataset consists of approximately 10% elite
users, the plot shows us that it is more likely for an elite
user to be among the first few reviewers of a business.
We calculated a score for each user which is a function of
the rank of each review of the user and we included this as
a feature in the SVM. For each review of a user, we find the
total number of reviews the business that this review belongs
to has. We take the total review count of this business and
subtract the rank of the review from it. We then sum this
value for each review to assign a score to the user. Based
on our hypothesis, since elite users will more likely have a
lower rank for each review than normal users, the score for
elite users should therefore be higher than normal users.
4.4
Review Activity Window
In this section, we look at the distribution of a user’s activity
over time. The window we look at is between the user’s join
date and end date, defined as the last date of any review
posted in the entire dataset. For each user, we will find
the interval in days between each review, including the join
date and end date. For example if the user has two reviews
on date1 and date2, where date2 is after date1, the interval
durations will be: date1-joinDate, date2-date1 and endDatedate2. So for n number of reviews, we will get n+1 intervals.
Based on the list of intervals, we will calculate a score. For
this feature, we hypothesize that the lower the score, the
more likely the user is an elite user.
The score is defined as:
score =
var(intervals) + avg(intervals)
days on yelp
The score for a review is defined as follows:
score =
X
(review count(business of (rev)) − rank(rev))
Where var(intervals) is the variance of all the interval values,
avg(intervals) is the average and days on yelp is the number
of days a user has been on Yelp.
rev
We subtract the rank from the total review count so that
based on our hypothesis, elite users will end up having a
higher score.
4.3
User Tip Rank
A tip is a short chunk of text that a user can submit to a
restaurant via any Yelp mobile application. Using 0-index,
if a tip has rank r for business b, the tip was the rth tip
written for business b. Similar to the review rank, we we
hypothesized that an elite user should be one of the first few
tippers (person who gives a tip) of a restaurant. We plotted
the same graph which shows the percentage of elite tips for
each rank across the top 10,000 businesses, whereby the top
business is defined as the business with the most tips. The
plot is shown in Fig 9.
Given that the dataset consists of approximately 10% elite
For the variance, the hypothesis is that for elite users, the
variance will tend to be low as we hypothesize that elite users
should post regularly. For normal users, the variance will be
high possibly due to irregular posting and long periods of
inactivity between posts.
We also look at the average value of the intervals. This is
because if we were to only look at variance, a user who writes
a review every two days will get the same variance (zero) as
a user who writes a review every day. As such the average
of the intervals will account for this by increasing the score
Finally, we divide the score by the number of days the user
has been on Yelp. This is to account for situations where
a user makes a post every week but has only been on Yelp
for three weeks, versus a user who makes a post every week
as well but has been on Yelp for a year. The user who has
been on Yelp for a year will then get a lower value for this
score (elite user).
5.
SOCIAL NETWORK ANALYSIS
The Yelp social network is the user friendship graph. This
data is available in the latest version of the Yelp academic
dataset. We used the graph library from the same toolkit
that was used to do the text analysis in section 3.
We make an assumption that users on the Yelp network
don’t become friends at random; that is, we hypothesize
that users become friends if they think their friendship is
mutually beneficial. In this model, we think one friend will
become friends with another user if he or she thinks the
other user is worth knowing (i.e., is a “good reviewer”). We
believe this is a fair assumption to make, since the purpose
of the Yelp website is to provide quality reviews for both
businesses and users. One potential downside we can see is
users becoming friends just because they are friends in real
life, or in a different social network.
5.1
Network Centrality
Since our goal is to find “interesting” or “elite” users, we use
three network centrality measures to identify central (important) nodes. We would like to find out if elite users are
more likely to be central nodes in their friendship network.
We’d also like to investigate whether the results of the three
centrality measures we investigate are correlated. Next, we
briefly summarize each measure. For a more in-depth discussion of centrality (including the measures we use), we
suggest the reader consult [7]. For our centrality calculations we considered the graph of 123,369 users that wrote at
least one review.
Degree centrality for a user u is simply the degree of node
u. In our network, this is the same value as the number
of friends. Therefore, it makes sense that users with more
friends are more important (or active) than those that have
fewer or no friends. Degree centrality can be calculated almost instantly.
Betweenness centrality for a node u essentially captures the
number of shortest paths between all pairs of nodes that
pass through u. In this case, a user being an intermediary
between many user pairs signifies importance. Betweenness
centrality is very expensive to calculate, even using a O(mn)
algorithm [4]. This algorithm is part of the toolkit we used
and it took two hours to run on 3.0 GHz processors with 24
threads.
Eigenvector centrality operates under the assumption that
important nodes are connected to other important nodes.
PageRank [11] is a simple extension to eigenvector centrality.
If a graph is represented as an adjacency matrix A, then
the (i, j)th cell is 1 if there is an edge between i and j,
and 0 otherwise. This notation is convenient when defining
eigenvector centrality for a node u denoted as xu :
xu =
n
1X
Aiu xi
λ i=1
Since this can be rewritten as Ax = λx, we can solve for
the eigenvector centrality values with power iteration, which
converges in a small number of iterations and is quite fast.
Name
Walker
Kimquyen
Katie
Philip
Gabi
Name
Gabi
Philip
Lindsey
Jon
Walker
Name
Kimquyen
Carol
Sam
Alina
Katie
Degree Centrality
Reviews Useful Friends
240
6,166
2,917
628
7,489
2,875
985
23,030
2,561
706
4,147
2,551
1,440
12,807
2,550
Betweenness Centrality
Reviews Useful Friends
1,440
12,807
2,550
706
4,147
2,551
906
7,641
1,617
230
2,709
1,432
240
6,166
2,917
Eigenvector Centrality
Reviews Useful Friends
628
7,489
2,875
505
2,740
2,159
683
9,142
1,960
329
2,096
1,737
985
23,030
2,561
Fans
142
128
1,068
86
420
Elite
Y
Y
Y
Y
Y
Fans
420
86
348
60
142
Elite
Y
Y
Y
Y
Y
Fans
128
163
100
141
1,068
Elite
Y
Y
Y
N
Y
Figure 10: Comparison of the top-ranked users as
defined by the three centrality measures on the social network.
The eigenvector centralities for the Yelp social network were
calculated in less than 30 seconds.
Fig 10 shows the comparison of the top five ranked users
based on each centrality score. The top five users of each
centrality shared some names: Walker, Gabi, and Philip in
degree and betweenness; Kimquyen and Katie in degree and
eigenvector; betweenness and eigenvector shared no users in
the top five (though not shown, there are some that are the
same in the range six to ten).
The top users defined by centrality measures are almost all
elite users even though elite users only make up about 8% of
the dataset. The only exception here is Alina from eigenvector centrality. Her other statistics look like they fit in with
the other elite users, so perhaps this could be a prediction
that Alina will be elite in the year 2015.
The next step is to use these social network features to predict elite users.
5.2
Weighted Networks
Adding weighted links between users could definitely enhance the graph representation. The types which could potentially be weighted are fans and votes. Additionally, if we
had some tie strength of friendship based on communication
or profile views, we could use weighted centrality measures
for this aspect as well.
Unfortunately, we have no way to define the strength of
the friendship between two users, since we only have the
information present in the Yelp academic dataset. As for
the votes and fans, in the Yelp academic dataset we are
only given a raw number for these values, as opposed to the
actual links for the social network. If we had this additional
Confusion Matrix: Balanced Text Features
classified as elite classified as normal
elite
0.651
0.349
normal
0.124
0.876
Overall Accuracy: 76.7%, baseline 50%
information, we could add those centrality measures to the
friendship graph centrality measures for an enriched social
network feature set.
6.
USER METADATA
User metadata is information that is already part of the
JSON Yelp user object. It is possible to see all the metadata
by visiting the Yelp website and viewing specific numerical
fields.
• Votes. Votes are ways to show a specific type of appreciation towards a user. There are three types of
votes: funny, useful, and cool. There is no specific
definition for what each means.
• Review count. This is simply the total number of
reviews that a user has written.
• Number of friends. The total number of friends in
a user’s friendship graph. This feature is duplicated
in the degree centrality measure of the social network
analysis.
• Number of fans. The total number of fans a user
has.
• Average rating. The average star rating in [1, 5] the
user gives a business.
• Number of compliments. According to Yelp, the
compliment button is “an easy way to send some good
vibes.” This is separate from a review. In fact, users
get compliments from other users based on particular
reviews that they write.
We hope to use these metadata features in order to classify
users as elite. We already saw in section 5 that some metadata fields seemed to be correlated with network centrality
measures as well as a user’s status, so it seems like they will
be informative features.
7.
EXPERIMENTS
We now run experiments to test whether each feature generation method is a viable candidate to distinguish between
elite and normal users. As mentioned before, the number of
elite users is much smaller than the number of total users;
about 8% of all 252,898 users are elite. This presents us with
a very imbalanced class distribution. Since using the entire
user base to classify elite users has such a high baseline (92%
accuracy), we also truncate the dataset to a balanced class
distribution with a total of 40,090 users, giving a alternate
baseline of 50% accuracy. Both datasets are used for all
future experiments.
As described in section 3.1, we use the MeTA toolkit2 to do
the text tokenization, class balancing, and five-fold crossvalidation with SVM. SVM is implemented here as stochastic gradient descent with hinge loss.
2
http://meta-toolkit.github.io/meta/
Confusion Matrix: Unbalanced Text Features
classified as elite classified as normal
elite
0.582
0.418
normal
0.039
0.961
Overall accuracy: 91.8%, baseline 92%
Figure 11: Confusion matrices for normal vs elite
users on balanced and unbalanced datasets.
Confusion Matrix: Balanced Temporal Features
classified as elite classified as normal
elite
0.790
0.210
normal
0.320
0.680
Overall Accuracy: 73.5%, baseline 50%
Confusion Matrix: Unbalanced Temporal Features
classified as elite classified as normal
elite
0.267
0.733
normal
0.067
0.933
Overall accuracy: 88%, baseline 92%
Figure 12: Confusion matrices for normal vs elite
users on balanced and unbalanced datasets.
7.1
Text Features
We represent users as a collection of all their review text.
Based on the previous experiments, we saw that it was possible to classify a single review as being written by an elite
or normal user. Now, we want to classify users based on all
their reviews as either an elite or normal user. Figure 11
shows the results of the text classification task. Using the
balanced dataset we achieve about 77% accuracy, compared
to barely achieving the baseline accuracy in the full dataset.
Since the text features are so high dimensional, we performed some basic feature selection by selecting the most
frequent features from the dataset. Before feature selection,
we had an accuracy on the balanced dataset of about 70%.
Using the top 100, 250, and 500 features all resulted in a
similar accuracy of around 76%. We use the reduced feature set of 250 in our experimental results in the rest of this
paper.
7.2
Temporal Features
The temporal features consist of features derived using
changes in the average number of votes per review posted,
the sum of the ranks of reviews of an user as well as the tips,
and the distribution of reviews posted over the lifetime of a
user. Using these features, we obtained the results shown in
Figure 12.
7.3
Graph Features
Figure 13 shows the results using the centrality measures
from the social network. Although there are only three fea-
Confusion Matrix: Balanced Graph Features
classified as elite classified as normal
elite
0.842
0.158
normal
0.251
0.749
Overall Accuracy: 79.6%, baseline 50%
Confusion Matrix: Balanced All Features
classified as elite classified as normal
elite
0.754
0.256
normal
0.111
0.889
Overall Accuracy: 82.2%, baseline 50%
Confusion Matrix: Unbalanced Graph Features
classified as elite classified as normal
elite
0.311
0.689
normal
0.075
0.925
Overall accuracy: 87.6%, baseline 92%
Confusion Matrix: Unbalanced All Features
classified as elite classified as normal
elite
0.976
0.024
normal
0.731
0.269
Overall accuracy: 92%, baseline 92%
Figure 13: Confusion matrices for normal vs elite
users on balanced and unbalanced datasets.
Figure 15: Confusion matrices for normal vs elite
users on balanced and unbalanced datasets with all
features present.
Confusion Matrix: Balanced Metadata Features
classified as elite classified as normal
elite
0.959
0.041
normal
0.083
0.917
Overall Accuracy: 93.8%, baseline 50%
Confusion Matrix: Unbalanced Metadata Features
classified as elite classified as normal
elite
0.880
0.120
normal
0.097
0.903
Overall accuracy: 90.1%, baseline 92%
Balanced
Unbalanced
Text
.767
.918
Temp.
.735
.880
Graph
.796
.876
Meta
.938
.901
All
.822∗
.920
Figure 16: Final results summary for all features and
feature combinations on balanced and unbalanced
data. ∗ Excluding just the text features resulted in
90.4% accuracy.
the difficult baseline.
Figure 14: Confusion matrices for normal vs elite
users on balanced and unbalanced datasets.
tures, Figure 10 showed that there is potentially a correlation between the elite status and high-valued centrality
measures. The three graph features alone were able to predict whether a user was elite using the balanced dataset with
almost 80% accuracy. Again, results were lower compared
to the baseline when using the full user set.
7.4
Metadata Features
Using only the six metadata features from the original Yelp
JSON file gave surprisingly high accuracy at almost 94% for
the balanced classes. In fact, the metadata features had the
highest precision for both the elite and normal classes. The
unbalanced accuracy was near the baseline.
7.5
Feature Combination and Discussion
To combine features, we simply concatenated the feature
vectors for all the previous features and used the same splits
and classifier as before. Figure 15 shows the breakdown of
this classification experiment. Additionally, we summarize
all results by final accuracy in Figure 16.
Unfortunately, it looks like the combined feature vectors did
not significantly improve the classification accuracy on the
balanced dataset as expected. Initially, we though that this
might be due to overfitting, which is why we reduced the
number of text features from over 70,000 to 250. Using the
70,000 text features combined with the other feature types
resulted in about 70% accuracy; with the top 250 features,
we achieved 82.2% as shown in the tables. For the unbalanced dataset, it seems that the results did improve to reach
Using all combined features except the text features resulted
in 90.4% accuracy, suggesting there is some sort of disagreement between “predictive” text features and all other predictive features. Thus, removing the text features yielded
a much higher result, approaching the accuracy of just the
Yelp metadata features.
Since we dealt with some overfitting issues, we made sure
that the classifier used regularization. Regularization ensures that weights for specific features do not become too
high if it seems that they are incredibly predictive of the
class label. Fortunately (or unfortunately), the classifier we
used does employ regularization, so there is nothing we could
further do to attempt to increase the performance.
8.
CONCLUSION AND FUTURE WORK
We investigated several different feature types to attempt to
classify elite users in the Yelp network. We found that all of
our features were able to distinguish between the two user
types. However, when combined, we weren’t able to make
an improvement in accuracy on the class-balanced dataset
over the best-performing single feature type.
In the text analysis, we can investigate different classifiers to
improve the classification accuracy. For example, k-nearest
neighbor could be a good approach since it is nonlinear and
we have a relatively small number of dimensions after reducing the text features. The text analysis could also be
extended with the aid of topic modeling [2]. One output
from this algorithm acts to cluster documents into separate
topics; a document is then represented as a mixture of these
topics, and each document’s mixture can be used as a feature
for the classifier.
In the temporal analysis, we made some assumptions about
the ‘elite’ data provided by the Yelp dataset. The data tells
us for which years the user was ‘elite’ and we made a simplifying assumption that as long a user has at least one year
of elite status, the user is currently and has always been an
elite user. For instance, if a user was only elite in the year
2010, we treated the user’s review back in 2008 as an elite review. Also, we could have made use of more advanced models like the vector autoregression model (VAR) [14] which
might allow us to improve the analysis of votes per review
over time. One possible way will be to look at all the votesper-review plots of users in the dataset and run the model
using this data. Finally, in the network analysis, we can consider different network features such as clustering coefficient
or similarity via random walks.
The graph features would certainly benefit from added
weights, but as mentioned in section 5, we unfortunately
do not have this data. Social graph structure can also be
created with more information about fans and votes.
Finally, since the metadata features were by far the bestperforming, it would be an interesting auxiliary problem to
predict their values via a regression using the other feature
types we created.
APPENDIX
A. REFERENCES
[1] Krisztian Balog, Leif Azzopardi, and Maarten
de Rijke. A language modeling framework for expert
finding. Inf. Process. Manage., 45(1):1–19, January
2009.
[2] David M. Blei, Andrew Y. Ng, and Michael I. Jordan.
Latent dirichlet allocation. J. Mach. Learn. Res.,
3:993–1022, March 2003.
[3] Alessandro Bozzon, Marco Brambilla, Stefano Ceri,
Matteo Silvestri, and Giuliano Vesci. Choosing the
right crowd: Expert finding in social networks. In
Proceedings of the 16th International Conference on
Extending Database Technology, EDBT ’13, pages
637–648, New York, NY, USA, 2013. ACM.
[4] Ulrik Brandes. A faster algorithm for betweenness
centrality. Journal of Mathematical Sociology,
25:163–177, 2001.
[5] Munmun De Choudhury, Hari Sundaram, Ajita John,
and Dorée Duncan Seligmann. What makes
conversations interesting?: Themes, participants and
consequences of conversations in online social media.
In Proceedings of the 18th International Conference on
World Wide Web, WWW ’09, pages 331–340, New
York, NY, USA, 2009. ACM.
[6] Kate Ehrlich, Ching-Yung Lin, and Vicky
Griffiths-Fisher. Searching for experts in the
enterprise: Combining text and social network
analysis. In Proceedings of the 2007 International
ACM Conference on Supporting Group Work, GROUP
’07, pages 117–126, New York, NY, USA, 2007. ACM.
[7] Jiawei Han. Data Mining: Concepts and Techniques.
Morgan Kaufmann Publishers Inc., San Francisco,
CA, USA, 2005.
[8] Yelp Inc. Yelp Dataset Challenge, 2014.
http://www.yelp.com/dataset_challenge.
[9] Jon M. Kleinberg. Authoritative sources in a
hyperlinked environment. J. ACM, 46(5):604–632,
September 1999.
[10] Christopher D. Manning, Prabhakar Raghavan, and
Hinrich Schütze. Introduction to Information
Retrieval. Cambridge University Press, New York, NY,
USA, 2008.
[11] Larry Page, Sergey Brin, R. Motwani, and
T. Winograd. The pagerank citation ranking:
Bringing order to the web, 1998.
[12] Bo Pang and Lillian Lee. Opinion mining and
sentiment analysis. Found. Trends Inf. Retr.,
2(1-2):1–135, January 2008.
[13] Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin,
Hong Cheng, and Tianyi Wu. Rankclus: Integrating
clustering with ranking for heterogeneous information
network analysis. In Proceedings of the 12th
International Conference on Extending Database
Technology: Advances in Database Technology, EDBT
’09, pages 565–576, New York, NY, USA, 2009. ACM.
[14] Hiro Y. Toda and Peter C.B. Phillips. Vector
Autoregression and Causality. Cowles Foundation
Discussion Papers 977, Cowles Foundation for
Research in Economics, Yale University, May 1991.
[15] Hongning Wang, Yue Lu, and ChengXiang Zhai.
Latent aspect rating analysis on review text data: A
rating regression approach. In Proceedings of the 16th
ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, KDD ’10,
pages 783–792, New York, NY, USA, 2010. ACM.
[16] Hongning Wang, Yue Lu, and ChengXiang Zhai.
Latent aspect rating analysis without aspect keyword
supervision. In Proceedings of the 17th ACM SIGKDD
International Conference on Knowledge Discovery and
Data Mining, KDD ’11, pages 618–626, New York,
NY, USA, 2011. ACM.
[17] Jun Zhang, Mark S. Ackerman, and Lada Adamic.
Expertise networks in online communities: Structure
and algorithms. In Proceedings of the 16th
International Conference on World Wide Web, WWW
’07, pages 221–230, New York, NY, USA, 2007. ACM.