Multidimensional Characterization of Expert Users in the Yelp Review Network ∗ Cheng Han Lee Sean Massung Department of Computer Science University of Illinois at Urbana-Champaign Department of Computer Science University of Illinois at Urbana-Champaign [email protected] [email protected] ABSTRACT In this paper, we propose a multidimensional model that integrates text analysis, temporal information, network structure, and user metadata to effectively predict experts from a large collection of user profiles. We make use of the Yelp Academic Dataset which provides us with a rich social network of bidirectional friendships, full review text including formatting, timestamped activities, and user metadata (such as votes and other information) in order to analyze and train our classification models. Through our experiments, we hope to develop a feature set that can be used to accurately predict whether a user is a Yelp expert (also known as an ‘elite’ user) or a normal user. We show that each of the four feature types is able to capture a signal that a user is an expert user. In the end, we combine all feature sets together in an attempt to raise the classification accuracy even higher. Keywords network mining, text mining, expert finding, social network analysis, time series analysis 1. INTRODUCTION Expert finding seeks to locate users in a particular domain that have more qualifications or knowledge (expertise) than the average user. Usually, the number of experts is very low compared to the overall population, making this a challenging problem. Expert finding is especially important in medical, legal, and even governmental situations. In our work, we focus on the Yelp academic dataset [8] since it has many rich features that are unavailable in other domains. In particular, we have full review content, timestamps, the friend graph, and user metadata—this allows us to use techniques from text mining, time series analysis, social network analysis, and classical machine learning. From a user’s perspective, it’s important to find an expert ∗Submitted as the semester project for CS 598hs Fall 2014. reviewer to give a fair or useful review of a business that may be a future destination. From a business’s perspective, expert reviewers should be great summarizers and able to explain exactly how to improve their store or restaurant. In both cases, it’s much more efficient to find the opinion of an expert reviewer than sift through hundreds of thousands of potentially useless or spam reviews. Yelp is a crowd-sourced business review site as well as a social network, consisting of several objects: users, reviews, and businesses. Users write text reviews accompanied by a star rating for businesses they visit. Users also have bidirectional friendships as well as one-directional fans. We consider the social network to consist of the bidirectional friendships since each user consents to the friendship of the other user. Additionally, popular users are much less likely to know their individual fans making this connection much weaker. Each review object is annotated with a time stamp, so we are able to investigate trends temporally. The purpose of this work is to investigate and analyze the Yelp dataset and find potentially interesting patterns that we can exploit in our future expert-finding system. The key question we hope to answer is: Given a network of Yelp users, who is an elite user? To answer the above question, we have to first address the following: 1. How does the text in expert reviews differ from text in normal reviews? 2. How does the average number of votes per review for a user change over time? 3. Are elite users the first to review a new business? 4. Does the social network structure suggest whether a user is an elite user? 5. Does user metadata available from Yelp have any indication about a user’s status? The structure of this paper is as follows: in section 2, we discuss related work. In sections 3, 4, 5, and 6, we discuss the four different dimensions of the Yelp dataset. For the first Paper Sun et al. 2009 [13] Bozzon et al. 2013 [3] Zhang et al. 2007 [17] Choudhury et al. 2009 [5] Balog et al. 2009 [1] Ehrlich et al. 2007 [6] three feature types, we use text analysis, temporal analysis, and social network analysis respectively. The user metadata is already in a quantized format, so we simply overview the fields available. Section 7 details running experiments on the proposed features on balanced (number of experts is equal to the number of normal users) and unbalanced (number of experts is much less) data. Finally, we end with conclusions and future work in section 8. 2. RELATED WORK RankClus [13] integrates clustering and ranking on heterogeneous information networks. Within each cluster, a ranking of nodes is created, so the top-ranked nodes could be considered experts for a given cluster. For example, consider the DBLP bibliographic network. Clusters are formed based on authors who share coauthors, and within each cluster there is a ranking of authoritative authors (experts in their field). Clustering and ranking are defined to mutually enhance each other since conditional rank is used as a clustering feature and cluster membership is used as an object feature. In order to determine the final configuration, an expectationmaximization algorithm is used to iteratively update cluster and ranking assignments. This work is relevant to our Yelp dataset if we consider clusters to be the business categories, and experts to be domain experts. However, the Yelp categories are not well-defined since some category labels overlap, so some extra processing may be necessary to deal with this issue. Expert Finding in Social Networks [3] considers Facebook, LinkedIn, and Twitter as domains where experts reside. Instead of labeling nodes from the entire graph as experts, a subset of candidate nodes is considered and they are ranked according to an expertise score. These expertise scores are obtained through link relation types defined on each social network, such as creates, contains, annotates, owns, etc. To rank experts, they used a vector-space retrieval model common in information retrieval [10] and evaluated with popular IR metrics such as MAP, MRR, and NDCG [10]. Their vector space consisted of resources, related entities, and expertise measures. They concluded that profile information is a less useful determiner for expertise than their extracted relations, and that resources created by others including the target are also quite useful. A paper on “Expertise Networks” [17] begins with a large study on analyzing a question and answer forum; typical centrality measures such as PageRank [11] and HITS [9] are used to initially find expert users. Then, other features describing these expert users are defined or extracted in order to create an “ExpertiseRank” algorithm, which (as far as we can tell) is essentially PageRank. This algorithm was then evaluated by human raters and it was found ExpertiseRank had slightly smaller errors than the other measures (including HITS, but was not evaluated against PageRank). While the result of ExpertiseRank is unsurprising, we would be unable to directly use it or PageRank since the Yelp social network is undirected; running PageRank on an undirected network approximates degree centrality. Text? Time? X X X X Graph? X X X X Figure 1: Comparison of features used in previous work. A paper on Expert Language Models [1] builds two different language models by invoking Bayes’ Theorem. The conditional probability of a candidate given a specific query is estimated by representing it using a multinomial probability distribution over the vocabulary terms. A candidate model θca is inferred for each candidate ca, such that the probability of a term used in the query given the candidate model is p(t|θca ). For one of the models, they assumed that the document and the candidate are conditionally independent and for the other model, they use the probability p(t|d, ca), which is based on the strength of the co-occurrence between a term and a candidate in a particular document. In terms of the modeling techniques used above, we can adopt a similar method whereby the candidate in the Expert Language Models [1] is a yelp user and we will determine the extent to which a review characterizes an elite or normal user. For the paper on Interesting YouTube commenters [5], the goal is to determine a real scalar value corresponding to each conversation to measure its interestingness. The model comprised of detecting conversational themes using a mixture model approach, determining ‘interestingness’ of participants and conversations based on a random walk model, and lastly, establishing the consequential impact of ‘interestingness’ via different metrics. The paper could be useful to us for characterizing reviews and Yelp users in terms of ‘interestingness’. An intuitive conjecture is that ‘elite’ users should be ones with high ‘interestingness’ level and likewise, they should post reviews that are interesting. In summary, Fig 1 shows a comparison of related work surveyed and which aspects of the dataset they examined. 3. TEXT ANALYSIS The text analysis examines the reviews written by each user in order to extract features from the unstructured text content. Common text processing techniques such as indexing, categorization, and language modeling are explored in the next sections. 3.1 Datasets First, we preprocessed the text by lowercasing, removing stop words, and performing stemming with the Porter2 English stemmer. Text was tokenized as unigram bag-of-words by the MeTA toolkit1 . 1 http://meta-toolkit.github.io/meta/ Dataset All Elite Normal Docs 659,248 329,624 329,624 Lenavg 81.8 98.8 64.9 |V | 164,311 125,137 95,428 Raw 480 290 190 Index 81 46 37 F1 scores were similar. Recall the F1 score is a harmonic mean of precision P and recall R: F1 = Figure 2: Comparison of the three text datasets of Yelp reviews in terms of corpus size, average document length (in words), vocabulary size, raw data size, and indexed data size (both in MB). elite normal Class Elite Normal Total Confusion Matrix classified as elite classfied as normal 0.665 0.335 0.252 0.748 F1 Score 0.694 0.718 0.706 Precision 0.665 0.748 0.706 Recall 0.725 0.691 0.708 Figure 3: Confusion matrix and classification accuracy on normal vs elite reviews. Since this is just a baseline classifier, we expect it is possible to achieve higher accuracy using more advanced features such as n-grams of words or grammatical features like partof-speech tags or parse tree productions. However, this initial experiment is to determine whether elite and non-elite reviews can be separated based on text alone, with no regard to the author or context. Since the accuracy on this default model is 70% is seems that text will make a useful subset of overall features to predict expertise. Furthermore, remember that this classification experiment is not whether a user is elite or not, but rather whether a review has been written by an elite user; it would be very straightforward to extend this problem to classify users instead, where each user is a combination of all reviews that he or she writes. In fact, this is what we do in section 7, where we are concerned with identifying elite users. 3.3 We created three datasets that are used in this section: • All: contains all the elite reviews (reviews written by an elite user) and an equal number of normal reviews • Elite: contains only reviews written by elite users • Normal: contains only reviews written by normal users Elite and Normal together make up All; this is to ensure the analyses run on these corpora have a balanced class distribution. Overall, there were 1,125,458 reviews, consisting of 329,624 elite reviews and 795,834 normal reviews. Thus, the number of normal reviews was randomized and truncated to 329,624. The number of elite users is far fewer than the ratio of written reviews may suggest; this is because elite users write many more reviews on average than normal users. A summary of the three datasets is found in Fig 2. 3.2 Classification We tested how easy it is to distinguish between an elite review and a non-elite (normal) review by a simple supervised classification task. We used the dataset All described in the previous section along with each review’s true label to train an SVM classifier. Evaluation was performed with five-fold cross validation and had a baseline of 50% accuracy. Results of this categorization experiment are displayed in Fig 3. The confusion matrix tells us that it was slightly easier to classify normal reviews, though the overall accuracy was acceptable at just over 70%. Precision and recall highs had opposite maximums for normal and elite, though overall the 2P R P +R Language Model We now turn to the next text analysis method: unigram language models. A language model is simply a distribution of words given some context. In our example, we will define three language models—each based on a corpus described in section 3.1. The background language model (or “collection” language model) simply represents the All corpus. We define a smoothed collection language model pC (w) as pC (w) = count(w, C) + 1 |C| + |V | This creates a distribution pC (·) over each word w ∈ V . Here, C is the corpus, and V is the vocabulary (each unique word in C), so |C| is the total number of words in the corpus and |V | is the number of unique terms. The collection language model essentially shows the probability of a word occurring in the entire corpus. Thus, we can sort the outcomes in the distribution by their assigned probabilities to get the most frequent words. Unsurprisingly, these words are the common stop words with no real content information. However, we will use this background model to filter out words specific to elite or normal users. We now define another unigram language model to represent the probability of seeing a word w in a corpus θ ∈ {elite, normal}. We create a normalized language model score per word using the smoothed background model defined previously: score(w, θ) = count(w,θ) |θ| pC (w) = count(w, θ) pC (w) · |θ| Background the and a i to was of is for it in that my with but this you we they on Normal gorsek forks) yu-go sabroso (*** eloff -/+ jeph deirdra ruffin’ josefa ubox waite again!! optionz ecig nulook gtr shiba kenta Elite uuu aloha!!! **recommendations** meter: **summary** carin no1dp (lyrics friends!!!!! **ordered** 8/20/2011 rickie kuge ;]]] #365 g *price visits): r ik 3.4 We use the following six style or typographical features: • Average review length. We calculate review length as the number of whitespace-delimited tokens in a review. Average review length is simply the average of this count across all of a user’s reviews. • Average review sentiment. We used sentiment valence scores [12] to calculate the sentiment of an entire review. The sentiment valence score is < 0 if the overall sentiment is negative and > 0 if the overall sentiment is positive. • Paragraph rate. Based on the language model analysis, we included a feature to detect whether paragraph segmentation was used in a review. We simply count the rate of multiple newline characters per review per user. Figure 4: Top 20 tokens from each of the three language models. • List rate. Again, based on the language model analysis, we add this feature to detect whether a bulleted list is included in the review. We defined a list as the beginning of a line followed by ‘*’ or ‘-’ before alpha characters. The goal of the language model score is to find unigram tokens that are very indicative of their respective categories; using a language model this way can be seen as a form of feature selection. Fig 4 shows a comparison of the top twenty words from each of the three methods. These default language models did not reveal very clear differences in word usage between the two categories, despite the elite users using a larger vocabulary as shown in Fig 2. The singular finding was that the elite language model shows that its users are more likely to segment their reviews into different sections, discussing different aspects of the business. For example, recommendations, summary, ordered, or price. Also, it may appear that there are a good deal of nonsense words in the top words from each language model. However, upon closer inspection, these words are actually valid given some domain knowledge of the Yelp dataset. For example, the top word “gorsek” in the normal language model is the last name of a normal user that always signs his posts. Similarly, “sabroso” is a Spanish word meaning delicious that a particular user likes to say in his posts. Similar arguments can be made for other words in the normal language model. In the elite model, “uuu” was originally “\uuu/”, an emoticon that an elite user is fond of. “No1DP” is a Yelp username that is often referred to by a few other elite users in their review text. Work on supervised and unsupervised review aspect segmentation has been done before [15,16], and it may be applicable in our case since there are clear boundaries in aspect mentions. Another approach would be to add a boolean feature has_aspects that detects whether a review is segmented in the style popular among elite users. Typographical Features Based partly on the experiments performed in section 3.3, we now define typographical features of the review text. We call a feature a ‘typographical’ feature if it is a trait that can’t be detected by a unigram words tokenizer and is indicative of the style of review writing. • All caps. The rate of words in all capital letters. We suspect very high rates of capital letters will indicate spam or useless reviews. • Bad punctuation. Again, this feature is to detect less serious reviews in an attempt to find spam. A basic example of bad punctuation is not starting a new sentence with a capital letter. Although the number of features here is low, we hope that the added meaning behind each one is more informative than a single unigram words feature. 4. TEMPORAL ANALYSIS In this section, we look at how features change temporally by making use of the time stamp in reviews as well as tips. This allows to us to analyze the activity of a user over time as well as how the average number of votes the user has received changes with each review posted. 4.1 Average Votes-per-review Over Time For the average number of votes-per-review varies with each review posted by an user. To gather this data, we grouped the reviews in the Yelp dataset by users and ordered the reviews by the date each was posted. The goal was to try to predict whether an user is an “Elite” or “Normal” user using the votes-per-review vs review number plot. The motivation for this was that after processing the data, we found out that the number of votes on average was significantly greater for elite users compared to normal Elite vs Normal users Statistics useful votes funny votes cool votes elite users 616 361 415 normal users 20 7 7 Figure 5: Average number of votes per category for elite and normal users. elite normal Confusion Matrix classified as elite classified as normal 0.64 0.36 0.26 0.74 Figure 7: Summary of results for logistic regression. Logistic Regression Summary elite users normal users training 2005 2005 testing 18040 18040 Figure 6: Summary of training and testing data for logistic regression. users as show in Fig 5. Thus, we decided to find out whether any trend exists on how the average number of votes grow with each review posted by users from both categories. We hypothesized that elite users should have an increasing average number of votes over time. On the y-axis, we have υi which is the votes-per-review after a user posts his ith review. This is defined as the sum of the number of “useful” votes, “cool” votes and “funny” votes divided by the number of reviews by the user up to that point in time. On the x-axis, we will have the review count. Using the Yelp dataset, we plotted a scatter plot for each user. Visual inspection of graphs did not show any obvious trends in how the average number of likes per review varied with each review being posted by the user. We then proceeded to perform a logistic regression using the following variables: Pincrease = count(increases) count(reviews) Figure 8: Plot of the probability of being an elite user for reviews at rank r. users compared to normal users. This means that each review that a elite user posts tends to be a “quality” review that receives enough votes to increase the running average of votes-per-review for this user. The second hypothesis is that the mean of the running average votes-per-review for elite users is higher than that of normal users. This is supported by data shown in Fig 5 where the average votes for elite users are higher than normal users. 4.2 Pcount(reviews) µ= υi i=0 count(reviews) where count(increases) is the number of times the average votes-per-review increased (i.e. υi+1 > υi ) after a user posts a review and count(reviews) is the number of reviews the user has made. Both the training and testing sets consists of only users with at least one review. For each user, we calculated the variables Pincrease and µ. The training and testing data are shown in Fig 6. 10% of users with at least one review became part of the training data and the remaining 90% were used to test. There was an accuracy of 0.69 on the testing set. The results are shown in Fig 7. Given the overall accuracy of our model is relatively high at 0.69, we can hypothesize that Pincrease is higher for elite User Review Rank For the second part of our temporal analysis, we look at the rank of each review a user has posted. Using 0-index, if a review has rank r for business b, the review was the rth review written for business b. Our hypothesis was that an elite user should be one of the first few users who write a review for a restaurant because elite users are more likely to find new restaurants to review. Also, based on the dataset, elite users write approximately 230 more reviews on average than normal users, thus it is more likely that elite users will be one of the first users to review a business. Over time, since there are more normal users, the ratio of elite to normal users will decrease as more normal users write reviews. To verify this, we calculated the percentage of elite reviews for each rank across the top 10,000 businesses, whereby the top business is defined as the business with the most reviews. The number of ranks we look at will be the minimum number of reviews of a single business among the top 10,000 businesses. The plot is shown in Fig 8. users, the plot shows us that it is more likely for an elite user to be among the first few tippers of a business. Furthermore, for this specific dataset, elite users only make up approximately 25% of the total number of tips, yet for the top 10,000 businesses, they make up more than 25% of the tips for almost all the ranks shown in Fig 9. We then calculated a score for each user based on the rank of each tip of the user and we included this as a feature in the SVM. The score is defined as follows: score = X (tip count(business of (review)) − rank(tip)) tip (The equation for this score follows the same reasoning as the user review rank section) Figure 9: Plot of the probability of being an elite user for tips at rank r. Given that the dataset consists of approximately 10% elite users, the plot shows us that it is more likely for an elite user to be among the first few reviewers of a business. We calculated a score for each user which is a function of the rank of each review of the user and we included this as a feature in the SVM. For each review of a user, we find the total number of reviews the business that this review belongs to has. We take the total review count of this business and subtract the rank of the review from it. We then sum this value for each review to assign a score to the user. Based on our hypothesis, since elite users will more likely have a lower rank for each review than normal users, the score for elite users should therefore be higher than normal users. 4.4 Review Activity Window In this section, we look at the distribution of a user’s activity over time. The window we look at is between the user’s join date and end date, defined as the last date of any review posted in the entire dataset. For each user, we will find the interval in days between each review, including the join date and end date. For example if the user has two reviews on date1 and date2, where date2 is after date1, the interval durations will be: date1-joinDate, date2-date1 and endDatedate2. So for n number of reviews, we will get n+1 intervals. Based on the list of intervals, we will calculate a score. For this feature, we hypothesize that the lower the score, the more likely the user is an elite user. The score is defined as: score = var(intervals) + avg(intervals) days on yelp The score for a review is defined as follows: score = X (review count(business of (rev)) − rank(rev)) Where var(intervals) is the variance of all the interval values, avg(intervals) is the average and days on yelp is the number of days a user has been on Yelp. rev We subtract the rank from the total review count so that based on our hypothesis, elite users will end up having a higher score. 4.3 User Tip Rank A tip is a short chunk of text that a user can submit to a restaurant via any Yelp mobile application. Using 0-index, if a tip has rank r for business b, the tip was the rth tip written for business b. Similar to the review rank, we we hypothesized that an elite user should be one of the first few tippers (person who gives a tip) of a restaurant. We plotted the same graph which shows the percentage of elite tips for each rank across the top 10,000 businesses, whereby the top business is defined as the business with the most tips. The plot is shown in Fig 9. Given that the dataset consists of approximately 10% elite For the variance, the hypothesis is that for elite users, the variance will tend to be low as we hypothesize that elite users should post regularly. For normal users, the variance will be high possibly due to irregular posting and long periods of inactivity between posts. We also look at the average value of the intervals. This is because if we were to only look at variance, a user who writes a review every two days will get the same variance (zero) as a user who writes a review every day. As such the average of the intervals will account for this by increasing the score Finally, we divide the score by the number of days the user has been on Yelp. This is to account for situations where a user makes a post every week but has only been on Yelp for three weeks, versus a user who makes a post every week as well but has been on Yelp for a year. The user who has been on Yelp for a year will then get a lower value for this score (elite user). 5. SOCIAL NETWORK ANALYSIS The Yelp social network is the user friendship graph. This data is available in the latest version of the Yelp academic dataset. We used the graph library from the same toolkit that was used to do the text analysis in section 3. We make an assumption that users on the Yelp network don’t become friends at random; that is, we hypothesize that users become friends if they think their friendship is mutually beneficial. In this model, we think one friend will become friends with another user if he or she thinks the other user is worth knowing (i.e., is a “good reviewer”). We believe this is a fair assumption to make, since the purpose of the Yelp website is to provide quality reviews for both businesses and users. One potential downside we can see is users becoming friends just because they are friends in real life, or in a different social network. 5.1 Network Centrality Since our goal is to find “interesting” or “elite” users, we use three network centrality measures to identify central (important) nodes. We would like to find out if elite users are more likely to be central nodes in their friendship network. We’d also like to investigate whether the results of the three centrality measures we investigate are correlated. Next, we briefly summarize each measure. For a more in-depth discussion of centrality (including the measures we use), we suggest the reader consult [7]. For our centrality calculations we considered the graph of 123,369 users that wrote at least one review. Degree centrality for a user u is simply the degree of node u. In our network, this is the same value as the number of friends. Therefore, it makes sense that users with more friends are more important (or active) than those that have fewer or no friends. Degree centrality can be calculated almost instantly. Betweenness centrality for a node u essentially captures the number of shortest paths between all pairs of nodes that pass through u. In this case, a user being an intermediary between many user pairs signifies importance. Betweenness centrality is very expensive to calculate, even using a O(mn) algorithm [4]. This algorithm is part of the toolkit we used and it took two hours to run on 3.0 GHz processors with 24 threads. Eigenvector centrality operates under the assumption that important nodes are connected to other important nodes. PageRank [11] is a simple extension to eigenvector centrality. If a graph is represented as an adjacency matrix A, then the (i, j)th cell is 1 if there is an edge between i and j, and 0 otherwise. This notation is convenient when defining eigenvector centrality for a node u denoted as xu : xu = n 1X Aiu xi λ i=1 Since this can be rewritten as Ax = λx, we can solve for the eigenvector centrality values with power iteration, which converges in a small number of iterations and is quite fast. Name Walker Kimquyen Katie Philip Gabi Name Gabi Philip Lindsey Jon Walker Name Kimquyen Carol Sam Alina Katie Degree Centrality Reviews Useful Friends 240 6,166 2,917 628 7,489 2,875 985 23,030 2,561 706 4,147 2,551 1,440 12,807 2,550 Betweenness Centrality Reviews Useful Friends 1,440 12,807 2,550 706 4,147 2,551 906 7,641 1,617 230 2,709 1,432 240 6,166 2,917 Eigenvector Centrality Reviews Useful Friends 628 7,489 2,875 505 2,740 2,159 683 9,142 1,960 329 2,096 1,737 985 23,030 2,561 Fans 142 128 1,068 86 420 Elite Y Y Y Y Y Fans 420 86 348 60 142 Elite Y Y Y Y Y Fans 128 163 100 141 1,068 Elite Y Y Y N Y Figure 10: Comparison of the top-ranked users as defined by the three centrality measures on the social network. The eigenvector centralities for the Yelp social network were calculated in less than 30 seconds. Fig 10 shows the comparison of the top five ranked users based on each centrality score. The top five users of each centrality shared some names: Walker, Gabi, and Philip in degree and betweenness; Kimquyen and Katie in degree and eigenvector; betweenness and eigenvector shared no users in the top five (though not shown, there are some that are the same in the range six to ten). The top users defined by centrality measures are almost all elite users even though elite users only make up about 8% of the dataset. The only exception here is Alina from eigenvector centrality. Her other statistics look like they fit in with the other elite users, so perhaps this could be a prediction that Alina will be elite in the year 2015. The next step is to use these social network features to predict elite users. 5.2 Weighted Networks Adding weighted links between users could definitely enhance the graph representation. The types which could potentially be weighted are fans and votes. Additionally, if we had some tie strength of friendship based on communication or profile views, we could use weighted centrality measures for this aspect as well. Unfortunately, we have no way to define the strength of the friendship between two users, since we only have the information present in the Yelp academic dataset. As for the votes and fans, in the Yelp academic dataset we are only given a raw number for these values, as opposed to the actual links for the social network. If we had this additional Confusion Matrix: Balanced Text Features classified as elite classified as normal elite 0.651 0.349 normal 0.124 0.876 Overall Accuracy: 76.7%, baseline 50% information, we could add those centrality measures to the friendship graph centrality measures for an enriched social network feature set. 6. USER METADATA User metadata is information that is already part of the JSON Yelp user object. It is possible to see all the metadata by visiting the Yelp website and viewing specific numerical fields. • Votes. Votes are ways to show a specific type of appreciation towards a user. There are three types of votes: funny, useful, and cool. There is no specific definition for what each means. • Review count. This is simply the total number of reviews that a user has written. • Number of friends. The total number of friends in a user’s friendship graph. This feature is duplicated in the degree centrality measure of the social network analysis. • Number of fans. The total number of fans a user has. • Average rating. The average star rating in [1, 5] the user gives a business. • Number of compliments. According to Yelp, the compliment button is “an easy way to send some good vibes.” This is separate from a review. In fact, users get compliments from other users based on particular reviews that they write. We hope to use these metadata features in order to classify users as elite. We already saw in section 5 that some metadata fields seemed to be correlated with network centrality measures as well as a user’s status, so it seems like they will be informative features. 7. EXPERIMENTS We now run experiments to test whether each feature generation method is a viable candidate to distinguish between elite and normal users. As mentioned before, the number of elite users is much smaller than the number of total users; about 8% of all 252,898 users are elite. This presents us with a very imbalanced class distribution. Since using the entire user base to classify elite users has such a high baseline (92% accuracy), we also truncate the dataset to a balanced class distribution with a total of 40,090 users, giving a alternate baseline of 50% accuracy. Both datasets are used for all future experiments. As described in section 3.1, we use the MeTA toolkit2 to do the text tokenization, class balancing, and five-fold crossvalidation with SVM. SVM is implemented here as stochastic gradient descent with hinge loss. 2 http://meta-toolkit.github.io/meta/ Confusion Matrix: Unbalanced Text Features classified as elite classified as normal elite 0.582 0.418 normal 0.039 0.961 Overall accuracy: 91.8%, baseline 92% Figure 11: Confusion matrices for normal vs elite users on balanced and unbalanced datasets. Confusion Matrix: Balanced Temporal Features classified as elite classified as normal elite 0.790 0.210 normal 0.320 0.680 Overall Accuracy: 73.5%, baseline 50% Confusion Matrix: Unbalanced Temporal Features classified as elite classified as normal elite 0.267 0.733 normal 0.067 0.933 Overall accuracy: 88%, baseline 92% Figure 12: Confusion matrices for normal vs elite users on balanced and unbalanced datasets. 7.1 Text Features We represent users as a collection of all their review text. Based on the previous experiments, we saw that it was possible to classify a single review as being written by an elite or normal user. Now, we want to classify users based on all their reviews as either an elite or normal user. Figure 11 shows the results of the text classification task. Using the balanced dataset we achieve about 77% accuracy, compared to barely achieving the baseline accuracy in the full dataset. Since the text features are so high dimensional, we performed some basic feature selection by selecting the most frequent features from the dataset. Before feature selection, we had an accuracy on the balanced dataset of about 70%. Using the top 100, 250, and 500 features all resulted in a similar accuracy of around 76%. We use the reduced feature set of 250 in our experimental results in the rest of this paper. 7.2 Temporal Features The temporal features consist of features derived using changes in the average number of votes per review posted, the sum of the ranks of reviews of an user as well as the tips, and the distribution of reviews posted over the lifetime of a user. Using these features, we obtained the results shown in Figure 12. 7.3 Graph Features Figure 13 shows the results using the centrality measures from the social network. Although there are only three fea- Confusion Matrix: Balanced Graph Features classified as elite classified as normal elite 0.842 0.158 normal 0.251 0.749 Overall Accuracy: 79.6%, baseline 50% Confusion Matrix: Balanced All Features classified as elite classified as normal elite 0.754 0.256 normal 0.111 0.889 Overall Accuracy: 82.2%, baseline 50% Confusion Matrix: Unbalanced Graph Features classified as elite classified as normal elite 0.311 0.689 normal 0.075 0.925 Overall accuracy: 87.6%, baseline 92% Confusion Matrix: Unbalanced All Features classified as elite classified as normal elite 0.976 0.024 normal 0.731 0.269 Overall accuracy: 92%, baseline 92% Figure 13: Confusion matrices for normal vs elite users on balanced and unbalanced datasets. Figure 15: Confusion matrices for normal vs elite users on balanced and unbalanced datasets with all features present. Confusion Matrix: Balanced Metadata Features classified as elite classified as normal elite 0.959 0.041 normal 0.083 0.917 Overall Accuracy: 93.8%, baseline 50% Confusion Matrix: Unbalanced Metadata Features classified as elite classified as normal elite 0.880 0.120 normal 0.097 0.903 Overall accuracy: 90.1%, baseline 92% Balanced Unbalanced Text .767 .918 Temp. .735 .880 Graph .796 .876 Meta .938 .901 All .822∗ .920 Figure 16: Final results summary for all features and feature combinations on balanced and unbalanced data. ∗ Excluding just the text features resulted in 90.4% accuracy. the difficult baseline. Figure 14: Confusion matrices for normal vs elite users on balanced and unbalanced datasets. tures, Figure 10 showed that there is potentially a correlation between the elite status and high-valued centrality measures. The three graph features alone were able to predict whether a user was elite using the balanced dataset with almost 80% accuracy. Again, results were lower compared to the baseline when using the full user set. 7.4 Metadata Features Using only the six metadata features from the original Yelp JSON file gave surprisingly high accuracy at almost 94% for the balanced classes. In fact, the metadata features had the highest precision for both the elite and normal classes. The unbalanced accuracy was near the baseline. 7.5 Feature Combination and Discussion To combine features, we simply concatenated the feature vectors for all the previous features and used the same splits and classifier as before. Figure 15 shows the breakdown of this classification experiment. Additionally, we summarize all results by final accuracy in Figure 16. Unfortunately, it looks like the combined feature vectors did not significantly improve the classification accuracy on the balanced dataset as expected. Initially, we though that this might be due to overfitting, which is why we reduced the number of text features from over 70,000 to 250. Using the 70,000 text features combined with the other feature types resulted in about 70% accuracy; with the top 250 features, we achieved 82.2% as shown in the tables. For the unbalanced dataset, it seems that the results did improve to reach Using all combined features except the text features resulted in 90.4% accuracy, suggesting there is some sort of disagreement between “predictive” text features and all other predictive features. Thus, removing the text features yielded a much higher result, approaching the accuracy of just the Yelp metadata features. Since we dealt with some overfitting issues, we made sure that the classifier used regularization. Regularization ensures that weights for specific features do not become too high if it seems that they are incredibly predictive of the class label. Fortunately (or unfortunately), the classifier we used does employ regularization, so there is nothing we could further do to attempt to increase the performance. 8. CONCLUSION AND FUTURE WORK We investigated several different feature types to attempt to classify elite users in the Yelp network. We found that all of our features were able to distinguish between the two user types. However, when combined, we weren’t able to make an improvement in accuracy on the class-balanced dataset over the best-performing single feature type. In the text analysis, we can investigate different classifiers to improve the classification accuracy. For example, k-nearest neighbor could be a good approach since it is nonlinear and we have a relatively small number of dimensions after reducing the text features. The text analysis could also be extended with the aid of topic modeling [2]. One output from this algorithm acts to cluster documents into separate topics; a document is then represented as a mixture of these topics, and each document’s mixture can be used as a feature for the classifier. In the temporal analysis, we made some assumptions about the ‘elite’ data provided by the Yelp dataset. The data tells us for which years the user was ‘elite’ and we made a simplifying assumption that as long a user has at least one year of elite status, the user is currently and has always been an elite user. For instance, if a user was only elite in the year 2010, we treated the user’s review back in 2008 as an elite review. Also, we could have made use of more advanced models like the vector autoregression model (VAR) [14] which might allow us to improve the analysis of votes per review over time. One possible way will be to look at all the votesper-review plots of users in the dataset and run the model using this data. Finally, in the network analysis, we can consider different network features such as clustering coefficient or similarity via random walks. The graph features would certainly benefit from added weights, but as mentioned in section 5, we unfortunately do not have this data. Social graph structure can also be created with more information about fans and votes. Finally, since the metadata features were by far the bestperforming, it would be an interesting auxiliary problem to predict their values via a regression using the other feature types we created. APPENDIX A. REFERENCES [1] Krisztian Balog, Leif Azzopardi, and Maarten de Rijke. A language modeling framework for expert finding. Inf. Process. Manage., 45(1):1–19, January 2009. [2] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993–1022, March 2003. [3] Alessandro Bozzon, Marco Brambilla, Stefano Ceri, Matteo Silvestri, and Giuliano Vesci. Choosing the right crowd: Expert finding in social networks. In Proceedings of the 16th International Conference on Extending Database Technology, EDBT ’13, pages 637–648, New York, NY, USA, 2013. ACM. [4] Ulrik Brandes. A faster algorithm for betweenness centrality. Journal of Mathematical Sociology, 25:163–177, 2001. [5] Munmun De Choudhury, Hari Sundaram, Ajita John, and Dorée Duncan Seligmann. What makes conversations interesting?: Themes, participants and consequences of conversations in online social media. In Proceedings of the 18th International Conference on World Wide Web, WWW ’09, pages 331–340, New York, NY, USA, 2009. ACM. [6] Kate Ehrlich, Ching-Yung Lin, and Vicky Griffiths-Fisher. Searching for experts in the enterprise: Combining text and social network analysis. In Proceedings of the 2007 International ACM Conference on Supporting Group Work, GROUP ’07, pages 117–126, New York, NY, USA, 2007. ACM. [7] Jiawei Han. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2005. [8] Yelp Inc. Yelp Dataset Challenge, 2014. http://www.yelp.com/dataset_challenge. [9] Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM, 46(5):604–632, September 1999. [10] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008. [11] Larry Page, Sergey Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web, 1998. [12] Bo Pang and Lillian Lee. Opinion mining and sentiment analysis. Found. Trends Inf. Retr., 2(1-2):1–135, January 2008. [13] Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng, and Tianyi Wu. Rankclus: Integrating clustering with ranking for heterogeneous information network analysis. In Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, EDBT ’09, pages 565–576, New York, NY, USA, 2009. ACM. [14] Hiro Y. Toda and Peter C.B. Phillips. Vector Autoregression and Causality. Cowles Foundation Discussion Papers 977, Cowles Foundation for Research in Economics, Yale University, May 1991. [15] Hongning Wang, Yue Lu, and ChengXiang Zhai. Latent aspect rating analysis on review text data: A rating regression approach. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’10, pages 783–792, New York, NY, USA, 2010. ACM. [16] Hongning Wang, Yue Lu, and ChengXiang Zhai. Latent aspect rating analysis without aspect keyword supervision. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’11, pages 618–626, New York, NY, USA, 2011. ACM. [17] Jun Zhang, Mark S. Ackerman, and Lada Adamic. Expertise networks in online communities: Structure and algorithms. In Proceedings of the 16th International Conference on World Wide Web, WWW ’07, pages 221–230, New York, NY, USA, 2007. ACM.
© Copyright 2026 Paperzz