GetJar Mobile Application Recommendations with Very Sparse Datasets Kent Shi Kamal Ali GetJar Inc. San Mateo, CA, USA GetJar Inc. San Mateo, CA, USA [email protected] [email protected] ABSTRACT Keywords The Netflix competition of 2006 [2] has spurred significant activity in the recommendations field, particularly in approaches using latent factor models [3, 5, 8, 12]. However, the near ubiquity of the Netflix and the similar MovieLens datasets1 may be narrowing the generality of lessons learned in this field. At GetJar, our goal is to make appealing recommendations of mobile applications (apps). For app usage, we observe a distribution that has higher kurtosis (heavier head and longer tail) than that for the aforementioned movie datasets. This happens primarily because of the large disparity in resources available to app developers and the low cost of app publication relative to movies. In this paper we compare a latent factor (PureSVD) and a memory-based model with our novel PCA-based model, which we call Eigenapp. We use both accuracy and variety as evaluation metrics. PureSVD did not perform well due to its reliance on explicit feedback such as ratings, which we do not have. Memory-based approaches that perform vector operations in the original high dimensional space overpredict popular apps because they fail to capture the neighborhood of less popular apps. They have high accuracy due to the concentration of mass in the head, but did poorly in terms of variety of apps exposed. Eigenapp, which exploits neighborhood information in low dimensional spaces, did well both on precision and variety, underscoring the importance of dimensionality reduction to form quality neighborhoods in high kurtosis distributions. Recommender system, mobile application, evaluation, sparse data, PCA 1. INTRODUCTION In the last few years, there has been a tremendous amount of growth in the mobile app space, particularly in the Android platform. As of January 2012, there are more than 400,000 apps hosted on Google’s app store:2 Google Play (formerly known as Android Market). However, Google Play provides little personalization beyond location-based tailoring of catalogs. That means all users from a given country will see the same list of apps regardless of their tastes and preferences. Since most users typically navigate no more than a few pages when browsing the store, lack of personalization limits exposure for the majority of the apps. By analyzing the usage of apps on a sample of devices, we find that this space is dominated by a few apps, which unsurprisingly are ones that have been “featured” recently on the front page of Google Play. GetJar, founded in 2004, is the largest free app store in the world. It provides mobile apps to users of all mobile platforms. We have recently begun to focus on the Android platform due to its openness and surging market share. Our goal is to become an attractive destination for Android apps by providing high quality personalization as a means to app discovery. Categories and Subject Descriptors 1.1 H.2.8 [Database Management]: Database Applications— Data mining; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Information filtering While recommendations techniques, especially those using collaborative filtering, have been common since the early 1990s [6] and have been deployed on a number of e-commerce websites such as Amazon.com [9], recommendation in the emerging app domain is a task beset by unique challenges mainly due to the greater kurtosis in the distribution of app usage data. From anonymous usage data collected at GetJar, we find that there are a few well-known apps popular among a large number of users, but the vast majority of apps are rarely used by most users. Figure 1(a) shows a comparison of the data distribution between the movie (Netflix) and app (GetJar) domains. Note the plot only includes apps that have been recently used by GetJar users. This constitutes approximately 55,000 apps, or about 14% of all apps. The movie General Terms Algorithms, Experimentation, Performance 1 http://www.grouplens.org/node/73 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDD’12, August 12–16, 2012, Beijing, China. Copyright 2012 ACM 978-1-4503-1462-6/12/08... $15.00. Challenges 2 http://www.distimo.com/blog/2012_01_ google-android-market-tops-400000-applications 204 100% 100% 1% 10% GetJar Netflix 0.01% 0.1% Percent of Users 1% 0.1% 0.01% Percent of Users 10% GetJar Netflix 0 20% 40% 60% 80% 100% 0.01% Percent of Items 0.1% 1% 10% 100% Percent of Items (a) (b) Figure 1: (a) Distribution of items (GetJar apps or Netflix movies) in terms of percentage of total users, with items sorted by popularity. (b) Distributions of items plotted in log-log scale. app stores rely on developers to categorize their own apps using a fixed inventory of labels. This leads to a small number of categories and a large number of apps within each, causing only the top few apps in each category to ever have significant visibility. Search is also ineffective because we find that most users don’t know what to search for. About 90% of search queries at GetJar are titles (or close variants) of popular apps, which means search currently is not being used as an effective vehicle to discover new apps. at the first percentile (rank 177) is rated by 20% of Netflix users. In contrast, the app at the first percentile (rank 550) is used only by 0.6% of GetJar users. Furthermore, the movie at the first percentile has 42% as many users as the most popular movie, but app at the first percentile has only 1.3% as many users as the most popular app. Therefore, even though there are over 400,000 available apps, in reality only a few thousand of them are being used in any significant sense. The same data is plotted in Figure 1(b), this time using a log scale for both axes. We can see that the GetJar curve is almost a straight line in log-log space, indicating that the frequencies can be approximated by a Zipf distribution[17]. This figure definitively shows the qualitative difference in distribution: App distribution is linear in log-log space whereas movie distribution isn’t. Traditional collaborative filtering techniques [9, 14] or even the newer latent factor models [3, 5, 8, 12, 13] were not designed to handle this level of sparsity. There are at least three reasons for this difference. First, the disparity in available resources among the app developers is larger than that of movie producers. This is mainly due to the cost (time and money) of publishing apps being much lower than that for releasing movies. Second, due to the less mature nature of the smart phone space, most casual users are unaware of the full capabilities of their device or what apps are available for it. This is in contrast to other domains such as movies, where there are numerous outlets dedicated to reviewing or promoting those products. Third, discovery mechanisms in the app space are less effective and mature compared to those of other domains. Today, most app stores offer three ways for users to discover apps: (1) Listings of apps sorted by the number of downloads or a similar trending metric, (2) Category-based browsing and (3) Keyword-based searching. We know that the number of apps that can be exposed using listings is limited, and that methods 2 and 3 are not as effective as we would like. Browsing by category is only useful if the ontology of categories is rich, as in the case of Amazon. But most 1.2 Goal and evaluation criteria Users visit GetJar hoping to find interesting and useful apps. But as we have seen, common strategies such as browsing and searching, which have worked well for other e-commerce sites don’t work as well in domains where many items remain under-publicized. Our goal is to use personalization to help users find a greater variety of appealing apps. Our prototype recommendation system recommends a topN list of apps to each user based on her recent app usage. We judge the quality of the recommendations primarily by accuracy, which represents the ability of the recommender to predict the presence of an app on the user’s device. To increase the exposure of under-publicized apps, the recommender is also evaluated on its ability to recommend tail apps as well as the variety of the apps it recommend. A number of app stores currently offer personalized app recommendations, most notably the Apple App Store and the Amazon Appstore. However, little is known about how they generate their recommendations. Furthermore, we are unaware of any publications on mobile app recommendations. The rest of the paper is organized as follows: Section 2 will review how the data was collected and some of its properties; Section 3 will provide details of the algorithms that we considered; Section 4 will provide the experimental setup and results; and finally Sections 5 and 6 provide discussion and conclusions. 205 2.1 GetJar Netflix 60% 40% 20% 0 Percent of Total Rating/Usage The data we report upon in this paper comes from server log files at GetJar where all personally identifying information had been stripped, but information pertaining to a single source can be uniquely identified up to a common anonymous identifier. The apps we report here include those hosted on GetJar as well as those on Google Play. For the purposes of this study, we rely upon app usage data rather than installation data. The reason we choose not to use installation data is that it is a poor indicator of interest since many app installations are experimental from a user’s perspective. A significant fraction of our users are found to uninstall an app on the same day as they installed it. Also, there is another significant fraction of users that have a vast number of installed apps that never get used. Many users are new to the mobile app space and thus are likely experimenting with a variety of apps. We restrict our data to recent app usage to account for this fact, that users’ tastes for apps can change more rapidly than for traditional domains such as movies and music. We are only interested in recommending apps that reflect their current tastes or interests. The observation period for data used for this study is from November 7 to November 21, 2011. We find that varying length of the observation period by a few days makes almost no difference in the number of apps used by the users.3 In an effort to reduce noise in the data from apps that were being trialed by users, we filtered out apps that were not used other than on the day of installation. We further cleaned the data by removing users that joined or left midway during the observation period and those that were not associated with a pre-determined list of legitimate devices. The resultant dataset contains 101,106 users. For each user we used the list of apps and the number of days each app was used during the observation period. The total number of unique apps used by all users during the interval satisfying our constraints was 55,020. 80% 100% 2. THE GETJAR DATA 0.01% 0.1% 1% 10% 100% Percent of Items Figure 2: Cumulative distribution of items in terms of percentage of total usage, the curves can be viewed as the integral of the curves in Figure 1. Dataset GetJar GetJar* Netflix Users 101,106 101,031 480,189 Items 55,020 7,304 17,770 Usages/Ratings 1.99M 1.82M 100M Density 0.04% 0.25% 1.18% Table 1: Size of user-item matrices for Netflix and GetJar dataset. GetJar* denotes the GetJar dataset including only apps that have been used by more than 20 users. Jar dataset, as previously alluded to, is primarily due to the low cost of publishing apps compared to the cost of releasing a movie. This encourages developers to release as many apps as possible to increase the chances of their apps being discovered by search. This strategy often leads to apps being published multiple times with different titles but similar functionalities. This also encourages the proliferation of a large number of apps tailored for very specific needs (e.g. ringtone apps dedicated to music by specific artists) as opposed to general apps (e.g. a single ringtone app containing music by all artists). Given that we have little or no usage information on the bulk of the tail apps, it makes recommending them a very difficult task. In order to ensure that the recommended apps will have certain amount of support, for this study, we limited our app population by only including apps with more than 20 users. This reduces the number of apps from 55,030 to 7,304. Even though this pruning process removed 87% of apps (or 98% if we include apps with no usage), it is noteworthy that only 9% of the total usage was thus eliminated from our modeling. Table 1 shows the size and density of the user-item matrices before and after our pruning. It shows that even after rejecting the bottom 87% of the apps, the GetJar* dataset is still much sparser relative to Netflix. Data sparsity and long tail As we have already illustrated in Figure 1, our data is extremely sparse and that the vast majority of apps have low usage. While it is well known that sparsity and a long tail [1] are two characteristics of all e-commerce data, these are especially pronounced in our dataset. Figure 2 plots the cumulative distribution of the items in terms of the total amount of usage. We can see that the GetJar dataset is far more head-heavy compared to the Netflix dataset, with the top 1% of apps accounting for 58% of usage in contrast to Netflix where the top 1% of movies contribute to 22% of all ratings. An even more selective subset - the 100 most popular apps - account for 30% of total app usage. For the GetJar dataset, we define the head to be the top 100 apps and the remaining apps to be the tail. One major reason for this difference is that many apps are used every day, but movies are seldom watched more than once or twice. Thus Netflix users may be more likely to explore new items relative to GetJar users. Another reason is that the Netflix data was collected over a much longer period of time. The reason for the longer tail in the Get- 2.2 3 We use the more convenient word users to denote their anonymized identifiers. Usage versus ratings Another difference between the GetJar dataset and the 206 Netflix dataset is that movie ratings is an explicit feedback for interest whereas days of usage is implicit [11]. The benefit of an explicit rating system is that it is well-defined and standardized, thus generating a more accurate measurement of interest compared to implicit feedbacks such as days of usage. The latter can be influenced by a number of factors such as mood, unforeseen events, or logging errors. Furthermore, there is also correlation between usage and category - we find that “social” apps are consistently the most heavily used apps among nearly all users. This is because “social” apps need to be used often in order to serve their purpose, but apps in categories such as “productivity” are seldom needed on a continuous basis. So while it is safe to assume that a user enjoyed a movie that she rated highly relatively to one rated lowly, the same cannot be said for a user that used a “social” app more than a “productivity” app. We choose not to use ratings because it has a number of drawbacks in the mobile app domain. Most importantly, it is very difficult to collect for a large number of users without forceful intervention. Furthermore, since users’ taste in apps may change and many app developers frequently update their apps with new features or functionalities, ratings may become obsolete in as little as one month. Finally, observing ratings on Google Play, we find they are polarized, with the vast majority of ratings being either 1 or 5. This is likely due to fragmentation of the Android platform,4 resulting in most ratings being given based on whether the app worked (5) or not (1) for the user. Due to the influence of the Netflix competition, most research in the recommendations community has been geared toward rating prediction by means of minimizing root mean square error (RMSE). However, Cremonesi et. al [3] reported that improving RMSE does not translate into improvement in accuracy for the top-N task. On the Netflix and MovieLens datasets, the predictive accuracy of a naive most popular list is comparable to those by sophisticated algorithms optimized for RMSE. We tried the same using the GetJar dataset but substituting days of usage for ratings, and found that algorithms optimized for RMSE actually performed far worse than a simple most popular list. With that said, days of usage can still be used for neighborhood approaches, provided that there still exists some correlation between it and interest. A part of this study is to evaluate the usefulness of this metric. Thus, for our experiments, we used two versions of the user-item matrix. In the first version, each cell represents the number of days the app was used, and in the second, each cell is a binary indicator of usage during the observation period. We’d like to see if the additional granularity provided by the days of usage will generate better recommendations than when using a binary indicator. Dataset GetJar* Netflix 0 83.2% 0.2% Number of Common Users 1 2-10 11-20 > 20 9.1% 6.6% 0.6% 0.6% 0.4% 33.8% 22.2% 43.3% Table 2: Breakdown of number of common users for the GetJar and Netflix datasets. For n items, the 2 total number of item pairs is n 2−n . items in user space or that of users in item space. A useruser or item-item similarity matrix is computed for pairs and recommendations are generated based on these similarities. Latent factor models are more sophisticated approaches where the user-item matrix is decomposed via matrix factorization techniques such as Singular Value Decomposition (SVD). Latent factors are then extracted and used to generate predictions. We evaluated both the above approaches using our data. In addition, we developed a hybrid system using Principal Components Analysis (PCA) which we call Eigenapp. These three algorithms were also compared against a nonpersonalized baseline recommendation system that serves the most popular items. 3.1 Non-personalized models Non-personalized models are those that serve the same list of items to all users. They commonly sort items by the number of purchases, profit margin, click-through rate (CTR), or other similar metrics. In this paper, our nonpersonalized baseline algorithm sorts items by popularity, where popularity is defined as the number of distinct users that have used the item during the observation period. 3.2 Memory-based models There are two types of memory-based models: Item-based and user-based. Item-based models find similarities between items, and for a given user they recommend items that are similar to items she already owns. User-based models find similarities between users, and for a given user they recommend items owned by her most similar users. Computationally, item-based models are more scalable because there are usually far fewer items than users, as is the case in the mobile app space. In addition, there is research showing that item-based algorithms generally perform better than user-based algorithms [9, 14]. Hence, our memorybased model uses the item-based approach. Two of the most common neighborhood similarity metrics in current use are the Pearson correlation coefficient and cosine similarity. The Pearson correlation coefficient is computed for a pair of items based on the set of users that have used both. Since the vast majority of our items reside in the long tail, many of those items are unlikely to share common users with most other items. Table 2 presents the distribution of number of common users in the GetJar and Netflix datasets. The table shows that 83.2% of item pairs in the GetJar dataset have zero users in common, whereas that same percentage for Netflix is 0.2%. For GetJar, more than 90% of item pairs have one or no common users. Thus it is impossible to compute correlations for these item pairs. In addition, the vast majority of the remaining item pairs share 10 or fewer users, 3. MODELS Two common recommendation approaches in use today are those using memory-based models and latent factor models. Memory-based models leverage the neighborhood of 4 There are many manufacturers that produce Android devices with various hardware specifications and tweaks of the operating system. This makes it difficult for developers to to test their apps on all devices, resulting in apps not working as intended on many devices. 207 meaning that the sample correlation estimate is likely to be inaccurate due to poor support. In contrast, the published Netflix dataset has less than 1% of movie pairs sharing 1 or fewer common users and about 65% of movie pairs share more than 10 common users. Since the Pearson correlation coefficient is undefined for 90% of our item pairs, we will use cosine similarity. Let R denote the m × n user-item matrix where m is the number of users and n is the number of items. From R, we compute an item-item similarity matrix S, whose (i, j) entry is: r∗,i · r∗,j si,j = (1) r∗,i 2 · r∗,j 2 Examples of this approach include [5, 8, 12, 13]. We tried [5] and [13] by substituting days of usage for ratings, and then sorting the predictions to generate a top-N recommended list. The results were by far the worst of all algorithms, for reasons explained in Section 2.2. We expect similar results for other rating prediction based algorithms. The only latent factor top-N algorithm we are aware of is PureSVD [3]. The algorithm works by replacing all missing values (those with no ratings) in R with 0, and then factorizing R via SVD: R = U · Σ · VT Then affinity between user u and item i can be computed by: where r∗,i and r∗,j are the ith and jth columns respectively of R. Cosine similarity does not require items to share common users. In such case it will simply produce a similarity of 0. However, it still suffers from low overlap support. The closest neighbors for a less popular item will often occur by coincidence simply because they are the only ones that produced non-zero similarity scores. Using S, the affinity tu,i between user u and item i is the sum of similarities between i and items used by u: tu,i = si,j (2) tu,i = ru,∗ · Q · qTi where Iu is the set of items used by u. For a given user, all items are sorted by their affinity score in order to produce a top-N list.5 We made two slight modifications to the above method that produced better results. First, the item-item similarity scores si,j were normalized before being used in equation (2). Deshpande et al. [4] suggested using a normalization such that the sum of the similarities add up to 1. However, we found that normalizing using z-score worked much better for the GetJar dataset, producing the asymmetric similarity: si,j − s∗,j σs∗,j 3.4 Eigenapp model Of the two previously mentioned approaches, memorybased models yielded far better results despite only having neighborhoods for popular items. We want to improve the result of memory-based models by borrowing ideas from the latent factor models. Along these lines, we used dimensionality reduction techniques to extract meaningful features from the items and then applied memory-based techniques to generate recommendations in this reduced space. Our neighborhood is still item-based, but items are now represented using features instead of users. Similar to [3], we also replace all missing values in R with 0. Given the large disparity in app frequencies, we normalized the item vectors to prevent the features from being based on only popular items. This is done by normalizingeach column in R to have zero mean and length of one: u ru,i = 0 and 2 = 1. We denote this new normalized user-item u ru,i matrix as R and apply PCA to R for feature extraction. PCA is performed via eigen decomposition of the covariance matrix C. C is computed first calculating the mean by item vector b with bu = n1 i ru,i . Then remove the mean by forming matrix A where each cell au,i = ru,i − bu and T finally compute C = AA . Note that C is an m × m matrix with the number of users m likely to be a very large number. This makes eigen decomposition practically impossible in time and space. Observing that the number of items n is likely to be much lower, we used the same procedure as in Eigenface [16] to optimize the process. The procedure works by first conducting eigen decomposition on the n × n matrix AT A obtaining eigenvectors v∗ and eigenvalues λ∗ such that for each j: (3) where s∗,j is the average similarity to item j and σs∗,j is the standard deviation of similarities to item j. Second, for each item candidate i, instead of summing over all items in Iu , we only considered the l nearest items, which are those with the greatest normalized similarly scores to i. This has the effect of noise reduction by discarding weakly related items to the given i. For the GetJar dataset, we find that setting l = 5 seemed to work the best. 3.3 (5) where Q stands for the top k singular vectors extracted from V and qi is the row in Q corresponding to item i. Note that tu,i is simply an association measure and not a predicted rating. A top-N list can then be made for user u by selecting the N items with the highest affinity score to u. PureSVD is the only latent factor algorithm we evaluated that was able to generate reasonable recommendations. The main reason for this is that, unlike the other algorithms, PureSVD is not optimized for RMSE based rating prediction but rather the relative ordering of items produced by the association scores. j∈Iu si,j = (4) Latent factor models Latent factor models work by factorizing the user-item matrix R into two lower rank matrices: user factors and item factors. These models are often used for rating predictions, where a rating ru,i for user u on item i can be predicted by taking the inner product of their respective vectors in the user factors and item factors. User bias and item bias are commonly removed by subtracting the row and column means from R prior to the factorization step. The biases are added back on to the inner product to generate the final prediction. AT Avj∗ = λ∗j vj∗ 5 In equation (2), users that use a greater number of items will have more summands, but since we’re only interested in the relative order of items for a given user, the varying number of summands does not pose a problem. (6) Multiplying both sides by A, we get: AAT (Avj∗ ) = λ∗j (Avj∗ ) 208 (7) 0.20 0.25 POP MEM BIN MEM DAY PureSVD BIN PureSVD DAY Eigenapp BIN Eigenapp DAY 0.10 0.15 Recall 0.010 0.00 0.000 0.05 0.005 Precision 0.015 0.020 POP MEM BIN MEM DAY PureSVD BIN PureSVD DAY Eigenapp BIN Eigenapp DAY 0.0 0.2 0.4 0.6 0.8 1.0 0 10 20 Recall 30 40 50 N (a) Precision-Recall (b) Recall at N Figure 3: (a) Precision-recall curves and (b) Recall at N curves using all users in the test set. We see that vectors vj = Avj∗ are the eigenvectors for C. From there, we normalize each vj to length one and keep only the k eigenvectors with the highest corresponding eigenvalues. The eigenvectors represent the dimensions with the largest variances, or the dimensions that can best differentiate the items. Alternatively, these eigenvectors can also be viewed as item features, items with similar projected values on a particular eigenvector are likely to be similar in certain attributes. We will denote these eigenvectors as eigenapps. Finally, we can project all the items onto the reduced eigenspace by D = vA. D is a k × n matrix, where each column contains the projected values of the item onto each of the eigenapps. The values can be viewed as the coefficients or weights of the eigenapp for the items. By observing several rows in D, apps with high projected values in these eigenapps are often similar types of apps. This was useful in preliminary validation showing that the Eigenapp approach indeed captured latent item features. Item-item similarities can be computed using equation (1) except that we use D instead of R. Since D is dense, similarity scores will likely be non-zero for all item pairs. Once the item-item similarity matrix S has been computed, the remainder of the algorithm is identical to the memorybased algorithm described in Section 3.2. We find that the computed neighborhood in the reduced eigenspace is of much better quality compared to the one computed using the memory-based methods in the non-reduced space. However, neighborhood quality is still better for popular items than for less popular items, likely due to better support. We also find that the quality of neighborhood improves when we increase the number of eigenapps used, and that the neighborhood becomes relatively stable after k = 200. The computation complexity of this algorithm, up to generating S, is O(mn2 ). Using the current GetJar dataset, that process took about 11 minutes on an Intel Core i7 machine using the Eigen library.6 However, since the computation of S is the offline phase of the recommender system, and We evaluated the four types of models from Section 3: Non-personalized (POP), Memory-based (MEM), PureSVD and Eigenapp, using the GetJar dataset. The experiment is set up by randomly dividing the users into five equal sized groups. Four of the groups are used for training, and the remaining one for evaluation. Using the training set, we compute the item-item similarity matrix S for MEM and Eigenapp, item factor matrix Q for PureSVD, and the list of most popular items for POP. The number of eigenvectors used for Eigenapp and number of singular vectors used for PureSVD are both 300. For each user in the test set, we sort the apps by install time. We feed the first M − 1 apps to the model to generate its recommendation list of N apps. Then we check if the left out app is in the recommended list (all algorithms make sure to exclude from their recommendation list the M − 1 apps known to already be installed for the given user). This procedure is repeated on all 5 possible ways of dividing the user groups, allowing every group to be used as the evaluation group once, and thus a recommendation list for every user exists. Two forms of user-item matrix R were considered for the experiments, as described in Section 2.2. The first version 6 7 the number of apps with some minimum amount of usage is unlikely to increase significantly with more users, we do not believe this will pose a problem. Eigenapp is similar to another PCA based algorithm Eigentaste [7]. The main difference is that Eigentaste, which was evaluated on the Jester joke dataset,7 requires a gauge item set where every user must have rated every item in the gauge set. Coming up with such a gauge set is impossible for our application, much less one that is representative. In addition, Eigentaste uses a user-based neighborhood approach to generate recommendations, whereas Eigenapp utilizes itembased neighborhoods. 4. EVALUATION http://eigen.tuxfamily.org 209 http://eigentaste.berkeley.edu/dataset 0.15 0.20 POP MEM BIN MEM DAY PureSVD BIN PureSVD DAY Eigenapp BIN Eigenapp DAY 0.00 0.000 0.05 0.10 Recall 0.010 0.005 Precision 0.015 POP MEM BIN MEM DAY PureSVD BIN PureSVD DAY Eigenapp BIN Eigenapp DAY 0.0 0.2 0.4 0.6 0.8 1.0 0 10 20 Recall 30 40 50 N (a) Precision-Recall (b) Recall at N Figure 4: (a) Precision-recall curves and (b) Recall at N curves after removing the 100 most popular items. 4.2 using days of usage will be denoted as DAY, and the binarized version will be denoted as BIN. Accuracy is the first evaluation criterion we used because we want our recommendations to be relevant to user’s interest and preferences. However, user satisfaction is not solely dependent on accuracy [10]. In particular, given the dominance of the popular apps in this domain, it is important to expose apps in the tail. With that in mind, we also evaluated the accuracy of the models in recommending tail apps, and the variety of the apps recommended. Accuracy of less popular items Given the overwhelming exposure popular apps receive today in the Android ecosystem, many users will use them simply because those are the only apps they know. Thus using a popular app may not be a strong indicator of interest relative to less popular apps. In order to measure precision and recall on the “tail”, we redrew the precisionrecall curves by excluding the 100 most popular apps from the recommended list of each user. Note therefore, that hu will always be 0 for users whose relevant items are among the 100 most popular apps. Thus those users were removed for this experiment. Figure 4(a) shows the precision-recall curves after removing the 100 most popular items. The figure shows that Eigenapp has the highest accuracy for this tail subset. MEM is now second, followed by PureSVD and POP. Recall at N shown in Figure 4(b) shows a similar picture, but it is worth noting that relative to Figure 3(b), recall dropped for every algorithm with the exception of PureSVD. This shows it is more difficult to recommend relevant tail apps than head apps. Using the two types of user-item matrix (BIN and DAY) still achieved similar performance for all three algorithms, but it appears Eigenapp and PureSVD yielded slightly better results using BIN compared to DAY. 4.1 Accuracy The accuracies of the models were evaluated by the standard precision-recall methodology. Since we have only one relevant item to be predicted for each user (the left out app), we set hu equal to 1 if the relevant item is in the top-N list for user u and 0 otherwise. Precision and recall at each N is computed by: m u=1 hu precision(N ) = (8) ·N m m u=1 hu (9) recall(N ) = m where m is the number of users. Figure 3(a) illustrates the precision-recall curves for the algorithms . As we can see, the best performer was MEM despite using an item-item similarity matrix consisting mostly of zeros. A close second was Eigenapp, followed by POP and PureSVD. Figure 3(b) illustrates the recall at each N , up to N = 50. This figure shows the percentage of users whose missing app was identified in the top-N. When N is 10, MEM identified the missing app for about 11% of users, Eigenapp identified the missing app for about 10% of users, and POP and PureSVD identified the missing app for about 7% and 4% of users respectively. The two types of user-item matrix (BIN and DAY) made little difference in the global accuracy of any of the three algorithms. Indicating that the additional signals contributed by number of days of usage do not outweigh its inaccuracies. 4.3 Presentation The impression that the recommended list makes to the user is also important to their satisfaction [10]. An artifact of our methodology for predicting the left-out item means that we penalize algorithms for predicting items that the user may have liked had she known about them. Since it is impossible for us to know which of the “irrelevant” items (those that do not correspond to the left out item) in the top-N are potentially interesting ones, we can only judge the diversity of items that are presented. In this study, we are interested in recommending a diverse list of apps from all popularity spectrums. 210 10 8 >1000 0 2% 4% 5% 11% 14% 18% 6 Popularity Rank 51-100 101-500 501-1000 0 0 0 5% 6% 2% 6% 8% 3% <1% 74% 22% 5% 54% 29% 18% 27% 8% 18% 23% 7% 4 POP MM BIN MM DAY PS BIN PS DAY EA BIN EA DAY 1-50 100% 85% 80% <1% 1% 34% 34% Entropy Algorithm POP MEM BIN MEM DAY PureSVD BIN PureSVD DAY Eigenapp BIN Eigenapp DAY 0 2 Table 3: Distribution of recommended apps in terms of app popularity for the different algorithms (MM stands for Memory, PS stands for PureSVD, and EA stands for Eigenapp) with N set to 10. The frequencies are computed by collecting all m · N recommended apps, where m is the number of users, and find the percentage of apps that lies in each of the stated popularity ranges. 0 10 20 30 40 50 N Figure 5: Entropy of the recommended list for all users with N ranging from 1 to 50. Table 3 shows the popularity ranking of the recommended apps when N is set to 10. As expected, all recommended apps by POP are from the 50 most popular apps. Of the remaining algorithms, more than 80% of apps recommended by MEM are from the 50 most popular. This is due to less popular apps not having many non-zero similarities in the high dimensional neighborhood used by MEM. We also see that PureSVD recommends almost no apps from the 100 most popular, particularly when using BIN, which explains why its accuracy increased significantly when we only considered the less popular items. From a presentation perspective, it is better to show a good mixture of items from the head and the tail. Only recommending items at the head is poor for discovery because they are items that users most likely already know. Only recommending items at the tail may not be good because users will not recognize any of the recommended items and so may lose trust in the recommender [15]. With that said, it appears that the Eigenapp performed the best under this measurement as it recommends items across all popularity levels. 4.4 is particularly poor in the top 5 of the recommendation list - exactly the region to which users pay the most attention. We verified this by finding that for MEM, the most popular app is the first item recommended to approximately 40% of users (or 80% of users that don’t already have that item). This finding is expected given the result presented in Section 4.3. Algorithms that recommend tail apps are expected to have a larger variety because the tail is where the vast majority of apps reside. However, Eigenapp actually recommended fewer tail apps than PureSVD, and yet it had slightly higher entropy, indicating that the exposure of tail apps by Eigenapp is more evenly distributed relative to that of PureSVD. 5. DISCUSSION The recent Netflix competition has significantly influenced the recommendation community towards usage of RMSEbased evaluation metrics. We were unable to find many algorithms expressly tailored for the top-N items task with the exception of [3]. In addition, we find most recent research to be focused on the movie domain using Netflix or Movielens datasets where interest is explicitly expressed through ratings. It is questionable how well these models would translate into other domains, particularly to those where explicit feedback is not available. Among the models we evaluated, neighborhood approaches are found to generate accurate recommendations. In particular, traditional memory based models operating in high dimensional user space performed surprisingly well when evaluated using precision-recall. However, further analysis showed that this was due to the presentation of a small set of popular items to almost every user, albeit in different orders. The high kurtosis of app usage distribution benefits algorithms that concentrate on a small set of popular items when evaluated under precision-recall, although such recommenders are unlikely to generate a pleasing experience for the users. The Eigenapp model performed very well on the GetJar dataset, particularly in its ability to recommend a diverse Variety In addition to higher exposure of less popular items, we also like the exposure to be more spread out rather than concentrated on a few items. We measure variety by calculating the entropy of the recommended results, which can be computed as follows: p(i) · log p(i) (10) H=− i where p(i) is the relative frequency of item i among the topN recommended items for all users (m · N total items). Figure 5 plots the entropy of the recommendations for the different models with N ranging from 1 to 50. We see that Eigenapp and PureSVD have the largest entropy meaning that the top of their recommendation lists are comprised of a wider variety of items. MEM has a far lower entropy, which we can expect since it largely only recommend the popular items. And as expected, POP has the smallest entropy. It is also worth noting that as N increases from 1 to 5, there is a large increase in variety for MEM, indicating its variety 211 8. REFERENCES list of apps. This is because all item vectors are normalized prior to applying PCA, thus usage of less popular apps can be captured by the top eigenvectors. That makes it possible for the less popular apps to be among the closest neighbors of the popular apps. This is particularly important for exposure of the less popular apps, because given the dominance of the popular apps, only apps that are close to one of the popular apps can make frequent appearances at the top of the recommended lists. Using traditional memory-based models, the popular apps form a tight cluster (relative to the less popular apps) in its neighborhood, thus making it difficult for less popular apps to surface to the top of the recommended lists for many users. [1] C. Anderson. The Long Tail: Why the Future of Business Is Selling Less of More. Hyperion, 2006. [2] J. Bennett and S. Lanning. The netflix prize. In Proceedings of KDD Cup and Workshop, pages 3–6, 2007. [3] P. Cremonesi, Y. Koren, and R. Turrin. Performance of recommender algorithms on top-n recommendation tasks. In Proceedings of the fourth ACM conference on Recommender systems, RecSys ’10, pages 39–46, New York, NY, USA, 2010. ACM. [4] M. Deshpande and G. Karypis. Item-based top-n recommendation algorithms. ACM Trans. Inf. Syst., 22(1):143–177, Jan. 2004. [5] S. Funk. Netflix update: Try this at home. http://sifter.org/˜simon/journal/20061211.html, 2006. [6] D. Goldberg, D. Nichols, B. M. Oki, and D. Terry. Using collaborative filtering to weave an information tapestry. Commun. ACM, 35(12):61–70, Dec. 1992. [7] K. Goldberg, T. Roeder, D. Gupta, and C. Perkins. Eigentaste: A constant time collaborative filtering algorithm. Inf. Retr., 4(2):133–151, July 2001. [8] Y. Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’08, pages 426–434, New York, NY, USA, 2008. ACM. [9] G. Linden, B. Smith, and J. York. Amazon.com recommendations: Item-to-item collaborative filtering. IEEE Internet Computing, 7:76–80, 2003. [10] S. M. McNee, J. Riedl, and J. A. Konstan. Being accurate is not enough: how accuracy metrics have hurt recommender systems. In CHI ’06 extended abstracts on Human factors in computing systems, CHI EA ’06, pages 1097–1101, New York, NY, USA, 2006. ACM. [11] D. W. Oard and J. Kim. Implicit feedback for recommender systems. In Proceedings of the AAAI Workshop on Recommender Systems, pages 81–83, 1998. [12] A. Paterek. Improving regularized singular value decomposition for collaborative filtering. In Proceedings of KDD Cup and Workshop, pages 39–42, 2007. [13] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Application of dimensionality reduction in recommender system – a case study. In Proceedings of the ACM WebKDD Workshop, 2000. [14] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web, WWW ’01, pages 285–295, New York, NY, USA, 2001. ACM. [15] G. Shani and A. Gunawardana. Evaluating recommendation systems. Recommender Systems Handbook, pages 257–297, 2011. [16] M. Turk and A. Pentland. Eigenfaces for recognition. J. Cognitive Neuroscience, 3(1):71–86, Jan. 1991. [17] G. K. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley, 1949. 6. CONCLUSION With increasing numbers of people switching to smart phones, the mobile application space is an emerging domain for recommendation systems. Due to the wide disparity in resources among app publishers, the apps that large companies develop receive far more exposure than those developed by individual developers. This results in app usage being dominated by a few popular apps. The problem is further exacerbated by existing apps stores using non-personalized ranking mechanisms. While that approach may help most users find high quality and essential apps quickly, it is less effective in recommending apps to users who are in an exploratory mode In this study, we used app-usage as our metric. Given the characteristics of this data, we found that traditional memory-based approaches heavily favor popular apps contrary to our mission. On the other hand, latent factor models that were developed based on the Netflix data performed quite poorly accuracy-wise. We find that the Eigenapp model performed the best in accuracy and in promotion of less well known apps in the tail of our dataset. A system using the Eigenapp model is currently in internal trials at GetJar. It presents a personalized app list to users along with a non-personalized most popular list. The first list is elicited when users are in an exploratory mode and the second when they are looking for the most sought-after apps. We plan to open this system for general use in the second half of 2012. Simultaneously, we are also working continuously to improve our system. A limitation of the current model is that it includes only apps with certain minimum of usage, a condition that most apps do not satisfy. While the set of apps included probably contains most of the potentially interesting ones, it is possible that we removed some interesting niche apps, or high quality apps by individual developers that were not exposed due to lack of marketing. The latter case is particularly important to us. We are currently exploring content-based models that extract useful features from app metadata and plan to combine the results of the collaborative and contentbased approaches in future work. 7. ACKNOWLEDGEMENTS The authors would like to thank Anand Venkataraman for guidance, edits and help with revisions. Chris Dury provided valuable feedback and Sunil Yarram helped during various stages of data preparation. 212
© Copyright 2026 Paperzz