GetJar Mobile Application Recommendations

GetJar Mobile Application Recommendations with Very
Sparse Datasets
Kent Shi
Kamal Ali
GetJar Inc.
San Mateo, CA, USA
GetJar Inc.
San Mateo, CA, USA
[email protected]
[email protected]
ABSTRACT
Keywords
The Netflix competition of 2006 [2] has spurred significant
activity in the recommendations field, particularly in approaches using latent factor models [3, 5, 8, 12]. However,
the near ubiquity of the Netflix and the similar MovieLens
datasets1 may be narrowing the generality of lessons learned
in this field. At GetJar, our goal is to make appealing recommendations of mobile applications (apps). For app usage,
we observe a distribution that has higher kurtosis (heavier
head and longer tail) than that for the aforementioned movie
datasets. This happens primarily because of the large disparity in resources available to app developers and the low
cost of app publication relative to movies.
In this paper we compare a latent factor (PureSVD) and
a memory-based model with our novel PCA-based model,
which we call Eigenapp. We use both accuracy and variety
as evaluation metrics. PureSVD did not perform well due
to its reliance on explicit feedback such as ratings, which we
do not have. Memory-based approaches that perform vector operations in the original high dimensional space overpredict popular apps because they fail to capture the neighborhood of less popular apps. They have high accuracy due
to the concentration of mass in the head, but did poorly
in terms of variety of apps exposed. Eigenapp, which exploits neighborhood information in low dimensional spaces,
did well both on precision and variety, underscoring the importance of dimensionality reduction to form quality neighborhoods in high kurtosis distributions.
Recommender system, mobile application, evaluation, sparse
data, PCA
1. INTRODUCTION
In the last few years, there has been a tremendous amount
of growth in the mobile app space, particularly in the Android platform. As of January 2012, there are more than
400,000 apps hosted on Google’s app store:2 Google Play
(formerly known as Android Market). However, Google Play
provides little personalization beyond location-based tailoring of catalogs. That means all users from a given country
will see the same list of apps regardless of their tastes and
preferences.
Since most users typically navigate no more than a few
pages when browsing the store, lack of personalization limits exposure for the majority of the apps. By analyzing
the usage of apps on a sample of devices, we find that this
space is dominated by a few apps, which unsurprisingly are
ones that have been “featured” recently on the front page of
Google Play.
GetJar, founded in 2004, is the largest free app store in
the world. It provides mobile apps to users of all mobile
platforms. We have recently begun to focus on the Android
platform due to its openness and surging market share. Our
goal is to become an attractive destination for Android apps
by providing high quality personalization as a means to app
discovery.
Categories and Subject Descriptors
1.1
H.2.8 [Database Management]: Database Applications—
Data mining; H.3.3 [Information Storage and Retrieval]:
Information Search and Retrieval—Information filtering
While recommendations techniques, especially those using
collaborative filtering, have been common since the early
1990s [6] and have been deployed on a number of e-commerce
websites such as Amazon.com [9], recommendation in the
emerging app domain is a task beset by unique challenges
mainly due to the greater kurtosis in the distribution of app
usage data.
From anonymous usage data collected at GetJar, we find
that there are a few well-known apps popular among a large
number of users, but the vast majority of apps are rarely
used by most users. Figure 1(a) shows a comparison of the
data distribution between the movie (Netflix) and app (GetJar) domains. Note the plot only includes apps that have
been recently used by GetJar users. This constitutes approximately 55,000 apps, or about 14% of all apps. The movie
General Terms
Algorithms, Experimentation, Performance
1
http://www.grouplens.org/node/73
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
KDD’12, August 12–16, 2012, Beijing, China.
Copyright 2012 ACM 978-1-4503-1462-6/12/08... $15.00.
Challenges
2
http://www.distimo.com/blog/2012_01_
google-android-market-tops-400000-applications
204
100%
100%
1%
10%
GetJar
Netflix
0.01%
0.1%
Percent of Users
1%
0.1%
0.01%
Percent of Users
10%
GetJar
Netflix
0
20%
40%
60%
80%
100%
0.01%
Percent of Items
0.1%
1%
10%
100%
Percent of Items
(a)
(b)
Figure 1: (a) Distribution of items (GetJar apps or Netflix movies) in terms of percentage of total users,
with items sorted by popularity. (b) Distributions of items plotted in log-log scale.
app stores rely on developers to categorize their own apps
using a fixed inventory of labels. This leads to a small number of categories and a large number of apps within each,
causing only the top few apps in each category to ever have
significant visibility. Search is also ineffective because we
find that most users don’t know what to search for. About
90% of search queries at GetJar are titles (or close variants)
of popular apps, which means search currently is not being
used as an effective vehicle to discover new apps.
at the first percentile (rank 177) is rated by 20% of Netflix users. In contrast, the app at the first percentile (rank
550) is used only by 0.6% of GetJar users. Furthermore,
the movie at the first percentile has 42% as many users as
the most popular movie, but app at the first percentile has
only 1.3% as many users as the most popular app. Therefore, even though there are over 400,000 available apps, in
reality only a few thousand of them are being used in any
significant sense.
The same data is plotted in Figure 1(b), this time using
a log scale for both axes. We can see that the GetJar curve
is almost a straight line in log-log space, indicating that the
frequencies can be approximated by a Zipf distribution[17].
This figure definitively shows the qualitative difference in distribution: App distribution is linear in log-log space whereas
movie distribution isn’t. Traditional collaborative filtering
techniques [9, 14] or even the newer latent factor models [3,
5, 8, 12, 13] were not designed to handle this level of sparsity.
There are at least three reasons for this difference. First,
the disparity in available resources among the app developers is larger than that of movie producers. This is mainly
due to the cost (time and money) of publishing apps being
much lower than that for releasing movies. Second, due to
the less mature nature of the smart phone space, most casual users are unaware of the full capabilities of their device
or what apps are available for it. This is in contrast to other
domains such as movies, where there are numerous outlets
dedicated to reviewing or promoting those products. Third,
discovery mechanisms in the app space are less effective and
mature compared to those of other domains.
Today, most app stores offer three ways for users to discover apps: (1) Listings of apps sorted by the number of
downloads or a similar trending metric, (2) Category-based
browsing and (3) Keyword-based searching. We know that
the number of apps that can be exposed using listings is
limited, and that methods 2 and 3 are not as effective as we
would like. Browsing by category is only useful if the ontology of categories is rich, as in the case of Amazon. But most
1.2
Goal and evaluation criteria
Users visit GetJar hoping to find interesting and useful
apps. But as we have seen, common strategies such as
browsing and searching, which have worked well for other
e-commerce sites don’t work as well in domains where many
items remain under-publicized. Our goal is to use personalization to help users find a greater variety of appealing apps.
Our prototype recommendation system recommends a topN list of apps to each user based on her recent app usage.
We judge the quality of the recommendations primarily by
accuracy, which represents the ability of the recommender
to predict the presence of an app on the user’s device. To
increase the exposure of under-publicized apps, the recommender is also evaluated on its ability to recommend tail
apps as well as the variety of the apps it recommend.
A number of app stores currently offer personalized app
recommendations, most notably the Apple App Store and
the Amazon Appstore. However, little is known about how
they generate their recommendations. Furthermore, we are
unaware of any publications on mobile app recommendations.
The rest of the paper is organized as follows: Section 2
will review how the data was collected and some of its properties; Section 3 will provide details of the algorithms that
we considered; Section 4 will provide the experimental setup
and results; and finally Sections 5 and 6 provide discussion
and conclusions.
205
2.1
GetJar
Netflix
60%
40%
20%
0
Percent of Total Rating/Usage
The data we report upon in this paper comes from server
log files at GetJar where all personally identifying information had been stripped, but information pertaining to
a single source can be uniquely identified up to a common
anonymous identifier. The apps we report here include those
hosted on GetJar as well as those on Google Play.
For the purposes of this study, we rely upon app usage
data rather than installation data. The reason we choose
not to use installation data is that it is a poor indicator of
interest since many app installations are experimental from
a user’s perspective. A significant fraction of our users are
found to uninstall an app on the same day as they installed
it. Also, there is another significant fraction of users that
have a vast number of installed apps that never get used.
Many users are new to the mobile app space and thus are
likely experimenting with a variety of apps. We restrict our
data to recent app usage to account for this fact, that users’
tastes for apps can change more rapidly than for traditional
domains such as movies and music. We are only interested
in recommending apps that reflect their current tastes or
interests.
The observation period for data used for this study is from
November 7 to November 21, 2011. We find that varying
length of the observation period by a few days makes almost
no difference in the number of apps used by the users.3 In an
effort to reduce noise in the data from apps that were being
trialed by users, we filtered out apps that were not used other
than on the day of installation. We further cleaned the data
by removing users that joined or left midway during the
observation period and those that were not associated with
a pre-determined list of legitimate devices. The resultant
dataset contains 101,106 users. For each user we used the list
of apps and the number of days each app was used during the
observation period. The total number of unique apps used
by all users during the interval satisfying our constraints was
55,020.
80%
100%
2. THE GETJAR DATA
0.01%
0.1%
1%
10%
100%
Percent of Items
Figure 2: Cumulative distribution of items in terms
of percentage of total usage, the curves can be
viewed as the integral of the curves in Figure 1.
Dataset
GetJar
GetJar*
Netflix
Users
101,106
101,031
480,189
Items
55,020
7,304
17,770
Usages/Ratings
1.99M
1.82M
100M
Density
0.04%
0.25%
1.18%
Table 1: Size of user-item matrices for Netflix and
GetJar dataset. GetJar* denotes the GetJar dataset
including only apps that have been used by more
than 20 users.
Jar dataset, as previously alluded to, is primarily due to the
low cost of publishing apps compared to the cost of releasing a movie. This encourages developers to release as many
apps as possible to increase the chances of their apps being
discovered by search. This strategy often leads to apps being published multiple times with different titles but similar
functionalities. This also encourages the proliferation of a
large number of apps tailored for very specific needs (e.g.
ringtone apps dedicated to music by specific artists) as opposed to general apps (e.g. a single ringtone app containing
music by all artists).
Given that we have little or no usage information on the
bulk of the tail apps, it makes recommending them a very
difficult task. In order to ensure that the recommended apps
will have certain amount of support, for this study, we limited our app population by only including apps with more
than 20 users. This reduces the number of apps from 55,030
to 7,304. Even though this pruning process removed 87% of
apps (or 98% if we include apps with no usage), it is noteworthy that only 9% of the total usage was thus eliminated
from our modeling. Table 1 shows the size and density of the
user-item matrices before and after our pruning. It shows
that even after rejecting the bottom 87% of the apps, the
GetJar* dataset is still much sparser relative to Netflix.
Data sparsity and long tail
As we have already illustrated in Figure 1, our data is
extremely sparse and that the vast majority of apps have
low usage. While it is well known that sparsity and a long
tail [1] are two characteristics of all e-commerce data, these
are especially pronounced in our dataset.
Figure 2 plots the cumulative distribution of the items
in terms of the total amount of usage. We can see that
the GetJar dataset is far more head-heavy compared to the
Netflix dataset, with the top 1% of apps accounting for 58%
of usage in contrast to Netflix where the top 1% of movies
contribute to 22% of all ratings. An even more selective
subset - the 100 most popular apps - account for 30% of
total app usage. For the GetJar dataset, we define the head
to be the top 100 apps and the remaining apps to be the
tail.
One major reason for this difference is that many apps are
used every day, but movies are seldom watched more than
once or twice. Thus Netflix users may be more likely to explore new items relative to GetJar users. Another reason
is that the Netflix data was collected over a much longer
period of time. The reason for the longer tail in the Get-
2.2
3
We use the more convenient word users to denote their
anonymized identifiers.
Usage versus ratings
Another difference between the GetJar dataset and the
206
Netflix dataset is that movie ratings is an explicit feedback
for interest whereas days of usage is implicit [11]. The benefit of an explicit rating system is that it is well-defined and
standardized, thus generating a more accurate measurement
of interest compared to implicit feedbacks such as days of
usage. The latter can be influenced by a number of factors
such as mood, unforeseen events, or logging errors. Furthermore, there is also correlation between usage and category - we find that “social” apps are consistently the most
heavily used apps among nearly all users. This is because
“social” apps need to be used often in order to serve their
purpose, but apps in categories such as “productivity” are
seldom needed on a continuous basis. So while it is safe to
assume that a user enjoyed a movie that she rated highly
relatively to one rated lowly, the same cannot be said for
a user that used a “social” app more than a “productivity”
app.
We choose not to use ratings because it has a number of
drawbacks in the mobile app domain. Most importantly, it
is very difficult to collect for a large number of users without forceful intervention. Furthermore, since users’ taste in
apps may change and many app developers frequently update their apps with new features or functionalities, ratings
may become obsolete in as little as one month. Finally, observing ratings on Google Play, we find they are polarized,
with the vast majority of ratings being either 1 or 5. This
is likely due to fragmentation of the Android platform,4 resulting in most ratings being given based on whether the
app worked (5) or not (1) for the user.
Due to the influence of the Netflix competition, most research in the recommendations community has been geared
toward rating prediction by means of minimizing root mean
square error (RMSE). However, Cremonesi et. al [3] reported that improving RMSE does not translate into improvement in accuracy for the top-N task. On the Netflix
and MovieLens datasets, the predictive accuracy of a naive
most popular list is comparable to those by sophisticated
algorithms optimized for RMSE. We tried the same using
the GetJar dataset but substituting days of usage for ratings, and found that algorithms optimized for RMSE actually performed far worse than a simple most popular list.
With that said, days of usage can still be used for neighborhood approaches, provided that there still exists some
correlation between it and interest. A part of this study is
to evaluate the usefulness of this metric. Thus, for our experiments, we used two versions of the user-item matrix. In
the first version, each cell represents the number of days the
app was used, and in the second, each cell is a binary indicator of usage during the observation period. We’d like to
see if the additional granularity provided by the days of usage will generate better recommendations than when using
a binary indicator.
Dataset
GetJar*
Netflix
0
83.2%
0.2%
Number of Common Users
1
2-10
11-20
> 20
9.1% 6.6%
0.6%
0.6%
0.4% 33.8% 22.2% 43.3%
Table 2: Breakdown of number of common users for
the GetJar and Netflix datasets. For n items, the
2
total number of item pairs is n 2−n .
items in user space or that of users in item space. A useruser or item-item similarity matrix is computed for pairs
and recommendations are generated based on these similarities. Latent factor models are more sophisticated approaches
where the user-item matrix is decomposed via matrix factorization techniques such as Singular Value Decomposition
(SVD). Latent factors are then extracted and used to generate predictions.
We evaluated both the above approaches using our data.
In addition, we developed a hybrid system using Principal Components Analysis (PCA) which we call Eigenapp.
These three algorithms were also compared against a nonpersonalized baseline recommendation system that serves
the most popular items.
3.1
Non-personalized models
Non-personalized models are those that serve the same
list of items to all users. They commonly sort items by
the number of purchases, profit margin, click-through rate
(CTR), or other similar metrics. In this paper, our nonpersonalized baseline algorithm sorts items by popularity,
where popularity is defined as the number of distinct users
that have used the item during the observation period.
3.2
Memory-based models
There are two types of memory-based models: Item-based
and user-based. Item-based models find similarities between
items, and for a given user they recommend items that are
similar to items she already owns. User-based models find
similarities between users, and for a given user they recommend items owned by her most similar users.
Computationally, item-based models are more scalable because there are usually far fewer items than users, as is the
case in the mobile app space. In addition, there is research
showing that item-based algorithms generally perform better than user-based algorithms [9, 14]. Hence, our memorybased model uses the item-based approach.
Two of the most common neighborhood similarity metrics in current use are the Pearson correlation coefficient
and cosine similarity. The Pearson correlation coefficient is
computed for a pair of items based on the set of users that
have used both. Since the vast majority of our items reside
in the long tail, many of those items are unlikely to share
common users with most other items.
Table 2 presents the distribution of number of common
users in the GetJar and Netflix datasets. The table shows
that 83.2% of item pairs in the GetJar dataset have zero
users in common, whereas that same percentage for Netflix is 0.2%. For GetJar, more than 90% of item pairs have
one or no common users. Thus it is impossible to compute
correlations for these item pairs. In addition, the vast majority of the remaining item pairs share 10 or fewer users,
3. MODELS
Two common recommendation approaches in use today
are those using memory-based models and latent factor models. Memory-based models leverage the neighborhood of
4
There are many manufacturers that produce Android devices with various hardware specifications and tweaks of the
operating system. This makes it difficult for developers to to
test their apps on all devices, resulting in apps not working
as intended on many devices.
207
meaning that the sample correlation estimate is likely to be
inaccurate due to poor support. In contrast, the published
Netflix dataset has less than 1% of movie pairs sharing 1
or fewer common users and about 65% of movie pairs share
more than 10 common users. Since the Pearson correlation
coefficient is undefined for 90% of our item pairs, we will use
cosine similarity.
Let R denote the m × n user-item matrix where m is the
number of users and n is the number of items. From R, we
compute an item-item similarity matrix S, whose (i, j) entry
is:
r∗,i · r∗,j
si,j =
(1)
r∗,i 2 · r∗,j 2
Examples of this approach include [5, 8, 12, 13]. We
tried [5] and [13] by substituting days of usage for ratings,
and then sorting the predictions to generate a top-N recommended list. The results were by far the worst of all
algorithms, for reasons explained in Section 2.2. We expect
similar results for other rating prediction based algorithms.
The only latent factor top-N algorithm we are aware of
is PureSVD [3]. The algorithm works by replacing all missing values (those with no ratings) in R with 0, and then
factorizing R via SVD:
R = U · Σ · VT
Then affinity between user u and item i can be computed
by:
where r∗,i and r∗,j are the ith and jth columns respectively
of R. Cosine similarity does not require items to share common users. In such case it will simply produce a similarity
of 0. However, it still suffers from low overlap support. The
closest neighbors for a less popular item will often occur
by coincidence simply because they are the only ones that
produced non-zero similarity scores.
Using S, the affinity tu,i between user u and item i is the
sum of similarities between i and items used by u:
tu,i =
si,j
(2)
tu,i = ru,∗ · Q · qTi
where Iu is the set of items used by u. For a given user, all
items are sorted by their affinity score in order to produce
a top-N list.5
We made two slight modifications to the above method
that produced better results. First, the item-item similarity
scores si,j were normalized before being used in equation (2).
Deshpande et al. [4] suggested using a normalization such
that the sum of the similarities add up to 1. However, we
found that normalizing using z-score worked much better for
the GetJar dataset, producing the asymmetric similarity:
si,j − s∗,j
σs∗,j
3.4
Eigenapp model
Of the two previously mentioned approaches, memorybased models yielded far better results despite only having neighborhoods for popular items. We want to improve
the result of memory-based models by borrowing ideas from
the latent factor models. Along these lines, we used dimensionality reduction techniques to extract meaningful features
from the items and then applied memory-based techniques
to generate recommendations in this reduced space.
Our neighborhood is still item-based, but items are now
represented using features instead of users. Similar to [3],
we also replace all missing values in R with 0. Given the
large disparity in app frequencies, we normalized the item
vectors to prevent the features from being based on only
popular items. This is done by normalizingeach column in
R to have zero mean and length of one:
u ru,i = 0 and
2
= 1. We denote this new normalized user-item
u ru,i
matrix as R and apply PCA to R for feature extraction.
PCA is performed via eigen decomposition of the covariance matrix C. C is computed
first calculating the mean
by
item vector b with bu = n1 i ru,i
. Then remove the mean
by forming matrix A where each cell au,i = ru,i
− bu and
T
finally compute C = AA . Note that C is an m × m matrix
with the number of users m likely to be a very large number. This makes eigen decomposition practically impossible
in time and space. Observing that the number of items n
is likely to be much lower, we used the same procedure as
in Eigenface [16] to optimize the process. The procedure
works by first conducting eigen decomposition on the n × n
matrix AT A obtaining eigenvectors v∗ and eigenvalues λ∗
such that for each j:
(3)
where s∗,j is the average similarity to item j and σs∗,j is the
standard deviation of similarities to item j. Second, for each
item candidate i, instead of summing over all items in Iu ,
we only considered the l nearest items, which are those with
the greatest normalized similarly scores to i. This has the
effect of noise reduction by discarding weakly related items
to the given i. For the GetJar dataset, we find that setting
l = 5 seemed to work the best.
3.3
(5)
where Q stands for the top k singular vectors extracted from
V and qi is the row in Q corresponding to item i. Note that
tu,i is simply an association measure and not a predicted
rating. A top-N list can then be made for user u by selecting
the N items with the highest affinity score to u.
PureSVD is the only latent factor algorithm we evaluated
that was able to generate reasonable recommendations. The
main reason for this is that, unlike the other algorithms,
PureSVD is not optimized for RMSE based rating prediction
but rather the relative ordering of items produced by the
association scores.
j∈Iu
si,j =
(4)
Latent factor models
Latent factor models work by factorizing the user-item
matrix R into two lower rank matrices: user factors and item
factors. These models are often used for rating predictions,
where a rating ru,i for user u on item i can be predicted
by taking the inner product of their respective vectors in
the user factors and item factors. User bias and item bias
are commonly removed by subtracting the row and column
means from R prior to the factorization step. The biases
are added back on to the inner product to generate the final
prediction.
AT Avj∗ = λ∗j vj∗
5
In equation (2), users that use a greater number of items
will have more summands, but since we’re only interested
in the relative order of items for a given user, the varying
number of summands does not pose a problem.
(6)
Multiplying both sides by A, we get:
AAT (Avj∗ ) = λ∗j (Avj∗ )
208
(7)
0.20
0.25
POP
MEM BIN
MEM DAY
PureSVD BIN
PureSVD DAY
Eigenapp BIN
Eigenapp DAY
0.10
0.15
Recall
0.010
0.00
0.000
0.05
0.005
Precision
0.015
0.020
POP
MEM BIN
MEM DAY
PureSVD BIN
PureSVD DAY
Eigenapp BIN
Eigenapp DAY
0.0
0.2
0.4
0.6
0.8
1.0
0
10
20
Recall
30
40
50
N
(a) Precision-Recall
(b) Recall at N
Figure 3: (a) Precision-recall curves and (b) Recall at N curves using all users in the test set.
We see that vectors vj = Avj∗ are the eigenvectors for C.
From there, we normalize each vj to length one and keep
only the k eigenvectors with the highest corresponding eigenvalues. The eigenvectors represent the dimensions with the
largest variances, or the dimensions that can best differentiate the items. Alternatively, these eigenvectors can also be
viewed as item features, items with similar projected values
on a particular eigenvector are likely to be similar in certain attributes. We will denote these eigenvectors as eigenapps. Finally, we can project all the items onto the reduced
eigenspace by D = vA. D is a k × n matrix, where each column contains the projected values of the item onto each of
the eigenapps. The values can be viewed as the coefficients
or weights of the eigenapp for the items. By observing several rows in D, apps with high projected values in these
eigenapps are often similar types of apps. This was useful in
preliminary validation showing that the Eigenapp approach
indeed captured latent item features.
Item-item similarities can be computed using equation (1)
except that we use D instead of R. Since D is dense,
similarity scores will likely be non-zero for all item pairs.
Once the item-item similarity matrix S has been computed,
the remainder of the algorithm is identical to the memorybased algorithm described in Section 3.2. We find that
the computed neighborhood in the reduced eigenspace is of
much better quality compared to the one computed using
the memory-based methods in the non-reduced space. However, neighborhood quality is still better for popular items
than for less popular items, likely due to better support. We
also find that the quality of neighborhood improves when we
increase the number of eigenapps used, and that the neighborhood becomes relatively stable after k = 200.
The computation complexity of this algorithm, up to generating S, is O(mn2 ). Using the current GetJar dataset, that
process took about 11 minutes on an Intel Core i7 machine
using the Eigen library.6 However, since the computation
of S is the offline phase of the recommender system, and
We evaluated the four types of models from Section 3:
Non-personalized (POP), Memory-based (MEM), PureSVD
and Eigenapp, using the GetJar dataset. The experiment is
set up by randomly dividing the users into five equal sized
groups. Four of the groups are used for training, and the
remaining one for evaluation. Using the training set, we
compute the item-item similarity matrix S for MEM and
Eigenapp, item factor matrix Q for PureSVD, and the list
of most popular items for POP. The number of eigenvectors
used for Eigenapp and number of singular vectors used for
PureSVD are both 300. For each user in the test set, we sort
the apps by install time. We feed the first M − 1 apps to the
model to generate its recommendation list of N apps. Then
we check if the left out app is in the recommended list (all
algorithms make sure to exclude from their recommendation
list the M − 1 apps known to already be installed for the
given user). This procedure is repeated on all 5 possible ways
of dividing the user groups, allowing every group to be used
as the evaluation group once, and thus a recommendation
list for every user exists.
Two forms of user-item matrix R were considered for the
experiments, as described in Section 2.2. The first version
6
7
the number of apps with some minimum amount of usage is
unlikely to increase significantly with more users, we do not
believe this will pose a problem.
Eigenapp is similar to another PCA based algorithm Eigentaste [7]. The main difference is that Eigentaste, which was
evaluated on the Jester joke dataset,7 requires a gauge item
set where every user must have rated every item in the gauge
set. Coming up with such a gauge set is impossible for our
application, much less one that is representative. In addition, Eigentaste uses a user-based neighborhood approach to
generate recommendations, whereas Eigenapp utilizes itembased neighborhoods.
4. EVALUATION
http://eigen.tuxfamily.org
209
http://eigentaste.berkeley.edu/dataset
0.15
0.20
POP
MEM BIN
MEM DAY
PureSVD BIN
PureSVD DAY
Eigenapp BIN
Eigenapp DAY
0.00
0.000
0.05
0.10
Recall
0.010
0.005
Precision
0.015
POP
MEM BIN
MEM DAY
PureSVD BIN
PureSVD DAY
Eigenapp BIN
Eigenapp DAY
0.0
0.2
0.4
0.6
0.8
1.0
0
10
20
Recall
30
40
50
N
(a) Precision-Recall
(b) Recall at N
Figure 4: (a) Precision-recall curves and (b) Recall at N curves after removing the 100 most popular items.
4.2
using days of usage will be denoted as DAY, and the binarized version will be denoted as BIN.
Accuracy is the first evaluation criterion we used because
we want our recommendations to be relevant to user’s interest and preferences. However, user satisfaction is not solely
dependent on accuracy [10]. In particular, given the dominance of the popular apps in this domain, it is important to
expose apps in the tail. With that in mind, we also evaluated the accuracy of the models in recommending tail apps,
and the variety of the apps recommended.
Accuracy of less popular items
Given the overwhelming exposure popular apps receive
today in the Android ecosystem, many users will use them
simply because those are the only apps they know. Thus
using a popular app may not be a strong indicator of interest relative to less popular apps. In order to measure
precision and recall on the “tail”, we redrew the precisionrecall curves by excluding the 100 most popular apps from
the recommended list of each user. Note therefore, that hu
will always be 0 for users whose relevant items are among
the 100 most popular apps. Thus those users were removed
for this experiment.
Figure 4(a) shows the precision-recall curves after removing the 100 most popular items. The figure shows that Eigenapp has the highest accuracy for this tail subset. MEM is
now second, followed by PureSVD and POP. Recall at N
shown in Figure 4(b) shows a similar picture, but it is worth
noting that relative to Figure 3(b), recall dropped for every
algorithm with the exception of PureSVD. This shows it is
more difficult to recommend relevant tail apps than head
apps.
Using the two types of user-item matrix (BIN and DAY)
still achieved similar performance for all three algorithms,
but it appears Eigenapp and PureSVD yielded slightly better results using BIN compared to DAY.
4.1 Accuracy
The accuracies of the models were evaluated by the standard precision-recall methodology. Since we have only one
relevant item to be predicted for each user (the left out app),
we set hu equal to 1 if the relevant item is in the top-N list
for user u and 0 otherwise. Precision and recall at each N
is computed by:
m
u=1 hu
precision(N ) =
(8)
·N
m
m
u=1 hu
(9)
recall(N ) =
m
where m is the number of users.
Figure 3(a) illustrates the precision-recall curves for the
algorithms . As we can see, the best performer was MEM despite using an item-item similarity matrix consisting mostly
of zeros. A close second was Eigenapp, followed by POP
and PureSVD. Figure 3(b) illustrates the recall at each N ,
up to N = 50. This figure shows the percentage of users
whose missing app was identified in the top-N. When N is
10, MEM identified the missing app for about 11% of users,
Eigenapp identified the missing app for about 10% of users,
and POP and PureSVD identified the missing app for about
7% and 4% of users respectively.
The two types of user-item matrix (BIN and DAY) made
little difference in the global accuracy of any of the three algorithms. Indicating that the additional signals contributed
by number of days of usage do not outweigh its inaccuracies.
4.3
Presentation
The impression that the recommended list makes to the
user is also important to their satisfaction [10]. An artifact
of our methodology for predicting the left-out item means
that we penalize algorithms for predicting items that the
user may have liked had she known about them. Since it
is impossible for us to know which of the “irrelevant” items
(those that do not correspond to the left out item) in the
top-N are potentially interesting ones, we can only judge the
diversity of items that are presented. In this study, we are
interested in recommending a diverse list of apps from all
popularity spectrums.
210
10
8
>1000
0
2%
4%
5%
11%
14%
18%
6
Popularity Rank
51-100 101-500 501-1000
0
0
0
5%
6%
2%
6%
8%
3%
<1%
74%
22%
5%
54%
29%
18%
27%
8%
18%
23%
7%
4
POP
MM BIN
MM DAY
PS BIN
PS DAY
EA BIN
EA DAY
1-50
100%
85%
80%
<1%
1%
34%
34%
Entropy
Algorithm
POP
MEM BIN
MEM DAY
PureSVD BIN
PureSVD DAY
Eigenapp BIN
Eigenapp DAY
0
2
Table 3: Distribution of recommended apps in terms
of app popularity for the different algorithms (MM
stands for Memory, PS stands for PureSVD, and
EA stands for Eigenapp) with N set to 10. The
frequencies are computed by collecting all m · N recommended apps, where m is the number of users,
and find the percentage of apps that lies in each of
the stated popularity ranges.
0
10
20
30
40
50
N
Figure 5: Entropy of the recommended list for all
users with N ranging from 1 to 50.
Table 3 shows the popularity ranking of the recommended
apps when N is set to 10. As expected, all recommended
apps by POP are from the 50 most popular apps. Of the
remaining algorithms, more than 80% of apps recommended
by MEM are from the 50 most popular. This is due to less
popular apps not having many non-zero similarities in the
high dimensional neighborhood used by MEM. We also see
that PureSVD recommends almost no apps from the 100
most popular, particularly when using BIN, which explains
why its accuracy increased significantly when we only considered the less popular items.
From a presentation perspective, it is better to show a
good mixture of items from the head and the tail. Only recommending items at the head is poor for discovery because
they are items that users most likely already know. Only
recommending items at the tail may not be good because
users will not recognize any of the recommended items and
so may lose trust in the recommender [15]. With that said,
it appears that the Eigenapp performed the best under this
measurement as it recommends items across all popularity
levels.
4.4
is particularly poor in the top 5 of the recommendation list
- exactly the region to which users pay the most attention.
We verified this by finding that for MEM, the most popular
app is the first item recommended to approximately 40% of
users (or 80% of users that don’t already have that item).
This finding is expected given the result presented in Section 4.3. Algorithms that recommend tail apps are expected
to have a larger variety because the tail is where the vast
majority of apps reside. However, Eigenapp actually recommended fewer tail apps than PureSVD, and yet it had
slightly higher entropy, indicating that the exposure of tail
apps by Eigenapp is more evenly distributed relative to that
of PureSVD.
5. DISCUSSION
The recent Netflix competition has significantly influenced
the recommendation community towards usage of RMSEbased evaluation metrics. We were unable to find many algorithms expressly tailored for the top-N items task with the
exception of [3]. In addition, we find most recent research to
be focused on the movie domain using Netflix or Movielens
datasets where interest is explicitly expressed through ratings. It is questionable how well these models would translate into other domains, particularly to those where explicit
feedback is not available.
Among the models we evaluated, neighborhood approaches
are found to generate accurate recommendations. In particular, traditional memory based models operating in high
dimensional user space performed surprisingly well when
evaluated using precision-recall. However, further analysis
showed that this was due to the presentation of a small set
of popular items to almost every user, albeit in different orders. The high kurtosis of app usage distribution benefits
algorithms that concentrate on a small set of popular items
when evaluated under precision-recall, although such recommenders are unlikely to generate a pleasing experience for
the users.
The Eigenapp model performed very well on the GetJar
dataset, particularly in its ability to recommend a diverse
Variety
In addition to higher exposure of less popular items, we
also like the exposure to be more spread out rather than
concentrated on a few items. We measure variety by calculating the entropy of the recommended results, which can
be computed as follows:
p(i) · log p(i)
(10)
H=−
i
where p(i) is the relative frequency of item i among the topN recommended items for all users (m · N total items).
Figure 5 plots the entropy of the recommendations for the
different models with N ranging from 1 to 50. We see that
Eigenapp and PureSVD have the largest entropy meaning
that the top of their recommendation lists are comprised of
a wider variety of items. MEM has a far lower entropy, which
we can expect since it largely only recommend the popular
items. And as expected, POP has the smallest entropy. It
is also worth noting that as N increases from 1 to 5, there
is a large increase in variety for MEM, indicating its variety
211
8. REFERENCES
list of apps. This is because all item vectors are normalized
prior to applying PCA, thus usage of less popular apps can
be captured by the top eigenvectors. That makes it possible
for the less popular apps to be among the closest neighbors of the popular apps. This is particularly important for
exposure of the less popular apps, because given the dominance of the popular apps, only apps that are close to one of
the popular apps can make frequent appearances at the top
of the recommended lists. Using traditional memory-based
models, the popular apps form a tight cluster (relative to
the less popular apps) in its neighborhood, thus making it
difficult for less popular apps to surface to the top of the
recommended lists for many users.
[1] C. Anderson. The Long Tail: Why the Future of
Business Is Selling Less of More. Hyperion, 2006.
[2] J. Bennett and S. Lanning. The netflix prize. In
Proceedings of KDD Cup and Workshop, pages 3–6,
2007.
[3] P. Cremonesi, Y. Koren, and R. Turrin. Performance
of recommender algorithms on top-n recommendation
tasks. In Proceedings of the fourth ACM conference on
Recommender systems, RecSys ’10, pages 39–46, New
York, NY, USA, 2010. ACM.
[4] M. Deshpande and G. Karypis. Item-based top-n
recommendation algorithms. ACM Trans. Inf. Syst.,
22(1):143–177, Jan. 2004.
[5] S. Funk. Netflix update: Try this at home.
http://sifter.org/˜simon/journal/20061211.html, 2006.
[6] D. Goldberg, D. Nichols, B. M. Oki, and D. Terry.
Using collaborative filtering to weave an information
tapestry. Commun. ACM, 35(12):61–70, Dec. 1992.
[7] K. Goldberg, T. Roeder, D. Gupta, and C. Perkins.
Eigentaste: A constant time collaborative filtering
algorithm. Inf. Retr., 4(2):133–151, July 2001.
[8] Y. Koren. Factorization meets the neighborhood: a
multifaceted collaborative filtering model. In
Proceedings of the 14th ACM SIGKDD international
conference on Knowledge discovery and data mining,
KDD ’08, pages 426–434, New York, NY, USA, 2008.
ACM.
[9] G. Linden, B. Smith, and J. York. Amazon.com
recommendations: Item-to-item collaborative filtering.
IEEE Internet Computing, 7:76–80, 2003.
[10] S. M. McNee, J. Riedl, and J. A. Konstan. Being
accurate is not enough: how accuracy metrics have
hurt recommender systems. In CHI ’06 extended
abstracts on Human factors in computing systems,
CHI EA ’06, pages 1097–1101, New York, NY, USA,
2006. ACM.
[11] D. W. Oard and J. Kim. Implicit feedback for
recommender systems. In Proceedings of the AAAI
Workshop on Recommender Systems, pages 81–83,
1998.
[12] A. Paterek. Improving regularized singular value
decomposition for collaborative filtering. In
Proceedings of KDD Cup and Workshop, pages 39–42,
2007.
[13] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl.
Application of dimensionality reduction in
recommender system – a case study. In Proceedings of
the ACM WebKDD Workshop, 2000.
[14] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl.
Item-based collaborative filtering recommendation
algorithms. In Proceedings of the 10th international
conference on World Wide Web, WWW ’01, pages
285–295, New York, NY, USA, 2001. ACM.
[15] G. Shani and A. Gunawardana. Evaluating
recommendation systems. Recommender Systems
Handbook, pages 257–297, 2011.
[16] M. Turk and A. Pentland. Eigenfaces for recognition.
J. Cognitive Neuroscience, 3(1):71–86, Jan. 1991.
[17] G. K. Zipf. Human Behavior and the Principle of
Least Effort. Addison-Wesley, 1949.
6. CONCLUSION
With increasing numbers of people switching to smart
phones, the mobile application space is an emerging domain
for recommendation systems. Due to the wide disparity in
resources among app publishers, the apps that large companies develop receive far more exposure than those developed
by individual developers. This results in app usage being
dominated by a few popular apps. The problem is further
exacerbated by existing apps stores using non-personalized
ranking mechanisms. While that approach may help most
users find high quality and essential apps quickly, it is less
effective in recommending apps to users who are in an exploratory mode
In this study, we used app-usage as our metric. Given
the characteristics of this data, we found that traditional
memory-based approaches heavily favor popular apps contrary to our mission. On the other hand, latent factor
models that were developed based on the Netflix data performed quite poorly accuracy-wise. We find that the Eigenapp model performed the best in accuracy and in promotion of less well known apps in the tail of our dataset.
A system using the Eigenapp model is currently in internal
trials at GetJar. It presents a personalized app list to users
along with a non-personalized most popular list. The first
list is elicited when users are in an exploratory mode and
the second when they are looking for the most sought-after
apps. We plan to open this system for general use in the
second half of 2012. Simultaneously, we are also working
continuously to improve our system.
A limitation of the current model is that it includes only
apps with certain minimum of usage, a condition that most
apps do not satisfy. While the set of apps included probably
contains most of the potentially interesting ones, it is possible that we removed some interesting niche apps, or high
quality apps by individual developers that were not exposed
due to lack of marketing. The latter case is particularly
important to us. We are currently exploring content-based
models that extract useful features from app metadata and
plan to combine the results of the collaborative and contentbased approaches in future work.
7. ACKNOWLEDGEMENTS
The authors would like to thank Anand Venkataraman for
guidance, edits and help with revisions. Chris Dury provided
valuable feedback and Sunil Yarram helped during various
stages of data preparation.
212