Your Apps Are What You Are: Personification Through Installed

Your Apps Are What You Are:
Personification Through Installed Smartphone Apps
Suranga Seneviratne∗† , Aruna Seneviratne∗† , Prasant Mohapatra‡
∗ School
of EET University of New South Wales , † NICTA, Australia
Email: suranga.seneviratne,[email protected]
‡ Department of Computer Science, University of California, Davis
‡ Email: [email protected]
†
Abstract—The smartphone ecosystem is based on the use of
apps developed by third party app developers, each providing a
specific service to the user. The apps need to obtain permissions
from the user to access, information about the user to provide
the service. A privacy-conscious user can decide to restrict access
to this private information. However other information, which is
considered non-private such as device id, phone model, installed
apps is accessible to the app developers without the consent
of the user, and some apps have been reported to collect this
information. In this paper, we investigate what can be inferred
from non-private information, namely installed app list, by using
a large dataset of users app lists. It shows that access to the app
list enables inferring information about the users, using wellknown clustering techniques and a simple set of rules. Finally
it shows that the information that can be inferred can be used
to provide some useful services such as recommendation systems
for users. However it leads to some loss of privacy, which can be
harmful.
I. I NTRODUCTION
Smartphone usage is increasing and it is predicted to
reach 50% of the global mobile device market by 2017 [1].
Smartphones allow third parties to develop apps to provide
different services to users. Third party app developers publish
these apps in app markets relevant to the mobile operating
system, and the users can download and install them on their
smartphones. Android and IOS operating systems covered
approximately 91% of the global smartphone market in Q1,
2013 [2]. Official markets for Android and IOS are reported
[3] to have more than 800,000 apps each in May, 2013 and
approximately 51 billion app downloads are predicted for the
year 2013.
Users decide to install apps depending on their requirements. For example a user who is using trains is likely to install
an app which gives the train timetables. Thus intuitively, the
apps that a user has installed are potentially good indicators
of their interests, life style etc.
In Android environments, apps need explicit permission to
access personal data such as location, call logs, SMS and social
network profiles. The permission to access this information is
requested at the time of installation, and if the user does not
wish to grant the permission, he/she can decide not to install
that app. Further at any point in time users can check the
permissions given to an app and decide to keep or uninstall
the app. In contrast, the list of installed apps on a users smart
phone can be obtained without user permission through any
installed app. It has been reported [4] that some ad libraries
has also embedded this feature and collect the information
about installed apps. Though it is not straight forward as in
Android, the second most popular mobile operating system,
IOS, also allows the apps to obtain the list of installed apps
[5].
This raises two fundamental questions. First, if a third
party could get the list of installed apps of individual users,
what can be inferred from this information? what can the
inferred information be used for? This paper answers the
first question by analyzing app lists of over 8000 Android
users, and verifying the findings using 34 volunteers. Then
we attempt to answer the second question by investigating the
possibility of "personifying" the user, using the information
that can inferred from the app lists.
We make the following contributions in this paper.
•
•
•
•
We present, to the best of our knowledge, the first large
scale study of what apps people install and associated
basic characteristics.
We show that it is possible to accurately infer personal
information that can be used to personify users.
Evaluate the viability of the proposed methodology using
a limited but a significant group of volunteers.
Highlight that personification of users can potentially be
used for a number of beneficial purposes but does lead
to loss of user privacy.
The rest of the paper is organized as follows. In Section
II we present the methodology used to collect and analyze
data and in Section III we provide the basic statistics of our
datasets. Section IV shows the possibility of stereotypical user
profiling based on the higher level app categories. In Section
V we present our decision rule engine followed by the results
in Section VI. Section VII discusses the implications of the
findings. Related work is presented in Section VIII and Section
IX concludes the paper.
II. DATA C OLLECTION AND P ROCESSING
In order to gather the list of apps being downloaded and
installed in user devices, we adopted two different methods.
First, we crawled two popular Social App Discovery sites
Appbrain [6] and Appaware [7] where users publicly share
installed app lists. Second, we developed and distributed an
2
TABLE I: Summary of the datasets: Before pre-processing
# of users
# of apps
# of installations
Average # of apps/user
Median # of apps/user
Appbrain
8653
85770
705004
81
51
Appaware
841
24254
94024
112
75
Android app called Apptronomy1 among a group of volunteers
to collect the lists of their installed apps. In addition, we
manually collected app lists of 16 volunteer iPhone users.
From these users we collected basic demographic information
through a brief questionnaire. We use the dataset collected
from the volunteers to evaluate our decision rules inferring
personal information in Section VI.
Table I provides the summary of the characteristics of our
crawled datasets. Appbrain and Appaware together provide
apps lists of more than 9000 users and include well over
100,000 apps. This data set however contains pre-installed as
well as some "outlier users" who either have abnormally high
or low number of installed apps. To eliminate the bias the
pre-installed apps and outlier users introduces, we removed
the pre-installed apps and outlier users from the data set as
described below.
•
Preinstalled apps: Our primary objective is to identify
user characteristics based on the apps they install. Thus,
it is necessary to remove the pre-installed apps such as
facebook, gmail and youtube. This was done by generating a list of pre-installed apps by examining 5 popular
phone models from 5 different manufacturers covering 3
major Android versions. We were able to identify 53 preinstalled apps which were removed from the dataset.
Outliers: We believe in the Social App Discovery services that were crawled, users with a very low number
of apps are those who have not synched their phones
with the service and the users with excessively higher
number of apps are potentially those who promotes apps
excessively such as marketers and app developers. To
remove the outlier users, we first used statistical outlier
removal method, interquartile range (IQR) with parameter
1.5 [8]. However the method did not identify a lower
limit, thus we removed the lists with size less than the
10th percentile to eliminate users with low numbers of
installed apps. For the upper bound we used the values
given by IQR. Since the volunteers who provided the
app lists are known, we did not do this correction for the
Apptronomy and IOS datasets.
The characteristics of the data sets after the removal of the
pre-installed apps and the outlier users, are shown in Table II.
The categories of the apps installed by the users were
then identified using the 30 pre-defined categories used when
publishing apps in Google Playstore by querying Google
Playstore. In the case the apps could not be found in Google
Playstore, we queried alternative app markets to find the
•
1 https://play.google.com/store/apps/details?id=com.test.apptronomy
TABLE II: Summary of the datasets: After pre-processing
# of users
# of apps
# of installations
Average # of apps/user
Median # of apps/user
Appbrain
7360
68115
474433
64
54
Appaware
708
18764
64327
91
79
category. For the apps found in Google Playstore we also
collected whether the apps were free or paid and the price
of the paid apps.
III. BASIC CHARACTERISTICS
In this section, we describe the general characteristics of the
apps downloaded and installed by the users, which provide
insights to the number of apps users have and the diversity
of user installed app space by category and price. These
characteristics helps us in scoping the size and trends of the
datasets we intend to analyze, and identify potential biases and
similarities.
A. App categories
To determine the distinctions between the installed set of
apps as well as to provide the overall profile of apps that are
being installed on a device, we first compared the composition
of the corpus of apps by category for both data sets as shown in
Figure 1a. It was found that the top app categories were Tools,
Entertainment and Personalization. Furthermore, these three
categories accounted for approximately 33% of the total apps.
Categories Comics, Libraries and Medical have the minimum
number of apps. According to Figure 1a, percentages of apps
in each category are approximately the same for both datasets.
For example Tools category have around 13% of the apps in
both the datasets as shown in the stacked Tools column in
Figure 1a.
B. App installations
The actual percentage of apps a user installs of a given
category can be different from the percentage of apps that are
available in that category in the app market. Figure 1a shows
the percentage of apps available in each category while Figure
1b shows percentage of apps installed on the user devices. As
can be seen from Figure 1b again the top category of apps
that are installed by the users is Tools.
In contrast, apps in Productivity and Communications categories have a higher number of installation despite these two
categories having a lower number of apps available in the
app market as shown in Figure 1a. Similarly although there
are more apps apps available in the categories Entertainment
and Personalization in the app market, actual percentage of
installations are lower.
C. No of apps per user
To personify a user using the apps, the app list of each user
needs to have statistically significant number of apps installed.
To determine whether this is the case, we analyzed the datasets
to determine the number of installed apps of users. Figure 2
20
15
10
5
0
35
30
Appaware
Appbrain
25
20
15
10
5
0
Arcade
Books
Brain
Business
Cards
Casual
Comics
Communica.
Education
Entertain.
Finance
Health
Libraries
Lifestyle
Video
Medical
Music
News
Personaliza.
Photography
Productivity
Racing
Shopping
Social
Sports
S. Games
Tools
Transport
Travel
Weather
25
Appaware
Appbrain
Arcade
Books
Brain
Business
Cards
Casual
Comics
Communica.
Education
Entertain.
Finance
Health
Libraries
Lifestyle
Video
Medical
Music
News
Personaliza.
Photography
Productivity
Racing
Shopping
Social
Sports
S. Games
Tools
Transport
Travel
Weather
% from total apps
30
% from total installations
3
App category
App category
(a) App availability by category
(b) App installations by category
1
Cumulative Probability
Cumulative Probability
Fig. 1: Apps by availability and installations
0.8
0.6
0.4
0.2
0
Appaware
Appbrain
1
10
100
No of apps per user
1000
(a) CDF: Before pre-processing
1
0.8
0.6
0.4
0.2
0
Appaware
Appbrain
1
10
100
No of apps per user
1000
(b) CDF: After pre-processing
Fig. 2: Distribution of number of apps per user
plots the CDF of the number of apps per user before and
after pre-processing, which shows the effects of removing the
outlier users and the preinstalled apps.
As can be seen from Figure 2b, 50% of the users have
installed more than 50 apps in both the datasets. This corroborates what was reported in a Nielsen report [9], that in 2012
an average US smartphone user had around 41 apps compared
to 32 in the previous year. Furthermore, it shows that there are
a sufficient number of user installed apps that can possibly be
used for personification.
One possible reason for the difference between Appbrain
and Appaware data is the length of time they have been
in operation: Appbrain site was launched in 2010, and the
Appaware site was launched in 2012. The older the site, the
higher the possibility of having app lists which were not
synched for a long period of time. Furthermore, the users of
the new site could possibly be more app-savvy and/or the site
has yet not had sufficient time to attract the general smartphone
users.
D. Deleted Apps
It was found that there are a number of apps which have
been deleted from Google Playstore but present in user app
lists: 3.7% in Appaware and 29.8% in Appbrain. Apps get
removed from Google Playstore due to various reasons such
as the developer discontinuing the app, or Google removing
apps due to malicious behavior. The reason Appbrain lists have
a significantly higher number of removed apps, we believe
is due to it being in operation for a longer period of time.
This is consistent with the observation made by d’Heureuse
et al. [10], that approximately 8% of the total apps were
deleted from Google Playstore in a period of three months. The
potential implications deleted apps being still in user phones
is discussed in Section VII.
E. Free vs. paid apps and the cost of paid apps
To study the user behavior with respect to paid apps, we
examined the number of paid and free apps in the users app
lists. Figure 3a shows the number of paid apps in user app
lists. It shows that 19% of the users in the Appaware dataset
and 38% of the users in the Appbrain dataset did not have any
paid app. In general the Appware dataset contains users who
have more paid apps than the Appbrain dataset users. Again
we believe that the reason for the higher number of paid apps
in Appaware dataset is its age and potentially having more
app-savvy users. .
For majority of the users in both the datasets the total cost
of the paid apps was less than AU$10s as shown in Figure 3c.
Furthermore Figure 3b shows that more than 50% of the apps
are less than AU$2s.
Figures 4a and 4b compares the percentages of free and
paid app installations by category. While the percentages of
Tools and Productivity categories remains same, users have
more paid apps in Arcade games, Brain games, Music and
Photography categories. Further, users use more free apps in
the Communications and Entertainment categories than paid
apps. Summary of the finding are shown in Table III.
IV. U SER C LUSTERING
To investigate the possibility of mapping users into
stereotypical profiles using their app lists, we represented
each user in a 30-dimensional space where a dimension is
the percentage of apps users have installed of a particular
category.
E.g. ith user Ui is represented as Ui =< pi1 , pi2 , ......, pi30 >
30
25
20
15
10
5
0
40
35
Appaware
Appbrain
30
25
20
15
10
5
0
Arcade
Books
Brain
Business
Cards
Casual
Comics
Communica.
Education
Entertain.
Finance
Health
Libraries
Lifestyle
Video
Medical
Music
News
Personaliza.
Photography
Productivity
Racing
Shopping
Social
Sports
S. Games
Tools
Transport
Travel
Weather
Appaware
Appbrain
35
% from paid app installations
40
Arcade
Books
Brain
Business
Cards
Casual
Comics
Communica.
Education
Entertain.
Finance
Health
Libraries
Lifestyle
Video
Medical
Music
News
Personaliza.
Photography
Productivity
Racing
Shopping
Social
Sports
S. Games
Tools
Transport
Travel
Weather
% from free app installations
4
App category
App category
(a) Free app installations by category
(b) Paid app installations by category
Fig. 4: Free and paid apps by installation
TABLE III: Free apps vs. paid apps
Appbrain
68115
20294
42494
89%
474433
91369
347992
91%
3.16
1.94
1
Cumulative Probability
Cumulative Probability
# of apps
# of deleted apps
# of free apps
% of free apps/available apps
# of installations
# of deleted app installations
# of free app installations
% of free app insta./ availa. app insta.
Average price of a paid app (AU$)
Median price of a paid app (AU$)
0.8
0.6
0.4
0.2
0
Appaware
Appbrain
0
20
40
60
80
100
Appaware
18764
694
15262
84%
64327
1411
53367
85%
3.33
1.99
1
0.8
0.6
0.4
0.2
0
No of paid apps per user
(a) CDF: No of paid apps per user
Appaware
Appbrain
1
10
Price of the app (AU$)
100
(b) CDF: Price of apps
60
Appaware
Appbrain
50
% of users
pik =
40
30
20
>100
80-90
90-100
70-80
60-70
50-60
40-50
30-40
20-30
10-20
0
0
0-10
10
Cost of app list (AU$)
(c) Histogram: Cost of app list
Fig. 3: Statistics of paid apps
where pik for k = 1..30 is the percentage of apps in user ith
app list of category k.
no. of user installed apps in category k X 100
total number of user installed apps
We then used Ward’s method [11] in agglomerative hierarchical clustering with Euclidian distance as the distance
measure between two users to determine whether users form
clusters. To decide the optimal number of clusters, we plotted
the Error Sum of Squares (SSE) against the number of cluster,
for both the datasets. The optimal number of clusters are the
no. of clusters around the elbow point of the SSE vs. no. of
clusters graph. In Figure 5a, the elbow point lies around 6
for both the datasets and thus 6 was selected as the optimal
number of clusters.
It is necessary determine whether these clusters have a
one to one mapping across the two datasets, i.e whether a
cluster in Appaware dataset is close only to one cluster in
Appbrain dataset and have large distances to the other clusters
in Appbrain. We first calculated the centroids of each cluster
and then calculating the pairwise distances between the cluster
centroids across the two datasets. It was found that one cluster
in one dataset is quite close to only one cluster in the other
dataset and is a significant distance away from the rest of the
clusters. We show this pairwise distance matrix in Table V.
Figure 5b shows that the percentage of users in each cluster
has approximately the same percentage of user in both the
datasets with the exception of cluster 2. We discuss the reasons
for this in the section below. The analysis of the category-wise
composition of the cluster centroids is shown in Table IV.2 The
analysis showed that a cluster could be representing the user
interests. Types of user interests identified are shown in Table
VI.
Cluster 1 users have more installed apps in categories Music,
Entertainment and Books. Cluster 2 users do not have specific
features compared to other clusters. Therefore we call them
as the balanced users. This also is the cluster which showed
the discrepancy highlighted in the section above. However due
2 For clarity, any category which is having less than 1.5% across all cluster
centers were omitted.
5
40
Appaware
Appbrain
0.6
0.4
0.2
0
Appaware
Appbrain
35
% of users
Error sum of squares
1
0.8
30
25
20
15
10
5
0
2
4
6
8
10
No of clusters
12
(a) SSE vs. No. of Clusters
14
0
c1
c2
c3
c4
Cluster No.
c5
c6
(b) % of users in each cluster
Appaware
TABLE V: Adjacency matrix of cluster centroids
c1
c2
c3
c4
c5
c6
c2
11.9
5.2
26.7
19.3
15.1
15.6
Appbrain
c3
c4
15.8 14.8
17.0 13.8
7.3
23.2
24.9 7.2
30.5 24.3
23.8 17.7
c4
c3
c2
c1
80
70
60
50
40
30
20
10
0
Fig. 6: Distance matrix
TABLE VII: Personal information by the presence of apps
c5
25.0
15.7
33.3
25.7
5.6
26.8
c6
20.4
17.6
32.2
25.8
24.3
12.4
TABLE VI: Clusters: Explanation
Cluster no.
c1
c2
c3
c4
c5
c6
c6
c5
(a) Distance matrix: before clustering (b) Distance matrix: after clustering
Fig. 5: User clustering
c1
5.8
11.9
21.1
20.7
26.3
15.1
80
70
60
50
40
30
20
10
0
Description
Entertainment, Music and Books
Balanced of of all categories
Game apps (Arcade, Brain and Sports games)
Personlization apps
Excessive no. of Tool apps
Social and Communications apps
to the generality of the cluster, it is not possible to identify a
specific reason for the discrepancy. The distinguishing feature
of cluster 3 users is the high use of gaming apps (Arcade
Games, Brain Games and Sports Games). There is a clear
distinction between cluster 4 and the rest, as this represents
users with higher numbers of apps in Personalization category.
The distinguishing feature of cluster 5 users is the larger
number of apps that fall under the Tools category, compared
to the others while the cluster 6 users have the distinguishing
feature of having higher numbers of apps that fall under the
categories of Social and Communication. Due to fact that we
observed similar clustering across the two data sets collected
from different sources, we conclude that these clusters represent generally distinguishable features of smartphone users.
Figure 6 shows the Euclidian distance matrix between the
users before and after clustering for the Appaware dataset.
It shows that clusters 2, 3, 4 and 5 clearly distinguishable.
However cluster 1 and 6 tend to overlap with cluster 2. Cluster
2 represents the balanced users and the cluster 1 differs from it,
due to the high use of apps that fall under the Entertainment,
Music and Books categories. On the other hand cluster 2
users have more apps that falls under Tools category which
minimizes the difference in Euclidean distance. Similarly the
difference in cluster 2 and cluster 6 is the higher use of
apps in categories Communication and Social in cluster 6.
Nevertheless cluster 6 has a lower number of apps in the Tools
category which reduces the distance between cluster 6 and
cluster 2.
Personal
/Demographic Information
Example Apps
Language/Ethnicity/Country Presence of apps in other languages
than English, Presence of apps related to specific countries
Religion
Catholic Chaplets 01, Catholic
Rosary Quick Guide, Bible Topics,
Lord Buddha Temple. Judaism 4 U,
Krishnashtakam
Interests
Sports: The Official ESPNcricinfo
App, ESPN FC Music: Instrumental Hip Hop Rap Beats, Jazz Radio
Relationship Status
Local Dating, eHarmony
Health conditions
Diabetes Diet, Stress Check by
Azumio
Education
ACCA Student Planner, CSAT
UPSC Prep, Engineering Dictionary
Sex Orientation
NearOx, DISTINC.TT
V. D ECISION RULE E NGINE
The objective is to determine the possibility of personifying
a user through the list of apps he/she has installed on the
smartphone. In order to do this it is necessary to infer fine
grained personal information about user such as religion,
ethnicity/languages, relationship status and health conditions.
Table VII shows some potential apps which can be used to
derive personal information.
As the app space is large, a heuristic rule based decision
engine was used. For each app, the app description and the
user reviews at Google Playstore were used to extract the
semantic information about the app, namely the language
and the concepts given by the natural language processing
API, Alchemy [12]. Concepts tag the apps using keywords
in the app description and the user reviews. For example,
description/review containing the keywords BMW, Ferrari and
Porsche will be tagged as "automotive industry".
This information is used with a set of rules to extract
information about the app such as the religion and language
associated with the app. Then the users has these apps installed
were characterized using this information. The rules used to
extract the information are described below. The analysis used,
merged pre-processed Appaware and Appbrain datasets which
consist of 8068 users.
6
Brain
Business
Cards
Casual
Communica.
Entertain.
Finance
Health
Lifestyle
Video
Music
News
Personliza.
Photogra.
Productiviti.
Racing
Shopping
Social
S. Games
Tools
Travel
c1
c2
c3
c4
c5
c6
a.a
a.b
a.a
a.b
a.a
a.b
a.a
a.b
a.a
a.b
a.a
a.b
3.0
4.1
2.7
2.6
2.2
2.4
1.6
2.5
1.5
1.4
2.3
2.8
3.8
3.6
2.7
1.6
9.7
5.3
2.6
2.1
0.9
1.5
3.6
1.2
1.4
1.8
1.7
1.8
0.7
1.1
0.8
1.6
1.5
1.4
1.3
2.5
1.7
2.1
0.7
0.5
0.6
1.2
1.3
0.9
0.5
0.5
1.1
0.4
2.7
3.2
1.7
1.3
4.8
4.9
1.3
2.4
0.5
1.0
1.7
0.7
7.0
9.0
8.5
9.6
5.2
6.9
8.0
10.0
8.4
8.8
17.5
24.2
7.1
9.7
4.4
5.8
5.1
6.1
4.6
6.8
3.4
4.6
5.1
6.2
1.2
1.6
1.2
1.1
0.4
0.8
1.1
0.7
1.0
0.7
0.9
0.6
1.9
1.4
1.1
1.2
0.7
0.8
0.9
0.7
0.8
0.5
1.1
0.4
3.2
2.8
1.6
1.8
1.2
1.5
1.9
2.1
1.6
1.4
1.5
2.8
3.7
3.7
3.5
4.5
2.0
3.1
2.1
3.3
3.7
2.9
2.5
2.9
6.5
6.2
4.4
5.4
3.0
4.1
4.5
4.0
4.3
3.6
3.6
3.5
3.4
3.2
2.9
2.5
1.8
1.8
2.2
1.8
1.6
1.7
1.8
1.6
4.3
3.6
5.2
4.5
7.5
5.2
22.3
17.0
5.7
6.9
5.7
3.4
6.6
2.5
3.9
4.7
3.3
2.3
4.1
2.6
3.3
1.9
3.8
2.4
7.2
7.3
11.1
11.9
5.4
6.8
9.8
8.1
13.3
9.4
9.6
8.3
0.9
1.0
0.9
0.6
3.6
2.8
0.7
1.0
0.3
0.7
0.5
1.1
2.8
3.2
2.1
2.5
1.1
1.2
2.0
1.8
2.3
1.5
1.5
2.3
5.1
3.7
4.7
4.1
3.4
2.8
4.2
3.2
2.7
3.3
13.3
6.4
1.1
1.0
0.8
0.6
2.0
1.9
0.8
1.0
0.4
0.5
0.6
0.7
11.9
11.5
20.3
20.9
9.4
12.6
15.0
15.1
35.2
35.3
12.4
18.5
2.4
3.1
2.4
3.0
0.9
1.3
1.4
2.1
1.7
2.1
2.4
1.4
% of users from total
14
12
10
8
6
4
2
0
1
2
3
4
5
6
7
8
9 10 >10
No. of non-English apps
% from users having more
than 1 non-English app
Books
6.0
5.3
6.9
2.8
22.7
20.0
3.5
5.2
2.4
5.0
2.5
1.6
Category
Arcade
TABLE IV: Cluster centroids (a.a∗ : Appaware, a.b∗ : Appbrain)
TABLE VIII: Top-10 languages by the number of users
70
60
50
40
30
20
10
0
1
2
3
4
No. of languages
5
(a) % of users of having non-English (b) % of users of multiple languages
apps
Language
Arabic
German
Italian
Spanish
Portuguese
# of users
236
172
153
109
83
% from total users
2.93%
2.13%
1.90%
1.35%
1.03%
by the number of users and the percentage of the number of
users from the total number of users in our dataset.
Fig. 7: Percentages of users with language information
B. Country
A. Language/Ethnicity
Most of the apps (63.5%) installed by the users, were
identified as English. The next top-5 languages associated with
apps are shown in Table VIII. Apps for which a language
could not be identified, were not used in the analysis. All
the apps that could be associated with a language were
distributed among 47 different languages. Our language rule
scans through the user app lists and if any non-english app
is found, flags those languages as the potential languages the
user speaks.
Figure 7a shows the number of non-English apps installed
by users. It shows that 13% of the users have only 1 nonEnlgish app, while 5% of the users have two non-English
apps and 3% have 3 non-English apps etc. Due to the large
difference between the percentage of having only one and two
non-English apps, we removed the group having only one nonEnglish app from our analysis because this can lead to errors
in the decision making as discussed in Section VI.
For a user who has apps associated with Italian and
German, it is not possible to definitely determine what native
language of the user. Figure 7b shows the ambiguity that
arises due to having apps in multiple non-English languages.
As can be seen, majority of the users have apps associated
with only one language. Thus it is possible to identify the
native language spoken by the user. Using this observation
the following rule was used to determine the language of a
user.
language rule - if the user has more than one non-English
app, list the languages of those apps as the possible
languages user will speak.
When analyzing the dataset using the above rule we could
infer the language of 25% of the users in our dataset. This is
summarized in Table VIII, which provides the top-5 languages
While the residing country information of the smartphone
users can be extracted from network information such
as IP address, telecom operator etc. it is not possible to
determine other information such as country of origin or the
countries that users have a specific interest in. To obtain this
information, for each app we extracted a list of concepts
using the Alchemy API and looked for the presence of
country names, by comparing with a pre-defined list of all the
countries in the world available at [13]. In case of a match,
the app was associated with the matching country.
country rule - if the user has more than one app with
country name tag, list those countries as the possible
countries users have associations with.
We observed that some apps have multiple country names as
keywords and we ignored these apps in our analysis to avoid
ambiguity. Figure 8a shows the percentage of people from total
against the number of apps with country information. Again
we discarded the users having only one app with country
information due to the same reason described above. Figure
8b shows the number of countries identified for the users who
have more than one country associated apps. It shows that the
majority of the users are associated with one or two countries.
Applying this rule to the dataset we were able to infer the
associated countries for 8% of users.
C. Religion
Religion was associated with an app using the same
principles used to associate country information to an app. A
pre-defined list of the top 10 religions based on the number
of followers globally was used. If any app with a religion
associated with it was in the users app list, we inferred the
user to be a believer of that religion. In this analysis 25 apps
had two religious names associated with them, we removed
70
60
50
40
30
20
10
0
1
2
3
4
5
6
7
8
9 10 >10
% from users having more
than 1 app with country info
% of users from total
7
No. of apps with country
TABLE X: Users by sport interest
60
50
40
30
20
10
0
1
2
3
4
5
No. of countries
>5
(a) % of users of having apps with(b) % of users associated with multiple
country info
countries
Fig. 8: Percentages of users with country information
TABLE IX: Users by religion
Religion
Islam
Hinduism
Christianity
Buddhism
Other
# of users
869
140
93
19
9
% from total users
10.77%
1.74%
1.15%
0.24%
0.11%
Sport
Cricket
Cycling
Football
Baseball
Golf
# of users
168
47
41
30
30
% from total users
2.08%
0.58%
0.51%
0.37%
0.37%
TABLE XI: Summary of the ground truth datasets
# of users
# of apps
# of installations
Average # of apps/user
Median # of apps/user
Android
18
417
576
32
28
IOS
16
419
546
34
22
By applying this rule we were able to identify 8% of users
as single.
VI. R ESULTS
those apps to eliminate ambiguity.
religion rule - if the user has apps with religious names
associated with flag those religions as potential religious
beliefs of that user.
As expected most app lists contained apps that has an
association with only one religion. We found 40 users have
apps with different religious flags and we removed these users
from our analysis. In table IX we list a summary of our
findings. We were able to infer the religions of 14% of the
users in our dataset.
D. Sport interests
The sport interest rule extracts information about the
sporting interests of the users. Similarly to the previous
inferences, we checked for apps having sports names as
concepts by comparing the concepts of each app with a list
of globally popular sports available at [14]. We ignored the
gaming apps because playing such apps may not necessarily
mean that the user is a fan of that sport. The rule has aimed
at identifying apps which provided sport related news and/or
sports scores.
sport interest rule - if the user has non-game apps which
have sports names, list those sports as the sports the user
is interested in.
Again, apps having more than one sports name associated with
it were discarded to avoid ambiguity. Using the rule we could
infer sports interests of 4% of the users and we summarize
the results in Table X.
E. Relationship status
The relationship rule check the availability of the concept
of "dating" associated with an app. We identify the users with
dating apps installed as single.
relationship rule - if the user has dating apps, list those
users as the users who are single.
Since it was not possible to establish the "ground truth"
for the two data sets, results from 18 voluntary Android users
with Apptronomy and 16 iPhone users who gave us their app
lists were used to validate the results. A summary of the test
data sets are shown in Table XI. Due to privacy concerns,
the volunteers were not asked any personal information other
than the country of residence and the their native language .
The methodology used to valid the the results is schematically
shown in Figure 9.
Alchemy API
Language, Keywords
Android/IOS
Smartphone
3) Text
Analysis
User profile
4) Inferences
1) App list
Decision
rule engine
2) Text
Extraction
Google Playstore/Apple Appstore
App description, User reviews
5) Cross validation
Fig. 9: Ground truth data collection process
We applied the decision rules to the test dataset and calculated the performance matrices precision and recall, for the
original language rule and country rule as well as relaxed
versions of the rules where presence of a single app is
sufficient for the decision, as we had the ground truth. Since
our dataset was limited and Alchemy API also returned a very
limited number of words, we extracted keywords from the app
market text instead of concepts. Table XII summarizes the
results.
The results show a higher precision and a lower recall. The
drop in the precision when more than one app requirement
is relaxed, is due to Alchemy API identifying apps as being
associated with a country, when in reality they were not.
This shows that it is necessary to have more than one app
8
TABLE XII: Summary of results
Platform
Android (≥ 1 app)
Android (> 1 app)
IOS (≥ 1 app)
IOS(> 1 app)
Language
Precision Recall
42%
21%
100%
17%
100%
30%
100%
20%
Country
Precision Recall
100%
18%
100%
3%
45%
20%
100%
4%
with language association information to make an accurate
inference.
The low recall is because of the Alchemy API not providing
country name associations for some apps which were country
specific. This is possibly due the app description text and
the user reviews in the app market not having any country
information. Another reason was, there were some languages
which were not identified by the Alchemy API. We discuss
how the results can be further improved in Section VII.
VII. D ISCUSSION
There are a number of implications both positive and
possibly negative of the inferences that can be made from
a user’s installed app list.
A. Characterization
The characterization of the number of installed apps of
smartphone users can be beneficial for dimensioning smartphone hardware such as screen, storage and memory, arranging
apps across multiple screens and designing app pre-launching
methods as already been suggested [15]. In contrast, the
presence of deleted apps in user’s smartphone poses a potential
security problem as the users may be using apps that have been
deemed malicious or are malfunctioning.
The evaluation of the app availability and actual user
installation by category shows that some app categories have a
excess number apps, namely Entertainment and Personalization, whilst some other categories, especially Productivity and
Communications have a shortage. This provides insights to app
developers for future app development. The characterization
of the user behavior with respect to free and paid apps
can provide useful information for advertisers, to generate
targeted advertisements: it may indicate users who have higher
probabilities of making an online purchases.
B. Clustering
It is possible to cluster users using well know clustering
techniques. One of of the obvious implication is that knowledge of which cluster a user belongs to gives a very good
indication of the types of apps that the app makers should
recommend to the user. This information can be further exploited to provide cross domain product recommendations, by
combining it with secondary information, such as users’ movie
preferences, shopping patterns. Thus, this can be beneficial for
both users and various service providers.
C. Personification
The results show that using simple decision rules, it is
possible to predict fine granular personal information such
as language, ethnicity and interests. This information can be
used for the benefit of the user in applications such as UI
personalization, micro-targeted advertising and recommendations of various kinds . Thus the knowledge of app list can be
seen as means of instantly building a user profile compared
to user tracking techniques which can be expensive and time
consuming.
On the other hand, we showed that it is possible to infer
personal information such as religion, relationship status,
health conditions and sex orientation by observing the apps
users install. More importantly, that this information can be
obtained without the consent of the user. As this type of
inferences can be easily made and misused by the third parties,
it is definitely a violation of users privacy.
D. Limitations
The primary limitation of the proposed method is that it
works only when a diverse range of apps are being installed.
For the smartphone users who only use the pre-installed apps
and only a very limited number of popular apps, method fails.
We believe that at best this is a temporary limitation, as the
users’ familiarity with app markets increases, majority of the
users will have a diverse range of apps.
Another potential limitation arises as a result of the presence
of apps not being a definitive indicator of a users interest as
the user might be not using the app, or has been installed by a
family member. We believe that this is not a real limitation as
the decision to install an app or allowing an app to be installed,
in itself is an indication that the user has at least some interest
associated with the app.
E. Possible improvements
The classification methods and the rules that were used in
this paper are simple and straight forward. Despite this, the
results are usable. It is likely that it will be possible to optimize
the classification techniques, and develop a more sophisticated
set of rules will yield better results. However, the utility of
using a a more sophisticated set of rules is unclear and needs
further investigation. For example, rules used simply checked
the presence of certain keywords. However keyword list returned by the text processing APIs might not always contain
these keywords but a set of other keywords associated with
the topic. Thus a term similarity based document classification
method will increase the probability of identifying apps related
to certain topics. With this method precision may be reduced
but the recall will be increased.
VIII. R ELATED W ORK
Various previous work looked at the possibility of deriving
personality traits by mining smartphone data. Chittaranjan et
al. [16] predicted the smartphone users’ Big-Five personality
[17] by mining app usage, call and SMS usage and Bluetooth
proximity information of 83 smartphone users and evaluated
the results by comparing the predicted personality with the
personality derived based on the users’ answers to Ten Item
Personality Inventory(TIPI) [18]. Similarly LiKamWa et al.
9
[19] used the contextual information in the likes browsed
websites, used apps and contact logs (SMS, voice and email)
32 participants over a period of two months to infer users’
mood.
Privacy leakage due to third party ad libraries collecting
user information through over-permission (i.e. asking for permissions which are not required for the app function) has
been studied in multiple work [20] [4] [21] [22]. However
this information leakage is always under the control of the
users as they can avoid installing apps which ask additional
permissions.
Shepard et al. [23] and Böhmer et al. [24] studied how app
usage is dependent on contextual variables such as location,
time of day and day of week and multiple work used this
information to predict users’ future app usage [25], [15]. Pan
et al. [26] collected app installation logs, call logs and data
on near by Bluetooth devices of 55 smartphone users for 5
months. This information together with externally collected
information in the likes of friendship and affiliation were
used to predict users future app installations using supervised
learning. However they did not study the app installation
patterns based on app categories or the personal information
that can be derived.
We differentiate our work from previous work since we
use only a single snapshot installed list of apps, which is
accessible to any app developer without the knowledge of
the user. Profiling the users based on app usage might give
more accurate results. However it requires the consent of the
user and longer data collection periods. Use of tracking [27]
to identify which websites users visit to infer interests also
needs observations for longer periods. Thus our personification
method provides an easy means for building user profiles.
IX. C ONCLUSION
The paper presented, to the best of our knowledge, the first
large scale study of user installed Android apps and related
basic statistics. It then analyzed the information that can be
derived from list of installed apps, which is accessible through
any installed app on the smartphone. The analysis showed that
smartphone users form clusters based on the characteristic of
the apps they install. Further access to app lists enables the
extraction of more fine-granular information about the smartphone user such as spoken languages, religion and interests
by using simple rules. The viability of the rules used, was
demonstrated using ground truth data of 34 volunteer smartphone users. Despite the simplicity, it was shown that detailed
information about users could be inferred with a precision
over 40%. Finally, opportunities this kind of analysis provides
and then threats it poses were highlighted. We believe that
the accuracy of the inferences can be significantly improved
through the use of more sophisticated classification techniques,
which we intend to explore. Therefore, we believe that app
lists can be effectively used to personify users. We plan to
expand our own dataset with the ground truth and to release
our current dataset to the research community to explore some
of the opportunities of this type of analysis provides.
R EFERENCES
[1] eMarketer, “Smartphone adoption tips past 50% in major markets worldwide,” http://www.emarketer.com/Article/SmartphoneAdoptionTipsPast5MajorMarketsWorldwide/1009923, 2013.
[2] A. Cocotas, “Android grabs a record share of the global smartphone market,” http://au.businessinsider.com/androidbouncesbacktoarecordquarter20135, 2013.
[3] mobiThinking, “Global mobile statistics 2013 section e: Mobile apps,
app stores, pricing and failure rates,” http://mobithinking.com/mobilemarketingtools/latestmobilestats/e, 2013.
[4] M. Grace, W. Zhou, X. Jiang, and A. . Sadeghi, “Unsafe exposure
analysis of mobile in-app advertisements,” in Proc. of the 5th ACM
WiSec, 2012, pp. 101–112.
[5] D. Amitay, “ios app detection,” http://www.ihasapp.com, 2012.
[6] Appbrain, “Top android apps and games in the android market,”
http://www.appbrain.com, 2013.
[7] Appaware, “Top android apps and games today on appaware.com,”
http://www.appaware.com, 2013.
[8] F. Mosteller and J. W. Tukey, “Data analysis and regression. a second
course in statistics,” Addison-Wesley Series in Behavioral Science:
Quantitative Methods, Reading, Mass.: Addison-Wesley, 1977, vol. 1.
[9] Nielsen, “Application:what a difference a year makes,”
http://www.nielsen.com/content/dam/corporate/us/en/newswire/uploads
/2012/05/appnation-what-has-changed.png, 2012.
[10] N. d’Heureuse, F. Huici, M. Arumaithurai, M. Ahmed, K. Papagiannaki,
and S. Niccolini, “What’s app?: a wide-scale measurement study of
smart phone markets,” SIGMOBILE Mobile Computing and Communications Review, vol. 16, no. 2, Nov. 2012.
[11] J. H. Ward Jr, “Hierarchical grouping to optimize an objective function,”
Journal of the American statistical association, vol. 58, 1963.
[12] “Alchemy api,” http://www.alchemyapi.com, 2013.
[13] M. Gifford, “Text list of all countries in the world,”
http://openconcept.ca/blog/mgifford/text-list-all-countries-world, 2007.
[14] Topendsports, “List of sports from around the world,”
http://www.topendsports.com/sport/sport-list.htm, 2013.
[15] C. Zhang, X. Ding, G. Chen, K. Huang, X. Ma, and B. Yan, Nihao: A
predictive smartphone application launcher, 2013, vol. 110 LNICST.
[16] G. Chittaranjan, B. Jan, and D. Gatica-Perez, “Who’s who with big-five:
Analyzing and classifying personality traits with smartphones,” in Proc.
- ISWC, 2011, pp. 29–36.
[17] R. R. McCrae and O. P. John, “An introduction to the five-factor model
and its applications,” Journal of Personality, vol. 60, no. 2, 1992.
[18] S. D. Gosling, P. J. Rentfrow, and W. B. Swann Jr., “A very brief measure
of the big-five personality domains,” Journal of Research in Personality,
vol. 37, no. 6, pp. 504–528, 2003.
[19] R. LiKamWa, Y. Liu, N. D. Lane, and L. Zhong, “Moodscope building
a mood sensor from smartphone usage patterns,” in Proc. of the 11th
MobiSys ’13, 2013.
[20] I. Leontiadis, C. Efstratiou, M. Picone, and C. Mascolo, “Don’t kill my
ads!: balancing privacy in an ad-supported mobile application market,”
in Proc. of the 12th HotMobile ’12. ACM, 2012, pp. 2:1–2:6.
[21] P. Pearce, A. P. Felt, G. Nunez, and D. Wagner, “Addroid: privilege
separation for applications and advertisers in android,” in Proc. of the
7th ASIACCS ’12. ACM, 2012, pp. 71–72.
[22] S. Shekhar, M. Dietz, and D. S. Wallach, “Adsplit: separating smartphone advertising from applications,” in Proceedings of the 21st USENIX
conference on Security symposium, ser. Security’12, 2012, pp. 28–28.
[23] C. Shepard, A. Rahmati, C. Tossell, L. Zhong, and P. Kortum, “Livelab: measuring wireless networks and smartphone users in the field,”
SIGMETRICS Perform. Eval. Rev., vol. 38, no. 3, pp. 15–20, Jan. 2011.
[24] M. Böhmer, B. Hecht, J. Schöning, A. Krüger, and G. Bauer, “Falling
asleep with angry birds, facebook and kindle: a large scale study on
mobile application usage,” in Proceedings of the 13th MobileHCI ’11,
2011.
[25] T. Yan, D. Chu, D. Ganesan, A. Kansal, and J. Liu, “Fast app launching
for mobile devices using predictive user context,” in Proc. of the 10th
MobiSys ’12, 2012, pp. 113–126.
[26] W. Pan, N. Aharony, and A. Pentland, “Composite social network for
predicting mobile apps installation,” in Proc. of the National Conference
on Artificial Intelligence, vol. 1, 2011, pp. 821–827.
[27] S. Han, J. Jung, and D. Wetherall, “A study of third-party tracking by
mobile apps in the wild,” University of Washington, Tech. Rep. UWCSE-12-03-01.