Your Apps Are What You Are: Personification Through Installed Smartphone Apps Suranga Seneviratne∗† , Aruna Seneviratne∗† , Prasant Mohapatra‡ ∗ School of EET University of New South Wales , † NICTA, Australia Email: suranga.seneviratne,[email protected] ‡ Department of Computer Science, University of California, Davis ‡ Email: [email protected] † Abstract—The smartphone ecosystem is based on the use of apps developed by third party app developers, each providing a specific service to the user. The apps need to obtain permissions from the user to access, information about the user to provide the service. A privacy-conscious user can decide to restrict access to this private information. However other information, which is considered non-private such as device id, phone model, installed apps is accessible to the app developers without the consent of the user, and some apps have been reported to collect this information. In this paper, we investigate what can be inferred from non-private information, namely installed app list, by using a large dataset of users app lists. It shows that access to the app list enables inferring information about the users, using wellknown clustering techniques and a simple set of rules. Finally it shows that the information that can be inferred can be used to provide some useful services such as recommendation systems for users. However it leads to some loss of privacy, which can be harmful. I. I NTRODUCTION Smartphone usage is increasing and it is predicted to reach 50% of the global mobile device market by 2017 [1]. Smartphones allow third parties to develop apps to provide different services to users. Third party app developers publish these apps in app markets relevant to the mobile operating system, and the users can download and install them on their smartphones. Android and IOS operating systems covered approximately 91% of the global smartphone market in Q1, 2013 [2]. Official markets for Android and IOS are reported [3] to have more than 800,000 apps each in May, 2013 and approximately 51 billion app downloads are predicted for the year 2013. Users decide to install apps depending on their requirements. For example a user who is using trains is likely to install an app which gives the train timetables. Thus intuitively, the apps that a user has installed are potentially good indicators of their interests, life style etc. In Android environments, apps need explicit permission to access personal data such as location, call logs, SMS and social network profiles. The permission to access this information is requested at the time of installation, and if the user does not wish to grant the permission, he/she can decide not to install that app. Further at any point in time users can check the permissions given to an app and decide to keep or uninstall the app. In contrast, the list of installed apps on a users smart phone can be obtained without user permission through any installed app. It has been reported [4] that some ad libraries has also embedded this feature and collect the information about installed apps. Though it is not straight forward as in Android, the second most popular mobile operating system, IOS, also allows the apps to obtain the list of installed apps [5]. This raises two fundamental questions. First, if a third party could get the list of installed apps of individual users, what can be inferred from this information? what can the inferred information be used for? This paper answers the first question by analyzing app lists of over 8000 Android users, and verifying the findings using 34 volunteers. Then we attempt to answer the second question by investigating the possibility of "personifying" the user, using the information that can inferred from the app lists. We make the following contributions in this paper. • • • • We present, to the best of our knowledge, the first large scale study of what apps people install and associated basic characteristics. We show that it is possible to accurately infer personal information that can be used to personify users. Evaluate the viability of the proposed methodology using a limited but a significant group of volunteers. Highlight that personification of users can potentially be used for a number of beneficial purposes but does lead to loss of user privacy. The rest of the paper is organized as follows. In Section II we present the methodology used to collect and analyze data and in Section III we provide the basic statistics of our datasets. Section IV shows the possibility of stereotypical user profiling based on the higher level app categories. In Section V we present our decision rule engine followed by the results in Section VI. Section VII discusses the implications of the findings. Related work is presented in Section VIII and Section IX concludes the paper. II. DATA C OLLECTION AND P ROCESSING In order to gather the list of apps being downloaded and installed in user devices, we adopted two different methods. First, we crawled two popular Social App Discovery sites Appbrain [6] and Appaware [7] where users publicly share installed app lists. Second, we developed and distributed an 2 TABLE I: Summary of the datasets: Before pre-processing # of users # of apps # of installations Average # of apps/user Median # of apps/user Appbrain 8653 85770 705004 81 51 Appaware 841 24254 94024 112 75 Android app called Apptronomy1 among a group of volunteers to collect the lists of their installed apps. In addition, we manually collected app lists of 16 volunteer iPhone users. From these users we collected basic demographic information through a brief questionnaire. We use the dataset collected from the volunteers to evaluate our decision rules inferring personal information in Section VI. Table I provides the summary of the characteristics of our crawled datasets. Appbrain and Appaware together provide apps lists of more than 9000 users and include well over 100,000 apps. This data set however contains pre-installed as well as some "outlier users" who either have abnormally high or low number of installed apps. To eliminate the bias the pre-installed apps and outlier users introduces, we removed the pre-installed apps and outlier users from the data set as described below. • Preinstalled apps: Our primary objective is to identify user characteristics based on the apps they install. Thus, it is necessary to remove the pre-installed apps such as facebook, gmail and youtube. This was done by generating a list of pre-installed apps by examining 5 popular phone models from 5 different manufacturers covering 3 major Android versions. We were able to identify 53 preinstalled apps which were removed from the dataset. Outliers: We believe in the Social App Discovery services that were crawled, users with a very low number of apps are those who have not synched their phones with the service and the users with excessively higher number of apps are potentially those who promotes apps excessively such as marketers and app developers. To remove the outlier users, we first used statistical outlier removal method, interquartile range (IQR) with parameter 1.5 [8]. However the method did not identify a lower limit, thus we removed the lists with size less than the 10th percentile to eliminate users with low numbers of installed apps. For the upper bound we used the values given by IQR. Since the volunteers who provided the app lists are known, we did not do this correction for the Apptronomy and IOS datasets. The characteristics of the data sets after the removal of the pre-installed apps and the outlier users, are shown in Table II. The categories of the apps installed by the users were then identified using the 30 pre-defined categories used when publishing apps in Google Playstore by querying Google Playstore. In the case the apps could not be found in Google Playstore, we queried alternative app markets to find the • 1 https://play.google.com/store/apps/details?id=com.test.apptronomy TABLE II: Summary of the datasets: After pre-processing # of users # of apps # of installations Average # of apps/user Median # of apps/user Appbrain 7360 68115 474433 64 54 Appaware 708 18764 64327 91 79 category. For the apps found in Google Playstore we also collected whether the apps were free or paid and the price of the paid apps. III. BASIC CHARACTERISTICS In this section, we describe the general characteristics of the apps downloaded and installed by the users, which provide insights to the number of apps users have and the diversity of user installed app space by category and price. These characteristics helps us in scoping the size and trends of the datasets we intend to analyze, and identify potential biases and similarities. A. App categories To determine the distinctions between the installed set of apps as well as to provide the overall profile of apps that are being installed on a device, we first compared the composition of the corpus of apps by category for both data sets as shown in Figure 1a. It was found that the top app categories were Tools, Entertainment and Personalization. Furthermore, these three categories accounted for approximately 33% of the total apps. Categories Comics, Libraries and Medical have the minimum number of apps. According to Figure 1a, percentages of apps in each category are approximately the same for both datasets. For example Tools category have around 13% of the apps in both the datasets as shown in the stacked Tools column in Figure 1a. B. App installations The actual percentage of apps a user installs of a given category can be different from the percentage of apps that are available in that category in the app market. Figure 1a shows the percentage of apps available in each category while Figure 1b shows percentage of apps installed on the user devices. As can be seen from Figure 1b again the top category of apps that are installed by the users is Tools. In contrast, apps in Productivity and Communications categories have a higher number of installation despite these two categories having a lower number of apps available in the app market as shown in Figure 1a. Similarly although there are more apps apps available in the categories Entertainment and Personalization in the app market, actual percentage of installations are lower. C. No of apps per user To personify a user using the apps, the app list of each user needs to have statistically significant number of apps installed. To determine whether this is the case, we analyzed the datasets to determine the number of installed apps of users. Figure 2 20 15 10 5 0 35 30 Appaware Appbrain 25 20 15 10 5 0 Arcade Books Brain Business Cards Casual Comics Communica. Education Entertain. Finance Health Libraries Lifestyle Video Medical Music News Personaliza. Photography Productivity Racing Shopping Social Sports S. Games Tools Transport Travel Weather 25 Appaware Appbrain Arcade Books Brain Business Cards Casual Comics Communica. Education Entertain. Finance Health Libraries Lifestyle Video Medical Music News Personaliza. Photography Productivity Racing Shopping Social Sports S. Games Tools Transport Travel Weather % from total apps 30 % from total installations 3 App category App category (a) App availability by category (b) App installations by category 1 Cumulative Probability Cumulative Probability Fig. 1: Apps by availability and installations 0.8 0.6 0.4 0.2 0 Appaware Appbrain 1 10 100 No of apps per user 1000 (a) CDF: Before pre-processing 1 0.8 0.6 0.4 0.2 0 Appaware Appbrain 1 10 100 No of apps per user 1000 (b) CDF: After pre-processing Fig. 2: Distribution of number of apps per user plots the CDF of the number of apps per user before and after pre-processing, which shows the effects of removing the outlier users and the preinstalled apps. As can be seen from Figure 2b, 50% of the users have installed more than 50 apps in both the datasets. This corroborates what was reported in a Nielsen report [9], that in 2012 an average US smartphone user had around 41 apps compared to 32 in the previous year. Furthermore, it shows that there are a sufficient number of user installed apps that can possibly be used for personification. One possible reason for the difference between Appbrain and Appaware data is the length of time they have been in operation: Appbrain site was launched in 2010, and the Appaware site was launched in 2012. The older the site, the higher the possibility of having app lists which were not synched for a long period of time. Furthermore, the users of the new site could possibly be more app-savvy and/or the site has yet not had sufficient time to attract the general smartphone users. D. Deleted Apps It was found that there are a number of apps which have been deleted from Google Playstore but present in user app lists: 3.7% in Appaware and 29.8% in Appbrain. Apps get removed from Google Playstore due to various reasons such as the developer discontinuing the app, or Google removing apps due to malicious behavior. The reason Appbrain lists have a significantly higher number of removed apps, we believe is due to it being in operation for a longer period of time. This is consistent with the observation made by d’Heureuse et al. [10], that approximately 8% of the total apps were deleted from Google Playstore in a period of three months. The potential implications deleted apps being still in user phones is discussed in Section VII. E. Free vs. paid apps and the cost of paid apps To study the user behavior with respect to paid apps, we examined the number of paid and free apps in the users app lists. Figure 3a shows the number of paid apps in user app lists. It shows that 19% of the users in the Appaware dataset and 38% of the users in the Appbrain dataset did not have any paid app. In general the Appware dataset contains users who have more paid apps than the Appbrain dataset users. Again we believe that the reason for the higher number of paid apps in Appaware dataset is its age and potentially having more app-savvy users. . For majority of the users in both the datasets the total cost of the paid apps was less than AU$10s as shown in Figure 3c. Furthermore Figure 3b shows that more than 50% of the apps are less than AU$2s. Figures 4a and 4b compares the percentages of free and paid app installations by category. While the percentages of Tools and Productivity categories remains same, users have more paid apps in Arcade games, Brain games, Music and Photography categories. Further, users use more free apps in the Communications and Entertainment categories than paid apps. Summary of the finding are shown in Table III. IV. U SER C LUSTERING To investigate the possibility of mapping users into stereotypical profiles using their app lists, we represented each user in a 30-dimensional space where a dimension is the percentage of apps users have installed of a particular category. E.g. ith user Ui is represented as Ui =< pi1 , pi2 , ......, pi30 > 30 25 20 15 10 5 0 40 35 Appaware Appbrain 30 25 20 15 10 5 0 Arcade Books Brain Business Cards Casual Comics Communica. Education Entertain. Finance Health Libraries Lifestyle Video Medical Music News Personaliza. Photography Productivity Racing Shopping Social Sports S. Games Tools Transport Travel Weather Appaware Appbrain 35 % from paid app installations 40 Arcade Books Brain Business Cards Casual Comics Communica. Education Entertain. Finance Health Libraries Lifestyle Video Medical Music News Personaliza. Photography Productivity Racing Shopping Social Sports S. Games Tools Transport Travel Weather % from free app installations 4 App category App category (a) Free app installations by category (b) Paid app installations by category Fig. 4: Free and paid apps by installation TABLE III: Free apps vs. paid apps Appbrain 68115 20294 42494 89% 474433 91369 347992 91% 3.16 1.94 1 Cumulative Probability Cumulative Probability # of apps # of deleted apps # of free apps % of free apps/available apps # of installations # of deleted app installations # of free app installations % of free app insta./ availa. app insta. Average price of a paid app (AU$) Median price of a paid app (AU$) 0.8 0.6 0.4 0.2 0 Appaware Appbrain 0 20 40 60 80 100 Appaware 18764 694 15262 84% 64327 1411 53367 85% 3.33 1.99 1 0.8 0.6 0.4 0.2 0 No of paid apps per user (a) CDF: No of paid apps per user Appaware Appbrain 1 10 Price of the app (AU$) 100 (b) CDF: Price of apps 60 Appaware Appbrain 50 % of users pik = 40 30 20 >100 80-90 90-100 70-80 60-70 50-60 40-50 30-40 20-30 10-20 0 0 0-10 10 Cost of app list (AU$) (c) Histogram: Cost of app list Fig. 3: Statistics of paid apps where pik for k = 1..30 is the percentage of apps in user ith app list of category k. no. of user installed apps in category k X 100 total number of user installed apps We then used Ward’s method [11] in agglomerative hierarchical clustering with Euclidian distance as the distance measure between two users to determine whether users form clusters. To decide the optimal number of clusters, we plotted the Error Sum of Squares (SSE) against the number of cluster, for both the datasets. The optimal number of clusters are the no. of clusters around the elbow point of the SSE vs. no. of clusters graph. In Figure 5a, the elbow point lies around 6 for both the datasets and thus 6 was selected as the optimal number of clusters. It is necessary determine whether these clusters have a one to one mapping across the two datasets, i.e whether a cluster in Appaware dataset is close only to one cluster in Appbrain dataset and have large distances to the other clusters in Appbrain. We first calculated the centroids of each cluster and then calculating the pairwise distances between the cluster centroids across the two datasets. It was found that one cluster in one dataset is quite close to only one cluster in the other dataset and is a significant distance away from the rest of the clusters. We show this pairwise distance matrix in Table V. Figure 5b shows that the percentage of users in each cluster has approximately the same percentage of user in both the datasets with the exception of cluster 2. We discuss the reasons for this in the section below. The analysis of the category-wise composition of the cluster centroids is shown in Table IV.2 The analysis showed that a cluster could be representing the user interests. Types of user interests identified are shown in Table VI. Cluster 1 users have more installed apps in categories Music, Entertainment and Books. Cluster 2 users do not have specific features compared to other clusters. Therefore we call them as the balanced users. This also is the cluster which showed the discrepancy highlighted in the section above. However due 2 For clarity, any category which is having less than 1.5% across all cluster centers were omitted. 5 40 Appaware Appbrain 0.6 0.4 0.2 0 Appaware Appbrain 35 % of users Error sum of squares 1 0.8 30 25 20 15 10 5 0 2 4 6 8 10 No of clusters 12 (a) SSE vs. No. of Clusters 14 0 c1 c2 c3 c4 Cluster No. c5 c6 (b) % of users in each cluster Appaware TABLE V: Adjacency matrix of cluster centroids c1 c2 c3 c4 c5 c6 c2 11.9 5.2 26.7 19.3 15.1 15.6 Appbrain c3 c4 15.8 14.8 17.0 13.8 7.3 23.2 24.9 7.2 30.5 24.3 23.8 17.7 c4 c3 c2 c1 80 70 60 50 40 30 20 10 0 Fig. 6: Distance matrix TABLE VII: Personal information by the presence of apps c5 25.0 15.7 33.3 25.7 5.6 26.8 c6 20.4 17.6 32.2 25.8 24.3 12.4 TABLE VI: Clusters: Explanation Cluster no. c1 c2 c3 c4 c5 c6 c6 c5 (a) Distance matrix: before clustering (b) Distance matrix: after clustering Fig. 5: User clustering c1 5.8 11.9 21.1 20.7 26.3 15.1 80 70 60 50 40 30 20 10 0 Description Entertainment, Music and Books Balanced of of all categories Game apps (Arcade, Brain and Sports games) Personlization apps Excessive no. of Tool apps Social and Communications apps to the generality of the cluster, it is not possible to identify a specific reason for the discrepancy. The distinguishing feature of cluster 3 users is the high use of gaming apps (Arcade Games, Brain Games and Sports Games). There is a clear distinction between cluster 4 and the rest, as this represents users with higher numbers of apps in Personalization category. The distinguishing feature of cluster 5 users is the larger number of apps that fall under the Tools category, compared to the others while the cluster 6 users have the distinguishing feature of having higher numbers of apps that fall under the categories of Social and Communication. Due to fact that we observed similar clustering across the two data sets collected from different sources, we conclude that these clusters represent generally distinguishable features of smartphone users. Figure 6 shows the Euclidian distance matrix between the users before and after clustering for the Appaware dataset. It shows that clusters 2, 3, 4 and 5 clearly distinguishable. However cluster 1 and 6 tend to overlap with cluster 2. Cluster 2 represents the balanced users and the cluster 1 differs from it, due to the high use of apps that fall under the Entertainment, Music and Books categories. On the other hand cluster 2 users have more apps that falls under Tools category which minimizes the difference in Euclidean distance. Similarly the difference in cluster 2 and cluster 6 is the higher use of apps in categories Communication and Social in cluster 6. Nevertheless cluster 6 has a lower number of apps in the Tools category which reduces the distance between cluster 6 and cluster 2. Personal /Demographic Information Example Apps Language/Ethnicity/Country Presence of apps in other languages than English, Presence of apps related to specific countries Religion Catholic Chaplets 01, Catholic Rosary Quick Guide, Bible Topics, Lord Buddha Temple. Judaism 4 U, Krishnashtakam Interests Sports: The Official ESPNcricinfo App, ESPN FC Music: Instrumental Hip Hop Rap Beats, Jazz Radio Relationship Status Local Dating, eHarmony Health conditions Diabetes Diet, Stress Check by Azumio Education ACCA Student Planner, CSAT UPSC Prep, Engineering Dictionary Sex Orientation NearOx, DISTINC.TT V. D ECISION RULE E NGINE The objective is to determine the possibility of personifying a user through the list of apps he/she has installed on the smartphone. In order to do this it is necessary to infer fine grained personal information about user such as religion, ethnicity/languages, relationship status and health conditions. Table VII shows some potential apps which can be used to derive personal information. As the app space is large, a heuristic rule based decision engine was used. For each app, the app description and the user reviews at Google Playstore were used to extract the semantic information about the app, namely the language and the concepts given by the natural language processing API, Alchemy [12]. Concepts tag the apps using keywords in the app description and the user reviews. For example, description/review containing the keywords BMW, Ferrari and Porsche will be tagged as "automotive industry". This information is used with a set of rules to extract information about the app such as the religion and language associated with the app. Then the users has these apps installed were characterized using this information. The rules used to extract the information are described below. The analysis used, merged pre-processed Appaware and Appbrain datasets which consist of 8068 users. 6 Brain Business Cards Casual Communica. Entertain. Finance Health Lifestyle Video Music News Personliza. Photogra. Productiviti. Racing Shopping Social S. Games Tools Travel c1 c2 c3 c4 c5 c6 a.a a.b a.a a.b a.a a.b a.a a.b a.a a.b a.a a.b 3.0 4.1 2.7 2.6 2.2 2.4 1.6 2.5 1.5 1.4 2.3 2.8 3.8 3.6 2.7 1.6 9.7 5.3 2.6 2.1 0.9 1.5 3.6 1.2 1.4 1.8 1.7 1.8 0.7 1.1 0.8 1.6 1.5 1.4 1.3 2.5 1.7 2.1 0.7 0.5 0.6 1.2 1.3 0.9 0.5 0.5 1.1 0.4 2.7 3.2 1.7 1.3 4.8 4.9 1.3 2.4 0.5 1.0 1.7 0.7 7.0 9.0 8.5 9.6 5.2 6.9 8.0 10.0 8.4 8.8 17.5 24.2 7.1 9.7 4.4 5.8 5.1 6.1 4.6 6.8 3.4 4.6 5.1 6.2 1.2 1.6 1.2 1.1 0.4 0.8 1.1 0.7 1.0 0.7 0.9 0.6 1.9 1.4 1.1 1.2 0.7 0.8 0.9 0.7 0.8 0.5 1.1 0.4 3.2 2.8 1.6 1.8 1.2 1.5 1.9 2.1 1.6 1.4 1.5 2.8 3.7 3.7 3.5 4.5 2.0 3.1 2.1 3.3 3.7 2.9 2.5 2.9 6.5 6.2 4.4 5.4 3.0 4.1 4.5 4.0 4.3 3.6 3.6 3.5 3.4 3.2 2.9 2.5 1.8 1.8 2.2 1.8 1.6 1.7 1.8 1.6 4.3 3.6 5.2 4.5 7.5 5.2 22.3 17.0 5.7 6.9 5.7 3.4 6.6 2.5 3.9 4.7 3.3 2.3 4.1 2.6 3.3 1.9 3.8 2.4 7.2 7.3 11.1 11.9 5.4 6.8 9.8 8.1 13.3 9.4 9.6 8.3 0.9 1.0 0.9 0.6 3.6 2.8 0.7 1.0 0.3 0.7 0.5 1.1 2.8 3.2 2.1 2.5 1.1 1.2 2.0 1.8 2.3 1.5 1.5 2.3 5.1 3.7 4.7 4.1 3.4 2.8 4.2 3.2 2.7 3.3 13.3 6.4 1.1 1.0 0.8 0.6 2.0 1.9 0.8 1.0 0.4 0.5 0.6 0.7 11.9 11.5 20.3 20.9 9.4 12.6 15.0 15.1 35.2 35.3 12.4 18.5 2.4 3.1 2.4 3.0 0.9 1.3 1.4 2.1 1.7 2.1 2.4 1.4 % of users from total 14 12 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 >10 No. of non-English apps % from users having more than 1 non-English app Books 6.0 5.3 6.9 2.8 22.7 20.0 3.5 5.2 2.4 5.0 2.5 1.6 Category Arcade TABLE IV: Cluster centroids (a.a∗ : Appaware, a.b∗ : Appbrain) TABLE VIII: Top-10 languages by the number of users 70 60 50 40 30 20 10 0 1 2 3 4 No. of languages 5 (a) % of users of having non-English (b) % of users of multiple languages apps Language Arabic German Italian Spanish Portuguese # of users 236 172 153 109 83 % from total users 2.93% 2.13% 1.90% 1.35% 1.03% by the number of users and the percentage of the number of users from the total number of users in our dataset. Fig. 7: Percentages of users with language information B. Country A. Language/Ethnicity Most of the apps (63.5%) installed by the users, were identified as English. The next top-5 languages associated with apps are shown in Table VIII. Apps for which a language could not be identified, were not used in the analysis. All the apps that could be associated with a language were distributed among 47 different languages. Our language rule scans through the user app lists and if any non-english app is found, flags those languages as the potential languages the user speaks. Figure 7a shows the number of non-English apps installed by users. It shows that 13% of the users have only 1 nonEnlgish app, while 5% of the users have two non-English apps and 3% have 3 non-English apps etc. Due to the large difference between the percentage of having only one and two non-English apps, we removed the group having only one nonEnglish app from our analysis because this can lead to errors in the decision making as discussed in Section VI. For a user who has apps associated with Italian and German, it is not possible to definitely determine what native language of the user. Figure 7b shows the ambiguity that arises due to having apps in multiple non-English languages. As can be seen, majority of the users have apps associated with only one language. Thus it is possible to identify the native language spoken by the user. Using this observation the following rule was used to determine the language of a user. language rule - if the user has more than one non-English app, list the languages of those apps as the possible languages user will speak. When analyzing the dataset using the above rule we could infer the language of 25% of the users in our dataset. This is summarized in Table VIII, which provides the top-5 languages While the residing country information of the smartphone users can be extracted from network information such as IP address, telecom operator etc. it is not possible to determine other information such as country of origin or the countries that users have a specific interest in. To obtain this information, for each app we extracted a list of concepts using the Alchemy API and looked for the presence of country names, by comparing with a pre-defined list of all the countries in the world available at [13]. In case of a match, the app was associated with the matching country. country rule - if the user has more than one app with country name tag, list those countries as the possible countries users have associations with. We observed that some apps have multiple country names as keywords and we ignored these apps in our analysis to avoid ambiguity. Figure 8a shows the percentage of people from total against the number of apps with country information. Again we discarded the users having only one app with country information due to the same reason described above. Figure 8b shows the number of countries identified for the users who have more than one country associated apps. It shows that the majority of the users are associated with one or two countries. Applying this rule to the dataset we were able to infer the associated countries for 8% of users. C. Religion Religion was associated with an app using the same principles used to associate country information to an app. A pre-defined list of the top 10 religions based on the number of followers globally was used. If any app with a religion associated with it was in the users app list, we inferred the user to be a believer of that religion. In this analysis 25 apps had two religious names associated with them, we removed 70 60 50 40 30 20 10 0 1 2 3 4 5 6 7 8 9 10 >10 % from users having more than 1 app with country info % of users from total 7 No. of apps with country TABLE X: Users by sport interest 60 50 40 30 20 10 0 1 2 3 4 5 No. of countries >5 (a) % of users of having apps with(b) % of users associated with multiple country info countries Fig. 8: Percentages of users with country information TABLE IX: Users by religion Religion Islam Hinduism Christianity Buddhism Other # of users 869 140 93 19 9 % from total users 10.77% 1.74% 1.15% 0.24% 0.11% Sport Cricket Cycling Football Baseball Golf # of users 168 47 41 30 30 % from total users 2.08% 0.58% 0.51% 0.37% 0.37% TABLE XI: Summary of the ground truth datasets # of users # of apps # of installations Average # of apps/user Median # of apps/user Android 18 417 576 32 28 IOS 16 419 546 34 22 By applying this rule we were able to identify 8% of users as single. VI. R ESULTS those apps to eliminate ambiguity. religion rule - if the user has apps with religious names associated with flag those religions as potential religious beliefs of that user. As expected most app lists contained apps that has an association with only one religion. We found 40 users have apps with different religious flags and we removed these users from our analysis. In table IX we list a summary of our findings. We were able to infer the religions of 14% of the users in our dataset. D. Sport interests The sport interest rule extracts information about the sporting interests of the users. Similarly to the previous inferences, we checked for apps having sports names as concepts by comparing the concepts of each app with a list of globally popular sports available at [14]. We ignored the gaming apps because playing such apps may not necessarily mean that the user is a fan of that sport. The rule has aimed at identifying apps which provided sport related news and/or sports scores. sport interest rule - if the user has non-game apps which have sports names, list those sports as the sports the user is interested in. Again, apps having more than one sports name associated with it were discarded to avoid ambiguity. Using the rule we could infer sports interests of 4% of the users and we summarize the results in Table X. E. Relationship status The relationship rule check the availability of the concept of "dating" associated with an app. We identify the users with dating apps installed as single. relationship rule - if the user has dating apps, list those users as the users who are single. Since it was not possible to establish the "ground truth" for the two data sets, results from 18 voluntary Android users with Apptronomy and 16 iPhone users who gave us their app lists were used to validate the results. A summary of the test data sets are shown in Table XI. Due to privacy concerns, the volunteers were not asked any personal information other than the country of residence and the their native language . The methodology used to valid the the results is schematically shown in Figure 9. Alchemy API Language, Keywords Android/IOS Smartphone 3) Text Analysis User profile 4) Inferences 1) App list Decision rule engine 2) Text Extraction Google Playstore/Apple Appstore App description, User reviews 5) Cross validation Fig. 9: Ground truth data collection process We applied the decision rules to the test dataset and calculated the performance matrices precision and recall, for the original language rule and country rule as well as relaxed versions of the rules where presence of a single app is sufficient for the decision, as we had the ground truth. Since our dataset was limited and Alchemy API also returned a very limited number of words, we extracted keywords from the app market text instead of concepts. Table XII summarizes the results. The results show a higher precision and a lower recall. The drop in the precision when more than one app requirement is relaxed, is due to Alchemy API identifying apps as being associated with a country, when in reality they were not. This shows that it is necessary to have more than one app 8 TABLE XII: Summary of results Platform Android (≥ 1 app) Android (> 1 app) IOS (≥ 1 app) IOS(> 1 app) Language Precision Recall 42% 21% 100% 17% 100% 30% 100% 20% Country Precision Recall 100% 18% 100% 3% 45% 20% 100% 4% with language association information to make an accurate inference. The low recall is because of the Alchemy API not providing country name associations for some apps which were country specific. This is possibly due the app description text and the user reviews in the app market not having any country information. Another reason was, there were some languages which were not identified by the Alchemy API. We discuss how the results can be further improved in Section VII. VII. D ISCUSSION There are a number of implications both positive and possibly negative of the inferences that can be made from a user’s installed app list. A. Characterization The characterization of the number of installed apps of smartphone users can be beneficial for dimensioning smartphone hardware such as screen, storage and memory, arranging apps across multiple screens and designing app pre-launching methods as already been suggested [15]. In contrast, the presence of deleted apps in user’s smartphone poses a potential security problem as the users may be using apps that have been deemed malicious or are malfunctioning. The evaluation of the app availability and actual user installation by category shows that some app categories have a excess number apps, namely Entertainment and Personalization, whilst some other categories, especially Productivity and Communications have a shortage. This provides insights to app developers for future app development. The characterization of the user behavior with respect to free and paid apps can provide useful information for advertisers, to generate targeted advertisements: it may indicate users who have higher probabilities of making an online purchases. B. Clustering It is possible to cluster users using well know clustering techniques. One of of the obvious implication is that knowledge of which cluster a user belongs to gives a very good indication of the types of apps that the app makers should recommend to the user. This information can be further exploited to provide cross domain product recommendations, by combining it with secondary information, such as users’ movie preferences, shopping patterns. Thus, this can be beneficial for both users and various service providers. C. Personification The results show that using simple decision rules, it is possible to predict fine granular personal information such as language, ethnicity and interests. This information can be used for the benefit of the user in applications such as UI personalization, micro-targeted advertising and recommendations of various kinds . Thus the knowledge of app list can be seen as means of instantly building a user profile compared to user tracking techniques which can be expensive and time consuming. On the other hand, we showed that it is possible to infer personal information such as religion, relationship status, health conditions and sex orientation by observing the apps users install. More importantly, that this information can be obtained without the consent of the user. As this type of inferences can be easily made and misused by the third parties, it is definitely a violation of users privacy. D. Limitations The primary limitation of the proposed method is that it works only when a diverse range of apps are being installed. For the smartphone users who only use the pre-installed apps and only a very limited number of popular apps, method fails. We believe that at best this is a temporary limitation, as the users’ familiarity with app markets increases, majority of the users will have a diverse range of apps. Another potential limitation arises as a result of the presence of apps not being a definitive indicator of a users interest as the user might be not using the app, or has been installed by a family member. We believe that this is not a real limitation as the decision to install an app or allowing an app to be installed, in itself is an indication that the user has at least some interest associated with the app. E. Possible improvements The classification methods and the rules that were used in this paper are simple and straight forward. Despite this, the results are usable. It is likely that it will be possible to optimize the classification techniques, and develop a more sophisticated set of rules will yield better results. However, the utility of using a a more sophisticated set of rules is unclear and needs further investigation. For example, rules used simply checked the presence of certain keywords. However keyword list returned by the text processing APIs might not always contain these keywords but a set of other keywords associated with the topic. Thus a term similarity based document classification method will increase the probability of identifying apps related to certain topics. With this method precision may be reduced but the recall will be increased. VIII. R ELATED W ORK Various previous work looked at the possibility of deriving personality traits by mining smartphone data. Chittaranjan et al. [16] predicted the smartphone users’ Big-Five personality [17] by mining app usage, call and SMS usage and Bluetooth proximity information of 83 smartphone users and evaluated the results by comparing the predicted personality with the personality derived based on the users’ answers to Ten Item Personality Inventory(TIPI) [18]. Similarly LiKamWa et al. 9 [19] used the contextual information in the likes browsed websites, used apps and contact logs (SMS, voice and email) 32 participants over a period of two months to infer users’ mood. Privacy leakage due to third party ad libraries collecting user information through over-permission (i.e. asking for permissions which are not required for the app function) has been studied in multiple work [20] [4] [21] [22]. However this information leakage is always under the control of the users as they can avoid installing apps which ask additional permissions. Shepard et al. [23] and Böhmer et al. [24] studied how app usage is dependent on contextual variables such as location, time of day and day of week and multiple work used this information to predict users’ future app usage [25], [15]. Pan et al. [26] collected app installation logs, call logs and data on near by Bluetooth devices of 55 smartphone users for 5 months. This information together with externally collected information in the likes of friendship and affiliation were used to predict users future app installations using supervised learning. However they did not study the app installation patterns based on app categories or the personal information that can be derived. We differentiate our work from previous work since we use only a single snapshot installed list of apps, which is accessible to any app developer without the knowledge of the user. Profiling the users based on app usage might give more accurate results. However it requires the consent of the user and longer data collection periods. Use of tracking [27] to identify which websites users visit to infer interests also needs observations for longer periods. Thus our personification method provides an easy means for building user profiles. IX. C ONCLUSION The paper presented, to the best of our knowledge, the first large scale study of user installed Android apps and related basic statistics. It then analyzed the information that can be derived from list of installed apps, which is accessible through any installed app on the smartphone. The analysis showed that smartphone users form clusters based on the characteristic of the apps they install. Further access to app lists enables the extraction of more fine-granular information about the smartphone user such as spoken languages, religion and interests by using simple rules. The viability of the rules used, was demonstrated using ground truth data of 34 volunteer smartphone users. Despite the simplicity, it was shown that detailed information about users could be inferred with a precision over 40%. Finally, opportunities this kind of analysis provides and then threats it poses were highlighted. We believe that the accuracy of the inferences can be significantly improved through the use of more sophisticated classification techniques, which we intend to explore. Therefore, we believe that app lists can be effectively used to personify users. We plan to expand our own dataset with the ground truth and to release our current dataset to the research community to explore some of the opportunities of this type of analysis provides. R EFERENCES [1] eMarketer, “Smartphone adoption tips past 50% in major markets worldwide,” http://www.emarketer.com/Article/SmartphoneAdoptionTipsPast5MajorMarketsWorldwide/1009923, 2013. [2] A. Cocotas, “Android grabs a record share of the global smartphone market,” http://au.businessinsider.com/androidbouncesbacktoarecordquarter20135, 2013. [3] mobiThinking, “Global mobile statistics 2013 section e: Mobile apps, app stores, pricing and failure rates,” http://mobithinking.com/mobilemarketingtools/latestmobilestats/e, 2013. [4] M. Grace, W. Zhou, X. Jiang, and A. . Sadeghi, “Unsafe exposure analysis of mobile in-app advertisements,” in Proc. of the 5th ACM WiSec, 2012, pp. 101–112. [5] D. Amitay, “ios app detection,” http://www.ihasapp.com, 2012. [6] Appbrain, “Top android apps and games in the android market,” http://www.appbrain.com, 2013. [7] Appaware, “Top android apps and games today on appaware.com,” http://www.appaware.com, 2013. [8] F. Mosteller and J. W. Tukey, “Data analysis and regression. a second course in statistics,” Addison-Wesley Series in Behavioral Science: Quantitative Methods, Reading, Mass.: Addison-Wesley, 1977, vol. 1. [9] Nielsen, “Application:what a difference a year makes,” http://www.nielsen.com/content/dam/corporate/us/en/newswire/uploads /2012/05/appnation-what-has-changed.png, 2012. [10] N. d’Heureuse, F. Huici, M. Arumaithurai, M. Ahmed, K. Papagiannaki, and S. Niccolini, “What’s app?: a wide-scale measurement study of smart phone markets,” SIGMOBILE Mobile Computing and Communications Review, vol. 16, no. 2, Nov. 2012. [11] J. H. Ward Jr, “Hierarchical grouping to optimize an objective function,” Journal of the American statistical association, vol. 58, 1963. [12] “Alchemy api,” http://www.alchemyapi.com, 2013. [13] M. Gifford, “Text list of all countries in the world,” http://openconcept.ca/blog/mgifford/text-list-all-countries-world, 2007. [14] Topendsports, “List of sports from around the world,” http://www.topendsports.com/sport/sport-list.htm, 2013. [15] C. Zhang, X. Ding, G. Chen, K. Huang, X. Ma, and B. Yan, Nihao: A predictive smartphone application launcher, 2013, vol. 110 LNICST. [16] G. Chittaranjan, B. Jan, and D. Gatica-Perez, “Who’s who with big-five: Analyzing and classifying personality traits with smartphones,” in Proc. - ISWC, 2011, pp. 29–36. [17] R. R. McCrae and O. P. John, “An introduction to the five-factor model and its applications,” Journal of Personality, vol. 60, no. 2, 1992. [18] S. D. Gosling, P. J. Rentfrow, and W. B. Swann Jr., “A very brief measure of the big-five personality domains,” Journal of Research in Personality, vol. 37, no. 6, pp. 504–528, 2003. [19] R. LiKamWa, Y. Liu, N. D. Lane, and L. Zhong, “Moodscope building a mood sensor from smartphone usage patterns,” in Proc. of the 11th MobiSys ’13, 2013. [20] I. Leontiadis, C. Efstratiou, M. Picone, and C. Mascolo, “Don’t kill my ads!: balancing privacy in an ad-supported mobile application market,” in Proc. of the 12th HotMobile ’12. ACM, 2012, pp. 2:1–2:6. [21] P. Pearce, A. P. Felt, G. Nunez, and D. Wagner, “Addroid: privilege separation for applications and advertisers in android,” in Proc. of the 7th ASIACCS ’12. ACM, 2012, pp. 71–72. [22] S. Shekhar, M. Dietz, and D. S. Wallach, “Adsplit: separating smartphone advertising from applications,” in Proceedings of the 21st USENIX conference on Security symposium, ser. Security’12, 2012, pp. 28–28. [23] C. Shepard, A. Rahmati, C. Tossell, L. Zhong, and P. Kortum, “Livelab: measuring wireless networks and smartphone users in the field,” SIGMETRICS Perform. Eval. Rev., vol. 38, no. 3, pp. 15–20, Jan. 2011. [24] M. Böhmer, B. Hecht, J. Schöning, A. Krüger, and G. Bauer, “Falling asleep with angry birds, facebook and kindle: a large scale study on mobile application usage,” in Proceedings of the 13th MobileHCI ’11, 2011. [25] T. Yan, D. Chu, D. Ganesan, A. Kansal, and J. Liu, “Fast app launching for mobile devices using predictive user context,” in Proc. of the 10th MobiSys ’12, 2012, pp. 113–126. [26] W. Pan, N. Aharony, and A. Pentland, “Composite social network for predicting mobile apps installation,” in Proc. of the National Conference on Artificial Intelligence, vol. 1, 2011, pp. 821–827. [27] S. Han, J. Jung, and D. Wetherall, “A study of third-party tracking by mobile apps in the wild,” University of Washington, Tech. Rep. UWCSE-12-03-01.
© Copyright 2026 Paperzz