APP

Ming Liu
Harbin Institute of Technology
School of Computer Science and
Technology
Backgrounds
 Plenty of apps are released to help users make the best
use of their phones.
 Facing to massive apps available to be used, app retrieval
and app recommendation are good solutions for users to
acquire their desire apps.
 Recent methods are conducted mostly depending on
user’s log or latent context similarity between apps.
 They can only detect whether two apps are downloaded,
installed meanwhile or provide similar functions or not.
APP Relationship
 Apps contain deep relationship such as one app needs
another app to cooperate to fulfill its work.
 This relationship can’t be dug only by user’s log or
latent contexts of apps.
 “Hotels.com” and “alipay”.
 https://play.google.com/store/apps/details?id=com.hco
m.android&hl=zh_CN
 https://play.google.com/store/apps/details?id=com.alip
ay.android.client.pad&hl=zh_CN
The Role of Reviews
 Reviews contain useful information about apps, such
as user’s viewpoint.
 Given two apps (marked as app1 and app2), users in
one review to app1 require a service which app1 can’t
provide, and there is another review to app2where
users state this service is provided by app2, app1 and
app2are possibly relevant.
This relationship isn’t similarity.
Challenging
 Reviews are too short, thus, are uneasy to be full used.
 Most of reviews don’t directly describe apps whereas
only contain user’s viewpoint, thus, are uneasy to be
used to extract attributes.
An iterative process by combining review
similarity and app relationship into an
calculating process.
Related Work
 App is just entity, and the way to calculate entity
relationship can be directly used to calculate app
relationship.
 Dictionary based way (sometimes called as
knowledge based way) relies on professional
thesauruses to extract attributes to define relationship
among entities (or apps).
 Statistic based way (sometimes called as corpus
based way) digs relationship among entities based on
large-scale corpus.
Defeats
 Dictionary based way
 With its hierarchical structure (e.g. WordNet), one can easily
tell entity relationship in terms of the position of entity.
 Most of recent thesauruses don’t import apps as their terms,
thus, it’s impossible to extract attributes from them to
represent apps.
 Statistic based way
 It seldom encounters missing data issue.
 It bases on contextual similarity by attributes extracted from
corpus, thus, it can only detect entity similarity .
Organization
 We use M to organize reviews and apps, and apply app
vectors and review vectors to represent apps and
reviews respectively.
 M is an n*k matrix constituted by Vector Space Model.
Each column in M indicates one app in APP. Each row
in M indicates one review in RC.
 The effective and efficient statistical metric tf-idf is
adopted to form the value of each entry in these
vectors.
APP Relationship Calculation
 Generally speaking, the straightforward way to
calculate the relationship between two apps (e.g. appp
and appq), is to use their app vectors as bases, such as

R  app p , appq   R V  app p  ,V  appq 
 R  app , app    tf  * tf 
n
p
q
c 1
cp
tfcp and tfcq denote the values of cth entry
respectively in V(appp) and in V(appq).
cq
Drawback
 Previous equation bases on the idea that two apps
frequently appearing in the same review are possibly
similar to each other.
 Consequence: many similarities are close to 0.
 Reason: many apps share no common reviews even
they are really similar.
Expand
 Reviews often contain topic similarity.
 For example, given two reviews, rci and rcj, respectively
corresponding to appp and appq, if users in rci require a
service which appp doesn’t provide, and users in rcj
state this service is provided by appq, rci and rcj are
topic similar.
 It’s straightforward that, if two apps frequently appear
in the topic similar reviews, these two apps are
relevant.
Results

R  app p , appq   R V '  app p  ,V '  appq 

 V’(appp) and V’(appq) respectively denote two app
vectors after expanding by topic similarity among
reviews. The cth entries in them can be calculated by
tf 'cp =  tf gpQgc ; tf 'cq =  tf gqQgc 
n
g 1
n
g 1
Qgc 
Sim  rcg , rcc 
n
 Sim  rc , rc 
l 1
l
c
Review Similarity
 To calculate previous equations, it needs to calculate
topic similarity among reviews beforehand.
 As the same to app relationship, vector based
measurements can be directly adopted to calculate
topic similarity as

Sim  rci , rc j   Sim V  rci  ,V  rc j 

Sim  rci , rc j     tf ic  *  tf jc 
k
c 1
Expand
 As topic similarity among reviews, apps contain
semantic relationship among them, which causes two
reviews without sharing the common apps even
present the similar topic.

Sim  rci , rc j   Sim V '  rci  ,V '  rc j 
tf 'ic =  tfig Pgc ;
k
g 1
tf ' jc =  tf jg Pgc 
k
g 1

Pgc 
R  appg , appc 
k
 R  app , app 
l 1
l
c
Assumption
 Calculations between app relationship and review
similarity interact.
 To calculate review similarity, it needs to calculate app
relationship in advance. Given rci and rcj, to calculate
Sim(rci,rcj), it needs to calculate Pgc in advance to form
app relationship between appg and appc. In contrast, to
calculate app relationship, it needs to calculate review
similarity in advance.
Simulating Results
similar
relevant
irrelevant
Google Map
Baidu Map
Booking.com
Alipay
Calculator
PGA Tour
Sohu Video
Youku Video
Effective
Weight Loss
Nike +
Running
MX Player
Chrome
Neuro Desktop
AutoCAD
Cameringo
Demo
Trip
Advisor
mWeather
Football 2014
Medical
Directory
MediDiary
Basic
Photo Art
Studio
Pic
Frames
Fun
Weight
Loss
Amazon
English
Dictionary
Dictionary
Word Web
Discount
Calculator
Ebay
Change
Voice
Tube
Two Ways
 App relationship obtained by two ways
1.0
S/1
S/2
S/3
S/4
S/5
R/1
R/2
R/3
R/4
R/5
I/1
I/2
I/3
I/4
I/5
0.8
0.6
0.4
0.2
Relationship between APP Pairs
Relationship between APP Pairs
1.0
S/1
S/2
S/3
S/4
S/5
R/1
R/2
R/3
R/4
R/5
I/1
I/2
I/3
I/4
I/5
0.8
0.6
0.4
0.2
0.0
0.0
0
2
4
6
8
Iterative Steps
10
12
14
0
2
4
6
8
Iterative Steps
10
12
14
Reasons to Observations
 Our process combines app relationship and review
similarity as an iterative calculating process.
 App relationship can be dug from reviews and then to
conduct review similarity calculation.
 Review similarity can be found by the relationship
among apps and then to direct app relationship
calculation.
 Via this iterative process, deep relationship among apps
can be obtained.
Initialization
 To perform our two-way-alternative process, we need
to set one initial parameter (either initial app
relationship R0(appp,appq) or initial review similarity
Sim0(rci,rcj)). (initial parameters )
 That is to choose the measurement to calculate app
relationship and review similarity via app vector and
review vector. It’s just to decide how to calculate
Sim(rci,rcj) and R(appp,appq). (initial measurement)
Data Sets and measurements
 Miller-65, which includes 65 entity (or word) pairs
selected from WordNet whose relationship values are
already defined by experts.
 Metrics: Pearson correlation and Spearman correlation
 APP Collection, which collects 1000 apps from Google
play to form one artificial testing collection.
 Metric: F1
Calculating results when changing initial parameters,
whereas, fixing Cosine as initial measurement
Miller-65
APP Collection
TWI1
Pearson
HCT-HCG
Cosine
CH
TLDA
Cucerzan
TPCA
Cucerzan
KL
GMEL
TLSI
ESBM
Euclidean
ESBM
LC
ELPM
WC
Snowball
TLSI
ERPM
TSR
83.52
88.08
70.81
68.72
84.11
66.35
68.48
TWI1
F1
TWI2
F1
53.78
54.69
67.54
HCT-HCG
Cosine
86.16
CH
TLDA
56.91
84.35
80.21
Cucerzan
TPCA
69.34
79.68
65.57
63.58
66.83
Cucerzan
KL
68.77
81.13
82.33
GMEL
TLSI
65.64
ESBM
Euclidean
61.56
60.83
80.21
ESBM
LC
65.43
76.73
82.31
81.88
TWI2
Spearman
82.14
87.01
69.35
67.25
82.56
64.78
67.13
Pearson
68.89
87.41
81.81
68.53
83.75
67.19
81.73
Spearman
85.39
83.72
85.03
83.39
ELPM
WC
85.22
83.86
82.12
80.44
Snowball
TLSI
82.45
79.76
84.46
83.03
84.49
83.02
ERPM
TSR
81.63
81.14
Calculating results when fixing Cosine to set initial parameters,
whereas, changing initial measurements to calculate vectors
Miller-65
APP Collection
TWI1
Pearson
TWI2
Spearman
Pearson
TWI1
TWI2
F1
F1
Cosine
64.83
62.84
Euclidean
64.31
62.37
Spearman
Cosine
68.11
66.63
65.94
64.22
Euclidean
67.67
66.15
65.42
63.78
KL
65.22
63.25
KL
68.59
67.05
66.31
64.67
TLDA
65.48
63.47
TLDA
68.78
67.30
66.52
64.83
TPCA
65.52
63.33
TPCA
68.75
67.26
66.47
64.76
TLSI
68.93
67.40
66.69
64.98
TLSI
65.76
63.74
LC
68.85
67.31
66.52
64.83
LC
65.71
63.57
WC
69.03
67.47
66.71
64.97
WC
65.84
63.82
TSR
69.15
67.59
66.85
65.11
TSR
65.92
63.89
Reasons
 The reason why initial parameters deeply affect
calculating results is that, initial parameters are the
only predefined factors to affect the subsequent
calculation.
 The reason why initial measurements are unable to
affect calculating results is that, app vector and review
vector are already expanded by semantic relationship
among apps and topic similarity among reviews.
Therefore, different initial measurements are unable to
import extra semantics, thus, they are unable to affect
calculating results.
Conclusions
 To acquire high-quality results, we can only focus
on choosing an effective method to set initial
parameters.
 However, to determine which method is effective is
uneasy to be fulfilled.
 For this reason, we hope to acquire high-quality
results even with weak initial parameters.
Two Definitions
 Weak initial parameters or weak initial
measurements
 Effective initial parameters or effective initial
measurements
 Obtained by simple concurrence based methods VS
obtained by compression based methods or semantic
based methods.
 For example: Cosine, KL, or Euclidean distance VS Word
Clustering, Lexical Cohesion, LSI, PCA, or LDA.
“Booking.com” and “Alipay”
 Two ways initialized by one combined method are
marked as the same symbol.
• With effective initial
parameters, the final
results are large, and,
with weak initial
parameters, the final
results are small.
• With effective initial
parameters. calculating
results from two ways
are closer to each
other than with weak
initial parameters.
Relationship between App Pair
0.95
0.90
0.85
0.80
0.75
0.70
0.65
HCT-HCG/Cosine
CH/TLDA
Cucerzan/TPCA
Cucerzan/KL
GMEL/TLSI
ESBM/Euclidean
ESBM/LC
ELPM/WC
Snowball/TLSI
ERPM/TSR
0.60
0.55
0.50
0
2
4
6
8
Iterative Steps
10
12
14
Observations and Conclusions
 Observations:
 The results with effective initial parameters are closer to the
optimal results than those with weak initial parameters.
 The range of the results with weak initial parameters in the
beginning stages contains that with effective initial parameters.
 Conclusion:
 When initial parameters for two ways are both optimal,
calculating results should be the same at any time.
 The optimal results lie between two results respectively
obtained by two ways of our two-way-alternative process.
Two Ways Combination
 The results with effective initial parameters are closer to the
optimal results, it is reasonable that the results from the smooth
way are credible and take more effects on the combined results.
 With effective initial parameters, the tracks are smooth, whereas,
with weak initial parameters, the tracks are rough .
Selected Publications
1. Ming Liu, Chong Wu, Yuanchao Liu A Vector Reconstruction based Clustering
Algorithm Particularly for Large-Scale Text Collection. Neural Networks. 2014,
Accepted. (SCI)
2. 刘铭, 吴冲, 刘远超. 基于特征权重量化的相似度计算方法.计算机学报, 2014,
Accepted.
3. Ming Liu, Chong Wu, Yuanchao Liu. Weight Evaluation for Features via
Constrained Data-Pairs. Information Sciences. 2014, Accepted. (SCI)
4. Ming Liu, Yuanchao Liu, Bingquan Liu, Lei Lin. Probability based Text Clustering
Algorithm by Alternately Repeating Two Operations. Journal of Information
Science. 2013, 39(3): 372-383. (SCI, IDS: 149BC)
5. Ming Liu, Lei Lin, Lili Shan, Chengjie Sun. A Novel Self-Adaptive Clustering
Algorithm for Dynamic Data. ICONIP 2012, Doha, Qatar, 2012: 42-49.
6. Ming Liu, Bingquan Liu, Yuanchao Liu, Chengjie Sun. Data Evolvement Analysis
Based on Topology Self-Adaptive Clustering Algorithm. Information Technology
and Control. 2012, 41(2): 162-172. (SCI, IDS: 967UJ)
7. 刘铭, 王晓龙, 刘远超. 基于词汇链的关键短语抽取方法的研究. 计算机学报. 2010,
33(7): 1246-1255.
8. 刘铭, 王晓龙, 刘远超. 一种大规模高维数据快速聚类算法. 自动化学报. 2009,
35(7): 859-866.
End
Thank you!