Rich e-Learning Eco-Systems System

18th International Conference on Database and Expert Systems Applications
Journey to the Centre of the Star:
Various Ways of Finding Star Centers in
Star Clustering
Tok Wee Hyong
Derry Tanti Wijaya
Stéphane Bressan
18th International Conference on Database and Expert Systems Applications
Vector Space Clustering
• Naturally translates into a graph clustering
problem for a dense graph
Weight is cosine of
corresponding vectors
Vectors
18th International Conference on Database and Expert Systems Applications
Star Clustering for Graph [1]
• Computes vertex cover by a simple computation
of star-shaped dense sub-graphs
1.
2.
3.
4.
5.
Lower weight edges are pruned
Vertices with higher degree (that
are not satellites) are chosen in
turn as Star centers
Vertices connected to a center
become satellites
Algorithm terminates when every
vertex is either a center or a
satellite
Each center and its satellites form
a cluster
18th International Conference on Database and Expert Systems Applications
Star Clustering
• Does not require the indication of an a priori
number of clusters
• Allows clusters to overlap
• Analytically guarantees a lower bound on the
similarity between objects in each cluster
• Computes more accurate clusters than either the
single or average link hierarchical clustering
18th International Conference on Database and Expert Systems Applications
Star Clustering
• Two critical elements:
• Threshold for pruning edges (σ)
• Metrics for selecting Star centers
• Aslam et al. [1] derived the theoretical lower bound on
the expected similarity between two satellites in a cluster
• Empirically shown to be a good estimate of the actual
similarity
• Current metrics for selecting Star centers does not
leverage this finding
 Our focus is on the metrics for selecting Star
centers
18th International Conference on Database and Expert Systems Applications
Extended Star Clustering
• Choose Star centers using complement degree
of vertices
• Allow Star centers to be adjacent to one another
• Has two versions: unrestricted and restricted
18th International Conference on Database and Expert Systems Applications
Our proposal
• Degree may not be the best metrics
• We propose metrics that considers weights of
edges in order to maximize intra-cluster
similarity:
• Markov Stationary Distribution
• Lower Bound
• Average
• Sum
18th International Conference on Database and Expert Systems Applications
Markov Stationary Distribution
• Similar to the idea of Google’s Page Rank
algorithm [2]
Method:
• Similarity graph is normalized into a symmetric
Markov matrix
• Compute the stationary distribution of the matrix
A* = (I – A) -1
• Vertices are sorted by their stationary values and
chosen in turn as Star centers
18th International Conference on Database and Expert Systems Applications
Lower Bound
• Theoretical lower bound on expected similarity
between satellite vertices:
cos(γi,j) ≥ cos(αi) cos(αj)+ (σ / σ + 1) sin(αi) sin(αj)
• Can be used to estimate the average intracluster similarity
• Lower bound metric is the estimated average
intra-cluster similarity when v is a Star center
and v.adj are its satellites
lb (v) = ((Σvi  v.adj cos(αi)) 2 + (σ / σ + 1) (Σvi  v.adj sin(αi)) 2) / n2
• Computed on the pruned graph
18th International Conference on Database and Expert Systems Applications
Average and Sum
• Approximations of the lower bound metric
• Computed on the pruned graph
• For each vertex v,
ave (v) = Σvi ∈ v.adj cos(αi) / degree(v)
sum (v) = Σvi ∈ v.adj cos(αi)
• Average metric is the square root of the first
term in the lower bound metric
18th International Conference on Database and Expert Systems Applications
Markov, Lower Bound, Average, Sum Metrics
• We integrate our proposed metrics in the Star
algorithm and its variants to produce:
•
•
•
•
•
•
•
•
•
•
Star-lb
Star-sum
Star-ave
Star-markov
Star-extended-sum-(r)
Star-extended-ave-(r)
Star-extended-sum-(u)
Star-extended-ave-(u)
Star-online-sum
Star-online-ave
18th International Conference on Database and Expert Systems Applications
Experiments
• Compare performance with off-line and on-line
Star clustering and restricted and unrestricted
Extended Star clustering
• Use data from Reuters-21578, Tipster-AP, and
our original collection: Google
• Measure effectiveness: recall, precision, F1
• Measure efficiency: running time
• Measure sensitivity to σ
18th International Conference on Database and Expert Systems Applications
Off-line Algorithms
• Star-lb and Star-ave are most effective but Starave is much more efficient
• Star-random performs comparably to original
Star when threshold σ is the average similarity
18th International Conference on Database and Expert Systems Applications
Off-line Algorithms
Effectiveness comparison
1.2
1
0.8
Precision
Recall
F1
0.6
0.4
0.2
reuters
tipster-ap
star-ave
starsum
starmarkov
starrandom
star-lb
star
star-ave
starsum
starmarkov
starrandom
star-lb
star
star-ave
starsum
starmarkov
starrandom
star-lb
star
0
google
18th International Conference on Database and Expert Systems Applications
Off-line Algorithms
Efficiency comparison
400000
350000
250000
200000
150000
100000
50000
reuters
tipster-ap
star-ave
starsum
starmarkov
starrandom
star-lb
star
star-ave
starsum
starmarkov
starrandom
star-lb
star
star-ave
starsum
starmarkov
starrandom
star-lb
0
star
Time (ms)
300000
google
18th International Conference on Database and Expert Systems Applications
Order of Stars
• We empirically demonstrate that Star-ave indeed
approximates Star-lb better than other
algorithms by a similar choice of Star centers
18th International Conference on Database and Expert Systems Applications
Order of Stars (on Tipster-AP)
2000
star
star-sum
star-markov
star-ave
star-lb
1800
expected similarity rank
1600
1400
1200
1000
800
600
400
200
0
1
4
7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52
iteration
18th International Conference on Database and Expert Systems Applications
Sensitivity to σ
As compared to the original Star:
• Star-ave and Star-markov converge to a
maximum F1 at a lower threshold
• The maximum F1 of Star-ave is higher
• F1 gradient of Star-ave and Star-markov is
smaller
18th International Conference on Database and Expert Systems Applications
Sensitivity to σ (F1 on Reuters)
1.2
star
star-ave
star-markov
star-sum
star-lb
1
0.6
0.4
0.2
20
10
9.5
9
8.5
8
7.5
7
6.5
6
5.5
5
4.5
4
3.5
3
2.5
2
1.6
1.2
1 (mean)
0.6
0.2
0
0
F1
0.8
σs
18th International Conference on Database and Expert Systems Applications
Sensitivity to σ (F1 gradient on Reuters)
0.6
star
star-ave
star-markov
star-sum
star-lb
0.5
0.4
0.3
0.2
0.1
0
0.2
0.6
1 (mean)
1.2
1.6
2
2.5
3
3.5
4
4.5
5
5.5
6
6.5
7
7.5
8
8.5
9
9.5
10
20
0
sσ
18th International Conference on Database and Expert Systems Applications
Extended Star
• Star-ave is more effective and efficient than
Star-extended-(r)
• Star-extended-ave-(r) improves the
effectiveness of Star-extended-(r)
• Similar findings are observed with Starextended-(u)
18th International Conference on Database and Expert Systems Applications
reuters
tipster-ap2
star-extended-sum-(r)
star-extended-ave-(r)
star-extended-(r)
star-ave
star-extended-sum-(r)
star-extended-ave-(r)
star-extended-(r)
star-ave
star-extended-sum-(r)
star-extended-ave-(r)
star-extended-(r)
star-ave
Extended Star
Effectiveness comparison
1.2
1
0.8
0.6
0.4
Precision
Recall
F1
0.2
0
google
18th International Conference on Database and Expert Systems Applications
reuters
tipster-ap2
star-extended-sum-(r)
star-extended-ave-(r)
star-extended-(r)
star-ave
star-extended-sum-(r)
star-extended-ave-(r)
star-extended-(r)
star-ave
star-extended-sum-(r)
star-extended-ave-(r)
star-extended-(r)
star-ave
Time (ms)
Extended Star
Efficiency comparison
50000
45000
40000
35000
30000
25000
20000
15000
10000
5000
0
google
18th International Conference on Database and Expert Systems Applications
On-line Algorithms
• Star-online-ave is more effective and efficient
than the original Star on-line algorithm
18th International Conference on Database and Expert Systems Applications
reuters
tipster-ap
star-onlinerandom
star-onlinesum
star-onlineave
star-online
star-onlinerandom
star-onlinesum
star-onlineave
star-online
star-onlinerandom
star-onlinesum
star-onlineave
star-online
On-line Algorithms
Effectiveness comparison
1.2
1
0.8
0.6
Precision
Recall
F1
0.4
0.2
0
google
18th International Conference on Database and Expert Systems Applications
reuters
tipster-ap
star-onlinerandom
star-onlinesum
star-onlineave
star-online
star-onlinerandom
star-onlinesum
star-onlineave
star-online
star-onlinerandom
star-onlinesum
star-onlineave
star-online
Time (ms)
On-line Algorithms
Efficiency comparison
400000
350000
300000
250000
200000
150000
100000
50000
0
google
18th International Conference on Database and Expert Systems Applications
Conclusion
• Current metrics for selecting Star centers is not
optimal
• We propose various new metrics for selecting
Star centers that maximize intra-cluster similarity
• Average metrics is a fast and good
approximation of lower bound metrics
• Since intra-cluster similarity is maximized, it is
precision that is mostly improved
• Our proposed average metrics yield up to 19.1%
improvement on precision for off-line algorithms,
20.9% improvement on precision for on-line
algorithms, and 102% improvement on precision
for extended star algorithm
18th International Conference on Database and Expert Systems Applications
References
1. Aslam, J., Pelekhov, K., Rus, D.: The Star
Clustering Algorithm. In Journal of Graph
Algorithms and Applications, 8(1) 95–129
(2004)
2. Brin Sergey, Page Lawrence: The anatomy of
a large-scale hypertextual Web search engine.
Proceedings of the seventh international
conference on World Wide Web 7, 107-117
(1998)
18th International Conference on Database and Expert Systems Applications
Credits
This work
was funded
by the
National University of Singapore
ARG project R-252-000-285-112,
"Mind Your Language:
Corpora and Algorithms
for Fundamental
Natural Language Processing
Tasks
in Information Retrieval
Extraction
Copyright ©and
2007
by Stéphane Bressan
for the Indonesian
and Malay languages"
18th International Conference on Database and Expert Systems Applications