18th International Conference on Database and Expert Systems Applications Journey to the Centre of the Star: Various Ways of Finding Star Centers in Star Clustering Tok Wee Hyong Derry Tanti Wijaya Stéphane Bressan 18th International Conference on Database and Expert Systems Applications Vector Space Clustering • Naturally translates into a graph clustering problem for a dense graph Weight is cosine of corresponding vectors Vectors 18th International Conference on Database and Expert Systems Applications Star Clustering for Graph [1] • Computes vertex cover by a simple computation of star-shaped dense sub-graphs 1. 2. 3. 4. 5. Lower weight edges are pruned Vertices with higher degree (that are not satellites) are chosen in turn as Star centers Vertices connected to a center become satellites Algorithm terminates when every vertex is either a center or a satellite Each center and its satellites form a cluster 18th International Conference on Database and Expert Systems Applications Star Clustering • Does not require the indication of an a priori number of clusters • Allows clusters to overlap • Analytically guarantees a lower bound on the similarity between objects in each cluster • Computes more accurate clusters than either the single or average link hierarchical clustering 18th International Conference on Database and Expert Systems Applications Star Clustering • Two critical elements: • Threshold for pruning edges (σ) • Metrics for selecting Star centers • Aslam et al. [1] derived the theoretical lower bound on the expected similarity between two satellites in a cluster • Empirically shown to be a good estimate of the actual similarity • Current metrics for selecting Star centers does not leverage this finding Our focus is on the metrics for selecting Star centers 18th International Conference on Database and Expert Systems Applications Extended Star Clustering • Choose Star centers using complement degree of vertices • Allow Star centers to be adjacent to one another • Has two versions: unrestricted and restricted 18th International Conference on Database and Expert Systems Applications Our proposal • Degree may not be the best metrics • We propose metrics that considers weights of edges in order to maximize intra-cluster similarity: • Markov Stationary Distribution • Lower Bound • Average • Sum 18th International Conference on Database and Expert Systems Applications Markov Stationary Distribution • Similar to the idea of Google’s Page Rank algorithm [2] Method: • Similarity graph is normalized into a symmetric Markov matrix • Compute the stationary distribution of the matrix A* = (I – A) -1 • Vertices are sorted by their stationary values and chosen in turn as Star centers 18th International Conference on Database and Expert Systems Applications Lower Bound • Theoretical lower bound on expected similarity between satellite vertices: cos(γi,j) ≥ cos(αi) cos(αj)+ (σ / σ + 1) sin(αi) sin(αj) • Can be used to estimate the average intracluster similarity • Lower bound metric is the estimated average intra-cluster similarity when v is a Star center and v.adj are its satellites lb (v) = ((Σvi v.adj cos(αi)) 2 + (σ / σ + 1) (Σvi v.adj sin(αi)) 2) / n2 • Computed on the pruned graph 18th International Conference on Database and Expert Systems Applications Average and Sum • Approximations of the lower bound metric • Computed on the pruned graph • For each vertex v, ave (v) = Σvi ∈ v.adj cos(αi) / degree(v) sum (v) = Σvi ∈ v.adj cos(αi) • Average metric is the square root of the first term in the lower bound metric 18th International Conference on Database and Expert Systems Applications Markov, Lower Bound, Average, Sum Metrics • We integrate our proposed metrics in the Star algorithm and its variants to produce: • • • • • • • • • • Star-lb Star-sum Star-ave Star-markov Star-extended-sum-(r) Star-extended-ave-(r) Star-extended-sum-(u) Star-extended-ave-(u) Star-online-sum Star-online-ave 18th International Conference on Database and Expert Systems Applications Experiments • Compare performance with off-line and on-line Star clustering and restricted and unrestricted Extended Star clustering • Use data from Reuters-21578, Tipster-AP, and our original collection: Google • Measure effectiveness: recall, precision, F1 • Measure efficiency: running time • Measure sensitivity to σ 18th International Conference on Database and Expert Systems Applications Off-line Algorithms • Star-lb and Star-ave are most effective but Starave is much more efficient • Star-random performs comparably to original Star when threshold σ is the average similarity 18th International Conference on Database and Expert Systems Applications Off-line Algorithms Effectiveness comparison 1.2 1 0.8 Precision Recall F1 0.6 0.4 0.2 reuters tipster-ap star-ave starsum starmarkov starrandom star-lb star star-ave starsum starmarkov starrandom star-lb star star-ave starsum starmarkov starrandom star-lb star 0 google 18th International Conference on Database and Expert Systems Applications Off-line Algorithms Efficiency comparison 400000 350000 250000 200000 150000 100000 50000 reuters tipster-ap star-ave starsum starmarkov starrandom star-lb star star-ave starsum starmarkov starrandom star-lb star star-ave starsum starmarkov starrandom star-lb 0 star Time (ms) 300000 google 18th International Conference on Database and Expert Systems Applications Order of Stars • We empirically demonstrate that Star-ave indeed approximates Star-lb better than other algorithms by a similar choice of Star centers 18th International Conference on Database and Expert Systems Applications Order of Stars (on Tipster-AP) 2000 star star-sum star-markov star-ave star-lb 1800 expected similarity rank 1600 1400 1200 1000 800 600 400 200 0 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 iteration 18th International Conference on Database and Expert Systems Applications Sensitivity to σ As compared to the original Star: • Star-ave and Star-markov converge to a maximum F1 at a lower threshold • The maximum F1 of Star-ave is higher • F1 gradient of Star-ave and Star-markov is smaller 18th International Conference on Database and Expert Systems Applications Sensitivity to σ (F1 on Reuters) 1.2 star star-ave star-markov star-sum star-lb 1 0.6 0.4 0.2 20 10 9.5 9 8.5 8 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 2 1.6 1.2 1 (mean) 0.6 0.2 0 0 F1 0.8 σs 18th International Conference on Database and Expert Systems Applications Sensitivity to σ (F1 gradient on Reuters) 0.6 star star-ave star-markov star-sum star-lb 0.5 0.4 0.3 0.2 0.1 0 0.2 0.6 1 (mean) 1.2 1.6 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9 9.5 10 20 0 sσ 18th International Conference on Database and Expert Systems Applications Extended Star • Star-ave is more effective and efficient than Star-extended-(r) • Star-extended-ave-(r) improves the effectiveness of Star-extended-(r) • Similar findings are observed with Starextended-(u) 18th International Conference on Database and Expert Systems Applications reuters tipster-ap2 star-extended-sum-(r) star-extended-ave-(r) star-extended-(r) star-ave star-extended-sum-(r) star-extended-ave-(r) star-extended-(r) star-ave star-extended-sum-(r) star-extended-ave-(r) star-extended-(r) star-ave Extended Star Effectiveness comparison 1.2 1 0.8 0.6 0.4 Precision Recall F1 0.2 0 google 18th International Conference on Database and Expert Systems Applications reuters tipster-ap2 star-extended-sum-(r) star-extended-ave-(r) star-extended-(r) star-ave star-extended-sum-(r) star-extended-ave-(r) star-extended-(r) star-ave star-extended-sum-(r) star-extended-ave-(r) star-extended-(r) star-ave Time (ms) Extended Star Efficiency comparison 50000 45000 40000 35000 30000 25000 20000 15000 10000 5000 0 google 18th International Conference on Database and Expert Systems Applications On-line Algorithms • Star-online-ave is more effective and efficient than the original Star on-line algorithm 18th International Conference on Database and Expert Systems Applications reuters tipster-ap star-onlinerandom star-onlinesum star-onlineave star-online star-onlinerandom star-onlinesum star-onlineave star-online star-onlinerandom star-onlinesum star-onlineave star-online On-line Algorithms Effectiveness comparison 1.2 1 0.8 0.6 Precision Recall F1 0.4 0.2 0 google 18th International Conference on Database and Expert Systems Applications reuters tipster-ap star-onlinerandom star-onlinesum star-onlineave star-online star-onlinerandom star-onlinesum star-onlineave star-online star-onlinerandom star-onlinesum star-onlineave star-online Time (ms) On-line Algorithms Efficiency comparison 400000 350000 300000 250000 200000 150000 100000 50000 0 google 18th International Conference on Database and Expert Systems Applications Conclusion • Current metrics for selecting Star centers is not optimal • We propose various new metrics for selecting Star centers that maximize intra-cluster similarity • Average metrics is a fast and good approximation of lower bound metrics • Since intra-cluster similarity is maximized, it is precision that is mostly improved • Our proposed average metrics yield up to 19.1% improvement on precision for off-line algorithms, 20.9% improvement on precision for on-line algorithms, and 102% improvement on precision for extended star algorithm 18th International Conference on Database and Expert Systems Applications References 1. Aslam, J., Pelekhov, K., Rus, D.: The Star Clustering Algorithm. In Journal of Graph Algorithms and Applications, 8(1) 95–129 (2004) 2. Brin Sergey, Page Lawrence: The anatomy of a large-scale hypertextual Web search engine. Proceedings of the seventh international conference on World Wide Web 7, 107-117 (1998) 18th International Conference on Database and Expert Systems Applications Credits This work was funded by the National University of Singapore ARG project R-252-000-285-112, "Mind Your Language: Corpora and Algorithms for Fundamental Natural Language Processing Tasks in Information Retrieval Extraction Copyright ©and 2007 by Stéphane Bressan for the Indonesian and Malay languages" 18th International Conference on Database and Expert Systems Applications
© Copyright 2026 Paperzz