CS473: Web Information Retrieval Fall 2014 Final Project Report John Moore, Ed Flanagan, Arjun Thakar 1 Handed In: November 26, 2014 Introduction Clustering has been a widely researched problem. There are many famous hierarchical clustering algorithms, and each of their variations that cluster given distance measures such as euclidean or the manhattan distance. However, these clustering algorithms only use one distance measure to cluster whereas in many problems, we are given many distance measures that each may provide valuable information in determining how two instances of a dataset are similar. We propose using Probabilistic Soft Logic(PSL), similar to a markov logic network, in order to combine distance functions into one, where we can then use this newly learned function in any clustering algorithm. [1] In general, many hierarchical clustering algorithms only need a distance matrix as input, and we take advantage of this fact to compute each pairwise distance between posts using PSL. In our paper we use the hierarchical clustering algorithm with complete linkage using our learned distances from PSL on a stack exchange dataset. Results show that in certain scenarios our approach does much better than only using 1 distance to cluster the documents on smaller datasets. 2 First Order Logic A first-order knowledge base (KB) is a set of sentences or formulas in first order logic. Formulas are constructed using four types of symbols: constants, variables, functions, and predicates. Constants represent objects of interest (e.g. people), variables range over possible objects, functions are mappings from tuples of objects to objects (e.g. FatherOf), and predicate symbols represent relations among objects (e.g. friendship links). An atomic formula (atom) is a predicate symbol applied to a tuple of terms (e.g. f riend(B, A) ∧ sameCollege(A, C)). [3] 3 Probabilistic Soft Logic A PSL program consists of a set of first order logic rules with conjunctive bodies and single literal heads. PSL is very similar to markov logic networks; however, one notable difference is that PSL relaxes the hard constraint of a predicate being either a 1 or 0. Instead, PSL will assign soft truth values from the interval [0,1] where false corresponds to 0 and true corresponds to 1. Rules are associated with non-negative weights, which determine how ’true’ a rule is. [1] The following illustrates how we can predict if 2 people go to the same college. 0.2 : f riend(B, A) ∧ sameCollege(A, C) → sameCollege(B, C) 0.7 : roommate(B, A) ∧ sameCollege(A, C) → sameCollege(B, C) 1 John Moore, Ed Flanagan, Arjun Thakar 2 On the very left, we weight how much we ’trust’ our rules. Ideally, the second rule is more important than the first which is encoded via the weights. A rule r≡ rbody → rhead ≡ ¬rbody ∨ rhead , is satisfied, I(r) = 1, if and only if I(rbody ) ≤ I(rhead ). A rule’s distance to satisfaction is then dr (I) = max(0, I(rbody , rhead )) 4 Distance Measures Stack Exchange forum data has many features, which allows us to calculate distances between posts. Features that allow distance calculation include Creation Date, last activity date, view count, title, body, and tags. To calculate distances between dates, we will calculate the number of minutes that separate a post. We will also just simply take the difference in number of views. For title, body, and tags, we can use a vector space model and compute okapi weighted vectors. To calculate a distance, we can then simply take the dot product between two vectors as performed in [2] 5 Algorithm To solve our problem, we propose making use of PSL to combine distance measures. We will create the following rules to be defined in our PSL model: d1 (A, B) → sameCluster(A, B) d2 (A, B) → sameCluster(A, B) ... dn (A, B) → sameCluster(A, B) where n is our number of distance measures (in our case, 6). The following rules encode transitivity. d1 (A, B) & d1 (B, C) → d1(A, C) d2 (A, B) & d2 (B, C) → d2 (A, C) ... dn (A, B) & dn (B, C) → dn (A, C) We also define priors. ∼ d1 (A, B) ∼ d2 (A, B) ... ∼ dn (A, B) First, we calculate our distance measures, and for each pair of posts, we obtain sameCluster probabilities on each given rule. Next, we use a function to transform these probabilities into distances. There are a variety of functions that can convert a probability of similarity into a distance. In our paper, we decide to take the complement of the sameCluster probability. Since we are calculating distances between each pair of nodes, our algorithm is O(p2 ) where p is the number of posts, and we encode our distances into a distance matrix where the ij (th) element is the distance between post i and post j. We can then perform any well-known clustering method that takes a distance matrix as input. In general, our method takes as input, P , the PSL model and its parameters and algorithms for weighted learning and inference. Our experiments use the default parameters for weighted learning and inference in [1]. D is a set of distance measure transformations (how John Moore, Ed Flanagan, Arjun Thakar 3 Algorithm 1 Overall Method(P, D, F, G, CA ) 1: Let M be array of n × n distance measures 2: for i in 1..length(D) do 3: Compute n × n distances Mi from G 4: end for 5: Create k rules in P 6: Let P use M 7: Compute k weights, W from P and M 8: Compute n × n sameCluster probabilities C from W , M , and using inference in P 9: Transform C to T D using F 10: Compute RC from CA and T D 11: return(RC) to calculate a distance given an attribute and distance function). Again, we use distance measure as defined in the distance measure section. F is a transformation function that transforms a probability to a distance. In our case, we simply take the complement of the sameCluster probability. G is our input post data as described in the dataset section. CA is a given clustering algorithm. In our paper, we use hierarchical clustering with complete linkage to cluster our posts. 6 6.1 Evaluation Data We chose to use Stack Exchange data since it is a well-known forum website, and it releases feature rich datasets here, https://archive.org/details/stackexchange. In order to evaluate performance, we have chosen subsets of already categorized datasets. Groundtruth simply consists of a forum post associated with its predetermined category. In our evaluation, we have chosen 5 categories: academia, aviation, beer, chemistry, and bicycles. We have chosen these categories since some categories overlap (ex: academia and chemistry), while others are very different (ex: aviation and beer). 6.2 Evaluation Methodology To evaluate clustering, we use two standard evaluation measures. We will use clustering entropy, which is an external evaluation measure. [2] In general, lower scores for entropy indicate that there exists less randomness in clusters while higher scores indicate the opposite. Therefore, lower scores indicate better clusters. We will next use the Adjusted Rand Index score similar to accuracy to compare the original ground truth (predefined categories) to the resulting clusters. In general, as with accuracy, higher scores indicate better clustering. [4] For each distance, we will also use the original distance matrix from each attribute. Let di refer to each distance matrix where 1 ≤ i ≤ 6. Let d1 be Creation Date, d2 be Last Activity Date, d3 be View Count, d4 be Title, d5 be Body, d6 be Tags. We ran two groups of tests. The first group used random posts to construct a dataset for testing. Since the first group did not yield good results, we ran a second group of tests that used posts with largest bodies per category. We ran this second group with the John Moore, Ed Flanagan, Arjun Thakar 4 intuition that posts with small bodies do not have indicative text features. We also vary the number of attributes used in each test. We denote the number of attributes used and the sampling technique. Three corresponds to the text attributes, body, title and text, while six corresponds to using all attributes. We will see if adding the non-text attributes will result in better clusters. The Random sampling technique refers to picking posts randomly and Big refers to picking the biggest body by counting the number of characters in a body. We varied the size of our datasets by picking 5, 10, 15, ..., 40 from each of the 5 categories resulting in using 25, 50, 75, ..., 200 posts. E. M. refers to entropy of our model, and AR. M. refers to the adjusted rand index of our model as well. Similarly, E. di and AR. di refers to the entropy and adjusted rand index of just using the di distance matrix to cluster. In other words we simply use each of these distance matrices in the hierarchical clustering algorithm with complete linkage. These resulting clusters are then compared to ground truth as well to obtain entropy and adjusted rand index scores. We perform these experiments in order to compare our method’s results to a set of baselines. 6.3 Results As we can see from the results, our method consistently reduces entropy as size of the dataset is increased, while individual measures used tend to stay around the same entropy value. This is probably because aggregating our measures together through PSL results in a space of posts where similar posts are more likely to be in the same ’region’. However, as the amount of data is increased, the adjusted rand measure consistently decreases with using our method. This could possibly be due to a non-indicative feature (d5) affecting the overall distance measure that PSL creates. Using the biggest posts, we demonstrate that using only 3 attributes in our model results in much higher adjusted rand scores compared to using all 6. This is most likely due to text attributes being much more indicative than non-text attributes in this context. The highest rand index in Configuration 2 is 0.55 using only 25 posts. It is interesting that our method coupled with random sampling and all 6 attributes results in higher adjusted rand index scores when data size is low versus using biggest sampling with 6 attributes. However, when size is large, using big sampling results in higher adjusted rand index scores. Overall, when data size is low it seems as though using only the text attributes and sampling the big posts results in the best distances produced by our method since we see the best entropies and adjusted rand scores. 7 Conclusion In this paper, we have successfully used PSL to combine many different yet informative distance measures into one distance for use in a hierarchical clustering algorithm. One distance measure alone may only provide 1 dimension of similarity whereas using multiple dimensions may capture the ’true’ similarity between posts. PSL enables us to combine distance measures probabilistically, and weights rules more than others based upon how true they are in the dataset. Using these rules and weights, PSL is able to perform inference to infer sameCluster probabilities. We can transform probabilities into distances via a transforma- John Moore, Ed Flanagan, Arjun Thakar 5 Configuration 1 # Attributes Sampling 3 Random Size 25 50 75 100 125 150 175 200 E. M. 1.2764 0.7736 0.7918 0.6191 0.5467 0.3360 0.4467 0.3054 AR. M. 0.1186 0.0754 0.0590 0.0664 0.0475 0.0105 0.0257 0.0089 E. d4 0.6615 1.0054 1.1642 1.2457 1.3568 1.4219 1.3775 1.4412 AR. d4 0.0526 0.0792 0.0557 0.0768 0.1473 0.1670 0.1215 0.1096 E. d5 1.5685 1.5449 1.5906 1.5213 1.5002 1.5604 1.5647 1.4756 AR. d5 0.0192 0.0412 -0.0039 0.0432 0.0326 0.0789 0.0212 0.0112 E. d6 0.7668 0.9930 0.9815 1.0802 1.0633 1.3090 1.2530 1.3118 AR. d6 0.0357 0.0889 0.0924 0.02102 0.0581 0.1543 0.0959 0.0994 AR. M. 0.55 0.4471 0.2846 0.1167 0.0762 0.0569 0.0141 0.0072 E. d4 0.6615 0.8472 1.0349 1.2263 1.3309 1.3570 1.4172 1.4421 AR. d4 0.0105 0.0497 0.1440 0.0932 0.0929 0.1433 0.1425 0.1329 E. d5 1.5061 1.4890 1.5559 1.4852 1.5851 1.5093 1.5105 1.5767 AR. d5 0.0146 -0.0188 0.0157 0.0323 0.0031 0.0079 0.0061 0.0012 E. d6 0.8493 1.0052 1.1366 1.2172 1.1876 1.3133 1.3425 1.4101 AR. d6 0.0602 0.0495 0.0878 0.0952 0.0637 0.1757 0.1379 0.1187 Configuration 2 # Attributes Sampling 3 Big Size 25 50 75 100 125 150 175 200 E. M. 1.4328 1.5147 1.1047 0.7632 0.6615 0.5972 0.3616 0.2868 tion function to use in any given hierarchical clustering algorithm. We have demonstrated that we perform better than the baseline the majority of the time with our method using Configuration 2. Our method does need to be tuned by selecting indicative attributes and sampling representative points, but once data is tuned, we believe this is a very effective method for clustering using several distance measures. In conclusion, this method has been observed to be much better than using any one distance measure alone in scenarios where a dataset has many indicative attributes and low dataset sizes. References [1] Angelika Kimmig, Stephen H. Bach, Matthias Broecheler, Bert Huang, and Lise Getoor. A short introduction to probabilistic soft logic. [2] William Lee, Hui Fang, and Yifan Li. Effective information access over public email archives, 2005. [3] Matthew Richardson and Pedro Domingos. Markov logic networks. Mach. Learn., 62(1-2):107– 136, February 2006. [4] Douglas Steinley. Properties of the hubert-arable adjusted rand index. Psychological Methods, 9(3):386 – 396, 2004. John Moore, Ed Flanagan, Arjun Thakar 6 Configuration 3 # Attributes Sampling 6 Random Size 25 50 75 100 125 150 175 200 E. M. 1.4587 1.3229 1.1839 0.5239 0.5241 0.4307 0.4277 0.3055 AR. M. 0.3586 0.1295 0.0852 0.0184 0.0196 0.0118 0.0222 0.0089 Size 25 50 75 100 125 150 175 200 E. d1 1.5752 1.5636 1.4824 1.4205 1.4422 1.4981 1.3972 1.5275 AR. d1 -0.0233 0.0056 0.0462 -0.0068 0.0215 0.0072 0.0080 0.0093 E. d2 1.5560 1.4879 1.5425 1.5741 1.5244 1.5654 1.5039 1.5628 AR. d2 -0.0568 0.0063 0.0730 0.0530 0.0770 -0.0013 0.0072 0.0214 E. d3 1.5124 1.4981 1.5377 1.5416 1.5608 1.5898 1.5705 1.575 AR. d3 0.0543 0.0330 -0.0200 0.0153 0.0051 0.0041 -0.0008 0.0072 E. d4 0.7668 1.0404 0.7688 1.2244 1.4364 1.4358 1.3679 1.4984 AR. d4 0.0357 0.2298 0.0317 0.1862 0.1518 0.0798 0.0970 0.0993 E. d5 1.4994 1.5629 1.5454 1.5743 1.4614 1.5275 1.5782 1.4990 AR. d5 0.0109 0.0066 0.1179 0.0419 0.0416 0.0155 0.0242 0.0181 E. d6 0.6615 0.8167 0.9510 0.9878 1.1335 1.2652 1.3212 1.2504 AR. d6 0.0211 0.0295 0.0928 0.1288 0.1299 0.0744 0.0678 0.0926 E. d1 1.5685 1.5582 1.5194 1.4771 1.4730 1.5092 1.5790 1.5376 AR. d1 -0.05 0.0265 0.0751 -0.0071 0.0537 0.0177 0.0067 0.0159 E. d2 1.4994 1.5866 1.5414 1.5598 1.6037 1.5212 1.5001 1.5721 AR. d2 -0.0109 -0.0339 0.0376 -0.0078 0.0035 -0.0002 -0.0005 0.0196 E. d3 1.5560 1.5242 1.5624 1.5914 1.5900 1.5083 1.5591 1.5373 AR. d3 0.0341 0.0363 -0.0062 0.0051 -0.0086 -0.0131 -0.0056 -0.0032 E. d4 0.6615 0.8472 1.0349 1.2263 1.3309 1.3570 1.4172 1.4421 AR. d4 0.0105 0.0497 0.1440 0.0932 0.0929 0.1433 0.1425 0.1329 E. d5 1.5061 1.4890 1.5559 1.4852 1.5851 1.5093 1.5105 1.5767 AR. d5 0.0146 -0.0188 0.0157 0.0323 0.0031 0.0079 0.0061 0.0012 E. d6 0.8493 1.0052 1.1366 1.2172 1.1876 1.3133 1.3425 1.4101 AR. d6 0.0602 0.0495 0.0878 0.0952 0.0637 0.1757 0.1379 0.1187 Configuration 4 # Attributes Sampling 6 Big Size 25 50 75 100 125 150 175 200 Size 25 50 75 100 125 150 175 200 E. M. 1.4146 1.3278 0.8713 0.6802 0.6944 0.5707 0.4572 0.4897 AR. M. .1414 0.0972 0.0587 0.0396 0.0470 0.0246 0.0220 0.0346
© Copyright 2026 Paperzz