1 Introduction 2 First Order Logic 3 Probabilistic Soft Logic

CS473: Web Information Retrieval
Fall 2014
Final Project Report
John Moore, Ed Flanagan, Arjun Thakar
1
Handed In: November 26, 2014
Introduction
Clustering has been a widely researched problem. There are many famous hierarchical
clustering algorithms, and each of their variations that cluster given distance measures such
as euclidean or the manhattan distance. However, these clustering algorithms only use one
distance measure to cluster whereas in many problems, we are given many distance measures
that each may provide valuable information in determining how two instances of a dataset are
similar. We propose using Probabilistic Soft Logic(PSL), similar to a markov logic network,
in order to combine distance functions into one, where we can then use this newly learned
function in any clustering algorithm. [1] In general, many hierarchical clustering algorithms
only need a distance matrix as input, and we take advantage of this fact to compute each
pairwise distance between posts using PSL. In our paper we use the hierarchical clustering
algorithm with complete linkage using our learned distances from PSL on a stack exchange
dataset. Results show that in certain scenarios our approach does much better than only
using 1 distance to cluster the documents on smaller datasets.
2
First Order Logic
A first-order knowledge base (KB) is a set of sentences or formulas in first order logic.
Formulas are constructed using four types of symbols: constants, variables, functions, and
predicates. Constants represent objects of interest (e.g. people), variables range over possible
objects, functions are mappings from tuples of objects to objects (e.g. FatherOf), and
predicate symbols represent relations among objects (e.g. friendship links). An atomic
formula (atom) is a predicate symbol applied to a tuple of terms (e.g. f riend(B, A) ∧
sameCollege(A, C)). [3]
3
Probabilistic Soft Logic
A PSL program consists of a set of first order logic rules with conjunctive bodies and single
literal heads. PSL is very similar to markov logic networks; however, one notable difference
is that PSL relaxes the hard constraint of a predicate being either a 1 or 0. Instead, PSL
will assign soft truth values from the interval [0,1] where false corresponds to 0 and true
corresponds to 1. Rules are associated with non-negative weights, which determine how
’true’ a rule is. [1] The following illustrates how we can predict if 2 people go to the same
college.
0.2 : f riend(B, A) ∧ sameCollege(A, C) → sameCollege(B, C)
0.7 : roommate(B, A) ∧ sameCollege(A, C) → sameCollege(B, C)
1
John Moore, Ed Flanagan, Arjun Thakar
2
On the very left, we weight how much we ’trust’ our rules. Ideally, the second rule is
more important than the first which is encoded via the weights. A rule r≡ rbody → rhead ≡
¬rbody ∨ rhead , is satisfied, I(r) = 1, if and only if I(rbody ) ≤ I(rhead ). A rule’s distance to
satisfaction is then dr (I) = max(0, I(rbody , rhead ))
4
Distance Measures
Stack Exchange forum data has many features, which allows us to calculate distances between
posts. Features that allow distance calculation include Creation Date, last activity date,
view count, title, body, and tags. To calculate distances between dates, we will calculate
the number of minutes that separate a post. We will also just simply take the difference in
number of views. For title, body, and tags, we can use a vector space model and compute
okapi weighted vectors. To calculate a distance, we can then simply take the dot product
between two vectors as performed in [2]
5
Algorithm
To solve our problem, we propose making use of PSL to combine distance measures. We will
create the following rules to be defined in our PSL model:
d1 (A, B) → sameCluster(A, B)
d2 (A, B) → sameCluster(A, B)
... dn (A, B) → sameCluster(A, B)
where n is our number of distance measures (in our case, 6). The following rules encode
transitivity.
d1 (A, B) & d1 (B, C) → d1(A, C)
d2 (A, B) & d2 (B, C) → d2 (A, C)
... dn (A, B) & dn (B, C) → dn (A, C)
We also define priors.
∼ d1 (A, B)
∼ d2 (A, B)
... ∼ dn (A, B)
First, we calculate our distance measures, and for each pair of posts, we obtain sameCluster probabilities on each given rule. Next, we use a function to transform these probabilities
into distances. There are a variety of functions that can convert a probability of similarity into a distance. In our paper, we decide to take the complement of the sameCluster
probability.
Since we are calculating distances between each pair of nodes, our algorithm is O(p2 )
where p is the number of posts, and we encode our distances into a distance matrix where
the ij (th) element is the distance between post i and post j. We can then perform any
well-known clustering method that takes a distance matrix as input.
In general, our method takes as input, P , the PSL model and its parameters and algorithms for weighted learning and inference. Our experiments use the default parameters for
weighted learning and inference in [1]. D is a set of distance measure transformations (how
John Moore, Ed Flanagan, Arjun Thakar
3
Algorithm 1 Overall Method(P, D, F, G, CA )
1: Let M be array of n × n distance measures
2: for i in 1..length(D) do
3:
Compute n × n distances Mi from G
4: end for
5: Create k rules in P
6: Let P use M
7: Compute k weights, W from P and M
8: Compute n × n sameCluster probabilities C from W , M , and using inference in P
9: Transform C to T D using F
10: Compute RC from CA and T D
11: return(RC)
to calculate a distance given an attribute and distance function). Again, we use distance
measure as defined in the distance measure section. F is a transformation function that
transforms a probability to a distance. In our case, we simply take the complement of the
sameCluster probability. G is our input post data as described in the dataset section. CA
is a given clustering algorithm. In our paper, we use hierarchical clustering with complete
linkage to cluster our posts.
6
6.1
Evaluation
Data
We chose to use Stack Exchange data since it is a well-known forum website, and it releases
feature rich datasets here, https://archive.org/details/stackexchange. In order to evaluate
performance, we have chosen subsets of already categorized datasets. Groundtruth simply
consists of a forum post associated with its predetermined category. In our evaluation, we
have chosen 5 categories: academia, aviation, beer, chemistry, and bicycles. We have chosen
these categories since some categories overlap (ex: academia and chemistry), while others
are very different (ex: aviation and beer).
6.2
Evaluation Methodology
To evaluate clustering, we use two standard evaluation measures. We will use clustering
entropy, which is an external evaluation measure. [2] In general, lower scores for entropy
indicate that there exists less randomness in clusters while higher scores indicate the opposite.
Therefore, lower scores indicate better clusters.
We will next use the Adjusted Rand Index score similar to accuracy to compare the
original ground truth (predefined categories) to the resulting clusters. In general, as with
accuracy, higher scores indicate better clustering. [4] For each distance, we will also use the
original distance matrix from each attribute. Let di refer to each distance matrix where
1 ≤ i ≤ 6. Let d1 be Creation Date, d2 be Last Activity Date, d3 be View Count, d4 be
Title, d5 be Body, d6 be Tags.
We ran two groups of tests. The first group used random posts to construct a dataset
for testing. Since the first group did not yield good results, we ran a second group of
tests that used posts with largest bodies per category. We ran this second group with the
John Moore, Ed Flanagan, Arjun Thakar
4
intuition that posts with small bodies do not have indicative text features. We also vary the
number of attributes used in each test. We denote the number of attributes used and the
sampling technique. Three corresponds to the text attributes, body, title and text, while
six corresponds to using all attributes. We will see if adding the non-text attributes will
result in better clusters. The Random sampling technique refers to picking posts randomly
and Big refers to picking the biggest body by counting the number of characters in a body.
We varied the size of our datasets by picking 5, 10, 15, ..., 40 from each of the 5 categories
resulting in using 25, 50, 75, ..., 200 posts. E. M. refers to entropy of our model, and AR. M.
refers to the adjusted rand index of our model as well. Similarly, E. di and AR. di refers to
the entropy and adjusted rand index of just using the di distance matrix to cluster. In other
words we simply use each of these distance matrices in the hierarchical clustering algorithm
with complete linkage. These resulting clusters are then compared to ground truth as well
to obtain entropy and adjusted rand index scores. We perform these experiments in order
to compare our method’s results to a set of baselines.
6.3
Results
As we can see from the results, our method consistently reduces entropy as size of the
dataset is increased, while individual measures used tend to stay around the same entropy
value. This is probably because aggregating our measures together through PSL results in
a space of posts where similar posts are more likely to be in the same ’region’. However,
as the amount of data is increased, the adjusted rand measure consistently decreases with
using our method. This could possibly be due to a non-indicative feature (d5) affecting the
overall distance measure that PSL creates. Using the biggest posts, we demonstrate that
using only 3 attributes in our model results in much higher adjusted rand scores compared
to using all 6. This is most likely due to text attributes being much more indicative than
non-text attributes in this context. The highest rand index in Configuration 2 is 0.55 using
only 25 posts. It is interesting that our method coupled with random sampling and all 6
attributes results in higher adjusted rand index scores when data size is low versus using
biggest sampling with 6 attributes. However, when size is large, using big sampling results
in higher adjusted rand index scores.
Overall, when data size is low it seems as though using only the text attributes and
sampling the big posts results in the best distances produced by our method since we see
the best entropies and adjusted rand scores.
7
Conclusion
In this paper, we have successfully used PSL to combine many different yet informative distance measures into one distance for use in a hierarchical clustering algorithm. One distance
measure alone may only provide 1 dimension of similarity whereas using multiple dimensions may capture the ’true’ similarity between posts. PSL enables us to combine distance
measures probabilistically, and weights rules more than others based upon how true they
are in the dataset. Using these rules and weights, PSL is able to perform inference to infer
sameCluster probabilities. We can transform probabilities into distances via a transforma-
John Moore, Ed Flanagan, Arjun Thakar
5
Configuration 1
# Attributes Sampling
3
Random
Size
25
50
75
100
125
150
175
200
E. M.
1.2764
0.7736
0.7918
0.6191
0.5467
0.3360
0.4467
0.3054
AR. M.
0.1186
0.0754
0.0590
0.0664
0.0475
0.0105
0.0257
0.0089
E. d4
0.6615
1.0054
1.1642
1.2457
1.3568
1.4219
1.3775
1.4412
AR. d4
0.0526
0.0792
0.0557
0.0768
0.1473
0.1670
0.1215
0.1096
E. d5
1.5685
1.5449
1.5906
1.5213
1.5002
1.5604
1.5647
1.4756
AR. d5
0.0192
0.0412
-0.0039
0.0432
0.0326
0.0789
0.0212
0.0112
E. d6
0.7668
0.9930
0.9815
1.0802
1.0633
1.3090
1.2530
1.3118
AR. d6
0.0357
0.0889
0.0924
0.02102
0.0581
0.1543
0.0959
0.0994
AR. M.
0.55
0.4471
0.2846
0.1167
0.0762
0.0569
0.0141
0.0072
E. d4
0.6615
0.8472
1.0349
1.2263
1.3309
1.3570
1.4172
1.4421
AR. d4
0.0105
0.0497
0.1440
0.0932
0.0929
0.1433
0.1425
0.1329
E. d5
1.5061
1.4890
1.5559
1.4852
1.5851
1.5093
1.5105
1.5767
AR. d5
0.0146
-0.0188
0.0157
0.0323
0.0031
0.0079
0.0061
0.0012
E. d6
0.8493
1.0052
1.1366
1.2172
1.1876
1.3133
1.3425
1.4101
AR. d6
0.0602
0.0495
0.0878
0.0952
0.0637
0.1757
0.1379
0.1187
Configuration 2
# Attributes Sampling
3
Big
Size
25
50
75
100
125
150
175
200
E. M.
1.4328
1.5147
1.1047
0.7632
0.6615
0.5972
0.3616
0.2868
tion function to use in any given hierarchical clustering algorithm. We have demonstrated
that we perform better than the baseline the majority of the time with our method using
Configuration 2. Our method does need to be tuned by selecting indicative attributes and
sampling representative points, but once data is tuned, we believe this is a very effective
method for clustering using several distance measures. In conclusion, this method has been
observed to be much better than using any one distance measure alone in scenarios where a
dataset has many indicative attributes and low dataset sizes.
References
[1] Angelika Kimmig, Stephen H. Bach, Matthias Broecheler, Bert Huang, and Lise Getoor. A
short introduction to probabilistic soft logic.
[2] William Lee, Hui Fang, and Yifan Li. Effective information access over public email archives,
2005.
[3] Matthew Richardson and Pedro Domingos. Markov logic networks. Mach. Learn., 62(1-2):107–
136, February 2006.
[4] Douglas Steinley. Properties of the hubert-arable adjusted rand index. Psychological Methods,
9(3):386 – 396, 2004.
John Moore, Ed Flanagan, Arjun Thakar
6
Configuration 3
# Attributes Sampling
6
Random
Size
25
50
75
100
125
150
175
200
E. M.
1.4587
1.3229
1.1839
0.5239
0.5241
0.4307
0.4277
0.3055
AR. M.
0.3586
0.1295
0.0852
0.0184
0.0196
0.0118
0.0222
0.0089
Size
25
50
75
100
125
150
175
200
E. d1
1.5752
1.5636
1.4824
1.4205
1.4422
1.4981
1.3972
1.5275
AR. d1
-0.0233
0.0056
0.0462
-0.0068
0.0215
0.0072
0.0080
0.0093
E. d2
1.5560
1.4879
1.5425
1.5741
1.5244
1.5654
1.5039
1.5628
AR. d2
-0.0568
0.0063
0.0730
0.0530
0.0770
-0.0013
0.0072
0.0214
E. d3
1.5124
1.4981
1.5377
1.5416
1.5608
1.5898
1.5705
1.575
AR. d3
0.0543
0.0330
-0.0200
0.0153
0.0051
0.0041
-0.0008
0.0072
E. d4
0.7668
1.0404
0.7688
1.2244
1.4364
1.4358
1.3679
1.4984
AR. d4
0.0357
0.2298
0.0317
0.1862
0.1518
0.0798
0.0970
0.0993
E. d5
1.4994
1.5629
1.5454
1.5743
1.4614
1.5275
1.5782
1.4990
AR. d5
0.0109
0.0066
0.1179
0.0419
0.0416
0.0155
0.0242
0.0181
E. d6
0.6615
0.8167
0.9510
0.9878
1.1335
1.2652
1.3212
1.2504
AR. d6
0.0211
0.0295
0.0928
0.1288
0.1299
0.0744
0.0678
0.0926
E. d1
1.5685
1.5582
1.5194
1.4771
1.4730
1.5092
1.5790
1.5376
AR. d1
-0.05
0.0265
0.0751
-0.0071
0.0537
0.0177
0.0067
0.0159
E. d2
1.4994
1.5866
1.5414
1.5598
1.6037
1.5212
1.5001
1.5721
AR. d2
-0.0109
-0.0339
0.0376
-0.0078
0.0035
-0.0002
-0.0005
0.0196
E. d3
1.5560
1.5242
1.5624
1.5914
1.5900
1.5083
1.5591
1.5373
AR. d3
0.0341
0.0363
-0.0062
0.0051
-0.0086
-0.0131
-0.0056
-0.0032
E. d4
0.6615
0.8472
1.0349
1.2263
1.3309
1.3570
1.4172
1.4421
AR. d4
0.0105
0.0497
0.1440
0.0932
0.0929
0.1433
0.1425
0.1329
E. d5
1.5061
1.4890
1.5559
1.4852
1.5851
1.5093
1.5105
1.5767
AR. d5
0.0146
-0.0188
0.0157
0.0323
0.0031
0.0079
0.0061
0.0012
E. d6
0.8493
1.0052
1.1366
1.2172
1.1876
1.3133
1.3425
1.4101
AR. d6
0.0602
0.0495
0.0878
0.0952
0.0637
0.1757
0.1379
0.1187
Configuration 4
# Attributes Sampling
6
Big
Size
25
50
75
100
125
150
175
200
Size
25
50
75
100
125
150
175
200
E. M.
1.4146
1.3278
0.8713
0.6802
0.6944
0.5707
0.4572
0.4897
AR. M.
.1414
0.0972
0.0587
0.0396
0.0470
0.0246
0.0220
0.0346

Download Report

1 Introduction 2 First Order Logic 3 Probabilistic Soft Logic

Paperzz.com

Your Paperzz