Event-Based Clustering on Turkish Daily News

BOĞAZİÇİ UNIVERSITY
DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS
Event-Based Clustering on Turkish Daily News
Doruk Güneş
Fatih Ok
July 2017
Table of Contents
1
Introduction................................................................................................................... 3
1.1
1.2
2
Information Retrieval .........................................................................................................3
Machine Learning ..............................................................................................................4
Background.................................................................................................................... 4
2.1
Bag of Words .....................................................................................................................4
2.2
Vector Space Model ...........................................................................................................5
2.2.1 TF-IDF value ............................................................................................................................. 5
2.3
Similarity and Distances .....................................................................................................5
2.3.1 Euclidean Distance .................................................................................................................. 6
2.3.2 Cosine Similarity ...................................................................................................................... 6
2.4
Clustering ..........................................................................................................................6
2.4.1 K-Means Clustering ................................................................................................................. 6
2.4.2 Agglomerative Clustering ........................................................................................................ 7
2.4.3 DBSCAN ................................................................................................................................... 7
2.4.4 Incremental Clustering ............................................................................................................ 7
2.5
Evaluation Measures ..........................................................................................................8
2.5.1 Internal Clustering Validation Measures................................................................................. 8
2.5.2 External Clustering Validation Measures ................................................................................ 8
3
Methodology ................................................................................................................. 9
3.1
3.2
3.3
3.4
Online and Incremental Clustering.................................................................................... 10
Implementation ............................................................................................................... 10
Evaluation ....................................................................................................................... 10
Discussion........................................................................................................................ 10
4
Conclusion ................................................................................................................... 10
5
References ................................................................................................................... 11
1 Introduction
Since news industry had inevitably transformed its traditional way of publishing to online, news
contents are being generated rapidly. In Turkey, the large part of news articles has been
published by more than a hundred digital providers. Besides, readers can follow all these
resources not only from social media but also from some aggregator apps like Flipboard or
Bundle. However, this excellent innovation also brings along some problems. Most of the time,
the contents talk about the same story or they are directly duplicated from one another. For
this reason, the end user may sometimes be faced with unnecessary news streams. Thus, there
is a strong need for more artificial solutions to organize online articles.
Such aggregators like Google News and Bing News present their content in event based clusters
to solve this problem. These systems make use of some methods and algorithms that will be
discussed in this report. But mainly, since these solutions focused on Non-Turkish stories, there
are still problems to be solved for Turkish news sources.
To sum up, the main purpose of this research is to test and evaluate the clustering
methodologies for Turkish news articles produced from various resources according to the
mentioned events.
1.1 Information Retrieval
Information Retrieval (IR) is the acquisition of resources related to a need for information from
the information resources.
In this project, IR is the key process that needs to be completed at first. Thus, we developed a
web crawler (a spider) that allows us to extract all the data that we need from a regular article.
The spider continuously track 5 main digital publishers’ home page and fetches newly published
stories then save them in a MongoDB instance.
The crawler is developed with the help of a python framework called Scrapy. Scrapy contains all
necessary modules like downloaders, schedulers, parsers and item pipelines. Only required job
is to connect this modules and source specific selectors like xpath ids or css classes. The
architecture overview of a Scrapy powered web crawler system can be found in Figure 1.
Figure 1 - The architecture overview of a Scrapy powered crawler. (source: scrappy.org)
1.2 Machine Learning
Machine learning is a science that deals with the design and development processes of
algorithms that enable learning based on data types such as computers, sensor data, or
databases.
All machine learning techniques that have been used for document clustering will be explained
and evaluated in the background section.
2 Background
2.1 Bag of Words
Bag-of-words model is the representation of a text document with unsorted bag or list of words
in this case bag-of-words refers to what kind of information we can extract from a document
(unique unigram words).
Figure 2 - Bag-of-words representation
2.2 Vector Space Model
Vector space model is an algebraic model for representing text documents vectors of
identifiers. Since the k-means efficiently cluster only the numerical values we need such
techniques to perform clustering. Given the bag-of-words that extracted from the document,
Creation of feature vector of the document where treating the word (term) as a feature and the
value is term weight. Term weight can be a TF-IDF value.
2.2.1 TF-IDF value
TD-IDF or term frequency-inverse document frequency intended to reflect the importance of a
word in a document. TF-IDF not only consider the frequency of a word in a document (TF), but
also weigh those terms with the inverse of the document frequency (IDF).
2.3 Similarity and Distances
After transforming vectors from the documents to perform a clustering which is based on
distances between points (e.g. k-means) there is need of selection a similarity measurement.
There is no measure that is universally fit to all kinds of clustering problems (Huang, 2008).
However, Euclidean Distance and Cosine Similarity metrics can be considered for document
clustering.
2.3.1 Euclidean Distance
Euclidean distance is a metric for distance between two points in a two or three-dimensional
space and it can also be used in clustering to measure the distances between text documents.
The distance between two documents (da and db) that their term vectors are ta and tb
respectively can be calculated as
𝑚
𝐷𝐸 (𝑡𝑎 , 𝑡𝑏 ) = √∑|𝑤𝑡,𝑎 − 𝑤𝑡,𝑏 |
2
𝑡=1
where 𝑤𝑡,𝑎 = 𝑡𝑓𝑖𝑑𝑓(𝑑𝑎 , 𝑡).
2.3.2 Cosine Similarity
Cosine similarity is a measure of the similarity between the two non-zero vectors of an inner
product domain that measures the cosine of the angle between them. It’s one of the mostly
used measure for document clustering. If documents are represented as term vectors, the
similarity of two documents is the correlation between the vectors. The similarity between two
documents can be defined as
𝑆𝐼𝑀𝐶 (𝑡𝑎 , 𝑡𝑏 ) =
𝑡𝑎 . 𝑡𝑏
|𝑡𝑎 | × |𝑡𝑏 |
where 𝑡𝑎 and 𝑡𝑏 are vectors of two documents over the term set 𝑇 = {𝑡1 , … , 𝑡𝑚 }
2.4 Clustering
Clustering is grouping a set of objects to the same group which is called a cluster. So that all
cluster members have the same or similar attributes contrary to other clusters. In machine
learning, there are 4 main clustering methods such as Hierarchical Clustering, Centroid-based
Clustering, Distribution-based Clustering and Density-based Clustering. All these main methods
include various alternative sub-methods for specific purposes. In this research, we will examine
4 different clustering techniques which commonly used in text-based clustering works.
2.4.1 K-Means Clustering
K-Means clustering is a technique of centroid-based clustering method. It partitions n elements
into k clusters in a situation that each element belongs to the cluster with the nearest mean.
This clustering technique takes k as input and creates k number of clusters as output.
K-means algorithm is a decent choice for datasets that have a small number of clusters with
proportional sizes and linearly separable data and it also can be scaled up to be used in the
large datasets.
For event-based clustering in daily news articles, the main problem with k-means clustering is
cluster numbers are not dynamic. When a daily news article occurs, it must enter a current
cluster rather than creates a new one with a single element. Thus, if the content is not related
with a past event clusters cannot be consistent in time.
2.4.2 Agglomerative Clustering
Agglomerative clustering is a strategy for hierarchical clustering. In this approach, each
observation starts with its own cluster and then these clusters are progressively merged from
bottom to top to build the hierarchy.
As it happens in k-means clustering, it also takes n as a parameter (the distance threshold) for
number of clusters. So that, this method compels the cluster number to a dedicated n rather
than dynamically increasing the number of clusters for different events.
2.4.3 DBSCAN
Density-based spatial clustering of applications with noise (DBSCAN) is a density-based
clustering algorithm. This algorithm computes the distances of the objects to their neighbors
and performs the clustering process by grouping the fields with more objects than the
predetermined threshold in each region.
Contrary to k-means and agglomerative clustering techniques, DBSCAN algorithm fits for
datasets that have unknown cluster numbers. However, in the case of event-based clustering in
daily news stories it requires high computing power and memory since the algorithm needs to
recalculate the clusters for new data arrivals.
2.4.4 Incremental Clustering
In incremental clustering, the aim is to reduce the storage and processing cost in terms of some
machine learning issues such as time, CPU usage and budget. The idea is to process only one
element at a time and typically store only a small subset of the data (Ackerman & Dasgupta).
Online data sources, such as news sites and blogs, constantly distribute new documents so that
this requires establishment of an effective clustering in an efficient manner. In the case of
event-based clustering in daily news stories, the use of incremental document clustering
systems can be more beneficial than using other clustering techniques alone.
2.5 Evaluation Measures
The main objective of clustering algorithms is forming a high intra cluster similarity and low
inter cluster similarity. Internal measures are designed to evaluate the quality of clustering
algorithms in this manner, achieved by computing cluster similarities. The internal measures
evaluate the goodness of a clustering structure without respect to external information (Liu, Li,
Xiong, Gao, & Wu, 2010). But good scores on internal measures doesn’t necessarily implies
good effectiveness in an application. In the next sections, we will look at silhouette coefficient
and Davies–Bouldin index.
On the other hand, external measures often referred as golden standards, used to match
externally supplied class labels. Most of the times this labeling process human labelers and
human intuition. Entropy and purity are the two external measurements which we will look at.
2.5.1 Internal Clustering Validation Measures
2.5.1.1 Silhouette Coefficient
The Silhouette coefficient is the measure of the dissimilarity of a document to its cluster and
the dissimilarity of a document to the closest neighbor cluster. The Silhouette Coefficient can
be calculated using the mean of the intra-cluster distance (x) and the mean of the nearestcluster distance (y) for each document. The Silhouette coefficient in this case is
(𝑦 − 𝑥)/max(𝑥, 𝑦)
The value of the Silhouette coefficient can vary between -1 and 1. The values under the 0 is not
desirable since it indicates that there is different cluster which are more like the original point.
Values near the 0 generally means there are overlapping clusters, and the 1 is the best
silhouette coefficient value which can be obtained.
2.5.2 External Clustering Validation Measures
2.5.2.1 V-Measure
V-measure compares a target clustering which has been manually labeled against an
automatically created clustering to determine how similar these clusters are.
To understand v-measure’s logic and formula we need to look at the homogeneity and
completeness. A clustering result satisfies homogeneity when the clustering assign only those
data points that are members of a single class to a single cluster. On the other hand, to satisfy
the completeness, the clustering must assign all those data points that are members of a single
class to single cluster (Rosenberg & Hirschberg, 2007).
The v-measure is the harmonic mean between homogeneity and completeness.
𝑣 = 2 × (ℎ𝑜𝑚𝑜𝑔𝑒𝑛𝑒𝑖𝑡𝑦 × 𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒𝑛𝑒𝑠𝑠)/(ℎ𝑜𝑚𝑜𝑔𝑒𝑛𝑒𝑖𝑡𝑦 + 𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒𝑛𝑒𝑠𝑠)
3 Methodology
In our project, we collect and save all published stories instantly and trying to find out if any
news is related to the previous one. As mentioned earlier, clustering of daily news articles
requires more efficiency due to the real-time streaming issues. For this reason, rather than
implementing a common clustering algorithm that runs for every arrival of an article we
developed an incremental approach with k-means clustering algorithm.
The system tracks all records and applies the algorithm to each record without running the
clustering algorithm for the complete dataset. A basic representation of the architecture can be
seen in Figure 3.
In this section, we will discuss the incremental approach of k-means clustering algorithm, it’s
application of our event-based clustering system and the evaluation of the methodology.
Figure 3 - Basic representation of the architecture.
3.1 Online and Incremental Clustering
As (Lönnberg & Love, 2013) described before, Incremental Clustering Algorithm (ICA) is an
algorithm that cluster incremental data quickly. Once the algorithm processes an object at a
time, it either places it in a pre-existing cluster or creates a new cluster that contains a single
object.
If no cluster has been created previously, in other words, if the algorithm will work for the first
object occurrence, a new cluster will be created for this object. For other new object
occurrences, the algorithm will decide which cluster to assign by finding the most similar object
and then checks the similarity for a predefined threshold value. If the similarity is above the
threshold value, the algorithm will assign this object into the cluster of the most similar object.
If the similarity is below the threshold value, a new cluster will be created.
Rather than finding the most similar object, another approach can be to find any object that
meets the threshold value and assign the new object to its cluster.
Both these approaches will be discussed and evaluated in later sections.
3.2 Implementation
3.3 Evaluation
3.4 Discussion
4 Conclusion
5 References
Ackerman, M., & Dasgupta, S. (n.d.). Incremental Clustering: The Case for Extra Clusters.
Retrieved from NIPS: https://papers.nips.cc/paper/5608-incremental-clustering-thecase-for-extra-clusters.pdf
Huang, A. (2008). Similarity Measures for Text Document Clustering. The University of Waikato,
Department of Computer Science. Hamilton: The University of Waikato.
Lönnberg, M., & Love, Y. (2013). Large scale news article clustering. Chalmers University of
Technology, Department of Computer Science and Engineering. Sweden: Chalmers
University of Technology.
Liu, Y., Li, Z., Xiong, H., Gao, X., & Wu, J. (2010). Understanding of Internal Clustering Validation
Measures. International Conference on Data Mining, 911-916.
Rosenberg, A., & Hirschberg, J. (2007). V-Measure: A conditional entropy-based external cluster
evaluation measure.
Wikipedia. (2017, March 30). tf–idf. Retrieved from Wikipedia:
https://en.wikipedia.org/wiki/Tf–idf
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html

Download Report

Event-Based Clustering on Turkish Daily News

Paperzz.com

Your Paperzz