A Methodology for Mining Document-Enriched
Heterogeneous Information Networks
Miha Grčar1, Nejc Trdin1, Nada Lavrač1
1
Jožef Stefan Institute, Dept. of Knowledge Technologies, Jamova cesta 39, 1000 Ljubljana, Slovenia
{Miha.Grcar, Nejc Trdin, Nada.Lavrac}@ijs.si
Abstract The paper presents a new methodology for mining heterogeneous information
networks, motivated by the fact that, in many real-life scenarios, documents are available in
heterogeneous information networks, such as interlinked multimedia objects containing titles,
descriptions, and subtitles. The methodology consists of transforming documents into bag-ofwords vectors, decomposing the corresponding heterogeneous network into separate graphs and
computing structural-context feature vectors with PageRank, and finally constructing a common
feature vector space in which knowledge discovery is performed. We exploit this feature vector
construction process to devise an efficient classification algorithm. We demonstrate the
approach by applying it to the task of categorizing video lectures. We show that our approach
exhibits low time and space complexity without compromising classification accuracy.
Keywords: text mining, heterogeneous information networks, data fusion,
classification, centroid-based classifier, diffusion kernels
1 Introduction
In many real-life data mining scenarios involving document analysis, the
accompanying data can be represented in the form of heterogeneous information
networks. We address this data analysis setting by proposing a methodology which
takes advantage of both research fields, text mining and mining heterogeneous
information networks.
Text mining [1], which aims at extracting useful information from document
collections, is a well-developed field of computer science. In the last decade, text
mining research was driven by the growth of the size and the number of document
collections available in corporate and governmental environments and especially by
the rapid growth of the world’s largest source of semi-structured data, the Web. Text
mining is used to extract knowledge from document collections by using data mining,
machine learning, natural language processing, and information retrieval techniques.
Like in data mining, but even more importantly, the data preprocessing step plays a
crucial role in text mining. In this step, documents are transformed into feature
vectors according to a certain representational model and then processed with the
available machine learning algorithms that can handle sparse vector collections with
high feature dimensionality and continuous or binary features (such as k-NN, kMeans, SVM, and Naive Bayes).
Naturally, not all data comes in the form of documents. A lot of recent data mining
research is performed on the data from networked systems where individual agents or
components interact with other components, forming large, interconnected, and
heterogeneous networks. For short, such networks are called heterogeneous
information networks [2]. Some examples of heterogeneous information networks are
communication and computer networks, transportation networks, epidemic networks,
social networks, e-mail networks, citation networks, biological networks, and also the
Web (with the emphasis on its structure). Such networks can also be formed from data
in relational databases and ontologies where the objects are interlinked with
heterogeneous links. In heterogeneous information networks, knowledge discovery is
usually performed by resorting to social network analysis [3], link analysis techniques
[4], and other dedicated approaches to mining heterogeneous information networks
[2].
In many real-life scenarios, documents are available in information networks. This
results in heterogeneous information networks in which (some) objects are associated
also with a corresponding set of text documents. Examples of such networks include
the Web (interlinked HTML documents), multimedia repositories (interlinked
multimedia descriptions, subtitles, slide titles, etc.), social networks of professionals
(interlinked CVs), citation networks (interlinked publications), and even software
code (heterogeneously interlinked code comments). The abundance of such
document-enriched networks motivates the development of a new methodology that
joins the two worlds, text mining and mining heterogeneous information networks,
and handles the two types of data in a common data mining framework.
The methodology presented in this paper is based on decomposing a heterogeneous
network into (homogeneous) graphs, computing feature vectors with Personalized
PageRank [5], and constructing a common vector space in which knowledge
discovery is performed. Heterogeneity is taken into account in this final step of the
methodology where all the structural contexts and the text documents are “fused”
together.
We demonstrate the methodology by applying it to the categorization of video
lectures on VideoLectures.net <http://videolectures.net/>, one of the world’s largest
academic video hosting Web portals.
The paper is structured as follows. In Section 2 we present the related work. In
Section 3 the proposed methodology is presented. Section 4 presents the experimental
results of employing the methodology in a real-life video-lecture categorization task.
We present a new centroid classifier that exploits the properties of the presented
feature vector construction process in Section 5 and discuss its efficiency in terms of
reduced time and space complexity in Section 6. Section 7 concludes the paper with
plans for future work.
2 Related Work
Text mining employs basic machine learning principles, such as supervised and
unsupervised learning [6], to perform tasks such as text categorization (also known as
“text classification”), topic ontology construction, text corpora visualization [7], and
user profiling [8]. Most text mining approaches rely on the bag-of-words vector
representation of documents [9].
Text categorization is a widely researched area due to its value in real-life
applications such as indexing of scientific articles, patent categorization, spam
filtering, and Web page categorization [10]. In [11], the authors present a method for
categorizing Web pages into the Yahoo! Taxonomy <http://dir.yahoo.com/>. They
employ a set of Naive Bayes classifiers, one for each category in the taxonomy. For
each category, the corresponding classifier gives the probability that the document
belongs to this category. A similar approach is presented in [12], where Web pages
are being categorized into the DMoz taxonomy <http://www.dmoz.org/>. Each
category is modeled with the corresponding centroid bag-of-words vector and a
document is categorized simply by computing the cosine similarity between the
document’s bag-of-words vector and each of the computed centroids. Apart from
Naive Bayes [6] and centroid-based classifiers [13; 27], SVM [14] is also a popular
classifier for text categorization.
In the field of mining heterogeneous information networks, a different family of
analysis algorithms was devised to deal with data analysis problems. Important
building blocks are the techniques that can be used to assess the relevance of an object
(with respect to another object or a query) or the similarity between two objects in a
network. Some of these techniques are: spreading of activation [15], hubs and
authorities (HITS) [16], PageRank and Personalized PageRank [5], SimRank [17],
and diffusion kernels [18]. These methods are extensively used in informationretrieval systems. The general idea is to propagate “authority” from “query nodes”
into the rest of the graph or heterogeneous network, assigning higher ranks to more
relevant objects.
ObjectRank [20] employs global PageRank (importance) and Personalized PageRank
(relevance) to enhance keyword search in databases. Specifically, the authors convert
a relational database of scientific papers into a graph by constructing the data graph
(interrelated instances) and the schema graph (concepts and relations). To speed up
the querying process, they precompute Personalized PageRank vectors (PPVs) for all
possible query words. HubRank [19] is an improvement of ObjectRank in terms of
space and time complexity without compromising the accuracy. It examines query
logs to compute several hubs for which PPVs are precomputed. In addition, instead of
precomputing full-blown PPVs, they compute fingerprints [23] which are a set of
Monte Carlo random walks associated with a node. Stoyanovich et al. [21] present a
ranking method called EntityAuthority which defines a graph-based data model that
combines Web pages, extracted (named) entities, and ontological structure in order to
improve the quality of keyword-based retrieval of either pages or entities. The authors
evaluate three conceptually different methods for determining relevant pages and/or
entities in such graphs. One of the methods is based on mutual reinforcement between
pages and entities, while the other two approaches are based on PageRank and HITS,
respectively.
For the classification tasks, Zhu and Ghahramani [30] present a method for
transductive learning which first constructs a graph from the data and then propagates
labels along the edges to label (i.e., classify) the unlabeled portion of the data. The
graph regularization framework proposed by Zhou and Schölkopf [24] can also be
employed for categorization. However, most of these methodologies are devised for
graphs rather than heterogeneous networks. GNetMine [25] is built on top of the
graph regularization framework but takes the heterogeneity of the network into
account and consequently yields better results. CrossMine [26] is another system that
exploits heterogeneity in networks. It constructs labeling rules while propagating
labels along the edges in a heterogeneous network. These approaches clearly
demonstrate the importance of handling different types of relations and/or objects in a
network separately.
Even though in this paper, we deal with feature vectors rather than kernels, the kernelbased data fusion approach presented by Lanckriet et al. [28] is closely related to our
work. In their method, the authors propose a general-purpose methodology for kernelbased data fusion. They represent each type of data with a kernel and then compute a
weighted linear combination of kernels (which is again a kernel). The linearcombination weights are computed through an optimization process called Multiple
Kernel Learning (MKL) [22; 29] which is tightly integrated into the SVM’s margin
maximization process. In [28], the authors define a quadratically constrained
quadratic program (QCQP) in order to compute the support vectors and linearcombination weights that maximize the margin. In the paper, the authors employ their
methodology for predicting protein functions in yeast. They fuse together 6 different
kernels (4 of them are diffusion kernels based on graph structures). They show that
their data fusion approach outperforms SVM trained on any single type of data, as
well as the previously advertised method based on Markov random fields. In the
approach employed in our case study, we do not employ MKL but rather a stochastic
optimizer called Differential Evolution [31] which enables us to directly optimize the
target evaluation metric.
From a high-level perspective, the approaches presented in this section either
(1) extract features from text documents for the purpose of document categorization,
(2) categorize objects by propagating rank, similarity, or labels with a PageRank-like
authority propagation algorithm, or (3) take network heterogeneity into account in a
classification setting. In this work, we employ several well-established approaches
from these three categories. The main contribution is a general-purpose framework for
feature vector construction establishing an analogy between bag-of-words vectors and
Personalized PageRank (P-PR). In contrast to the approaches that use authority
propagation algorithms for label propagation, we employ P-PR for feature vector
construction. This allows us to “fuse” text documents and different types of structural
information together and thus take the heterogeneity of the network into account. Our
methodology is thoroughly discussed in the following sections.
3 Proposed Methodology
This section presents the proposed methodology for transforming a heterogeneous
information network into a feature vector representation. We assume that we have a
heterogeneous information network that can be decomposed into several
homogeneous undirected graphs with weighted edges, each representing a certain type
of relationship between objects of interest (see Section 4.1 for an example, where
edge weights represent either video lecture co-authorship counts or the number of
users viewing the same video lecture). We also assume that several objects of interest
are associated with text documents, which is not mandatory for the methodology to
work. Fig. 1 illustrates the proposed feature vector construction process.
Feature vector
Feature vector
w0
w1
w2
w3
Feature vector
Feature vector
Fig. 1. The proposed methodology for transforming a heterogeneous information network and the corresponding
text documents into a joint feature vector format. Feature vector construction is shown for one particular object.
Text documents are first transformed into feature vectors (i.e., TF-IDF bag-of-words
vectors) as briefly explained in Section 3.1. In addition, each graph (resulting from the
decomposition of the given heterogeneous network) is transformed into a set of
feature vectors. We employ Personalized PageRank for this purpose as explained in
Section 3.2. As a result, each object is now represented as a set of feature vectors (i.e.,
one for each graph and one for the corresponding text document). Finally, the feature
vectors describing a particular object are combined into a single concatenated feature
vector as discussed in Section 3.3. We end up with a typical machine learning setting
in which each object, representing either a labeled or unlabeled data instance, is
represented as a sparse feature vector with continuous feature values. The
concatenated feature vectors are ready to be employed for solving data mining tasks.
For classification and clustering, any kernel or distance-based algorithm can be used
(e.g., SVM, k-NN, k-Medoids, agglomerative clustering). With some care, the
algorithms that manipulate feature vectors (e.g., Centroid Classifier and k-Means) can
also be employed. One of the clear advantages over kernels and similarity matrices is
also the ability to employ feature weighting and feature selection algorithms.
3.1 Constructing Feature Vectors from Text Documents
To convert text documents into their bag-of-words representations, we follow a
typical text mining approach [1]. The documents are tokenized, stop words are
removed, and the word tokens are stemmed (or lemmatized). Bigrams are considered
in addition to unigrams. Infrequent words are removed from the vocabulary. Next,
TF-IDF vectors are computed and normalized in the Euclidean sense. Finally, from
each vector, the terms with the lowest weights are removed (i.e., their weights are set
to 0).
3.2 Using Personalized PageRank for Computing StructuralContext Feature Vectors
For computing the structural-context feature vectors corresponding to each separate
context graph, we run P-PR from a single source node representing the object for
which we want to compute the feature vector. The process is equivalent to a random
walk that starts in the source node. In effect, for a selected source node i in a given
graph, P-PR computes a vector of probabilities with components PRi(j), where j is any
node in the graph. PRi(j) is the probability that a random walker starting from node i
will be observed at node j at an arbitrary point in time.
word_5
word_2
word_6
word_3
word_4
word_7
word_1
word_8
“word_7 word_6 word_4 word_3 …”
v7 = <0.03, 0.03, 0.06, 0.18, 0.09, 0.19, 0.27, 0.15>
Fig. 2. The “random writer” principle: the random walker is “writing down” words that it encounters along the
way. This is similar to generating random texts with a language model.
Let us suppose that each node is assigned a single random word and that the random
walker is “writing down” words that it encounters along the way (this principle is
illustrated in Fig. 2). It is not difficult to see that a structural-context feature vector
computed with P-PR is in fact the l1-normalized (i.e., the sum of vector components is
equal to 1) term-frequency bag-of-words vector representation of this random text
document. This is also one of the main reasons for employing P-PR over other
methods for computing structural features: it allows an interpretation that relates P-PR
vectors to bags-of-words and thus nicely fits into the existing text mining frameworks.
The other reasons for choosing P-PR over other similar methods are discussed in
Section 3.4.
In text mining, cosine similarity is normally used to compare bag-of-words vectors.
Cosine similarity is equal to computing the dot product of feature vectors, provided
that the two vectors are normalized in the Euclidean sense (i.e., their l2-norm is equal
to 1). Since we use the dot product as the similarity measure in the proposed
framework, the P-PR vectors need to be normalized in the Euclidean sense in order to
conform to the analogy with text mining. Given a P-PR vector vi = PRi(1), PRi(2),
…, PRi(n) for object i, the corresponding structural-context feature vector vi' is thus
computed as vi' = || vi ||–1 PRi(1), PRi(2), …, PRi(n) .
3.3 Combining Feature Vectors
The final step of the proposed methodology is to combine the computed bag-of-words
and structural-context feature vectors describing a particular object with a single
concatenated feature vector. To explain the theoretical background, we first establish
a relationship between feature vectors and linear kernels. Suppose that for a given
object i, the concatenated feature vector is obtained by “gluing together” m feature
vectors, i.e., m – 1 structural feature vectors and one bag-of-words feature vector
corresponding to the text document describing the object. For a given set of n objects,
let us denote the m sets of feature vectors with V1, …, Vm, where each Vk is a matrix
with n rows, in which i-th row represents the feature vector corresponding to object i.
The corresponding kernels, one for each set of feature vectors, are computed as
Kk = VkVkT.
This relationship is important because there has been a lot of work done recently on
Multiple Kernel Learning (MKL) which can also be employed for data fusion [28]. In
MKL, multiple kernels are combined into a weighted convex combination of kernels
which yields a combined kernel K = kαkKk, kαk = 1, αk ≥ 0. In analogy, we derive
the following equation which shows how the above weights αk can be used to
combine feature vectors:
V 1 V1 2 V2 ... m Vm .
(1)
In this equation, represents the concatenation of matrix rows. To prove that the
resulting combined vectors correspond to the kernel K, we have to show that
VVT = K:
V V T
V ...
1
1
m Vm
V ...
1
1
m Vm
T
k k Vk Vk T k k K k K .
Note that the weights wk from Fig. 1 directly correspond to the above weights, i.e.,
wk k
.
In general, weights αk can be set in several different ways. We can resort to trial-anderror or a greedy heuristic. We can also consider “binary weights” and either include
or exclude a certain type of vectors. Employing MKL is also an option. In the
presented case study (see Section 4), we employ a stochastic optimizer and directly
optimize the target evaluation metric.
3.4 Theoretical Comparison of Suitable Methods
Computing Structural-Context Feature Vectors
for
Our approach to constructing structural-context feature vectors from weighted
networks is to model networks as undirected graphs with weighted edges. In this
section we overview three methods which are suitable for such modeling as they all
model the features concerning the structure of the given network. We construct
weights on the edges between instances in the same way for all three algorithms – by
defining the adjacency matrices. We first present a straightforward mathematical
approach with Markov chains, and then outline the ScentTrails algorithm which is
computationally too slow. Finally, we present the Personalized PageRank (P-PR)
approach which is based on spreading of activation and by all means the most
appropriate for the task, therefore it was used in our methodology.
The first approach is to employ Markov chains [32]. A Markov chain is a
mathematical system that undergoes discrete transitions between a finite number of
states. Markov chains are represented by a transition matrix P, in which each element
pij denotes the probability of transition of the Markov chain from state i to state j. Our
task is to compute the probability of the Markov chain being in some state, given the
initial state probability distribution s(0). The formula for computing the distribution is
given in Equation (1).
s ( k ) s (0) P k
(2)
We are interested in the probability of Markov chain being in each of the states after
arbitrary number of transitions. The obtained probability distribution of the states is
called stationary distribution π (Equation (3)). Note that stationary distribution is
independent of the starting probability distribution and that the Equation (4) must
hold.
π lim s ( k )
(3)
π πP
(4)
k
The best way to find the stationary distribution vector is by transforming the problem
into a system of linear equations with n variables and n + 1 equations. Note that
computing the stationary distribution for n nodes asymptotically takes O(n3)
computations.
As we want a feature vector for each instance, there is no point of calculating the
global stationary distribution, since it is the same for every starting distribution. We
will compute the stationary distribution for each of the instances by introducing a
damping factor d. We can interpret the Markov chain with a damping factor as a
random walker, walking through connected nodes with the following property. At
each node, the random walker decides whether to teleport back to the source node
(this is done with the probability (1 – d)) or to continue the walk along one of the
edges. The probability of choosing a certain edge is proportional to the edge’s weight
compared to the weights of the other edges connected to the node.
Since the Markov chain method is slow for large networks, we will rather consider
two other options: the ScentTrails and the P-PR algorithms.
The idea of the ScentTrails algorithm [34] is to metaphorically “waft” scent of a
specific vertex in the direction of its out-links (links with higher weights conduct
more scent than links with lower weights). The scent is then iteratively spread
throughout the graph. After that we can observe how much of the scent reached each
of the other vertices – the more scent on the vertex, the more important it is. After
defining the adjacency matrix, we normalize the rows of matrix A so that they sum to
1. In each iteration we compute a scent conduit matrix S(t). The scent conduit matrix
element sij(t) denotes the amount of scent that reached vertex i from vertex j within t
iterations. The scent conduit matrix computation is performed as follows:
S (0) I ,
S (t ) I (1 ) zdiag( A T S (t 1) ) ,
(5)
where zdiag(...) sets the diagonal entries of the matrix to zero (to prevent the scent
being directed back to its origin), and is the fraction of the scent intensity that is lost
during the propagation through each link.
The ScentTrails algorithm shows some similarities with the Markov chain approach –
the matrix rows are normalized (and can be perceived as probability distributions) and
also Equation (5) shows similarities with equations used for computing distributions
for the Markov chain after t steps. Note however that the resulting matrix elements
sij(t) do not denote probabilities – in fact, the scent conduit matrix elements can be
greater than 1. They indicate the aggregate width of the scent conduit between each
pair of vertices.
The biggest disadvantage of using ScentTrails is the duration of the matrix
multiplication given in Equation (5), which asymptotically takes O(n3) operations, for
n states, as in Equation (5), the values of S(t) could converge after many iterations.
Interesting notion in graph modeling is the concept of local centrality. Local centrality
determines which nodes are relevant from the point of view of a subset of vertices.
The basic algorithm is called spreading activation [33; 15]. The idea of the algorithm
is to give a subset of vertices some initial energy, which is then iteratively spread
through the weighted edges of the graph. Vertices with most energy after some
number of steps are considered central, given the initial subset. An interesting
instance of spreading activation algorithms is the P-PR algorithm.
In our methodology, the Personalized PageRank method (P-PR) [5] is the method
selected for computing structural-context feature vectors. “Personalized” in this
context refers to using a predefined set of nodes which we use as the source of rank.
The P-PR algorithm has the same data structure as the Markov chains. In addition, the
P-PR algorithm computes the weights and introduces the damping factor d in the
same way as in the Markov chains approach. The difference is in the computation of
the stationary distribution for the chain. P-PR uses an iterative method, which
converges to the appropriate stationary distribution in logarithmic time [5], unlike
Markov chains, where we solve a system of linear equations in cubic time.
As the iterative method is a lot faster than solving the system of linear equations, we
employ the P-PR method in our methodology.
4 VideoLectures.net Categorization Case Study
The task in the VideoLectures.net case study was to develop a method that can be
used to assist the categorization of video lectures hosted by one of the largest video
lecture repositories VideoLectures.net. This functionality was required due to the
rapid growth of the number of hosted lectures (150–200 lectures are added each
month) as well as due to the fact that the categorization taxonomy is rather finegrained (129 categories in the provided database snapshot). We evaluated our
methodology in this use case, confronting it with a typical text mining approach and
an approach based on diffusion kernels.
4.1 Dataset
The VideoLectures.net team provided us with a set of 3,520 English lectures, 1,156 of
which were manually categorized. Each lecture is described by a title, while 2,537
lectures also have a short description and/or come with slide titles. The lectures are
categorized into 129 categories. Each lecture can be assigned to more than one
category (on average, a categorized lecture is categorized into 1.26 categories). There
are 2,706 authors in the dataset, 219 events at which the lectures were recorded, and
3,274 portal users’ click streams.
From this data, it is possible to represent lectures, authors, events, and portal users in
a heterogeneous information network. In this network, authors are linked to lectures,
lectures are linked to events, and portal users are linked to lectures that they viewed.
Following our methodology, data preprocessing was performed by using as input the
following textual and structural information about video lectures.
Each lecture is assigned a textual document formed of the title and, if available,
extended with the corresponding lecture description (abstract) and the titles of
slides.
The structural information of this heterogeneous network is represented in the
form of three homogeneous graphs in which nodes represent individual video
lectures:
o Same-event graph. Two nodes are linked if the two corresponding lectures
were recorded at the same event. The weight of a link is always 1.
o Same-author graph. Two nodes are linked if the two corresponding lectures
were presented by the same author or authors. The weight of a link is
proportional to the number of authors the two lectures have in common.
o Viewed-together graph. Two nodes are linked if the two corresponding
lectures were viewed together by one or several portal user. The weight of a
link is proportional to the number of users that viewed both lectures.
4.2 Results of Text Mining and Diffusion Kernels
We first performed a set of experiments on textual data only, by following a typical
text mining approach. In addition, we employed diffusion kernels (DK) [18] for
classifying lectures according to their structural contexts.
In the text mining experiments, each lecture was assigned a text document formed of
the title and, if available, extended with the corresponding abstract and slide titles. We
represented the documents as normalized TF-IDF bag-of-words vectors. We
performed 10-fold cross validation on the manually categorized lectures. We
performed flat classification as suggested in [12]. We employed several classifiers for
the task: Centroid Classifier (for details see Section 5.1), SVM, and k-Nearest
Neighbors (k-NN). In the case of SVM, we applied SVMmulticlass [14] for which we set
(the termination criterion) to 0.1 and C (the tradeoff between error and margin) to
5,000. In the case of k-NN, we set k (the number of neighbors) to 20. We used the dot
product (i.e., the cosine similarity) to compute the similarity between feature vectors.
In addition to text mining experiments (using only the textual information), we
computed also the DKs for the three graphs (we set the diffusion coefficient to
0.0001). For each kernel separately, we employed SVM and k-NN in a 10-fold cross
validation setting. The two classifiers were configured in the same way as before in
the text mining setting.
We measured the classification accuracy on 1, 3, 5, and 10 top categories predicted by
the classifiers. The results are shown in Fig. 3, where different evaluation metrics are
stacked on top of each other thus forming one column for each of the performed
experiments. Standard errors are shown next to the accuracy values given in the chart
(note also a slight discrepancy between accuracy values in the text and the values in
Fig. 3 and Fig. 4, as in the figures the values were rounded to the closest integer, due
to space limitations).
The results show that the text mining approach performs relatively well. It achieves
55.10% accuracy on the top-most item (k-NN) and 84.78% on top 10 items (Centroid
Classifier). The same-author graph contains the least relevant information for the
categorization task. The most relevant information is contained in the viewed-together
graph. k-NN applied to the viewed-together graph achieves 72.74% accuracy on the
topmost item and 93.94% on top 10 items. Noteworthy, the choice of the
classification algorithm is not as important as the selection of the data from which the
Accuracy [%]
similarities between objects are inferred.
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
81±1
55±2
82±2
55±2
94±1
94±1
71±2
73±2
85±1
59±1
61±1
54±2
44±1
36±1
32±1
TOP 10
32±1
TOP 5
19±1
20±1
TOP 3
TOP 1
Fig. 3. Results of the selected text categorization algorithms and diffusion kernels.
4.3 Results of the Proposed Methodology
In the next set of experiments, we applied the proposed methodology. The results are
shown in Fig. 4.
The first nine experiments in Fig. 4 were performed by employing the proposed
methodology on each graph separately. As before, we performed 10-fold cross
validation on the manually categorized lectures and employed the Centroid Classifier,
SVM, and k-NN for the categorization task (we used the same parameter values as
before). We set the PageRank damping factor to 0.4 when computing the structuralcontext feature vectors.
In the last three experiments in Fig. 4, we employed the data fusion method explained
in Section 3.3. In Experiment 10, we weighted all types of data (i.e., bags-of-words,
viewed-together, same-event, and same-author) equally. We only show the results for
Centroid Classifier (SVM and k-NN demonstrated comparable results). In Experiment
11 we employed Differential Evolution (DE) to directly optimize the target evaluation
metric. The objective function to be maximized was computed in an inner 10-fold
cross validation loop and was defined as c=1,3,5,10 accc where accc stands for the
accuracy of the categorization algorithm on top c predicted categories. We only
employed the Centroid Classifier in this setting as it is fast enough to allow for
numerous iterations required for the stochastic optimizer to find a good solution. DE
computed the following weights: 0.0130, 0.9651, 0.0175, and 0.0045 for the bag-ofwords, viewed-together, same-event, and same-author data, respectively. In the last
experiment, we removed the viewed-together information from the test set. The
reason is that in real-life, new lectures are not connected to other lectures in the
viewed-together graph because they were not yet viewed by any user.
100%
94±1
93±1
95±1
93±1
96±1
88±1
90%
80%
Accuracy [%]
70%
70±1
60%
71±2
75±2
60±1
65±1
76±2
65±1
62±2
50%
40%
30%
20%
10%
57±1
43±2
33±1
31±1
32±2
33±1
TOP 10
TOP 5
28±1
TOP 3
16±1
15±1
15±1
TOP 1
0%
Fig. 4. Results of employing the proposed methodology.
From the results of the first 9 experiments, we can confirm that the most relevant
information is contained in the viewed-together graph. The Centroid Classifier
applied to the viewed-together graph exhibits 74.91% accuracy on the topmost item
and 95.33% on top 10 items. We can also reconfirm that the choice of the
classification algorithm is not as important as the selection of the data from which the
similarities between objects are inferred. Even so, the Centroid Classifier does
outperform the SVM and the k-NN on top 10 items and in the case of the viewedtogether graph, also on the topmost item. The Centroid Classifier is outperformed by
the other two classifiers on the topmost item in the case of the same-event graph.
When comparing the approaches based on our methodology to the DK-based
approaches, we can see that the Centroid Classifier applied to the viewed-together
graph outperforms the SVM and the k-NN applied to the viewed-together diffusion
kernel. On the other hand, with respect to the same-event and same-author graphs, the
Centroid Classifier is outperformed by the DK-based approaches on the topmost
predicted category.
The results of Experiment 10 show that weighting all types of data equally does not
produce the best results. The accuracy falls in comparison with exploiting the viewedtogether graph alone. The optimized weights indeed yield the best results (Experiment
11). Since most of the relevant information is contained in the viewed-together graph,
the accuracy achieved through combining feature vectors is not much higher than that
demonstrated by exploiting the viewed-together graph alone. However, as clearly
demonstrated by the last experiment, the combined feature vectors excel when the
viewed-together information is not present in the test set. The classifier is able to
exploit the remaining data and exhibit accuracies that are significantly higher than
those achieved by resorting to text mining alone (88.15% versus 84.78% accuracy on
the top 10 items). A classifier based on combined feature vectors is thus not only
more accurate but is also robust to missing a certain type of data in the test examples.
5 Efficient Classification with the PageRank-based
Centroid Classifier
From the results presented in Section 4 it is evident that the Centroid Classifier offers
a very good performance and is much more efficient than its competitors. This
outcome has motivated the development of a new centroid-based classifier which
exploits the flexibility of the proposed feature-vector construction process in order to
compute the centroids extremely efficiently.
5.1 The Standard Centroid Classifier
In text mining, the centroid vector is a vector representing an artificial prototype
document of a document set which “summarizes” the documents in the set. Given a
set of TF-IDF vectors, there are several ways of computing the corresponding
centroid vector. Some of the well-known methods are the Rocchio formula, the
average of vector components, and the (normalized) sum of vector components. Of
these three methods, the normalized sum of vector components is shown to perform
best in text classification scenarios [27]. In this case, given a set of document vectors
represented as rows in matrix D (let D[i] denote the i-th row of matrix D) and a set of
row indices R, identifying documents that we want to group into a centroid, the
normalized centroid vector C is computed as follows:
C'
1
C'
D[i] , C
.
i
R
|R|
|| C' ||
Let us now consider a multi-context setting introduced in Section 3. Suppose we have
m contexts and thus m sets of feature vectors represented as rows in matrices
V1, …, Vm. Again, let R be the set of row indices identifying objects that we want to
group into a centroid. Finally, let V[i] denote the i-th row in matrix V. In the proposed
framework, in order not to invalidate the intuitions provided in Sections 3.2 and 3.3,
the centroid needs to be computed as follows (kαk = 1, αk ≥ 0):
C 1
Cm
C1
C2
,
2
... m
|| C1 ||
|| C2 ||
|| Cm ||
1
where C k
Vk [i], 1 k m
| R | iR
(6)
.
Note that || C || = 1 in this case.
Following the above equations, we train a centroid classifier by computing a centroid
vector for each class of objects (i.e., for each class label). In the classification phase,
we simply compare the feature vector of a new (unlabelled) object with every centroid
in the model. The centroid that yields the highest similarity score implies the label to
be assigned to the object. More formally, we denote this as follows:
C* arg max CM {cossim(C, v )} .
In this equation, M = {C1, C2, …, Cr} is the centroid model, represented simply as a
set of centroid vectors. Note that each centroid vector in the set corresponds exactly to
one of the r classes. Furthermore, v is the (unlabelled) object’s feature vector and C*
is the centroid vector that represents the predicted class.
5.2 PageRank-based Centroid Classifier on Graphs
In this section, we show that each structural context of a centroid vector as given by
Equation (6) can be very efficiently computed. Let us focus on one of the “partial”
centroids representing one of the structural contexts in Equation (6), Ck (1 ≤ k ≤ m).
The methodology suggests that, in order to compute Ck, we should construct | R | PPR vectors and compute their average. However, it is possible to do this computation
a lot more efficiently by computing just one P-PR vector. Instead of running P-PR
from a single source node, we set R to be the set of source nodes (when the random
walker teleports, it teleports to any of the nodes in R with equal probability). It turns
out that a centroid computed in this way is exactly the same as if it was computed in
the “slow way” by strictly following the methodology. We show this equivalence in
the following paragraphs.
Let A be the adjacency matrix of the graph representing one of the structural contexts,
normalized so that each column sums up to 1. Let V be the matrix in which rows
represent the corresponding structural-context feature vectors. Let V[i] denote the i-th
row in matrix V (i.e., the P-PR feature vector of the i-th object). Let R be the set of
row indices identifying nodes (objects) that we want to group into a centroid.
Furthermore, let ti be the “teleport” vector defining the i-th node as the source node,
having i-th element set to 1 and all others to 0, ti = [0, …, 0, 1, 0, …, 0]T. The size of
this vector is equal to the number of rows in V. Finally, let d be the PageRank
damping factor. Then, each row in matrix V is computed by solving the P-PR
equation:
V[i] = (1 – d)ti + dAV[i]
(7)
If we now compute the average over the matrix rows (i.e., P-PR vectors) defined by
R, we get the following equation:
1
1
V[i ]
(1 d )t i dAV[i]
i
R
|R|
| R | iR
1
1
1
V[i ]
(1 d )t i
dAV[i]
i
R
i
R
|R|
|R|
| R | iR
1
1
1
V[i ] (1 d )iR
t i dA
V[i]
i
R
|R|
|R|
| R | iR
C' (1 d )t'dAC' , C'
1
1
V[i] , t' iR
ti
i
R
|R|
|R|
It is easy to see that this equation resembles the single-source P-PR equation
(Equation (7)). The main difference is the modified teleport vector t' which contains
values 1 / | R | at locations that denote the nodes (objects) that we want to group into a
centroid. This is exactly the P-PR equation with multiple source nodes where 1 / | R |
is the probability of choosing a particular source node when teleporting. Therefore,
instead of computing the average over several single-source P-PR vectors, we can
compute just one multiple-source P-PR vector.
In case of having r classes and n objects, n r, this not only speeds up the process
by factor n / r but also reduces the time complexity from computing O(n) P-PR
vectors to computing O(1) P-PR vectors. Practical implications are outlined in the
following section.
6 Time and Space Complexity Analysis
Whenever a set of new lectures enters the categorization system—regardless whether
we use the proposed methodology (termed “bags-of-features” in Fig. 5) or the DK
approach—the following procedure is applied: (1) kernel or feature vectors are
recomputed, (2) a model is trained on manually categorized lectures, and (3) new
lectures are categorized. Each fold in the 10-fold cross validation roughly corresponds
to this setting. We focused on the viewed-together graph only and measured the times
required to perform each of these 3 steps in each of the 10 folds, computing average
values in the end. The results are given in Fig. 5.
Time [s]
(logarithmic scale)
10000
1000
1193 s
predicting
85 s
286 s
100
34 s
training
35 s
10
feature vector /
kernel computation
1
1: DK - kNN
2: Personalized
PageRank - kNN
3: Personalized
PageRank - PRCC
Fig. 5. The time spent for feature vector or kernel computation, training, and prediction. Note that the chart is
plotted on a logarithmic scale.
The results show that the DK-based approach (column 1) is more demanding than the
proposed methodology represented by column 2 (1,193 seconds vs. 371 seconds).
Roughly speaking, this is mostly due to the fact that in our use case, the diffusion
kernel is computed over 3,520 objects (resulting in a 3,520 by 3,520 kernel matrix)
while by using the proposed methodology, “only” 1,156 P-PR vectors of length 3,520
need to be computed, where 1,156 is the number of manually categorized lectures.
Note also that computing a series of P-PR vectors is trivially parallelizable as one
vector is computed entirely independently of the others (the so-called “embarrassingly
parallel” problem). On a quad-core machine, for example, the time required to
compute the P-PR vectors in our case would be approximately 80 seconds. Even
greater efficiency is demonstrated by the PageRank-based centroid classifier (PRCC)
(the last column). When PRCC is used, the feature vectors are not pre-computed.
Instead, in the training phase, approximately 130 P-PR vectors are computed, one for
each category in the training set. In addition, in the prediction phase, approximately
115 additional P-PR vectors are computed (115 objects is roughly the size of the test
set). PRCC thus requires only 70 seconds for the entire process. Needless to say, the
PRCC-based approach is also trivially parallelizable which makes it even more
suitable for large-scale scenarios. Let us also point out that this efficiency is not
achieved at the cost of decreased accuracy. In fact, the accuracy of PRCC is exactly
the same as that of the Centroid Classifier (see Section 4.3). Of all our experiments
involving the viewed-together graph, the one employing the Centroid Classifier
(which is equivalent to employing the more efficient PRCC) demonstrates the best
accuracy.
Considering the space complexity, let us point out that PRCC computes and stores
only around 130 P-PR vectors of length 3,520 (i.e., the PRCC model) which makes it
by far the most efficient approach in terms of memory requirements. In comparison,
the DK-based approach stores a 3,520 by 3,520 kernel matrix and k-NN employed by
the proposed methodology stores around 1,040 P-PR vectors of length 3,520 (roughly
1,040 objects constitute the training set in each fold). For simplicity, we assumed that
these vectors are not sparse, which is actually not the case and would speak even more
in favor of the proposed methodology.
7 Conclusions and Future Work
We presented a new methodology for mining heterogeneous information networks.
The methodology is based on building a common vector space from textual and
structural information, for which we use Personalized PageRank (P-PR) to compute
structural-context features. We also devised and presented an extremely efficient
PageRank-based centroid classifier. We applied the proposed methodology and the
devised classifier in a video lecture categorization use case and showed that the
proposed methodology is fast and memory-efficient, and that the devised classifier is
accurate and robust.
In future work we will further develop the analogy between text mining and the
proposed methodology, considering stop nodes (analogous to stop words). We will
also look for a more efficient way to compute weights when combining feature
vectors. We will apply the methodology to larger problem domains to fully utilize the
efficiency demonstrated by the developed PageRank-based Centroid Classifier.
Acknowledgements
This work has been partially funded by the European Commission in the context of
the FP7 project FIRST (Large scale information extraction and integration
infrastructure for supporting financial decision making) under the grant agreement no.
257928. The authors are grateful to the Center for Knowledge Transfer at Jožef Stefan
Institute and to Viidea Ltd. for providing the Videolectures dataset and the use case
presented in this paper. We are grateful also to Anže Vavpetič for his help and
discussions.
References
[1] R. Feldman, J. Sanger: The Text Mining Handbook: Advanced Approaches in
Analyzing Unstructured Data. Cambridge University Press (2006)
[2] J. Han: Mining Heterogeneous Information Networks by Exploring the Power of
Links. Proceedings of Discovery Science ‘09, pp. 13–30 (2009)
[3] W. de Nooy, A. Mrvar, V. Batagelj: Exploratory Social Network Analysis with
Pajek, Cambridge University Press (2005)
[4] L. Getoor, C. P. Diehl: Link Mining: A Survey. SIGKDD Explorations, vol. 7(2),
pp. 3–12 (2005)
[5] L. Page, S. Brin, R. Motwani, T. Winograd: The PageRank Citation Ranking:
Bringing Order to the Web. Technical Report, Stanford InfoLab (1999)
[6] T. Mitchell: Machine Learning. McGraw Hill (1997)
[7] B. Fortuna, M. Grobelnik, D. Mladenic: OntoGen: Semi-Automatic Ontology
Editor. HCI International ’07, Beijing (2007)
[8] H. R. Kim, P. K. Chan: Learning Implicit User Interest Hierarchy for Context in
Personalization. Journal of Applied Intelligence, vol. 28(2) (2008)
[9] G. Salton: Automatic Text Processing: The Transformation, Analysis, and
Retrieval of Information by Computer. Addison-Wesley (1989)
[10] F. Sebastiani: Machine Learning in Automated Text Categorization. ACM
Computing Surveys, vol. 34(1), pp. 1–47 (2002)
[11] D. Mladenic: Machine Learning on Non-Homogeneous, Distributed Text Data.
PhD thesis (1998)
[12] M. Grobelnik, D. Mladenic: Simple Classification into Large Topic Ontology of
Web Documents. Journal of Computing and Information Technology, vol. 13(4), pp.
279–285 (2005)
[13] S. Tan: An Improved Centroid Classifier for Text Categorization. Expert
Systems with Applications, vol. 35(1–2) (2008)
[14] T. Joachims, T. Finley, C.-N. J. Yu: Cutting-Plane Training of Structural SVMs.
Journal of Machine Learning, vol. 77(1) (2009).
[15] F. Crestani: Application of Spreading Activation Techniques in Information
Retrieval. Artificial Intelligence Review, vol. 11, pp. 453–482 (1997)
[16] J. M. Kleinberg: Authoritative Sources in a Hyperlinked Environment. Journal of
the Association for Computing Machinery, vol. 46, pp. 604–632 (1999)
[17] G. Jeh, J. Widom: SimRank: A Measure of Structural Context Similarity.
Proceedings of KDD ‘02, pp. 538–543 (2002)
[18] R. I. Kondor, J. Lafferty: Diffusion Kernels on Graphs and Other Discrete
Structures. Proceedings of ICML ‘02, pp. 315–322 (2002)
[19] S. Chakrabarti: Dynamic Personalized PageRank in Entity-Relation Graphs. In
Proceedings of WWW 2007, pp. 571–580 (2007)
[20] A. Balmin, V. Hristidis, Y. Papakonstantinou: ObjectRank: Authority-based
Keyword Search in Databases. Proceedings of VLDB ‘04, pp. 564–575 (2004)
[21] J. Stoyanovich, S. Bedathur, K. Berberich, G. Weikum: EntityAuthority:
Semantically Enriched Graph-based Authority Propagation. In Proceedings of the
10th International Workshop on Web and Databases (2007)
[22] A. Rakotomamonjy, F. Bach, Y. Grandvalet, S. Canu: SimpleMKL. Journal of
Machine Learning Research, vol. 9, pp. 2491–2521 (2008)
[23] D. Fogaras, B. Rácz: Towards Scaling Fully Personalized PageRank. In
Proceedings of the Workshop on Algorithms and Models for the Web-graph (WAW
2004), pp. 105–117 (2004)
[24] D. Zhou, B. Schölkopf: A Regularization Framework for Learning from Graph
Data. ICML Workshop on Statistical Relational Learning and Its Connections to
Other Fields (2004)
[25] M. Ji, Y. Sun, M. Danilevsky, J. Han, J. Gao: Graph Regularized Transductive
Classification on Heterogeneous Information Networks. In Proceedings of PKDD, pp.
570–586 (2010)
[26] X. Yin, J. Han, J. Yang, P. S. Yu: CrossMine: Efficient Classification Across
Multiple Database Relations. In Proceedings of Constraint-Based Mining and
Inductive Databases, pp. 172–195 (2004)
[27] A. Cardoso-Cachopo, A. Oliveira L.: Empirical Evaluation of Centroid-based
Models for Single-label Text Categorization. INSEC-ID Technical Report 7/2006
(2006)
[28] G. R. G. Lanckriet, M. Deng, N. Cristianini, M. I. Jordan, W. S. Noble: Kernelbased Data Fusion and Its Application to Protein Function Prediction in Yeast.
Proceedings of the Pacific Symposium on Biocomputing, pp. 300–311 (2004)
[29] S. V. N. Vishwanathan, Z. Sun, N. Theera-Ampornpunt, and M. Varma: Multiple
Kernel Learning and the SMO Algorithm. Advances in Neural Information
Processing Systems 23 (2010)
[30] X. Zhu, Z. Ghahramani: Learning from Labeled and Unlabeled Data with Label
Propagation. Technical Report CMU-CALD-02-107, Carnegie Mellon University
(2002)
[31] R. Storn, K. Price: Differential Evolution: A Simple and Efficient Heuristic for
Global Optimization over Continuous Spaces. Journal of Global Optimization, vol.
11, pp. 341–359, Kluwer Academic Publishers (1997)
[32] E. Nummelin: General irreducible Markov chains and non-negative operators.
Cambridge University Press, 1984, 2004. ISBN 0-521-60494-X
[33] M. A. Rodriguez: Grammar-Based Random Walkers in Semantic Networks.
Knowledge-Based Systems, volume 21, issue 7, pp. 727-739 (2008)
[34] C. Olston, H.E. Chi: ScentTrails: Integrating browsing and searching on the
Web. In ACM Transactions on Computer-Human Interaction (TOCHI), volume 10,
issue 3, pp. 177–197, ACM Press, New York, NY, USA (2003)
© Copyright 2026 Paperzz