Learning Hierarchical Relationships among Partially Ordered

Learning Hierarchical Relationships among Partially Ordered Objects
with Heterogeneous Attributes and Links
Chi Wang, Jiawei Han
University of Illinois at Urbana-Champaign
{chiwang1,hanj}@illinois.edu
Abstract
Objects linking with many other objects in an information network may imply various semantic relationships.
Uncovering such knowledge is essential for role discovery, data cleaning, and better organization of information networks, especially when the semantically meaningful relationships are hidden or mingled with noisy
links and attributes. In this paper we study a generic
form of relationship along which objects can form a treelike structure, since it is a pervasive structure in various
domains. We formalize the problem of uncovering hierarchical relationships in a supervised setting. In general, local features of object attributes, their interaction
patterns, as well as rules and constraints for knowledge
propagation can be used to infer the structures. Existing approaches, designed for specific applications, either cannot handle dependency rules together with local
features, or cannot leverage labeled data to differentiate their importance. In this study, we propose a discriminative undirected graphical model. It integrates a
wide range of features and rules by defining potential
functions with simple forms. These functions are also
summarized and categorized. Our experiments on three
quite different domains demonstrate how to apply the
method to encode domain knowledge. The efficacy is
measured with both traditional and our newly designed
metrics in the evaluation of discovered tree structures.
1 Introduction
In an information network, linked objects have different
roles defined by their relationships with other objects.
Many relationships are not explicitly specified, but hidden in the observed network or mingled with noisy links.
Uncovering such knowledge has great importance in
cleaning and reorganizing the information network for
better utility. When the linked objects with such relationship can be organized into a tree-like structure,
we name it hierarchical relationship. There are many
instances of this kind of relationship in the real world.
Parent-child, manager-subordinate, advisor-advisee are
Qi Li, Xiang Li, Wen-Pin Lin, Heng Ji
City University of New York
{hengji}@cs.qc.cuny.edu
examples of such relationships among people in social
networks and organizations. Many recent studies reported various applications with the help of certain hierarchical relationships in various domains, such as news
dynamics tracking (Leskovec et al. [20]), information retrieval from online discussions (Seo et al. [27]), inference
of search intent (Yin and Shah [33]), and information
cascade discovery in social networks (Gomez-Rodriguez
et al. [24]). However, no one has formalized the problem in a domain-independent generic setting, where the
relationship is not detectable from single clear patterns,
but needs to be learned from multiple features and constraints with the presence of labeled data. In this paper,
we cast the task of hierarchical relationship reconstruction as a learning problem, and attempt a solution that
is generally applicable for partially ordered objects.
The supervised scenario is worth studying because
it is often the case that we need to discover the instantiated hierarchical relationship among a set of data
entities in a noisy network, given attributes on the entities and interaction patterns. Also, we can use commonsense assumptions to constrain the prediction or
propagate the knowledge in the network according to
the dependencies existing among linked objects. Without labeled instances as training data, we are unable
to determine their relative importance and make correct predictions in a generic case. On the other hand,
it is challenging to handle both local features and dependency rules simultaneously in a learning framework,
specifically for the tree-like structure prediction. Traditional machine learning methods designed for individual
prediction cannot capture the dependencies naturally,
while generic learning methods designed for structured
output do not tell how to encode the specific dependencies in a tree-like structure prediction and is hard to be
applied directly. Thus it is necessary to study the property of the tree-like structure prediction and develop a
method that can be easily applied.
Contributions
First, the problem is novel because no one has stud-
ied the generic supervised tree construction problem
with the same kind of input and output.
Second, the methods used in previous work on
specific domains were either lack of the ability to
accommodate multiple features to the data, or incapable
to incorporate the dependencies between the labels of
linked objects. We overcome their drawbacks to achieve
better performance. For example, when predicting the
family relationship of two named entities, not only the
context they occur in a document, but also their age and
residence location, etc., provide clues. We can define
different features from these clues. “Local” features
are indicative of whether two objects have a certain
relationship or not, and have nothing to do with the
relationship among others. The last name equivalence
of two people is an exemplar local feature for filiation.
We can also define features or rules to capture the
correlation of the relationships between different pairs
of objects. There are simple propagation rules, such
as siblings share the same parent. There are also
more complicated ones, such as the constraint that one
must graduate from his advisor before s/he can advise
others, which may involve complex interactions between
unknown variables because one’s advisor and advisees
both need to be inferred. Our method can compromise
such complicated interactions and the local features
in a unified model and learn their importance jointly,
rather than artificially assigning weights or using hard
constraints to do postprocessing.
Third, we study the generally useful features and
rules that offer clues at solving real problems. They
are categorized into two major classes and eight minor
types. We show that many complex dependency rules
can be factorized into potential functions that are only
dependent on two variables. Thus one only needs
to materialize these potential functions, either in the
form of singleton potential or pairwise potential, to
apply our approach. The generality of our approach is
validated by experimenting with different applications.
We duplicated several state-of-the-art approaches and
show that our method constantly beat them in these
applications.
2 Related Work
In a broad view of relationship identification, there
are studies in different domains. One category of
such work is relation mining from text data. Text
mining and language processing techniques have been
employed on text data to add structured semantic
information to plain text. Commonly referred entity
relationship extraction also belongs to this category,
such as those developed around the Knowledge Base
Population task [16]. While most of them explore low-
level textual features [1, 14], some also exploit relational
patterns for reasoning [25, 7]. Another line of studies
focus primarily on processing the interaction events in
a social network, and try to discover relationships from
traffic or content patterns of the interaction, e.g., from
the email communication data [23, 9]. The central
problem for most of these studies is judging whether
a pair of objects have a certain relationship. They do
not have special requirement for discovered relationship
to form certain structures.
The particular relationship considered in this paper
is asymmetric among a set of linked objects, and the objects can be organized in a tree-like structure along this
relationship. In a few recent studies, finding such a relationship is an essential task. Leskovec et al. [20] defines
the DAG partitioning problem, which is NP-hard, and
proposes a class of heuristic methods of finding a parent for each non-rooted node, which can be regarded as
finding hierarchical relationships among the phrase quotations in news articles. Kemp and Tenenbaum [18] propose to use a generative model to find structure in data,
and Maiya and Berger-Wolf [22] apply the similar idea
in inferring the social network hierarchy with the maximum likelihood of observing the interactions among the
people. In the Web search domain, Yin and Shah [33]
studies the problem of building taxonomies of search
intents for entity queries based on inferring the “belonging” relationships between them with unsupervised
approaches. There are other related work for building
taxonomies from Web tags with similar methodology [8].
NLP Researchers have also studied entity hyponymy (or
is-a) relation from web documents, among whom Zhang
et al.’s unsupervised approach [34] claimed the state-ofthe-art performance, though the tree structures along
the relation are not exploited. In [31], advisor-advisee
relationships are mined and academic family trees are
built from coauthor network in an unsupervised way.
All of these unsupervised approaches rely on one kind
of observation data, either links or attributes, and a
clear standard of tree construction. We handle heterogeneous data with multiple attributes and links while
no single factor determines the hierarchy, and we learn
the important factors from training data. In Information Retrieval community, researchers developed supervised methods to discover the “reply” relationship for
online conversations, which is also a hierarchical relationship [27]. They use domain-specific features and
could not be directly applied in more general scenario.
To the best of our knowledge, this work is the first attempt to formalize and solve the general hierarchical
relationship learning problem.
The term “social hierarchy” or “organizational hierarchy” are referred to stratified node ranking results
v1
pv11
vp22
vp44
pv33
Candidate DAG
v4
v1
v2
v2
v3
v3
One possible result
v4
Another possible result
Figure 1: An illustration of the problem definition
in some work [12, 26], and “hierarchical structure” is
referred to hierarchical grouping/clustering by a lot of
researchers including physicists and biologists [6]. They
are essentially distinct concepts from what we are studying. Our output is a network where each node is an
object from the input data and each link represents the
existence of the target relationship between a pair of
objects.
3 Problem Formulation
We first give two real-world examples.
Example 1. Given a set of named people and with
knowledge of their ages, nationality and mentions from
text, we want to find the family relation among them.
Besides text features such as the coreferential link
between two names with family keywords, we have
constraints like: (1) for two persons to be siblings, they
should have the same parent; and (2) if A is B’s parent,
then A is unlikely to be C’s parent if B and C have
different nationalities. The output should be a family
tree or forest of these people.
Example 2. In an online forum, we want to predict the
replying relationship among the posts within the same
thread, with knowledge of the content, author and the
posting time. Every post replies to one earlier post in
the same thread, except for the first one. One intuition
is that a post will have similar content with the one it
replies to; yet another possibility is two similar posts
may reply to a common post. The output is a tree
structure of each threaded discussion.
As an abstract of many such kind of problems, we
give the following formalization. Given a set of objects
V = {v1 , . . . , vn } and their directed links E ∈ V × V ,
a relation R ⊂ E is a hierarchical relationship if (1)
for every object u ∈ V , there exists exactly one object
v ∈ V such that (u, v) ∈ R; and (2) there does not exist
a cycle along R, i.e., no sequence (u1 , . . . , ut ), t > 1
such that u1 R u2 , . . . , ut−1 R ut , ut R u1 . In the relation
instance vi R vj , we name vi as a child and vj as a parent.
For convenience we denote (vi , vj ) to be eij . We define
the out-neighbors of one node as Yi = {vj |(vi , vj ) ∈ E}.
In a general setting, there might exist multiple
hierarchical relationships among the same set of nodes.
In this paper we study the task of uncovering one userspecified hierarchical relationship according to labeled
instance pairs with this relationship. Let the labeled
pairs be L ⊂ R, we aim to uncover the remaining set
R \ L. In other words, we need to predict for every
pair of directly linked objects (vi , vj ) ∈ E, whether
the statement “(vi , vj ) ∈ R” is true.
This defines
the generic hierarchical relationship learning problem.
Along with each learned relation Rk , the objects form a
tree or a forest. So we can also name it tree-like structure
prediction.
Figure 1 gives one example. If we instantiate it as a
family tree prediction problem, each node vi represents
a person, and each link points from one person to
his potential parent. v1 and v2 each has a link to
themselves, implying they can be the root of a tree. We
call G = (V, E) a candidate graph. Suppose we know
the ages of these people, we can make sure the candidate
graph has no directed cycles by always linking a younger
person to an older one. Many other real problems also
have this property.
Assumption 1. The candidate graph is a directed
acyclic graph (DAG).
With this assumption, given a set of objects, a treelike structure can be learned in two steps. First, we
extract a partial order for the objects and build a DAG.
Next, we learn a model with given labels on some links
in the DAG, and conduct prediction for the remaining
links.
When the candidate graph is not a DAG in original
data, additional effort needs to be made. In fact, it is
an NP-hard problem to remove as few edges as possible
to break cycles of a directed graph, known as minimum
feedback arc set [17]. The best known approximation
algorithm has ratio O(log n log log n) [11]. In this paper,
instead of examining that issue, we simply assume
such an order can be constructed and focus on how
to design features and learn a joint model to handle
the dependency and propagate knowledge. We will
demonstrate that it is nontrivial to learn the tree
structure even when the candidate graph is a DAG.
4 Our Approach
The relation of each pair of nodes can be determined
one by one; or be determined together, by taking their
interactions into consideration. We argue that a joint
model should be more powerful. For instance, in Example 1 (Figure 2(b)), parents and siblings are inter-related
and mutually constrained in a network. If we decompose the inference task as a classification problem for
every individual pair of nodes, it is difficult to lever-
vb
Post 1
vd
va=vc
Jerry
Ada
vb=vc
Bob
Sibling
S
Smith
Post 1
p2
p1
Post 2
vd
Similar
vc
p0
Post 2
American
S
Smith
Swedish
(a) Two soft dependency rules on
family tree: the relative importance need be learnt.
vp22
v4Rv1
v2Rv2
va=vd
Post 0
Jerry
p2
Bob
v2Rv1
vaRvb
va
vb
vcRvd
vc
Ada
v1Rv1
pv11
p1
p1
p2
Post 1
Post 2
(b) Conflict rules on forum
reply structure: similar posts
may be filiation or siblings.
vp44
vb=vd
va
(a) 4 cases for va Rvb
and vc Rvd to be directly
dependent
pv33
Candidate DAG
v4Rv3
v3Rv2
Markov Network
(b) The Markov network from the example candidate DAG.
Figure 3: Illustration of the Markov assumption
non-independent features without having to generate
the observed variables or modeling their dependencies.
Figure 2: Examples for the dependency among the
Compared to the commonly used linear chain CRFs [19]
inference
or its extension 2D CRFs [35], our model is designed
towards handling the more complicated dependencies
age intra-dependency among nodes because one predic- when inferring the tree structure.
tion depends on some other predictions which are also
unknown. On the other hand, the internal dependen- 4.1 Conditional Random Field for Hierarchical
cies within the structure provide the regularization over Relationship
each individual prediction, which helps the overall pre- We model the joint probability of every possible reladictions in the network. When we have high confidence tionship (vi , vj ) ∈ E being a truly existent relationship
on discovering some parts of the structure, the predic- in R. We use an indicator variable xij for the event
tion on the less certain parts will become easier with the “(vi , vj ) ∈ R”, i.e., xij = 1 if (vi , vj ) ∈ R, and 0 if not.
assumption that the knowledge can be propagated. One As we have analyzed, the inference of the relationship
may argue that these constraints can be added to the for some pairs are not independent. Suppose we have
point-wise classification model as postprocessing. For evidence that two people v2 and v4 are not siblings,
example, after the prediction, if we find some conflicts we may not expect the two events “(v2 , v1 ) ∈ R” and
such as two siblings having different parents, we change “(v4 , v1 ) ∈ R” to happen together. We formalize that
the prediction for one of them to avoid mistakes. How- kind of intuition as a Markov assumption that events
ever, the correct parent is already missing, and we even involving common objects are dependent.
do not know which prediction is wrong.
This shows the importance of inference across mul- Assumption 2. Two events “(va , vb ) ∈ R” and
tiple dependency rules among nodes/edges in the whole “(vc , vd ) ∈ R” correspond to connected random varinetwork. One challenge of our problem in general cases ables in a Markov network if and only if they share a
is that the heterogeneous information and their inter- common node, i.e., one of the following is true: va = vc ,
play can give different constraints, some harder, some vb = vc , va = vd or vb = vd . (Figure 3)
softer, and some even inconsistent with each other. In
It immediately follows that the Markov network can be
Example 2 (Figure 2(b)), the two rules do not agree
obtained from the candidate graph G by having a node
with each other if both are applied to make predictions.
for every edge in G and connecting nodes that represent
Moreover, rules like “the more similar two objects, the
two adjacent edges in G.
more probable they share the same parent” cannot be
The conditional joint probability is formulated as
fulfilled by hard constraints. To propagate the knowledge in a systematic way, we need a model that can
K
(∑
)
unify these different kinds of features and rules.
(4.1) p(X|G, Θ) ∝ exp
θk Fk (X, G) + H(X, G)
In this paper, we resort to a probabilistic graphk=1
ical model to handle the uncertain dependency rules.
Specifically, we develop a discriminative model CRF- where {Fk (X, G)}K
k=1 is a set of features defined on the
Hier to solve the hierarchical relationship learning prob- given candidate graph G and the indicator variables
lem. As a conditionally trained, undirected graphical X = {xij }; {θk }K
k=1 are the weights of the corresponding
model, it is able to accommodate multiple, overlapping, features. H is a special feature function to enforce the
hierarchy constraint
∑
{
(4.2)
H(X, G) =
(vi ,vj )∈E
xij ≤ 1.
f4
0.8
0.3
pv1
∑
−∞ ∃i, (vi ,vj )∈E xij > 1
0 o.w.
y1
1
1
y2
1
2
y1
y2
f5
0.7
0.9
0.5
0.6
vp22
y2
1
1
2
2
y4
1
2
1
2
y1
g1
g2
1
0.5
1
y2
g1
g2
1
1
0
2
0
1
f4
0.7
0.9
y2
1
2
y3
2
2
Any other hard constraints can be encoded in the same
y3
2
pv33
y3
vp44
g3
f4
y2 y3
manner.
0.7
1 2
K
y4
1
3
0.9
2 2
Thus, once we learn the weights {θk }k=1 from
y4
G
g1 0.1 0.2
training data, the relation inference task could be
formulated as a maximum a posterior (MAP) inference
CRF-Hier for G
problem: for each given candidate DAG G, we target for
the optimal configuration of the relationship indicator Figure 4: The graphical representation of the proposed
CRF-Hier model for the example in Figure 1. All the
X ∗ , such that,
(4.3)
∗
X = arg max p(X|G, Θ)
X∈X
where X is the set of all the possible configurations,
i.e., the search space. Since every xij can take 0 or 1,
the size of the space is 2|E| .
Such a formulation keeps the maximal generality,
while it poses great challenge to solve the combinatorial
optimization problem. We improve it in two ways.
First, we explore the form of feature functions Fk .
Each of them can be defined on all variables, and then
the computation of every function value relies on the
enumeration of X in X . However, since we have made
the Markov assumptions, the dependency can be represented by a factor graph, and each feature function can
be decomposed into the potential functions over cliques
of the graph (Hammersley-Clifford theorem [15]). Moreover, we can restrict the potential functions to have interactions between at most a pair of random variables
xij and xst . In fact, we have the following claim.
Theorem 4.1. Any factor graph over discrete variables
can be converted to a factor graph restricting to pairwise
interactions, by introducing auxiliary random variables.
This property is found by Weiss and Freeman [32]. Although the generic procedure of conversion may introduce additional variables, we will show that quite a
broad range of features can be materialized in as simple forms as pairwise potentials without the help of
auxiliary variables in the next section. Here we factorize the constraint
H to exemplify the philosophy:
∏
H(X, G) =
eis ,eit ∈E h(xis , xit ), where h(xis , xit ) =
−∞ if xis = xit = 1, and 0 otherwise.
Next, we try to reduce the number of variables and
constraints. To leverage the constraint that one node
has at only one parent and the assumption that the
candidate graph G is a DAG, we introduce a variable yi
to represent the parent of vi , i.e., yi = j if and only if
eij = (vi , vj ) is an instance in a hierarchical relationship
R. Since Assumption 1 holds, the problem is equivalent
singleton potentials defined on each variable are listed in
the table connected to the related variables by dotted line;
pairwise potentials are represented by solid rectangles and
tables listing the pairwise function values when the pair of
variables take different configurations.
to the task of predicting yi ’s value from Yi , because
every combination of yi ’s ensures acyclic prerequisite.
Assumption 2 implies there are two kinds of dependencies: two variables yi and yj where vj is a candidate
parent of vi ; two variables ys and yt such that vs and
vt share a common candidate parent vm . With this formulation, the constraint H is not needed any more, and
the objective function has the following form.
(∑ ∑
p(Y |G, Θ) ∝ exp
θk gk (ys |Ys )
(4.4)
+
∑
k∈I1 ys ∈Sk′
∑
θk fk (yi , yj |Yi , Yj )
k∈I2 (yi ,yj )∈Pk′
where I1 and I2 are the index sets of features that can
be decomposed into singleton potential and pairwise
potential functions respectively, and Sk and Pk are
decomposed singleton and pairwise cliques for the k-th
feature.
As an example, for the candidate graph G in
Figure 4 we will build a factor graph like the one
on its right. Four nodes v1 , v2 , v3 , and v4 have four
parent variables y1 to y4 , while the range for them is
Y1 = {v1 }, Y2 = {v1 , v2 }, Y3 = {v3 }, and Y4 = {v1 , v3 }.
For each variable there can be one or more singleton
potential functions. For each pair of directly linked
nodes, we have a pairwise potential function f4 in this
example (we omit the one between y1 and y4 ). v2 and
v4 have a common parent candidate, so there is one
pairwise potential f5 defined on y2 and y4 .
We name our model CRF-Hier as it is a conditional
random field optimally designed for the hierarchical
relationship learning problem.
We give several examples for specific potential definitions.
Table 1: Potential categorization and illustration.
Type
Cognitive Description
Potential Definition
homophyly
Parent and child are similar
g(yi ) = sim(vi , vyi )
polarity
Parent is superior to child
g(yi ) = asim(vi → vyi )
support
pattern
forbidden
pattern
Patterns frequently occurring with child-parent pairs
Patterns rarely occurring
with child-parent pairs
attribute
augment
label
propagate
Use inherited attributes from
parents or children
Similar nodes share similar
parents (children)
Patterns altering in childparent and parent-child pairs
reciprocity
constraints
g(yi ) = #pattern(vi , vyi )
g(yi ) = −#pattern(vi , vyi )
f (yi , yj ) = [yi = j]sim(vi , vyj )
f (yi , yj ) = sim(vyi , vyj )sim(vi , vj )
f (yi , yj ) = sim(vi , vyj )sim(vj , vyi )
Example
textual similarity, interaction intensity, spatial and temporal locality
authority, interaction tendency, conceptual extension and inclusion
contextual pattern, interaction pattern, preference to certain attributes
forbidden attribute to share, forbidden distinction
content propagation for documents,
authority propagation for entities
siblings share common parents, colleagues share similar supervisors
author reciprocity in online conversation - forth-and-back;
consistency check of transitive property
For conciseness we omit the conditional variable in g and f .
Restrict certain patterns
f (yi , yj ) = −#pattern′ (vi , vj , vyi , vyj )
Example 3. The names of a parent and a child should
appear in the same sentence (support sentence) together
with family words like “son, father”.
g(yi |Yi ) = #support sentences of (vi , vyi ).
Example 4. Siblings have common parent.
f (yi , yj |Yi , Yj ) = [yi = yj ]p(vi and vj are siblings).
where we use Iverson bracket to denote a number that
is 1 if the condition in square brackets is satisfied, and
0 otherwise. In this case, [yi = yj ] = 1 if yi = yj and 0
otherwise.
Example 5. Two people of the same age are unlikely
siblings.
f (yi , yj |Yi , Yj ) = [yi = yj ][vi ’s age = vj ’s age].
Note that the these statements are not always
true, e.g., twins can be siblings. But the advantage
of our framework is that it does not enforce them
to be true. The indicator in Example 5 takes value
0 or 1, and the weight of the corresponding feature
controls how much we trust this rule. Eventually the
compatibility with this rule plays a factor as the product
of potential function and feature weight in the additive
log-likelihood.
We see that these different features, constraints and
propagation rules can be encoded in a unified form; we
just need to define the potential for each of them. We
discuss how to systematically design potentials in the
next subsection, followed by the inference and learning
algorithm.
wise potential functions. We explore typical features
and rules to be defined across various domains in that
space. Usually the objects in the network have heterogeneous attributes, as well as complex interaction patterns. We summarize the potential types with domainindependent cognitive meanings so that one can design
domain-specific potentials with the map.
We first examine several important singleton potentials.
• homophyly. The first kind, and probably most widely
applicable, is a similarity measure between two objects. The assumption here is that the filiation connects to homophily, e.g., content similarity, interaction intensity (e.g., the frequency of telephone calls
in unit time), location adjacency, time proximity
(e.g., whether two documents are published within
a short period). This kind of similarity measure
sim is symmetric, and there are numerous metrics
we can use. The potential function has the form
g(yi |Yi ) = sim(vi , vyi ).
• polarity. The second kind, almost equally important,
is an asymmetric similarity measure. It is used to
measure the dominance of certain attributes of the
parent on the child, e.g., authority difference, bias
of interaction tendency (e.g., whether A writes many
more emails to B than B to A), the degree of conceptual generalization/specialization. Such measure
asim quantifies the partial order in terms of polarity between linked nodes, in the form of g(yi |Yi ) =
asim(vi → vyi ).
4.2 Potential Function Design
We have restricted our focus to the features that can
be decomposed into either singleton potential or pair- • support pattern. The third kind of potentials char-
acterize the preference to certain patterns involving 4.3 Model Inference and Learning
a pair of nodes with filiation. We can define a po- Given a training set of network G = {V, E} with
tential based on the number of pattern occurrences: labeled instances with hierarchical relationship L, we
g(yi |Yi ) = #pattern(vi , vyi ).
need to find the optimal model setting Θ = {θk }K
k=1 ,
which maximizes the conditional likelihood defined in
While the singleton potential depicts the local in- (4.4) over the training set.
Let Y (o) and T be the
tuitions, the pairwise potentials are responsible for the variable sets and assignments corresponding to labeled
knowledge propagation as well as the restrictive depen- relation instances L, and Y (u) be the unknown labels
dencies.
in training data. Their union is the full variable set
• attribute augmentation. One intuition for the knowl- Y in training data. The log-likelihood of the labeled
edge propagation is that one node can borrow at- variables could be written as:
tributes from its parent or child to augment its own
LΘ = log p(Y (o) = T |G, Θ)
content. The difficulty for a deterministic method
∑
(
)
to achieve such propagation is that the child or the (4.5) = log
exp Θt F(Y |Y (o) =L , G) − log Z(G, Θ)
parent of each node is unknown before we finish
Y (u)
the inference. In our model this can be realized by
of the feature
defining a pairwise potential f (yi , yj |Yi , Yj ) = [yi = where F(Y, G) is the vector
∑ representation
t
functions,
Z(G,
Θ)
=
exp(Θ
F(Y,
G)).
Y
j]sim(vi , vyj ). It can be understood in two ways:
To avoid overfitting, we penalize the likelihood by
knowing that the parent of vi is vj , vj ’s parent will
tend to choose a similar one with vi ; or given that the L2 regularization. Taking the derivative of this object
parent of vj is vyj , it affects the decision of its child function, we could get:
towards inheriting attributes from its parent. By re∇LΘ = Ep(Y (u) |G,Θ,Y (o) =T ) F(Y |Y (o) =T , G)
placing the boolean indicator [yi = j] with a weight− Ep(Y |G,Θ) F(Y, G) − λΘ
ing function of vi and vj we can control the extent to (4.6)
which we propagate the attribute.
When the training data are fully labeled, Y (u) = ∅, and
• label propagation. Sometimes we can measure how Equation (4.6) becomes simply:
likely two nodes share the same parent, so the label of
one’s parent can be propagated to similar nodes with
∇LΘ = F(Y = T, G) − Ep(Y |G,Θ) F(Y, G) − λΘ
a function like f (yi , yj |Yi , Yj ) = [yi = yj ]sim(vi , vj ).
The first part is the empirical feature value in
Given vi ’s parent vyi , the more similar vj is with
the
training set, the second part is the expectation
vi , the larger contribution to the joint likelihood this
of
the
feature value in the given training data, and
function will have via setting vj ’s parent to be the
the
last
part is the L2-regularization. Given that
same, i.e., yj = yi .
the
expectation
of the feature value can be computed
• reciprocity. This kind of potentials can handle a more
efficiently,
L-BFGS
[4] algorithm can be employed to
complicated pattern that occurs in child-parent and
optimize
the
objective
function Equation (4.5) by the
parent-child pairs alternatively. For example, “if augradient,
although
it
is
not convex with respect to Θ,
thor A often replies author B, then author B is more
and
we
can
expect
to
find
a local maximum of it. When
likely to reply author A”. To achieve such kind of
there
are
multiple
training
networks, we train the model
collective intelligence, we seem ∑
to need a big factor
by
optimizing
the
sum
of
their
log likelihood.
function
like F (yi |G, Y \ yi ) = A,B ([ai = B][ayi =
∑
When
the
feature
weights
are fixed, the learnA] aj =A [ayj = B]), where ai stands for vi ’s author.
ing
process
requires
an
efficient
inference
algorithm for
It requires all labels to be known. Fortunately, we
marginal
probability
of
every
clique:
singletons
and
find that rules in this form have equivalent decomedges
where
we
define
our
potentials.
The
prediction,
posed representation. The above rule can be decomposed into pairwise potentials f (yi , yj |Yi , Yj ) = [ai = however, requires MAP inference to find maximal joint
probability. Loopy belief propagation (LBP) [13], and
ayj ][ayi = aj ].
its variants Tree-Based Reparameterization (TRP) [30],
For a specific application we can encode an arbi- residual belief propagation [10] etc., have been shown
trary number of potentials of each type. In experiment effective to achieve empirically good inference and
part we will use three real-world examples to demon- also more scalable than general-purpose LP solvers
strate that. In principle, one can apply frequent pattern (CPLEX).
In the toy model in Figure 4, there are 5 different
mining or statistical methods to find distinguishing feapotentials g1 , g2 , g3 , f4 , f5 . We will learn the weight
tures in each category.
pv00
v0
v0
v0
pv12
v1
v2
v3
v4
v1
pv23
D)ODW7UHH
v4
pv31
E&KDLQ
v2
v3
F$Q,QIHUUHG
7UHH
pv44
v1
v2
v4
v3
G7UXH
6WUXFWXUH
Figure 5: Comparison of inferred structures and the
gold standard structure. (a)(b)(c) are three possible prediction results for the true structure in (d).
Green(light) nodes have right prediction for their parents, while red(dark) nodes have wrong predictions
Table 2: Measurement for structures in Figure 5
Structure
Panc
Ranc
F 1anc
Apath
Apar
(a) flat tree
(b) chain
(c) inferred
1.00
0.60
0.80
0.60
1.00
0.80
0.75
0.75
0.80
0.60
0.40
0.80
0.60
0.80
0.80
are neither desirable because the edit operations do
not apply here, and the computation of them has
biquadratic complexity w.r.t. the tree size.
We define a set of novel measures for quantitative
evaluation of the quality of hierarchical relationship
prediction, which can be computed in linear time.
Formally, let T be the ground-truth structure for n
linked objects V = {v1 , . . . , vn }, and Y the prediction.
We evaluate how well the structure is preserved in
prediction by examining two additional aspects: the
ancestors and the path to root.
• Precision and recall of ancestors Panc , Ranc .
∑n
δ[Anci (Y ) ⊂ Anci (T )]
(5.7)
Panc = i=1
n
∑n
i=1 δ[Anci (T ) ⊂ Anci (Y )]
(5.8)
Ranc =
n
where Anci (Y ) and Anci (T ) stand for the ancestor
set of node vi in prediction and in ground truth
respectively.
vector Θ = {θ1 , . . . , θ5 } from training data, and then
find the optimal value of Y = {y1 , . . . , y4 } to predict
• Accuracy of the path to root, Apath .
the structure.
∑n
i=1 δ[pathi (Y ) = pathi (T )]
(5.9)
A
=
path
5 Experimental Results
n
We perform the experimental study on different tasks,
where pathi (Y ) and pathi (T ) is the path of from
varying from named entity level to document-level renode vi to its root in prediction and in ground truth
lationship discovery. Our code and data can be downrespectively.
loaded from http://tinyurl.com/3t64rq8
Evaluation Measure. Little study has been carried
As a commonly used compromise between precision
out on how we should evaluate the quality of hierarchical and recall, we can also define F-value for the proposed
relationship prediction. Accuracy on the predicted par- measure for ancestors as the harmonic mean of precision
ent yi (Apar ), and the accuracy on the predicted relation and recall. For the example in Figure 5, the proposed
pairs xij (Apair ), are two most natural evaluation crite- metrics have the values as shown in Table 2. We can see
ria; they or their variants (Precision, Recall, etc.) are that although (b) and (c) have the same accuracy on the
employed by most previous studies [27, 33, 31]. How- parent prediction, three of the four proposed measures
ever, such measures only evaluate the prediction vari- all imply the inferred structure in (c) has better quality
ables on each node or each edge in an isolated view, than the chain. Also, we notice that Apath is a most
missing some aspects of the comprehensive goodness of strict measure, and in that measure (b) is even worse
structure. We take an example to illustrate.
than (a). That implies one predicted structure may be
Figure 5 lists a ground-truth structure and several good in some aspect but bad in others. This reaffirms
different reconstruction results. Both (b) and (c) have the necessity of using multiple measures for the tree
the same Apar and Apair because only one node has structure evaluation.
incorrect predicted parent. However, the chain is quite
The last thing we want to point out is that the
different from the gold standard tree with two branches, hardness of the inference problem highly depends on
and result (c) should be regarded closer to the ground the number of candidate parents of each node. The
truth. So we can see that one mistake may not only random guess will have much lower performance than
affect the parent of one node, but also deviates the 0.5 in most cases, even when each node only has two or
shaping statistics of other nodes, e.g., in their degree, three candidate parents.
number of ancestors and descendants. In other words, Algorithm Setting. We observe no obvious difference
different edges have different importance in preserving for the several variants of belief propagation algorithms.
the shape of the tree, which is not reflected by the The learning results are based on LBP, and the L2
unweighted judgment of each prediction. Tree similarity penalty parameter is λ = 2 if no further specification.
measures for ontologies such as tree edit distance [2] All the features are normalized into [−1, 1].
Table 3: Potential used in family tree construction task.
Type
homophily
polarity
support
pattern
forbidden
pattern
attribute
augment
label
prop
Potential Description
vi and vyi live in the same location?
mutual information of two names and
family keywords from web snippets
suffix comparison: Junior, I, III, IV etc.
co-occurrence of two names and some
family keywords from web snippets;
parent-child implied by Wikipedia infoboxes
child’s birth year after parent’s death+1;
non parent-child implied by Wikipedia
same residence location of one and
grandparent
people who are likely to be siblings share
the same parent
people with the same age get lower posconstraints
sibility to share the same parent
We compare our method with three baselines: (1)
NLP, the general relation mining approaches based on
NLP techniques described in [5]. More specifically,
we applied a pattern matching algorithm to discover
parent-children relation links among entities by analyzing the Web snippets. (2) Ranking SVM, a robust machine learning technique developed from support vector
machine (SVM) for ranking problem, but only singleton features can be handled. (3) Ranking SVM + post
processing (PP). For Ranking SVM, we treat each node
as a query, and its parent as relevant “document” and
all the other candidates as non-relevant “documents”.
For post processing, we encode some pairwise potentials as global constraints (e.g. one person cannot have
multiple parents; siblings should share the same parents) in Integer Linear Programming (ILP), in order
to maximize the summation of confidence values from
Ranking SVM subject to these constraints. The detailed implementation is described in [21]. Table 4 shows
the results on two families with 60 and 40 members respectively. We can see that the optimized model for
hierarchical relationship discovery performs 2 to 3-fold
better in most measures than the general purpose relation miner. Compared with the two-stage method that
uses post processing rules, the joint model can better
integrate them into the learning and inference framework. The margin is largest in the most strict measure
path accuracy (333%, 133%). That implies our method
makes fewer mistakes at the key positions of the tree
structure, where the chance of absorbing knowledge and
doing regularization is higher. We also find that RankingSVM and RankingSVM + PP beat NLP in Apar , but
are no better than NLP in F 1anc and Apath in the first
test case. Also, the post processing does not help much
because the confidence estimation from the prediction
is not always reliable due to noises in prediction features. As one example, the prediction component successfully identified “William Emlen Roosevelt” as the
parent of “Philip James Roosevelt” but with a low confidence; while it mistakenly identified “Theodore Roosevelt” as the parent of “George Emlen Roosevelt” but
with a much higher confidence due to their high mutual
information value in Web snippets. Therefore in the
post-processing stage, based on the fact that “Philip
James Roosevelt” and “George Emlen Roosevelt” are
siblings, the label propagation rule mistakenly changed
the parent of “Philip James Roosevelt” to “Theodore
Roosevelt”. In our model, the weight of the local features and the propagation rules is learned from the data,
and the global optimization prevents this mistake.
5.1 Uncovering Family Tree Structure
We apply our method to an entity relation discovery
problem. In the Natural Language Processing (NLP)
community, it is sometimes studied as a slot filling
task [16], i.e., to answer questions like “who are the
top employees of IBM”. Some relation types satisfy
our definition of hierarchical relationship, e.g., managersubordinate, parent organization-subsidiary, countrystate-city. We take the family relation (parent-child)
as the case to study, and try to answer the following
two questions: 1) whether the proposed method works
better than the state-of-the-art NLP approaches to
generic entity relation mining; and 2) how good a
joint model is compared to a model that does not
handle dependency rules, or uses the rules just for post
processing.
For clear demonstration, we define the task as
automatically assembling the family tree from a set
of person named entities. These named entities were
extracted from two famous American families, Kennedy
family and Roosevelt family, as listed in Wikipedia and
Freebase [3]. We design potentials according to the map
in Section 4.2. Table 3 lists the potentials we used.
Given any pair of named entities we first collected
all the Web context sentences (snippets) returned by
Yahoo! search engine. Then we encoded features based
on analysis of these snippets such as co-occurrence
statistics. We also encoded additional features of
various discovered attributes including residence, age,
birth-date, death-date and other family members. The
5.2 Uncovering Online Discussion Structure
results of random guess reflect the hardness of the
Online conversation structure is a hot topic studied
problem.
Table 4: Prediction performance for family tree
Train/Test Method
F 1anc
Apath
Apar
Random
Train on
NLP
Kennedy,
RankingSVM
test on
RankingSVM+PP
Roosevelt
CRF-Hier
Train
Random
on RooNLP
sevelt,
RankingSVM
test on
RankingSVM+PP
Kennedy CRF-Hier
< 0.01
0.1146
0.0667
0.0667
0.3439
< 0.01
0.2625
0.4371
0.4750
0.4846
< 0.01
0.0333
0.0500
0.0500
0.2167
< 0.01
0.1750
0.1500
0.1500
0.3500
0.0943
0.1833
0.5000
0.5833
0.7167
0.1313
0.2500
0.3250
0.3500
0.4000
by researchers in information retrieval, since the structure benefits many tasks including keyword-based retrieval, expert finding and question-answer discovery.
We perform the study on the problem of finding reply relationship among posts in online forums. The
data are crawled from Apple Discussion forum (http:
//discussions.apple.com/) and Google Earth Community (http://bbs.keyhole.com/). From each forum we crawled around 8000 threads and each thread
contains 6-7 posts on average, although some threads
can be as long as containing 250 posts. The posts in
each thread can be organized as a tree based on their
reply relationship. The task is to reconstruct the reply structure for the threads with no labels, given a
few threaded posts with labeled reply relationship. Although we use the data from the forums where the reply
relationship is actually recorded by the system, the real
application scenario will be predicting the online conversations whose structure is unknown. Therefore, we
want to study the following questions: 1) How many labeled data are needed to achieve good performance; and
2) How is the adaptability of the model when trained
on one forum while testing on another.
We list the features of each type in Table 5. The
competitor we choose is Ranking SVM, used in [27] for
this task. Again, it can only handle singleton features.
We also compare with a naive baseline that always
predicts chain structure.
To answer the first question, we fix the test data
of 2000 threads and vary the training data size in two
different ways. First, we use all the labels from each
thread, but vary the size of training threads from 50
to 5000. Second, we fix the number of training threads
as 1000, and change the number of labels we use from
each thread from 3 to 11. From Figure 6(a), we find
that with a small training set, 50 labeled threads, CRFHier already achieves encouraging performance. The
margin is significant because even the naive baseline
of predicting every post to reply to the last post gives
Table 5: Potentials used in post reply structure task
Type
Features, Rules and Constraints
tf-idf cosine similarity cos(vi , vyi );
recency of posting time
whether vyi is the first post of the thread;
whether vyi is the last post before vi
whether ayi ’s name appears in post vi , where
ayi is the author of vyi
homophily
polarity
support
pattern
forbidden
pattern
an author does not reply to himself
attribute
augment
label
propagate
reciprocity
constraints
the average content of one post’s children is
similar to its parent: [yi = j] cos(vi , vyj )
similar posts reply to the same post:
[yi = yj ] cos(vi , vj )
author A replies B, motivated by B replying
A: [ayj = ai ][ayi = aj ]
one author does not repeat reply to a post;
B replying A’s post, A does not reply to B’s
earlier post: −[tyi < tj ][ayi = aj ][ayj = ai ]
training on fully labeled threads
1
0.95
0.9
training on partially labeled threads
1
CRF−hier F1Anc
Ranking SVM F1
Anc
CRF−hier Apath
RankingSVM Apath
0.95
0.85
0.8
0.8
0.75
0.75
0.7
0.7
500 1000 1500 2000 2500
# threads in training set
(a)
Ranking SVM F1Anc
CRF−hier A
0.9
0.85
0.65
0
CRF−hier F1Anc
path
RankingSVM Apath
0.65
0
5
10
15
20
# samples used from each training thread
(b)
Chain structure F 1anc = 0.7435, Apath = 0.5647
Figure 6: Performance with varying training data
size(Apple).
0.74 in F 1anc — compared with Ranking SVM’s 0.80
and CRF-Hier’s 0.86, CRF-Hier doubles the margin of
what can be achieved by Ranking SVM from the naive
baseline. As more training data is added, the testing
performance is relatively more stable than Ranking
SVM. Figure 6(b) shows that when the labels for
each tree is very incomplete, CRF-Hier degrades its
performance to its competitor because the pairwise
features cannot be well exploited. When the labels
become reasonably sufficient (5 posts in this case) to
characterize the structural dependency among them,
CRF-Hier presents superiority (increase the margin
from baseline by 32% in Apath ). Although CRF-Hier
has more feature weights to learn, the L2 regularization
mitigates overfitting, and the model works well even
when the training data size is small.
To answer the second question, we randomly select
Table 6: Cross domain evaluation (CRF-Hier/Ranking Table 7: Prediction performance for academic family
tree
SVM).
(a) F 1anc
PP
PP Test
Train PPP
Apple
Apple
Google Earth
0.8476/0.8136
0.8383/0.8017
Google Earth
Apple
Apple
Google Earth
0.7326/0.6722
0.7143/0.6548
F 1anc
Apath
Apar
TPFG
CRF-Singleton
CRF-Hier
0.5241
0.4684
0.5681
0.2276
0.1648
0.3226
0.3548
0.3226
0.4516
0.8233/0.7855
0.8186/0.8099
(b) Apath
PP
PP Test
Train PPP
Method
Google Earth
0.6909/0.6325
0.6797/0.6610
All improvements are statistically significant with p < 0.05
2,000 threads for training and 2,000 threads for testing
from each of the two datasets, and perform a cross domain experiment with the 4 combinations of train/test
sets. Apple Discussion is a computer technical forums,
while Google Earth focuses on entertainments. From
the comparative results in Table 6, we can find that
with the help of pairwise features, CRF-Hier generalizes
better than Ranking SVM which relies on the singleton
features only.
In conclusion, CRF-Hier constantly outperforms
the baseline when the size or the domain of the training
data vary. It has good adaptability to generalize.
5.3 Uncovering Academic Family Tree
As a final experiment, we show an example in a domain involving no text data. We consider the task
of academic family tree discovery according to advisoradvisee relationship which need to be inferred from research publication networks. We use the DBLP Computer Science Bibliography Database, which consists
of 871,004 authors and 2,365,362 papers with time
provided (from 1970 to 2009). To test the accuracy of the discovered advisor-advisee relationships, we
adopt labeled data from [31], which are crawled from
the Mathematics Genealogy (http://www.genealogy.
math.ndsu.nodak.edu) and AI Genealogy (http://
aigp.eecs.umich.edu).
We define the potentials according to the features
used in [31]: average Kulczynski and IR measure in
estimated advising period as homophily and polarity
potentials, and the constraint that one cannot advise
another before graduation as a pairwise potential. We
compare with the unsupervised method TPFG in [31],
which is equivalent to giving every potential equal
weight without learning from labeled data. We also
compare to a CRF model with singleton features only.
So we can have a decomposed study on the importance
of the learning and the constraint respectively.
The results are shown in Table 7. Note that we now
evaluate with more strict measures than the Apair used
in [31]. The result shows that learning is helpful when
the same set of features are used, yet the constraint
encoded by pairwise potentials is critical: without the
constraint, CRF has even worse performance compared
to the unsupervised model TPFG which can handle the
constraint. When the constraint is added, CRF-Hier
can increase Apath , Apar and F 1anc of TPFG by 42%,
27% and 8%.
The overall performance for this dataset is lower
than that of the Forum dataset. One reason is we
use only a small number of features for fair comparison
with the unsupervised method. Another reason is the
dataset is not fully labeled, i.e., only a scattered part
of the forest is known and we do not have even one
fully labeled tree. The interaction of the variables is
thus limited within a short range in the training data,
and the trained model has less power in utilizing the
interaction.
6
Conclusions and Future Work
We define and tackle with the problem of hierarchical
relationship discovery for linked objects. We study it
in the supervised setting. We propose a discriminative
probabilistic model that is optimized for the tree structure learning and prediction, which can handle both local features and knowledge propagation rules. We sort
out the common features and rules that can be encoded
in simple and unified forms. By demonstrating with various specific applications, we validate the effectiveness
and generality of our approach.
While we have categorized the features for tree
structures, it remains a promising problem how to do
feature extraction and selection automatically for these
different types of features. One may explore other
learning framework such as max-margin markov networks [28] and structured SVM [29], and other setting
such as semi-supervised or active learning. Finally, more
research issues need to be studied when some assumption is violated, e.g., the linked objects do not have an
cycle-free order, or the relations do not form a strict
tree structure.
References
[1] E. Agichtein and L. Gravano. Snowball: extracting
relations from large plain-text collections. In Proc.
2000 ACM conf. on Digital libraries (DL ’00), 2000.
[2] P. Bille. A survey on tree edit distance and related
problems. Theor. Comput. Sci., 337:217–239, June
2005.
[3] K. Bollacker, R. Cook, and P. Tufts. Freebase: A
shared database of structured general human knowledge. In AAAI ’08, 2008.
[4] R. H. Byrd, J. Nocedal, and R. B. Schnabel. Representations of quasi-newton matrices and their use in
limited memory methods. Math. Program., 63:129–156,
1994.
[5] Z. Chen, S. Tamang, A. Lee, X. Li, W.-P. Lin, J. Artiles, M. Snover, M. Passantino, and H. Ji. Cunyblender tac-kbp2010 entity linking and slot filling system description. In Proc. 2010 NIST Text Analytics
Conference, TAC ’10, 2010.
[6] A. Clauset, C. Moore, and M. E. J. Newman. Hierarchical structure and the prediction of missing links in
networks. Nature, 453:98–101, May 2008.
[7] A. Culotta, A. McCallum, and J. Betz. Integrating
probabilistic extraction models and data mining to
discover relations and patterns in text. In Proc. HLT
Conf. the North American Chapter of the Association
of Computational Linguistics, HLT-NAACL ’06, 2006.
[8] L. Di Caro, K. S. Candan, and M. L. Sapino. Using
tagflake for condensing navigable tag hierarchies from
tag clouds. In KDD ’08, 2008.
[9] C. P. Diehl, G. Namata, and L. Getoor. Relationship
identification for social network discovery. In AAAI’07,
2007.
[10] G. Elidan, I. McGraw, and D. Koller. Residual belief
propagation: Informed scheduling for asynchronous
message passing. In Proc. 2006 Conf. on Uncertainty
in AI (UAI ’06), 2006.
[11] G. Even, J. S. Naor, B. Schieber, and M. Sudan.
Approximating minimum feedback sets and multicuts
in directed graphs. ALGORITHMICA, 20:151–174,
1998.
[12] L. C. Freeman. Uncovering organizational hierarchies.
Comput. Math. Organ. Theory, 3:5–18, March 1997.
[13] B. J. Frey. Graphical models for machine learning and
digital communication. MIT Press, Cambridge, MA,
USA, 1998.
[14] Z. GuoDong, S. Jian, Z. Jie, and Z. Min. Exploring
various knowledge in relation extraction. In ACL ’05,
2005.
[15] J. Hammersley and P. Clifford. Markov field on finite
graphs and lattices, 1971. Unpublished.
[16] H. Ji and R. Grishman. Knowledge base population:
Successful approaches and challenges. In ACL ’11,
2011.
[17] R. M. Karp. Reducibility among combinatorial problems. In Complexity of Computer Computations, pages
85–103. 1972.
[18] C. Kemp and J. B. B. Tenenbaum. The discovery of
structural form. Proceedings of National Academy of
Sciences of the United States of America, July 2008.
[19] J. D. Lafferty, A. McCallum, and F. C. N. Pereira.
Conditional random fields: Probabilistic models for
segmenting and labeling sequence data. In ICML’01,
pages 282–289, 2001.
[20] J. Leskovec, L. Backstrom, and J. Kleinberg. Memetracking and the dynamics of the news cycle. In KDD
’09, 2009.
[21] Q. Li, S. Anzaroot, W. Lin, X. Li, and H. Ji. Joint
Inference for Cross-document Information Extraction.
In CIKM ’11, 2011.
[22] A. S. Maiya and T. Y. Berger-Wolf. Inferring the
maximum likelihood hierarchy in social networks. In
Proc. 2009 International Conf. on Computational Science and Engineering, 2009.
[23] A. McCallum, X. Wang, and A. Corrada-Emmanuel.
Topic and role discovery in social networks with experiments on enron and academic email. Journal of Artificial Intelligence Research (JAIR), 30:249–272, 2007.
[24] M. G. Rodriguez, J. Leskovec, and A. Krause. Inferring
networks of diffusion and influence. In KDD ’10, 2010.
[25] D. Roth and W.-t. Yih. Probabilistic reasoning for
entity & relation recognition. In Proc. 2002 ICCL
Conf. on Computational Linguistics, COLING ’02,
2002.
[26] R. Rowe and S. Hershkop. Automated social hierarchy
detection through email network analysis. In Proc.
2007 ACM workshop on Web mining and social network
analysis (WebKDD/SNA-KDD ’07), 2007.
[27] J. Seo, W. Croft, and D. Smith. Online community
search using thread structure. In CIKM ’09, 2009.
[28] B. Taskar, C. Guestrin, and D. Koller. Max margin
Markov networks. In NIPS ’03, 2003.
[29] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine learning for interdependent and structured output spaces. In ICML ’04, 2004.
[30] M. J. Wainwright, T. Jaakkola, and A. S. Willsky.
Tree-based reparameterization for approximate estimation on loopy graphs. In NIPS’01, pages 1001–1008.
[31] C. Wang, J. Han, Y. Jia, J. Tang, D. Zhang, Y. Yu,
and J. Guo. Mining advisor-advisee relationships from
research publication networks. In KDD’10, 2010.
[32] Y. Weiss and W. T. Freeman. On the optimality
of solutions of the max-product belief propagation
algorithm in arbitrary graphs. IEEE Transactions on
Information Theory, 47:723–735, 2001.
[33] X. Yin and S. Shah. Building taxonomy of web search
intents for name entity queries. In WWW ’10, 2010.
[34] F. Zhang, S. Shi, J. Liu, S. Sun, and C.-Y. Lin. Nonlinear evidence fusion and propagation for hyponymy
relation mining. In Proc. 2011 ACL Conf. Human Language Technologies, HLT ’11, 2011.
[35] J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y.
Ma. 2d conditional random fields for web information
extraction. In ICML ’05, 2005.