Learning Shortest Paths for Word Graphs
Emmanouil Tzouridis and Ulf Brefeld
Technische Universität Darmstadt,
Hochschulstr. 10, 64289 Darmstadt, Germany
{tzouridis,brefeld}@kma.informatik.tu-darmstadt.de
Abstract. The vast amount of information on the Web drives the need
for aggregation and summarisation techniques. We study event extraction as a text summarisation task using redundant sentences which is also
known as sentence compression. Given a set of sentences describing the
same event, we aim at generating a summarisation that is (i) a single sentence, (ii) simply structured and easily understandable, and (iii) minimal
in terms of the number of words/tokens. Existing approaches for sentence
compression are often based on finding the shortest path in word graphs
that is spanned by related input sentences. These approaches, however,
deploy manually crafted heuristics for edge weights and lack theoretical
justification. In this paper, we cast sentence compression as a structured
prediction problem. Edges of the compression graph are represented by
features drawn from adjacent nodes so that corresponding weights are
learned by a generalised linear model. Decoding is performed in polynomial time by a generalised shortest path algorithm using loss augmented
inference. We report on preliminary results on artificial and real world
data.
1
Introduction
Information is ubiquitous. People are consuming information on the go by reading news articles, blogs entries, checking status updates, or planning the next
vacations. However, the informed mobility comes at the cost of relevance. Users
need to manually identify relevant pieces of information in the overloaded supply
and to aggregate these pieces themselves to find an answer to their query. Thus,
there is a real need for techniques that assist the user in separating relevant from
irrelevant information and aggregating the pieces of information automatically.
In this paper we study the intelligent aggregation of related sentences to
quickly serve the information needs of users. Given a collection of sentences
dealing with the same real-world event, we aim at generating a single sentence
that is (i) a summarisation of the input sentences, (ii) simply structured and
easily understandable, and (iii) minimal in terms of the number of words/tokens.
The input collection of sentences is represented as a word graph [4], where words
are identified with nodes and directed edges connect adjacent words in at least
one sentence. The output is a sequence of words fulfilling conditions (i-iii).
In general, learning such mappings between arbitrary structured and interdependent input and output spaces challenges the standard model of learning a
2
E. Tzouridis and U. Brefeld
mapping from independently drawn instances to a small set of labels. In order
to capture the involved dependencies it is helpful to represent inputs x ∈ X and
outputs y ∈ Y in a joint feature representation. The learning task is therefore
rephrased as finding a function f : X × Y → < such that
ŷ = argmax f (x, y)
(1)
y
is the desired output for any input x [2, 12]. The function f is a linear model in
a joint space Φ(x, y) of input and output variables and the computation of the
argmax is performed by an appropriate decoding strategy. Prominent applications of such models are part-of-speech tagging, parsing, or image segmentation.
In this paper, we cast sentence compression as a supervised structured prediction problem to learn a mapping from word graphs to their shortest paths.
Edges of the graphs are labeled with costs and the shortest path realises the
lowest possible costs from a start to an end node. Thus, we aim at finding a
function f , such that
ŷ = argmin f (x, y).
(2)
y
We devise structured perceptrons and support vector machines for learning the
shortest path. The latter usually requires the use of two-best decoding algorithms
which are expensive in terms of computation time and memory. We proof that an
equivalent expression can be obtained by a generalised shortest path algorithm
using loss-augmented inference. The latter renders learning much more efficient
than using a two-best decoding strategy. Empirically, we compare our approach
to the state-of-the-art that uses heuristic edge weights [4] on artificial and real
world data and report on preliminary results.
The remainder is structured as follows. Section 2 introduces preliminaries.
Our main contribution on learning shortest paths is presented in Section 3 and
Section 4 reports on empirical results. Section 5 discusses related work and Section 6 concludes.
2
2.1
Preliminaries
Word Graphs
Word or compression graphs have been studied for instance by [3, 4]. The idea
is to build a non-redundant representation for possibly redundant sequences by
merging identical observations. From a collection of related sentences {s1 , . . . , sn },
we iteratively construct a word graph by adding sentences one-by-one as follows:
We begin with the empty graph and add the first sentence s1 , where every word
in the sentence becomes a node and a directed edge connects nodes of adjacent
words. After this step, the graph is a sequence representing s1 . Words from the
second sentence s2 are incorporated by differentiating two cases: (i) if the graph
already contains the word (e.g., using lowercase representations), the word is
Learning Shortest Paths for Word Graphs
3
Fig. 1: Word graph constructed from the sentences: ”Yahoo in rumoured $1.1bn
bid to buy white-hot Tumblr”, ”Yahoo buys Tumblr as David Karp stays as
CEO”, ”Yahoo to buy Tumblr for $1.1bn”. The shortest path is highlighted.
simply mapped to the corresponding node and (ii) otherwise, a new node is instantiated. In both cases, a directed edge is inserted to connect the word to its
predecessor from s2 . The procedure continues until all n sentences are incorporated in the graph.
Usually, sentences are augmented by designated auxiliary words indicating
the start and the end of the sentence. The sketched procedure merges identical
words but preserves the structure of the sentences along the contained paths and
the original sentences can often be reconstructed from the compressed representation. There are many different ways to build such graphs, e.g., by excluding
punctuations, stop-word removal, or by using part-of-speech information to improve merge operations. Figure 1 shows some related sentences and the corresponding word graph.
2.2
Shortest Path Algorithms
Given a directed weighted graph x = (N, E), where N is the set of nodes and
E the set of edges. As the graph x defines the sets N and E, we will use N (x)
and E(x) in the remainder to denote the set of nodes and edges of graph x,
respectively. Every edge (xi , xj ) ∈ E(x) is assigned a positive weight given by
a cost function cost : (xi , xj ) 7→ <+ . A path y in the graph x is a sequence of
connected nodes of x and the cost of such a path is given by the sum of the edge
costs for every edge that is on the path. Given the graph x, a start node xs and
4
E. Tzouridis and U. Brefeld
an end node xe , the shortest path problem is finding the path in x from xs to
xe with the lowest costs,
X
argmin
yij cost(xi , xi+1 )
y
(xi ,xy )∈N (x)
s.t. y ∈ path(xs , xe ).
There exist many algorithms for computing shortest paths efficiently [6–8]. Usually, these methods are based on relaxation integer programming, where an
approximation of the exact quantity is iteratively updated until it converges
to the correct solution. Such algorithms converge in polynomial time if there
are no negative cycles inside the graph, otherwise the problem is NP-hard. A
prominent algorithm for computing the k-th shortest paths is Yen’s algorithm
[5]. Intuitively, the approach recursively computes the second best solution by
considering deviations from the shortest path, the third best solution from the
previous two solutions, and so on. Figure 1 visualises the shortest path for the
displayed compression graph.
3
3.1
Learning the Shortest Path
Representation
To learn the shortest path, we need to draw features from adjacent nodes in the
word graph to learn the score of the connecting edge. Let xi and xj be connected
nodes of the compression graph x, that is xi , xj ∈ N (x) and (xi , xj ) ∈ E(x).
We represent the edge between xi and xj by a feature vector φ(xi , xj ) that
captures characteristic traits of the connected nodes, such as indicator functions
detailing whether xi and xj are part of the same named entity or part-of-speech
transitions.
A path in the graph is represented as an n × n binary matrix y with n =
|N (x)| and elements {yij } given by yij = [[(xi , xj ) ∈ path]] where [[z]] is the
indicator function returning one if z is true and zero otherwise. The cost of using
the edge (xi , xj ) in a path is given by a linear combination of those features
parameterised by w,
cost(xi , xj ) = w> φ(xi , xj ).
Replacing the constant costs by the parameterised ones, we arrive at the following objective function (ignoring the constraints for a moment) that can be
rewritten as a generalised linear model.
X
X
yij w> φ(xi , xj ) = w>
yij φ(xi , xj ) = w> Φ(x, y) = f (x, y)
(xi ,xj )∈E(x)
(xi ,xj )∈E(x)
|
{z
=Φ(x,y)
}
Learning Shortest Paths for Word Graphs
5
Given a word graph x, the shortest path ŷ for a fixed parameter vector w can
now be computed by
ŷ = argmin f (x, y),
y
where f is exactly the objective of the shortest path algorithm and the argmin
consequently computed by an appropriate solver, such as Yen’s algorithm [5].
3.2
Problem Setting
In our setting, word graphs x ∈ X and the best summarising sentence y ∈ Y are
represented jointly by a feature map Φ(x, y) that allows to capture multiple-way
dependencies between inputs and outputs. We apply a generalised linear model
f (x, y) = w> Φ(x, y) to decode the shortest path
ŷ = argmin f (x, y).
y
The quality of f is measured by the Hamming loss
∆H (y, ŷ) =
1
2
X
[[yij 6= ŷij ]]
(xi ,xj )∈E(x)
that details the differences between the true y and the prediction ŷ, where [[·]]
is again the indicator function from Section 3.1. Thus, the generalisation error
is given by
Z
R[f ] =
∆H y, argmin f (x, ȳ) dP (x, y)
ȳ
X ×Y
and approximated by its empirical counterpart
R̂[f ] =
m
X
∆H y, argmin f (x, ȳ)
(3)
ȳ
i=1
on a finite m-sample of pairs (x1 , y1 ), (x2 , y2 ), ...(xm , ym ) where xi is a word
graph and yi its shortest path (i.e., the best summarising sentence). Minimising
the empirical risk in Equation (3) directly leads to an optimisation problem
that is not well-posed as there generally exist many equally well solutions that
realise an empirical loss of zero. We thus consider also the minimisation of the
regularised empirical risk
Q̂[f ] = Ω(f ) +
m
X
i=1
∆H
y, argmin f (x, ȳ)
(4)
ȳ
where Ω(f ) places a prior on f , e.g., to enforce smooth solutions. In the remainder we focus on Ω(f ) = kwk2 .
6
3.3
E. Tzouridis and U. Brefeld
Perceptron-based Learning
Structured Perceptrons [13, 14] directly minimise the empirical loss in Equation
(3). The training set is processed iteratively, where in the i-th iteration the
actual prediction ŷ = argminy f (xi , y) is compared with the true output yi .
If ∆H (yi , ŷ) = 0 the algorithm proceeds with the next instance. However, if
∆H (yi , ŷ) 6= 0 an erroneous prediction is made and the parameters need to be
adjusted accordingly. In case of learning shortest paths, we aim at assigning a
smaller function value to the true path than to all alternative paths. If this holds
for all training examples we have
∀
ȳ 6= yi : w> Φ(xi , ȳ) − w> Φ(xi , yi ) > 0.
(5)
In case one of these constraints is violated, the parameter vector is updated
according to
w ← w + Φ(xi , ŷ) − Φ(xi , yi ),
where ŷ denotes the erroneously decoded path (the false prediction). One can
show that the perceptron converges if an optimal solution realising R̂[f ] = 0
exists. [13, 14]
3.4
Large-margin Approach
For support vector-based learning, we extend the constraints in Equation (5) by
a term that induces a margin between the true path yi and all alternative paths.
A common technique is called margin-rescaling and implies to scale the margin
with the actual loss that is induced by decoding ȳ instead of yi . Thus, rescaling
the margin by the loss implements the intuition that the confidence of rejecting
a mistaken output is proportional to its error. In the context of learning shortest
paths, margin-rescaling gives us the following constraint
∀ȳ 6= yi : w> Φ(xi , ȳ) − w> Φ(xi , yi ) > ∆H (yi , ȳ) − ξi ,
(6)
where ξi ≥ 0 is a slack-variable that allows pointwise relaxations of the margin.
Solving the equation for ξi shows that margin rescaling also effects the hinge loss
that now augments the structural loss ∆H ,
>
>
`∆H (x, y, f ) = max 0, min ∆H (yi , ȳ) − w Φ(xi , ȳ) + w Φ(xi , yi ) .
ȳ
The effective hinge loss upper bounds the structural loss ∆H for every pair
(xi , yi ) and therefore also
m
X
i=1
`∆H (xi , yi , f ) ≥
m
X
i=1
∆H (yi , argmin f (xi , ȳ))
ȳ
holds. Instead of minimising the empirical risk in Equation (3), structural support vector machines [2] aim to minimise its regularised counterpart in Equation
Learning Shortest Paths for Word Graphs
7
(4). A maximum-margin approach to learning shortest paths therefore leads to
the following optimisation problem
min kwk2 +
w,ξ
m
X
ξi
i=1
s.t. ∀i ∀ȳ 6= yi : w> Φ(xi , ȳ) − w> Φ(xi , yi ) > ∆H (yi , ȳ) − ξi
∀i : ξi ≥ 0
The above optimisation problem can be solved by cutting plane methods (e.g.,
[2]). The idea behind cutting planes is to instantiate only a minimal subset of the
exponentially many constrains. That is, for the i-th training example, we decode
the shortest path ŷ given our current model and consider two cases: (i) The case
ŷ 6= yi the prediction is erroneous and ŷ is called the most strongly violated
constraint as it realises the smallest function value, i.e., f (xi , ŷ) < f (xi , y) for
all y 6= ŷ. Consequentially, the respective constraint of the above optimisation
problem is instantiated and influences the subsequent iterations. (ii) If instead
the prediction is correct, that is ŷ = yi , we need to verify that the second
best prediction ŷ(2) fulfils the margin constraint. If so, we proceed with the
next training example, otherwise we instantiate the corresponding constraint,
analogously to case (i). Luckily, we do not need to rely on an expensive two-best
shortest path algorithm but can compute the most strongly violated constraint
directly via the cost function
Q(ȳ) = ∆H (yi , ȳ) − w> Φ(xi , ȳ) + w> Φ(xi , yi )
(7)
that has to be maximised wrt y. The following proposition shows that we can
equivalently solve a shortest path problem for finding the maximiser of Q.
Proposition 1 (Loss augmented inference for shortest path problems).
The maximum y∗ of Q in Equation (7) can be equivalently computed by minimising a shortest path problem with cost(xi , xj ) = yij + w> φ(xi , xj ).
Proof. In the proof, we treat paths y as graphs and write N (y) for the set of
nodes on the path and E(y) to denote the set of edges that lie on the path.
If, for instance, an element of the binary adjacency matrix representing path y
equals one, e.g., yij = 1, we write yi , yj ∈ N (y) and (yi , yj ) ∈ E(y). First, note
that the Hamming loss can be rewritten as
∆H (yi , ȳ) =
X
(yi ,yj )∈E(y)
(1 − yij ȳij ) .
(8)
8
E. Tzouridis and U. Brefeld
Using Equation (8), we have
ŷ = argmax ∆H (yi , ȳ) + w> Φ(xi , yi ) − w> Φ(xi , ȳ)
ȳ
= argmax ∆H (yi , ȳ) − w> Φ(xi , ȳ)
ȳ
X
= argmax
ȳ
(1 − yij ȳij ) − w> Φ(xi , ȳ)
(yi ,yj )∈E(y)
X
= argmax −
ȳ
yij ȳij − w> Φ(xi , ȳ)
(yi ,yj )∈E(y)
= argmin
ȳ
X
yij ȳij + w> Φ(xi , ȳ)
(yi ,yj )∈E(y)
= argmin
ȳ
X
X
>
yij ȳij + w
(yi ,yj )∈E(y)
ȳij φ(xi , xj )
(xi ,xj )∈E(x)
= argmin
ȳ
X
yij ȳij + w>
(xi ,xj )∈E(x)
= argmin
ȳ
X
X
ȳij φ(xi , xj )
(xi ,xj )∈E(x)
>
(yij + w φ(xi , xj ))ȳij
(xi ,xj )∈E(x)
The output ŷ is the shortest path with costs given by yij + w> φ(xi , xj ).
t
u
Given a parameter vector w and start and end nodes xs and xt , respectively,
the optimisation of Q can be performed with the following linear program.
X
yij + w> φ(xi , xj ) ȳij
min
ȳ
ij
s.t. ∀k ∈ N (x)/{s, t} :
X
ȳkj −
X
j
∀k ∈ N (x)/{s, t} : −
X
ȳkj +
X
j
X
ȳsj −
j
X
i
X
ȳi,s ≤ 1
X
ȳik ≤ 0
i
∧
−
X
i
ȳit −
ȳik ≤ 0
i
ȳsj +
X
j
ȳtj ≤ 1
∧
j
∀(i, j) : yij ≤ x(i,j)
−
X
i
∧
ȳis ≤ −1
i
ȳit +
X
ȳtj ≤ −1
j
∀(i, j) : yij ∈ {0, 1}
The first two constraints guarantee that every inner node of the path must have
as many incoming as outgoing edges, the third line of constraints guarantees the
path to start in xs and, analogously, the forth line ensures that it terminates
in xt . The last line of constraints forces the edges of the path ȳ to move along
existing paths of x.
Learning Shortest Paths for Word Graphs
9
acc
Perceptron
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
Nodes=10
Nodes=20
Nodes=30
Nodes=40
0.2
0.1
0
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.2
0.1
0
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Fig. 2: Performance on artificial data for perceptrons (left) and SVMs (right).
4
4.1
Empirical Results
Artificial Data
To showcase the effectivity of our approach, we generate artificial data as follows.
We sample random graphs with |N | ∈ {10, 20, 30, 40} nodes. For every node in
a graph, we sample the number of outgoing edges uniformly in the interval
[ |N2 | , |N |]. For every outgoing edge, a receiving node is sampled uniformly from
the remaining |N | − 1 nodes. The optimal path is annotated as follows. We draw
the length of the path uniformly in the interval [ |N2 | , |N |] and randomly select
the respective number of nodes from N . This gives us a sequence of nodes that
we define as the shortest path. In case two adjacent nodes are not connected by
an edge, we discard the nodes and draw again.
To ensure that the optimal path is actually the one with lowest costs, edge features are sampled from a one-dimensional Gaussian mixture distribution, where
the generating component is chosen according to whether the respective edge
lies on the the shortest path or not. That is, we introduce two Gaussian components G0,1 , so that costs for edges lying on the shortest path are drawn from
G0 (µ1 , σ12 ) while costs for all other edges are sampled from G1 (µ2 , σ22 ).
The difficulty of the experimental setup is controlled by a parameter α that
measures the distance of the two means, i.e., α = |µ1 − µ2 |. We initialise the
components G0,1 by drawing one of them (say q) according to a coin flip and
sample the corresponding mean from a normal distribution µq ∼ Gq (− α2 , 0.1).
Gq̄ is then initialised with µq̄ ∼ Gq̄ ( α2 , 0.1). We use σ1 = σ2 = 0.01. We report
on averages over 100 repetitions.
The results for perceptrons and SVMs are shown in Figure 2. The distance
α is depicted on the x-axis. The y-axis shows the top-one accuracy. The performance of both algorithms highly depends on the distance of the cost-generating
components and the size of the graph. The fewer nodes and the larger the distance α, the more accurate is the prediction. Both algorithms perform similarly.
10
E. Tzouridis and U. Brefeld
0.3
12
0.28
11
Filippova
10
0.24
Average rank
Top1 Accuracy
0.26
0.22
0.2
9
8
7
0.18
6
0.16
Filippova
5
0.14
0.12
5
10
15
20
25
30
35
40
4
5
45
10
15
#Training samples
0.26
9
0.24
0.22
40
45
8
7
0.2
6
0.18
5
15
35
Filippova
10
10
30
11
Filippova
0.28
0.16
5
25
#Training samples
Average rank
Top1 Accuracy
0.3
20
20
25
30
35
40
45
4
5
10
15
#Training samples
20
25
30
35
40
45
#Training samples
Fig. 3: Performance on news headlines. Accuracies (left column) and average
ranks (right column) are evaluated for perceptrons (top row) and SVMs (bottom
row)
4.2
News Headlines
The real-world data originates from news articles which have been crawled from
several web sites on different days. We use categories Technology, Sports, Business and General and focus only on the headlines. Related sets of news headlines
are manually identified and grouped together. We discard collections with less
than 5 headlines and build word graphs for the remaining sets following the procedure described in Section 2.1 where we remove stop words. Ground truth is
annotated manually by selecting the best sentence among the 20 shortest paths
computed by Yen’s algorithm [5] using frequencies as edge weights, this annotation has been done by one of the authors. This process leaves us with 87 training
examples.
We aim to learn the costs for the edges that give rise to the optimal compression of the training graphs. We devise three different sets of features. The
first feature representation consists of only two features that are inspired by the
heuristic in [4]. For an edge (xi , xj ), we use
φ1 (xi , xj ) =
#(x1 )
#(x2 )
,
#(x1 , x2 ) #(x1 , x2 )
>
,
Learning Shortest Paths for Word Graphs
11
Table 1: Leave-one-out results for news headlines
Filipova
SVM-fset3
avg. acc. avg. rank
0.277
4.378
6.942
0.252
where # denotes the frequency of nodes and edges, respectively. The second feature representation extends the previous one by Wordnet similarity simW (x1 , x2 )
of the nodes and frequencies #L taken from the Leipzig corpus, to incorporate
the notion of global relatedness,
φ2 (xi , xj ) =
#(x1 )
#(x2 )
#L (x1 )
#L (x2 )
,
, simW (x1 , x2 ),
,
#(x1 , x2 ) #(x1 , x2 )
#L (x1 , x2 ) #L (x1 , x2 )
>
.
Finally, we use a third feature representation which is again inspired by [4].
Instead of precomputing the surrogates, we simply input the ingredients to have
the algorithm pick the best combination,
>
φ3 (xi , xj ) = (#(x1 ), #(x2 ), #(x1 , x2 ), log (#(x1 )) , log (#(x2 )) , log (#(x1 , x2 ))) .
We compare our algorithms with the unsupervised approach by Filippova [4]. In
that work, the author extracts the shortest path from the word-graph that uses
#(x1 )+#(x2 )
as edge weights.
#(x1 ,x2 )
Figure 3 shows average accuracy (left column) and average rank (right column) for perceptrons (top row) and SVMs (bottom row) for different training
set sizes, depicted on the x-axis. Every curve is the result of a cross-validation
that uses all available data. Thus, the rightmost points are generated by a 2-fold
cross validation while the leftmost points result from using 11-folds. Due to the
small training sets, interpreting the figures is difficult. The unsupervised baseline outperforms the learning methods although there are indications that more
training data could lead to better performances of perceptrons and SVMs. The
first feature representation shows better performances than the second. However,
these conjectures need to be verified by an experiment on a larger scale.
Finally, Table 1 shows average accuracies and average ranks for a leave-oneout setup using the third feature representation. The results are promising and
not too far from the baseline, however, as before, the evaluation needs to be
based on larger sample sizes to allow for interpretations.
5
Related work
Barzilay et al. [9] study sentence compression using dependency trees. Aligned
trees are represented by a lattice from which a compression sentence is extracted
by an entropy-based criterion over all possible traversals of the lattice. Wan et
al. at [11] use a language models in combination with maximum spanning trees
to rank candidate aggregations that satisfy grammatical constrains.
12
E. Tzouridis and U. Brefeld
While the previous approaches to multi-sentence compression are based on
syntactic parsing of the sentences, word graph approaches have been proposed,
that do not make use of dependency trees or other linguistic concepts. Filippova
[4] casts the problem as finding the shortest path in directed word graphs, where
each node is a unique word and directed edges represent the word ordering in
the original sentences. The costs of these edges are a heuristic function that is
based on word frequencies. Recently, Boudin and Morin [10] propose a re-ranking
scheme to identify summarising sentences that contain many keyphrases. The
underlying idea is that representative key phrases for a given topic give rise to
more informative aggregations.
In contrast to the cited related work, we cast the problem of sentence compression as a supervised structured prediction problem and aim at learning the
edge costs from a possibly rich set of features describing adjacent nodes.
6
Conclusion
In this paper, we proposed to learn shortest paths in compression graphs for
summarising related sentences. We addressed the previously unsupervised problem in a supervised context and devised structured perceptrons and support
vector machines that effectively learn the edge weights of compression graphs,
so that a shortest path algorithm decodes the best possible summarisation. We
showed that the most strongly violated constrains can be computed directly by
loss-augmented inference and rendered the use of expensive two-best algorithms
unnecessary. Empirically, we presented preliminary results on artificial and real
world data sets. Due to small sample sizes, conclusions cannot be confidently
drawn yet, although the results seemingly indicate that learning shortest paths
could be an alternative to heuristic and unsupervised approaches. Future work
will address this question in greater detail.
References
1. U. Brefeld. Cost-based Ranking in Input Output Spaces. Proceedings of the Workshop on Learning from Non-vectorial Data, 2007.
2. I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, Large Margin Methods
for Structured and Interdependent Output Variables, Journal of Machine Learning
Research, 6 (Sep):1453-1484, 2005
3. R. Barzilay, L. Lee, Learning to Paraphrase: An Unsupervised Approach Using
Multiple-Sequence Alignment, in Proc. of NAACL-HLT, 2003.
4. K. Filippova. Multi-sentence compression: Finding shortest paths in word graphs,
COLING, 2010
5. J. Y. Yen, Finding the k Shortest Loopless Paths in a Network. Management Science
17 (11): 712716, 1971
6. R. Bellman (1958). ”On a routing problem”. Quarterly of Applied Mathematics 16:
8790. MR 0102435.
7. J. Ford, R. Lester (August 14, 1956). Network Flow Theory. Paper P-923. Santa
Monica, California: RAND Corporation.
Learning Shortest Paths for Word Graphs
13
8. E. W. Dijkstra (1959). ”A note on two problems in connexion with graphs”. Numerische Mathematik 1: 269271. doi:10.1007/BF01386390.
9. R. Barzilay and K. McKeown. Sentence Fusion for Multidocument News Summarization, Computational Linguistics, 2005.
10. F. Boudin and E. Morin. Keyphrase Extraction for N-best Reranking in MultiSentence Compression, Proceedings of the 2013 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, 2013.
11. S. Wan, R. Dale, M. Dras, C. Paris. Global revision in summarisation : generating
novel sentences with Prim’s algorithm, Conference of the Pacific Association for
Computational Linguistics, 2007.
12. B. Taskar and D. Klein and M. Collins and D. Koller and C. Manning. Max-margin
parsing, Proc. EMNLP, 2004.
13. M. Collins and N. Duffy. New Ranking Algorithms for Parsing and Tagging: Kernels
over Discrete Structures, and the Voted Perceptron, ACM, 2002
14. Y. Altun, M. Johnson, T. Hofmann. Investigating loss functions and optimization
methods for discriminative learning of label sequences, Proc. EMNLP, 2003.
15. Princeton University ”About WordNet.” WordNet. Princeton University. 2010.
http://wordnet.princeton.edu
16. Leipzig Corpora Collection (LCC). Universität Leipzig, Institut für Informatik,
Abteilung Sprachverarbeitung. http://corpora.uni-leipzig.de/
© Copyright 2026 Paperzz