pptx - Indian Statistical Institute

Latent Semantic Indexing
Debapriyo Majumdar
Information Retrieval – Spring 2015
Indian Statistical Institute Kolkata
Vector space model
Recall
d1
d2
d3
d4
d5
q
car
0.5
0
0
0
0
1
automobile
0.2
0.8
0
0.2
0
0
engine
0.7
0.6
0.9
0
0.5
0
search
0
0
0.7
0.5
0.8
0
qT d
q d
0.7
0
0
0
0
Each term represents a dimension
Documents are vectors in the
term-space
Term-document matrix: a very
sparse matrix
Entries are scores of the terms in
the documents (Boolean  Count
 Weight)
Query is also a vector in the termspace
What is the problem?
car ~ automobile
But in the vector space model
each term is a different dimension
Vector similarity: cosine of the
angle between the vectors
qT d
simcos (q, d) = cos(q[q,d] ) =
q d
2
Synonyms in different dimensions
d1
d2
d3
d4
d5
q
car
0.5
0
0
0
0
1
automobile
0.2
0.8
0
0.2
0
0
engine
0.7
0.6
0.9
0
0.5
0
search
0
0
0.7
0.5
0.8
0
car
d1
engine
But different dimensions
Same situation for terms
belonging to similar
concepts
q
d2
Synonyms
Car and automobiles are
synonyms
automobile
Goal: can we map
synonyms (similar
concepts) to same
dimensions
automatically?
3
Linear algebra review
 Rank of a matrix: number of linearly independent
columns (or rows)
 If A is an m × n matrix, rank(A) ≤ min(m, n)
Rank of
d1
d2
d3
d4
d5
car
1
2
0
0
1
automobile
1
2
0
0
0
engine
1
2
1
0.2
0
search
0
0
1
0.2
0.8
=?
4
Linear algebra review
 A square matrix M is called orthogonal if its rows and columns
are orthogonal unit vectors (orthonormal vectors)
– Each column (row) has norm 1
– Any two columns (rows) have dot product 0
 For a square matrix M, if there is a vector v such that
Av = λv
for some scalar λ, then v is called an eigenvector of A
λ is the corresponding eigenvalue
5
Singular value decomposition
If A is an m × n matrix with rank r
Then there exists a factorization of A as:
where U (m × m) and V (n × n) are orthogonal, and
Σ (m × n) is a diagonal-like matrix
Σ = (σij), where σii = σi, for i = 1, …, r are the singular
values of A, all non-diagonal entries of Σ are zero
σ1 ≥ σ2 ≥ … ≥ σr ≥ 0
Columns of U are the left singular vectors of A
6
Singular value decomposition
σ1
σr
0
0
m×m
m ×n
n ×n
σ1
σr
r ×r
r ×n
m×r
7
Matrix digonalization for symmetric matrix
If A is an m × m matrix with rank r
Consider C = AAT. Then:
C has rank r
Σ2 is a diagonal matrix with
entries σi2, for i = 1, …, r
Columns of U are the
eigenvectors of C
σi2 are the corresponding
eigenvalues of C
8
SVD of term – document matrix
A = [ d1,..., dn ]
Documents are vectors in the m dimensional
term space
But we would think there are less number of concepts associated with the
collection
m terms, k concepts. k << m
Ignore all but the first k singular values, singular vectors
Σk
VkT
Ak
k ×k
m×k
k ×n
Low rank
approximation
9
Low-rank approxmation
Computationally,
still m dimensional
vectors
Rank k
Now compute cosine similarity
with the query q
10
Retrieval in the concept space
 Retrieval in the term-space (cosine): both q and d are m
dimensional vectors (m = #of terms)
qT d
simterm (q, d) = simcos (q, d) =
q d
 Term space (m)  concept space (k)
 Use the first k singular vectors
Query: q  UkTq (k × m, m × 1 = k × 1)
Document: d  UkTd (k × m, m × 1 = k × 1)
 Cosine similarity in the concept space:
(UkT q)·(U Tkd)
simconcept (q, d) = simcos (U q,U d) =
UkT q UkT d
T
k
T
k
 Other variants: map using (UkΣk)T
11
How to find the optimal low-rank?
 Primarily intuitive
– Assumption that a document collection has exactly k
concepts
– No systematic method to find optimal k
– Experimental results are not very consistent
12
Bast & Majumdar, SIGIR 2005
HOW DOES LSI WORK?
13
Spectral retrieval – general framework
Term-document matrix
q
A
(m × 1)
(m × n)
m terms, n documents
Singular value decomposition
(SVD)
A
m×n
cosine similarities in term space
L
(k× m)
(k × n)
k concepts, n documents
cosine similarities in concept space
U
m×r
Σ
r×r
VT
r×n
Uk = first k columns of U
dimension
reduction to
concept space
L.A
=
L = UkT (k × m)
LSI and other LSI-based retrieval
methods are called
L.q
(k × 1)
“Spectral retrieval”
Spectral retrieval as document "expansion"
car
auto engine search
car
1
1
0
0
0
auto
1
1
0
0
1
engine
0
0
1
0
search
0
0
0
1
0-1 expansion matrix
·
1
0
1
=
1
1
0
Spectral retrieval as document "expansion"
add car if auto is present
car auto engine search
car
1
1
0
0
0
auto
1
1
0
0
1
engine
0
0
1
0
search
0
0
0
1
0-1 expansion matrix
·
1
0
1
=
1
1
0
Spectral retrieval as document "expansion"
LSI expansion matrix
add car if auto is present
car auto engine search
car
0.29 0.36 0.25 -0.12
0
0.61
auto
0.36 0.44 0.30 -0.17
1
0.74
engine
0.25 0.30 0.44 0.30
search
-0.12 -0.17 0.30 0.84
expansion matrix
·
1
0
=
0.74
0.42 0.51 0.66 0.37
0.33 0.43 -0.08 -0.84
0.13
UkUkT
matrix L = U2U2T projecting to
2 dimensions
 Ideal expansion matrix should have
– high scores for intuitively related terms
– low scores for intuitively unrelated terms
expansion matrix depends heavily on the subspace dimension!
Why document "expansion"
LSI expansion matrix
add car if auto is present
car auto engine search
car
0.93 -.12 0.20 -0.11
0
0.08
auto
-0.12 0.80 0.34 -0.18
1
1.13
engine
0.20 0.34 0.44 0.30
search
-0.11 -0.18 0.30 0.84
expansion matrix
·
1
0
=
0.78
0.42 0.51 0.66 0.37
0.33 0.43 -0.08 -0.84
-0.80 0.59 0.06 -0.01
0.12
UkUkT
matrix L = U3U3T projecting to
3 dimensions
 Ideal expansion matrix should have
– high scores for intuitively related terms
– low scores for intuitively unrelated terms
expansion matrix depends heavily on the subspace dimension!
Finding the optimal number of dimensions k remained an open problem
Relatedness Curves
 How the entries in the expansion matrix depend on the
dimension k of the subspace
 Plot (i,j)-th entry of expansion matrix T = LTL = UkUkT against
the dimension k
U = {singular vectors}
i
j
k
k U
 Cumulative dot product of the i-th and j-th rows of
Types of Relatedness Curves
expansion
matrix entry
Three main types
logic /
logics
0
node /
vertex
0
200
400
subspace dimension
logic /
vertex
600
0
200
400
600
0
subspace dimension
200
400
subspace dimension
No single dimension is appropriate for all term pairs
But the shape of the curve indicates the term-term relationship!
600
Curves for related terms
We call two terms perfectly related
if they have an identical cooccurrence pattern

expansion
matrix entry
proven shape for
perfectly related terms
1
0A
1
1a
00
0
1
1
1
0
provably small change
after slight perturbation
1
0A
1
00
1a
0
1
1
0
1
.
.B
.
1b
1b
.
.
.
1
1
.
.
.
1
1
.
.D
.
00
00
.
.
.
0
0
more perturbation
0
0
200
400
subspace dimension
600
0
200
400
subspace dimension
point of fall-off is different
for every term pair, we can calculate that
600
0
200
400
subspace dimension
600
shape: up, then down
term 1
term 2
Curves for unrelated terms
 Co-occurrence graph:
– terms are vertices
– edge between two terms if they co-occur
 We call two terms perfectly unrelated if no path
connects them in the graph
expansion
matrix entry
proven shape for
perfectly unrelated terms
provably small change
after slight perturbation
more perturbation
0
0
200
400
subspace dimension
600
0
200
400
subspace dimension
600
0
curves for unrelated terms randomly oscillate around zero
200
400
subspace dimension
600
TN: the non-negativity Test
1.
Normalize term-document matrix so that theoretical point of fall-off is
same for all term pairs
Discard the parts of the curves after this point
For each term pair: if curve is never negative before this point, set entry
in expansion matrix to 1, otherwise to 0
expansion
matrix entry
2.
3.
0
0
Related terms
Related terms
Unrelated terms
set entry to 1
set entry to 1
set entry to 0
200
400
600
subspace dimension
0
200
400
600
subspace dimension
0
200
400
600
subspace dimension
A simple 0-1 classification, produces a sparse expansion matrix!
TS: the Smoothness Test
1.
Again, discard the part of the curves after the theoretical point of fall-off
(same for every term-pair, after normalization)
For each term pair compute the smoothness of its curve
(= 1 if very smooth,  0 as number of turns increase)
If smoothness is above some threshold, set entry in expansion matrix to
1, otherwise to 0
2.
expansion
matrix entry
3.
0.69
0.07
Related terms
Related terms
Unrelated terms
set entry to 1
set entry to 1
set entry to 0
0.82
0
0
200
400
600
subspace dimension
0
200
400
600
subspace dimension
0
200
400
600
subspace dimension
Again, 0-1 classification, produces a sparse expansion matrix!
Experimental results
Average
precision
COS
LSI*
LSI-RN*
CORR*
IRR*
TN
TM
425 docs
3882 terms
Time
63.2%
62.8%
58.6%
59.1%
62.2%
64.9%
64.1%
Baseline: cosine similarity in term space
Latent Semantic Indexing Dumais et al. 1990
Term-normalized LSI Ding et al. 2001
Correlation-based LSI Dupret et al. 2001
Iterative Residual Rescaling Ando & Lee 2001
non-negativity test
smoothness test
* the numbers for LSI, LSI-RN, CORR, IRR are for the best subspace dimension!
Experimental results
Average
precision
COS
LSI*
LSI-RN*
CORR*
IRR*
TN
TM
425 docs
3882 terms
21578 docs
5701 terms
Time
63.2%
62.8%
58.6%
59.1%
62.2%
64.9%
64.1%
Reuters
36.2%
Ohsumed
13.2%
32.0%
37.0%
32.3%
6.9%
13.0%
10.9%
——
——
41.9%
42.9%
14.4%
15.3%
233445 docs
99117 terms
* the numbers for LSI, LSI-RN, CORR, IRR are for the best subspace dimension!
Asymmetric term-term relations
fruit - fruit
apple - apple
0
Related terms: fruit – apple
 Until some dimension k’ the curve
fruit – apple is above the curve apple
- apple
 Until dimension k’ apple is more
related to fruit than to apple itself
 Asymmetric relation: fruit is more
general than apple
k
fruit - apple
Bast, Dupret, Majumdar & Piwowarski, 2006
Examples
More
general
Less
general
More
general
Car
Less
general
Apple
--
Fruit
--
Opel
Space
--
Solar
Restaurant --
Dish
India
--
Gandhi
Fashion
--
Trousers
Restaurant --
Waiter
Metal
--
Zinc
Sweden
--
Stockholm
India
--
Delhi
Church
--
Priest
Opera
--
Vocal
Metal
--
Aluminum
Fashion
--
Silk
Saudi
--
Sultan
Fish
--
Shark
Sources and Acknowledgements
 IR Book by Manning, Raghavan and Schuetze:
http://nlp.stanford.edu/IR-book/
 Bast and Majumdar: Why spectral retrieval works.
SIGIR 2005
– Some slides are adapted from the talk by Hannah Bast
29