D EEP B ELIEF N ETWORKS WITH APPLICATIONS
TO FAST D OCUMENT R ETRIEVAL
CSC2515
1
Existing Methods
• One of the most popular and widely used in practice
algorithms for document retrieval tasks is TF-IDF.
• TF-IDF weights each word by:
– its frequency in the query document (Term Frequency)
– the logarithm of the reciprocal of its frequency in the whole set of
documents (Inverse Document Frequency).
However, TF-IDF has several limitations:
– It computes document similarity directly in the word-count space,
which may be slow for large vocabularies.
– It assumes that the counts of different words provide independent
evidence of similarity.
– It makes no use of semantic similarities between words.
2
Existing Methods
• To overcome these limitations, many models for capturing
low-dimensional, latent representations have been proposed.
• One such simple and widely-used method is Latent Semantic
Analysis (LSA).
• It extracts low-dimensional semantic structure using SVD to
get a low-rank approximation of the word-document
co-occurrence matrix:
log(1 + M (doc, w)) ∼ U SV.
U = |doc| × d, S = d × d, V = d × |w|.
• A test query q is represented as d-dim vector S −1V log (1 + q).
3
Drawbacks of Existing Methods
• LSA is a linear method so it can only capture pairwise
correlations between words.
• Numerous methods, in particular probabilistic versions of
LSA were introduced in the machine learning community.
• These models can be viewed as graphical models in which a
single layer of hidden topic variables have directed
connections to variables that represent word-counts.
• There are limitations on the types of structure that can be
represented efficiently by a single layer of hidden variables.
• We will build a network with multiple hidden layers and with
millions of parameters and show that it can discover latent
representations that work much better.
4
RBM’s revisited
• A joint configuration (v, h) has an energy:
X
X
X
E(v, h) = −
bi v i −
bj hj −
vihj Wij .
i
j
i,j
• The probability that the model assigns to v is:
X
1X
exp(−E(v, h)).
p(v) =
p(v, h) =
Z
h
h
5
RBM’s for continuous data
• Hidden units remain binary.
• The visible units are replaced by linear stochastic units that
have Gaussian noise.
• The energy becomes:
X vi
X (vi − bi)2 X
−
hj Wij .
bj hj −
E(v, h) =
2
2σi
σi
j
i,j
i
• Conditional distributions over hidden and visible units are:
1
P
p(hj = 1|v) =
,
1 + exp(−bj − i Wij vi/σi)
P
(v − bi − σi j hj Wij )2 1
exp −
.
p(vi = v|h) = √
2
2σi
2πσi
6
RBM’s for count data
• Hidden units remain binary and the visible word counts are
modeled by the Poisson model.
• The energy is defined as:
P
P
P
P
E(v, h) = − i bivi − j bj hj − i,j vihj Wij + i log vi!.
• Conditional distributions over hidden and visible units are:
1
P
,
p(hj = 1|v) =
1 + exp(−b
vi )
j − i WijX
p(vi = n|h) = Poisson n, exp (bi +
hj Wij ) ,
j
where
n
λ
Poisson n, λ = e−λ .
n!
7
Learning Stacks of RBM’s
• Perform greedy, layer-by-layer learning:
– Learn and Freeze W1 using Poisson
Model.
30
W4
500
RBM
500
– Treat the existing feature detectors
as if they were data.
– Learn and Freeze W2.
W3
1000
RBM
1000
W2
2000
RBM
– Greedily learn many layers.
• Each layer of features captures strong
high-order correlations between the activities
of units in the layer below.
2000
W1
RBM
8
9
20 newsgroup: 2-D topic space
Autoencoder 2−D Topic Space
LSA 2−D Topic Space
talk.politics.mideast
rec.sport.hockey
talk.religion.misc
sci.cryptography
misc.forsale
comp.graphics
• The 20 newsgroup corpus contains 18,845 postings (11,314
training and 7,531 test) taken from the Usenet newsgroups.
• We use a 2000-500-250-125-10 autoencoder to convert a
document into a low-dimensional code.
• We used a simple “bag-of-words” representation.
10
Reuters Corpus: 2-D topic space
Autoencoder 2−D Topic Space
Interbank Markets
LSA 2−D Topic Space
European Community
Monetary/Economic
Energy Markets
Disasters and
Accidents
Leading Ecnomic
Indicators
Accounts/
Earnings
Legal/Judicial
Government
Borrowings
• We use a 2000-500-250-125-2 autoencoder to convert test
documents into a two-dimensional code.
• The Reuters Corpus Volume II contains 804,414 newswire
stories (randomly split into 402,207 training and 402,207 test).
• We used a simple “bag-of-words” representation.
11
Results for 10-D codes
• We use the cosine of the angle between two codes as a
measure of similarity.
• Precision-recall curves when a 10-D query document from the
test set is used to retrieve other test set documents, averaged
over 402,207 possible queries.
Autoencoder 10D
LSA 10D
LSA 50D
Autoencoder 10D
prior to fine−tuning
50
Precision (%)
40
30
20
10
0.1
0.2
0.4
0.8
1.6
3.2
Recall (%)
12
6.4
12.8
25.6
51.2
100
Semantic Hashing
2000
W1 +ε6
W1
RBM
20
2000
500
W3
500
RBM
500
W3 +ε4
20
Code Layer
W3
20
W2
Code Layer
T
W3
500
500
W2 +ε5
W2
500
500
500
500
W2
500
W1
2000
Bag of Words
Unrolling
Recursive Pretraining
T
W2 +ε2
500
T
W1
2000
T
W3 +ε3
500
T
RBM
Gaussian
Noise
T
W1 +ε1
2000
Fine−tuning
• Learn to map documents into semantic 20-D binary code and
use these codes as memory addresses.
• We have the ultimate retrieval tool: Given a query document,
compute its 20-bit address and retrieve all of the documents
stored at the similar addresses with no search at all.
13
The Main Idea of Semantic Hashing
Memory
Semantically
Similar
Documents
f
Document
14
Semantic Hashing
Reuters 2−D Embedding of 20−bit codes
European Community
Monetary/Economic
TF−IDF
TF−IDF using 20 bits
Locality Sensitive Hashing
50
Disasters and
Accidents
Precision (%)
40
Government
Borrowing
Energy Markets
30
20
10
Accounts/Earnings
0
0.1
0.2
0.4
0.8
1.6
3.2
6.4
12.8 25.6 51.2 100
Recall (%)
• Left picture shows a 2-dimensional embedding of the learned
20-bit codes using stochastic neighbor embedding.
• Right picture shows Precision-Recall curves when a query
document from the test set is used to retrieve other test set
documents, averaged over all 402,207 possible queries.
15
Semantic Hashing
Reuters 2−D Embedding of 20−bit codes
European Community
Monetary/Economic
TF−IDF
TF−IDF using 20 bits
Locality Sensitive Hashing
50
Disasters and
Accidents
Precision (%)
40
30
20
Government
Borrowing
Energy Markets
10
Accounts/Earnings
0
0.1
0.2
0.4
0.8
1.6
3.2
6.4
12.8 25.6 51.2 100
Recall (%)
• We used a simple C implementation on Reuters dataset
(402,212 training and 402,212 test documents).
• For a given query, it takes about 0.5 milliseconds to create a
short-list of about 3,000 semantically similar documents.
• It then takes 10 milliseconds to retrieve the top few matches
from that short-list using TF-IDF, and it is more accurate than
full TF-IDF.
16
Learning nonlinear embedding
• Learning a similarity measure over the input space X .
• Given a distance metric D (e.g. Euclidean) we can measure
similarity between two input vectors xn, xk ∈ X by
computing D[f (xn|W ), f (xk |W )].
• “Push-Pull” Idea: Pull points belonging to the same class
together. Push points belonging to the different classes apart.
d[f(x1),f(x2)]
10
10
W3
W3
500
500
W2
W2
500
500
W1
W1
2000
2000
x1
x2
17
Learning Nonlinear NCA
• We are given a set of N labeled training cases (xn, cn),
a = 1, 2, ..., N , where xn ∈ Rd, and cn ∈ {1, 2, ..., C}.
• For each training vector xn, define the probability that point n
selects one of its neighbours k as:
exp (−Dnk )
pnk = P
,
pnn = 0
z6=n exp (−Dnz )
where Dnk =k f (xn|W ) − f (xk |W ) k2, and f (·|W ) is a
multi-layer perceptron.
• Previous algorithms: a simple linear projection f (x|W ) = W x.
• The Euclidean distance is then the Mahalanobis distance:
D[f (xn), f (xk )] = (xn − xk )T W T W (xn − xk ).
18
Learning Nonlinear NCA
• Probability that point n belongs to class a is:
P
n
p(c = a) = k:ck =a pnk .
• Maximize the expected number of correctly classified points
on the training data:
PN P
1
ON CA = N n=1 k:cn=ck pnk .
• By considering a linear perceptron we arrive at linear NCA.
19
2-D codes
atheism
7
sci.cryptography
religion.christian
9
8
sci.space
1
4
2
rec.hokey
3
6
rec.autos
comp.windows
comp.hardware
0
5
20
Regularized Nonlinear NCA
Decoder
(1 − λ) ∗ E
30
W4
2000
Top
RBM
W1 +ε8
W1 +ε8
500
W2 +ε7
2000
500
W3 +ε6
RBM
W3 +ε6
2000
2000
W4 +ε5
λ*NCA
30
500
T
T
W4 +ε4
2000
RBM
2000
T
W3 +ε3
T
W3 +ε3
500
500
T
W2 +ε2
500
T
W2 +ε2
500
500
T
W1 +ε1
W1
W4 +ε5
30
W4 +ε4
W2
500
W2 +ε7
500
W3
500
500
T
W1 +ε1
Encoder
RBM
Fine−tuning
Pretraining
• The NCA objective ON CA is combined with the autoencoder
reconstruction error E to maximize:
C = λON CA + (1 − λ)(−E).
21
Results
3.2
80
Nonlinear NCA + Auto 10D
Nonlinear NCA 10D
Linear NCA 10D
Autoencoder 10D
LDA 10D
LSA 10D
3
70
2.8
2.6
60
2.4
Accuracy (%)
Error (%)
2.2
2
Nonlinear NCA 30D
Linear NCA 30D
Autoencoder 30D
PCA 30D
1.8
1.6
1.4
1.2
50
40
30
20
1
10
0.8
0.6
1
3
5
0
7
Number of Nearest Neighbours
1
3
7
15
31
63
127 255 511 1023 2047 4095 7531
Number of Retrieved Documents
• Left: K nearest neighbours results in the 30-D space on the
MNIST test set.
• Right: Precision-recall curves for 20 newsgroup test data.
22
T HE E ND
23
© Copyright 2026 Paperzz