Finding Meaning in the Chaos
Establishing Corpus Size Limits for Unsupervised Semantic Linking
Gary Munnelly
1
Alexander O’Connor
2
Jennifer Edmond
1 School
of Computer Science and Statistics
Trinity College, Dublin, Ireland
2 School of Computing
Dublin City University, Dublin, Ireland
3 Long Room Hub
Trinity College, Dublin, Ireland
DARIAH, 2015
3
Séamus Lawless
1
Background
I
Examined the effects of cultural trauma on a society1
I
Attempted to measure changes in the meaning of metaphors due to cultural trauma
1
Cultural trauma describes a sudden shift in societal views due to a significant event like the holocaust.
Gary Munnelly (TCD)
Finding Meaning in the Chaos
DARIAH, 2015
1 / 23
About a year ago, I was invited to help out on a project called
SPeCTReSS. SPeCTReSS was interested in examining the effects of
cultural trauma on a society. For our part, we were interested in
investigating how cultural trauma might be reflected in the literary works
of a society. How would the meaning behind certain common metaphors
and images change? Would these changes transcend national borders?
The challenge was to train a system that could recognise and measure
shifts in the meaning behind particular metaphors or images (for example,
the swastika) in response to these traumatic events.
Metaphors: Direct Word to Word Association
1900
Buddhism
0.818
0.810
Swastika
Hinduism
0.765
1950
Peace
Horror
0.681
0.710
Hope
Evil
0.756
Swastika
0.689
Holocaust
0.792
Nazi
This could possibly be achieved with Word2Vec
Gary Munnelly (TCD)
Finding Meaning in the Chaos
DARIAH, 2015
2 / 23
We figured there were probably two approaches we could take to
analysing these metaphors. The first would be to attempt to establish
direct one-to-one relationships between words that were considered
semantically equivalent. Where we saw a strong bond between, say a
noun and an adjective, we might be able to say that their use was
roughly synonymos in the corpus, suggesting that the noun was a
metaphor for the adjective.
So, as we see here, we have a (fictitious) model trained on works from
1900 and 1950. Due to the different time periods (and the cultural
trauma that was the holocaust happening in the middle), the meaning of
this image has changed drastically with time.
The algorithm that you could use to build these relationships is called
Word2Vec, which was developed by Google. It’s a very powerful algorithm
that establishes vector relationships between words that it considers to be
semantically equivalent. What’s incredible about it is that these vectors
are additive and subtractive, so if we have a vector that links ”king” and
”queen”, Word2Vec know that it can get from king to queen by
subtracting the vector for man and adding the vector for woman.
Metaphors: Topic Modelling
1900
Hope
0.702
1950
Peace
0.776
Holocaust
0.756
Swastika
0.825
Hinduism
0.641
Horror
0.681
Swastika
0.752
Buddhism
0.719
Evil
0.689
Nazi
0.792
Might be done using Latent Semantic Analysis or Latent Dirichlet Allocation
Gary Munnelly (TCD)
Finding Meaning in the Chaos
DARIAH, 2015
3 / 23
Another approach might have been to try and group words under broad
topic headings. Words that were strongly related to the same topic could
be considered to describe the same subject. If we can decipher what that
subject is, then we may be able to get a sense of what images these
metaphors are supposed to evoke. It’s crude, but easier to train than
Word2Vec, so in some cases, i.e. where we might not have access to a
lot of text, it could be considered more desirable.
Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis
(LSA)/Latent Semantic Indexing (LSI) are two methods by which we
could achieve this. We’ll talk a bit more about those later.
Background
I
Examined the effects of cultural trauma on the literary works of a society1
I
Attempted to measure changes in the meaning of metaphors due to cultural trauma
I
Models produced were too unstable due to a lack of text
I
How much text do we need?
1
Cultural trauma describes a sudden shift in societal views due to a significant event like the holocaust.
Gary Munnelly (TCD)
Finding Meaning in the Chaos
DARIAH, 2015
4 / 23
So, we ran the experiment and we actually got some really cool results.
Kaa (the snake from The Jungle Book) got linked to the Devil in the
bible, ”Hitler” got linked to the word ”shit” which is a result I’ll treasure
forever. Unfortunately, it turned out that this was pure fluke. The models
were unstable and when we tried to reproduce those connections in
subsequent training runs, we weren’t able to do it. Essentially, the
literature didn’t reflect these relationships at all, which was very
disappointing.
So, of course, we asked ourselves why this might have happened. We
thought it might have been due to us not having enough text to work
with. Relatively speaking, the collections we used were very small, which
is probably why we didn’t see anything particularly useful coming out of
it. So, of course, that begged the question - how much text do we need?
Overview
1
Introduction
2
LSA and LDA: An Overview
3
Methodology
4
Results
5
Conclusion
Gary Munnelly (TCD)
Finding Meaning in the Chaos
DARIAH, 2015
5 / 23
Topic Modelling
Gary Munnelly (TCD)
Finding Meaning in the Chaos
DARIAH, 2015
6 / 23
So, very quickly introducing LSA and LDA.
Assume that we have a very large corpus of documents. We don’t know
what they’re about, but we want to find some way to group them such
that documents which refer to the same subject are stored together. We
don’t know what these documents contain. In fact, maybe we don’t even
understand the language in which they’re written. So how do we
establish whether or not two documents are about the same subject?
Topic Modelling
¡Hola! Me llamo Pedro. Soy de
España. Vivo en una casa con mi
madre, mi padre y mis tres perros.
Soy Juan. Soy de Madrid. Tengo
una madre y un padre, pero no
tengo niguna mascotas.
The documents contain several of the same words which would suggest that they probably
discuss the same topic
Gary Munnelly (TCD)
Finding Meaning in the Chaos
DARIAH, 2015
7 / 23
What we could do is look at the structure of the documents. What words
to we see commonly occurring together? Might this suggest that these
words are in some way semantically related, or perhaps discuss a similar
subject?
In this case, we see quite a few parallels between these two toy documents
– Madre, Padre and the phrase “soy de”. Without knowing what these
words actually mean, we can make a fairly decent guess that these two
texts are probably discussing the same thing, and indeed they are.
LSA and LDA
Latent Semantic Analysis
I
Represents documents as a matrix of unique terms and paragraphs
I
Reduces matrix by singular value decomposition
I
Reconstructs approximation of the original matrix using just two dimensions of the
factorised matrices
Latent Dirichlet Allocation
I
Represents documents as groupings of topic distributions
I
Topic distributions produce words with a given probability
I
If the structure of the document reflects the topic distribution, then it can be said to be
about that topic
Both are unsupervised topic modelling algorithms.
Gary Munnelly (TCD)
Finding Meaning in the Chaos
DARIAH, 2015
8 / 23
LSA constructs a matrix that represents your documents, with rows being
unique terms and columns being paragraphs. This matrix is reduced
using singular value decomposition to produce three matrices which,
when combined, will perfectly reconstruct the original matrix. However,
what LSA does is it recombines those matrices across only two
dimensions, collapsing the original matrix. You essentially blur the lines
between words, establishing relationships. From the result, we can derive
our topics.
LDA represents documents as being comprised of groups of topics. In
this instance, a topic is a probability distribution which produced words.
So, for example, the topic that represents Nazis will produce words like
Reich, Hitler, holocaust and genocide with a high degree of probability.
Where the structure of a document reflects this distribution, we can say
that a document is likely related to that subject
One key thing to note is that both of these algorithms are unsupervised.
We can set a few initial parameters, but when they begin running we
have no input. One number we can set is the number of topics that the
algorithms are looking for. We can’t define what those topics will be, but
we can specify how many of them we want.
Methodology
The experiment:
1. Select a suitable corpus and train a model on it. Call this the gold standard
2. Reduce the size of the corpus and train a new model on the reduced collection
3. Compare the reduced model to the gold model and measure the similarity
4. Repeat reducing and comparing until the reduced model doesn’t agree with gold
5. At this point, we can say that the model is unstable
Gary Munnelly (TCD)
Finding Meaning in the Chaos
DARIAH, 2015
9 / 23
So, the experiment itself was fairly simple. What we wanted to do was
determine how big our corpus needed to be before our topics would
stabilise.
The experiment, it must be said, didn’t go entirely as planned. Our initial
intention was to train a model on the entirety of Wikipedia. We knew
we’d have enough information there to get a stable model. However,
training the models on Wikipedia has proven to be extremely tricky. A
combination of space requirements and speed limitations mean that I
didn’t manage to train the models, which is disappointing.
Instead, we worked with a collection of short stories by H.P. Lovecraft. I
thought this would be interesting, partially because it involved
investigating the works of a single author using these methods and
partially because I’m a bit of a horror nut.
So, we need to establish how we’ll measure the similarity of a model and
the gold standard. To do this, we use a method that is described by a
guy called Derek Green. I’ll step through that quickly now.
Let’s say we have two models and we want to know how similar they are.
Essentially, if two models are similar, we expect that their topics will have
roughly the same content. So we need a way to compare the contents of
topics.
Average Jaccard
T 1 = {album, music, best, award, win}
T 2 = {sport, best, win, medal, award}
T1
album
album,
album,
album,
album,
music
music, best
music, best, award
music, best, award, win
Gary Munnelly (TCD)
T2
sport
sport,
sport,
sport,
sport,
best
best, win
best, win, medal
best, win, medal,award
Finding Meaning in the Chaos
AJ
0.000
0.000
0.067
0.086
0.154
DARIAH, 2015
10 / 23
We want to measure how similar they are. We could just look at how
many words they have in common (the size of the intersection divided by
the size of the union - Jaccard), but that’s not really an accurate
measure of similarity. As you move down the list of terms in the topic,
they become less strongly associated with that subject. So the words
towards the end of the list are of lesser importance than those at the top.
A measure used by Derek Green was the average Jaccard distance. This
takes the topics in subsets of increasing length, computes the Jaccard
distance between them and takes the average of those distances across a
number of runs.
For example, we start with just “album” and “sport”. Zero similarity.
“Album” and “music” vs “sport” and “best”, again zero. At the third
stage, however, they have the word “best” in common. The Jaccard
distance between these two sets is 1/5 - the size of the intersection
divided by the size of the union, but we average this with the previous
two distances causing a division by 3, giving 1/15 or 0.067.
It continues...
Average Jaccard
t
AJ(Ti , Tj ) =
1 X |Ti,d ∩ Tj,d |
t
|Ti,d ∪ Tj,d |
d=1
Used to compute the similarity of two topics
Gary Munnelly (TCD)
Finding Meaning in the Chaos
DARIAH, 2015
11 / 23
Here’s the mathematical formulation for those who are interested.
As we’ve seen, it’s genuinely not as bad as it looks.
So, this formula will give us the similarity of two topics, but how do we
determine the similarity of two models using this Average Jaccard
Distance?
Agreement
M1 = {T1,1 , T1,2 , T1,3 , T1,4 , T1,5 }
M2 = {T2,1 , T2,2 , T2,3 , T2,4 , T2,5 }
Match = {(T1,3 , T2,2 ), (T1,2 , T2,3 ), (T1,5 , T2,5 ), (T1,1 , T2,4 ), (T1,4 , T2,1 )}
agree(M1, M2) =
AJ(T1,3 , T2,2 ) + AJ(T1,2 , T2,3 ) + AJ(T1,5 , T2,5 ) + AJ(T1,1 , T2,4 ) + AJ(T1,4 , T2,1 )
5
Gary Munnelly (TCD)
Finding Meaning in the Chaos
DARIAH, 2015
12 / 23
We have two models, each comprised of the same number of topics and
we want to find the extent to which they are in agreement about the
contents of those topics. Bear in mind that we haven’t influenced what
these topics are. Two documents could arrive at completely different
conclusions about the topics present in the collection they were built
with. At the very least we can expect that even if they find exactly the
same topics, these topics won’t have the same topic labels. So, topic 1 in
model 1 might not describe the same topic as topic 1 in model 2. We
therefore have to pair each one off with its most similar partner. We do
this using the Hungarian Method, which will give us the optimal
configuration of topic pairs based on Average Jaccard
We then compute the average similarity of these topic pairings by getting
the Average Jaccard for each and dividing by the number of topics.
Agreement
agree(Mx , My ) =
k
1X
AJ(Txi , π(Txi ))
k
k=1
Used to compute the similarity of two models
Gary Munnelly (TCD)
Finding Meaning in the Chaos
DARIAH, 2015
13 / 23
Mathematically, it looks like this.
So, we can now take two models and say that the do or don’t agree on
the nature of the topics present in a corpus. How can we used this
information to determine whether or not the models are stable?
Quite simply really.
Stability
stability (k) =
τ
1X
agree(M0 , Mi )
τ
i=1
Used to assess the stability of a model
Gary Munnelly (TCD)
Finding Meaning in the Chaos
DARIAH, 2015
14 / 23
Sorry, just mathematical notation for this one.
After we’ve trained our gold standard model, we train a number of
models on a subset of the collection. Not reducing here. Just a number
of models trained on a fixed subset. The average agreement of each of
these models with the gold standard is our stability measure. If the
model is stable, then each of the sub models should have a high
agreement with the gold standard, giving us a large average. If not, we’ll
get a smaller average indicating instability.
So, that’s basically what we did for Lovecraft. We produced a number of
sub-collections of his works at 90%, 80%, 70% etc. At each level of
reduction we trained 10 of each model. We then compared these 10
models to the gold standard and use that to assess the stability of the
model at that level of reduction. The results looked like this...
Lovecraft
Lovecraft: 146 documents, ≈ 389,486 words
Gary Munnelly (TCD)
Finding Meaning in the Chaos
DARIAH, 2015
15 / 23
So, here’s Lovecraft modelled using LDA. 145 documents, totalling about
390,000 words. For two topics, it looks like we actually did ok. About .75
stability. That’s not horrendous. It’s not exactly desirable either, but it’s
not awful. That’s not a lot of topics for 145 stories though, is it? He’s
either talking about monsters or he’s not? That’s not really a great
insight.
As you can see, as we increase the number of topics, it gets harder and
harder for the model to stabilise. For 20 topics, there’s not really much
difference between training it on 10% of the corpus or 90%.
Lovecraft
Lovecraft: 146 documents, ≈ 389,486 words
Gary Munnelly (TCD)
Finding Meaning in the Chaos
DARIAH, 2015
16 / 23
Here’s LSA, which actually stabilised quite well. For 2, 3, 4, 5, 6, 7, 8, 9,
10 topics as we reintroduce documents to the corpus, we rapidly
approach a high level of stability. That’s pretty nice to see. Our topics
here are more stable and can divide the corpus into a broader range of
topics. This isn’t really surprising though, I don’t think. LSA is fairly
deterministic in terms of how it works - represent the document as a
matrix, reduce, reconstruct, you’re done. There’s a greater degree of
random variability in LDA, such that two models trained on the same
collection which attain stability still won’t necessarily agree 100% on all
their topics. Truth be told, I don’t actually think this graph really tells us
anything useful. I mean, let’s actually look at it. For two topics, the
stability largely mirrors the size of the subsample. Intuitively, that makes
sense. So the models will be stable. They’ll be in a high degree of
agreement, but will they be useful? That’s a different question.
Lovecraft Line
Lovecraft: 29,894 documents, ≈ 389,486 words
Gary Munnelly (TCD)
Finding Meaning in the Chaos
DARIAH, 2015
17 / 23
I cheated.
I tried taking every individual sentence as a document and fed them to
each of the models one at a time. This gave the impression that we had
a much larger corpus than just 145 documents. We got just under 30,000
documents by doing this. We achieved nothing. No stability. There just
wasn’t enough context for LDA to find suitable patterns I guess. That’s
what you get for being a cheat.
Lovecraft Line
Lovecraft: 29,894 documents, ≈ 389,486 words
Gary Munnelly (TCD)
Finding Meaning in the Chaos
DARIAH, 2015
18 / 23
LSA converges again, although not as tightly as we saw before. They all
start off relatively compact, but they spread out as the move along.
Again, my guess is that this is simply because, due to each document
only being about 10 words long, the model just doesn’t have enough
context to decipher what the topics should be. We only really get decent
stability up to 6 topics here. Again, the actual contents of those topics
might not even be useful.
Horror
Horror: 185 documents, ≈ 1,627,251 words
Gary Munnelly (TCD)
Finding Meaning in the Chaos
DARIAH, 2015
19 / 23
Like I said, I’m a horror nut. I was curious to see what would happen if I
introduced some new authors to the collection, so I added some works by
Edgar Allan Poe, Bram Stoker and a number of others. A lot of these
were novels or collected works, so the number of words in the corpus
increased drastically, even though there were only 40 new documents. I
thought this would have resulted in much higher stability. We have more
documents, more text and a greater diversity of writing style. Surely this
would make it easier to distinguish between different topics. Seemingly
not. These models are all less stable than those we saw for the smaller
Lovecraft corpus. It’s rather unexpected and I’m not sure why this might
be the case. It could be that different authors use the same words in
different ways. Maybe this confused the model as the diversity in writing
style meant it was actually harder to establish a definite context for
particular words.
I didn’t get to run this test for LSA unfortunately.
Metaphor
Metaphor: 34 documents ≈ 2,026,713 words
Gary Munnelly (TCD)
Finding Meaning in the Chaos
DARIAH, 2015
20 / 23
I decided to go back and see what this stability analysis would look like
for our original metaphor corpus too. This is the smallest number of
documents, but the greatest quantity of text that we worked with. In
some ways it’s similar to the results that we got for Lovecraft. About 70
Notice the pinch at 40%. I’m not 100% sure what happened there, but
my guess is that it might actually be a mistake in the subsample. I think
I might have removed a particularly influential document (or perhaps a
series of particularly influential documents) at that level of subsampling.
Metaphor
Metaphor: 34 documents ≈ 2,026,713 words
Gary Munnelly (TCD)
Finding Meaning in the Chaos
DARIAH, 2015
21 / 23
See how LSA has a similar shape? A drop off at the end and a pinch in
the middle. I don’t think this tells us anything about LSA or LDA. I think
it speaks more about the effect of the changing the corpus.
Conclusion
There are a lot of factors that go into determining how stable a model will be:
I
Size of the collection
I
Number of topics
I
Diversity of text
I
How the text is parsed
Future work:
I
Generate Wikipedia model and compare to Lovecraft model
I
Generate models using Part of Speech tagged text
Gary Munnelly (TCD)
Finding Meaning in the Chaos
DARIAH, 2015
22 / 23
So having done all of that, what can we say?
There are a lot of factors that influence the stability and quality of your
models. It’s not just about how much text you have or how many topics
you choose. The nature of different collections varies. How they’re
structured, how they’re parsed, all of this will influence the quality of the
models that are produced. This isn’t to mention the various priors and
other parameters that you can configure before you start building the
models. There’s a lot that can influence your model.
As for where we should go from here, I’d like to see that Wikipedia model
somehow being generated. Alex made a very interesting suggestion which
was that it might be possible to take this model as a relatively normalised
model of how the English language is used as the noise of so many
authors cancels out personal style. That’s assuming that we build the
model correctly. Could we compare this model to something like the
Lovecraft model (assuming it’s stable) and from that determine what it is
about Lovecraft that makes his style distinctly his.
It would also be interesting to see the effect of building models on POS
tagged text, so that the model can distinguish between ”can” the noun
or ”can” the verb.
For Further Reading
Derek Greene, Derek O’Callaghan and Pádraig Cunningham
How Many Topics? Stability Analysis for Topic Models.
Machine Learning and Knowledge Discovery in Databases. 498-513, 2014
David M. Blei, Andrew Y. Ng, and Michael I. Jordan
Latent Dirichlet Allocation.
The Journal of Machine Learning Research. 3:993-1022, 2003
Thomas K. Landauer, Peter W. Foltz, and Darrell Laham
An Introduction to Latent Semantic Analysis.
Quantitative Approaches to Semantic Knowledge Representations. 25(3): 259-284, 1998
Radim Řehůřek and Petr Sojka
Gensim: Topic Modelling for Humans
https://radimrehurek.com/gensim/index.html
Gary Munnelly (TCD)
Finding Meaning in the Chaos
DARIAH, 2015
23 / 23
Feedback:
1) too few words overall, perhaps too few occurrences of particular words
(suggested min. 200)
2) several people suggested paragraph-level dissection of the texts
3) a warning from about the unrepresentative nature of wikipedia,
because it’s low on personal pronouns, syntactic variety and other things.
4) Suggested that gensim and mallet don’t behave the same way with
LDA. Gensim has mallett bindings.
Thank You
[email protected]
github.com/munnellg/elltm
peculiarparity.com/writing
© Copyright 2026 Paperzz