Topic Coherence - UCL Computer Science

Automatic Correction of
Topic Coherence
William Martin <[email protected]>
John Shawe-Taylor <[email protected]>
Text Analysis
Text Analysis
● Want to group documents.
● Want to see broad categories without
physically reading them.
Text Analysis
● We can do these things using topic
modelling.
● How can we optimise results?
● Improve topic coherence.
Topic Modelling
Topic Modelling
● Unsupervised machine learning.
● Assumes documents generated from set of
latent (hidden) topics.
Document
Topic
Topic Modelling
● Example topics:
chemistry
synthesis
oxidation
reaction
product
organic
conditions
cluster
molecule
studies
orbit
dust
jupiter
line
system
solar
gas
atmospheric
mars
field
infection
immune
aids
infected
viral
cells
vaccine
antibodies
hiv
parasite
Topic Modelling
● We use information from the documents to
approximate these latent topics.
Document
Topic
Topic Modelling
● Select the topic that generated each word:
Topic
Word
Topics
● Topic is a distribution over all dataset
words.
Topic Mixture Model
● Document is formed from a distribution over
all topics.
Latent Dirichlet Allocation
Latent Dirichlet Allocation (LDA)
● Introduced in 2003*.
● Topic mixture model.
● Fits:
○ topic-document distribution.
○ term-topic distribution.
● These distributions are used for the text
analysis applications mentioned earlier.
* 'Latent Dirichlet allocation' by Blei et al. (2003).
Latent Dirichlet Allocation (LDA)
●
●
●
●
●
θ - distribution of topics in single document
z - topic identity vector of all dataset words
w - word identity vector of all dataset words
N - cardinality of single document
M - no of documents
Latent Dirichlet Allocation (LDA)
● α - Dirichlet prior on per-document topic
distribution.
● β - Dirichlet prior on per-topic word
distribution.
LDA Properties
● Topic mixture model.
● Sparsity:
○ Low values for α (~0.1) encourage few topics per
document.
○ Low values for β (~0.001) encourage few words per
topic.
Gibbs Sampling
● For each sample:
○ For each term:
■ Select the topic which is most likely to randomly
generate the term.
■ Ignore term's own inclusion.
● We do this by maintaining term-topic and
topic-document counts.
Topic Coherence
Topic Coherence
● Fact - generated topics do not always make
sense!
● Measure of topic internal relatedness.
● Quality of topic.
Topic Coherence
● Several metrics.
● We used Google Log Hits:
○ LH-Score(topic) = log(# results for top 10 terms)
○ Query = "+ t1 + t2 + t3 + ... + t10"
● Fast metric.
● Can no longer be automated.
Coherent
students
program
summer
biomedical
training
experience
undergraduate
career
minority
student
Incoherent
globin
longevity
human
erythroid
sickle
beta
hb
senescence
older
lcr
Topic Coherence
● To illustrate - imagine we plotted each word
in a document on a 2D grid...
Document as a Grid
Word
Related
Unrelated
Now Overlay Topics
Topic
Related
Unrelated
Word
Topic Coherence
● Those were good topics.
● Topic distributions are not always ideal.
● Some topics are coherent, others are
formed from terms that are not similar to
anything else.
Topic Coherence
Topic
Related
Unrelated
Word
Want to Increase Topic Coherence
Topic
Related
Unrelated
Word
Optimising Coherence
Optimising Semantic Coherence in
Topic Models*
● *Mimno et al. (2011) effectively used biases
added during each Gibbs sample.
● Biases derived from internal term similarity.
● Can we improve coherence further using
external term similarity?
● I will explain the process.
Polya Urn Model
● Instead of adding and removing a term to
counts, we add and remove the term and all
similar terms.
● Favours similar words when sampling.
● Similar terms are added to term-topic
counts during sampling.
Polya Urn Model
Ordinarily...
Counts (Urn)
Polya Urn Model...
Similar
Term
Term
Counts (Urn)
Document Frequency Information
● Fast.
● Results can be cached.
● Works by asking how often a pair of words
appear together in documents (and how
often they don't).
Adjacency Matrix
● Compute adjacency matrix for each pair of
words in dataset.
● Use DFI.
● From this, form a list of similar terms.
● Ignore terms that fall below a certain
threshold (0.5).
Challenges
● Large datasets.
● Lots of caching.
● Small modification for the case where a
word is not found.
● Stop words, stemming.
Results
Results on NUS Abstracts
Results on NIPS Abstracts
Results
● Peak 30% improvement in topic coherence.
● Coherent topics are still formed.
● Incoherent topics are avoided.
● Greater range of coherent topics.
Implications
What Does this Mean?
● More accurate clustering.
● Categories all make sense during analysis.
Future Work - Tailoring
● Select validation dataset based on desired
effects.
● Multilingual datasets.
● Coherence.
Conclusions
Conclusions
● No-holds-barred method of improving topic
coherence.
● Ability to tailor results using validation
dataset.
● Speed is an issue.
Questions?