Automatic Correction of Topic Coherence William Martin <[email protected]> John Shawe-Taylor <[email protected]> Text Analysis Text Analysis ● Want to group documents. ● Want to see broad categories without physically reading them. Text Analysis ● We can do these things using topic modelling. ● How can we optimise results? ● Improve topic coherence. Topic Modelling Topic Modelling ● Unsupervised machine learning. ● Assumes documents generated from set of latent (hidden) topics. Document Topic Topic Modelling ● Example topics: chemistry synthesis oxidation reaction product organic conditions cluster molecule studies orbit dust jupiter line system solar gas atmospheric mars field infection immune aids infected viral cells vaccine antibodies hiv parasite Topic Modelling ● We use information from the documents to approximate these latent topics. Document Topic Topic Modelling ● Select the topic that generated each word: Topic Word Topics ● Topic is a distribution over all dataset words. Topic Mixture Model ● Document is formed from a distribution over all topics. Latent Dirichlet Allocation Latent Dirichlet Allocation (LDA) ● Introduced in 2003*. ● Topic mixture model. ● Fits: ○ topic-document distribution. ○ term-topic distribution. ● These distributions are used for the text analysis applications mentioned earlier. * 'Latent Dirichlet allocation' by Blei et al. (2003). Latent Dirichlet Allocation (LDA) ● ● ● ● ● θ - distribution of topics in single document z - topic identity vector of all dataset words w - word identity vector of all dataset words N - cardinality of single document M - no of documents Latent Dirichlet Allocation (LDA) ● α - Dirichlet prior on per-document topic distribution. ● β - Dirichlet prior on per-topic word distribution. LDA Properties ● Topic mixture model. ● Sparsity: ○ Low values for α (~0.1) encourage few topics per document. ○ Low values for β (~0.001) encourage few words per topic. Gibbs Sampling ● For each sample: ○ For each term: ■ Select the topic which is most likely to randomly generate the term. ■ Ignore term's own inclusion. ● We do this by maintaining term-topic and topic-document counts. Topic Coherence Topic Coherence ● Fact - generated topics do not always make sense! ● Measure of topic internal relatedness. ● Quality of topic. Topic Coherence ● Several metrics. ● We used Google Log Hits: ○ LH-Score(topic) = log(# results for top 10 terms) ○ Query = "+ t1 + t2 + t3 + ... + t10" ● Fast metric. ● Can no longer be automated. Coherent students program summer biomedical training experience undergraduate career minority student Incoherent globin longevity human erythroid sickle beta hb senescence older lcr Topic Coherence ● To illustrate - imagine we plotted each word in a document on a 2D grid... Document as a Grid Word Related Unrelated Now Overlay Topics Topic Related Unrelated Word Topic Coherence ● Those were good topics. ● Topic distributions are not always ideal. ● Some topics are coherent, others are formed from terms that are not similar to anything else. Topic Coherence Topic Related Unrelated Word Want to Increase Topic Coherence Topic Related Unrelated Word Optimising Coherence Optimising Semantic Coherence in Topic Models* ● *Mimno et al. (2011) effectively used biases added during each Gibbs sample. ● Biases derived from internal term similarity. ● Can we improve coherence further using external term similarity? ● I will explain the process. Polya Urn Model ● Instead of adding and removing a term to counts, we add and remove the term and all similar terms. ● Favours similar words when sampling. ● Similar terms are added to term-topic counts during sampling. Polya Urn Model Ordinarily... Counts (Urn) Polya Urn Model... Similar Term Term Counts (Urn) Document Frequency Information ● Fast. ● Results can be cached. ● Works by asking how often a pair of words appear together in documents (and how often they don't). Adjacency Matrix ● Compute adjacency matrix for each pair of words in dataset. ● Use DFI. ● From this, form a list of similar terms. ● Ignore terms that fall below a certain threshold (0.5). Challenges ● Large datasets. ● Lots of caching. ● Small modification for the case where a word is not found. ● Stop words, stemming. Results Results on NUS Abstracts Results on NIPS Abstracts Results ● Peak 30% improvement in topic coherence. ● Coherent topics are still formed. ● Incoherent topics are avoided. ● Greater range of coherent topics. Implications What Does this Mean? ● More accurate clustering. ● Categories all make sense during analysis. Future Work - Tailoring ● Select validation dataset based on desired effects. ● Multilingual datasets. ● Coherence. Conclusions Conclusions ● No-holds-barred method of improving topic coherence. ● Ability to tailor results using validation dataset. ● Speed is an issue. Questions?
© Copyright 2026 Paperzz