Ben Jeffery NLP Talk

NLP IN Q
TEXT
NLP IN Q
▸ No existing NLP libraries
▸ Parsing is expensive, simple vector operations are cheap
▸ Focus on vector operations, rather than named entity
recognition, part-of-speech tagging, co-reference
resolution
TEXT
REPRESENTING DATA
Document 1
circumambulate|
sail
|
cook
|
whale
|
ishmael
|
...
4.997212
0.9236821
4.969805
0
3.722053
Document 2
fish
|
harpoon |
jolly
|
ishmael |
inn
|
...
0.4790985
2.636207
2.898556
5.263778
4.057829
Union of keys
`circumambulate`sail`cook`whale`ishmael`fish`harpoons`jolly`ishmael`inn
Key aligned vectors
4.997 0.923 4.969 0 3.722 0n
0n
2.049 3.722 0n
...
0n
0.653 0n
0 5.263 0.479 2.636 2.898 5.263 4.057...
TEXT
COSINE SIMILARITY
▸ Used to calculate the angle between two vectors
▸ Dot product is the sum of the pair-wise products
▸ Given two vectors aligned such that each index i refers to
the same element in each vector, the q is
0 ^ (sum x * y) % (*) . {sqrt sum x xexp 2} each (x; y)
TEXT
TF-IDF
▸ Significance depends on the document and the corpus
▸ A word is significant if it is common in a document, but
uncommon in a corpus
▸ Term Frequency * Inverse Document Frequency
▸ IDF: log count[corpora] % sum containsTerm;
TF: (1 + log occurrences) | 0;
significance: TF * IDF;
TEXT
LEVEL STATISTICS
▸ Within a single document, clustered terms are more significant than uniformly
distributed ones
▸ Compares the standard deviation of distances between words to the distribution
that would be predicted by a geometric distribution.
▸ Where “distances” is a vector of the distance between words:
σ : (dev each distances) % avg each distances;
σnor : σ % sqrt 1 - p;
sd(σnor) : 1 % (sqrt n) * (1 + 2.8 * n xexp -0.865); // factors in # of occurrences
⟨σnor⟩ : ((2 * n) - 1)%((2 * n) + 2);
signifigance: (σnor - ⟨σnor⟩) % sd(σnor);
▸ Carpena, P., et al. "Level statistics of words: Finding keywords in literary texts and
symbolic sequences." Physical Review E 79.3 (2009): 035102.
TEXT
CLUSTERING
▸ Find groupings of entities
▸ Cluster documents, terms, proper nouns
▸ Find natural divisions of text
▸ Can be random or deterministic
▸ Can take as parameters: similarity of documents, number
of clusters, time spent clustering
▸ Cluster centroids can be represented as feature vectors
TEXT
CLUSTERING ALGORITHMS
▸ K means
▸ Pick k random documents, and cluster around these, then use the
centroids as the new clustering points, and repeat until convergence
▸ Buckshot clustering
▸ Cluster sqrt(n) of the documents with an O(n^2) algorithm, then
match the remaining documents to the centroids
▸ Group Average
▸ Starting with buckshot clusters, cluster each cluster into sub-clusters,
merge any similar sub-clusters , repeat as long as you want
TEXT
THE MARKOV CLUSTERING ALGORITHM
▸ “… random walks on the graph will infrequently go from
one natural cluster to another.” - Stijn van Dongen
▸ Multiply the matrix by itself, square every element,
normalize the columns, and repeat this process until it
converges.
▸ Rows with multiple non-zero values gives the clusters
▸ http://micans.org/mcl/
TEXT
THE MARKOV CLUSTERING FOR DOCUMENTS
Minimum similarity is passed in as a parameter
.80+
Form letters, similar versions, updated articles
.60 to .80
Articles translated from English, then back
.50 to .60
Articles about the same events
.25 to .50
Articles about the same topics
.5 to .25
Articles about the same, more general, topics
less than .10
Several very large clusters, outliers become obvious
TEXT
MARKOV CLUSTERING THE BIBLE (OR KJB IN KDB)
▸ At .49, you get Matthew, Mark and Luke in a cluster
▸ All .11, you get the New Testament, the Old Testament,
and the Epistles, clustered by author
▸ At .05, you get the Epistles of John in one cluster, and
everything else in another
Similarity
of .06
Similarity
of .08
TEXT
EXPLAINING SIMILARITY
▸ Useful for explaining why
▸ a document is in a cluster
▸ a document matches a query
▸ two documents are similar
▸ product : (terms1 % magnitude terms1)
* (terms2 % magnitude terms2);
desc alignedKeys ! product % sum product;
Diogenes sitting in his tub, Jean-Léon Gerôme
TEXT
EXPLAINING SIMILARITY 3
▸ Given the cluster containing the three gospels Matthew, Mark and Luke, described
by the keywords
disciples, pharisees, john, peter, herod, mary, answering, scribes, simon, pilate
▸ The Gospel According to Saint Matthew (Relevance 0.84)
disciples 0.310, pharisees 0.146, peter 0.087, herod 0.085, john 0.082, scribes
0.062, mary 0.061, hour 0.057, publicans 0.056, simon 0.053
▸ The Gospel According to Saint Matthew (Relevance 0.78)
disciples 0.291, pharisees 0.115, john 0.093, peter 0.090, herod 0.081, immediately
0.072, scribes 0.069, answering 0.066, pilate 0.062, mary 0.061
▸ The Gospel According to Saint Luke (Relevance 0.82)
disciples 0.244, pharisees 0.133, john 0.093, answering 0.091, herod 0.088, peter
0.086, mary 0.076, simon 0.067, pilate 0.062, immediately 0.061
TEXT
COMPARE CORPORA
▸ Find, for each term, the difference in relative frequency via loglikelihood
▸ totalFreq : (termCountA + termCountB) %
(totalWordCountA + totalWordCountB);
expectedA : totalWordCountA * totalFreq;
desc (termCountA * log[termCountA % expectedA]);
▸ Rayson, Paul, and Roger Garside. "Comparing corpora using
frequency profiling." Proceedings of the workshop on
Comparing Corpora. Association for Computational Linguistics,
2000.
TEXT
COMPARE CORPORA - KJB
▸ Old Testament - Lord, shall, thy, Israel, king, thee, thou,
land, shalt, children, house
▸ New Testament - Jesus, ye, Christ, things, unto, god, faith,
disciples, man, world, say
TEXT
COMPARE CORPORA - JEFF SKILLING’S EMAILS
▸ Business emails - enron, please, jeff, energy, information,
market, business
▸ Fraternity emails - yahoo, beta, betas, reunion, kai,
ewooglin
TEXT
WORDS AS VECTORS
▸ Words can be described as vectors
▸ All previously mentioned operations become available on
individual words
▸ Vectors are based on co-occurence
▸ word2vec uses machine learning to find which co-occuring
words are most predictive
TEXT
CALCULATING VECTORS FOR WORDS
▸ Finding the significance of “captain” to “ahab”
▸ Of the 272 sentences containing “captain”, 78 contain “ahab”
▸ “captain” occurs in 2.7% of sentences, but occurs in 16% of sentences also
containing “ahab”
▸ The likelihood of a sentence that contains “ahab” also containing “captain” is a
binomially distributed random variable, as it is the product of a Bernoulli process
▸ The deviation of this random variable is √(np(1-p))
where p is the overall probability of “captain” being in a sentence
▸ Significance is (cooccurenceRate - overallFrequency) % deviation
(.16 - .027) % .162
.84
TEXT
WORD VECTOR EXAMPLES
Moby
stem
relevance tokens
------------------------------------------------------------------
dick
11.3
`dick`dick's
whale
7.75
`whaling`whale`whales`whale's`whaled
white
7.04
`white`whiteness`whitenesses`whites
ahab
6.1
`ahab`ahab's`ahabs
boat
4.95
`boat`boats`boat's
encounter 4.52
`encounter`encountered`encountering`encounters
seem
4.31
`seemed`seem`seems`seeming
sea
4.13
`sea`seas`sea's
TEXT
WORD VECTOR EXAMPLES
harpoon
stem
relevance tokens
-------------------------------------------------------
whale
3.918473 `whaling`whale`whales`whale's`whaled
boat
2.902082 `boat`boats`boat's
line
2.235111 `line`lines`lined`lining
sea
1.991354 `sea`seas`sea's
iron
1.973497 `iron`irons`ironical
dart
1.964671 `dart`darted`darts`darting
ship
1.888228 `ship`ships`shipped`ship's`shipping
queequeg 1.825947 `queequeg`queequeg's
TEXT
WORD VECTOR EXAMPLES - PROPER NOUNS ONLY
Moses
Jesus
Pharaoh
stem
relevance
-------------------
aaron
2.76
israel
2.26
pharaoh
1.39
egypt
1.31
egyptians 1.23
levites
1.19
eleazar
1.07
sinai
1.06
joshua
1
jordan
0.921
god
0.9
stem
relevance
-------------------
galilee
1.85
god
1.85
son
1.73
lord
1.68
john
1.57
peter
1.5
jerusalem 1.47
jews
1.45
pilate
1.37
david
1.33
pharisees 1.24
stem
relevance
-----------------------
egypt
5.53
moses
3.9
joseph
3.46
egyptians
3.26
goshen
2.21
aaron
2.03
israel
1.58
god
1.16
red sea
1.16
canaan
1.13
hebrews
1.13
TEXT
WORDS AS VECTORS
▸ Clustering words becomes possible
▸ Given the names: pharaoh jude simon noah lamech judas
ham methuselah aaron levi moses shem japeth jesus
▸ Cluster 1: noah lamech ham methuselah shem
Cluster 2: pharaoh aaron levi moses
Cluster 3: simon judas jesus
TEXT
ANSWER QUERIES
▸ Find the harmonic mean of each tokens relevance for each
search term
▸ Drop any terms with above average significance to the
anti-search terms
Search Terms: captain, pequod
captain ahab | 0.672187
captain peleg | 0.4844358
captains
| 0.4797662
captain bildad| 0.4429896
Search Terms: captain, !pequod
captain sleet
| 0.5986764
captain scoresby| 0.5184432
captain pollard | 0.5184432
captain mayhew | 0.5184432
captain boomer | 0.5184432
TEXT
ANSWER QUERIES
How are Captain Bildad and Captain Peleg related?
captain|
hand
|
old
|
ship
|
owner |
0.0341
0.00935
0.00896
0.00894
0.00883
TEXT
EXPAND SETS
Summing the vectors for a set of words will give an expanded set
expanding simon, andrew, james, and john gives
bartholomew, alphaeus, matthew, thaddaeus, canaanite, zelotes, thomas,
brother, iscariot, zebedee, james, peter, lebbaeus, boanerges, traitor, andrew,
philip, simon, judas, and john
expanding bread, fish, milk, and beans gives
butter, honey, lentiles, cheese, millet, kine(cows), fitches(spelt), parched,
shobi, bason, earthen, pulse, wheat, and barley
TEXT
STEMMING
▸ Stemming removes what it guesses are inflections
antidisestablishmentarianism -> establish
programmer -> program
brother -> broth
▸ Produces a root word, which may not be a real word:
happiness -> happi
▸ Stemmers can be compared by aggressiveness
▸ Stemming is rule based, does not require extensive datasets
TEXT
STEMMING
▸ Moby Dick has
16950 distinct words
10466 distinct stems
nearly 700 words have 4 of more inflected forms
▸ general generally generous generic generously generalizing
generations generated
▸ admirer admire admirals admiral admirable admirably
admiral’s admirers
TEXT
TOKENIZING
▸ Tokens are individual words, names, numbers, etc.
▸ Proper names are counted as a single token
▸ Rule based, for simplicity
▸ To tokenize, all characters not in [a-zA-Z0-9’\u00C0\u017F] get replaced with whitespace, then split on
whitespace, remove terminal apostrophes, then join
consecutive proper nouns
TEXT
PROPER NOUNS
▸ Any run of title cased word not at the start of a sentence is
treated as a proper noun
▸ Any title cased word at the start of a sentence is treated as
a proper noun if it is found as a proper noun elsewhere
TEXT
SENTENCE DETECTION
▸ Rule based, for simplicity
▸ Break on ‘?’, ‘!’ or any period which isn’t in an ellipsis (…), used
as an infix (P.E.I.), preceding a number (.1416), following a title
(Ms.)
▸ Sentence boundaries get modified to include quotes or
multiple punctuation marks
▸ Not as accurate as machine learning algorithms which can
detect sentence breaks such as “My uncle is from P.E.I. He’s in the potato business”
EXPLORATORY NLP UI IN ANALYST FOR KX
Searching, managing document collections, and common nlp operations are available through the UI
VISUALIZING NLP DATA IN ANALYST FOR KX
Data from prose text can be visualized, here showing a discontinuity from chapter 32-45,
which is on zoology, instead of part of the narrative
TEXT
CLOSING COMMENTS
▸ Feature vectors and the vector space model work for
things other than documents
▸ Used with dialectology, phonetics, music classification,
recommender systems
▸ Other operations that can be done with the same
techniques include summarization, phrase detection, and
natural language generation
TEXT
CONTACT INFO
[email protected]