Latent Semantic Analysis: 10 questions What is Plato`s problem? Oh

10 questions
Latent Semantic Analysis:
Is it a solution to Plato’s problem?
[And 10 other questions & answers.]
•
•
•
•
•
•
•
•
How did this paper change our lives?
What is Plato's problem?
Oh no! Not more philosophy?
How can Plato’s problem be solved?
What kind of solution do we need?
What is latent semantic analysis?
How is an LSA model constructed?
How is the LSA model used?
– What’s a cosine between vectors?
• What are some cool empirical findings?
• Is LSA psychologically plausible?
How did this paper change our lives?
• Because I saw a talk by Landauer on this work, I
became interested in latent semantic analysis [LSA]
• Because I was interested in LSA, I became interested
Curt Burgess's HAL model.
• Because I was interested in HAL, I decided to come to
Edmonton, where Lori Buchanan was working on it
• Because I came to Edmonton- here I am teaching Psych
357.
• If Landauer hadn’t written this paper, we probably
wouldn’t have the mutual pleasure of knowing each
other as we do.
What is Plato's problem?
• Meno (in the Platonic dialog named after him) asks: How
can one ever investigate what one does not know?
• He saw two problems:
• i.) How can you propose what you do not know as
the object of your search?
• ii.) How will you recognize what you do not know as
the thing you did not know if you do (by chance) find
it?
• More generally, the problem is that there is a gap between
what we experience and what we know, with the latter
seeming to be larger than the former is able to support.
Oh no! Not more philosophy?
How can Plato’s problem be solved?
• Not at all (indeed, the opposite)
• Plato's problem is exactly the poverty of the
stimulus/failure of induction problem
• It is thus central to syntactic knowledge as
well as to many other dimensions of
linguistic knowledge (wherever we make
fine-grained untaught distinctions: e.g.
prosody, phonology, and semantics).
• i.) Plato's solution was recollection of knowledge gained in
a previous life, famously demonstrated in the Meno by
showing that a slave boy 'knows' the Pythagorean Theorem
• ii.) Some favour the idea of innate knowledge, the modern
equivalent of recollection of a previous life
• The basic common principle is one we already know and
love in Psych357: we need some source of strong
additional constraints on the problems (= information) to
narrow down the size of the search space.
1
What kind of solution do we need?
• That is: What properties are desirable in a
scientifically-acceptable explanation of how
constraints on a search space operate?
•
i.) They must be sufficient.
•
ii.) They must be well-defined.
•
iii.) They must be psychologicallyplausible
How is an LSA model constructed?
• i.) Build a matrix with rows representing
words and columns representing context (a
document or word string)
• ii.) Enter in each cell (= a word X document
intersection) a count of many times that
word occurred in that document
• iii.) Transform the matrix
What is latent semantic analysis?
•
LSA is an algorithmically well-defined way of measuring
lexical co-occurrence in some set of text
• The assumption is that co-occurrence says something
about semantics: words about the same things are likely to
occur in the same contexts.
• If we have many words and contexts, small differences in
co-occurrence probabilities can be compiled together to
give information about semantics.
– Think of 20 questions: No single question might be
sufficient to identify an unknown object, but 20
questions usually are sufficient
• i.) Build a matrix with rows representing words
and columns representing context (a document or
word string)
Sonnets
Learn C
A day at the
zoo
…
dog
zebra
computer
…
• ii.) Enter in each cell (= a word X document
intersection) a count of many times that word
occurred in that document
dog
zebra
computer
Sonnets
Learn C
A day at the
zoo
6
0
0
1
2
123
7
46
0
…
…
2
• iii.) Transform the matrix
– a.) Control for word frequency
• The log transform compresses the effects of
frequency
– b.) Control for the number of contexts each word
appeared in
• Words that occur in few contexts are more
informative about those contexts (= reduce
uncertainty about their context more) than words
that appear in many different contexts
– Eg. Knowing the word ‘computer’ was common
places more constraints on what the document is
about than knowing the word ‘the’ was common
• iii.) Transform the matrix
– c.) Singular value decomposition
• This reduces dimensionality by 'projecting' the tens
of thousands of context dimensions onto a smaller
number (roughly 300).
• A mathematical projection is roughly the same as
real projection: Think of shining a light through a
three dimensional pattern and tracing the shadow it
casts to get a two-dimensional projection
• The 'discarded' dimensions are those that are least
informative = have low variance = are redundant
(e.g. a word like 'the' occurred in every context or a
word like 'anti-disestablishmentarianism' occurred in
hardly any contexts).
Huh? What’s a cosine between vectors?
• They probably forget to mention in your Grade 9
trigonometry class (as they did in mine) that cosine is
extensible to dimensions above 2
– Typical teaching: always the special case, never the
general.
• The dot product of two vectors is the sum of the products
of corresponding entries in the two vectors: i.e. (x 1*x 2) +
(y 1*y2) + (z 1*z 2), for two vectors of length 3.
• The dot product of two vectors is the cosine of the angle
between those two vectors, multiplied by the lengths of
those vectors.
• Therefore, cosine is dot product divided by divided by the
product of the two vector lengths
How is the LSA model used?
• To get a measure of how related a word is to
another word, measure the distance between the
columns containing the two words.
– This gives you a measure of how different the contexts
of the two words were: that is, how often a word
occurred a different number of times in each context
• You can also take the distance between two
document vectors to get a measure of how related
they are.
• You can measure distance by taking the cosine
between two vectors
What are some cool empirical findings?
• i.) LSA models can pass the TOEFL
• ii.) LSA can learn the meanings of words it has
never encountered
• iii.) LSA can explain some priming effects
• iv.) LSA replicates human number judgments
• v.) LSA can mark essays
• vi.) LSA-like measures predict LD RTs
3
i.) LSA models can pass the TOEFL
• On a 4-possibility multiple choice TOEFL,
the model got 51.5% correct (corrected for
guessing)
• Chance score is 25%
• Real foreigners hoping to attend American
universities averaged 52.7%
ii.) LSA can learn the meanings of
words it had never encountered
• So can children!
• By substituting words with nonsense words and controlling access,
they showed that the model could learn the meanings of words it
had never encountered
• This replicated (and explained) an odd result which had been
found in human children- and estimated that most word
knowledge was inductive rather than direct.
• The result is not odd when you consider that the meaning of a
word is distributed across all vectors with which it shares contexts.
– You can learn a lot about lions, even if you have never heard of
them before, by knowing they are something like tigers.
iii.) LSA can explain some
priming effects
iv.) LSA replicates human
number judgments
• The model can explain some priming work
using homographs: i.e. testing for 'mole'
(the animal) versus 'mole' (the beauty
mark).
• If context is marked by (either phonological
or orthographic word form), then these
words will indeed get over-lapping contexts
even though they are semantically different
• Previous work has shown that judgments
about number size are best represented on
the assumption that numbers are represented
as their log of their values.
v.) LSA can mark essays
vi.) LSA-like measures predict LD RTs
• LSA judgments of the quality of sentences correlate at r =
0.81 with expert ratings
• LSA can judge how good an essay (on a well-defined set
topic) is by computing the average distance between the
essay to be marked and a set of model essays
– The correlation are equal to between-human
correlations
• “If you wrote a good essay and scrambled the words you
would get a good grade," Landauer said. "But try to get the
good words without writing a good essay!”
• An LSA-like measure for single words can predict
human RTs in lexical decision
• We used 10 words each side of the target word as
a ‘document’ and got distances between all words
• Words close to their nearest neighbours are
recognized more quickly than words far away
from them, after controlling for other known
variables
– That is, people ‘scale down’ large numbers
• LSA got the same representation using their
contextual occurrences.
4
Is LSA psychologically plausible?
• Well, the above evidence suggests they might be, and is nicely
consistent with much of our talk about mapping between
schemas
• Neuro-philosopher Paul Churchland has written:
"Explanatory understanding consists of the activation of a
specific prototype vector in a well-trained network. It consists
in the apprehension of the problematic case as an instance of a
general type, a type for which the creature has a detailed and
well-informed representation. Such a representation allows
the creature to anticipate aspects of the case so far
unperceived, and to deploy practical techniques appropriate to
the case at hand."
Paul Churchland
A Neurocomputational Perspective: The Nature Of Mind
and The Structure Of Science
5