School of Computing something FACULTY OF ENGINEERING OTHER Objectives of the module Eric Atwell, Language Research Group On completion of this module, students should be able to: - understand theory and terminology of empirical modelling of natural language; - understand and use algorithms, resources and techniques for implementing and evaluating NLP systems; - be familiar with some of the main language engineering application areas; - appreciate why unrestricted natural language processing is still a major research task. (with thanks to other contributors) In a nutshell: NLP/CL: Review Why NLP is difficult: language is a complex system How to solve it? Corpus-based machine-learning approaches Motivation: applications of “The Language Machine” The main sub-areas of linguistics ◮ Phonetics and Phonology: The study of linguistic sounds or speech. ◮ Morphology: The study of the meaningful components of words. ◮ Syntax: The study of the structural relationships between words. ◮ Semantics: The study of meanings of words, phrases, sentences. Python, NLTK, WEKA Python: A good programming language for NLP • Interpreted • Object-oriented • Easy to interface to other things (text files, web, DBMS) • Good stuff from: java, lisp, tcl, perl ◮ Discourse: The study of linguistic units larger than a single utterance. • Easy to learn ◮ Pragmatics: The study of how language is used to accomplish goals. • FUN! Python NLTK: Natural Language Tool Kit with demos and tutorials WEKA: Machine Learning toolkit: Classifiers, eg J48 Decision Trees Why is NLP difficult? Ambiguity: Grammar (PoS) and Meaning Iraqi Head Seeks Arms Computers are not brains • There is evidence that much of language understanding is built into the human brain Juvenile Court to Try Shooting Defendant Teacher Strikes Idle Kids Kids Make Nutritious Snacks Computers do not socialize British Left Waffles on Falkland Islands • Much of language is about communicating with people Red Tape Holds Up New Bridges Key problems: Bush Wins on Budget, but More Lies Ahead • Representation of meaning and hidden structure Hospitals are Sued by 7 Foot Doctors • Language presupposes knowledge about the world (Headlines leave out punctuation and function-words) • Language is ambiguous: a message can have many interpretations • Language presupposes communication between people Lynne Truss, 2003. Eats shoots and leaves: The Zero Tolerance Approach to Punctuation 1 The Role of Memorization But there is too much to memorize! Children learn words quickly establish • Around age two they learn about 1 word every 2 hours. establishment • (Or 9 words/day) • Often only need one exposure to associate meaning with word • Can make mistakes, e.g., overgeneralization “I goed to the store.” the church of England as the official state church. disestablishment antidisestablishment • Exactly how they do this is still under study antidisestablishmentarian Adult vocabulary antidisestablishmentarianism • Typical adult: about 60,000 words • Literate adults: about twice that. is a political philosophy that is opposed to the separation of church and state. MAYBE we don’t remember every word separately; MAYBE we remember MORPHEMES and how to combine them Rationalism v Empiricism Empiricism v Rationalism Rationalism: the doctrine that knowledge is acquired by reason without regard to experience (Collins English Dictionary) Empiricism: the doctrine that all knowledge derives from experience (Collins English Dictionary) The field was stuck for quite some time: rationalist Noam Chomsky, 1957 Syntactic Structures -Argued that we should build models through introspection: linguistic models for a specific example did not generalise. A new approach started around 1990: Corpus Linguistics -A language model is a set of rules thought up by an expert • Well, not really new, but in the 50’s to 80’s, they didn’t have the text, disk space, or GHz Like “Expert Systems”… Main idea: machine learning from CORPUS data Chomsky thought data was full of errors, better to rely on linguists’ intuitions… How to do corpus linguistics: • Get large text collection (a corpus; plural: several corpora) • Compute statistical models over the words/PoS/parses/… in the corpus Surprisingly effective Example Problem Grammar checking example: Which word to use? <principal> <principle> Using Very Large Corpora Keep track of which words are the neighbors of each spelling in well-edited text, e.g.: • Principal: “high school” • Principle: “rule” Empirical solution: look at which words surround each use: • I am in my third year as the principal of Anamosa High School. At grammar-check time, choose the spelling best predicted by the probability of co-occurring with surrounding words. • School-principal transfers caused some upset. No need to “understand the meaning” !? • This is a simple formulation of the quantum mechanical uncertainty principle. Surprising results: • Log-linear improvement even to a billion words! • Power without principle is barren, but principle without power is futile. (Tony Blair) • Getting more data is better than fine-tuning algorithms! 2 The Effects of LARGE Datasets Corpus, word tokens and types From Banko & Brill, 2001. Scaling to Very Very Large Corpora for Natural Language Disambiguation, Proc ACL Corpus: text selected by language, genre, domain, … Brown, LOB, BNC, Penn Treebank, MapTask, CCA, … Corpus Annotation: text headers, PoS, parses, … Corpus size is no. of words – depends on tokenisation We can count word tokens, word types, type-token distribution Lexeme/lemma is “root form”, v inflections (be v am/is/was…) Tokenization and Morphology Corpus word-counts and n-grams Tokenization - by whitespace, regular expressions FreqDist counts of tokens and their distribution can be useful Problems: It’s data-base New York … Eg find main characters in Gutenberg texts Jabberwocky shows we can break words into morphemes Eg compare word-lengths in different languages Morpheme types: root/stem, affix, clitic Human can predict the next word … Derivational vs. Inflectional Regular vs. Irregular Concatinative vs. Templatic (root-and-pattern) N-gram models are based on counts in a large corpus Auto-generate a story ... (but gets stuck in local maximum) Morphological analysers: Porter stemmer, Morphy, PC-Kimmo Morphology by lookup: CatVar, CELEX, OALD++ Word-counts follow Zipf’s Law Kilgarriff’s Sketch Engine Zipf’s law applies to a word type-token frequency distribution: frequency is proportional to the inverse of the rank in a ranked list Sketch Engine shows a Word Sketch or list of collocates: words co-occurring with the target word more frequently than predicted by independent probabilities f*r = k where f is frequency, r is rank, k is a constant A lexicographer can colour-code groups of related collocates indicating different senses or meanings of the target word ie a few very common words, a small to medium number of middle-frequency words, and a long tail of low frequency (1) Chomsky argued against corpus evidence as it is finite and limited compared to introspection; Zipf’s law shows that many words/structures only appear 1 or 0 times in a given corpus, ??supporting the argument that corpus evidence is limited compared to introspection With a large corpus the lexicographer should find all current senses, better than relying on intuition/introspection Large user-base of experience, used in development of several published dictionaries for English For minority languages with few existing corpus resources, Sketch Engine is combined with Web-Bootcat to enable lexicographers to collect their own Web-as-Corpus 3 Parts of Speech Parts of Speech: groups words into grammatical categories Training and Testing of Machine Learning Algorithms … and separates different functions of a word Algorithms that “learn” from data see a set of examples and try to generalize from them. In English, many words are ambiguous: 2 or more PoS-tags Training set: Very simple tagger: everything is NN Better Pos-Taggers: unigram, bigram, trigram, Brill, … • Examples trained on Test set: • Also called held-out data and unseen data • Use this for testing your algorithm • Must be separate from the training set • Otherwise, you cheated! “Gold Standard” • A test set that a community has agreed on and uses as a common benchmark – use for final evaluation Grammar and Parsing Problems with context-free grammars and parsers Context-Free Grammars and Constituency Parse-trees show syntactic structure of sentences Some common CFG phenomena for English Key constituents: S, NP, VP, PP • Sentence-level constructions You can draw a parse-tree and corresponding CFG • NP, PP, VP • Coordination Problems with Context-Free Grammar: • Subcategorization • Coordination: X X and X is a Meta-Rule, not strict CFG rule Top-down and Bottom-up Parsing • Agreement: needs duplicate CFG rules for singular/plural etc • Subcategorization: needs separate CFG non-terminals for trans/intrans/… • Movement: object/subject of verb may be “moved” in questions • Dependency parsing captures deeper semantics but is harder • Parsing: top-down v bottom-up v combined • Ambiguity causes backtracking, so CHART PARSER stores partial parses Parsing sentences left-to-right The horse raced past the barn [S [NP [A the] [N horse] NP] [VP [V raced] [PP [I past] [NP [A the] [N barn] NP] PP] VP] S] Chunking or Shallow Parsing Break text up into non-overlapping contiguous subsets of tokens. Shallow parsing or Chunking is useful for: Entity recognition • people, locations, organizations Studying linguistic patterns The horse raced past the barn fell • gave NP [S [NP [NP [A the] [N horse] NP] • gave up NP in NP [VP [V raced] [PP [I past] [NP [A the] [N barn] NP] PP] VP] NP] [VP [V fell] VP] S] • gave NP NP • gave NP to NP Prosodic phrase breaks – pauses in speech Can ignore complex structure when not relevant Chunking can be done via regular expressions over PoS-tags 4 Information Extraction Semantics: Word Sense Disambiguation Partial parsing gives us NP chunks… e.g. mouse (animal /PC-interface) IE: Named Entity Recognition • It’s a hard task (Very) • Humans very good at it People, places, companies, dates etc. • Computers not In a cohesive text, some NPs refer to the same thing/person • Active field of research for over 50 years Needed: an algorithm for NP coreference resolution; eg: • Mistakes in disambiguation have negative results “Hudda”, “ten of hearts”, “Mrs Anthrax”, “she” all refer to the same Named Entity • Beginning to be of practical use Machine learning v cognitive modelling • Desirable skill (Google, M$) NLP applications NLP has been successful using ML from data, without “linguistic / cognitive models” Machine Translation Supervised ML: given labelled data (eg PoS-tagged text to train PoS-tagger, to tag new text in the style of training text) Information Retrieval (Google, etc) Unsupervised ML: no labelled data (eg clustering words with “similar contexts” gives PoS-tag categories) Detecting Terrorist Activities Unsupervised ML is harder, but increasingly successful! Localization: adapting text (e.g. ads) to local language Information Extraction Understanding the Quran … For more, see The Language Machine And Finally… Any final questions? Feedback please (eg email me) Good luck in the exam! Look at past exam papers BUT note changes in topics covered And if you do use NLP in your career, please let me know… 5
© Copyright 2024 Paperzz