Natural Language Processing >> Error Checking << winter / fall 2015/2016 41.4268 Prof. Dr. Bettina Harriehausen-Mühlbauer Univ. of Applied Sciences, Darmstadt, Germany https://www.fbi.h-da.de/organisation/personen/harriehausenmuehlbauer-bettina.html [email protected] Error Checking electronic Grammars.... ...precisively describe the central, well-defined grammatical constructions of language... • ??? well-defined ??? • Chomsky (1957) ...“parse all and only the grammatical constructions of language ...“ Question: How do we define grammaticality? • Utterances of native speaker ? • Structural descriptions in our descriptive grammars ? • Lexical descriptions in our (standard) dictionaries ? • ..... 2 Error Checking PLNLP : Programming Language for Natural Language Processing George Heidorn (IBM, Microsoft) PEG : Penelope English Grammar Karen Jensen (IBM, Microsoft) http://www.amazon.de/Natural-Language-Processing-InternationalEngineering/dp/0792392795 3 Error Checking Error Correction : question of grammaticality Is grammaticality boolean ? ± grammaticality ? Something is correct or wrong ? Types of errors: (a) The children climbs over the fence. (dis/agreement) (b) Either of the manuscripts are fine. (dis/agreement) (c) ...between you and I... (case) 4 Error Checking Can you have it translated into English please and have this forwarded to Paul and I? Bettina - can you let Yunsun and I know if options 1 or 2 are possible asap? Only if not, perhaps option 3 can be followed. (email correspondence with a native English speaker) 5 Error Checking Types of errors: (d) He stated the fact and why they succeeded. (non-/parallel construction) (e) We want to very carefully test the idea. (split infinitives) (f) The big, old, dirty, ugly, hungry,....dog (multiple conjoined adjectives) (g) The mouse, which the cat, which sat under the table, which was.... (embedded relative clauses; grammar ? style ?)) (h) Would you like another glass of soda. (incorrect S-delimiter) (i) He said that (maybe) he (maybe) would (maybe) come (maybe). (WO) (j ) Colourless green ideas sleep furiously. (syntax ! semantics ?) ... (..) lexical errors (typos), spelling reform in various European countries 6 Error Checking : Question of grammaticality Always remember: the computer is ultimately patient – in contrast to the human brain. • Computationally speaking, language is rather vivid and we let the data dictate the form of the grammar (insofar as possible), rather than try to develop rules as a model of any current linguistic theory. • We do distinguish between various types of errors, i.e. regard errors to be positioned somewhere on a continuum. • The rules treat different types of errors differently. The computational concept of grammaticality: correct somewhere in–between wrong 7 Inputstring 1. Decoding Rules Success Success Fail 1 Parse >1 Parses 0 Parse 2. Encoding Rules Hierarchy is computed The tripartite organization of the grammar (Karen Jensen) 3. Encoding Rules flag off fitted parse on error detection /correction 8 Parsing : tree structures Programming Language FOR Natural Language Processing Samples of tree structures: Enter a sentence or a PLNLP command: Linguistics is interesting. ---------------------------DECL NP1 NOUN1* „linguistics“ VERB1* „is“ AJP1 ADJ1* „interesting“ PUNV1 „.“ 9 Computed Trees Programming Language FOR Natural Language Processing Linguistics is interesting. (prtreer 1) DECL1 (7200) VP1 (5080) NP2 (3000) NOUN1 „linguistics“ VP2 (4090) VP3 (4000) VERB1 „is“ AJP1 (2100) AJP1 (2100) ADJ1 „interesting“ PUNC1(975) PUNC2 „.“ 10 Computed Trees - Problem: The structure of the computed trees (rule structure), is based on the binary rules: Sample: recursive rule: ADJ + NP ----------------------------------------------------------------------------------------the the the the the the sad sad, sad, sad, sad, brown brown, dirty brown, dirty dog brown, dirty dog. (prtreer np1) NP1 (3050) AJP5 (2100) ADJ1 „the“ NP2 (3050) AJP6 (6220) AJP2 (2100) ADJ2 „sad“ AJP7 (6210) CONJ1 (980) PUNC3 (975) PUNC4 „,“ AJP3 (2230) AJP8 (2100) ADJ3 „brown“ PUNC1 (975) PUNC5 „,“ NP3 (3050) AJP9 (2100) ADJ4 „dirty“ NP4 (3000) NOUN1 „dog“ 11 Programming Language FOR Natural Language Processing Computed Trees: PS-tree NP1 AJP5 NP2 (the) AJP6 AJP2 NP3 AJP7 (sad) CONJ1 AJP9 NP4 (dirty) (dog) AJP3 (,) ADJ8 PUNC1 (brown) (,) 12 Programming Language FOR Natural Language Processing Record Structure (prtrec np1) SEGTYPE STR RULES RULE COPYOF BASE POS INDIC PRMODS PRMODS PRMODS HEAD VERB ADV CLOSED NODENAME Value=NIL 'NP' “the sad, brown, dirty dog“ 3000 3050 3050 3050 3050 AJP5 NP2 NP2 “sad, brown, dirty dog“ ‘DOG‘ VERB NOUN ADV SING PERS3 ART DET DEF DET1 “the“ AJP1 “sad, brown,“ AJP4 “dirty“ NOUN1 “dog“ REC “dog“ REC “dog“ 1 ‘NP1‘ NP PRMODS HEAD the sad, brown, dirty dog 13 Record Structure - Syntax Programming Language FOR Natural Language Processing (compare: LFG: F-structures) left side: attribute names right side : values values can either be simple or complex. Many of the values are records in itself (pointers !) At least the following 5 attributes are needed to build a parse structure tree: PRMODS,HEAD,PSMODS,SEGTYPE und STR. The rest of the attributes show that the record contains much more information than is shown in the parse structure tree. • S = derivational history of the parse (= order of the applied rules) • INDIC = features from the dictionary / from the word-entries (link to electronic dictionary) • BASE = the (base) form of the word entry (from the electronic dictionary) • POS = part of speech • functional information, such as: SUBJECT, DIROBJ, INDOBJ is added to the record by grammar rules 14 Record Structure - Syntax – Sample (1) Linguistics is interesting. Programming Language FOR Natural Language Processing Enter a sentence or a PLNLP command: (prtrec 1) SEGTYPE SEGTYP2 STR RULES RULE COPYOF BASE DICT POS INDIC PRMODS HEAD PSMODS PSMODS FRSTV SUBJECT PREDADJ COPL CLOSED PARSENO SENTYPE TOPIC NODENAME Value=NIL ‘SENT‘ ‘DECL‘ “linguistics is interesting“ 4000 4090 5080 7200 7200 REC VP1 PUNC1 VP1 “linguistics is interesting“ ‘BE‘ ‘is‘ VERB SING PRES THATCOMP PERS3 NP1 “linguistics“ VERB1 “is“ AJP1 “interesting“ PUNC1 “.“ VERB1 “is“ NP1 “linguistics“ AJP1 “interesting“ 1 1 1 ‘DECL‘ NP1 “linguistics" ‘DECL1‘ 15 Record Structure - Syntax - Sample (2) Programming Language FOR Natural Language Processing Linguistics is interesting. Enter a sentence or a PLNLP command: (prtrec np1) SEGTYPE STR RULES RULE COPYOF BASE DICT INDIC PRMODS HEAD NODENAME Value=NIL ‘NP‘ “linguistics“ 3000 3000 NOUN1 NP2 “linguistics“ ‘LINGUISTICS‘ ‘linguistics‘ SING PERS3 NP1 “linguistics“ NOUN1 “linguistics“ ‘NP1‘ 16 Record Structure - Syntax - Sample (3) Programming Language FOR Natural Language Processing Linguistics is interesting. Enter a sentence or a PLNLP command: (prtrec vp2) SEGTYPE STR RULES RULE COPYOF BASE POS DICT INDIC HEAD PSMODS FRSTV PREDADJ COPL NODENAME Value=NIL ‘VP‘ “is interesting“ 4000 4090 4090 VP3 AJP1 VP3 “is“ ‘BE‘ VERB ‘is‘ SING PRES AUX TRAN THATCOMP VERB1 “is“ AJP1 “interesting“ VERB1 “is“ AJP1 “interesting“ 1 ‘VP1‘ 17 Record Structure - Syntax - Sample (4) Programming Language FOR Natural Language Processing Linguistics is interesting. Enter a sentence or a PLNLP command: (prtrec vp3) SEGTYPE STR RULES RULE COPYOF BASE DICT POS INDIC HEAD FRSTV COPL NODENAME Value=NIL ‘VP‘ “is“ 4000 4000 VERB1 VERB1 “is“ ‘BE‘ ‘is‘ VERB SING PRES AUX PERS3 VERB1 “is“ VERB1 “is“ 1 ‘VP3‘ 18 Record Structure - Syntax - Sample (5) Programming Language FOR Natural Language Processing Linguistics is interesting. Enter a sentence or a PLNLP command: (prtrec ajp1) SEGTYPE STR RULES RULE COPYOF BASE DICT POS INDIC HEAD VERB ADJPRED NODENAME Value=NIL ‘AJP‘ “interesting“ 2100 2100 ADJ1 ADJ1 “interesting“ ‘INTERESTING‘ ‘interesting‘ VERB ADJ PRESPART TRAN THATCOMP ADJ1 “interesting“ REC “interesting“ 1 ‘AJP1‘ 19 Record Structure - Syntax - Sample (6) Programming Language FOR Natural Language Processing Linguistics is interesting. Enter a sentence or a PLNLP command: (prtrec verb1) SEGTYPE STR COPYOF BASE DICT POS INDIC COPL NODENAME Value=NIL ‘VERB‘ “is“ REC “is“ ‘BE‘ ‘is‘ VERB SING PRES AUX PERS3 1 ‘VERB1‘ 20 Record Strukture - Syntax - Sample (7) – it works just the same way for any other language Programming Language FOR Natural Language Processing DEN MANN FÄHRT DIE FRAU. (prtrec 1) SEGTYPE SEGTYP2 STR RULES RULE COPYOF BASE DICT INDIC PRMODS HEAD PSMODS PSMODS SUBJECT SING DIROBJ PARSENO NODENAME Value=NIL ‘SENT‘ ‘DECL‘ “Den Mann fährt die Frau.“ 2010 2700 2430 2870 2870 REC VP1 PUNC1 VP1 “Den Mann fährt die Frau“ ‘FAHREN‘ ‘fährt‘ PRES P3 NI NA NPZ NDP NP1 “Den Mann“ VERB1 “fährt“ NP2 “die Frau“ PUNC1 “.“ “die Frau“ 1 NP1 “Den Mann“ 1 ‘DECL1‘ 21 Multiple Parses – Disambiguation goal: Programming Language FOR Natural Language Processing • reduce the number of analyses • „compute“ an order, in which to prefer the analyses !: • modify the rules in such a way that only the syntactically meaninful structures „survive“ • limit certain ambiguities, e.g. PP-attachment How is this done? Multiple parses are often normal: when 2 different syntactic analyses of a sentence are correct, we cannot suppress any of the alternatives before we don‘t have additional information at hand. Default: local attachment. New: parsing of dictionary entries and using „hidden“ information. (Jensen/Binot: Disambiguating Prepositional Phrase Attachments by using on-line dictionary definitions) 22 Multiple Parses PLEASE REFRAIN FROM TAKING PICTURES Programming Language FOR Natural Language Processing ----------------------------------IMPR1 AVP VERB1* PP1 ADV1* “please“ “refrain“ PREP1 “from“ AJP1 ADJ1* “taking“ NOUN1* “pictures“ “.“ PUNC1 P-METRIC=0.21 ----------------------------------IMPR2 AVP1 ADV1* “please“ VERB1* “refrain“ PP2 PREP1 “from“ VERB2* “taking“ NP1 NOUN1* “pictures“ PUNC1 “.“ P-METRIC=0.20001 23 Record Structure - PARSENO1 (prtrec 1) SEGTYPE SEGTYP2 STR RULES RULE COPYOF BASE POS INDIC PRMODS HEAD PSMODS PSMODS FRSTV SUBJECT NOUN OBJTPREP YOUPLEAS PARSENO SENTYPE TOPIC METRIC NODENAME Value=NIL Programming Language FOR Natural Language Processing ‘SENT‘ ‘IMPR‘ “please refrain from taking pictures“ 4000 4250 4240 7200 7200 REC VP1 PUNC1 VP1 “please refrain from taking pictures“ ‘REFRAIN‘ VERB NOUN SING PLUR PRES INF AVP1 “please“ VERB1 “refrain“ PP1 “from taking pictures“ PUNC1 “.“ VERB1 “refrain“ REC1 ““ REC “refrain“ ‘FROM‘ 1 1 ‘IMPR‘ REC1 "" 0.21 ‘IMPR1‘ 24 Record Structure - PARSENO2 (prtrec 2) SEGTYPE SEGTYP2 STR RULES RULE COPYOF BASE POS INDIC PRMODS HEAD PSMODS PSMODS FRSTV SUBJECT NOUN OBJTPREP YOUPLEAS PARSENO SENTYPE TOPIC METRIC NODENAME Value=NIL Programming Language FOR Natural Language Processing ‘SENT‘ ‘IMPR‘ “please refrain from taking pictures“ 4000 4250 4240 7200 7200 REC VP2 PUNC1 VP2 “please refrain from taking pictures“ ‘REFRAIN‘ VERB NOUN SING PLUR PRES INF AVP1 “please“ VERB1 “refrain“ PP2 “from taking pictures“ PUNC1 “.“ VERB1 “refrain“ REC2 ““ REC “refrain“ ‘FROM‘ 1 2 ‘IMPR‘ REC2 "" 0.20001 ‘IMPR2‘ 25 Multiple Parses – tree structure (prtreer 1) IMPR1 (7200) VP1 (4240) AVP1 (2260) ADV1 “please“ VP3 (4250) VP4 (4000) VERB1 “refrain“ PP1 (2360) PREP1 “from“ NP2 (3050) AJP2 (2100) ADJ1 “taking“ NP1 (3000) NOUN1 “pictures“ PUNC1 (975) PUNC2 “.“ Value=NIL „taking pictures“ (prtreer2) IMPR2 (7200) VP2 (4240) AVP1 (2260) ADV1 “please" VP5 (4250) VP4 (4000) VERB1 “refrain" PP2 (2380) PREP1 “from" PRPRTCL1 (2690) VP6 (4050) VP7 (4000) VERB2 “taking" NP1 (3000) NOUN1 “pictures" PUNC1 (975) PUNC2 “.“ Value=NIL „picture taking = to take a photo“ pictures, that are taking 26 Fitted Parse Fitted parsing is used when the core rules of the grammar Programming Language FOR Natural Language Processing cannot compute a complete parse structure (S-node / final VP-node). Fitted parses are ideal to • debug your grammar • point to errors in your grammar • ... Frequent mistakes: • sentence fragments • fixed expressions • extreme ellipses • coherency problems (sample see Jensen) • ... 27 Fitted Parse Procedure: Programming Language FOR Natural Language Processing 1. The systems attempts to interpret every input string as a sentence. 2. In case the final (S-) rule cannot be applied, the string gets analyzed by a set of encoding rules, which represent the fitting routine. 3. The fitting routine attempts to find plausible constituents. 4. Order in which the constituents are preferred: (a) imperatives (b) VP with a subject (c) tensed VP without a subject (d) non-finite clause (e) phrases without a verb (f) others In words: 1. The grammar is looking for a record with segtype “IMPR“. If that fails -> 2. the grammar is looking for a record with segtype “VP“ and a SUBJECT attribute in the VP record. If that fails -> 3. etc.etc.etc. (until (f)) If the grammar finds a segment record, which covers the entire input string -> success ! If that fails -> 4. the largest possible record is chosen and declared as being the HEAD. Then -> 5. other records are added as pre- und postmodifying constituents (lists). 28 Programming Language FOR Natural Language Processing Fitted Parse Sample of fitted parse: 1. Fragment :subordinate clause (prtrec np1) that he would come. SEGTYPE ‘NP' ------------------------- STR “that he would come" XXXX1 RULES RULE COPYOF BASE POS INDIC 4000 4040 5080 3390 3390 REC COMPL1 VP1 VP1 “he would come" ‘COME' NOUN VERB ADJ SING PAST AUX INDCNAM2 PERS3 COMPL1 “that“ NP2 “he“ VERB1 “would“ VERB2 “come“ VERB1 “would“ NP2 “he“ REC “come“ REC “come“ 1 1 1 ‘NP1‘ 29 NP1* COMPL1 “that“ NP2 PRON1* “he“ VERB1 “would“ VERB2* “come“ PUNC1 “.“ ------------------------- PRMODS PRMODS PRMODS HEAD FRSTV SUBJECT NOUN ADJ COPL AUX1 AUXFORM NODENAME Value=NIL Programming Language FOR Natural Language Processing Error Correction In Natural Language Processing, one needs to anticipate errors before the machine can handle them. The machine is NOT capable to detect errors without the help of the human (grammarian). How is this done? Error Statistics ! (For German: Goethe Institute, Munich) 30 Error Correction Programming Language FOR Natural Language Processing In PLNLP, certain flags can be set: • by which we can correct errors in the input string (intelligent text processing beyond spell-checking) Question: How do these flags look like ? Sample rule (shortened): NP (conditions,...) VP (conditions,..., <PERSNUMB(NP).AGREE.PERSNUMB|SETERR<‘PERS1‘>>) -> VP(..., SUBJECT=NP,...) read: either the PERSNUMB-attribute of the NP agrees with the PERSNUMB-attribute of the current node (here: VP) OR (|) the errorattribute „PERS1“ is set true. 31 Programming Language FOR Natural Language Processing Error Correction Enter a sentence or a PLNLP command: Der Mann arbeiten mit der Frau. -------------------------XXXX1 NP1 DET1 ADJ1* “Der“ NOUN1* “Mann“ VP1* VERB1* “arbeiten“ PP1 PREP1 “mit“ DET2 ADJ2* NOUN2* “Frau“ PUNC1 “.“ -------------------------(setq nlperrs 1) Enter a sentence or a PLNLP command: Der Mann arbeiten mit der Frau. -------------------------DECL1 NP1 DET1 ADJ1* “Der“ NOUN1* "Mann“ VERB1* “arbeiten“ PP1 PREP1 “mit“ DET2 ADJ2* “der“ NOUN2* “Frau“ PUNC1 “.“ GRAMMATICAL ERROR IN SENTENCE 3. Der Mann arbeiten mit der Frau. CONSIDER: Der Mann arbeitet mit der Frau. CHANGE THE VERBAL INFLECTION -------------------------- “der“ 32 Programming Language FOR Natural Language Processing Approximate parse Syntactic ambiguities Disambiguation of multiple parses 33 Programming Language FOR Natural Language Processing Approximate parse / syntactic ambiguities Problem: multiple parses; i.e. too many analyses of an input string -> incorrect analyses Ad hoc solution: limit the application of rules by augmentations/conditions/restrictions on the electronic PS-rules goal: one parse for every inputstring Present problem: little (if any) semantic/pragmatic information is available. Only a pragmatic analysis, which considers world knowledge for the process of disambiguation, would result in a disambiguation of the structure. Solution: an approximate parse is built and alternative „attachment-points“ are marked in the tree by „?“ 34 Ambiguities / Approximate Parse HE WATCHED THE MAN WITH THE TELESCOPE. Programming Language FOR Natural Language Processing -----------------------------DECL1 NP1 PRON1 “he“ VERB1* “watched“ NP2 DET1 ADJ1* NOUN1* “man“ ? PP1 PREP1 DET2 NOUN2* PUNC1 “.“ “the“ “with“ ADJ2* “the“ “telescope“ remember: eat with a fork vs. eat with bones ER FÄHRT DEN MANN MIT DEM AUTO. -----------------------------DECL1 NP1 PRON1 VERB1* “fährt“ NP2 DET1 NOUN1* ? PP1 PUNC1 “." “Er“ ADJ1* “Mann“ PREP1 DET2 NOUN2* “den“ “mit“ ADJ2* “dem“ “Auto“ 35 Programming Language FOR Natural Language Processing Ambiguities / Approximate Parse More examples: (1) in both cases „PP – time expression“: I read a report about the evolution in 10 minutes. vs. Pragmatics / world knowledge: evolution needs > 10 minutes. Syntax/Semantics: * report + „in-PP“ ( okay: report + „about-PP“) * report + 2 PP (about.., in..) I read a report about the evolution in the last 10 million years. Pragmatics / world knowledge : „I“ (a human being) cannot be 10 million years old. Syntax/Semantics: * report + „in-PP“ ( okay: report + „about-PP“) * report + 2 PP (about.., in..) 36 the past - the present - the future The past: • dreams , science fiction (limitations in technology, both on the HW as well as SW side) The present: • applications like Penelope ( NaLa) The future: • combination of rule-based algorithms and statistics • mobile (NLP) applications 37
© Copyright 2024 Paperzz