Natural Language Processing >> Error Checking <<

Natural Language Processing
>> Error Checking <<
winter / fall 2015/2016
41.4268
Prof. Dr. Bettina Harriehausen-Mühlbauer
Univ. of Applied Sciences, Darmstadt, Germany
https://www.fbi.h-da.de/organisation/personen/harriehausenmuehlbauer-bettina.html
[email protected]
Error Checking
electronic Grammars....
...precisively describe the central, well-defined grammatical
constructions of language...
• ??? well-defined ???
• Chomsky (1957) ...“parse all and only the grammatical
constructions of language ...“
Question: How do we define grammaticality?
• Utterances of native speaker ?
• Structural descriptions in our descriptive grammars ?
• Lexical descriptions in our (standard) dictionaries ?
• .....
2
Error Checking
PLNLP : Programming Language for Natural Language Processing
George Heidorn (IBM, Microsoft)
PEG : Penelope English Grammar
Karen Jensen (IBM, Microsoft)
http://www.amazon.de/Natural-Language-Processing-InternationalEngineering/dp/0792392795
3
Error Checking
Error Correction : question of grammaticality
Is grammaticality boolean ?
± grammaticality ?
Something is correct or wrong ?
Types of errors:
(a) The children climbs over the fence. (dis/agreement)
(b) Either of the manuscripts are fine. (dis/agreement)
(c) ...between you and I... (case)
4
Error Checking
Can you have it translated into English please and
have this forwarded to Paul and I?
Bettina - can you let Yunsun and I know if options
1 or 2 are possible asap? Only if not, perhaps
option 3 can be followed.
(email correspondence with a native English speaker)
5
Error Checking
Types of errors:
(d) He stated the fact and why they succeeded. (non-/parallel
construction)
(e) We want to very carefully test the idea. (split infinitives)
(f) The big, old, dirty, ugly, hungry,....dog (multiple conjoined adjectives)
(g) The mouse, which the cat, which sat under the table, which was....
(embedded relative clauses; grammar ? style ?))
(h) Would you like another glass of soda. (incorrect S-delimiter)
(i) He said that (maybe) he (maybe) would (maybe) come (maybe). (WO)
(j ) Colourless green ideas sleep furiously. (syntax ! semantics ?)
...
(..) lexical errors (typos), spelling reform in various European countries
6
Error Checking : Question of grammaticality
Always remember: the computer is ultimately
patient – in contrast to the human brain.
• Computationally speaking, language is rather vivid and we let the data
dictate the form of the grammar (insofar as possible), rather than try to
develop rules as a model of any current linguistic theory.
• We do distinguish between various types of errors, i.e. regard errors to
be positioned somewhere on a continuum.
• The rules treat different types of errors differently.
The computational concept of grammaticality:
correct
somewhere
in–between
wrong
7
Inputstring
1. Decoding Rules
Success
Success
Fail
1 Parse
>1 Parses
0 Parse
2. Encoding
Rules
Hierarchy
is
computed
The tripartite organization of
the grammar (Karen Jensen)
3. Encoding
Rules
flag
off
fitted parse
on
error detection
/correction
8
Parsing : tree structures
Programming Language
FOR Natural
Language Processing
Samples of tree structures:
Enter a sentence or a PLNLP command:
Linguistics is interesting.
---------------------------DECL
NP1
NOUN1* „linguistics“
VERB1* „is“
AJP1
ADJ1* „interesting“
PUNV1 „.“
9
Computed Trees
Programming Language
FOR Natural
Language Processing
Linguistics is interesting.
(prtreer 1)
DECL1 (7200)
VP1 (5080)
NP2 (3000)
NOUN1 „linguistics“
VP2 (4090)
VP3 (4000)
VERB1 „is“
AJP1 (2100)
AJP1 (2100)
ADJ1 „interesting“
PUNC1(975)
PUNC2 „.“
10
Computed Trees - Problem:
The structure of the computed trees (rule structure), is based on the binary rules:
Sample: recursive rule: ADJ + NP
----------------------------------------------------------------------------------------the
the
the
the
the
the
sad
sad,
sad,
sad,
sad,
brown
brown, dirty
brown, dirty dog
brown, dirty dog.
(prtreer np1)
NP1 (3050)
AJP5 (2100)
ADJ1 „the“
NP2 (3050)
AJP6 (6220)
AJP2 (2100)
ADJ2 „sad“
AJP7 (6210)
CONJ1 (980)
PUNC3 (975)
PUNC4 „,“
AJP3 (2230)
AJP8 (2100)
ADJ3 „brown“
PUNC1 (975)
PUNC5 „,“
NP3 (3050)
AJP9 (2100)
ADJ4 „dirty“
NP4 (3000)
NOUN1 „dog“
11
Programming Language
FOR Natural
Language Processing
Computed Trees: PS-tree
NP1
AJP5
NP2
(the)
AJP6
AJP2
NP3
AJP7
(sad)
CONJ1
AJP9
NP4
(dirty)
(dog)
AJP3
(,)
ADJ8
PUNC1
(brown)
(,)
12
Programming Language
FOR Natural
Language Processing
Record Structure
(prtrec np1)
SEGTYPE
STR
RULES
RULE
COPYOF
BASE
POS
INDIC
PRMODS
PRMODS
PRMODS
HEAD
VERB
ADV
CLOSED
NODENAME
Value=NIL
'NP'
“the sad, brown, dirty dog“
3000 3050 3050 3050
3050 AJP5 NP2
NP2 “sad, brown, dirty dog“
‘DOG‘
VERB NOUN ADV
SING PERS3 ART DET DEF
DET1 “the“
AJP1 “sad, brown,“
AJP4 “dirty“
NOUN1 “dog“
REC “dog“
REC “dog“
1
‘NP1‘
NP
PRMODS
HEAD
the sad, brown, dirty
dog
13
Record Structure - Syntax
Programming Language
FOR Natural
Language Processing
(compare: LFG: F-structures)
left side: attribute names
right side : values
values can either be simple or complex. Many of the values are records in
itself (pointers !)
At least the following 5 attributes are needed to build a parse structure tree:
PRMODS,HEAD,PSMODS,SEGTYPE und STR.
The rest of the attributes show that the record contains much more information than is shown
in the parse structure tree.
•
S = derivational history of the parse (= order of the applied rules)
• INDIC = features from the dictionary / from the word-entries (link to electronic dictionary)
• BASE = the (base) form of the word entry (from the electronic dictionary)
• POS = part of speech
• functional information, such as: SUBJECT, DIROBJ, INDOBJ is added to the record by
grammar rules
14
Record Structure - Syntax – Sample (1)
Linguistics is interesting.
Programming Language
FOR Natural
Language Processing
Enter a sentence or a PLNLP command:
(prtrec 1)
SEGTYPE
SEGTYP2
STR
RULES
RULE
COPYOF
BASE
DICT
POS
INDIC
PRMODS
HEAD
PSMODS
PSMODS
FRSTV
SUBJECT
PREDADJ
COPL
CLOSED
PARSENO
SENTYPE
TOPIC
NODENAME
Value=NIL
‘SENT‘
‘DECL‘
“linguistics is interesting“
4000 4090 5080 7200
7200 REC VP1 PUNC1
VP1 “linguistics is interesting“
‘BE‘
‘is‘
VERB
SING PRES THATCOMP PERS3
NP1 “linguistics“
VERB1 “is“
AJP1 “interesting“
PUNC1 “.“
VERB1 “is“
NP1 “linguistics“
AJP1 “interesting“
1
1
1
‘DECL‘
NP1 “linguistics"
‘DECL1‘
15
Record Structure - Syntax - Sample (2)
Programming Language
FOR Natural
Language Processing
Linguistics is interesting.
Enter a sentence or a PLNLP command:
(prtrec np1)
SEGTYPE
STR
RULES
RULE
COPYOF
BASE
DICT
INDIC
PRMODS
HEAD
NODENAME
Value=NIL
‘NP‘
“linguistics“
3000
3000 NOUN1
NP2 “linguistics“
‘LINGUISTICS‘
‘linguistics‘
SING PERS3
NP1 “linguistics“
NOUN1 “linguistics“
‘NP1‘
16
Record Structure - Syntax - Sample (3)
Programming Language
FOR Natural
Language Processing
Linguistics is interesting.
Enter a sentence or a PLNLP command:
(prtrec vp2)
SEGTYPE
STR
RULES
RULE
COPYOF
BASE
POS
DICT
INDIC
HEAD
PSMODS
FRSTV
PREDADJ
COPL
NODENAME
Value=NIL
‘VP‘
“is interesting“
4000 4090
4090 VP3 AJP1
VP3 “is“
‘BE‘
VERB
‘is‘
SING PRES AUX TRAN THATCOMP
VERB1 “is“
AJP1 “interesting“
VERB1 “is“
AJP1 “interesting“
1
‘VP1‘
17
Record Structure - Syntax - Sample (4)
Programming Language
FOR Natural
Language Processing
Linguistics is interesting.
Enter a sentence or a PLNLP command:
(prtrec vp3)
SEGTYPE
STR
RULES
RULE
COPYOF
BASE
DICT
POS
INDIC
HEAD
FRSTV
COPL
NODENAME
Value=NIL
‘VP‘
“is“
4000
4000 VERB1
VERB1 “is“
‘BE‘
‘is‘
VERB
SING PRES AUX PERS3
VERB1 “is“
VERB1 “is“
1
‘VP3‘
18
Record Structure - Syntax - Sample (5)
Programming Language
FOR Natural
Language Processing
Linguistics is interesting.
Enter a sentence or a PLNLP command:
(prtrec ajp1)
SEGTYPE
STR
RULES
RULE
COPYOF
BASE
DICT
POS
INDIC
HEAD
VERB
ADJPRED
NODENAME
Value=NIL
‘AJP‘
“interesting“
2100
2100 ADJ1
ADJ1 “interesting“
‘INTERESTING‘
‘interesting‘
VERB ADJ
PRESPART TRAN THATCOMP
ADJ1 “interesting“
REC “interesting“
1
‘AJP1‘
19
Record Structure - Syntax - Sample (6)
Programming Language
FOR Natural
Language Processing
Linguistics is interesting.
Enter a sentence or a PLNLP command:
(prtrec verb1)
SEGTYPE
STR
COPYOF
BASE
DICT
POS
INDIC
COPL
NODENAME
Value=NIL
‘VERB‘
“is“
REC “is“
‘BE‘
‘is‘
VERB
SING PRES AUX PERS3
1
‘VERB1‘
20
Record Strukture - Syntax - Sample (7) –
it works just the same way for any other language
Programming Language
FOR Natural
Language Processing
DEN MANN FÄHRT DIE FRAU.
(prtrec 1)
SEGTYPE
SEGTYP2
STR
RULES
RULE
COPYOF
BASE
DICT
INDIC
PRMODS
HEAD
PSMODS
PSMODS
SUBJECT
SING
DIROBJ
PARSENO
NODENAME
Value=NIL
‘SENT‘
‘DECL‘
“Den Mann fährt die Frau.“
2010 2700 2430 2870
2870 REC VP1 PUNC1
VP1 “Den Mann fährt die Frau“
‘FAHREN‘
‘fährt‘
PRES P3 NI NA NPZ NDP
NP1 “Den Mann“
VERB1 “fährt“
NP2 “die Frau“
PUNC1 “.“
“die Frau“
1
NP1 “Den Mann“
1
‘DECL1‘
21
Multiple Parses – Disambiguation
goal:
Programming Language
FOR Natural
Language Processing
• reduce the number of analyses
• „compute“ an order, in which to prefer the analyses
!:
• modify the rules in such a way that only the syntactically meaninful structures
„survive“
• limit certain ambiguities, e.g. PP-attachment
How is this done?
Multiple parses are often normal: when 2 different syntactic analyses of a sentence
are correct, we cannot suppress any of the alternatives before we don‘t have
additional information at hand. Default: local attachment. New: parsing of dictionary
entries and using „hidden“ information.
(Jensen/Binot: Disambiguating Prepositional Phrase Attachments by using on-line
dictionary definitions)
22
Multiple Parses
PLEASE REFRAIN FROM TAKING PICTURES
Programming Language
FOR Natural
Language Processing
----------------------------------IMPR1
AVP
VERB1*
PP1
ADV1*
“please“
“refrain“
PREP1
“from“
AJP1
ADJ1*
“taking“
NOUN1*
“pictures“
“.“
PUNC1
P-METRIC=0.21
----------------------------------IMPR2
AVP1
ADV1*
“please“
VERB1*
“refrain“
PP2
PREP1
“from“
VERB2*
“taking“
NP1
NOUN1*
“pictures“
PUNC1
“.“
P-METRIC=0.20001
23
Record Structure - PARSENO1
(prtrec 1)
SEGTYPE
SEGTYP2
STR
RULES
RULE
COPYOF
BASE
POS
INDIC
PRMODS
HEAD
PSMODS
PSMODS
FRSTV
SUBJECT
NOUN
OBJTPREP
YOUPLEAS
PARSENO
SENTYPE
TOPIC
METRIC
NODENAME
Value=NIL
Programming Language
FOR Natural
Language Processing
‘SENT‘
‘IMPR‘
“please refrain from taking pictures“
4000 4250 4240 7200
7200 REC VP1 PUNC1
VP1 “please refrain from taking pictures“
‘REFRAIN‘
VERB NOUN
SING PLUR PRES INF
AVP1 “please“
VERB1 “refrain“
PP1 “from taking pictures“
PUNC1 “.“
VERB1 “refrain“
REC1 ““
REC “refrain“
‘FROM‘
1
1
‘IMPR‘
REC1 ""
0.21
‘IMPR1‘
24
Record Structure - PARSENO2
(prtrec 2)
SEGTYPE
SEGTYP2
STR
RULES
RULE
COPYOF
BASE
POS
INDIC
PRMODS
HEAD
PSMODS
PSMODS
FRSTV
SUBJECT
NOUN
OBJTPREP
YOUPLEAS
PARSENO
SENTYPE
TOPIC
METRIC
NODENAME
Value=NIL
Programming Language
FOR Natural
Language Processing
‘SENT‘
‘IMPR‘
“please refrain from taking pictures“
4000 4250 4240 7200
7200 REC VP2 PUNC1
VP2 “please refrain from taking pictures“
‘REFRAIN‘
VERB NOUN
SING PLUR PRES INF
AVP1 “please“
VERB1 “refrain“
PP2 “from taking pictures“
PUNC1 “.“
VERB1 “refrain“
REC2 ““
REC “refrain“
‘FROM‘
1
2
‘IMPR‘
REC2 ""
0.20001
‘IMPR2‘
25
Multiple Parses – tree structure
(prtreer 1)
IMPR1 (7200)
VP1 (4240)
AVP1 (2260)
ADV1 “please“
VP3 (4250)
VP4 (4000)
VERB1 “refrain“
PP1 (2360)
PREP1 “from“
NP2 (3050)
AJP2 (2100)
ADJ1 “taking“
NP1 (3000)
NOUN1 “pictures“
PUNC1 (975)
PUNC2 “.“
Value=NIL
„taking pictures“
(prtreer2)
IMPR2 (7200)
VP2 (4240)
AVP1 (2260)
ADV1 “please"
VP5 (4250)
VP4 (4000)
VERB1 “refrain"
PP2 (2380)
PREP1 “from"
PRPRTCL1 (2690)
VP6 (4050)
VP7 (4000)
VERB2 “taking"
NP1 (3000)
NOUN1 “pictures"
PUNC1 (975)
PUNC2 “.“
Value=NIL
„picture taking = to take a photo“
pictures, that are taking
26
Fitted Parse
Fitted parsing is used when the core rules of the grammar
Programming Language
FOR Natural
Language Processing
cannot compute a complete parse structure
(S-node / final VP-node).
Fitted parses are ideal to
• debug your grammar
• point to errors in your grammar
• ...
Frequent mistakes:
• sentence fragments
• fixed expressions
• extreme ellipses
• coherency problems (sample see Jensen)
• ...
27
Fitted Parse
Procedure:
Programming Language
FOR Natural
Language Processing
1. The systems attempts to interpret every input string as a sentence.
2. In case the final (S-) rule cannot be applied, the string gets analyzed by a set of encoding
rules, which represent the fitting routine.
3. The fitting routine attempts to find plausible constituents.
4. Order in which the constituents are preferred:
(a) imperatives
(b) VP with a subject
(c) tensed VP without a subject
(d) non-finite clause
(e) phrases without a verb
(f) others
In words:
1. The grammar is looking for a record with segtype “IMPR“. If that fails ->
2. the grammar is looking for a record with segtype “VP“ and a SUBJECT attribute in
the VP record. If that fails ->
3. etc.etc.etc. (until (f)) If the grammar finds a segment record, which covers the entire input
string -> success ! If that fails ->
4. the largest possible record is chosen and declared as being the HEAD. Then ->
5. other records are added as pre- und postmodifying constituents (lists).
28
Programming Language
FOR Natural
Language Processing
Fitted Parse
Sample of fitted parse:
1. Fragment :subordinate clause
(prtrec np1)
that he would come.
SEGTYPE
‘NP'
-------------------------
STR
“that he would come"
XXXX1
RULES
RULE
COPYOF
BASE
POS
INDIC
4000 4040 5080 3390
3390 REC COMPL1 VP1
VP1 “he would come"
‘COME'
NOUN VERB ADJ
SING PAST AUX
INDCNAM2 PERS3
COMPL1 “that“
NP2 “he“
VERB1 “would“
VERB2 “come“
VERB1 “would“
NP2 “he“
REC “come“
REC “come“
1
1
1
‘NP1‘
29
NP1*
COMPL1
“that“
NP2
PRON1* “he“
VERB1 “would“
VERB2* “come“
PUNC1 “.“
-------------------------
PRMODS
PRMODS
PRMODS
HEAD
FRSTV
SUBJECT
NOUN
ADJ
COPL
AUX1
AUXFORM
NODENAME
Value=NIL
Programming Language
FOR Natural
Language Processing
Error Correction
In Natural Language Processing, one needs to anticipate errors before the
machine can handle them.
The machine is NOT capable to detect errors without the help of the human
(grammarian).
How is this done?
Error Statistics !
(For German: Goethe Institute, Munich)
30
Error Correction
Programming Language
FOR Natural
Language Processing
In PLNLP, certain flags can be set:
• by which we can correct errors in the input string (intelligent text processing beyond
spell-checking)
Question: How do these flags look like ?
Sample rule (shortened):
NP (conditions,...)
VP (conditions,..., <PERSNUMB(NP).AGREE.PERSNUMB|SETERR<‘PERS1‘>>)
->
VP(..., SUBJECT=NP,...)
read: either the PERSNUMB-attribute of the NP agrees with the
PERSNUMB-attribute of the current node (here: VP) OR (|) the errorattribute „PERS1“ is set true.
31
Programming Language
FOR Natural
Language Processing
Error Correction
Enter a sentence or a PLNLP command:
Der Mann arbeiten mit der Frau.
-------------------------XXXX1
NP1
DET1
ADJ1*
“Der“
NOUN1*
“Mann“
VP1*
VERB1*
“arbeiten“
PP1
PREP1
“mit“
DET2
ADJ2*
NOUN2*
“Frau“
PUNC1
“.“
-------------------------(setq nlperrs 1)
Enter a sentence or a PLNLP command:
Der Mann arbeiten mit der Frau.
-------------------------DECL1
NP1
DET1
ADJ1*
“Der“
NOUN1*
"Mann“
VERB1*
“arbeiten“
PP1
PREP1
“mit“
DET2
ADJ2*
“der“
NOUN2*
“Frau“
PUNC1
“.“
GRAMMATICAL ERROR IN SENTENCE 3.
Der Mann arbeiten mit der Frau.
CONSIDER:
Der Mann arbeitet mit der Frau.
CHANGE THE VERBAL INFLECTION
--------------------------
“der“
32
Programming Language
FOR Natural
Language Processing
Approximate parse
Syntactic ambiguities
Disambiguation of multiple parses
33
Programming Language
FOR Natural
Language Processing
Approximate parse / syntactic ambiguities
Problem: multiple parses; i.e. too many analyses of an input string -> incorrect
analyses
Ad hoc solution: limit the application of rules by augmentations/conditions/restrictions
on the electronic PS-rules
goal: one parse for every inputstring
Present problem: little (if any) semantic/pragmatic information is available. Only a
pragmatic analysis, which considers world knowledge for the process of
disambiguation, would result in a disambiguation of the structure.
Solution: an approximate parse is built and alternative „attachment-points“ are marked
in the tree by „?“
34
Ambiguities / Approximate Parse
HE WATCHED THE MAN WITH THE TELESCOPE.
Programming Language
FOR Natural
Language Processing
-----------------------------DECL1
NP1
PRON1
“he“
VERB1* “watched“
NP2
DET1
ADJ1*
NOUN1* “man“
?
PP1
PREP1
DET2
NOUN2*
PUNC1
“.“
“the“
“with“
ADJ2*
“the“
“telescope“
remember:
eat with a fork
vs.
eat with bones
ER FÄHRT DEN MANN MIT DEM AUTO.
-----------------------------DECL1
NP1
PRON1
VERB1* “fährt“
NP2
DET1
NOUN1*
?
PP1
PUNC1
“."
“Er“
ADJ1*
“Mann“
PREP1
DET2
NOUN2*
“den“
“mit“
ADJ2*
“dem“
“Auto“
35
Programming Language
FOR Natural
Language Processing
Ambiguities / Approximate Parse
More examples:
(1) in both cases „PP – time expression“:
I read a report about the evolution in 10 minutes.
vs.
Pragmatics / world knowledge: evolution needs > 10 minutes.
Syntax/Semantics: * report + „in-PP“ ( okay: report + „about-PP“)
* report + 2 PP (about.., in..)
I read a report about the evolution in the last 10 million years.
Pragmatics / world knowledge : „I“ (a human being) cannot be 10 million years old.
Syntax/Semantics: * report + „in-PP“ ( okay: report + „about-PP“)
* report + 2 PP (about.., in..)
36
the past - the present - the future
The past:
• dreams , science fiction (limitations in technology, both on the
HW as well as SW side)
The present:
• applications like Penelope ( NaLa)
The future:
• combination of rule-based algorithms and statistics
• mobile (NLP) applications
37