The role of the parser source code scanner Syntax analysis tokens parser Context-free syntax is specified with a context-free grammar. IR Formally, a CFG G is a 4-tuple (Vt ,Vn, S, P), where: errors Vt is the set of terminal symbols in the grammar. For our purposes, Vt is the set of tokens returned by the scanner. Parser • • • • • Vn, the nonterminals, is a set of syntactic variables that denote sets of (sub)strings occurring in the language. These are used to impose a structure on the grammar. performs context-free syntax analysis guides context-sensitive analysis constructs an intermediate representation produces meaningful error messages attempts error correction S is a distinguished nonterminal (S ∈ Vn) denoting the entire set of strings in L(G). This is sometimes called a goal symbol. P is a finite set of productions specifying how terminals and non-terminals can be combined to form strings in the language. Each production must have a single non-terminal on its left hand side. c 2007 by Antony L. Hosking. Permission to make digital or hard copies of part or all of this work for Copyright ! personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from [email protected]. CS502 Parsing The set V = Vt ∪Vn is called the vocabulary of G 1 Notation and terminology CS502 Parsing 2 Syntax analysis • a, b, c, . . . ∈ Vt Grammars are often written in Backus-Naur form (BNF). • A, B,C, . . . ∈ Vn Example: • U,V,W, . . . ∈ V 1 )goal* ::= )expr* 2 )expr* ::= )expr*)op*)expr* 3 | num | id 4 5 )op* ::= + | − 6 7 | ∗ 8 | / • !, ", #, . . . ∈ V ∗ • u, v, w, . . . ∈ Vt∗ If A → # then !A" ⇒ !#" is a single-step derivation using A → # Similarly, ⇒∗ and ⇒+ denote derivations of ≥ 0 and ≥ 1 steps This describes simple expressions over numbers and identifiers. In a BNF for a grammar, we represent If S ⇒∗ " then " is said to be a sentential form of G 1. non-terminals with angle brackets or capital letters 2. terminals with typewriter font or underline 3. productions as in the example L(G) = {w ∈ Vt∗ | S ⇒+ w}, w ∈ L(G) is called a sentence of G Note, L(G) = {" ∈ V ∗ | S ⇒∗ "} ∩Vt∗ CS502 Parsing 3 CS502 Parsing 4 Scanning vs. parsing Derivations Where do we draw the line? We can view the productions of a CFG as rewriting rules. term ::= [a − zA − z]([a − zA − z] | [0 − 9])∗ | 0 | [1 − 9][0 − 9]∗ Using our example CFG: op ::= + | − | ∗ | / expr ::= (term op)∗term )goal* ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ Regular expressions are used to classify: • identifiers, numbers, keywords • REs are more concise and simpler for tokens than a grammar • more efficient scanners can be built from REs (DFAs) than grammars )expr* )expr*)op*)expr* )expr*)op*)expr*)op*)expr* )id,x*)op*)expr*)op*)expr* )id,x* + )expr*)op*)expr* )id,x* + )num,2*)op*)expr* )id,x* + )num,2* ∗ )expr* )id,x* + )num,2* ∗ )id,y* Context-free grammars are used to count: We have derived the sentence x + 2 ∗ y. We denote this )goal*⇒∗ id + num ∗ id. • brackets: (), begin. . . end, if. . . then. . . else • imparting structure: expressions Such a sequence of rewrites is a derivation or a parse. Syntactic analysis is complicated enough: grammar for C has around 200 productions. Factoring out lexical analysis as a separate phase makes compiler more manageable. CS502 Parsing The process of discovering a derivation is called parsing. 5 CS502 Parsing Derivations Rightmost derivation At each step, we chose a non-terminal to replace. For the string x + 2 ∗ y: )goal* ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ This choice can lead to different derivations. Two are of particular interest: leftmost derivation the leftmost non-terminal is replaced at each step rightmost derivation the rightmost non-terminal is replaced at each step 6 )expr* )expr*)op*)expr* )expr*)op*)id,y* )expr* ∗ )id,y* )expr*)op*)expr* ∗ )id,y* )expr*)op*)num,2* ∗ )id,y* )expr* + )num,2* ∗ )id,y* )id,x* + )num,2* ∗ )id,y* Again, )goal*⇒∗ id + num ∗ id. The previous example was a leftmost derivation. CS502 Parsing 7 CS502 Parsing 8 Precedence Precedence These two derivations point out a problem with the grammar. goal It has no notion of precedence, or implied order of evaluation. To add precedence takes additional machinery: expr expr expr op expr <id,x> + <num,2> op expr * <id,y> 1 )goal* 2 )expr* 3 4 5 )term* 6 7 8 )factor* 9 Treewalk evaluation computes (x + 2) ∗ y — the “wrong” answer! ::= ::= | | ::= | | ::= | )expr* )expr* + )term* )expr* − )term* )term* )term* ∗ )factor* )term*/)factor* )factor* num id This grammar enforces a precedence on the derivation: • terms must be derived from expressions • forces the “correct” tree Should be x + (2 ∗ y) CS502 Parsing 9 Precedence CS502 Parsing 10 Precedence Now, for the string x + 2 ∗ y: )goal* ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ goal )expr* )expr* + )term* )expr* + )term* ∗ )factor* )expr* + )term* ∗ )id,y* )expr* + )factor* ∗ )id,y* )expr* + )num,2* ∗ )id,y* )term* + )num,2* ∗ )id,y* )factor* + )num,2* ∗ )id,y* )id,x* + )num,2* ∗ )id,y* expr expr + term * term term factor factor <id,x> <num,2> factor <id,y> Again, )goal*⇒∗ id + num ∗ id, but this time, we build the desired tree. Treewalk evaluation computes x + (2 ∗ y) CS502 Parsing 11 CS502 Parsing 12 Ambiguity Ambiguity If a grammar has more than one derivation for a single sentential form, then it is ambiguous May be able to eliminate ambiguities by rearranging the grammar: )stmt* ::= )matched* | )unmatched* )matched* ::= if )expr* then )matched* else )matched* | other stmts )unmatched* ::= if )expr* then )stmt* | if )expr* then )matched* else )unmatched* Example: )stmt* ::= | | if )expr*then )stmt* if )expr*then )stmt*else )stmt* other stmts Consider deriving the sentential form: This generates the same language as the ambiguous grammar, but applies the common sense rule: if E1 then if E2 then S1 else S2 It has two derivations. match each else with the closest unmatched then This ambiguity is purely grammatical. It is a context-free ambiguity. This is most likely the language designer’s intent. CS502 Parsing 13 Ambiguity CS502 Parsing 14 Parsing: the big picture tokens Ambiguity is often due to confusion in the context-free specification. Context-sensitive confusions can arise from overloading. Example: grammar a = f(17) parser generator parser code IR In many Algol-like languages, f could be a function or subscripted variable. Disambiguating this statement requires context: • need values of declarations • not context-free • really an issue of type Our goal is a flexible parser generator system Rather than complicate parsing, we will handle this separately. CS502 Parsing 15 CS502 Parsing 16 Top-down versus bottom-up Top-down parsing Top-down parsers A top-down parser starts with the root of the parse tree, labelled with the start or goal symbol of the grammar. • start at the root of derivation tree and fill in To build a parse, it repeats the following steps until the fringe of the parse tree matches the input string • picks a production and tries to match the input • may require backtracking 1. At a node labelled A, select a production A → ! and construct the appropriate child for each symbol of ! • some grammars are backtrack-free (predictive) Bottom-up parsers • start at the leaves and fill in 2. When a terminal is added to the fringe that doesn’t match the input string, backtrack • start in a state valid for legal first tokens 3. Find the next node to be expanded (must have a label in Vn) • as input is consumed, change state to encode possibilities (recognize valid prefixes) • use a stack to store both state and sentential forms The key is selecting the right production in step 1 ⇒ should be guided by input string CS502 Parsing 17 Simple expression grammar CS502 Prod’n – 1 2 4 7 9 – – 3 4 7 9 – – 7 8 – – 5 7 8 – – 9 – ::= ::= | | ::= | | ::= | )expr* )expr* + )term* )expr* − )term* )term* )term* ∗ )factor* )term* /)factor* )factor* num id Consider the input string x − 2 ∗ y CS502 18 Example Recall our grammar for simple expressions: 1 )goal* 2 )expr* 3 4 5 )term* 6 7 8 )factor* 9 Parsing Parsing 19 CS502 Sentential form )goal* )expr* )expr* + )term* )term* + )term* )factor* + )term* id + )term* id + )term* )expr* )expr* − )term* )term* − )term* )factor* − )term* id − )term* id − )term* id − )term* id − )factor* id − num id − num id − )term* id − )term* ∗ )factor* id − )factor* ∗ )factor* id − num ∗ )factor* id − num ∗ )factor* id − num ∗ )factor* id − num ∗ id id − num ∗ id Input ↑x − ↑x − ↑x − ↑x − ↑x − ↑x − x ↑− ↑x − ↑x − ↑x − ↑x − ↑x − x ↑− x − x − x − x − x − x − x − x − x − x − x − x − Parsing 2 2 2 2 2 2 2 2 2 2 2 2 2 ↑2 ↑2 ↑2 2 ↑2 ↑2 ↑2 ↑2 2 2 2 2 ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ↑∗ ∗ ∗ ∗ ∗ ↑∗ ∗ ∗ ∗ y y y y y y y y y y y y y y y y y y y y y y ↑y ↑y y ↑ 20 Example Left-recursion Another possible parse for x − 2 ∗ y Top-down parsers cannot handle left-recursion in a grammar Prod’n – 1 2 2 2 2 2 Sentential form )goal* )expr* )expr* + )term* )expr* + )term* + )term* )expr* + )term* + · · · )expr* + )term* + · · · ··· Input ↑x − ↑x − ↑x − ↑x − ↑x − ↑x − ↑x − Formally, a grammar is left-recursive if 2 2 2 2 2 2 2 ∗ ∗ ∗ ∗ ∗ ∗ ∗ y y y y y y y ∃A ∈ Vn such that A ⇒+ A! for some string ! If the parser makes the wrong choices, expansion doesn’t terminate. This isn’t a good property for a parser to have. Our simple expression grammar is left-recursive (Parsers should terminate!) CS502 Parsing 21 CS502 Parsing Eliminating left-recursion Example To remove left-recursion, we can transform the grammar Our expression grammar contains two cases of left-recursion )expr* ::= | | )term* ::= | | Consider the grammar fragment: )foo* ::= )foo*! | " where ! and " do not start with )foo* 22 )expr* + )term* )expr* − )term* )term* )term* ∗ )factor* )term*/)factor* )factor* Applying the transformation gives We can rewrite this as: )expr* )expr.* )term* )term.* )foo* ::= ")bar* )bar* ::= !)bar* | $ where )bar* is a new non-terminal ::= ::= ::= ::= )term*)expr.* +)term*)expr.* | −)term*)expr.* | $ )factor*)term.* ∗)factor*)term.* | /)factor*)term.* | $ With this grammar, a top-down parser will • terminate • backtrack on some inputs This fragment contains no left-recursion CS502 Parsing 23 CS502 Parsing 24 Example Example This cleaner grammar defines the same language Our long-suffering expression grammar: 1 2 3 4 5 6 7 8 9 )goal* )expr* ::= ::= | | )term* ::= | | )factor* ::= | )expr* )term* + )expr* )term* − )expr* )term* )factor* ∗ )term* )factor*/)term* )factor* num id 1 2 3 4 5 6 7 8 9 10 11 It is )goal* )expr* )expr.* ::= ::= ::= | | )term* ::= )term.* ::= | | )factor* ::= | )expr* )term*)expr.* +)term*)expr.* −)term*)expr.* $ )factor*)term.* ∗)factor*)term.* /)factor*)term.* $ num id • right-recursive • free of $-productions Unfortunately, it generates different associativity Same syntax, different meaning CS502 Parsing Recall, we factored out left-recursion 25 CS502 How much lookahead is needed? Predictive parsing We saw that top-down parsers may need to backtrack when they select the wrong production Basic idea: Do we need arbitrary lookahead to parse CFGs? • in general, yes Parsing 26 For any two productions A → ! | ", we would like a distinct way of choosing the correct production to expand. For some RHS ! ∈ G, define FIRST(!) as the set of tokens that appear first in some string derived from !. That is, for some w ∈ Vt∗, w ∈ FIRST (!) iff. ! ⇒∗ w#. • use the Earley or Cocke-Younger, Kasami algorithms Fortunately • large subclasses of CFGs can be parsed with limited lookahead • most programming language constructs can be expressed in a grammar that falls in these subclasses Key property: Whenever two productions A → ! and A → " both appear in the grammar, we would like FIRST (!) ∩ FIRST (") = % Among the interesting subclasses are: LL(1): left to right scan, left-most derivation, 1-token lookahead; and LR(1): left to right scan, right-most derivation, 1-token lookahead This would allow the parser to make a correct choice with a lookahead of only one symbol! The example grammar has this property! CS502 Parsing 27 CS502 Parsing 28 Left factoring Example What if a grammar does not have this property? Consider a right-recursive version of the expression grammar: 1 2 3 4 5 6 7 8 9 Sometimes, we can transform a grammar to have this property. For each non-terminal A find the longest prefix ! common to two or more of its alternatives. if ! /= $ then replace all of the A productions A → !"1 | !"2 | · · · | !"n with A → !A. A. → " 1 | " 2 | · · · | " n where A. is a new non-terminal. )goal* )expr* ::= ::= | | )term* ::= | | )factor* ::= | )expr* )term* + )expr* )term* − )expr* )term* )factor* ∗ )term* )factor*/)term* )factor* num id To choose between productions 2, 3, & 4, the parser must see past the num or id and look at the +, −, ∗, or /. FIRST (2) ∩ FIRST (3) ∩ FIRST (4) /= % Repeat until no two alternatives for a single non-terminal have a common prefix. This grammar fails the test. Note: This grammar is right-associative. CS502 Parsing 29 CS502 Parsing Example Example There are two nonterminals that must be left-factored: Substituting back into the grammar yields )expr* ::= )term* + )expr* | )term* − )expr* | )term* 1 2 3 4 5 6 7 8 9 10 11 )term* ::= )factor* ∗ )term* | )factor*/)term* | )factor* Applying the transformation gives us: )expr* ::= )term*)expr.* )expr.* ::= +)expr* | −)expr* | $ Parsing ::= ::= ::= | | )term* ::= )term. * ::= | | )factor* ::= | )expr* )term*)expr.* +)expr* −)expr* $ )factor*)term. * ∗)term* /)term* $ num id Now, selection requires only a single token lookahead. )term* ::= )factor*)term.* )term.* ::= ∗)term* | /)term* | $ CS502 )goal* )expr* )expr.* 30 Note: This grammar is still right-associative. 31 CS502 Parsing 32 Example Back to left-recursion elimination – 1 2 6 11 – 9 4 – 2 6 10 – 7 – 6 11 – 9 5 Sentential form )goal* )expr* )term*)expr.* )factor*)term.*)expr.* id)term. *)expr.* id)term. *)expr.* id$ )expr.* id− )expr* id− )expr* id− )term*)expr.* id− )factor*)term.*)expr.* id− num)term. *)expr.* id− num)term. *)expr.* id− num∗ )term*)expr.* id− num∗ )term*)expr.* id− num∗ )factor*)term.*)expr.* id− num∗ id)term. *)expr.* id− num∗ id)term. *)expr.* id− num∗ id)expr.* id− num∗ id Input ↑x − 2 ∗ y ↑x − 2 ∗ y ↑x − 2 ∗ y ↑x − 2 ∗ y ↑x − 2 ∗ y x ↑- 2 ∗ y x ↑- 2 x ↑- 2 ∗ y x − ↑2 ∗ y x − ↑2 ∗ y x − ↑2 ∗ y x − ↑2 ∗ y x − 2 ↑* y x − 2 ↑* y x − 2 ∗ ↑y x − 2 ∗ ↑y x − 2 ∗ ↑y x − 2 ∗ y↑ x − 2 ∗ y↑ x − 2 ∗ y↑ Given a left-factored CFG, to eliminate left-recursion: if ∃ A → A! then replace all of the A productions A → A! | " | . . . | # with A → NA. N → " | ... | # A. → !A. | $ where N and A. are new productions. Repeat until there are no left-recursive productions. The next symbol determined each choice correctly. CS502 Parsing 33 CS502 Parsing Generality Recursive descent parsing Question: Now, we can produce a simple recursive descent parser from the (right-associative) grammar. By left factoring and eliminating left-recursion, can we transform an arbitrary context-free grammar to a form where it can be predictively parsed with a single token lookahead? goal: token ← next token(); if (expr() = ERROR | token /= EOF) then return ERROR; Answer: expr: if (term() = ERROR) then return ERROR; else return expr prime(); expr prime: if (token = PLUS) then token ← next token(); return expr(); else if (token = MINUS) then token ← next token(); return expr(); else return OK; Given a context-free grammar that doesn’t meet our conditions, it is undecidable whether an equivalent grammar exists that does meet our conditions. Many context-free languages do not have such a grammar: {an0bn | n ≥ 1} [ {an1b2n | n ≥ 1} Must look past an arbitrary number of a’s to discover the 0 or the 1 and so determine the derivation. CS502 Parsing 34 35 CS502 Parsing 36 Recursive descent parsing Building the tree One of the key jobs of the parser is to build an intermediate representation of the source code. term: if (factor() = ERROR) then return ERROR; else return term prime(); term prime: if (token = MULT) then token ← next token(); return term(); else if (token = DIV) then token ← next token(); return term(); else return OK; factor: if (token = NUM) then token ← next token(); return OK; else if (token = ID) then token ← next token(); return OK; else return ERROR; CS502 Parsing To build an abstract syntax tree, we can simply insert code at the appropriate points: • factor() can stack nodes id, num • term prime() can stack nodes ∗, / • term() can pop 3, build and push subtree • expr prime() can stack nodes +, − • expr() can pop 3, build and push subtree • goal() can pop and return tree 37 CS502 Parsing Non-recursive predictive parsing Non-recursive predictive parsing Observation: Now, a predictive parser looks like: Our recursive descent parser encodes state information in its run-time stack, or call stack. 38 stack Using recursive procedure calls to implement a stack abstraction may not be particularly efficient. source code This suggests other implementation methods: scanner tokens table!driven parser IR • explicit stack, hand-coded parser • stack-based, table-driven parser parsing tables Rather than writing code, we build tables. Building tables can be automated! CS502 Parsing 39 CS502 Parsing 40 Table-driven parsers Non-recursive predictive parsing A parser generator system often looks like: Input: a string w and a parsing table M for G tos ← 0 Stack[tos] ← EOF Stack[++tos] ← Start Symbol token ← next token() repeat X ← Stack[tos] if X is a terminal or EOF then if X = token then pop X token ← next token() else error() else /* X is a non-terminal */ if M[X,token] = X → Y1Y2 · · ·Yk then pop X push Yk ,Yk−1, · · · ,Y1 else error() until X = EOF stack source code scanner tokens table!driven parser parser generator grammar IR parsing tables This is true for both top-down (LL) and bottom-up (LR) parsers CS502 Parsing 41 CS502 Parsing Non-recursive predictive parsing FIRST What we need now is a parsing table M. For a string of grammar symbols !, define FIRST (!) as: Our expression grammar: 1 2 3 4 5 6 7 8 9 10 11 † )goal* )expr* )expr.* ::= ::= ::= | | )term* ::= )term. * ::= | | )factor* ::= | )expr* )term*)expr.* +)expr* −)expr* $ )factor*)term. * ∗)term* /)term* $ num id Its parse table: id )goal* 1 )expr* 2 )expr.* – )term* 6 )term. * – )factor* 11 num 1 2 – 6 – 10 + – – 3 – 9 – − – – 4 – 9 – ∗ – – – – 7 – / – – – – 8 – • the set of terminal symbols that begin strings derived from !: {a ∈ Vt | ! ⇒∗ a"} $† – – 5 – 9 – • If ! ⇒∗ $ then $ ∈ FIRST (!) FIRST (!) contains the set of tokens valid in the initial position in ! To build FIRST(X): 1. If X ∈ Vt then FIRST(X) is {X} 2. If X → $ then add $ to FIRST(X) 3. If X → Y1Y2 · · ·Yk : (a) Put FIRST(Y1) − {$} in FIRST (X) (b) ∀i : 1 < i ≤ k, if $ ∈ FIRST(Y1) ∩ · · · ∩ FIRST(Yi−1) (i.e., Y1 · · ·Yi−1 ⇒∗ $) then put FIRST(Yi) − {$} in FIRST(X) (c) If $ ∈ FIRST (Y1) ∩ · · · ∩ FIRST(Yk ) then put $ in FIRST(X) Repeat until no more additions can be made. we use $ to represent EOF CS502 42 Parsing 43 CS502 Parsing 44 FOLLOW LL(1) grammars For a non-terminal A, define FOLLOW(A) as Previous definition the set of terminals that can appear immediately to the right of A in some sentential form A grammar G is LL(1) iff. for all non-terminals A, each distinct pair of T productions A → " and A → # satisfy the condition FIRST(") FIRST (#) = %. What if A ⇒∗ $? Thus, a non-terminal’s FOLLOW set specifies the tokens that can legally appear after it. Revised definition A terminal symbol has no FOLLOW set. A grammar G is LL(1) iff. for each set of productions A → !1 | !2 | · · · | !n: To build FOLLOW(A): 1. FIRST(!1), FIRST (!2 ), . . . , FIRST (!n) are all pairwise disjoint 2. If !i ⇒∗ $ then FIRST(! j ) 1. Put $ in FOLLOW()goal*) 2. If A → !B": T FOLLOW (A) = %, ∀1 ≤ j ≤ n, i /= j. If G is $-free, condition 1 is sufficient. (a) Put FIRST(") − {$} in FOLLOW(B) (b) If " = $ (i.e., A → !B) or $ ∈ FIRST (") (i.e., " ⇒∗ $) then put FOLLOW(A) in FOLLOW(B) Repeat until no more additions can be made CS502 Parsing 45 CS502 Parsing LL(1) grammars LL(1) parse table construction Provable facts about LL(1) grammars: Input: Grammar G 1. No left-recursive grammar is LL(1) Output: Parsing table M 2. No ambiguous grammar is LL(1) Method: 3. Some languages have no LL(1) grammar 1. ∀ productions A → !: 4. A $–free grammar where each alternative expansion for A begins with a distinct terminal is a simple LL(1) grammar. (a) ∀a ∈ FIRST (!), add A → ! to M[A, a] (b) If $ ∈ FIRST (!): Example i. ∀b ∈ FOLLOW(A), add A → ! to M[A, b] ii. If $ ∈ FOLLOW(A) then add A → ! to M[A, $] • S → aS | a is not LL(1) because FIRST(aS) = FIRST (a) = {a} 2. Set each undefined entry of M to error S → aS. S. → aS. | $ • 46 If ∃M[A, a] with multiple entries then grammar is not LL(1). accepts the same language and is LL(1) Note: recall a, b ∈ Vt , so a, b /= $ CS502 Parsing 47 CS502 Parsing 48 Example Building the tree Our long-suffering expression grammar: Again, we insert code at the right points: tos ← 0 Stack[tos] ← EOF Stack[++tos] ← root node Stack[++tos] ← Start Symbol token ← next token() repeat X ← Stack[tos] if X is a terminal or EOF then if X = token then pop X token ← next token() pop and fill in node else error() else /* X is a non-terminal */ if M[X,token] = X → Y1Y2 · · ·Yk then pop X pop node for X build node for each child and make it a child of node for X push nk ,Yk , nk−1,Yk−1, · · · , n1,Y1 else error() until X = EOF T → FT . S→E T . → ∗T | /T | $ E → T E. . E → +E | −E | $ F → id | num FIRST FOLLOW S {num, id} {$} E {num, id} {$} E . {$, +, −} {$} T {num, id} {+, −, $} T . {$, ∗, /} {+, −, $} F {num, id} {+, −, ∗, /, $} id {id} − num {num} − ∗ {∗} − / {/} − + {+} − − {−} − S E E. T T. F id num + − ∗ / $ S→E S→E − − − − − E → T E. E → T E. − − − − − − − E . → +E E . → −E − − E. → $ T → FT . T → FT . − − − − − − − T . → $ T . → $ T . → ∗T T . → /T T . → $ F → id F → num − − − − − CS502 Parsing 49 A grammar that is not LL(1) CS502 Parsing 50 Error recovery Key notion: )stmt* ::= if )expr* then )stmt* | if )expr* then )stmt* else )stmt* | ... Left-factored: )stmt* ::= if )expr* then )stmt* )stmt.* | . . . )stmt.* ::= else )stmt* | $ • For each non-terminal, construct a set of terminals on which the parser can synchronize • When an error occurs looking for A, scan until an element of SYNCH (A) is found Now, FIRST ()stmt.*) = {$, else} Also, FOLLOW()stmt.*) = {else, $} T But, FIRST()stmt.*) FOLLOW()stmt.*) = {else} /= % Building SYNCH: 1. a ∈ FOLLOW(A) ⇒ a ∈ SYNCH (A) 2. place keywords that start statements in SYNCH (A) On seeing else, conflict between choosing 3. add symbols in FIRST(A) to SYNCH (A) )stmt.* ::= else )stmt* and )stmt.* ::= $ If we can’t match a terminal on top of stack: ⇒ grammar is not LL(1)! 1. pop the terminal The fix: 2. print a message saying the terminal was inserted Put priority on )stmt.* ::= else )stmt* to associate else with closest previous then. 3. continue the parse (i.e., SYNCH (a) = Vt − {a}) CS502 Parsing 51 CS502 Parsing 52 LL(k) parsing Strong LL(k): example Definition: Consider the grammar, G: G→S$ S→aAa A→b L(G) = {aba$, aa$, bbba$, bba$} S→bAba A→$ G is strong LL(k) iff. for A → " and A → # (" /= #), T FIRST k (" FOLLOWk (A)) FIRST k (# FOLLOWk (A)) = % G is NOT LL(1): Predict(A → b) = FIRST 1(b) = {b} Predict(A → $) = FOLLOW1(A) = {a, b} LL(1) = Strong LL(1), but Strong LL(k) ⊂ LL(k) for k > 1. Definition: G is LL(k) iff.: 1. S ⇒∗lm wA! ⇒lm w"! ⇒∗ wx G is NOT Strong LL(2): Predict(A → b) = FIRST 2(bFOLLOW2(A)) = {ba, bb} Predict(A → $) = FIRST 2($FOLLOW2(A)) = {a$, ba} 2. S ⇒∗lm wA! ⇒lm w#! ⇒∗ wy 3. FIRSTk (x) = FIRSTk (y) ⇒"=# i.e., knowing w, A, and next k input tokens, FIRSTk (x) = FIRST k (y) accurately predicts production to use: A → "(= #) G is LL(2): In aAa: In bAba: • LL(k) ⊂ LL(k + 1) • Strong LL(k) ⊂ Strong LL(k + 1) • Strong LL(k) ⊂ LL(k), k > 1 CS502 Parsing 53 CS502 Some definitions Bottom-up parsing Recall Goal: For a grammar G, with start symbol S, any string ! such that S ⇒∗ ! is called a sentential form 54 Given an input string w and a grammar G, construct a parse tree by starting at the leaves and working to the root. At each match, it applies a reduction to build on the frontier: • Otherwise it is just a sentential form (not a sentence in L(G)) • each reduction matches an upper frontier of the partially built tree to the RHS of some production A left-sentential form is a sentential form that occurs in the leftmost derivation of some sentence. • each reduction adds a node on top of the frontier A right-sentential form is a sentential form that occurs in the rightmost derivation of some sentence. Parsing Parsing The parser repeatedly matches a right-sentential form from the language against the tree’s upper frontier. • If ! ∈ Vt∗, then ! is called a sentence in L(G) CS502 Predict(A → b) = {ba} and Predict(A → $) = {a$} Predict(A → b) = {bb} and Predict(A → $) = {ba} The final result is a rightmost derivation, in reverse. 55 CS502 Parsing 56 Example Handles Consider the grammar What are we trying to find? 1 S → aABe 2 A → Abc | b 3 4 B → d A substring ! of the tree’s upper frontier that matches some production A → ! where reducing ! to A is one step in the reverse of a rightmost derivation and the input string abbcde We call such a string a handle. Prod’n. 3 2 4 1 – Sentential Form a b bcde a Abc de aA d e aABe S Formally: a handle of a right-sentential form # is a production A → " and a position in # where " may be found and replaced by A to produce the previous right-sentential form in a rightmost derivation of #. i.e., if S ⇒∗rm !Aw ⇒rm !"w then A → " in the position following ! is a handle of !"w The trick appears to be scanning the input and finding valid sentential forms. CS502 Parsing Because # is a right-sentential form, the substring to the right of a handle contains only terminal symbols. 57 Handles CS502 Parsing 58 Handles Theorem: S If G is unambiguous then every right-sentential form has a unique handle. Proof: (by definition) ! 1. G is unambiguous ⇒ rightmost derivation is unique 2. ⇒ a unique production A → " applied to take #i−1 to #i A 3. ⇒ a unique position k at which A → " is applied w " The handle A → " in the parse tree for !"w CS502 Parsing 4. ⇒ a unique handle A → " 59 CS502 Parsing 60 Example Handle-pruning The left-recursive expression grammar 1 2 3 4 5 6 7 8 9 )goal* )expr* ::= ::= | | )term* ::= | | )factor* ::= | (original form) Prod’n. – 1 3 5 9 7 8 4 7 9 )expr* )expr* + )term* )expr* − )term* )term* )term* ∗ )factor* )term*/)factor* )factor* num id Sentential Form )goal* )expr* )expr* − )term* )expr* − )term* ∗ )factor* )expr* − )term* ∗ id )expr* − )factor* ∗ id )expr* − num ∗ id )term* − num ∗ id )factor* − num ∗ id id − num ∗ id The process to construct a bottom-up parse is called handle-pruning. To construct a rightmost derivation S = #0 ⇒ #1 ⇒ #2 ⇒ · · · ⇒ #n−1 ⇒ #n = w we set i to n and apply the following simple algorithm for i = n downto 1 1. find the handle Ai → "i in #i 2. replace "i with Ai to generate #i−1 This takes 2n steps, where n is the length of the derivation CS502 Parsing 61 Stack implementation CS502 Parsing 62 Example: back to x − 2 ∗ y One scheme to implement a handle-pruning, bottom-up parser is called a shift-reduce parser. Shift-reduce parsers use a stack and an input buffer 1. initialize stack with $ 2. Repeat until the top of the stack is the goal symbol and the input token is $ a) find the handle if we don’t have a handle on top of the stack, shift an input symbol onto the stack b) prune the handle if we have a handle A → " on the stack, reduce 1 2 3 4 5 6 7 8 9 )goal* )expr* ::= ::= | | )term* ::= | | )factor* ::= | )expr* )expr* + )term* )expr* − )term* )term* )term* ∗ )factor* )term*/)factor* )factor* num id Stack $ $id $)factor* $)term* $)expr* $)expr* − $)expr* − num $)expr* − )factor* $)expr* − )term* $)expr* − )term* ∗ $)expr* − )term* ∗ id $)expr* − )term* ∗ )factor* $)expr* − )term* $)expr* $)goal* Input id − num ∗ id − num ∗ id − num ∗ id − num ∗ id − num ∗ id num ∗ id ∗ id ∗ id ∗ id id Action shift reduce 9 reduce 7 reduce 4 shift shift reduce 8 reduce 7 shift shift reduce 9 reduce 5 reduce 3 reduce 1 accept 1. Shift until top of stack is the right end of a handle i) pop | " | symbols off the stack 2. Find the left end of the handle and reduce ii) push A onto the stack 5 shifts + 9 reduces + 1 accept CS502 Parsing 63 CS502 Parsing 64 Shift-reduce parsing LR parsing Shift-reduce parsers are simple to understand The skeleton parser: A shift-reduce parser has just four canonical actions: push s0 token ← next token() repeat forever s ← top of stack if action[s,token] = "shift si" then push si token ← next token() else if action[s,token] = "reduce A → "" then pop | " | states s. ← top of stack push goto[s.,A] else if action[s, token] = "accept" then return else error() 1. shift — next input symbol is shifted onto the top of the stack 2. reduce — right end of handle is on top of stack; locate left end of handle within the stack; pop handle off stack and push appropriate non-terminal LHS 3. accept — terminate parsing and signal success 4. error — call an error recovery routine Key insight: recognize handles with a DFA: • DFA transitions shift states instead of symbols This takes k shifts, l reduces, and 1 accept, where k is the length of the input string and l is the length of the reverse rightmost derivation • accepting states trigger reductions CS502 Parsing 65 Example tables state 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 id s4 – – – – s4 s4 – – ACTION + ∗ – – – – s5 – r5 s6 r6 r6 – – – – – – r4 – CS502 Parsing 66 Example using the tables $ – acc r3 r5 r6 – – r2 r4 )expr* 1 – – – – 7 – – – Stack $0 $04 $03 $036 $0364 $0363 $0368 $02 $025 $0254 $0253 $0252 $0257 $01 GOTO )term* )factor* 2 3 – – – – – – – – 2 3 8 3 – – – – The Grammar ::= )expr* ::= )term* + )expr* | )term* )term* ::= )factor* ∗ )term* | )factor* )factor* ::= id )goal* )expr* Input id∗ id+ id$ ∗ id+ id$ ∗ id+ id$ id+ id$ + id$ + id$ + id$ + id$ id$ $ $ $ $ $ Action s4 r6 s6 s4 r6 r5 r4 s5 s4 r6 r5 r3 r2 acc Note: This is a simple little right-recursive grammar; not the same as in previous lectures. CS502 Parsing 67 CS502 Parsing 68 LR(k) grammars LR(k) grammars Informally, we say that a grammar G is LR(k) if, given a rightmost derivation Formally, a grammar G is LR(k) iff.: S = #0 ⇒ #1 ⇒ #2 ⇒ · · · ⇒ #n = w, 1. S ⇒∗rm !Aw ⇒rm !"w, and we can, for each right-sentential form in the derivation, 2. S ⇒∗rm #Bx ⇒rm !"y, and 1. isolate the handle of each right-sentential form, and 3. FIRST k (w) = FIRSTk (y) 2. determine the production by which to reduce by scanning #i from left to right, going at most k symbols beyond the right end of the handle of #i. ⇒ !Ay = #Bx i.e., Assume sentential forms !"w and !"y, with common prefix !" and common k-symbol lookahead FIRSTk (y) = FIRST k (w), such that !"w reduces to !Aw and !"y reduces to #Bx. But, the common prefix means !"y also reduces to !Ay, for the same result. Thus !Ay = #Bx. CS502 Parsing 69 CS502 Parsing Why study LR grammars? LR parsing LR(1) grammars are often used to construct parsers. Three common algorithms to build tables for an “LR” parser: We call these parsers LR(1) parsers. • virtually all context-free programming language constructs can be expressed in an LR(1) form • LR grammars are the most general grammars parsable by a deterministic, bottom-up parser • efficient parsers can be implemented for LR(1) grammars 3. LALR(1) • intermediate sized set of grammars • same number of states as SLR(1) • canonical construction is slow and large • better construction techniques exist • LR grammars describe a proper superset of the languages recognized by predictive (i.e., LL) parsers LL(k): recognize use of a production A → " seeing first k symbols derived from " An LR(1) parser for either Algol or Pascal has several thousand states, while an SLR(1) or LALR(1) parser for the same language may have several hundred states. LR(k): recognize the handle " after seeing everything derived from " plus k lookahead symbols Parsing 1. SLR(1) • smallest class of grammars • smallest tables (number of states) • simple, fast construction 2. LR(1) • full set of LR(1) grammars • largest tables (number of states) • slow, large construction • LR parsers detect an error as soon as possible in a left-to-right scan of the input CS502 70 71 CS502 Parsing 72 LR(k) items Example The table construction algorithms use sets of LR(k) items or configurations to represent the possible states in a parse. The • indicates how much of an item we have seen at a given state in the parse: [A → •XY Z] indicates that the parser is looking for a string that can be derived from XY Z [A → XY • Z] indicates that the parser has seen a string derived from XY and is looking for one derivable from Z An LR(k) item is a pair [!, "], where ! is a production from G with a • at some position in the RHS, marking how much of the RHS of a production has already been seen " is a lookahead string containing k symbols (terminals or $) LR(0) items: A → XY Z generates 4 LR(0) items: 1. 2. 3. 4. Two cases of interest are k = 0 and k = 1: LR(0) items play a key role in the SLR(1) table construction algorithm. LR(1) items play a key role in the LR(1) and LALR(1) table construction algorithms. CS502 Parsing (no lookahead) 73 [A → •XY Z] [A → X •Y Z] [A → XY • Z] [A → XY Z•] CS502 Parsing The characteristic finite state machine (CFSM) closure0 The CFSM for a grammar is a DFA which recognizes viable prefixes of right-sentential forms: Given an item [A → ! • B"], its closure contains the item and any other items that can generate legal substrings to follow !. A viable prefix is any prefix that does not extend beyond the handle. Thus, if the parser has viable prefix ! on its stack, the input should reduce to B" (or # for some other item [B → •#] in the closure). It accepts when a handle has been discovered and needs to be reduced. To construct the CFSM we need two functions: function closure0(I) repeat if [A → ! • B"] ∈ I add [B → •#] to I until no more items can be added to I return I • closure0(I) to build its states • goto0(I, X) to determine its transitions CS502 Parsing 74 75 CS502 Parsing 76 goto0 Building the LR(0) item sets Let I be a set of LR(0) items and X be a grammar symbol. We start the construction with the item [S. → •S$], where S. is the start symbol of the augmented grammar G. S is the start symbol of G $ represents EOF Then, GOTO(I, X) is the closure of the set of all items [A → !X • "] such that [A → ! • X"] ∈ I If I is the set of valid items for some viable prefix #, then GOTO(I, X) is the set of valid items for the viable prefix #X. GOTO (I, X) To compute the collection of sets of LR(0) items function items(G.) s0 ← closure0({[S. → •S$]}) S ← {s0} repeat for each set of items s ∈ S for each grammar symbol X if goto0(s, X) /= % and goto0(s, X) /∈ S add goto0(s, X) to S until no more item sets can be added to S return S represents state after recognizing X in state I. function goto0(I,X) let J be the set of items [A → !X • "] such that [A → ! • X"] ∈ I return closure0(J) CS502 Parsing 77 LR(0) example I0 : S → •E$ E → •E + T E → •T T → •id T → •(E) I1 : S → E • $ E → E • +T I2 : S → E$• I3 : E → E + •T T → •id T → •(E) The corresponding CFSM: 9 T id 0 + 1 2 CS502 E I4 : E → E + T • I5 : T → id• I6 : T → (•E) E → •E + T E → •T T → •id T → •(E) I7 : T → (E•) E → E • +T I8 : T → (E)• I9 : E → T • 7 ) T 4 ( 6 ( + 3 $ T id 5 id E Parsing 78 Constructing the LR(0) parsing table 1 S → E$ 2 E → E +T | T 3 4 T → id 5 | (E) ( CS502 1. construct the collection of sets of LR(0) items for G. 2. state i of the CFSM is constructed from Ii (a) [A → ! • a"] ∈ Ii and goto0(Ii, a) = I j ⇒ ACTION[i, a] ← “shift j” (b) [A → !•] ∈ Ii, A /= S. ⇒ ACTION[i, a] ← “reduce A → !”, ∀a (c) [S. → S$•] ∈ Ii ⇒ ACTION[i, a] ← “accept”, ∀a 3. goto0(Ii, A) = I j ⇒ GOTO[i, A] ← j 4. set undefined entries in ACTION and GOTO to “error” 8 5. initial state of parser s0 is closure0([S. → •S$]) Parsing 79 CS502 Parsing 80 LR(0) example Conflicts in the ACTION table 9 state T ( id 0 + 1 5 E 7 ) T 4 0 1 2 3 4 5 6 7 8 9 ( 6 ( + 3 $ 2 id id E T 8 ACTION GOTO id ( ) + $ S E T s5 s6 – – – – 1 9 – – – s3 s2 – – – acc acc acc acc acc – – – s5 s6 – – – – – 4 r2 r2 r2 r2 r2 – – – r4 r4 r4 r4 r4 – – – s5 s6 – – – – 7 9 – – s8 s3 – – – – r5 r5 r5 r5 r5 – – – r3 r3 r3 r3 r3 – – – If the LR(0) parsing table contains any multiply-defined ACTION entries then G is not LR(0) Two conflicts arise: shift-reduce: both shift and reduce possible in same item set reduce-reduce: more than one distinct reduce action possible in same item set Conflicts can be resolved through lookahead in ACTION. Consider: • A → $ | a! ⇒ shift-reduce conflict • a:=b+c*d requires lookahead to avoid shift-reduce conflict after shifting c (need to see * to give precedence over +) CS502 Parsing 81 SLR(1): simple lookahead LR CS502 Parsing 82 From previous example Add lookaheads after building LR(0) item sets 1 S → E$ 2 E → E +T | T 3 4 T → id 5 | (E) Constructing the SLR(1) parsing table: 1. construct the collection of sets of LR(0) items for G. FOLLOW(E) = FOLLOW(T ) = {$, +, )} state 2. state i of the CFSM is constructed from Ii 9 (a) [A → ! • a"] ∈ Ii and goto0(Ii, a) = I j ⇒ ACTION[i, a] ← “shift j”, ∀a /= $ (b) [A → !•] ∈ Ii, A /= S. ⇒ ACTION[i, a] ← “reduce A → !”, ∀a ∈ FOLLOW(A) T ( id 0 (c) [S. → S • $] ∈ Ii ⇒ ACTION[i, $] ← “accept” id E + 1 3. goto0(Ii, A) = I j ⇒ GOTO[i, A] ← j 2 E 7 ) T 4 ( 6 ( + 3 $ T id 5 0 1 2 3 4 5 6 7 8 9 ACTION GOTO id ( ) + $ S E T s5 s6 – – – – 1 9 – – – s3 acc – – – – – – – – –– – s5 s6 – – – – – 4 – – r2 r2 r2 – – – – – r4 r4 r4 – – – s5 s6 – – – – 7 9 – – s8 s3 – – – – – – r5 r5 r5 – – – – – r3 r3 r3 – – – 8 4. set undefined entries in ACTION and GOTO to “error” 5. initial state of parser s0 is closure0([S. → •S$]) CS502 Parsing 83 CS502 Parsing 84 Example: A grammar that is not LR(0) 1 2 3 4 5 6 7 S → E$ E → E +T | T T → T ∗F | F F → id | (E) FOLLOW E {+, ), $} T {+, ∗, ), $} F {+, ∗, ), $} CS502 I0 : S → •E$ E → •E + T E → •T T → •T ∗ F T → •F F → •id F → •(E) I1 : S → E • $ E → E • +T I2 : S → E$• I3 : E → E + •T T → •T ∗ F T → •F F → •id F → •(E) I4 : T → F• I5 : F → id• Example: But it is SLR(1) I6 : F → (•E) E → •E + T E → •T T → •T ∗ F T → •F F → •id F → •(E) I7 : E → T • T → T • ∗F I8 : T → T ∗ •F F → •id F → •(E) I9 : T → T ∗ F• I10 : F → (E)• I11 : E → E + T • T → T • ∗F I12 : F → (E•) E → E • +T Parsing state 0 1 2 3 4 5 6 7 8 9 10 11 12 85 Example: A grammar that is not SLR(1) Consider: S → L=R | R L → ∗R | id R → L Its LR(0) item sets: I0 : S. → •S$ S → •L = R S → •R L → •∗R L → •id R → •L I1 : S. → S • $ I2 : S → L• = R R → L• I3 : S → R• I4 : L → id• + – s3 – – r5 r6 – r3 – r4 r7 r2 s3 CS502 ∗ – – – – r5 r6 – s8 – r4 r7 s8 – ACTION id ( ) $ s5 s6 – – – – – acc – – – – s5 s6 – – – – r5 r5 – – r6 r6 s5 s6 – – – – r3 r3 s5 s6 – – – – r4 r4 – – r7 r7 – – r2 r2 – – s10 – S – – – – – – – – – – – – – GOTO E T 1 7 – – – – – 11 – – – – 12 7 – – – – – – – – – – – – F 4 – – 4 – – 4 – 9 – – – – Parsing 86 LR(1) items Recall: An LR(k) item is a pair [!, "], where I5 : L → ∗ • R R → •L L → •∗R L → •id I6 : S → L = •R R → •L L → •∗R L → •id I7 : L → ∗R• I8 : R → L• I9 : S → L = R• ! is a production from G with a • at some position in the RHS, marking how much of the RHS of a production has been seen " is a lookahead string containing k symbols (terminals or $) What about LR(1) items? • All the lookahead strings are constrained to have length 1 • Look something like [A → X •Y Z, a] Now consider I2: = ∈ FOLLOW(R) (S ⇒ L = R ⇒ ∗R = R) CS502 Parsing 87 CS502 Parsing 88 LR(1) items closure1(I) What’s the point of the lookahead symbols? Given an item [A → ! • B", a], its closure contains the item and any other items that can generate legal substrings to follow !. • carry along to choose correct reduction when there is a choice • lookaheads are bookkeeping, unless item has • at right end: Thus, if the parser has viable prefix ! on its stack, the input should reduce to B" (or # for some other item [B → •#, b] in the closure). – in [A → X •Y Z, a], a has no direct use – in [A → XY Z•, a], a is useful function closure1(I) repeat if [A → ! • B", a] ∈ I add [B → •#, b] to I, where b ∈ first("a) until no more items can be added to I return I • allows use of grammars that are not uniquely invertible† The point: For [A → !•, a] and [B → !•, b], we can decide between reducing to A or B by looking at limited right context † No two productions have the same RHS CS502 Parsing 89 CS502 Parsing goto1(I) Building the LR(1) item sets for grammar G Let I be a set of LR(1) items and X be a grammar symbol. We start the construction with the item [S. → •S, $], where S. is the start symbol of the augmented grammar G. S is the start symbol of G $ represents EOF Then, GOTO(I, X) is the closure of the set of all items [A → !X • ", a] such that [A → ! • X", a] ∈ I If I is the set of valid items for some viable prefix #, then GOTO(I, X) is the set of valid items for the viable prefix #X. To compute the collection of sets of LR(1) items function items(G.) s0 ← closure1({[S. → •S, $]}) S ← {s0} repeat for each set of items s ∈ S for each grammar symbol X if goto1(s, X) /= % and goto1(s, X) /∈ S add goto1(s, X) to S until no more item sets can be added to S return S goto(I, X) represents state after recognizing X in state I. function goto1(I,X) let J be the set of items [A → !X • ", a] such that [A → ! • X", a] ∈ I return closure1(J) CS502 Parsing 90 91 CS502 Parsing 92 Constructing the LR(1) parsing table Back to previous example (/∈ SLR(1)) Build lookahead into the DFA to begin with S → | L → | R → 1. construct the collection of sets of LR(1) items for G. 2. state i of the LR(1) machine is constructed from Ii (a) [A → ! • a", b] ∈ Ii and goto1(Ii, a) = I j ⇒ ACTION[i, a] ← “shift j” (b) [A → !•, a] ∈ Ii, A /= S. ⇒ ACTION[i, a] ← “reduce A → !” (c) [S. → S•, $] ∈ Ii ⇒ ACTION[i, $] ← “accept” I0 : S. → •S, $ S → •L = R, $ S → •R, $ L → • ∗ R, = L → •id, = R → •L, $ L → • ∗ R, $ L → •id, $ I1 : S. → S•, $ I2 : S → L• = R, $ R → L•, $ I3 : S → R•, $ I4 : L → ∗ • R, = $ R → •L, =$ L → • ∗ R, = $ L → •id, =$ L=R R ∗R id L 3. goto1(Ii, A) = I j ⇒ GOTO[i, A] ← j 4. set undefined entries in ACTION and GOTO to “error” 5. initial state of parser s0 is closure1([S. → •S, $]) I5 : L → id•, =$ I6 : S → L = •R, $ R → •L, $ L → • ∗ R, $ L → •id, $ I7 : L → ∗R•, = $ I8 : R → L•, =$ I9 : S → L = R•, $ I10 : R → L•, $ I11 : L → ∗ • R, $ R → •L, $ L → • ∗ R, $ L → •id, $ I12 : L → id•, $ I13 : L → ∗R•, $ I2 no longer has shift-reduce conflict: reduce on $, shift on = CS502 Parsing 93 CS502 Parsing Example: back to SLR(1) expression grammar Another example In general, LR(1) has many more states than LR(0)/SLR(1): Consider: 0 S. → 1 S → 2 C → | 3 1 S → E 4 T → T ∗F | F 2 E → E +T 5 | T 3 6 F → id 7 | (E) LR(1) item sets: I0 : S → •E, $ E → •E + T,+$ E → •T, +$ T → •T ∗ F, ∗ + $ T → •F, ∗+$ F → •id, ∗ + $ F → •(E), ∗ + $ CS502 I0. :shifting ( F → (•E), ∗ + $ E → •E + T,+) E → •T, +) T → •T ∗ F, ∗+) T → •F, ∗+) F → •id, ∗+) F → •(E), ∗+) Parsing S CC cC d state ACTION c d $ 0 s3 s4 – 1 – – acc 2 s6 s7 – 3 s3 s4 – 4 r3 r3 – 5 – – r1 6 s6 s7 – 7 – – r3 8 r2 r2 – 9 – – r2 I0.. :shifting ( F → (•E), ∗+) E → •E + T,+) E → •T, +) T → •T ∗ F, ∗+) T → •F, ∗+) F → •id, ∗+) F → •(E), ∗+) 95 CS502 GOTO S C 1 2 – – – 5 – 8 – – – – – 9 – – – – – – LR(1) item sets: I0 : S. → •S, $ S → •CC, $ C → •cC, cd C → •d, cd I1 : S. → S•, $ I2 : S → C •C, $ C → •cC, $ C → •d, $ I3 : C → c •C, cd C → •cC, cd C → •d, cd Parsing 94 I4 : C → d•, cd I5 : S → CC•, $ I6 : C → c •C, $ C → •cC, $ C → •d, $ I7 : C → d•, $ I8 : C → cC•, cd I9 : C → cC•, $ 96 LALR(1) parsing LALR(1) table construction Define the core of a set of LR(1) items to be the set of LR(0) items derived by ignoring the lookahead symbols. To construct LALR(1) parsing tables, we can insert a single step into the LR(1) algorithm Thus, the two sets (1.5) For each core present among the set of LR(1) items, find all sets having that core and replace these sets by their union. • {[A → ! • ", a], [A → ! • ", b]}, and • {[A → ! • ", c], [A → ! • ", d]} The goto function must be updated to reflect the replacement sets. have the same core. Key idea: If two sets of LR(1) items, Ii and I j , have the same core, we can merge the states that represent them in the ACTION and GOTO tables. The resulting algorithm has large space requirements. CS502 Parsing 97 CS502 LALR(1) table construction Example The revised (and renumbered) algorithm Reconsider: 0 S. → S 1 S → CC 2 C → cC | d 3 1. construct the collection of sets of LR(1) items for G. 2. for each core present among the set of LR(1) items, find all sets having that core and replace these sets by their union (update the goto function incrementally) 3. state i of the LALR(1) machine is constructed from Ii. Merged states: I36 : C → c •C, cd$ C → •cC, cd$ C → •d, cd$ I47 : C → d•, cd$ I89 : C → cC•, cd$ (a) [A → ! • a", b] ∈ Ii and goto1(Ii, a) = I j ⇒ ACTION[i, a] ← “shift j” (b) [A → !•, a] ∈ Ii, A /= S. ⇒ ACTION[i, a] ← “reduce A → !” (c) [S. → S•, $] ∈ Ii ⇒ ACTION [i, $] ← “accept” 4. goto1(Ii, A) = I j ⇒ GOTO[i, A] ← j Parsing I0 : S. → •S, $ S → •CC, $ C → •cC, cd C → •d, cd I1 : S. → S•, $ I2 : S → C •C, $ C → •cC, $ C → •d, $ I3 : C → c •C, cd C → •cC, cd C → •d, cd I4 : C → d•, cd I5 : S → CC•, $ 98 I6 : C → c •C, $ C → •cC, $ C → •d, $ I7 : C → d•, $ I8 : C → cC•, cd I9 : C → cC•, $ state ACTION GOTO c d $ S C 0 s36 s47 – 1 2 1 – – acc – – 2 s36 s47 – – 5 36 s36 s47 – – 8 47 r3 r3 r3 – – 5 – – r1 – – 89 r2 r2 r2 – – 5. set undefined entries in ACTION and GOTO to “error” 6. initial state of parser s0 is closure1([S. → •S, $]) CS502 Parsing 99 CS502 Parsing 100 More efficient LALR(1) construction The role of precedence Observe that we can: Precedence and associativity can be used to resolve shift/reduce conflicts in ambiguous grammars. • represent Ii by its basis or kernel: items that are either [S. → •S, $] or do not have • at the left of the RHS • lookahead with higher precedence ⇒ shift • same precedence, left associative ⇒ reduce Advantages: • compute shift, reduce and goto actions for state derived from Ii directly from its kernel • more concise, albeit ambiguous, grammars • shallower parse trees ⇒ fewer reductions Classic application: expression grammars This leads to a method that avoids building the complete canonical collection of sets of LR(1) items CS502 Parsing 101 CS502 Parsing The role of precedence Error recovery in shift-reduce parsers With precedence and associativity, we can use: The problem E → | | | | | | | E ∗E E/E E +E E −E (E) -E id num • encounter an invalid token • bad pieces of tree hanging from stack • incorrect entries in symbol table We want to parse the rest of the file Restarting the parser This eliminates useless reductions (single productions) CS502 Parsing 102 • find a restartable state on the stack • move to a consistent place in the input • print an informative message to stderr 103 CS502 Parsing (line number) 104 Error recovery in yacc/bison/Java CUP Example The error mechanism Using error stmt list • designated token error • valid in any production : | stmt stmt list ; stmt can be augmented with error • error shows syncronization points stmt list When an error is discovered • pops the stack until error is legal : | | stmt error stmt list ; stmt This should • skips input tokens until it successfully shifts 3 • throw out the erroneous statement • synchronize at “;” or “end” • invoke yyerror("syntax error") • error productions can have actions This mechanism is fairly general Other “natural” places for errors • all the “lists”: FieldList, CaseList • missing parentheses or brackets • extra operator or missing operator (yychar) See §Error Recovery of the on-line CUP manual CS502 Parsing 105 Left versus right recursion CS502 Parsing 106 Parsing review Right Recursion: Recursive descent • needed for termination in predictive parsers A hand coded recursive descent parser directly encodes a grammar (typically an LL(1) grammar) into a series of mutually recursive procedures. It has most of the linguistic limitations of LL(1). • requires more stack space • right associative operators Left Recursion: LL(k) • works fine in bottom-up parsers An LL(k) parser must be able to recognize the use of a production after seeing only the first k symbols of its right hand side. • limits required stack space • left associative operators LR(k) Rule of thumb: An LR(k) parser must be able to recognize the occurrence of the right hand side of a production after having seen all that is derived from that right hand side with k symbols of lookahead. • right recursion for top-down parsers • left recursion for bottom-up parsers CS502 Parsing 107 CS502 Parsing 108 Complexity of parsing: grammar hierarchy ambiguous Language vs. grammar unambiguous For example, every regular language has a grammar that is LL(1), but not all regular grammars are LL(1). Consider: type!0: !&>" type!1: context!sensitive !'"&>!(" Linear!bounded automaton: PSPACE complete type!2: context!free '&>! Earley’s algorithm: O(n") type!3: regular A!>wX DFA: O(n) S → ab S → ac Without left-factoring, this grammar is not LL(1). O(n#) LR(k) LL(k) Knuth’s algorithm: O(n) LR(1) LL(1) LALR(1) SLR(1) LR(0) LL(0) Note: this is a hierarchy of grammars not languages CS502 Parsing 109 CS502 Parsing 110
© Copyright 2025 Paperzz