Language Processors (E2.15) Lecture 2: Language Grammars Objectives To introduce: The concept of grammars Types of grammars Chomsky’s hierarchy Examples; string derivations; parse trees Regular expressions Notations associated with grammars E2.15 - Language Processors (Lect 2) 2 Introduction to Grammars (1) What are they? Grammars are the essential formalism for describing the structure of programs in a programming language. They provide a precise, yet easy to understand, syntactic specification of a programming language Although in principle they describe the syntax of a language only, they are instrumental in the definition of semantics E2.15 - Language Processors (Lect 2) 3 Introduction to Grammars (2) Why have them? [by definition] To precisely and unambiguously describe a language From certain classes of grammars we can automatically construct an efficient parser that determines if a source program is syntactically well formed [we will see how in coming lectures]. E2.15 - Language Processors (Lect 2) 4 Introduction to Grammars (3) Where is the challenge? If a language only contained a finite number of elements (strings), its definition would be straightforward (albeit tedious), since all the strings in the language could be listed. All languages of interest contain an infinite number of strings, therefore some means of representing an infinite number of strings in a finite manner is required. We will examine different notations, with varying expressive power. E2.15 - Language Processors (Lect 2) 5 Grammar Basics (1) Form of a grammar (1) A grammar consists mainly of a set of production rules [full definition of grammar coming up] Production rules have two parts, a left- and a righthand side, separated by a left-to-right arrow; the left hand side is the name of the syntactic construct; the right hand side shows a possible form of the syntactic construct. The right-hand side of a production rule can contain two kinds of symbols, terminal and non-terminal. E2.15 - Language Processors (Lect 2) 6 Grammar Basics (1) Form of a grammar (2) (e.g. of a production rule) Expression -> ‘(‘ expression operator expression ‘)’ Non-terminal symbols Terminal symbols [usually denoted with capital letters (A, B,C)] [usually denoted with small letters, (a, b, c)] Sequences of grammar symbols are usually denoted with Greek letters (α, β, γ...); the empty sequence is denoted by ε (epsilon) E2.15 - Language Processors (Lect 2) 7 Grammar Basics (2) Example of a grammar (1) [1] Expression -> ‘(‘ Expression operator Expression ‘)’ [2] Expression -> ‘1’ [3] Operator -> ‘+’ [4] Operator -> ‘*’ From these production rules you can produce various strings, for example: (1*(1+1) ) This is the sequential form of a “program text”, and the steps of the production process that lead to this string, are called the derivation of that string. E2.15 - Language Processors (Lect 2) 8 Grammar Basics (2) Example of a grammar (2) – derivation of a string [1] Expression -> ‘(‘ Expression operator Expression ‘)’ [2] Expression -> ‘1’ [3] Operator -> ‘+’ [4] Operator -> ‘*’ (1*(1+1)) The (leftmost) derivation of the string was: expression <- “Start Symbol” 1@1 ‘(‘ expression operator expression ‘)’ 2@2 ‘(‘ ‘1’ operator expression ‘)’ 4@3 ‘(‘ ‘1’ ‘*’ expression ‘)’ 1@4 ‘(‘ ‘1’ ‘*’ ‘(‘ expression operator expression ‘)’ ‘)’ 2@5 ‘(‘ ‘1’ ‘*’ ‘(‘ ‘1’ operator expression ‘)’ ‘)’ 3@6 ‘(‘ ‘1’ ‘*’ ‘(‘ ‘1’ ‘+’ expression ‘)’ ‘)’ 2@7 ‘(‘ ‘1’ ‘*’ ‘(‘ ‘1’ ‘+’ ‘1’ ‘)’ ‘)’ E2.15 - Language Processors (Lect 2) 9 Grammar Basics (3) Derivations and Parse trees Parse tree of the derivation [the order is no longer visible]. Expression ‘(‘ Expression Operator ‘1’ ‘*’ Expression ‘(‘ Expression Operator ‘1’ E2.15 - Language Processors (Lect 2) ‘+’ ‘)’ ‘)’ Expression ‘1’ 10 Formal aspects (1) Formal definition of a grammar A grammar is defined as a quadruple (VT,VN,P,S) where VT is the set of terminal symbols VN is the set of non-terminal symbols VT and VN have no symbols in common, and V is the union of VT and VN P is the set of production rules, each element of which consists of a pair (α, β) (where α is in V+ [one of more elements of V] and β is in V* [zero or more elements of V]), and a production has the form: α -> β S is a member of VN, and is known as the start or sentence symbol and is the starting point in the generation of any string in the language E2.15 - Language Processors (Lect 2) 11 Formal aspects (2) Types of grammar - Chomsky’s hierarchy The grammar we defined has no restrictions on the types of productions that may appear. Chomsky defined four classes of grammar (types 0–3), where type-0 is the unrestricted grammar we saw, and the other types are derived by imposing restrictions on the form of productions that may be used Type-0 grammars are equivalent to Turing-machines E2.15 - Language Processors (Lect 2) 12 Formal aspects (3) Type-1 grammars The first restriction that may appear in a grammar is that for its productions α -> β, it is necessary that |α| ≤ |β| Type-1 grammars are equivalent to linearlybound automata E2.15 - Language Processors (Lect 2) 13 Formal aspects (4) Type-2 grammars A further restriction that may appear in a grammar is that only a single non-terminal may appear on the left-side of a production. These are also known as “context-free” grammars (compiler theory is based almost entirely on CF grammars), and are equivalent to push-down automata. E2.15 - Language Processors (Lect 2) 14 Formal aspects (5) Type-3 grammars If we further impose the restriction that productions should either all be left-linear, or all be right-linear , then the grammar is now a type-3 grammar. [“Right-Linear”: every production is of the form A->a, or A->bC “Left-Linear”: every production is of the form A->a, or A->Bc] Type-3 grammars are also known as regular grammars, and are equivalent to finite state automata. E2.15 - Language Processors (Lect 2) 15 Formal aspects (6) Backus-Naur Form (BNF) The notation used so far is sufficient, but in practice a richer notation, called the BackusNaur form is used. The right hand side of rules with the same left side are combined, and separated by vertical bars. E.g. the rules N -> α, N->β, N->γ are replaced with N -> α | β | γ It is very suitable for expressing nesting and recursion E2.15 - Language Processors (Lect 2) 16 Formal aspects (7) Extended Backus-Naur Form (1) – the additional operators However, BNF is less convenient for expressing repetition and optionality The extended BNF introduces three additional notations, each in the form of a postfix operator, to remedy this R+ indicates the occurrence of one or more Rs, to express repetition [the Kleene cross] R? indicates the occurrence of zero or one Rs, to express optionality R* indicates the occurrence of zero or more Rs to express optional repetition [the Kleene star] Parentheses are added if these postfix operators are to operate on more than one grammar symbol. E2.15 - Language Processors (Lect 2) 17 Formal aspects (7) Extended Backus-Naur Form (2) – example The production rule Expression -> ‘way’* ‘too’ ‘boring’ (‘for’ (‘me’|‘us’))? can produce the following sentences way too boring way way too boring way too boring for me way way too boring for us too boring E2.15 - Language Processors (Lect 2) 18 Regular expressions (1) Notation (1) Languages generated by type-3 grammars (regular grammars) can also be generated using another important notation, that of regular expressions. One of the most common application is their use in advanced text searches [have a look at your favourite search engine] The simplest kind is a sequence of simple characters, for example (using Perl notation): /student/ We can then use the Kleene star/cross, as well as the ? operator to express repetition, and optionality E2.15 - Language Processors (Lect 2) 19 Regular expressions (1) Notation (2) The following are also used: The operators [ ] to express a range, for example [0-9] to mean any digit between 0 and 9 (also known as character classes) Other operators are also sometimes used; many are language or application specific, for example, Perl also has the /./ operator to mean any character, the /^/ operator (caret) after the single brackets to specify what a pattern cannot be (for example, [^0-9] means not a digit) etc E2.15 - Language Processors (Lect 2) 20 Regular expressions (2) Example – SheepTalk Consider the language of certain kinds of sheep, which consist of strings that look like: Baa! Baaa! Baaaa! …. The language consists of strings with a B, followed by at least two as, followed by an exclamation point. So, the regular expression /baa+!/ can generate this language, as well as the /baaa*!/ regular expression. (example from Chapter 2 of Speech and Language Processing, Jurafsky and Martin) E2.15 - Language Processors (Lect 2) 21 Regular expressions (3) Advantages / Limitations Regular expressions are powerful but have limitations To generate the language {xmyn , with m, n ≥0} we can use the regular expression x*y*, but to generate the language {xmym, m ≥ 0} there is no regular expression. However they are simpler to use, so they are frequently used in the lexical analysis part of the input processing in language processors. Also the theoretical constructs (Finite State Automata) associated with regular expressions are well studied, so a number of proofs can be made about the expressions and the languages that are generated by them. E2.15 - Language Processors (Lect 2) 22 Summary Language Grammars: Are necessary for specifying the language in a formal and compact way Can be classified according to Chomsky’s hierarchy Type-3 grammars, or the equivalent Regular Expression notation are used for the lexical analysis part of language processing, while CF grammars are more frequently associated with the syntax analysis part. Next lectures: More on Grammars A complete example of a simple language processor E2.15 - Language Processors (Lect 2) 23 Recommended Reading Section 1.9 of Grune et al. Section 4.2 of Aho et al Chapters 5, 8 of Terry Chapter 2 of “The Essence of Compilers” by Robin Hunter E2.15 - Language Processors (Lect 2) 24
© Copyright 2024 Paperzz