Top Banner
1 Introduction to Parsing Lecture 5

Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,

Feb 09, 2020



Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Page 1: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Introduction to Parsing

Lecture 5

Page 2: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,



•  Regular languages revisited

•  Parser overview

•  Context-free grammars (CFG’s)

•  Derivations

•  Ambiguity

Page 3: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Languages and Automata

•  Formal languages are very important in CS –  Especially in programming languages

•  Regular languages –  The weakest formal languages widely used –  Many applications (as we’ve seen)

•  We will also study context-free languages, tree languages

Page 4: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Beyond Regular Languages

•  Difficulty with regular languages is that many languages are not regular –  Some are very important –  They can’t be expressed using REs and FAs

•  Ex. Strings of balanced parentheses are not regular: –  Note this is fairly representative of lots of

programming constructs

{ }( ) | 0i i i ≥

Note: given as set not RE

Page 5: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Beyond Regular Languages

•  Ex. Nested arithmetic expressions –  ((1+2) * 3)

•  Ex. Nested if then else statements if then if then if then … fi fi fi

“if” here acts like “(“ in previous example Note that even if language doesn’t have the “fi” like Cool, it is usually implied

Page 6: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,

An Example To Help Understand the Limitations

•  Consider the following DFA

•  What does it recognize? •  Note: doesn’t have any way of knowing length

of input string 6





Ex: 1111111

Page 7: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Beyond Regular Languages

•  In general: Nesting constructs cannot be handled by regular expressions

•  Raises the questions: –  What can be expressed? –  Why are REs insufficient for recognizing arbitrary

nesting constructs?

Page 8: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


What Can Regular Languages Express?

•  Languages requiring counting modulo a fixed integer –  E.g., parity

•  Intuition: A finite automaton that runs long enough must repeat states

•  Finite automaton can’t remember # of times it has visited a particular state

Page 9: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


The Functionality of the Parser

•  Input: sequence of tokens from lexer

•  Output: parse tree of the program (But some parsers never produce a parse tree . . .)

Page 10: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,



•  Cool if x = y then 1 else 2 fi

•  Parser input (from lexical analyzer) IF ID = ID THEN INT ELSE INT FI

•  Parser output IF-THEN-ELSE




Page 11: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,



•  Note: nesting structure has been made explicit by tree

•  Also the three components of the if then else –  Predicate –  Then branch –  Else branch





Page 12: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Comparison with Lexical Analysis

Phase Input Output

Lexer String of characters

String of tokens

Parser String of tokens

Parse tree

Page 13: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,

Couple of things

•  As mentioned, sometimes parse tree is only implicit –  More on this later –  Many compilers do build full parse tree, many do not

•  There are compilers that combine lexer and parser phases into one phase –  Everything done by the parser –  Parsing technology powerful enough to express

lexical analysis in addition to parsing –  But most compilers use two phases, because REs are

such a good match for lexical analysis 13

Page 14: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


The Role of the Parser

•  Not all strings of tokens are programs . . . •  . . . parser must distinguish between valid and

invalid strings of tokens –  And give error messages for the invalid ones

•  We need –  A language for describing valid strings of tokens –  An algorithm for distinguishing valid from invalid

strings of tokens

Page 15: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Context-Free Grammars

•  Programming language constructs have recursive structure

•  An EXPR in Cool can be… if EXPR then EXPR else EXPR fi while EXPR loop EXPR pool …

•  Context-free grammars are a natural notation

for this recursive structure

Note: Recursively composed of other expressions

Page 16: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


What is a Context-Free Grammar (CFG)?

•  A CFG consists of –  A set of terminals T –  A set of non-terminals N –  A start symbol S (a non-terminal) –  A set of productions

X →Y1Y2!Yn

where X ∈ N and Yi ∈ T ∪N∪ ε{ }

Page 17: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Notational Conventions

•  In these lecture notes –  Non-terminals are written upper-case –  Terminals are written lower-case –  The start symbol is the left-hand side of the first

production •  This is standard for CFGs

Page 18: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Examples of CFGs

S ( S ) S ε

Page 19: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Examples of CFGs

S ( S ) S ε

What are the parts of the grammar: N = ? T = ? Start = ?

Page 20: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Examples of CFGs

S ( S ) S ε

What are the parts of the grammar: N = { S } T = { ( , ) } Start = S (the only nonterminal)

Page 21: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Examples of CFGs

S ( S ) S ε

What are the parts of the grammar: N = { S } T = { ( , ) } Start = S (the only nonterminal) Productions?

Page 22: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Examples of CFGs

A fragment of Cool:

EXPR if EXPR then EXPR else EXPR fi| while EXPR loop EXPR pool| id

Page 23: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Examples of CFGs (cont.)

Simple arithmetic expressions:

( )

E E E| E + E| E| id

→ ∗

Page 24: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


The Language of a CFG

Read productions as rules:

Means can be replaced by That is, in general, the right hand side can replace

the left hand side.

X →Y1!Yn

X Y1!Yn

Page 25: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Key Idea

1.  Begin with a string consisting of the start symbol “S”

2.  Replace any non-terminal X in the string by a the right-hand side of some production

3.  Repeat (2) until there are no non-terminals in the string

X →Y1!Yn

So note, the string is changing over time

Page 26: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


The Language of a CFG (Cont.)

More formally, write if there is a production and say that the left hand side “derives” the

right, or “can derive” the right hand side, etc.

X1!Xi!Xn → X1!Xi−1Y1!YmXi+1!Xn

Xi →Y1!Ym

This is one step of a context-free derivation.

Page 27: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


The Language of a CFG (Cont.)

Write if in 0 or more steps We say “the left hand side rewrites in zero or

more steps to the right hand side”

X1!Xn→∗ Y1!Ym

X1!Xn →!→!→Y1!Ym

Page 28: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,

So, in general…

When we write it is shorthand for saying that there is some sequence of individual productions (rules) that get us from X0 to Xn in zero or more steps


X 0→∗ Xn

Page 29: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


The Language of a CFG

Let G be a context-free grammar with start symbol S. Then the language, L(G), of G is:

a1…an | S→∗ a1…an and every ai is a terminal



Page 30: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,



•  Terminals are so-called because there are no rules for replacing them

•  Once generated, terminals are permanent feature of the string

•  Terminals ought to be tokens of the language

Page 31: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Recall earlier Example

L(G) is the language of CFG G Strings of balanced parentheses Two grammars: ( )S SS ε

( )|

S Sε

{ }( ) | 0i i i ≥


Page 32: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Cool Example

A fragment of COOL: Recall:

–  Non-terminals are written upper-case –  Terminals are written lower-case

Also, could have written as three productions

EXPR if EXPR then EXPR else EXPR fi| while EXPR loop EXPR pool| id

Page 33: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Cool Example (Cont.)

Some elements of the language (why?) idif id then id else id fiwhile id loop id poolif while id loop id pool then id else idif if id then id else id fi then id else id fi

Page 34: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Arithmetic Example

Simple arithmetic expressions: Some elements of the language:

E E+E | E E | (E) | id→ ∗

id id + id(id) id id(id) id id (id)

∗ ∗

Page 35: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,



The idea of a CFG is a big step. But:

•  Membership in a language is “yes” or “no”; also need parse tree of the input

•  Must handle errors gracefully

•  Need an implementation of CFG’s (e.g., bison)

Page 36: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


More Notes

•  Form of the grammar is important –  Many grammars generate the same language –  Tools are sensitive to the grammar

–  Note: Tools for regular languages (e.g., flex) are sensitive to the form of the regular expression, but this is rarely a problem in practice

Page 37: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Derivations and Parse Trees

A derivation is a sequence of productions A derivation can be drawn as a tree

–  Start symbol is the tree’s root –  For a production add children

to node


X →Y1!Yn X



Y1 Yn …

Page 38: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Derivation Example

•  Grammar

•  String

E E+E | E E | (E) | id→ ∗

id id + id∗We wish to “parse” the string

Page 39: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Derivation Example (Cont.)

EE+EE E+Eid E + Eid id + Eid id + id

→ ∗

→ ∗

→ ∗

→ ∗




E +

id *

id id

parse tree (of the input string)

Page 40: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Derivation in Detail (1)



Page 41: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Derivation in Detail (2)



E E +

Page 42: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Derivation in Detail (3)


EE+EE +→ ∗




E +


Page 43: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Derivation in Detail (4)

EE+EE E+Eid E + E→ ∗

→ ∗




E +



Page 44: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Derivation in Detail (5)

EE+EE E+Eid E + id id +

EE→ ∗

→ ∗

→ ∗




E +


id id

Page 45: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Derivation in Detail (6)

EE+EE E+Eid E + Eid id + Eid id + id

→ ∗

→ ∗

→ ∗




E +

id *

id id

Page 46: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Some Interesting Things About Parse Trees

•  A parse tree has –  Terminals at the leaves –  Non-terminals at the interior nodes

•  An in-order traversal of the leaves is the original input –  Let’s go back and take a look

•  The parse tree shows the association of operations, the input string does not –  Note * binds more tightly than + because * is a

subtree of the parse tree

Page 47: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,

An Interesting Question…

How did I know to pick this particular parse tree for the derivation? It turns out that there is more than one…


Page 48: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Left-most and Right-most Derivations

•  The example we did is a left-most derivation –  At each step, replace the

left-most non-terminal

•  There is an equivalent notion of a right-most derivation

EE+EE E+Eid E + Eid id + Eid id + id

→ ∗

→ ∗

→ ∗

→ ∗

Page 49: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Left-most and Right-most Derivations

•  The example we did is a left-most derivation –  At each step, replace the

left-most non-terminal

•  There is an equivalent notion of a right-most derivation

EE+EE+idE E + idE id + idid id + id

→ ∗

→ ∗

→ ∗

Page 50: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Right-most Derivation in Detail (1)



Page 51: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Right-most Derivation in Detail (2)



E E +

Page 52: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Right-most Derivation in Detail (3)




E E +


Page 53: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Right-most Derivation in Detail (4)

EE+EE+idE E + id




E +

id *

Page 54: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Right-most Derivation in Detail (5)


+ idid + id




E +

id *


Page 55: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Right-most Derivation in Detail (6)

EE+EE+idE E + idE id + idid id + id→ ∗

→ ∗

→ ∗




E +

id *

id id

Page 56: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Derivations and Parse Trees

•  Note that right-most and left-most derivations have the same parse tree –  In this case –  And this is not an accident

•  The difference is the order in which branches are added

•  Finally, there could be other parse trees that arise from neither left-most or right-most derivation –  But we are most interested in left-most and right-


Page 57: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Summary of Derivations

•  We are not just interested in whether s is in L(G) –  We need a parse tree for s

•  A derivation defines a parse tree –  But one parse tree may have many derivations

•  Left-most and right-most derivations are important in parser implementation

Page 58: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,



•  Grammar

•  String

E E+E | E E | (E) | id→ ∗

id id + id∗

Page 59: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Ambiguity (Cont.)

This string has two parse trees




E *

id +

id id




E +

id *

id id

Page 60: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Ambiguity (Cont.)

•  A grammar is ambiguous if it has more than one parse tree for some string –  Equivalently, there is more than one right-most or

left-most derivation for some string

•  Ambiguity is BAD –  Leaves meaning of some programs ill-defined

Page 61: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Dealing with Ambiguity

•  There are several ways to handle ambiguity

•  Most direct method is to rewrite grammar unambiguously

•  Enforces precedence of * over +

' '


E E E | EE id E | id | (E) E | (E)

→ +

ʹ′ ʹ′→ ∗ ∗

Page 62: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Ambiguity in Arithmetic Expressions

•  Recall the grammar E → E + E | E * E | ( E ) | int •  The string int * int + int has two parse trees:




E *

int +

int int




E +

int *

int int

Page 63: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Ambiguity: The Dangling Else

•  Consider the grammar E → if E then E | if E then E else E | OTHER

•  This grammar is also ambiguous

Page 64: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


The Dangling Else: Example

•  The expression if E1 then if E2 then E3 else E4 has two parse trees


E1 if

E2 E3 E4


E1 if

E2 E3


•  Typically we want the second form

Page 65: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


The Dangling Else: A Fix

•  else matches the closest unmatched then •  We can describe this in the grammar

E → MIF /* all then are matched */ | UIF /* some then is unmatched */ MIF → if E then MIF else MIF | OTHER UIF → if E then E | if E then MIF else UIF

•  Describes the same set of strings

Page 66: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


The Dangling Else: Example Revisited

•  The expression if E1 then if E2 then E3 else E4


E1 if

E2 E3 E4


E1 if

E2 E3


•  Not valid because the then expression is not a MIF

•  A valid parse tree (for a UIF)

Page 67: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,



•  No general techniques for handling ambiguity

•  Impossible to convert automatically an ambiguous grammar to an unambiguous one

•  Used with care, ambiguity can simplify the grammar –  Sometimes allows more natural definitions –  We need disambiguation mechanisms

Page 68: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Precedence and Associativity Declarations

•  Instead of rewriting the grammar –  Use the more natural (ambiguous) grammar –  Along with disambiguating declarations

•  Most tools allow precedence and associativity declarations to disambiguate grammars

•  Examples …

Page 69: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Associativity Declarations

•  Consider the grammar E → E + E | int •  Ambiguous: two parse trees of int + int + int




E +

int +

int int




E +

int +

int int

•  Left associativity declaration: %left +

Page 70: Lecture 5 - University of Richmond4 Beyond Regular Languages • Difficulty with regular languages is that many ... • Languages requiring counting modulo a fixed integer – E.g.,


Precedence Declarations

•  Consider the grammar E → E + E | E * E | int –  And the string int + int * int




E +

int *

int int




E *

int +

int int •  Precedence declarations: %left + %left *