Agenda

CPSC4600 1

Agenda

Scanner vs. parser Regular grammar vs. context-free grammar

Grammars (context-free grammars) grammar rules derivations parse trees ambiguous grammars useful examples

Reading: Chapter 2, 4.1 and 4.2 ,

CPSC4600 2

Characteristics of a Parser

Input: sequence of tokens from scanner

Output: parse tree of the program parse tree is generated (implicitly or explicitly) if the

input is a legal program if input is an illegal program, syntax errors are issued

Note: Instead of parse tree, some parsers produce directly:

abstract syntax tree (AST) + symbol table , or intermediate code, or object code

In the following lectures, we’ll assume that parse tree is generated.

CPSC4600 3

Comparison with Lexical Analysis

Phase Input Output

Lexical Analysis

String of characters

String of tokens

Syntax Analysis

String of tokens

Parse tree

CPSC4600 4

Example

E

E

E E

E+

id*

idid

• The program:x * y + z

• Input to parser:ID TIMES ID PLUS IDwe’ll write tokens as follows:

id * id + id

• Output of parser:the parse tree

CPSC4600 5

Why are Regular Grammars Not Enough?

Write an automaton that accepts strings “a”, “(a)”, “((a))”, and “(((a)))”

“a”, “(a)”, “((a))”, “(((a)))”, … “(ka)k”

CPSC4600 6

What must parser do?

1. Recognizer: not all strings of tokens are programs must distinguish between valid and invalid strings of

tokens2. Translator: must expose program structure

• e.g., associativity and precedence• hence must return the parse tree

We need: A language for describing valid strings of tokens

context-free grammars (analogous to regular grammars in the scanner)

A method for distinguishing valid from invalid strings of tokens (and for building the parse tree) the parser

(analogous to the state machine in the scanner)

CPSC4600 7

Context-free grammars (CFGs)

Example: Simple Arithmetic Expressions Grammar In English:

An integer is an arithmetic expression. If exp1 and exp2 are arithmetic expressions,

then so are the following:

exp1 - exp2

exp1 / exp2

( exp1 )

the corresponding CFG: we’ll write tokens as follows:exp INTLITERAL E intlitexp exp MINUS exp E E - E exp exp DIVIDE exp E E / E exp LPAREN exp RPAREN E ( E )

CPSC4600 8

Reading the CFG

The grammar has five terminal symbols: intlit, -, /, (, ) terminals of a grammar = tokens returned by the scanner.

The grammar has one non-terminal symbol: E non-terminals describe valid sequences of tokens

The grammar has four productions or rules, each of the form: E

left-hand side = a single non-terminal. right-hand side = either

a sequence of one or more terminals and/or non-terminals, or

(an empty production);

CPSC4600 9

Example, revisited

Note: a more compact way to write previous

grammar: E INTLITERAL | E - E | E / E | ( E )

or

E INTLITERAL | E - E | E / E | ( E )

CPSC4600 10

A formal definition of CFGs

A CFG consists of

A set of terminals T A set of non-terminals N A start symbol S (a non-terminal) A set of productions:

X X1 X2 … Xn

where X N and Yi T U N U {}

CPSC4600 11

Notational Conventions

In these lecture notes Non-terminals are written upper-case Terminals are written lower-case The start symbol is the left-hand side of the

first production

CPSC4600 12

The Language of a CFG

The language defined by a CFG is the set of strings that can be derived from the start symbol of the grammar.

Derivation: Read productions as rules: X Y1 … Yn

Means X can be replaced by Y1 … Yn

CPSC4600 13

Derivation: key idea

1. Begin with a string consisting of the start symbol “S”

2. Replace any non-terminal X in the string by a the right-hand side of some production

3. Repeat (2) until there are no non-terminals in the string

CPSC4600 14

Derivation: an example

CFG:E idE E + E E E * E E ( E )

Is string id * id + id in the language defined by the grammar?

E

E+E

E E+E

id E + E

id id + E

id id + id

derivation:

CPSC4600 15

Terminals

Terminals are called so because there are no rules for replacing them

Once generated, terminals are permanent

Therefore, terminals are the tokens of the language

CPSC4600 16

The Language of a CFG (Cont.)

More formally, write

X1 X2 … Xn X1 X2 … X i-1 Y1 Y2 … Ym X i+1 … Xn

if there is a production X i Y1 Y2 … Ym

CPSC4600 17

The Language of a CFG (Cont.)

Write X1 X2 … Xn * Y1 Y2 … Ym

ifX1 X2 … Xn … .. Y1 Y2 … Ym

in 0 or more steps

CPSC4600 18

The Language of a CFG

Let G be a context-free grammar with start

symbol S. Then the language of G is:

{ a1 a2 … an | S * a1 a2 … an }where ai, i= 1,2, .., n are terminal symbols

CPSC4600 19

Examples

Strings of balanced parentheses

The grammar:

( )S S

S

( )

|

S S

( ) | 0i i i

sameas

CPSC4600 20

Arithmetic Expression Example

Simple arithmetic expressions:

Some elements of the language:

E E+E | E E | (E) | id

id id + id

(id) id id

(id) id id (id)

CPSC4600 21

Notes

The idea of a CFG is a big step. But: Membership in a language is “yes” or “no”

we also need parse tree of the input! furthermore, we must handle errors gracefully

Need an “implementation” of CFG’s, i.e. the parser we’ll create the parser using a parser generator

available generators: CUP, bison, yacc

CPSC4600 22

More Notes

Form of the grammar is important Many grammars generate the same language Parsers are sensitive to the form of the grammar

Example:E E + E | E – E | intlit

is not suitable for an LL(1) parser (a common kind of parser).

CPSC4600 23

Derivations and Parse Trees

A derivation is a sequence of productions

S .. .. .. A derivation can be drawn as a tree

Start symbol is the tree’s root

For a production X Y1 Y2 add children Y1 Y2

to node X

CPSC4600 24

Derivation Example

Grammar

String

E E+E | E E | (E) | id

id id + id

CPSC4600 25

Derivation Example (Cont.)

E

E+E

E E+E

id E + E

id id + E

id id + id

E

E

E E

E+

id*

idid

CPSC4600 26

Notes on Derivations

A parse tree has Terminals at the leaves Non-terminals at the interior nodes

An in-order traversal of the leaves is the original input

The parse tree shows the association of operations, the input string does not

CPSC4600 27

Left-most and Right-most Derivations

The example is a left-most derivation At each step,

replace the left-most non-terminal

There is an equivalent notion of a right-most derivation

E

E+E

E+id

E E + id

E id + id

id id + id

CPSC4600 28

Derivations and Parse Trees

Note that right-most and left-most derivations have the same parse tree

The difference is the order in which branches are added

CPSC4600 29

Remarks on Derivation

We are not just interested in whether s L(G)

We need a parse tree for s, (because we need to build the AST)

A derivation defines a parse tree But one parse tree may have many derivations

Left-most and right-most derivations are important in parser implementation

CPSC4600 30

Ambiguity(1)

Grammar

String

E E+E | E E | (E) | id

id id + id

CPSC4600 31

Ambiguity (2)

This string has two parse trees

E

E

E E

E*

id +

idid

E

E

E E

E+

id*

idid

CPSC4600 32

Ambiguity(3)

for each of the two parse trees, find the corresponding left-most derivation

for each of the two parse trees, find the corresponding right-most derivation

CPSC4600 33

Ambiguity (4)

A grammar is ambiguous if, for some string of the language it has more than one parse tree, or there is more than one right-most derivation,

or there is more than one left-most derivation.(the three conditions are equivalent)

Ambiguity Leaves meaning of some programs ill-defined

CPSC4600 34

Dealing with Ambiguity

There are several ways to handle ambiguity

Most direct method is to rewrite grammar unambiguously

Enforces precedence of * over +

' '

'

E E E | E

E id E' | id | (E)

CPSC4600 35

Removing Ambiguity

Rewriting: Expression Grammars

precedence associativity

IF-THEN-ELSE the Dangling-ELSE problem

CPSC4600 36

Handling operator precedence

Rewrite the grammar use a different nonterminal for each precedence level start with the lowest precedence (MINUS)

E E - E | E / E | ( E ) | id

rewrite to

E E - T | TT T / F | F F id | ( E )

CPSC4600 37

Example

parse tree for id – id / id

E E - T | TT T / F | F F id | ( E )

E

E

F

-

id

/

idid

T

FT F

T

CPSC4600 38

Handling Operator Associativity

The grammar captures operator precedence, but it is still ambiguous!

fails to express that both subtraction and division are left associative;

e.g., 5-3-2 is equivalent to: ((5-3)-2) and not to: (5-(3-2)).

CPSC4600 39

Recursion

A grammar is recursive in nonterminal X if: X + … X …

+ means “in one or more steps, X derives a sequence of symbols that includes an X”

A grammar is left recursive in X if: X + X …

in one or more steps, X derives a sequence of symbols that starts with an X

A grammar is right recursive in X if: X + … X

in one or more steps, X derives a sequence of symbols that ends with an X

CPSC4600 40

Resolving ambiguity due to associativity

The grammar given above is both left and right recursive in nonterminals E and T

To correctly expresses operator associativity: For left associativity, use left recursion. For right associativity, use right recursion.

Here's the correct grammar: E E – T | TT T / F | F F id | ( E )

CPSC4600 41

The Dangling “Else” ambiguity

Consider the grammar St if E then St | if E then St else St | other

This grammar is also ambiguous

CPSC4600 42

Resolving the “dangling else”

else matches the closest unmatched then We can describe this in the grammar

E MIF /* all then are matched */ | UIF /* some then are unmatched */

MIF if E then MIF else MIF

| printUIF if E then E | if E then MIF else UIF

Describes the same set of strings

CPSC4600 43

Precedence and Associativity Declarationsin Parser Generators Instead of rewriting the grammar

Use the more natural (ambiguous) grammar Along with disambiguating declarations

Most parser generators allow precedence and associativity declarations to disambiguate grammars

CPSC4600 44

Parsing Approaches

Top-down parsing build parse tree from start symbol (root) match terminal symbols(tokens) in the production

rules with tokens in the input stream simple but limited in power

Bottom-up parsing start from input token stream build parse tree from terminal symbols (tokens) until

get start symbol complex but powerful

CPSC4600 45

Top Down vs. Bottom Up

start here

resultmatch

input token stream input token stream

start here

result

Top-down Parsing Bottom-up Parsing

CPSC4600 46

Top-down Parsing

A top-down parsing algorithm parses an input string of tokens by tracing out the steps in a leftmost derivation.

The parse tree associated with the input string is constructed using preorder traversal and hence the name “top-down”.

CPSC4600 47

Top-down parsers

There are mainly two kinds of top-down parsers:

1. Predictive parsers - Tries to make decisions about the structure of the

tree below a node based on a few lookahead tokens (usually one!).

- Weakness: Little program structure has been seen before predictive decisions must be made.

2. Backtracking parsers - Backtracking parsers solve the lookahead problem

by backtracking if one decision turns out to be wrong and making a different choice.

- Weakness: Backtracking parsers are slow (exponential time in general).

CPSC4600 48

Recursive-descent parsing

Main idea

1. Use the grammar rules as recipes for procedure code that “parses” the rule

2. Each non-terminal corresponds to a procedure 3. Each appearance of a terminal in the right hand side of a

rule causes a token to be matched. 4. Each appearance of a non-terminal corresponds to a call of

the associated procedure.

CPSC4600 49

Example: Recursive-descent Parsing

F (E) | numCode:

void F()

{ if (token == num) match(num);

else {

match(‘(‘);

E();

match(‘)’);// match token ‘(‘

}

CPSC4600 50

Example: Recursive-descent Parsing (2)

Observation: Note how lookahead is not a problem in this

example: if the token is number, go one way, if the token is ‘(‘ go the other, and if the token is neither, declare error:

void match(Token expect)

{ if (token == expect)

getToken(); //get next token

else error(token,expect);

}

CPSC4600 51

Example: Recursive-descent Parsing (3)

A recursive-descent procedure can also compute values or syntax trees:

int F()

{ if (token == num)

{ int temp = atoi(lexeme);

match(number); return temp;

}

else {

match(‘(‘); int temp = E();

match(‘)’); return temp;

}

}

CPSC4600 52

When Recursive Descent Does Not Work

E E ‘+’ term | term

void E()

{ if (token == ??)

{ E(); // uh, oh!!

match(‘+’);

term();

}

else term();

}

- A left-recursive grammar has a non-terminal A A + A for some

- Recursive descent does not work in such cases

CPSC4600 53

Elimination of Left Recursion

Consider the left-recursive grammarA + A| for some sentential forms and

S generates all strings starting with a and followed by a number of

Can rewrite the grammar using right-recursion A A’ A’ A’ |

where A’ is a new nonterminal

CPSC4600 54

Elimination of Left Recursion (2)

In general A A 1 | … | A n | 1 | … | m

All strings derived from A start with one of 1,…,m and continue with several instances of 1,…,n

Rewrite as A 1 A’ | … | m A’

A’ 1 A’ | … | n A’ |

CPSC4600 55

General Left Recursion

The grammar S A | A S is also left-recursive because

S + S

This left-recursion can also be eliminatedSee book, Section 4.3 for general algorithm

CPSC4600 56

Summary of Recursive Descent with backtracking

Simple and general parsing strategyLeft-recursion must be eliminated first… but that can be done automatically

Unpopular because of backtrackingThought to be too inefficient

In practice, backtracking is eliminated by restricting the grammar

CPSC4600 57

Predictive Parsers

Like recursive-descent but parser can “predict” which production to use- By looking at the next few tokens- No backtracking

Predictive parsers accept LL(k) grammars- L means “left-to-right” scan of input- L means “leftmost derivation”- k means “predict based on k tokens of

lookahead”In practice, LL(1) is used

CPSC4600 58

LL(1) Languages

In recursive-descent, for each non-terminal and input token there may be a choice of production

LL(1) means that for each non-terminal and token there is only one production

Can be specified via 2D tables- One dimension for current non-terminal to

expand- One dimension for next token- A table entry contains one production

CPSC4600 59

Predictive Parsing and Left Factoring

Consider the grammar E T + E | T T num | num * T | ( E )

Hard to predict becauseFor T, two productions start with numFor E, it is not clear how to predict

A grammar must be left-factored before use for predictive parsing

CPSC4600 60

Left-Factoring Example

Recall the grammar E T + E | T T num | num * T | ( E )

Factor out common prefixes of productions E T X X + E | T ( E ) | num Y Y * T |

CPSC4600 61

LL(1) Parsing Table Example

Left-factored grammarE T X X + E | T ( E ) | num Y Y * T |

The LL(1) parsing table:

Y Y Y Y * T Y

T( E )T num YT

X X X + EX

E TX E T XE

$)(+*num

CPSC4600 62

LL(1) Parsing Table Example (Cont.)

Consider the [E, num] entry- “When current non-terminal is E and next

input is num, use production E T X- This production can generate a num in the

first placeConsider the [Y,+] entry

- “When current non-terminal is Y and current token is +, get rid of Y”

Y can be followed by + only in a derivation in which Y

CPSC4600 63

LL(1) Parsing Tables. Errors

Blank entries indicate error situationsConsider the [E,*] entry“There is no way to derive a string starting

with * from non-terminal E”

CPSC4600 64

Using Parsing Tables

Method similar to recursive descent, except- For each non-terminal S- We look at the next token a- And chose the production shown at [S,a]

We use a stack to keep track of pending non-terminals

We reject when we encounter an error state

We accept when we encounter end-of-input

CPSC4600 65

LL(1) Parsing Algorithm

Start nonterminal end-of-input symbol

initialize stack = <S $> and Token = nextToken()

repeat case stack of <X, rest> : if T[X,Token] = Y1…Yn

then stack <Y1… Yn rest>; else error (); <t, rest> : if t == nextToken then stack <rest>; else error ();until stack == < > // empty

CPSC4600 66

LL(1) Parsing Example

Stack Input ActionE $ num * num $ T XT X $ num * num $ num Ynum Y X $ num * num $ terminalY X $ * num $ * T* T X $ * num $ terminalT X $ num $ num Yint Y X $ num $ terminalY X $ $ X $ $ $ $ ACCEPT

Y Y Y Y * T Y

T( E )T num YT

X X X + EX

E TX E T XE

$)(+*num

CPSC4600 67

Constructing Parsing Tables

LL(1) languages are those defined by a parsing table for the LL(1) algorithm

No table entry can be multiply defined

We want to generate parsing tables from CFG

CPSC4600 68

Constructing Parsing Tables: First and Follow sets

If A , where in the row of A we place ?Answer: In the column of t where t can start a

string derived from * t We say that t First()

In the column of t if is and t can follow an AS * A t We say t Follow(A)

CPSC4600 69

Computing First Sets

Definition: First(X) = { t | X * t} { | X * }

Algorithm sketch (see book for details): for all terminals t do First(t) { t } for each production X do First(X) { } if X A1 … An and First(Ai), 1 i n do

• add First() to First(X) for each X A1 … An s.t. First(Ai), 1 i n do

• add to First(X) repeat steps 3 & 4 until no First set can be grown

CPSC4600 70

First Sets. Example

Recall the grammar E T X X + E | T ( E ) | num Y Y * T |

First sets First( ( ) = { ( } First( T ) = {num, ( } First( ) ) = { ) } First( E ) = {num, ( } First( num) = { num} First( X ) = {+, } First( + ) = { + } First( Y ) = {*, } First( * ) = { * }

CPSC4600 71

Computing Follow Sets

Definition:

Follow(X) = { t | S * X t }

Intuition:If S is the start symbol then $ Follow(S)

If X A B then First(B) Follow(A) and Follow(X) Follow(B)Also if B * then Follow(X) Follow(A)

CPSC4600 72

Computing Follow Sets (Cont.)

Algorithm sketch:

1. Follow(S) { $ } 2. For each production A X

add First() \ {} to Follow(X) 3. For each A X where First()

add Follow(A) to Follow(X)repeat step(s) 2 and 3 until no Follow set grows

CPSC4600 73

Follow Sets. Example

Recall the grammar E T X X + E | T ( E ) | num Y Y * T |

Follow sets Follow( + ) = { num, ( } Follow( * ) = { num, ( } Follow( ( ) = { num, ( } Follow( E ) = {), $} Follow( X ) = {$, ) } Follow( T ) = {+, ) , $} Follow( ) ) = {+, ) , $} Follow( Y ) = {+, ) , $} Follow( num) = {*, +, ) , $}

CPSC4600 74

Constructing LL(1) Parsing Tables

Construct a parsing table T for CFG G

For each production A in G do: For each terminal t First() do

T[A, t] = If First(), for each t Follow(A) do

T[A, t] = If First() and $ Follow(A) do

T[A, $] =

CPSC4600 75

Notes on LL(1) Parsing Tables

If any entry is multiply defined then G is not LL(1) If G is ambiguous If G is left recursive If G is not left-factored

Most programming language grammars are not LL(1)

There are tools that build LL(1) tables

CPSC4600 76

Review

For some grammars there is a simple parsing strategy Predictive parsing

Next time: Bottom-up parsing

Agenda

Documents

e lefthand

e nonterminals

parse tree cpsc4600

program parse tree

terminals andor nonterminals

nonterminal symbol

start symbol

exp intliterale intlitexp