CPSC4600 1 Agenda Scanner vs. parser Regular grammar vs. context-free grammar Grammars (context-free grammars) grammar rules derivations parse trees ambiguous grammars useful examples Reading: Chapter 2, 4.1 and 4.2 ,
Jan 03, 2016
CPSC4600 1
Agenda
Scanner vs. parser Regular grammar vs. context-free grammar
Grammars (context-free grammars) grammar rules derivations parse trees ambiguous grammars useful examples
Reading: Chapter 2, 4.1 and 4.2 ,
CPSC4600 2
Characteristics of a Parser
Input: sequence of tokens from scanner
Output: parse tree of the program parse tree is generated (implicitly or explicitly) if the
input is a legal program if input is an illegal program, syntax errors are issued
Note: Instead of parse tree, some parsers produce directly:
abstract syntax tree (AST) + symbol table , or intermediate code, or object code
In the following lectures, we’ll assume that parse tree is generated.
CPSC4600 3
Comparison with Lexical Analysis
Phase Input Output
Lexical Analysis
String of characters
String of tokens
Syntax Analysis
String of tokens
Parse tree
CPSC4600 4
Example
E
E
E E
E+
id*
idid
• The program:x * y + z
• Input to parser:ID TIMES ID PLUS IDwe’ll write tokens as follows:
id * id + id
• Output of parser:the parse tree
CPSC4600 5
Why are Regular Grammars Not Enough?
Write an automaton that accepts strings “a”, “(a)”, “((a))”, and “(((a)))”
“a”, “(a)”, “((a))”, “(((a)))”, … “(ka)k”
CPSC4600 6
What must parser do?
1. Recognizer: not all strings of tokens are programs must distinguish between valid and invalid strings of
tokens2. Translator: must expose program structure
• e.g., associativity and precedence• hence must return the parse tree
We need: A language for describing valid strings of tokens
context-free grammars (analogous to regular grammars in the scanner)
A method for distinguishing valid from invalid strings of tokens (and for building the parse tree) the parser
(analogous to the state machine in the scanner)
CPSC4600 7
Context-free grammars (CFGs)
Example: Simple Arithmetic Expressions Grammar In English:
An integer is an arithmetic expression. If exp1 and exp2 are arithmetic expressions,
then so are the following:
exp1 - exp2
exp1 / exp2
( exp1 )
the corresponding CFG: we’ll write tokens as follows:exp INTLITERAL E intlitexp exp MINUS exp E E - E exp exp DIVIDE exp E E / E exp LPAREN exp RPAREN E ( E )
CPSC4600 8
Reading the CFG
The grammar has five terminal symbols: intlit, -, /, (, ) terminals of a grammar = tokens returned by the scanner.
The grammar has one non-terminal symbol: E non-terminals describe valid sequences of tokens
The grammar has four productions or rules, each of the form: E
left-hand side = a single non-terminal. right-hand side = either
a sequence of one or more terminals and/or non-terminals, or
(an empty production);
CPSC4600 9
Example, revisited
Note: a more compact way to write previous
grammar: E INTLITERAL | E - E | E / E | ( E )
or
E INTLITERAL | E - E | E / E | ( E )
CPSC4600 10
A formal definition of CFGs
A CFG consists of
A set of terminals T A set of non-terminals N A start symbol S (a non-terminal) A set of productions:
X X1 X2 … Xn
where X N and Yi T U N U {}
CPSC4600 11
Notational Conventions
In these lecture notes Non-terminals are written upper-case Terminals are written lower-case The start symbol is the left-hand side of the
first production
CPSC4600 12
The Language of a CFG
The language defined by a CFG is the set of strings that can be derived from the start symbol of the grammar.
Derivation: Read productions as rules: X Y1 … Yn
Means X can be replaced by Y1 … Yn
CPSC4600 13
Derivation: key idea
1. Begin with a string consisting of the start symbol “S”
2. Replace any non-terminal X in the string by a the right-hand side of some production
3. Repeat (2) until there are no non-terminals in the string
CPSC4600 14
Derivation: an example
CFG:E idE E + E E E * E E ( E )
Is string id * id + id in the language defined by the grammar?
E
E+E
E E+E
id E + E
id id + E
id id + id
derivation:
CPSC4600 15
Terminals
Terminals are called so because there are no rules for replacing them
Once generated, terminals are permanent
Therefore, terminals are the tokens of the language
CPSC4600 16
The Language of a CFG (Cont.)
More formally, write
X1 X2 … Xn X1 X2 … X i-1 Y1 Y2 … Ym X i+1 … Xn
if there is a production X i Y1 Y2 … Ym
CPSC4600 17
The Language of a CFG (Cont.)
Write X1 X2 … Xn * Y1 Y2 … Ym
ifX1 X2 … Xn … .. Y1 Y2 … Ym
in 0 or more steps
CPSC4600 18
The Language of a CFG
Let G be a context-free grammar with start
symbol S. Then the language of G is:
{ a1 a2 … an | S * a1 a2 … an }where ai, i= 1,2, .., n are terminal symbols
CPSC4600 19
Examples
Strings of balanced parentheses
The grammar:
( )S S
S
( )
|
S S
( ) | 0i i i
sameas
CPSC4600 20
Arithmetic Expression Example
Simple arithmetic expressions:
Some elements of the language:
E E+E | E E | (E) | id
id id + id
(id) id id
(id) id id (id)
CPSC4600 21
Notes
The idea of a CFG is a big step. But: Membership in a language is “yes” or “no”
we also need parse tree of the input! furthermore, we must handle errors gracefully
Need an “implementation” of CFG’s, i.e. the parser we’ll create the parser using a parser generator
available generators: CUP, bison, yacc
CPSC4600 22
More Notes
Form of the grammar is important Many grammars generate the same language Parsers are sensitive to the form of the grammar
Example:E E + E | E – E | intlit
is not suitable for an LL(1) parser (a common kind of parser).
CPSC4600 23
Derivations and Parse Trees
A derivation is a sequence of productions
S .. .. .. A derivation can be drawn as a tree
Start symbol is the tree’s root
For a production X Y1 Y2 add children Y1 Y2
to node X
CPSC4600 24
Derivation Example
Grammar
String
E E+E | E E | (E) | id
id id + id
CPSC4600 25
Derivation Example (Cont.)
E
E+E
E E+E
id E + E
id id + E
id id + id
E
E
E E
E+
id*
idid
CPSC4600 26
Notes on Derivations
A parse tree has Terminals at the leaves Non-terminals at the interior nodes
An in-order traversal of the leaves is the original input
The parse tree shows the association of operations, the input string does not
CPSC4600 27
Left-most and Right-most Derivations
The example is a left-most derivation At each step,
replace the left-most non-terminal
There is an equivalent notion of a right-most derivation
E
E+E
E+id
E E + id
E id + id
id id + id
CPSC4600 28
Derivations and Parse Trees
Note that right-most and left-most derivations have the same parse tree
The difference is the order in which branches are added
CPSC4600 29
Remarks on Derivation
We are not just interested in whether s L(G)
We need a parse tree for s, (because we need to build the AST)
A derivation defines a parse tree But one parse tree may have many derivations
Left-most and right-most derivations are important in parser implementation
CPSC4600 30
Ambiguity(1)
Grammar
String
E E+E | E E | (E) | id
id id + id
CPSC4600 31
Ambiguity (2)
This string has two parse trees
E
E
E E
E*
id +
idid
E
E
E E
E+
id*
idid
CPSC4600 32
Ambiguity(3)
for each of the two parse trees, find the corresponding left-most derivation
for each of the two parse trees, find the corresponding right-most derivation
CPSC4600 33
Ambiguity (4)
A grammar is ambiguous if, for some string of the language it has more than one parse tree, or there is more than one right-most derivation,
or there is more than one left-most derivation.(the three conditions are equivalent)
Ambiguity Leaves meaning of some programs ill-defined
CPSC4600 34
Dealing with Ambiguity
There are several ways to handle ambiguity
Most direct method is to rewrite grammar unambiguously
Enforces precedence of * over +
' '
'
E E E | E
E id E' | id | (E)
CPSC4600 35
Removing Ambiguity
Rewriting: Expression Grammars
precedence associativity
IF-THEN-ELSE the Dangling-ELSE problem
CPSC4600 36
Handling operator precedence
Rewrite the grammar use a different nonterminal for each precedence level start with the lowest precedence (MINUS)
E E - E | E / E | ( E ) | id
rewrite to
E E - T | TT T / F | F F id | ( E )
CPSC4600 37
Example
parse tree for id – id / id
E E - T | TT T / F | F F id | ( E )
E
E
F
-
id
/
idid
T
FT F
T
CPSC4600 38
Handling Operator Associativity
The grammar captures operator precedence, but it is still ambiguous!
fails to express that both subtraction and division are left associative;
e.g., 5-3-2 is equivalent to: ((5-3)-2) and not to: (5-(3-2)).
CPSC4600 39
Recursion
A grammar is recursive in nonterminal X if: X + … X …
+ means “in one or more steps, X derives a sequence of symbols that includes an X”
A grammar is left recursive in X if: X + X …
in one or more steps, X derives a sequence of symbols that starts with an X
A grammar is right recursive in X if: X + … X
in one or more steps, X derives a sequence of symbols that ends with an X
CPSC4600 40
Resolving ambiguity due to associativity
The grammar given above is both left and right recursive in nonterminals E and T
To correctly expresses operator associativity: For left associativity, use left recursion. For right associativity, use right recursion.
Here's the correct grammar: E E – T | TT T / F | F F id | ( E )
CPSC4600 41
The Dangling “Else” ambiguity
Consider the grammar St if E then St | if E then St else St | other
This grammar is also ambiguous
CPSC4600 42
Resolving the “dangling else”
else matches the closest unmatched then We can describe this in the grammar
E MIF /* all then are matched */ | UIF /* some then are unmatched */
MIF if E then MIF else MIF
| printUIF if E then E | if E then MIF else UIF
Describes the same set of strings
CPSC4600 43
Precedence and Associativity Declarationsin Parser Generators Instead of rewriting the grammar
Use the more natural (ambiguous) grammar Along with disambiguating declarations
Most parser generators allow precedence and associativity declarations to disambiguate grammars
CPSC4600 44
Parsing Approaches
Top-down parsing build parse tree from start symbol (root) match terminal symbols(tokens) in the production
rules with tokens in the input stream simple but limited in power
Bottom-up parsing start from input token stream build parse tree from terminal symbols (tokens) until
get start symbol complex but powerful
CPSC4600 45
Top Down vs. Bottom Up
start here
resultmatch
input token stream input token stream
start here
result
Top-down Parsing Bottom-up Parsing
CPSC4600 46
Top-down Parsing
A top-down parsing algorithm parses an input string of tokens by tracing out the steps in a leftmost derivation.
The parse tree associated with the input string is constructed using preorder traversal and hence the name “top-down”.
CPSC4600 47
Top-down parsers
There are mainly two kinds of top-down parsers:
1. Predictive parsers - Tries to make decisions about the structure of the
tree below a node based on a few lookahead tokens (usually one!).
- Weakness: Little program structure has been seen before predictive decisions must be made.
2. Backtracking parsers - Backtracking parsers solve the lookahead problem
by backtracking if one decision turns out to be wrong and making a different choice.
- Weakness: Backtracking parsers are slow (exponential time in general).
CPSC4600 48
Recursive-descent parsing
Main idea
1. Use the grammar rules as recipes for procedure code that “parses” the rule
2. Each non-terminal corresponds to a procedure 3. Each appearance of a terminal in the right hand side of a
rule causes a token to be matched. 4. Each appearance of a non-terminal corresponds to a call of
the associated procedure.
CPSC4600 49
Example: Recursive-descent Parsing
F (E) | numCode:
void F()
{ if (token == num) match(num);
else {
match(‘(‘);
E();
match(‘)’);// match token ‘(‘
}
CPSC4600 50
Example: Recursive-descent Parsing (2)
Observation: Note how lookahead is not a problem in this
example: if the token is number, go one way, if the token is ‘(‘ go the other, and if the token is neither, declare error:
void match(Token expect)
{ if (token == expect)
getToken(); //get next token
else error(token,expect);
}
CPSC4600 51
Example: Recursive-descent Parsing (3)
A recursive-descent procedure can also compute values or syntax trees:
int F()
{ if (token == num)
{ int temp = atoi(lexeme);
match(number); return temp;
}
else {
match(‘(‘); int temp = E();
match(‘)’); return temp;
}
}
CPSC4600 52
When Recursive Descent Does Not Work
E E ‘+’ term | term
void E()
{ if (token == ??)
{ E(); // uh, oh!!
match(‘+’);
term();
}
else term();
}
- A left-recursive grammar has a non-terminal A A + A for some
- Recursive descent does not work in such cases
CPSC4600 53
Elimination of Left Recursion
Consider the left-recursive grammarA + A| for some sentential forms and
S generates all strings starting with a and followed by a number of
Can rewrite the grammar using right-recursion A A’ A’ A’ |
where A’ is a new nonterminal
CPSC4600 54
Elimination of Left Recursion (2)
In general A A 1 | … | A n | 1 | … | m
All strings derived from A start with one of 1,…,m and continue with several instances of 1,…,n
Rewrite as A 1 A’ | … | m A’
A’ 1 A’ | … | n A’ |
CPSC4600 55
General Left Recursion
The grammar S A | A S is also left-recursive because
S + S
This left-recursion can also be eliminatedSee book, Section 4.3 for general algorithm
CPSC4600 56
Summary of Recursive Descent with backtracking
Simple and general parsing strategyLeft-recursion must be eliminated first… but that can be done automatically
Unpopular because of backtrackingThought to be too inefficient
In practice, backtracking is eliminated by restricting the grammar
CPSC4600 57
Predictive Parsers
Like recursive-descent but parser can “predict” which production to use- By looking at the next few tokens- No backtracking
Predictive parsers accept LL(k) grammars- L means “left-to-right” scan of input- L means “leftmost derivation”- k means “predict based on k tokens of
lookahead”In practice, LL(1) is used
CPSC4600 58
LL(1) Languages
In recursive-descent, for each non-terminal and input token there may be a choice of production
LL(1) means that for each non-terminal and token there is only one production
Can be specified via 2D tables- One dimension for current non-terminal to
expand- One dimension for next token- A table entry contains one production
CPSC4600 59
Predictive Parsing and Left Factoring
Consider the grammar E T + E | T T num | num * T | ( E )
Hard to predict becauseFor T, two productions start with numFor E, it is not clear how to predict
A grammar must be left-factored before use for predictive parsing
CPSC4600 60
Left-Factoring Example
Recall the grammar E T + E | T T num | num * T | ( E )
Factor out common prefixes of productions E T X X + E | T ( E ) | num Y Y * T |
CPSC4600 61
LL(1) Parsing Table Example
Left-factored grammarE T X X + E | T ( E ) | num Y Y * T |
The LL(1) parsing table:
Y Y Y Y * T Y
T( E )T num YT
X X X + EX
E TX E T XE
$)(+*num
CPSC4600 62
LL(1) Parsing Table Example (Cont.)
Consider the [E, num] entry- “When current non-terminal is E and next
input is num, use production E T X- This production can generate a num in the
first placeConsider the [Y,+] entry
- “When current non-terminal is Y and current token is +, get rid of Y”
Y can be followed by + only in a derivation in which Y
CPSC4600 63
LL(1) Parsing Tables. Errors
Blank entries indicate error situationsConsider the [E,*] entry“There is no way to derive a string starting
with * from non-terminal E”
CPSC4600 64
Using Parsing Tables
Method similar to recursive descent, except- For each non-terminal S- We look at the next token a- And chose the production shown at [S,a]
We use a stack to keep track of pending non-terminals
We reject when we encounter an error state
We accept when we encounter end-of-input
CPSC4600 65
LL(1) Parsing Algorithm
Start nonterminal end-of-input symbol
initialize stack = <S $> and Token = nextToken()
repeat case stack of <X, rest> : if T[X,Token] = Y1…Yn
then stack <Y1… Yn rest>; else error (); <t, rest> : if t == nextToken then stack <rest>; else error ();until stack == < > // empty
CPSC4600 66
LL(1) Parsing Example
Stack Input ActionE $ num * num $ T XT X $ num * num $ num Ynum Y X $ num * num $ terminalY X $ * num $ * T* T X $ * num $ terminalT X $ num $ num Yint Y X $ num $ terminalY X $ $ X $ $ $ $ ACCEPT
Y Y Y Y * T Y
T( E )T num YT
X X X + EX
E TX E T XE
$)(+*num
CPSC4600 67
Constructing Parsing Tables
LL(1) languages are those defined by a parsing table for the LL(1) algorithm
No table entry can be multiply defined
We want to generate parsing tables from CFG
CPSC4600 68
Constructing Parsing Tables: First and Follow sets
If A , where in the row of A we place ?Answer: In the column of t where t can start a
string derived from * t We say that t First()
In the column of t if is and t can follow an AS * A t We say t Follow(A)
CPSC4600 69
Computing First Sets
Definition: First(X) = { t | X * t} { | X * }
Algorithm sketch (see book for details): for all terminals t do First(t) { t } for each production X do First(X) { } if X A1 … An and First(Ai), 1 i n do
• add First() to First(X) for each X A1 … An s.t. First(Ai), 1 i n do
• add to First(X) repeat steps 3 & 4 until no First set can be grown
CPSC4600 70
First Sets. Example
Recall the grammar E T X X + E | T ( E ) | num Y Y * T |
First sets First( ( ) = { ( } First( T ) = {num, ( } First( ) ) = { ) } First( E ) = {num, ( } First( num) = { num} First( X ) = {+, } First( + ) = { + } First( Y ) = {*, } First( * ) = { * }
CPSC4600 71
Computing Follow Sets
Definition:
Follow(X) = { t | S * X t }
Intuition:If S is the start symbol then $ Follow(S)
If X A B then First(B) Follow(A) and Follow(X) Follow(B)Also if B * then Follow(X) Follow(A)
CPSC4600 72
Computing Follow Sets (Cont.)
Algorithm sketch:
1. Follow(S) { $ } 2. For each production A X
add First() \ {} to Follow(X) 3. For each A X where First()
add Follow(A) to Follow(X)repeat step(s) 2 and 3 until no Follow set grows
CPSC4600 73
Follow Sets. Example
Recall the grammar E T X X + E | T ( E ) | num Y Y * T |
Follow sets Follow( + ) = { num, ( } Follow( * ) = { num, ( } Follow( ( ) = { num, ( } Follow( E ) = {), $} Follow( X ) = {$, ) } Follow( T ) = {+, ) , $} Follow( ) ) = {+, ) , $} Follow( Y ) = {+, ) , $} Follow( num) = {*, +, ) , $}
CPSC4600 74
Constructing LL(1) Parsing Tables
Construct a parsing table T for CFG G
For each production A in G do: For each terminal t First() do
T[A, t] = If First(), for each t Follow(A) do
T[A, t] = If First() and $ Follow(A) do
T[A, $] =
CPSC4600 75
Notes on LL(1) Parsing Tables
If any entry is multiply defined then G is not LL(1) If G is ambiguous If G is left recursive If G is not left-factored
Most programming language grammars are not LL(1)
There are tools that build LL(1) tables
CPSC4600 76
Review
For some grammars there is a simple parsing strategy Predictive parsing
Next time: Bottom-up parsing