CS 381 - Summer 2005Top-down and Bottom-up Parsing - a
whirlwindtour
June 20, 2005
Slide acknowledgment: Radu Rugina, CS 412
cmp $0,ecxcmovz edx,ecx
Simplified Compiler Structure
Source code Understand
source code
Generateassembly code
Assembly code
Front end (machine-
independent)
Back end(machine-
dependent)
if (b == 0) a = b;
Optimize
Intermediate code
Intermediate code
Optimizer
Simplified Front-End Structure
Source code(character stream)
Lexical Analysis
Syntax Analysis(Parsing)
Tokenstream
Abstract SyntaxTree (AST)
Semantic Analysis
if (b == 0) a = b;
if ( b ) a = b ;0==
if==
b 0
=
a b
Parse Tree vs. AST
• Parse tree also called “concrete syntax”
Parse Tree(ConcreteSyntax)
AbstractSyntax Tree
Discards (abstracts) unneeded information
+5+
+
3 4
1
2 +
( S )
SE + S
E + S
( S )
12
3
E
E + SE4
5E S+
E
How to build an AST• Need to find a derivation for the
program in the grammar• Want an efficient algorithm
– should only read token stream once– exponential brute-force search out of
question– even CKY is too slow
• Two main ways to parse:– top-down parsing (recursive descent)– bottom-up parsing (shift-reduce)
Parsing Top-down
Goal: construct a leftmost derivation of string while reading in token stream
Partly-derived String LookaheadS ( (1+2+(3+4))+5 E+S ( (1+2+(3+4))+5 (S) +S 1 (1+2+(3+4))+5 (E+S)+S 1 (1+2+(3+4))+5 (1+S)+S 2 (1+2+(3+4))+5 (1+E+S)+S 2 (1+2+(3+4))+5 (1+2+S)+S 2 (1+2+(3+4))+5 (1+2+E)+S ( (1+2+(3+4))+5 (1+2+(S))+S 3 (1+2+(3+4))+5 (1+2+(E+S))+S 3 (1+2+(3+4))+5
parsed part unparsed part
S E + S | E
E num | ( S )
Problem S E + S | E
E num | ( S )
• Want to decide which production to apply based on next symbol
(1) S E (S) (E) (1)
(1)+2 S E + S (S) + S (E) + S (1)+E (1)+2
• Why is this hard?
Grammar is Problem
• This grammar cannot be parsed top-down with only a single look-ahead symbol
• Not LL(1) = Left-to-right-scanning, Left-most derivation, 1 look-ahead symbol
• Is it LL(k) for some k?• Can rewrite grammar to allow top-down
parsing: create LL(1) grammar for same language
Making a grammar LL(1)
S E + SS EE numE ( S )
S ES'S' S' + SE numE ( S )
•Problem: can’t decide which S production to apply until we see symbol after first expression
•Left-factoring: Factor common S prefix, add new non-terminal S' at decision point. S' derives (+E)*
Parsing with new grammar
S ( (1+2+(3+4))+5 E S' ( (1+2+(3+4))+5 (S) S' 1 (1+2+(3+4))+5 (E S') S' 1 (1+2+(3+4))+5 (1 S') S' + (1+2+(3+4))+5 (1+E S' ) S' 2 (1+2+(3+4))+5 (1+2 S') S' + (1+2+(3+4))+5 (1+2 + S) S' ( (1+2+(3+4))+5 (1+2 + E S') S' ( (1+2+(3+4))+5 (1+2 + (S) S') S' 3 (1+2+(3+4))+5 (1+2 + (E S' ) S') S' 3 (1+2+(3+4))+5 (1+2 + (3 S') S') S' + (1+2+(3+4))+5 (1+2 + (3 + E) S') S' 4 (1+2+(3+4))+5
S ES ' S ' | + S E num | ( S )
Predictive Parsing
• LL(1) grammar:– for a given non-terminal, the look-ahead
symbol uniquely determines the production to apply
– top-down parsing = predictive parsing– Driven by predictive parsing table of
non-terminals terminals productions
Using Table
S ( (1+2+(3+4))+5 E S' ( (1+2+(3+4))+5 (S) S' 1 (1+2+(3+4))+5 (E S' ) S' 1 (1+2+(3+4))+5 (1 S') S' + (1+2+(3+4))+5 (1 + S) S' 2 (1+2+(3+4))+5 (1+E S' ) S' 2 (1+2+(3+4))+5 (1+2 S') S' + (1+2+(3+4))+5
num + ( ) $S E S ' E S ' S ' +S E num ( S )
S E S ' S ' | + SE num | ( S )
How to Implement?
• Table can be converted easily into a recursive-descent parser
num + ( ) $S E S ' E S ' S ' +S E num ( S )
• Three procedures: parse_S, parse_S’, parse_E
Recursive-Descent Parservoid parse_S () {
switch (token) {case num: parse_E(); parse_S’(); return;case ‘(’: parse_E(); parse_S’(); return;default: throw new ParseError();
}}
number + ( ) $S ES’ ES’S’ +S E number ( S )
lookahead token
Recursive-Descent Parser
void parse_S’() {switch (token) {
case ‘+’: token = input.read(); parse_S(); return;
case ‘)’: return;case EOF: return;default: throw new ParseError();
}}
number + ( ) $S ES’ ES’S’ +S E number ( S )
Recursive-Descent Parservoid parse_E() {
switch (token) {case number: token = input.read(); return;case ‘(‘: token = input.read(); parse_S();
if (token != ‘)’) throw new ParseError();
token = input.read(); return;default: throw new ParseError(); }
} number + ( ) $
S ES’ ES’S’ +S E number ( S )
Call Tree = Parse Tree(1 + 2 + (3 + 4)) + 5
SE S’
( S ) + S
E S’ 51 + S
E S’
2 + SE S’
( S )
E S’+ S
E4
3
parse_S
parse_E parse_S’
parse_Sparse_S
parse_E parse_S’
parse_Sparse_Eparse_S’
parse_Sparse_Eparse_S’parse_S
N + ( ) $
S ES’ ES’S’ +S E N ( S )
How to Construct Parsing Tables
• There exists an algorithm for automatically generating a predictive parse table from a grammar (take 412 for details)S ES’S’ | + SE number | ( S )
Summary for top-down parsing
• LL(k) grammars – left-to-right scanning– leftmost derivation– can determine what production to apply from
the next k symbols– Can automatically build predictive parsing
tables
• Predictive parsers – Can be easily built for LL(k) grammars from
the parsing tables– Also called recursive-descent, or top-down
parsers
Top-Down Parsing Summary
Language grammar
LL(1) grammar
predictive parsing table
recursive-descent parser
parser with AST generation
Left-recursion eliminationLeft-factoring
Now: Bottom-up Parsing
• A more powerful parsing technology
• LR grammars -- more expressive than LL– construct right-most derivation of program– virtually all programming languages– easier to express programming language
syntax
• Shift-reduce parsers– Parsers for LR grammars– automatic parser generators (e.g. yacc,CUP)
Bottom-up Parsing• Right-most derivation -- backward
– Start with the tokens– End with the start symbol
(1+2+(3+4))+5 (E+2+(3+4))+5
(S+2+(3+4))+5 (S+E+(3+4))+5
(S+(3+4))+5 (S+(E+4))+5 (S+(S+4))+5
(S+(S+E))+5 (S+(S))+5 (S+E)+5 (S)+5
E+5 S+E S
S S + E | EE num | ( S )
Progress of Bottom-up Parsing
(1+2+(3+4))+5 (1+2+(3+4))+5(E+2+(3+4))+5 (1 +2+(3+4))+5(S+2+(3+4))+5 (1 +2+(3+4))+5(S+E+(3+4))+5 (1+2 +(3+4))+5(S+(3+4))+5 (1+2+(3 +4))+5(S+(E+4))+5 (1+2+(3 +4))+5(S+(S+4))+5 (1+2+(3 +4))+5(S+(S+E))+5 (1+2+(3+4 ))+5(S+(S))+5 (1+2+(3+4 ))+5(S+E)+5 (1+2+(3+4) )+5(S)+5 (1+2+(3+4) )+5 E+5 (1+2+(3+4)) +5 S+E (1+2+(3+4))+5S (1+2+(3+4))+5
rig
ht-
most
deri
vati
on
Bottom-up Parsing• (1+2+(3+4))+5
(E+2+(3+4))+5 (S+2+(3+4))+5 (S+E+(3+4))+5 …
• Advantage of bottom-up parsing: can postpone the selection of productions until more of the input is scanned
SS + E
E
( S )
5
S + E
S + E ( S )S + EE
E12
34
S S + E | EE num | ( S )
Top-down Parsing
S S+E E+E (S)+E (S+E)+E (S+E+E)+E (E+E+E)+E (1+E+E)+E (1+2+E)+E ...
• In left-most derivation, entire tree above a token (2) has been expanded when encountered
SS + E
E
( S )
5
S + E
S + E ( S )S + EE
E12
3
4
S S + E | EE num | ( S )
(1+2+(3+4))+5
Top-down vs. Bottom-up
scanned unscanned
scanned unscanned
Top-down Bottom-up
Bottom-up: Don’t need to figure out as much of the parse tree for a given amount of input
Shift-reduce Parsing
• Parsing actions: is a sequence of shift and reduce operations
• Parser state: a stack of terminals and non-terminals (grows to the right)
• Current derivation step = always stack+inputDerivation step stack unconsumed input(1+2+(3+4))+5 (1+2+(3+4))+5(E+2+(3+4))+5 (E +2+(3+4))+5(S+2+(3+4))+5 (S +2+(3+4))+5(S+E+(3+4))+5 (S+E +(3+4))+5
Shift-reduce Parsing
• Parsing is a sequence of shifts and reduces• Shift : move look-ahead token to stack
stack inputaction
( 1+2+(3+4))+5 shift 1(1 +2+(3+4))+5
• Reduce : Replace symbols from top of stack with non-terminal symbol X, corresponding to production X (pop , push X)
stack input action(S+E +(3+4))+5 reduce
S S+E (S +(3+4))+5
Shift-reduce Parsing
(1+2+(3+4))+5 (1+2+(3+4))+5 shift(1+2+(3+4))+5 ( 1+2+(3+4))+5 shift(1+2+(3+4))+5 (1 +2+(3+4))+5 reduce
Enum(E+2+(3+4))+5 (E +2+(3+4))+5 reduce S
E(S+2+(3+4))+5 (S +2+(3+4))+5 shift(S+2+(3+4))+5 (S+ 2+(3+4))+5 shift(S+2+(3+4))+5 (S+2 +(3+4))+5reduce Enum(S+E+(3+4))+5 (S+E +(3+4))+5reduce S S+E (S+(3+4))+5 (S +(3+4))+5 shift(S+(3+4))+5 (S+ (3+4))+5 shift(S+(3+4))+5 (S+( 3+4))+5 shift(S+(3+4))+5 (S+(3 +4))+5 reduce Enum
derivation stack input stream action
S S + E | EE num | ( S )
Problem
• How do we know which action to take: whether to shift or reduce, and which production?
• Issues:– Sometimes can reduce but shouldn’t– Sometimes can reduce in different ways
Action Selection Problem• Given stack and look-ahead symbol b,
should parser:– shift b onto the stack (making it b)– reduce X assuming that stack has
the form (making it X)
• If stack has form , should apply reduction X (or shift) depending on stack prefix is different for different possible
reductions, since ’s have different length.
LR Parsing Engine• Basic mechanism:
– Use a set of parser states– Use a stack with alternating symbols and states
• E.g: 1 ( 6 S 10 + 5
– Use a parsing table to:• Determine what action to apply (shift/reduce) • Determine the next state
• The parser actions can be precisely determined from the table
The LR Parsing Table
• Algorithm: look at entry for current state S and input terminal C
If Table[S,C] = s(S’) then shift:push(C), push(S’)
If Table[S,C] = X then reduce: pop(2*||), S’=top(), push(X), push(Table[S’,X])
Non-terminals
Nextstate
Terminals
State Next actionand next state
Action table Goto table
LR Parsing Table Example
( ) id , $ SL
1 s3 s2 g42 Sid Sid Sid Sid Sid
3 s3 s2 g7 g54 accept
5 s6 s86 S(L) S(L) S(L) S(L) S(L)
7 LS LS LS LS LS
8 s3 s2 g99 LL,S LL,S LL,S LL,S LL,S
LR(k) Grammars
• LR(k) = Left-to-right scanning, Right-most derivation, k look-ahead characters
• Main cases: LR(0), LR(1), and some variations (SLR and LALR(1))
• Parsers for LR(0) Grammars:– Determine the actions without any lookahead
symbol
Building LR(0) Parsing Tables
• To build the parsing table:– Define states of the parser– Build a DFA to describe the transitions
between states– Use the DFA to build the parsing table
Summary for bottom-up parsing
• LR(k) grammars – left-to-right scanning– rightmost derivation– can determine whether to shift or
reduce from the next k symbols– Can automatically build predictive
parsing tables• Shift-reduce parsers
– Can be built for LR(k) grammars using automated parser generator tools, eg. CUP, yacc.
Top-down vs. Bottom-up again
scanned unscanned
scanned unscanned
Top-down Bottom-up
LL(k), recursive descent LR(k), shift-reduce