CS 381 - Summer 2005 Top-down and Bottom-up Parsing - a whirlwind tour

CS 381 - Summer 2005Top-down and Bottom-up Parsing - a

whirlwindtour

June 20, 2005

Slide acknowledgment: Radu Rugina, CS 412

cmp $0,ecxcmovz edx,ecx

Simplified Compiler Structure

Source code Understand

source code

Generateassembly code

Assembly code

Front end (machine-

independent)

Back end(machine-

dependent)

if (b == 0) a = b;

Optimize

Intermediate code

Intermediate code

Optimizer

Simplified Front-End Structure

Source code(character stream)

Lexical Analysis

Syntax Analysis(Parsing)

Tokenstream

Abstract SyntaxTree (AST)

Semantic Analysis

if (b == 0) a = b;

if ( b ) a = b ;0==

if==

b 0

=

a b

Parse Tree vs. AST

• Parse tree also called “concrete syntax”

Parse Tree(ConcreteSyntax)

AbstractSyntax Tree

Discards (abstracts) unneeded information

+5+

+

3 4

1

2 +

( S )

SE + S

E + S

( S )

12

3

E

E + SE4

5E S+

E

How to build an AST• Need to find a derivation for the

program in the grammar• Want an efficient algorithm

– should only read token stream once– exponential brute-force search out of

question– even CKY is too slow

• Two main ways to parse:– top-down parsing (recursive descent)– bottom-up parsing (shift-reduce)

Parsing Top-down

Goal: construct a leftmost derivation of string while reading in token stream

Partly-derived String LookaheadS ( (1+2+(3+4))+5 E+S ( (1+2+(3+4))+5 (S) +S 1 (1+2+(3+4))+5 (E+S)+S 1 (1+2+(3+4))+5 (1+S)+S 2 (1+2+(3+4))+5 (1+E+S)+S 2 (1+2+(3+4))+5 (1+2+S)+S 2 (1+2+(3+4))+5 (1+2+E)+S ( (1+2+(3+4))+5 (1+2+(S))+S 3 (1+2+(3+4))+5 (1+2+(E+S))+S 3 (1+2+(3+4))+5

parsed part unparsed part

S E + S | E

E num | ( S )

Problem S E + S | E

E num | ( S )

• Want to decide which production to apply based on next symbol

(1) S E (S) (E) (1)

(1)+2 S E + S (S) + S (E) + S (1)+E (1)+2

• Why is this hard?

Grammar is Problem

• This grammar cannot be parsed top-down with only a single look-ahead symbol

• Not LL(1) = Left-to-right-scanning, Left-most derivation, 1 look-ahead symbol

• Is it LL(k) for some k?• Can rewrite grammar to allow top-down

parsing: create LL(1) grammar for same language

Making a grammar LL(1)

S E + SS EE numE ( S )

S ES'S' S' + SE numE ( S )

•Problem: can’t decide which S production to apply until we see symbol after first expression

•Left-factoring: Factor common S prefix, add new non-terminal S' at decision point. S' derives (+E)*

Parsing with new grammar

S ( (1+2+(3+4))+5 E S' ( (1+2+(3+4))+5 (S) S' 1 (1+2+(3+4))+5 (E S') S' 1 (1+2+(3+4))+5 (1 S') S' + (1+2+(3+4))+5 (1+E S' ) S' 2 (1+2+(3+4))+5 (1+2 S') S' + (1+2+(3+4))+5 (1+2 + S) S' ( (1+2+(3+4))+5 (1+2 + E S') S' ( (1+2+(3+4))+5 (1+2 + (S) S') S' 3 (1+2+(3+4))+5 (1+2 + (E S' ) S') S' 3 (1+2+(3+4))+5 (1+2 + (3 S') S') S' + (1+2+(3+4))+5 (1+2 + (3 + E) S') S' 4 (1+2+(3+4))+5

S ES ' S ' | + S E num | ( S )

Predictive Parsing

• LL(1) grammar:– for a given non-terminal, the look-ahead

symbol uniquely determines the production to apply

– top-down parsing = predictive parsing– Driven by predictive parsing table of

non-terminals terminals productions

Using Table

S ( (1+2+(3+4))+5 E S' ( (1+2+(3+4))+5 (S) S' 1 (1+2+(3+4))+5 (E S' ) S' 1 (1+2+(3+4))+5 (1 S') S' + (1+2+(3+4))+5 (1 + S) S' 2 (1+2+(3+4))+5 (1+E S' ) S' 2 (1+2+(3+4))+5 (1+2 S') S' + (1+2+(3+4))+5

num + ( ) $S E S ' E S ' S ' +S E num ( S )

S E S ' S ' | + SE num | ( S )

How to Implement?

• Table can be converted easily into a recursive-descent parser

num + ( ) $S E S ' E S ' S ' +S E num ( S )

• Three procedures: parse_S, parse_S’, parse_E

Recursive-Descent Parservoid parse_S () {

switch (token) {case num: parse_E(); parse_S’(); return;case ‘(’: parse_E(); parse_S’(); return;default: throw new ParseError();

}}

number + ( ) $S ES’ ES’S’ +S E number ( S )

lookahead token

Recursive-Descent Parser

void parse_S’() {switch (token) {

case ‘+’: token = input.read(); parse_S(); return;

case ‘)’: return;case EOF: return;default: throw new ParseError();

}}

number + ( ) $S ES’ ES’S’ +S E number ( S )

Recursive-Descent Parservoid parse_E() {

switch (token) {case number: token = input.read(); return;case ‘(‘: token = input.read(); parse_S();

if (token != ‘)’) throw new ParseError();

token = input.read(); return;default: throw new ParseError(); }

} number + ( ) $

S ES’ ES’S’ +S E number ( S )

Call Tree = Parse Tree(1 + 2 + (3 + 4)) + 5

SE S’

( S ) + S

E S’ 51 + S

E S’

2 + SE S’

( S )

E S’+ S

E4

3

parse_S

parse_E parse_S’

parse_Sparse_S

parse_E parse_S’

parse_Sparse_Eparse_S’

parse_Sparse_Eparse_S’parse_S

N + ( ) $

S ES’ ES’S’ +S E N ( S )

How to Construct Parsing Tables

• There exists an algorithm for automatically generating a predictive parse table from a grammar (take 412 for details)S ES’S’ | + SE number | ( S )

Summary for top-down parsing

• LL(k) grammars – left-to-right scanning– leftmost derivation– can determine what production to apply from

the next k symbols– Can automatically build predictive parsing

tables

• Predictive parsers – Can be easily built for LL(k) grammars from

the parsing tables– Also called recursive-descent, or top-down

parsers

Top-Down Parsing Summary

Language grammar

LL(1) grammar

predictive parsing table

recursive-descent parser

parser with AST generation

Left-recursion eliminationLeft-factoring

Now: Bottom-up Parsing

• A more powerful parsing technology

• LR grammars -- more expressive than LL– construct right-most derivation of program– virtually all programming languages– easier to express programming language

syntax

• Shift-reduce parsers– Parsers for LR grammars– automatic parser generators (e.g. yacc,CUP)

Bottom-up Parsing• Right-most derivation -- backward

– Start with the tokens– End with the start symbol

(1+2+(3+4))+5 (E+2+(3+4))+5

(S+2+(3+4))+5 (S+E+(3+4))+5

(S+(3+4))+5 (S+(E+4))+5 (S+(S+4))+5

(S+(S+E))+5 (S+(S))+5 (S+E)+5 (S)+5

E+5 S+E S

S S + E | EE num | ( S )

Progress of Bottom-up Parsing

(1+2+(3+4))+5 (1+2+(3+4))+5(E+2+(3+4))+5 (1 +2+(3+4))+5(S+2+(3+4))+5 (1 +2+(3+4))+5(S+E+(3+4))+5 (1+2 +(3+4))+5(S+(3+4))+5 (1+2+(3 +4))+5(S+(E+4))+5 (1+2+(3 +4))+5(S+(S+4))+5 (1+2+(3 +4))+5(S+(S+E))+5 (1+2+(3+4 ))+5(S+(S))+5 (1+2+(3+4 ))+5(S+E)+5 (1+2+(3+4) )+5(S)+5 (1+2+(3+4) )+5 E+5 (1+2+(3+4)) +5 S+E (1+2+(3+4))+5S (1+2+(3+4))+5

rig

ht-

most

deri

vati

on

Bottom-up Parsing• (1+2+(3+4))+5

(E+2+(3+4))+5 (S+2+(3+4))+5 (S+E+(3+4))+5 …

• Advantage of bottom-up parsing: can postpone the selection of productions until more of the input is scanned

SS + E

E

( S )

5

S + E

S + E ( S )S + EE

E12

34

S S + E | EE num | ( S )

Top-down Parsing

S S+E E+E (S)+E (S+E)+E (S+E+E)+E (E+E+E)+E (1+E+E)+E (1+2+E)+E ...

• In left-most derivation, entire tree above a token (2) has been expanded when encountered

SS + E

E

( S )

5

S + E

S + E ( S )S + EE

E12

3

4

S S + E | EE num | ( S )

(1+2+(3+4))+5

Top-down vs. Bottom-up

scanned unscanned

scanned unscanned

Top-down Bottom-up

Bottom-up: Don’t need to figure out as much of the parse tree for a given amount of input

Shift-reduce Parsing

• Parsing actions: is a sequence of shift and reduce operations

• Parser state: a stack of terminals and non-terminals (grows to the right)

• Current derivation step = always stack+inputDerivation step stack unconsumed input(1+2+(3+4))+5 (1+2+(3+4))+5(E+2+(3+4))+5 (E +2+(3+4))+5(S+2+(3+4))+5 (S +2+(3+4))+5(S+E+(3+4))+5 (S+E +(3+4))+5


• Parsing is a sequence of shifts and reduces• Shift : move look-ahead token to stack

stack inputaction

( 1+2+(3+4))+5 shift 1(1 +2+(3+4))+5

• Reduce : Replace symbols from top of stack with non-terminal symbol X, corresponding to production X (pop , push X)

stack input action(S+E +(3+4))+5 reduce

S S+E (S +(3+4))+5


(1+2+(3+4))+5 (1+2+(3+4))+5 shift(1+2+(3+4))+5 ( 1+2+(3+4))+5 shift(1+2+(3+4))+5 (1 +2+(3+4))+5 reduce

Enum(E+2+(3+4))+5 (E +2+(3+4))+5 reduce S

E(S+2+(3+4))+5 (S +2+(3+4))+5 shift(S+2+(3+4))+5 (S+ 2+(3+4))+5 shift(S+2+(3+4))+5 (S+2 +(3+4))+5reduce Enum(S+E+(3+4))+5 (S+E +(3+4))+5reduce S S+E (S+(3+4))+5 (S +(3+4))+5 shift(S+(3+4))+5 (S+ (3+4))+5 shift(S+(3+4))+5 (S+( 3+4))+5 shift(S+(3+4))+5 (S+(3 +4))+5 reduce Enum

derivation stack input stream action

S S + E | EE num | ( S )

Problem

• How do we know which action to take: whether to shift or reduce, and which production?

• Issues:– Sometimes can reduce but shouldn’t– Sometimes can reduce in different ways

Action Selection Problem• Given stack and look-ahead symbol b,

should parser:– shift b onto the stack (making it b)– reduce X assuming that stack has

the form (making it X)

• If stack has form , should apply reduction X (or shift) depending on stack prefix is different for different possible

reductions, since ’s have different length.

LR Parsing Engine• Basic mechanism:

– Use a set of parser states– Use a stack with alternating symbols and states

• E.g: 1 ( 6 S 10 + 5

– Use a parsing table to:• Determine what action to apply (shift/reduce) • Determine the next state

• The parser actions can be precisely determined from the table

The LR Parsing Table

• Algorithm: look at entry for current state S and input terminal C

If Table[S,C] = s(S’) then shift:push(C), push(S’)

If Table[S,C] = X then reduce: pop(2*||), S’=top(), push(X), push(Table[S’,X])

Non-terminals

Nextstate

Terminals

State Next actionand next state

Action table Goto table

LR Parsing Table Example

( ) id , $ SL

1 s3 s2 g42 Sid Sid Sid Sid Sid

3 s3 s2 g7 g54 accept

5 s6 s86 S(L) S(L) S(L) S(L) S(L)

7 LS LS LS LS LS

8 s3 s2 g99 LL,S LL,S LL,S LL,S LL,S

LR(k) Grammars

• LR(k) = Left-to-right scanning, Right-most derivation, k look-ahead characters

• Main cases: LR(0), LR(1), and some variations (SLR and LALR(1))

• Parsers for LR(0) Grammars:– Determine the actions without any lookahead

symbol

Building LR(0) Parsing Tables

• To build the parsing table:– Define states of the parser– Build a DFA to describe the transitions

between states– Use the DFA to build the parsing table

Summary for bottom-up parsing

• LR(k) grammars – left-to-right scanning– rightmost derivation– can determine whether to shift or

reduce from the next k symbols– Can automatically build predictive

parsing tables• Shift-reduce parsers

– Can be built for LR(k) grammars using automated parser generator tools, eg. CUP, yacc.

Top-down vs. Bottom-up again

scanned unscanned

scanned unscanned

Top-down Bottom-up

LL(k), recursive descent LR(k), shift-reduce

CS 381 - Summer 2005 Top-down and Bottom-up Parsing - a whirlwind tour

Documents

CS 381 - Summer 2005 Top-down and Bottom-up Parsing - a whirlwind tour