Parsing G22.2110 Programming Languages May 24, 2012 New York University Chanseok Oh ([email protected])

Parsing

G22.2110 Programming LanguagesMay 24, 2012

New York UniversityChanseok Oh ([email protected])

• Chapter 2

Scanning

Parsing

• Overview

– Scanner, Tokenizer, Lexer, Lexical AnalyzerIF ( A >= .30 ) THEN { …IF, LPARAN, IDENT(A), GTE, FPN(.30), RPARAN, THEN, …• Tokens, Lexemes• DFA , NFA, Regular expressions• lex, flex, Jlex

–Parser• DPDA, Deterministic context-free grammars• Yacc, Bison

• Table of Contents

–Practical parsers (Linear time)• LL (top-down, predictive)• LR (bottom-up, shift-reduce)

–Related side-topics• Ambiguity, Language and parser hierarchy

– Examples: Simple Calculator Language

• A Language

– A set of strings (of given symbols)• { finite, set, with, five, strings }• { ab, aaba, abbaba, … }• { 0n1n }• { aibj | i < j }• { void main() { int i = 0 }, … }

– Is an input string in the language?• cf. Recursive, Turing-decidable languages

• Context-Free Languages (CFL)– Languages that can be generated by

•CFG’s– Languages that can be determined by

•PDA’s– Not all languages are CF.

– CFG: suitable for most PL’s.• <sentence> := <subject> <verb> <object> PERIOD

– Deterministic CFL

• Example

Here is our CFG:

Input: sum , a1 , ptr ;

S := id AA := , id AA := ;

• Parse Tree

S

A

A

A

sum

a1

ptr

,

,

;

S := id AA := , idAA := ;

• Ambiguous Grammars

– Is it ambiguous? Undecidable.– No general procedure for converting to

unambiguous grammars– Can be allowed to some extent for deterministic

parsing, e.g., by defining precedence or associativity.

E E + E E – E

E * E E / E

• Parsers– LL (Left-to-right, Left-most derivation)• Top-down• Predictive• Simple and easy to understand

– LR (Left-to-right, Right-most derivation)• Bottom-up• Shift-reduce• Most common in production-level

• SLR (Simple)• LALR (Look-ahead)

• LL(k) Parser– LL(k) Parser• Uses k look-ahead symbols• Does not backtrack (deterministic).

– LL(1) is the most popular kind of LL parser.

– LL(k) Languages• Not all CFL’s are LL(k) languages.

CFLLL(k)

CFLLL

• LL Parsing Example

It is an LL grammar.The language is also LL.

Input to parse:sum , a1 , ptr ;

<id_list> := id <id_list_tail><id_list_tail> := , id <id_list_tail><id_list_tail> := ;

CFLLL

•

• Parse Tree<id_list>

<id_list_tail>

<id_list_tail>

<id_list_tail>

sum a1 ptr, , ;


• LR Parser– LR(k) parser• Uses k look-ahead symbols.• Usually k is 1, and the term LR Parser is often intended

to refer to this case.

– LR(k) Languages• Not all CFL’s are LR(k) languages.

CFLLR

Language RelationshipsUnambiguous languages Ambiguous languages

LR(0)SLRLALRLR(1) LL(0)

LL(1)

• LR Parsing Example

With the same grammar,

It is also an LR grammar,and the language is LR.

Input to parse (as before):sum , a1 , ptr ;

id_list id id_list_tailid_list_tail , id id_list_tailid_list_tail ;

CFLLR(1)

•LL

• Parse Tree<id_list>

<id_list_tail>

<id_list_tail>

<id_list_tail>

sum a1 ptr, , ;


• Another LR Parsing Example

Consider a modified grammar,

The grammar is not LL,(though the language itself is both LR and LL).

<id_list> := <id_list_prefix> ;<id_list_prefix> := <id_list_prefix> , id <id_list_prefix> := id

• LR Parsing<id_list>

<id_list_prefix>

<id_list_prefix>

<id_list_prefix>

;,a1 ptr,sum

<id_list> := <id_list_prefix> ;<id_list_prefix> := <id_list_prefix> , id <id_list_prefix> := id

• Simple Calculator Language

3 + ( 4 * 1 )

total := 7

read n

write ( 10 – ( total + 1 ) / 3 * n )

• Simple Arithmetic Expression

E E + E | E – E E * E | E / E

E id | number | ( E )

• Simple Arithmetic Expression

– LL language, but not LL grammar (yet LR one)

– Two most common obstacles to “LL(1)-ness”• Left-recursion• Common prefixes

expr term | expr add_op termterm factor | term mult_op factorfactor id | number | ( expr )add_op + | -mult_op * | /

stmt stmt stmt_listid := exprid ( arg_list )

stmt id := exprid ( arg_list )

• Converting to LL-Grammars

– Alternatively, you can employ conflict-resolution rules.

stmt_list stmt stmt_list | є

stmt id | stmt_list_tailstmt_list_tail := expr | ( arg_list )

stmt stmt stmt_list

• Converted LL(1) Grammarexpr term term_tailterm_tail add_op term term_tail | єterm factor | factor_tailfactor_tail mult_op factor factor_tail | єfactor ( expr ) | id | numberadd_op + | -mult_op * | /

CFLLL

Not every CFG can be converted to LL grammar. Why?

• LL(1) for Simple Calculator Languageprogram stmt_list $$stmt_list stmt stmt_list | єstmt id := expr | read id | write exprexpr term term_tailterm_tail add_op term term_tail | єterm factor factor_tailfactor_tail mult_op factor factor_tail | єfactor ( expr ) | id | numberadd_op + | -mult_op * | /

Added three more production rules to the previous LL(1) grammar for expressions.

• LL Parsing

– Input program

read Aread Bsum := A + Bwrite sumwrite sum / 2

• Predict Setsprogram stmt_list $$ {id, read, write, $$}stmt_list stmt stmt_list {id, read, write} | є {$$}stmt id := expr {id}

read id {read} | write expr {write}expr term term_tail {(, id, number}term_tail add_op term term_tail {+,-}

є {), id, read, write, $$}term factor factor_tail {(, id, number}factor_tail mult_op factor factor_tail {*, /}

є {+, -, ), id, read, write, $$}factor ( expr ) {(} | id {id} | number {number}add_op + {+} | - {-}mult_op * {*} | / {/}

• Predict Sets

– Notice the pair-wise disjoint sets:{id}, {read} ,{write}

– You are to expand stmt.

– Look ahead 1 token (LL(1)).

stmt id := expr {id} read id {read} write expr {write}

• LL(1)program stmt_list $$stmt_list stmt stmt_list | єstmt id := expr | read id | write exprexpr term term_tailterm_tail add_op term term_tail | єterm factor factor_tailfactor_tail mult_op factor factor_tail | єfactor ( expr ) | id | numberadd_op + | -mult_op * | /

• Better grammar: LR(1)

– More intuitive than LL• However, not exactly the same language (no empty

string)

– Left-recursive is advantageous.

program stmt_list $$stmt_list stmt_list stmt | stmt stmt id := expr | read id | write exprexpr term | expr add_op termterm factor | term mult_op factorfactor id | number | ( expr )add_op + | -mult_op * | /

• LR Parsing

– With the same input program,

read Aread Bsum := A + Bwrite sumwrite sum / 2

• State Transition Diagram

program ● stmt_list $$stmt_list ● stmt_list stmt

● stmt stmt ● id := expr

● read id ● write expr

State 0 (Initial state)

stmt read ● id

State 1

stmt read id ●

State 1’

read

id Reduce(shifting stmt from a viewpoint of State 0)

stmt_list stmt ●stmt

Reduce(shifting stmt_list)

State 0’

program stmt_list ● $$ stmt_list stmt_list ● stmtstmt ● id := expr

● read id ● write expr

State 2stmt_list

• Shift/Reduce Conflicts

• Reduce/Reduce Conflicts

expr ● termfactor id ●…

expr id ●factor id ●

• Resolving Conflicts• LR(0)

– Any LR language has an LR(0) grammar (with $$).– Not practical: prohibitively large and unintuitive

• SLR– SLR grammar: no shift/reduce or reduce/reduce conflicts when using

FOLLOW sets– FOLLOW sets: also used in LL to generate PREDICT sets

• LALR(1)– LALR(1) grammar (may not be SLR)– Same states as SLR– Improvement over SLR with local look-ahead– LALR’s are the most common parsers in practice.

• LR(1)– LR(1) grammars (may not be LALR(1) or SLR)

Parsing G22.2110 Programming Languages May 24, 2012 New York University Chanseok Oh ([email protected])

Documents

list id id

list stmt id

e e eparsersll

list stmt stmt stmt

ptr id

id lr parsing

list stmtid stmt

lr grammar