Parsing G22.2110 Programming Languages May 24, 2012 New York University Chanseok Oh ([email protected])
Dec 31, 2015
Parsing
G22.2110 Programming LanguagesMay 24, 2012
New York UniversityChanseok Oh ([email protected])
• Overview
– Scanner, Tokenizer, Lexer, Lexical AnalyzerIF ( A >= .30 ) THEN { …IF, LPARAN, IDENT(A), GTE, FPN(.30), RPARAN, THEN, …• Tokens, Lexemes• DFA , NFA, Regular expressions• lex, flex, Jlex
–Parser• DPDA, Deterministic context-free grammars• Yacc, Bison
• Table of Contents
–Practical parsers (Linear time)• LL (top-down, predictive)• LR (bottom-up, shift-reduce)
–Related side-topics• Ambiguity, Language and parser hierarchy
– Examples: Simple Calculator Language
• A Language
– A set of strings (of given symbols)• { finite, set, with, five, strings }• { ab, aaba, abbaba, … }• { 0n1n }• { aibj | i < j }• { void main() { int i = 0 }, … }
– Is an input string in the language?• cf. Recursive, Turing-decidable languages
• Context-Free Languages (CFL)– Languages that can be generated by
•CFG’s– Languages that can be determined by
•PDA’s– Not all languages are CF.
– CFG: suitable for most PL’s.• <sentence> := <subject> <verb> <object> PERIOD
– Deterministic CFL
• Ambiguous Grammars
– Is it ambiguous? Undecidable.– No general procedure for converting to
unambiguous grammars– Can be allowed to some extent for deterministic
parsing, e.g., by defining precedence or associativity.
E E + E E – E
E * E E / E
• Parsers– LL (Left-to-right, Left-most derivation)• Top-down• Predictive• Simple and easy to understand
– LR (Left-to-right, Right-most derivation)• Bottom-up• Shift-reduce• Most common in production-level
• SLR (Simple)• LALR (Look-ahead)
• LL(k) Parser– LL(k) Parser• Uses k look-ahead symbols• Does not backtrack (deterministic).
– LL(1) is the most popular kind of LL parser.
– LL(k) Languages• Not all CFL’s are LL(k) languages.
CFLLL(k)
CFLLL
• LL Parsing Example
It is an LL grammar.The language is also LL.
Input to parse:sum , a1 , ptr ;
<id_list> := id <id_list_tail><id_list_tail> := , id <id_list_tail><id_list_tail> := ;
CFLLL
•
• Parse Tree<id_list>
<id_list_tail>
<id_list_tail>
<id_list_tail>
sum a1 ptr, , ;
<id_list> := id <id_list_tail><id_list_tail> := , id <id_list_tail><id_list_tail> := ;
• LR Parser– LR(k) parser• Uses k look-ahead symbols.• Usually k is 1, and the term LR Parser is often intended
to refer to this case.
– LR(k) Languages• Not all CFL’s are LR(k) languages.
CFLLR
• LR Parsing Example
With the same grammar,
It is also an LR grammar,and the language is LR.
Input to parse (as before):sum , a1 , ptr ;
id_list id id_list_tailid_list_tail , id id_list_tailid_list_tail ;
CFLLR(1)
•LL
• Parse Tree<id_list>
<id_list_tail>
<id_list_tail>
<id_list_tail>
sum a1 ptr, , ;
<id_list> := id <id_list_tail><id_list_tail> := , id <id_list_tail><id_list_tail> := ;
• Another LR Parsing Example
Consider a modified grammar,
The grammar is not LL,(though the language itself is both LR and LL).
<id_list> := <id_list_prefix> ;<id_list_prefix> := <id_list_prefix> , id <id_list_prefix> := id
• LR Parsing<id_list>
<id_list_prefix>
<id_list_prefix>
<id_list_prefix>
;,a1 ptr,sum
<id_list> := <id_list_prefix> ;<id_list_prefix> := <id_list_prefix> , id <id_list_prefix> := id
• Simple Arithmetic Expression
– LL language, but not LL grammar (yet LR one)
– Two most common obstacles to “LL(1)-ness”• Left-recursion• Common prefixes
expr term | expr add_op termterm factor | term mult_op factorfactor id | number | ( expr )add_op + | -mult_op * | /
stmt stmt stmt_listid := exprid ( arg_list )
stmt id := exprid ( arg_list )
• Converting to LL-Grammars
– Alternatively, you can employ conflict-resolution rules.
stmt_list stmt stmt_list | є
stmt id | stmt_list_tailstmt_list_tail := expr | ( arg_list )
stmt stmt stmt_list
• Converted LL(1) Grammarexpr term term_tailterm_tail add_op term term_tail | єterm factor | factor_tailfactor_tail mult_op factor factor_tail | єfactor ( expr ) | id | numberadd_op + | -mult_op * | /
CFLLL
Not every CFG can be converted to LL grammar. Why?
• LL(1) for Simple Calculator Languageprogram stmt_list $$stmt_list stmt stmt_list | єstmt id := expr | read id | write exprexpr term term_tailterm_tail add_op term term_tail | єterm factor factor_tailfactor_tail mult_op factor factor_tail | єfactor ( expr ) | id | numberadd_op + | -mult_op * | /
Added three more production rules to the previous LL(1) grammar for expressions.
• Predict Setsprogram stmt_list $$ {id, read, write, $$}stmt_list stmt stmt_list {id, read, write} | є {$$}stmt id := expr {id}
read id {read} | write expr {write}expr term term_tail {(, id, number}term_tail add_op term term_tail {+,-}
є {), id, read, write, $$}term factor factor_tail {(, id, number}factor_tail mult_op factor factor_tail {*, /}
є {+, -, ), id, read, write, $$}factor ( expr ) {(} | id {id} | number {number}add_op + {+} | - {-}mult_op * {*} | / {/}
• Predict Sets
– Notice the pair-wise disjoint sets:{id}, {read} ,{write}
– You are to expand stmt.
– Look ahead 1 token (LL(1)).
stmt id := expr {id} read id {read} write expr {write}
• LL(1)program stmt_list $$stmt_list stmt stmt_list | єstmt id := expr | read id | write exprexpr term term_tailterm_tail add_op term term_tail | єterm factor factor_tailfactor_tail mult_op factor factor_tail | єfactor ( expr ) | id | numberadd_op + | -mult_op * | /
• Better grammar: LR(1)
– More intuitive than LL• However, not exactly the same language (no empty
string)
– Left-recursive is advantageous.
program stmt_list $$stmt_list stmt_list stmt | stmt stmt id := expr | read id | write exprexpr term | expr add_op termterm factor | term mult_op factorfactor id | number | ( expr )add_op + | -mult_op * | /
• State Transition Diagram
program ● stmt_list $$stmt_list ● stmt_list stmt
● stmt stmt ● id := expr
● read id ● write expr
State 0 (Initial state)
stmt read ● id
State 1
stmt read id ●
State 1’
read
id Reduce(shifting stmt from a viewpoint of State 0)
stmt_list stmt ●stmt
Reduce(shifting stmt_list)
State 0’
program stmt_list ● $$ stmt_list stmt_list ● stmtstmt ● id := expr
● read id ● write expr
State 2stmt_list
• Resolving Conflicts• LR(0)
– Any LR language has an LR(0) grammar (with $$).– Not practical: prohibitively large and unintuitive
• SLR– SLR grammar: no shift/reduce or reduce/reduce conflicts when using
FOLLOW sets– FOLLOW sets: also used in LL to generate PREDICT sets
• LALR(1)– LALR(1) grammar (may not be SLR)– Same states as SLR– Improvement over SLR with local look-ahead– LALR’s are the most common parsers in practice.
• LR(1)– LR(1) grammars (may not be LALR(1) or SLR)