Compilation 0368-3133 Lecture 4 Syntax Analysis Noam Rinetzky 1 Zijian Xu, Hong Chen, Song-Chun Zhu and Jiebo Luo, "A hierarchical compositional model for face representation and sketching," IEEE Trans. Pattern Analysis and Machine Intelligence(PAMI)'08. * * * *
124
Embed
Compilation 0368-3133 Lecture 4 Syntax Analysis Noam Rinetzky 1 Zijian Xu, Hong Chen, Song-Chun Zhu and Jiebo Luo, "A hierarchical compositional model.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Compilation 0368-3133
Lecture 4
Syntax AnalysisNoam Rinetzky
Zijian Xu, Hong Chen, Song-Chun Zhu and Jiebo Luo, "A hierarchical compositional model for face representation and sketching," IEEE Trans. Pattern Analysis and Machine Intelligence(PAMI)'08.
*
*
*
*
2
Where are we?
Executable
code
exe
Source
text
txtLexicalAnalysi
s
Sem.Analysis
Process text input
characters SyntaxAnalysi
s
tokens AST
Intermediate code
generation
Annotated AST
Intermediate code
optimization
IR CodegenerationIR
Target code optimizatio
n
Symbolic Instructions
SI Machine code
generation
Write executable
output
MI
LexicalAnalysi
s
SyntaxAnalysi
s✓✓ ﹅
From scanning to parsing
3
((23 + 7) * x)
) x * ) 7 + 23 (
RP Id OP RP Num ( Num LP LP
Lexical Analyzer
characters (program text)
token stream
ParserContext free grammar: Exp ... |Exp + Exp | Id
Op(*)
Id(x)
Num(23) Num(7)
Op(+)
Abstract Syntax Treevalidsyntax
error
Regular language: Id ‘a’ | ... | ‘z’
4
Broad kinds of parsers
• Top-Down parsers – Construct parse tree in a top-down matter– Find the leftmost derivation
• Bottom-Up parsers – Construct parse tree in a bottom-up manner– Find the rightmost derivation in a reverse order
• Parsers for arbitrary grammars– Earley’s method, CYK method– Usually, not used in practice (though might change)
5
6
Context free grammars (CFGs)
• V – non terminals (syntactic variables)• T – terminals (tokens)• P – derivation rules
• Each rule of the form V (T V)*
• S – start symbol
G = (V,T,P,S)
7
Derivations
• Show that a sentence ω is in a grammar G by repeatedly applying a production rule
FIRST sets computation exampleSTMT if EXPR then STMT | while EXPR do STMT | EXPR ;EXPR TERM -> id | zero? TERM | not EXPR | ++ id | -- idTERM id | constant
TERM EXPR STMT
31
1. Initialization
TERM EXPR STMTidconstant
zero?Not++--
ifwhile
STMT if EXPR then STMT | while EXPR do STMT | EXPR ;EXPR TERM -> id | zero? TERM | not EXPR | ++ id | -- idTERM id | constant
32
STMT if EXPR then STMT | while EXPR do STMT | EXPR ;EXPR TERM -> id | zero? TERM | not EXPR | ++ id | -- idTERM id | constant
TERM EXPR STMT
idconstant
zero?Not++--
ifwhile
zero?Not++--
2. F(STMT) = F(STMT) ∪ F(EXPR)
33
3. F(EXPR) = F(EXPR) ∪ F(TERM)
TERM EXPR STMTidconstant
zero?Not++--
ifwhile
idconstant
zero?Not++--
STMT if EXPR then STMT | while EXPR do STMT | EXPR ;EXPR TERM -> id | zero? TERM | not EXPR | ++ id | -- idTERM id | constant
34
4. F(STMT) = F(STMT) ∪ F(EXPR)
TERM EXPR STMTidconstant
zero?Not++--
ifwhile
idconstant
zero?Not++--
idconstant
STMT if EXPR then STMT | while EXPR do STMT | EXPR ;EXPR TERM -> id | zero? TERM | not EXPR | ++ id | -- idTERM id | constant
35
4. We reached a fixed-point
TERM EXPR STMTidconstant
zero?Not++--
ifwhile
idconstant
zero?Not++--
idconstant
STMT if EXPR then STMT | while EXPR do STMT | EXPR ;EXPR TERM -> id | zero? TERM | not EXPR | ++ id | -- idTERM id | constant
36
Fixed-point algorithm for computing FIRST sets
• What to do with null productions?
X Y a | Z bY ℇ Z ℇ
• Say input=“a”, which rule to use?• a FIRST (∉ Y) , a FIRST (∉ Z)
Use what comes after
Y/Z
37constraint
Computing FIRST sets (take II)
• Observation
If X ⟶ A1 .. Ak N α| … and ℇ FIRST(∈ A1) , … , ℇ FIRST(∈ Ak)
Then FIRST(N) \ { ℇ } FIRST(⊆ X)
ℇ a…
Use what comes after A1..Ak to predict which
production rule of X to use
38
FOLLOW sets
• FOLLOW(N) = the set of tokens that can immediately follow the non-terminal N in some sentential form
If S * ➝ αNtβ then t ∈ FOLLOW(N)
p. 189
39
FOLLOW sets
• FOLLOW(N) = the set of tokens that can immediately follow the non-terminal N in some sentential form
If S * ➝ αNtβ then t ∈ FOLLOW(N)
• FOLLOW(t) = the set … terminal t … form
If αNtβ * ➝ α’tqβ’then q ∈ FOLLOW(t)
p. 189
40
FOLLOW sets: Constraints
• $ ∈ FOLLOW(S)
• If X α N βthen FIRST(β) – { ℇ } FOLLOW(⊆ N)
• If X α N β and ℇ ∈ FIRST(β)then FOLLOW(X) ⊆ FOLLOW(N)
End of input Start symbol Compute FIRST and FOLLOW by solving
the extended constraint system
41
Example: FOLLOW sets
• E TX X+ E | ℇ• T (E) | int Y Y * T | ℇ
Terminal + ( * ) int
FOLLOW int, ( int, ( int, ( _, ), $ *, ), +, $
Non. Term.
E T X Y
FOLLOW ), $ +, ), $ $, ) _, ), $
42
Prediction Table
• A α
• T[A,t] = α if t FIRST(∈ α)• T[A,t] = α if ℇ FIRST(∈ α) and t FOLLOW(∈ A)
– t can also be $
• T is not well defined the grammar is not LL(1)
43
LL(k) grammars• A grammar is in class LL(k) iff
for every two productions Aα and Aβ – FIRST(α) ∩ FIRST(β) = {}
• In particular α*ℇ and β*ℇ is not possible – If β* ℇ then FIRST(α) ∩ FOLLOW(A) = {}
44
Problem: Non LL(k) grammars
45
LL(k) grammars• An LL(k) grammar G can be derived via:
– Top-down derivation– Scanning the input from left to right (L)– Producing the leftmost derivation (L)– With lookahead of k tokens (k)
– G is not ambiguous – G is not left-recursive
• A language is said to be LL(k) when it has an LL(k) grammar
46
Non LL grammar: Common prefix
• FIRST(term) = { ID }• FIRST(indexed_elem) = { ID }
• FIRST/FIRST conflict
term ID | indexed_elemindexed_elem ID [ expr ]
47
Solution: left factoring• Rewrite the grammar to be in LL(1)
Intuition: just like factoring x*y + x*z into x*(y+z)
term ID | indexed_elemindexed_elem ID [ expr ]
term ID after_IDAfter_ID [ expr ] |
48
S if E then S else S | if E then S | T
S if E then S S’ | TS’ else S |
Left factoring – another example
49
• FIRST(S) = { a } FOLLOW(S) = { $ } • FIRST(X) = { a, }FOLLOW(X) = { a }
• FIRST/FOLLOW conflict
S X a bX a |
Non LL grammar: Problematic null productions
T[X,a] = α if a FIRST(∈ a)T[X, a] = if ℇ FIRST(∈ ℇ) and a FOLLOW(∈ X)
t can also be $
T is not well defined the grammar is not LL(1)
50
Solution: substitution
S A a bA a |
S a a b | a b
Substitute A in S
S a after_A after_A a b | b
Left factoring
51
Non LL grammar: Left-recursion
• Left recursion cannot be handled with a bounded lookahead
• What can we do?
E E - term | term
52
Solution: Left recursion removal
• L(G1) = β, βα, βαα, βααα, …• L(G2) = same
N Nα | β N βN’ N’ αN’ |
G1 G2
p. 130
Can be done algorithmically.Problem: grammar becomes mangled beyond recognition
53
Solution: Left recursion removal
• L(G1) = β, βα, βαα, βααα, …• L(G2) = same
N Nα | β N βN’ N’ αN’ |
G1 G2
E E - term | term
E term TE | termTE - term TE |
p. 130
Can be done algorithmically.Problem: grammar becomes mangled beyond recognition
– nonterminals x tokens -> production alternative– Entries indexed by nonterminal N and token t
• Entry contains the alternative of N that must be predicated when current input starts with t
LL(k) parsing via PDA
56
LL(k) parsing via PDA: Moves
• Prediction top(prediction stack) = N– Pop N– If table[N, current] = α, push α to prediction
stack, otherwise – syntax error
• Match top(prediction stack) = t– If (t == current) pop prediction stack,
otherwise syntax error
57
LL(k) parsing via PDA: Termination
• Parsing terminates when prediction stack is empty– If input is empty at that point, success,
otherwise, syntax error
58
( ) not true false and or xor $
E 2 3 1 1
LIT 4 5
OP 6 7 8
(1) E → LIT(2) E → ( E OP E ) (3) E → not E(4) LIT → true(5) LIT → false(6) OP → and(7) OP → or(8) OP → xor
Non
term
inal
s
Input tokens
Which rule should be used
Example transition table
59
Model of non-recursivepredictive parser
Predictive Parsing program
Parsing Table
X
Y
Z
$
Stack
$ b + a
Output
60
a b c
A A aAb A c
A aAb | caacbb$
Input suffix Stack content Move
aacbb$ A$ predict(A,a) = A aAbaacbb$ aAb$ match(a,a)
acbb$ Ab$ predict(A,a) = A aAbacbb$ aAbb$ match(a,a)
cbb$ Abb$ predict(A,c) = A ccbb$ cbb$ match(c,c)
bb$ bb$ match(b,b)
b$ b$ match(b,b)
$ $ match($,$) – success
Running parser example
61
Erorrs
62
Handling Syntax Errors
• Report and locate the error• Diagnose the error• Correct the error• Recover from the error in order to discover
more errors– without reporting too many “strange” errors
63
Error Diagnosis
• Line number – may be far from the actual error
• The current token• The expected tokens• Parser configuration
64
Error Recovery
• Becomes less important in interactive environments
• Example heuristics:– Search for a semi-column and ignore the statement– Try to “replace” tokens for common errors– Refrain from reporting 3 subsequent errors
• Globally optimal solutions – For every input w, find a valid program w’ with a
“minimal-distance” from w
65
a b c
A A aAb A c
A aAb | cabcbb$
Input suffix Stack content Move
abcbb$ A$ predict(A,a) = A aAbabcbb$ aAb$ match(a,a)
bcbb$ Ab$ predict(A,b) = ERROR
Illegal input example
66
Error handling in LL parsers
• Now what?– Predict b S anyway “missing token b inserted in line XXX”
S a c | b Sc$
a b c
S S a c S b S
Input suffix Stack content Move
c$ S$ predict(S,c) = ERROR
67
Error handling in LL parsers
• Result: infinite loop
S a c | b Sc$
a b c
S S a c S b S
Input suffix Stack content Move
bc$ S$ predict(b,c) = S bSbc$ bS$ match(b,b)
c$ S$ Looks familiar?
68
Error handling and recovery
• x = a * (p+q * ( -b * (r-s);
• Where should we report the error?
• The valid prefix property
69
The Valid Prefix Property
• For every prefix tokens– t1, t2, …, ti that the parser identifies as legal:
• there exists tokens ti+1, ti+2, …, tn such that t1, t2, …, tn is a syntactically valid program
• If every token is considered as single character:– For every prefix word u that the parser identifies as legal
there exists w such that u.w is a valid program
70
Recovery is tricky
• Heuristics for dropping tokens, skipping to semicolon, etc.
71
Building the Parse Tree
72
Adding semantic actions
• Can add an action to perform on each production rule
• Can build the parse tree– Every function returns an object of type Node– Every Node maintains a list of children– Function calls can add new children
73
Building the parse tree
Node E() { result = new Node(); result.name = “E”; if (current {TRUE, FALSE}) // E LIT result.addChild(LIT()); else if (current == LPAREN) // E ( E OP E ) result.addChild(match(LPAREN)); result.addChild(E()); result.addChild(OP()); result.addChild(E()); result.addChild(match(RPAREN)); else if (current == NOT) // E not E result.addChild(match(NOT)); result.addChild(E()); else error; return result;}
static int Parse_Expression(Expression **expr_p) {
Expression *expr = *expr_p = new_expression() ;
/* try to parse a digit */
if (Token.class == DIGIT) {
expr->type=‘D’; expr->value=Token.repr –’0’;
get_next_token();
return 1; }
/* try parse parenthesized expression */
if (Token.class == ‘(‘) {
expr->type=‘P’; get_next_token();
if (!Parse_Expression(&expr->left)) Error(“missing expression”);
if (!Parse_Operator(&expr->oper)) Error(“missing operator”);
if (Token.class != ‘)’) Error(“missing )”);
get_next_token();
return 1; }
return 0;
}
74
Parser for Fully Parenthesized Expers
75
Bottom-up parsing
76
Intuition: Bottom-Up Parsing
• Begin with the user's program• Guess parse (sub)trees • Check if root is the start symbol
77
+ * 321
Bottom-up parsingUnambiguousgrammarE E * TE TT T + FT FF idF numF ( E )
78
+ * 321
F
Bottom-up parsingUnambiguousgrammarE E * TE TT T + FT FF idF numF ( E )
79
Bottom-up parsingUnambiguousgrammarE E * TE TT T + FT FF idF numF ( E )
+ * 321
F F
T
F
T
80
Top-Down vs Bottom-Up• Top-down (predict match/scan-complete )
to be read…
already read…
A
Aa b
Aa b
c
aacbb$
AaAb|c
81
Top-Down vs Bottom-Up• Top-down (predict match/scan-complete )
Bottom-up (shift reduce)
to be read…
already read…
A
Aa b
Aa b
c
A
a bA
c
a b
A
aacbb$
AaAb|c
82
Bottom-up parsing: LR(k) Grammars
• A grammar is in the class LR(K) when it can be derived via:– Bottom-up derivation– Scanning the input from left to right (L)– Producing the rightmost derivation (R)– With lookahead of k tokens (k)
83
Bottom-up parsing: LR(k) Grammars
• A language is said to be LR(k) if it has an LR(k) grammar
• The simplest case is LR(0), which we will discuss
84
Terminology: Reductions & Handles
• The opposite of derivation is called reduction– Let Aα be a production rule– Derivation: βAµ βαµ– Reduction: βαµ βAµ
• A handle is the reduced substring– α is the handles for βαµ
85
Goal: Reduce the Input to the Start Symbol
Example: 0 + 0 * 1B + 0 * 1E + 0 * 1E + B * 1E * 1E * BE
E → E * B | E + B | BB → 0 | 1
Go over the input so far, and upon seeing a right-hand side of a rule, “invoke” the rule and replace the right-hand side with the left-hand side (reduce)
E
BE *
B 1
0B
0
E +
86
Use Shift & Reduce In each stage, we shift a symbol from the input to the stack, or reduce according to one of the rules.
87
Use Shift & Reduce In each stage, we shift a symbol from the input to the stack, or reduce according to one of the rules.
• A state will keep the info gathered on handle(s)– A state in the “control” of the PDA– Also (part of) the stack alpha beit
• A table will tell it “what to do” based on current state and next token– The transition function of the PDA
• A stack will records the “nesting level”– Prefixes of handles
Set of LR(0) items
90
LR item
N αβ
Already matched To be matched
Input
Hypothesis about αβ being a possible handle, so far we’ve matched α, expecting to see β
Example: LR(0) Items• All items can be obtained by placing a dot at every
position for every production:
91
(1) S E $(2) E T(3) E E + T(4) T id (5) T ( E )
1: S E$2: S E $3: S E $ 4: E T5: E T 6: E E + T7: E E + T8: E E + T9: E E + T 10: T i11: T i 12: T (E)13: T ( E)14: T (E )15: T (E)
Grammar LR(0) items
92
LR(0) items
N αβ Shift Item
N αβ Reduce Item
93
States and LR(0) Items
• The state will “remember” the potential derivation rules given the part that was already identified
• For example, if we have already identified E then the state will remember the two alternatives:
(1) E → E * B, (2) E → E + B• Actually, we will also remember where we are in each of
them: (1) E → E ● * B, (2) E → E ● + B• A derivation rule with a location marker is called LR(0) item
• The state is actually a set of LR(0) items. E.g., q13 = { E → E ● * B , E → E ● + B}
E → E * B | E + B | BB → 0 | 1
94
Intuition
• Gather input token by token until we find a right-hand side of a rule and then replace it with the non-terminal on the left hand side– Going over a token and remembering it in the
stack is a shift• Each shift moves to a state that remembers what
we’ve seen so far – A reduce replaces a string in the stack with the
non-terminal that derives it
95
Model of an LR parser
LR Parser0
T
2
+
7
id
5
Stack
$ id + id + id
Outputstate
symbol
goto action
Input
Terminals and Non-terminals
96
LR parser stack
• Sequence made of state, symbol pairs• For instance a possible stack for the
grammarS E $E TE E + TT id T ( E )
could be: 0 T 2 + 7 id 5Stack grows this way
Form of LR parsing table
97
state terminals non-terminals
Shift/Reduce actions Goto part01...
sn
rk
shift state n reduce by rule k
gm
goto state m
acc
accept
error
98
LR parser table examplegoto action STATE
T E $ ) ( + id
g6 g1 s7 s5 0
acc s3 1
2
g4 s7 s5 3
r3 r3 r3 r3 r3 4
r4 r4 r4 r4 r4 5
r2 r2 r2 r2 r2 6
g6 g8 s7 s5 7
s9 s3 8
r5 r5 r5 r5 r5 9
99
Shift move
LRParsing
program
q...
Stack
$ … a …
Output
goto action
Input
• If action[q, a] = sn
Result of shift
100
LRParsing
program
naq...
Stack
$ … a …
Output
goto action
Input
• If action[q, a] = sn
101
Reduce move
• If action[qn, a] = rk• Production: (k) A β• If β= σ1… σn
Top of stack looks like q1 σ1… qn σn• goto[q, A] = qm
LRParsing
program
qn
…
q…
Stack
$ … a …
Output
goto action
Input
2*|β|
102
Result of reduce move
LRParsing
program
Stack
Output
goto action
2*|β|qm
A
q
…
$ … a …Input
• If action[qn, a] = rk• Production: (k) A β• If β= σ1… σn
Top of stack looks like q1 σ1… qn σn• goto[q, A] = qm
Last slide
Accept move
103
LRParsing
program
q...
Stack
$ a …
Output
goto action
Input
If action[q, a] = acceptparsing completed
Error move
104
LRParsing
program
q...
Stack
$ … a …
Output
goto action
Input
If action[q, a] = error (usually empty)parsing discovered a syntactic error
105
Example
Z E $E T | E + TT i | ( E )
106
Example: parsing with LR itemsZ E $E T | E + TT i | ( E )
E T E E + TT i T ( E )
Z E $
i + i $
Why do we need these additional LR items?Where do they come from?What do they mean?
107
-closure
• Given a set S of LR(0) items
• If P αNβ is in S• then for each rule N in the grammar
S must also contain N -closure({Z E $}) =
E T, E E + T,T i , T ( E ) }
{ Z E $,
Z E $E T | E + TT i | ( E )
108
i + i $
E T E E + T
T i T ( E )
Z E $
Z E $E T | E + TT i | ( E )
Items denote possible future handles
Remember position from which we’re trying to reduce
Example: parsing with LR items
109
T i Reduce item!
i + i $
E T E E + T
T i T ( E )
Z E $
Z E $E T | E + TT i | ( E )
Match items with current token
Example: parsing with LR items
110
i
E T Reduce item!
T + i $Z E $E T | E + TT i | ( E )
E T E E + T
T i T ( E )
Z E $
Example: parsing with LR items
111
T
E T Reduce item!
i
E + i $Z E $E T | E + TT i | ( E )
E T E E + T
T i T ( E )
Z E $
Example: parsing with LR items
112
T
i
E + i $Z E $E T | E + TT i | ( E )
E T E E + T
T i T ( E )
Z E $
E E+ T
Z E$
Example: parsing with LR items
113
T
i
E + i $Z E $E T | E + TT i | ( E )
E T E E + T
T i T ( E )
Z E $
E E+ T
Z E$ E E+T
T i T ( E )
Example: parsing with LR items
114
E E+ T
Z E$ E E+T
T i T ( E )
E + T $
i
Z E $E T | E + TT i | ( E )
E T E E + T
T i T ( E )
Z E $
T
i
Example: parsing with LR items
115
E T E E + T
T i T ( E )
Z E $
Z E $E T | E + TT i | ( E )
E + T
T
i
E E+ T
Z E$ E E+T
T i T ( E )
i
E E+T
$
Reduce item!
Example: parsing with LR items
116
E T E E + T
T i T ( E )
Z E $
E $
E
T
i
+ T
Z E$
E E+ T
i
Z E $E T | E + TT i | ( E )
Example: parsing with LR items
117
E T E E + T
T i T ( E )
Z E $
E $
E
T
i
+ T
Z E$
E E+ T
Z E$
i
Z E $E T | E + TT i | ( E )
Example: parsing with LR items
Reduce item!
118
E T E E + T
T i T ( E )
Z E $
Z
E
T
i
+ T
Z E$
E E+ T
Z E$
Reduce item!
E $
i
Z E $E T | E + TT i | ( E )
Example: parsing with LR items
119
GOTO/ACTION tables
State i + ( ) $ E T action
q0 q5 q7 q1 q6 shift
q1 q3 q2 shift
q2 ZE$q3 q5 q7 q4 Shift
q4 EE+Tq5 Tiq6 ETq7 q5 q7 q8 q6 shift
q8 q3 q9 shift
q9 TE
GOTO TableACTIONTable
empty – error move
120
LR(0) parser tables
• Two types of rows:– Shift row – tells which state to GOTO for
current token– Reduce row – tells which rule to reduce
(independent of current token)• GOTO entries are blank
121
LR parser data structures• Input – remainder of text to be processed• Stack – sequence of pairs N, qi
– N – symbol (terminal or non-terminal)– qi – state at which decisions are made
• Initial stack contains q0
+ i $input
q0stack i q5
122
LR(0) pushdown automaton• Two moves: shift and reduce• Shift move
– Remove first token from input– Push it on the stack– Compute next state based on GOTO table– Push new state on the stack– If new state is error – report error
i + i $input
q0stack
+ i $input
q0stack
shift
i q5
State i + ( ) $ E T action
q0 q5 q7 q1 q6 shift
Stack grows this way
123
LR(0) pushdown automaton• Reduce move
– Using a rule N α– Symbols in α and their following states are removed from stack– New state computed based on GOTO table (using top of stack,
before pushing N)– N is pushed on the stack– New state pushed on top of N