Recursive-Descent Parsing continued - CS Home · PDF fileRecursive-Descent Parsing CS F331 Programming Languages CSCE A331 Programming Language Concepts Lecture Slides Wednesday, February
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Recursive-Descent Parsing
CS F331 Programming LanguagesCSCE A331 Programming Language ConceptsLecture SlidesWednesday, February 15, 2017
Glenn G. ChappellDepartment of Computer ScienceUniversity of Alaska [email protected]
Two phases:§ Lexical analysis (lexing)§ Syntax analysis (parsing)
The output of a parser is often an abstractsyntax tree (AST). Specifications of these can vary.
15 Feb 2017 CS F331 / CSCE A331 Spring 2017
ParserLexemeStream
ASTor Error
cout << ff(12.6);
id op id litop
punctop
expr
binOp: <<
expr
id: cout funcCall
expr
id: ff
numLit: 12.6
expr
LexerCharacter
Streamcout << ff(12.6);
Parsing
2
ReviewIntroduction to Syntax Analysis — Categories of Parsers
Parsing algorithms can be divided into two broad categories.Top-Down Parsing Algorithms
§ Go through derivation from top to bottom, expanding nonterminals.§ Usually produce a leftmost derivation.§ Important subclass: LL (read Left-to-right, Leftmost derivation).
§ Common reason a top-down parser may not be LL: doing lookahead.§ Often hand-coded.§ Example algorithm we will look at: Recursive Descent.
Bottom-Up Parsing Algorithms§ Go through the derivation from bottom to top, collapsing substrings.§ Usually produce a rightmost derivation.§ Important subclass: LR (read Left-to-right, Rightmost derivation).
§ Common reason a bottom-up parser may not be LR: doing lookahead.§ Almost always automatically generated.§ Example algorithm we will look at: Shift-Reduce.
15 Feb 2017 CS F331 / CSCE A331 Spring 2017 3
ReviewIntroduction to Syntax Analysis — Categories of Grammars
Grammars that LL parsers can use are LL grammars.Grammars that LR parsers can use are LR grammars.
15 Feb 2017 CS F331 / CSCE A331 Spring 2017
All Grammars
LR Grammars
CFGs
LL Grammars
4
ReviewRecursive-Descent Parsing — Intro, How It Works, Example #1
Recursive Descent is a top-down, LL parsing algorithm.§ There is one parsing function for each nonterminal.§ A parsing function is responsible for parsing all strings that its
nonterminal can be expanded into.
We wrote a Recursive-Descent parser based on Grammar 1.
In a parsing function:[ … ] Brackets (optional: 0 or 1) become a conditional.
§ Check for the possible initial lexemes inside the brackets. If found, parse everything inside the brackets. Otherwise skip the brackets.
{ … } Braces (optional, repeatable: 0 or more) become a loop.§ Loop body: Check for the possible initial lexemes inside the braces.
If not found, then exit loop, moving to just after the braces. If found, parse everything inside the braces, and then REPEAT.
TO DO§ Write a Recursive-Descent parser based on Grammar 2.
15 Feb 2017 CS F331 / CSCE A331 Spring 2017
Done. See rdparser2.cpp.
8
ReviewRecursive-Descent Parsing — Example #3: Expressions [1/5]
Now we bump up our standards. We wish to parse arithmetic expressions in their usual form, with variables, numeric literals, binary +, -, *, and / operators, and parentheses. When given a syntactically correct expression, our parser should return an abstract syntax tree (AST).§ All operators will be binary and left-associative, so that, for
example, “a + b + c” means “(a + b) + c”.§ Precedence will be as usual, so that “a + b * c” means
“a + (b * c)”.§ Precedence and associativity may be overridden using parentheses:
“(a + b) * c”.
Due to the limitations of our lexer, the expression “k-4” will need to be rewritten as “k - 4”.
15 Feb 2017 CS F331 / CSCE A331 Spring 2017 9
ReviewRecursive-Descent Parsing — Example #3: Expressions [2/5]
We begin with the following grammar, with start symbol expr.
Grammar 3expr → term
| expr ( “+” | “-” ) termterm → factor
| term ( “*” | “/” ) factorfactor → ID
| NUMLIT| “(” expr “)”
Grammar 3 encodes our associativity and precedence rules.
15 Feb 2017 CS F331 / CSCE A331 Spring 2017 10
ReviewRecursive-Descent Parsing — Example #3: Expressions [3/5]
To the right is part of a parsing function for nonterminal expr.
Grammar 3expr → term
| expr ( “+” | “-” ) termterm → factor
| term ( “*” | “/” ) factorfactor → ID
| NUMLIT| “(” expr “)”
What is wrong with this code?
15 Feb 2017 CS F331 / CSCE A331 Spring 2017
function parse_expr()if parse_term() then
return trueelseif parse_expr() then
…
11
ReviewRecursive-Descent Parsing — Example #3: Expressions [4/5]
function parse_expr()if parse_term() then
return trueelseif parse_expr() then
…
What is wrong with this code?§ First, if the call to parse_term returns false, then the position in the
input may have changed. Fixing this requires backtracking, which can lead to extreme inefficiency.
§ But even if we can solve that, there is a more serious problem. Suppose parse_expr is called with input that does not begin with a valid term. What happens? Answer: infinite recursion!
15 Feb 2017 CS F331 / CSCE A331 Spring 2017 12
ReviewRecursive-Descent Parsing — Example #3: Expressions [5/5]
In fact, without lookahead, it is impossible to write a Recursive-Descent parser for Grammar 3.
Grammar 3expr → term
| expr ( “+” | “-” ) termterm → factor
| term ( “*” | “/” ) factorfactor → ID
| NUMLIT| “(” expr “)”
Recall that a Recursive-Descent parser requires an LL grammar. But Grammar 3 is not an LL grammar. Next we look at LL grammars. We return to the expression-parsing problem later.
An LL grammar is a CFG that can be handled by an LL parsing algorithm, such as Recursive Descent, if multiple-lexeme lookahead is not done.
Recall the origin of the name: these parsers handle their input in a strictly Left-to-right order, and they go through the steps required to generate a Leftmost derivation.
Now we look at some of the properties that an LL grammar must have.
The grammar below illustrates a more general problem.
Grammar Bxx → “a” yy | “a” zzyy → “*”zz → “/”
We cannot even being to write a Recursive-Descent parser for Grammar B. How would the code for function parse_xx begin? Should it take the first or second option? There is no way to tell, without lookahead.
We say the first production in Grammar B is not left-factored. An LL grammar can only contain left-factored productions.
In Grammar C, the empty string can be derived from either yy or zz. So if there is no more input, then there is no basis for making the xx-or-yy decision in the first production.
The strings “a” and “aa” lie in the language generated by Grammar D. But imagine a Recursive-Descent parser based on Grammar D, attempting to parse these strings. What would happen?
It turns out that the problems presented by Grammars A–D illustrate all the reasons a CFG might not be LL.
Fact.* Suppose that a context-free grammar G has the following three properties.1. If A → α and A → β are productions in G, then there do not exist
two strings, one derived from α, the other derived from β, that begin with the same (terminal) symbol.
2. If A → α and A → β are productions in G, then it is not the case that the empty string can be derived from both α and β.
3. If A → α and A → β are productions in G, and the empty string can be derived from β, then there is no (terminal) symbol x that begins a string that can be derived from α, such that x can follow a string derived from A.
Then Grammar G is an LL grammar.
*Adapted from A.V. Aho, R. Sethi, and J.D. Ullman,Compilers: Principles, Techniques, and Tools, 1986, p. 192.
15 Feb 2017 CS F331 / CSCE A331 Spring 2017
(1) does not hold for Grammars A, AA, and B; (2) does not hold
for Grammar C; and (3) does not hold for Grammar D.
Fact. Suppose that G is an LL grammar. Then,§ G is not ambiguous, and§ G does not contain left recursion.
In general, when there is a choice to be made, an LL parser must be able to make that choice based on the current lexeme. If this cannot be done, then the grammar is not LL.
Now suppose—as in our expression-parsing example—that we wish to write a Recursive-Descent parser, but our grammar is not LL. What can we do about this?
If a grammar is not LL, this does not mean that the grammar must be completely useless as a basis for a Recursive-Descent parser. We might be able to transform the grammar into an LL grammar that generates the same language.
For example, here is Grammar A, which is not LL, along with an LL grammar that generates the same language.
It is not at all uncommon to be faced with a grammar that is not LL, but that can be transformed easily to one that is LL. In particular, this is common in the specification of programming-language syntax.
Note, however, that there are context-free languages that cannot be generated by any LL grammar at all.
15 Feb 2017 CS F331 / CSCE A331 Spring 2017 26
Recursive-Descent ParsingBack to Example #3: Expressions — Left-Associativity [1/3]
Now we return to our expression grammar. It is given below. Recall that this is not an LL grammar.
Grammar 3expr → term
| expr ( “+” | “-” ) termterm → factor
| term ( “*” | “/” ) factorfactor → ID
| NUMLIT| “(” expr “)”
An easy fix is to reorder the operands; for example,expr ( “+” | “-” ) term becomes term ( “+” | “-” ) expr. I will also use [ … ] to make the grammar more concise.
15 Feb 2017 CS F331 / CSCE A331 Spring 2017 27
Recursive-Descent ParsingBack to Example #3: Expressions — Left-Associativity [2/3]
Here is the resulting grammar. This is an LL grammar.
Grammar 3aexpr → term [ ( “+” | “-” ) expr ]term → factor [ ( “*” | “/” ) term ]factor → ID
| NUMLIT| “(” expr “)”
But now we have a new problem: Grammar 3a is LL, but it encodes right-associative binary operators. We want our operators to be left-associative.
Fortunately, all is not lost …
15 Feb 2017 CS F331 / CSCE A331 Spring 2017 28
Recursive-Descent ParsingBack to Example #3: Expressions — Left-Associativity [3/3]
Here is how we do it.
Grammar 3bexpr → term { ( “+” | “-” ) term }term → factor { ( “*” | “/” ) factor }factor→ ID
| NUMLIT| “(” expr “)”
Grammar 3b is what we want. Itis LL, and we can use it toparse left-associativebinary operators.
However, we still need to generate an AST.15 Feb 2017 CS F331 / CSCE A331 Spring 2017
function parse_expr()
if not parse_term() then
return false
end
while true do
if not matchString("+")
and not matchString("-")
then break
end
if not parse_term() then
return false
end
end
return true
end
29
Recursive-Descent ParsingBack to Example #3: Expressions — ASTs [1/7]
We want write a parser that returns an abstract syntax tree(AST). First, we need to specify the format of an AST.
Recall that a parse tree, or concrete syntax tree, includes one leaf node for each lexeme in the input, and one non-leaf node for each nonterminal in the derivation.
However, an AST is more sparse. For example, below are reasonable ASTs for the expressions a + 2 and (a + 2) * b.
Lexemes that only guide parsing are omitted from an AST: semicolons to end statements, parentheses in expressions, etc.
15 Feb 2017 CS F331 / CSCE A331 Spring 2017
*
+ b
a 2
a 2
+
30
Recursive-Descent ParsingBack to Example #3: Expressions — ASTs [2/7]
We need to represent these trees in Lua. Represent a single lexeme by its string form. If there is more than one node in an AST, then represent it as an array whose first item represents the root node and whose remaining items each represent one of the subtrees rooted at the child nodes, in order.
Here is the first AST above in Lua: { "+", "a", "2" }
And here is the second AST: { "*", { "+", "a", "2" }, "b" }
15 Feb 2017 CS F331 / CSCE A331 Spring 2017
*
+ b
a 2
a 2
+
31
Recursive-Descent ParsingBack to Example #3: Expressions — ASTs [3/7]
It is better to describe our ASTs in a way that does not require drawings of trees. So we specify the format of an AST for each line in our grammar.
(1) expr. If there is only a term, then the AST for the expr is the AST for the term. Otherwise, the AST is { OO, AA, BB }, where OO is the string form of the last operator, AA is the AST for everything before it, and BB is the AST for the last term.
15 Feb 2017 CS F331 / CSCE A331 Spring 2017 32
Recursive-Descent ParsingBack to Example #3: Expressions — ASTs [4/7]
(2) term. If there is only a factor, then the AST for the term is the AST for the factor. Otherwise, the AST is { OO, AA, BB }, where OO is the string form of the last operator, AA is the AST for everything before it, and BB is the AST for the last factor.
15 Feb 2017 CS F331 / CSCE A331 Spring 2017 33
Recursive-Descent ParsingBack to Example #3: Expressions — ASTs [5/7]
(3) factor: ID. AST for the factor: string form of the ID.(4) factor: NUMLIT. AST for the factor: string form of the NUMLIT.(5) factor: “(” expr “)”. AST for the factor: AST for the expr.
15 Feb 2017 CS F331 / CSCE A331 Spring 2017 34
Recursive-Descent ParsingBack to Example #3: Expressions — ASTs [6/7]
Applying the various rules, the AST for (a + 2) * b is{ "*", { "+", "a", "2" }, "b" }
Each parsing function can now return a pair: a boolean and an AST. The boolean indicates a correct parse, as before. The AST is only valid if the boolean is true, in which case it will be in the specified format.
15 Feb 2017 CS F331 / CSCE A331 Spring 2017 35
Recursive-Descent ParsingBack to Example #3: Expressions — ASTs [7/7]
(1) expr. If there is only a term, then the AST for the expr is the AST for the term. Otherwise, the AST is { OO, AA, BB }, where OO is the string form of the last operator, AA is the AST for everything before it, and BB is the AST for the last term.
(2) term. Similar to (1).(3) factor: ID. AST for the factor: string form of the ID.(4) factor: NUMLIT. AST for the factor: string form of the NUMLIT.(5) factor: “(” expr “)”. AST for the factor: AST for the expr.15 Feb 2017 CS F331 / CSCE A331 Spring 2017
TO DO§ Based on Grammar 3b,
write a Recursive-Descent parser that produces an AST, as described.
The ASTs we have specified are not quite what we want.
We need to know whether each node represents an operator, an identifier, etc. The lexer already figured this out, but then we did not store this information in the AST.
And there is other information we could store. For example, in many PLs, “-” can be either a binary operator (a - b) or a unary operator (-x). The lexer does not know which it is. But the parser knows, and the parser could return this information to its caller.
To give the caller additional information, we mark each node in the AST, indicating what kind of entity it is. So far, we have three kinds of nodes: binary operators, identifiers, and numeric literals. So we mark each node as being one of these three.
In the Lua form of our ASTs, we can replace each string with a two-item array. The first item in the array will be one of three constants: BIN_OP, ID_VAL, or NUMLIT_VAL. The second item will be the string form of the lexeme.