Module 9 – Top Down Parser – Pre-processing In this module, the role of a Parser in the context of a compiler is discussed. Types of Parsers and the preprocessing that is necessary for a Top down parser are dealt in detail in this module. Pre-processing steps of eliminating left recursion and left factoring is dealt with algorithm and example. 9.1.Role of the Parser Parser is typically integrated with the lexical phase of the compiler. This is done to avoid multiple passes of the compiler, anyway a compiler will have more than one pass. Integration of a parser with the lexical phase is depicted in figure 9.1 (a) and an elaborate representation is given in figure 9.1 (b). From figure 9.1(a) it can be understood that the scanner issues tokens to the parser and the parser validates the tokens and converts it to an intermediate representation (IR). Figure 9.1 (a) Integration of lexer and parser Figure 9.1 (b) Interaction of lexer and parser Figure 9.1 (b) gives more clarity on the interaction between the lexer and the parser. The lexer scans the input code, tokenizes and issues a token to the parser, whenever a “GetNextToken” request is issued by the parser. The lexer records lexical errors during the scanning process while the parser records any syntax errors. The semantic phase is one more component of the front end of the compiler which will check for semantic errors. All these phases interact with the symbol table.
10
Embed
Module 9 – Top Down Parser – Pre-processing - e-PG ...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Module 9 – Top Down Parser – Pre-processing
In this module, the role of a Parser in the context of a compiler is discussed. Types of Parsers
and the preprocessing that is necessary for a Top down parser are dealt in detail in this module.
Pre-processing steps of eliminating left recursion and left factoring is dealt with algorithm and
example.
9.1 .Role of the Parser
Parser is typically integrated with the lexical phase of the compiler. This is done to avoid
multiple passes of the compiler, anyway a compiler will have more than one pass. Integration of
a parser with the lexical phase is depicted in figure 9.1 (a) and an elaborate representation is
given in figure 9.1 (b). From figure 9.1(a) it can be understood that the scanner issues tokens to
the parser and the parser validates the tokens and converts it to an intermediate representation
(IR).
Figure 9.1 (a) Integration of lexer and parser
Figure 9.1 (b) Interaction of lexer and parser
Figure 9.1 (b) gives more clarity on the interaction between the lexer and the parser. The lexer
scans the input code, tokenizes and issues a token to the parser, whenever a “GetNextToken”
request is issued by the parser. The lexer records lexical errors during the scanning process while
the parser records any syntax errors. The semantic phase is one more component of the front end
of the compiler which will check for semantic errors. All these phases interact with the symbol
table.
9.2 Brief discussion on Grammar
Before we actually discuss the types of parsers, a brief discussion on the Grammar is necessary.
A grammar is used to denote the sentence structure of a particular language. A Grammar can be
of 4 types, type 0, type 1, type 2 and type 3. Type 0 grammar defines a superset of a class of
languages while Type 3 grammar defines a smaller set of language. For compiler, type 2
grammar otherwise called Context Free Grammar is used. A Context Free Grammar (CFG) is
defined formally as a four tuple, (V, T, P, S) where, V is the set of Variables / Non-terminals,
T is the set terminals that constitute the string, P is the set of Productions that has a LHS and
RHS where the LHS can be replaced by RHS to derive a string and S is the special symbol called
as the start symbol and is a subset of V. A string is said to belong to a grammar if and only if it is
derived from the start symbol.
The following are the set of conventions which is typically used in the context of
Grammar.
• Terminals are specified using characters a,b,c, etc. Other specific terminals could be 0, 1,
id, +, -, *, etc.
• Non-terminals are specified using A, B, C, etc, Other specific non-terminals include expr,
term, stmt
• Grammar symbols - X,Y,Z (VT)
• Strings of terminals are indicated using symbols u, v, w, x, y, z T*
• Strings of grammar symbols are defined using symbols ,, (NT)*
A derivation is a process of deriving a string from the start symbol of the grammar. The one-step
derivation is defined by
A (9.1)
where A is a production in the grammar and indicates the process of derivation. In
addition, we define the following terminologies for the process of derivation
• is leftmost lm if does not contain a non-terminal
• is rightmost rm if does not contain a non-terminal
• Transitive closure * (zero or more steps)
• Positive closure + (one or more steps)
The language generated by G is defined by
L(G) = {w | S + w} (9.2)
Consider the following grammar defined as ({E}, {+, *, (, ), -, id}, P, E) where P is defined as
follows:
E E + E
E E * E
E ( E )
E - E
E id (9.3)
This grammar is a subset of the expression grammar over the terminals +, *, id, (, ). The terminal
“id” indicates identifier while other terminals are arithmetic operators. In this grammar, any
string can be derived from the start symbol “E”. Consider the following derivation
E rm E + E rm E + id rm id + id (9.4)
From equation (9.4), we have substitution, “E” with “E+E” followed by replacement of “E” with
“id” from right to left and hence, this is right most derivation thus yielding the string “id+id”.
Other possible derivations are
E - E - id (9.5)
E E* E id * E id * E + E id* id + id (9.6)
Equation (9.6), apply left most derivation to derive the string “id*id+id”. The process of
derivation is depicted as a tree representation and is called as a derivation tree where the LHS
non-terminal is the parent and the symbols on the RHS of the production are the children.
The grammar construct is used to derive strings. Thus, given any input string, the grammar
construct is used to start deriving the string by applying one derivation after another. This
process is called as top-down derivation. On the other hand, given a string, if the symbols are
combined and if it yields the start symbol the process is called bottom-up derivation. In both
events, if the string is derived from the start symbol or if the combination yields the start symbol
of the grammar the string is said to be belonging to the grammar.
This process of derivation is used by the parsers to validate a string or a sequence of strings.
9.3 Types of Parsers
The functionality of the parser is to verify whether a sequence of tokens belong to a correct
sentence structure. The validation is based on designing rules that has to be followed for
constructing a sentence. The rules are specified using any one type of grammar. Using this
grammar, when an input sentence is given, the parser compares whether the input sentence
belongs to the grammar.
There are various types of parsers. The following is one of the classifications of parsers:
• Universal Parsers
– Cocke- Younger-Kasami (CYK)
– Earley Parser
• Top-Down Parsers
• Bottom Up Parsers
9.3.1 Universal Parsers
The Universal parsers CYK and Earley can parse any type of grammar given to it. They are
typically used in natural language processing to validate the syntax of natural language sentences
using a predefined grammar for Natural language. However, these parsers are not efficient for a
compiler, where the sentence structure is based on Context Free Grammar.
9.3.2 Top Down Parsers
The top down parser is a name that is derived based on the construction of the derivation tree to
validate a string for a particular grammar. If the input string is derived from the start symbol of
the grammar then we conclude that the string belongs to the Grammar. For example, consider the
Grammar
S → c A d
A → ab | a (9.7)
and let “S” be the start symbol and consider the input “cad”. The first step of the derivation is
given by the following
S
c A d
Consider replacing “A” with one of its productions “ab” and this would result in the following
S
c A d
a b
The string has to be taken by looking at the leaves of the derivation tree from left to right. Thus
the string formed in “cabd” and this does not match with the input string. So, we try an alternate
application of the production for “A” and this results in the string “cad”.
S
c A d
a
Thus in top-down parsing, we try all possible productions and if it doesn‟t match, we backtrack
and try an alternate production to derive the string.
As backtracking is a typical characteristic of top-down parsing, these parsers have to handle
this while validating sentences and these parsers are called as recursive descent parsers. There is
a variation of recursive descent parsers which is Predictive parsers which avoids backtracking.
One such type of predictive parsers is the LL parsers, where the first “L” stands for input being
scanned from left to right and the second “L” stands for left-most derivation.
9.3.3 Bottom Up Parsers
As already discussed bottom up parsing, starts from the string and combines the strings to
generate the start symbol in a bottom up fashion. Consider the same string “cad” for the grammar
given in (9.7).
c a d
A
S
As the start symbol “S” could be reached this string belongs to the grammar. LR parsers are
bottom up parsers where the “L” indicates input being scanned from left to right and the “R”
indicates that we are applying right most derivation.
9.3.4 Property used by parsers
Top Down and Bottom up parsers parse the string based on a viable-prefix property. The
property states that before the string is fully processed, if there is an error, the parser will identify
it and recover from it. This property is based on identifying the possible prefix of all strings that
belong to any context free grammar. All programming language constructs are defined using
context free grammar. For example consider the following grammar with “stmt” as the start
symbol, {stmt, E} are Non-terminals, {if, then, else, a, b} being terminals and productions
defined as follows:
stmt if E then stmt else stmt
stmt if E then stmt
stmt a
E b
This grammar defines the CFG for the “if – else” programming construct. Context Free
grammars are already defined for all programming constructs and all strings that are part of the
programming language will be based on this construct. Hence, parsers are designed keeping in
mind the CFG. The following are some of the top-down and bottom-up parsers which we will
look into in this course.
• Top-Down parser - LL(k): Input is scanned from left-to-right, Leftmost derivation, k
tokens lookahead
• Bottom Up Parsers - LR(k): Input is scanned from left-to-right, Rightmost derivation, k
tokens lookahead
9.4 LL Parsers
LL parsers are non-recursive top down parsers. The parsing action is incorporated using a table
approach to avoid backtracking. LL(k) parsers have „k‟ tokens lookahead. In this module, the
value of „k‟ is chosen as „1‟ where we consider only 1 symbol lookahead.
However, LL parsers cannot handle a left recursive grammar or a grammar that has the
characteristics of left factoring. Formally, a grammar is left recursive if
A NT such that a derivation A + A, for some string (NT T )
+
For example, in the grammar, A A | | , the production A A is said to be left
recursive as the left-hand side non-terminal A and the first symbol on the right hand side of the
production is also A.
On the other hand, when a non-terminal has two or more productions and all the productions
have a common prefix grammar symbols the grammar is said to have left-factor property.
Formally, a grammar has a left factoring property is
, (NT T )+
, such that there exists productions A 1 | 2 | …. | n
For example, consider the grammar involving the following productions,
A 1 | 2 | … | n |
In this grammar, the prefix “” is common for all the productions, hence while deriving a string
one would not know which productions to substitute. This involves lot of backtracking and
therefore the LL(1) parsers should consider the action of parsing after left factoring the grammar.
9.4.1 Elimination of Left Recursion
This is the first step of the pre-processing that is necessary for a grammar to be parsed by LL (1)
grammar. As already discussed, if A A is a production, this is referred to as A-production
and this needs to be removed from the grammar. The algorithm for removing left recursion is
given in algorithm 9.1
LeftrecursionEliminate(G)
{
Arrange the nonterminals in some order A1, A2, …, An
for i = 1, …, n do
for j = 1, …, i-1 do
replace each
Ai Aj
with
Ai 1 | 2 | … | k
where
Aj 1 | 2 | … | k
enddo
eliminate the immediate left recursion in Ai
enddo
}
Algorithm 9.1 Left recursion elimination
The first step arranges all the non-terminals in some order starting from the start symbol. There
are two loops to pair every non-terminal with every other non-terminal. The logic behind the
elimination algorithm is to substitute any non-terminal that starts with another non-terminal thus
verifying the occurrence of an indirect left recursion. After forming new productions, immediate
left recursion is eliminated. Immediate left recursion is eliminated using the following procedure.
Consider the grammar to have the following A-productions and non-A productions
A A | A
A |
Left recursion from this grammar is eliminated by converting to a right-recursive grammar by
using the following procedure:
A new non-terminal is introduced for every non-terminal that has a left recursion production.
The non-A productions are suffixed with this new non-terminal and the new non-terminal is
made to produce the suffix of the A-production‟s RHS and is then suffixed by the new non-
terminal.
A AR | AR
AR AR | AR |
As can be seen from the above set of productions, the grammar has now become a right-heavy
grammar on the new non-terminal AR .
Example 9.1
Remove left recursion from the following grammar
A B C | a
B C A | A b
C A B | C C | a
The non-terminals are arranged as A, B, C to check for direct and in-direct left recursion. First
iteration of “i" doesn‟t do anything as “j” is only till “i-1”.
i = 1: nothing to do
The second iteration considers the non-terminal B for removing left recursion. For this iteration
of “i" there is only one iteration for “j”.
i = 2, j = 1: B C A | A b
B C A | B C b | a b
This introduces a left recursion as underlined. This can be removed by introducing one new non-
terminal for B called as BR and is given below:
(imm) B C A BR | a b BR
BR C b BR |
The third iteration of “i" has two iterations for “j” thus comparing the indirect left recursion of C
with respect to B as well as A. As can be seen from the first iteration of “j” there is no immediate
left recursion when substituting for B
i = 3, j = 1: C A B | C C | a
C B C B | a B | C C | a
With this as the basis, the second iteration of “j” is carried out and this introduces a left
recursion.
i = 3, j = 2: C B C B | a B | C C | a
C C A BR C B | a b BR C B | a B | C C | a
The immediate left recursion for C is removed by introducing a new non-terminal for C called as
CR and is given below.
(imm) C a b BR C B CR | a B CR | a CR
CR A BR C B CR | C CR |
Example 9.2
Remove left recursion from the following grammar
E E+T | T
T T * F | F
F (E) | id
The non-terminals E and T have left recursions directly. Substitution of E in the productions
involving T is not possible and so is the case with F which would introduce an immediate left
recursion. Hence, the non-terminals E and T need two new non-terminals E‟ and T‟ and the left
recursion removed grammar is given below
• E TE‟
• E‟ +TE‟ | ε
• T FT‟
• T‟ *FT‟ | ε
• F (E) | id
9.4.2 Left Factoring
After removing left-recursion from the grammar, left-factoring of the grammar is done. Left-
factoring also involves introduction of a new non-terminal to produce the suffixes other than the
common prefix. The existing non-terminal would produce the common prefix followed by the
new non-terminal. In the following example, “α” is the common prefix and the new non-terminal
is AR and thus A is made to produce α AR and other productions of A. AR the new non-terminal
produces all uncommon suffixes of the productions of A.
Input Grammar:
A 1 | 2 | … | n |
Left-factored grammar:
A AR |
AR 1 | 2 | … | n
Example 9.3
Remove left factoring from the following grammar.
S iCtS | iCtSeS | a
C b
This grammar does not have left recursion and hence can be directly left factored. A new non-
terminal S‟ is introduced and the existing production of S‟s prefix iCtS is suffixed with S‟ and
the other production of S is retained. The new non-terminal S‟ is made to produce the suffix of
the productions that has a common prefix and in this example, it is “ε” and “eS”.
• S iCtSS‟ | a
• S‟ eS | ε
• C b
The modified grammar is now ready for the implementation of LL(1) parser.
Summary
In this module derivation and derivation tree for deriving a string using a Context Free Grammar
are discussed. The types of derivation, parsing and the various parsers are briefed. The pre-
processing tasks like left recursion elimination and left factoring which are needed for the
implementation of LL(1) parser are discussed. The next module will focus on constructing the
LL(1) parsing table from the pre-processed grammar.