Top Banner
Scanning A process of recognizing the lexical components in a source string Type-3 grammars: A ::= tB|t or A ::= Bt|t Type-2 grammars: A ::= The lexical features of a language can be specified using Type-3 or regular grammars This facilitates automatic construction of efficient recognizers for the lexical features of the language In fact, the scanner generator LEX generates such recognizers from the string specifications input to it.
39
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: • a Process of Recognizing the Lexical Components in A

Scanning• A process of recognizing the lexical components

in a source string• Type-3 grammars: A ::= tB|t or A ::= Bt|t• Type-2 grammars: A ::= • The lexical features of a language can be specified

using Type-3 or regular grammars• This facilitates automatic construction of efficient

recognizers for the lexical features of the language• In fact, the scanner generator LEX generates such

recognizers from the string specifications input to it.

Page 2: • a Process of Recognizing the Lexical Components in A

Scanning• E.g. DO 10 I = 1, 2

DO 10 I = 1.2• Former is a DO statement while the latter is an

assignment to a variable named DO10I (blanks are ignored)

• Thus scanning can only be performed after the presence of the ‘,’ identifies the former as a DO statement and its absence identifies the latter as an assignment statement

• Fortunately, modern PLs do not contain such constructs.

Page 3: • a Process of Recognizing the Lexical Components in A

ScanningReason for separating scanning from parsing:

• It is clear that each Type-3 production specifying a lexical component is also a Type-2 production

• Hence it is possible to write a single set of Type-2 productions which specifies both lexical and syntactic components of the source language

• However, a recognizer for Type-3 productions is simpler, easier to build and more efficient during execution than a recognizer for Type-2 productions

Page 4: • a Process of Recognizing the Lexical Components in A

Finite State Automaton

• A Finite State Automaton (FSA) is a triple (S, , T) where

S is a finite set of states, one of which is the initial state sinit, and one or more of which are the final state

is the alphabet of source symbols

T is a finite set of state transitions defining transitions out of each si S on encountering the symbols of

Page 5: • a Process of Recognizing the Lexical Components in A

Finite State Automaton

• A transition out of si S on encountering a symbol symb has the label symb

• We say a symbol symb is recognized by an FSA when the FSA makes a transition labeled symb.

• The transitions in an FSA can be represented in the form of a state transition table (STT) which has one row for each state si S and column for each symbol symb

Page 6: • a Process of Recognizing the Lexical Components in A

Finite State Automaton

• An entry STT(si, symb) in the table indicates the id of the new state entered by the FSA if there exists a transition labeled symb in state si

• If the FSA does not contain a transition out of state si for symb, we leave STT(si, symb) blank

Page 7: • a Process of Recognizing the Lexical Components in A

Finite State Automaton

• A state transition can also be represented as a triple (old state, source symbol, new state)

• Thus, the entry STT (si, symb) = sj and the triple (si, symb, sj) are equivalent

Page 8: • a Process of Recognizing the Lexical Components in A

Finite State Automaton

• The operation of an FSA is determined by its current state sc.

• The FSA actions are limited to the following:– Given a source symbol x at its input, it checks

to see if STT(sc, x) is defined – that is, if STT(sc, x) = sj, for some sj.

Page 9: • a Process of Recognizing the Lexical Components in A

Deterministic Finite State Automaton

• A deterministic finite state automaton (DFA) is an FSA such that t1 T, t1 (si, symb, sj) implies there does not exist t2 T, t2 (si, symb, sk)

• Transitions in a DFA are deterministic, that is at most one transition exists in state si for a symbol symb

Page 10: • a Process of Recognizing the Lexical Components in A

Deterministic Finite State Automaton

• At any point of time, the DFA would have recognized some prefix of the source string, possibly the null string

• It would next recognize the symbol pointed to by the pointer next symbol

Page 11: • a Process of Recognizing the Lexical Components in A

Deterministic Finite State Automaton

• The operation of DFA is history-sensitive because its current state is a function of the prefix recognized by it

• The DFA halts when all the symbols in the source string are recognized, or an error condition is encountered

• It can be seen that a DFA recognizes the longest valid prefix before stopping

Page 12: • a Process of Recognizing the Lexical Components in A

Deterministic Finite State Automaton

• The validity of a string is determined by giving it at the input of a DFA in its initial state

• The string is valid iff the DFA recognizes every symbol in the string and finds itself in a final state at the end of the string.

• This fact follows from the deterministic nature of transitions in the DFA

Page 13: • a Process of Recognizing the Lexical Components in A

EXAMPLE<integer> ::= d|<integer>d

state

Next Symbol

d

start int

int int

start intd

d

• A transition from state si to sj on symbol symb is depicted by an arrow labeled symb from si to sj

• The initial and final states of DFA are start and int respectively

Page 14: • a Process of Recognizing the Lexical Components in A

EXAMPLE• Transitions during the recognition of string 539

are as given:

• The string leaves the DFA in the state int which is the final state, hence the string is a valid integer string.

• A string 5ab9 is an invalid string because no transition marked ‘letter’ exists in state int

Current state Input Symbol New State

start

int

int

5

3

9

int

int

int

Page 15: • a Process of Recognizing the Lexical Components in A

Regular Expressions

• In the preceding example, a single Type-3 rule was adequate to specify a lexical component

• However, many Type-3 rules would be needed to specify complex lexical components like real constants

• Hence we use generalization of Type-3 productions called a regular expression

Page 16: • a Process of Recognizing the Lexical Components in A

Example

• An organization uses an employee code which is obtained by concatenating the section id of an employee, which is alphabetic in nature, with a numeric code

• The structure of the employee can be specified as <section code> ::= l | <section code>l

<numeric code> ::= d|<numeric code>d

<employee code> ::= <section code><numeric code>

Page 17: • a Process of Recognizing the Lexical Components in A

Example

• Note that the specification like<s_code> ::= l | d | <s_code> l | <s_code> d

would be incorrect!

• The regular expression generalizes on Type-3 rules by permitting multiple occurrences of a string form, and concatenation of strings

Page 18: • a Process of Recognizing the Lexical Components in A

Regular Expression

Regular Expression Meaning

r string r

s string s

r.s or rs concatenation of r and s

(r) same meaning as r

r | s or (r | s) alteration i.e., string r or string s

(r) | (s) alteration

[r] An optional occurrence of string r

(r)* 0 occurrences of string r

(r)+ 0 occurrences of string r

Page 19: • a Process of Recognizing the Lexical Components in A

Example

• Thus the employee codes can be specified by the regular expression

(l)+ (d)+

• Some other examples of regular expressions areInteger [ + | - ] (d)+

Real number [ + | - ] (d)+ . (d)+

Real number with fraction [+ | - ] (d)+ . (d)*

Identifier l (l | d)*

Page 20: • a Process of Recognizing the Lexical Components in A

Building DFA

• The lexical components of a source language can be specified by a set of regular expressions

• Since an input string may contain any one of these lexical components, it is necessary to use a single DFA as a recognizer for valid lexical strings in the language

• Such DFA have a single initial state and one or more final states for each lexical components

Page 21: • a Process of Recognizing the Lexical Components in A

Example

state Next Symbol

l d .

start id Int

id id id

int int s2

s2 real

real real

d

d

d

d

l

s2 real

dstart

id

int.

l

Page 22: • a Process of Recognizing the Lexical Components in A

Performing Semantic Actions

• Semantic actions during scanning concern table building and construction of tokens for lexical components

• These actions are associated with the final states of a DFA

• The semantic actions associated with a final state sf are performed after the DFA recognizes the longest valid prefix of the source string corresponding to sf

Page 23: • a Process of Recognizing the Lexical Components in A

Writing a Scanner

Regular Expression Semantic Actions

[ + | - ] (d)+ {Enter the string in the table of integer constants, say in entry n. Return the token Int#n}

[ + | - ] ((d)+ . (d)* | (d)+ . (d)+) {Enter in the table of real constants. Return the token Real#m

l (l | d)* {Compare with reserved words. If the match is found, return the token Kw#k else enter in the symbol table and return the token Id#i}

Page 24: • a Process of Recognizing the Lexical Components in A

Parsing

• The goals of parsing are– To check the validity of a source string, and– To determine its syntactic structure

Page 25: • a Process of Recognizing the Lexical Components in A

Parsing

• For an invalid string the parser issues diagnostic messages reporting the cause and nature of error(s) in the string

• For valid string it builds a parse tree to reflect the sequence of derivations or reductions

• The parse tree is passed on to the subsequent phases of the compiler

Page 26: • a Process of Recognizing the Lexical Components in A

Parsing

• The fundamental step in parsing is to derive a string from a NT, or reduce a string to an NT

• This gives rise to two fundamental approaches to parsing:– Top down parsing– Bottom up parsing

Page 27: • a Process of Recognizing the Lexical Components in A

Parse Trees

• A parse tree depicts the steps in parsing, hence it is useful for understanding the process of parsing

• However, it is a poor intermediate representation for a source string because it contains too much information as far as subsequent processing in the compiler is concerned

Page 28: • a Process of Recognizing the Lexical Components in A

Abstract Syntax Trees

• An abstract syntax tree (AST) represents the structure of a source string in a more economical manner

• The word ‘abstract’ implies that it is a representation designed by a compiler designer for his own purposes

• Thus the designer has total control over the information represented in an AST

Page 29: • a Process of Recognizing the Lexical Components in A

Example

+

E

E

<id>

T

F

P

<id>

T

F

P

<id>

F

P

T

*

+

*<id>

<id> <id>

Page 30: • a Process of Recognizing the Lexical Components in A

Top Down Parsing

• Top down parsing according to grammar G attempts to derive a string matching a source string through a sequence of derivations starting with the distinguished symbol of G

• For a valid source string , a top down parse thus determines a derivation sequence

S … …

Page 31: • a Process of Recognizing the Lexical Components in A

Algorithm (Naive Top Down Parsing)

• Current sentential form (CSF) := ‘S’;• Let CSF be of the form A, such that is a string

of Ts (note that may be null), and A is the leftmost NT in CSF. Exit with success if CSF =

• Make a derivation A 1B according to a production A ::= 1B of G such that 1 is a string of Ts (again 1 may be null). This makes CSF = 1B

• Go to Step 2

Page 32: • a Process of Recognizing the Lexical Components in A

Description

• Since we make a derivation for the leftmost NT at any stage, top down parsing is also known as left-to-left parsing (LL parsing)

• Algorithm lacks one vital provision from a practical viewpoint

• Let CSF C with C as the leftmost NT in it and let the grammar production for C be

C ::= | where , is a string of terminal and non-terminal symbols

Page 33: • a Process of Recognizing the Lexical Components in A

Description

• Which RHS alternative should the parser choose for the next derivation?

• The alternative we choose may lead us to a string of Ts which does not match with the source string

• In such cases, other alternatives would have to be tried out until we derive a sentence that matches the source string (i.e., a successful parse)

• Or, until we have systematically generated all possible sentences without obtaining a sentence that matches the source string (i.e.,an unsuccessful parse)

Page 34: • a Process of Recognizing the Lexical Components in A

Description

• A naïve approach to top down parsing would be to generate complete sentence of the source language and compare them with to check if a match exists

• We introduce a check, called continuation check, to determine whether the current sequence of derivations may be able to find a successful parse of

Page 35: • a Process of Recognizing the Lexical Components in A

Descriptions

• This check is performed as follows:– Let CSF be of the form A, where is a string

of n Ts– All sentential forms derived from CSF would

have the form ….– Hence, for a successful parse, must match the

first n symbols of – We can apply this check at every parsing step,

and abandon the current sequence of derivations any time this condition is violated

Page 36: • a Process of Recognizing the Lexical Components in A

Description• The continuation check may be applied

incrementally as follows:– Let CSF A then the source string must be … (else

we would have abandoned this sequence of derivations earlier)

– If the prediction for A is A=> 1B where 1 is a string of m terminal symbols then 1 must match with m symbols following in the source string

– Hence we compare 1 with m symbols following in the source string

• This incremental check is more economical than a continuation check which compares the string 1 with (n+m) symbols in the source string

Page 37: • a Process of Recognizing the Lexical Components in A

Predictions and Backtracking

• A typical stage in top-down parsing can be depicted as follows: CSF A

Source string : t

SSM

where CSF = A implies S=> A and SSM points at the first symbol following in the source string, i.e., at the terminal symbol ‘t’

*

Page 38: • a Process of Recognizing the Lexical Components in A

Predictions and Backtracking

• Parsing proceeds as follows:– Identify the leftmost non terminal in CSF i.e., A– Select an alternative on the RHS of the

production of A– Since we do not know whether the string will

satisfy the continuation check, we call this choice a prediction

Page 39: • a Process of Recognizing the Lexical Components in A

Predictions and Backtracking

– The continuation check is applied incrementally to the terminal symbol(s), if any, occurring in the leftmost position(s) of the predictions

– SSM is incremented if the check is applied incrementally to the terminal symbol(s), if any, occurring in the leftmost position(s) of the prediction

– SSM is incremented if the check succeeds and parsing continues

– If the check fails, one or more predictions are discarded and SSM is reset to its value before the rejected prediction(s) was made

– This is called backtracking. Parsing is now resumed