-
MODULE IIIIntroduction to compiling:- Compilers, Analysis of a
source program, the phases of a compiler.Lexical Analysis:-The role
of the lexical analyzer, Input buffering, specification of tokens,
Recognition of tokens, Finite automata, Conversion of an NFA to
DFA, From a regular expression to an NFA.
-
COMPILERS
-
CompilerIntroduction to Compilers
TranslatorA translator is a program that takes a program written
in one programming language as input and produces a program in
another language as output. If the source language is a high level
language and the object language is a low level language , then
such a translator is called a compiler.Source
ProgramObjectProgram
-
Analysis of Source Program The analysis part breaks up the
source program into constituent pieces and imposes a grammatical
structure on them.
It then uses this structure to create an intermediate code of
the source program.
If the analysis part detects any error, it must provide
informative messages, so the user can take corrective action.
The analysis part also collects information about the source
program and stores it in the data structure SYMTAB, which is passed
along with the intermediate code to the synthesis phase.
-
The synthesis part constructs the desired target program from
the intermediate representation and the information in the
SYMTAB.
The analysis part is often called the front end and synthesis
phase is called the back end.
-
Phases of a compilerLexical AnalyzerSymbol TableCode
generatorMachine independent CodeoptimizerIntermediate code
generatorSyntax AnalyzerSource program Target machine code token
streamSyntax treeSemantic AnalyzerSyntax treeIntermediate
representationIntermediate representationMachine dependent
CodeoptimizerTarget machine code
-
Lexical Analysis (Scanning)
The first phase of a compiler
The lexical analyzer reads the stream of characters from the
source program and groups the characters into meaningful sequences
called lexemes.
For each lexeme, the lexical analyzer produces a token as output
of the form, (token-name, attribute-value) Where token-name is an
abstract symbol that is used during syntax analysis, and
attribute-value points to an entry in the symbol table for this
token.
-
Eg. Position = initial +rate *60The lexemes and tokens are1)
position is a lexeme would be mapped into a token , where id is
identifier and 1 points to the SYMTAB entry for position.2) = is a
lexeme that mapped into a token < = >. Since this token needs
no attribute value, we have omitted the second component.3) initial
- 4) + - < + >5)rate - 6 ) * - 7) 60 -
-
Syntax Analysis(Parsing)The second phase of the compiler.The
parser uses the first components of the tokens produced by the
lexical analyzer to create syntax trees. The syntax tree for the
above eg is60
-
Semantic AnalysisThe semantic analyzer uses the syntax tree and
the information in the SYMTAB to check the source program for
semantic consistency with the language definition.
It also gathers type information and saves it in either the
syntax tree or the SYMTAB, for subsequent use during intermediate
code generation.
An important part of semantic analysis is type checking, where
the compiler checks that each operator has matching operands.(eg,
the compiler must report an error, if a float value is used as an
array index).Eg. Suppose position, initial and rate are float
numbers. The lexeme is an integer. The type checker in semantic
analyzer discovers that the operator * is applied to a float number
rate and an int 60. So int 60 is converted to float.
-
Intermediate Code Generation.In the process of translation from
source to target code, the compiler may construct one or more
intermediate representations.This intermediate representations
should be (a) easy to produce and (b) easy to translate into target
machine.
Eg. t1 = inttofloat(60) t2 = id3 * t1 t3 = id2 + t2 id1 = t3
-
Code OptimizationThe machine independent code optimization phase
attempts to improve the intermediate code so that better target
code will result.A simple intermediate code generation algorithm
followed by code optimization is a reasonable way to generate good
target code.The optimizer can deduce that the conversion of 60 from
int to float can be done once. So the inttofloat operation can be
eliminated by replacing int 60 by float 60.0Eg. t1 = id3 * 60.0 id1
= id2 * t1
-
Code GenerationThe code generator takes as input an intermediate
representation of the source program and maps it to the target
language.If the target language is machine code, registers or
memory locations are selected for each of the variables used by the
program. Then the intermediate instructions are translated into
sequences of machine instructions.Eg. LDFR2,id3MULFR2,R2,
#60.0LDFR1,id2ADDFR1,R1, R2STFid1,R1
-
SYMBOL TABLE MANAGEMENTAn essential function of a compiler is to
record the variable names used in the source program and collect
information about various attribute of each name.
This data structure should be designed to allow the compiler to
find the record for each name quickly and to store or retrieve data
from that record quickly.
-
Position = initial +rate *60Lexical Analyzer = + * Syntax
Analyzer 60Semantic Analyzer
-
inttofloat(60)Intermediate code generator t1 = inttofloat(60) t2
= id3 * t1 t3 = id2 + t2 id1 = t3code optimizer t1 = id3 * 60.0 id1
= id2 * t1
-
Code GeneratorLDFR2,id3MULFR2,R2, #60.0LDFR1,id2ADDFR1,R1,
R2STFid1,R1Translation of an assignment statement
-
Role of Lexical AnalyzerThe main task of lexical analyzer is to
read the input characters and group them into lexemes and produce
tokensThe stream of tokens is send to the parser for syntax
analysis.
-
Lexical AnalyzerSource ProgramParsertokengetnexttokenTo semantic
analysisSymbol Table
-
Tasks(Role) of Lexical AnalyzerIdentification of LexemesRemoval
of comments and white spaces(blank,newline,tab etc)Correlating
error messages generated by the compiler with the source
program.Lexical Analyzer processes.Scanning consists (simple
processes) that do not require tokenization of the input, such as
deletion of comments and compaction of consecutive white spaces
into one.Lexical analysis(complex process) where scanner produces
the sequence of tokens as output.
-
Tokens, Patterns and LexemesA token is a pair with a token name
and an optional attribute valueA pattern is a description of the
form that the lexemes of a token may takeA lexeme is a sequence of
characters in the source program that matches the pattern for a
token
-
INPUT BUFFERINGSpecialized buffering techniques have been
developed to reduce the amount of overhead required to process a
single input character.Two buffers are alternately reloaded. Each
buffer is of same size N. N is the size of a disk
block.lexemeBeginforward
E=M*C**2eof
-
Input BufferingTwo pointers are required lexemeBegin - marks the
beginning of a lexeme forward - scans ahead until a pattern match
is found.Advancing forward requires that we first test whether we
have reached the end of one of the buffers , and if so, we must
reload the other buffer from the input, and move forward to the
beginning of the newly loaded buffer.
-
SentinelsUsed to mark the end of input.Natural choice is the
character eof.Any eof that appears other than at the end of a
buffer means that the input is at an
end.lexemeBeginforwardSentinels at the end of each buffer
E=M*eofC**2eofeof
-
Switch(*forward++){ case eof:if(forward is at the end of first
buffer){ reload second buffer; forward = beginning of second
buffer; } else if(forward is at the end of second buffer){ reload
first buffer; forward = beginning of first buffer; } else /* eof
within a buffer marks the end of input */terminate
lexicalanalysis;break;}
-
SPECIFICATION OF TOKENS
-
Strings and Languagesalphabet is a finite sequence of
symbols.The string over an alphabet is a finite sequence of symbols
drawn from that alphabet.|s| represents the length of a string s,
Ex: banana is a string of length 6 The set {0,1} is the binary
alphabetA language is any countable set of strings over some fixed
alphabet.Abstract languages - , the empty set , or {}, the set
containing only the empty string.The empty string is the identity
under concatenation; that is, for any string s, s = s =
s.Exponentiation of strings :- s0 is , and for all i >0, si is
si-1s. Since S = S , s1 = s , s2=ss,s3 =sss and so on.
-
Operations on Languages
-
Operations on Languages (contd.)Example: Let L be the set of
letters {A, B, . . . , Z, a, b, . . . , z ) and let D be the set of
digits {0,1,.. .9). L and D are, respectively, the alphabets of
uppercase and lowercase letters and of digits. Other languages
constructed from L and D are
1. L U D is the set of letters and digits - strictly speaking
the language with 62 (52+10) strings of length one, each of which
strings is either one letter or one digit.2. LD is the set of 520
strings of length two, each consisting of one letter followed by
one digit.(1052). Ex: A1, a1,B0,etc3. L4 is the set of all 4-letter
strings. (ex: aaba, bcef)4. L* is the set of all strings of
letters, including .5. L(L U D)* is the set of all strings of
letters and digits beginning with a letter.6. D+ is the set of all
strings of one or more digits.
-
Regular Expression & Regular language
Regular ExpressionA notation that allows us to define a pattern
in a high level language.Regular languageEach regular expression r
denotes a language L(r) (the set of sentences relating to the
regular expression r)Notes: Each word in a program can be expressed
in a regular expression
-
Eg. Suppose we want to describe the set of valid C
identifiers.If letter_ stand for any letter or the underscore, and
digit_ stands for any digit, then we would describe the language of
C identifiers by :letter_(letter_ | digit)*The | means union. ( )
are used to group sub expressions. * means zero or more occurrences
ofThe juxtaposition of letter_ with the remainder of the expression
signifies concatenation.
-
Rules for constructing regular expressions1. is a regular
expression denoting {}, the language containing only empty string .
L() = {}2. If a is a symbol in alphabet , then a is a regular
expression, and L(a) = { a } ,that is the language with one string,
of length one, with a in its one position. (We use italics for
symbols and boldface for their corresponding regular
expression.)
The regular expressions are built recursively out of smaller
regular expressions using the following rules.
-
INDUCTION:Let r and s be regular expressions with languages L(r)
and L(s). Then (r) | (s) is a regular expression denoting the
language L(r) L(s)(r)(s) is a regular expression denoting the
language L(r) L(s)(r)* is a regular expression denoting the
language (L(r))*(r) is a regular expression denoting the language
L(r).
-
Precedence* has highest precedence.Concatenation ha second
highest precedence| has lowest precedence
Eg. (a) | ((b) * (c )) may be replaced by a | b * c.
-
Examples
-
Algebraic laws of Regular Expressions
-
Regular DefinitionsWe can give names to certain regular
expressions and use those names in subsequent expressions. d1 ->
r1 d2 -> r2 ..... dn -> rn
-
e.g. C identifiers are strings of letters , digits and
underscores. letter_ A|B||Z|a|b||z|_ digit 0|1|2||9 Id
letter_(letter_|digit)*This can also be written as letter_ [A
Za-z_] digit [0-9] Id letter_(letter_|digit)*
We shall conventionally use italics for the symbols defined in
the regular expressions.
-
Recognition of tokens In this topic we study how to take the
patterns for all the needed tokens and build a piece of code that
examines the input string and finds a prefix that is a lexeme
matching one of the patterns.**
-
Consider following example.
stmt If expr then stmt| if expr then stmt else stmtI expr term
relop term| termterm id| number
A grammar for branching statements*
-
For relop, we use the comparison operators.The patterns for
tokens (id and number) aredigit [0-9]digitsdigit+numberdigits (.
digits)?(E [+-]? digits)?letter[A-Za-z]idletter (letter |
digit)*ififthenthenelseelserelop|=|=|
**
-
Token for white space isws (blank | tab | newline )+
Token ws is different from the other tokens in that, when we
recognize it, we do not return it to the parser, but rather restart
the lexical analysis from the character that follows the
whitespace.
**
-
LEXEMES TOKEN NAME ATTRIBUTE VALUEAny ws - -If if -then then -e
l s e else-Any id id Pointer to table entryAny numbernumber Pointer
to table entry< relop LT relop GT>= relop
Tokens, their patterns, and attribute values
*
-
Transition DiagramsAs an intermediate step in the construction
of a lexical analyzer, we first convert patterns into "transition
diagrams.Transition diagrams have a collection of nodes or circles,
called states.Each state represents a condition that could occur
during the process of scanning the input looking for a lexeme that
matches one of several patterns.
**
-
Edges are directed from one state of the transition diagram to
another.Each edge is labeled by a symbol or set of symbols. All our
transition diagrams are deterministic, meaning that there is never
more than one edge out of a given state with a given symbol among
its labels.
**
-
Some importantconventions about transition diagrams are:
Certain states are said to be accepting, or final. These states
indicate that a lexeme has been found. (We always indicate an
accepting state by a double circle, and if there is an action to be
taken typically returning a token and an attribute value to the
parser we shall attach that action to the accepting state.)2. In
addition, if it is necessary to retract the forward pointer one
position (i.e., the lexeme does not include the symbol that got us
to the accepting state), then we shall additionally place a * near
that accepting state.
-
3. One state is designated the start state, or initial state; it
is indicated by an edge, labeled "start," entering from no where.
The transition diagram always begins in the start state before any
input symbols have been read.
-
Transition diagram for relop
- We begin in state 0, the start state. If we see < as the
first input symbol, then among the lexemes that match the pattern
for relop we can only be looking at
-
State 4 has a * to indicate that we must retract the input one
position.
In state 0, we see any character besides , we can not possibly
be seeing a relop lexeme, so this transition diagram will not be
used.
-
Usually, keywords like if or then are reserved so they are not
identifiers even though they look like identifiers.
Transition diagram for id's and
keywords*9startletter10other11return(getToken(), installlD
())Letter or digitRecognition of Reserved Words and Identifiers
-
There are two ways that we can handle reserved words that look
like identifiers:1) Install the reserved words in the symbol table
initially. When we find an identifier, a call to installlD places
it in the symbol table if it is not already there and returns a
pointer to the symbol-table entry for the lexeme found.Any
identifier not in the symbol table during lexical analysis cannot
be a reserved word, so its token is id.
**
-
The function getToken examines the symbol table entry for the
lexeme found, and returns whatever token name the symbol table says
this lexeme represents either id or one of the keyword tokens that
was initially installed in the table.2.) Create separate transition
diagrams for each keyword.
*startt h e n nonlet/dig*Transition diagram for then
-
A transition diagram for unsigned numbersstart*
-
A transition diagram for whitespace
*delimHere we look for one or more white space characters
,represented by delim. These characters would be blank, tab newline
etc.
In state 24, we have found a block of consecutive whitespace
characters, followed by a non whitespace character. We retract the
input to begin at the non whitespace, but we do not return to the
parser.
-
Design of Lexical AnalyzerInitial step is to form flowcharts for
the valid possible tokensFlowcharts for lexical analyzer is known
as Transition diagramsComponents are States represent the
circlesEdges the arrows connecting the states The labels on the
edges indicate the input character that can appear after that
state
-
Transition diagram for identifier012letter or
digitStartdelimiterletter*Fig : Transition diagram for
identifier
-
The next step is to produce code for each of the states
The code for State 0State 0 : C:= GETCHAR( ); if LETTER then
goto state1 else FAIL( )
Here LETTER is a boolean valued function, returns true if C is a
letter FAIL is a routine which retracts the lookahead pointer and
starts up the next transition diagram or calls the error
routine.
-
The code for State 1State 1 : C:= GETCHAR( ); if LETTER or
DIGIT( C ) then goto state1 else if DELIMITER ( C ) then goto state
2 else FAIL( )
Here DIGIT is a boolean valued function, returns true if C is
one of the digits 0, 1, .,9. DELIMITER is a procedure which returns
true whenever C is a character that could follow an identifier
-
The code for State 2State 2 : RETRACT( ); return (id, INSTALL( )
)
state 2 indicates that an identifier has been found. Since the
delimiter is not part of the token found, the function RETRACT will
move the lookahead pointer one character back.* indicate states on
which input retraction must take place.INSTALL( ) procedure will
install the identifier into symbol table if it is not already
there.
-
TokenCodeValuebegin 1 ---end 2 ---If 3 ---Then 4 ---Else 5
---Identifier 6 pointer to symbol tableConstant 7 pointer to symbol
table< 8 1 8 4> 8 5>= 8 6
Fig: Tokens recognized
-
0123457891112131516181920216*17*10*14*22*StartBENENDGIESLFIBlank/
newlineNEHTBlank/ newlineBlank/ newlineBlank/ newlineBlank/
newlinereturn(1, )return(2, )return(3, )return(5, )return(4,
)Keywords :
-
232425letter or digitStartletter*Identifier :Not letter or
digitreturn(6, INSTALL( ) )constant :262728digitStartdigit*Not
digitreturn(7, INSTALL( ) )
-
29303531*34*323336*Start
not =return(8,1)return(8, 2)return(8,3 )return(8,4
)return(8,5)Not = or >=>37return(8,6)=Relops :
-
Regular Expressions Strings and Languages
Alphabet or character class denote any finite set of symbols Eg
: {0,1} is an alphabet, with two symbols 0 and 1
String It is a finite sequence of symbolsEg: 001, 10101,.
-
Operations with stringLength : x denotes the length of string x,
will be the number of characters in x is the empty string, = 0
Concatenation of x and y is denoted by x.y or xy , formed by
appending string y to x Eg: x = abc y = de then x.y = abcde x = x =
x where is the identity in concatenation
Exponentiation xi means string x repeated i times Eg: x1 = x; x2
= xx; x3 = xxx; .. and x0 =
-
Prefix is obtained by discarding o or more trailing symbols of x
Eg: abc, abcd, a .. Are prefix of abcde
Suffix of x is obtained by discarding 0 or more leading symbols
of x Eg: cde, e, represent the suffix of abcde
Substring of x is obtained by deleting a prefix and suffix from
x Eg: cd, abc, de, abcde represent the substring of abcde All
suffix and prefix will be a substring, but the substring need not
be a suffix or prefix and x are prefixes, suffixes, and substring
of x
-
Language It is the set of strings formed from specific alphabet
If L & M are two languages, the possible operations
areConcatenation Concatenation of L & M is denoted as L.M and
can be found by selecting a string x from L and y from M and
joining them in that order LM = {xy x is in L and y is in M} L = L
= Exponentiation Li = LLLLL L (i times) L0 = {}, {}L = L{}=L
-
Union LUM = {x x is in L or x is in M} UL = LU = LClosure *
denotes 0 or more instances of; L* = U Li Eg: let L = { aa } L* is
all strings of even number of as L0 = {} L1 = { aa } L2 = { aaaa }
.
+ is the positive closure, means one or more instances of
exclude {}, then its L.(L*)
L.(L*) = L. U Li = U Li+1 = U Li = L+
i=0i=0i=1i=0
-
Regular Expressions used to describe the tokens Eg: for
identifier, identifier = letter ( letter digit )* used to define a
languageRegular Expression construction rules1. is a regular
expression denoting {}, that is the language containing only the
empty string2. For each a in , a is regular expression denoting
{a}, the language with only one string, that string consisting of
the single symbol a3. If R and S are regular expressions denoting
languages LR and LS respectively then (i) (R ) (S) is a regular
expression denoting LR U LS (ii) ( R) . (S) is a regular expression
denoting LR.LS (iii) ( R)* is a regular expression denoting LR*
-
A regular expression is defined in terms of primitive regular
expression (basis) and compound regular expressions (induction
rules) So rules (i) and (ii) form the basis, (iii) forms the
inductive portionEg: Some Regular Expressions1. a* - denotes all
strings of 0 or more as2. aa* - denotes the string of one or more
as (a+)3. (a b)*- the set of all strings of as and bs i.e.
(a*b*)*4. (aa ab ba bb)* - all strings of even length5. a b strings
of length 0 or 16. (a b) (a b) (a b) denotes strings of length 3 so
(a b) (a b) (a b) (a b)* denotes strings of length 3 or more a b (a
b) (a b) (a b) (a b)* - all strings whose length is not 2
-
Regular Expressions forKeyword = BEGIN/ END/ IF/THEN/ ELSE
Identifier = letter (letter/digit)*Constant = digit+relop = =If two
regular expressions R and S denote same language, then R and S are
equivalenti.e. (a/b)* = (a*b*)*Algebraic laws with Regular
Expressions1. R/S = S/R( / is commutative)2. ( R/S) /T = R/ (S/T)(/
is associative)3. R (ST) = (RS) T( . Is associative)4. R (S/T) = RS
/ RT and (S/T) R= SR /TR ( . Distribution over / )5. R = R = R ( is
identity for concatenation)
-
Finite AutomataIt is a program that identifies the presence of a
token on the input . It takes a string x as its input, answers yes
if x is a sentence of L and no otherwise.Language RecognizerTo
determine x belongs to a language L, x is decomposed into a
sequence of substrings denoted by the primitive sub expressions in
RHow it works?Given R = (a/b)*abb, the set of all strings ending in
abb,andx = aabbSince R = R1R2 where R1 = (a/b)* and R2 = abbIt is
easy to show a the language(a is an element of the language)
denoted by R1 and abb matches R2Example
-
It is the generalized transition diagram that is derived from
the expressionNondeterministic Automata0123bababstartFig: A non
deterministic finite automata of (a b)*abbThe nodes are called
states and the labeled edges are called transitions. Edges can be
labeled by & characters. Also same character can label two or
more transitions out of one state. It has one start state and can
be one or more final states(accepting states).
-
The tabular form representing the transitions of an NFA . In the
transition table, there is a row for each state and a column for
each admissible input symbol and .The entry for row i and symbol a
is the set of possible next states for state i on i/p a. Transition
tableState Input symbol a b 0{0,1}{0} 1-----{2} 2 ----{3} Fig:
Transition table
-
The path for the input string aabb can be represented by the
following sequence of moves
State Remaining i/p 0 aabb 0 abb 1 bb 2 b 3 The language defined
by an NFA is the set of i/p strings it accepts.
-
NFA accepting aa* bb* 0start1324abab
-
Input :- A regular expression R over alphabet Output :- An NFA,
N accepting the language denoted by RMethod : Decompose R into its
primitive components. For each component, we construct a finite
automata inductively using basis and induction rules
Algorithm to construct an NFA from a Regular Expression
-
The basis and induction rules are 1. NFA for
Finite Automata construction from regular expressionifwhere i
and f are new initial state and final state
2. NFA for a
i'f 'aeach state should be new
2. NFA for a
i'f 'aeach state should be new
-
Each time we need a new state, we give that state a new name.
Even if a appears several times in the regular expression R, we
give each instance of a a separate finite automation with its own
states.
3. Having constructed components for the basis regular
expressions, we proceed to combine them in ways that correspond to
the way compound regular expressions are formed from smaller
regular expressions.
-
i'f
3. NFA for R1 / R2
Let N1 and N2 be NFAs corresponds R1 and R2
respectivelyiN1N2There is a transition on from the new initial
state to the initial states of N1 and N2.There is an -transition
from the final states of N1 and N2 to the new final state f. Any
path from i to f must go through either N1 or N2.
-
4. NFA for R1R2
Let N1 and N2 be NFAs corresponds R1 and R2
respectivelyfiN1N2The initial state of N2 is identified with the
accepting state of N1. A path from i to f must go first through N1,
then through N2.
-
f
5. NFA for R1*
iN1In this, we can go from i to f directly along a path labeled
,or go through N1 one or more times.
-
Decomposition of (a / b)*abbR11R10R9bR8 bR7R6R5 a * R3)(R4/R2
R1ab
-
R1= a N1 :
2a3R2= b N2 :
4b5R3= R1/R2 2a34b516N3 : N4 : R4= (R3) is same as N3
-
R5= (R4)*2a34b51607N5 :
-
R6= a N6 :
7'a8R7= R5R62a34b51607N7 :a8
-
R8= b N8 :
8'b9R9= R7R82a34b51607N9 :a89b
-
R10 = b N10 :
9'b10R11= R9R102a34b51607N11 :a89b10bStart
-
Since in the NFA transition function is multivalued and , it is
difficult to simulate an NFA with a computer program
A finite automaton is deterministic ifit has no transitions on
input for each state s and input symbol a, there is at most one
edge labeled a leaving s
For each NFA, we can find a DFA accepting the same
language.Deterministic Automata (DFA)
-
2a34b51607a89b10b-closure (0) = { 0, 1, 2, 4, 7} --------------
(A) {3, 8} { 5 } -closure {3, 8} = {1, 2, 3, 4, 6, 7, 8}
------------- (B)ab{3, 8}{ 5, 9 } abStart
-
-closure {5} = {1, 2, 4, 5, 6, 7} ---------------- (C)-closure
{5, 9} = {1, 2, 4, 5, 6, 7, 9} ----------- (D)-closure {5, 10 } =
{1, 2, 4, 5, 6, 7, 10} ------------(E){3, 8}{ 5} ab{3, 8}{ 5, 10 }
ab{3, 8}{ 5 } ab
-
State Input symbol a b A (Start)B C BB D C B C D B E E (Accept)
B CABDCabaaaabbbbEStart
-
State Input symbol a b A (Start) B A B B D D B E E (Accept) B
AABDabaaaStartbbbEMinimizing the number of states
-
Constructing DFA from NFAAlgorithm
Input: a NFA N.output: a DFA D accepting the same language Let
us define the function -CLOSURE(s) to be the set of states of N
built by applying the following rules:1. S is added to -closure
(s)2. If t is in -CLOSURE (s), and there is an edge labeled from t
to u, then u is added to -CLOSURE(s) if u is not already there.
Rule 2 is repeated until no more states can be added to -CLOSURE(s)
.Thus,-CLOSURE(s) is the set of states that can be reached from s
on -transitions only. If T is a set of states, then -CLOSURE(T) is
the union over all states s in T of -CLOSURE(s).
-
Constructing DFA from NFAAlgorithm - CLOSURE
Push all states in T onto stack;-closure(T) := T;while stack is
not empty do begin pop s, the top element, off the stack for each
state t with an edge from s to t labeled do if t is not in
-closure(T) do begin add t to -closure(T) push t onto stack end if
end do end while
-
Constructing DFA from NFAAlgorithm Subset construction
While there is an unmarked state x= {s1,s2,.,sn) of D do Begin
mark x;for each input symbol a doBeginlet T be the set of states to
which there is a transition on a from some state si in x; y := -
CLOSURE (T) If y has not yet been added to the set of states of D
then make y an unmarked state of D Add a transition from x to y
labeled a if not already present EndforEndwhile
-
Minimizing the number of states in DFAAlgorithm
Input: a DFA Moutput: a minimum state DFA MIf some states in M
ignore some inputs, add transitions to a dead state.Let P =
{accepting state, All nonaccepting states}Let P = {}Loop: for each
group G in P do Partition G into subgroups so that s and t (in G)
belong to the same subgroup if and only if each input a,states s
and t have transitions to states in the same group of Pput those
subgroups in Pif (P != P) goto loopRemove any dead states and
unreachable states.
-
NFA to DFA Example-22a16a345bb8b7ab0start-closure({0}) =
{0,1,3,7} subset({0,1,3,7},a) = {2,4,7} subset({0,1,3,7},b) =
{8}
-closure({2,4,7}) = {2,4,7} subset({2,4,7},a) = {7}
subset({2,4,7},b) = {5,8} -closure({8}) = {8} subset({8},a) =
subset({8},b) = {8} -closure({7}) = {7} subset({7},a) = {7}
subset({7},b) = {8} ----------------------
-
AstartaDbbbabbBCEFaba1a3a3a2 a3DFA states A = {0,1,3,7} B =
{2,4,7} C = {8} D = {7} E = {5,8} F = {6,8}
-
Minimizing the Number of States of a
DFAAstartBCDEbbbbbaaaaaAstartBDEbbaabaa
-
A language for specifying Lexical Analyzers A LEX source pgm is
a specification of a lexical analyzer, consisting of a set of
regular expressions together with an action for each regular
expression.
The action is a piece of code which is to be executed whenever a
token specified by the corresponding regular expression is
recognized.
The output of LEX is a lexical analyzer pgm constructed from the
LEX source specification.
-
Creating a Lexical Analyzer with Lex lex compilerlex source
programinput streamLexical Analyzer Lsequence of tokensLexical
analyzer L
-
A LEX source pgm consists of 2 parts:Auxiliary definitions and
translation rules
Auxiliary Definitions
The auxiliary definitions are stmnts of the formD1=R1D2=R2 .
.Dn=RnEg: letter=A B Z digit=0 1 . 9 identifier= letter (letter
digit )*
-
Translation Rules
The translation rules of a LEX pgm are stmnts of the formP1
{A1}P2 {A2}..Pm {Am}
Where each pi is a regular expression called a pattern and each
Ai is a pgm fragmentThe pattern describe the form of the tokensThe
pgm fragment describes what action the lexical analyzer should take
when token Pi is found.
-
AUXILIARY DEFINITIONSletter = A B .. Zdigit = 0 1 ..
9TRANSLATION RULESBEGIN{return 1}END{return 2}IF{return
3}THEN{return 4}ELSE{return 5}letter(letter digit)*{LEXVAL:=
INSTALL(); return 6}digit*{LEXVAL:= INSTALL(); return
7}={LEXVAL:=6; return 8}
-
Regular Expressions in Lexxmatch the character x \.match the
character . stringmatch contents of string of characters . match
any character except newline ^match beginning of a line $match the
end of a line [xyz]match one character x, y, or z (use \ to escape
-) [^xyz]match any character except x, y, and z [a-z]match one of a
to z r*closure (match zero or more occurrences) r+positive closure
(match one or more occurrences) r? optional (match zero or one
occurrence) r1r2match r1 then r2 (concatenation) r1|r2match r1 or
r2 (union) ( r ) grouping r1\r2match r1 when followed by r2
{d}match the regular expression defined by d