SystemSoftware-compilers

MODULE IIIIntroduction to compiling:- Compilers, Analysis of a source program, the phases of a compiler.Lexical Analysis:-The role of the lexical analyzer, Input buffering, specification of tokens, Recognition of tokens, Finite automata, Conversion of an NFA to DFA, From a regular expression to an NFA.

COMPILERS

CompilerIntroduction to Compilers

TranslatorA translator is a program that takes a program written in one programming language as input and produces a program in another language as output. If the source language is a high level language and the object language is a low level language , then such a translator is called a compiler.Source ProgramObjectProgram

Analysis of Source Program The analysis part breaks up the source program into constituent pieces and imposes a grammatical structure on them.

It then uses this structure to create an intermediate code of the source program.

If the analysis part detects any error, it must provide informative messages, so the user can take corrective action.

The analysis part also collects information about the source program and stores it in the data structure SYMTAB, which is passed along with the intermediate code to the synthesis phase.

The synthesis part constructs the desired target program from the intermediate representation and the information in the SYMTAB.

The analysis part is often called the front end and synthesis phase is called the back end.

Phases of a compilerLexical AnalyzerSymbol TableCode generatorMachine independent CodeoptimizerIntermediate code generatorSyntax AnalyzerSource program Target machine code token streamSyntax treeSemantic AnalyzerSyntax treeIntermediate representationIntermediate representationMachine dependent CodeoptimizerTarget machine code

Lexical Analysis (Scanning)

The first phase of a compiler

The lexical analyzer reads the stream of characters from the source program and groups the characters into meaningful sequences called lexemes.

For each lexeme, the lexical analyzer produces a token as output of the form, (token-name, attribute-value) Where token-name is an abstract symbol that is used during syntax analysis, and attribute-value points to an entry in the symbol table for this token.

Eg. Position = initial +rate *60The lexemes and tokens are1) position is a lexeme would be mapped into a token , where id is identifier and 1 points to the SYMTAB entry for position.2) = is a lexeme that mapped into a token < = >. Since this token needs no attribute value, we have omitted the second component.3) initial - 4) + - < + >5)rate - 6 ) * - 7) 60 -

Syntax Analysis(Parsing)The second phase of the compiler.The parser uses the first components of the tokens produced by the lexical analyzer to create syntax trees. The syntax tree for the above eg is60

Semantic AnalysisThe semantic analyzer uses the syntax tree and the information in the SYMTAB to check the source program for semantic consistency with the language definition.

It also gathers type information and saves it in either the syntax tree or the SYMTAB, for subsequent use during intermediate code generation.

An important part of semantic analysis is type checking, where the compiler checks that each operator has matching operands.(eg, the compiler must report an error, if a float value is used as an array index).Eg. Suppose position, initial and rate are float numbers. The lexeme is an integer. The type checker in semantic analyzer discovers that the operator * is applied to a float number rate and an int 60. So int 60 is converted to float.

Intermediate Code Generation.In the process of translation from source to target code, the compiler may construct one or more intermediate representations.This intermediate representations should be (a) easy to produce and (b) easy to translate into target machine.

Eg. t1 = inttofloat(60) t2 = id3 * t1 t3 = id2 + t2 id1 = t3

Code OptimizationThe machine independent code optimization phase attempts to improve the intermediate code so that better target code will result.A simple intermediate code generation algorithm followed by code optimization is a reasonable way to generate good target code.The optimizer can deduce that the conversion of 60 from int to float can be done once. So the inttofloat operation can be eliminated by replacing int 60 by float 60.0Eg. t1 = id3 * 60.0 id1 = id2 * t1

Code GenerationThe code generator takes as input an intermediate representation of the source program and maps it to the target language.If the target language is machine code, registers or memory locations are selected for each of the variables used by the program. Then the intermediate instructions are translated into sequences of machine instructions.Eg. LDFR2,id3MULFR2,R2, #60.0LDFR1,id2ADDFR1,R1, R2STFid1,R1

SYMBOL TABLE MANAGEMENTAn essential function of a compiler is to record the variable names used in the source program and collect information about various attribute of each name.

This data structure should be designed to allow the compiler to find the record for each name quickly and to store or retrieve data from that record quickly.

Position = initial +rate *60Lexical Analyzer = + * Syntax Analyzer 60Semantic Analyzer

inttofloat(60)Intermediate code generator t1 = inttofloat(60) t2 = id3 * t1 t3 = id2 + t2 id1 = t3code optimizer t1 = id3 * 60.0 id1 = id2 * t1

Code GeneratorLDFR2,id3MULFR2,R2, #60.0LDFR1,id2ADDFR1,R1, R2STFid1,R1Translation of an assignment statement

Role of Lexical AnalyzerThe main task of lexical analyzer is to read the input characters and group them into lexemes and produce tokensThe stream of tokens is send to the parser for syntax analysis.

Lexical AnalyzerSource ProgramParsertokengetnexttokenTo semantic analysisSymbol Table

Tasks(Role) of Lexical AnalyzerIdentification of LexemesRemoval of comments and white spaces(blank,newline,tab etc)Correlating error messages generated by the compiler with the source program.Lexical Analyzer processes.Scanning consists (simple processes) that do not require tokenization of the input, such as deletion of comments and compaction of consecutive white spaces into one.Lexical analysis(complex process) where scanner produces the sequence of tokens as output.

Tokens, Patterns and LexemesA token is a pair with a token name and an optional attribute valueA pattern is a description of the form that the lexemes of a token may takeA lexeme is a sequence of characters in the source program that matches the pattern for a token

INPUT BUFFERINGSpecialized buffering techniques have been developed to reduce the amount of overhead required to process a single input character.Two buffers are alternately reloaded. Each buffer is of same size N. N is the size of a disk block.lexemeBeginforward

E=M*C**2eof

Input BufferingTwo pointers are required lexemeBegin - marks the beginning of a lexeme forward - scans ahead until a pattern match is found.Advancing forward requires that we first test whether we have reached the end of one of the buffers , and if so, we must reload the other buffer from the input, and move forward to the beginning of the newly loaded buffer.

SentinelsUsed to mark the end of input.Natural choice is the character eof.Any eof that appears other than at the end of a buffer means that the input is at an end.lexemeBeginforwardSentinels at the end of each buffer

E=M*eofC**2eofeof

Switch(*forward++){ case eof:if(forward is at the end of first buffer){ reload second buffer; forward = beginning of second buffer; } else if(forward is at the end of second buffer){ reload first buffer; forward = beginning of first buffer; } else /* eof within a buffer marks the end of input */terminate lexicalanalysis;break;}

SPECIFICATION OF TOKENS

Strings and Languagesalphabet is a finite sequence of symbols.The string over an alphabet is a finite sequence of symbols drawn from that alphabet.|s| represents the length of a string s, Ex: banana is a string of length 6 The set {0,1} is the binary alphabetA language is any countable set of strings over some fixed alphabet.Abstract languages - , the empty set , or {}, the set containing only the empty string.The empty string is the identity under concatenation; that is, for any string s, s = s = s.Exponentiation of strings :- s0 is , and for all i >0, si is si-1s. Since S = S , s1 = s , s2=ss,s3 =sss and so on.

Operations on Languages

Operations on Languages (contd.)Example: Let L be the set of letters {A, B, . . . , Z, a, b, . . . , z ) and let D be the set of digits {0,1,.. .9). L and D are, respectively, the alphabets of uppercase and lowercase letters and of digits. Other languages constructed from L and D are

1. L U D is the set of letters and digits - strictly speaking the language with 62 (52+10) strings of length one, each of which strings is either one letter or one digit.2. LD is the set of 520 strings of length two, each consisting of one letter followed by one digit.(1052). Ex: A1, a1,B0,etc3. L4 is the set of all 4-letter strings. (ex: aaba, bcef)4. L* is the set of all strings of letters, including .5. L(L U D)* is the set of all strings of letters and digits beginning with a letter.6. D+ is the set of all strings of one or more digits.

Regular Expression & Regular language

Regular ExpressionA notation that allows us to define a pattern in a high level language.Regular languageEach regular expression r denotes a language L(r) (the set of sentences relating to the regular expression r)Notes: Each word in a program can be expressed in a regular expression

Eg. Suppose we want to describe the set of valid C identifiers.If letter_ stand for any letter or the underscore, and digit_ stands for any digit, then we would describe the language of C identifiers by :letter_(letter_ | digit)*The | means union. ( ) are used to group sub expressions. * means zero or more occurrences ofThe juxtaposition of letter_ with the remainder of the expression signifies concatenation.

Rules for constructing regular expressions1. is a regular expression denoting {}, the language containing only empty string . L() = {}2. If a is a symbol in alphabet , then a is a regular expression, and L(a) = { a } ,that is the language with one string, of length one, with a in its one position. (We use italics for symbols and boldface for their corresponding regular expression.)

The regular expressions are built recursively out of smaller regular expressions using the following rules.

INDUCTION:Let r and s be regular expressions with languages L(r) and L(s). Then (r) | (s) is a regular expression denoting the language L(r) L(s)(r)(s) is a regular expression denoting the language L(r) L(s)(r)* is a regular expression denoting the language (L(r))*(r) is a regular expression denoting the language L(r).

Precedence* has highest precedence.Concatenation ha second highest precedence| has lowest precedence

Eg. (a) | ((b) * (c )) may be replaced by a | b * c.

Examples

Algebraic laws of Regular Expressions

Regular DefinitionsWe can give names to certain regular expressions and use those names in subsequent expressions. d1 -> r1 d2 -> r2 ..... dn -> rn

e.g. C identifiers are strings of letters , digits and underscores. letter_ A|B||Z|a|b||z|_ digit 0|1|2||9 Id letter_(letter_|digit)*This can also be written as letter_ [A Za-z_] digit [0-9] Id letter_(letter_|digit)*

We shall conventionally use italics for the symbols defined in the regular expressions.

Recognition of tokens In this topic we study how to take the patterns for all the needed tokens and build a piece of code that examines the input string and finds a prefix that is a lexeme matching one of the patterns.**

Consider following example.

stmt If expr then stmt| if expr then stmt else stmtI expr term relop term| termterm id| number

A grammar for branching statements*

For relop, we use the comparison operators.The patterns for tokens (id and number) aredigit [0-9]digitsdigit+numberdigits (. digits)?(E [+-]? digits)?letter[A-Za-z]idletter (letter | digit)*ififthenthenelseelserelop|=|=|

**

Token for white space isws (blank | tab | newline )+

Token ws is different from the other tokens in that, when we recognize it, we do not return it to the parser, but rather restart the lexical analysis from the character that follows the whitespace.

**

LEXEMES TOKEN NAME ATTRIBUTE VALUEAny ws - -If if -then then -e l s e else-Any id id Pointer to table entryAny numbernumber Pointer to table entry< relop LT relop GT>= relop

Tokens, their patterns, and attribute values

*

Transition DiagramsAs an intermediate step in the construction of a lexical analyzer, we first convert patterns into "transition diagrams.Transition diagrams have a collection of nodes or circles, called states.Each state represents a condition that could occur during the process of scanning the input looking for a lexeme that matches one of several patterns.

**

Edges are directed from one state of the transition diagram to another.Each edge is labeled by a symbol or set of symbols. All our transition diagrams are deterministic, meaning that there is never more than one edge out of a given state with a given symbol among its labels.

**

Some importantconventions about transition diagrams are:

Certain states are said to be accepting, or final. These states indicate that a lexeme has been found. (We always indicate an accepting state by a double circle, and if there is an action to be taken typically returning a token and an attribute value to the parser we shall attach that action to the accepting state.)2. In addition, if it is necessary to retract the forward pointer one position (i.e., the lexeme does not include the symbol that got us to the accepting state), then we shall additionally place a * near that accepting state.

3. One state is designated the start state, or initial state; it is indicated by an edge, labeled "start," entering from no where. The transition diagram always begins in the start state before any input symbols have been read.

Transition diagram for relop

We begin in state 0, the start state. If we see < as the first input symbol, then among the lexemes that match the pattern for relop we can only be looking at

State 4 has a * to indicate that we must retract the input one position.

In state 0, we see any character besides , we can not possibly be seeing a relop lexeme, so this transition diagram will not be used.

Usually, keywords like if or then are reserved so they are not identifiers even though they look like identifiers.

Transition diagram for id's and keywords*9startletter10other11return(getToken(), installlD ())Letter or digitRecognition of Reserved Words and Identifiers

There are two ways that we can handle reserved words that look like identifiers:1) Install the reserved words in the symbol table initially. When we find an identifier, a call to installlD places it in the symbol table if it is not already there and returns a pointer to the symbol-table entry for the lexeme found.Any identifier not in the symbol table during lexical analysis cannot be a reserved word, so its token is id.

**

The function getToken examines the symbol table entry for the lexeme found, and returns whatever token name the symbol table says this lexeme represents either id or one of the keyword tokens that was initially installed in the table.2.) Create separate transition diagrams for each keyword.

*startt h e n nonlet/dig*Transition diagram for then

A transition diagram for unsigned numbersstart*

A transition diagram for whitespace

*delimHere we look for one or more white space characters ,represented by delim. These characters would be blank, tab newline etc.

In state 24, we have found a block of consecutive whitespace characters, followed by a non whitespace character. We retract the input to begin at the non whitespace, but we do not return to the parser.

Design of Lexical AnalyzerInitial step is to form flowcharts for the valid possible tokensFlowcharts for lexical analyzer is known as Transition diagramsComponents are States represent the circlesEdges the arrows connecting the states The labels on the edges indicate the input character that can appear after that state

Transition diagram for identifier012letter or digitStartdelimiterletter*Fig : Transition diagram for identifier

The next step is to produce code for each of the states

The code for State 0State 0 : C:= GETCHAR( ); if LETTER then goto state1 else FAIL( )

Here LETTER is a boolean valued function, returns true if C is a letter FAIL is a routine which retracts the lookahead pointer and starts up the next transition diagram or calls the error routine.

The code for State 1State 1 : C:= GETCHAR( ); if LETTER or DIGIT( C ) then goto state1 else if DELIMITER ( C ) then goto state 2 else FAIL( )

Here DIGIT is a boolean valued function, returns true if C is one of the digits 0, 1, .,9. DELIMITER is a procedure which returns true whenever C is a character that could follow an identifier

The code for State 2State 2 : RETRACT( ); return (id, INSTALL( ) )

state 2 indicates that an identifier has been found. Since the delimiter is not part of the token found, the function RETRACT will move the lookahead pointer one character back.* indicate states on which input retraction must take place.INSTALL( ) procedure will install the identifier into symbol table if it is not already there.

TokenCodeValuebegin 1 ---end 2 ---If 3 ---Then 4 ---Else 5 ---Identifier 6 pointer to symbol tableConstant 7 pointer to symbol table< 8 1 8 4> 8 5>= 8 6

Fig: Tokens recognized

0123457891112131516181920216*17*10*14*22*StartBENENDGIESLFIBlank/ newlineNEHTBlank/ newlineBlank/ newlineBlank/ newlineBlank/ newlinereturn(1, )return(2, )return(3, )return(5, )return(4, )Keywords :

232425letter or digitStartletter*Identifier :Not letter or digitreturn(6, INSTALL( ) )constant :262728digitStartdigit*Not digitreturn(7, INSTALL( ) )

29303531*34*323336*Start

not =return(8,1)return(8, 2)return(8,3 )return(8,4 )return(8,5)Not = or >=>37return(8,6)=Relops :

Regular Expressions Strings and Languages

Alphabet or character class denote any finite set of symbols Eg : {0,1} is an alphabet, with two symbols 0 and 1

String It is a finite sequence of symbolsEg: 001, 10101,.

Operations with stringLength : x denotes the length of string x, will be the number of characters in x is the empty string, = 0

Concatenation of x and y is denoted by x.y or xy , formed by appending string y to x Eg: x = abc y = de then x.y = abcde x = x = x where is the identity in concatenation

Exponentiation xi means string x repeated i times Eg: x1 = x; x2 = xx; x3 = xxx; .. and x0 =

Prefix is obtained by discarding o or more trailing symbols of x Eg: abc, abcd, a .. Are prefix of abcde

Suffix of x is obtained by discarding 0 or more leading symbols of x Eg: cde, e, represent the suffix of abcde

Substring of x is obtained by deleting a prefix and suffix from x Eg: cd, abc, de, abcde represent the substring of abcde All suffix and prefix will be a substring, but the substring need not be a suffix or prefix and x are prefixes, suffixes, and substring of x

Language It is the set of strings formed from specific alphabet If L & M are two languages, the possible operations areConcatenation Concatenation of L & M is denoted as L.M and can be found by selecting a string x from L and y from M and joining them in that order LM = {xy x is in L and y is in M} L = L = Exponentiation Li = LLLLL L (i times) L0 = {}, {}L = L{}=L

Union LUM = {x x is in L or x is in M} UL = LU = LClosure * denotes 0 or more instances of; L* = U Li Eg: let L = { aa } L* is all strings of even number of as L0 = {} L1 = { aa } L2 = { aaaa } .

+ is the positive closure, means one or more instances of exclude {}, then its L.(L*)

L.(L*) = L. U Li = U Li+1 = U Li = L+

i=0i=0i=1i=0

Regular Expressions used to describe the tokens Eg: for identifier, identifier = letter ( letter digit )* used to define a languageRegular Expression construction rules1. is a regular expression denoting {}, that is the language containing only the empty string2. For each a in , a is regular expression denoting {a}, the language with only one string, that string consisting of the single symbol a3. If R and S are regular expressions denoting languages LR and LS respectively then (i) (R ) (S) is a regular expression denoting LR U LS (ii) ( R) . (S) is a regular expression denoting LR.LS (iii) ( R)* is a regular expression denoting LR*

A regular expression is defined in terms of primitive regular expression (basis) and compound regular expressions (induction rules) So rules (i) and (ii) form the basis, (iii) forms the inductive portionEg: Some Regular Expressions1. a* - denotes all strings of 0 or more as2. aa* - denotes the string of one or more as (a+)3. (a b)*- the set of all strings of as and bs i.e. (a*b*)*4. (aa ab ba bb)* - all strings of even length5. a b strings of length 0 or 16. (a b) (a b) (a b) denotes strings of length 3 so (a b) (a b) (a b) (a b)* denotes strings of length 3 or more a b (a b) (a b) (a b) (a b)* - all strings whose length is not 2

Regular Expressions forKeyword = BEGIN/ END/ IF/THEN/ ELSE Identifier = letter (letter/digit)*Constant = digit+relop = =If two regular expressions R and S denote same language, then R and S are equivalenti.e. (a/b)* = (a*b*)*Algebraic laws with Regular Expressions1. R/S = S/R( / is commutative)2. ( R/S) /T = R/ (S/T)(/ is associative)3. R (ST) = (RS) T( . Is associative)4. R (S/T) = RS / RT and (S/T) R= SR /TR ( . Distribution over / )5. R = R = R ( is identity for concatenation)

Finite AutomataIt is a program that identifies the presence of a token on the input . It takes a string x as its input, answers yes if x is a sentence of L and no otherwise.Language RecognizerTo determine x belongs to a language L, x is decomposed into a sequence of substrings denoted by the primitive sub expressions in RHow it works?Given R = (a/b)*abb, the set of all strings ending in abb,andx = aabbSince R = R1R2 where R1 = (a/b)* and R2 = abbIt is easy to show a the language(a is an element of the language) denoted by R1 and abb matches R2Example

It is the generalized transition diagram that is derived from the expressionNondeterministic Automata0123bababstartFig: A non deterministic finite automata of (a b)*abbThe nodes are called states and the labeled edges are called transitions. Edges can be labeled by & characters. Also same character can label two or more transitions out of one state. It has one start state and can be one or more final states(accepting states).

The tabular form representing the transitions of an NFA . In the transition table, there is a row for each state and a column for each admissible input symbol and .The entry for row i and symbol a is the set of possible next states for state i on i/p a. Transition tableState Input symbol a b 0{0,1}{0} 1-----{2} 2 ----{3} Fig: Transition table

The path for the input string aabb can be represented by the following sequence of moves

State Remaining i/p 0 aabb 0 abb 1 bb 2 b 3 The language defined by an NFA is the set of i/p strings it accepts.

NFA accepting aa* bb* 0start1324abab

Input :- A regular expression R over alphabet Output :- An NFA, N accepting the language denoted by RMethod : Decompose R into its primitive components. For each component, we construct a finite automata inductively using basis and induction rules

Algorithm to construct an NFA from a Regular Expression

The basis and induction rules are 1. NFA for

Finite Automata construction from regular expressionifwhere i and f are new initial state and final state

2. NFA for a

i'f 'aeach state should be new

2. NFA for a

i'f 'aeach state should be new

Each time we need a new state, we give that state a new name. Even if a appears several times in the regular expression R, we give each instance of a a separate finite automation with its own states.

3. Having constructed components for the basis regular expressions, we proceed to combine them in ways that correspond to the way compound regular expressions are formed from smaller regular expressions.

i'f

3. NFA for R1 / R2

Let N1 and N2 be NFAs corresponds R1 and R2 respectivelyiN1N2There is a transition on from the new initial state to the initial states of N1 and N2.There is an -transition from the final states of N1 and N2 to the new final state f. Any path from i to f must go through either N1 or N2.

4. NFA for R1R2

Let N1 and N2 be NFAs corresponds R1 and R2 respectivelyfiN1N2The initial state of N2 is identified with the accepting state of N1. A path from i to f must go first through N1, then through N2.

f

5. NFA for R1*

iN1In this, we can go from i to f directly along a path labeled ,or go through N1 one or more times.

Decomposition of (a / b)*abbR11R10R9bR8 bR7R6R5 a * R3)(R4/R2 R1ab

R1= a N1 :

2a3R2= b N2 :

4b5R3= R1/R2 2a34b516N3 : N4 : R4= (R3) is same as N3

R5= (R4)*2a34b51607N5 :

R6= a N6 :

7'a8R7= R5R62a34b51607N7 :a8

R8= b N8 :

8'b9R9= R7R82a34b51607N9 :a89b

R10 = b N10 :

9'b10R11= R9R102a34b51607N11 :a89b10bStart

Since in the NFA transition function is multivalued and , it is difficult to simulate an NFA with a computer program

A finite automaton is deterministic ifit has no transitions on input for each state s and input symbol a, there is at most one edge labeled a leaving s

For each NFA, we can find a DFA accepting the same language.Deterministic Automata (DFA)

2a34b51607a89b10b-closure (0) = { 0, 1, 2, 4, 7} -------------- (A) {3, 8} { 5 } -closure {3, 8} = {1, 2, 3, 4, 6, 7, 8} ------------- (B)ab{3, 8}{ 5, 9 } abStart

-closure {5} = {1, 2, 4, 5, 6, 7} ---------------- (C)-closure {5, 9} = {1, 2, 4, 5, 6, 7, 9} ----------- (D)-closure {5, 10 } = {1, 2, 4, 5, 6, 7, 10} ------------(E){3, 8}{ 5} ab{3, 8}{ 5, 10 } ab{3, 8}{ 5 } ab

State Input symbol a b A (Start)B C BB D C B C D B E E (Accept) B CABDCabaaaabbbbEStart

State Input symbol a b A (Start) B A B B D D B E E (Accept) B AABDabaaaStartbbbEMinimizing the number of states

Constructing DFA from NFAAlgorithm

Input: a NFA N.output: a DFA D accepting the same language Let us define the function -CLOSURE(s) to be the set of states of N built by applying the following rules:1. S is added to -closure (s)2. If t is in -CLOSURE (s), and there is an edge labeled from t to u, then u is added to -CLOSURE(s) if u is not already there. Rule 2 is repeated until no more states can be added to -CLOSURE(s) .Thus,-CLOSURE(s) is the set of states that can be reached from s on -transitions only. If T is a set of states, then -CLOSURE(T) is the union over all states s in T of -CLOSURE(s).

Constructing DFA from NFAAlgorithm - CLOSURE

Push all states in T onto stack;-closure(T) := T;while stack is not empty do begin pop s, the top element, off the stack for each state t with an edge from s to t labeled do if t is not in -closure(T) do begin add t to -closure(T) push t onto stack end if end do end while

Constructing DFA from NFAAlgorithm Subset construction

While there is an unmarked state x= {s1,s2,.,sn) of D do Begin mark x;for each input symbol a doBeginlet T be the set of states to which there is a transition on a from some state si in x; y := - CLOSURE (T) If y has not yet been added to the set of states of D then make y an unmarked state of D Add a transition from x to y labeled a if not already present EndforEndwhile

Minimizing the number of states in DFAAlgorithm

Input: a DFA Moutput: a minimum state DFA MIf some states in M ignore some inputs, add transitions to a dead state.Let P = {accepting state, All nonaccepting states}Let P = {}Loop: for each group G in P do Partition G into subgroups so that s and t (in G) belong to the same subgroup if and only if each input a,states s and t have transitions to states in the same group of Pput those subgroups in Pif (P != P) goto loopRemove any dead states and unreachable states.

NFA to DFA Example-22a16a345bb8b7ab0start-closure({0}) = {0,1,3,7} subset({0,1,3,7},a) = {2,4,7} subset({0,1,3,7},b) = {8}

-closure({2,4,7}) = {2,4,7} subset({2,4,7},a) = {7} subset({2,4,7},b) = {5,8} -closure({8}) = {8} subset({8},a) = subset({8},b) = {8} -closure({7}) = {7} subset({7},a) = {7} subset({7},b) = {8} ----------------------

AstartaDbbbabbBCEFaba1a3a3a2 a3DFA states A = {0,1,3,7} B = {2,4,7} C = {8} D = {7} E = {5,8} F = {6,8}

Minimizing the Number of States of a DFAAstartBCDEbbbbbaaaaaAstartBDEbbaabaa

A language for specifying Lexical Analyzers A LEX source pgm is a specification of a lexical analyzer, consisting of a set of regular expressions together with an action for each regular expression.

The action is a piece of code which is to be executed whenever a token specified by the corresponding regular expression is recognized.

The output of LEX is a lexical analyzer pgm constructed from the LEX source specification.

Creating a Lexical Analyzer with Lex lex compilerlex source programinput streamLexical Analyzer Lsequence of tokensLexical analyzer L

A LEX source pgm consists of 2 parts:Auxiliary definitions and translation rules

Auxiliary Definitions

The auxiliary definitions are stmnts of the formD1=R1D2=R2 . .Dn=RnEg: letter=A B Z digit=0 1 . 9 identifier= letter (letter digit )*

Translation Rules

The translation rules of a LEX pgm are stmnts of the formP1 {A1}P2 {A2}..Pm {Am}

Where each pi is a regular expression called a pattern and each Ai is a pgm fragmentThe pattern describe the form of the tokensThe pgm fragment describes what action the lexical analyzer should take when token Pi is found.

AUXILIARY DEFINITIONSletter = A B .. Zdigit = 0 1 .. 9TRANSLATION RULESBEGIN{return 1}END{return 2}IF{return 3}THEN{return 4}ELSE{return 5}letter(letter digit)*{LEXVAL:= INSTALL(); return 6}digit*{LEXVAL:= INSTALL(); return 7}={LEXVAL:=6; return 8}

Regular Expressions in Lexxmatch the character x \.match the character . stringmatch contents of string of characters . match any character except newline ^match beginning of a line $match the end of a line [xyz]match one character x, y, or z (use \ to escape -) [^xyz]match any character except x, y, and z [a-z]match one of a to z r*closure (match zero or more occurrences) r+positive closure (match one or more occurrences) r? optional (match zero or one occurrence) r1r2match r1 then r2 (concatenation) r1|r2match r1 or r2 (union) ( r ) grouping r1\r2match r1 when followed by r2 {d}match the regular expression defined by d

SystemSoftware-compilers

Documents