Lecture # 3 Chapter #3: Lexical Analysis
Lecture # 3
Chapter #3: Lexical Analysis
Role of Lexical Analyzer
• It is the first phase of compiler
• Its main task is to read the input characters and produce as output a sequence of tokens that the parser uses for syntax analysis
• Reasons to make it a separate phase are:– Simplifies the design of the compiler– Provides efficient implementation– Improves portability
3
Interaction of the Lexical Analyzer with the Parser (Fig 3.1)
LexicalAnalyzer Parser
SourceProgram
Token,tokenval
Symbol Table
Get nexttoken
error error
4
Tokens, Patterns, and Lexemes
• A token is a classification of lexical units– For example: id and num
• Lexemes are the specific character strings that make up a token– For example: abc and 123
• Patterns are rules describing the set of lexemes belonging to a token– For example: “letter followed by letters and digits” and
“non-empty sequence of digits”
Diff b/w Token, Lexeme and Pattern
Token Lexeme Pattern
if if if
relation <, <=,=,<>,>,>= < or <= or = or <> or > or >=
id y, x Letter followed by letters and digits
num 31 , 28 Any numeric constant
operator + , *, - ,/ Any arithmetic operator+ or * or – or /
6
Attributes of Tokens
Lexical analyzer
<id, “y”> <assign, :=> <num, 31> <operator , +> <num, 28> <operator, *> <id, “x”>
y := 31 + 28*x
Parser
token
tokenval(token attribute)
Section # 3.3: Specification of Tokens
• Alphabet: Finite, nonempty set of symbols Example: Example:
• Strings: Finite sequence of symbols from an alphabet e.g. 0011001
• Empty String: The string with zero occurrences of symbols from alphabet. The empty string is denoted by
Continue…• Length of String: Number of positions for
symbols in the string. |w| denotes the length of string w
Example |0110| = 4; | | = 0• Powers of an Alphabet: = = the set of strings
of length k with symbols from Example:
Continue..
• The set of all strings over is denoted
Continue..
• Language: is a specific set of strings over some fixed alphabet
Example: The set of legal English words The set of strings consisting of n 0's followed by n
1’s LP = the set of binary numbers whose value is
prime
Concatenation and Exponentiation
• The concatenation of two strings x and y is denoted by xy
• The exponentiation of a string s is defined by
s0 = si = si-1s for i > 0
note that s = s = s
Language Operations
• UnionL M = {s s L or s M}
• ConcatenationLM = {xy x L and y M}
• ExponentiationL0 = {}; Li = Li-1L
• Kleene closureL* = i=0,…, Li
• Positive closureL+ = i=1,…, Li
13
Regular Expressions
• Basis symbols:– is a regular expression denoting language {}– a is a regular expression denoting {a}
• If r and s are regular expressions denoting languages L(r) and M(s) respectively, then– rs is a regular expression denoting L(r) M(s)– rs is a regular expression denoting L(r)M(s)– r* is a regular expression denoting L(r)*
– (r) is a regular expression denoting L(r)• A language defined by a regular expression is called a
Regular set or a Regular Language
14
Regular Definitions
• Regular definitions introduce a naming convention: d1 r1
d2 r2
…dn rn
where each ri is a regular expression over {d1, d2, …, di-1 }
• Example:
letter AB…Zab…z digit 01…9 id letter ( letterdigit )*
• The following shorthands are often used:
r+ = rr*
r? = r[a-z] = abc…z
• Examples:digit [0-9]num digit+ (. digit+)? ( E (+-)? digit+ )?
16
Regular Definitions and Grammars
stmt if expr then stmt
if expr then stmt else stmt
expr term relop term
termterm id
num if if then then else else
relop < <= <> > >= = id letter ( letter | digit )*
num digit+ (. digit+)? ( E (+-)? digit+ )?
Grammar
Regular definitions
17
Coding Regular Definitions in Transition Diagrams
0 21
6
3
4
5
7
8
return(relop, LE)
return(relop, NE)
return(relop, LT)
return(relop, EQ)
return(relop, GE)
return(relop, GT)
start <
=
>
=
>
=
other
other
*
*
9start letter 10 11*other
letter or digit
return(gettoken(), install_id())
relop <<=<>>>==
id letter ( letterdigit )*
Section 3.6 : Finite Automata
• Finite Automata are used as a model for:
– Software for designing digital circuits – Lexical analyzer of a compiler
– Searching for keywords in a file or on the web.
– Software for verifying finite state systems, such as communication protocols.
19
Design of a Lexical Analyzer Generator
• Translate regular expressions to NFA• Translate NFA to an efficient DFA
regularexpressions NFA DFA
Simulate NFAto recognize
tokens
Simulate DFAto recognize
tokens
Optional
20
Nondeterministic Finite Automata
• An NFA is a 5-tuple (S, , , s0, F) where
S is a finite set of states is a finite set of symbols, the alphabet is a mapping from S to a set of statess0 S is the start stateF S is the set of accepting (or final) states
21
Transition Graph
• An NFA can be diagrammatically represented by a labeled directed graph called a transition graph
0start a 1 32b b
a
b
S = {0,1,2,3} = {a,b}s0 = 0F = {3}
22
Transition Table
• The mapping of an NFA can be represented in a transition table
StateInputa
Inputb
0 {0, 1} {0}
1 {2}
2 {3}
(0,a) = {0,1}(0,b) = {0}(1,b) = {2}(2,b) = {3}
23
The Language Defined by an NFA
• An NFA accepts an input string x if and only if there is some path with edges labeled with symbols from x in sequence from the start state to some accepting state in the transition graph
• A state transition from one state to another on the path is called a move
• The language defined by an NFA is the set of input strings it accepts, such as (ab)*abb for the example NFA
24
N(r2)N(r1)
From Regular Expression to -NFA (Thompson’s Construction)
fi
fai
fiN(r1)
N(r2)
start
start
start
fistart
N(r) fistart
a
r1r2
r1r2
r*
25
Combining the NFAs of a Set of Regular Expressions
2a1start
6a3start
4 5b b
8b7start
a b
a { action1 }abb { action2 } a*b+ { action3 }
2a1
6a3 4 5b b
8b7
a b0
start
26
Deterministic Finite Automata
• A deterministic finite automaton is a special case of an NFA– No state has an -transition– For each state s and input symbol a there is at most one
edge labeled a leaving s
• Each entry in the transition table is a single state– At most one path exists to accept a string– Simulation algorithm is simple
27
Example DFA
0start a 1 32b b
bb
a
a
a
A DFA that accepts (ab)*abb
28
Conversion of an NFA into a DFA
• The subset construction algorithm converts an NFA into a DFA using:
-closure(s) = {s} {t s … t}-closure(T) = sT -closure(s)move(T,a) = {t s a t and s T}
• The algorithm produces:Dstates is the set of states of the new DFA consisting of sets of states of the NFADtran is the transition table of the new DFA
29
-closure and move Examples
2a1
6a3 4 5b b
8b7
a b0
start
-closure({0}) = {0,1,3,7}move({0,1,3,7},a) = {2,4,7}-closure({2,4,7}) = {2,4,7}move({2,4,7},a) = {7}-closure({7}) = {7}move({7},b) = {8}-closure({8}) = {8}move({8},a) =
0
1
3
7
2
4
7
7 8a ba a none
Also used to simulate NFAs
30
Subset Construction Example 1
0start a1 10
2
b
b
a
b
3
4 5
6 7 8 9
Astart
B
C
D E
b
b
b
b
b
aa
a
a
DstatesA = {0,1,2,4,7}B = {1,2,3,4,6,7,8}C = {1,2,4,5,6,7}D = {1,2,4,5,6,7,9}E = {1,2,4,5,6,7,10}
a
31
Subset Construction Example 2
2a1
6a3 4 5b b
8b7
a b0
start
a1
a2
a3
DstatesA = {0,1,3,7}B = {2,4,7}C = {8}D = {7}E = {5,8}F = {6,8}
Astart
a
D
b
b
b
ab
bB
C
E Fa
b
a1
a3
a3 a2 a3