Lexical Analyzer (Checker)
Lexical Analyzer (Checker)
2
Lexical Analyzer• Lexical Analyzer reads the source program
character by character to produce tokens.
• Normally a lexical analyzer doesn’t return a list of tokens at one shot, it returns a token when the parser asks a token from it.
Tokens, Lexemes, and Patterns
• Tokens include keywords, operators, identifiers, constants, literal strings, punctuation symbols – e.g: identifier, number, addop, assgop
• A lexeme is a sequence of characters in the source program representing a token – e.g: newval, oldval
• A pattern is a rule describing a set of lexemes that can represent a particular token– e.g: Identifier represents a set of strings which start
with a letter continues with letters and digits
4
• Since a token can represent more than one lexeme, attributes provide additional information about tokens
• For simplicity, a token may have a single attribute. – For an identifier, attribute is a pointer to the symbol table
• Examples of some attributes:– <id,attr> where attr is pointer to the symbol table– <assgop,_> no attribute is needed (only one assignment operator)– <num,val> where val is the actual value of the number.
• Token and its attribute uniquely identifies a lexeme.
Attributes
Strings and Languages
• Alphabet – any finite set of symbols (e.g. ASCII, binary alphabet, or a set of tokens)
• String – A finite sequence of symbols drawn from an alphabet
• Language – A set of strings over a fixed alphabet
Operations on Languages
• Union:• Concatenation:• Kleene closure:
–
– Zero or more concatenations
• Positive closure:
–
– One or more concatenations
M}tLsstLM in is and in is |{M}sLssML in is or in is |{
0
*
i
iLL
1i
iLL
Regular Expressions
• Can give “names” to regular expressions
• Convention: names in boldface (to distinguish them from symbols)
letter A|B|…|Z|a|b|…|zdigit 0|1|…|9id letter (letter | digit)*
Notational Shorthands
• One or more instances: r+ denotes rr*
• Zero or one Instance: r? denotes r|ε• Character classes: [a-z] denotes [a|b|…|z]
digit [0-9]digits digit+
optional_fraction (. digits )?num digits optional_fraction
Limitations
• Can not describe balanced or nested constructs– Example, all valid strings of balanced
parentheses– This can be done with Context Free Grammar
( CFG)
Grammar Fragment (Pascal)
stmt if expr then stmt| if expr then stmt else stmt| ε
expr term relop term| term
term id | num
Related Regular Expression Definitions
if ifthen thenelse elserelop < | <= | = | <> | > | >=id letter ( letter | digit )*
num digit+ (. digit+ )? ws delim+
delim blank | tab | newline
Tokens and Attributes
Regular Expression Token Attribute Value
ws - -
if if -
then then -
else else -
id id pointer to entry
num num pointer to entry
< relop LT
<= relop LE
= relop EQ
<> relop NE
> relop GT
=> relop GE
Transition Diagrams
• A stylized flowchart• Transition diagrams consist of states
connected by edges• Edges leaving a state s are labeled with
input characters that may occur after reaching state s
• Assumed to be deterministic• There is one start state and at least one
accepting (final) state
Transition Diagram for “relop”
Identifiers and Keywords
• Share a transition diagram– After reaching accepting state, code
determines if lexeme is keyword or identifier
Numbers
Finding the Next Tokentoken nexttoken(void) { while (1) { switch (state) { case 0: c = nextchar(); if (c == ' ' || c=='\t' || c == '\n') { state = 0; lexeme_beginning++; } else if (c == '<') state = 1; else if (c == '=') state = 5 else if (c == '>') state = 6 else state = next_td();
break;
… /* other cases here */
Trying Transition Diagrams
int next_td(void) { switch (start) { case 0: start = 9; break; case 9: start = 20; break; case 20: start = 25; break; case 25: recover(); break; default: error("invalid start state"); }
/* Possibly additional actions here */
return start;}
Finite Automata
• Generalized transition diagrams that act as “recognizer” for a language
• Can be nondeterministic (NFA) or deterministic (DFA)– NFAs can have ε-transitions, DFAs can not– NFAs can have multiple edges with same
symbol leaving a state, DFAs can not– Both can recognize exactly what regular
expressions can denote
NFAs
• A set of states S• A set of input symbols Σ (input alphabet)• A transition function move that maps state,
symbol pairs to a set of states
• A single start state s0
• A set of accepting (or final) states F• An NFA accepts a string s if and only if there
exists a path from the start state to an accepting state such that the edge labels spell out s
21
NFA (Example)
10 2a bstart
a
b
0 is the start state s0
{2} is the set of final states F = {a,b}S = {0,1,2}
Transition graph of the NFA
The language recognized by this NFA is (a|b) * ab
Transition Tables
StateInput Symbol
a b
0 {0,1} {0}
1 --- {2}
2 --- {3}
DFAs
• No state has an ε-transition
• For each state s and input symbol a, there as at most one edge labeled a leaving s
Example: r = (a|b)*abb
Functions ε-closure and move
• ε-closure(s) is the set of NFA states reachable from NFA state s on ε-transitions alone
• move(T,a) is the set of NFA states to which there is a transition on input a from any NFA state s in T
Constructed DFA
Simulating a DFA
s := s0
c := nextcharwhile c != eof do
s := move(s, c)c := nextchar
endif s is in F then
return “yes”else
return “no”
Simulating an NFA
S := ε-closure({s0})a := nextcharwhile a != eof do
S := ε-closure(move(S,a))a := nextchar
if S ∩ F != Øreturn “yes”
elsereturn “no”
Space/Time Tradeoff (Worst Case)
Space Time
NFA O(|r|) O(|r|*|x|)
DFA O(2|r|) O(|x|)
• First use Thompson’s Construction to convert RE to NFA
• Then there are two choices:– Use subset construction to convert NFA to
DFA, then simulate the DFA– Simulate the NFA directly
Simulating a Regular Expression
31
Some Other Issues in Lexical Analyzer
• The lexical analyzer has to recognize the longest possible string.– Ex: identifier newval -- n ne new newv
newva newval
• What is the end of a token? Is there any character which marks the end of a token?
32
Some Other Issues in Lexical Analyzer (cont.)
• Skipping comments– Normally we don’t return a comment as a token.– So, the comments are only processed by the lexical analyzer,
and don’t complicate the syntax of the language.
• Symbol table interface– symbol table holds information about tokens (at least lexeme of
identifiers)– how to implement the symbol table, and what kind of operations.
• hash table – open addressing, chaining
• putting into the hash table, finding the position of a token from its lexeme.
• Positions of the tokens in the file (for the error handling).