Top Banner
Lexical Analysis April 3, 2013 Wednesday, April 3, 13
164

Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

May 11, 2018

Download

Documents

lydang
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Lexical AnalysisApril 3, 2013

Wednesday, April 3, 13

Page 2: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Previously on CSE 131b...

The Structure of a Modern Compiler

Lexical Analysis

Syntax Analysis

Semantic Analysis

IR Generation

IR Optimization

Code Generation

Optimization

SourceCode

Machine

Code

Structure of a modern compiler

Wednesday, April 3, 13

Page 3: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Where are we?

Where We Are

Lexical Analysis

Syntax Analysis

Semantic Analysis

IR Generation

IR Optimization

Code Generation

Optimization

SourceCode

MachineCode

Wednesday, April 3, 13

Page 4: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

w h i l e ( i < z ) \n \t + i p ;

while (ip < z) ++ip;

p + +

Input: code (character stream)

Goal of Lexical AnalysisBreaking the program down into words or “tokens”

Wednesday, April 3, 13

Page 5: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

w h i l e ( i < z ) \n \t + i p ;

while (ip < z) ++ip;

p + +

T_While ( T_Ident < T_Ident ) ++ T_Ident

ip z ip

Goal of Lexical AnalysisOutput: Token Stream

Wednesday, April 3, 13

Page 6: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

w h i l e ( i < z ) \n \t + i p ;

while (ip < z) ++ip;

p + +

T_While ( T_Ident < T_Ident ) ++ T_Ident

ip z ip

While

++

Ident

<

Ident Ident

ip z ip

The Token Stream is then used as input for Parser (syntax analysis)

Wednesday, April 3, 13

Page 7: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

What’s a token?

• What’s a lexical unit of code?

Wednesday, April 3, 13

Page 8: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

Wednesday, April 3, 13

Page 9: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

What is my name ?

Wednesday, April 3, 13

Page 10: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

What is my name ?

Wednesday, April 3, 13

Page 11: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

What is my name ?

Wednesday, April 3, 13

Page 12: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

What is my name ?

Wednesday, April 3, 13

Page 13: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

What is my name ?

Wednesday, April 3, 13

Page 14: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

What is my name ?

Wednesday, April 3, 13

Page 15: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

What is my name ?

Wednesday, April 3, 13

Page 16: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

What is my name ?

Wednesday, April 3, 13

Page 17: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

What is my name ?

Wednesday, April 3, 13

Page 18: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

What is my name ?

Wednesday, April 3, 13

Page 19: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

What is my name ?

Wednesday, April 3, 13

Page 20: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

What is my name ?

Wednesday, April 3, 13

Page 21: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

What is my name ?

Wednesday, April 3, 13

Page 22: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

What is my name ?

Wednesday, April 3, 13

Page 23: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

What is my name ?

Wednesday, April 3, 13

Page 24: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

Token Type

Wednesday, April 3, 13

Page 25: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

Token Type

• Keyword: for int if else while

Wednesday, April 3, 13

Page 26: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

Token Type

• Keyword: for int if else while

• Punctuation: ( ) { } ;

Wednesday, April 3, 13

Page 27: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

Token Type

• Keyword: for int if else while

• Punctuation: ( ) { } ;

• Operand: + - ++

Wednesday, April 3, 13

Page 28: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

Token Type

• Keyword: for int if else while

• Punctuation: ( ) { } ;

• Operand: + - ++

• Relation: < > =

Wednesday, April 3, 13

Page 29: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

Token Type

• Keyword: for int if else while

• Punctuation: ( ) { } ;

• Operand: + - ++

• Relation: < > =

• Identifier: (variable name,function name) foo foo_2

Wednesday, April 3, 13

Page 30: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

Token Type

• Keyword: for int if else while

• Punctuation: ( ) { } ;

• Operand: + - ++

• Relation: < > =

• Identifier: (variable name,function name) foo foo_2

• Integer, float point, string: 2345 2.0 “hello world”

Wednesday, April 3, 13

Page 31: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

Token Type

• Keyword: for int if else while

• Punctuation: ( ) { } ;

• Operand: + - ++

• Relation: < > =

• Identifier: (variable name,function name) foo foo_2

• Integer, float point, string: 2345 2.0 “hello world”

• Whitespace, comment /* this code is awesome */Wednesday, April 3, 13

Page 32: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

Wednesday, April 3, 13

Page 33: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

Wednesday, April 3, 13

Page 34: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

Wednesday, April 3, 13

Page 35: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

Wednesday, April 3, 13

Page 36: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

Wednesday, April 3, 13

Page 37: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

Wednesday, April 3, 13

Page 38: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

Wednesday, April 3, 13

Page 39: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning a Source File

w h i l e ( 1 < i ) \n \t + i ;3 + +

T_While

7

Wednesday, April 3, 13

Page 40: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning a Source File

w h i l e ( 1 < i ) \n \t + i ;3 + +

T_While

7

Token

Wednesday, April 3, 13

Page 41: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning a Source File

w h i l e ( 1 < i ) \n \t + i ;3 + +

T_While

7

Token

Lexeme: the piece of the original program from which we made the token

Wednesday, April 3, 13

Page 42: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning a Source File

w h i l e ( 1 < i ) \n \t + i ;3 + +

T_While

7

( T_IntConst

137

Wednesday, April 3, 13

Page 43: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning a Source File

w h i l e ( 1 < i ) \n \t + i ;3 + +

T_While

7

( T_IntConst

137

Some tokens can have

attributes that store

extra information about

the token. Here we

store which integer is

represented.

Some tokens can have

attributes that store

extra information about

the token. Here we

store which integer is

represented.

Wednesday, April 3, 13

Page 44: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Lexical Analyzer

• Recognize substrings that correspond to tokens: lexemes

• Lexeme: actual text of the token

• For each lexeme, identify token type

• < Token type, attribute>

• attribute: optional, extra information, often numeric value

Wednesday, April 3, 13

Page 45: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Why is this process hard?

Wednesday, April 3, 13

Page 46: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning is Hard

● FORTRAN: Whitespace is irrelevant

DO 5 I = 1,25

DO5I = 1.25

Thanks to Prof. Alex Aiken

Wednesday, April 3, 13

Page 47: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning is Hard

● C++: Nested template declarations

vector<vector<int>> myVector

Thanks to Prof. Alex Aiken

Wednesday, April 3, 13

Page 48: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning is Hard

● C++: Nested template declarations

vector < vector < int >> myVector

Thanks to Prof. Alex Aiken

Wednesday, April 3, 13

Page 49: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Scanning is Hard

● C++: Nested template declarations

(vector < (vector < (int >> myVector)))

● Again, can be difficult to determine where to split.

Thanks to Prof. Alex Aiken

Wednesday, April 3, 13

Page 50: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Challenges for Lexical Analyzer

• How do we determine which lexemes are associated with each token?

• When there are multiple ways we could scan the input, how do we know which one to pick?

• if if1

• How do we address these concerns efficiently?

Wednesday, April 3, 13

Page 51: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Associate Lexemes to Tokens

• Tokens: categorize lexemes by what information they provide.

• Associate lexemes to token: Pattern matching

• How to describe patterns??

Wednesday, April 3, 13

Page 52: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Token: Lexemes

• Keyword: for int if else while

• Punctuation: ( ) { } ;

• Operand: + - ++

• Relation: < > =

• Identifier: (variable name,function name) foo foo_2

• Integer, float point, string: 2345 2.0 “hello world”

• Whitespace, comment /* this code is awesome */Wednesday, April 3, 13

Page 53: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Token: Lexemes

• Keyword: for int if else while

• Punctuation: ( ) { } ;

• Operand: + - ++

• Relation: < > =

• Identifier: (variable name,function name) foo foo_2

• Integer, float point, string: 2345 2.0 “hello world”

• Whitespace, comment /* this code is awesome */

Finite possible lexemes

Wednesday, April 3, 13

Page 54: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Token: Lexemes

• Keyword: for int if else while

• Punctuation: ( ) { } ;

• Operand: + - ++

• Relation: < > =

• Identifier: (variable name,function name) foo foo_2

• Integer, float point, string: 2345 2.0 “hello world”

• Whitespace, comment /* this code is awesome */

Finite possible lexemes

Infinite possible lexemes

Wednesday, April 3, 13

Page 55: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

• How do we describe which (potentially infinite) set of lexemes is associated with each token type?

Wednesday, April 3, 13

Page 56: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Formal Languages

● A formal language is a set of strings.

● Many infinite languages have finite descriptions:

● Define the language using an automaton.

● Define the language using a grammar.

● Define the language using a regular expression.

● We can use these compact descriptions of the language to define sets of strings.

● Over the course of this class, we will use all of these approaches.

Wednesday, April 3, 13

Page 57: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

• What type of formal language should we use to describe tokens?

Wednesday, April 3, 13

Page 58: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Regular Expressions

● Regular expressions are a family of descriptions that can be used to capture certain languages (the regular languages).

● Often provide a compact and human-readable description of the language.

● Used as the basis for numerous software systems, including the flex tool we will use in this course.

Wednesday, April 3, 13

Page 59: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Atomic Regular Expressions

● The regular expressions we will use in this course begin with two simple building blocks.

● The symbol ε is a regular expression matches the empty string.

● For any symbol a, the symbol a is a regular expression that just matches a.

Wednesday, April 3, 13

Page 60: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Compound Regular Expressions

● If R1 and R

2 are regular expressions, R

1R

2 is a regular

expression represents the concatenation of the languages of R

1 and R

2.

● If R1 and R

2 are regular expressions, R

1 | R

2 is a regular

expression representing the union of R1 and R

2.

● If R is a regular expression, R* is a regular expression for the Kleene closure of R.

● If R is a regular expression, (R) is a regular expression with the same meaning as R.

Wednesday, April 3, 13

Page 61: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Simple Regular Expressions

● Suppose the only characters are 0 and 1.

● Here is a regular expression for strings containing 00 as a substring:

(0 | 1)*00(0 | 1)*

Wednesday, April 3, 13

Page 62: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Simple Regular Expressions

● Suppose the only characters are 0 and 1.

● Here is a regular expression for strings containing 00 as a substring:

(0 | 1)*00(0 | 1)*

Wednesday, April 3, 13

Page 63: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Simple Regular Expressions

● Suppose the only characters are 0 and 1.

● Here is a regular expression for strings containing 00 as a substring:

(0 | 1)*00(0 | 1)*

110111001010000

11111011110011111

Wednesday, April 3, 13

Page 64: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Simple Regular Expressions

● Suppose the only characters are 0 and 1.

● Here is a regular expression for strings containing 00 as a substring:

(0 | 1)*00(0 | 1)*

110111001010000

11111011110011111

Wednesday, April 3, 13

Page 65: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Applied Regular Expressions

● Suppose that our alphabet is all ASCII characters.

● A regular expression for even numbers is

(+|-)?(0|1|2|3|4|5|6|7|8|9)*(0|2|4|6|8)?

Wednesday, April 3, 13

Page 66: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Applied Regular Expressions

● Suppose that our alphabet is all ASCII characters.

● A regular expression for even numbers is

(+|-)?(0|1|2|3|4|5|6|7|8|9)*(0|2|4|6|8)

Wednesday, April 3, 13

Page 67: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Applied Regular Expressions

● Suppose that our alphabet is all ASCII characters.

● A regular expression for even numbers is

42+1370-3248

-9999912

(+|-)?(0|1|2|3|4|5|6|7|8|9)*(0|2|4|6|8)

Wednesday, April 3, 13

Page 68: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Wednesday, April 3, 13

Page 69: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

• More examples

• Whitespace: [ \t\n]+

• Integers: [+\-]?[0-9]+

• Hex numbers: 0x[0-9a-f]+

• identifier

Wednesday, April 3, 13

Page 70: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

• More examples

• Whitespace: [ \t\n]+

• Integers: [+\-]?[0-9]+

• Hex numbers: 0x[0-9a-f]+

• identifier

• [A-Za-z]([A-Za-z]|[0-9])*

Wednesday, April 3, 13

Page 71: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

• Use regular expressions to describe token types

• How do we match regular expressions?

Wednesday, April 3, 13

Page 72: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Recognizing Regular Language

What is the machine that recognize regular language??

Wednesday, April 3, 13

Page 73: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Recognizing Regular Language

• Finite Automata

• DFA (Deterministic Finite Automata)

• NFA (Non-deterministic Finite Automata)

What is the machine that recognize regular language??

Wednesday, April 3, 13

Page 74: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

" "start

A,B,C,...,Z

A Simple Automaton

Wednesday, April 3, 13

Page 75: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

" "start

A,B,C,...,Z

Each circle is a state of the

automaton. The automaton's

configuration is determined

by what state(s) it is in.

Each circle is a state of the

automaton. The automaton's

configuration is determined

by what state(s) it is in.

A Simple Automaton

Wednesday, April 3, 13

Page 76: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

" "start

A,B,C,...,Z

These arrows are called

transitions. The automaton

changes which state(s) it is in

by following transitions.

These arrows are called

transitions. The automaton

changes which state(s) it is in

by following transitions.

A Simple Automaton

Wednesday, April 3, 13

Page 77: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Finite Automata: Takes an input string and determines whether it’s a valid sentence of a language

accept or reject

Wednesday, April 3, 13

Page 78: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

The automaton takes a string

as input and decides whether

to accept or reject the string.

The automaton takes a string

as input and decides whether

to accept or reject the string.

Wednesday, April 3, 13

Page 79: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Wednesday, April 3, 13

Page 80: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Wednesday, April 3, 13

Page 81: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Wednesday, April 3, 13

Page 82: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Wednesday, April 3, 13

Page 83: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Wednesday, April 3, 13

Page 84: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Wednesday, April 3, 13

Page 85: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Wednesday, April 3, 13

Page 86: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Wednesday, April 3, 13

Page 87: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Wednesday, April 3, 13

Page 88: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Wednesday, April 3, 13

Page 89: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

The double circle indicates that this

state is an accepting state. The

automaton accepts the string if it

ends in an accepting state.

The double circle indicates that this

state is an accepting state. The

automaton accepts the string if it

ends in an accepting state.

Wednesday, April 3, 13

Page 90: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

An Even More Complex Automatona, b

a, c

b, c

start

ε

ε

ε

c

b

a

Wednesday, April 3, 13

Page 91: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

An Even More Complex Automatona, b

a, c

b, c

start

ε

ε

ε

c

b

a

These are called -transitionsε . These

transitions are followed automatically and

without consuming any input.

These are called -transitionsε . These

transitions are followed automatically and

without consuming any input.

Wednesday, April 3, 13

Page 92: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

An Even More Complex Automatona, b

a, c

b, c

start

ε

ε

ε

c

b

a

b c b a

Wednesday, April 3, 13

Page 93: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

An Even More Complex Automatona, b

a, c

b, c

start

ε

ε

ε

c

b

a

b c b a

Wednesday, April 3, 13

Page 94: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

An Even More Complex Automatona, b

a, c

b, c

start

ε

ε

ε

c

b

a

b c b a

Wednesday, April 3, 13

Page 95: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

An Even More Complex Automatona, b

a, c

b, c

start

ε

ε

ε

c

b

a

b c b a

Wednesday, April 3, 13

Page 96: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

An Even More Complex Automatona, b

a, c

b, c

start

ε

ε

ε

c

b

a

b c b a

Wednesday, April 3, 13

Page 97: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

An Even More Complex Automatona, b

a, c

b, c

start

ε

ε

ε

c

b

a

b c b a

Wednesday, April 3, 13

Page 98: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

An Even More Complex Automatona, b

a, c

b, c

start

ε

ε

ε

c

b

a

b c b a

Wednesday, April 3, 13

Page 99: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

An Even More Complex Automatona, b

a, c

b, c

start

ε

ε

ε

c

b

a

b c b a

Wednesday, April 3, 13

Page 100: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

An Even More Complex Automatona, b

a, c

b, c

start

ε

ε

ε

c

b

a

b c b a

Wednesday, April 3, 13

Page 101: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

An Even More Complex Automatona, b

a, c

b, c

start

ε

ε

ε

c

b

a

b c b a

Wednesday, April 3, 13

Page 102: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Lexer Generator

• Given regular expressions to describe the language (token types),

• Generates NFA that can recognize the regular language defined

• existing algorithms

• Transforms NFA to DFA

• existing algorithms

• Tools: lex, flex

Wednesday, April 3, 13

Page 103: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Challenges for Lexical Analyzer

• How do we determine which lexemes are associated with each token?

• Regular expression to describe token type

• When there are multiple ways we could scan the input, how do we know which one to pick?

• How do we address these concerns efficiently?

Wednesday, April 3, 13

Page 104: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Lexing Ambiguities

T_For forT_Identifier [A-Za-z_][A-Za-z0-9_]*

Wednesday, April 3, 13

Page 105: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Lexing Ambiguities

T_For forT_Identifier [A-Za-z_][A-Za-z0-9_]*

f o tr

Wednesday, April 3, 13

Page 106: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Lexing Ambiguities

T_For forT_Identifier [A-Za-z_][A-Za-z0-9_]*

f o tr

f o tr

f o tr

f o tr

f o tr

f o tr

f o tr

f o tr

f o tr

f o tr

Wednesday, April 3, 13

Page 107: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Conflict Resolution

● Assume all tokens are specified as regular expressions.

● Algorithm: Left-to-right scan.

● Tiebreaking rule one: Maximal munch.

● Always match the longest possible prefix of the remaining text.

Wednesday, April 3, 13

Page 108: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Lexing Ambiguities

T_For forT_Identifier [A-Za-z_][A-Za-z0-9_]*

f o tr

f o tr

Wednesday, April 3, 13

Page 109: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Implementing Maximal Munch

● Given a set of regular expressions, how can we use them to implement maximum munch?

● Idea:

● Convert expressions to NFAs.

● Run all NFAs in parallel, keeping track of the last match.

● When all automata get stuck, report the last match and restart the search at that point.

Wednesday, April 3, 13

Page 110: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Implementing Maximal Munch

● Given a set of regular expressions, how can we use them to implement maximum munch?

● Idea:

● Convert expressions to NFAs.

● Run all NFAs in parallel, keeping track of the last match.

● When all automata get stuck, report the last match and restart the search at that point.

Wednesday, April 3, 13

Page 111: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

• Example

Wednesday, April 3, 13

Page 112: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

Wednesday, April 3, 13

Page 113: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

Wednesday, April 3, 13

Page 114: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 115: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 116: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 117: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 118: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 119: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 120: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 121: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 122: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 123: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 124: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 125: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 126: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 127: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 128: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 129: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 130: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 131: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 132: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 133: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 134: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 135: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 136: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 137: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 138: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 139: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 140: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 141: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 142: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 143: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 144: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 145: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 146: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 147: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 148: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 149: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 150: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 151: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 152: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 153: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 154: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 155: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 156: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 157: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Wednesday, April 3, 13

Page 158: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

A Minor Simplification

d o

d o u b l e

Σ

ε

ε

εstart

Wednesday, April 3, 13

Page 159: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Other Conflicts

T_Do doT_Double doubleT_Identifier [A-Za-z_][A-Za-z0-9_]*

d o bu el

Wednesday, April 3, 13

Page 160: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

More Tiebreaking

● When two regular expressions apply, choose the one with the greater “priority.”

● Simple priority system: pick the rule that was defined first.

Wednesday, April 3, 13

Page 161: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Other Conflicts

T_Do doT_Double doubleT_Identifier [A-Za-z_][A-Za-z0-9_]*

d o bu el

d o bu el

d o bu el

Wednesday, April 3, 13

Page 162: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Other Conflicts

T_Do doT_Double doubleT_Identifier [A-Za-z_][A-Za-z0-9_]*

d o bu el

d o bu el

Wednesday, April 3, 13

Page 163: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Implement a lexical analyzer

• Use regular expressions to describe token types (keyword, identifier, integer constant..)

• Use DFA/NFA to recognize the regular language

• But...good news. you don’t need to implement the algorithms to transform your regular expressions to DFA/NFA to recognize it

• flex: given regular expressions -> output c code that does lexical analysis (it internally

Wednesday, April 3, 13

Page 164: Lexical Analysis - Computer Science and Engineering are we? Where We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Summary

• Lexical Analysis

• Tokens

• Lexemes

• Regular expressions

• DFA

• NFA

• Flex: a tool for you to build a lexical analyzer

Wednesday, April 3, 13