Top Banner
Lexical Analysis Compiler Baojian Hua [email protected]
62

Lexical Analysis Compiler Baojian Hua [email protected].

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Lexical Analysis

CompilerBaojian Hua

[email protected]

Page 2: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Compiler

source program

target programcompiler

Page 3: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Front and Back Ends

source program

target program

front end

back end

IR

Page 4: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Front End

source code

abstract syntax

tree

lexical analyzer

parser

tokens

IRsemantic analyzer

Page 5: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Lexical Analyzer The lexical analyzer translates the source

program into a stream of lexical tokens Source program:

stream of characters vary from language to language (ASCII or Unicode,

or …) Lexical token:

compiler internal data structure that represents the occurrence of a terminal symbol

vary from compiler to compiler

Page 6: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Conceptually

character

sequence

token sequence

lexical

analyzer

Page 7: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Example Recall the min-ML language in “code3”prog -> decsdecs -> dec; decs |dec -> val id = exp | val _ = printInt expexp -> id | num | exp + exp | true | false | if (exp) then exp else exp | (exp)

Page 8: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Example

val x = 3;val y = 4;val z = if (2) then (x) else y;val _ = printInt z;

VAL IDENT(x) ASSIGN INT(3) SEMICOLON

VAL IDENT(y) ASSIGN INT(4) SEMICOLON

VAL IDENT(z) ASSIGN IF LPAREN INT(2) RPAREN THEN LPAREN IDENT(x) RPAREN ELSE IDENT(y) SEMICOLON

VAL UNDERSCORE ASSIGN PRINTINT INDENT(z) SEMICOLON EOF

lexical analysis

Page 9: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Lexer Implementation Options:

Write a lexer by hand from scratch boring, error-prone, and too much work see dragon book sec3.4

Automatic lexer generator Quick and easy

Page 10: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Lexer Implementation declarative

specification

lexical analyzer

Page 11: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Regular Expressions

How to specify a lexer? Develop another language Regular expressions

What’s a lexer-generator? Another compiler…

Page 12: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Basic Definitions

Alphabet: the char set (say ASCII or Unicode)

String: a finite sequence of char from alphabet

Language: a set of strings finite or infinite say the C language

Page 13: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Regular Expression (RE) Construction by induction

each c \in alphabet {a}

empty \eps {}

for M and N, then M|N (a|b) = {a, b}

for M and N, then MN (a|b)(c|d) = {ac, ad, bc, bd}

for M, then M* (Kleen closure) (a|b)* = {\eps, a, aa, b, ab, abb, baa, …}

Page 14: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Regular Expression

Or more formally:

e -> {} | c | e | e | e e | e*

Page 15: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Example

C’s indentifier: starts with a letter (“_” counts as a lett

er) followed by zero or more of letter or digit

(…) (…)

(_|a|b|…|z|A|B|…|Z) (…)

(_|a|b|…|z|A|B|…|Z)(_|a|b|…|z|A|B|…|Z|0|…|9)

(_|a|b|…|z|A|B|…|Z)(_|a|b|…|z|A|B|…|Z|0|…|9)*

It’s really error-prone and tedious…

Page 16: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Syntax Sugar More syntax sugar:

[a-z] == a|b|…|z e+ == one or more of e e? == zero or one of e “a*” == a* itself e{i, j} == more than i and less than j of e . == any char except \n

All these can be translated into core RE

Page 17: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Example Revisted C’s indentifier:

starts with a letter (“_” counts as a letter)

followed by zero or more of letter or digit(…) (…)

(_|a|b|…|z|A|B|…|Z) (…)

(_|a|b|…|z|A|B|…|Z)(_|a|b|…|z|A|B|…|Z|0|…|9)

[_a-zA-Z][_a-zA-Z0-9]*

What about the key word “if”?

Page 18: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Ambiguous Rule

A single RE is not ambiguous But in a language, there may be many

REs? [_a-zA-Z][_a-zA-Z0-9]* “if”

So, for a string, which RE to match?

Page 19: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Ambiguous Rule Two conventions:

Longest match: The regular expression that matches the longest string takes precedence.

Rule Priority: The regular expressions identifying tokens are written down in sequence. If two regular expressions match the same (longest) string, the first regular expression in the sequence takes precedence.

Page 20: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Lexer Generator History Lexical analysis was once a

performance bottleneck certainly not true today!

As a result, early research investigated methods for efficient lexical analysis

While the performance concerns are largely irrelevant today, the tools resulting from this research are still in wide use

Page 21: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

History: A long-standing goal

In this early period, a considerable amount of study went into the goal of creating an automatic compiler generator (aka compiler-compiler)

declarative compiler

specification

compiler

Page 22: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

History: Unix and C In the mid-1960’s at Bell Labs, Ritchie and others

were developing Unix A key part of this project was the development of C

and a compiler for it Johnson, in 1968, proposed the use of finite state

machines for lexical analysis and developed Lex [CACM 11(12), 1968]

read the accompanying paper on course page Lex realized a part of the compiler-compiler goal b

y automatically generating fast lexical analyzers

Page 23: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

The Lex tool The original Lex generated lexers written in

C (C in C) Today every major language has its own lex

tool(s): sml-lex, ocamllex, JLex, C#lex, …

Our topic next: sml-lex concepts and techniques apply to other tools

Page 24: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

SML-Lex Specification Lexical specification consists of 3

parts (yet another programming language):User Declarations (plain SML types, values, functions)

%%SML-LEX Definitions (RE abbreviations, special stuff)

%%Rules (association of REs with tokens) (each token will be represented in plain SML)

Page 25: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

User Declarations

User Declarations: User can define various values that are ava

ilable to the action fragments. Two values must be defined in this section:

type lexresult type of the value returned by each rule action.

fun eof () called by lexer when end of input stream is reached.

(EOF)

Page 26: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

SML-LEX Definitions

ML-LEX Definitions: User can define regular expression abbre

viations:

Define multiple lexers to work together. Each is given a unique name.

digits = [0-9] +;letter = [a-zA-Z];

%s lex1 lex2 lex3;

Page 27: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Rules Rules:

A rule consists of a pattern and an action: Pattern in a regular expression. Action is a fragment of ordinary SML code. Longest match & rule priority used for disambig

uation Rules may be prefixed with the list of lexers

that are allowed to use this rule.

<lexerList> regularExp => (action) ;

Page 28: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Rules Rule actions can use any value defined in the User

Declarations section, including type lexresult

type of value returned by each rule action val eof : unit -> lexresult

called by lexer when end of input stream reached special variables:

yytext: input substring matched by regular expression yypos: file position of the beginning of matched string continue (): doesn’t return token; recursively calls lexer

Page 29: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Example #1(* A language called Toy *)

prog -> word prog

->

word -> symbol

-> number

symbol -> [_a-zA-Z][_0-9a-zA-Z]*

number -> [0-9]+

Page 30: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Example #1(* Lexer Toy, see the accompany code for detail *)datatype token = Symbol of string * int | Number of string * intexception Endtype lexresult = unitfun eof () = raise Endfun output x = …;%%letter = [_a-zA-Z];digit = [0-9];ld = {letter}|{digit};symbol = {letter} {ld}*;number = {digit}+;%%<INITIAL>{symbol} =>(output (Symbol(yytext, yypos)));<INITIAL>{number} =>(output (Number(yytext, yypos)));

Page 31: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Example #2(* Expression Language

* C-style comment, i.e. /* … */

*)

prog -> stms

stms -> stm; stms

->

stm -> id = e

-> print e

e -> id

-> num

-> e bop e

-> (e)

bop -> + | - | * | /

Page 32: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Sample Program

x = 4;

y = 5;

z = x+y*3;

print z;

Page 33: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Example #2(* All terminals *)

prog -> stms

stms -> stm; stms

->

stm -> id = e

-> print e

e -> id

-> num

-> e bop e

-> (e)

bop -> + | - | * | /

Page 34: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Example #2 in Lex(* Expression language, see the accompany code * for detail. * Part 1: user code *)datatype token = Id of string * int | Number of string * int | Print of string * int | Plus of string * int | … (* all other stuffs *)exception Endtype lexresult = unitfun eof () = raise Endfun output x = …;

Page 35: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Example #2 in Lex, cont’(* Expression language, see the accompany code * for detail. * Part 2: lex definition *)%%letter = [_a-zA-Z];digit = [0-9];ld = {letter}|{digit};sym = {letter} {ld}*;num = {digit}+;ws = [\ \t];nl = [\n];

Page 36: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Example #2 in Lex, cont’(* Expression language, see the accompany code * for detail. * Part 3: rules *)%%<INITIAL>{ws} =>(continue ()); <INITIAL>{nl} =>(continue ());<INITIAL>”+” =>(output (Plus (yytext, yypos)));<INITIAL>”-” =>(output (Minus (yytext, yypos)));<INITIAL>”*” =>(output (Times (yytext, yypos))); <INITIAL>”/” =>(output (Divide (yytext, yypos)));<INITIAL>”(” =>(output (Lparen (yytext, yypos)));<INITIAL>”)” =>(output (Rparen (yytext, yypos)));<INITIAL>”=” =>(output (Assign (yytext, yypos)));<INITIAL>”;” =>(output (Semi (yytext, yypos)));

Page 37: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Example #2 in Lex, cont’(* Expression language, see the accompany code * for detail. * Part 3: rules cont’ *)<INITIAL>”print”=>(output (Print(yytext, yypos)));<INITIAL>{sym} =>(output (Id (yytext, yypos)));<INITIAL>{num} =>(output (Number(yytext, yypos)));<INITIAL>”/*” => (YYBEGIN COMMENT; continue ());<COMMENT>”*/” => (YYBEGIN INITIAL; continue ());<COMMENT>{nl} => (continue ());<COMMENT>. => (continue ());<INITIAL>. => (error (…));

Page 38: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Lex Implementation Lex accepts regular expressions (alon

g with others) So SML-lex is a compiler from RE to a l

exer Internal:RE NFA DFA table-driven alog’

Page 39: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Finite-state Automata (FA)

Input String M {Yes, No}

M = (, S, q0, F, )

Input alphabet State

setInitial state

Final states

Transition function

Page 40: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Transition functions

DFA : S S

NFA : S (S)

Page 41: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

DFA example

Which strings of as and bs are accepted?

Transition function: { (q0,a)q1, (q0,b)q0, (q1,a)q2, (q1,b)q1, (q2,a)q2, (q2,b)q2 }

1 20 a a

bb a,b

Page 42: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

NFA example

Transition function: {(q0,a){q0,q1}, (q0,b){q1}, (q1,a),

(q1,b){q0,q1}}

0 1a,b

a b

b

Page 43: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

RE -> NFA:Thompson algorithm

Break RE down to atoms construct small NFAs directly for atoms inductively construct larger NFAs from s

mall NFAs Easy to implement

a small recursion algorithm

Page 44: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

RE -> NFA:Thompson algorithme -> -> c

-> e1 e2

-> e1 | e2

-> e1*

c

e1 e2

Page 45: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

RE -> NFA:Thompson algorithme -> -> c

-> e1 e2

-> e1 | e2

-> e1*

e1

e2

e1

Page 46: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Example%%

letter = [_a-zA-Z];

digit = [0-9];

id = {letter} ({letter}|{digit})* ;

%%

<INITIAL>”if” => (IF (yytext, yypos));

<INITIAL>{id} => (Id (yytext, yypos));

(* Equivalent to:

* “if” | {id}

*)

Page 47: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Example<INITIAL>”if” => (IF (yytext, yypos));

<INITIAL>{id} => (Id (yytext, yypos));

i

f

Page 48: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

NFA -> DFA:Subset construction algorithm(* subset construction: workList algorithm *)

q0 <- e-closure (n0)

Q <- {q0}

workList <- q0

while (workList != \phi)

remove q from workList

foreach (character c)

t <- -closure (move (q, c)) D[q, c] <- t

if (t\not\in Q)

add t to Q and workList

Page 49: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

NFA -> DFA:-closure(* -closure: fixpoint algorithm *)(* Dragon Fig 3.33 gives a DFS-like algorithm.

* Here we give a recursive version. (Simpler)

*)

X <- \phi

fun eps (t) =

X <- X {t}∪ foreach (s \in one-eps(t))

if (s \not\in X)

then eps (s)

Page 50: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

NFA -> DFA: -closure(* -closure: fixpoint algorithm *)(* dragon Fig 3.33 gives a DFS-like algorithm.

* Here we give a recursive version. (Simpler)

*)

fun e-closure (T) =

X <- T

foreach (t \in T)

X <- X eps(t)∪

Page 51: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

NFA -> DFA: -closure(* -closure: fixpoint algorithm *)(* A BFS-like algorithm. *)X <- empty;fun e-closure (T) = Q <- T X <- T while (Q not empty) q <- deQueue (Q) foreach (s \in one-eps(q)) if (s \not\in X) enQueue (Q, s) X <- X s∪

Page 52: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Example<INITIAL>”if” => (IF (yytext, yypos));

<INITIAL>{id} => (Id (yytext, yypos));

1 i

5

0

2

8

3

f

6[_a-zA-Z]

7

[_a-zA-Z0-9]

4

Page 53: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Exampleq0 = {0, 1, 5} Q = {q0}

D[q0, “i”] = {2, 3, 6, 7, 8} Q q1∪D[q0, _] = {6, 7, 8} Q q2∪D[q1, “f”] = {4, 7, 8} Q q3∪

1 i

5

0

2

8

3

f

6[_a-zA-Z]

7

[_a-zA-Z0-9] q0

q1

q2

q3if

_

4

Page 54: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

ExampleD[q1, _] = {7, 8} Q q4∪D[q2, _] = {7, 8} Q

D[q3, _] = {7, 8} Q

D[q4, _] = {7, 8} Q 1 i

5

0

2

8

3

f

6[_a-zA-Z]

7

[_a-zA-Z0-9]

q0

q1

q2

q3

i

f

_ q4

_

_

_

_

4

Page 55: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Exampleq0 = {0, 1, 5} q1 = {2, 3, 6, 7, 8}

q2 = {6, 7, 8} q3 = {4, 7, 8} q4 = {7, 8}

1 i

5

0

2

8

3

f

6[_a-zA-Z]

7

[_a-zA-Z0-9]

q0

q1

q2

q3

“i”

“f”

letter-”i”

q4ld-”f”

ld

ld

ld

4

Page 56: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Exampleq0 = {0, 1, 5} q1 = {2, 3, 6, 7, 8}

q2 = {6, 7, 8} q3 = {4, 7, 8} q4 = {7, 8}

1 i

5

0

2

8

3

f

6[_a-zA-Z]

7

[_a-zA-Z0-9]

q0

q1

q2

q3

“i”

“f”

letter-”i”

q4ld-”f”

ld

ld

ld

4

Page 57: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Table-driven Algorithm Conceptually, an FA is a directed graph Pragmatically, many different strategies to

encode an FA: Matrix (adjacency matrix)

sml-lex Array of list (adjacency list) Hash table Jump table (switch statements)

flex Balance between time and space

Page 58: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Example

q0

q1

q2

q3

“i”

“f”

letter-”i”

q4ld-”f”

ld

ld

ld

state\char

“i” “f” letter-”i”-”f”

… other

q0 q1 q2 q2 … error

q1 q4 q3 q4 … error

q2 q4 q4 q4 … error

q3 q4 q4 q4 … error

q4 q4 q4 q4 … error

<INITIAL>”if” => (IF (yytext, yypos));<INITIAL>{id} => (Id (yytext, yypos));

state q0 q1 q2 q3 q4

action

Id Id IF Id

Page 59: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

DFA Minimization:Hopcroft’s Algorithm

q0

q1

q2

q3

“i”

“f”

letter-”i”

q4ld-”f”

ld

ld

ld

state q0 q1 q2 q3 q4

action

Id Id IF Id

Page 60: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

DFA Minimization:Hopcroft’s Algorithm

q0

q1

q2

q3

“i”

“f”

letter-”i”

q4ld-”f”

ld

ld

ld

state q0 q1 q2 q3 q4

action

Id Id IF Id

Page 61: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

DFA Minimization:Hopcroft’s Algorithm

q0

q1

q2, q4

q3

“i”

“f”

letter-”i”

ld-”f” ld

ld

state q0 q1 q2, q4

q3

action Id Id IF

Page 62: Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn.

Summary A Lexer:

input: stream of characters output: stream of tokens

Writing lexers by hand is boring, so we use a lexer generator: ml-lex RE -> NFA -> DFA -> table-driven algo

Moral: don’t underestimate your theory classes! great application of cool theory developed in mat

hematics. we’ll see more cool apps as the course progress

es