Top Banner
THEORY OF COMPILATION Lecture 02 – Lexical Analysis Eran Yahav 1
35

Theory of Compilation

Jan 22, 2016

Download

Documents

Kelli

Lecture 02 – Lexical Analysis. Theory of Compilation. Eran Yahav. Source text. txt. Executable code. exe. You are here. Compiler. Lexical Analysis. Syntax Analysis Parsing. Semantic Analysis. Inter. Rep. (IR). Code Gen. Source text. txt. Executable code. exe. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Theory of Compilation

THEORY OF COMPILATIONLecture 02 – Lexical Analysis

Eran Yahav

Page 2: Theory of Compilation

2

You are here

Executable

code

exe

Source

text

txt

Compiler

LexicalAnalysi

s

Syntax Analysi

s

Parsing

Semantic

Analysis

Inter.Rep.

(IR)

Code

Gen.

Page 3: Theory of Compilation

3

You are here…

Executable

code

exe

Source

text

txtLexicalAnalysi

s

Sem.Analysis

Process text input

characters SyntaxAnalysi

s

tokens AST

Intermediate code

generation

Annotated AST

Intermediate code

optimization

IR CodegenerationIR

Target code optimizatio

n

Symbolic Instructions

SI Machine code

generation

Write executable

output

MI

Back End

Page 4: Theory of Compilation

4

From characters to tokens

What is a token? Roughly – a “word” in the source

language Identifiers Values Language keywords Really - anything that should appear in

the input to syntax analysis Technically

Usually a pair of (kind,value)

Page 5: Theory of Compilation

5

Example Tokens

Type Examples

Identifier x, y, z, foo, bar

NUM 42

FLOATNUM 3.141592654

STRING “so long, and thanks for all the fish”

LPAREN (

RPAREN )

IF if

Page 6: Theory of Compilation

6

Strings with special handling

Type Examples

Comments /* Ceci n'est pas un commentaire */

Preprocessor directives #include<foo.h>

Macros #define THE_ANSWER 42

White spaces \t \n

Page 7: Theory of Compilation

7

From characters to tokens

x = b*b – 4*a*c

txt

<ID,”x”> <EQ> <ID,”b”> <MULT> <ID,”b”> <MINUS> <INT,4> <MULT> <ID,”a”> <MULT> <ID,”c”>

TokenStream

Page 8: Theory of Compilation

8

Errors in lexical analysis

pi = 3.141.562

txt

Illegal token

pi = 3oranges

txt

Illegal token

pi = oranges3

txt

<ID,”pi”>, <EQ>, <ID,”oranges3”>

Page 9: Theory of Compilation

9

How can we define tokens?

Keywords – easy! if, then, else, for, while, …

Identifiers? Numerical Values? Strings?

Characterize unbounded sets of values using a bounded description?

Page 10: Theory of Compilation

10

Regular ExpressionsBasic Patterns Matching

x The character x

. Any character, usually except a new line

[xyz] Any of the characters x,y,z

Repetition Operators

R? An R or nothing (=optionally an R)

R* Zero or more occurrences of R

R+ One or more occurrences of R

Composition Operators

R1R2 An R1 followed by R2

R1|R2 Either an R1 or R2

Grouping

(R) R itself

Page 11: Theory of Compilation

11

Examples

ab*|cd? = (a|b)* = (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)* =

Page 12: Theory of Compilation

12

Escape characters

What is the expression for one or more + symbols? (+)+ won’t work (\+)+ will

backslash \ before an operator turns it to standard character

\*, \?, \+, …

Page 13: Theory of Compilation

13

Shorthands

Use names for expressions letter = a | b | … | z | A | B | … | Z letter_ = letter | _ digit = 0 | 1 | 2 | … | 9 id = letter_ (letter_ | digit)*

Use hyphen to denote a range letter = a-z | A-Z digit = 0-9

Page 14: Theory of Compilation

14

Examples

digit = 0-9 digits = digit+ number = digits (Є | .digits (Є | e (Є|

+|-) digits )) if = if then = then relop = < | > | <= | >= | = | <>

Page 15: Theory of Compilation

15

Ambiguity

if = if id = letter_ (letter_ | digit)*

“if” is a valid word in the language of identifiers… so what should it be?

How about the identifier “iffy”?

Solution Always find longest matching token Break ties using order of definitions… first definition

wins (=> list rules for keywords before identifiers)

Page 16: Theory of Compilation

16

Creating a lexical analyzer Input

List of token definitions (pattern name, regex)

String to be analyzed Output

List of tokens

How do we build an analyzer?

Page 17: Theory of Compilation

17

Character classification

#define is_end_of_input(ch) ((ch) == ‘\0’);

#define is_uc_letter(ch) (‘A’<= (ch) && (ch) <= ‘Z’)

#define is_lc_letter(ch) (‘a’<= (ch) && (ch) <= ‘z’)

#define is_letter(ch) (is_uc_letter(ch) || is_lc_letter(ch))

#define is_digit(ch) (‘0’<= (ch) && (ch) <= ‘9’)

Page 18: Theory of Compilation

18

Main reading routine

void get_next_token() {do { char c = getchar(); switch(c) { case is_letter(c) : return recognize_identifier(c); case is_digit(c) : return recognize_number(c); …} while (c != EOF);

Page 19: Theory of Compilation

19

But we have a much better way! Generate a lexical analyzer

automatically from token definitions

Main idea Use finite-state automata to match

regular expressions

Page 20: Theory of Compilation

20

Reminder: Finite-State Automaton

Deterministic automaton M = (,Q,,q0,F)

- alphabet Q – finite set of state q0 Q – initial state F Q – final states δ : Q Q - transition function

Page 21: Theory of Compilation

21

Reminder: Finite-State Automaton

Non-Deterministic automaton M = (,Q,,q0,F)

- alphabet Q – finite set of state q0 Q – initial state F Q – final states δ : Q ( {}) → 2Q - transition function

Possible -transitions For a word w, M can reach a number of states or

get stuck. If some state reached is final, M accepts w.

Page 22: Theory of Compilation

22

From regular expressions to NFA Step 1: assign expression names and

obtain pure regular expressions R1…Rm

Step 2: construct an NFA Mi for each regular expression Ri

Step 3: combine all Mi into a single NFA

Ambiguity resolution: prefer longest accepting word

Page 23: Theory of Compilation

23

Basic constructs

R =

R =

R = a a

Page 24: Theory of Compilation

24

Composition

R = R1 | R2 M1

M2

R = R1R2

M1 M2

Page 25: Theory of Compilation

25

Repetition

R = R1*

M1

Page 26: Theory of Compilation

26

What now?

Naïve approach: try each automaton separately

Given a word w: Try M1(w) Try M2(w) … Try Mn(w)

Requires resetting after every attempt

Page 27: Theory of Compilation

Combine automata

27

1 2a

a

3a

4b

5b

6

abb

7 8b

a*b+ba

9a

10b

11a

12b

13

abab

0

aabba*b+abab

Page 28: Theory of Compilation

28

Ambiguity resolution

Recall… Longest word Tie-breaker based on order of rules

when words have same length

Recipe Turn NFA to DFA Run until stuck, remember last accepting

state, this is the token to be returned

Page 29: Theory of Compilation

29

Corresponding DFA

0 1 3 7 9

8

7

b

a

a

2 4 7 10

a

bb

6 8

5 8 11b

12 13a b

b

abba*b+a*b+

a*b+

abab

a

Page 30: Theory of Compilation

30

Examples

0 1 3 7 9

8

7

b

a

a

2 4 7 10

a

bb

6 8

5 8 11b

12 13a b

b

abba*b+a*b+

a*b+

abab

a

abaa: gets stuck after aba in state 12, backs up to state (5 8 11) pattern is a*b+, token is ab

abba: stops after second b in (6 8), token is abb because it comes first in spec

Page 31: Theory of Compilation

31

Good News

All of this construction is done automatically for you by common tools

lex is your friend Automatically generates a lexical

analyzer from declaration filelex

Declaration file

LexicalAnalysi

s

characters tokens

Page 32: Theory of Compilation

32

Lex declarations file

%{

#include “lex.h”

Token_Type Token;

int line_number=1

%}

whitespace [ \t]

letter [a-zA-Z]

digit [0-9]

%%

{digit}+ {return INTEGER;}

{identifier} {return IDENTIFIER;}

{whitespace} { /* ignore whitespace */ }

\n { line_number++;}

. { return ERROR; }

%%

void start_lex(void){}

void get_next_token(void) {…}

Page 33: Theory of Compilation

33

Summary

Lexical analyzer Turns character stream into token

stream Tokens defined using regular expressions Regular expressions -> NFA -> DFA

construction for identifying tokens Automated constructions of lexical

analyzer using lex

Page 34: Theory of Compilation

34

Coming up next time

Syntax analysis

Page 35: Theory of Compilation

35

NFA vs. DFA

(a|b)*a(a|b)(a|b)…(a|b)

n times

Automaton SPACE TIME

NFA O(|r|) O(|r|*|w|)

DFA O(2^|r|) O(|w|)