0368-3133 Lecture 2: Lexical Analysis Noam Rinetzkymaon/teaching/2017-2018/compilation/compilation...Conceptual Structure of a Compiler Executable code exe Source text txt Semantic

Post on 15-Feb-2019

221 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

Compilation0368-3133

Lecture2:

LexicalAnalysis

NoamRinetzky1

2

LexicalAnalysis

ModernCompilerDesign:Chapter2.1

3

ConceptualStructureofaCompiler

Executable code

exe

Sourcetext

txt

SemanticRepresentation

Backend

Compiler

Frontend

LexicalAnalysis

Syntax AnalysisParsing

Semantic Analysis

IntermediateRepresentation

(IR)

CodeGeneration

4

ConceptualStructureofaCompiler

Executable code

exe

Sourcetext

txt

SemanticRepresentation

Backend

Compiler

Frontend

LexicalAnalysis

Syntax AnalysisParsing

Semantic Analysis

IntermediateRepresentation

(IR)

CodeGeneration

words sentences 5

WhatdoesLexicalAnalysisdo?

• Language:fullyparenthesizedexpressionsExpr® Num |LPExpr OpExpr RPNum® Dig|DigNumDig® ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’LP® ‘(’RP® ‘)’Op® ‘+’ |‘*’

( ( 23 + 7 ) * 19 )

6

WhatdoesLexicalAnalysisdo?

• Language:fullyparenthesizedexpressionsContextfreelanguage

Regularlanguages

( ( 23 + 7 ) * 19 )

Expr® Num |LPExpr OpExpr RPNum® Dig|DigNumDig® ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’LP® ‘(’RP® ‘)’Op® ‘+’ |‘*’

7

WhatdoesLexicalAnalysisdo?

• Language:fullyparenthesizedexpressionsContextfreelanguage

Regularlanguages

( ( 23 + 7 ) * 19 )

Expr® Num |LPExpr OpExpr RPNum® Dig|DigNumDig® ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’LP® ‘(’RP® ‘)’Op® ‘+’ |‘*’

8

WhatdoesLexicalAnalysisdo?

• Language:fullyparenthesizedexpressionsContextfreelanguage

Regularlanguages

( ( 23 + 7 ) * 19 )

Expr® Num |LPExpr OpExpr RPNum® Dig|DigNumDig® ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’LP® ‘(’RP® ‘)’Op® ‘+’ |‘*’

9

WhatdoesLexicalAnalysisdo?

• Language:fullyparenthesizedexpressionsContextfreelanguage

Regularlanguages

( ( 23 + 7 ) * 19 )

LP LP Num Op Num RP Op Num RP

Expr® Num |LPExpr OpExpr RPNum® Dig|DigNumDig® ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’LP® ‘(’RP® ‘)’Op® ‘+’ |‘*’

10

WhatdoesLexicalAnalysisdo?

• Language:fullyparenthesizedexpressionsContextfreelanguage

Regularlanguages

( ( 23 + 7 ) * 19 )

LP LP Num Op Num RP Op Num RPKind

Value

Expr® Num |LPExpr OpExpr RPNum® Dig|DigNumDig® ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’LP® ‘(’RP® ‘)’Op® ‘+’ |‘*’

11

WhatdoesLexicalAnalysisdo?

• Language:fullyparenthesizedexpressionsContextfreelanguage

Regularlanguages

( ( 23 + 7 ) * 19 )

LP LP Num Op Num RP Op Num RPKind

Value

Expr® Num |LPExpr OpExpr RPNum® Dig|DigNumDig® ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’LP® ‘(’RP® ‘)’Op® ‘+’ |‘*’Token Token …

12

• Partitionstheinputintostreamoftokens– Numbers– Identifiers– Keywords– Punctuation

• Usuallyrepresentedas(kind,value)pairs– (Num,23)– (Op,‘*’)

• “word”inthesourcelanguage• “meaningful”tothesyntacticalanalysis

WhatdoesLexicalAnalysisdo?

13

Fromscanningtoparsing((23 + 7) * x)

) ?*)7+23((RPIdOPRPNumOPNumLPLP

LexicalAnalyzer

programtext

tokenstream

ParserGrammar:Expr® ...|IdId® ‘a’|...|‘z’

Op(*)

Id(?)

Num(23) Num(7)

Op(+)

AbstractSyntaxTree

validsyntaxerror

14

WhyLexicalAnalysis?

• Well,notstrictlynecessary,but …– RegularlanguagesÍ Context-Freelanguages

• Simplifiesthesyntaxanalysis(parsing)– Andlanguagedefinition

• Modularity• Reusability• Efficiency

15

Lecturegoals

• Understandrole&placeoflexicalanalysis

• Lexicalanalysistheory• Usingprogramgeneratingtools

16

LectureOutline

üRole&placeoflexicalanalysis• Whatisatoken?• Regularlanguages• Lexicalanalysis• Errorhandling• Automaticcreationoflexicalanalyzers

17

Whatisatoken?(Intuitively)

• A“word”inthesourcelanguage– Anythingthatshouldappearintheinputtosyntaxanalysis• Identifiers• Values• Languagekeywords

• Usually,representedasapairof(kind,value)

18

ExampleTokens

Type Examples

ID foo, n_14, lastNUM 73, 00, 517, 082 REAL 66.1, .5, 5.5e-10IF ifCOMMA ,NOTEQ !=LPAREN (RPAREN )

19

ExampleNonTokens

Type Examplescomment /* ignored */preprocessordirective #include <foo.h>

#define NUMS 5.6macro NUMSwhitespace \t, \n, \b, ‘ ‘

20

Somebasicterminology

• Lexeme(akasymbol)- aseriesoflettersseparatedfromtherestoftheprogramaccordingtoaconvention(space,semi-column,comma,etc.)

• Pattern - arulespecifyingasetofstrings.Example:“anidentifierisastringthatstartswithaletterandcontinueswithlettersanddigits”– (Usually)aregularexpression

• Token - apairof(pattern,attributes)

21

Examplevoid match0(char *s) /* find a zero */

{

if (!strncmp(s, “0.0”, 3))

return 0.0 ;

}

VOID ID(match0) LPAREN CHAR DEREF ID(s) RPAREN

LBRACE

IF LPAREN NOT ID(strncmp) LPAREN ID(s) COMMA STRING(0.0) COMMA NUM(3) RPAREN RPAREN

RETURN REAL(0.0) SEMI

RBRACE

EOF 22

ExampleNonTokens

Type Examplescomment /* ignored */preprocessordirective #include <foo.h>

#define NUMS 5.6macro NUMSwhitespace \t, \n, \b, ‘ ‘

• Lexemesthatarerecognizedbutgetconsumedratherthantransmittedtoparser– if– i/*comment*/f

23

LectureOutline

üRole&placeoflexicalanalysisüWhatisatoken?• Regularlanguages• Lexicalanalysis• Errorhandling• Automaticcreationoflexicalanalyzers

24

Howcanwedefinetokens?

• Keywords– easy!– if,then,else,for,while,…

• Identifiers?• NumericalValues?• Strings?

• Characterizeunboundedsetsofvaluesusingaboundeddescription?

25

Regularlanguages

• Formallanguages– Σ =finitesetofletters– Word=sequenceofletter– Language=setofwords

• Regularlanguagesdefinedequivalentlyby– Regularexpressions– Finite-stateautomata

26

Commonformatforreg-expsBasic Patterns Matching

x Thecharacterx

. Anycharacter,usuallyexceptanewline

[xyz] Anyofthecharactersx,y,z

^x Anycharacterexceptx

RepetitionOperators

R? AnRornothing(=optionallyanR)

R* Zero ormoreoccurrencesofR

R+ OneormoreoccurrencesofR

CompositionOperators

R1R2 AnR1 followedbyR2

R1|R2 Either anR1orR2

Grouping

(R) Ritself 27

Examples

• ab*|cd?=• (a|b)*=• (0|1|2|3|4|5|6|7|8|9)*=

28

Escapecharacters

• Whatistheexpressionforoneormore+symbols?– (+)+ won’twork– (\+)+ will

• backslash\ beforeanoperatorturnsittostandard character– \*, \?, \+, a\(b\+\*, (a\(b\+\*)+, …

• backslashdoublequotessurroundstext– “a(b+*”, “a(b+*”+ 29

Shorthands

• Usenamesforexpressions– letter=a|b|…|z|A|B|…|Z– letter_=letter|_– digit=0|1|2|…|9– id=letter_(letter_|digit)*

• Usehyphentodenotearange– letter=a-z|A-Z– digit=0-9

30

Examples

• if=if• then=then• relop =<|>|<=|>=|=|<>

• digit=0-9• digits=digit+

31

Example

• A number is number = ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 )+

( e | \. ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 )+

( e | E ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 )+)

)

• Using shorthands it can be written as

number = digits (e | \.digits (e | E (e|+|-) digits ) )

32

Exercise1- Question

• Languageofrationalnumbersindecimalrepresentation(noleading,endingzeros)– 0– 123.757– .933333– Not007– Not0.30

33

Exercise1- Answer

• Languageofrationalnumbersindecimalrepresentation(noleading,endingzeros)

– Digit =1|2|…|9Digit0=0|DigitNum =DigitDigit0*Frac =Digit0*DigitPos =Num|\.Frac |0\.Frac|Num\.FracPosOrNeg =(Є|-)PosR =0|PosOrNeg

34

Exercise2- Question

• Equalnumberofopeningandclosingparenthesis:[n]n =[],[[]],[[[]]],…

35

Exercise2- Answer

• Equalnumberofopeningandclosingparenthesis:[n]n =[],[[]],[[[]]],…

• Notregular• Context-free• Grammar: S::=[] |[S]

36

Challenge:Ambiguity

• If=if• Id=Letter(Letter|Digit)*

• “if”isavalididentifiers…whatshoulditbe?• ‘’iffy”isalsoavalididentifier

• Solution– Longestmatchingtoken– Breaktiesusingorderofdefinitions…

• Keywordsshouldappearbeforeidentifiers37

Creatingalexicalanalyzer

• Givenalistoftokendefinitions(patternname,regex),writeaprogramsuchthat– Input:Stringtobeanalyzed– Output:Listoftokens

• Howdowebuildananalyzer?

38

BuildingaScanner– TakeI

• Input:String

• Output:Sequenceoftokens

39

BuildingaScanner– TakeIToken nextToken(){char c ;loop: c = getchar();switch (c){case ` `: goto loop ;case `;`: return SemiColumn;case `+`: c = getchar() ;switch (c) {case `+': return PlusPlus ;case '=’ return PlusEqual;default: ungetc(c); return Plus;

};case `<`: …case `w`: …

} 40

Theremustbeabetterway!

41

Abetterway

• Automatically generate ascanner

• Definetokensusingregularexpressions

• Usefinite-stateautomatafordetection

42

Reg-expvs.automata

• Regularexpressionsaredeclarative– Goodfor humans– Not“executable”

• Automata areoperative– Defineanalgorithm fordecidingwhetheragivenwordisinaregularlanguage

– Notanaturalnotationforhumans43

Overview

• Definetokensusingregularexpressions

• Constructanondeterministicfinite-stateautomaton(NFA)fromregularexpression

• Determinize theNFAintoadeterministicfinite-stateautomaton(DFA)

• DFAcanbedirectlyusedtoidentifytokens44

Automatatheory:abird’s-eyeview

45

DeterministicAutomata(DFA)

• M=(S,Q,d,q0,F)– S - alphabet– Q– finitesetofstate– q0Î Q– initialstate– FÍ Q– finalstates– δ:Q´ Sà Q - transitionfunction

• Forawordw,Mreachsomestatex– MacceptswifxÎ F

46

DFA inpictures

start

a

b,c

a,b

c

acceptingstate

startstate

transition

• Anautomatonisdefinedbystatesandtransitions

47

a,b,c a,b,c

AcceptingWords

• Wordsarereadleft-to-rightcba

start

a

b

c

48

• Missingtransition=non-acceptance– “Stuckstate”

• Wordsarereadleft-to-right

AcceptingWords

cba

start

a

b

c

49

• Wordsarereadleft-to-right

AcceptingWords

cba

start

a

b

c

50

• Wordsarereadleft-to-right

AcceptingWords

cba

start

a

b

c

51

RejectingWords

cbb

start

a

b

c

52

• Wordsarereadleft-to-right

start

RejectingWords

• Missingtransitionmeansnon-acceptancecbb

a

b

c

53

Non-deterministicAutomata(NFA)

• M=(S,Q,d,q0,F)– S - alphabet– Q– finitesetofstate– q0 ÎQ – initialstate

– FÍ Q– finalstates– δ:Q´ (S È {e})→2Q - transitionfunction

• DFA:δ:Q´ Sà Q

• Forawordw,McanreachanumberofstatesX– MacceptswifX∩M≠{}

• Possible:X={}

• Possiblee-transitions 54

NFA

• Allowmultipletransitionsfromgivenstatelabeledbysameletter

start

a

a

b

c

c

b

55

Acceptingwords

cba

start

a

a

b

c

c

b

56

Acceptingwords

• Maintainsetofstates

cba

start

a

a

b

c

c

b

57

Acceptingwords

cba

start

a

a

b

c

c

b

58

Acceptingwords• Acceptwordifreachedanacceptingstate

cba

start

a

a

b

c

c

b

59

NFA+Є automata

• Є transitionscan“fire”withoutreadingtheinput

Є

start a

b

c

60

NFA+Є runexample

cba

Є

start a

b

c

61

NFA+Є runexample• NowЄ transitioncannon-deterministicallytakeplace

cba

Є

start a

b

c

62

NFA+Є runexample

cba

Є

start a

b

c

63

NFA+Є runexample

cba

Є

start a

b

c

64

NFA+Є runexample

cba

Є

start a

b

c

65

• Є transitionscan“fire”withoutreadingtheinput

NFA+Є runexample

cba

• Wordaccepted

Є

start a

b

c

66

FromregularexpressionstoNFA

• Step1:assignexpressionnamesandobtainpureregularexpressionsR1…Rm

• Step2:constructanNFAMi foreachregularexpressionRi

• Step3:combineallMi intoasingleNFA

• Ambiguityresolution:preferlongestacceptingword 67

Fromreg.exp.toautomata• Theorem:thereisanalgorithmtobuildanNFA+Є automatonforanyregularexpression

• Proof:byinductiononthestructureoftheregularexpression

start

68

R = e

R = f

R = aa

Basicconstructs

69

CompositionR = R1 | R2 e M1

M2e

e

e

R = R1R2

eM1 M2

e e

70

Repetition

R = R1*

eM1

e

e

e

71

72

Naïveapproach

• Tryeachautomatonseparately

• Givenawordw:– TryM1(w)– TryM2(w)– …– TryMn(w)

• Requiresresettingaftereveryattempt73

Actually,wecombineautomata

1 2aa

3a

4 b 5 b 6

abb

7 8b a*b+ba

9a

10 b 11 a 12 b 13

abab

0

e

e

e

e

aabba*b+abab

combines

74

CorrespondingDFA

01379

8

7

b

a

a24710

a

bb

68

5811b

12 13a b

b

abba*b+a*b+

a*b+

abab

a

Combine automata: an example.

Combine a, abb, a*b+, abab.

75#

1# 2#a#

a#

3#a#

4#b#

5#b#

6#

abb#

7# 8#b#

a*b+#b#a#

9#a#

10#b#

11#a#

12#b#

13#

abab#

0#

ε#

ε#

ε#

ε#

b

75

ScanningwithDFA

• Rununtilstuck– Rememberlastacceptingstate

• Gobacktoacceptingstate• Returntoken

76

Ambiguityresolution

• Longestword• Tie-breakerbasedonorderofrules whenwordshavesamelength

77

Combine automata: an example.

Combine a, abb, a*b+, abab.

75#

1# 2#a#

a#

3#a#

4#b#

5#b#

6#

abb#

7# 8#b#

a*b+#b#a#

9#a#

10#b#

11#a#

12#b#

13#

abab#

0#

ε#

ε#

ε#

ε#

Examples

01379

8

7

b

a

a24710

a

bb

68

5811b

12 13a b

b

abba*b+a*b+

a*b+

abab

a

Combine automata: an example.

Combine a, abb, a*b+, abab.

75#

1# 2#a#

a#

3#a#

4#b#

5#b#

6#

abb#

7# 8#b#

a*b+#b#a#

9#a#

10#b#

11#a#

12#b#

13#

abab#

0#

ε#

ε#

ε#

ε#b

abaa:getsstuckafterabainstate12,backsuptostate(5811)patternisa*b+,tokenisabTokens:<a*b+,ab><a,a><a,a> 78

Examples

01379

8

7

b

a

a24710

a

bb

68

5811b

12 13a b

b

abba*b+a*b+

a*b+

abab

a

b

abba:stopsaftersecondbin(68),tokenisabb becauseitcomesfirstinspec79Tokens:<abb,abb><a,a>

Combine automata: an example.

Combine a, abb, a*b+, abab.

75#

1# 2#a#

a#

3#a#

4#b#

5#b#

6#

abb#

7# 8#b#

a*b+#b#a#

9#a#

10#b#

11#a#

12#b#

13#

abab#

0#

ε#

ε#

ε#

ε#

SummaryofConstruction

• Describetokensasregularexpressions– Decideattributes(values)tosaveforeachtoken

• RegularexpressionsturnedintoaDFA– Also,recordswhichattributes(values)tokeep

• Lexicalanalyzersimulatestherunofanautomatawiththegiventransitiontableonanyinputstring

80

AFewRemarks

• TurninganNFAtoaDFAisexpensive,but– Exponentialintheworstcase– Inpractice,worksfine

• Theconstructionisdoneonceper-language– AtCompilerconstructiontime– Not atcompilationtime

81

Implementation

82

ImplementationbyExampleif { return IF; }[a-z][a-z0-9]* { return ID; }[0-9]+ { return NUM; }[0-9]”.”[0-9]+|[0-9]*”.”[0-9]+ { return REAL; }(\-\-[a-z]*\n)|(“ “|\n|\t) { ; }. { error(); }

83

if

xy,i,zs98

3,32,032

0.55,33.1

--comm\n\n, \t,““ ID

IF

ID error REAL

NUM REAL

error w.s.errorw.s.

01

2 3

9 10 1112

int edges[][256]= { /* …, 0, 1, 2, 3, ..., -, e, f, g, h, i, j, ... */

/* state 0 */ {0, …, 0, 0, …, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0},/* state 1 */ {13, … , 7, 7, 7, 7, …, 9, 4, 4, 4, 4, 2, 4, …, 13, 13},/* state 2 */ {0, …, 4, 4, 4, 4, ..., 0, 4, 3, 4, 4, 4, 4, …, 0, 0},/* state 3 */ {0, …, 4, 4, 4, 4, …, 0, 4, 4, 4, 4, 4, 4, , 0, 0},/* state 4 */ {0, …, 4, 4, 4, 4, …, 0, 4, 4, 4, 4, 4, 4, …, 0, 0}, /* state 5 */ {0, …, 6, 6, 6, 6, …, 0, 0, 0, 0, 0, 0, 0, …, 0, 0},/* state 6 */ {0, …, 6, 6, 6, 6, …, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0},/* state 7 *//* state … */ .../* state 13 */ {0, …, 0, 0, 0, 0, …, 0, 0, 0, 0, 0, 0, 0, …, 0, 0}}; 84

ID

IF

ID error REAL

NUM REAL

error w.s.errorw.s.

01

2 3

9 10 1112

PseudoCodeforScannerchar* input = … ;

Token nextToken() {lastFinal = 0; currentState = 1 ;inputPositionAtLastFinal = input; currentPosition = input; while (not(isDead(currentState))) {

nextState = edges[currentState][*currentPosition];if (isFinal(nextState)) {

lastFinal = nextState ; inputPositionAtLastFinal = currentPosition;

}currentState = nextState; advance currentPosition;

}input = inputPositionAtLastFinal + 1;return action[lastFinal];

}85

Example

Input:“if--not-a-com”

86

2blanks

ID

IF

ID error REAL

NUM REAL

error w.s.errorw.s.

01

2 3

9 10 1112

final state input

0 1 if--not-a-com

2 2 if--not-a-com

3 3 if--not-a-com

3 0 if--not-a-comreturnIF

87

ID

IF

ID error REAL

NUM REAL

error w.s.errorw.s.

01

2 3

9 10 1112

foundwhitespace

final state input

0 1 --not-a-com

12 12 --not-a-com

12 12 --not-a-com

12 0 --not-a-com

88

final state input

0 1 --not-a-com

9 9 --not-a-com

9 10 --not-a-com

10 10 --not-a-com

10 10 --not-a-com

10 0 --not-a-comerror

89

ID

IF

ID error REAL

NUM REAL

error w.s.errorw.s.

01

2 3

9 10 1112

final state input

0 1 -not-a-com

9 9 -not-a-com

9 0 -not-a-com

9 0 -not-a-com

9 0 -not-a-com

error

90

ID

IF

ID error REAL

NUM REAL

error w.s.errorw.s.

01

2 3

9 10 1112

Concludingremarks

• Efficientscanner• Minimization• Errorhandling• Automaticcreationoflexicalanalyzers

91

EfficientScanners

• Efficientstaterepresentation• Inputbuffering• Usingswitchandgotosinsteadoftables

92

Minimization

• Createanon-deterministicautomaton(NDFA)fromeveryregularexpression

• Mergealltheautomatausingepsilonmoves(likethe|construction)

• Constructadeterministicfiniteautomaton(DFA)– Statepriority

• Minimizetheautomaton– separateacceptingstatesbytokenkinds

93

Exampleif { return IF; }[a-z][a-z0-9]* { return ID; }[0-9]+ { return NUM; }

94ModerncompilerimplementationinML,AndrewAppel,(c)1998,Figures2.7,2.8

IDIF

errorNUM

Exampleif { return IF; }[a-z][a-z0-9]* { return ID; }[0-9]+ { return NUM; }

95ModerncompilerimplementationinML,AndrewAppel,(c)1998,Figures2.7,2.8

IDIF

error

NUM

ID

NUM

ID

IDIF

errorNUM

Example

96

IDIF

errorNUM

IDIF

error

NUM

ID

NUM

ID

if { return IF; }[a-z][a-z0-9]* { return ID; }[0-9]+ { return NUM; }

IDIF

errorNUM

ID

ID

IF

NUM NUM

error

ModerncompilerimplementationinML,AndrewAppel,(c)1998,Figures2.7,2.8

Example

97

if { return IF; }[a-z][a-z0-9]* { return ID; }[0-9]+ { return NUM; }

IDIF

errorNUM

ID

ID

ID

IF

NUM NUM

error

ModerncompilerimplementationinML,AndrewAppel,(c)1998,Figures2.7,2.8

ErrorHandling• Manyerrorscannotbeidentifiedatthisstage• Example:“fi(a==f(x))”.Should“fi”be“if”?Orisitaroutinename?

– Wewilldiscoverthislaterintheanalysis– Atthispoint,wejustcreateanidentifiertoken

• Sometimesthelexemedoesnotmatchanypattern– Easiest:eliminatelettersuntilthebeginningofalegitimatelexeme– Alternatives:eliminate/add/replaceoneletter,replaceorderoftwoadjacent

letters,etc.

• Goal:allowthecompilationtocontinue• Problem:errorsthatspreadallover

98

Automaticallygeneratedscanners

• UseofProgram-GeneratingTools– Specificationè Partofcompiler– Compiler-Compiler

Streamoftokens

JFlexregularexpressions

inputprogram scanner99

UseofProgram-GeneratingTools

• Input:regularexpressionsandactions• Action=Javacode

• Output:ascannerprogramthat• Producesastreamoftokens• Invokeactionswhenpatternismatched

Streamoftokens

JFlexregularexpressions

inputprogram scanner100

LineCountingExample

• Createaprogramthatcountsthenumberoflinesinagiveninputtextfile

101

CreatingaScannerusingFlex

int num_lines = 0;%%\n ++num_lines;. ;%%main() {yylex();printf( "# of lines = %d\n", num_lines);

}

102

CreatingaScannerusingFlex

initial

other

newline\n

^\n

int num_lines = 0;%%\n ++num_lines;. ;%%main() {yylex();printf( "# of lines = %d\n", num_lines);

}

103

JFLex SpecFileUsercode:CopieddirectlytoJavafile

%%JFlex directives:macros,statenames

%%Lexicalanalysisrules:– Optionalstate,regularexpression,action– Howtobreakinputtotokens– Actionwhentokenmatched

Possiblesourceofjavac errorsdown

theroad

DIGIT=[0-9]LETTER=[a-zA-Z]

YYINITIAL

{LETTER}({LETTER}|{DIGIT})*

104

CreatingaScannerusingJFlex

import java_cup.runtime.*;%%%cup%{private int lineCounter = 0;

%}

%eofval{System.out.println("line number=" + lineCounter);return new Symbol(sym.EOF);

%eofval}

NEWLINE=\n%%{NEWLINE} { lineCounter++; } [^{NEWLINE}] { }

105

Catchingerrors

• Whatifinputdoesn’tmatchanytokendefinition?

• Trick:Adda“catch-all”rulethatmatchesanycharacterandreportsanerror– Addafterallotherrules

106

AJFlex specificationofCScannerimport java_cup.runtime.*;%%%cup%{private int lineCounter = 0;

%}Letter= [a-zA-Z_]Digit= [0-9]%%”\t” { }”\n” { lineCounter++; }“;” { return new Symbol(sym.SemiColumn);}“++” { return new Symbol(sym.PlusPlus); }“+=” { return new Symbol(sym.PlusEq); }“+” { return new Symbol(sym.Plus); }“while” { return new Symbol(sym.While); }{Letter}({Letter}|{Digit})*

{ return new Symbol(sym.Id, yytext() ); }“<=” { return new Symbol(sym.LessOrEqual); }“<” { return new Symbol(sym.LessThan); }

107

Missing

• Creatingalexicalanalysisbyhand• Tablecompression• SymbolTables• NestedComments• HandlingMacros

108

LexicalAnalysis:What

• Input:programtext(file)• Output:sequenceoftokens

109

LexicalAnalysis:How

• Definetokensusingregularexpressions

• Constructanondeterministicfinite-stateautomaton(NFA)fromregularexpression

• Determinize theNFAintoadeterministicfinite-stateautomaton(DFA)

• DFAcanbedirectlyusedtoidentifytokens110

LexicalAnalysis:Why

• Readinputfile• Identifylanguagekeywordsandstandardidentifiers• Handleincludefilesandmacros• Countlinenumbers• Removewhitespaces• Reportillegalsymbols

• [Producesymboltable]

111

TheRealAnatomyofaCompiler

Executable code

exe

Sourcetext

txtLexicalAnalysis

Sem.Analysis

Process text input

characters SyntaxAnalysistokens AST

Intermediate code

generation

Annotated AST

Intermediate code

optimizationIR Code

generationIR

Target code optimization

Symbolic Instructions

SI Machine code generation

Write executable

output

MI

112

LexicalAnalysis

SyntaxAnalysis

top related