Top Banner
Code Generator Translator Architecture Parser Tokenizer string of characters (source code) string of tokens abstract program string of integers (object code)
26

Code Generator Translator Architecture Parser Tokenizer string of characters (source code) string of tokens abstract program string of integers (object.

Apr 01, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Code Generator Translator Architecture Parser Tokenizer string of characters (source code) string of tokens abstract program string of integers (object.

CodeGenerator

Translator Architecture

ParserTokenizer

string ofcharacters

(source code)

string oftokens

abstractprogram

string ofintegers

(object code)

Page 2: Code Generator Translator Architecture Parser Tokenizer string of characters (source code) string of tokens abstract program string of integers (object.

A Tokenizing Machine

Another greatmachine!

Ready to Dispense?

In(Chars)

Out(Tokens)

Page 3: Code Generator Translator Architecture Parser Tokenizer string of characters (source code) string of tokens abstract program string of integers (object.

Tokenizing Machine Continued…

Ready toDispense?

In Out

3: Tokenscome outhere

2: Light comes on

1: Charactersgo in here

Page 4: Code Generator Translator Architecture Parser Tokenizer string of characters (source code) string of tokens abstract program string of integers (object.

Tokenizing Machine Continued… Type

BL_Tokenizing_Machine_Kernel is modeled by (

buffer : string of character ready_to_dispense : boolean

) constraint …

Initial Value (empty_string, false)

Page 5: Code Generator Translator Architecture Parser Tokenizer string of characters (source code) string of tokens abstract program string of integers (object.

Tokenizing Machine Continued…

Operations m.Insert (ch) m.Dispense (token_text, token_kind) m.Is_Ready_To_Dispense () m.Flush_A_Token (token_text, token_kind) m.Size ()

Page 6: Code Generator Translator Architecture Parser Tokenizer string of characters (source code) string of tokens abstract program string of integers (object.

A State-Transition View

not readyto dispense

readyto dispense

Insert

Flush_A_Token

Insert

Size

Size

Is_Ready_To_DispenseIs_Ready_To_Dispense

Dispense

Page 7: Code Generator Translator Architecture Parser Tokenizer string of characters (source code) string of tokens abstract program string of integers (object.

Tokenizing BL Programs

Token Types KEYWORD IDENTIFIER CONDITION WHITE_SPACE COMMENT ERROR

Page 8: Code Generator Translator Architecture Parser Tokenizer string of characters (source code) string of tokens abstract program string of integers (object.

A Very Useful Extensionprocedure_body Get_Next_Token ( alters Character_IStream& str, produces Text& token_text, produces Integer& token_kind ){

}

while (not self.Is_Ready_To_Dispense () and not str.At_EOS ()) { object Character ch; str.Read (ch); self.Insert (ch); } if (self.Is_Ready_To_Dispense ()) { self.Dispense (token_text, token_kind); } else { self.Flush_A_Token (token_text, token_kind); }

Page 9: Code Generator Translator Architecture Parser Tokenizer string of characters (source code) string of tokens abstract program string of integers (object.

Another Useful Extensionprocedure_body Get_Next_Non_Separator_Token ( alters Character_IStream& str, produces Text& token_text, produces Integer& token_kind ){

}

self.Get_Next_Token (str, token_text, token_kind); while ((token_kind == WHITE_SPACE) or (token_kind == COMMENT)) { self.Get_Next_Token (str, token_text, token_kind); }

Page 10: Code Generator Translator Architecture Parser Tokenizer string of characters (source code) string of tokens abstract program string of integers (object.

How Does Insert Work?

buffer: “PROGRAM”ready_to_dispense: false

buffer: “PROGRAM ”ready_to_dispense: true

#m:

m:

Here’s anothercharacter.

‘ ’

Page 11: Code Generator Translator Architecture Parser Tokenizer string of characters (source code) string of tokens abstract program string of integers (object.

The Specification of Insert

procedure Insert ( preserves Character ch ) is_abstract; /*! requires self.ready_to_dispense = false ensures self.buffer = #self.buffer * <ch> and self.ready_to_dispense = IS_COMPLETE_TOKEN_TEXT (#self.buffer, ch) !*/

Page 12: Code Generator Translator Architecture Parser Tokenizer string of characters (source code) string of tokens abstract program string of integers (object.

An Important Math Operation

math definition IS_COMPLETE_TOKEN_TEXT (

s: string of character

c: character

): boolean is

( s is in OK_STRINGS and

s * <c> is not in OK_STRINGS ) or

( <c> is in PREFIX (OK_STRINGS) and

s * <c> is not in PREFIX (OK_STRINGS) )

s is a complete“valid” token

c can start a “valid” token, buts*<c> doesn’t start a “valid” token

Page 13: Code Generator Translator Architecture Parser Tokenizer string of characters (source code) string of tokens abstract program string of integers (object.

Other Math Definitions

OK_STRINGS ={s: string of character (IS_KEYWORD (s))} union{s: string of character (IS_IDENTIFIER (s))} union{s: string of character (IS_CONDITION_NAME (s))} union{s: string of character (IS_WHITE_SPACE (s))} union{s: string of character (IS_COMMENT (s))}

PREFIX (s_set) ={x: string of character (there exists y: string of character (x * y is in s_set))}

Page 14: Code Generator Translator Architecture Parser Tokenizer string of characters (source code) string of tokens abstract program string of integers (object.

PREFIX Examples

s_set = {“abc”} PREFIX (s_set) =

{“”, “a”, “ab”, “abc”}

s_set = {“abc”, “de”} PREFIX (s_set) =

{“”, “a”, “ab”, “abc”, “d”, “de”}

Page 15: Code Generator Translator Architecture Parser Tokenizer string of characters (source code) string of tokens abstract program string of integers (object.

Tokenizing Machine: Implementation

Obvious Representation Text buffer_rep Boolean token_ready

Insert (ch)? check if IS_COMPLETE_TOKEN_TEXT

(self[buffer_rep], ch), andset self[token_ready] accordingly

append ch at end of self[buffer_rep]

Page 16: Code Generator Translator Architecture Parser Tokenizer string of characters (source code) string of tokens abstract program string of integers (object.

Tokenizing Machine: Implementation Continued…

Dispense (token_text, token_kind)? set token_text to all but the last

character of self[buffer_rep] set token_kind to the value of

WHICH_KIND (token_text) set self[token_ready] to false

Page 17: Code Generator Translator Architecture Parser Tokenizer string of characters (source code) string of tokens abstract program string of integers (object.

Tokenizing Machine: Implementation Continued…

How do we “check if IS_COMPLETE_TOKEN_TEXT (self[buffer_rep], ch)”?

How do we determine“WHICH_KIND (token_text)”?

How do we do these things quickly?

Page 18: Code Generator Translator Architecture Parser Tokenizer string of characters (source code) string of tokens abstract program string of integers (object.

Making Decisions Quickly

Keep track of the “state” of the buffer by adding one field to the representation: Text buffer_rep Boolean token_ready Integer buffer_state

Page 19: Code Generator Translator Architecture Parser Tokenizer string of characters (source code) string of tokens abstract program string of integers (object.

Possible Buffer States

How many interestingly different buffer states do you think there may be?

Let’s start enumerating them…

Page 20: Code Generator Translator Architecture Parser Tokenizer string of characters (source code) string of tokens abstract program string of integers (object.

Buffer States Continued… Initial state (empty buffer) How many states after inserting

the first character? ‘B’, ‘D’, ‘E’, ‘I’, ‘P’, ‘T’, ‘W’, ‘n’, ‘r’, ‘t’,

identifier (any other letter) white_space (‘ ’, ‘\n’, ‘\t’) comment (‘#’) error (any other character)

Page 21: Code Generator Translator Architecture Parser Tokenizer string of characters (source code) string of tokens abstract program string of integers (object.

Buffer States Continued…

How many states after inserting the second character? “BE”, “DO”, “EL”, “EN”, “IF”, “IN”, “IS”,

“PR”, “TH”, “WH”, “ne”, “ra”, “tr”, identifier (any other id character)

white_space (‘ ’, ‘\n’, ‘\t’) comment (any other character but ‘\n’) error (any character that cannot start a

new “good” token)

Page 22: Code Generator Translator Architecture Parser Tokenizer string of characters (source code) string of tokens abstract program string of integers (object.

A State Transition Diagram:Transitions Out of ‘empty’ Only

B

DE I

PT

W

n

r

t

identifierwhite_space

comment

error

empty

‘B’

‘D’

‘E’ ‘I’‘P’

‘T’

‘W’

‘n’

‘r’

‘t’

any otherletter

‘ ’,’\n’,’\t’‘#’

any othercharacter

Page 23: Code Generator Translator Architecture Parser Tokenizer string of characters (source code) string of tokens abstract program string of integers (object.

Structure of Body of Insertcase_select (self[buffer_state]){ case empty: // case for buffer = empty_string case B: // case for buffer = “B” case D: // case for buffer = “D” case E: // case for buffer = “E” . . . case error: // case for buffer holding an error token}

Page 24: Code Generator Translator Architecture Parser Tokenizer string of characters (source code) string of tokens abstract program string of integers (object.

A Simplified View

Buffer States EMPTY_BS ID_OR_KEYWORD_OR_CONDITION_BS WHITE_SPACE_BS COMMENT_BS ERROR_BS

Page 25: Code Generator Translator Architecture Parser Tokenizer string of characters (source code) string of tokens abstract program string of integers (object.

The State Transition Diagram

EMPTY_BS

ID_OR_KEYWORD_OR_CONDITION_BS

COMMENT_BS

WHITE_SPACE_BS

ERROR_BS

‘ ’, ‘\n’, ‘\t’

‘#’

‘a’..’z’,‘A’..’Z’

any othercharacter

any characterexcept ‘\n’

‘ ’, ‘\n’, ‘\t’

‘a’..’z’,‘A’..’Z’,‘0’..’9’,

‘-’

any character except‘a’..’z’, ‘A’..’Z’, ‘ ’, ‘\n’, ‘\t’, ‘#’

Page 26: Code Generator Translator Architecture Parser Tokenizer string of characters (source code) string of tokens abstract program string of integers (object.

Useful Private Functions Is_White_Space_Character (ch) Is_Digit_Character (ch) Is_Alphabetic_Character (ch) Is_Identifier_Character (ch) Can_Start_Token (ch) Id_Or_Keyword_Or_Condition (t) Buffer_Type (ch) Token_Kind (bs, str)