Top Banner

Click here to load reader

of 26

Compilation Encapsulation

Feb 23, 2016




Compilation Encapsulation. Or: Why Every Component Should Just Do Its Damn Job. “when a negative int literal (e.g. -5) appears in the code, should it be a single integer token whose value is -5 or as two tokens, minus and an integer whose value is 5?”. - PowerPoint PPT Presentation

Compilation Encapsulation

Compilation EncapsulationOr: Why Every Component Should Just Do Its Damn Jobwhen a negative int literal (e.g. -5) appears in the code, should it be a single integer token whose value is -5 or as two tokens, minus and an integer whose value is 5?when a negative int literal (e.g. -5) appears in the code, should it be a single integer token whose value is -5 or as two tokens, minus and an integer whose value is 5?Well, in theory

We can write a lexer (maybe not with flex) with lookbehind, that makes sure the last token was neither a number nor a variable. (Or a function call, or a field reference. Pretty complicated lexer.)But just because we can do itdoes that make it a good idea?But what if we change the syntax?Professor Moriarty wants IC to be more like Matlab. He asks you to support support scalar operations on arrays of scalars.array scalar = [a1-scalar, an-scalar]

And suddenly new int[n] - 6 is validGeneric compiler structureExecutable codeexeSourcetext txtSemanticRepresentationBackend(synthesis)CompilerFrontend(analysis)Executable codeexeSourcetext txtSemanticRepresentationBackend(synthesis)CompilerFrontend(analysis)ICProgramicx86 executableexeLexicalAnalysisSyntax AnalysisParsingASTSymbolTableetc.Inter.Rep.(IR)CodeGenerationIC compilerEncapsulation, what does it mean?It means each component needs to do its job, without regard for what the other components are doing.The tokenizer only cares about dividing the stream into tokensInvalid charactersKeywordsStrings and commentsThe parser only cares about building a structure out of tokensAssumes a valid stream of tokensStructural rules with no meaning

The semantic checker is free to only worry about semanticsAssumes a valid ASTActually worries about meaningFake Exam Question #1Professor Xavier wants IC to be more like Python. He asks you to support array and string multiplication.

"abc"*3 abcabcabc(new MyClass[5] * 2).length 10But supposeSuppose you decided to define your strings like so:

\" { //move to state to handle content yybegin(STRING); in_string_literal = true; }


\"/{VALID_STR_POSTFIX} { //found the end of a string, finish. yybegin(YYINITIAL); in_string_literal = false; return new Token (sym.QUOTE,yyline + 1, string.toString()); } \" { throw new LexicalError(yyline+1); } //longest token only if invalid aheadBut supposeNow we have to go back and fix the lexer, too.When, in reality, there was no real reason to perform that test:Theres no case of something after the string the syntax wont be able to cope with.Back to the tokenizer not caringWhat needs changing?Lexer:NothingSyntax:NothingSemantic checksType checkCode generationFunctionality of the operationFake Exam Question #2We want IC to support binary numeric literalsWith the following syntax: 0b010010101 (leading zeros after the binary signifier allowed)With the same range restrictionsSolution #1Well add a new lexer token type, BINNUMBER0b[01]+And a new syntax rule for a BinNumber literalWhich, really, is only BINNUMBERAnd then check its rangeWhich is actually a lot easier than with decimalsA short interlude: where does X go?Is property X lexical, syntactic or semantic?Two main deciding factorsCorrectness:Is there enough data to make the call right now?Laziness:What will be gained by doing this right now?Is this the place where its easiest to do?

Example A: Range of decimal literalsCorrectness:In any two's complement implementation of integers, the bound is not going to be symmetric.So we cant make the call until we know if we have a positive or a negative number on our handLaziness:Writing a lot of code that looks at the child expressions during syntax is usually a bad sign.Example B: range of binary literalsCorrectness:All the data is there the second we got the token.Laziness:Postponing the check means a continued separation between binary and decimal literals If we check right now, we can convert the value to a number and forget all about itBack to Fake Question #2So we can actually do it this way:Well add a new lexer rule0b[01]+Well also check the range hereAnd then! return new Token (sym.INTEGER,yyline + 1, bin2decimal(yytext()));Where does Y go?Place the following property:call to method foo() is a static call.

Our guiding principle here is correctness:LexerSyntax?Where does Y go?Syntax breaks methods up into three definitely definitely not

So correct?foo() - ???So not syntax.Fake Question #3We want to allow type inference in IC

var a = new A();A b = a;C c = a; //type errorQ3: LexerNew token type VARQ3: SyntaxWe want an init expression whose type is VARDo we add VAR to types?No, we treat it like void.

How about AST representation?We modify our LocalVariable class to keep TBD in its typeQ3: SemanticsTo determine the new variables type: instead of computing its type field (which is TBD) compute the type of the expressionPut that value into the symbol table, and all else is business as usual!Good luck on the exam!

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.