======================================== Decorating a Syntax Tree The calculator language we’ve been using for examples doesn’t have sufficiently interesting semantics. Consider an extended version with types and declarations: program → stmt_list $$ stmt_list → decl stmt_list | stmt stmt_list | ε decl → int id | real id stmt → id := expr | read id | write expr expr → term term_tail term_tail → add_op term term_tail | ε term → factor factor_tail factor_tail → mul_op factor factor_tail | ε factor → ( expr ) | id | int_const | real_const | float ( expr ) | trunc ( expr ) add_op → + | – mul_op → * | / Now we can - require declaration before use - require type match on arithmetic ops We could do some of this checking while building the AST. We could even do it while building an explicit parse tree. The more common strategy is to implement checks once the AST is built easier -- tree has nicer structure more flexible -- can accommodate non depth-first left-to-right traversals - mutually recursive definitions e.g., methods of a class in most languages - type inference based on use - switch statement label checking etc. Assume the parser builds the AST and tags every node with a source location Tagging of tree nodes is annotation inside the compiler, tree nodes are structs annotations and pointers to children are fields (annotation can also be done to an explicit parse tree; we’ll stick to ASTs) But first: what do we want the AST to look like? One appealing way to specify it is a tree grammar. Each "production" of tree grammar has parent on LHS and children on RHS. This is not for parsing; it's to describe the trees that - we want the parser to build - we need to annotate Example for the extended calculator language: program → item int_decl : item → id item // item is next decl or stmt real_decl : item → id item assign : item → id expr item read : item → id item write : item → expr item null : item → ε ‘+’ : expr → expr expr ‘-’ : expr → expr expr ‘*’ : expr → expr expr ‘/’ : expr → expr expr float : expr → expr trunc : expr → expr id : expr → ε // no children int_const : expr → ε real_const : expr → ε The A:B syntax on the left means that A is one kind of a B, and may appear wherever a B is expected on a RHS. Note that "program → item" does not mean that a program "is" an item (the way it does in a CFG), but merely that a program node in a syntax tree has one child, which is an item. Here's a syntax tree for a tiny program. Structure is given by the tree grammar. Construction would be via execution of appropriate action routines embedded in a CFG. Remember: tree grammars are not CFGs. Language for a CFG is the set of possible fringes of parse trees. Language for a tree grammar is the set of possible whole trees. No comparable notion of parsing: structure of tree is self-evident. Our tree grammar helps guide us as we write (by hand) the action routines to build the AST. It can also help guide us in writing recursive tree-walking routines to perform semantic checks and (later) generate mid-level intermediate code (next lecture). - Helpful to augment the tree grammar with semantic rules that describe relationships among annotations of parent and children. - Semantic rules are like action routines, but without explicit specification of what is executed when. A CFG or tree grammar with semantic rules is an attribute grammar (AG) Not used much in production compilers, but useful for prototyping (e.g., the first validated Ada implementation [Dewar et al., 1980]) and in some cool language-based tools - syntax-directed editing [Reps, 1984] - parallel CSS [Meyerovich et al., 2013] The book goes into a bit of AG theory, talking about synthesized attributes (depend only on information below the current node in the tree) inherited attributes (depend at least in part on info from above or to the side) Remember that an AG doesn't actually specify the order in which rules should be evaluated. There exist tools to figure that out, and a rich theory of classes of grammars with varying attribute flow (non-circular, circular but converging, ...) When basing an AG on a CFG, it's desirable to have attribute flow that’s consistent with the order in which the parser builds the tree bottom-up parsers need S-attributed grammars -- all attributes are synthesized top-top parsers can use L-attributed grammars, which are a superset -- attributes are synthesized or depend on stuff to the left See the text for more info. Our CFG w/ action routines to build the AST could be written as an AG by making each action routine a semantic rule and then listing the rules for each production w/out actually embedding them in the RHS. For something as simple as AST construction, not having to specify what is done when isn’t much of a savings – a tool to find an evaluation order consistent w/ attribute flow wouldn’t be useful (it was useful in the tools mentioned above). In practice, people do hand-written tree walk on ASTs. Book gives extended example for declaration and type checking in extended calculator grammar. Written as a pure AG, with following attributes: program errors - list of all static semantic errors (type clash, undefined/redefined names) item, expr symtab - list with types of all names declared to left item errors_in - list of all static semantic errors to left errors_out - list of all static semantic errors through here expr type errors - list of all static semantic errors inside everything location More common to make symbol table and error lists global variables insert errors, as found, into a list or tree, sorted by source location for symtab, label each construct with list of active scopes look up <name, scope> pairs, starting with closest scope for calculator language, which has no scopes, can enforce declare-before-use in a simple left-to-right traversal of the tree - complain at any re-definition - or any use w/out prior definition To avoid cascading errors, it's common to have an "error" value for an attribute that means "I already complained about this." So, for example, in int a real b int c a := b + c We label the '+' tree node with type "error" so we don't generate a second message for the ":=" node. A few example rules (with error list and symtab as globals): int_decl : item1 → id item2 // item2 is rest of program ▷ if <id.name, ?> ∈ symtab errors.insert("redefinition of" id.name, item1.location) else symtab.insert(<id.name, int>) id : expr → ε ▷ if <id.name, A> ∈ symtab expr.type := A else errors.insert(id.name "undefined", id.location) expr.type := error ‘+’ : expr1 → expr2 expr3 ▷ if expr2.type = error or expr3.type = error expr1.type := error else if expr2.type <> expr3.type expr 1 .type := error errors.insert("type clash", expr1.location) else expr1.type := expr2.type The right-pointing triangle here is meant to introduce a semantic rule. (This is not standard notation, but it matches what’s in the text.) In these particular cases there is only one rule per “production,” but in a more complicated grammar there could be many. Formal AG notation would require no side effects (no globals) and would specify each semantic rules as Si.ax := f(Si.ax, ..., Sk.ay) – e.g., ▷ expr.type := if <id.name, A> ∈ symtab then A else error ▷ expr.errors := if <id.name, A> ∈ symtab then null else [id.name “undefined at” id.location] We can see how these rules would be enforced while walking the syntax tree: In a more complicated language, we might make multiple passes over the tree – perhaps • one to fill in the symbol table; • a second to check types, check for undeclared names, match parameter lists to declarations, etc.; and • a third to generate mid-level IF. program int_decl read real_decl read write a a b b null 2.0 b a float int a read a real b read b write (float (a) + b) / 2.0 + ÷ program int_decl read real_decl read write a a b b null 2.0 b a float + ÷