6/24/2004 SEKE 2004 1 GOLD: A Grammar Oriented GOLD: A Grammar Oriented Parsing System Parsing System Devin Cook and Du Zhang Devin Cook and Du Zhang Department of Computer Science Department of Computer Science California State University California State University Sacramento, CA 95819-6021 Sacramento, CA 95819-6021
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
6/24/2004 SEKE 2004 1
GOLD: A Grammar Oriented GOLD: A Grammar Oriented Parsing SystemParsing System
Devin Cook and Du ZhangDevin Cook and Du ZhangDepartment of Computer ScienceDepartment of Computer Science
California State UniversityCalifornia State UniversitySacramento, CA 95819-6021Sacramento, CA 95819-6021
Introduction
• What is a Parser?– Software which breaks a source program into
its various grammatical units w.r.t. a formal grammar
– Used to convert a source program into an internal representation
• The common approach to create parsers is through compiler-compiler, or parser generator
• Each parser generator is designed for a specific programming language. There is no consistent parser generator– Different grammatical notations– Features and interfaces of tools vary in both the look
and the behavior
Goals
• Design and implement a generalized parsing system that supports development of multiple programming languages
• Offer a consistent development environment for the language developers
GOLD
• Grammar Oriented Language Developer.
• Separating the component that generates parse tables for a target grammar from the component that does the actual parsing.
• Support the full Unicode character set.
• Include a set of tools that can aid language development process.
System Structure
Builder– Analyzes a target grammar and creates DFA and
LALR parse tables– These tables are saved to a Compiled Grammar
Table file
Compiled Grammar Table file– Intermediary between the Builder and the Engine– The file format is platform independent– Format is designed to be very easy to read and
extend in future versions
Engine– Reads the tables & parses the source text – Can be implemented in different programming
languages – as needed
Development Flow
1. Grammar is defined and loaded– Any text editor can be used
2. Builder– Grammar is analyzed and errors
reported– The parse tables are created and saved
to .cgt file
3. Engine– Reads the tables, parses the source
string, and produces parsing results– Can be implemented in different
programming languages – as needed
The Builder
• GOLD meta-language• Compiled grammar table (.cgt) file• Skeleton program creation for the Engine from
program templates• Interactive source string testing• Display of various parse table information• Export parse tables to a web page, XML file, or
formatted text
GOLD Meta-Language
• The GOLD Meta-Language is used to define a target grammar
• It must not contain features that are programming language dependent
• Its notation is very close to the standards
• It supports all language attributes (including those which cannot be specified using BNF or regular expressions)
GOLD Meta-Language (contd.)
• Format – Parameters are used to specify attributes about the
grammar– Character Sets are used to define the character
domain for terminals– Terminals are defined using regular expressions– Rules are defined using Backus-Naur Form
Defining Parameters
• Used for Name, Author, Case Sensitive, Start Rule, ....
• Parameter names are delimited by double quotes
• Parameters– "Name", "Author", "Version", "About" are
informative– "Start Symbol" specifies the initial / start
rule in the grammar
Parameters
"Name", "Version", "Author", "About"Informative fields. These have no effect on table generation.
"Case Sensitive"If set to True, the system will construct case sensitive tokenizer tables.
"Character Mapping"Some characters overlap ordinal values between ANSI and Unicode. If set the ANSI, the system will populate both.
"Auto Whitespace"If not set to False, the system will automatically define a terminal to accept whitespace.
"Start Symbol"The initial/start rule of the grammar. This parameter is required.
Example Parameters
"Name" = 'My Programming Language'
"Version" = '1.0 beta'
"Author" = 'John Q. Public'
"About" = 'This is a test declaration.'
| 'Multiple lines are available'
| 'by using the "pipe" symbol'
"Case Sensitive" = 'False'
"Start Symbol" = <Statement>
Defining Sets
• Character sets are used to aid the construction of regular expressions used to define terminals
• Literal sets of characters are delimited using ‘[’ and ‘]’
• Names of user-defined sets are delimited by ‘{’ and ‘}’
• Sets can be defined by adding and subtracting previously declared sets
Example Sets
{Bracket} = [']'] ]
{Quote} = [''] '
{Vowels} = [aeiou] aeiou
{Vowels 2} = {Vowels} + [y] aeiouy
{Set 1} = [abc] abc
{Set 2} = {Set 1} + [12] - [c] ab12
{Set 3} = {Set 2} + {Digit} ab0123456789
{Hex Char} = {Digit} + [ABCDEF] 0123456789ABCDEF
Pre-defined Character Sets
• There are many sets of characters which are not accessible via keyboard, or so commonly used that it would be repetitive and time-consuming to redefine in each grammar
• GOLD meta-language contains a collection of useful pre-defined sets
• These include sets often used for defining terminals as well as characters not accessible via keyboard
Individual Characters
• Some control characters that cannot be specified on a standard keyboard
Commonly used Character Sets
{Digit}
{Letter}
{Alphanumeric}
{Printable}
{Whitespace}
{Letter Extended}
{Printable Extended}
{ANSI Mapped}
{ANSI Printable}
Unicode Character Sets
• GOLD meta-language contains 43 pre-defined Unicode character sets
• The names of those sets are based on standard names of the Unicode Consortium
Comments
• GOLD meta-language allows both line comments and block comments
Defining Terminals
• Terminals are used to define reserved words, symbols, and recognized patterns (identifiers) in a grammar
• Each terminal is defined using a regular expression which is used to construct the Deterministic Finite Automata used by the tokenizer
• Implicit declaration of frequently used reserved words and symbols
Example Terminals
Example1 = a b c* ab, abc, abcc, abccc, ...
Example2 = a b? c abc, ac
Example3 = a|b|c a, b, c
Example4 = a[12]*b ab, a1b, a2b, a12b, a21b, ...
Example5 = {Letter}+ cat, dog, Sacramento, ...
ListFunction = c[ad]+r car, cdr, caar, cadr, ...
Defining Rules
• Use Backus-Naur Form• Nonterminals are delimited by angle brackets <
and >• Terminals are delimited by single quotes or not
delimited at all
Example: Lists
• Lists are specified using recursive rules
Identifier = {Letter}{Alphanumeric}*
<List> ::= <List Item> ',' <List> | <List Item>
<List Item> ::= Identifier
Recursion
Example: Optional Rules
• Optional rules are specified with a production containing no terminals
• This allows the developer to both specify a list containing 0 or more members
<Series> ::= <s-Expression> <Series>
|
<Quote> ::= ''
| Optional Rule
zero or more
"Name" = 'LISP'
"Author" = 'John McCarthy'
"Version" = 'Minimal'
"About" = 'LISP organizes ALL data around "lists".'
• A file format designed to store parse tables and other information generated by the Builder
• Design considerations– Easy to implement on different platforms– Flexibility for data structures to be added or
expanded– Room for future growth (additional new types of data)
.cgt File Structure
• The file consists of a number of records• Each record contains a number of entries
.cgt Record
• The header contains name and version info• A record has the following format
Parameter Record
• Parameter record which only occurs once in the .cgt file. It contains information about the grammar as well as attributes that affect how the grammar functions. The record is preceded by a byte field contains the value 80, the ASCII code for the letter 'P'.
Table Size Record
• Table size record : that appears before any records containing information about symbols, sets, rules or state table information. The first field of the record contains a byte with the value 84 - the ASCII code for the letter 'T’ Each value contains the total number of objects for each of the listed tables
Other Types of Records
• Character set table member• Symbol table member• Initial states (for both DFA and LALR)• Rule table member• DFA state table member• LALR state table member
• To illustrate, only one of each record type is included
6
1
'M'
2
'b' 'T'
1 1
'I' 14
1 2
Symbol Table
'I' 13
1 2
Character SetTable
'I' 8
1 2
Rule Table
'I' 23
1 2
DFA TableTable Counts
'I' 18
1 2
LALR Table
4
1
'M'
2
'b' 'S'
1 1
'I' 0
1 2
IndexSymbol
'S'
Name
EOF
1 4
'I' 3
1 2
Kind
3
1
'M'
2
'b'
CharacterSet
'C'
1 1
'I'
Index
4
1 2
'S'
Characters
Ii
1 6
6
1
'M'
2
'b' 'R'
1 1
'I' 0
1 2
IndexRule
'I' 13
1 2
Nonterminal
'E'
1
(Empty)
'I' 12
1 2
Symbol 0
'I' 13
1 2
Symbol 1
8
1
'M'
2
'b' 'D'
1 1
'I' 1
1 2
IndexDFA State
'I' 2
1 2
Accept Index
'E'
1
(Empty)
'I' 0
1 2
Character Set Index0
'I' 1
1 2
Target Index0
'B' 1
1 1
Accept State
'E'
1
(Empty)0
7
1
'M'
2
'b' 'L'
1 1
'I' 7
1 2
IndexLALR State
'E'
1
(Empty)
'I' 1
1 2
Action0
'I' 8
1 2
Target0
'E''I' 9
1 2
Symbol Index 0 (Empty)0
1
7
1
'M'
2
'b'
Parameters
'P'
1 1
'S'
Name
Example
1 16
'S'
Version
1.0
1 8
'S'
Author
Devin Cook
1 22
'B'
CaseSensitive
0
1 1
'S'
Start Symbol
13
1 2
'S'
About
N/A
1 8
DFAInitial States LALR
3
1
'M'
2
'b' 'I'
1 1
'I' 0
1 2
'I' 0
1 2
The Remaining Builder Features
• Besides meta-language and .cgt file, – Skeleton program creation for the Engine from
program templates– Interactive source string testing– Display of various parse table information– Export parse tables to a web page, XML file, or
formatted text
Online Help
Application Layout
Status Message
Grammar Editor
Next Button
Toolbar
Program Templates
• When developing the Engine which is interacting with tables of rules and symbols in the .cgt file, manually typing constant definitions can be tedious and problematic
• Program templates are designed to help automate the Engine development
• The Builder can use a program template to create a “skeleton program” for an implementation of the Engine
Program Templates (contd.)
• Skeleton program contains– Necessary declarations of
constants and variables– Function calls– Case statements, pre-processor
statements– Ready-to-use programs
• Notation designed to not conflict with known languages
• Program templates are saved in a subfolder
Display of Symbol Table
• Symbol table display
Display of Rule Table
• Rule table display
Display of Log Information
• Log info: general information about the number of symbols, which ones were defined implicitly, table counts, and any errors that occur
Display of DFA State Table
• DFA state table
Display of LALR State Table
• LALR state table
Export Parse Tables
• Parse tables can be exported to a web page, formatted text, or an XML file
Web Page Export
• An example of webpage export
A Short Demo
• A simple grammar• ANSI C
The Engine
• Different implementations of the Engine• Object-oriented approach• Its design is centered around the object
of “GOLDParser”, which performs all the parsing logic
• The remaining objects are used for storage or to support GOLDParser object
• Available in: Visual Basic .NET, ANSI C, C#, C++ (MFC), Delphi 5 & 6, Java, Python, Visual Basic 6
Testing and Development
• Extensive tests on the Builder’s algorithms to generate the LALR and DFA tables– Small grammars– Grammars for the real world programming languages