Top Banner
Efficient Semantic Analysis for Text Editors * Elad Lahav School of Computer Science, University of Waterloo [email protected] August 16, 2004 Abstract Meddle is a programmer’s text editor designed to provide as-you-type semantic information to the user. This is accomplished by using algo- rithms for tracking changes to the editor’s text buffer, incremental scan- ning and incremental parsing. These algorithms are presented and ex- plained. 1 Introduction Almost all modern text editors provide syntactic information, usually in the form of syntax highlighting. Some text editors, such as Emacs 1 and Kate 2 even provide limited semantic information, such as displaying matching sets of paren- theses. It is hard, however, to find a text editor that provides full semantic in- formation of the file being edited. Two examples of editors that accomplish this task are SourceInsight 3 and Visual SlickEdit 4 , both of which are closed- source, commercial editors with price tags of several hundreds of dollars. These applications, however, show a decisive advantage over the aforementioned free editors in that they give the programmer a much better view of the edited files, including such information as the distinction between global and local variables, function declarations vs. function calls, type resolution for non-primitive types, etc. This information cannot be obtained without the construction of a complete parse tree for the file being edited. It is easy to see that semantic information is much harder to generate than syntactic information: while the latter task only requires scanning the text and extracting tokens with a simple finite automaton, the former relies on a parser for context-free grammars. Even though some semantic information can be extracted from the text by simple extensions to the scanner’s automaton (e.g., a parentheses stack), the construction of a complete parse tree requires a full-blown parser. * Submitted as a final project for CS842: Programming Language Design and Implemen- tation. Sadly, this work is self-funded. 1 www.emacs.org 2 kate.kde.org 3 www.sourceinsight.com 4 www.slickedit.com 1
25

Efficient Semantic Analysis for Text Editors€¦Efficient Semantic Analysis for Text Editors∗ Elad Lahav School of Computer Science, University of Waterloo [email protected] August

Nov 06, 2019

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Efficient Semantic Analysis for Text Editors€¦Efficient Semantic Analysis for Text Editors∗ Elad Lahav School of Computer Science, University of Waterloo elahav@uwaterloo.ca August

Efficient Semantic Analysis for Text Editors∗

Elad Lahav

School of Computer Science, University of Waterloo

[email protected]

August 16, 2004

Abstract

Meddle is a programmer’s text editor designed to provide as-you-type

semantic information to the user. This is accomplished by using algo-

rithms for tracking changes to the editor’s text buffer, incremental scan-

ning and incremental parsing. These algorithms are presented and ex-

plained.

1 Introduction

Almost all modern text editors provide syntactic information, usually in theform of syntax highlighting. Some text editors, such as Emacs1 and Kate2 evenprovide limited semantic information, such as displaying matching sets of paren-theses. It is hard, however, to find a text editor that provides full semantic in-formation of the file being edited. Two examples of editors that accomplish thistask are SourceInsight3 and Visual SlickEdit4, both of which are closed-source, commercial editors with price tags of several hundreds of dollars. Theseapplications, however, show a decisive advantage over the aforementioned freeeditors in that they give the programmer a much better view of the edited files,including such information as the distinction between global and local variables,function declarations vs. function calls, type resolution for non-primitive types,etc. This information cannot be obtained without the construction of a completeparse tree for the file being edited.

It is easy to see that semantic information is much harder to generate thansyntactic information: while the latter task only requires scanning the textand extracting tokens with a simple finite automaton, the former relies on aparser for context-free grammars. Even though some semantic information canbe extracted from the text by simple extensions to the scanner’s automaton(e.g., a parentheses stack), the construction of a complete parse tree requires afull-blown parser.

∗Submitted as a final project for CS842: Programming Language Design and Implemen-

tation. Sadly, this work is self-funded.1www.emacs.org2kate.kde.org3www.sourceinsight.com4www.slickedit.com

1

Page 2: Efficient Semantic Analysis for Text Editors€¦Efficient Semantic Analysis for Text Editors∗ Elad Lahav School of Computer Science, University of Waterloo elahav@uwaterloo.ca August

Figure 1: The editing-scanning-parsing cycle. The user makes changes to theeditor’s text buffer, which are recorded for later use by the incremental scan-ner. Scanning is triggered by a timer, and the incremental scanner generates astream of tokens for the parser. These tokens can be used to display syntacticinformation, which is not dependent upon parsing, such as syntax highlighting.A second timer invokes the incremental parser, which generates the parse tree,and provides semantic information that can be displayed in various forms.

The real problem, however, is not the construction of parsers for program-ming languages. This is relatively easy to do using the numerous parser gener-ators available, such as Yacc and Bison. These parsers, however, are designedto work in batch mode: they require, as input, the complete stream of tokensextracted from a text document in order to build a parse tree. This method,while appropriate for tasks such as compilation, is much too time consuming forrapidly-changing environments, such as text editors.

Luckily, not every change to the editor’s text buffer requires a complete re-parse of the entire stream of tokens. In many cases, a correct parse tree can beobtained by reusing some of the sub-trees from the previous version of the tree,a procedure referred to as incremental parsing. Incremental parsing is basedupon incremental scanning of the text, a process that identifies the modifiedtokens. Incremental scanning, in turn, requires a method for tracking changesto the text buffer. The complete editing-scanning-parsing cycle is depicted infigure 1.

Strangely enough, even though algorithms for incremental parsing have beenknown for more than 20 years [3], it seems that they are rarely incorporated intomodern text editors. A possible reason for this can be that previous papers onthese algorithms were either incomplete, or described the algorithms in a mannerthat was not implementation-oriented. The primary goal of this paper is to give

2

Page 3: Efficient Semantic Analysis for Text Editors€¦Efficient Semantic Analysis for Text Editors∗ Elad Lahav School of Computer Science, University of Waterloo elahav@uwaterloo.ca August

a complete description of the incremental parsing process, as implemented in ademo text editor called Meddle. The source code for Meddle is freely available,and its documentation follows this paper, so it can be used as a tutorial forconstructing other semantic text editors.

Since this paper is implementation-oriented, some of the algorithms pre-sented here are not necessarily optimal. Instead, I have tried to maintain abalance between performance and ease of implementation.

The remainder of this paper is structured as follows: Section 2 discussessome of the previous work published on incremental scanning and parsing; Sec-tions 3 to 5 describe the implementation of the editing-scanning-parsing cyclein Meddle; Finally, section 6 suggests some ideas for future work.

2 Related Work

Jalili and Gallier were the first to present a complete and correct algorithm forincremental parsing of LR(1) grammars [3]. Their algorithm is based on a state-matching criterion, which assigns to each node in the parse tree the state of theparser at the time it was shifted. Note that unlike traditional parsing, in thiscase the parse stack does not hold just the states of the finite state machine, butrather (state, node) pairs. Jalili and Gallier showed that sub-trees of the parsetree can be reused when parsing the new token stream, based on the followingkey observation:

Proposition 1. Let G = (VT , VN , P, S) be a LR(1) grammar, let x, y ∈ (VT ∪VN )+ be (non-empty) strings in the grammar vocabulary, and let a ∈ VT be aterminal symbol. Assume that the text to parse is xya, and let s be the top stateon the parse stack after parsing x. If y is reduced to a single non-terminal A

based on the look-ahead symbol a, then it will also be reduced to A when parsingx′ya for any string x′ such that the top state after parsing x′ is s.

In other words, a deterministic LR(1) parser will reduce a string to a singlesymbol based solely on the current state and look-ahead symbol, and regardlessof the symbols and actions that have led to that state. This means that a sub-tree rooted at a node A can be reused if, after performing all reductions usingFirst(A), the following conditions hold:

1. The parser is at the same state as it was when A was shifted during theconstruction of the original tree; and

2. The look-ahead symbol is the same one that was used to determine thatA should be shifted.

The term “reuse” means that the parser can push the state goto(s, A) on thetop of the stack without further examination of the contents of the terminalstring spanned by A.

Another key feature of the Jalili-Gallier algorithm is that if a sub-tree cannotbe reused in its entirety, it can be broken into sub-trees that can in turn beconsidered for reuse. Meddle uses the state-matching test as well as the Divide,Undo-Reductions and Replace operations described in [3]. The Delete

operation is implemented differently. The use of two stacks, a parse stack and atree-reuse stack is replaced by a data structure that is described in section 5.1.

3

Page 4: Efficient Semantic Analysis for Text Editors€¦Efficient Semantic Analysis for Text Editors∗ Elad Lahav School of Computer Science, University of Waterloo elahav@uwaterloo.ca August

Unfortunately, [3] assumes that the changes to the token list are given, andso does not provide any details as to how these changes are recorded.

A complete description of the editing-scanning-parsing cycle for a text editoris given by Beetem and Beetem in [2]. Their method for tracking changes, basedon a data structure called a kmn-list is used by Meddle, and is described indetail in section 3. However, while [2] presents a single complex procedure formanipulating the kmn-list, Meddle uses two separate procedures, one for textinsertions and one for text deletion. The use of the complex procedure is onlyrequired in order to support text replacement, which can be easily broken intotwo separate actions of first deleting the old text and then inserting the newone.

Meddle also follows the gist of the algorithm presented in [2] for incrementalscanning, as described in section 4. However, the algorithm given in that paperis based on different data structures than used here, and its presentation, inthe form of pseudo-code, is difficult to comprehend. Therefore the algorithmpresented in section 4 was developed from scratch, based only on the guidelinesgiven in [2].

On the other hand, the incremental parsing algorithm given in [2] is basedon a recursive descent parser, suitable for languages described by LL(1) gram-mars. Most modern programming languages, however, are based on LALR(1)grammars, and common parser generators such as Yacc require that languagesbe specified using such grammars. The parser described in [2] is therefore inad-equate for a real-life text editor (in fact, the editor described in [2] is tailoredaround a specific programming language used for research purposes.)

Much of this paper has been inspired by the work of the “Harmonia” project5,led by Susan L. Graham. This project has resulted in numerous papers ontext editing in general and incremental parsing in particular. Unfortunately, itseems that none of the editors mentioned in these papers (Pan, Ensemble andHarmonia) were ever released for public use6, so the actual implementation ofthe data structures and algorithms presented in these papers is unknown. Ihope to remedy this by providing the full source code for Meddle along withthis paper.

In [1], Ballance, Butcher and Graham suggest the use of grammar abstrac-tions for text editors. Informally, one grammar is referred to as an abstractionof another grammar, if every structure described by the abstract grammar isalso described by the original grammar. While abstract grammars are usuallynot suitable for compilation, as they may ignore structures required for correctcode generation, they may be suitable for text editors, that are not required toprovide all the semantics of the edited text. For example, abstract grammarsmay ignore operator precedence in algebraic expressions. Abstract grammarsmay thus result in faster parsers, making them more adequate for dynamicenvironments such as text editors.

All of the papers mentioned so far describe methods for incremental pars-ing of unambiguous programming languages. In real life, however, many of themore commonly used languages, such as C, contain semantic ambiguities. Thishappens when syntactic analysis is dependent upon semantic information, suchas with the use of type definitions in C: a scanner cannot identify a token as a

5harmonia.cs.berkeley.edu6A Harmonia plug-in for Emacs is available in binary-form only.

4

Page 5: Efficient Semantic Analysis for Text Editors€¦Efficient Semantic Analysis for Text Editors∗ Elad Lahav School of Computer Science, University of Waterloo elahav@uwaterloo.ca August

type, without knowing that it has been declared as such (except for primitivetypes, such as int and char.) This knowledge, in turn, depends upon correctparsing of the preceding code (including header files.) This problem is describedand resolved in [4], where generalised grammars are used for the creation of in-cremental parsers. Semantic information, such as type definitions, is propagatedthrough the parse tree for correct scanning.

Finally, Wagner and Graham presented in [5] an optimal incremental parsingalgorithm, based on the Jalili-Gallier algorithm, that is guaranteed to reuse asmuch of the previous parsing tree as possible. The algorithm used by Meddle issub-optimal, but hopefully simpler and easier to understand. As opposed to thetext editor described in that paper, Meddle does not require multiple versionsof the parse tree to be kept. Instead, it generalises the notion of a parse tree tothat of an ordered parse forest, which is changed in-place. This is described insection 5. Another difference between the two algorithms is that Meddle uses thestate matching criterion, while [5] suggests a new mechanism, called “sentential-form” parsing. The first reason for using the latter is ascribed by the authors tothe additional space required for holding the parser state for each node in thestate-matching method. Since the size of each node in Meddle’s parse tree is 56bytes, saving 4 bytes can hardly justify a change in the algorithm. The second,and more substantial reason, stated in [5], is that the state-matching condition,while suitable for LALR(1) parsers, is too restrictive for LR(1) parsers. Meddleuses the output of Bison to generate the parser’s tables, which are given inLALR(1) form. Thus the state-matching algorithm of Jalili and Gallier seemsto be adequate in this case.

Wagner and Graham have also shown that the use of two stacks, a parsingstack and a tree-reuse stack, is not necessary, as these can be inferred with theuse of a correct data structure. The P-Tree structure, presented in section 5.1,follows this result.

3 Tracking Changes to the Text Buffer

A typical scenario for a text editor session would start with the user loadinga file into the editor’s buffer. The contents of this file can then be scannedand parsed using batch-like methods, that create the initial stream of tokensand parse tree. The major difference between a compiler-like tool and a texteditor is unveiled when the user begins modifying the file. This usually involvesrapid changes to the buffer at random, unpredictable locations. In order for thechanges to be reflected, first in the token stream and then in the parse tree, theeditor needs to rescan and re-parse the buffer. Since an editor needs to be highlyresponsive, the batch methods are replaced with more efficient algorithms forincremental scanning and parsing.

A natural assumption would be that incremental scanning and parsing shouldnot commence while the user is actively editing the file, that is, that rapidchanges should be logged until the editor is free of processing user commands,and can dedicate its resources to text analysis. Note that a multi-threaded ap-proach would usually not be useful in this case, as we would like the results ofthe scanner and parser to affect the visual appearance of the text, so changesmust not occur in the buffer before scanning and parsing are completed.

One trivial, yet important, observation is that although modern text edi-

5

Page 6: Efficient Semantic Analysis for Text Editors€¦Efficient Semantic Analysis for Text Editors∗ Elad Lahav School of Computer Science, University of Waterloo elahav@uwaterloo.ca August

tors support a variety of user commands for text manipulation, such as cutting,pasting, multiple undo/redo levels, etc., the underlying text buffer can only bemodified in two ways: text insertion and range deletion. Some papers ([3],[2])also suggest that text replacement is a primitive operation, which leads to com-plex algorithms to support it. It is easy to see that text replacement is actuallya combination of range deletion followed by text insertion, which significantlysimplifies the algorithms involved in the editing-scanning-parsing cycle.

An easy and efficient way to keep track of changes to the text buffer is givenin [2]. The algorithm described by Beetem and Beetem is based on a datastructure called a kmn-list. This structure is a linked list of nodes, each ofwhich is composed of 3 fields:

1. k: The number of (consecutive) unmodified characters;

2. m: The number of following characters that were deleted;

3. n: The number of characters inserted after the unmodified ones.

For example, a node written as (5, 2, 4) describes a segment of the text composedof 5 unmodified characters, followed by 2 deleted characters and 4 insertedcharacters (as in the transformation of commons to commodity.) In the rest ofthis paper the segment of the text matching the k field will be referred to asthe stable region of the node; the one matching the m field as the deleted region;and the one matching the n field as the inserted region.

Consider a text buffer that holds k characters. This buffer can be representedby a kmn-list that consists of a single node, namely (k, 0, 0). As changes are ap-plied to the buffer, this list may grow and take the form (k1, m1, n1), ..., (kl, ml, nl).An important property of this list is that the old buffer, before any modifications,is depicted by the sequence k1, m1, k2, m2, ..., kl, ml: the buffer contained k1

characters that were not changed, followed by m1 characters that were deleted,followed by k2 characters that were not changed, etc. Similarly, the new bufferis depicted by the sequence k1, n1, k2, n2, ..., kl, nl.

Example 1. A text buffer holds the following sentence:

The brown fox jumps over the lazy dog.

We now change this sentence to read

The brown ferret jumps all over the dog.

The underlined segments in the original sentence were thus deleted, and theunderlined segments in the new sentence were inserted. These changes arerepresented by the following kmn-list:

(11, 2, 5); (5, 0, 4); (10, 5, 0); (4, 0, 0)

This list is visually depicted in figure 2. Note that the list is compact: the firstnode contains both the deleted region and the inserted region that follow thefirst stable region. �

6

Page 7: Efficient Semantic Analysis for Text Editors€¦Efficient Semantic Analysis for Text Editors∗ Elad Lahav School of Computer Science, University of Waterloo elahav@uwaterloo.ca August

Figure 2: The kmn-list described in example 1. The upper bar illustrates thebuffer before the changes, as described by the k and m fields of the list nodes.The lower bar shows the state of affairs in the new buffer, as described by thek and n fields of the list.

In order to maintain the kmn-list, [2] suggests the use of a single function,called KMN-Insert. This function accepts the position of a change pos, thenumber of characters deleted from pos, and the number of characters inserted atpos. The function then modifies the kmn-list to include the changes specified bythese parameters. The use of a single, unnecessarily complex, function is derivedfrom the need to support text replacement as a primitive buffer operation. Thisfunction can be replaced by two, much simpler, ones, for the separate handlingof text insertions and range deletions.

We first look at insertions. The insert operation is handled by the functionKMN-Insert, described in algorithm 1. Line 1 skips all kmn nodes that arenot affected by the change. Note that in order to calculate the segment of thebuffer spanned by a node we take into account both its stable and its insertedregions, which are the fields that describe the state of the current buffer. Oncewe have found the node that is affected by the change, we are left with 2 casesto consider:

Case 1 Insertion was made inside the stable region of a node (lines 3-8.) Thenode is then split into two: the first node holds the part of the stable region (upto the insert position), and the newly inserted region; the second node holds theother part of the original stable region (after the insert position) as well as allprevious changes recorded by the original node.

Case 2 Insertion was made inside the inserted region of a node, or immediatelyafter it (lines 9-10.) In this case the inserted region of the node is simplyincreased to include the new region.

Note that, unlike the KMN-Insert function described in [2], text insertiondoes not require any node merges in order for the list to remain compact.

Range deletion, on the other hand, is somewhat harder. This operation ishandled by the function KMN-Delete, described in algorithm 2. As with textinsertion, we begin by finding the first node N affected by the change (lines1-2.) We then consider two cases:

Case 1 The deleted range is contained in its entirety within the stable regionof N (lines 3-7.) We split N in a way that is similar to that described inalgorithm 1, only with the m field now taken into account.

7

Page 8: Efficient Semantic Analysis for Text Editors€¦Efficient Semantic Analysis for Text Editors∗ Elad Lahav School of Computer Science, University of Waterloo elahav@uwaterloo.ca August

Case 2 The deleted range goes to the end of the stable region of N , andperhaps beyond that (lines 8-31.) This case involves several stages (the variabled always holds the number of characters that still need to be deleted):

1. Delete as many characters as possible from the stable range of N , updatingits stable and deleted regions (lines 9-13.)

2. For this stage we note that the inserted region of a node along with thestable region of the next node always form a contiguous segment of the textbuffer. Remove all nodes for which the inserted region of their predecessoralong with their stable region is completely included within the deletedrange (lines 14-19.) All previous deletions in these nodes are incorporatedinto the deleted region of N , and the inserted region of N is set to be theinserted region of the last node removed (since all other previously-insertedregions were deleted.)

3. Delete as many characters as possible from the inserted region of N (lines20-21.)

4. If there are still characters to delete, remove them from the stable regionof the next node (lines 22-24.)

5. N may now contain no changes at all. If that is the case, merge it withthe next node (lines 25-30.)

The time complexity of both algorithms is linear in the length of the kmn-list. The original paper raises concerns as to the increase in this length, andsuggests that it should be bounded by a fixed size [2]. Such a bound requiresfurther handling by the algorithms. Beetem and Beetem give 12 as a reasonableupper bound, based on their experiments. My own experiments, however, showthat the list never grows, in practice, beyond two nodes: Meddle rescans thetext every 300 milliseconds, and successful scanning resets the kmn-list to asingle node (see section 4.) For the list to grow beyond two nodes, a userneeds to perform two non-consecutive changes in less then 0.3 seconds, which ispractically impossible. The only way for the list to grow further is for scannererrors to occur, which is quite rare.

As a result of this analysis I have decided to not complicate the algorithmsby using an upper bound. In fact, even though kmn-lists handle correctlyany number of changes, they may not be necessary in practice: if scanning istriggered by the first non-consecutive change instead of a timer, we can simplyuse a single record of the change’s position, type and length. This technique isleft for future study.

4 Incremental Scanning

Scanning refers to the process that takes as input a stream of characters, andoutputs a stream of tokens. In the case of programming languages, tokens arethe terminal symbols defined by the grammar of the language7. The objectiveof incremental scanning is to scan as few tokens as possible, and still achieve

7Some tokens required for syntactic analysis, such as comments and preprocessor directives,

are not defined in the grammar, and will be treated in a special way.

8

Page 9: Efficient Semantic Analysis for Text Editors€¦Efficient Semantic Analysis for Text Editors∗ Elad Lahav School of Computer Science, University of Waterloo elahav@uwaterloo.ca August

Algorithm 1 KMN-Insert(inspos, len)

1: Skip all nodes for which pos[node] + node.k + node.n ≤ inspos

2: node←first node for which the above condition does not hold3: if pos[node] + node.k > inspos then

4: Add a node new before node

5: new.k ← pos− pos[node], new.m← 0, new.n← len

6: node.k ← node.k − new.k

7: else

8: node.n← node.n + len

9: end if

Algorithm 2 KMN-Delete(delpos, len)

1: Skip all nodes for which pos[node] + node.k + node.n ≤ delpos

2: node←first node for which the above condition does not hold3: if pos[node] + node.k > delpos + len then

4: Add a node new before node

5: new.k ← delpos− pos[node]6: new.m← len

7: new.n← 08: node.k ← node.k − new.k − len

9: else

10: d← min{len, pos[node] + node.k − delpos}11: len← len− d

12: node.k ← node.k − d,13: node.m← node.m + d

14: while next[node] 6= Nil and d ≥ node.n + next[node].k do

15: d← d− node.n− next[node].k16: node.n← next[node].n17: node.m← node.m + next[node].ms

18: Remove next[node]19: end while

20: node.n← node.n−min{d, node.n}21: d← d−min{d, node.n}22: if d > 0 then

23: next[node].k ← next[node].k − d

24: end if

25: if node.m = 0 and node.n = 0 and next[node] 6= Nil then

26: node.k ← node.k + next[node].k27: node.m← next[node].m28: node.n← next[node].n29: Remove next[node]30: end if

31: end if

9

Page 10: Efficient Semantic Analysis for Text Editors€¦Efficient Semantic Analysis for Text Editors∗ Elad Lahav School of Computer Science, University of Waterloo elahav@uwaterloo.ca August

Figure 3: The token streams for the code segments given in example 2. Eachtoken holds a terminal symbol from the grammar, and a pair of coordinates(p, l), which specify its position and length, respectively.

a complete and correct stream of tokens for the edited text. This requires aniteration over a list of modifications made to the buffer since the last scan, andcorrect identification of the tokens that were affected by these changes.

The task of identifying the tokens that need to be rescanned may be harderthan it seems at first sight. First of all, not all tokens affected by a change tothe buffer need to actually be rescanned: all tokens that follow a change (eitherinsertion or deletion) to the text are shifted, but do not necessarily changetheir contents. On the other hand, some tokens that reside outside the changedregion, may still need to be rescanned, as shown in example 2.

Example 2. Consider the following code written in some programming lan-guage:

nn := 10

pri nt n

The errors in the code are then corrected by the programmer, and the new codehas the following form:

n := 10

print n

The token streams for the original and modified code are given in figure 3a andfigure 3b, respectively. Each token shows the terminal symbol it represents, aswell as its position in the buffer and its length (print is assumed to be a reservedkeyword.) The first token was changed, and needs to be rescanned. The next 3tokens changed their positions, but no scanning is required for them. The nexttoken, however, must be rescanned, even though no change was made to theregion spanned by it. If the scanner does not recognise that, it will leave pri

as a separate token from nt, which results in a completely different semanticinterpretation. �

The algorithm for incremental scanning is based on the structure of thekmn-list. Informally, this algorithm can be stated as follows:

As long as we are in a stable region, and the current token is fullycontained within that region, advance to the next token. Otherwise,start scanning tokens and separators until we are again in a stableregion. [2]

10

Page 11: Efficient Semantic Analysis for Text Editors€¦Efficient Semantic Analysis for Text Editors∗ Elad Lahav School of Computer Science, University of Waterloo elahav@uwaterloo.ca August

While this definition is correct, and captures the gist of the algorithm, it isnot accurate enough for a concrete algorithm to be developed. For instance,it does not specify where exactly scanning should start, and which tokens arereplaced during (re)scanning. The pseudo-code for incremental scanning givenin [2] is, on the other hand, difficult to comprehend and relies heavily on theimplementation of the parse tree given in that paper. These difficulties requiredthat a new algorithm for incremental scanning be developed for Meddle. Thefunction Inc-Scan, given in algorithm 3 is based on the general idea quotedabove, but provides complete details as to the parts of the text that need to berescanned, and the tokens that are replaced as a result.

Inc-Scan accepts a list of text modifications (in the form of a kmn-list),and a list of tokens. The variable node holds the head of the kmn-list, shift isthe value by which to change the position of tokens (positive or negative), andlast points to the last token that does not require scanning (initially set to Nil

to suggest that all tokens are candidates for rescanning.)The algorithm begins by evaluating the changes in the first node of the kmn-

list (since the list is compact, each node in that list, with the exception of thelast, must contain some changes.) It first skips all tokens that are contained inthe stable region defined by this node (lines 3-6.) A token token is said to bewithin the stable region of the node node if

pos[token] + length[token] + shift < pos[node] + node.k

Note that this is a strict inequality, or else tokens that abet unstable regionswill not be rescanned (such as pri in example 2.)

The next stage is to rescan tokens (lines 7-17.) Scanned tokens are insertedinto a temporary list, so scanner errors do not damage the original token list.Scanning is then started on the first character that immediately follows last

(lines 8-12.) Recall the last holds the last token that does not need to berescanned. This means that token delimiters may be rescanned, even if they arecontained within a stable region. Though this may seem somewhat less efficient,it greatly simplifies the incremental scanning algorithm (and overall increasesits efficiency.)

The tokens are scanned and appended to the temporary list, until a stableregion is reached again (lines 13-17.) Once a token has been scanned, thefunction KMN-Move, described in algorithm 4, sets the current kmn nodeto be the one that spans the new position of the scanner. This, however, isnot done by a simple iteration over the kmn-list. Instead, all nodes beforethe position passed to KMN-Move are considered as “consumed”, i.e., thatthe changes that were represented by these nodes were applied. KMN-Move

therefore resets the current node, and merges the changes represented by thenext one into it. This is why node does not change, in effect, but is ratherexpanded to contain it successors. The shift value is updated while kmn nodesare consumed.

Before the new tokens can be merged into the token list, the ones invalidatedby the recent changes need to be removed (lines 18-19.) The tokens that shouldbe deleted are all the successors of last that end before the new position of thescanner. However, this position needs to be translated first (line 18), as it isgiven in coordinates relative to the current buffer, while the tokens to be deletedhold their positions relative to the old buffer.

11

Page 12: Efficient Semantic Analysis for Text Editors€¦Efficient Semantic Analysis for Text Editors∗ Elad Lahav School of Computer Science, University of Waterloo elahav@uwaterloo.ca August

Algorithm 3 Inc-Scan(kmn list, token list)

1: node← head[kmn list], shift← 0, last← Nil

2: repeat

3: for all token in token list such that token ∈ stable[node] do

4: pos[token]← pos[token] + shift

5: last← token

6: end for

7: Create an empty token list temp list

8: if last 6= Nil then

9: Start scanning at pos[last] + length[last]10: else

11: Start scanning at position 012: end if

13: repeat

14: Read a new token token

15: Append token to temp list

16: shift← KMN-Move(pos[token] + length[token], node, shift)17: until token ∈ stable[node]18: old scan pos← pos[token] + length[token]− shift

19: Remove all tokens after last that end before old scan pos

20: Merge temp list into token list after last

21: last← last scanned token22: until The kmn list contains no changes, or a scanner error has occurred

Algorithm 4 KMN-Move(pos, node, shift)

1: while Position[node] + node.k + node.n ≤ pos do

2: shift← shift + node.n− node.m

3: node.k ← node.k + node.n

4: if Next[node] 6= Nil then

5: node.k ← node.k + Next[node].k6: node.m← Next[node].m7: node.n← Next[node].n8: else

9: node.m← 010: node.n← 011: end if

12: end while

13: return shift

12

Page 13: Efficient Semantic Analysis for Text Editors€¦Efficient Semantic Analysis for Text Editors∗ Elad Lahav School of Computer Science, University of Waterloo elahav@uwaterloo.ca August

File Size Batch Scanner Incremental Scanner

1 Kb 3.2 ms 0.2 ms10 Kb 13.1 ms 0.4 ms100 Kb 143.2 ms 3.3 ms

Table 1: The average scan times for different file sizes. Each file was modifiedusing the same set of tests, described in appendix A. Note that the timesspecified for the incremental scanner include the initial (batch-like) scan.

The final step is to merge the temporary list into the original stream, re-placing the deleted tokens (lines 20-21.) The algorithm goes back to the firststage, using the updated kmn node, until all changes are applied, or a scannererror occurs. A successful scan resets the kmn list, so that the single remainingnode does not contain any changes (i.e., node.m = node.n = 0.) A scannererror causes the algorithm to abort. This should happen immediately after thescanner returns with an error result, so that the kmn nodes holding the nextchanges are not consumed.

Even though only modified tokens are rescanned, Inc-Scan is linear in thenumber of tokens in the stream. This is because it needs to iterate over the entirestream in order to determine which tokens need to be rescanned, and to shiftthe tokens that are kept. This raises a natural question whether incrementalscanning is at all necessary. This concern is further justified by the nature ofthe scanning process itself: scanners are based on deterministic finite automata,so their time complexity is linear in the size of the input text.

Empiric tests show, however, that incremental scanning is significantly fasterthan batch-mode scanning (see table 1.) Moreover, the difference between thescanning times of the two methods increases with the size of the text. Theseresults cannot be solely ascribed to the difference between linearity in the size ofthe token stream vs. linearity in the size of the text (characters), as this shouldhave resulted in a constant improvement only. A better suggestion is that scan-ning involves more than just identifying tokens. It involves allocating memory,maintaining the stream (which, in the case of Meddle involves more than justa linked list, as explained in section 5) and other tasks, such as updating thesyntax highlighting tags.

5 Incremental Parsing

The most challenging part of the editing-scanning-parsing cycle is incrementalparsing. Not only is parsing inherently more difficult than scanning, identifyingand managing sub-trees is significantly harder than identifying and managinglinear streams of tokens.

5.1 The P-Tree Data Structure

The purpose of the P-Tree (for Parse Tree) data structure is to facilitate in-cremental parsing, based on the Jalili-Gallier algorithm. Moreover, a P-Tree

13

Page 14: Efficient Semantic Analysis for Text Editors€¦Efficient Semantic Analysis for Text Editors∗ Elad Lahav School of Computer Science, University of Waterloo elahav@uwaterloo.ca August

provides multiple views, both as a parse tree and as a list of tokens8, whichallows it to be used for the scanning procedure as well as for parsing. Finally, aP-Tree is used to implement both stacks (the parse stack and tree-reuse stack)required by the algorithm.

The P-Tree is actually not a parse tree, but rather an ordered parse forest.At any given point, a P-Tree holds a list of correctly parsed trees, referredto as the stream, which are candidates for reuse as parts of a new tree beingconstructed. These trees may also be singletons, in which case the node is aterminal node that needs to be parsed.

Each node in a P-Tree has the following fields:

• The grammar symbol (terminal or non-terminal) represented by this node;

• Position in the text buffer (leftmost character);

• Length (total number of characters);

• The state of the scanner after reading this token (for terminal nodes only);

• The state of the parser before shifting this node;

• The state of the parser after shifting this node9;

• The look-ahead symbol that was used to determine that this node needsto be shifted;

• Pointers to the node’s parent, siblings, first child and last child.

The stream of trees is maintained by keeping two permanent nodes, or sen-tinels [5], to the beginning of the stream (BOS) and to its end (EOS.) Thesibling pointers of each node on the stream point to the immediate neighboursof that node on the stream, which allows the implementation of the stream asa linked list. Note that this is a generalisation of the sibling definition for parsetrees: nodes that are not on the stream are internal nodes or leaves of the treesin the parse forest, and point to their siblings in the usual way. Nodes on thestream, on the other hand, are roots of trees in the parse forest, so their siblingsare roots of other trees.

In addition to the stream of trees, the P-Tree structure also maintains the listof tokens, as generated by the scanner. This list is required both for incrementalscanning, as described by the algorithms presented in section 4, as well as forparsing, as will be shown later. The first token in the list is found by taking thenode on the stream immediately following the BOS, and descending along itsleft branch until a terminal node (the leaf) is found. The next token operationis implemented as follows:

• If the current token has an immediate successor, it is the next token inthe list;

8Henceforth, we will use the term tokens to refer to the P-Tree leaves whenever we discuss

the scanner, and the term terminal nodes whenever we discuss the parser. Note, however,

that in effect these are the same nodes.9Recall that the stack holds pairs of nodes and states. This is implemented by letting the

nodes hold the state of the parser, which is also useful for tree-reuse, as will be shown later.

14

Page 15: Efficient Semantic Analysis for Text Editors€¦Efficient Semantic Analysis for Text Editors∗ Elad Lahav School of Computer Science, University of Waterloo elahav@uwaterloo.ca August

Figure 4: Implementing reduction on a P-Tree. The string aBb is reduced to thesingle non-terminal C, based on the look-ahead symbol c. A new non-terminalnode for C replaces the string nodes on the stream, which become its children.The rectangular nodes are terminals, while the circular nodes are non-terminals.

• Otherwise, the current token is the right-most child of its parent. Moveup the tree until a node with a successor is found. Take this node, andmove along the left branch of its subtree, until a terminal node is found.

A complete parse tree is represented by a stream that contains the BOSnode, followed by the root of the parse tree, and ends in the EOS node. Note,however, that if an error occurs during parsing, the P-Tree structure still holdsa correct stream of terminals that were not yet shifted, and all non-terminalsthat were shifted before the error was encountered. This stream can thus beused for re-parsing after the error is corrected, as explained in the next sections.

5.2 The Parser

The parser works by keeping a pointer to a node on the stream, which symbolisesthe top of the stack. All nodes on the stream to the left of this pointer are onthe stack, while the nodes to the right are tokens that were not yet parsed (inthe case of terminal nodes), or the root of trees that are considered for reuse.Thus the stacks suggested by Jalili and Gallier do not need to be implementedseparately, which greatly simplifies the algorithms [5].

After the initial scan, the stream of the P-Tree is composed of terminalnodes only, which correspond to the scanned stream of tokens. The parser thenworks just as a batch parser: each token is considered as the look-ahead symbolof the tokens preceding it, and once all reductions were applied, is pushed onthe stack (if the scanned text is semantically correct.) Pushing is achieved bysimply advancing the parser’s pointer to the next node on the stream.

Reducing a string to a single non-terminal symbol is also easy: a new nodeis created for the non-terminal symbol. The first and last children of the newnode are set to be the left-most and right-most nodes in the string, respectively.Finally, the string is detached from the stream, by having its immediate prede-cessor and successor nodes point to the new node. Thus the new terminal nodereplaces the nodes of the string on the stream. This procedure is delineated infigure 4.

Note that very few pointers actually need to be changed in this procedure:the structures of the trees rooted at the string nodes do not change, nor doesthe order of the nodes in the string. Only the semantics of the order changes:instead of pointing to the next and previous nodes on the stream, each node inthe string points to the next and previous child of the new tree root. The onlypointers that need to be handled are those of the new node, the parent pointers

15

Page 16: Efficient Semantic Analysis for Text Editors€¦Efficient Semantic Analysis for Text Editors∗ Elad Lahav School of Computer Science, University of Waterloo elahav@uwaterloo.ca August

of the string nodes, the previous node pointer of the left-most node in the stringand the next node pointer of the right-most node in the string (the last two areset to Nil.)

5.3 Tree Reuse

The heart of the incremental parsing algorithm is tree reuse. The goal of thealgorithm is to identify as many trees as possible on the stream that can bepushed on the stack without examining their internal structure.

The incremental parsing algorithm implemented in Meddle is based on thetree reuse rule suggested in proposition 1: a non-terminal node representing asub-tree created during a previous parse can be shifted immediately if the stateof the parser and the look-ahead symbol match the ones used during the lastparse.

Note however, that while this condition is sufficient, it is not necessary. Infact, the non-terminal node can be shifted even if the look-ahead symbol haschanged, as long as it is in the look-ahead set for the state of the parser beforeshifting [5]. However, since Meddle is based on the output of Bison, which doesnot provide this information, it uses the original condition as specified by Jaliliand Gallier.

If the tree reuse condition cannot be matched for a given node, the treerooted at this node needs to be broken down. There are two ways to do so,based on the part of the condition that has failed:

Left break-down All non-terminal nodes on the left-most branch (that is, allnodes on that branch but the leaf) are removed. The root of the tree, whichresides on the stream, is replaced by a string composed of the immediatechildren of the removed nodes, from left to right.

Right break-down All non-terminal nodes on the right-most branch are re-moved. The root of the tree is replaced by a string composed of theimmediate children of the removed nodes, from left to right.

Figure 5 shows examples of a left and right break-downs. Note that these oper-ations are referred to in [3] as Replace and Undo-Reductions, respectively.

Unfortunately, [3] does not clearly justify the use of these operations. Toexplain them, we need to make the following two observations:

Proposition 2. Let α1, ...αn ∈ (VT ∪VN )+ be strings in a LR(1) grammar G =(VT , VN , P, S). If a parser performs the set of reductions α1 ⇒ α2 ⇒ ... ⇒ αn,then the parser’s state before shifting the left-most symbol of each string αi isthe same.10

Proof. Let σi denote the first symbol of the string αi, and let s be the state ofthe parser just before shifting σi. Before the parser shifts the left-most symbolin string αi+1, denoted by σi+1, it must pop all symbols (and states) thatcorrespond to the portion of the string in αi that is spanned by that symbol.By the correctness of the parsing algorithm, and since both σi and σi+1 are onthe left-most branch of the tree, the last symbol to be popped is σi. This bringsthe parser’s stack back to its state before σi was shifted, and thus σi+1 is shifted

10The term “reduction” and the ⇒ symbol are used here in their generalised form.

16

Page 17: Efficient Semantic Analysis for Text Editors€¦Efficient Semantic Analysis for Text Editors∗ Elad Lahav School of Computer Science, University of Waterloo elahav@uwaterloo.ca August

Figure 5: The operations of left (a) and right (b) break-downs.

using the same parser state as σi. By induction, this is true for all symbols inthe left-most branch.

Proposition 3. Let α1, ...αn ∈ (VT ∪ VN )+ be strings in a LR(1) grammarG = (VT , VN , P, S), and let A2, ..., An+1 be non-terminal symbols. If a parserperforms the set of reductions α1 ⇒ α2A2 ⇒ ... ⇒ αnAn ⇒ An+1, then allthese reductions are made based on the same look-ahead symbol.

Proof. Consider the first reduction α1 ⇒ α2A2. This reduction is performedbased on the first terminal symbol that follows α1. A2 then replaces a substringof α1, that is terminated by the right-most symbol of that string. The nextterminal symbol has thus not been shifted or reduced, nor was any terminalgenerated before it. Similarly, the right most symbol in the yield of Ai is Ai−1

for i ≥ 3, so the next terminal symbol, which serves as the look-ahead, remainsthe same.

According to proposition 2, if the current state of the parser does not matchthe one saved on the root of the tree being considered for reuse, this tree cannotbe shifted. Moreover, none of the nodes on the left branch of this tree shouldbe shifted, and so this branch needs to be discarded (except for the leaf, whichis a terminal node.) This is achieved by the left break-down operation. Notethat breaking down the tree on the left branch does not mean that the internaltrees should be broken down as well, nor does it guarantee their reuse. Instead,these trees are reconsidered in turn.

On the other hand, if the state is matched, but the look-ahead symbol haschanged, then none of the reductions referred to by the right branch of the treeshould occur (as is suggested by proposition 3.) These reductions are therefore“undone” by the right break-down operation.

17

Page 18: Efficient Semantic Analysis for Text Editors€¦Efficient Semantic Analysis for Text Editors∗ Elad Lahav School of Computer Science, University of Waterloo elahav@uwaterloo.ca August

Figure 6: The results of applying the Divide operation on a parse tree. Thetree is divided from the root to the terminal d node. The shaded nodes are theones being removed.

5.4 Handling Changes to The Token Stream

Trees that are considered for reuse can only be those for which the yield (thestring of tokens spanned by the root of the tree) has not changed since the lastparse. The stream of tokens, however, is modified between each two invocationsof the parser (or else there is no need for re-parsing.) The effects of these changeson the stream and on the operation of the parser need therefore be described.

Note that unlike the algorithms presented in [3] and [5], the one given in thispaper leaves the task of modifying the parse tree as a result of changes to thetoken list to the scanner, instead of the parser. This delegation of responsibilityis natural, as the scanner generates and modifies the token list. Two advantagesof this approach is that there is no need to keep a record of modified tokensbetween parses, and that the P-Tree maintains a usable stream at all times.One major drawback, on the other hand, is that decisions regarding the parsetree may be taken prematurely, before the entire set of changes between parsesis examined. While this does not lead to incorrect results, it may result insub-optimal behaviour.

Changes to the stream of tokens can be either the introduction of new tokensor the removal of existing ones. As long as the P-Tree holds only terminal nodes(i.e., before the first parse), handling changes to the stream is relatively easy.Things become complicated when the P-Tree includes previously parsed trees,as trees for which the yield has changed do no longer correctly describe the textbeing edited.

In order to present the procedures for inserting and removing nodes from a(fully or partially) parsed stream, we first need to describe the Divide proce-dure, as suggested in [3]. Given a leaf (terminal) node in a parse tree, Divide

scans the path from the root to that node, and removes all nodes on this path(except for the leaf.) All immediate children of the removed nodes are addedto the stream, in left-to-right order (see figure 6.) It is easy to see that left andright break-downs are special cases of the divide operation, where the leaves arethe left-most and right most ones, respectively.

While [3] considers divisions from the root to a terminal node only, theDivide operation can be generalised to internal (non-terminal) nodes. The al-gorithm works the same way as it does for terminals. In essence, the generalisedDivide operation can be described as bringing a P-Tree node to the stream,by discarding its ancestors. The siblings of this node are also brought to thestream as a result.

18

Page 19: Efficient Semantic Analysis for Text Editors€¦Efficient Semantic Analysis for Text Editors∗ Elad Lahav School of Computer Science, University of Waterloo elahav@uwaterloo.ca August

Deleting tokens from a P-Tree is achieved by the procedure Remove-Tokens,described in algorithm 5. Recall from algorithm 3 that the scanner needs to re-move all tokens after the last one that was not modified, and up to the newposition of the scanner (line 19.) Since the list of tokens is incorporated intoa P-Tree, we need to specify how this removal affects the trees in which thesetokens are included, and how it affects the stream. Naturally, if all tokens toremove are on the stream (not yet parsed successfully), their removal is a simplematter of arranging the pointers of the stream nodes immediately before andimmediately after them.

The algorithm accepts the first token that needs to be removed, and thelast position of the deleted range. Based on this information, the algorithmidentifies the last token that is contained in the deleted range (line 1.) In orderto maintain a valid P-Tree, nodes can only be removed from the stream. That is,a tree in the P-Tree forest can either be deleted or kept in its entirety. The nextstep is, therefore, to bring the first and last tokens that need to be removed tothe stream. This is accomplished using the Divide operation (lines 2-3.) Nowall nodes on the stream between first and last are the roots of trees that needto be completely removed, and are therefore deleted (line 4.)

The procedure Merge-Tokens listed in algorithm 6 is used to insert tokensinto a P-Tree. This procedure accepts the token after which to insert, and alist of new tokens. It begins by finding the highest ancestor of after that is notaffected directly from the new tokens11. The generalised Divide procedure isthen invoked on this node (line 2), after which it resides on the stream. Finally,the tokens are merged into the stream (line 3.)

5.5 The Algorithm

We are now ready to present the incremental parsing algorithm. This algorithmis based on a regular, batch mode, LALR(1) parser. Incrementality is achievedthrough the Next-Terminal function, described in algorithm 7. This functionis invoked whenever the parser requests the next token. In essence, Next-

Terminal finds the first available terminal node on the stream, while handlingall non-terminals before it.

The function accepts two arguments, which describe the current conditionof the parser: top is a pointer to the stream node that represents the top of theparse stack, and state is the current state of the parser (this last parameter isnot really needed, since each node also holds the state of the parser after it isshifted, so the top node also holds the current state of the parser.)

The function starts by iterating over nodes on the stream of the P-Tree.Recall that the sibling pointers of every node on the stream point to othernodes on the stream, so by starting with top and using the Next operation, weexamine only stream nodes. If the node is a non-terminal, the function usesits left-most terminal as a look-ahead symbol to perform all reductions possible(line 3.) The parser is then ready to consider a non-terminal for shifting. Ifboth the state of the parser and the look-ahead symbol match the ones usedduring the previous parse (lines 4-6), the node is shifted (lines 7-9.) Shiftingis done by advancing the top pointer to the shifted node, and setting the new

11This does not mean that the tree is correct: it may need to be modified as a result of a

change to its look-ahead symbol. Handling this case is deferred until the parsing phase.

19

Page 20: Efficient Semantic Analysis for Text Editors€¦Efficient Semantic Analysis for Text Editors∗ Elad Lahav School of Computer Science, University of Waterloo elahav@uwaterloo.ca August

Algorithm 5 Remove-Tokens(first, endpos)

1: last← the last token for which pos[last] + length[last] ≤ endpos

2: Divide(first)3: Divide(last)4: Delete all trees on the stream from first to last (inclusive)

Algorithm 6 Merge-Tokens(after, token list)

1: node← the highest node for which after is a right-most descendant2: Divide(node)3: Merge token list after node and before next[node]

Algorithm 7 Next-Terminal(top, state)

1: node← next[top]2: while node is a non-terminal do

3: Do all reductions based on the left-most terminal in the yield of node

4: if prev state[node] = state then

5: la← the next terminal following the yield of node

6: if lookahead[node] = la then

7: top← node

8: state← next state[node]9: node← next[node]

10: else

11: node← Right-Breakdown(node)12: end if

13: else

14: node← Left-Breakdown(node)15: end if

16: end while

17: return node

20

Page 21: Efficient Semantic Analysis for Text Editors€¦Efficient Semantic Analysis for Text Editors∗ Elad Lahav School of Computer Science, University of Waterloo elahav@uwaterloo.ca August

File Size Batch Parser Incremental Parser

1 Kb 0.3 ms < 0.1 ms10 Kb 1.8 ms 0.2 ms100 Kb 16.2 ms 2.2 ms

Table 2: Average parse times for different file sizes, based on the tests describedin appendix A. The results given for the incremental parser include the initialparse, but do not include the scanning time.

state of the scanner to the one saved by this node (recall that a node holds theparser’s states before and after it was shifted.) If the state is the same, butthe look-ahead symbol has changed, the tree needs to be broken down alongits right branch (line 11), as suggested by proposition 3. Otherwise, the tree isbroken along the left branch (line 14), based on proposition 2. After the tree isbroken down (either to the left or to the right), its components are consideredfor reuse in turn. It is assumed that the functions Left-Breakdown andRight-Breakdown return the first node on the stream resulting from theirexecutions.

5.6 Results

The incremental parser was compared with a batch parser on files of differentsizes. The tests included manual insertions and deletions, as well as cut andpaste operations, in the beginning, middle and end of the files (see appendix Afor a description of these tests.) The results are given in table 2.

While these numbers look promising, they fail to capture the true cost ofincremental parsing. Recall that some of the parser’s duties were delegated tothe scanner, which is usually invoked multiple times between parses12. Thusany true comparison between the batch and incremental parsers must includethe scanning times as well. Note that the scanning times given in table 1 cannotbe used here since the scanner was tested without the parsing code (the streamwas kept with terminals only, so no maintenance was required by the scanner.)

The firsts tests that took into account both the parsing and scanning timeswere disappointing: the incremental parser showed only a marginal improvementover the batch parser. Indeed, Meddle felt sluggish while editing large sourcefiles (100 kilobytes.) These poor results were due to the scanner, which, whenrequired to perform the additional tasks of maintaining the stream, exhibitedperformance close to that of the batch scanner.

At first it seemed that the approach described in this paper for implementingincremental parsing has failed: if indeed the scanner cannot maintain the streamefficiently, then this task needs to be returned to the parser. In that case, we nolonger have a valid P-Tree at all times, which, in turn, requires that some kindof bookkeeping be kept of the token list between parses. Since this cannot bedone by the P-Tree in its current form, the entire system looked like a house ofcards.

12The results in this table may be misleading in one more sense. Comparing table 1 with

table 2 suggests that parsing is faster than scanning. This, however, is not the case, as

scanning times include updates to the visual tag system, while the parser, due to technical

difficulties, currently builds the parse tree and nothing more.

21

Page 22: Efficient Semantic Analysis for Text Editors€¦Efficient Semantic Analysis for Text Editors∗ Elad Lahav School of Computer Science, University of Waterloo elahav@uwaterloo.ca August

File Size Batch Parser Incremental Parser

1 Kb 11.3 ms 1.0 ms10 Kb 37.1 ms 3.2 ms100 Kb 624.8 ms 24.2 ms

Table 3: The combined scanning and parsing results for the batch and incre-mental parsers. Scanning times were accumulated between parses, and addedto the run time of the parser itself.

However, using a profiler to analyse the run-time behaviour of Meddle’sscanner resulted in a surprising revelation: it turned out that the cause forthe poor performance displayed by Meddle was not due to the code requiredfor maintaining a valid stream by the scanner (i.e., the functions that im-plement the Remove-Tokens and Divide algorithms.) Instead, most of thetime was consumed by two seemingly-harmless functions: ptree next term(),which is responsible for implementing the token-list view of the P-Tree, andptree shift(), which updates the position and length of nodes during scan-ning. While the footprint of these functions was quite small, thousands of callsto each of them during a single scan of a large file resulted in a significantperformance loss.

The solution for the problem posed by the first of these functions was tointroduce one more pointer to the P-Tree node structure. This pointer, imple-mented in terminal nodes only, always points to the next token. Initially, ithas the same value as the pointer that points to the next node in the stream.However, while the latter may change when a token is removed from the streamand becomes the child of a non-terminal, the new pointer is kept during parsing.On the other hand, it may change when new tokens are added to the P-Tree,even if the terminal node itself is not on the stream. Thus iterations over thetoken list, which are required on every scan, are now much faster. The over-head of maintaining this new pointer is negligible, as this task fits well into themaintenance of the stream pointers.

The original idea behind the implementation of ptree shift() was to usethe P-Tree as a search tree for tokens, based on their positions. Thus, thescanner may not need to iterate over the entire token list to find a token thatneeds to be changed. For the scanner to be able to find the right token, however,all tree nodes must be kept up-to-date when the text changes, even if this changeonly results in tokens being shifted. In order to achieve this, ptree shift()

updates the requested terminal node, and then modifies all of its ancestorsthat may have changed as a result. However, since the possible gain of thismethod is apparently nullified by the overhead of ptree shift(), the functionwas changed and now updates the terminal nodes only.

Once the above changes were implemented, the scanner again displayed goodperformance. In fact, despite the overhead imposed by the P-Tree maintenancetasks, the incremental parser worked almost as fast on complex parse trees ason flat token lists. Table 3 shows a comparison between the batch parser andthe incremental parser, with scanning time taken into account.

22

Page 23: Efficient Semantic Analysis for Text Editors€¦Efficient Semantic Analysis for Text Editors∗ Elad Lahav School of Computer Science, University of Waterloo elahav@uwaterloo.ca August

5.7 Non-Grammar Tokens

An interesting problem, for which I could not find any references in the citedpapers, is how to handle tokens that should not be parsed. Most programminglanguages allow for certain tokens to be included in a programme’s listing, eventhough they are not part of the grammar. A common example for such tokensare source comments.

Although non-grammar tokens do not provide any semantic information,they may still be useful from a syntactic point of view, and therefore need tobe scanned. The P-Tree structure, however, makes it difficult to support tokensthat are not a part of any parse tree. One solution is to discard these tokensduring the Merge-Tokens procedure. The main problem with this methodis that non-grammar tokens are not kept between scans, which leads to someanomalies in the scanner’s operation.

The next-token pointer, discussed in the previous section, gives more flexi-bility when handling the list of tokens. Thus the actual token list can differ fromthe (virtual) list of terminals which the parser discovers as it iterates over thestream (note that the parser uses the ptree next term() function describedabove in its original form, as it needs to find the look-ahead symbol of a non-terminal.) Maintaining these separate lists, however, requires changes to theMerge-Tokens and Remove-Tokens procedures. Currently, these changesare not implemented in Meddle, so non-grammar tokens are not handled cor-rectly by the parser.

6 Conclusion

The P-Tree data structure provides an easy way for implementing the Jalili-Gallier algorithm for incremental parsing. The result is not optimal, as parsetrees may be broken down even if the exact same tree is rebuilt during thefollowing parse. However, interactive environments, such as text editors, aremeasured by their responsiveness, and not necessarily by their absolute perfor-mance. Empiric tests show that the Meddle remains fast and responsive evenwhen editing large source files, while maintaining a correct semantic represen-tation of the edited text at all times.

The P-Tree structure, on the other hand, makes it difficult to separate thetoken list, as seen by the scanner, from the list of terminals, as required by theparser. This separation is sometimes useful, as in the case of tokens that arenot included in the language’s grammar. The inclusion of a next-token pointerwith terminal nodes, that differs from the next stream-node pointer, results ina better distinction between the parser and scanner views of the P-Tree. Theseviews are still not completely independent, so the handling of non-grammartokens remains an arduous task.

An issue that remains to be studied is the effect of incremental parsing onattribute grammars. Meddle includes some attribute handling in its code formaintaining the position and length of the yield of each node in the P-Tree(both of which are synthesised attributes.) These attributes, however, form aspecial case, since they provide no semantic information, and are usually notneeded for code generation. It would be safe to assume that in most cases, if atree is reused, than the values of the synthesised attributes remain unchanged

23

Page 24: Efficient Semantic Analysis for Text Editors€¦Efficient Semantic Analysis for Text Editors∗ Elad Lahav School of Computer Science, University of Waterloo elahav@uwaterloo.ca August

for all nodes in this tree. Inherited attributes, on the other hand, may changeupon shifting of the tree, and this change would need to be propagated to therest of the nodes.

References

[1] R. A. Ballance, J. Butcher, and S. L. Graham. Grammatical abstractionand incremental syntax analysis in a language-based editor. In Proceedingsof the ACM SIGPLAN 1988 conference on Programming Language designand Implementation, pages 185–198. ACM Press, 1988.

[2] John F. Beetem and Anne F. Beetem. Incremental scanning and parsingwith galaxy. IEEE Transactions on Software Engineering, 17(7):641–651,1991.

[3] Fahimeh Jalili and Jean H. Gallier. Building friendly parsers. In Proceedingsof the 9th ACM SIGPLAN-SIGACT symposium on Principles of program-ming languages, pages 196–206. ACM Press, 1982.

[4] Tim A. Wagner and Susan L. Graham. Incremental analysis of real program-ming languages. In Proceedings of the ACM SIGPLAN 1997 conference onProgramming language design and implementation, pages 31–43. ACM Press,1997.

[5] Tim A. Wagner and Susan L. Graham. Efficient and flexible incremen-tal parsing. ACM Transactions on Programming Languages and Systems,20(5):980–1013, 1998.

A Tests

The following tests were conducted to determine the performance of the incre-mental scanner and parser:

Test 1 Type (i.e., one letter at a time) the following code near the beginningof the file:

typedef struct test1_s {

char* string;

int number;

} test1_t;

Test 2 Type the following code near at the end of the file (i.e., this text shouldbe appended):

void test2(char* string, int number)

{

printf("%s,%d\n", string, number);

}

24

Page 25: Efficient Semantic Analysis for Text Editors€¦Efficient Semantic Analysis for Text Editors∗ Elad Lahav School of Computer Science, University of Waterloo elahav@uwaterloo.ca August

Test 3 Paste the following code around the middle of the file:

int test3(const char* str)

{

return strlen(str);

}

Test 4 Delete the code added in test 1, by repeatedly using the backspace

key.

Test 5 Replace the text pasted in test 3 with the following code:

const char* test5(int number)

{

char buf[20];

sprintf(buf, "%d", number);

return buf;

}

Replacing text can be done by first selecting the old text, and then pastingthe new one.

25