Top Banner
Extracting Grammar from Programs: Evolutionary Approach Matej ˇ Crepinˇ sek 1 , Marjan Mernik 1 , Faizan Javed 2 , Barrett R. Bryant 2 , and Alan Sprague 2 1 University of Maribor, Faculty of Electrical Engineering and Computer Science, Smetanova 17, 2000 Maribor, Slovenia {matej.crepinsek, marjan.mernik}@uni-mb.si 2 The University of Alabama at Birmingham, Department of Computer and Information Sciences, Birmingham, AL 35294-1170, U.S.A. {javedf, bryant, sprague}@cis.uab.edu Abstract. The paper discusses context-free grammar (CFG) inference using genetic-programming with application to inducing grammars from programs written in simple domain-specific lan- guages. Grammar-specific heuristic operators and non-random construction of the initial popu- lation are proposed to achieve this task. Suitability of the approach is shown by small examples where the underlying CFG’s are successfully inferred. Keywords. Grammar induction, Grammar inference, Learning from positive and negative ex- amples, Genetic programming 1 Introduction In the accompanying paper [15] we discussed the search space of regular and context-free grammar inference. The conclusion reached was that owing to the large search space, the exhaustive (brute-force) approach to grammar induction could only be applied to small positive samples. Hence, a need for a different and more efficient approach to explore the search space arose. Evolutionary computation [16] is particulary suitable for such kinds of problems. In fact, genetic algorithms have already been applied to the grammar inference problem, with varying results. In this paper another evolutionary approach, Genetic Programming (GP), to CFG learning is presented. Genetic programming [3] is a successful technique for getting computers to automatically solve problems. It has been successfully used in a wide variety of application domains such as data mining, image classification and robotic control. In general, genetic programming works well for problems where solutions can be expressed with a modestly short program. For example, methods working on typical data structures such as stacks, queues and lists have been successfully evolved using genetic programming in [4]. Specifications (BNF) for domain- specific languages are small enough so that we can expect that a successful solution can be found using genetic programming. Our previous work [6] was successful in inferring small context-free grammars from positive and negative samples. This paper elaborates on our recent research findings and builds on our previous work. 2 Related Work The impact of different representations of grammars was explored in [14] where experimental results showed that an evolutionary algorithm using standard context-free grammars (BNF) outperforms those using Greibach Normal Form (GNF), Chomsky Normal Form (CNF) or bit-string representations [5]. This performance differential was attributed to the larger grammar search space of the other represen- tations, which was a consequence of them having a more complex grammar form. The experimental assessment in [14] was very limited due to the large processing time (processing of one generation had taken several hours; using our system, processing of one generation takes just few seconds). This was due to use of the chart parser, which is used commonly in natural language parsing and can accept ACM SIGPLAN Notices 39 Vol. 40(4), Apr 2005
8

Extracting grammar from programs: evolutionary approach

Jan 18, 2023

Download

Documents

Kim Nimon
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Extracting grammar from programs: evolutionary approach

Extracting Grammar from Programs: Evolutionary Approach

Matej Crepinsek1, Marjan Mernik1,Faizan Javed2, Barrett R. Bryant2, and Alan Sprague2

1 University of Maribor,Faculty of Electrical Engineering and Computer Science,

Smetanova 17, 2000 Maribor, Slovenia{matej.crepinsek, marjan.mernik}@uni-mb.si

2 The University of Alabama at Birmingham,Department of Computer and Information Sciences,

Birmingham, AL 35294-1170, U.S.A.{javedf, bryant, sprague}@cis.uab.edu

Abstract. The paper discusses context-free grammar (CFG) inference using genetic-programmingwith application to inducing grammars from programs written in simple domain-specific lan-guages. Grammar-specific heuristic operators and non-random construction of the initial popu-lation are proposed to achieve this task. Suitability of the approach is shown by small exampleswhere the underlying CFG’s are successfully inferred.Keywords. Grammar induction, Grammar inference, Learning from positive and negative ex-amples, Genetic programming

1 Introduction

In the accompanying paper [15] we discussed the search space of regular and context-free grammarinference. The conclusion reached was that owing to the large search space, the exhaustive (brute-force)approach to grammar induction could only be applied to small positive samples. Hence, a need for adifferent and more efficient approach to explore the search space arose. Evolutionary computation [16]is particulary suitable for such kinds of problems. In fact, genetic algorithms have already been appliedto the grammar inference problem, with varying results. In this paper another evolutionary approach,Genetic Programming (GP), to CFG learning is presented. Genetic programming [3] is a successfultechnique for getting computers to automatically solve problems. It has been successfully used in awide variety of application domains such as data mining, image classification and robotic control. Ingeneral, genetic programming works well for problems where solutions can be expressed with a modestlyshort program. For example, methods working on typical data structures such as stacks, queues andlists have been successfully evolved using genetic programming in [4]. Specifications (BNF) for domain-specific languages are small enough so that we can expect that a successful solution can be found usinggenetic programming. Our previous work [6] was successful in inferring small context-free grammarsfrom positive and negative samples. This paper elaborates on our recent research findings and buildson our previous work.

2 Related Work

The impact of different representations of grammars was explored in [14] where experimental resultsshowed that an evolutionary algorithm using standard context-free grammars (BNF) outperforms thoseusing Greibach Normal Form (GNF), Chomsky Normal Form (CNF) or bit-string representations [5].This performance differential was attributed to the larger grammar search space of the other represen-tations, which was a consequence of them having a more complex grammar form. The experimentalassessment in [14] was very limited due to the large processing time (processing of one generation hadtaken several hours; using our system, processing of one generation takes just few seconds). This wasdue to use of the chart parser, which is used commonly in natural language parsing and can accept

ACM SIGPLAN Notices 39 Vol. 40(4), Apr 2005

Page 2: Extracting grammar from programs: evolutionary approach

ambiguous grammars as well. With this approach a grammar was successfully inferred for the lan-guage of correctly balanced and nested brackets. In [1] a genetic algorithm was used on the problemof merging states in the prefix-tree automaton of regular grammar inference. It was shown that thegenetic algorithm performs equally good as other regular grammar inference algorithms (e.g. RPNI).Variable length chromosomes with introns were used in stochastic context-free grammar induction in[17]. Genetic algorithm was used also on the problem of labelling nonterminals (partitioning problem ofnonterminals) in context-free grammar inference using completely structured [10] or partially structuredsamples [11].

3 Genetically Generated Grammars

3.1 Previous work

In this section a short overview of our previous work is presented. For more details see [6]. To infercontext-free grammars for domain-specific languages, the genetic programming approach was adopted.In genetic programming, a program is constructed from terminal set T and user-defined function set F .The set T contains variables and constants and the set F contains functions that are a priori believedto be useful for the problem domain. In our case, the set T consists of terminal symbols defined withregular expressions and the set F consists of nonterminal symbols. From these two sets appropriategrammars can be evolved, which can be seen as a domain-specific language for expressing the syntax. Foreffective use of an evolutionary algorithm we have to choose a suitable representation of the problem,suitable genetic operators and parameters, and the evaluation function to determine the fitness ofchromosomes. For the encoding of a grammar into a chromosome we used a direct encoding as a list ofBNF production rules as suggested in [14] since this encoding outperforms bit-string representations.

Our earlier GP system starts with a population of randomly generated grammars [6], where thefollowing additional control parameters that prevent grammars from becoming too large have beenintroduced:

– max prod size: maximum number of productions of one grammar,– max RHS size: maximum number of right-hand symbols of one production.

Furthermore, the specific one-point crossover, mutation and heuristic operators have been proposedas genetic operators. The one-point crossover is performed in the following manner: two grammarsare chosen randomly and are cut at the same random position; the second halves are then swappedbetween the two grammars. To ensure that after crossover the two offsprings are both legal grammars,the breakpoint position cannot appear in the middle of the production rule. The breakpoint position ischosen randomly from the smallest of two grammars selected for crossover. An example of the crossoveroperation is presented in Figure 1. After crossover, grammars undergo mutation, where a symbol in arandomly chosen production is mutated. An example of the mutation operator is presented in Figure 2.

E → int TT →operator ET → ε

E →T EE →TE → εT → ε

E →T ET →operator ET → ε

E → int TE →TE → εT → ε

AA

AAAK �

��

���

� Crossoverpoint

Fig. 1. The crossover operator

To enhance the search, the following heuristic operators have been proposed:

ACM SIGPLAN Notices 40 Vol. 40(4), Apr 2005

Page 3: Extracting grammar from programs: evolutionary approach

E → int TT →E E

E → int TT →operator E

-�

��

Mutation point

Fig. 2. The mutation operator

– option operator,

– iteration* operator and

– iteration+ operator

which exploit the knowledge of grammars, namely extended BNF (EBNF), where grammar symbolsoften appear optionally or iteratively. Heuristic operators work in a similar manner as the mutationoperator. A symbol in a randomly chosen production can appear optionally or iteratively. An exampleof the option operator is presented in Figure 3.

E → int TT →operator ET → ε

E → int TT →operator FT → εF →EF → ε

-�

��

Option point

Fig. 3. The option operator

Similar transformations on grammars are performed under iteration* and iteration+ operators.To ensure that after crossover, mutation and deletion a chromosome represents a legal grammar aspecial procedure is performed where non-reachable or superfluous nonterminal symbols are detectedand eliminated.

Chromosomes were evaluated at the end of each generation by testing each grammar on a sample ofpositive and negative samples. For each grammar in the population an LR(1) parser was automaticallygenerated using the compiler generator tool LISA [7]. The generated parser was then run on fitnesscases (Fig. 4). A grammar’s fitness value is proportional to the length of the correctly parsed positivesample; thus it is desirable to have a grammar which accepts all the positive samples and rejects allthe negative samples.

Many grammars can be concocted which reject the negative samples. However, our search convergesto the desired grammar more when we obtain grammars which accept the positive samples. Hence, it is adesultory move to search in the space of all grammars which reject the negative samples, only. Negativesamples are only taken into account when a grammar is capable of accepting all the positive samples.Another reason is that negative samples are needed mainly to prevent overgeneralization of grammars[2]. Keeping these facts in view, the fitness value of each grammar is defined to be between 0 and 1,where interval 0 .. 0.5 denotes that the grammar did not recognize all positive samples and interval 0.5.. 1 denotes that the grammar recognized all positive samples and did not reject all negative samples. Agrammar with fitness value of 1 signifies that the generated LR(1) parser successfully parsed all positivesamples and rejected all negative samples. For the given grammar[i] its fitness fj(grammar[i]) on thej-fitness case is defined as:

fj(grammar[i]) =s

length(programj) ∗ 2

where s = length(successfully parsed programj)

The total fitness f(grammar[i]) is defined as:

ACM SIGPLAN Notices 41 Vol. 40(4), Apr 2005

Page 4: Extracting grammar from programs: evolutionary approach

f(grammar[i]) =

∑N

k=1fk(grammar[i])

N

A grammar is tested on the negative samples set only if it successfully recognizes all positive samples.Here, the portion of successfully parsed negative sample is not important. Therefore, its fitness valueis defined as:

f(grammar[i]) = 1.0 −m

M ∗ 2

where m = number of recognized negative samplesM = number of all negative samples

For each grammar in thepopulation

Fitness cases(positive and negative

samples)

Generated parser

LISA compilergenerator

Test grammars

Crossover and mutation

Population of grammars

Selection

Parsergeneration

Run parser on eachfitness case

Sucessfulnes ofparsing

Fitness value Evolutionaryprocess

Fig. 4. The evaluation of chromosomes

Experiments performed in [6] show that with the described approach context-free grammars can beinferred. But, these context-free grammars have only a small number of productions (e.g. up to 5). Forexample, we were able to infer the context-free grammar of a language for simple robot movement [6]:

NT1 -> #Begin NT2 #End

NT2 -> #Command NT2

NT2 -> epsilon

or of a language for nested begin end blocks:

NT1 -> #Begin NT2 #End

NT2 -> NT1 NT2

NT2 -> epsilon

On examples where the underlying context-free grammar was more complex, this approach was notsuccessful.

3.2 New ideas and approaches

It was observed that the heuristic operators considerably improved the search process, resulting in fitterinduced grammars. Currently, we are using slightly modified versions of the aforementioned heuristicoperators. The sequence of right-hand symbols (not just a single symbol) can appear optionally or

ACM SIGPLAN Notices 42 Vol. 40(4), Apr 2005

Page 5: Extracting grammar from programs: evolutionary approach

iteratively. However, we still were not able to induce bigger grammars despite the fact that the respectivesub-grammars were previously induced in a separate process. For example, we were successful in findinga context-free grammar for simple expressions and simple assignments statements where the right handexpression can only be a numeric value. But, when both sub-languages were combined, our earlierapproach failed to find a solution.

Example: The evolution of a grammar for assignment statements with simple arithmetic expressions on

the right side

G=50, pop_size=500, pc=0.4, pm=0.4 pheuristic=0.2, max_prod_size = 8, max_RHS_size = 5

pos num cases (N = 4) neg num cases (M = 5)

a := 9 + 2 22 := da := 10 + 2 + 12 d := := 32abc := 22 := 2i := 0j := 1abc := 22 + 3 a :=i := 10 + 2 + 3j := 1

+ 6

Upon closer analysis of our results, it became clear that the randomly generated initial populationwas an impediment for the induction process. The search space of all possible grammars is expansive,and in order to minimize this search space, the initial population should exploit knowledge from thepositive samples by generating a few valid derivation trees by simple composition of consecutive symbols.For example, in fig. 5 one of the possible derivation trees and appropriate context-free grammars forpositive sample a := 9 + 2 of the aforementioned example is presented. During this process a sequenceof non-terminal symbols which appear iteratively can be detected and an appropriate grammar can beconstructed (see fig. 6 and 7). Apart from this change in the construction of the initial population,the genetic programming system remained the same. Using this simple enhancement we were able toinduce a context-free grammar for this example (fig. 8).

NT8

NT7

NT5 NT6

NT4 NT2 NT1 NT3 NT1

#id := #int + #int

NT8 → NT7 NT1NT7 → NT5 NT6NT6 → NT1 NT3NT5 → NT4 NT2NT4 → #idNT3 → +NT2 → :=NT1 → #int

Fig. 5. One possible derivation tree for positive sample a := 9 + 2

Yet, in some other cases inferring of the underlying context-free grammar was still not successfulsimply because composition of consecutive symbols is not always correct. What we need is to iden-tify sub-languages and construct derivation tree for sub-programs first. But this is as hard as originalproblem. Since using completely structured [10] or partially structured samples [11] are impractical weare using an approximation: frequent sequences. A string of symbols is called a frequent sequence if itappears at least θ times, where θ is some preset threshold. Consider the above example of assignment

ACM SIGPLAN Notices 43 Vol. 40(4), Apr 2005

Page 6: Extracting grammar from programs: evolutionary approach

NT5 NT6 NT6

NT4 NT2 NT1 NT3 NT1 NT3 NT1

#id := #int + #int + #int

NT6 → NT1 NT3NT5 → NT4 NT2NT4 → #idNT3 → +NT2 → :=NT1 → #int

Fig. 6. Construction of derivation tree for positive sample a := 10 + 2 + 12 to the point where iteration ofnon-terminal NT6 was detected

NT9

NT8

NT7

NT7

NT7

NT5 NT6 NT6 ε

NT4 NT2 NT1 NT3 NT1 NT3 NT1

#id := #int + #int + #int

NT9 → NT8 NT1NT8 → NT5 NT7NT7 → NT6 NT7NT7 → εNT6 → NT1 NT3NT5 → NT4 NT2NT4 → #idNT3 → +NT2 → :=NT1 → #int

Fig. 7. One possible derivation tree for positive sample a := 10 + 2 + 12

NT9

NT8 NT7 NT9

NT5 NT6 NT7

NT4 NT2 NT1 NT3 NT1

#id := #int + #int ε ε

NT9 → NT8 NT7 NT9NT9 → εNT8 → NT4 NT5NT7 → NT6 NT7NT7 → εNT6 → NT3 NT1NT5 → NT2 NT1NT4 → #idNT3 → +NT2 → :=NT1 → #int

Fig. 8. Correct context-free grammar was found in the generation 21

ACM SIGPLAN Notices 44 Vol. 40(4), Apr 2005

Page 7: Extracting grammar from programs: evolutionary approach

statements with simple arithmetic expressions on the right side. Some frequent sequences of length 2in 4 positive cases are:

pair occurrences#id #oper= 8#oper= #int 8#int #oper+ 6#oper+ #int 6

Our basic idea is to construct an initial derivation tree in which frequent sequences are recognizedby a single nonterminal. For example, we might adjoin productions FR1 → #id #oper=, or FR2 →#int #oper+ into the initial grammar and then construct a valid derivation tree by composition ofconsecutive symbols.

Table 1. Inferred grammars for some DSLs.

DSL G pop size N M Inferred grammar An example ofpositive sample

video 28 300 8 8 NT15 → NT11 NT7 NT15 jurassicpark childstore NT15 → ε roadtrip reg

NT11 → NT10 NT6 ring newNT10 → NT5 NT10 andy 3 jurassicpark childNT10 → ε 2 roadtrip regNT7 → NT5 NT7 ann 3 ring newNT7 → εNT6 → #name #daysNT5 → #title #type

stock 1 300 10 3 NT6 → NT5 NT3 stock descriptionand NT5 → NT4 #sales twix 0.70 10sales NT4 → #stock NT2 mars 0.65 12

NT2 → FR2 NT2 bar 1.09 5NT2 → ε sales descriptionFR2 → #item #price #qty mars 0.65NT3 → FR0 NT3 twix 0.70NT3 → ε mars 0.65FR0 → #item #price

DESK 1 300 5 3 NT7 → NT6 NT3 print a + bNT6 → NT5 #where where a = 10 b = 20NT5 → NT4 NT2NT4 → #print #idNT3 → FR4 NT3NT3 → εNT2 → FR3 NT2NT2 → εFR4 → #id #assign #intFR3 → #plus #id

FDL 127 300 8 4 NT7 → NT2 NT7 c : all(c1, more-of(f4, f5))NT7 → ε c1 : one-of(f1, c2)NT2 → NT1 FR8 c2 : all(f4, f5)NT1 → #feature #:FR8 → #op #( NT11 #,

NT11 #)NT11 → #featureNT11 → FR8

4 Experimental Results

Using an evolutionary approach enhanced by grammar-specific heuristic operators and by better con-struction of the initial population we were able to infer grammars for small domain-specific languages

ACM SIGPLAN Notices 45 Vol. 40(4), Apr 2005

Page 8: Extracting grammar from programs: evolutionary approach

[8] such as video store [13], stock and sales [13], simple desk calculation language - DESK [9], and asimplified version of feature description language (FDL) [12]. Inferred grammars as well as some otherparameters (G - number of generations when solution is found, pop size – population size, N – numberof positive samples, M – number of negative samples) are shown in the Table 1. Although results arepromising more research work still need to be done. A lot of ideas need to be implemented and verifiedin practice.

5 Conclusions and Future Work

Previous attempts at learning context-free grammars resulted in ineffectual success on real examples.We extend those works by introducing grammar-specific heuristic operators and facilitating betterconstruction of the initial population where knowledge from positive samples has been exploited. Ourfuture work involves exploring the use of data mining techniques in grammar inference, augmentingthe brute force approach with heuristics, and investigating the Minimum Description Length (MDL)approach for context-free grammar inference.

References

1. P. Dupont. Regular Grammatical Inference from Positive and Negative Samples by Genetic Search: TheGIG method. Proceedings of the 2nd International Colloquium on Grammatical Inference and Applications,ICGI’94, LNAI, Vol. 862, pp. 236-245, 1994.

2. M.E. Gold. Language Identification in the Limit. Information and Control, Vol. 10, pp. 447-474, 1967.3. J.R. Koza. Genetic Programming: On the Programming of Computers by Natural Selection. MIT Press,

1992.4. W.B. Langdon. Genetic Programming and Data Structures: Genetic Programming + Data Structures =

Automatic Programming! Kluwer Academic Publishers, 1998.5. S. Lukas. Structuring Chromosomes for Context-Free Grammar Evolution. 1st International Conference on

Evolutionary Computing, pp. 130-135, 1994.6. M. Mernik, G. Gerlic, V. Zumer, B. Bryant. Can a Parser be Generated from Examples? Proceedings of the

ACM Symposium on Applied Computing, Melbourne, pp. 1063-1067, 2003.7. M. Mernik, M. Lenic, E. Avdicausevic, V. Zumer. LISA: An Interactive Environment for Programming

Language Development. 11th International Conference on Compiler Construction, LNCS, Vol. 2304, pp. 1-4,2002.

8. M. Mernik, J. Heering, T. Sloane. When and how to develop domain-specific languages. CWI TechnicalReport, SEN-E0309, 2003.

9. J. Paakki. Attribute Grammar Paradigms - A High-Level Methodology in Language Implementation. ACMComputing Surveys, Vol. 27, No. 2, pp. 196-255, 1995.

10. Y. Sakakibara. Efficient Learning of Context-Free Grammars from Positive Structural Examples. Informa-tion and Computation, Vol. 97, pp. 23-60, 1992.

11. Y. Sakakibara, H. Muramatsu. Learning Context-Free Grammars from Partially Structured Examples.Proceedings of the 5th International Colloquium on Grammatical Inference and Applications, ICGI’00, LNAI,Vol. 1891, pp. 229-240, 2000.

12. A. van Deursen, P. Klint. Domain-Specific Language Design Requires Feature Descriptions. Journal forComputing and Information Technology, Special issue on Domain-Specific Languages, Eds: R. Lammel andM. Mernik, Vol. 9, No. 4, pp. 1-17, 2002.

13. M. Varanda Pereira, M. Mernik, T. Kosar, V. Zumer, P. Henriques. Object-Oriented Attribute Grammarbased Grammatical Approach to Problem Specification. Technical Report, University of Braga, Departmentof Computer Science, 2002.

14. P. Wyard. Representational Issues for Context Free Grammar Induction Using Genetic Algorithm. Pro-ceedings of the 2nd International Colloquium on Grammatical Inference and Applications, LNAI, Vol. 862,pp. 222-235, 1994.

15. M. Crepinsek, M. Mernik, V. Zumer. Extracting Grammar from Programs: Brute Force Approach. Submit-ted to ACM Sigplan Notices, 2004.

16. T. Back, D. Fogel, Z. Michalewicz. Handbook of Evolutionary Computation. University of Oxford Press,1996.

17. T. Kammeyer, R. K. Belew. Stochastic Context-Free Grammar Induction with a Genetic Algorithm UsingLocal Search. Foundations of Genetic Algorithms IV, 1996.

ACM SIGPLAN Notices 46 Vol. 40(4), Apr 2005