Extracting Grammar from Programs: Evolutionary Approach Matej ˇ Crepinˇ sek 1 , Marjan Mernik 1 , Faizan Javed 2 , Barrett R. Bryant 2 , and Alan Sprague 2 1 University of Maribor, Faculty of Electrical Engineering and Computer Science, Smetanova 17, 2000 Maribor, Slovenia {matej.crepinsek, marjan.mernik}@uni-mb.si 2 The University of Alabama at Birmingham, Department of Computer and Information Sciences, Birmingham, AL 35294-1170, U.S.A. {javedf, bryant, sprague}@cis.uab.edu Abstract. The paper discusses context-free grammar (CFG) inference using genetic-programming with application to inducing grammars from programs written in simple domain-specific lan- guages. Grammar-specific heuristic operators and non-random construction of the initial popu- lation are proposed to achieve this task. Suitability of the approach is shown by small examples where the underlying CFG’s are successfully inferred. Keywords. Grammar induction, Grammar inference, Learning from positive and negative ex- amples, Genetic programming 1 Introduction In the accompanying paper [15] we discussed the search space of regular and context-free grammar inference. The conclusion reached was that owing to the large search space, the exhaustive (brute-force) approach to grammar induction could only be applied to small positive samples. Hence, a need for a different and more efficient approach to explore the search space arose. Evolutionary computation [16] is particulary suitable for such kinds of problems. In fact, genetic algorithms have already been applied to the grammar inference problem, with varying results. In this paper another evolutionary approach, Genetic Programming (GP), to CFG learning is presented. Genetic programming [3] is a successful technique for getting computers to automatically solve problems. It has been successfully used in a wide variety of application domains such as data mining, image classification and robotic control. In general, genetic programming works well for problems where solutions can be expressed with a modestly short program. For example, methods working on typical data structures such as stacks, queues and lists have been successfully evolved using genetic programming in [4]. Specifications (BNF) for domain- specific languages are small enough so that we can expect that a successful solution can be found using genetic programming. Our previous work [6] was successful in inferring small context-free grammars from positive and negative samples. This paper elaborates on our recent research findings and builds on our previous work. 2 Related Work The impact of different representations of grammars was explored in [14] where experimental results showed that an evolutionary algorithm using standard context-free grammars (BNF) outperforms those using Greibach Normal Form (GNF), Chomsky Normal Form (CNF) or bit-string representations [5]. This performance differential was attributed to the larger grammar search space of the other represen- tations, which was a consequence of them having a more complex grammar form. The experimental assessment in [14] was very limited due to the large processing time (processing of one generation had taken several hours; using our system, processing of one generation takes just few seconds). This was due to use of the chart parser, which is used commonly in natural language parsing and can accept ACM SIGPLAN Notices 39 Vol. 40(4), Apr 2005
8
Embed
Extracting grammar from programs: evolutionary approach
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Extracting Grammar from Programs: Evolutionary Approach
Matej Crepinsek1, Marjan Mernik1,Faizan Javed2, Barrett R. Bryant2, and Alan Sprague2
1 University of Maribor,Faculty of Electrical Engineering and Computer Science,
2 The University of Alabama at Birmingham,Department of Computer and Information Sciences,
Birmingham, AL 35294-1170, U.S.A.{javedf, bryant, sprague}@cis.uab.edu
Abstract. The paper discusses context-free grammar (CFG) inference using genetic-programmingwith application to inducing grammars from programs written in simple domain-specific lan-guages. Grammar-specific heuristic operators and non-random construction of the initial popu-lation are proposed to achieve this task. Suitability of the approach is shown by small exampleswhere the underlying CFG’s are successfully inferred.Keywords. Grammar induction, Grammar inference, Learning from positive and negative ex-amples, Genetic programming
1 Introduction
In the accompanying paper [15] we discussed the search space of regular and context-free grammarinference. The conclusion reached was that owing to the large search space, the exhaustive (brute-force)approach to grammar induction could only be applied to small positive samples. Hence, a need for adifferent and more efficient approach to explore the search space arose. Evolutionary computation [16]is particulary suitable for such kinds of problems. In fact, genetic algorithms have already been appliedto the grammar inference problem, with varying results. In this paper another evolutionary approach,Genetic Programming (GP), to CFG learning is presented. Genetic programming [3] is a successfultechnique for getting computers to automatically solve problems. It has been successfully used in awide variety of application domains such as data mining, image classification and robotic control. Ingeneral, genetic programming works well for problems where solutions can be expressed with a modestlyshort program. For example, methods working on typical data structures such as stacks, queues andlists have been successfully evolved using genetic programming in [4]. Specifications (BNF) for domain-specific languages are small enough so that we can expect that a successful solution can be found usinggenetic programming. Our previous work [6] was successful in inferring small context-free grammarsfrom positive and negative samples. This paper elaborates on our recent research findings and buildson our previous work.
2 Related Work
The impact of different representations of grammars was explored in [14] where experimental resultsshowed that an evolutionary algorithm using standard context-free grammars (BNF) outperforms thoseusing Greibach Normal Form (GNF), Chomsky Normal Form (CNF) or bit-string representations [5].This performance differential was attributed to the larger grammar search space of the other represen-tations, which was a consequence of them having a more complex grammar form. The experimentalassessment in [14] was very limited due to the large processing time (processing of one generation hadtaken several hours; using our system, processing of one generation takes just few seconds). This wasdue to use of the chart parser, which is used commonly in natural language parsing and can accept
ambiguous grammars as well. With this approach a grammar was successfully inferred for the lan-guage of correctly balanced and nested brackets. In [1] a genetic algorithm was used on the problemof merging states in the prefix-tree automaton of regular grammar inference. It was shown that thegenetic algorithm performs equally good as other regular grammar inference algorithms (e.g. RPNI).Variable length chromosomes with introns were used in stochastic context-free grammar induction in[17]. Genetic algorithm was used also on the problem of labelling nonterminals (partitioning problem ofnonterminals) in context-free grammar inference using completely structured [10] or partially structuredsamples [11].
3 Genetically Generated Grammars
3.1 Previous work
In this section a short overview of our previous work is presented. For more details see [6]. To infercontext-free grammars for domain-specific languages, the genetic programming approach was adopted.In genetic programming, a program is constructed from terminal set T and user-defined function set F .The set T contains variables and constants and the set F contains functions that are a priori believedto be useful for the problem domain. In our case, the set T consists of terminal symbols defined withregular expressions and the set F consists of nonterminal symbols. From these two sets appropriategrammars can be evolved, which can be seen as a domain-specific language for expressing the syntax. Foreffective use of an evolutionary algorithm we have to choose a suitable representation of the problem,suitable genetic operators and parameters, and the evaluation function to determine the fitness ofchromosomes. For the encoding of a grammar into a chromosome we used a direct encoding as a list ofBNF production rules as suggested in [14] since this encoding outperforms bit-string representations.
Our earlier GP system starts with a population of randomly generated grammars [6], where thefollowing additional control parameters that prevent grammars from becoming too large have beenintroduced:
– max prod size: maximum number of productions of one grammar,– max RHS size: maximum number of right-hand symbols of one production.
Furthermore, the specific one-point crossover, mutation and heuristic operators have been proposedas genetic operators. The one-point crossover is performed in the following manner: two grammarsare chosen randomly and are cut at the same random position; the second halves are then swappedbetween the two grammars. To ensure that after crossover the two offsprings are both legal grammars,the breakpoint position cannot appear in the middle of the production rule. The breakpoint position ischosen randomly from the smallest of two grammars selected for crossover. An example of the crossoveroperation is presented in Figure 1. After crossover, grammars undergo mutation, where a symbol in arandomly chosen production is mutated. An example of the mutation operator is presented in Figure 2.
E → int TT →operator ET → ε
E →T EE →TE → εT → ε
E →T ET →operator ET → ε
E → int TE →TE → εT → ε
AA
AAAK �
��
���
� Crossoverpoint
Fig. 1. The crossover operator
To enhance the search, the following heuristic operators have been proposed:
which exploit the knowledge of grammars, namely extended BNF (EBNF), where grammar symbolsoften appear optionally or iteratively. Heuristic operators work in a similar manner as the mutationoperator. A symbol in a randomly chosen production can appear optionally or iteratively. An exampleof the option operator is presented in Figure 3.
E → int TT →operator ET → ε
E → int TT →operator FT → εF →EF → ε
-�
��
Option point
Fig. 3. The option operator
Similar transformations on grammars are performed under iteration* and iteration+ operators.To ensure that after crossover, mutation and deletion a chromosome represents a legal grammar aspecial procedure is performed where non-reachable or superfluous nonterminal symbols are detectedand eliminated.
Chromosomes were evaluated at the end of each generation by testing each grammar on a sample ofpositive and negative samples. For each grammar in the population an LR(1) parser was automaticallygenerated using the compiler generator tool LISA [7]. The generated parser was then run on fitnesscases (Fig. 4). A grammar’s fitness value is proportional to the length of the correctly parsed positivesample; thus it is desirable to have a grammar which accepts all the positive samples and rejects allthe negative samples.
Many grammars can be concocted which reject the negative samples. However, our search convergesto the desired grammar more when we obtain grammars which accept the positive samples. Hence, it is adesultory move to search in the space of all grammars which reject the negative samples, only. Negativesamples are only taken into account when a grammar is capable of accepting all the positive samples.Another reason is that negative samples are needed mainly to prevent overgeneralization of grammars[2]. Keeping these facts in view, the fitness value of each grammar is defined to be between 0 and 1,where interval 0 .. 0.5 denotes that the grammar did not recognize all positive samples and interval 0.5.. 1 denotes that the grammar recognized all positive samples and did not reject all negative samples. Agrammar with fitness value of 1 signifies that the generated LR(1) parser successfully parsed all positivesamples and rejected all negative samples. For the given grammar[i] its fitness fj(grammar[i]) on thej-fitness case is defined as:
A grammar is tested on the negative samples set only if it successfully recognizes all positive samples.Here, the portion of successfully parsed negative sample is not important. Therefore, its fitness valueis defined as:
f(grammar[i]) = 1.0 −m
M ∗ 2
where m = number of recognized negative samplesM = number of all negative samples
For each grammar in thepopulation
Fitness cases(positive and negative
samples)
Generated parser
LISA compilergenerator
Test grammars
Crossover and mutation
Population of grammars
Selection
Parsergeneration
Run parser on eachfitness case
Sucessfulnes ofparsing
Fitness value Evolutionaryprocess
Fig. 4. The evaluation of chromosomes
Experiments performed in [6] show that with the described approach context-free grammars can beinferred. But, these context-free grammars have only a small number of productions (e.g. up to 5). Forexample, we were able to infer the context-free grammar of a language for simple robot movement [6]:
NT1 -> #Begin NT2 #End
NT2 -> #Command NT2
NT2 -> epsilon
or of a language for nested begin end blocks:
NT1 -> #Begin NT2 #End
NT2 -> NT1 NT2
NT2 -> epsilon
On examples where the underlying context-free grammar was more complex, this approach was notsuccessful.
3.2 New ideas and approaches
It was observed that the heuristic operators considerably improved the search process, resulting in fitterinduced grammars. Currently, we are using slightly modified versions of the aforementioned heuristicoperators. The sequence of right-hand symbols (not just a single symbol) can appear optionally or
iteratively. However, we still were not able to induce bigger grammars despite the fact that the respectivesub-grammars were previously induced in a separate process. For example, we were successful in findinga context-free grammar for simple expressions and simple assignments statements where the right handexpression can only be a numeric value. But, when both sub-languages were combined, our earlierapproach failed to find a solution.
Example: The evolution of a grammar for assignment statements with simple arithmetic expressions on
a := 9 + 2 22 := da := 10 + 2 + 12 d := := 32abc := 22 := 2i := 0j := 1abc := 22 + 3 a :=i := 10 + 2 + 3j := 1
+ 6
Upon closer analysis of our results, it became clear that the randomly generated initial populationwas an impediment for the induction process. The search space of all possible grammars is expansive,and in order to minimize this search space, the initial population should exploit knowledge from thepositive samples by generating a few valid derivation trees by simple composition of consecutive symbols.For example, in fig. 5 one of the possible derivation trees and appropriate context-free grammars forpositive sample a := 9 + 2 of the aforementioned example is presented. During this process a sequenceof non-terminal symbols which appear iteratively can be detected and an appropriate grammar can beconstructed (see fig. 6 and 7). Apart from this change in the construction of the initial population,the genetic programming system remained the same. Using this simple enhancement we were able toinduce a context-free grammar for this example (fig. 8).
Fig. 5. One possible derivation tree for positive sample a := 9 + 2
Yet, in some other cases inferring of the underlying context-free grammar was still not successfulsimply because composition of consecutive symbols is not always correct. What we need is to iden-tify sub-languages and construct derivation tree for sub-programs first. But this is as hard as originalproblem. Since using completely structured [10] or partially structured samples [11] are impractical weare using an approximation: frequent sequences. A string of symbols is called a frequent sequence if itappears at least θ times, where θ is some preset threshold. Consider the above example of assignment
Our basic idea is to construct an initial derivation tree in which frequent sequences are recognizedby a single nonterminal. For example, we might adjoin productions FR1 → #id #oper=, or FR2 →#int #oper+ into the initial grammar and then construct a valid derivation tree by composition ofconsecutive symbols.
Table 1. Inferred grammars for some DSLs.
DSL G pop size N M Inferred grammar An example ofpositive sample
Using an evolutionary approach enhanced by grammar-specific heuristic operators and by better con-struction of the initial population we were able to infer grammars for small domain-specific languages
ACM SIGPLAN Notices 45 Vol. 40(4), Apr 2005
[8] such as video store [13], stock and sales [13], simple desk calculation language - DESK [9], and asimplified version of feature description language (FDL) [12]. Inferred grammars as well as some otherparameters (G - number of generations when solution is found, pop size – population size, N – numberof positive samples, M – number of negative samples) are shown in the Table 1. Although results arepromising more research work still need to be done. A lot of ideas need to be implemented and verifiedin practice.
5 Conclusions and Future Work
Previous attempts at learning context-free grammars resulted in ineffectual success on real examples.We extend those works by introducing grammar-specific heuristic operators and facilitating betterconstruction of the initial population where knowledge from positive samples has been exploited. Ourfuture work involves exploring the use of data mining techniques in grammar inference, augmentingthe brute force approach with heuristics, and investigating the Minimum Description Length (MDL)approach for context-free grammar inference.
References
1. P. Dupont. Regular Grammatical Inference from Positive and Negative Samples by Genetic Search: TheGIG method. Proceedings of the 2nd International Colloquium on Grammatical Inference and Applications,ICGI’94, LNAI, Vol. 862, pp. 236-245, 1994.
2. M.E. Gold. Language Identification in the Limit. Information and Control, Vol. 10, pp. 447-474, 1967.3. J.R. Koza. Genetic Programming: On the Programming of Computers by Natural Selection. MIT Press,
1992.4. W.B. Langdon. Genetic Programming and Data Structures: Genetic Programming + Data Structures =
Automatic Programming! Kluwer Academic Publishers, 1998.5. S. Lukas. Structuring Chromosomes for Context-Free Grammar Evolution. 1st International Conference on
Evolutionary Computing, pp. 130-135, 1994.6. M. Mernik, G. Gerlic, V. Zumer, B. Bryant. Can a Parser be Generated from Examples? Proceedings of the
ACM Symposium on Applied Computing, Melbourne, pp. 1063-1067, 2003.7. M. Mernik, M. Lenic, E. Avdicausevic, V. Zumer. LISA: An Interactive Environment for Programming
Language Development. 11th International Conference on Compiler Construction, LNCS, Vol. 2304, pp. 1-4,2002.
8. M. Mernik, J. Heering, T. Sloane. When and how to develop domain-specific languages. CWI TechnicalReport, SEN-E0309, 2003.
9. J. Paakki. Attribute Grammar Paradigms - A High-Level Methodology in Language Implementation. ACMComputing Surveys, Vol. 27, No. 2, pp. 196-255, 1995.
10. Y. Sakakibara. Efficient Learning of Context-Free Grammars from Positive Structural Examples. Informa-tion and Computation, Vol. 97, pp. 23-60, 1992.
11. Y. Sakakibara, H. Muramatsu. Learning Context-Free Grammars from Partially Structured Examples.Proceedings of the 5th International Colloquium on Grammatical Inference and Applications, ICGI’00, LNAI,Vol. 1891, pp. 229-240, 2000.
12. A. van Deursen, P. Klint. Domain-Specific Language Design Requires Feature Descriptions. Journal forComputing and Information Technology, Special issue on Domain-Specific Languages, Eds: R. Lammel andM. Mernik, Vol. 9, No. 4, pp. 1-17, 2002.
13. M. Varanda Pereira, M. Mernik, T. Kosar, V. Zumer, P. Henriques. Object-Oriented Attribute Grammarbased Grammatical Approach to Problem Specification. Technical Report, University of Braga, Departmentof Computer Science, 2002.
14. P. Wyard. Representational Issues for Context Free Grammar Induction Using Genetic Algorithm. Pro-ceedings of the 2nd International Colloquium on Grammatical Inference and Applications, LNAI, Vol. 862,pp. 222-235, 1994.
15. M. Crepinsek, M. Mernik, V. Zumer. Extracting Grammar from Programs: Brute Force Approach. Submit-ted to ACM Sigplan Notices, 2004.
16. T. Back, D. Fogel, Z. Michalewicz. Handbook of Evolutionary Computation. University of Oxford Press,1996.
17. T. Kammeyer, R. K. Belew. Stochastic Context-Free Grammar Induction with a Genetic Algorithm UsingLocal Search. Foundations of Genetic Algorithms IV, 1996.