Top Banner
Sentence Realisation from Bag-of-Words with Dependency Constraints Thesis submitted in partial fulfillment of the requirements for the degree of Master of Science (by Research) in Computer Science by Karthik Kumar G 200402018 [email protected] Language Technologies Research Centre International Institute of Information Technology Hyderabad - 500 032, INDIA May 2010
44

Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

May 25, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

Sentence Realisation from Bag-of-Words with Dependency Constraints

Thesis submitted in partial fulfillmentof the requirements for the degree of

Master of Science (by Research)in

Computer Science

by

Karthik Kumar G200402018

[email protected]

Language Technologies Research CentreInternational Institute of Information Technology

Hyderabad - 500 032, INDIAMay 2010

Page 2: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

Copyright c© Karthik Kumar G, 2010

All Rights Reserved

Page 3: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

International Institute of Information Technology

Hyderabad, India

CERTIFICATE

It is certified that the work contained in this thesis, titled“Sentence Realisation from Bag-of-Words with

Dependency Constraints” by Karthik Kumar G, has been carried out under my supervision and is not

submitted elsewhere for a degree.

Date Principal Co-Adviser: Prof. Rajeev Sangal

Principal Co-Adviser: Mr. Sriram Venkatapathy

Page 4: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

To My Parents

Page 5: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

Acknowledgments

First and foremost, I would like to thank my advisers Prof. Rajeev Sangal and Mr. Sriram Venkat-

apathy for their guidance and support during my research. They provided me with the opportunities to

pursue my research interests and helped in developing a thinking processes required for the research.

I would like to thank Dr. Dipti Misra Sharma, Prof Laxmi Bai and Dr. Soma Paul for the valuable

discussions on a variety of topics over the years on various research works at Language Technologies

Research Centre (LTRC). I would also like to thank all my LTRCseniors which include Prashanth

Mannem, Rafiya Begum, Anil Kumar Singh, Himanshu Agarwal, Jagadeesh Gorla and Samar Hussain

from whom I learnt NLP. Also I would like to thank all my LTRC colleagues Rohini, Avinesh, Manohar,

Ravi Kiran, Taraka Rama, Sudheer K, Sandhya, Shilpi Sapna, Ritu, Viswanth, Suman, Praneeth and all

my juniors for their support during my stay at LTRC.

My acknowledgment would not be complete without the mentionof my classmates of UG2k4 batch

at IIIT Hyderabad.

v

Page 6: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

Abstract

Sentence realisation involves generation of a well-formedsentence from a bag-of-words. Sentence

realisation has wide range of applications ranging from machine translation to dialogue systems. In

this thesis, we present five models for the task of sentence realisation. The five models proposed are

sentential language model, subtree-type based language models (STLM), head-word STLM, part-of-

speech (POS) based STLM and marked head-POS based LM. The proposed models employ simple and

efficient techniques based onN -gram language modeling and use only minimal syntactic information

as dependency constraints among the bag-of-words. This enables the models to be used in wide range

of applications that require sentence realisation.

We have evaluated the proposed models on two different typesof input (gold and noisy). In the

first input type, dependency constraints among the bag-of-words are extracted from treebank which is

manually transcribed and in the second input type, the dependency constraints among the bag-of-words

are noisy as they are automatically extracted from a parser.We have evaluated the models on noisy input

in order to test the robustness of our models. From the results, we observed that the models performed

well even on the noisy input. We have achieved state-of-the-art result for the task of sentence realisation.

We have also successfully tested a graph based nearest neighbour algorithm for the task of sentence

realization. We have shown that using the graph-based algorithm, the computational complexity can be

reduced from factorial to quadratic at a cost of 2% reductionin the overall accuracy. This method makes

our module suitable for employment in several practical applications.

vi

Page 7: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

Contents

Chapter Page

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Introduction: Natural language generation . . . . . . . . . .. . . . . . . . . . . . . . 11.2 Stages of NLG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 11.3 Dependency constraints . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . 31.4 Significance of sentence realisation in machine translation . . . . . . . . . . . . . . . 4

1.4.1 Transfer based Machine Translation . . . . . . . . . . . . . . .. . . . . . . . 41.4.2 Two-stage statistical machine translation . . . . . . . .. . . . . . . . . . . . 5

1.5 Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 51.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 5

2 Sentence realisation: A review. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1 Issues analysed in this thesis . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 8

3 Statistical Language Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 93.2 N -gram parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 103.3 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 12

4 Sentence realisation experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 144.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 15

4.2.1 Model 1 : Sentential Language Model . . . . . . . . . . . . . . . .. . . . . . 154.2.2 Model 2 : Subtree-type based Language Models(STLM) . .. . . . . . . . . . 164.2.3 Model 3 : Head-word STLM . . . . . . . . . . . . . . . . . . . . . . . . . .. 184.2.4 Model 4 : POS based STLM . . . . . . . . . . . . . . . . . . . . . . . . . . .204.2.5 Model 5: Marked Head-POS based LM . . . . . . . . . . . . . . . . . .. . . 21

4.3 Nearest Neighbour Algorithm . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . 21

5 Results and Discussion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 25

6 Conclusion and Future Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 296.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 30

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

vii

Page 8: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

List of Figures

Figure Page

1.1 Tasks of Natural Language Generation . . . . . . . . . . . . . . . .. . . . . . . . . . 21.2 Bag-of-words with dependency constraints and head marked . . . . . . . . . . . . . . 4

4.1 Unordered dependency tree . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . 164.2 Unordered dependency tree with partial order calculated . . . . . . . . . . . . . . . . 164.3 Dependency tree of a English sentence . . . . . . . . . . . . . . . .. . . . . . . . . . 164.4 Dependency tree of a English sentence with POS tags . . . . .. . . . . . . . . . . . . 174.5 Unordered Dependency tree with POS tags attached to words . . . . . . . . . . . . . . 184.6 A treelet having 3 nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 204.7 Two different treelets which would have same best POS tagsequence . . . . . . . . . 214.8 Complete directed graph using the edges was,.,market,illiquid, . . . . . . . . . . . . . 234.9 Complete directed graph with the best path marked . . . . . .. . . . . . . . . . . . . 23

5.1 Graph showing the BLEU scores of Nearest Neighbour Algorithm for different valuesof K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

viii

Page 9: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

List of Tables

Table Page

4.1 The number of nodes having a particular number of children and cumulative frequenciesin the test data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 15

5.1 The results of Model 1-5 . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 265.2 Results of Nearest Neighbour Algorithm for different values of K . . . . . . . . . . . . 265.3 Comparsion of results for English WSJ section 23 . . . . . . .. . . . . . . . . . . . . 27

ix

Page 10: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

Chapter 1

Introduction

1.1 Introduction: Natural language generation

Natural language generation (NLG) is a natural language processing task of generating natural lan-

guage from a machine representation system such as a knowledge base or a logical form. The goal

of this process can be viewed as the inverse ofnatural language understanding (NLU). While NLU

concerns mapping from natural language to a computer-interpretable representation, NLG concerns

mapping from a computer-interpretable representation into natural language [22]. In NLU, given an

utterance, the task is to analyze its syntactic structure and interpret the semantic content of that utter-

ance. Whereas in NLG, the goal is to produce a well-formed natural language utterance. NLG is a

critical component in many natural language processing (NLP) applications where the expected output

is a well-formed natural language text. The applications include machine translation, human-computer

dialogue, summarization, item and event descriptions, question answering, tutorials and stories [11].

Since the task of NLG is not straight forward, it is generallydivided into different modules. In the next

section, we describe the different stages involved in the task of NLG.

1.2 Stages of NLG

As mentioned in the earlier section, NLG is generally decomposed into a number of steps. Fig. 1.1

show the different steps involved in a NLG system.

The six steps involved in NLG are:

1. Content determination: The goal of content determination is to decide the type of information to

be included.

(a) It is based on creating messages (data objects used in subsequent language generation).

Messages consist of entities, concepts and relations.

(b) The choice of messages is often based on an analysis of a corpus of human-generated texts

covering the same field (the corpus-based approach).

1

Page 11: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

Content DeterminationDiscourse Planning

LexicalizaionReferring Expression Generation

Syntax and morphologyOrthographic Realization Linguistic Realization

Sentence Planning

Text Planning

Sentence Aggregation

Tasks Modules

Figure 1.1Tasks of Natural Language Generation

2. Discourse planning: The process of imposing order and structure over the set of messages that

are determinated by content determination step. It is basedon either discourse relations (such

as elaboration, exemplification) or schemas (that is general patterns according to which a text is

constructed).

3. Sentence aggregation: It involves the process of grouping messages together into sentences. How-

ever, it is not always necessary. Each message may be presented as a separate sentence, but a good

aggregation improves the quality of the text. There are several types of aggregation:

(a) Simple conjunctions

(b) Ellipsis (e.g. “John went to the bank and John deposited $100” may be combined into “John

went to the bank and deposited $100”)

(c) Set formation (combining messages that are identical except for a single constituent, e.g.

“John bought a car, John bought a house and John bought a computer” may be combined

into “John bought a car, a house and a computer”. Also, set formation covers instances like

replacing “Monday, Tuesday,..., Sunday” with whole week)

(d) Embedding (usage of relative clauses)

(e) Problems arise when the system has to make choices between different possibilities of ag-

gregation. The system builder has to supply it with rules that govern such choice-making

4. Lexicalization: The process of deciding which words and phrases should be used in order to

transform the underlying messages into a readable text. This is the point when pragmatic issues

are taken into consideration (e.g. should the text be formalor informal). As with aggregation,

a poorly lexicalized text may still be understood, but good lexicalization improves the quality

2

Page 12: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

and fluency of the text. Similarly to aggregation, problems emerge when the system has to make

choices between particular words

5. Referring expression generation: Selecting words and phrases to identify entities (e.g. “Caledo-

nian Express” or “it” or “this train”), generating deictic expressions.

6. Linguistic realization: The process of applying rules ofgrammar in order to produce a text which

is syntactically, morphologically and orthographically correct. This process may be realized by

means of the inverse parsing model, different kinds of grammars or templates

Sometimes, one or more subtasks of NLG might be combined to form a module. In general, there

are three modules:

1. Text planning: This stage combines the content determination and discourse planning tasks.

2. Sentence planning: This stage combines sentence aggregation, lexicalization, and referring ex-

pression generation.

3. Linguistic realisation: As described above, this task involves syntactic, morphological, and ortho-

graphic processing.

Linguistic realisation is the last step in NLG. In linguistic realisation,sentence realisation is a major

step. Sentence realisation involves generating a well-formed sentence from a bag-of-words. These bag-

of-words may be syntactically related to each other. The level of syntactic information attached to the

bag-of-words might vary with application. In this thesis, we present sentence realisation models which

have basic syntactic information as dependency constraints between bag-of-words. In the next section,

we will briefly explain about dependency constraints.

1.3 Dependency constraints

As mentioned in the previous section, we present sentence realisation models which take bag-of-

words with dependency constraints as input and produce well-formed sentence. Dependency constraints

are basic modifier-modified relationships between the bag-of-words [2]. The main reason behind choos-

ing dependency constraints is that they are used in most of the applications and hence the proposed

models can be easily used.

Fig. 1.2 shows an example of bag-of-words with dependency constraints for the sentence “Ram is

going to school”.

This bag-of-words with Dependency constraints can also be seen as a unordered dependency tree. A

dependency tree in which the order of the lexical items are not given is known as unordered dependency

tree.

3

Page 13: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

is going

Ram to school

Figure 1.2Bag-of-words with dependency constraints and head marked

1.4 Significance of sentence realisation in machine translation

Our sentence realiser models can be applied in various Natural Language Processing applications

such as transfer based machine translation, two-step statistical machine translation and Natural Lan-

guage Generation applications such as dialogue systems. Webriefly describe the first two applications

and show how our realisation models can be applied to achievethe following tasks.

1.4.1 Transfer based Machine Translation

We now present the role of a sentence realiser in the task of transfer based MT. In transfer-based

approaches for MT1 [12], the source sentence is first analyzed by a parser (a phrase-structure or a

dependency-based parser). Then the source lexical items are transferred to the target language using

a bilingual dictionary. The target language sentence is finally realised by applying transfer-rules that

map the grammar of both the languages. Generally, these transfer rules make use of rich analysis, such

as dependency labels etc, on the source side. The accuracy ofhaving such rich analysis (dependency

labeling) is low and hence, might affect the performance of the sentence realiser. Also, the approach

of manually constructing transfer rules is costly, especially for divergent language pairs such as English

and Hindi or English and Japanese. In this scenario, our models can be used as a robust alternative to

the transfer rules. Once the source lexical items are transferred to the target language using a bi-lingual

dictionary, we apply the realisation models trained on the target side (say Hindi or Japanese) to realise

the sentence. There is one more advantage of using realisation models instead of transfer grammar.

Transfer grammar rules are dependent on both the source language and target language syntactic prop-

erties, whereas the realisation models are dependent only on the target language which makes it more

general. For example, if we want to build a machine translation system between English to Hindi and

Japanese to Hindi and if we follow the transfer based method we need to build transfer grammar rules

for both English to Hindi and Japanese to Hindi. But if we use realisation models, then we need to train

only the Hindi models for both the machine translation systems. In this way the burden will be reduced.

1http://www.isi.edu/natural-language/mteval/html/412.html

4

Page 14: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

1.4.2 Two-stage statistical machine translation

A sentence realiser can also be used in the framework of a two-step statistical machine translation.

In the two-step framework, the semantic transfer and sentence realisation are decoupled into indepen-

dent modules. Semantic transfer involves selection of appropriate target language lexical items, given

the source language sentence and sentence realisation involves the production of a well-formed target

sentence from the selected target language lexical items. This provides an opportunity to develop sim-

ple and efficient modules for each of the steps. The model forglobal lexical selection and Sentence

Re-construction [1] is one such approach. In this approach, discriminative techniques are used to first

transfer semantic information of the source sentence by looking at the source sentence globally, this ob-

taining a accurate bag-of-words in the target language. Thewords in the bag might be attached with mild

syntactic information (i.e., the words they modify) [23]. We propose models that take this information

as input and produce a well-formed target sentence.

We can also use our sentence realiser as an ordering module inother approaches such as [21], where

the goal is to order an unordered bag (of treelets in this case) with dependency links. In this thesis,

we do not test our models with any of the applications mentioned above. We evaluate our models

independently. Evaluating the models by plugging them in these applications can become a direction

for potential future work.

1.5 Summary of contributions

1. We proposed five models for the task of sentence realisation from bag-of-words with dependency

constraints. We achieve the state-of-the-art results for this task.

2. We successfully used graph based algorithms which to the best of our knowledge is the first

attempt in this direction for the task of sentence realisation.

1.6 Outline

• Chapter 2 describes the related work on sentence realisation.

• Chapter 3 presents the language modeling techniques. It starts with an introduction to general

language modeling techniques and then describes the maximum likelihood estimation. Finally, it

identifies the problem of data sparsity and gives some smoothing techniques to solve the problem.

• Chapter 4 lists the experiments conducted for the task of sentence realisation. It starts with a

description of the experimental setup and then explains in detail, the five proposed models of

sentence realisation. It also describes the application ofgraph based models (nearest neighbour

algorithm) for the task of sentence realisation.

5

Page 15: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

• Chapter 5 contains the results and discussion of the experiments. It tabulates the results of the

models 1-5 on the standard test data. The results on the same test data using the graph based

models are also given at the end of the work. The results are also compared with previous methods.

• Chapter 6 contains concluding remarks and suggestions for future work on sentence realisation.

6

Page 16: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

Chapter 2

Sentence realisation: A review

Most general purpose sentence realisation systems developed to date transform the input into well-

formed sentence by statistical language modeling techniques or by the application of a set of grammar

rules based on particular linguistic theories, e.g. Lexical Functional Grammar (LFG), Head-Driven

Phrase Structure Grammar (HPSG), Combinatory Categorial Grammar (CCG), Tree Adjoining Gram-

mar (TAG) etc. The grammars rules can be obtained in different ways like hand-crafted grammars or

semi-automatically extracted or automatically extracted(from treebanks), while language modeling in-

volves the estimation of probabilities of natural languagesentences. We give a brief description of each

of the above:

1. Hand-crafted rules:

In this method, the set of grammar rules for sentence realisation will be manually written. These

rules are based on a particular linguistic theory as mentioned above. For example, the work

presented in FUF/SURGE [8], LKB [5] OpenCCG [24] and XLE [7] perform the task of sen-

tence realisation by manually constructing the grammar rules. There are some problems with this

approach. If we want to build a sentence realiser for a new domain, then the rules have to be re-

written for a new domain. The same reasoning applies for the case of building a sentence realiser

for new languages which is a time-consuming process. Despite these drawbacks, this method

is useful for languages such as Hindi and Telugu which do not have treebanks for learning the

grammar rules automatically.

2. Automatically generated rules using statistical methods:

The grammar rules for sentence realisation can also be learned/extracted directly from the tree-

banks using statistical methods. For languages like English and Chinese which have large tree-

banks, the grammar rules can be created automatically. The work presented in HPSG [18],

LFG [4, 10] and CCG [25] extract the grammar rules automatically from the treebank for the

task for sentence realisation. The advantage of extractingthe grammar rules automatically from

the treebank is that it reduces the manual effort in writing the rules. The other advantage is that

7

Page 17: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

these methods can be easily adapted to other domains and languages. The problem with this ap-

proach is that it requires large treebanks to learn the grammar rules automatically. For languages

such as English which have large treebanks, this method can be used directly.

3. Semi-automatically extracted rules: In this method, thetask of sentence realisation is done using

both automatically extracted rules and hand-crafted rules. In the first step, the hand-crafted rules

are used to get the list of possible realisations from a giveninput. Then the statistical methods are

used to choose the best sentence. The sentence realiser presented in Belz,2007 [3] is an example of

this method. This method carries the advantages and disadvantages of both the methods presented

above.

4. Statistical language modeling: Here, the task of sentence realisation is performed using language

modeling techniques. Language modeling enables us to estimate the probability of occurrences

natural language sentences. Detailed explanation of language modeling is given in chapter 3.

Generally language modeling is used to rank the probable candidates in sentence realisation.

2.1 Issues analysed in this thesis

This thesis mainly focuses on language modeling for the taskof sentence realisation in English. One

of the major issues with the models which uses language modeling presented before is that they don’t

use the treelet structure of the dependency tree while building the language models. We believe that

using the treelet structure while building the language models will improve the accuracy of sentence

realisation. We propose five sentence realisation models which uses the treelet structure while building

the language models. The five proposed models are:

1. Sentential LM : This is a generic language model trained onthe entire sentences of training data.

2. Phrase-based LM : We train different language models for subtrees of different types differentiated

by the POS tags of the head of the subtree.

3. Head-based LM : Different LMs for different subtrees are trained only on the heads of the sub-

trees.

4. Head-POS based LM : This is a modification to head based LM trained on POS tags instead of

words.

5. Marked Head-POS based LM : This is an extension to the fourth model where the head of the

subtree explicitly mark.

A detailed explanation of the above models are given in section 4.2.

8

Page 18: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

Chapter 3

Statistical Language Modeling

In our models for sentence realisation, we traverse the given unordered dependency tree in a bottom

up fashion. At each node we find the best phrase representing the subtree. This is achieved by scoring

all the possible phrases and extracting the phrase with the best score. Scoring of the phrases is done

using language modeling techniques. The best scored phraseis fixed to the root of the subtree. The best

phrase obtained at the root of the unordered dependency treeis the realised sentence. In this chapter, we

discuss the basics of the language modeling and smoothing techniques which are used in our models for

sentence realisation.

3.1 Introduction

The goal of statistical language modeling is to build a statistical model that can estimate the distri-

bution of natural language sentences as accurately as possible. A statistical language model (LM) is a

probability distributionp(S) over stringsS that attempts to reflect the likelihood of a sentenceS in a

language.

Let the string for which we want to find the probability beS = w1 w2 .... wn. The probabilityp(S)

of S can be expressed as

p(S) = p(w1, w2, w3...wn) = p(w1)p(w2|w1)...p(wn|w1...wn−1) (3.1)

However, it is not possible to reliably estimate all the conditional probabilities required. So, in

practice, the approximation presented in equation 3.2 is made.

p(wi|w1, w2, w3, ..., wi−1) ≈ p(wi|wi−N+1, ..., wi−1) (3.2)

giving

p(S) =

n∏

i=1

p(wi|wi−N+1, ..., wi−1) (3.3)

9

Page 19: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

In equation 3.3, a word is dependent only on theN − 1 preceding words and not all preceding words

in the sequence.

The models based on the approximation in equation 3.2 are known asN -gram models. EvenN -gram

probabilities are difficult to estimate reliably and soN is usually limited toN = 1, N = 2 or N = 3,

giving unigram, bigram and trigram models respectively. The simplest model is the unigram where,

p(S) ≈n

i=1

p(wi) (3.4)

The unigram model uses only the frequencies of occurrence ofwords in the corpus to calculatep(S).

The bigram and trigram models are richer approximations. A bigram model uses the conditional

probabilities of word pairs is defined as in equation 3.5

p(S) ≈ p(w1)

[

n∏

i=2

p(wi|wi−1)

]

(3.5)

A trigram is based on the conditional probabilities of word-triples and is defined as in equation 3.6

p(S) ≈ p(w1)p(w2|w1)

[

n∏

i=3

p(wi|wi−2, wi−1)

]

(3.6)

A special marker “$” is used to label the beginning and the endof word-sequences respectively.

Introducing this marker allows statistics relating to the likelihoods of certain words starting or ending a

sentence to be incorporated into models. So sentenceS is

S = $, w1, w2, .., , wn, $ (3.7)

the bigram model computes the probability ofS as given in equation 3.8

p(S) ≈ p(w1|$)

[

n∏

i=2

p(wi|wi−1)

]

p($|wn) (3.8)

Similarly the trigram model computes the sequence probability as shown in equation 3.9

p(S) ≈ p(w1|$)p(w2|$, w1)

[

n∏

i=3

p(wi|wi−2, wi−1)

]

p($|wn−1, wn) (3.9)

3.2 N -gram parameter estimation

The parameters in a traditionalN -gram model can be calculated from the frequency of occurrences

of theN -grams:

p(wi|wi−N+1, ..., wi−1) =c(wi−N+1, ..., wi−1, wi)

c(wi−N+1, ..., wi−1)(3.10)

10

Page 20: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

In equation 3.10,c(wi−N+1,...,wi−1,wi) represents count of the sequence of wordswi−N+1,...,wi−1,wi

cooccured in the training corpus andc(wi−N+1,...,wi−1) represents count of the sequence of words

wi−N+1,...,wi−1 occurred in the training corpus.

For bigram approximation we get the probabilities as,

p(wi|wi−1) =c(wi−1, wi)

c(wi−1)(3.11)

For trigram approximation we get

p(wi|wi−2, wi−1) =c(wi−2, wi−1, wi)

c(wi−2, wi−1)(3.12)

Let us consider a small example to understand. Let our training data D be composed of the three

sentences:

• John read Moby Dick

• Mary read a new book

• She read a book by Cher

Let us calculatep(John read a book) for the maximum likelihood bigram model. We have

p(John|$) =c($|John)

c($)=

1

3(3.13)

p(read|John) =c(John|read)

c(John)=

1

1(3.14)

p(a|read) =c(a|read)

c(a)=

2

3(3.15)

p(book|a) =c(book|a)

c(book)=

1

2(3.16)

p($|book) =c(book|$)

c(book)=

1

2(3.17)

giving us

p(John read a book) = p(John|$)p(read|John)p(a|read)p(book|a)p($|book)

=1

1

2

1

1

2

≈ 0.06

11

Page 21: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

3.3 Smoothing

But, sometimes the probabilityp(wn|wn−1) might be zero. This might be because c(wn,wn − 1) can

be zero. Then the probability of a valid sentenceS which contains word sequencewn−1,wn is zero. The

problem comes when one of the words is a proper noun. The training data from which the parameters are

learned might not contain all the proper nouns. For example,the probability of a valid English sentence

“Sam read a book” is zero. Since, the probabilityp(read|Sam) is zero. This is because “Sam”

is a proper noun which is not present in the training data. A technique calledsmoothing is used to

address this problem. The term smoothing describes techniques for adjusting the maximum likelihood

estimate of probabilities to produce more accurate probabilities. The name smoothing comes from

the fact that these techniques tend to make distributions more uniform, by adjusting low probabilities

such as zero probabilities upward, and high probabilities downward. Not only do smoothing methods

generally prevent zero probabilities, but they also attempt to improve the accuracy of the model as a

whole. Whenever a probability is estimated from few counts,smoothing has the potential to significantly

improve estimation.

There are many smoothing techniques available. Some of the techniques which we tried are:

1. Add-One smoothing

2. Witten bell smoothing

3. Good-Turing estimate

4. Kneser-Ney Smoothing

5. Katz Smoothing

The detailed explanation of the above techniques is explained in Chen and Goodman(2003) [6]. We

have done experiments with all the above smoothing techniques. We have found out that Kneser-Ney

smoothing technique works well. In all our experiments we have used the same algorithm for smoothing.

TheN -gram models can also be applied between word-classes instead of words, where a class may

be a POS tag or supertag or some other category to which words can be assigned. As mentioned in the

previous section, the problem with the word basedN -gram models are that it cannot handle new words

(The bigrams or trigrams that are not in training data). In this case, we can use the part-of-speech based

N -gram model. In this models, we train our models on part-of-speech tags instead of words. Since,

12

Page 22: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

the types of part-of-speech tags are limited, the problem ofsparsity will be avoided. In this thesis, we

also explore whether language models built on POS tags improve the performance of our models. We

also experimented whether language models built on a combination of both word and part-of-speech tag

improves sentence realisation.

13

Page 23: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

Chapter 4

Sentence realisation experiments

In this chapter, we present the experiments conducted for the task of sentence realisation. It starts

with a description of the experimental setup and then explains in detail, the five proposed models for

sentence realisation. The chapter concludes with a application of the graph based models (nearest

neighbour algorithm) for the task of sentence realisation.

4.1 Experimental setup

For the experiments, we use the Wall Street Journal (WSJ) portion of the Penn tree bank [14], using

the standard train/development/test splits, viz 39,832 sentences from 2-21 sections, 2416 sentences from

section 23 for testing and 1,700 sentences from section 22 for development. The input to our sentence

realiser are bag-of-words with dependency constraints which are automatically extracted from the Penn

treebank using head percolation rules used in [13]. We also use the provided part-of-speech tags in some

experiments.

In most of the practical applications, the input to the sentence realiser is noisy. To test the robustness

of our models in such scenarios, we also conducted experiments on noisy input data. The input test data

is first parsed with an unlabeled projective dependency parser [19] to obtain the dependency constraints

the order information is dropped to obtain the input to our sentence realiser. However we still use the

correct bag-of-words.

Table 4.1 shows the number of nodes having a particular number of children in the test data.

From Table 4.1, we can see that more than 97.11% of the internal nodes of the trees contain five or

less children.

14

Page 24: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

Children countNodes Cumulative frequency %0 30219 53.311 13649 77.392 5887 87.773 3207 93.424 1526 96.115 1017 97.96 685 99.17 269 99.578 106 99.75> 8 119 100

Table 4.1The number of nodes having a particular number of children and cumulative frequencies inthe test data

4.2 Experiments

4.2.1 Model 1 : Sentential Language Model

In this model, we traverse the tree in bottom up manner and findthe best phrase at each subtree. The

best phrase corresponding to the subtree is assigned to rootnode of the subtree during the traversal. If

the subtree contains only one node i.e., the head itself, then the node is the best phrase corresponding

to that subtree. Let a noden haveN children represented asci (1 ≤ i ≤ N ). During the bottom

up traversal, the childrenci are assigned best phrases before processing noden. Let the best phrases

corresponding toith child beαci. The best phrase corresponding to the noden is computed by exploring

the permutations ofn and the best phrases of all the children. The total number of permutations that are

explored are(N + 1)!. A sentential language model is applied on each of the candidate phrases to score

the phrases and select the one with maximum score as the best phrase. A sentential language model is

trained on complete sentences of the training corpus to score the permutations.

αn = bestPhrase ( perm (n, ∀ i p(ci)) o LM ) (4.1)

For example, Fig. 4.1 is an unordered dependency tree and theexpected output is “The equity

market was illiquid .”. When we traverse the tree in a bottom up manner, we first reach the subtree

with “illiquid” as head which contains only one node. So the best phrase corresponding to the subtree

is “illiquid”. We then reach “.”, “The” and “equity” and their best phrases are “.”, “The” and “equity”,

respectively (since all the subtrees contain only one node). Then we reach the subtree with “market”

as head, which contains three nodes “The”, “equity”, “market” and there will be 3! (6) phrases formed

15

Page 25: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

was

.illiquid market

equityThe

Figure 4.1Unordered dependency tree

was

.illiquid The_equity_market

Figure 4.2Unordered dependency tree with partial order calculated

and here we can apply language model on each of the 6 phrases and identify the best phrase (say “The

equity market”) to the node “market” as seen in Fig. 4.2. We next go to the subtree with “was” as head.

It contains four nodes “was”, “the equity market”, “illiquid” and “.” (note that here we consider “The

equity market” as a unit and permute). We get 4! (24) phrases and apply language model on each one

of them and identify the best phrase. The best phrase which wehave identified at the tree root is the

sentence generated. For this example, we should ideally get“The equity market was illiquid .”.

The worst case complexity of the model isO(n!). But, this is a rare case where in all the(n − 1)

words is attached to a parent. If we consider that the dependency tree has a subtree with maximum of

k children, then the complexity is of the order ofn × (k + 1)! i.e.,O(n × k!). For a binary tree with n

words, the complexity isn × 3! (O(n)).

4.2.2 Model 2 : Subtree-type based Language Models(STLM)

teaches

friend

myschool

in

Figure 4.3Dependency tree of a English sentence

16

Page 26: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

Consider the tree given in Fig. 4.3, when we traverse the dependency tree in a bottom up manner as

mentioned in model 1, we first reach the subtree with “school”as head which contains only one node.

So, the best phrase corresponding to the subtree is “school”. Then we reach the subtree with “in” as

head. There are two phrases “school” and “in” in the subtree which are to be ordered to assign the

best phrase to the node “in”. The two possible phrases are “inschool” and “school in”. Both of the

phrases are valid strings in English. The following sentences “Ram worked in school.” and “The school

in Gachibowli is famous.” are two valid English sentences which contains “in school” and “school in”

phrases respectively. Also the probabilities of “school in” and “in school” might be equal as both of

the phrases have a good chance of occurring in the complete sentences. But we know that, given the

fact that the node represents a prepositional phrases (which have prepositions as root), the string “in

school” should be more probable than the string “school in”.This objective can be achieved by training

different language models for subtrees of different types.A prepositional phrasal language model is

then expected to give much higher probability to “in school”. The POS tags are used to represent the

subtrees of various kinds and hence, different language models are built for different POS tags. So in this

model we synthesize a sentence from a unordered dependency tree by traversing the dependency tree

in bottom-up manner as in model 1. But while scoring the permuted phrases we use different language

models for subtrees headed by words of various POS tags.

αn = bestPhrase ( perm (n, ∀ i p(ci)) o LMPOS(n) ) (4.2)

Here,LMPOS(n) represents the language model associated with the part-of-speech of the noden.

To build STLMs, the training data is parsed first. Each subtree in the parse structure is represented

by the part-of-speech tag of its head. Different language models are created for each of the POS tags.

.

has(VBZ)

date(NN) been(VBN) .(.)

set(VBN)A(DT) record(NN)

not(RB)

Figure 4.4Dependency tree of a English sentence with POS tags

Fig. 4.4 shows a ordered dependency tree of sentence “A record date has not been set .” with words

having their POS tags. Consider the subtree with “date” as head whose gold phrase is “A record date”.

17

Page 27: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

We use this phrase to build a “NN” language model. In the similar way, we use gold phrase “been set”

to build a “VBN” language model and “A record date has not beenset .” to build a “VBZ” language

model. Note that we don’t use leaf nodes of the dependency tree for training language models.

We have trained 44 different language models each corresponding to a particular POS tag. For

example, a “IN” language model contains phrases like “in hour, of chaos, after crash, in futures, etc”

and a “VBD” language model contains phrases like “were criticized, never resumed” while training.

was(VBD)

equity(NN)The(DT)

illiquid(JJ) market(NN).(.)

Figure 4.5Unordered Dependency tree with POS tags attached to words

While decoding, the input will be the unordered dependency tree with POS tags attached to the

words. For example, Fig. 4.5 is the given unordered dependency tree with POS tags attached to words

and the expected output is “The equity market was illiquid .”When we traverse the tree in a bottom

up manner, we first reach the subtree with “illiquid” as head which contains only one node. So, the

best corresponding phrase to the subtree is the “illiquid”.We then reach “.”, “The” and “equity” in

order and their best phrases are “.”, “The” and “equity” respectively (since all the subtrees contains only

one node). Then we reach the subtree with “market” as head which contains three nodes “the, equity,

market” and there can be 3! (6) phrases formed and here we can apply “NN” language model on these 6

phrases and assign the best phrase (say “The equity market”)to the node “market” see Fig. 4.2. Next we

go to the subtree with “was” as head which contains 4 nodes “was, the equity market, illiquid, .” (note

that here we consider “The equity market” as a unit and permute). We get 4! (24) phrases and apply

“VBD” language model on them and get the best phrase. The bestphrase obtained at the tree’s root is

the generated sentence (in this example, we should ideally get “The equity market was illiquid .”).

4.2.3 Model 3 : Head-word STLM

In the models presented earlier, a node and its children are ordered using the best phrases of the

children. For example, the best phrase assigned to the node “was” is computed by taking the permuta-

18

Page 28: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

tion of “was” and its children “The equity market”, “illiquid” and “.” and then applying the language

model. In this model, instead of considering best phrases while ordering, the heads of the childrenci

are considered. For example, the best phrase assigned to thenode “was” is computed in three steps.

1. The best phrases of children are computed. The best phraseof the childci is αci.

2. Permute the node “was” and the children “market”, “illiquid”, “.” and then apply a language

model trained to order the heads of subtrees.

αn = bestPhrase ( perm (n, ∀ i ci) o LMPOS(n) ) (4.3)

For example, the ordered head will be “market was illiquid .”.

3. The ordered heads are replaced with the corresponding best phrases to get the best phrase asso-

ciated with “was”. Hence, the best phrase of node “was” is “The equity market was illiquid .”

which is “The equity market was illiquid”.

In the similar way, we train head-word STLM language models for each subtree by using only the

heads of the children instead of all the phrase. For example,consider the subtree with “has” as head in

Fig. 4.4, we add heads based phrase “date has not been .” instead of “A record date has not been set .”

to build the “NN” language models.

The major advantages of using this model are,

1. Order at a node is independent of the best phrases of its descendants. This eliminates the redun-

dant words in the descendant’s best phrases to be consideredby the language models that scores

strings at the higher level.

2. Any errors in computation of best phrases of descendants doesn’t effect the choice of reordering

decision at a particular node. For example, the computationof the best phrase at the node “was” is

not affected by whether the best phrase of node “market” is “The equity market” or “The market

equity” or any other possibility. This reduces the propagation of errors to some extent.

3. The head words represent the entire subtree. In this manner, the language modeling can be done

across non-contiguous units also which are connected by a dependency relation.

4. For the subtrees which have a preposition as their root, the preposition represents the grammatical

information of the entire subtree during the scoring using alanguage model.

19

Page 29: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

4.2.4 Model 4 : POS based STLM

We now experiment by using Part-Of-Speech (POS) tags of words for ordering the nodes. In the

previous approaches, the language models were trained on the words which were then used to compute

the best strings associated with various nodes. Here, we order the node and its children using a language

model trained on POS tag sequences.

For example in Fig. 4.4, consider the subtree with “date” as head whose gold POS phrase is “DT

NN NN”. We use this phrase to build a “NN” POS based STLM. In thesimilar way, we use gold phrase

“VBN VBN” to build a “VBN” language model and “NN VBZ RB VBN .”.to build a “VBZ” language

model. Note that here also we don’t use leaf nodes of the dependency tree for training language models.

For decoding the process involves three steps,

1. Obtain the best string of POS tags for a node

2. Substitute words in the place of POS tags (of the best POS string)

3. Apply word based language model to choose the best phrase

The third step is important because a single POS tag might be substituted by several words thus

resulting in several candidates for the best phrase. A word-based language model needs to be applied

on these resulting candidates in order to obtain the best phrase.

VB

NN NN

Figure 4.6A treelet having 3 nodes

For example, let the node shown in Fig. 4.6 have the best POS phrase ‘NN NN VB’ where “VB” is

the POS tag of “w1” and “NN” is the POS tag of “w2”, “ w3”. So, the various candidates for best phrase

are “w2 w3 w1” and “w3 w2 w1” and word based language model is applied to choose the best phrase

among these two.

There are two primary advantages with this model:

1. This model is more general and it deals with unseen words effectively

2. Much faster than earlier models as the size of head-POS language model is less

20

Page 30: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

4.2.5 Model 5: Marked Head-POS based LM

In POS based STLM, the head of a particular node isn’t marked while applying the language model.

Hence, all the nodes of the treelet are treated equally whileapplying the LM. For example, in Fig. 4.7,

the structures of treelets is not taken into account while applying the head-POS based language model.

Both are treated in the same manner while applying TLM. In this model, we experiment by marking the

head information for the POS of the head word which treats thetreelets in Fig. 4.7 in a different manner

to obtain the best phrase. As the best POS tag sequence might correspond to several orderings of the

treelet, we test various word-based approaches to choose the best ordering among the many possibilities.

The best phrase is generated by following the enumerated methods.

1. Apply word based language model

2. Apply a language model which is trained on a corpus where the POS tags are attached to the

words (of the form wordPOS).

3. Apply a language model trained on the corpus where, only the head of a particular node has an

associated POS tag.

VB

VBP NN

VBP

VB NN

Figure 4.7Two different treelets which would have same best POS tag sequence

4.3 Nearest Neighbour Algorithm

A major bottleneck of the models presented in section 4.2 is that their worst case complexity is

O(N !), whereN is the number of nodes in the subtree. For the models to be usedin a practical systems,

there is a need to decrease the computational complexity of the models presented. The major reason

behind the computational complexity of the models beingO(N !) is that we do a exhaustive search to

find best phrase for each subtree. In order to reduce the computational complexity we have to modify

the search technique. In this regard, we plan to use graph-based algorithm for searching the best phrase

for a subtree. This will reduce the computational complexity from factorial to quadratic. Graph based

21

Page 31: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

algorithms for natural language applications such as parsing [15], summarization [17] and word sense

disambiguation [16] have been well explored.

We map the problem of finding the best phrase in a subtree to a graph problem of finding a least

Hamiltonian path in a complete graph. The problem is formulated as follows. The nodes in the graph

represents the words (“$” is added to the bag-of-words whichrepresents the beginning of the phrase)

whereas the edges represents the bi-gram probabilities. Then a complete directed graph is constructed

from the above vertices.

A complete directed graph is a simple graph in which every pair of distinct vertices is connected

by an directed edge. The complete directed graph onn vertices hasn × (n − 1) edges. Every edge

in complete directed graph is associated with a scoreScore(x, y) that maps edge betweenx andy to

a real number. These scores are a negative of the conditionalprobability p(y/x) which is given by

−(count(x, y)/count(x)). Here, we take the negative of the conditional probability only for the sake

of implementation simplicity. We can see from the above formulation that the path which visits each

vertex exactly once and has the least score is the required phrase of the subtree. Such a path which

visits each vertex exactly once is called Hamiltonian path.Now, our task of finding the best sentence is

reduced to building a complete directed graph with nodes of subtree as vertices and find the least cost

Hamiltonian path.

But finding the Hamiltonian path is aNP -complete problem i.e., the search space isN ! for N

nodes. But there are some approximate algorithms1 which gives the approximate solution with less

computational complexity. The solution might not be alwaysoptimal but the primary advantage is the

reduction of search space from factorial to quadratic. One of the familiar approximate algorithm is

nearest neighbour algorithm.

In nearest neighbour algorithm, we start with “$” node and mark it has visited. Then, we find and

visit the lightest edge going from the current vertex which is not visited and mark it as visited. We repeat

this till all the nodes of the graph are visited which gives usthe least Hamiltonian path.

For the example discussed in 4.2.3, the best phrase assignedto node “was” is computed by taking the

permutations of “was” and its children heads “was”,“.”,“market”,“illiquid”. But, in this model instead

of searching all the permutations we use algorithm mentioned above.

1http://en.wikipedia.org/wiki/Approximationalgorithm

22

Page 32: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

.(.)market(NN)illiquid(JJ)was(VBD) $($)

Figure 4.8Complete directed graph using the edges was,.,market,illiquid,

First we add “$” to the list of words (“was”,“.”,“market”,“illiquid”) and we consider them as edges.

Then we build the complete directed graph using these edges.Fig. 4.8 shows the complete directed

graph using the words “$”,“was”,“.”,“market”,“illiquid”.

The edges are scored with the negative of conditional probabilities mentioned above. For example,

the edges from “illiquid” to “market” has the score of conditional probability of−P(market/illiquid). In

the similar way, all the edges are scored using the conditional probabilities.

.(.)market(NN)illiquid(JJ)was(VBD) $($)

Figure 4.9Complete directed graph with the best path marked

Once the scored complete directed graph is build, the task now is to find the Hamiltonian path from

the graph with minimum score. We use the nearest neighbour algorithm for achieving this. We start

with “$” vertex and find the edge with least scored edge connected with “$” and we mark that edge as

visited. The process is repeated until all the nodes are marked visited. Fig. 4.9 shows the state of the

23

Page 33: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

graph after all the nodes are visited. From the Figure, we getthe best path as “$ market was illiquid .”.

Once we get the best path, we follow the same steps as did in model 3.

Since there are (N2) possible bigram probabilities, the run time complexity ofthe Nearest Neighbour

algorithm is in the order ofN2. This decreases the search space and in turn effects the system output

since the best output might not be possibly explored.

The method described takes the local best at each step. So, the output might not be the global best.

In order to get the global best, we storeK-best instead of top one at each stage. Then in the end when

we get theK-best phrases for the subtree. Then, we chose the best phrasehaving the highest global

probability. Higher value ofK allows more phrases to be considered. Hence the search spacefor the

K-best nearest neighbour algorithm will be in the order ofK × N2.

24

Page 34: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

Chapter 5

Results and Discussion

In the previous chapter, we have proposed models for sentence realisation which takes bag-of-words

with dependency constraints and produce well-formed sentence. This chapter explains the results of the

proposed models on the standard test data using two different metrics. The results are also compared

with previous methods.

5.1 Results

Similar to most of the previous works on sentence realisation, we have usedBilingual Evaluation

Understudy (BLEU) score [20] and percentage of exactly matched sentences as evaluation metrics. We

compare the system generated sentences with reference sentences and get the evaluation metrics. As

our system guarantees that all input bag-of-words can realise a sentence, special coverage-dependent

evaluation (as has been adopted in most grammar-based generation systems) is not necessary in our

experiments.

As mentioned earlier in section 4.1, we evaluate our models on two types of input which differs

in dependency constraints. In the first input type, dependency constraints among the bag-of-words are

extracted from treebank which is manually transcribed and in the second input type, the dependency

constraints among the bag-of-words are noisy as they are automatically extracted from a parser.

Table 5.1 shows the results of model 1-5 for both input extracted from treebank and parser output.

We can observe that in model 1, BLEU score of the parser input is high when compared to treebank

input. This might be because, the parser input is projective(as we used projective parsing) whereas

the treebank input might contain some non-projective cases. In general, for all the models, the results

25

Page 35: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

Treebank(gold) Parser(noisy)Model BLEU score ExMatch BLEU score ExMatch

Model 1 0.5472 12.62% 0.5514 12.78%Model 2 0.6886 18.45% 0.6870 18.29%Model 3 0.7284 21.86% 0.7227 21.52%Model 4 0.7890 28.52% 0.7783 27.92%Model 5 0.8156 29.47% 0.8027 28.78%

Table 5.1The results of Model 1-5

with noisy dependency links are comparable to the cases where gold dependency links are used which

is encouraging.

From table 5.1, we can observe that model 5 gives the best BLEUscores for both the input types

and we can also observe that the difference in the BLEU scoresfor two input types is 0.0129. From

the low difference value, we can infer that our models work well even when the dependency constraints

between the bag-of-words are noisy.

As mentioned in section 4.3, we have applied the graph based nearest neighbour algorithm for search-

ing the best phrase for each subtree. We have directly applied the search algorithm on the high perform-

ing Model 5. We have also suggested to store the K-best instead of one best. Table 5.1 shows the results

of the K-best nearest neighbour algorithm for different values ofK.

K-best BLEU score1 0.45132 0.62113 0.68855 0.736510 0.771620 0.786830 0.7968

Table 5.2Results of Nearest Neighbour Algorithm for different values of K

We can see from the Table 5.1 that, forK = 30, BLEU score obtained for the standard test set is

0.7968. Model 5, which does a exhaustive search ofN ! has achieved a BLEU score of 0.8156 on the

same test data. The results show that there is a slight decrease of 0.0188 BLEU score with the decrease

of computational complexity fromN ! to K ∗ N2 (K=30).

Fig. 5.1 shows the results of the nearest neighbour algorithm with K on the x-axis and BLEU score

on the y-axis. We observe that BLEU score increases at a faster rate upto the value of K = 10 and

gradually stabilises after that.

26

Page 36: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

Figure 5.1Graph showing the BLEU scores of Nearest Neighbour Algorithm for different values of K

Paper BLEU scoreLangkilde(2002) 0.757Nakanishi(2005) 0.705Cahill(2006) 0.6651Hogan(2007) 0.6882White(2007) 0.5768Guo(2008) 0.7440Our Model 0.8156

Table 5.3Comparsion of results for English WSJ section 23

The results given in table 5.1 are taken from Guo et al., 2008 [9], which shows the BLEU scores

for different systems on section 23 of PTB. It is really difficult to compare sentence realisers as the

information contained in the input varies greatly between systems.

But, we can clearly see that the our system performs better than all the other systems. The main

observations from the results are summarised below

1. Searching the entire search space of O(n!) gives the best performance.

2. Treelet LM capture characteristics of phrases headed by various POS tags, in contrast to sentential

LM which is a general LM.

3. POS tags play an important role in ordering nodes of a dependency structure.

4. The head models performed better than the models that usedall the nodes of the subtree.

27

Page 37: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

5. Marking the head of a treelet provides vital clues to the language model for reordering.

28

Page 38: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

Chapter 6

Conclusion and Future Work

6.1 Conclusion

In this thesis, we have addressed the problem of sentence realisation. The input to the sentence real-

isation is a bag-of-words with dependency constraints. Theexpected output is a well formed sentence

out of the bag-of-words. We have presented fiven-gram based models for sentence realisation. They

are:

1. Sentential Language Model

2. Subtree-type based Language Models(STLM)

3. Head-word STLM

4. POS based STLM

5. Marked Head-POS based LM

We have evaluated our models on two different types of input(gold and noisy). In the first input

type, we have bag-of-words with dependency constraints extracted from treebank and in the second

input type, the dependency constraints among the bag-of-words are extracted from the parser which are

noisy. From the results, we can conclude that the model ‘Marked Head-POS based LM’ works best

with 0.8156 BLEU score on dependency constraints extracted from treebank and0.8027 BLEU score

on dependency constraints extracted from parser output. This is the best result for the task of sentence

realisation on the standard test data. We also observe that,there is only a small decrease in the results

when the dependency constraints among the bag-of-words arenoisy from this we can conclude that the

models are fairly robust.

29

Page 39: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

We have also tested the graph based nearest neighbour algorithm for the task of sentence realization.

We have shown that using the graph-based algorithm can reduce the computational complexity from

factorial to quadratic at the cost of 2% reduction in the overall BLEU score. This method of decreasing

the computational complexity at a very low cost makes our module suitable for employment in practical

applications. We have also checked the importance of storing K-best solutions at each stage and choose

the best sentence with higher global probability at the end.The BLEU score improved from 0.4513 for

K = 1 to 0.7968 for K = 30.

6.2 Future work

There are several possible areas of further research as an extension to this work.

The models proposed in thesis, consider only the locally best phrases (local to the subtree) at every

step. In order to retain the globally best possibilities at every step, we plan to use beam search, where

we retain K-best phrases for every subtree.

Also, the goal is to test the approach for morphologically-rich languages such as Hindi. Also, it

would require us to expand our features set. We also plan to test the factored models in this regard.

The most important experiment that we plan to perform is to test our system in the context of MT,

where the input is more real and noisy.

To train more robust language models, we plan to use the much larger data on a web scale.

30

Page 40: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

Publications

1. Karthik Gali and Sriram Venkatapathy,Sentence Realisation from Bag of Words with de-

pendency constraints. In Proceedings of Human Language Technologies: The 2009 Annual

Conference of the North American Chapter of the Associationfor Computational Linguistics.

2. Karthik Gali , Sriram Venkatapathy and Taraka Rama,From Factorial to Quadratic Time

Complexity for Sentence Realization using Nearest Neighbour Algorithm . In Proceedings

of The 7th Brazilian Symposium in Information and Human Language Technology.

31

Page 41: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

Bibliography

[1] S. Bangalore, P. Haffner, and S. Kanthak. Statistical machine translation through global lexical selection and

sentence reconstruction. InANNUAL MEETING-ASSOCIATION FOR COMPUTATIONAL LINGUISTICS,

volume 45, page 152, 2007.

[2] R. Begum, S. Husain, A. Dhwaj, D. Sharma, L. Bai, and R. Sangal. Dependency annotation scheme for

Indian languages.Proceedings of IJCNLP-2008, 2008.

[3] A. Belz. Probabilistic Generation of Weather Forecast Texts. InProceedings of NAACL HLT, 2007.

[4] A. Cahill and J. van Genabith. Robust PCFG-Based Generation Using Automatically Acquired LFG

Approximations. InANNUAL MEETING-ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, vol-

ume 44, 2006.

[5] J. Carroll, A. Copestake, D. Flickinger, and V. Poznanski. An efficient chart generator for (semi-) lexicalist

grammars. InProceedings of the 7th European workshop on natural language generation (EWNLG99),

pages 86–95, 1999.

[6] S. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling.Computer

Speech and Language, 13(4):359–394, 1999.

[7] D. Crouch, M. Dalrymple, R. Kaplan, T. King, J. Maxwell, and P. Newman. XLE documentation.Available

on-line, 2007.

[8] M. Elhadad. FUF: The universal unifier user manual version 5.0. Department of Computer Science,

Columbia University. New York, 1991.

[9] Y. Guo, J. van Genabith, and H. Wang. Dependency-Based N-Gram Models for General Purpose Sentence

Realisation.Proceedings of the 22nd conference on Computational linguistics, 2008.

[10] D. Hogan, C. Cafferkey, A. Cahill, and J. van Genabith. Exploiting Multi-Word Units in History-Based

Probabilistic Generation. InProceedings of the 2007 Joint Conference on Empirical Methods in Natural

Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), 2007.

[11] I. Langkilde-Geary. An empirical verification of coverage and correctness for a general-purpose sentence

generator. InProceedings of the 12th International Natural Language Generation Workshop, pages 17–24.

Citeseer, 2002.

32

Page 42: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

[12] A. Lavie, S. Vogel, L. Levin, E. Peterson, K. Probst, A. Llitjos, R. Reynolds, J. Carbonell, and R. Co-

hen. Experiments with a Hindi-to-English transfer-based MT system under a miserly data scenario.ACM

Transactions on Asian Language Information Processing (TALIP), 2(2):143–163, 2003.

[13] D. Magerman. Statistical decision-tree models for parsing. In Proceedings of the 33rd annual meeting

on Association for Computational Linguistics, pages 276–283. Association for Computational Linguistics

Morristown, NJ, USA, 1995.

[14] M. Marcus, M. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of English: the penn

treebank.Computational Linguistics, 19(2), 1993.

[15] R. McDonald, F. Pereira, K. Ribarov, and J. Hajic. Non-projective dependency parsing using spanning

tree algorithms. InProceedings of Human Language Technology Conference and Conference on Empirical

Methods in Natural Language Processing, pages 523–530, 2005.

[16] R. Mihalcea. Unsupervised large-vocabulary word sense disambiguation with graph-based algorithms for

sequence data labeling. InProceedings of the conference on Human Language Technology and Empiri-

cal Methods in Natural Language Processing, pages 411–418. Association for Computational Linguistics

Morristown, NJ, USA, 2005.

[17] R. Mihalcea and P. Tarau. Multi-document Summarization with iterative graph-based algorithms. In

Proceedings of the First International Conference on Intelligent Analysis Methods and Tools (IA 2005),

McLean, VA, 2005.

[18] H. Nakanishi and Y. Miyao. Probabilistic models for disambiguation of an HPSG-based chart generator. In

Proceedings of the International Workshop on Parsing Technology, 2005.

[19] J. Nivre, J. Hall, J. Nilsson, G. Eryigit, and S. Marinov. Labeled pseudo-projective dependency parsing with

support vector machines. InProceedings of the Tenth Conference on Computational Natural Language

Learning (CoNLL), pages 221–225, 2006.

[20] K. Papineni, S. Roukos, T. Ward, and W. Zhu. BLEU: a method for automatic evaluation of machine

translation.Proceedings of the 40th Annual Meeting on ACL, 2001.

[21] C. Quirk, A. Menezes, and C. Cherry. Dependency treelettranslation: Syntactically informed phrasal SMT.

pages 271–279, 2005.

[22] E. Reiter and R. Dale. Building applied natural language generation systems.Natural Language Engineer-

ing, 3(01):57–87, 1997.

[23] S. Venkatapathy and S. Bangalore. Three models for discriminative machine translation using Global Lexi-

cal Selection and Sentence Reconstruction. InProceedings of SSST, NAACLHLT/AMTA Workshop on Syntax

and Structure in Statistical Translation, pages 152–159, 2007.

[24] M. White. Reining in CCG Chart Realization.LECTURE NOTES IN COMPUTER SCIENCE, 2004.

[25] M. White, R. Rajkumar, and S. Martin. Towards Broad Coverage Surface Realization with CCG. In

Proceedings of the Workshop on Using Corpora for NLG: Language Generation and Machine Translation

(UCNLG+ MT), 2007.

33

Page 43: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

CURRICULUM VITAE

1. NAME : Karthik Kumar G

2. DATE OF BIRTH : 08-02-1987

3. PERMANENT ADDRESS:

11-9-78/2, Laxmi nagar colony,

Kothapet, Hyderabad 50032

Andhra Pradesh, India

4. EDUCATIONAL QUALIFICATIONS :

May 2010: Master of Science (by Research) and Bachelor of Technology in Computer Science

and Engineering, IIIT Hyderabad

34

Page 44: Sentence Realisation from Bag-of-Words with Dependency Co ...web2py.iiit.ac.in/publications/default/download/... · First and foremost, I would like to thank my advisers Prof. Rajeev

THESIS COMMITTEE

1. GUIDES:

• Prof. Rajeev Sangal

• Mr. Sriram Venkatapathy

2. MEMBERS :

• Dr. Bruhadeshwar Bezawada

• Prof. B. Yegnanarayana

35