Top Banner
How Would You Say It? Eliciting Lexically Diverse Data for Supervised Semantic Parsing Abhilasha Ravichander 1* , Thomas Manzini 1* , Matthias Grabmair 1 Graham Neubig 1 , Jonathan Francis 2 , Eric Nyberg 1 1 Language Technologies Institute, Carnegie Mellon University 2 Robert Bosch LLC, Corporate Sector Research and Advanced Engineering {aravicha, tmanzini, mgrabmai, gneubig, ehn}@cs.cmu.edu [email protected] Abstract Building dialogue interfaces for real- world scenarios often entails training se- mantic parsers starting from zero exam- ples. How can we build datasets that bet- ter capture the variety of ways users might phrase their queries, and what queries are actually realistic? Wang et al. (2015) pro- posed a method to build semantic pars- ing datasets by generating canonical ut- terances using a grammar and having crowdworkers paraphrase them into natu- ral wording. A limitation of this approach is that it induces bias towards using similar language as the canonical utterances. In this work, we present a methodology that elicits meaningful and lexically diverse queries from users for semantic parsing tasks. Starting from a seed lexicon and a generative grammar, we pair logical forms with mixed text-image representations and ask crowdworkers to paraphrase and con- firm the plausibility of the queries that they generated. We use this method to build a semantic parsing dataset from scratch for a dialog agent in a smart-home simulation. We find evidence that this dataset, which we have named SMARTHOME, is demon- strably more lexically diverse and difficult to parse than existing domain-specific se- mantic parsing datasets. 1 Introduction Semantic parsing is the task of mapping natural language utterances to their underlying meaning representations. This is an essential component for many tasks that require understanding natu- ral language dialogue (Woods, 1977; Zelle and * *The indicated authors contributed equally to this work. Figure 1: Crowdsourcing pipeline for building se- mantic parsers for new domains Mooney, 1996; Berant et al., 2013; Branavan et al., 2009; Azaria et al., 2016; Gulwani and Marron, 2014; Krishnamurthy and Kollar, 2013). Orient- ing a dialogue-capable intelligent system is ac- complished by training its semantic parser with utterances that capture the nuances of the domain. An inherent challenge lies in building datasets that have enough lexical diversity for granting the sys- tem robustness against natural language variation in query-based dialogue. With the advent of data- driven methods for semantic parsing (Dong and Lapata, 2016; Jia and Liang, 2016), constructing such realistic and sufficient-sized dialog datasets for specific domains becomes especially impor- tant, and is often the bottleneck for applying se- mantic parsers to new tasks. Wang et al. (2015) propose a methodology for efficient creation of semantic parsing data that starts with the set of target logical forms, and
10

How Would You Say It? Eliciting Lexically Diverse Data for ... Would You Say...Jul 07, 2017  · limitations when constructing semantic parsers for new domains: (1) the seed utterances

Aug 04, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: How Would You Say It? Eliciting Lexically Diverse Data for ... Would You Say...Jul 07, 2017  · limitations when constructing semantic parsers for new domains: (1) the seed utterances

How Would You Say It?Eliciting Lexically Diverse Data for Supervised Semantic Parsing

Abhilasha Ravichander1∗, Thomas Manzini1∗, Matthias Grabmair1

Graham Neubig1, Jonathan Francis2, Eric Nyberg1

1Language Technologies Institute, Carnegie Mellon University2Robert Bosch LLC, Corporate Sector Research and Advanced Engineering

{aravicha, tmanzini, mgrabmai, gneubig, ehn}@[email protected]

Abstract

Building dialogue interfaces for real-world scenarios often entails training se-mantic parsers starting from zero exam-ples. How can we build datasets that bet-ter capture the variety of ways users mightphrase their queries, and what queries areactually realistic? Wang et al. (2015) pro-posed a method to build semantic pars-ing datasets by generating canonical ut-terances using a grammar and havingcrowdworkers paraphrase them into natu-ral wording. A limitation of this approachis that it induces bias towards using similarlanguage as the canonical utterances. Inthis work, we present a methodology thatelicits meaningful and lexically diversequeries from users for semantic parsingtasks. Starting from a seed lexicon and agenerative grammar, we pair logical formswith mixed text-image representations andask crowdworkers to paraphrase and con-firm the plausibility of the queries that theygenerated. We use this method to build asemantic parsing dataset from scratch for adialog agent in a smart-home simulation.We find evidence that this dataset, whichwe have named SMARTHOME, is demon-strably more lexically diverse and difficultto parse than existing domain-specific se-mantic parsing datasets.

1 Introduction

Semantic parsing is the task of mapping naturallanguage utterances to their underlying meaningrepresentations. This is an essential componentfor many tasks that require understanding natu-ral language dialogue (Woods, 1977; Zelle and

∗*The indicated authors contributed equally to this work.

Figure 1: Crowdsourcing pipeline for building se-mantic parsers for new domains

Mooney, 1996; Berant et al., 2013; Branavan et al.,2009; Azaria et al., 2016; Gulwani and Marron,2014; Krishnamurthy and Kollar, 2013). Orient-ing a dialogue-capable intelligent system is ac-complished by training its semantic parser withutterances that capture the nuances of the domain.An inherent challenge lies in building datasets thathave enough lexical diversity for granting the sys-tem robustness against natural language variationin query-based dialogue. With the advent of data-driven methods for semantic parsing (Dong andLapata, 2016; Jia and Liang, 2016), constructingsuch realistic and sufficient-sized dialog datasetsfor specific domains becomes especially impor-tant, and is often the bottleneck for applying se-mantic parsers to new tasks.

Wang et al. (2015) propose a methodology forefficient creation of semantic parsing data thatstarts with the set of target logical forms, and

Page 2: How Would You Say It? Eliciting Lexically Diverse Data for ... Would You Say...Jul 07, 2017  · limitations when constructing semantic parsers for new domains: (1) the seed utterances

generates example natural language utterances forthese logical forms. Specifically, the authors ofthe parser specify a seed lexicon with canonicalphrase/predicate pairs for a particular domain, andsubsequently a generic grammar constructs canon-ical utterances paired with logical forms. Becausethe canonical utterances may be ungrammatical orstilted, they are then paraphrased by crowd work-ers to be more natural queries in the target lan-guage. We argue that this approach has threelimitations when constructing semantic parsers fornew domains: (1) the seed utterances may inducebias towards the language of the canonical utter-ance, specifically with regards to lexical choice,(2) the generic grammar suggested cannot be usedto generate all the queries we may want to sup-port in a new domain, and (3) there is no check onthe correctness or naturalness of the canonical ut-terances themselves, which may not be logicallyplausible. This is problematic as even unlikelycanonical utterances can be paraphrased fluently.

In this paper, we propose and evaluate a newapproach for creating lexically diverse and plau-sible utterances for semantic parsing (Figure 1.).Firstly, inspired by the use of images in the cre-ation of datasets for paraphrasing (Lin et al., 2014)or for natural language generation (Novikovaet al., 2016), we seek to reduce this linguistic biasby using a lexicon consisting of images. Sec-ondly, a generative grammar, which is tailored tothe domain, combines these images to form mixedtext-image representations. Using these two ap-proaches, we retain many of the advantages ofexisting approaches such as ease of supervisionand completeness of the dataset, with the addedbonus of promoting lexical diversity in the nat-ural language utterances, and supporting queriesrelevant to our domain. Finally, we add a simplestep within the crowdsourcing experiment wherecrowd-workers evaluate the plausibility of the gen-erated canonical utterances. At training time, weconjecture that optionally adding a term to up-weight plausible queries might be useful to deploya semantic parser in real world settings. Encourag-ing the parser to focus on queries that make sensereduces emphasis on things that a user is unlikelyto ask.

We evaluate our method by building a semanticparser from scratch for a dialogue agent in a smarthome simulation. The dialogue agent will be ca-pable of answering questions about various sen-

sor activations, and higher-level concepts whichmap to these activations. Such a task requiresunderstanding the natural language queries of theuser, which could be varied and even indirect.For example, in SMARTHOME, ‘where can I goto cool off?’ corresponds to the canonical utter-ance ‘which room contains the AC that is in thehouse?’. Similarly, ‘is the temp in the chillspacebroke?’ corresponds to ‘are the thermometers inthe living room malfunctioning?’.

As a result of our analysis, we find that theproposed method of eliciting utterances usingimage-based representations results in consider-ably more diverse utterances than in the previ-ous text-based approach. We also find evidencethat the SMARTHOME dataset, constructed usingthis approach, is more diverse than other domain-specific datasets for semantic parsing, such asGEOQUERY or ATIS. We release this dataset tothe community1 as a new benchmark.

2 Example Domain: Smart Home

While our proposed data collection methodologycould conceivably be used in a number of do-mains, for illustrative purposes we choose the do-main of a smart home simulation for all our ex-amples. We define a smart home as a home popu-lated with sensors and appliances that are stream-ing data which can be read. A fully connected di-alog agent could reason about and discuss thesedata streams. Our work attempts to develop aquestion answering system to support dialogue inthis environment.

In the smart home domain, queries could rangefrom complex, such as a user trying to determinethe optimal time to start cooking dinner given aparty schedule, to simple, asking for a temperaturereading. While we believe that many queries couldbe handled with the methodology that we describe,we have limited the types of queries that can beasked to a reasonable subset, primarily single-turnqueries about entity states (for example, ‘did Ileave the lights in the bedroom on?’ or ‘is the dogsafe?’).

3 Approach Overview

Our approach to building a dialog interface for anew domain D, first requires analysis of the do-main and identification of the entities involved.This builds on the methodology of Wang et al.

1https://github.com/oaqa/resources

Page 3: How Would You Say It? Eliciting Lexically Diverse Data for ... Would You Say...Jul 07, 2017  · limitations when constructing semantic parsers for new domains: (1) the seed utterances

(2015), but with three significant additions to elicitdiversity and capture domain-relevant queries:

1. An additional step of specifying images forthe entities in the domain, and

2. A domain-specific grammar that capturesqueries relevant to the particular domain.

3. A crowdsourcing methodology that includescrowdworkers annotating canonical utter-ances for plausibility

After analyzing the domain and the queries wewant to support, we construct a seed lexicon anda generative grammar. The generative grammargenerates matched pairs of canonical utterancesand logical forms. As our seed lexicon containsimages, the canonical forms generated are mixedtext-image representations. These representationsare then shown to workers from Amazon Mechan-ical Turk2 to paraphrase in natural language.

3.1 Seed LexiconEssential to the goal of reducing lexical bias,is the use of images to describe the entities inthe domain. It is beneficial here to choose im-ages which are representative and will be well-understood. The images we used for entitieswithin the SMARTHOME domain are shown in Fig-ure. 3. It is not necessary that all entities be as-signed images, in fact it is possible for entities tobe named or abstract, and not have any associatedimages. In these cases, we simply use the naturallanguage description of the image.

We specify a seed lexicon L, consisting of enti-ties e in our domain and associated images (whenavailable) i. Our lexicon consists of a set of rules〈e, (i) → t[e]〉, where t is a domain type. For oursmart home domain, we define possible domaintypes to be appliances, rooms, food, weather andentities, and their associated subtypes and states(Figure 2.).

3.2 Generative GrammarNext, we utilize a generative grammar G to pro-duce canonical utterance and logical form pairs(c, z), similar to Wang et al. (2015). Our grammardiffers from theirs, in that in our work, the gram-mar G is not a generic grammar, but is written togenerate the kinds of queries we would actuallylike to support in our domain D. The rules are of

2https://www.mturk.com/mturk/welcome

the form α!β1γ1... → t[z], where αβγ are tokensequences and t is the domain type. A completedescription of our grammar is included in the sup-plementary material.

3.3 Canonical Utterances and Logical FormsWe generate canonical form - logical form pairs(c, z) exhaustively using the seed lexicon L andgrammar G for domain D. This resulted in ex-actly 948 canonical and logical form pairs in ourdomain.

The logical formalism we utilize closely corre-sponds to Python syntax. It consists of functionalprograms where all questions in our smart-homedomain are formulated with the help of a contexttree. Each questions is defined as spans over thistree as shown in Figure. 4. The root node of thetree is the environment that we are operating in,and at the surface-level are sensors. These spansare then used to construct a single-line Pythonstatement that is executed against our smart homesimulation to retrieve an answer. From this con-struct, we are able to execute logical forms againstthe simulation seamlessly after having retrievedthem.

3.4 Data Collection MethodologyThe next step after forming canonical utteranceand logical form pairs, is generating paraphrasesfor each pair. We use Amazon Mechanical Turkto distribute our data collection task. Over a spanof three days, we collected data from nearly 200Turkers, some of whom participated in the datacollection task multiple times.

During the first stage of the task, the Turkerswere instructed to paraphrase canonical utterancesas naturally as possible, as well as mark the ut-terances themselves as likely to be asked or notasked. They were also shown a small number ofexamples, and possible paraphrases. These exam-ples were created using images not present in thelexicon, so as to avoid biasing the Turkers.

In the next stage, the Turkers were asked toenter their paraphrases. Each worker was askedto enter a total of 60 paraphrases over the courseof the task. These paraphrases were presented tothe worker over 3 pages, with 2 paraphrases percanonical utterance. Turkers were also asked tostate if they believed that the question that theywere paraphrasing was likely or not. This anno-tation could subsequently be used for curation, orto bias semantic parsing models towards answers

Page 4: How Would You Say It? Eliciting Lexically Diverse Data for ... Would You Say...Jul 07, 2017  · limitations when constructing semantic parsers for new domains: (1) the seed utterances

Figure 2: The lexicon used to generate canonical and logical forms.

Figure 3: Images for terms in the seed lexicon

Figure 4: An example of a concept tree that couldbe used to define the logical form structure.

that users labeled as likely. Most canonical formshad a single image inserted into the text (875 or92.3%), some had no images inserted into the text(58 or 6.1%), and even fewer had two images in-serted into the text (15 or 1.6%). Each logical formwas shown to five Turkers for paraphrasing, result-ing in approximately ten paraphrases for each log-ical form.

Finally, we took several post-processing stepsto remove improper paraphrases from our dataset.Firstly, a large portion of Turker mistakes arosebecause of them making real-world assumptionsand neglecting to mention locations in their ut-

terances. We automatically shortlisted all para-phrases missing location information. We thenmanually inspected each of these paraphrases anddiscarded the ones identified as invalid. In all, thispost processing step took less than one day andcould have easily been delegated to crowd work-ers, had it been necessary. Secondly, we automat-ically pruned all paraphrases in our dataset whichwere associated with more than one logical form.This left us with 8294 paraphrases.

4 Data Statistics and Analysis

In this section, we describe some statistics of ourdata set, perform a comparative analysis with thedata collection paradigm of existing work, andcontrast the statistics of our dataset with other se-mantic parsing datasets.

4.1 Data StatisticsIn its uncurated form, our dataset consists of10522 paraphrases spread across 948 distinctcanonical and logical form pairs. Each pair hasa minimum of 10 paraphrases and a maximumof 28 paraphrases. These paraphrases were col-lected over 195 Turker sessions using the method-ology described in the previous section. Followingthe removal of duplicate paraphrases, and para-phrases missing location information, we are leftwith 8294 paraphrases over the same 948 logicalforms.

4.2 Effect of Data Collection MethodologyWe ran an experiment on purely text-based repre-sentations as suggested in (Wang et al., 2015) tocompare and contrast with our mixed text-imagerepresentations. In an effort to subdue domainvariance, we utilize our domain-specific grammar

Page 5: How Would You Say It? Eliciting Lexically Diverse Data for ... Would You Say...Jul 07, 2017  · limitations when constructing semantic parsers for new domains: (1) the seed utterances

to generate text-based canonical representations.We randomly subsample 100 logical form andcanonical utterance pairs from this dataset, andrecreate the crowdsourcing experiment suggestedby Wang et al. (2015), wherein each canonical ut-terance is shown to ten Turkers to paraphrase andeach Turker receives four canonical utterances toparaphrase. The workers are asked to reformulatethe canonical utterance in natural language or statethat it is incomprehensible. In this way, we collect1000 paraphrases associated with the 100 logicalforms. For each of these logical forms, we ran-domly subsample paraphrases from the set gath-ered using the proposed mixed text-image method-ology. We then compare the two and observe theresults shown in Table 3. We evaluate the resultson three metrics:

Lexical Diversity We estimate the lexical diver-sity elicited from the two methodologies by com-paring the total vocabulary size as well as the type-to-token ratio as shown in Table 1. We find thatboth the total vocabulary size, as well as the type-to-token ratio of the paraphrases collected usingthe proposed crowdsourcing methodology is con-siderably higher than that of an equivalent num-ber of paraphrases collected using the methodol-ogy suggested in (Wang et al., 2015).

Lexical Bias We estimate bias by computing theaverage lexical overlap between the paraphrasegenerated by the Turker and the canonical utter-ance they were shown. For the text-image exper-iment, we consider the equivalent text representa-tion of the canonical utterance, by substituting theimages by terms from the lexicon. We find thatthe proposed crowd sourcing methodology elicitsconsiderably less lexical bias as shown in Table 1.

Relevance We estimate relevance by randomlysampling one paraphrase each for one hundredlogical forms using the two methodologies. Wethen manually annotate them for relevance. Here,relevance is defined as a paraphrase exactly ex-pressing the meaning of the original canonicalform.

We performed this analysis on both our finaldataset and the the data that was collected in thesame manner as described in (Wang et al., 2015).We find that our data set had an estimated rele-vance of 60% when compared directly with thesame random logical forms sampled from the datacollected in the manner of (Wang et al., 2015),

RepresentationVocabSize

TTRLexicalOverlap

Text(Wang et al., 2015)

291 .044 5.50

Text-Image (ours) 438 .066 4.79

Table 1: Comparison of data creation methodol-ogy of (Wang et al., 2015) and this work. ‘Vocabsize’ is the total vocabulary size across an equalnumber of paraphrase collected for the same logi-cal forms using the two methodologies. TTR rep-resents the word-type:token ratio. Lexical over-lap measures the average number of words that arecommon between the canonical utterances and theparaphrases in the two methodologies.

which had an estimated relevance of 69%.Randomly sampling from our entire curated

dataset, we find that we have an estimated rele-vance of 66%.

4.3 Comparison with Other Data Sets

In order to examine the lexical diversity in theoriginal dataset, we examine the ratio of the to-tal number of word types seen in the natural lan-guage representations to the total number of tokentypes in the meaning representation. We compareagainst four publicly accessible datasets:

OVERNIGHT The Overnight dataset (Wang et al.,2015) consists of 26k examples distributedacross eight different domains. These ex-amples are obtained by asking crowdworkersto paraphrase slightly ungrammatical naturallanguage realizations of a logical form.

GEO880 Geoquery is a benchmark dataset forsemantic parsing (Zettlemoyer and Collins,2005) which contains 880 queries to a U.Sgeography database. The dataset is dividedinto canonical test-train splits with the first680 examples being used for training and thelast 200 examples being used for testing.

ATIS This dataset is another benchmark seman-tic parsing dataset that contains queries fora flights database, each with an associatedmeaning representation in lambda calculus.The dataset consists of 5,410 queries and istraditionally divided into 4,480 training in-stances, 480 development instances and 450test instances.

Page 6: How Would You Say It? Eliciting Lexically Diverse Data for ... Would You Say...Jul 07, 2017  · limitations when constructing semantic parsers for new domains: (1) the seed utterances

Dataset Example

GEO

how many states border the state with the largest population?answer(A,count(B,(state(B),next to(B,C),largest(D,(state(C),population(C,D)))),A))

JOBSwhat jobs desire a degree but don’t use c++?answer(A, (job(A), des deg(A),+((language(A,C),const(C,’c++’)))))

ATIS

what flights from tacoma to orlando on saturday( lambda 0e(and(f light0 ) ( from 0tacoma :c i)(to0 orlando: ci )( day $0 saturday: da ) ) )

OVERNIGHT

what players made less than three assists over a season( call SW.listValue ( call SW.getProperty ( ( lambda s ( call SW.filter( var s ) ( call SW.ensureNumericProperty ( string num assists ) ) ( string <)( call SW.ensureNumericEntity ( number 3 assist ) ) ) )( call SW.domain ( string player ) ) ) ( string player ) ) )

SMARTHOMEhas the milk gone bad?ROOT[”(None, ’refrigerator’, ’milk’, ’getFood>checkState-expired state’)”]

Table 2: Example from datasets GEO, JOBS, ATIS, OVERNIGHT and SMARTHOME

DatasetNLTypes

MRTypes

NL/MRRatio

GEO 283 148 1.91ATIS 934 489 1.91JOBS 387 226 1.71OVERNIGHT 1422 199 7.14SMARTHOME (Ours) 1356 83 16.33

Table 3: Number of word types in the languagecompared to number of word types in the logicalform. Larger ratio indicates more lexical diversityfor the same complexity of the logical form

JOBS The JOBS dataset (Zettlemoyer andCollins, 2005) consists of 640 queries toa job listing database where each queryis associated with Prolog-style semantics.This dataset is traditionally divided into 500examples for training and 140 examples fortesting.

An example of the kind of query that can be foundin each of these datasets is given in Table 2.

In the analysis, we find that on averageSMARTHOME exhibits nearly twice the word typeto meaning representation token ratio, as com-pared to most existing semantic parsing datasetsas shown in Table 3.

4.4 Logical Form Plausibility

For each canonical utterance, Turkers were askedto state if the canonical form was ’likely’ or ’notlikely’. By examining the most polar of these rat-ings, we see interesting patterns. For example,the canonical form ’what are the readings of thethermometers in the hallway¿ is rated as a highlylikely form according to Turkers and does indeedseem like a question that could be asked in the realworld. On the other hand, one of the less likelyforms according to the Turkers, ‘are the televisionsin the bathroom on?’, is indeed not likely, as bath-rooms are arguably one of the least likely roomsthat one would encounter multiple televisions in.Overall, 752 out of 948 logical forms were identi-fied as very plausible by at least 60% of the Turk-ers who paraphrased them, indicating they werereasonable questions to ask.

5 Semantic Parsing Experiments

Finally, it is of interest how the data collectionmethodology influences the realism and difficultyof the semantic parsing task. In this section, werun several baseline models to measure this effect.

5.1 Models

We present three different baselines on our dataset,including a state-of-the-art neural model withan attention-copying mechanism (Jia and Liang,2016).

Page 7: How Would You Say It? Eliciting Lexically Diverse Data for ... Would You Say...Jul 07, 2017  · limitations when constructing semantic parsers for new domains: (1) the seed utterances

Figure 5: Neural Reranking Model

Jaccard First, we experiment with a simplebaseline using Jaccard Similarity which is givenby J(A, B) = |A∩B||A∪B| . For each query in the test set,we find the paraphrase in the training set whichhas the highest Jaccard similarity score with thetest query and return its associated logical form.

Neural Reranking Model We next experimentwith a neural reranking model for semantic pars-ing which learns a distribution over the logicalforms by means of learning a distribution overtheir associated paraphrases as a proxy. Thismodel has the added advantage of being indepen-dent of the choice of the formal language, andhas been used for tasks such as answer selection(Wang and Nyberg, 2015; Tan et al., 2015), but notfor semantic parsing. The basic model is shownin Figure. 5. We generate a representation ofboth the test query and the paraphrasing using abidirectional-LSTM and use a hinge loss functionas specified:

L = max(0,M − d(p∗, p+) + d(p∗, p−))

where M is the margin, d is a distance function,p* is the test query, p+ is a paraphrase that hasthe same meaning representation as p* and p- is aparaphrase that does not. For our experiments, wechoose d to be the product of the Euclidean dis-tance and the sigmoid of the cosine distance be-tween the two representations, and M to be 0.05.

We group all the paraphrases by logical form,and create training examples by picking all possi-ble combinations within one grouping as positivesamples, and randomly sampling from the remain-ing top-25 matching paraphrases for negative ex-amples. At test time, we first identify twenty fivemost likely candidates utilizing a Jaccard-basedsearch engine over the paraphrases in the training

data. We then identify the most likely paraphrasefrom amongst these using the Neural Rerankingmodel.

Neural Semantic Parsing Model We also im-plement the neural semantic parsing model withan attention-based copying mechanism from (Jiaand Liang, 2016). We use the same setting of hy-perparameters that gave the best results on GEO,OVERNIGHT and ATIS. Specifically, we run the ex-periments with 200 hidden units, 100 dimensionalword vectors and all the parameters of the networkare initialized from the interval [-0.1, 0.1]. We alsotrain the model for 30 epochs starting with a learn-ing rate of 0.1 and halving the learning rate at ev-ery 5 epochs from the 15th epoch onwards. Werefer the readers to (Jia and Liang, 2016) for fur-ther details about the model.

5.2 Results and Discussion

We evaluate these models on independent data inthe form of the OVERNIGHT and GEO datasets.We use the standard train-test splits suggested by(Zettlemoyer and Collins, 2005) and (Wang et al.,2015). The full results are presented in Table 4.We observe that the neural semantic parsing modelperforms relatively poorly on the SMARTHOME

dataset compared to OVERNIGHT or GEO. Care-ful error analysis suggests that most of the errorsstem due to the following types of queries in ourdataset, which are not present in OVERNIGHT orGEO

• The model not differentiating between thesingular and plural forms (For example,which room in the house can you find thestereo? maps to the logical form for pluralradios instead of the singular)

• The model not recognizing terms which havenot been seen in the training data i.e unseenvocabulary (for example, does bob not haveany energy? does not map to the logicalform for checking if Bob is tired, because themodel has never seen that to not have energymeans being tired for living entities),

• The model not being able to respond to indi-rect queries in the test set (for example, howlong will the heat have to run? does not mapto the logical form for how long the weatherwill be cold, or do i need to change the lights

Page 8: How Would You Say It? Eliciting Lexically Diverse Data for ... Would You Say...Jul 07, 2017  · limitations when constructing semantic parsers for new domains: (1) the seed utterances

System SMARTHOME(ours) OVERNIGHT GEO

Jaccard 18.0% 24.82% 40.7%NeuralReranker

30.3% 41.91% 60.2%

Seq2Seq(Jia and Liang, 2016)

42.1% 75.8% 85.0%

Table 4: Test accuracy results of different systems on the SMARTHOME dataset as compared toOVERNIGHT and GEO

in the living room? does not map to the logi-cal form for the living room lights not work-ing correctly)

• Errors with and between complementary val-ued variables such as on/off and malfunction-ing/not malfunctioning. (For example, doesthe tv in the bathroom work? maps to the log-ical form for the TV malfunctioning, when itshould map to the logical form for the TV notmalfunctioning)

We are aware that by accounting for pluralnouns, we added a dimension of difficulty for allcanonical forms that have a plural/singular sib-ling which is not present in the datasets whichwe compare to . We found that 29.7% of theSeq2Seq model’s mistakes contained a wrongquantity. Similarly, the smart-home domain in-cludes complementary terms that sometimes formthe only difference between two canonical forms(e.g. functioning vs malfunctioning, on vs off).We measure that 43.2% of the Seq2Seq model’serrors contain an incorrect complementary term.9.8% percent contain both a wrong quantity anda wrong complementary term. We conclude thathandling plurals and complementary forms makesthe task more difficult, particularly as they are of-ten not differentiated well in conversational lan-guage. The remaining 36.9% of errors made bythe model can largely be attributed to lexical di-versity, indirect queries or confusion between en-tity states.

This work represents a first step in consideringlexical diversity as an important criteria while cre-ating semantic parsing datasets. Due to the am-biguity introduced by images (though it is hard tomake claims on whether it is ambiguity based onlyon the interpretation of these images by crowd-workers, or overall difficulty of trying to para-phrase a mixed text-image representation), thiscould come at the cost of generating slightly less

relevant queries. We hope this starts the conversa-tion and inspires further research in finding betterways of introducing lexical diversity.

6 Related Work

Semantic parsing has been used in dialog systemswith significant success.(Zhu et al., 2014; Pad-makumar et al., 2017; Engel, 2006). Supervisedsemantic parsing is of special practical interest aswhile trying to build dialogue systems for newdomains, it is important to be able to adapt todomain-specific language. Domains exhibit var-ied linguistic phenomena and every domain hasit’s own vocabulary (Kushman and Barzilay, 2013;Matuszek et al., 2012; Tellex et al., 2011; Krish-namurthy and Kollar, 2013; Wang et al., 2015;Quirk et al., 2015). Training a semantic parser forthese domains involves understanding the kinds oflanguage used in a domain, however, the cost ofsupervision of associating natural language withequivalent logical forms is prohibitive.

In an attempt to overcome this overhead of su-pervision, several approaches have been suggestedincluding learning from denotation-match (Clarkeet al., 2010; Liang et al., 2011). As the authorsof (Wang et al., 2015) point out, paraphrasingovercomes this overhead by being a considerablylightweight form of supervision. However, meth-ods such as theirs which utilize text induce lexicalbias.

Novikova et al. (2016) show that using imagesreduces this lexical bias for natural language gen-eration tasks. In this work, we unite these strandsof research by presenting a methodology wherewe construct a seed lexicon from images, anduse a generative grammar to combine these im-ages into questions, each paired with an associ-ated logical form. These can then be paraphrasedby workers from Amazon Mechanical Turk. Ourexperiment provides evidence that partially re-placing canonical form text with images leads

Page 9: How Would You Say It? Eliciting Lexically Diverse Data for ... Would You Say...Jul 07, 2017  · limitations when constructing semantic parsers for new domains: (1) the seed utterances

to measurably higher lexical diversity in crowd-sourced paraphrases. By contrast to (Wang et al.,2015), we operate only inside a single domainand observe the linguistic patterns specific to thesmarthome setting (see Sec 5.2). It remains tobe examined whether the observed large increasein diversity can be reproduced in a different do-main with different language patterns and collo-quialisms. Another immediate research direction,inspired by (Novikova et al., 2016) is replacingmore of the canonical form representation withimages to further reduce lexical bias and increasevariety. This would require the development of asymbol set that is sufficiently expressive while notbeing overly ambiguous. We anticipate this con-verging to a tradeoff between the diversity and rel-evance measures (see Sec 4.2).

7 Conclusion

The primary goal of this paper is to highlight stepsto be taken in order to apply semantic parsingin the real world, where systems need robustnessagainst variation in natural language. In this work,we propose a novel crowdsourcing methodologyfor semantic parsing that elicits lexical diversity inthe training data, with the aim of promoting fu-ture research in constructing less brittle seman-tic parsing systems. We utilize combined text-image representations which we believe reduceslexical bias towards language from the lexicon,at the cost of additional ambiguity introduced bythe use of images. We find that this crowdsourc-ing methodology elicits demonstrably more lex-ical diversity compared to previous crowdsourc-ing methodologies suggested for creating semanticparsing datasets. The dataset created utilizing thismethodology offers unique challenges that resultin lower performance of semantic parsing modelsas compared to standard semantic parsing bench-mark datasets. The dataset contains both directand indirect conversational queries, and we be-lieve that learning to recognize the semantics ofsuch varied queries will open up new directions ofresearch for the community.

Acknowledgments

This research has been supported by funds pro-vided by Robert Bosch LLC to Eric Nyberg’sresearch group as part of a joint project withCMU on the development of a context-aware,dialog-capable personal assistant. The authors are

grateful to Bosch Pittsburgh researcher Alessan-dro Oltramari for valuable insights and produc-tive collaboration. We would also like to thankAlan Black and Carolyn Rose for helpful discus-sions and Shruti Rijhwani, Rajat Kulshreshtha andPengcheng Yin for their suggestions on the writingof this paper. We are grateful to all the studentsfrom the Language Technologies Institute whohelped us test our methodology before releasingit to Turkers including especially Maria Ryskina,Soumya Wadhwa and Evangelia Spiliopoulou. Wealso thank the anonymous reviewers for helpfuland constructive feedback.

ReferencesAmos Azaria, Jayant Krishnamurthy, and Tom M.

Mitchell. 2016. Instructable intelligent personalagent. In Dale Schuurmans and Michael P. Well-man, editors, Proceedings of the Thirtieth AAAIConference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA.. AAAI Press,pages 2681–2689.

Jonathan Berant, Andrew Chou, Roy Frostig, and PercyLiang. 2013. Semantic parsing on freebase fromquestion-answer pairs. In EMNLP. Association forComputational Linguistics, pages 1533–1544.

S. R. K. Branavan, Harr Chen, Luke S. Zettlemoyer,and Regina Barzilay. 2009. Reinforcement learn-ing for mapping instructions to actions. In Pro-ceedings of the Joint Conference of the 47th AnnualMeeting of the ACL and the 4th International JointConference on Natural Language Processing of theAFNLP: Volume 1 - Volume 1. Association for Com-putational Linguistics, Stroudsburg, PA, USA, pages82–90.

James Clarke, Dan Goldwasser, Ming-Wei Chang, andDan Roth. 2010. Driving semantic parsing fromthe world’s response. In Proceedings of the four-teenth conference on computational natural lan-guage learning. Association for Computational Lin-guistics, pages 18–27.

Li Dong and Mirella Lapata. 2016. Language to logicalform with neural attention. CoRR abs/1601.01280.http://arxiv.org/abs/1601.01280.

Ralf Engel. 2006. Spin: A semantic parser for spokendialog systems. In Proceedings of the Fifth Slove-nian And First International Language TechnologyConference (IS-LTC 2006)..

Sumit Gulwani and Mark Marron. 2014. Nlyze: Inter-active programming by natural language for spread-sheet data analysis and manipulation. In Proceed-ings of the 2014 ACM SIGMOD International Con-ference on Management of Data. ACM, New York,NY, USA, SIGMOD ’14, pages 803–814.

Page 10: How Would You Say It? Eliciting Lexically Diverse Data for ... Would You Say...Jul 07, 2017  · limitations when constructing semantic parsers for new domains: (1) the seed utterances

Robin Jia and Percy Liang. 2016. Data recom-bination for neural semantic parsing. volumeabs/1606.03622. http://arxiv.org/abs/1606.03622.

Jayant Krishnamurthy and Thomas Kollar. 2013.Jointly learning to parse and perceive: Connectingnatural language to the physical world. Transac-tions of the Association for Computational Linguis-tics 1:193–206.

Nate Kushman and Regina Barzilay. 2013. Using se-mantic unification to generate regular expressionsfrom natural language. North American Chapterof the Association for Computational Linguistics(NAACL).

Percy Liang, Michael I Jordan, and Dan Klein. 2011.Learning dependency-based compositional seman-tics. In Proceedings of the 49th Annual Meeting ofthe Association for Computational Linguistics: Hu-man Language Technologies-Volume 1. Associationfor Computational Linguistics, pages 590–599.

Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Dollar,and C Lawrence Zitnick. 2014. Microsoft coco:Common objects in context. In European Confer-ence on Computer Vision. Springer, pages 740–755.

Cynthia Matuszek, Nicholas FitzGerald, Luke Zettle-moyer, Liefeng Bo, and Dieter Fox. 2012. A jointmodel of language and perception for grounded at-tribute learning. arXiv preprint arXiv:1206.6423 .

Jekaterina Novikova, Oliver Lemon, and VerenaRieser. 2016. Crowd-sourcing nlg data: Pictureselicit better data. In Proceedings of the 9th Inter-national Natural Language Generation conference.Association for Computational Linguistics, pages265–273.

Aishwarya Padmakumar, Jesse Thomason, and Ray-mond J Mooney. 2017. Integrated learning of dialogstrategies and semantic parsing. In Proceedings ofthe 15th Conference of the European Chapter of theAssociation for Computational Linguistics (EACL).

Chris Quirk, Raymond J Mooney, and Michel Galley.2015. Language to code: Learning semantic parsersfor if-this-then-that recipes. Association for Com-putational Linguistics.

Ming Tan, Cicero dos Santos, Bing Xiang, and BowenZhou. 2015. Lstm-based deep learning modelsfor non-factoid answer selection. arXiv preprintarXiv:1511.04108 .

Stefanie Tellex, Thomas Kollar, Steven Dickerson,Matthew Walter, Ashis Banerjee, Seth Teller, andNicholas Roy. 2011. Understanding natural lan-guage commands for robotic navigation and mobilemanipulation. In AAAI Conference on Artificial In-telligence.

Di Wang and Eric Nyberg. 2015. Cmu oaqa at trec2015 liveqa: Discovering the right answer withclues. Technical report, Carnegie Mellon UniversityPittsburgh United States.

Yushi Wang, Jonathan Berant, and Percy Liang. 2015.Building a semantic parser overnight. In Associa-tion for Computational Linguistics.

William A. Woods. 1977. Lunar rocks in natural En-glish: Explorations in natural language question an-swering. In Linguistic Structures Processing. NorthHolland, pages 521–569.

John M. Zelle and Raymond J. Mooney. 1996. Learn-ing to parse database queries using inductive logicprogramming. In Proceedings of the Thirteenth Na-tional Conference on Artificial Intelligence - Volume2. AAAI Press, AAAI’96, pages 1050–1055.

Luke S. Zettlemoyer and Michael Collins. 2005.Learning to map sentences to logical form: Struc-tured classification with probabilistic categorialgrammars. In In Proceedings of the 21st Conferenceon Uncertainty in AI. pages 658–666.

Su Zhu, Lu Chen, Kai Sun, Da Zheng, and KaiYu. 2014. Semantic parser enhancement for dia-logue domain extension with little data. In SpokenLanguage Technology Workshop (SLT), 2014 IEEE.IEEE, pages 336–341.