P OLITECNICO DI MILANO FACOLTÁ DI I NGEGNERIA CORSO DI L AUREA S PECIALISTICA IN I NGEGNERIA I NFORMATICA MATCHING NATURAL LANGUAGE MULTIDOMAIN QUERIES TO SEARCH SERVICES Re latore: Ing. Marco BRAMBILL A Correla tore: Prof. Stefano CERI Tesi di Laurea Specialistica di: Claudia Farè Matricola n. 721154 ANN O ACCADEMICO 2008-2009
96
Embed
Matching Natural Language Multi Domain Queries to Search Service
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
In the last years a lot of efforts have been spent on the research in information re-
trieval either over the subject of full-text search or document indexing. The mainfruits of these efforts have been the general purpose search engines everyone of us
uses in their life like Yahoo™ and Google™, the former has even become a proper
verb in the English language given the popularity of the term. These engines gives
us the possibility to retrieve any document available on the web about the topic we
are searching for. If with the World Wide Web the democratization of information
availability began, with these search engines it grew to a maximum. However
this type of simple but wide search brought along some limitations. Users don’t
want to look for generic documents about a topic anymore, they want answers to
specific questions as the search engine were a human being that understood their
needs and satisfied them. In order to look for an answer, with a general purpose
engine as Google™, users usually have to hope to find a document where someone
has already asked that question or rely on the reading of a number of documents
hoping to find what they were looking for. A lot of researches and projects ex-
plored this field and one notable effort is the one represented by Knowledge Based
search systems. These systems allow the user to ask for a specific question based
8
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
on a knowledge base built on large ontologies that can select the right answers.
This is very efficient for “non-changing” information such as technical, mathe-
matical, geographical and physics questions. Instead it’s really unreliable for ever
changing data like news and events. Moreover the number of domains that the
request can involve is restricted to one, only specific question about one topic at
a time can be asked. So the object of the future research is to lift the limitation
of the single domains questions and to provide results not only about precise facts
but also about questions where the answers can involve more domains with possi-
ble rankings based on features. For example the question “I want a cheap Chinese
restaurant near piazza Duomo in Milan” involves two domains “place” and “Chi-nese restaurants” and it requires a ranking based on the price. In the last years
there has been an increase in popularity of the web services. These services of-
fer a software interface which allows other systems to interact with them through
the HTTP protocol. The proliferation of open and accessible web search services
has allowed the world to access, aggregate and mix data in previously unthought
ways. From this prerogatives the SeCo project at Politecnico di Milano was born.
This project is currently under active development and it aims at building a system
that pushes the boundaries of the current search engines.
1.2 The Problem
Although many discoveries have been made in theoretical and formal aspects of
distributing multi-domain queries and merging back the results, a lot of work still
has to be done in the subject of interfacing the system with the user in the most
natural way. Usually interfaces for such services are complex and have to be set
manually sometimes with a very little user friendly syntax. Instead services as
Yahoo!™ or Google™ have popularized the simple text box where free text can
be entered; the filtering and understanding is completely left to the service while
the user writes as he would write to another understanding entity. This is the main
problem that led us to the current project of research in the field of query analysis,
specifically oriented towards understanding, translating and matching with the
right services the queries made in the SeCo project and that can span more than
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
The registration flow comprises all the activities that deal with the registration
of new domains, domain descriptions and search services. This section will be
briefly explained because it doesn’t interest directly the thesis project.
The domain framework deals with domains and their definitions and addresses the
problems of semantic annotation, storage, management, and access to domains
and their descriptions. On the concept of domain is based all of the multi-domain
search engine. A domain is considered as a self-standing field of interest for
the user. The domain repository is a data structure that is able to store domains
organized as a taxonomy, representing a tree of domain/sub-domain relationships.
Information about the domains can be retrieved by other components through an
API.
The search service framework defines a conceptual model of search service and
addresses the semantic annotation, storage, management, and access to search
services. Its main function is to enable the annotation of the request/response
interface of the services. Such annotation uses the WordNet vocabulary and addslabels to each service, its operations, and the input-output parameters of each
operation. The framework is concerned only with those operations belonging to
a Web service which perform data retrieval, particularly with operations which
return itemized and ranked information.
The service analyzer addresses the following problems: the clustering of the avail-
able services, based on their similarity, the mapping of services to domains and
the definition of join connections between services.
2.2.2 The query execution flow
Along the query execution flow we address the following problems:
The main components are the query analysis, the query-to-domain mapper, the
query planner, the query engine and the results transformation. A query sent from
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
the user first passes through the query analysis and the query-to-domain mapper,
where the different domains and properties are extracted from the natural lan-
guage query. It then goes to the query planner, which creates an execution plan
taking into accounts the different costs associated to executing the query, in order
to create the most efficient execution. The different sub-queries are then sent to
the domain and service frameworks, which take care of calling the external ser-
vices through a Web or messaging interface. The results are then collected and,
according to the plan, merged back together. The final results are then transformed
before being sent back to the user.
2.2.2.1 Query analysis
In this section high level multi-domain user queries are analyzed and a splitting
into sub-queries is made. A high level query is the specification of an information
need of a user at a high level of abstraction. It’s assumed that high level queries
are quasi-natural language descriptions of the user request which may require to
extract information from multiple domains. The query analysis component de-
composes the high-level queries into sub-queries, each representing one search
objective in a specific domain. For processing the natural language query, an
open source tool developed by the Stanford Natural Language Processing Group
is used.
2.2.2.2 Query to domain and service mapping
This component addresses the problems of mapping of sub-queries to domainsand subsequently to associated search services, at the purpose of defining low-
level queries. To successfully map a sub-query to a domain we need to retrieve
for each of them a defined subset of similar domains which allow a crisp iden-
tification of the sub-query semantic, that due to the use of natural language can
be ambiguous and imprecise. Several techniques can be applied to optimize the
recognition of query/sub-query structures which comply with the separation into
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
distinct domains of concern so as to achieve the objective; some of these methods
will be analyzed in their meaning and implementation in the next chapters.
2.2.2.3 Query Planner
A low-level query is a composite query over a number of services. The query
planner is a well-defined scheduling of service invocations, possibly parallelized,
that complies with their access modes and exploits the ranking order in which
search services return results to rank the results. The Query Planner addressesthe problem of generating query plans and evaluating them against a cost metric
so as to choose the most promising one for execution. It accepts as input low-
level queries, i.e. conjunctive queries that list the specific services to be invoked,
already chosen by the Query-To-Domain Mapper. Then it schedules the invoca-
tions of Web services and the composition of their inputs and outputs. In the end
it progressively refines choices and produces an access plan by performing the
following steps:
1.Given that services may be accessed according to different patterns, the Query
Planner chooses specific access patterns for each of the services involved in the
query, provided that they are compatible with the query.
2.Once the access patterns are fixed, there may still be some indeterminacy on the
order of invocation of the different services, some of which may be invoked in
parallel. The Query Planner fixes such order.
3.The main operation for combining search services in our conjunctive setting is
the join. The Query Planner selects an execution strategy for each join.
4.Optimality of execution primarily depends upon the cost and time of execu-
tion of request/responses to services. The Query Planner determines the expected
number of requests associated with each service request in order to obtain the
desired number of results, so as to associate to each plan an execution cost.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
This component takes the low-level plan from the query planner and executes the
different service calls in parallel, merging and ordering when required. It will
return the final results of the query in a pure and internal format as they become
available, sending them to the results transformation component for their final
processing.
The query engine deals with the generation and processing of query execution
schedules: it takes the low-level plan from the query planner and executes thedifferent service calls in parallel, merging and ordering when required. The results
generated and the combinations returned are collected in their “raw” format of
tuples of values, and passed to the Result Transformation module, to be processed
in order to be presented to the user.
2.2.2.5 Result transformation and Interfaces
This component is dedicated to the definition of proper interfaces for submission
of multi-domain user queries and transformation of the results in the format re-
quested by the final user. It deals with: building a interface for the user to express
multi-domain queries in a facilitated way and building an interface for presenting
results. In the former the user can drill down the result set and understand where
each piece of information comes from, enabling query refinement, or can peruse
the results of past queries to better reformulate his information need.
2.3 Service Marts
The Service Mart component is an abstraction used to manage the publication
and handle the data sources in the Search Computing architecture. The goal of
a service mart is to ease the publication of a special class of software services,
called search services, whose responses are ranked lists of objects. Every service
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
mart is mapped to one "Web object" available on Internet; therefore, we may have
service marts for “hotels”, “flights”, “doctors”, and so on. Thus, service marts
are consistent with a view of the "Internet of objects" which is gaining popularity
as a new way to reinterpret concept organization in the Web and go beyond the
unstructured organization of Web pages.
A Service Mart is a component with a known interface defined at project time
which manages a collection of similar or semantically correlated services. The
Service Mart can invoke these services presenting itself as a standard interface
between the request from a query and its result. The underlying complexity can
be then hidden to the higher level and the result can be a completely relational
model, simplified w.r.t. the original complexity of the web services model.
A Service Mart is defined by an Id, a Name and a Description which documents
its functionalities. It’s then divided on different levels of abstraction: the highest
level is the Service Mart Signature, it contains a description of the service mart
attributes that are the sample input and output data that the Mart can handle and
repeating groups consisting of a non-empty set of sub-attributes that collectively
define a property of the service mart. In the underlying level there are the AccessPatterns. Their structure is analogue to the Signature and they specify an ulterior
possible invocation mode. Each parameter in an Access Pattern is identified by a
data type, a “mandatory” flag and a direction (input or output). At the third lower
level there are the Service Interfaces. A service Interface is a concrete description
of an access pattern, it has an interface with its attributes and it’s linked to a service
implementation, the real link to the web service (to retrieve data from local or
remote sources).
Connection patterns represent the coupling of service marts (at the conceptual
level) and of service interfaces (at the physical level). Each pattern has a con-
ceptual name and then a logical specification, consisting of a sequence of simple
comparison predicates between pairs of attributes or sub-attributes of the two ser-
vices, which are interpreted as a conjunctive Boolean expression, and therefore
can be implemented by joining the results returned by calling service implemen-
tations.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
This tool implements a probabilistic lexical parser of English natural language
sentences. The outcome of the parser is a tree representation of the sentences that
is suitable for the problem of splitting the queries into sub-queries to be assigned
to different domains.
Probabilistic parsing is using dynamic programming algorithms to compute the
most likely parse(s) of a given sentence, given a statistical model of the syntactic
structure of a language. Models have been developed for parsing in several lan-guages: English (the corpus used for this research), Chinese, Arabic, and German.
The Stanford Parser is a Natural Language Processing suite of tools and libraries
that can be used in various tasks related to natural language analysis. In the context
of this research, it is used for its parsing abilities. It is based on a probabilistic
model and it is implemented as a Java library accompanied by a dictionary file
that is used as training data.
The very detailed parsing of a sentence or a period gives the possibility to try alot of different approaches to the splitting and analysis of the natural language. In
the framework in particular two main approaches have been researched: the first
level splitting and the clause level splitting. The research and result details about
these approaches are examined in the next chapters.
2.8 Name Entity Recognition
The tool we used as Name Entity Recognition (NER) is the CRF(Conditional
Random Field)-based NER system developed by the Stanford NLP Group[2].
Named entity recognition (also known as entity identification and entity extrac-
tion) is a subtask of information extraction that seeks to locate and classify atomic
elements in a text into predefined categories such as the names of persons, orga-
nizations, locations etc.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
Following is an overview of the main technologies and tools used to implement
the framework.
JavaScript Object Notation The Javascript Object Notation or JSON is a lightweight
data-interchange format similar to XML. It is a text-based, human-readable for-
mat for representing simple data structures and associative arrays (called objects).
It is based on the Javascript syntax for describing data structures. It supports avariety of data structures, the most commonly used in high-level languages. It
was chosen over other data exchange formats such as XML for its simplicity and
readability. Its usability and ease to map it to the data types provided by most
languages, makes it very natural to convert back and forth and it’s also supported
over a multitude of languages and frameworks, libraries having been implemented
in every high-level language available.
CouchDB CouchDB is an Apache Foundation project for a document-based
database server written in Erlang, a highly efficient language for concurrent and
distributed applications. It diverges itself from the model of relational databases in
many ways, and offers a very different performance profile. CouchDB stores free-
form documents instead of records as can be seen in a regular relational database.
Its schemas are flexible, and the elements can change from one document to an-
other in a same database. This can be useful in many different applications, such
as ones where schemas are highly likely to change over time, or in situations
where the rows are very sparse, that is, many fields are present but only few areactually used in a single document. The server is accessible via a RESTful JSON
API. JSON is its native data format, and this makes it very flexible in terms of
what data types can be stored. It also supports computed views, which replace
indices, and which are created in JavaScript by the user. These views follow the
Map/Reduce paradigm, where a first function (map) is tasked with going over ev-
ery document, emitting a key/value pair which can be any given JSON element.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
The second function (reduce), then sorts and groups elements by their keys, and
transforms and reduces the array of values associated with that key into a singular
atomic element. The contract is that the computation of one element is totally
independent from the computation of any other, allowing the system to distribute
the work, cache it aggressively and reorder it as needed in order to improve the
performance. It also supports keeping multiple revisions of a single document, al-
lowing the user to require a particular version. It also gives the opportunity to offer
optimistic conflict resolution for updates, where, during an update operation, the
sender is required to state which version its change is based on. If that version cor-
responds to the currently most up-to-date, the update is made without any trouble.In the other case, if another user had already updated the same document, an error
message is sent to the user who is then given the opportunity to rebase himself
on the latest version. Other interesting features of CouchDB are its core support
for master-master replication, where two nodes can be synchronized and where
both can still act as master, unlike the normal model master-slave model where
slaves are only used for read operations while the master is the unique point of
update. CouchDB was chosen within the context of this project with the idea that
the schema was most likely to change greatly over the progress of the research,and that the objects that we would need to store would not fit a relational database
very well.
Ruby Ruby is a high-level programming language known as being highly dy-
namic and flexible in regards to its syntax. Ruby supports multiple programming
paradigms, including functional, object oriented, imperative and reflective. It also
has a dynamic type system and automatic memory management. While its im-
plementation is relatively slower than other languages, it has become famous forallowing the creation of DSL, Domain-Specific Languages, where the host lan-
guage itself is adapted in order to create a more natural syntax adapted to the task
at hand. In particular, it has become famous for its use in the Web domain, where it
now sports a host of libraries adapted to quickly and efficiently creating Web Ap-
plications. It is a pure object-oriented language, where every method or function
is actually activated by sending a message to the desired instance. Every element
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
defining the type of each variable. In addition, it supports high-level functions,
pattern matching and an evolution of interfaces and abstract classes called traits,
inspired by Ruby mixins. Among its remarkable features can be found a library
that offers a new perspective on concurrent systems, called actors. This feature,
taken from languages such as Erlang and Smalltalk, allows one developer to con-
ceptualize systems as a series of independent processes called actors, where these
can communicate through the use of referentially-transparent messages. Actors
are implemented using a mailbox, in fact a queue where messages are stored. The
actor can then define its act method to handle these messages, often using pattern
matching to dispatch in view of the type of the message which can be arbitrary.Scala was primarily chosen because it offers access to the wide library of Java
applications. It was also then chosen over Java itself because it is more suited to
explorative programming, where one does not know exactly the shape the result
will take, as was the case at the beginning of this project.
Kestrel Kestrel is a queuing service we use to distribute work tasks amongst the
worker, and to send it from the server where the manager of the workers can reach
it. While it is quite new, it has proven its worth through use at Twitter Inc., where
it powers a lot of the hugely-popular communication service. The particularity of
this service is that it complies with the Memcached protocol. Memcached is the
most widely-used service to store transient data and is used as a cache to avoid
repeating costly operations. While it changes the semantics of this protocol, the
fact that Kestrel respects the simple get and set contract of Memcached allows
the use of a great number of libraries that, while originally made for Memcached
itself, can now be used transparently to send tasks to the Kestrel server. Its basic
semantics are that a set operation associates a key, with a queue, and the payloadgiven within the operation will be added to the end of that named queue. The get
operation instead takes the first element from that same named queue, or returns
a special message if no element can be found. Kestrel itself is implemented as a
daemon in Scala, a high-level language that takes most of its inspiration from Java
and is in fact compiled to Java bytecode, allowing to run seamlessly in the Java
Virtual Machine.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
that have similar meaning or bear some relation to those in the query, increasing
the chances of matching words in relevant documents. Expanded terms are gener-
ally taken from a thesaurus. Even with query expansions methods no satisfactory
results were really achieved, mainly because of some practical limitations of the
tool WordNet:
• Two terms that seem to be interrelated have different parts of speech in
WordNet. This is the case between stochastic (adjective) and statistic (noun).
Since words in WordNet are grouped on the basis of part of speech in Word-
Net, it is not possible to find a relationship between terms with different
parts of speech.
• Most of relationships between two terms are not found in WordNet. For
example how do we know that Mizuho Bank is a Japanese company?
• Some terms are not included in WordNet (proper name, locations etc).
3.2 WordNet Domains
This tool has been used mainly in the field of word disambiguation. The under-
lying hypothesis is that domain labels, such as Medicine, Architecture and Sport,
provide a useful way to establish semantic relations among word senses, which
can be profitably used during the disambiguation process. One of the first ap-
proaches to the word domain disambiguation process through WordNet domains
was from [5] where words in a text are tagged with a domain label in place of a
sense label, originally taken from the classic WordNet dictionary. They adoptedfrequency measures, based respectively on the intra text frequency and the intra
word frequency of a domain label.
In [6] it’s presented the Domain Relevance Estimation (DRE). Given a certain
domain, DRE distinguishes between relevant and non-relevant texts by means of
a Gaussian Mixture model that describes the frequency distribution of domain
words inside a large-scale corpus; DRE is a fully unsupervised text categorization
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
The topic of query splitting or query segmentation has been analyzed in a lot of
papers. Very different approaches have been tested. The one examined in [11] is
based on retrieved result. The aim of this approach is to find interesting documents
that will link two queries that function as “stepping stones”. This method of pro-
ceeding is particularly useful in the academic and scientific article field. The two
queries can be provided by the user himself or they can be identified by the system
through the examination of the single query provided; this is done with an unsu-
pervised method that analyzes the various documents retrieved from the request
in the query and groups them according to common terms and characteristics.
In [12] it’s proposed an unsupervised approach based on query word-frequency
matrix derived from web statistics.They first adopt the N-Gram model to estimate
the query term’s frequency matrix based on word occurrence statistics on the web.
They then devise a strategy to select principal eigenvectors of the matrix. Finally
they calculate the similarity of query words for segmentation.
In[13] they use a generative query model to recover a query’s underlying concepts
that compose its original segmented form. The model’s parameters are estimated
using an expectation-maximization (EM) algorithm, optimizing the minimum de-
scription length objective function on a partial corpus that is specific to the query.
To augment this unsupervised learning, they incorporate evidence from Wikipedia
to exploit some external knowledge to make sure that the output segments are
well-formed concepts, not just frequent patterns.
The great part of effective approaches to query splitting are done using unsuper-vised methods. There are some natural language based query analysis researches
but they’re often very structured or referred to specific domains, more like natural
language interfaces to databases than natural language analyzers.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
The query matching subject hasn’t been approached widely but we can find a sig-
nificant research in this article [14] where a generic query is routed to a proper
search service after an analysis by the automated query routing system. Off-line,
Q-Pilot takes as input a set of search engines’ URLs and creates, for each en-
gine, an approximate textual model of that engine’s content or scope, something
conceptually similar to SeCo’s semantic annotation for Service Marts. On-line,
Q-Pilot takes a user query as input, applies a query expansion technique to the
query and then clusters the output of query expansion to suggest multiple topics
that the user may be interested in investigating. Each topic is associated with a set
of search engines, for the query to be routed to, and a phrase that characterizes the
topic. For example, for the query “Python” Q-Pilot enables the user to choose be-
tween movie related search engines under the heading “movie — monty python”
and software-oriented resources under the headings “objected-oriented program-
ming in python” and “jpython — python in Java”. An important key point in
the Q-Pilot design is to use the neighborhood-based identification of search en-
gines’ topics in combination with query expansion. This approach gives quitegood results, as reported on the article; query expansion fills the gap between the
short query and the small number of terms in search engines’ topics. This system,
though quite efficient, is well-suited only for very short, single domain queries.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
Complex queries make it possible to extract answers from complex data, rather
than from within a single Web page; but complex data require a data integra-tion process. In the SeCo project this process is query-specific because to answer
queries about very different topics we require intrinsically different data sources.
However, data integration is one of the hardest problems in computing, because it
requires full understanding of the semantics of data sources; as such, it cannot be
done without human intervention. A data source is any data collection accessible
on the Web. The Search Computing motto is that each data source should be fo-
cused on its single domain of expertise (e.g., travels, music, shows, food, movies,
health, genetic diseases) but pairs of data sources which share information can
be linked to each other to build complex results. This classification of the data
in different domain groups, represented by the service marts, is the basis for the
upper level query elaboration that will try to match the input with the available
data sources.
In fact the main objective of the thesis project is to enhance the existing natural
language analyzer framework and add a service mart matching function to match
34
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
Figure 4.3: The semantic modelization of the Service Mart
As you can see we only kept the elements with a semantic value, so we omitted therepresentation of repeating groups, because they only indicate a structural value.
We also hypothesized that we had only semantic attributes and not quality indica-
tors like ranking. This feature is not yet retrievable from the queries through the
framework we built so we imagined that for now it could be treated automatically
(intrinsic property of the order of the results) or parametrically (the user is given
the opportunity to decide about it).
Another feature that has not been considered is the join relation between different
service marts or access patterns. This feature is a very important one in the SeCo
architecture. The possibility to link different search services with join paths gives
the power to answer the greatest part of multi-domain queries; in fact it’s assumed
that a user won’t repeat every bit of data in the request for the number of times
the splitting in domains requires. The repetition of the linking parameter through
a join path is vital in these cases. With our implementation we can match only a
smaller range of multi-domain queries to service marts.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
To efficiently analyze a natural language query the system has to recognize and
identify the greatest number of parameters in input. If a parameter can be la-
beled with its format the probability of a good matching between the query and
the service attributes will be higher. For this reason we have decided to use a
Name Entity Recognizer to extract name entities from the queries. We considered
the use of different NERs and we finally decided to use the one implemented by
the Stanford group because it was completely compatible with the libraries from
the parser already used. This NER can recognize entities in the form of: Per-
sons, Locations and Organizations. These “proper noun” words are recognized by
the mean of a large training set that functions as a great database of information.There are a lot of other NERs that perform in a similar way and can recognize
more entities in the form of numbers, dates etc..., the choice on this particular
NER was firstly because of the compatibility with the project already developed
and secondly because we believe that a more efficient recognition of “standardiz-
able types” can be achieved with the use of regular expression. This led us toward
a simpler NER rather than a complex and multifunction one. We call “standard-
izable types” all the data types that follow a standard pattern in their expression.
Prices for instance are always numbers followed or preceded by the symbol or thename of the currency, titles, if written correctly, are delimited by double quotes
(“I’m a Title”), distances have the same characteristics as prices with a unit of
measurement symbol and so on. The choice of using regular expression was done
because having the ability to change the expression in our program lifted us from
being dependent on a NER. These can be very useful for entity recognition on
“natural” words, machine learning algorithms and big training sets are not an easy
task to handle, but not be powerful enough with standardizable types.
4.5 Mapping to domains
From the basic objects identified in the clauses, nouns and verbs, we use another
set of tools and techniques to extract domains out of them. Subsequently in the
SeCo application domains can be mapped to a web service. In order to do this, we
focused on the tools provided by the WordNet project, and especially the add-on
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
of WordNet Domains. The approach used is to parse the dictionary of WordNet,
which is organized in words which relate to one or more synonym sets or senses
of a word, also called synsets. Each synset has a unique identifier consisting
of its offset within the WordNet database. We use this identifier to connect a
domain to its associated domains within the WordNet-Domains database, where
the key is the synset offset and its values are one or more domains. In fig. 4.4 the
relationships we follow to get the domains are represented.
Figure 4.4: The WordNet Domains Hierarchy
The domain retrieval process can result in a large number of domains from a sin-gle word. The perfect approach (from the human mind point of view) would be
to identify for every word which one is the sense which it refers to, in the given
sentence, and retrieve the domains accordingly. Given the inner difficulty of this
task in order to get the most relevant domains, we use the tf-idf[10] information
retrieval technique. This is a sorting mechanism that calculates the importance of
a single domain by its relative presence in a single word, over how common it is
across all the domains we retrieved from the objects of the sub-entry. A second
technique that was evaluated is to retrieve the domain relationship directly from
WordNet, which gives a word definition a relationship to another word of which
it is the topic. WordNet is organized as an index of words to their possible senses,
and a database containing details about such senses. In particular, information
about the relationship between the current senses and others is kept in that file.
There are many kind of relationships, such as is-a, is-part-of or, as we wish to ex-
tract, is-member-of-this-domain. These relationships allow one to go from sense,
or synset, to sense, forming a graph spanning the whole of the database. The
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
frequent couples. Using this criteria “distant” or very different domains can
be assigned a higher score on the basis that they’ve been found together in
a query a lot of times.
• Most frequent couples in Service Marts: the same approach can be ap-
plied to the scoring of the Service Marts domains. An offline analysis can be
done and a bonus can be assigned to the most frequent couples of domains
among the Service Marts annotations.
• Nearest domains: a bonus is assigned to the domains in the query that are
“near” considering the distance on the WordNet domains tree.
Only the third method was actually implemented for many reasons, first and fore-
most the absence of a testable, reliable and numerous database of queries and
service marts.
Other researches along this line in the future can involve more complex data min-
ing methods and approaches. The biggest problem in this domain scoring ap-
proach is the small number of domains. Even if we could achieve a perfect or-
dered list of domains for each sub-entry the meaning would be very poor wrt thepossible available annotations used on the service marts side. Another possible
approach to this problem would be to somehow use the retrieved domains to find
some matching synsets that could be useful in the following matching process.
4.6 The Service Mart Repository
The SeCo project is still a work in progress and the registration for the service
marts is not active yet. Therefore, to efficiently test our query analysis and match-
ing processes we decided to create a list of fictitious service marts with character-
istics and parameters very close to the real ones. We used as model and inspiration
the ones presented in the YQL database. The service marts semantic value spans
over a great number of domains and they’re complete with data types descriptions
and multiple access patterns. Thus we populated a repository with approximately
70 service marts that we used in our experiments.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
Every Access Pattern is composed of a number of service attributes that define
the searching capabilities of the mart. These service attributes are annotated se-mantically with domains and synsets. We also hyphotyze that the service provider
will indicate for every attribute a data type chosen in the enumeration we defined
in the uml of fig.4.3. With the definition of these annotations we can then match
every parameter, starting from the mandatory ones, to the available access pat-
terns. The matching is done respecting the order of the data both in the Extracted
Data structure as in the Access Pattern one. We assume that this will be an advan-
tage for temporal and space parameters that are usually placed in a certain order:
retrieved from WordNet and WordNet domains. This annotations can be useful
when more than one name parameter is needed. Through the calculation of a
matching score very similar to the one done previously for the sub-query/service
mart matching we sort the eligible names by highest compatibility and choose the
more suitable one.
4.8.2 Evaluation Criteria and Statistics
Some improvements from the original Sift application were required to filter ef-ficiently the queries among the raw set we acquired from our source Yahoo! An-
swers.
The Yahoo! Answers input structure requires the users to type a “title” and a
proper question in the form, the Sift application only acquires the “title” of the
question since often the complete text of the question is too long and filled with
other objects such as links, very unuseful for our analysis.
Due to this choice a lot of filtering to eliminate incomplete queries or inconclusive“titles” has to be done.
The preprocessing
During the preprocessing of the queries a basic filtering is manually applied and
entries are deleted if no question is asked or there are grammatical or spelling
errors in the keywords of the phrase. An improvement to this phase has beenadded to the application providing a correction form, for every entry retrieved,
which can be useful to correct and update the sentences with typos, spelling errors
and abbreviations without having to eliminate them. Also the option to eliminate
an entry or more entries altogether has been added.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
set of data. The center of all is the web front-end that powers the creation of
the corpus of queries as well as functions as a visualization tool for the outputs
of the algorithms, employed to analyze the queries and extract the domains in
the underlying application, and for the statistical results section. The front end
communicates with the outside through the retrieving of questions from Yahoo!
Answers from their web service. This feature is on request so the user that is
browsing the Sift web page can ask to retrieve questions from the outside. These
questions will be showed as new unrated entries and will be saved in the database.
The load of the process required to analyze the queries is non-negligible, both
in terms of CPU usage as well as memory, so it’s not advisable to require theextraction and the analysis to be done in real time in the same environment as
the database and the Web front-end. Therefore a mechanism to offload the work
into another computer has been devised. This is based on a standard and simple
architecture for background workers, the Web front-end, or the user through the
command line, can post work items on a queuing server where it will be picked
up by one of the clients, the first one who is available to compute. The client then
can process the task, given the input parameters to elaborate, and store the results
back in the database from which the web front end will retrieve them later after auser request.
5.2 The Sift Application
Sift is the Web Application composed of the front-end and the tool used to extract
data from the Yahoo! Answers Web Service. It is based on the Sinatra framework,
and thus it has been written in the Ruby programming language. In addition tothe main Sinatra library for web application development, it imports the Ruby
libraries used to interface with the CouchDB server, the Kestrel queue service as
well as the Yahoo! Answers Web Service. The application is divided into three
principal parts: the Models, the Controllers and the Views.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
The last part is the view module, where a template is processed, taking as input
the different variables prepared by the controller. In the index page, where thelist of entries is shown, the output HTML contains the list of all the entries, with
each entry containing the results of the processing done if there was any, although
this part is hidden at first. In the index page a list of summaries is shown and
can be navigated through, either using the mouse or the keyboard. Elements can
be given a rating by clicking the corresponding star to the right of it. It is also
possible to change the rating on more than one element at once by selecting them
first and then using the drop-down action menu or the keyboard shortcuts to give
it a new rating. To see the complete details for a single entry one has to click on
it and the screen will show all the retrieved data, the data has to be previously
processed by the background worker system, in which case, it is possible to see
the alternative strategies and rate them individually. Otherwise the phrase “This
entry hasn’t been parsed yet” will appear on the screen. This interface was created
using the jQuery toolkit for JavaScript, which allows a high-level view of the web
page, allowing to query and manipulate elements, as well as make asynchronous
calls to the servers.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
named by their identifier. Once a message is received, they first check into the
database to see if this particular instance of the chain had already been executed.
If that is the case, the existing data is fetched and parsed. The tasks will be skipped
as long as no errors have been encountered and no version of a task has changed.
Once all the remaining tasks have been executed, the worker asks the database
actor to store the updated document before resuming listening on the queue.
Splitting Tasks
Splitting Tasks are a specialization of a generic task, in which the parsing and
serialization is already taken care of. The user only has to define a function, split,
that will transform the input from a tree-based representation to a series of parts.
Each part is an instance of a class composed of two fields, the first being the
phrase, or sub-sentence that is considered as that part. The second element is the
list of the objects retrieved from that phrase, where each object is a couple made
of the object itself as well as its part of speech (e.g. verb, noun or adjective).
Domain Extraction and matching task
These tasks have been united in a single big task because of implementation and
resources requirements. They correspond to the third phase of the query analysis
process, where the different parts of a sentence are analyzed in order to obtain a
series of domains that are later mapped to different query services. This interfaces
once again have one single method to implement. This method, named extract,
takes the list of parts obtained from the previous operation, and have to return, foreach part of the sentence, a list of possible domains and a list of matched Service
Marts. If nothing of importance is found or if a word is not recognized the output
list can be empty.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
Once the framework was established and stabilized, the first tasks were imple-
mented. The first of these were the parsing strategies, which take as input the
natural language sentence and output the grammatical structure. This output will
have the form of an arbitrarily-deep tree, where a leaf represents an atom, a word
of the sentence, while an internal node represents a grouping of these words insome structure, for example in a noun phrase or verb phrase.
Parsers evaluated Different parsers were evaluated to test their performance
wrt our elaboration needs. The first parser to be evaluated was the Stanford Nat-
ural Language Parser. Distributed as a Java library, it requires little code to use.
One simply needs to load the parser with the chosen training data file, and then
apply the parser to a sentence to get a resulting tree. This tree is then transformed
in order to take it from the native tree representation of the Stanford Parser intoa generic one that is to be used by the later tasks. Here is an example on how to
load and apply the parser, in Scala.
/ / d a t a F i l e p o i n t s t o t h e t r a i n i n g d a t a
/ / s e t o n t h e h a r d d r i v e
/ / i n p u t c o n t a i n s t h e n a t u r a l l a n g u a g e s e n t e n c e
i m p o r t e d u . s t a n f o r d . n l p . p a r s e r s . _
v a l p a r s e r = n e w L e x i c a l i z e d P a r s e r ( d a t a F i l e )
v a l t r e e = p a r s e r . a p p l y ( i n p u t )
Another parser tested was the Shallow Parser, developed at the University of Illi-
nois at Urbana-Champaign. This takes a different approach to obtain the final
result. It works by using a series of different tools that will process the input
into a more and more complex form. The first of these steps is to sanitize the
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
input, make sure that every element is well tokenized; that is, every element in
the sentence is spaced out, even the punctuation. It also produces some slight
transformations and normalization operations. The output of this first operation
is then sent to a second program, which takes care of tagging each element of
the sentence with its most probable part of speech, be it noun, verb, adjective or
other. This is then finally sent to a server called the chunking server, which takes
the annotated input and groups, or chunks, elements into what it thinks are the
primordial structure of the sentence. The shallow name thus comes from the fact
that this grouping operation is only done at one level, which means that the output
could be formally defined as a sequence of elements which can either be atoms,or sub-sequences of such atoms. This output, given as text by the server, is then
parsed by the task and put into a tree representation, although it is only going to
be one level deep.
5.3.2 Sentence Splitting Strategies
Once we have a tree with a satisfying parsing structure, in our case we chose theStanford tree version, we proceed to the division of that structure into many parts,
with the expectation that each part will correspond to a single semantic domain.
In output, each part of the sentence is represented by an instance of the “part”
class, which contains the extracted sub-sentence as well as the objects that are
considered important to the definition of the domain.
First-Level Split
A first strategy to split the sentences is to suppose that the first level at which a
separation of the sentence occurs defines the various domains. The purpose of the
task is thus to find the first internal node that has more than one child, and take
each child as a different part from which a domain will be extracted. From each
sub-tree, we look for interesting elements, nouns and verb atoms and take them
as objects. While this first attempt at splitting the tree is simple and does not take
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
into account the subtleties of the resulting parse tree, it gives a good base line and
provides a jumping point from which we can explore better techniques.
Clauses Extraction
Given the fact that sentences are most of the time organized in subject-verb-object
form, and that the object is often the one that has the most chances of having a
subjunctive or relative clause, we can expect the tree to be leaning most of the time
on the right, a fact that the previous technique does not take into account. In orderto fix that, a second technique has been implemented, where the tree is visited
in a depth-first, left-to-right manner, buffering elements in a domain until a new
clause is encountered. Such a clause is encountered when we find an internal node
Name Entity Extraction The name entity extraction tool is used to examine
each object extracted from the sub-entry. To do this we first initialize the classifier
with the training set, then using as input the string value to examine we extract
a list structured as: List<Triple<String,Integer,Integer>>. This structure contains
the name of the entities extracted, either Location, Organization or Person, and
the offsets of the values they refer to in the string examined.
/ / i n i t i a l i z e t h e c l a s s i f i e r t r a i n i n g s e t
v a r s e r i a l i z e d C l a s s i f i e r = " c l a s s i f i e r s / n e r - e n g - i e . c r f - 3 - a l l 2 0 0 8 - d i s t s i m . s e r . g z "
/ / i n i t i a l i z e t h e c l a s s i f i e r
v a r c l a s s i f i e r = C R F C l a s s i f i e r . g e t C l a s s i f i e r N o E x c e p t i o n s ( s e r i a l i z e d C l a s s i f i e r )
/ / e x t r a c t t h e i n f o r m a t i o n
v a r g = c l a s s i f i e r . t e s t S t r i n g A n d G e t C h a r a c t e r O f f s e t s ( s e n t e n c e )
From the variable g then we get the types of entities extracted and label the values
examined accordingly. This entity labels will be then used to extract the domains
from WordNet, task that otherwise would be impossible for proper nouns.
Domain Extraction All the sub-entries in output from the splitting methods are
stored in specific objects called “part”. These objects contain the original version
of the sentence in the sub-entry, all the noun and verb object retrieved by WordNet
as well as the nouns retrieved by the Name Entity Recognizer.
In the schema below you can see the structure of the retrieval of the domains from
each object saved in the part structure (O1, O2, O3). This retrieval is done by the
exploration of the WordNet domains database; for each one of the objects there
may be more than one group of domains to be retrieved, groups D1, D2, D3 refer
all to O1, due to the subdivision in multiple synsets, fig4.4.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
The last step in our analysis is to match sub-queries to appropriate service marts
that can satisfy their search requests. To do this we implemented a complex con-
frontation algorithm. As said earlier we extract from each part a datatype object
which contains different data types, either identified by regular expressions or the
entity recognizer.
The matching happens specifically between the available service attributes con-
tained in the access patterns and the data from the sub-queries. If the matchingis satisfied the service mart from which the access pattern is selected will be a
suitable candidate for the final search. All the candidates service marts are sorted
according to the domain matching score they earned during the mapping of the
sub-queries (see section 4.8).
The mapping For each type we implemented a proper mapping function. Each
type is treated separately and the mapping schema in the figure below is applied
to each DataType/AccessPattern combination. The output of the function is a list
of results containing all the data that will be needed for the following invocation
of the search service: Service Id, data, type format requested. Also the updated
DataType, with all the still available elements for matching, is given in output so
that it can be the input to the following calls to the mapping function.
Figure 5.10: Mapping schema
Since not every type is perfectly recognizable with the tools we used, we decided
to implement an enhanced comparison. We don’t compare only the attributes with
their correspondents belonging to the same type but also, as a backup matching,
with more generic types that can be a suitable match. For instance: if the service
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
attribute needs a “Price” type and we can’t find one in the DataType structure
extracted from the sub-query we look for a simple “number” type.
Following there is the detailed summary of every data type comparison and their
backups:
• Number and Words are confronted only with their corresponding types.
• Price: Confronted with the Price type. If not matched they are confronted
with the Number type.
• Date: The Date structure is formed by Day/Month/Year. The comparison
happens between Date types. If Day or Month types are requested singularly
from the services the matching is researched in a Date structure.
• Time: The Time structure is formed by hour/minute. The comparison hap-
pens between Time types. If Hour types are requested singularly from the
services the matching is researched in a Time structure.
• Organization, Location, Person, Title: These types are confronted with theirhomonyms respective types in the datatype structure. If no match is found
they’re confronted with the Word type doing a semantic matching.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
Figure 6.1: The Main Screen of the Sift Application
Most of the entries from Yahoo! Answers we rated had a very low score and thus
they were not corresponding to our needs. A lot of questions had to be pruned
because they contained special characters, misspelled words or other nonsensepunctuation (i.e. “Want 2 find an hotel in tokio—-ASAP!!!!!!!PLS!”). The num-
ber of real multi-domain queries we extracted from our source is very low and
most of those entries are not suitable for our analysis, they don’t contain all the
data required to successfully invoke a service mart. The main reason is that a lot
of parameters are implicit in a question, taking for example this sub-entry:
“I want to find an hotel in Fiji for me and my son for the next week”
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
A human reading that question can extract some useful information like :
• How many persons are interested in the hotel? parent+son=2
• What is the date of the vacation? next week= date of this week + 7 days
These data is available in the question but due to the limited power of our ex-
traction methods we can’t identify them. That’s why a lot of real multi-domain
question will not have a matching service mart in our algorithm.
Moreover we have to consider the nature of the service Yahoo! Answers. This
service in fact was originally born to give people the possibility to ask questions
to other people, not specifically multi-domain ones. In fact most of the questions
we found were about advices, thoughts about something and the sort of things one
would ask to only another human being and never to an automatic service online;
for example “How’s the hotel X?I want to go there with my kids of 3 and 5 years,
will it be a good choice?” or ”I heard bad things about that neighborhood, is it
really bad? I’m moving there next week”. Both of these questions are completely
unanswerable with any automatic service, the possibility is even that the user who
asked them wants to have multiple opinions on the matter and then decide for
himself; a completely different approach to the service from what we expect from
a user approaching our multi-domain search service.
Another downside of using Yahoo! Answers is the fact that the form of the site
allows the user to give a “title” and then a specific and longer explanation of
the request. We decided to retrieve only the title section because of the useless
elements that can be found in the “text” section as links and attachments. Thisis one of the reason for the great number of low rated entries, because often the
“title”, if it’s comprehensible and structured, refers only to the main topic of the
request that in its corpus can be multi-domain.
We retrieved the approximately 1200 entries but we had to discard some of them
for the reasons stated above. We finally obtained 1064 entries. 759 of them were
rated as one star, which means that they were completely inappropriate for the
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
The main problem found in this section is the too unbalanced and small corpus of
domains available in the database. In the WordNet Domains Hierarchy there are
less than 200 domains (the complete structure can be found in the appendix 3), an
exiguous number compared to the various exigencies of annotation of the entries
and services. Moreover the variety of domains is very unbalanced for examplethe “tourism” domain is categorized under the label “social sciences” and it’s not
detailed in any way, instead the “sport” domain is divided in 29 subcategories that
detail every possible sport discipline. This can really affect the annotation of an
entry or a service, for instance a touristic service can only count on one domain in
the annotation. This reduces a lot its semantic potential in the matching process.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
From the auto generated statistics we can see that 726 sub-entries have at least
one service mart that matches their data. This result can be considered successful
for the technical aspects of our matching algorithm. A good number of queries
is successfully matched to a service mart that is therefore invokable with everyrequired service parameter. Despite the correctness of the algorithm we can’t
say anything on the semantic correctness and effectiveness of the matching since
we don’t have the actual results but only the name and description of a fictitious
service mart.
6.5 A Complete example of info extraction, splitting
and matching.
To examine properly every detail of the analysis and processing of an entry we
chose a multi-domain question that gave good results in almost every section.
The original question in input:
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
This thesis project objective was the research and creation of a matching service
that could help the pairing of the natural language queries to the most suitablesearch services, a long and boring task if operated by hand by a single user. In the
research process we first had to enhance and enrich the analysis environment and
implement some autostatistics tools. Using them we validated some techniques
that were previously used to split and retrieve domains and we researched, tested
and evaluated some new approaches for the domain extraction sorting. Then we
extracted, through the use of different tools, the data information from the sub-
queries. Combining the data and domain information we then developed a new
task for the matching with services. Finally we presented the results obtained and
validated the approaches used. The final application presents the complete process
of acquisition, analysis and matching of the entries.
The results obtained from the program indicate that the approaches used were
quite successful in the technical aspects as we explained in the evaluation section
and are a strong base for future testing and development of the tasks researched.
84
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service