Top Banner
NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University of Michigan
36

NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

Jan 01, 2016

Download

Documents

Douglas Walsh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

NaLIXNatural Language Interface for

querying XML

Huahai Yang

Department of Information Studies

Joint work with Yunyao Li and H.V. Jagadish

at University of Michigan

Page 2: NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

Outline

• Motivation• Basic ideas• System design• Live demonstration of NaLIX• User study• Conclusion• Q & A

Page 3: NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

XML is Everywhere

• Extensible Markup Language (XML) for data exchange and storage

• Almost every application domain is moving towards XML based document formats – Digital libraries– Office documents– GIS– ...

Page 4: NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

How to Query XML?

• We can of course use keywords search. Can we do better?

After all, XML has structures.

Page 5: NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

Example XML Fragment

<bookinfo>

<book>

<title>One Fish Two Fish</title>

<author>John Meyer</author>

<author >Peter Smith</author>

<price>7.95</price>

</book>

<book>

<title>Goodnight Moon</title>

<author >Margaret Brown</author>

<price>10.55</price>

</book>

</bookinfo>

Page 6: NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

Example Query

Query: Find titles and prices of books by ‘Meyer’

bookinfo

Just Lost

book

title author

author

price

Mercy Meyer

Gina Meyer

$5.75

book

titleprice

Brown Hedi

$13.95

Page 7: NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

More Powerful Queries

• Aggregation– What are the books with more than 5

authors?• Value Join

– Find all the books, where an author of each book is the same as an author of Pride and Prejudice.

• Nesting– Find all the articles published in 2000 and

with keyword “XML”.

Page 8: NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

Standard Approach - XQuery

• XQuery –a structured query language – search for specific information within

XML documents primarily based on path expression

Page 9: NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

Standard Approach - XQuery

for $b in document(“bib.xml”)//book

where $b/author contains ‘Meyer’

return <result>

<title> $b/title </title><price> $b/price </price>

</result>

Query: Find titles and prices of books by ‘Meyer’

book

titleauthor

author

price

book

titleprice

bookinfo

Page 10: NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

Another Document Structure

• The same XQuery no longer works

author

name

Dr. Meyer

author

namebook

M. Brown

Goodnight Moon

title

book

title

price

One Fish Two Fish

$12.50

book

title price

Cat in the Hat

$14.95

bookinfo

Page 11: NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

Standard Approach - XQuery

• XQuery is powerful, but ……

The user may NOT KNOW the structure of bib.xml!

Page 12: NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

Solution 1 – Study Schema

• Ask the user to study the structure of bib.xml and write a query in XQuery

Page 13: NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

Solution 1 – Study Schema

More problems• Data evolution: The document structure may change

over time• Heterogeneity: multiple XML documents with similar

content but different structures• Plain unrealistic from an usability point of view

Page 14: NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

Solution 2 – Keyword Search

• Keyword Search – IR approach- Discard all tags:

“Mary”

- Treat tags as keywords:

“book article year title author Mary”

Page 15: NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

Query: What are the titles and years of the publications, of which Mary is an author?

===> “year title author Mary”

Solution 2 – Keyword Search

bibliography(1)

bib(2)bib(11)

year(3)year(12)

book (4)article(7)

title(5) author(6)

title(8)

author(10)

book(13)article(16)

title(14)

author(15)title(17)

author(18)19992000

XMLBob

HTMLMary

DatabaseCodd

C++John

Joe

author(9)

Page 16: NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

Solution 2 – Keyword Search

Pro

-Required no knowledge of document structure

Con -Does not take advantage of the structure

-Cannot express complex query semantic

Page 17: NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

Our Approach

• Taking advantage of whatever partial knowledge user may have on document schema

• Support wide range of queries – from regular XQuery to keyword

search• Minimum effort required for the user

–Natural language query is desirable

Page 18: NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

Basic Ideas

• Map parsed natural language query into XQuery.

– Proximity in natural language parse tree should correspond to proximity in matched XML tree

• XML is human readable and is human created!

• Interactive query formulation to help users pose system-understandable queries.

Page 19: NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

What are the Challenges?

Challenge 1:

Automatically understand user intent of an arbitrary natural language question.

Challenge 2:

How to map user intent to XML schema?- Should I use “author” or “writer”?

- Is “Gone with the wind” a book or a movie?

- Are books grouped by year or by author in the bibliography?

Page 20: NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

Our focus is on the second challenge

User Intent => XML Schema

• Match users’ limited schema knowledge with the actual document schema.– Schema-Free Xquery

• Meaningful query focus

Page 21: NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

Meaningful Query Focus

• Without knowing the document structure, the user can still specify WHICH nodes should be meaningfully related

• The Meaningful Query Focus (MQF) is the most specific XML structure in which the nodes are related.

authortitle

Mary

year authortitle

Mary

year

MQF (year, title, author)

Page 22: NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

Intuition

- A node in a XML document usually represent a real-world entity

book art icle

t it le author

J oe

t it le

year

author

Bob

bib

1999

XML HTML

(2)

(3)

(9)(8)

(7)

(6)(5)

(4)

author(10)

Mary

- Two nodes are related to each other by their lowest common ancestor

article is one of the most specific entities that contain entity title and author

Page 23: NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

Intuition

The entity represented by a lowest common ancestor node may not be the most specific type of entity that contains the types of entities each of the nodes represents

NOT all lowest common ancestors are meaningful !

book art icle

t it le author

J oe

t it le

year

author

Bob

bib

1999

X ML X ML

(2)

(3)

(9)(8)

(7)

(6)(5)

(4)

author(10)

Mary

Page 24: NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

Meaningful Lowest Common Ancestor

Given the sets of nodes with tag name title, year and author

bibliography(1)

bibbib

yearyear

bookarticle

title author

title

author

book article

titleauthor

titleauthor1999

2000

XMLBob

HTMLMary

DatabaseCodd

C++John

Joe

author

(4)

(5) (6)

(8)

(7)

(9)

(10)

(3)

(14)(15)

(13) (16)

(17)(18)

(2)

(12)

(11)

Page 25: NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

Meaningful Lowest Common Ancestor Structure

Given the sets of nodes with tag name title, year and author , Meaningful Query Focus of these nodes are:

bibliography(1)

bibbib

yearyear

bookarticle

title author

title

author

book article

titleauthor

titleauthor1999

2000

XMLBob

HTMLMary

DatabaseCodd

C++John

Joe

author

(4)

(5) (6)

(8)

(7)

(9)

(10)

(3)

(14)(15)

(13) (16)

(17)(18)

(2)

(12)

(11)

Page 26: NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

Vocabulary Problem

• Users may not know the exact tag names• Solution: term expansion

– WordNet® • English nouns, verbs, adjectives and

adverbs are organized into synonym sets, each representing one underlying lexical concept.

– Ontology mapping• Domain knowledge

Page 27: NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

for $b in MQF doc(“bib.xml”)//expand(writer) $c in MQF doc(“bib.xml”)//title $d in MQF doc(“bib.xml”)//yearwhere $b = “Mary” return $c, $d

Term expansion

User may not know the exact tag names

term expansion

for $b in doc(“bib.xml”)//author $c in doc(“bib.xml”)//title $d in doc(“bib.xml”)//yearwhere mlca($b, $c, $d) and $b = “Mary” return $c, $d

Page 28: NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

Query Translation Process

English Sentence Classified Parse Tree Schema-Free XQuery Dependency Parse Tree

List (V)

title (N)

published (A)

by (Prep)

Addison-Wesley (N)

the (Det)

book (N)

of (Prep)

List (CMT)

title (NT)

published by (CM)

publisher(NT)

Addison-Wesley (VT)

the (MM)

book (NT)

of (CM) for $m0 in timber-mlca($v0, document("sdblp.xml")//title,

$v1, document("sdblp.xml")//book,

$v2, document("sdblp.xml")//publisher)

where $v2 = "Addison-Wesley"

return <result>{$v0}</result>

List the titles of all the books published by Addison-Wesley .

Page 29: NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

System Architecture

XMLDBMS

Classifier Validator Translator

user

GUI

Dependency Parser

NLQ NLQ Classified Parse tree

Valid Parse tree

NLQParsetree

Query Repository

Command

History & Template

Message Generator

Errors

Warnings

Feedbacks

Results

Feedbacks

Results

Command

History&Template

Page 30: NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

Implementation● NaLIX itself is a Java application

● Web version is under development● Contains off-the-shelf components

● Natural Language Parser: MiniPar● XML Database: Timber● Thesaurus: WordNet● Ontology: manually constructed

• Under development:– User interactive domain ontology construction– Machine learning of domain ontology

Page 31: NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

Interactive Query Formulation

Invalid Query: List all the books with the same author as an author of “Advanced Programming in the Unix Environment”

FeedbackMessage

Suggestions

ExampleUsage

Page 32: NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

Experiment

• NaLIX vs. Keywords search

• Subjects: 18 UAlbany students

- No training on NaLIX was offered

• Tasks: 12 standard XMP use cases

Dataset: subset of DBLP

- All the books and SIGMOD conference articles by year 2003.

Page 33: NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

Ease of Use

Number of query reformulation

Time to formulate an acceptable query

Page 34: NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

Search Quality

Precision

Recall

Page 35: NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

Conclusions

• Taking advantage of inherent structure of XML, precise and powerful queries with natural language can be achieved

• Using NaLIX, naive user can effectively search unfamiliar XML data with ease and precision

• Iterative query reformulation can be an effective search strategy.

Page 36: NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

Thanks you!

Huahai [email protected]