Top Banner
© 2014 Adobe Systems Incorporated. All Rights Reserved. Query and Document Understanding Rishiraj Saha Roy | Computer Scientist, Adobe Research Labs India | [email protected] 1
36

Query and Document Understanding - Max Planck …people.mpi-inf.mpg.de/~rsaharo/qdu_pesit.pdfQuery and Document Understanding. ... Shikhar Dhawanhas much better shot placement than

Apr 07, 2018

Download

Documents

vonguyet
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Query and Document Understanding - Max Planck …people.mpi-inf.mpg.de/~rsaharo/qdu_pesit.pdfQuery and Document Understanding. ... Shikhar Dhawanhas much better shot placement than

© 2014 Adobe Systems Incorporated. All Rights Reserved.

Query and Document UnderstandingRishiraj Saha Roy | Computer Scientist, Adobe Research Labs India | [email protected]

1

Page 2: Query and Document Understanding - Max Planck …people.mpi-inf.mpg.de/~rsaharo/qdu_pesit.pdfQuery and Document Understanding. ... Shikhar Dhawanhas much better shot placement than

© 2014 Adobe Systems Incorporated. All Rights Reserved.

Overview

Simple techniques in query and document understanding

Lucene – A simple commercial text search library

Take-home assignment on basic Information Retrieval

Industry positions for text mining and IR skills

2

Page 3: Query and Document Understanding - Max Planck …people.mpi-inf.mpg.de/~rsaharo/qdu_pesit.pdfQuery and Document Understanding. ... Shikhar Dhawanhas much better shot placement than

© 2014 Adobe Systems Incorporated. All Rights Reserved.

Basics

3

Page 4: Query and Document Understanding - Max Planck …people.mpi-inf.mpg.de/~rsaharo/qdu_pesit.pdfQuery and Document Understanding. ... Shikhar Dhawanhas much better shot placement than

© 2014 Adobe Systems Incorporated. All Rights Reserved.

Basics

What is “not” understanding?

Query: compare performance shikhar dhawan rohit sharma

Document: Shikhar Dhawan has much better shot placement than Rohit Sharma.

compareperformance

shikhar dhawanrohit

sharma

has thanbetter shot

shikhar dhawanrohit placement

much sharma

4

Page 5: Query and Document Understanding - Max Planck …people.mpi-inf.mpg.de/~rsaharo/qdu_pesit.pdfQuery and Document Understanding. ... Shikhar Dhawanhas much better shot placement than

© 2014 Adobe Systems Incorporated. All Rights Reserved.

Basics

Much more to queries and documents than keywords

and their frequencies!!!

5

Page 6: Query and Document Understanding - Max Planck …people.mpi-inf.mpg.de/~rsaharo/qdu_pesit.pdfQuery and Document Understanding. ... Shikhar Dhawanhas much better shot placement than

© 2014 Adobe Systems Incorporated. All Rights Reserved.

Basics

Query: create hyperlinks in excel

Documents: Forums

create hyperlinks in word …. Filters in excel have to be specified

with…

Documents: Spam (?)

Zingo.com – Your one stop tech quide. Best excel tips | Best

hyperlinks in your page | Create your own blog…

6

Page 7: Query and Document Understanding - Max Planck …people.mpi-inf.mpg.de/~rsaharo/qdu_pesit.pdfQuery and Document Understanding. ... Shikhar Dhawanhas much better shot placement than

© 2014 Adobe Systems Incorporated. All Rights Reserved.

Basics

Query 1: us open home page

Query 2: chrome cant open home page

US open official site by IBM. Cant view page properly? Best viewed

in Google Chrome.

7

Page 8: Query and Document Understanding - Max Planck …people.mpi-inf.mpg.de/~rsaharo/qdu_pesit.pdfQuery and Document Understanding. ... Shikhar Dhawanhas much better shot placement than

© 2014 Adobe Systems Incorporated. All Rights Reserved.

Basics

Relative word orders important

china detains india traders latest news

Query segmentation

glass office windows

open office windows

Entities, Attributes and Relations

france capital, polio symptoms, bon jovi age

barclays capital, capital punishment?!8

Page 9: Query and Document Understanding - Max Planck …people.mpi-inf.mpg.de/~rsaharo/qdu_pesit.pdfQuery and Document Understanding. ... Shikhar Dhawanhas much better shot placement than

© 2014 Adobe Systems Incorporated. All Rights Reserved.

Basics

And much more!!!

Term proximities

Term dependencies

Term and page annotations

Endless research areas………..

9

Page 10: Query and Document Understanding - Max Planck …people.mpi-inf.mpg.de/~rsaharo/qdu_pesit.pdfQuery and Document Understanding. ... Shikhar Dhawanhas much better shot placement than

© 2014 Adobe Systems Incorporated. All Rights Reserved.

Query Lengths

2.21

3.53.98

012345

2000 2006 2010

The mean length of Web search queriesis increasing

> 8 words Long Queries (3.2%)

3 to 8 words Medium Queries (80%)

< 3 words Short Queries (14%)

10

Page 11: Query and Document Understanding - Max Planck …people.mpi-inf.mpg.de/~rsaharo/qdu_pesit.pdfQuery and Document Understanding. ... Shikhar Dhawanhas much better shot placement than

© 2014 Adobe Systems Incorporated. All Rights Reserved.

Motivation

Query understanding: Why? How?

Queries do not follow any formal grammar

“EMERGENCY HATCH PENGUIN EGGS HOW”

medicines for high pressure otc only

samsung galaxy gprs config at&t

11

Page 12: Query and Document Understanding - Max Planck …people.mpi-inf.mpg.de/~rsaharo/qdu_pesit.pdfQuery and Document Understanding. ... Shikhar Dhawanhas much better shot placement than

© 2014 Adobe Systems Incorporated. All Rights Reserved.

(Some more) Motivation

Reordering, no function words, multiword expressions, part NL

Natural language processing (NLP) / Linguistics-based techniques fail!

Computationally expensive!

Simple data-driven statistical approaches

Empirical formulations

Provide noticeable improvements!!

12

Page 13: Query and Document Understanding - Max Planck …people.mpi-inf.mpg.de/~rsaharo/qdu_pesit.pdfQuery and Document Understanding. ... Shikhar Dhawanhas much better shot placement than

© 2014 Adobe Systems Incorporated. All Rights Reserved.

Query Segmentation

Query segmentation

Why?

A simple how

Extracting Entities and Attributes

Why?

Some simple hows

13

Page 14: Query and Document Understanding - Max Planck …people.mpi-inf.mpg.de/~rsaharo/qdu_pesit.pdfQuery and Document Understanding. ... Shikhar Dhawanhas much better shot placement than

© 2014 Adobe Systems Incorporated. All Rights Reserved.

Query Segmentation

Dividing a query into individual semantic units (Bergsma and

Wang, 2007)

Example

australian open home page →

australian open | home page

australian | open home | page

14

Page 15: Query and Document Understanding - Max Planck …people.mpi-inf.mpg.de/~rsaharo/qdu_pesit.pdfQuery and Document Understanding. ... Shikhar Dhawanhas much better shot placement than

© 2014 Adobe Systems Incorporated. All Rights Reserved.

Query Segmentation

Goes beyond multiword named entity recognition (gprs config,

history of, how to)

Helps in better query understanding

Query expansion, query suggestions

Can improve IR performance by increasing precision

north america versus north of america

15

Page 16: Query and Document Understanding - Max Planck …people.mpi-inf.mpg.de/~rsaharo/qdu_pesit.pdfQuery and Document Understanding. ... Shikhar Dhawanhas much better shot placement than

© 2014 Adobe Systems Incorporated. All Rights Reserved.

Simple algorithm – Pointwise Mutual Information

𝑃𝑃𝑃𝑃𝑃𝑃 𝑎𝑎𝑎𝑎 = log2𝑝𝑝(𝑎𝑎𝑎𝑎)

𝑝𝑝 𝑎𝑎 ∗ 𝑝𝑝(𝑎𝑎)

Compute probabilities from any source – documents, queries, page

titles, anchor text

Microsoft Web n-gram services

http://research.microsoft.com/en-us/collaboration/focus/cs/web-

ngram.aspx

Query Segmentation

16

Page 17: Query and Document Understanding - Max Planck …people.mpi-inf.mpg.de/~rsaharo/qdu_pesit.pdfQuery and Document Understanding. ... Shikhar Dhawanhas much better shot placement than

© 2014 Adobe Systems Incorporated. All Rights Reserved.

Query Segmentation

PMI measures strength of bonding – by chance or by choice?

Meanigful bigrams have high PMI – harry potter, blood pressure,

jurassic park, difference between

Measure PMI of adjacent word pairs

Fix significance threshold

Insert boundary whenever PMI falls below threshold

17

Page 18: Query and Document Understanding - Max Planck …people.mpi-inf.mpg.de/~rsaharo/qdu_pesit.pdfQuery and Document Understanding. ... Shikhar Dhawanhas much better shot placement than

© 2014 Adobe Systems Incorporated. All Rights Reserved.

Query Segmentation

Input: australian open home page

PMI(australian, open) = 15.89

PMI(open, home) = 5.43

PMI(home, page) = 13.92

Threshold: 8.50

Output: australian open | home page

Problem: Not optimized over whole query!!

18

Page 19: Query and Document Understanding - Max Planck …people.mpi-inf.mpg.de/~rsaharo/qdu_pesit.pdfQuery and Document Understanding. ... Shikhar Dhawanhas much better shot placement than

© 2014 Adobe Systems Incorporated. All Rights Reserved.

(Named) Entities

jetbeam rrt-01

Where to buy? How to use? Life? Weight? ….

roger federer

Return information in structured form

lotr cast

Book? Movie? Game?

19

Page 20: Query and Document Understanding - Max Planck …people.mpi-inf.mpg.de/~rsaharo/qdu_pesit.pdfQuery and Document Understanding. ... Shikhar Dhawanhas much better shot placement than

© 2014 Adobe Systems Incorporated. All Rights Reserved.

Detecting Entities

Simplest – List based approach

Wikipedia titles acts as a very good resource

http://dumps.wikimedia.org/enwiki/latest/

5 million entries, 2 GB RAM, no problem

20

Page 21: Query and Document Understanding - Max Planck …people.mpi-inf.mpg.de/~rsaharo/qdu_pesit.pdfQuery and Document Understanding. ... Shikhar Dhawanhas much better shot placement than

© 2014 Adobe Systems Incorporated. All Rights Reserved.

Detecting Entities

Efficient data structures – Trie, Dictionary

Low memory

Fast search

Lists work great, extensive commercial use

Annotate both queries and documents

21

Page 22: Query and Document Understanding - Max Planck …people.mpi-inf.mpg.de/~rsaharo/qdu_pesit.pdfQuery and Document Understanding. ... Shikhar Dhawanhas much better shot placement than

© 2014 Adobe Systems Incorporated. All Rights Reserved.

Detecting Entities

howard shore music director

22

Page 23: Query and Document Understanding - Max Planck …people.mpi-inf.mpg.de/~rsaharo/qdu_pesit.pdfQuery and Document Understanding. ... Shikhar Dhawanhas much better shot placement than

© 2014 Adobe Systems Incorporated. All Rights Reserved.

Detecting Entities

Often need to view very large files – lists, logs

LTF Viewer – An unsung hero

http://www.swiftgear.com/ltfviewer/features.html

Vim, Cygwin, command-based

Edit programmatically only

23

Page 24: Query and Document Understanding - Max Planck …people.mpi-inf.mpg.de/~rsaharo/qdu_pesit.pdfQuery and Document Understanding. ... Shikhar Dhawanhas much better shot placement than

© 2014 Adobe Systems Incorporated. All Rights Reserved.

Problems

More than one match

the dark knight, the dark knight rises

tom cruise ship scene

False positives – Match, but not entity

list of capitals

24

Page 25: Query and Document Understanding - Max Planck …people.mpi-inf.mpg.de/~rsaharo/qdu_pesit.pdfQuery and Document Understanding. ... Shikhar Dhawanhas much better shot placement than

© 2014 Adobe Systems Incorporated. All Rights Reserved.

Identifying Attributes

Why?

User wants specific results

galaxy note specs

Intent diversification

galaxy note (What about it??)

Pictures, specs, stores, prices, accessories

25

Page 26: Query and Document Understanding - Max Planck …people.mpi-inf.mpg.de/~rsaharo/qdu_pesit.pdfQuery and Document Understanding. ... Shikhar Dhawanhas much better shot placement than

© 2014 Adobe Systems Incorporated. All Rights Reserved.

Identifying Attributes

Using documents: Template based

What is the A of I <what … A … I>

I’s A

Who was A of I <who … A … I>

A of I

A in I

26

Page 27: Query and Document Understanding - Max Planck …people.mpi-inf.mpg.de/~rsaharo/qdu_pesit.pdfQuery and Document Understanding. ... Shikhar Dhawanhas much better shot placement than

© 2014 Adobe Systems Incorporated. All Rights Reserved.

Identifying Attributes

Ps2’s accessories

Accessories of galaxy note

New Delhi is the capital of India

Paris is the capital of france

Narendra Modi is the prime minister of India

??? is the prime minister of Pakistan

27

Page 28: Query and Document Understanding - Max Planck …people.mpi-inf.mpg.de/~rsaharo/qdu_pesit.pdfQuery and Document Understanding. ... Shikhar Dhawanhas much better shot placement than

© 2014 Adobe Systems Incorporated. All Rights Reserved.

Identifying Attributes

Challenges

Hall of fame

Wall of shame

Shindler’s list

Beijing’s mist

28

Page 29: Query and Document Understanding - Max Planck …people.mpi-inf.mpg.de/~rsaharo/qdu_pesit.pdfQuery and Document Understanding. ... Shikhar Dhawanhas much better shot placement than

© 2014 Adobe Systems Incorporated. All Rights Reserved.

Identifying Attributes

Using query logs or documents – Co-occurrence counts

Common wisdom: Attributes are frequent words

More robust statistics: They co-occur with a higher number of

distinct words

29

Page 30: Query and Document Understanding - Max Planck …people.mpi-inf.mpg.de/~rsaharo/qdu_pesit.pdfQuery and Document Understanding. ... Shikhar Dhawanhas much better shot placement than

© 2014 Adobe Systems Incorporated. All Rights Reserved.

Identifying Attributes

nikon camera prices, winter coats prices, property prices in

bengaluru

nikon camera prices, nikon camera models, nikon camera for sale

Issues: Where to draw the line?

lyrics, recipe, cast

after, test, centre, black, server

30

Page 31: Query and Document Understanding - Max Planck …people.mpi-inf.mpg.de/~rsaharo/qdu_pesit.pdfQuery and Document Understanding. ... Shikhar Dhawanhas much better shot placement than

© 2014 Adobe Systems Incorporated. All Rights Reserved.

Summary

Keyword-based retrieval good, but not enough

Query and document understanding are required to boost IR

performance

Methods used need to be fast and scalable

Query segmentation is a first step towards better query

representation

Entities and attributes can be identified effectively using simple

approaches

References: http://bit.ly/19b2dMC

31

Page 32: Query and Document Understanding - Max Planck …people.mpi-inf.mpg.de/~rsaharo/qdu_pesit.pdfQuery and Document Understanding. ... Shikhar Dhawanhas much better shot placement than

© 2014 Adobe Systems Incorporated. All Rights Reserved.

How to Use LuceneFiles: http://cse.iitkgp.ac.in/resgrp/cnerg/qa/ForLucene.zip

32

Page 33: Query and Document Understanding - Max Planck …people.mpi-inf.mpg.de/~rsaharo/qdu_pesit.pdfQuery and Document Understanding. ... Shikhar Dhawanhas much better shot placement than

© 2014 Adobe Systems Incorporated. All Rights Reserved.

Basic IR AssignmentOpen

33

Page 34: Query and Document Understanding - Max Planck …people.mpi-inf.mpg.de/~rsaharo/qdu_pesit.pdfQuery and Document Understanding. ... Shikhar Dhawanhas much better shot placement than

© 2014 Adobe Systems Incorporated. All Rights Reserved.

Industry Scope

34

Page 35: Query and Document Understanding - Max Planck …people.mpi-inf.mpg.de/~rsaharo/qdu_pesit.pdfQuery and Document Understanding. ... Shikhar Dhawanhas much better shot placement than

© 2014 Adobe Systems Incorporated. All Rights Reserved.

Questions?

35

Page 36: Query and Document Understanding - Max Planck …people.mpi-inf.mpg.de/~rsaharo/qdu_pesit.pdfQuery and Document Understanding. ... Shikhar Dhawanhas much better shot placement than

© 2014 Adobe Systems Incorporated. All Rights Reserved.