Improving Query Results using Answer Corroboration

Improving Query Results using Answer

Corroboration

Amélie MarianRutgers University

10/18/2006 Amélie Marian - Rutgers University 2

Motivations Query on databases traditionally return exact

answer (set of) tuples that match query exactly

Query in Information retrieval traditionally return best documents containing the answer (list of) documents from which users have to find

relevant information within the documents Both query models are insufficient for today’s

information needs New models have been used and studied: top-k queries,

question answering (QA)

But these model consider answers individually (except for some QA systems)

Data Corroboration Data sources cannot be fully trusted

Low quality data (e.g., data integration, user-input data)

Web data (anybody can say anything on the web)

Non exact query models Top-k answers are requested

Repeated information leads more credence to the quality of the information Aggregate similar information, and increase its

Outline Answer Corroboration for Data Cleaning

joint work with Yannis Kotidis and Divesh Srivastava

Motivations Multiple Join Path Framework Our Approach Experimental Evaluation

Answer Corroboration for Web Search Motivations Our Approach Query Interface

Motivating ExampleSales

CustName

Provisioning

CustName

CustName PONSubPON

Inventory

TN CircuitID

CircuitID

Ordering

ORN TN

TN: Telephone NumberORN: Order NumberBAN: Billing Account NumberPON: Provisioning Order NumberSubPON: Related PON

What is the Circuit ID associated with a Telephone Numberthat appears in SALES?

Motivations Data applications with overlapping

features Data integration Web sources

Data quality issues (duplicate, null, default values, data inconsistencies) Data-entry problems Data integration problems

Contributions Multiple Join Path (MJP) framework

Quantifies answer quality Takes corroborating evidence into account Agglomerative scoring of answers

Answer computation techniques Designed for MJP scoring methodologies Several output options (top-k, top-few)

Experimental evaluation on real data VIP integration platform Quality of answers Efficiency of our techniques

Multiple Join Path Framework:

Problem Definition Query of the form:

“Given X=a find the value of Y”

Examples: Given a telephone number of a customer, find the ID of the

circuit to which the telephone line is attached.One answer expected

Given a circuit ID, find the name of customers whose telephones are attached to the circuit ID.Possibly several answers

Schema Graph Directed acyclic graph Nodes are field names Intra-application edge

Links fields in the same application

Inter-application edge Links fields across

applicationsAll (non-source, non-sink) nodes in schema graph are (possibly approximate) primary or foreign keys of their applications

Data Graph Given a specific value of the source node X what

are values of the sink node Y? Considers all join paths from X to Y in the schema

X (no corresponding SALES.BAN)

Example: two paths lead to answer c1

Scoring Answers Which are the correct values?

Unclean data No a priori knowledge

Technique to score data edges What is the probability that the fields

associated by the edge is correct Probabilistic interpretation of data edge

scores to score full join paths Edge score aggregation Independent on the length of the path

Scoring Data Edges Rely on functional

dependencies (we are considering fields that are keys)

Data edge scores model the error in the data

Intra-application edge Inter-application edge

equals 1, unless approximate matching

Fields A and B within the same application

A B (and symetrically for B -> A)

|},...,1),,{(|

nibabascore

Where bi are the values instantiated from querying the application with value a

|)}.*,(.*),{(|

jiji bABabascore

B Aand

Scoring Data Paths A single data path is

scored using a simple sequential composition of its data edges probabilities

Data paths leading to the same answer are scored using parallel composition

i iedgeScorepathScore1

pathScorepathScore

pathScorepathscore

thScoreparallelpa

X a b Y0.5 0.8 0.6

pathScore=0.5*0.8*0.6=0.24

X a b Y0.5 0.8 0.6

pathScore=0.24+0.2-(0.24*0.2)pathScore=0.392

0.40.5

Independence Assumption

Identifying Answers Only interested in best answers Standard top-k techniques do not apply

Answer scores can always be increased by new information

We keep score range information Return top answers when identified, may not

have complete scores (similar to NRA by Fagin et al.)

Two return strategies Top-k Top-few (weaker stop condition)

Computing Answers Take advantage of early pruning

Only interested in best answers Incremental data graph computation

Probes to each applications Cost model is number of probes

Standard graph searching techniques (DFS, BFS) do not take advantage of score information

We propose a technique based on the notion of maximum benefit

Maximum Benefit Benefit computation of a path uses two

components Known scores of the explored data edges Best way to augment an answer’s scores

Uses residual benefit of unexplored schema edges

Our strategy makes choices that aim at maximizing this benefit metric

VIP Experimental Platform Integration platform developed at AT&T 30 legacy systems Real data Developed as a platform for resolving

disputes between applications that are due to data inconsistencies

Front-end web interface

VIP Queries Random sample of 150 user queries. Analysis shows that queries can be classified

according to the number of answers they retrieve: noAnswer(nA): 56 queries anyAnswer(aA): 94 queries

oneLarge(oL): 47 queries manyLarge(mL): 4 queries manySmall(mS): 8 queries

heavyHitters(hH): 10 queries that returned between 128 and 257 answers per query

VIP Schema GraphPaths leading to an answer/paths leading to top-1 answer (94 queries)

Not considering all paths may lead to missing top-1 answers

Number of Parallel Paths Contributing to the Top-1 Answer

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91

Number of Parallel Paths

Average of 10 parallel paths per answer, 2.5 significant

Cost of Execution

Related Work (Data Cleaning) Keyword Search in DBMS (BANKS, DBXPlorer,

DISCOVER, ObjectRank) Query is set of keywords Top-k query model DB as data graph Do not agglomerate scores

Top-k query evaluation (TA, MPro, Upper) Consider tuples as an entity Wait for exact answer (Except for NRA) Do not agglomerate scores

Probabilistic ranking of DB results Queries not selective, large answer set

We take corroborative evidence into account to rank query results

Contributions Multiple Join Path Framework

Uses corroborating evidence to identify high quality results

Looks at all paths in the schema graph Scoring mechanism

Probabilistic interpretation Takes schema information into account

Techniques to compute answers Take into account agglomerative scoring Top-k and top-few

Outline Answer Corroboration for Data Cleaning

Motivations Multiple Join Path Framework Our Approach Experimental Evaluation

Answer Corroboration for Web Search Motivations Our Approach Challenges

Motivations Information on web sources is unreliable

Erroneous Misleading Biased Outdated

Users check many web sites to confirm the information Data corroboration Can we do that automatically to save time?

Example: What is the gas mileage of my Honda Civic

Query: “honda civic 2005 gas mileage” on MSN Search

Is the top hit; the carhybrids.com site trustworthy?

Is the Honda web site unbiased?

Are all these values refering to the correct year of the model?

“36 mpg”

“48 mpg”

“37 mpg”

“47 mpg”

“44 mpg” Users may check several web sites to get an answer

Example: Aggregating Results using Data Corroboration

Combines similar values

Use frequency of the answer as the ranking measure

(out of the first 10 pages; one page had no answer)

Challenges Designing a meaningful ranking function

Frequency of the answer in the result set Importance of the web pages containing the

answer As measured by the search engine (e.g. Pagerank)

Importance of the answer within the page Use of formatting information within the page Proximity of the answer to query term Multiple answers per page

Similarity of the page with other pages Dampening factor Reduce the impact of copy-paste sites Reduce the impact of pages from same domain

Challenges (cont.) Selecting the result set (web pages)

How deep in the search engine result are we going?

Low ranked page will not contribute much to the score: use top-k pruning techniques

Extracting information from the web page Use existing Information Extraction (IE) and

Question Answering (QA) techniques

Current work Focus on numerical queries

Analysis of MSN queries show that they have a higher clickthrough rate than general queries

Answer easier to identify in the text Scoring function

Currently a simple aggregation of individual parameter scores

Working on a probabilistic approach Number of page accessed

Dynamic selection based on score information

Evaluation 15 million query logs from MSN Focus on:

Queries with high clickthrough rate Numerical value queries (for now)

Compare clickthrough with best-ranked sites to measure precision and recall

User studies

Interface

Related work Web Search

Our interface is build on top of a standard search engine Question Answering Systems (START, askMSR,

MULDER) Some have used frequency of answer to increase score

(askMSR, MULDER) We are considering more complex scoring mechanisms

Information Extraction (Snowball) We can use existing technique to identify information

within a page Our problem is much simpler than standard IE

Top-k queries (TA, Upper, MPro) We need some pruning functionalities to stop retrieving

web search results

Conclusions Large amount of low-quality data

Users have to rummage through a lot of information Data corroboration can improve the quality of

query results Has not been used much in practice Makes sense in many applications

Standard ranking techniques have to be modified to handle corroborative scoring Standard ranking scored each answer individually Corroborative ranking combines answer Pruning conditions in top-k queries do not work on

corroborative answers

Improving Query Results using Answer Corroboration

Documents

Introduction to Database Systems CSE 444...Query Evaluation....

Corroboration of Confessions in Federal and Military Trials

Grand unified field theory a predator prey approach...

Book Notes: Effects of Corroboration Instructions in a ...

Immigration Multiple Sources Corroboration Activity

Message Authentication by Integrity with Public...

Cranio-morphometric and aDNA corroboration of … et...

Origins of Intelligence: Discerning Corroboration vs....

Voucher Objectives Query - HUD.gov / U.S. Department of...

Corroboration and Verisimilitude: Against Lakatos's “Sheer...

Chapter 6 The 'Corroboration' of...

PIR with compressed queries and amortized query …database....

From Evidential Support to a Measure of Corroboration ·...

Corroboration and Verisimilitude: Against Lakatos’s...

Spatial and temporal corroboration of a ﬁre-scar-based...

Hypothesis Corroboration in Semantic Spaces with Swarming...