ZIYANG LIU, Peng Sun, Yi ChenArizona State University
STRUCTURED QUERY RESULT DIFFERENTIATION
2 KEYWORD SEARCH ON STRUCTURED DATA
Effective techniques have been developed to help users find relevant results? Ranking: sort the results in the order of estimated relevance Snippet: provide a summary of each result to help users judge
relevance
50% of keyword searches are information exploration queries, which inherently have multiple relevant results Users intend to investigate and compare multiple relevant
results.
How to help user compare relevant results?
Keywords
Search Engine
Results: Relevant Data Fragments
Structured Data
Web Search
50% Navigation
50% Information Exploration
Broder, SIGIR 02
3RESULTS AND SNIPPETS
store
city
Phoenix
name
BHPhoto
merchandises
category
DSLR
camera
brand
Canon
megapixel
12
category
DSLR
camera
brand
Sony
megapixel
12
……
store
city
Phoenix
name
Adorama
merchandises
category
Compact
camera
brand
HP
megapixel
14
category
Compact
camera
brand
Canon
megapixel
12
……
“Phoenix, camera, store”
store
city name
BHPhoto
merchandises
brand
Canon
camera
megapixel
12
brand
Canon
camera
PhoenixSnippet
store
city
Phoenix
name
Adorama
merchandises
category
Compact
camera
brand
Canon
megapixel
12
Snippet
Snippets are unhelpful in differentiating query results.
(Huang et al. SIGMOD 09)
4DIFFERENTIATION FEATURE SETS(DFS)
store
city
Phoenix
name
BHPhoto
merchandises
category
DSLR
camera
brand
Canon
megapixel
12
category
DSLR
camera
brand
Sony
megapixel
12
……
store
city
Phoenix
name
Adorama
merchandises
category
Compact
camera
brand
HP
megapixel
14
category
Compact
camera
brand
Canon
megapixel
12
……
DFS
DFS
Feature Type value
store: name BHPhoto
camera: brand CanonCanonSony
camera: category DSLR
Feature Type value
store: name Adorama
camera: brand CanonHP
camera: category Compact
Feature: (entity, attribute, value)
Bank websites usually allow users to compare selected credit cards, however, only with a pre-defined feature set.
5CHALLENGES OF RESULT DIFFERENTIATION
How to automatically generate DFS that highlight the differences among results?
How to measure the quality of a set of DFSs? DFSs should obviously maximize the
difference among results. How to quantify it?
What are other desirable properties?
Can DFSs be efficiently generated from results?
6 CONTRIBUTIONS
1st work on automatically differentiating structured search results Application domains: online shopping, employee hiring,
job/institution hunting, etc.
Identifying 3 desiderata for good DFSs
Quantifying the differentiation power of a set of DFSs
Proving the NP-hardness of DFS generation
Tackling the problem using two local optimality criteria Single-swap / Multi-swap optimality
Implemented XRed: XML Result Differentiation
Empirically verified the effectiveness & efficiency of XRed
7 ROADMAP
Desiderata for good DFSs
Problem definition
Local optimality and algorithms
Experiments
8DESIDERATUM 1BEING SMALL
A Small DFS is easy for user to go through and compare with other DFSs.
The size of each DFS, |D|, cannot exceed a user-specified upper bound L
|D| ≤ L
9
DESIDERATUM 2SUMMARIZING QUERY RESULTS
DFSs that do not summarize the results show useless & misleading differences.
store
city
Phoenix
name
BHPhoto
merchandises
category
DSLR
camera
brand
Canon
megapixel
12
category
DSLR
camera
brand
Sony
megapixel
12
……
store
city
Phoenix
name
Adorama
merchandises
category
Compact
camera
brand
HP
megapixel
14
category
Compact
camera
brand
Canon
megapixel
12
……
DFS
DFS
Feature Type DFS
camera:brand HP
Feature Type DFS
camera:brand Canon
This store sells only a few HP cameras.
10
DESIDERATUM 2SUMMARIZING QUERY RESULTS
DFSs that do not summarize the results show useless & misleading differences.
store
city
Phoenix
name
BHPhoto
merchandises
category
DSLR
camera
brand
Canon
megapixel
12
category
DSLR
camera
brand
Sony
megapixel
12
……
store
city
Phoenix
name
Adorama
merchandises
category
Compact
camera
brand
HP
megapixel
14
category
Compact
camera
brand
Canon
megapixel
12
……
DFS
DFS
Feature Type DFS
camera:brand Canon
camera:brand HP
Feature Type DFS
camera:brand Canon
camera:brand HP
This store sells only a few HP cameras.
11
A DFS is valid only if it summarizes the corresponding result. Features of the same type should be included in
order of occurrences.
Ratios of two features in the DFS should be roughly the same as in the result.
DESIDERATUM 2SUMMARIZING QUERY RESULTS
Dominance Ordered
Distribution Preserved
12
DESIDERATUM 3DIFFERENTIATING QUERY RESULTS
Differentiation unit: feature type.
A feature type t in two DFSs D1 and D2 is differentiable if
The order of the features of type t is different.
The ratio of two features of type t is different.
D1. Camera: brand: CanonD2. Camera: brand: HP
D1. Camera: brand: CanonD2. Camera: brand: Canon Camera: brand: HP
D1. Camera: brand: Canon Camera: brand: HPD2. Camera: brand: Canon Camera: brand: Canon Camera: brand: HP
13
Degree of Differentiation (DoD) of two DFSs = Number of differentiable feature types.
DESIDERATUM 3DIFFERENTIATING QUERY RESULTS
Feature Type DFS
store:name BHPhoto
camera:brand CanonCanonSony
camera:category DSLR
Feature Type DFS
store:name Adorama
camera:brand CanonHP
camera:category Compact
DoD = 3
DoD of multiple DFSs = the sum of DoD of every pair of DFS.
14 ROADMAP
Desiderata for good DFSs
Problem definition
Local optimality and algorithms
Experiments
15DFS GENERATION PROBLEM
Given a set of results and a size limit L, generate a DFS for each result such that Their DoD is maximized.
Every DFS is valid (good summary)
Every DFS’s size does not exceed L.
We proved the NP-hardness of this problem by reduction from X3C.
16 ROADMAP
Desiderata for good DFSs
Problem definition
Local optimality and algorithms
Experiments
17 LOCAL OPTIMALITY
To tackle this hard problem, instead of achieving global optimality, we propose two local optimality criteria: Single-swap Optimality
Multi-swap Optimality
18 SINGLE SWAP
A set of DFSs is Single-Swap Optimal, if adding / changing a single feature in a single DFS (subject to validity and size limit) cannot increase the DoD.
Feature Type Value
store: name BHPhoto
store: city Phoenix
camera: megapixel 12
camera: category DSLR
Feature Type Value
store: name Adorama
camera: brand CanonHP
camera: megapixel 12
DoD = 1
Feature Type Value
store: name Adorama
camera: brand CanonHP
camera: category Compact
DoD increases to 2
# of cameras: 200Category: DSLR: 188 Others: 12Brand: Canon: 103 Sony: 50 Nikon: 25 HP: 22Megapixel: 12: 160 13: 15 14: 20
STORE 1
# of cameras: 150Category: Compact: 140 Others: 10Brand: Canon: 80 HP: 70Megapixel: 12: 105 13: 5 14: 19 STORE 2
Achieved Single-Swap Optimal
19ALGORITHM FOR SINGLE-SWAP OPTIMALITY
Start from a randomly generated DFS for each result.
Repeatedly add a feature / change a feature in a DFS.
Stop until the DoD no longer increases.
Does this algorithm terminate in polynomial time?
Yes: The maximum possible DoD for a set of DFSs is POLYNOMIAL. Each iteration increases the DoD at least by 1.Each iteration takes polynomial time.
20MULTI-SWAP OPTIMALITY
A set of DFSs is Multi-Swap Optimal, if adding / changing any number of features in a single DFS (subject to validity and size limit) cannot increase the DoD.
Feature Type Value
store:name BHPhoto
store:city Phoenix
camera: megapixel 12
camera:category DSLR
Feature Type Value
store:name Adorama
camera: brand CanonHP
camera:category Compact
DoD = 2Feature Type Value
store:name BHPhoto
camera:brand CanonCanonSony
camera:category DSLR
DoD increases to 3
# of cameras: 200Category: DSLR: 188 Others: 12Brand: Canon: 103 Sony: 50 Nikon: 25 HP: 22Megapixel: 12: 160 13: 15 14: 20
STORE 1
# of cameras: 150Category: Compact: 140 Others: 10Brand: Canon: 80 HP: 70Megapixel: 12: 105 13: 5 14: 19 STORE 2
21ALGORITHM FOR MULTI-SWAP OPTIMALITY
Start from a randomly generated DFS for each result.
Repeatedly add / change multiple features in a DFS.
Stop until the DoD no longer increases.
We designed a novel dynamic programming algorithm, which takes pseudo-polynomial time
This algorithm has exponential time complexity!
22 EVALUATION
We have implemented Xred (XML Result Differentiation) and evaluated it empirically.
Data sets Film (http://infolab.stanford.edu/pub/movies) Camera Retailer (synthetic)
Result generation: XSeek (http://xseek.asu.edu/)
DFS size limit: 10% of # of feature types
Metrics: Quality (DoD) Efficiency
Comparison system: exponential algorithm that generates optimal solution.
23 DFS QUALITY
QC1 QC2 QC3 QC4 QC5 QC6 QC7 QC80
20
40
60
Single-Swap Multi-Swap Optimal
DoD
DoD
QF1 QF2 QF3 QF4 QF5 QF6 QF7 QF80
20
40
DoD
Film
Camera Retailer
24 EFFICIENCY
QC1 QC2 QC3 QC4 QC5 QC6 QC7 QC8-5.55111512312578E-17
0.02
0.04
0.06
0.08
0.1
Single-Swap Multi-Swap Optimal
Tim
e (s
)
QF1 QF2 QF3 QF4 QF5 QF6 QF7 QF8-5.55111512312578E-17
0.02
0.04
0.06
0.08
0.1
Tim
e (s
)Result Size1KB ~ 9KB
# of Results2 ~ 52
Film
Camera Retailer
25 SCALABILITY
1 10 20 30 40 50 60 70 800
0.1
0.2
0.3
0.4
Query Result Size (KB)
Tim
e (s
)
101112131415161718-0.0999999999999997
3.05311331771918...
0.1
0.2
0.3
Single-SwapMulti-Swap
DFS Size Limit
10 100 200 300 400 5000
25
50
75
100
# of Query Results
26 CONCLUSIONS
We initiate the problem of automatically differentiating structured query results, which is useful for information exploration queries.
We define Differentiation Feature Set (DFS) for each result, and identify three desiderata for DFS.
We formalize the DFS generation problem, and prove its NP-hardness.
We propose two local optimality criteria: single-swap and multi-swap, and design algorithms to efficiently achieve them.
We implemented the XRed system, and verified its effectiveness and efficiency through experiments.
27 FUTURE WORK
Result differentiation is a new area and opens opportunities for new research topics. Is there a better way of selecting feature
types, e.g., by considering users’ interests? Is there a better way of measuring the
quality of DFSs besides DoD? Are there approximation / randomized
algorithms for DFS generation problem that achieve better quality / efficiency?
28