Search by strategy
Post on 14-Jan-2016
32 Views
Preview:
DESCRIPTION
Transcript
Search by strategy
Arjen P. de Vriesarjen@acm.org
CWI, Spinque, Delft University of Technology
Interactive Information Access
Feedback: Interaction improves information
representation
Faceted Browsing: Interaction can let user take over where
machine would fail
Search by Strategy: Interaction can let user take over where
system designer would fail
Two extreme views on ‘search’
What is DB? From business applications Deductive reasoning Precise and efficient query
processing Users with technical skills
(SQL, XQuery, etc) and precise information needs
SelectionBooks where category=‘CS’
What is IR? From digital libraries,
patent collections, etc Inductive reasoning Best-effort processing Users with low technical
skills and imprecise information needs
RankingBooks about CS
Note: SemWeb more DB than IR!!!
Complex search tasks(or, “not-so-simple” search tasks, if you like)
I want to buy a house in Amsterdam and I want it with ‘sfeer’ but still in good shape
I can afford about €350K. I need 3 bedrooms, the size should be about 80m2. It should have a balcony or a backyard
The closer to the station and an AH, the better. BUT… I do not want to live in Amsterdam-Noord, unless there is a quick bus connection to the ferry
I may be willing to drop some of these constraints, but I’m not sure which
Search = IR + DB
Complex search tasks mix exact (DB) and ranked (IR) searches, on structured (DB) and unstructured (IR) data
Current technical solutions support either/or SQL or Excel for structured data (DB) Google or Lucene for unstructured data (IR)
Combining results requires significant effort copy & paste result sets between interfaces,
“human (probabilistic) joins”
Search = IR on-top-of DB ?
IR on-top-of DB: let exact and ranked operations both be processed by the same engine, so they can be mixed freely
IR responsible for ranking models, using DB as a data-access layer; no physical details necessary
DB responsible for reliable, dynamically optimised, data access; no logical details necessary
IR on-top-of DB???
Traditional, general-purpose DB technology cannot compete with custom IR search tools Working assumption: using column stores
should solve the efficiency problem
IR on-top-of DB – part I
let $c := doc("papers.xml")//DOC[author = "John Smith"]
let $q := "//text[about(.//Abstract, IR DB integration)]"
for $res in tijah-query($c, $q)
return $res/title/text()
+ +
I am an IR and DB expert
Hiemstra, Rode, Van Os, Flokstra, OSIR 2006PF/Tijah: text search in an XML database system
IR on-top-of DB
DATA AND QUERIES
DATA MANAGEMENT
Bla bla
Bla bla bl
a, bla bl
a. Bla
.
Bla bla
Bla bla bla, bla bla. Bla, bla bla
Bla
bla
Bla
bla
ENGINEERINGData Model Mismatch!!!
To implement IR taskson top of DB
is a tough job!
FORMALISM
MAPPING
Rölleke, Tsikrika, KazaiA general matrix frameworkfor modelling Information Retrieval
“IR concepts to matrix spaces”
Cornacchia, De Vries, Van Ballegooij, KerstenSRAM (Sparse Relational Array Mapping)
“matrix spaces to relational tables”
SRAM – query lifecycle
IR on-top-of DB – part II
let $c := doc("papers.xml")//DOC[author = "John Smith"]
let $q := "//text[about(.//Abstract, IR DB integration)]"
for $res in tijah-query($c, $q)
return $res/title/text()
+++
I am an IR expert(needn’t be a DB expert as well)
Cornacchia, De Vries, ECIR 2007A Parametrised Search System
SRAM for IR# Language Modellinglangmod(Q,d,λ) = sum( [ lm_t(d,t,λ) * Q(t) | t ] )lm_t(d,t,λ) = log( λ * DTf(d,t) + (1-λ) * Tf(t) )
# Okapi BM25bm25(Q,d,k1,b) = sum( [ w_t(d,t,k1,b) * Q(t) | t ] )w_t(d,t,k1,b) = Tidf(t) * (k1+1) * DTf(d,t) / ( DTf(d,t) + k1 * ((1-b) + b * Dnl(d)/avgDl) )
DTnl = mxMult(mxTrnsp(LT), LD) # doc-term freq.Dnl = mvMult(mxTrnsp(LD), L) # doc lenghtTnl = mvMult(mxTrnsp(LT), L) # term freq.DTf = [ DTnl(d,t)/Dnl(d) | d,t ] # doc-term norm. freq.Tf = [ Tnl(t)/nLocs | t ] # term norm. freq.
mxTrnsp(A) = [ A(j,i) | i,j ]mxMult(A,B) = [ sum([ A(i,k) * B(k,j) | k ]) | i,j ]mvMult(A,V) = [ sum([ A(i,j) * V(j) | j ]) | i ]
Indexing.ram (excerpt)
Linear_Algebra.ram (excerpt)
Retrieval_Models.ram (excerpt)
Parameterised Search System (PSS)
Cannot we ‘remove’ this IR engineer
from the loop, like DBMS software
removes the data engineer from the
loop?
Cornacchia, De Vries, ECIR 2007A Parametrised Search System
Search by Strategy
Visually construct search strategies by connecting building blocks
Strategy Builder
From Patent to Inventor
Search by Strategy
Visually construct search strategies by connecting building blocks
Each block describes either data or actions upon that data A search strategy may include multiple data
sources Data sources are internally represented as
quadruples, triples extended with an additional probability value
Actions are actually scripts expressed in Fuhr and Roelleke’s PRA (TOIS 1997)
1. Which universities/colleges
hold patents?
2. Who are the inventors named in those patents?
3. Which inventors are active in the area of our company?
Real-life patent search example:
Which researchers associated to universities and colleges should our Human Resources manager know to hire the right people on time?
Generate Search Engine
Work In Progress…
How Strategies Help
Strategies improve communication between search intermediary and user Encapsulate domain expert knowledge Abstract representation of search expert knowledge Analyze information seeking process at any stage
Strategies facilitate knowledge management Store / share / publish / refine
Strategies mix exact (DB) and ranked (IR) searches Avoid the need for “human (probabilistic) joins”
Implementation
PRA translates into SQL (!) Current system setup using CWI’s
MonetDB column-store Strategies are dynamically transformed
into a REST API
Conclusion
Hand over control to the user (or, most likely, the search intermediary) Patent information specialists Digital forensics detectives Librarians / archivists Real estate agents Travel agency
Open Issues
Assist the user make the best out of their increased level of control Integrate usage data live system to help
improve or adapt strategies
Handle “even larger” scale data Patent demo fine on ~17GB semi-structured
data (i.e., Fairview Research’s Green Energy collection), without specific optimizations, even with fairly large strategies
Close the loop!
Current Situation
index ; repeat { specify ; retrieve } until
IR + DB
IE
Desirable Situation
repeat { index ; specify ; retrieve } until
IR + DB + IE
top related