Structured Querying of Web Structured Querying of Web Text Text A Technical Challenge A Technical Challenge Kulsawasd Jitkajornwanich Kulsawasd Jitkajornwanich University of Texas at Arlington University of Texas at Arlington [email protected][email protected]CSE6339 Web Mining | April 16, 2009 | CSE6339 Web Mining | April 16, 2009 | 9:30 am 9:30 am by Cafarella, Re’, Suciu, Etzioni & Banko
43
Embed
Structured Querying of Web Text A Technical Challenge Kulsawasd Jitkajornwanich University of Texas at Arlington [email protected] CSE6339 Web Mining.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Structured Querying of Web Structured Querying of Web TextText
Has “condition” in the query Has “condition” in the query Can make a complicated queryCan make a complicated query ex. ex. ““SQL query”SQL query”
List employee whose name start with ‘David’ and salary List employee whose name start with ‘David’ and salary > 5000> 5000
SELECTSELECT E.name, E.salary E.name, E.salary FROM FROM Employee EEmployee E WHERE WHERE E.name LIKE ‘David’, E. salary > 5000E.name LIKE ‘David’, E. salary > 5000
22
IntroductionIntroduction
What is What is structured-querystructured-query?? 2. Unstructured-query2. Unstructured-query
ex. ex. ““Keyword Search”Keyword Search” no “condition” in the queryno “condition” in the query simply do “string matching”simply do “string matching”
33
IntroductionIntroduction
--> we just talked about type of query --> we just talked about type of query <--<--
What about type of data?What about type of data? 2 types of data:2 types of data:
2. Unstructured-data2. Unstructured-data ex. Web documentsex. Web documents
44
IntroductionIntroduction Objective of the paper:Objective of the paper:
To propose a tool called To propose a tool called ExDBExDB to make a to make a structured-structured-queryquery on web documents on web documents (unstructured-data)(unstructured-data)
55
RelationalDatabase
Web Text
SQL Query
SQL Query
ExDB
Unstructured-query(Keyword Search)
Structured-query(Complicated query
like SQL-query)
Search Engine
Structured-data
Unstructured-data
How it works: How it works: Big Picture ofBig Picture of ExDBExDB
66
Collection of web documents
ExDB Extractor
Fact Table
Type Table
Constraint TableUser
ExDB Complier
q(?x,?y):- invented(?x,?y)
RDBMS Database
Resulting Table
How it works: How it works: Big Picture ofBig Picture of ExDBExDB
77
Collection of web documents
ExDB Extractor
Fact Table
Type Table
Constraint TableUser
ExDB Complier
q(?x,?y):- invented(?x,?y)
RDBMS Database
Resulting Table
OutlineOutline
11stst Component Component: : ExDB ExtractorExDB Extractor What/How does it do in more detail?What/How does it do in more detail?
22ndnd Component: Component: ExDB CompilerExDB Compiler What/How does it do in more detail?What/How does it do in more detail?
Test your understanding!!Test your understanding!! Working on tasksWorking on tasks Compare result Compare result ExDBExDB & & GoogleGoogle ConclusionConclusion
88
How How ExDBExDB WorksWorks
11stst Component: Component: ExDB ExtractorExDB Extractor What does it do?What does it do?
To To extract dataextract data from the from the web documentsweb documents & & put itput it into the into the tablestables
99
How How ExDBExDB WorksWorks
22ndnd Component: Component: ExDB CompilerExDB Compiler What does it do?What does it do?
To To processprocess the user’s the user’s structured-query structured-query on on the tables from 1the tables from 1stst component ( component (ExDB ExDB ExtractorExtractor) and give the ) and give the resulting tableresulting table back to userback to user
ex. ex. q(?x, ?y):- invented(?x, ?y)q(?x, ?y):- invented(?x, ?y) <we will study this query syntax later on><we will study this query syntax later on>
1010
How it works: How it works: Big Picture ofBig Picture of ExDBExDB
1111
RDBMS Database
Collection of web documents
…was surprising. In
1877, Edison
invented the light bulb. Although he
…
ExDB Extractor
Fact Table
Type Table
Constraint Table User: Make a Make a query query using using ExDBExDB syntaxsyntax
What does it do?What does it do? To To extract dataextract data from the from the web documentsweb documents & & put put
itit into the into the tablestables There are 3 tables:There are 3 tables:
1. Fact Table1. Fact Table 2. Type Table2. Type Table 3. Constraint Table3. Constraint Table
Additional column: stores tuple probabilityAdditional column: stores tuple probability Discussion:Discussion: Why do need this column?Why do need this column?
0<p<1, 0<p<1, p pi i = 1= 1 One way to assign probability: Counting occurrence One way to assign probability: Counting occurrence
frequencyfrequency Assume Assume IndependenceIndependence among tuples among tuples
1212
1.1 Fact Table1.1 Fact Table Stores Stores fact informationfact information
ex. “Edison invented light bulb” ex. “Edison invented light bulb” Uses Uses TextRunnerTextRunner to extract to extract How is it look like?How is it look like?
Example2: shows how to get Example2: shows how to get Type tableType table
1.3 Constraint Table1.3 Constraint Table Stores Stores constraint information constraint information of objects or of objects or
predicatespredicates There are 2 types of constraints discussed in There are 2 types of constraints discussed in
this paper: Synonym and Inclusion Dependencythis paper: Synonym and Inclusion Dependency Uses Uses DIRTDIRT to extract to extract
1. Synonym1. Synonym example for predicate: did-invented = example for predicate: did-invented = inventedinvented example for object: Edison T. =example for object: Edison T. = Edison Edison
2. Inclusion Dependency2. Inclusion Dependency example for predicate: be-guardian example for predicate: be-guardian be-parentbe-parent example for object: relative example for object: relative sistersister
Key point summary of 1Key point summary of 1stst component: component: (ExDB Extractor)(ExDB Extractor) 1. ExDB Extractor uses different kinds 1. ExDB Extractor uses different kinds of existing extractor: of existing extractor: TextRunnerTextRunner, , KnowItAllKnowItAll and and DIRTDIRT..
2. 2. Probabilistic columnProbabilistic column is used to is used to indicate the indicate the degree of correctnessdegree of correctness and and deal with deal with uncertainty problemuncertainty problem..
3. Drawback of fact table, only 3. Drawback of fact table, only Binary Binary PredicatePredicate is allowed. is allowed.
ExDBExDB syntax: syntax: ?x ?x = variable = variable xx w w = constant value = constant value ww q(?x,?y):- q(?x,?y):- = define resulting table = define resulting table qq consisting consisting of column of column xx and and yy
invented(?x,?y) invented(?x,?y) = return list of object x and y = return list of object x and y regarding predicate “invented”regarding predicate “invented”
invented(<scientists> ?x,?y)invented(<scientists> ?x,?y) = return list of = return list of object object xx whose type is whose type is <scientists><scientists> and and yy regarding predicate regarding predicate “invented”“invented”
This syntax is calledThis syntax is called “Datalog-like notation” “Datalog-like notation” Let’s try some examples!Let’s try some examples!
DiscussionDiscussion:: In this case, What can we do In this case, What can we do
to answer this query?to answer this query?
22
Make a QueryMake a Query Problem ScenarioProblem Scenario
3333
example8: (this example involves PROJECTION)example8: (this example involves PROJECTION) list all name who invented somethinglist all name who invented something
New Way: New Way: usingusing “Panel of Expert” “Panel of Expert” techniquetechnique principle:principle:
1.define number n of duplicate output 1.define number n of duplicate output ex. n=5 (meaning that if in total, there are 10 ex. n=5 (meaning that if in total, there are 10 duplicate output, we will consider only 5 and duplicate output, we will consider only 5 and eliminate other 5) to eliminate low quality eliminate other 5) to eliminate low quality output.output.
2.newProb = calculate by selecting the max value 2.newProb = calculate by selecting the max value among those n duplicate output.among those n duplicate output.
newProb = max {duplicateProbnewProb = max {duplicateProbii}; i}; in n
3434
Solving Problem Scenario by usingSolving Problem Scenario by using ‘‘Panel of ExpertPanel of Expert’’
techniquetechnique
22
Make a QueryMake a Query Problem Scenario:Problem Scenario:
3535
example8: (problem caused by projection operation)example8: (problem caused by projection operation) list all name who invented somethinglist all name who invented something
Key points summary of 2Key points summary of 2ndnd Component: Component: (ExDB Compiler)(ExDB Compiler)1.1. ExDBExDB has its own syntax. has its own syntax.2.2. Result will be in table format.Result will be in table format.3.3. Last column is probability value Last column is probability value ranked by ranked by
decreasingdecreasing order of probability value. The order of probability value. The assumption is that the assumption is that the higher probabilityhigher probability, the , the more more accurateaccurate..
4.4. Can implement top K to reduce time complexity Can implement top K to reduce time complexity (increase performance).(increase performance).
5.5. In case of In case of JOIN JOIN table, the resulting probability table, the resulting probability the product of 2 joining tablethe product of 2 joining table
6.6. In case of In case of PROJECTIONPROJECTION, use , use Panel of ExpertPanel of Expert to solve to solve the problem.the problem.
7.7. In case that user’s query contains In case that user’s query contains relationrelation which which does not exist in the Fact Tabledoes not exist in the Fact Table, we can use , we can use Constraint TableConstraint Table to answer such a query. to answer such a query.
Compare result Compare result ExDBExDB && GoogleGoogle Test query:Test query: list all scientists who create list all scientists who create somethingsomething
3939
Output from ExDB
Output from GoogleComments:Comments:
ExDB performs much better than Google.ExDB performs much better than Google. For Google result, after investigating all the link, only For Google result, after investigating all the link, only
1 document comes close to the answer.1 document comes close to the answer. For ExDB, although they have some redundancy, answer is For ExDB, although they have some redundancy, answer is
still better.still better.
ConclusionConclusion
Only Binary Predicate is allowed.Only Binary Predicate is allowed.
Result will be in table format (different from Google search Result will be in table format (different from Google search engine).engine).
How How ExDBExDB get answer makes more sense since they get answer makes more sense since they integrate integrate all data together before we make a query on themall data together before we make a query on them..
Extractor has to run beforehand before allowing user to make Extractor has to run beforehand before allowing user to make a query.a query.
IE involved in this paper are TextRunner, KnowItAll, DIRT.IE involved in this paper are TextRunner, KnowItAll, DIRT.
User is not expected to know the schema of the table, instead, User is not expected to know the schema of the table, instead, system itself will try to match as much as they can to answer system itself will try to match as much as they can to answer the query (using synonym, inclusion independency). the query (using synonym, inclusion independency).
4040
Question?Question?
4242
??
ReferencesReferences
N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB, 2004.
D. V. K. Reynold Cheng and S. Prabhakar. Evaluating proba bilistic queries over imprecise data. In SIGMOD, pages 551