From Dirt to Shovels: From Dirt to Shovels: Automatic Tool Automatic Tool Generation Generation for Ad Hoc Data for Ad Hoc Data David Walker David Walker Princeton University Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny Q. Zhu
33
Embed
From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
From Dirt to Shovels:From Dirt to Shovels:Automatic Tool GenerationAutomatic Tool Generation
for Ad Hoc Datafor Ad Hoc Data
David WalkerDavid Walker
Princeton UniversityPrinceton University
with David Burke, Kathleen Fisher, Peter White & Kenny Q. Zhu
who am I?who am I?
why am I here?why am I here?
Our Common Communication InfrastructureOur Common Communication Infrastructure
Much information is represented in Much information is represented in standardized data standardized data formatsformats:: Web pages in HTML Pictures in JPEG Movies in MPEG “Universal” information format XML Standard relational database formats
A plethora of data processing tools:A plethora of data processing tools: Visualizers (Browsers Display JPEG, HTML, ...) Query languages allow users extract information (SQL, XQuery) Programmers get easy access through standard libraries
► Java XML libraries --- JAXP Many applications handle it natively and convert back and forth
►MS Word
Ad Hoc DataAd Hoc Data
Massive amounts of data are stored in XML, HTML or Massive amounts of data are stored in XML, HTML or relational databases but there’s relational databases but there’s even moreeven more data that data that isn’tisn’t
An An ad hoc data formatad hoc data format is any nonstandard, but structured is any nonstandard, but structured data format for which convenient parsing, querying, data format for which convenient parsing, querying, visualizing, transformation tools are not available. (not visualizing, transformation tools are not available. (not natural language)natural language)
Ad Hoc Data from Web Server Logs (CLF)Ad Hoc Data from Web Server Logs (CLF)
Ad Hoc data from www.geneontology.orgAd Hoc data from www.geneontology.org
!autogenerated-by: DAG-Edit version 1.419 rev 3 !saved-by: gocvs !date: Fri Mar 18 21:00:28 PST 2005 !version: $Revision: 3.223 $ !type: % is_a is a !type: < part_of part of !type: ^ inverse_of inverse of !type: | disjoint_from disjoint from $Gene_Ontology ; GO:0003673 <biological_process ; GO:0008150 %behavior ; GO:0007610 ; synonym:behaviour %adult behavior ; GO:0030534 ; synonym:adult behaviour %adult feeding behavior ; GO:0008343 ; synonym:adult feeding behaviour % feeding behavior ; GO:0007631 %adult locomotory behavior ; GO:0008344 ;
...
The Challenge of Ad Hoc DataThe Challenge of Ad Hoc Data
Data arrives “as is.”Data arrives “as is.”
Documentation is often out-of-date or nonexistent.Documentation is often out-of-date or nonexistent.
Data is buggy.Data is buggy. Missing data, “extra” data, … Missing data, “extra” data, … Human error, malfunctioning machines, software bugs (e.g. race Human error, malfunctioning machines, software bugs (e.g. race
conditions on log entries), …conditions on log entries), … Errors are sometimes the Errors are sometimes the mostmost interesting portion of the data. interesting portion of the data.
Data sources may be enormousData sources may be enormous AT&T sources can generate up to 2GB/secondAT&T sources can generate up to 2GB/second
There are no software libraries, manuals, or armies of There are no software libraries, manuals, or armies of consultants to help you....consultants to help you....
Email
Raw Data
Data Entry:Create Format
Description
DataAnalysis
Data Exit: Data
Transformation
ExternalSystems
• Description libraries• Automatic inference• Manual customization• Visual support
Key points to know:Key points to know: Descriptions based on programming language “types”Descriptions based on programming language “types” Broad collection of “base types” (ints, strings, dates, ip addresses...) Broad collection of “base types” (ints, strings, dates, ip addresses...) Structured types includes “structs,” “unions” and “arrays”Structured types includes “structs,” “unions” and “arrays” .... but has many other features: dependency, constraints, recursion, ....... but has many other features: dependency, constraints, recursion, ... has formal semantics & proven propertieshas formal semantics & proven properties
The PADS System (version 2.0)The PADS System (version 2.0)
Top-down, divide-and-conquer algorithm:Top-down, divide-and-conquer algorithm: Compute various statistics from tokenized dataCompute various statistics from tokenized data Guess a top-level type constructorGuess a top-level type constructor Partition tokenized data into smaller chunksPartition tokenized data into smaller chunks Recursively analyze and compute types from smaller chunksRecursively analyze and compute types from smaller chunks
Top-down, divide-and-conquer algorithm:Top-down, divide-and-conquer algorithm: Compute various statistics from tokenized dataCompute various statistics from tokenized data Guess a top-level type constructorGuess a top-level type constructor Partition tokenized data into smaller chunksPartition tokenized data into smaller chunks Recursively analyze and compute types from smaller chunksRecursively analyze and compute types from smaller chunks
Top-down, divide-and-conquer algorithm:Top-down, divide-and-conquer algorithm: Compute various statistics from tokenized dataCompute various statistics from tokenized data Guess a top-level type constructorGuess a top-level type constructor Partition tokenized data into smaller chunksPartition tokenized data into smaller chunks Recursively analyze and compute types from smaller chunksRecursively analyze and compute types from smaller chunks
Cluster tokens into groups with similar histogramsCluster tokens into groups with similar histograms Similar histogramsSimilar histograms
► strong evidence tokens coexist in same description componentstrong evidence tokens coexist in same description component► use symmetric relative entropy to measure similarityuse symmetric relative entropy to measure similarity
Only the “shape” of the histogram mattersOnly the “shape” of the histogram matters► normalize histograms by sorting columns in descending sizenormalize histograms by sorting columns in descending size► result: comma & quote grouped together result: comma & quote grouped together
Find most promising token group to divide and conquer:Find most promising token group to divide and conquer: Structs == Groups with high coverage & low “residual mass”Structs == Groups with high coverage & low “residual mass” Arrays == Groups with high coverage, sufficient width & high “residual mass”Arrays == Groups with high coverage, sufficient width & high “residual mass” Unions == Other token groups Unions == Other token groups
Struct involving comma, quote identified in histogram aboveStruct involving comma, quote identified in histogram above
Overall procedure gives good starting point for rewriting systemOverall procedure gives good starting point for rewriting system
0102030405060708090
100
Quote Comma Integer String
1
2
Format RefinementFormat RefinementReanalyze example data with aid of rough descriptionReanalyze example data with aid of rough description
Rewrite format description to:Rewrite format description to: simplify presentationsimplify presentation
Phase 2: Constraint inferencePhase 2: Constraint inference► Analyze table and infer constraintsAnalyze table and infer constraints► Use TANE algorithm [Huhtala et al. 99]Use TANE algorithm [Huhtala et al. 99]
Phase 3: Format rewritingPhase 3: Format rewriting► Use inferred constraints & type isomorphisms to rewrite rough Use inferred constraints & type isomorphisms to rewrite rough
descriptiondescription► Greedy search to optimize information-theoretic scoreGreedy search to optimize information-theoretic score
Refinement: Simple ExampleRefinement: Simple Example
(first union is “int” whenever second union is “int”)
constraintinference
rule-basedstructurerewriting
struct
“ ”union
0 strint str
struct struct
, ,
id1 id2
2
11
2
id3
--
0
... ... ...
more accurate:-- first int = 0-- rules out “int , alpha-string” records
str (id4) int (id5) str (id6)
id4
--
id5
...
id6
--
...
foo beg--
...
24
Biggest WeaknessBiggest WeaknessDegree of success often hinges on the inference system Degree of success often hinges on the inference system
having a tokenization scheme that matches the having a tokenization scheme that matches the tokenization scheme of the data source.tokenization scheme of the data source.
Good tokens capture high-level, human abstractions Good tokens capture high-level, human abstractions compactly.compactly.
Techniques for learning tokenizations from data directly?Techniques for learning tokenizations from data directly?
Techniques for using multiple, ambiguous tokenization Techniques for using multiple, ambiguous tokenization schemes simultaneously?schemes simultaneously?
Related WorkRelated WorkMost common domains for grammar inference:Most common domains for grammar inference:
xml/htmlxml/html natural languagenatural language
Systems that focus on ad hoc data rare and the few that don’t Systems that focus on ad hoc data rare and the few that don’t support PADS tool suite:support PADS tool suite: Rufus system ’93, TSIMMIS ’94, Potter’s Wheel ’01Rufus system ’93, TSIMMIS ’94, Potter’s Wheel ’01
Top-down structure discoveryTop-down structure discovery Arasu & Garcia-Molina ’03 (extracting data from web pages)Arasu & Garcia-Molina ’03 (extracting data from web pages)
Grammar induction using MDL & grammar rewriting searchGrammar induction using MDL & grammar rewriting search Stolcke and Omohundro ’94 “Inducing probabilistic grammars...”Stolcke and Omohundro ’94 “Inducing probabilistic grammars...” T. W. Hong ’02, Ph.D. thesis on information extraction from web pagesT. W. Hong ’02, Ph.D. thesis on information extraction from web pages Higuera ’01 “Current trends in grammar induction”Higuera ’01 “Current trends in grammar induction”
ConclusionsConclusionsStill a work in progress, but we are able to produce XML Still a work in progress, but we are able to produce XML
and statistical reports fully automatically from ad hoc and statistical reports fully automatically from ad hoc data sources.data sources.
We’ve tested on approximately 15 real, mostly systemy We’ve tested on approximately 15 real, mostly systemy data sources (web logs, crash reports, AT&T phone call data sources (web logs, crash reports, AT&T phone call data, etc.) with what we believe is relatively good data, etc.) with what we believe is relatively good successsuccess
For papers & software, see our website at:For papers & software, see our website at:http://www.padsproj.org/http://www.padsproj.org/