This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Ad Hoc Data: From Uggh to Ad Hoc Data: From Uggh to SmugSmug
Ad Hoc Data is EverywhereAd Hoc Data is Everywhere• Lots of data in databases ==> even more data that Lots of data in databases ==> even more data that
isn’tisn’t• Ad Hoc Data:Ad Hoc Data: sets of semi-structured data files for which sets of semi-structured data files for which
standard data processing tools are unavailablestandard data processing tools are unavailable
• Tasks:Tasks: “getting the data into a database” (and other “getting the data into a database” (and other kinds of transformations), data cleaning, querying, kinds of transformations), data cleaning, querying, editing, parsing...editing, parsing...
• Anne: A “Mark-up Language” for Ad Hoc Data Anne: A “Mark-up Language” for Ad Hoc Data [PLDI [PLDI 2010]2010]
• with Qian Xi (Princeton)with Qian Xi (Princeton)
• Forest: A Language for Specifying Environmental Forest: A Language for Specifying Environmental AssumptionsAssumptions• with Kathleen Fisher (AT&T)with Kathleen Fisher (AT&T)• Nate Foster (Princeton)Nate Foster (Princeton)• Kenny Zhu (Jiao Tong Shanghai University)Kenny Zhu (Jiao Tong Shanghai University)
Anne: Anne: A Context-A Context-
free Mark-up free Mark-up Language for Language for
Ad Hoc DataAd Hoc Data
[PLDI 2010][PLDI 2010]
Qian Xi
The ProblemThe ProblemWhat is the What is the fastestfastest, , most reliablemost reliable way to go from data like this: way to go from data like this:
To a parse tree like this: To a parse tree like this:
And generate documentation (a grammar) and tools such as a parser, printer, query engine, editor, And generate documentation (a grammar) and tools such as a parser, printer, query engine, editor, xml converter, ...xml converter, ...
Our Solution: AnneOur Solution: Anne• Develop a “Develop a “mark-up languagemark-up language” for ordinary text” for ordinary text
• programmers annotate raw text using a set of “programmers annotate raw text using a set of “grammatical directivesgrammatical directives””• a simple, predictable algorithm generates a complete grammar a simple, predictable algorithm generates a complete grammar & processing tools from directives + the & processing tools from directives + the
surrounding raw datasurrounding raw data
Pros:Pros:• really easy to usereally easy to use
• directives are simple -- applied when & where neededdirectives are simple -- applied when & where needed• you can do it at 3amyou can do it at 3am
• predictable predictable • documentation and toolsdocumentation and tools may be generated automatically may be generated automatically
Cons:Cons:• not completely automaticnot completely automatic
• but I’m skeptical any other more magical bullet exists anywaybut I’m skeptical any other more magical bullet exists anyway
Loc ::= {[^ ]*}ID ::= ‘-’ + wordEntry ::= Loc ‘ ‘ – ‘ ‘ ID ‘ ‘ ‘”’ ‘GET’ ... int ‘ ‘ int
Generated Grammar:
Document:
any string terminated by a space
$ directs the system to infer a terminating symbol
a space follows the closing brace
Interjection: The Config FileInterjection: The Config File
def db [0-9][0-9]def zone [+-][0-1][0-9]00def ampm am\|AM\|pm\|PMdef trip [0-9][0-9][0-9]\|[0-9][0-9]\|[0-9]...
exp Time {db}:{db}:{db}\([ ]*{ampm}\)?\([ \t]+{zone}\)?
exp IP {trip}\.{trip}\.{trip}\.{trip}
default.config:
• A A config fileconfig file provides a mechanism for defining provides a mechanism for defining regular expressions and giving them namesregular expressions and giving them names• def is an internal definitiondef is an internal definition• exp is an exported named regular expressionexp is an exported named regular expression
• The The default config filedefault config file provides regular expressions provides regular expressions for common systems data (IP, dates, times, URL, for common systems data (IP, dates, times, URL, email, ... )email, ... )
• Generated Artifacts:Generated Artifacts:• PADS description (and from there, the PADS tool suite)PADS description (and from there, the PADS tool suite)• XML & CSS for debuggingXML & CSS for debugging
• Semantics: connections to Relevance Logic [see PLDI 10]Semantics: connections to Relevance Logic [see PLDI 10]
{Record*[|]:9152271|9152271|1|0|0|0|0|1}
Elem ::= intRecord ::= (Elem (‘|’ Elem)* )?
Repetition (2)
Repetition (1) Kleene Star with elements separated by ‘|’ and defined by first element
Various causes for errors:•Missing files•Directories/files in wrong locations•Wrong permissions•Links to wrong targets
If only we could...If only we could...
• Describe Describe required file and directory structure, required file and directory structure, including permissions, etc.including permissions, etc.
• CheckCheck that the actual file system matches the spec. that the actual file system matches the spec.• EliminateEliminate a whole class of errors! a whole class of errors!
CORAL Monitoring SystemCORAL Monitoring System• Monitoring system for an “Internet-scale, self-Monitoring system for an “Internet-scale, self-
organizing, web-content distribution network” organizing, web-content distribution network” developed by Mike Freedman, Princeton.developed by Mike Freedman, Princeton.
Observations on MonitoringObservations on Monitoring• Coral is similar to other monitoring systems: Coral is similar to other monitoring systems:
PlanetLab and a multitude of systems at AT&T.PlanetLab and a multitude of systems at AT&T.
• Often a configuration file specifies which hosts Often a configuration file specifies which hosts to monitor, what data to collect, and how often.to monitor, what data to collect, and how often.
• File and directory names encode meta-data. File and directory names encode meta-data.
• Want to ask questions such as:Want to ask questions such as:• what was the total load on planetlab1 last week?what was the total load on planetlab1 last week?
• on what days and at what times are files are missing?on what days and at what times are files are missing?
• what is the maximum memory usage?what is the maximum memory usage?
• Answering questions requires formulating Answering questions requires formulating queries both in terms of the contents of files queries both in terms of the contents of files and the structure of the file system (directory and the structure of the file system (directory names, files names) names, files names)
Other Possible ExamplesOther Possible Examples• File Hierarchy Standard (FHS) for unix-like installationsFile Hierarchy Standard (FHS) for unix-like installations• Haskell code base, PADS Source TreeHaskell code base, PADS Source Tree
• Cabal system for GHC librariesCabal system for GHC libraries• Disk cache for browser history, IMAP mailDisk cache for browser history, IMAP mail• Scientific data setsScientific data sets• CVS, SVN, other source control systemsCVS, SVN, other source control systems
To Do!To Do!• We need a language not just for specifying the contents We need a language not just for specifying the contents
(formats) of ad hoc data files but also for the structure of (formats) of ad hoc data files but also for the structure of file system fragmentsfile system fragments• specify filesspecify files• directory structuredirectory structure• dependencies (config files determine file system structure)dependencies (config files determine file system structure)• meta-data (permissions, sizes, owners, modification times)meta-data (permissions, sizes, owners, modification times)
• The PlanThe Plan• Build such a specification language on top of PADSBuild such a specification language on top of PADS• Generate a checker from the specificationsGenerate a checker from the specifications• Interface that allows programs to slurp up specified data from the Interface that allows programs to slurp up specified data from the
file system file system • Stand-alone tools: query engine, monitor, etc...Stand-alone tools: query engine, monitor, etc...
ptype host_d(h::phostname, t::pdate) = pdirectory { ... as before ... }
ptype host_d () = pdirectory { hosts is [t::date_d(t) | t <- pdate]; }
ptype coral_d () = pdirectory { hostNames is “Config” :: conf_t; hosts is [h::host_d | h <= hostNames]; }
Current & Future PlansCurrent & Future Plans• Designing a semantics based on a classical logic of treesDesigning a semantics based on a classical logic of trees
• We considered using one of the substructural (“separating”) tree logics but we discarded We considered using one of the substructural (“separating”) tree logics but we discarded it as the substructural logics gave us the wrong defaults & made the system harder to it as the substructural logics gave us the wrong defaults & made the system harder to design and understand (especially in the presence of parent pointers)design and understand (especially in the presence of parent pointers)
• Building a “file system parser” & tool generation infrastructure in HaskellBuilding a “file system parser” & tool generation infrastructure in Haskell• Leverage type-directed programming.Leverage type-directed programming.• Leverage laziness in loading structures.Leverage laziness in loading structures.
• Envision a collection of file system management tools based on Envision a collection of file system management tools based on descriptionsdescriptions• valid –desc dvalid –desc d -- check for conformance to d-- check for conformance to d• ls –desc dls –desc d -- list files described by d-- list files described by d• grep pattern –desc dgrep pattern –desc d -- grep for pattern in files described by d-- grep for pattern in files described by d• mv –desc d foo bar mv –desc d foo bar -- move files described by d rooted at foo to bar-- move files described by d rooted at foo to bar
• Thinking about a query engine & continuous monitoring systemThinking about a query engine & continuous monitoring system
• Considering extensions to handle other elements of the programming Considering extensions to handle other elements of the programming environment: environment variablesenvironment: environment variables