Top Banner
Pads/ML: Pads/ML: A Functional Data Description Language A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez (AT&T)
54

Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

Pads/ML:Pads/ML:A Functional Data Description LanguageA Functional Data Description Language

David Walker

Princeton University

with: Yitzhak Mandelbaum (Princeton),

Kathleen Fisher and Mary Fernandez (AT&T)

Page 2: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

2

Data, data everywhere!Data, data everywhere!

Incredible amounts of data stored in well-behaved formats:

Tools• Schema• Browsers• Query languages• Standards• Libraries• Books, documentation• Conversion tools• Vendor support• Consultants…

Databases:

XML:

Page 3: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

3

Ad hoc DataAd hoc Data

• Vast amounts of data in ad hoc formats.• Ad hoc data is semi-structured:

– Not free text.– Not as rigid as data in relational databases.

• Examples from many different areas:– Physics– Computer system maintenance and administration– Biology– Finance– Government– Healthcare– More!

Page 4: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

4

Ad Hoc Data in BiologyAd Hoc Data in Biology

format-version: 1.0date: 11:11:2005 14:24auto-generated-by: DAG-Edit 1.419 rev 3default-namespace: gene_ontologysubsetdef: goslim_goa "GOA and proteome slim"

[Term]id: GO:0000001name: mitochondrion inheritancenamespace: biological_processdef: "The distribution of mitochondria\, including the mitochondrial genome\, into daughter cells after mitosis or meiosis\, mediated by interactions between mitochondria and the cytoskeleton." [PMID:10873824,PMID:11389764, SGD:mcc]is_a: GO:0048308 ! organelle inheritanceis_a: GO:0048311 ! mitochondrion distribution

www.geneontology.orgwww.geneontology.org

Page 5: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

5

Ad Hoc Data in Chemistry Ad Hoc Data in Chemistry

O=C([C@@H]2OC(C)=O)[C@@]3(C)[C@]([C@](CO4)(OC(C)=O)[C@H]4C[C@@H]3O)([H])[C@H](OC(C7=CC=CC=C7)=O)[C@@]1(O)[C@@](C)(C)C2=C(C)[C@@H](OC([C@H](O)[C@@H](NC(C6=CC=CC=C6)=O)C5=CC=CC=C5)=O)C1

O O

O

OH

AcO

H

O

O

O

HO

NH

O

O

OHO

Page 6: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

6

Ad Hoc Data in FinanceAd Hoc Data in Finance

HA00000000START OF TEST CYCLEaA00000001BXYZ U1AB0000040000100B0000004200HL00000002START OF OPEN INTERESTd 00000003FZYX G1AB0000030000300000HM00000004END OF OPEN INTERESTHE00000005START OF SUMMARYf 00000006NYZX B1QB00052000120000070000B000050000000520000 00490000005100+00000100B00000005300000052500000535000HF00000007END OF SUMMARY

www.opradata.comwww.opradata.com

Page 7: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

7

Ad Hoc Data from Web Server Logs Ad Hoc Data from Web Server Logs (CLF)(CLF)

207.136.97.49 - - [15/Oct/1997:18:46:51 -0700] "GET /tk/p.txt HTTP/1.0" 200 30tj62.aol.com - - [16/Oct/1997:14:32:22 -0700] "POST /scpt/[email protected]/confirm HTTP/1.0" 200 941234.200.68.71 - - [15/Oct/1997:18:53:33 -0700] "GET /tr/img/gift.gif HTTP/1.0” 200 409240.142.174.15 - - [15/Oct/1997:18:39:25 -0700] "GET /tr/img/wool.gif HTTP/1.0" 404 178188.168.121.58 - - [16/Oct/1997:12:59:35 -0700] "GET / HTTP/1.0" 200 3082214.201.210.19 ekf - [17/Oct/1997:10:08:23 -0700] "GET /img/new.gif HTTP/1.0" 304 -

Page 8: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

8

Ad Hoc Data: DNS packetsAd Hoc Data: DNS packets

00000000: 9192 d8fb 8480 0001 05d8 0000 0000 0872 ...............r00000010: 6573 6561 7263 6803 6174 7403 636f 6d00 esearch.att.com.00000020: 00fc 0001 c00c 0006 0001 0000 0e10 0027 ...............'00000030: 036e 7331 c00c 0a68 6f73 746d 6173 7465 .ns1...hostmaste00000040: 72c0 0c77 64e5 4900 000e 1000 0003 8400 r..wd.I.........00000050: 36ee 8000 000e 10c0 0c00 0f00 0100 000e 6...............00000060: 1000 0a00 0a05 6c69 6e75 78c0 0cc0 0c00 ......linux.....00000070: 0f00 0100 000e 1000 0c00 0a07 6d61 696c ............mail00000080: 6d61 6ec0 0cc0 0c00 0100 0100 000e 1000 man.............00000090: 0487 cf1a 16c0 0c00 0200 0100 000e 1000 ................000000a0: 0603 6e73 30c0 0cc0 0c00 0200 0100 000e ..ns0...........000000b0: 1000 02c0 2e03 5f67 63c0 0c00 2100 0100 ......_gc...!...000000c0: 0002 5800 1d00 0000 640c c404 7068 7973 ..X.....d...phys000000d0: 0872 6573 6561 7263 6803 6174 7403 636f .research.att.co

Page 9: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

9

Properties of Ad hoc DataProperties of Ad hoc Data

• Data arrives “as is” -- you don’t choose the format

• Documentation is often out-of-date or nonexistent.– Hijacked fields.

– Undocumented “missing value” representations.

• Data is buggy.– Missing data, “extra” data, …

– Human error, malfunctioning machines, software bugs (e.g. race conditions on log entries), …

– Errors are sometimes the most interesting portion of the data.

• Data sources often have high volume.– Data might not fit into main memory.

• Data can be created by malicious sources attempting to exploit software vulnerabilities

– c.f. Ethereal network monitoring system

Page 10: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

10

The Goal(s)The Goal(s)• What can we do about ad hoc data?

– how do we read it into programs?– how do we detect errors?– how do we correct errors?– how do we query it?– how do we discover its structure and properties? – how do we view it?– how do we transform it into standard formats like CSV, XML?– how do we merge multiple data sources?

• In short: how do we do all the things we take for granted when dealing with standard formats in a fault-tolerant and efficient, yet nearly effortless way?

Page 11: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

11

Enter PadsEnter Pads

• Pads: a system for Processing Ad hoc Data Sources

• Three main components:– a data description language

• for concise and precise specifications of ad hoc data formats and semantic properties

– a compiler that automatically generates a suite of programming libraries & end-to-end applications

– a visual interface to support both novice and expert users

Page 12: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

12

One Description, Many ToolsOne Description, Many Tools

Data Description(Type T)

compiler

queryengine

parser printervisual data

browserxml

translator...

programming library

complete application

Page 13: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

13

Some Advantages Over Some Advantages Over Ad Hoc MethodsAd Hoc Methods

• Big bang for buck: 1 description, many tools

• Descriptions document data sources

– the documentation IS the tool generator so documentation is automatically

kept up-to-date with implementation

• Descriptions are easy to write, easy to understand.

– descriptions are high-level & declarative

– description syntax exploits programmer intuition concerning types

• Tools are robust

– Error handling code generated automatically; doesn’t clutter documentation.

• Descriptions & generated tools can be analyzed and reasoned about

– eg: data size, tool termination & safety properties, coherence of generated

parsers & printers

Page 14: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

14

The PADS ProjectThe PADS Project

• PADS/C [PLDI 05; POPL 06]– Based on C type structure.– Generates C libraries.

• too bad C doesn’t actually support libraries ....

– LaunchPADS visual interface [Daly et al., SIGMOD 06]

• PADS/ML (Mandelbaum’s thesis)– Based on the ML type structure.

• polymorphic, dependent datatypes– Generates ML modules.

• better reuse & library structure • functional data processing = far

greater programmer productivity– New framework for tool development.

• Format-independent algorithms architected using functors vs macros

– Implementation status.• Version 1.0 up and running• Many more exciting things to do

• Describe real formats:– Newick tree-structured data– Reglens galaxy catalogues– Palm PDA databases– AT&T call-detail data– AT&T billing data– Web server logs– Gene ontologies– DNS packets– OPRA data– More …

Page 15: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

15

OutlineOutline

• Motivation and PADS Overview

• Data Description in PADS/ML

• Implementation architecture

• The Semantic of PADS

• Conclusions

Page 16: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

16

Base Types and RecordsBase Types and Records

• Base types: C (e).– Describe atomic portions of data.

– Parameterized by host-language expression.

– Examples:

• Pint, Pchar, Pstring_FW(n), Pstring(c).

• Tuples and Records: t * t’ and {x:t; y:t’}.– Record fields are dependent: field names can be referenced

by types of later fields.

– Example to follow.

Page 17: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

17

122Joe|Wright|45|95|79n/aEd|Wood|10|47|31124Chris|Nolan|80|93|85 Burton|30|82|71126George|Lucas|32|62|40

Base Types and RecordsBase Types and Records

Tim

Pint * Pstring(‘|’) * Pchar

125 |

Movie-director Bowling Score (MBS) Format

Page 18: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

18

13C Programming31Types and Programming Languages20Twenty Years of PLDI36Modern Compiler Implementation in ML 27Elements o f ML Programming

Base Types and RecordsBase Types and Records

13C Programming

{ width: ; title: Pstring_FW(width) }Pint

Bookshelf Listing (BL) Format

Page 19: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

19

ConstraintsConstraints

• Constrained types: [x:t | e] .– Enforce the constraint e on the underlying type t.

BurtonTim125 | 30|

[c:Pchar | c = ‘|’]

ptype Scores = { min:Pint; ‘|’; max: [m:Pint | min ≤ m]; ‘|’; avg: [a:Pint | min ≤ a & a ≤ max] }

82 71| |

Pchar ‘|’

Page 20: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

20

122Joe|Wright|45|95|79n/aEd|Wood|10|47|31124Chris|Nolan|80|93|85125Tim|Burton|30|82|71126George|Lucas|32|62|40

DatatypesDatatypes

• Describe alternatives in data source with datatypes.– Parser tries each alternative in order.

pdatatype Id = None of “n/a” | Some of Pint

n/aEd|Wood|10|47|31124Chris|Nolan|80|93|85

Page 21: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

21

Recursive Datatypes Recursive Datatypes

• Describe inductively-defined formats.

pdatatype IntList = Cons of Pint * ‘|’

| Last of Pint

* IntList

79| 4031|71|

Page 22: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

22

Polymorphic Types Polymorphic Types

• Parameterize types by other types.

pdatatype (Elt) List = Cons of Elt * ‘|’

| Last of Elt

* (Elt) List

pdatatype IntList = Cons of Pint * ‘|’

| Last of Pint

* IntList

ptype IntList = Pint List

ptype CharList = Pchar List

Page 23: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

23

Dependent Types Dependent Types

• Parameterize types by values.

pdatatype IntList = Cons of Pint * ‘|’ * IntList

| Nil of Pint

ptype IntListBar = Pint List(‘|’)

ptype CharListComma = Pchar List (‘,’)

pdatatype (Elt) List (x:char) = Cons of Elt * x * (Elt) List(x)

| Nil of Elt

Page 24: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

24

More Dependent Types More Dependent Types

pdatatype GuidedOption (tag: int) =

pmatch tag of

0 => Zero of Pstring

| 1 => One of Pint

| 2 => Two of Pint * Pint

| _ => None

ptype source = {tag: Pint; payload: GuidedOption (tag)}

• “Switched” datatypes:

Page 25: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

ptype Timestamp = Ptimestamp_explicit_FW(8, "%H:%M:%S", gmt)

ptype Pip = Puint8 * ’.’ * Puint8 * ’.’ * Puint8 * ’.’ * Puint8

ptype (Alpha) Pnvp(p : string -> bool) = { name : [name : Pstring(’=’) | p name]; ’=’; value : Alpha }

ptype (Alpha) Nvp(name:string) = Alpha Pnvp(fun s -> s = name)

ptype SVString = Pstring_SE("/;|\\|/")

ptype Nvp_a = SVString Pnvp(fun _ -> true)

ptype Details = { source : Pip Nvp("src_addr");’;’; dest : Pip Nvp("dest_addr");’;’; start_time : Timestamp Nvp("start_time");’;’; end_time : Timestamp Nvp("end_time");’;’; cycle_time : Puint32 Nvp("cycle_time")}

ptype Semicolon = Pcharlit(’;’)ptype Vbar = Pcharlit(’|’)

pdatatype Info(alarm_code : int) = Pmatch alarm_code with 5074 -> Details of Details | _ -> Generic of (Nvp_a,Semicolon,Vbar) Plist

pdatatype Service = Dom of "DOMESTIC" | Int of "INTERNATIONAL" | Spec of "SPECIAL"

ptype Raw_alarm = { alarm : [ i : Puint32 | i = 2 or i = 3];’:’; start : Timestamp Popt;’|’; clear : Timestamp Popt;’|’; code : Puint32;’|’; src_dns : SVString Nvp("dns1");’;’; dest_dns : SVString Nvp("dns2");’|’; info : Info(code);’|’; service : Service}

let checkCorr ra = ...

ptype Alarm = [x:Raw_alarm | checkCorr x]ptype Source = (Alarm,Peor,Peof) Plist

2:3004092508||5001|dns1=abc.com;dns2=xyz.com|c=slow link;w=lost packets|INTERNATIONAL3:|3004097201|5074|dns1=bob.com;dns2=alice.com|src_addr=192.168.0.10;dst_addr=192.168.23.10;start_time=1234567890;end_time=1234568000;cycle_time=17412|SPECIAL

Sample Regulus Data:

PADS/ML Regulus Format:

Page 26: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

26

OutlineOutline

• Motivation and PADS Overview

• Data Description in PADS/ML

• Implementation architecture

• The Semantic of PADS

• Conclusions

Page 27: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

27

Parsing With PADSParsing With PADS

data description(type T)

0100100100111 user code

data rep(type ~ T)

parse descriptor(type ~ T)

compiler

parser

Page 28: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

28

Example: MBS Example: MBS Representation Representation

n/aEd|Wood|10|47|31

ptype MBS-Entry = { id: Id; first: Pstring(‘|’); ‘|’; last: Pstring(‘|’); ‘|’; scores: Scores }

pdatatype Id = None of “n/a” | Some of Pint

type MBS-Entry = { id: Id; first: string; last: string; scores: Scores }

datatype Id = None | Some of int

Page 29: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

29

Tool Generation With PADS/MLTool Generation With PADS/ML

data description(type T)

0100100100111

format-specific traversalfunctor

data rep(type ~ T)

parse descriptor(type ~ T)

compiler

parser

format-independent

toolmodule

tools in this pattern:accumulator, debugger, histograms, clusters, format converters

Page 30: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

30

Types as ModulesTypes as Modules

• PADS/ML generates a module for each type/description

• Parameterized types ==> Functors• Recursive types ==> Recursive modules

– sigh: combination of recursive modules & functors not supported in O’Caml, so we’re reduced to a bit of a hack for recursion

sig

type rep

type pd

fun parser : Pads.handle -> rep * pd

module Traverse (tool : TOOL) : sig ... end

end

Page 31: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

31

OutlineOutline

• Motivation and PADS Overview

• Data Description in PADS/ML

• Implementation architecture

• The Semantic of PADS

• Conclusions

Page 32: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

32

MotivationMotivation

• To crystallize design principles.

– Example: error counting methodology in PADS/C.

• To ensure system correctness.

– Example: parsers return data of expected type.

• As basis for evolution and experimentation.

– Critical to design of PADS/ML.

• To communicate core ideas.

– Designing the next 700 data description languages.

Page 33: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

33

PADS and DDCPADS and DDC

• Developed semantic framework based on Data

Description Calculus (DDC).

• Explains PADS/ML and other languages with DDC.

• Give denotational semantics to DDC.

PADS/CPADS/CPADS/MLPADS/ML

The Next 700 The Next 700

DDC

Page 34: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

34

Data Description CalculusData Description Calculus

• DDC: calculus of dependent types for describing data.

• Expressions e with type drawn from F-omega• A kinding judgment specifies well-formed descriptions.

t ::= unit | bottom | C(e)

| x:t.t | t + t | t & t | {x:t | e}

| t seq(t,e,t) | x.e | t e

| .t | t t | | .t

| compute (e:) | absorb(t) |

scan(t)

Page 35: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

35

Choosing a SemanticsChoosing a Semantics

• Semantics of REs, CFGs given as sets of strings but

fails to account for:

– Relationship between internal and external data.

– Error handling.

– Types of representation and parse descriptor.

• DDC

– Denotational semantics of types as parsers in F-omega

Page 36: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

36

A 3-Fold SemanticsA 3-Fold Semantics

t

0100100100...

Parser

Representation

Parse Descriptor

Description

[ t ][ t ]rep

[ t ]pd

Interpretations of t[ {x:t | e} ]rep = [ t ]rep + [ t ]rep

[ x:t.t’ ]rep = [ t ]rep * [ t’ ]rep

[ {x:t | e} ]pd = hdr * [ t ]pd

[ x:t.t’ ]pd = hdr * [ t ]pd * [ t’ ]pd

Page 37: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

37

Type CorrectnessType Correctness

t

0100100100...

Parser

Representation

Parse Descriptor

DescriptionTheorem [ t ] : bits [ t ]rep * [ t ]pd [ t ]

[ t ]rep

[ t ]pd

Interpretations of t

Page 38: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

38

OutlineOutline

• Motivation and PADS Overview

• Data Description in PADS/ML

• Implementation architecture

• The Semantic of PADS

• Conclusions

Page 39: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

39

Related WorkRelated Work

• parser generator technology:– Lex & Yacc

• no dependency

• semantic actions entwined with data description

• no higher-level tools

– Parser combinators

• semantic actions entwined with data description

• no higher-level tools

Page 40: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

40

Reminder:Reminder:One Description, Many ToolsOne Description, Many Tools

Data Description(Type T)

compiler

queryengine

parser printervisual data

browserxml

translator...

programming library

complete application

Page 41: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

41

Parser combinators:Parser combinators:One algorithm, One ToolOne algorithm, One Tool

parser

Page 42: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

42

Related WorkRelated Work

• Other “data description” languages– Data Format Description Language (DFDL)

– Binary Format Description Language (BFD)

– PacketTypes [SIGCOMM ’00]

– DataScript [GPCE ’02]

• None have a well-defined semantics or Pads tool

support

Page 43: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

43

Current & Future WorkCurrent & Future Work

• Tools and Applications– Description inference.

– Support for specific domains (microbiology)

• Language Design– Transformation language for ad hoc data.

– Description language for distributed

• Describe locations, versions, timing, relationships, etc.

• Theory– Analyze data descriptions for interesting properties, e.g.

equivalence, data size, termination, emptiness (always fails).

– Coherence of parsing & printing

Page 44: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

44

SummarySummary

• The PADS vision: reliable, efficient and effortless ad

hoc data processing

• PADS/ML:– Data description based on polymorphic, dependent

datatypes

– “Types as modules” implementation

– Solid theoretical basis.

• Visit www.padsproj.org

Page 45: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

45

The EndThe End

Questions?

Page 46: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

49

Existing ApproachesExisting Approaches

• C, Perl, or shell scripts: most popular.– Time consuming & error prone to hand code parsers.

– Difficult to maintain (worse than the ad hoc data itself in some cases!).

– Often incomplete, particularly with respect to errors.

• Error code, if written, swamps main-line computation.

• If not written, errors can corrupt “good” data.

• Lex & Yacc

– Good match for programming languages.

– Bad match for ad hoc data.

• Compiler converts descriptions into robust, format-specific

tools.

Page 47: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

50

Parsing With PADSParsing With PADS

• Robust parser at the core of generated tools.

Page 48: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

51

Using Ad hoc DataUsing Ad hoc Data

• Parsing only brings you part way.– Queries must be written in ML.– A lot of work.

• What about a declarative query?

122Joe|Wright|45|95|79

124Chris|Nolan|80|93|85125Tim|Burton|30|82|71126George|Lucas|32|62|40

• Can Ed Wood bowl?

n/aEd|Wood|10|47|31

Page 49: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

52

From Ad hoc Data To XMLFrom Ad hoc Data To XML

• XML– Encoding for semi-structured data.

– Good match!

• XQuery– Declarative XML query language for semi-structured

sources.

– Standardized by W3C, many implementations.

Page 50: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

53

PADX = PADS + XQueryPADX = PADS + XQuery

• Galax [Fernandez, et al.]

– Complete, open-source XQuery implementation.

• PADX

– Integrates PADS and Galax.

– Supports declarative queries over ad hoc data sources.

Page 51: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

54

Using PADXUsing PADX

• User describes format in PADS.

• PADX provides

– XML “view” of data in XML Schema.

– Customized XQuery engine.

• Query PADS-specific and other XML sources.

• User provides

– Ad hoc data

– Queries expressed in XQuery.

Page 52: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

55

Describing MBS Format Describing MBS Format

• Example Movie-director Bowling Score data

• PADS/ML Description

n/aEd|Wood|10|47|31

ptype MBS-Entry = { id: Id; first: Pstring(‘|’); ‘|’; last: Pstring(‘|’); ‘|’; scores: Scores }

Page 53: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

56

Viewing and Querying MBSViewing and Querying MBS• Virtual XML view

• Query: What is Ed Wood’s maximum score?

$pads/Psource/MBS-Entry[first = “Ed”][last = “Wood”]/scores/max

• Query: Which directors have scored less than 50?

$pads/Psource/MBS-Entry[scores/min < 50]

<MBS-Entry> <id><None>n/a</None></id> <first>Ed</first> <last>Wood</last> <scores> <min>10</min> <max>47</max> <avg>31</avg> <scores></MBS-Entry>

ptype MBS-Entry = { id: Id; first: Pstring(‘|’); ‘|’; last: Pstring(‘|’); ‘|’; scores: Scores }

n/aEd|Wood|10|47|31

Page 54: Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.

57

Challenges & SolutionsChallenges & Solutions

• Semantics

– Map PADS language to XML Schema.

• Re-engineer Galax Data Model

– Create abstract data model.

– Generate description-specific concrete data models.

• Efficiently query large-scale data sources.

– Provide lazy access to data.

– Implement custom memory-management.