Top Banner
FP A Framework for Synthesizing Data Profiles Saswat Padhi 1 Prateek Jain 2 Daniel Perelman 3 Oleksandr Polozov 4 Sumit Gulwani 3 Todd Millstein 1 1 University of California, Los Angeles, CA 2 Microso Research Lab, India 3 Microso Corporation, Redmond, WA 4 Microso Research, Redmond, WA (OOPSLA) Contributed during an internship with PROSE team at Microso
55

FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Oct 10, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

FlashProfileA Framework for Synthesizing Data Profiles

Saswat Padhi 1 † Prateek Jain 2 Daniel Perelman 3

Oleksandr Polozov 4 Sumit Gulwani 3 Todd Millstein 1

1 University of California, Los Angeles, CA2 Microso� Research Lab, India

3 Microso� Corporation, Redmond, WA4 Microso� Research, Redmond, WA

(OOPSLA)

† Contributed during an internship with PROSE team at Microso�

Page 2: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

The Challenges of “Big” Data

High Volume> 2.5 M TB of data generated every day!

High Velocity∼ 4 M Google searches, ∼ 1/2 M tweets,> 1 K Amazon shipments ... per minute!

High Variety90 % of generated data is unstructured!Data may be incomplete, inconsistent,may contain multiple formats ...

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 2 / 15

Page 3: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

The Challenges of “Big” Data

High Volume> 2.5 M TB of data generated every day!

High Velocity∼ 4 M Google searches, ∼ 1/2 M tweets,> 1 K Amazon shipments ... per minute!

High Variety90 % of generated data is unstructured!Data may be incomplete, inconsistent,may contain multiple formats ...

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 2 / 15

Page 4: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

State of The Art

Reference ID

PMC5079771

doi: 10.1016/S1387-

7003(03)00113-8

ISBN: 2-287-34069-6...

.

.

....

ISBN: 0-006-08903-1

PMC9473786...

.

.

....

doi:

10.13039/100005795

ISBN: 1-158-23466-X

not_available

PMC9035311

Microsoft SSDT: (does not describe all data formats)I doi: +10\.\d\d\d\d\d/\d+ (110)I .∗ (113)I ISBN: 0-\d\d\d-\d\d\d\d\d-\d (204)I PMC\d+ (1024)

Ataccama One: (coarse grained, no constants & fixed-width pa�erns)I W_W (5)I W: N.N/LN-N(N)N-D (11)I W: D-N-N-L (34)I W: N.N/N (110)I W: D-N-N-D (267)I WN (1024)

FlashProfile:I ‘not_available’ (5)I ‘doi:’ + ‘10.1016/’ U D4 ‘-’ D4 ‘(’ D2 ‘)’ D5 ‘-’ D (11)I ‘ISBN:’ D ‘-’ D3 ‘-’ D5 ‘-X’ (34)I ‘doi:’ + ‘10.13039/’ D+ (110)I ‘ISBN:’ D ‘-’ D3 ‘-’ D5 ‘-’ D (267)I ‘PMC’ D7 (1024)

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 3 / 15

Page 5: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

State of The Art

Reference ID

PMC5079771

doi: 10.1016/S1387-

7003(03)00113-8

ISBN: 2-287-34069-6...

.

.

....

ISBN: 0-006-08903-1

PMC9473786...

.

.

....

doi:

10.13039/100005795

ISBN: 1-158-23466-X

not_available

PMC9035311

Microsoft SSDT: (does not describe all data formats)I doi: +10\.\d\d\d\d\d/\d+ (110)I .∗ (113)I ISBN: 0-\d\d\d-\d\d\d\d\d-\d (204)I PMC\d+ (1024)

Ataccama One: (coarse grained, no constants & fixed-width pa�erns)I W_W (5)I W: N.N/LN-N(N)N-D (11)I W: D-N-N-L (34)I W: N.N/N (110)I W: D-N-N-D (267)I WN (1024)

FlashProfile:I ‘not_available’ (5)I ‘doi:’ + ‘10.1016/’ U D4 ‘-’ D4 ‘(’ D2 ‘)’ D5 ‘-’ D (11)I ‘ISBN:’ D ‘-’ D3 ‘-’ D5 ‘-X’ (34)I ‘doi:’ + ‘10.13039/’ D+ (110)I ‘ISBN:’ D ‘-’ D3 ‘-’ D5 ‘-’ D (267)I ‘PMC’ D7 (1024)

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 3 / 15

Page 6: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

State of The Art

Reference ID

PMC5079771

doi: 10.1016/S1387-

7003(03)00113-8

ISBN: 2-287-34069-6...

.

.

....

ISBN: 0-006-08903-1

PMC9473786...

.

.

....

doi:

10.13039/100005795

ISBN: 1-158-23466-X

not_available

PMC9035311

Microsoft SSDT: (does not describe all data formats)I doi: +10\.\d\d\d\d\d/\d+ (110)I .∗ (113)I ISBN: 0-\d\d\d-\d\d\d\d\d-\d (204)I PMC\d+ (1024)

Ataccama One: (coarse grained, no constants & fixed-width pa�erns)I W_W (5)I W: N.N/LN-N(N)N-D (11)I W: D-N-N-L (34)I W: N.N/N (110)I W: D-N-N-D (267)I WN (1024)

FlashProfile:I ‘not_available’ (5)I ‘doi:’ + ‘10.1016/’ U D4 ‘-’ D4 ‘(’ D2 ‘)’ D5 ‘-’ D (11)I ‘ISBN:’ D ‘-’ D3 ‘-’ D5 ‘-X’ (34)I ‘doi:’ + ‘10.13039/’ D+ (110)I ‘ISBN:’ D ‘-’ D3 ‘-’ D5 ‘-’ D (267)I ‘PMC’ D7 (1024)

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 3 / 15

Page 7: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

State of The Art

Reference ID

PMC5079771

doi: 10.1016/S1387-

7003(03)00113-8

ISBN: 2-287-34069-6...

.

.

....

ISBN: 0-006-08903-1

PMC9473786...

.

.

....

doi:

10.13039/100005795

ISBN: 1-158-23466-X

not_available

PMC9035311

Microsoft SSDT: (does not describe all data formats)I doi: +10\.\d\d\d\d\d/\d+ (110)I .∗ (113)I ISBN: 0-\d\d\d-\d\d\d\d\d-\d (204)I PMC\d+ (1024)

Ataccama One: (coarse grained, no constants & fixed-width pa�erns)I W_W (5)I W: N.N/LN-N(N)N-D (11)I W: D-N-N-L (34)I W: N.N/N (110)I W: D-N-N-D (267)I WN (1024)

FlashProfile:I ‘not_available’ (5)I ‘doi:’ + ‘10.1016/’ U D4 ‘-’ D4 ‘(’ D2 ‘)’ D5 ‘-’ D (11)I ‘ISBN:’ D ‘-’ D3 ‘-’ D5 ‘-X’ (34)I ‘doi:’ + ‘10.13039/’ D+ (110)I ‘ISBN:’ D ‘-’ D3 ‘-’ D5 ‘-’ D (267)I ‘PMC’ D7 (1024)

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 3 / 15

Page 8: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Can We Do Even Be�er?

Reference ID

PMC5079771

doi: 10.1016/S1387-

7003(03)00113-8

ISBN: 2-287-34069-6...

.

.

....

ISBN: 0-006-08903-1

PMC9473786...

.

.

....

doi:

10.13039/100005795

ISBN: 1-158-23466-X

not_available

PMC9035311

Allowing domain-experts to profile with custom pa�erns:I ‘not_available’ (5)I ‘doi:’ + DOI (121)I ‘ISBN:’ ISBN10 (301)I ‘PMC’ D7 (1024)

Interactive refinement to gradually drill into data:I ‘not_available’ (5)I ‘doi:’ + ‘10.1016/’ U D4 ‘-’ D4 ‘(’ D2 ‘)’ D5 ‘-’ D (11)I ‘doi:’ + ‘10.13039/’ D+ (110)I ‘ISBN:’ ISBN10 (301)I ‘PMC’ D7 (1024)

Default profile from FlashProfile:I ‘not_available’ (5)I ‘doi:’ + ‘10.1016/’ U D4 ‘-’ D4 ‘(’ D2 ‘)’ D5 ‘-’ D (11)I ‘ISBN:’ D ‘-’ D3 ‘-’ D5 ‘-X’ (34)I ‘doi:’ + ‘10.13039/’ D+ (110)I ‘ISBN:’ D ‘-’ D3 ‘-’ D5 ‘-’ D (267)I ‘PMC’ D7 (1024)

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 4 / 15

Page 9: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Can We Do Even Be�er?

Reference ID

PMC5079771

doi: 10.1016/S1387-

7003(03)00113-8

ISBN: 2-287-34069-6...

.

.

....

ISBN: 0-006-08903-1

PMC9473786...

.

.

....

doi:

10.13039/100005795

ISBN: 1-158-23466-X

not_available

PMC9035311

Allowing domain-experts to profile with custom pa�erns:I ‘not_available’ (5)I ‘doi:’ + DOI (121)I ‘ISBN:’ ISBN10 (301)I ‘PMC’ D7 (1024)

Interactive refinement to gradually drill into data:I ‘not_available’ (5)I ‘doi:’ + ‘10.1016/’ U D4 ‘-’ D4 ‘(’ D2 ‘)’ D5 ‘-’ D (11)I ‘doi:’ + ‘10.13039/’ D+ (110)I ‘ISBN:’ ISBN10 (301)I ‘PMC’ D7 (1024)

Default profile from FlashProfile:I ‘not_available’ (5)I ‘doi:’ + ‘10.1016/’ U D4 ‘-’ D4 ‘(’ D2 ‘)’ D5 ‘-’ D (11)I ‘ISBN:’ D ‘-’ D3 ‘-’ D5 ‘-X’ (34)I ‘doi:’ + ‘10.13039/’ D+ (110)I ‘ISBN:’ D ‘-’ D3 ‘-’ D5 ‘-’ D (267)I ‘PMC’ D7 (1024)

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 4 / 15

Page 10: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Can We Do Even Be�er?

Reference ID

PMC5079771

doi: 10.1016/S1387-

7003(03)00113-8

ISBN: 2-287-34069-6...

.

.

....

ISBN: 0-006-08903-1

PMC9473786...

.

.

....

doi:

10.13039/100005795

ISBN: 1-158-23466-X

not_available

PMC9035311

Allowing domain-experts to profile with custom pa�erns:I ‘not_available’ (5)I ‘doi:’ + DOI (121)I ‘ISBN:’ ISBN10 (301)I ‘PMC’ D7 (1024)

Interactive refinement to gradually drill into data:I ‘not_available’ (5)I ‘doi:’ + ‘10.1016/’ U D4 ‘-’ D4 ‘(’ D2 ‘)’ D5 ‘-’ D (11)I ‘doi:’ + ‘10.13039/’ D+ (110)I ‘ISBN:’ ISBN10 (301)I ‘PMC’ D7 (1024)

Default profile from FlashProfile:I ‘not_available’ (5)I ‘doi:’ + ‘10.1016/’ U D4 ‘-’ D4 ‘(’ D2 ‘)’ D5 ‘-’ D (11)I ‘ISBN:’ D ‘-’ D3 ‘-’ D5 ‘-X’ (34)I ‘doi:’ + ‘10.13039/’ D+ (110)I ‘ISBN:’ D ‘-’ D3 ‘-’ D5 ‘-’ D (267)I ‘PMC’ D7 (1024)

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 4 / 15

Page 11: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Key Challenges

The space of profiles is large and inherently ambiguous

Should { ‘1817’, ‘1813?’ } be generalized to a pa�ern, or { ‘1817’, ‘1907’ }?

I prior tools have a fixed bias for { ‘1817’, ‘1907’ }

We allow users to disambiguate, by defining custom domain-specific pa�erns

I for example, a user may define a pa�ern 1800s (= the regex 18.*)

I Exponentially many ways of partitioning a given set of strings

I Exponentially many ways of generalizing strings to a pa�ern

. Clustering, with similarity ≈ Pa�ern score

. E�icient synthesis of complex pa�erns

Inductive program synthesis to the rescue!

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 5 / 15

Page 12: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Key Challenges

The space of profiles is large and inherently ambiguous

Should { ‘1817’, ‘1813?’ } be generalized to a pa�ern, or { ‘1817’, ‘1907’ }?

I prior tools have a fixed bias for { ‘1817’, ‘1907’ }

We allow users to disambiguate, by defining custom domain-specific pa�erns

I for example, a user may define a pa�ern 1800s (= the regex 18.*)

I Exponentially many ways of partitioning a given set of strings

I Exponentially many ways of generalizing strings to a pa�ern

. Clustering, with similarity ≈ Pa�ern score

. E�icient synthesis of complex pa�erns

Inductive program synthesis to the rescue!

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 5 / 15

Page 13: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Key Challenges

The space of profiles is large and inherently ambiguous

Should { ‘1817’, ‘1813?’ } be generalized to a pa�ern, or { ‘1817’, ‘1907’ }?

I prior tools have a fixed bias for { ‘1817’, ‘1907’ }

We allow users to disambiguate, by defining custom domain-specific pa�erns

I for example, a user may define a pa�ern 1800s (= the regex 18.*)

I Exponentially many ways of partitioning a given set of strings

I Exponentially many ways of generalizing strings to a pa�ern

. Clustering, with similarity ≈ Pa�ern score

. E�icient synthesis of complex pa�erns

Inductive program synthesis to the rescue!

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 5 / 15

Page 14: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Key Challenges

The space of profiles is large and inherently ambiguous

Should { ‘1817’, ‘1813?’ } be generalized to a pa�ern, or { ‘1817’, ‘1907’ }?

I prior tools have a fixed bias for { ‘1817’, ‘1907’ }

We allow users to disambiguate, by defining custom domain-specific pa�erns

I for example, a user may define a pa�ern 1800s (= the regex 18.*)

I Exponentially many ways of partitioning a given set of strings

I Exponentially many ways of generalizing strings to a pa�ern

. Clustering, with similarity ≈ Pa�ern score

. E�icient synthesis of complex pa�erns

Inductive program synthesis to the rescue!

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 5 / 15

Page 15: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Key Challenges

The space of profiles is large and inherently ambiguous

Should { ‘1817’, ‘1813?’ } be generalized to a pa�ern, or { ‘1817’, ‘1907’ }?

I prior tools have a fixed bias for { ‘1817’, ‘1907’ }

We allow users to disambiguate, by defining custom domain-specific pa�erns

I for example, a user may define a pa�ern 1800s (= the regex 18.*)

I Exponentially many ways of partitioning a given set of strings

I Exponentially many ways of generalizing strings to a pa�ern

. Clustering, with similarity ≈ Pa�ern score

. E�icient synthesis of complex pa�erns

Inductive program synthesis to the rescue!

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 5 / 15

Page 16: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Key Challenges

The space of profiles is large and inherently ambiguous

Should { ‘1817’, ‘1813?’ } be generalized to a pa�ern, or { ‘1817’, ‘1907’ }?

I prior tools have a fixed bias for { ‘1817’, ‘1907’ }

We allow users to disambiguate, by defining custom domain-specific pa�erns

I for example, a user may define a pa�ern 1800s (= the regex 18.*)

I Exponentially many ways of partitioning a given set of strings

I Exponentially many ways of generalizing strings to a pa�ern

. Clustering, with similarity ≈ Pa�ern score

. E�icient synthesis of complex pa�erns

Inductive program synthesis to the rescue!

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 5 / 15

Page 17: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Key Challenges

The space of profiles is large and inherently ambiguous

Should { ‘1817’, ‘1813?’ } be generalized to a pa�ern, or { ‘1817’, ‘1907’ }?

I prior tools have a fixed bias for { ‘1817’, ‘1907’ }

We allow users to disambiguate, by defining custom domain-specific pa�erns

I for example, a user may define a pa�ern 1800s (= the regex 18.*)

I Exponentially many ways of partitioning a given set of strings

I Exponentially many ways of generalizing strings to a pa�ern

. Clustering, with similarity ≈ Pa�ern score

. E�icient synthesis of complex pa�erns

Inductive program synthesis to the rescue!

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 5 / 15

Page 18: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Key Challenges

The space of profiles is large and inherently ambiguous

Should { ‘1817’, ‘1813?’ } be generalized to a pa�ern, or { ‘1817’, ‘1907’ }?

I prior tools have a fixed bias for { ‘1817’, ‘1907’ }

We allow users to disambiguate, by defining custom domain-specific pa�erns

I for example, a user may define a pa�ern 1800s (= the regex 18.*)

I Exponentially many ways of partitioning a given set of strings

I Exponentially many ways of generalizing strings to a pa�ern

. Clustering, with similarity ≈ Pa�ern score

. E�icient synthesis of complex pa�erns

Inductive program synthesis to the rescue!

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 5 / 15

Page 19: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Key Challenges

The space of profiles is large and inherently ambiguous

Should { ‘1817’, ‘1813?’ } be generalized to a pa�ern, or { ‘1817’, ‘1907’ }?

I prior tools have a fixed bias for { ‘1817’, ‘1907’ }

We allow users to disambiguate, by defining custom domain-specific pa�erns

I for example, a user may define a pa�ern 1800s (= the regex 18.*)

I Exponentially many ways of partitioning a given set of strings

I Exponentially many ways of generalizing strings to a pa�ern

. Clustering, with similarity ≈ Pa�ern score

. E�icient synthesis of complex pa�erns

Inductive program synthesis to the rescue!

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 5 / 15

Page 20: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Key Challenges

The space of profiles is large and inherently ambiguous

Should { ‘1817’, ‘1813?’ } be generalized to a pa�ern, or { ‘1817’, ‘1907’ }?

I prior tools have a fixed bias for { ‘1817’, ‘1907’ }

We allow users to disambiguate, by defining custom domain-specific pa�erns

I for example, a user may define a pa�ern 1800s (= the regex 18.*)

I Exponentially many ways of partitioning a given set of strings

I Exponentially many ways of generalizing strings to a pa�ern

. Clustering, with similarity ≈ Pa�ern score

. E�icient synthesis of complex pa�erns

Inductive program synthesis to the rescue!

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 5 / 15

Page 21: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Main Contributions

An application of a supervised learning technique (inductive programsynthesis) to the unsupervised learning problem of syntactic profiling.

We present:

I a definition for syntactic profiling as a pa�ern-aware clustering problem

I a technique using inductive program synthesis

I practical optimizations for fast, approximate profiling

I FlashProfile, and evaluation of its performance and accuracy

I profile-guided interaction for traditional PBE workflows

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 6 / 15

Page 22: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Main Contributions

An application of a supervised learning technique (inductive programsynthesis) to the unsupervised learning problem of syntactic profiling.

We present:

I a definition for syntactic profiling as a pa�ern-aware clustering problem

I a technique using inductive program synthesis

I practical optimizations for fast, approximate profiling

I FlashProfile, and evaluation of its performance and accuracy

I profile-guided interaction for traditional PBE workflows

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 6 / 15

Page 23: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Main Contributions

An application of a supervised learning technique (inductive programsynthesis) to the unsupervised learning problem of syntactic profiling.

We present:

I a definition for syntactic profiling as a pa�ern-aware clustering problem

I a technique using inductive program synthesis

I practical optimizations for fast, approximate profiling

I FlashProfile, and evaluation of its performance and accuracy

I profile-guided interaction for traditional PBE workflows

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 6 / 15

Page 24: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Main Contributions

An application of a supervised learning technique (inductive programsynthesis) to the unsupervised learning problem of syntactic profiling.

We present:

I a definition for syntactic profiling as a pa�ern-aware clustering problem

I a technique using inductive program synthesis

I practical optimizations for fast, approximate profiling

I FlashProfile, and evaluation of its performance and accuracy

I profile-guided interaction for traditional PBE workflows

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 6 / 15

Page 25: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Main Contributions

An application of a supervised learning technique (inductive programsynthesis) to the unsupervised learning problem of syntactic profiling.

We present:

I a definition for syntactic profiling as a pa�ern-aware clustering problem

I a technique using inductive program synthesis

I practical optimizations for fast, approximate profiling

I FlashProfile, and evaluation of its performance and accuracy

I profile-guided interaction for traditional PBE workflows

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 6 / 15

Page 26: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Main Contributions

An application of a supervised learning technique (inductive programsynthesis) to the unsupervised learning problem of syntactic profiling.

We present:

I a definition for syntactic profiling as a pa�ern-aware clustering problem

I a technique using inductive program synthesis

I practical optimizations for fast, approximate profiling

I FlashProfile, and evaluation of its performance and accuracy

I profile-guided interaction for traditional PBE workflows

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 6 / 15

Page 27: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Overview of FlashProfile

ApproximationParameters

Custom Pa�erns

Number ofPa�erns

Pa�ernSynthesizer

ClusteringProcedure

Profile

Dataset

FlashProfile provides:

I Support for user-defined pa�ernsI Support for arbitrary constants and fixed-width pa�ernsI Interactive refinement of profilesI Control over accuracy vs. performance trade-o�

FlashProfile is publicly-available as a cross-platform C# library (Matching.Text),as part of the Microso� PROSE SDK.

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 7 / 15

Page 28: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Overview of FlashProfile

ApproximationParameters

Custom Pa�erns

Number ofPa�erns

Pa�ernSynthesizer

ClusteringProcedure

Profile

Dataset

FlashProfile provides:I Support for user-defined pa�erns

I Support for arbitrary constants and fixed-width pa�ernsI Interactive refinement of profilesI Control over accuracy vs. performance trade-o�

FlashProfile is publicly-available as a cross-platform C# library (Matching.Text),as part of the Microso� PROSE SDK.

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 7 / 15

Page 29: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Overview of FlashProfile

ApproximationParameters

Custom Pa�erns

Number ofPa�erns

Pa�ernSynthesizer

ClusteringProcedure

Profile

Dataset

FlashProfile provides:I Support for user-defined pa�ernsI Support for arbitrary constants and fixed-width pa�erns

I Interactive refinement of profilesI Control over accuracy vs. performance trade-o�

FlashProfile is publicly-available as a cross-platform C# library (Matching.Text),as part of the Microso� PROSE SDK.

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 7 / 15

Page 30: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Overview of FlashProfile

ApproximationParameters

Custom Pa�erns

Number ofPa�erns

Pa�ernSynthesizer

ClusteringProcedure

Profile

Dataset

FlashProfile provides:I Support for user-defined pa�ernsI Support for arbitrary constants and fixed-width pa�ernsI Interactive refinement of profiles

I Control over accuracy vs. performance trade-o�

FlashProfile is publicly-available as a cross-platform C# library (Matching.Text),as part of the Microso� PROSE SDK.

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 7 / 15

Page 31: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Overview of FlashProfile

ApproximationParameters

Custom Pa�erns

Number ofPa�erns

Pa�ernSynthesizer

ClusteringProcedure

Profile

Dataset

FlashProfile provides:I Support for user-defined pa�ernsI Support for arbitrary constants and fixed-width pa�ernsI Interactive refinement of profilesI Control over accuracy vs. performance trade-o�

FlashProfile is publicly-available as a cross-platform C# library (Matching.Text),as part of the Microso� PROSE SDK.

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 7 / 15

Page 32: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Overview of FlashProfile

ApproximationParameters

Custom Pa�erns

Number ofPa�erns

Pa�ernSynthesizer

ClusteringProcedure

Profile

Dataset

FlashProfile provides:I Support for user-defined pa�ernsI Support for arbitrary constants and fixed-width pa�ernsI Interactive refinement of profilesI Control over accuracy vs. performance trade-o�

FlashProfile is publicly-available as a cross-platform C# library (Matching.Text),as part of the Microso� PROSE SDK.

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 7 / 15

Page 33: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Profiling via Clustering

I Pa�ern-Aware Partitioning. Clustering: Agglomerative hierarchical clustering. Objective: Minimize the cost of describing partitions. Similarity: Minimum cost of describing 2 strings

I Dependencies. A pa�ern learner L. A cost function C

I Optimizations. Approximate similarity using previous pa�erns. Profiling small chunks→ Full profile

.+

‘1’ .+

‘1’ D3

‘18’ D2

1813 · · · 1898

‘190’ D

1900 · · · 1903

‘18’ D2 ‘?’

1850? · · · 1875?

‘?’

?

Empty

ϵ

. .. .... . . . .

. .... . . . .

. .... . .

Suggested

Refined

(see our paper for details)

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 8 / 15

Page 34: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Profiling via Clustering

I Pa�ern-Aware Partitioning. Clustering: Agglomerative hierarchical clustering. Objective: Minimize the cost of describing partitions. Similarity: Minimum cost of describing 2 strings

I Dependencies. A pa�ern learner L. A cost function C

I Optimizations. Approximate similarity using previous pa�erns. Profiling small chunks→ Full profile

.+

‘1’ .+

‘1’ D3

‘18’ D2

1813 · · · 1898

‘190’ D

1900 · · · 1903

‘18’ D2 ‘?’

1850? · · · 1875?

‘?’

?

Empty

ϵ

. .. .... . . . .

. .... . . . .

. .... . .

Suggested

Refined

(see our paper for details)

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 8 / 15

Page 35: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Profiling via Clustering

I Pa�ern-Aware Partitioning. Clustering: Agglomerative hierarchical clustering. Objective: Minimize the cost of describing partitions. Similarity: Minimum cost of describing 2 strings

I Dependencies. A pa�ern learner L. A cost function C

I Optimizations. Approximate similarity using previous pa�erns. Profiling small chunks→ Full profile

.+

‘1’ .+

‘1’ D3

‘18’ D2

1813 · · · 1898

‘190’ D

1900 · · · 1903

‘18’ D2 ‘?’

1850? · · · 1875?

‘?’

?

Empty

ϵ

. .. .... . . . .

. .... . . . .

. .... . .

Suggested

Refined

(see our paper for details)

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 8 / 15

Page 36: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Profiling via Clustering

I Pa�ern-Aware Partitioning. Clustering: Agglomerative hierarchical clustering. Objective: Minimize the cost of describing partitions. Similarity: Minimum cost of describing 2 strings

I Dependencies. A pa�ern learner L. A cost function C

I Optimizations. Approximate similarity using previous pa�erns. Profiling small chunks→ Full profile

.+

‘1’ .+

‘1’ D3

‘18’ D2

1813 · · · 1898

‘190’ D

1900 · · · 1903

‘18’ D2 ‘?’

1850? · · · 1875?

‘?’

?

Empty

ϵ

. .. .... . . . .

. .... . . . .

. .... . .

Suggested

Refined

(see our paper for details)

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 8 / 15

Page 37: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Pa�ern Synthesis

I A Language LFP:Pa�ern P [s] := Empty(s)

| P[SuffixAfter(s, α)

]Atom α := Classnc | RegExr

| Functf | Consts

I A Pa�ern Learner LFP

. recursively reduces a synthesis problem

. sound and complete over a given set of atoms

I A Cost Function CFP

. tradeo� between specificity and simplicity

. weighted sum of costs of individual atoms

(see our paper for details)

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 9 / 15

Page 38: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Pa�ern Synthesis

I A Language LFP:Pa�ern P [s] := Empty(s)

| P[SuffixAfter(s, α)

]Atom α := Classnc | RegExr

| Functf | Consts

I A Pa�ern Learner LFP

. recursively reduces a synthesis problem

. sound and complete over a given set of atoms

I A Cost Function CFP

. tradeo� between specificity and simplicity

. weighted sum of costs of individual atoms

(see our paper for details)

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 9 / 15

Page 39: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Pa�ern Synthesis

I A Language LFP:Pa�ern P [s] := Empty(s)

| P[SuffixAfter(s, α)

]Atom α := Classnc | RegExr

| Functf | Consts

I A Pa�ern Learner LFP

. recursively reduces a synthesis problem

. sound and complete over a given set of atoms

I A Cost Function CFP

. tradeo� between specificity and simplicity

. weighted sum of costs of individual atoms

(see our paper for details)

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 9 / 15

Page 40: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Pa�ern Synthesis

I A Language LFP:Pa�ern P [s] := Empty(s)

| P[SuffixAfter(s, α)

]Atom α := Classnc | RegExr

| Functf | Consts

I A Pa�ern Learner LFP

. recursively reduces a synthesis problem

. sound and complete over a given set of atoms

I A Cost Function CFP

. tradeo� between specificity and simplicity

. weighted sum of costs of individual atoms

(see our paper for details)

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 9 / 15

Page 41: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Profile-Guided Interaction for PBE

Traditional PBE Interaction Profile-Guided Interaction

Users typically provide their desired

outputs sequentially

System proactively requests outputs for

syntactically discrepant inputs

Birthdays Years8/20 ’921986 June 073/24 ’881994 November 23

.

.

.

13-08-83

4/21 ’7924-11-91

Birthdays Years8/20 ’92

1986 June 07

3/24 ’88

1994 November 23...

13-08-83

4/21 ’79

24-11-91

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 10 / 15

Page 42: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Profile-Guided Interaction for PBE

Traditional PBE Interaction Profile-Guided Interaction

Users typically provide their desired

outputs sequentially

System proactively requests outputs for

syntactically discrepant inputs

Birthdays Years8/20 ’92 1992 1

1986 June 07 1986 2

3/24 ’88 1988 3

1994 November 23 1994...

13-08-83 1983 4

4/21 ’79 197924-11-91 1991

Birthdays Years8/20 ’92

1986 June 07

3/24 ’88

1994 November 23...

13-08-83

4/21 ’79

24-11-91

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 10 / 15

Page 43: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Profile-Guided Interaction for PBE

Traditional PBE Interaction Profile-Guided Interaction

Users typically provide their desired

outputs sequentially

System proactively requests outputs for

syntactically discrepant inputs

Birthdays Years8/20 ’92 1992 1

1986 June 07 1986 2

3/24 ’88 1988 3

1994 November 23 1994...

13-08-83 1983 4

4/21 ’79 197924-11-91 1991

Birthdays Years8/20 ’92

1986 June 07

3/24 ’88

1994 November 23...

13-08-83

4/21 ’79

24-11-91

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 10 / 15

Page 44: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Profile-Guided Interaction for PBE

Traditional PBE Interaction Profile-Guided Interaction

Users typically provide their desired

outputs sequentially

System proactively requests outputs for

syntactically discrepant inputs

Birthdays Years8/20 ’92 1992 1

1986 June 07 1986 2

3/24 ’88 1988 3

1994 November 23 1994...

13-08-83 1983 4

4/21 ’79 197924-11-91 1991

Birthdays Years8/20 ’92

1986 June 07

3/24 ’88

1994 November 23...

13-08-83

4/21 ’79

24-11-91

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 10 / 15

Page 45: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Profile-Guided Interaction for PBE

Traditional PBE Interaction Profile-Guided Interaction

Users typically provide their desired

outputs sequentially

System proactively requests outputs for

syntactically discrepant inputs

Birthdays Years8/20 ’92 1992 1

1986 June 07 1986 2

3/24 ’88 1988 3

1994 November 23 1994...

13-08-83 1983 4

4/21 ’79 197924-11-91 1991

Birthdays Years8/20 ’92 1992 3

1986 June 07 1986 2

3/24 ’88 1988

1994 November 23 1994...

13-08-83 1983 1

4/21 ’79 1979

24-11-91 1991

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 10 / 15

Page 46: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Profile-Guided Interaction for PBE

Traditional PBE Interaction Profile-Guided Interaction

Users typically provide their desired

outputs sequentially

System proactively requests outputs for

syntactically discrepant inputs

Birthdays Years8/20 ’92 1992 1

1986 June 07 1986 2

3/24 ’88 1988 3

1994 November 23 1994...

13-08-83 1983 4

4/21 ’79 197924-11-91 1991

Birthdays Years8/20 ’92 1992 3

1986 June 07 1986 2

3/24 ’88 1988

1994 November 23 1994...

13-08-83 1983 1

4/21 ’79 1979

24-11-91 1991

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 10 / 15

Page 47: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

�ality of Generated Profiles

20% Profile P̃Dataset S

80%Dataset S

80%Randomly sampled

from other datasets

P̃tn

fp

tp

fn

}Quality(P̃) = F1 Score

�ality of profiles generated by FlashProfile

Ataccama One Microso� SSDT

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 11 / 15

Page 48: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

�ality of Generated Profiles

20% Profile P̃Dataset S

80%Dataset S

80%Randomly sampled

from other datasets

P̃tn

fp

tp

fn

}Quality(P̃) = F1 Score

�ality of profiles generated by FlashProfile

Ataccama One Microso� SSDT

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 11 / 15

Page 49: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

End-to-End Profiling Performance

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 12 / 15

Page 50: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Related WorkI Microso� SQL Server Data Tools (SSDT) [ https://docs.microsoft.com/en-us/sql/ssdt ]

. Recognizes constants and fixed-width atoms.

. Not extensible. No refinement. Profiles are sometimes not comprehensive.

I Ataccama One [ https://one.ataccama.com/ ]

. Comprehensive profiles. Recognizes fixed-width atoms.

. A small fixed set of atoms. No refinement. Does not recognize constants.

I Trifacta Wrangler [ https://cloud.trifacta.com ]

. Recognizes fixed-width atoms. Generates readable profiles.

. Not extensible. No refinement. Does not recognize constants.

I Google OpenRefine [ http://openrefine.org/ ]

. No pa�erns, only clusters based on character-wise similarity.

I Potter’s Wheel [ Vijayshankar Raman and Joseph M. Hellerstein. VLDB 2001 ]

. Extensible set of atoms.

. Only learns the most-frequent pa�ern and shows outliers, not a profile.

I LearnPADS++ [ Kathleen Fisher et al. SIGMOD 2008 ; Kenny Q. Zhu et al. PADL 2012 ]

. Not extensible. No refinement. Generates C-style structures.

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 13 / 15

Page 51: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Conclusion & Future Work

A novel composition of hierarchical clustering and program synthesistechniques for e�icient pa�ern-based data profiling

Future Work:

I Automatically selecting costs for atoms. Machine-learnt costs to maximize the quality of profiles

I Alternate approximation strategies with be�er guarantees. Compute the overall “goodness” of an approximation and refine if needed

I Identify and classify semantic entities as well. For example, combine with named-entity recognition (NER) techniques

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 14 / 15

Page 52: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Conclusion & Future Work

A novel composition of hierarchical clustering and program synthesistechniques for e�icient pa�ern-based data profiling

Future Work:

I Automatically selecting costs for atoms. Machine-learnt costs to maximize the quality of profiles

I Alternate approximation strategies with be�er guarantees. Compute the overall “goodness” of an approximation and refine if needed

I Identify and classify semantic entities as well. For example, combine with named-entity recognition (NER) techniques

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 14 / 15

Page 53: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Conclusion & Future Work

A novel composition of hierarchical clustering and program synthesistechniques for e�icient pa�ern-based data profiling

Future Work:

I Automatically selecting costs for atoms. Machine-learnt costs to maximize the quality of profiles

I Alternate approximation strategies with be�er guarantees. Compute the overall “goodness” of an approximation and refine if needed

I Identify and classify semantic entities as well. For example, combine with named-entity recognition (NER) techniques

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 14 / 15

Page 54: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Conclusion & Future Work

A novel composition of hierarchical clustering and program synthesistechniques for e�icient pa�ern-based data profiling

Future Work:

I Automatically selecting costs for atoms. Machine-learnt costs to maximize the quality of profiles

I Alternate approximation strategies with be�er guarantees. Compute the overall “goodness” of an approximation and refine if needed

I Identify and classify semantic entities as well. For example, combine with named-entity recognition (NER) techniques

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 14 / 15

Page 55: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Publicly-Available Artifacts

I The Matching.Text NuGet package:https://www.nuget.org/packages/Microsoft.ProgramSynthesis.Matching.Text/

I Documentation for Matching.Text library:https://microsoft.github.io/prose/documentation/matching-text/intro/

I OOPSLA artifacts (a C# app showing Matching.Text API usage):https://github.com/SaswatPadhi/FlashProfileDemo

I Contact: [email protected]

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 15 / 15