Topes: Enabling End-User Topes: Enabling End-User Programmers to Validate and Programmers to Validate and Reformat Data Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair) Institute for Software Research, Carnegie Mellon University Sebastian Elbaum Computer Science & Engineering, University of Nebraska-Lincoln Jim Herbsleb Institute for Software Research, Carnegie Mellon University
68
Embed
Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Topes: Enabling End-User Topes: Enabling End-User Programmers to Validate and Reformat Programmers to Validate and Reformat
DataData
Christopher Scaffidi
Committee:
Mary Shaw (chair) Institute for Software Research, Carnegie Mellon University
Sebastian Elbaum Computer Science & Engineering, University of Nebraska-Lincoln
Jim Herbsleb Institute for Software Research, Carnegie Mellon University
Brad Myers Human-Computer Interaction Institute, Carnegie Mellon University
22
Target populationTarget population
• In 2012, there will be 90 million computer end users in American workplaces.
• Of these, at least 55 million will create spreadsheets, databases, web applications, or other programs.– Spreadsheets for computing budgets– Spreadsheets and databases for storing information– Web applications for collecting data from coworkers
And similar programs for automating a wide range of tedious or error-prone work tasks.
Another person’s task: validate web forms--Another person’s task: validate web forms--but he didn’t know JavaScript / regexpsbut he didn’t know JavaScript / regexps
1. Identify valid, invalid, and 1. Identify valid, invalid, and questionable valuesquestionable values
• Data is sometimes questionable… yet valid.– E.g.: an unusually long email address– In practice, person names and other proper nouns are never
validated with regexps… too brittle.– Life is full of corner cases and exceptions.
• If code can identify questionable data, then it can double-check the data:– Ask an application end user to confirm the input– Flag the input for checking by a system administrator– Compare the value to a list of known exceptions– Call up a server and see if it can confirm the value
• Two different strings can be equivalent.– What if an end user types a date in the wrong format?– “Jan-3-2007” and “1/3/2007” mean the same thing because of
the category that they are in: date.– Sometimes the interpretation is ambiguous. In real life,
preferences and experience guide interpretation.
• If code can transform among formats, then it can put data in an unambiguous format as needed.– Display result so users can check/fix interpretation
4. Reusability across programming 4. Reusability across programming environments (“platforms”)environments (“platforms”)
• Validity does not depend on whether the string is in a spreadsheet or a webform or a database
• To validate a kind of data, people don’t want to write– JavaScript for webforms on the client side– C#/Java/PHP for webforms on the server side– Stored procedures for databases– VBScript for spreadsheets
A tope has functions for recognizing and A tope has functions for recognizing and transforming instances of a data categorytransforming instances of a data category• Each tope implementation has executable functions:
– 1 isa:string[0,1] function per format, for recognizing instances of the format (a fuzzy set)
– 0 or more trf:stringstring functions linking formats, for transforming values from one format to another
• Validation function:(str) = max(isaf(str))where f ranges over tope’s formats– Valid when (str) = 1– Invalid when (str) = 0– Questionable when 0 < (str) < 1
Two other common kinds of topes:Two other common kinds of topes:numeric and hierarchicalnumeric and hierarchical
• Numeric, e.g.: human masses– Numeric and in a certain range– Values slightly outside range might be questionable– Sometimes labeled with an explicit unit– Transformation usually by multiplication
• Hierarchical, e.g.: address lines– Parts described with other topes (e.g.: “100 Main St.”
uses a numeric, a proper noun, and an enum)– Simple isas can be implemented with regexps.– Transformations involve permutation of parts, lookup
tables, and changes to separators & capitalization.
Role of good tool supportRole of good tool support
• Some simple isa functions could be implemented as– Enumerations– Regular expressions / formal grammars
• But for many topes, we also need to support questionable values and reformatting
• And usability can almost always be improved by tailoring the tools to the problem domain– Integrate with users’ familiar tools– Match the user interface to the problem’s structure
User highlights cellsClicks “New” button on our Validation toolbar
2525
System infers a boilerplate topeSystem infers a boilerplate topeand presents it for review and customizationand presents it for review and customization
Induction steps:1. Identify number & word parts2. Align parts based on punctuation3. Infer simple constraints on parts
Efficient recommendation• Only consider a tope if its instances could possibly have the “character content” of each example string.(eg.: could this have 12 letters & 1 space?)
3030
Search repository by Search repository by label and/or exampleslabel and/or examples
Note: many repositories will be organization-specific
• RedRover– Spreadsheet auditing– They already support formula auditing– Goal: Using topes for checking strings
• LogicBlox– Decision-support– Helping users enter data & make decisions from it– Goal: Using topes for validating data– Goal: Using topes for data de-duplication
3434
EvaluationEvaluation
• Many evaluations rely on the EUSES Spreadsheet Corpus (collected by Univ. Nebraska) – In particular, 4250 spreadsheet columns that
contained at least 20 strings
• These evaluations generally use the F1 statistic as a measure of accuracy1. Get strings from the corpus
2. Manually validate the strings
3. Automatically validate the strings (eg: with topes)
4. Compute F1 to check agreementF1 = precision * recall / ( (precision + recall)/2 )
• Implemented topes for spreadsheet data– Created 32 topes for the most common categories
• Covering 1199 columns, which was ~69% of the 1713 categorized columns, or ~28% of all 4250 columns
• Up to 5 formats per tope
– Compared to current practice• Validate w/ tope, simulate asking user on questionable inputs, F1=0.7• Validate w/ regexps or enumerations if available, but accept all inputs
when no regexp or enumeration is available, F1=0.19
– Tope-based validation was 3 times as accurate• Big benefit from supporting multi-format topes• Moderate benefit from validating currently-unvalidated categories • Small benefit from double-checking questionable values
Evaluating support for data cleaningEvaluating support for data cleaning
• Used topes to put web data into consistent formats– Again with the 5 columns in Hurricane Katrina website– Used transformation functions to put each string into
the most common format for that data category– Increased number of duplicate strings found by 10%
Closing the mismatch between Closing the mismatch between data abstractions and the real worlddata abstractions and the real world
• People often work with strings that are possibly-questionable instances of multi-format categories.
• These categories are application-agnostic and often common to many people.
• By capturing rules for validating and reformatting strings (including distinguishing questionable strings and multiple formats), topes…– Increase the accuracy of validation– Help users to accomplish validation and reformatting
activities quickly and effectively – Improve the reusability of validation code
For more information on end users and topes- End users’ counts and needs: VL/HCC’05, VL/HCC’07- Topes model: ICSE’08- Format inferrence: ICEIS’07- Integration with other systems: WEUSE’08 & FSE’08- Our latest tools + usability validation: ISEUD’09 & IUI’09
For more information on some related work- Dependent types, eg: X. Ou, Dynamic Typing with Dependent Types, Tech Rpt TR-695-04, Princeton Univ, 2004
- Regexp induction, eg: K. Lerman, S. Minton. Learning the Common Structure of Data, Proc. AAAI, 2000.
- Lapis system: R. Miller, Lightweight structure in text, Tech Rpt CMU-CS-02-134, Carnegie Mellon Univ., 2002.
- SWYN regexp editor: A. Blackwell, See What You Need: Helping End-users to Build Abstractions, JVLC, 2001.
- Federated databases, eg: A. Sheth, J. Larsen, Federated database systems for managing distributed, heterogeneous, and autonomous databases, CSUR, 1990.
- ETL Tools, eg: E. Rahn, H. Do, Data Cleaning: Problems and Current Approaches, IEEE Data Eng. Bulletin, 2000.
- Potter’s Wheel: V. Raman, J. Hellerstein, Potter's Wheel: An Interactive Data Cleaning System, VLDB, 2001.
- Forms/3 : M. Burnett et al, End-user software engineering with assertions in the spreadsheet paradigm, ICSE, 2003.
- -calculus: M. Erwig, M. Burnett, Adding Apples and Oranges. Symp. Practical Aspects of Declarative Lang., 2002.
- Named entities, eg: Message Understanding Conference series.
Professional programmers use lots of tricks Professional programmers use lots of tricks to simplify validation code. Eg: njtransit.comto simplify validation code. Eg: njtransit.com
Split inputs into many easy-to-validate fields.Who cares if the user has to type tabs now,or if he can’t just copy-paste into one field?
Make users pick from drop-downs.Who cares if it’s faster for users to type
“NJ” or “1/2007”?(Disclaimer: drop-downs sometimes are good!)
Even with these tricks, writing validation is Even with these tricks, writing validation is still very time-consuming.still very time-consuming.
Overall, the site had over 1100 lines of JavaScript
just for validation….Plus equivalent server-side Java code (too bad code
isn’t platform-independent)
if (!rfcCheckEmail(frm.primaryemail.value)) return messageHelper(frm.primaryemail, "Please enter a valid Primary Email address.");var atloc = frm.primaryemail.value.indexOf('@');if (atloc > 31 || atloc < frm.primaryemail.value.length-33) return messageHelper(frm.primaryemail, "Sorry. You may only enter 32 characters or less for your email name\r\n”+ ”and 32 characters or less for your email domain (including @).");
That was worst case.That was worst case.Best case: reusable regexps.Best case: reusable regexps.
• Many IDEs allow the programmer to enter oneregular expression for validating each input field.– Usually, this drastically reduces the amount of code,
since most validation ain’t fancy.– Yet programmers don’t validate most inputs.
End user applications Microsoft Excel – spreadsheets Visual Studio.NET – web forms Robofox – web macros Vegemite/Co-Scripter – web macros
...
Topei
Toped++ Topeg
Add-ins
Remote repositories
Local repository
5050
As a tool builder, what do I have to do so As a tool builder, what do I have to do so that people can use topes in my tool?that people can use topes in my tool?
You need to make an add-in1. Figure out what kind of fields you want to help your
• Code can ask an oracle, “Is this a person name?”, and the oracle replies yes, no, almost definitely, probably not, and other shades of gray.
• Code allows input in any reasonable format, since the code can ask the oracle to put the input into the format that is actually needed.
• Regardless of whether they are working in spreadsheets, webforms, or other programming environment, end users can teach the oracle about a new data category by concisely stating its parts and constraints.
A Word-like part that almost always contains 1-6 words that each always have 1-8 lowercase letters per word and only hyphens or ampersands between words: