Basic concepts and workflow Validation features Outlook Data validation infrastructure: the validate package Mark van der Loo and Edwin de Jonge Statistics Netherlands Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
25
Embed
Data validation infrastructure: the validate package
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Basic concepts and workflowValidation features
Outlook
Data validation infrastructure: the validatepackage
Mark van der Loo and Edwin de Jonge
Statistics Netherlands
Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
Basic concepts and workflowValidation features
Outlook
Validate
GoalTo make checking your data against domain knowledge andtechnical demands as easy as possible.
Content of this talkI Basic concepts and workflowI Examples of possibilities and syntaxI Outlook
Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
Basic concepts and workflowValidation features
Outlook
Basic concepts and workflow
Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
Basic concepts and workflowValidation features
Outlook
Basic concepts of the validate package
Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
Basic concepts and workflowValidation features
Outlook
Example: retailers data
library(validate)data(retailers)
dat <- retailers[4:6]head(dat)
## turnover other.rev total.rev## 1 NA NA 1130## 2 1607 NA 1607## 3 6886 -33 6919## 4 3861 13 3874## 5 NA 37 5602## 6 25 NA 25
Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
Basic concepts and workflowValidation features
Outlook
Other features
I Rules (validator objects)I Select from validator objects using []I Extract or set rule metadata (label, description, timestamp, . . .)I Get affected variable names, rule linkageI Summarize validatorsI Read/write to yaml format
I ConfrontI Control behaviour on NAI Raise errors, warningsI Set machine rounding limit
Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
Basic concepts and workflowValidation features
Outlook
Outlook
Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
Basic concepts and workflowValidation features
Outlook
In the works / ideas
I More analyses of rulesI More programmabilityI More (interactive) visualisationsI Roxygen-like metadata specificationI More support for reportingI . . .
We’d ♥ to hear your comments, suggestions, bugreports
Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
Basic concepts and workflowValidation features
Outlook
Validate is just the beginning!
See github.com/data-cleaning
Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package
I Van der Loo (2015) A formal typology of data validationfunctions. in United Nations Economic Comission for EuropeWork Session on Statistical Data Editing , Budapest. [pdf]
I Di Zio et al (2015) Methodology for data validation. ESSNeton validation, deliverable. [pdf]
I Van der Loo, M. and E. de Jonge (2016). Statistical DataCleaning with Applications in R, Wiley (in preparation).
Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package