Top Banner
Automatic Data Validation & Cleaning with PySemantic Jaidev Deshpande Data Scientist, Cube26 Software Pvt Ltd
12

Automatic Data Validation and Cleaning with PySemantic

Feb 12, 2017

Download

Engineering

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Automatic Data Validation and Cleaning with PySemantic

Automatic Data Validation & Cleaning with PySemantic

Jaidev DeshpandeData Scientist, Cube26 Software Pvt Ltd

Page 2: Automatic Data Validation and Cleaning with PySemantic

About Me

● Data Scientist at Cube26 Software Pvt Ltd● Previously software developer at Enthought● Research assistant at TIFR and UoP● Active contributor to the SciPy stack

/ jaidevd

/ jaidevd

Page 3: Automatic Data Validation and Cleaning with PySemantic

Typical Data Pipeline

Page 4: Automatic Data Validation and Cleaning with PySemantic

The Problem● Curating and the data and standardizing across the team● Data quality problems:

○ Unstructured data○ Unorganized data○ Duplicated data○ Irrelevant data

● Communication problems:○ Large and distributed teams○ “What has happened to get the dataset to the current stage?”○ Messier data means more communication.

HOW DO I DESCRIBE THE STRUCTURE OF THE DATA EFFECTIVELY?

Page 5: Automatic Data Validation and Cleaning with PySemantic
Page 6: Automatic Data Validation and Cleaning with PySemantic

PySemantic

Page 7: Automatic Data Validation and Cleaning with PySemantic

Pythonically, PySemantic is:● A wrapper around pandas parsers and dataframe manipulation routines.● Not a parser● A loader for feature extraction for machine learning tasks● A logger for all operations on a dataset

PySemantic supports:● Recursive elimination of parser errors● Automatic validation based on rules

Page 8: Automatic Data Validation and Cleaning with PySemantic

How it works

$ semantic add mydictionary.yaml

mydataset1: path: /path/to/mydataset.csv nrows: 100 use_columns:

- col_a- col_b- col_c

>>> from pysemantic import Project>>> project = Project(“myproject”)>>>project.load_dataset(“mydataset”)

Page 9: Automatic Data Validation and Cleaning with PySemantic

PySemantic Internals

● Infer and validate parser arguments from the schema using traits

● Dynamically change parser arguments based on the errors raised, if any

● Log everything● Post loading a dataset, apply common preprocessing

methods by default

Page 10: Automatic Data Validation and Cleaning with PySemantic

Software Development Practices

● Fully test-driven● Fully documented● Pylint score > 9.0

Page 11: Automatic Data Validation and Cleaning with PySemantic

Limitations

● Only supports local files and MySQL tables (untested)● Not as smart as MS Excel● Architecture isn’t very clean - the main classes are

somewhat confusing

Page 12: Automatic Data Validation and Cleaning with PySemantic

Feedback, Issues, PRs Welcome!

http://github.com/jaidevd/pysemantic