CDO-1 Certificate Program: Foundations for Chief Data ... · Data Wrangling and MongoDB Instructor: Ninging Wu University of Arkansas at Little Rock CDO-1 Certificate Program: Foundations

Data Wrangling and MongoDBInstructor: Ninging Wu

University of Arkansas at Little Rock

CDO-1 Certificate Program:Foundations for Chief Data Officers

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 1

What is Data Wrangling?

• Refer to any data transformations required to prepare a dataset for down stream analysis, visualization, or operational consumption, etc.

• Account for about 80% of time of a data analysis project

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 2

Data Wrangling Activities

• Cover both traditional data curation and modern data analysis• Understand what data is available

• Choose what data to use and at what level to detail

• Understand how to meaningfully combine multiple sources of data

• Decide how to distill the results to a size and shape that can drive downstream analysis

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 3

Data Wrangling Process

• In broader strokes, data wrangling process involves• Acquisition: extraction portion of the ETL pipeline

• Transformation: functional aspect of wrangling

• Profiling: motivate and validate transformation

• Output: correspond to the completion of the data wrangling process

Together, transformation and profiling form the core of data wrangling

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 4

Acquisition

• Extraction portion of the ETL pipeline

• Involves pulling data either by scraping various internet endpoints or by linking to existing data stores.

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 5

Transformation

• Fundamental aspects of data wrangling.

• Involves changing data forms, validating/altering contents to meet the needs of downstream data analysis, etc. • Cleaning

• enhancement

• Integration

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 6

Transformation Process –Cleaning

• Data cleaning is an iterative process• Detecting error

• Correcting error

• Example:

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 7

Source of Dirty Data

• User entry errors

• Different schemas

• Legacy systems

• Evolving applications

• No unique identifiers

• Data migration

• Programmer error

• Corruption in transmission

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 8

Data is never clean

• What is clean data? What is clean enough?

• Can I work with the data ( Is it usable)

• Do I trust the data? (Is it credible?)

• Can I learn from is? (Is it useful?)

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 9

Usability, Credibility, and Usefulness

• Data is usable if it can be parsed and manipulated by computational tools. Data usability is thus defined in conjunction with the tools by which it is to be processed.

• Data is credible if, according to one’s subjective assessment, it is suitably representative of a phenomenon to enable productive analysis

• Data is useful if it is usable, credible, and responsive to one’s inquiry.

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 10

Other Data Quality Issues

• Accuracy: data is free of error and conform to gold standards of data

• Completeness: no missing values

• Consistency: matches other data

• Validity: conforms to a schema.

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 11

Blueprint for Cleaning

• Access your data

• Create a data cleaning plan• Identify causes

• Define operations

• Test

• Execute the plan

• Manually correct

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 12

Accessing Accuracy

• Difficult because it requires gold standard of data

• Need compare values with known correct data

• Some data error is tolerable while others not.

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 13

Accessing Completeness

• Schema completeness: degree to which entities and attributes are not missing from the schema• If a key piece of information is missing from schema or if

information of an entity/entities is missing

• Column completeness: degree to which there exist missing values in a column of a table.

• Population completeness: the degree to which members of the population that should be present but not present.

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 14

Accessing Consistency• Consistency between foreign key and key values

• Consistency between two related data elements• Functional dependencies

zip state

• Business rules

quantity discount%

• Semantic relationship

sentiment orientation numeric rating

• Consistency among multiple copies of the same data items• John Doe’s EmpIDs should be the same in the DB.

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 15

Assessing Validity

Determines what constraints are on individual fields and ensure the field values adhere to those constraints.

• Foreign key constraints

• Cross-field constraints

• Data type

• Range

• Format

• Uniqueness

• mandatorySept. 26-29, 2016 (c) 2016 iCDO@UAL 16

Major Types of Data Correction

• Remove/correct typographical errors

• Fill in missing data

• Valid data type, format, constraints

• Cross check of data

• Data enhancement• Structuring

• Enriching

• standardizing

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 17

Transformation Process –Enhancement

• Structuring: manipulate the schema of the dataset. It involves• Modifying the schema by splitting a column/field

• Collapsing multiple columns/fields into one

• Removing columns fields entirely

• Changing granularity of dataset

• Enriching: addition of columns/fields that add new information to the dataset• E.g., converting counts to percentages, derive customer sentiment

from their comments

• Standardization: uniformity of data type, storage format, units, etc.Sept. 26-29, 2016 (c) 2016 iCDO@UAL 18

Transformation Process –Integration

• Combine data from disparate sources into meaningful and valuable information.

• Challenges:• Scale of source data

• Semi-structured data

• Heterogeneity• Source type heterogeneity

• Schema heterogeneity

• Data type heterogeneity

• Data value heterogeneity

• Semantic heterogeneity

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 19

Profiling

• Provides descriptive statistics and information about data as well as help users decide which transformations to apply• Assess whether data can be used

• Assess data quality

• Discover metadata on source data including value patterns and distribution, candidate keys and foreign keys, FDs

• Understand data challenges

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 20

Information Obtained by Profiling• Descriptive statistics:

• Central tendency: • Mean, Median, Mode

• Dispersion:• min, max, standard deviation

• Other• variation, frequency, aggregate functions such as sum, count

• Meta data: • date type, length, discrete values, occurrence of null, uniqueness,

string patterns, etc

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 21

Output

• Corresponds to the completion of the wrangling process

• Main material outputs• Wrangled datasets

• Script of transformation logic

• Documentation of data lineage/provenance of the data

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 22

Free Data Wrangling Tools

• Tabula: convert PDF table into a spreadsheet.

• OpenRefine: Friendly GUI for describing and manipulating data

• R package

• DataWrangler

• CSVKit

• Python and Pandas

• Mr. Data Converter

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 23

Introduction to MongoDB

• MongoDB – What, Why, where & Advantages

• MongoDB Basics & Definition

• Notations & Terminology

• Key Features

• Comparison to SQL

• User Interface

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 24

What is MongoDB

• MongoDB => Humongous DB

• Document database• MongoDB is an open-source document database that

provides high performance, high availability, and automatic scaling.

• Optimal model parameters

• Different w.r.t typical relational type• MongoDB obviates the need for an Object Relational

Mapping (ORM) to facilitate development. (Source: mongodb.org)

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 25

Why MongoDB

• Document Oriented Storage : Data is stored in the form of JSON style documents

• Index on any attribute

• Replication & High Availability

• Auto-Sharding

• Rich Queries

• Fast In-Place Updates

• Professional Support By MongoDBSept. 26-29, 2016 (c) 2016 iCDO@UAL 26

When to Use MongoDB

• Big Data applications

• Content Management & Delivery

• Mobile and Social Infrastructure

• User Data Management

• Location Services

• Analytics

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 27

MongoDB - Advantages

• Schema-less

• Structure of a single object is clear

• No complex joins

• Tuning

• Easy to Scale

• Faster access of data

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 28

Key Features

• High Performance• Indexes support faster queries

• Embedded data models reduce i/o on DB

• High Availability• replica sets, automatic failover, data redundancy.

• Automatic Scaling• horizontal scalability as part of its core functionality.

• Automatic sharding, High throughputs

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 29

Document Database - Definition

• A record in MongoDB is a document, which is a data structure composed of field and value pairs.

• MongoDB documents are similar to JSON objects. The values of fields may include other documents, arrays, and arrays of documents.

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 30

Document Database - Example

Advantages:• Documents (i.e. objects) correspond to native

data types in many programming languages.

• Embedded documents and arrays reduce need for expensive joins.

• Dynamic schema supports fluent polymorphism.

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 31

Example

• In RDBMS schema design for the user requirements will have minimum three tables.

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 32

Example(cont.)

• While in MongoDB schema design will have one collection post and has the following structure

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 33

Example (cont.)

• So while showing the data, in RDBMS you need to join three tables and in MongoDB data will be shown from one collection only.

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 34

MongoDB – JSON Format• Data is described as name/value pairs

• Syntax: name/value pair consists of a field name followed by a colon and then value.• Example: “name”: “Ningning Wu”

• Data is separated by commas• Example: “name”: “Ningning Wu”, univ: “UALR”

• Curly braces hold objects• Example: {“name”: “Ningning Wu”, univ: “UALR”, dept:”IFSC”}

• An array is stored in brackets []• Example [{“name”: “Ningning Wu”, univ: “UALR”, dept:”IFSC”}, {“name”: “Kiran”,

univ: “UALR”, dept:”INFQ”} ]

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 35

MongoDB – Create Database

• use DATABASE_NAME is the command to create a database in MongoDB

• Syntax• use DATABASE_NAME

• In mongoDB, by default the database is test if you did not create one.

• All the collections will be stored here.

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 36

MongoDB – Drop Database

• db.dropDatabase() is the command to drop a database in MongoDB

• Syntax• db.dropDatabase()

• In mongoDB, by default the database is test and hence if you execute drop database command, it deletes test until unless specified otherwise.

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 37

MongoDB – Data Types• String, Integer

• Boolean, Double

• Min/ Max keys, Arrays

• Timestamp

• Object, Object ID

• Null, Symbol,

• Date, Binary data,

• Code, Regular expression

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 38

Collections in MongoDB• MongoDB stores all documents in collections. A

collection is a group of related documents that have a set of shared common indexes. Collections are analogous to a table in relational databases.

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 39

Query in MongoDB

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 40

Query Interface - SELECT

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 41

Data Modification - INSERT

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 42

Data Modification - UPDATE

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 43

Data Modification - DELETE

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 44

MongoDB – Sort Records

• sort() method is used to sort documents in MongoDB

• sort() method accepts a document containing list of fields along with their sorting order.

• By default, sort() method will display documents in ascending order.• 1 is used for ascending order

• -1 is used for descending order.

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 45

MongoDB – Indexing

• Indexes support the efficient resolution of queries.

• Syntax: db.COLLECTION_NAME.ensureIndex({KEY:1})

• Here key is the name of filed on which you want to create index and 1 is for ascending order.

• To create index in descending order you need to use -1.

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 46

MongoDB – Aggregation

• Aggregations operations process data records and return computed results. • Aggregation operations group values from multiple

documents together, and can perform a variety of operations on the grouped data to return a single result.

• Syntax:

db.COLLECTION_NAME.aggregate(AGGREGATE_OPERATION)

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 47

MongoDB – Replication

• Replication is the process of synchronizing data across multiple servers.

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 48

MongoDB – Sharding

• Sharding is the process of storing data records across multiple machines and it is MongoDB's approach to meeting the demands of data growth.

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 49

MongoDB – Create Backup

• mongodump command is used to create backup of database. This command will dump all data of your server into dump directory.

• Syntax: mongodump

• Restore Data:• To restore backup data mongodb's mongorestore command is

used. This command restore all of the data from the back up directory.

• Syntax: mongorestore

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 50

SQL - MongoDB Mapping Chart

RDBMS MongoDB

Database Database

Table Collection

Tuple/Row Document

column Field

Table Join Embedded Documents and Linking

Primary Key Primary Key(Default key _id provided by

mongodb itself)

Database Server and Client MongoDB

Mysqld/Oracle mongod

mysql/sqlplus mongo

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 51

SQL to MongoDB Aggregation

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 52

MongoDB Support for Drivers

• JavaScript , Python , Ruby

• PHP, Perl, Java , Scala

• C#, C, C++

• Haskell, Erlang

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 53

MongoDB – Limitations

• No Transactional Support

• No Relational Integrity

• No Joins

• RAM Intensive

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 54

References

• The MongoDB 3.2 Manual

• https://docs.mongodb.org/getting-started/shell/

• https://docs.mongodb.org/v3.0/MongoDB-manual-v3.0.pdf

Sept. 26-29, 2016 (c) 2016 iCDO@UAL 55

https://docs.mongodb.org/getting-started/shell/

https://docs.mongodb.org/v3.0/MongoDB-manual-v3.0.pdf