Data Wrangling and MongoDB Instructor: Ninging Wu University of Arkansas at Little Rock CDO-1 Certificate Program: Foundations for Chief Data Officers Sept. 26-29, 2016 (c) 2016 iCDO@UAL 1
Data Wrangling and MongoDBInstructor: Ninging Wu
University of Arkansas at Little Rock
CDO-1 Certificate Program:Foundations for Chief Data Officers
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 1
What is Data Wrangling?
• Refer to any data transformations required to prepare a dataset for down stream analysis, visualization, or operational consumption, etc.
• Account for about 80% of time of a data analysis project
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 2
Data Wrangling Activities
• Cover both traditional data curation and modern data analysis• Understand what data is available
• Choose what data to use and at what level to detail
• Understand how to meaningfully combine multiple sources of data
• Decide how to distill the results to a size and shape that can drive downstream analysis
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 3
Data Wrangling Process
• In broader strokes, data wrangling process involves• Acquisition: extraction portion of the ETL pipeline
• Transformation: functional aspect of wrangling
• Profiling: motivate and validate transformation
• Output: correspond to the completion of the data wrangling process
Together, transformation and profiling form the core of data wrangling
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 4
Acquisition
• Extraction portion of the ETL pipeline
• Involves pulling data either by scraping various internet endpoints or by linking to existing data stores.
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 5
Transformation
• Fundamental aspects of data wrangling.
• Involves changing data forms, validating/altering contents to meet the needs of downstream data analysis, etc. • Cleaning
• enhancement
• Integration
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 6
Transformation Process –Cleaning
• Data cleaning is an iterative process• Detecting error
• Correcting error
• Example:
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 7
Source of Dirty Data
• User entry errors
• Different schemas
• Legacy systems
• Evolving applications
• No unique identifiers
• Data migration
• Programmer error
• Corruption in transmission
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 8
Data is never clean
• What is clean data? What is clean enough?
• Can I work with the data ( Is it usable)
• Do I trust the data? (Is it credible?)
• Can I learn from is? (Is it useful?)
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 9
Usability, Credibility, and Usefulness
• Data is usable if it can be parsed and manipulated by computational tools. Data usability is thus defined in conjunction with the tools by which it is to be processed.
• Data is credible if, according to one’s subjective assessment, it is suitably representative of a phenomenon to enable productive analysis
• Data is useful if it is usable, credible, and responsive to one’s inquiry.
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 10
Other Data Quality Issues
• Accuracy: data is free of error and conform to gold standards of data
• Completeness: no missing values
• Consistency: matches other data
• Validity: conforms to a schema.
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 11
Blueprint for Cleaning
• Access your data
• Create a data cleaning plan• Identify causes
• Define operations
• Test
• Execute the plan
• Manually correct
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 12
Accessing Accuracy
• Difficult because it requires gold standard of data
• Need compare values with known correct data
• Some data error is tolerable while others not.
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 13
Accessing Completeness
• Schema completeness: degree to which entities and attributes are not missing from the schema• If a key piece of information is missing from schema or if
information of an entity/entities is missing
• Column completeness: degree to which there exist missing values in a column of a table.
• Population completeness: the degree to which members of the population that should be present but not present.
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 14
Accessing Consistency• Consistency between foreign key and key values
• Consistency between two related data elements• Functional dependencies
zip state
• Business rules
quantity discount%
• Semantic relationship
sentiment orientation numeric rating
• Consistency among multiple copies of the same data items• John Doe’s EmpIDs should be the same in the DB.
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 15
Assessing Validity
Determines what constraints are on individual fields and ensure the field values adhere to those constraints.
• Foreign key constraints
• Cross-field constraints
• Data type
• Range
• Format
• Uniqueness
• mandatorySept. 26-29, 2016 (c) 2016 iCDO@UAL 16
Major Types of Data Correction
• Remove/correct typographical errors
• Fill in missing data
• Valid data type, format, constraints
• Cross check of data
• Data enhancement• Structuring
• Enriching
• standardizing
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 17
Transformation Process –Enhancement
• Structuring: manipulate the schema of the dataset. It involves• Modifying the schema by splitting a column/field
• Collapsing multiple columns/fields into one
• Removing columns fields entirely
• Changing granularity of dataset
• Enriching: addition of columns/fields that add new information to the dataset• E.g., converting counts to percentages, derive customer sentiment
from their comments
• Standardization: uniformity of data type, storage format, units, etc.Sept. 26-29, 2016 (c) 2016 iCDO@UAL 18
Transformation Process –Integration
• Combine data from disparate sources into meaningful and valuable information.
• Challenges:• Scale of source data
• Semi-structured data
• Heterogeneity• Source type heterogeneity
• Schema heterogeneity
• Data type heterogeneity
• Data value heterogeneity
• Semantic heterogeneity
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 19
Profiling
• Provides descriptive statistics and information about data as well as help users decide which transformations to apply• Assess whether data can be used
• Assess data quality
• Discover metadata on source data including value patterns and distribution, candidate keys and foreign keys, FDs
• Understand data challenges
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 20
Information Obtained by Profiling• Descriptive statistics:
• Central tendency: • Mean, Median, Mode
• Dispersion:• min, max, standard deviation
• Other• variation, frequency, aggregate functions such as sum, count
• Meta data: • date type, length, discrete values, occurrence of null, uniqueness,
string patterns, etc
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 21
Output
• Corresponds to the completion of the wrangling process
• Main material outputs• Wrangled datasets
• Script of transformation logic
• Documentation of data lineage/provenance of the data
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 22
Free Data Wrangling Tools
• Tabula: convert PDF table into a spreadsheet.
• OpenRefine: Friendly GUI for describing and manipulating data
• R package
• DataWrangler
• CSVKit
• Python and Pandas
• Mr. Data Converter
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 23
Introduction to MongoDB
• MongoDB – What, Why, where & Advantages
• MongoDB Basics & Definition
• Notations & Terminology
• Key Features
• Comparison to SQL
• User Interface
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 24
What is MongoDB
• MongoDB => Humongous DB
• Document database• MongoDB is an open-source document database that
provides high performance, high availability, and automatic scaling.
• Optimal model parameters
• Different w.r.t typical relational type• MongoDB obviates the need for an Object Relational
Mapping (ORM) to facilitate development. (Source: mongodb.org)
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 25
Why MongoDB
• Document Oriented Storage : Data is stored in the form of JSON style documents
• Index on any attribute
• Replication & High Availability
• Auto-Sharding
• Rich Queries
• Fast In-Place Updates
• Professional Support By MongoDBSept. 26-29, 2016 (c) 2016 iCDO@UAL 26
When to Use MongoDB
• Big Data applications
• Content Management & Delivery
• Mobile and Social Infrastructure
• User Data Management
• Location Services
• Analytics
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 27
MongoDB - Advantages
• Schema-less
• Structure of a single object is clear
• No complex joins
• Tuning
• Easy to Scale
• Faster access of data
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 28
Key Features
• High Performance• Indexes support faster queries
• Embedded data models reduce i/o on DB
• High Availability• replica sets, automatic failover, data redundancy.
• Automatic Scaling• horizontal scalability as part of its core functionality.
• Automatic sharding, High throughputs
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 29
Document Database - Definition
• A record in MongoDB is a document, which is a data structure composed of field and value pairs.
• MongoDB documents are similar to JSON objects. The values of fields may include other documents, arrays, and arrays of documents.
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 30
Document Database - Example
Advantages:• Documents (i.e. objects) correspond to native
data types in many programming languages.
• Embedded documents and arrays reduce need for expensive joins.
• Dynamic schema supports fluent polymorphism.
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 31
Example
• In RDBMS schema design for the user requirements will have minimum three tables.
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 32
Example(cont.)
• While in MongoDB schema design will have one collection post and has the following structure
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 33
Example (cont.)
• So while showing the data, in RDBMS you need to join three tables and in MongoDB data will be shown from one collection only.
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 34
MongoDB – JSON Format• Data is described as name/value pairs
• Syntax: name/value pair consists of a field name followed by a colon and then value.• Example: “name”: “Ningning Wu”
• Data is separated by commas• Example: “name”: “Ningning Wu”, univ: “UALR”
• Curly braces hold objects• Example: {“name”: “Ningning Wu”, univ: “UALR”, dept:”IFSC”}
• An array is stored in brackets []• Example [{“name”: “Ningning Wu”, univ: “UALR”, dept:”IFSC”}, {“name”: “Kiran”,
univ: “UALR”, dept:”INFQ”} ]
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 35
MongoDB – Create Database
• use DATABASE_NAME is the command to create a database in MongoDB
• Syntax• use DATABASE_NAME
• In mongoDB, by default the database is test if you did not create one.
• All the collections will be stored here.
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 36
MongoDB – Drop Database
• db.dropDatabase() is the command to drop a database in MongoDB
• Syntax• db.dropDatabase()
• In mongoDB, by default the database is test and hence if you execute drop database command, it deletes test until unless specified otherwise.
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 37
MongoDB – Data Types• String, Integer
• Boolean, Double
• Min/ Max keys, Arrays
• Timestamp
• Object, Object ID
• Null, Symbol,
• Date, Binary data,
• Code, Regular expression
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 38
Collections in MongoDB• MongoDB stores all documents in collections. A
collection is a group of related documents that have a set of shared common indexes. Collections are analogous to a table in relational databases.
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 39
Query in MongoDB
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 40
Query Interface - SELECT
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 41
Data Modification - INSERT
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 42
Data Modification - UPDATE
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 43
Data Modification - DELETE
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 44
MongoDB – Sort Records
• sort() method is used to sort documents in MongoDB
• sort() method accepts a document containing list of fields along with their sorting order.
• By default, sort() method will display documents in ascending order.• 1 is used for ascending order
• -1 is used for descending order.
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 45
MongoDB – Indexing
• Indexes support the efficient resolution of queries.
• Syntax: db.COLLECTION_NAME.ensureIndex({KEY:1})
• Here key is the name of filed on which you want to create index and 1 is for ascending order.
• To create index in descending order you need to use -1.
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 46
MongoDB – Aggregation
• Aggregations operations process data records and return computed results. • Aggregation operations group values from multiple
documents together, and can perform a variety of operations on the grouped data to return a single result.
• Syntax:
db.COLLECTION_NAME.aggregate(AGGREGATE_OPERATION)
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 47
MongoDB – Replication
• Replication is the process of synchronizing data across multiple servers.
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 48
MongoDB – Sharding
• Sharding is the process of storing data records across multiple machines and it is MongoDB's approach to meeting the demands of data growth.
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 49
MongoDB – Create Backup
• mongodump command is used to create backup of database. This command will dump all data of your server into dump directory.
• Syntax: mongodump
• Restore Data:• To restore backup data mongodb's mongorestore command is
used. This command restore all of the data from the back up directory.
• Syntax: mongorestore
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 50
SQL - MongoDB Mapping Chart
RDBMS MongoDB
Database Database
Table Collection
Tuple/Row Document
column Field
Table Join Embedded Documents and Linking
Primary Key Primary Key(Default key _id provided by
mongodb itself)
Database Server and Client MongoDB
Mysqld/Oracle mongod
mysql/sqlplus mongo
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 51
SQL to MongoDB Aggregation
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 52
MongoDB Support for Drivers
• JavaScript , Python , Ruby
• PHP, Perl, Java , Scala
• C#, C, C++
• Haskell, Erlang
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 53
MongoDB – Limitations
• No Transactional Support
• No Relational Integrity
• No Joins
• RAM Intensive
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 54
References
• The MongoDB 3.2 Manual
• https://docs.mongodb.org/getting-started/shell/
• https://docs.mongodb.org/v3.0/MongoDB-manual-v3.0.pdf
Sept. 26-29, 2016 (c) 2016 iCDO@UAL 55