Ein Unternehmen der Daimler AG Lecture @DHBW: Data Warehouse 01 Introduction & Motivation Andreas Buckenhofer
May 22, 2020
Ein Unternehmen der Daimler AG
Lecture @DHBW: Data Warehouse
01 Introduction & Motivation
Andreas Buckenhofer
Daimler TSS GmbH
Wilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99
[email protected] / Internet: www.daimler-tss.com
Sitz und Registergericht: Ulm / HRB-Nr.: 3844 / Geschäftsführung: Martin Haselbach (Vorsitzender), Steffen Bäuerle
© Daimler TSS I Template Revision
Andreas BuckenhoferSenior DB Professional
Since 2009 at Daimler TSS
Department: Machine Learning Solutions
Business Unit: AnalyticsDHBWDOAG
Contact/Connect
vcard
• Oracle ACE Associate
• DOAG responsible for InMemory DB
• Lecturer at DHBW
• Certified Data Vault Practitioner 2.0
• Certified Oracle Professional
• Certified IBM Big Data Architect
• Over 20 years experience with
database technologies
• Over 20 years experience with Data
Warehousing
• International project experience
Daimler TSS Data Warehouse / DHBW 3
Change Log
Date Changes
02.10.2019 Initial version
Daimler TSS Data Warehouse / DHBW 4
What you will learn today
• Data Warehousing is a major topic of computer science
• After the end of this lecture you will be able to
• Understand current challenges towards a data-driven future
• Understand the basic business and technology drivers for data warehousing and Big Data
• Describe the characteristics of a data warehouse
• Describe the differences between production and data warehouse systems
We’re entering a new world in which
data may be more important than
software.
[Tim O’Reilly, Founder O’Reilly Media]
Data is a precious thing and will last longer than
the systems themselves.
[Tim Berners-Lee, Father of the Worldwide Web]
Information is the oil of the 21st
century
[Peter Sondergaard, Gartner]
Everything we do in the digital realm ... creates a data trail.
And if that trail exists, chances are someone is using it.
[Douglas Rushkoff, Author]
Data creation is exploding
[Gavin Belson, HBOs Silicon Valley]
Data is the new gold
[Open Data Initiative, European Commission]
In a world deluged by irrelevant
information, clarity is power.
[Yuval Noah Harari, Author]
Big data is not about the data
[Gary King, Harvard University]
Data WarehousE
Data Warehouse /
DHBWDaimler TSS 7
Applications come, applications go.
The data, however, lives forever.
It is not about building applications;
it really is about the data underneath these applications
(Tom Kyte, Oracle)
Daimler TSS Data Warehouse / DHBW 8
What do you think is the biggest challenge in data?
Technology?People and their
Know-How?Privacy & Ethics?
Data quality?Data sharing
culture?
Transparency, Trust
and security?
Daimler TSS 9
Source: https://informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks/
Source: https://informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks/
Introduction
Data Warehouse / DHBWDaimler TSS 11
Daimler TSS Data Warehouse / DHBW 12
Data in, intelligence out?Data producers vs data consumers
Internet of Everything
Industry 4.0
Artificial Intelligence
Connected Cars
Data Cybersecurity
Smart Cities
Digital twins
Data Ethics and
Data Privacy
Robotics
Digitization
Social Media
Alexa, Cortana, Siri
Online Transaction
Processing (OLTP)
Audio/Video
Streaming
Daimler TSS Data Warehouse / DHBW 13
Data producers
Source: Barry Devlin: Business unIntelligence: Insight and Innovation beyond Analytics and Big Data, Technics Publications 2013, chapter 6.5
Daimler TSS Data Warehouse / DHBW 14
Information Technology (1960’ies – 80‘ies)
• Many systems throughout the enterprises for dedicated purposes
• Support daily transactions / day-to-day business
• Target: replace manual and time consuming activities
• Data embedded in process-specific application
• Process-orientation + dedicated purpose
• Customer data, order data, etc. spread over many systems in many variations and with contradictions
Daimler TSS Data Warehouse / DHBW 15
Sample applications for an airline
Flight Reservation System
Planes
Airline Frequent Flyer System
Internal Human Ressources System
Inventory Purchasing Systems
Operational PlanningMaintenance Tracking
Billing System
CRM System, e.g. campaigns
Customer data
Customer data
Customer data
Customer data
Planes
PlanesPlanesCrews
Crews
SeatsFood / Drinks
Seats
Seats
Daimler TSS Data Warehouse / DHBW 16
Need for decision support system / management information system
Flight Reservation System
Planes
Airline Frequent Flyer System
Internal Human Ressources System
Inventory Purchasing Systems
Operational PlanningMaintenance Tracking
Billing System
CRM System, e.g. campaigns
Customer data
Customer data
Customer data
Customer data
Planes
PlanesPlanesCrews
Crews
SeatsFood / Drinks
Seats
SeatsDCS / MIS
Daimler TSS Data Warehouse / DHBW 17
Early Decision Support systems (1960’ies – 80‘ies)
Can be characterized as “Unplanned decision support” or “Unplanned Management Information Systems (MIS)”
• Management needs reports / combined data from different systems to make decisions for company
• Reports are manually written by IT people
• Extract, combine, accumulate data
• Can take several days to write report and to get the data
• Error prone and labour-intensive
• Relevant information may be forgotten or combined in a wrong way
Did not really work
Daimler TSS Data Warehouse / DHBW 18
Information technology today - further requirements
• Data still spread across many applications, but additional requirements
• Data as Asset, getting more and more important across all industries
• Not only classical data-intensive companies like Google or Facebook
• Increasing interest e.g. in insurance, health care, automotive, …
• Connected cars, Smart Home, Tailor-made insurances, etc.
• Hype technologies
• New databases technologies like NoSQL and Big Data
• DWH still booming with additional stimuli coming from Big Data, Digitization, Internet Of Things IOT, Industry 4.0, Real Time, Time To Market, etc.
Daimler TSS Data Warehouse / DHBW 19
Exercise – data services/products
• What are data services that enrich or replace „hard“ products?
• Where does data improve or influence the customer product experience?
Daimler TSS Data Warehouse / DHBW 20
Sample data services
• Driving style: car insurances offering dedicated products, e.g. cheaper for good drivers
• Health-care
• Today: one drug fits all
• Tomorrow: personalized therapy due to patient profiles
• Connected, autonomous cars/vans/etc: no driver required, less accidents
• Airbnb: does not own hotels – acts as broker
• 360-degree view of customer: dedicated offers e.g. on smartphone
Caution: privacy, ethics
Daimler TSS Data Warehouse / DHBW 21
OLTP: ONLINE TRANSACTIONAL PROCESSING
Exercise – OLTP systems
Outline at least 5 operational systems for a vehicle manufacturer
• which data is stored by these systems
• characterize which operations are performed by them
• which questions can be answered by these systems (and which questions can not be answered = major problems for decision support)
Daimler TSS Data Warehouse / DHBW 22
Sample OLTP systems
Vehicle production
Vehicle
Plant
Worker
Robot
Car rentals
Driver
Booking
Vehicle
Route
Parts Logistics
Part
Plant
Supplier
Route
Financial Services
Credit
Customer
Bank
account
Workshop
Repair data
Parts
Vehicle
Diagnostic
data
Vehicle Sales
Customer
Seller
Vehicle
Production
date
Daimler TSS Data Warehouse / DHBW 23
Sample OLTP systems
Truck fleet management
Truck
Route
Driver
Engineering, Research and development
Engineer
Prototype
Vehicle
Tests
Website and Car configurator
Vehicle
CRM Lead
Interior
etc
…
…
…
…
Daimler TSS Data Warehouse / DHBW 24
Challenge
How to get an overall view
across OLTP applications / functions that works?
Daimler TSS Data Warehouse / DHBW 25
Major problems for effective decision support
Distributed data
Different data structures
Historic data
System workload
Inadequate technology
Daimler TSS Data Warehouse / DHBW 26
Distributed data
Problem: Data resides on
• different systems / storages
• different applications
• different technologies
Solution: Data has to be accumulated on one system for further analysis
• Data is inhomogeneous, e.g. each system has their own customer number or order number, etc.
• How to combine the data?
• Data must be ingested regularly, e.g. daily and not ad-hoc
Daimler TSS Data Warehouse / DHBW 27
Different data structures
Problem: Systems developed independently from each other
• Different data types
• E.g.: zip-code as integer or character string
• Different encodings
• E.g.: kilometer or miles
• Different data modeling
• E.g.: last name / first name in different fields vs last name / first name (badly modelled) in one single field
Solution: Dedicated system required that harmonizes / standardizes the data
Daimler TSS Data Warehouse / DHBW 28
Issues with historic data
Problem: Data is updated and deleted or archived after max. 3 months
• daily transactions produce lots of data
• limited size of storage → high amounts of data fill up systems
Historic data is required for decision support
• e.g. how did sales figures develop compared to last month / last year / etc.
Solution: All data (changes) have to be stored in a system capable of dealing with huge amounts of data
Daimler TSS Data Warehouse / DHBW 29
Issues with system workload
Problem: Performance not optimized for new workloads
• Systems stressed by additional load (due to reports)
• Not optimized for this kind of workload
• Performance of daily transaction business jeopardized
• May possibly lead to system failure!
• Imagine what happens if a system like Amazon gets slow
Solution: Dedicated system that handles complex (arithmetic) queries on huge amounts of data. A system that is optimized for that kind of workloads
Daimler TSS Data Warehouse / DHBW 30
Inadequate technology
Problem: Tooling and technology different from OLTP
• Inadequate tools for data integration and analysis
• Infrastructure configured for OLTP transactions and not for DWH load
• Storage systems and processors to weak to fulfill the requirements
Solution: Standard Tools and technology that help to increase productivity and solve such problems, e.g. Reporting Tools for Data Analysis or ETL tools for Data ingestion/load
Daimler TSS Data Warehouse / DHBW 31
Challenges are the similar today or even more
• Large amounts of data
• Multiple technologies
• Multiple data formats
• Multiple data schemas
• Rapid changes in data schemas
• Complexity of legacy data
• Data quality challenges
Daimler TSS Data Warehouse / DHBW 32
Challenge
How to get an overall view
across OLTP applications / functions that works?
Daimler TSS Data Warehouse / DHBW 33
Conclusion
Operative systems not suitable for analytical evaluations
Need for a new, separated system
• fast answers, ad-hoc questions possible
• no interference with daily transaction business
Data Warehouse
Daimler TSS Data Warehouse / DHBW 34
Exercise: Data Warehouse user
List possible (functional and non-functional) requirements for a data warehouse end-user. Think of deficiencies of transactional systems like
• Distributed data
• Different data structures
• Problem with historic data
• Problem with system workload
• Inadequate technology
What are requirements from a Data Warehouse user perspective? (List at least 5 requirements)
Daimler TSS Data Warehouse / DHBW 35
Data Warehouse User
• Wants to trust the data: quality assured data
• Wants to access and analyze all data in a single database
• Wants to get a complete analysis including history, e.g. where did the customer live 5 years ago or how did bookings develop the last 10 days?
• Wants fast data access for his queries
• Wants to understand the data model = one single and easy data model and not many different applications
• Wants to browse through combined data sets to identify correlations or new insights
Daimler TSS Data Warehouse / DHBW 36
Data Warehouse - summary
• Contains data from different systems
• Imports data from different systems on a regular basis
• detailed data and summarized data
• provide historic data
• generate metadata
• OLTP applications remain, DWH is a completely new system
• Overcomes difficulties when using existing transaction systems for those tasks
Definition
Data Warehouse / DHBWDaimler TSS 37
Daimler TSS Data Warehouse / DHBW 38
Definitions … not always agreed on a single one
Data Warehouse
Business Intelligence
Big Data
Daimler TSS Data Warehouse / DHBW 39
First DWH architecture by Devlin/Murphy (1988)
• "Users can now focus on the use of the information rather than on how to obtain it" (p. 61)
• "Although data may reside in multiple locations, the appearance is of a single source" (p. 63)
• "Each user sees information from different company tables combined in a way that makes the data most meaningful" (p. 67)
• "Business Data Warehouse (BDW): the BDW is the single logical storehouse of all information used to report to the business" (p. 67)
• "For the first time, the end user is given the benefit of the information stored in the Data Dictionary" (p. 75)
Source: http://www.9sight.com/pdfs/EBIS_Devlin_&_Murphy_1988.pdf
http://www.9sight.com/1988/02/art-ibmsj-ebis/
Daimler TSS Data Warehouse / DHBW 40
Data Warehouse definitions by two “fathers” of the DWH
Ralph Kimball William Harvey „Bill“ Inmon
„A data warehouse is a copy of
transaction data specifically
structured for querying and
reporting“
“A data warehouse is a subject-
oriented, integrated, time-
variant, nonvolatile collection of
data in support of
management’s
decision-making
process”
Daimler TSS Data Warehouse / DHBW 41
Subject-oriented
• A data warehouse is organized around the major subjects (business entities) of the enterprise like
• Customer
• Vendor
• Car
• Transaction or activity
• In contrast to the application/process/functional orientation such as
• Booking application
• Delivery handling
Daimler TSS Data Warehouse / DHBW 42
Subject-oriented - example
DWHOLTP
Flight Reservation System
Passengers
Bookings
Flight Operation System
Crews
Planes
Planes
Airline Frequent Flyer System
Customer
Points
Customer Planes
Marketing:
Which are
popular
destinations,
e.g. Paris and
make the
customer an
exclusive offer.
Planning:
How many flight
kilometers and
flight times do
planes have.
When does a
plane need
maintenance?Capacity planning:
What is a forecasted
passenger demand for flights
to London? Is a larger plane
required on the route?
Daimler TSS Data Warehouse / DHBW 43
Integrated
Data contained in the warehouse are integrated.
Aspects of integration
• consistent naming conventions
• consistent measurement of variables
• consistent encoding structures
• consistent physical attributes of data (data types)
Daimler TSS Data Warehouse / DHBW 44
Integrated - example
OLTP DWH
System1: m,w
System2: male, female
System3: 1, 0
m,w
System1: John Brown
System2: Brown, J.
System3: Brown, JoJohn Brown
System1: Varchar(5)
System2: Number(8)
System3: Char(12)
Varchar(12)
Daimler TSS Data Warehouse / DHBW 45
Nonvolatile
• Operations in operational environment
• Insert
• Delete
• Update
• Select
• Operations in a data warehouse
• Insert: the initial and additional loading of data by (batch) processes
• Select: the access of data
• (almost) no updates and deletes (technical updates / deletes only)
Daimler TSS Data Warehouse / DHBW 46
Nonvolatile - example
OLTP
Flight Reservation System
Passenger John flies from
Stuttgart to London on 15.02
at 06:00
Insert into DB:
Passenger John, From Stuttgart to London,
15.02. 06:00
Passenger John changes his
mind and flies at 10:00
Update in DB:
Passenger John, 15.02. 10:00
DWH
Insert into DB:
Passenger John, From Stuttgart to London,
15.02. 06:00
Insert into DB:
Passenger John, From Stuttgart to London,
15.02. 10:00
Daimler TSS Data Warehouse / DHBW 47
Nonvolatile - example
• What happens in the OLTP system if the customer cancels his booking?
• Delete operation in OLTP
• Seat gets available again and can be sold to another passenger
• What happens in the DWH?
• Insert operation in DWH with e.g. a flag indicating that the customer
cancelled/deleted his booking
• Business can make analysis about cancelled booking: why might the customer
have cancelled? How to prevent the customer or other customers to cancel
next time?
Daimler TSS Data Warehouse / DHBW 48
Time-variant
• All data in the data warehouse is accurate as of some moment in time
• Has to be associated with a time stamp
• Once data is correctly recorded in the data warehouse, it cannot be updated or deleted
• Data warehouse data is, for all practical purposes, a long series of snapshots
• In the operational environment data is accurate as of the moment of access
• Operational data, being accurate as of the moment of access, can be updated as the need arises
Daimler TSS Data Warehouse / DHBW 49
Time-variant - example
DWH
Insert into DB:
Passenger John, From Stuttgart to London,
15.02. 06:00
Insert into DB:
Passenger John, From Stuttgart to London,
15.02. 10:00
Insert into DB:
Passenger Jim, From Hamburg to Munich,
18.02. 15:00
DB insert timestamp: 02.02. 15:03:21
DB insert timestamp: 02.02. 15:04:29
DB insert timestamp: 05.02. 12:15:03
Insert into DB:
Passenger Mike, From Hamburg to Munich,
15.02. 10:00
DB insert timestamp: 05.02. 12:15:11
Insert into DB:
Passenger John, From Stuttgart to London,
15.02. 10:00, Cancel Flag
DB insert timestamp: 08.02. 09:52:33
Daimler TSS Data Warehouse / DHBW 50
Data warehouse Definition update by InmonWWDVC conference 2018
Source: https://twitter.com/lecyberax/status/996723448092266497
Integrated got a
different
meaning (storing
raw data due to
various reasons)
Daimler TSS Data Warehouse / DHBW 51
Exercise - DWH
You outlined OLTP systems for a vehicle manufacturer in an earlier exercise.
Now start designing a Data Warehouse:
• Describe what data can be stored in it. Define at least 5 subject-areas!
• Which questions can/should be answered with this information
Daimler TSS Data Warehouse / DHBW 52
DWH – Subject areas
Customer
Driver
Bank
account
CRM Lead
Individual or
company?
Part
Supplier
Color
Partnumber
Description
Vehicle
Truck
Prototype
Car
Car Rental
GPS data
Rental start
time
Bill
Rental end
time
Formula-1
car
Plant
Robots
Cars built
Location
Daimler TSS Data Warehouse / DHBW 53
Exercise – sample questions
Which customers own a car and use car rental regularly?
Which parts have the most defects? Can diagnostic data be used to predict potential defects and warn customers?
Which areas and times are popular for car rentals? Does it make sense to relocate cars to these areas? (e.g. cinema in the evening/night)
Daimler TSS Data Warehouse / DHBW 54
Data Warehouse (DWH) or Business Intelligence (BI)?
• DWH or BI: Often used as synonym
• DWH more technical focus
• central repository containing data from many sources: subject-oriented, integrated, nonvolatile, time-variant
• BI more business / process oriented with a broader focus
• “Business intelligence is a set of methodologies, processes, architectures, and technologies that transform raw data into meaningful and useful information used to enable more effective strategic, tactical, and operational insights and decision making.” (Boris Evelson, Forrester Research, 2008)
This lecture has a broader focus – not just DWH as a central repository
Daimler TSS Data Warehouse / DHBW 55
OLTP vs OLAP – oversimplified view
DB
DB
OLTP
Application(could be
Microservice)
OLTP
Application(could be
Microservice)
OLTP
Application(could be
Microservice)
Decision
Management
Decision
Management
DB
DB
Daimler TSS Data Warehouse / DHBW 56
OLTP vs OLAPOperational system vs DWH
Online Transaction Processing Online Analytical Processing
Transaction-oriented system Query-oriented system
Optimized for insert and update consistency Optimized for complex queries with short
response times; ad-hoc queries
Many users change data Only ETL process writes data
Selective queries on the data Evaluations of all data including history
(complex queries)
Avoid redundancy Redundant data storage
Normalized data management 3NF De-normalized data management
Relational Data Modeling Several layers with different data models, one
model usually Dimensional Data Modeling
Daimler TSS Data Warehouse / DHBW 57
Operative vs Integrated data
Operative data Integrated data
Handling Structured, parallel processes with
short and isolated ("atomic")
transactions
Information for management (decision
support)
Modeling Process- and function oriented,
individual for each application
Different data models in one DWH;
historic, stable and summarized, data
# users Many Few(er) but increasing user base
System return time Milliseconds Seconds to minutes (even hours)
Daimler TSS Data Warehouse / DHBW 58
Operative vs analytical databases
Operative DBs Analytical DBs
Purpose Processing of daily business
transactions
Information for management (decision
support)
Content Detailed, complete, most recent
data
Historic, stable and summarized data
Data amount Small amount of data per
transaction. Nested Loop Joins
Large amount of data for load, and
often per query. Hash Joins common
Data structure Suitable for operational
transactions
Several models; suitable for long term
storage and business analyses
Transactions ACID; very short read/write
transactions
Long load operations, longer read
transactions
Daimler TSS Data Warehouse / DHBW 59
What happens in an internet minute?
Source: https://www.allaccess.com/merge/archive/28030/2018-update-what-happens-in-an-internet-minute#sthash.IKyiTou1.uxfs
Daimler TSS Data Warehouse / DHBW 60
Big Data characteristics
Volume
• The amount of data
Velocity
• The speed at which data is generated
Variety
• The different types of data
Veracity
• The trustworthiness/ accuracy of data
Daimler TSS Data Warehouse / DHBW 61
Volume 1(2)
What is a high amount of data?
• Walmart handles more than 1 million customer transactions every hour, which are imported into databases estimated to contain more than 2.5 petabytes (2560 terabytes) of data — the equivalent of 167 times the information contained in all the books in the US Library of Congress
• Internet: Google processed about 24 petabytes of data per day in 2009
Daimler TSS Data Warehouse / DHBW 62
Volume 2(2)
What is a high amount of data?
• Telecommunications (usage): AT&T transfers about 30 petabytes of data through its networks each day.
• As of January 2013, Facebook users had uploaded over 240 billion photos, with 350 million new photos every day. For each uploaded photo, Facebook generates and stores four images of different sizes, which translated to a total of 960 billion images and an estimated 357 petabytes of storage
Daimler TSS Data Warehouse / DHBW 63
Small data / smart dataData vs information
1 Kilobyte kB = 1.000 Byte
1 Megabyte MB = 1.000.000 Bytes = 10^6 Bytes
1 Gigabyte GB = 1.000.000.000 Bytes = 10^9 Bytes
1 Terabyte TB = 10^12 Bytes
1 Petabyte PB = 10^15 Bytes
1 Exabyte EB = 10^18 Bytes
1 Zettabyte ZB = 10^21 Bytes
1 Yottabyte ZB = 10^24 Bytes
Source: https://www.cmswire.com/cms/information-management/big-data-smart-data-and-the-fallacy-that-lies-between-017956.php#null
Daimler TSS Data Warehouse / DHBW 64
Velocity
What is high velocity?
• The Large Hadron Collider experiments represent about 150 million sensors delivering data 40 million times per second. There are nearly 600 million collisions per second. After filtering and refraining from recording more than 99.99995% of these streams, there are 100 collisions of interest per second
• Internet of Things
• Connected, autonomous Cars
Daimler TSS Data Warehouse / DHBW 65
Variety
• Structured data like tables typically stored in relational databases
• Unstructured data usually generated by humans e.g. natural language, voice, Wikipedia, Twitter posts, video, images
• Semi-structured data has some structure in tags but it changes with documents E.g. HTML, XML, JSON files, server logs
Unstructured data is a bad phrase, e.g. Tweets are structured, too.
Better: data has low information density.
Daimler TSS Data Warehouse / DHBW 66
Veracity
• Data involves some uncertainty and ambiguities
• Mistakes can be introduced by humans and machines
• #FakeNews
Data Quality is vital!
Garbage In – Garbage Out
Garbage data + perfect model => garbage results
Daimler TSS Data Warehouse / DHBW 67
Big Data definition 1(2)
• Still no agreed definition
• Originally and most used:
• Volume +
• Velocity +
• Variety
• Big data is a term used to refer systems that are too complex for traditional data-processing (often said that an RDBMS does not suffice anymore). Big data challenges include capturing data, data storage, data analysis, search, etc.
• part of this lecture
Daimler TSS Data Warehouse / DHBW 68
Big Data definition 2(2)
• Another usage of the term "big data" refers to advanced data analytics or data science methods that extract value from data
• Not part of this lecture
Daimler TSS Data Warehouse / DHBW 69
Big Data landscape
Source: http://mattturck.com/wp-content/uploads/2018/07/Matt_Turck_FirstMark_Big_Data_Landscape_2018_Final.png
Daimler TSS GmbH
Wilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99
[email protected] / Internet: www.daimler-tss.com
Sitz und Registergericht: Ulm / HRB-Nr.: 3844 / Geschäftsführung: Martin Haselbach (Vorsitzender), Steffen Bäuerle
© Daimler TSS I Template Revision