Data Warehouse and Data Mining · · 2015-01-08Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com
Post on 01-May-2018
222 Views
Preview:
Transcript
Naeem Ahmed
Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro
Email: naeemmahoto@gmail.com
Data Warehouse and Data Mining Lecture No. 01
Introduction to Data Warehouse
Outline • Introduction to Data Warehouse
• Data Warehouse versus Operational Database
• OLTP vs. DW
• Applications of DW
Source: www.stonebridgegroup.com
Data Warehouse • Purpose of the Data Warehouse
– Value of the DATA - Realize!!! • Data / Information is an asset • Data / Information can be sold • Methods to realize the VALUE – Reporting, Analysis, Data
Mining, etc
• Make better decisions!!! – Turn data into Information – Create competitive advantages – Methods to support decision making process – DSS etc
Why data warehouse? • Bad decisions can lead to disasters
– Data Warehousing is at the base of decision support systems
• Data warehousing is a data-driven decision-support system
• Data warehousing helps to – Understand the information hidden within the
organization’s data • See data from different angles: product, client, time,
geographical area • Get a glimpse of the future.
Why data warehouse? • DBMS Approach
– List of all items that were sold last month?
– List of all items purchased by Naeem?
– The total sales of the last month grouped by branch?
– How many sales transactions occurred during the month of January?
Why data warehouse? • Intelligent Enterprise
– Which items sell together? Which items to stock?
– Where and how to place the items? What discounts to offer?
– How best to target customers to increase sales at a branch?
– Which customers are most likely to respond to the next promotional campaign, and why?
Why data warehouse? • Businesses want much more …
– What happened? – Why it happened? – What will happen? – What is happening? – What do you want to happen?
What is Data warehouse? • Basically a very large database…
– Not all very large databases are data warehouses, but all data warehouses are pretty large databases
– Nowadays a warehouse is considered to start at around 800 GB and goes up to several TB
– It spans over several servers and needs an impressive amount of computing power
What is Data warehouse? • More specific, a collective data repository
– Containing snapshots of the operational data (history) – Obtained through data cleansing ETL (Extract-Transform- Load) – Useful for analytics
What is Data warehouse? • Compared to other solutions it…
– Is suitable for tactical/strategic focus
– Implies a small number of transactions
– Implies large transactions spanning over a long period of time
Some Definitions • Ralph Kimball: “a copy of transaction data
specifically structured for query and analysis”
• Bill Inmon (father of data warehousing, in 1993): A Data Warehouse is a:
• subject oriented • integrated • non-volatile • time-variant
collection of data in support of management’s decisions
Data Warehouse • Subject oriented: Data is arranged by subject
area rather than by application. Data is organized so that all the data elements relating to the same real-world event or object are linked together
– Typical subject areas in DWs are Customer, Product, Order, Claim, Account,…
Data Warehouse • Subject oriented:
– Example: customer as subject in a DW • DW is organized in this case by the customer • It may consist of 10, 100 or more physical tables, all related
Data Warehouse • Integrated: Data is collected and consistently
stored from multiple, diverse sources of an organization's operational systems and this data is made consistent – E.g. gender, measurement, conflicting keys,
consistency,…
Data Warehouse • Non-volatile: Data in the data warehouse is never
over-written or deleted - once committed, the data is static, read-only, and retained for future reporting. Data is loaded, but not updated – When subsequent changes occur, a new snapshot
record is written.
Data Warehouse • Time-variant: The changes to the data in the data
warehouse are tracked and recorded so that reports can be produced showing changes over time. – Different environments have different time horizons
associated • While for operational systems a 60-to-90 day time horizon is
normal, data warehouse has a 5-to-10 year horizon
General Definition • More general, a DW is a
– Repository of an organization’s electronically stored data
– Designed to facilitate reporting and analysis
General Definition A complete repository of historical corporate data extracted from transaction systems that is available for ad-hoc access by knowledge workers • Transaction Systems
– Management Information System (MIS) • Ad-hoc access
– Dose not have a certain access pattern – Queries not known in advance – Difficult to write SQL in advance
• Knowledge workers – Typically NOT IT literate (Executives, Analysts, Managers)
Data Warehousing • A paradigm specifically designed for strategic
business information or decision making
• Data warehousing is a data-driven decision-support system
Data Warehouse (definitions) • Used for decision making, Duplicates existing data,
Combination of hardware, specialized software and data – Dyche
• A copy of transaction data specifically structured for query and analysis – Kimball
• A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a way that can be understood and used in business context – Barry Devlin
Data Warehouse (definitions) • A data warehouse is a database where data is
collected for the purpose of being analyzed"
• A data warehouse is used to help people make better decisions"
• A data warehouse is defined by the use to which it is put, not its underlying architecture
Typical Features • DW typically…
– Reside on computers dedicated to this function – Run on DBMS such as Oracle, IBM DB2, Teradata or
Microsoft SQL Server – Retain data for long periods of time – Consolidate data obtained from a variety of sources – Are built around their own carefully designed data
model
What can be warehoused? • Customer records • Customer purchases • Click stream, web traffic • Product records • Product purchase records • Inventory movement
Use case • DW stands for big data volume, lets take an
example of 2 big companies: a retailer - Walmart and a RDBMS vendor – Sybase – Walmart CIO: I want to keep track of sales in all my stores
simultaneously – Sybase consultant: You need our wonderful RDBMS
software. You can stuff data in as sales are rung up at cash registers and simultaneously query data right in your office
Walmart buys a $1 milion Sun E10000 multi-CPU server, a $500 000 Sybase license, a book “Database Design for Smarties”, and build themselves a normalized SQL data model
Use case • After a few months of stuffing data into the tables...
a Walmart executive asks… – I have noticed that there was a Colgate promotion
recently, directed to people who live in small towns. How much toothpaste did we sell in those towns yesterday?
– Translation to a query: select sum(sales.quantity_sold) from sales, products, product_categories, manufacturers, stores, cities where manufacturer_name = ‘Colgate’ and product_category_name = ‘toothpaste’ and cities.population < 40 000 and trunc(sales.date_time_of_sale) = trunc(sysdate-1) and sales.product_id = products.product_id and sales.store_id = stores.store_id and products.product_category_id = product_categories.product_category_id and products.manufacturer_id = manufacturers.manufacturer_id and stores.city_id = cities.city_id
Use case • The tables contain large volumes of data and the
query implies a 6 way join so it will take some time to execute
• The tables are at the same time also updated by new sales
• Soon after executive start their quest for marketing information store employees notice that there are times during the day when it is impossible to process a sale
Any attempt to update the database results in freezing the
computer up for 20 minutes
Use case • In minutes...the Walmart CIO calls Sybase tech
support – Walmart CIO: WE TYPE IN THE TOOTHPASTE
QUERY AND OUR SYSTEM HANGS!!! – Sybase support: Of course it does! You built an on-line
transaction processing (OLTP) system. You can’t feed it a decision support system (DSS) query and expect things to work!
– Walmart CIO:!@%$#. I thought this was the whole point of SQL and your RDBMS...to query and insert simultaneously!!
Use case – Sybase support: Uh, not exactly. If you’re reading from
the database, nobody can write to the database. If you’re writing to the database, nobody can read from the database. So if you’ve got a query that takes 20 minutes to run and don’t specify special locking instructions, nobody can update those tables for 20 minutes.
– Walmart CIO: It sounds like a bug. – Sybase support: Actually it is a feature. We call it
pessimistic locking. – Walmart CIO: Can you fix your system so that it doesn’t
lockup???
Use case – Sybase support: No. But we made this great loader tool
so that you can copy everything from your OLTP system into a separate Data Warehouse system at 100 GB/hour
• After a while…
How does it work? Business user needs info
User requests IT people
IT people create reports
IT people send reports to business user
IT people do system analysis and design
Business user may get answers
Answers result in more questions
?
Data Warehouse vs. Operational Database
Data Warehouse • Subject oriented
• Integrated
• Non-volatile
• Time-variant
Operational Database • Application oriented
• Multiple diverse sources
• Updateable
• Real-time, current
OnLine Transaction Processing • OLTP (OnLine Transaction Processing):
– Also known under the name of operational data, it represents day-to-day operational business activities:
• Purchasing, sales, production distribution, …
– Typically for data entry and retrieval transaction processing
– Reflects only the current state of the data
OnLine Analytical Processing • OLAP (OnLine Analytical Processing):
– Represents front-end analytics based on a DW repository
– It provides information for activities like: • Resource planning, capital budgeting, marketing initiatives,...
– It is decision oriented
OLTP vs. DW • Properties
Operational DB DW Mostly updates Mostly reads Many small transactions Queries long, complex MB-TB of data GB-PB of data Raw data Summarized data Clerical users Decision makers Up-to-date data May be slightly outdated
OLTP vs. DW OLTP Data Warehouse users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date detailed, flat relational isolated
historical, summarized, multidimensional integrated, consolidated
usage repetitive ad-hoc
access read/write index/hash on prim. key
lots of scans
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
OLTP vs. DW • Basic insights from comparing OLTP and DWs
– A DW is a separate (RDBMS) installation that contains copies of data from on-line systems
• Physically separate hardware may not be absolutely necessary if one has lots of extra computing power, but it is recommended
– With an optimistic locking DBMS one might even be able to get away for a while with keeping just one copy of its data
OLTP vs. DW • There is an essentially different pattern of
hardware utilization between on-line and analytical processing
Applications of DW • Typical questions which can be answered with
DW & OLAP – How much did sales unit A earn in January? – How much did sales unit B earn in February? – What was their combined sales amount for the first quarter?
• Answering these questions with SQL-queries is difficult – Complex query formulation necessary – Process is likely to be slow due to complex joins and
multiple scans
Applications of DW • Why such questions can be answered better with a
DW? – Because in a DW tables are rearranged and pre-
aggregated (known as computing cubes) • The tables arrangement is subject oriented, usually some star
schema
Applications of DW • A DW is the base repository for front-end analytics
– OLAP – KDD – Data visualization – Reporting
KDD (Knowledge Discovery in Databases) a data mining process
Applications of DW • OLAP is a form of information processing and thus
needs to provide timely, accurate and understandable information – timely is however a relative term:
• In OLTP one expects an update to go through in a matter of seconds
• In OLAP the time to answer a query can take minutes, hours or even longer
• There are many flavors of OLAP – ROLAP, DOLAP, MOLAP, WOLAP, HOLAP,…
Applications of DW • KDD (Data Mining): Constructs models of the data
in question – Models can be viewed as high level summaries of the underlying data
– Based on this example a query returns the data that fulfills the
constraints • SELECT * FROM CUSTOMER_TABLE WHERE TOTAL_SPENT > €100;
Applications of DW – Data mining might return the following set of rules for
customers spending more than €100: • IF AGE > 35 AND CAR = ‘MINIVAN’ THEN TOTAL SPENT > €100
• IF SEX = ‘M’ AND ZIP = 38106 THEN TOTAL SPENT > €100 – It answers questions like
• Which products or customers are more profitable • Which outlets have sold the least this year
– In consequence it motivates decisions like • Which products should have their production increased • Which customers should be targeted for special promotions • Which outlets should be closed
DW User • Users of DW are called DSS analysts and usually
are business persons – Their primary job is to define and discover information
used in corporate decision-making – The way they think
• “Give me what I say I want, and then I can tell you what I really want”
• They work in explorative manner
DW User – Typical explorative line of work
• “Ah! Now that I see what the possibilities are, I can tell what I really want to see. But until I know what the possibilities are, I cannot describe exactly what I want...”
– This usage has profound effect on the way a DW is developed
• The classical system development life cycle assumes that the requirements are known at the start of design
• The DSS analyst starts with existing requirements, but factoring in new requirements is almost impossible
top related