© 2015 IBM Corporation DDB-1560: Hadoop as Storage for Aging Data - Simplicity with Big SQL and DB2 Eugen Stoianovici ([email protected] ) Software Developer , IBM 02:00 pm - 03:00 pm Thursday, October 29 th , 2015
© 2015 IBM Corporation
DDB-1560:Hadoop as Storage for Aging Data -Simplicity with Big SQL and DB2
Eugen Stoianovici ([email protected])
Software Developer , IBM
02:00 pm - 03:00 pm
Thursday, October 29th, 2015
• IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal
without notice at IBM’s sole discretion.
• Information regarding potential future products is intended to outline our general product direction
and it should not be relied on in making a purchasing decision.
• The information mentioned regarding potential future products is not a commitment, promise, or
legal obligation to deliver any material, code or functionality. Information about potential future
products may not be incorporated into any contract.
• The development, release, and timing of any future features or functionality described for our
products remains at our sole discretion.
Performance is based on measurements and projections using standard IBM benchmarks in a
controlled environment. The actual throughput or performance that any user will experience will vary
depending upon many factors, including considerations such as the amount of multiprogramming in the
user’s job stream, the I/O configuration, the storage configuration, and the workload processed.
Therefore, no assurance can be given that an individual user will achieve results similar to those stated
here.
Please Note:
2
• Problem description – data aging
• Big SQL - SQL on Hadoop for online query archive
• Data movement tools with Big Insights
• Big SQL – Tips for maximizing usability
Agenda
2
Modern Warehouse
Exploration,Integrated
Warehouse, and Mart Zones
Discovery
Deep Reflection
Operational
Predictive
All Data Sources
Decision
Management
BI and Predictive
Analytics
Analytic
Applications
Intelligence
Analysis
Raw DataStructured DataText AnalyticsData MiningEntity AnalyticsMachine Learning
Video/Audio
Network/Sensor
Entity Analytics
Predictive
Stream Processing Data Integration Master Data
Streams
Information Governance, Security and Business Continuity
Analytic Applications
Real-time Analytic Zone
Landing Area, Analytics Zone
and Archive
Information Ingestion and Operational Information
IBM BigInsights for Hadoop
Discovery
Deep Reflection
Operational
Predictive
All Data Sources
Decision
Management
BI and Predictive
Analytics
Analytic
Applications
Intelligence
Analysis
Raw DataStructured DataText AnalyticsData MiningEntity AnalyticsMachine Learning
Stream Processing Data Integration Master Data
Streams
Information Governance, Security and Business Continuity
Analytic Applications
IBM Big Insights for Hadoop
Exploration,Integrated
Warehouse, and Mart Zones
Landing Area, Analytics Zone
and Archive
Information Ingestion and Operational Information
• Create a large area of storage for cold data
• Immediate cost savings, freeing EDW capacity
Placing lowest detailed schemas in BigInsights / Hadoop
Offloading ETL into a cheaper, more elastic MPP environment
Removing cost of HA/DR from data
• Data remains online, instead of nearline or on tape
More detailed and more variety of data
Retain longer history of detailed data
Queryable Archive
5
• Hadoop is not a piece of software, you can't install “Hadoop"
• It is an ecosystem of software that work together
Hadoop Core (API's)
HDFS (File system)
MapReduce (Data processing )
Hive (SQL access)
HBase (NoSQL database)
Spark (compute engine)
Sqoop (Data movement)
Oozie (Job workflow)
…. There is a LOT of "Hadoop" software
• Open Data Platform Initiative – facilitate adoption of Hadoop
What is Hadoop
6
Text Analytics
POSIX Distributed
Filesystem
Multi-workload, multi-tenant
scheduling
IBM BigInsights
Enterprise Management
Machine Learning on
Big R
Big R (R support)
IBM Open Platform with Apache Hadoop
(HDFS, YARN, MapReduce, Ambari, Flume, HBase, Hive, Kafta, Knox, Oozie, Pig, Slider, Solr, Spark, Sqoop, Zookeeper)
IBM BigInsights
Data Scientist
IBM BigInsights
Analyst
Big SQL
BigSheets
Industry standard SQL
(Big SQL)
Spreadsheet-style
tool (BigSheets)
Overview of BigInsights
Free Quick Start (non production):
• IBM Open Platform
• BigInsights Analyst, Data Scientist
features
• Community support
. . .
IBM BigInsights for
Apache Hadoop
IBM InfoSphere BigInsights is now IBM BigInsights for Apache Hadoop!
Part II - Hadoop and SQL
• Data in Hadoop is truly “schema-less”
• Hadoop is API based at lower levels
Requires strong programming expertise
Steep learning curve
Simple operations can be tedious
• In most cases, data is structured
e.g. aging old data
• SQL when possible
Already lots of expertise available
Separation of what you want vs. how to get it
Robust ecosystem of tools
SQL on Hadoop
9
• Part of the value-add packages for BigInsights
• IBM’s SQL for Hadoop
Opens Hadoop data to a wider audience
• Complements the Data Warehouse
Exploratory analytics
Sandbox
Data Lake
• Use familiar SQL tools
Cognos
MicroStrategy
DSM
What is Big SQL?
Real-time Analytics InfoSphere Streams
Data Governance and SecurityData Click, LDAP and Secured Cluster
Data ExplorationBigSheets “schema-on-read” tooling
Predictive ModelingBig R scalable data mining” on R
Text AnalyticsText processing with AQL
Application Tooling Toolkits and accelerators
• Key advantages of Big SQL
Performance
SQL “Richness”
Enterprise features & security
• How we’re doing it
Design principles
Architecture
Big SQL - topics
11
Big SQL V4.1 vs Hive 1.2.1 @ 1TB TPC-DS*
12
*Not an official TPC-DS Benchmark.
• In 99 / 99 Queries, Big SQL was faster
• On Average, Big SQL was 21X faster
• Excluding the Top 5 and Bottom 5 results, Big SQL was 19X faster
Big SQL V4.1 vs Hive 1.2.1 @ 1TB TPC-DS*
13
*Not an official TPC-DS Benchmark.
14
Big SQL runs more SQL out-of-box
Big SQL 4.1 Spark SQL 1.5.0
1 hour 3-4 weeksPorting Effort:
Big SQL can execute all 99
queries with minimal porting
effort
Single stream results:
Big SQL was faster than Spark SQL 76 / 99 Queries
When Big SQL was slower, it was only slower by
1.6X on average
Query vs. Query, Big is on average 5.5X faster
Removing Top 5 / Bottom 5, Big SQL is 2.5X faster
• Leverage IBM’s rich SQL history and expertise
Modern SQL:2011 capabilities
DB2 compatible SQL – PL support
• SQL bodied functions
• SQL bodied stored procedures
• Robust error handling
• Application logic/security encapsulation
• Flow of control logic
• Same SQL syntax used in the EDW works on Big SQL.
• Grouping sets with CUBE and ROLLUP
• Windowing and OLAP aggregates
Big SQL - SQL Capabilities
15
16
Big SQL runs more SQL out-of-box
Big SQL 4.1 Spark SQL 1.5.0
1 hour 3-4 weeksPorting Effort:
Big SQL can execute all 99
queries with minimal porting
effort
Single stream results:
Big SQL was faster than Spark SQL 76 / 99 Queries
When Big SQL was slower, it was only slower by
1.6X on average
Query vs. Query, Big is on average 5.5X faster
Removing Top 5 / Bottom 5, Big SQL is 2.5X faster
Hadoop-DS: Performance Test update:Big SQL V4.1 vs. Spark SQL 1.5.1 @ 1 TB, single stream*
17*Not an official TPC-DS Benchmark.
But, … what happens when you scale it?
Scale Single Stream 4 Concurrent Streams
1 TB • Big SQL was faster on 76 / 99
Queries
• Big SQL averaged 5.5X faster
• Removing Top / Bottom 5, Big SQL
averaged 2.5X faster
• Spark SQL FAILED on 3 queries
• Big SQL was 4.4X faster*
10 TB • Big SQL was faster on 80/99 Queries
• Spark SQL FAILED on 7 queries
• Big SQL averaged 6.2X faster*
• Removing Top / Bottom 5, Big SQL
averaged 4.6X faster
• Big SQL elapsed time for workload was
better than linear
• Spark SQL could not complete the
workload (numerous issues). Partial results
possible with only 2 concurrent streams.
*Compares only queries that both Big SQL and Spark SQL could complete (benefits Spark SQL)
More Users
More
Data
What is the verdict? Use the right tool for the right job
Machine Learning
Simpler SQL
Good Performance
Ideal tool for BI Data
Analysts and production
workloads
Ideal tool for Data
Scientists and discovery
Big SQL Spark SQL
Migrating existing
workloads to Hadoop
Security
Many Concurrent Users
Best in-class Performance
• Modern security model
GRANT/REVOKE support
GROUP permissions
ROW level permissions
COLUMN level permissions
• Data Auditing tools
IBM InfoSphere Guardium (e.g.)
Big SQL enterprise grade features
20
Big SQL Architecture
21
Management Node
Big SQL
Master Node
Management Node
Big SQL
Scheduler
Big SQL
Worker Node
Java
I/O
FMP
Native
I/O
FMP
HDFS
Data
Node
MRTask
Tracker
Other
ServiceHDFS
Data HDFS
Data HDFS
Data
Temp
Data
UDF
FMP
Compute Node
Database
Service
Hive
Metastore
Hive
Server
Big SQL
Worker Node
Java
I/O
FMP
Native
I/O
FMP
HDFS
Data
Node
MRTask
Tracker
Other
ServiceHDFS
Data HDFS
Data HDFS
Data
Temp
Data
UDF
FMP
Compute Node
Big SQL
Worker Node
Java
I/O
FMP
Native
I/O
FMP
HDFS
Data
Node
MRTask
Tracker
Other
ServiceHDFS
Data HDFS
Data HDFS
Data
Temp
Data
UDF
FMP
Compute Node
DDL
FMP
UDF
FMP
*FMP = Fenced mode process
• Head node HA Support
• Easy to extend
Add data nodes – more storage
Worker nodes – processing power
• Fault tolerance
Only use online workers for computations
MPP – Flexible architecture
22
Big SQL – data
23
CREATE HADOOP TABLE bigsql.sales (
part_id INTEGER,
part_name VARCHAR(32),
quantity INTEGER,
cost DECIMAL(10,4)
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
LOCATION ‘/user/bigsql/sales’;
12321,apple,10,10.20
32881,banana,5,11.91
43321,salt,1,0.90
44552,sugar,1,0.80
Does not actually “store” data -> interprets it.
• Support for all major file formats
ORC
Parquet
Avro
Sequence
etc.
• Does not lock users in
Files can be used with external tools (Spark, MR, etc)
Big SQL – open storage
24
Part III – Tools to offload data to Online Archive
• Moves data from EDWs and external stores into HDFS
• Bulk import
Import individual tables or entire databases
• SQL import
Import result of SQL execution
• Data interaction
Support for interacting with data through Java classes
• Data export
Offload data from HDFS into relational databases
Simple with BigInsights - Sqoop
26
Sqoop
Hadoop
HDFS
EDW
• SQL statement executable in Big SQL
• Extracts data from external sources
RDBMS
EDW
HDFS
• Loads data into Big SQL tables
Understands Big SQL tables (e.g. columns, partitioning, etc)
Understands Big SQL table properties(e.g. delimiters, etc)
Simple with Big SQL - LOAD
27
CREATE HADOOP TABLE staff(
id INT,
name VARCHAR(64),
job VARCHAR(10))
PARTITIONED BY (years INT);
LOAD HADOOP USING
JDBC CONNECTION URL
‘jdbc:db2://edw:5000/SAMPLE’
WITH PARAMETERS (
‘credential.file’=
‘/user/bigsql/mycreds’
)
FROM TABLE STAFF
COLUMNS(ID, NAME, JOB, YEARS)
WHERE ‘YEARS is not null’
WITH TARGET TABLE PROPERTIES(
‘delimited.drop.loadedlims’=‘true’
)
WITH LOAD PROPERTIES (
‘max.rejected.records’=50,
‘num.map.tasks’ = 2)
• Transparent
acts as a single source for all data sources
• Extensible
Bring heterogeneous sources together
• High functionality
use the same rich SQL against all sources
• High performance
understands capabilities of sources and pushes down computation
Simple with Big SQL – Federation
28
Part IV – Tips for SQL better usability of your Online Archive
• SQL is standard but internals are not
• Some things require more attention
Point queries
Normalization
DML (UPDATE/DELETE)
Tips for better performance
30
• Table partitioning for data elimination
• Special file formats
ORC
Parquet
• HBase with secondary indexes
Point query optimization
31
• Partitioning is defined on one or more columns
• Each unique value becomes a partition
• Query predicates are used to eliminate partitions
Big SQL table partitioning
32
CREATE TABLE demo.sales (
part_id int,
part_name string,
qty int,
cost double
)
PARTITIONED BY (
state char(2)
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|';
biginsights
hive
warehouse
demo.db
sales
state=NJ
state=AR
state=CA
state=NY
data1.csv
data1.csvdata2.csv
data1.csvselect * from demo.sales
where state in ('NJ', 'CA');
• What they are:
Columnar file formats
Store statistics
• Pros
Efficient compression
Predicates are pushed down and evaluated at file level!
We skip files for which predicates don’t match
• Cons
Not easy to integrate outside of Hadoop
Big SQL special file formats (ORC, Parquet)
33
CREATE HADOOP TABLE orc_table
(col1 int, col2 string, col3 bigint)
STORED AS ORC
CREATE HADOOP TABLE parquet_table
(col1 int, col2 string, col3 bigint)
STORED AS PARQUETFILE
• Clustered Environment
Master and Region Servers
Automatic split ( sharding )
• Key-value store
Key and value are byte arrays
Efficient access using row key
• HBase supports secondary
indexes!
HBase
34
•
[De]normalization – complex types
35
CREATE HADOOP TABLE people(
id INT,
name VARCHAR(64),
phoneNo ARRAY<VARCHAR(32)>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
COLLECTION ITEMS TERMINATED BY ‘|’
SELECT p.id, p.name, tel.no
FROM people p
UNNEST (p.phoneNo) tel(no)
ID NAME Number
1 Eugen Stoianovici +353 8733 -----
2 Eugen Stoianovici +353 8700 -----
3 John Smith +353 8732 -----
… … …
• Fault tolerance to data (errors in semi-structured data do not cause queries to fail)
• Fault tolerance to hardware
Hardware maintenance performed by threshold
Hardware failure does not impact system
• Elastic clusters
Add more nodes when more capacity is required
Reassign nodes when batch job is done
Big SQL – elastic and fault tolerant
36
Notices and Disclaimers
37
Copyright © 2015 by International Business Machines Corporation (IBM). No part of this document may be reproduced or transmitted in any form
without written permission from IBM.
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM.
Information in these presentations (including information relating to products that have not yet been announced by IBM) has been reviewed for
accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to
update this information. THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IN NO
EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE USE OF THIS INFORMATION, INCLUDING BUT NOT LIMITED TO,
LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFIT OR LOSS OF OPPORTUNITY. IBM products and services are warranted
according to the terms and conditions of the agreements under which they are provided.
Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice.
Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as
illustrations of how those customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other
results in other operating environments may vary.
References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services
available in all countries in which IBM operates or does business.
Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the
views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor shall constitute legal or
other guidance or advice to any individual participant or their specific situation.
It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the
identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’s business and any actions the
customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will
ensure that the customer is in compliance with any law.
Notices and Disclaimers (con’t)
38
Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly
available sources. IBM has not tested those products in connection with this publication and cannot confirm the accuracy of performance,
compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the
suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to
interoperate with IBM’s products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents, copyrights,
trademarks or other intellectual property right.
• IBM, the IBM logo, ibm.com, Aspera®, Bluemix, Blueworks Live, CICS, Clearcase, Cognos®, DOORS®, Emptoris®, Enterprise Document
Management System™, FASP®, FileNet®, Global Business Services ®, Global Technology Services ®, IBM ExperienceOne™, IBM
SmartCloud®, IBM Social Business®, Information on Demand, ILOG, Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON,
OpenPower, PureAnalytics™, PureApplication®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®,
pureScale®, PureSystems®, QRadar®, Rational®, Rhapsody®, Smarter Commerce®, SoDA, SPSS, Sterling Commerce®, StoredIQ,
Tealeaf®, Tivoli®, Trusteer®, Unica®, urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System z® Z/OS, are trademarks of
International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be
trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at:
www.ibm.com/legal/copytrade.shtml.
Questions?
39
• BigInsights – Hadoop open platform integration in the warehouse
• Big SQL architecture and compatibility with enterprise data warehouses
• Data movement tools
• Tips from Big SQL performance
Wrap up
40
41
Get started with Big SQL: External resources
Hadoop Dev: links to videos, white paper, lab, . . . .
https://developer.ibm.com/hadoop/
We Value Your Feedback!
Don’t forget to submit your Insight session and speaker
feedback! Your feedback is very important to us – we use it
to continually improve the conference.
Access your surveys at insight2015survey.com to quickly
submit your surveys from your smartphone, laptop or
conference kiosk.
42
© 2015 IBM Corporation
Thank You