Page 1
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 1
PER STRICKER, THOMAS KALB
07.02.2017, HEART OF TEXAS DB2 USER GROUP, AUSTIN
08.02.2017, DB2 FORUM USER GROUP, DALLAS
INITIAL EVALUATION BIGSQL FOR HORTONWORKS (Homerun or merely a major bluff?)
Page 2
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 2
Agenda
Introduction The MPP Architecture DB2 DPF and Hadoop (HDFS) Installation stumbling blocks
Red Hat or Centos The HDP Installation with Ambari (See Appendix) The BigSQL Installation
Working with BigSQL Familiar and the New
a. DB2 - Interface b. HDFS - Interface
Der Big Data Deployment (SQL for unstructured Data) DB2 Engine versus HDFS-Engine
Functional Differences Performance Differences
BIG SQL and Hive Conclusion – Sham or Masterstroke? Questions and Discussions
Page 3
Hadoop (HDFS)
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 3
http://bradhedlund.s3.amazonaws.com/2011/hadoop-network-intro/Hadoop-Cluster.PNG
Page 4
Hadoop Distribution
Cloudera / Hortonworks / MapR / IOP (Worldwide Market share)
Hortonworks 16 %
others 20 %
Cloudera
53%
MapR 11 %
Quelle: https://www.dezyre.com/article/top-6-hadoop-vendors-providing-big-data-solutions-in-open-data-platform/93
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 4
Page 5
Hadoop Appraisal
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 5
Quelle: https://www.cloudera.com/content/dam/www/static/documents/analyst-reports/forrester-wave-big-data-hadoop-distributions.pdf
Page 6
Hadoop SQL Engines
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 6
Quelle: IBM Big SQL – Vendor Landscape © 2014 IBM Corporation
Page 7
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 7
Agenda
Introduction The MPP Architecture DB2 DPF and Hadoop (HDFS) BIGSQL – Sham or Masterstroke? Installation stumbling blocks
Red Hat or Centos The HDP Installation with Ambari (See Appendix) The BigSQL Installation
Working with BigSQL Familiar and the New
a. DB2 - Interface b. HDFS - Interface
Der Big Data Deployment (SQL for unstructured Data) DB2 Engine versus HDFS-Engine
Functional Differences Performance Differences
Conclusion – Sham or Masterstroke? Questions and Discussion
Page 8
Big SQL and MPP-Architecture
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 8
IBM Big SQL is a high performance SQL- on-Apache-Hadoop- Engine
IBM MPP-engine (C++) replaces the MapReduce-Layer (Java)
Big SQL is a MPP (Massively Parallel Processing) SQL-engine
HIVE extends Hadoop with Data- Warehouse Features
HBASE is a distributed column-oriented database
HDFS is a high availability filesystem for storing very large volumes of data distributed across many nodes.
Quelle: Big SQL: A Technical Introduction © 2016 IBM Corporation
Page 9
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 9
SMP vs. MPP Architecture
SMP: Dynamically distributes running processes across all available processors which share system resources (multi processor systems)
Page 10
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 10
SMP vs. MMP Architecture
MPP: Distributes a task across multiple independent nodes with individual processors, RAM and I/O. (Share nothing architecture)
Page 11
SMP Scaling
Vertical Scaling
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 11
Page 12
Horizontal Scaling
BIGSQL homerun or merely a major bluff?
Page 13
Agenda
Introduction The MPP Architecture DB2 DPF and Hadoop (HDFS) Installation stumbling blocks
Red Hat or Centos The HDP Installation with Ambari (See Appendix) The BigSQL Installation
Working with BigSQL Familiar and the New
a. DB2 - Interface b. HDFS - Interface
Der Big Data Deployment (SQL for unstructured Data) DB2 Engine versus HDFS-Engine
Functional Differences Performance Differences
BIG SQL and Hive Conclusion – Sham or Masterstroke? Questions and Discussions
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 13
Page 14
Hadoop
Cluster
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 14
DB2 DPF versus Hadoop (HDFS) Hadoop Cluster (Diploma Thesis)
DB2 DPF
Page 15
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 15
DB2 DPF
Quelle: toadworld.com
Page 16
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 16
Big SQL – IBM Slide
Quelle: Big SQL: A Technical Introduction © 2016 IBM Corporation
Page 17
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 17
BIG SQL – ITGAIN Slide
Page 18
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 18
Agenda
Introduction The MPP Architecture DB2 DPF and Hadoop (HDFS) Installation stumbling blocks
Red Hat or Centos The HDP Installation with Ambari (See Appendix) The BigSQL Installation
Working with BigSQL Familiar and the New
a. DB2 - Interface b. HDFS - Interface
Der Big Data Deployment (SQL for unstructured Data) DB2 Engine versus HDFS-Engine
Functional Differences Performance Differences
Conclusion – Sham or Masterstroke? Questions and Discussions
Page 19
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 19
Installation Stumbling Blocks
ITGAIN Test Environment
Installing two nodes
• Hardware
2 virtual Servers with 8 Cores / 10 GB RAM / SSDs
• Software
Linux RedHat 7.2 / Cent OS 7.2
Ambari 2.2.2.0
Hortonworks Data Platform (HDP) 2.4.2
BETA: Big SQL 4.2 for Hortonworks Data Platform
Extending with two additional identical nodes (DataNode / WorkerNode)
Page 20
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 20
Installation Stumbling Blocks Red Hat or CentOS?
IBM BigInsights for Apache Hadoop 4.2 only supports
Red Hat Enterprise Linux (RHEL) Server 6.7
Red Hat Enterprise Linux (RHEL) Server 7.2
Hortonworks Data Platform HDP 2.4.2 supports
Red Hat Enterprise Linux (RHEL) 6.x - 7.x
CentOS 6.x - 7.x
Debian 7.x
Oracle Linux 6.x - 7.x
SUSE Linux Enterprise Server (SLES) v11 SP3 / SP4
Ubuntu Precise v12.04
Ubuntu Trusty v14.04
Page 21
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 21
Installation Stumbling Blocks Red Hat or CentOS?
Recommendation for BETA auf Hortonworks Red Hat Enterprise Linux (RHEL) Server 7.2
Test-Cluster on
Red Hat Enterprise Linux (RHEL) Server 7.2
CentOS 7.2
Installation on both OSes was successful
Page 22
Installation Stumbling Blocks The HDP Installation with Ambari
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 22
Page 23
Installation Stumbling Blocks The HDP Installation with Ambari
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 23
Tips and Tricks:
• Very simple installation with Ambari, provided there are no errors
• Therefore: prior to the installation take the time to clear any warnings in the Confirm Hosts and Check Scripts
• In case of Errors: Check the errors output to stderr
Often stderr is empty Typical cause is a timeout
If stderr contains errors Attempt to correct the error and retry
• If the installation crashes it is often easier to retry with a fresh OS
rather than changing the OS and retrying the installation
Page 24
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 24
Installation Stumbling Blocks The BigSQL Installation
Recommendations: Execute the Big SQL Pre-Checker before the Installation
Pre-Checker Scripts are available in the installation package but need to be extracted
rpm2cpio BigInsights-HDP-1.2.0.0-2.4.el7.x86_64.rpm | cpio -ivd
./var/lib/ambari-server/resources/stacks/HDP/2.4/services/BIGSQL/
package/scripts/bigsql-precheck.sh
rpm2cpio BigInsights-HDP-1.2.0.0-2.4.el7.x86_64.rpm | cpio -ivd
./var/lib/ambari-server/resources/stacks/HDP/2.4/services/BIGSQL/
package/scripts/bigsql-util.sh
All errors should be cleared before starting the installation
Page 25
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 25
Installation Stumbling Blocks The BigSQL Installation
Execute for ALL servers!
Only when successful should you start the installation
Page 26
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 26
Installation Stumbling Blocks The BigSQL Installation
Add the Service to a Cluster
Page 27
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 27
Installation Stumbling Blocks The BigSQL Installation
Page 28
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 28
Installation Stumbling Blocks The BigSQL Installation
It is always possible to add additional Big SQL Workers to an individual host via Add Services option under Hosts
However, this is not possible on a Big SQL Head Node!
Page 29
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 29
Installation Stumbling Blocks Extending the Cluster with Ambari
Additional hosts can easily be added with the Add New Hosts – Wizard
Page 30
Installation Stumbling Blocks Extending the Cluster with Ambari
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 30
Page 31
Installation Stumbling Blocks Extending the Cluster with Ambari
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 31
Page 32
Installation Stumbling Blocks Extending the Cluster with Ambari
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 32
Page 33
Installation Stumbling Blocks Extending the Cluster with Ambari
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 33
Data must be redistributed after the extension
Page 34
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 34
Agenda
Introduction The MPP Architecture DB2 DPF and Hadoop (HDFS) Installation stumbling blocks
Red Hat or Centos The HDP Installation with Ambari (See Appendix) The BigSQL Installation
Working with BigSQL Familiar and the New
a. DB2 - Interface b. HDFS - Interface
Der Big Data Deployment (SQL for unstructured Data) DB2 Engine versus HDFS-Engine
Functional Differences Performance Differences
BIG SQL and Hive Conclusion – Sham or Masterstroke? Questions and Discussions
Page 35
Working with BigSQL – The New and the Familiar
DB2 Interface
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 35
Page 36
Working with BigSQL – The New and the Familiar
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 36
Where does one find the Tables in HDFS? /apps/hive/warehouse/bigsql.db/firsttable
Page 37
Working with BigSQL – The New and the Familiar
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 37
Or via the Command line (HDFS Browse):
Page 38
Working with BigSQL – The New and the Familiar
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 38
Not everything works with the DB2 Command line: For example loading data into a Hadoop Table
What now?
Page 39
Working with BigSQL – The New and the Familiar
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 39
There is also a Command line for BigSQL: JSqsh (Java SQL Shell) – pronounced "jay-skwish“
According to the docs it should be found in:
/usr/ibmpacks/common-utils/current/jsqsh
BUT:
Page 40
Working with BigSQL – The New and the Familiar
SOLUTION: JSqsh isn’t part of the BigSQL-Installation
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 40
Page 41
Working with BigSQL – The New and the Familiar
JSqsh appears in the list of installed clients
JSqsh can also be installed via the OpenSource GitHub- project
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 41
Page 42
Working with BigSQL – The New and the Familiar
JSqsh Setup:
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 42
Page 43
Working with BigSQL – The New and the Familiar
JSqsh Setup: driver selection
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 43
Page 44
Working with BigSQL – The New and the Familiar
JSqsh Setup: Customize the Connection details and save
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 44
Page 45
Working with BigSQL – The New and the Familiar
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 45
Requesting the table list with Jsqsh
Jsqsh Command help via \help e.g g.: Defining the current schema: use BIGSQL
Requesting a table list in a given schema: \show tables
Page 46
Working with BigSQL – The New and the Familiar
Starting point: Load data in the Tables Tip: for better Performance load the Load-File with hdfs
hdfs dfs -copyFromLocal /tmp/firsttable.csv /tmp/
hdfs dfs -chmod 777 /tmp/firsttable.csv
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 46
Page 47
Working with BigSQL – The New and the Familiar
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 47
What happened in the hdfs-Filesystem? a new file has appeared
Page 48
Working with BigSQL – The New and the Familiar
db2top also works: For example, LOAD
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 48
Page 49
Working with BigSQL – The New and the Familiar
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 49
Even db2pd works: For example LOAD However LIST UTILITIES does not work
Page 50
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 50
Agenda
Introduction The MPP Architecture DB2 DPF and Hadoop (HDFS) Installation stumbling blocks
Red Hat or Centos The HDP Installation with Ambari (See Appendix) The BigSQL Installation
Working with BigSQL Familiar and the New
a. DB2 - Interface b. HDFS - Interface
Der Big Data Deployment (SQL for unstructured Data) DB2 Engine versus HDFS-Engine
Functional Differences Performance Differences
BIG SQL and Hive Conclusion – Sham or Masterstroke? Questions and Discussions
Page 51
Loading the Benchmark BIGSQL HDFS Table
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 51
Page 52
The HDFS (DB2-) Blocks
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 52
Page 53
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 53
BIGSQL HDFS versus DB2 DPF
Page 54
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 54
BIGSQL HDFS versus DB2 DPF
Page 55
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 55
DB2 DPF Restrictions
Page 56
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 56
DB2 DPF Restrictions
Page 57
Performance differences DB2 DPF versus DB2 HDFS Loading 10 million rows
DB2 HDFS: 64 Sek.
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 57
DB2 DPF: 22 Sek.
Page 58
Performance differences DB2 DPF versus DB2 HDFS Random I/O Benchmark (Reading von 1023 rows)
DB2 DPF DB2 HDFS Cold: Cold:
Warm: Warm:
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 58
Page 59
Performance differences DB2 DPF versus DB2 HDFS Read-Ahead I/O Benchmark (Reading von 10 Mio. Rows)
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 59
Warm:
Cold:
Warm:
Cold:
DB2 DPF DB2 HDFS
Page 60
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 60
Agenda
Introduction The MPP Architecture DB2 DPF and Hadoop (HDFS) Installation stumbling blocks
Red Hat or Centos The HDP Installation with Ambari (See Appendix) The BigSQL Installation
Working with BigSQL Familiar and the New
a. DB2 - Interface b. HDFS - Interface
The Big Data Deployment (SQL for unstructured Data) DB2 Engine versus HDFS-Engine
Functional Differences Performance Differences
BIG SQL and Hive Conclusion – Sham or Masterstroke? Questions and Discussions
Page 61
The Big Data Deployment (SQL for unstructured Data)
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 61
Working with datatypes for complex data (partially structured)
ARRAY: Collection of data of the same datatype
MAP: Collection of Key-Value pairs
STRUCT: Collection of data with different datatypes
Working with unstructured data is possible via the Serializer and
Deserializer (SerDe)
The SerDe-Interface is instructed how it should process data blocks
There are many Built-In SerDes for example for JSON, Avro, Parquet, Regular Expressions, etc...
Many SerDes are available in the Public Domain
Specific SerDes that may be required can be developed in Java
Page 62
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 62
Big Data – Working with the ARRAY-Data types
Collection of data of the same datatype
Page 63
Big Data – Working with MAP Types
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 63
Collection of Key-Value pairs
Page 64
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 64
Big Data– Working with STRUCTs
Collection of data with different data types
Page 65
Big Data – Unstructured Data
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 65
Using SerDes in BigSQL
Before using the SerDe.jar-Files it needs to be registered in BigSQL - Only when the jar file has been successfully registered will it be available to BigSQL
3 Steps to Register:
Hive Servers: Copy the SerDe.jar-File in the /lib/ directory
Big SQL Node: Copy the SerDe.jar-File in the /userlib/ directory of each individual node
Restart all BigSQL Services
Page 66
Big Data – Example of Unstructured Data
Example: Parsing log files with Regular Expression (RegexSerDe)
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 66
Page 67
Big Data – Example of Unstructured Data
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 67
select * from apache_log fetch first 5 rows only
For example, to correlate Client Data with Web Browser data for analysis of user behavior
Page 68
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 68
Agenda
Introduction The MPP Architecture DB2 DPF and Hadoop (HDFS) Installation stumbling blocks
Red Hat or Centos The HDP Installation with Ambari (See Appendix) The BigSQL Installation
Working with BigSQL Familiar and the New
a. DB2 - Interface b. HDFS - Interface
Der Big Data Deployment (SQL for unstructured Data) DB2 Engine versus HDFS-Engine
Functional Differences Performance Differences
BIG SQL and Hive Conclusion – Sham or Masterstroke? Questions and Discussions
Page 69
Big SQL versus Hive
SQLReplayer
Copyright © 2016 ITGAIN GmbH 69
Page 70
SQLReplayer
Copyright © 2016 ITGAIN GmbH 70
Hive Big SQL Object Synchronization
Create a table into Hive:
Page 71
SQLReplayer
Copyright © 2016 ITGAIN GmbH 71
Hive Big SQL Object Synchronization
Synchronize the Hive Tables:
Page 72
SQLReplayer
Copyright © 2016 ITGAIN GmbH 72
Hive Big SQL Object Synchronization
Test the Big SQL Table:
Page 73
SQLReplayer
Copyright © 2016 ITGAIN GmbH 73
Hive Big SQL Data Synchronization (Refresh)
Edit the HDFS File:
Page 74
SQLReplayer
Copyright © 2016 ITGAIN GmbH 74
Hive Big SQL Data Synchronization (Refresh)
Select the Hive Table:
Page 75
SQLReplayer
Copyright © 2016 ITGAIN GmbH 75
Hive Big SQL Data Synchronization (Refresh)
Synchronization (Refresh):
Page 76
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 76
Agenda
Introduction The MPP Architecture DB2 DPF and Hadoop (HDFS) Installation stumbling blocks
Red Hat or Centos The HDP Installation with Ambari (See Appendix) The BigSQL Installation
Working with BigSQL Familiar and the New
a. DB2 - Interface b. HDFS - Interface
Der Big Data Deployment (SQL for unstructured Data) DB2 Engine versus HDFS-Engine
Functional Differences Performance Differences
BIG SQL and Hive Conclusion – Sham or Masterstroke? Questions and Discussions
Page 77
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 77
BIGSQL – Sham or Masterstroke?
Sham
DB2 DPF for HDFS
Masterstroke
The right strategy at the right time
Reuse of existing investments
Increased acceptance via the reuse of SQL
Simple integration of Big Data in an existing infrastructure
Page 78
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 78
The Big Data Solution
Big SQL Hadoop-Tables are not a replacement for OLTP-DBMS Technology
Big SQL makes it possible to use SQL Requests against existing Hadoop Data (no proprietary storage formats)
All the data are Hadoop files in HDFS
Big SQL was developed to make effective and efficient use of the Hadoop infrastructure Most organizations possess experienced SQL developers
No UPDATE or DELETE is possible on a Hadoop table
Much lower license costs than DPF
Good SQL compatibility
Great monitoring with Speedgain for BIGSQL is available
Page 79
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 79
The Big Data Solution
Primary Use cases would be:
To move rarely referenced data out of the Data-Warehouse and onto cheaper hardware while maintaining the ability to query the data via SQL
To setup new Data-Warehouse
To filter and analyze unstructured data (such as log files, sensor data and social media) as well as to connect this data to existing structured data (such as via federation)
Page 80
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 80
Conclusion
Bluff = Homerun
Page 81
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 81
Q & A