Page 1
© 2013 IBM Corporation1
The Data Scientists Workplace of the Future - Data Science Connect 22nd of July, 2014
Romeo Kienzler
IBM Center of Excellence for Data Science, Cognitive Systems and BigData(A joint-venture between IBM Research Zurich and IBM Innovation Center DACH)
Source: http://www.kdnuggets.com/2012/04/data-science-history.jpg
Page 2
© 2013 IBM Corporation2
What is DataScience?
Source: Statoo.com http://slidesha.re/1kmNiX0
Page 3
© 2013 IBM Corporation3
DataScience at present● Tools (http://blog.revolutionanalytics.com/2014/01/in-data-scientist-survey-r-is-the-most-used-tool-other-than-databases.html)
● SQL (42%)● R (33%)● Python (26%)● Excel (25%)● Java, Ruby, C++ (17%)● SPSS, SAS (9%)
● Limitations (Single Node usage)● Main Memory● CPU <> Main Memory Bandwidth● CPU ● Storage <> Main Memory Bandwidth (either Single node or SAN)
Page 4
© 2013 IBM Corporation4
What is BIG data?
Page 5
© 2013 IBM Corporation5
What is BIG data?
Page 6
© 2013 IBM Corporation6
What is BIG data?
Big Data
Hadoop
Page 7
© 2013 IBM Corporation7
What is BIG data?
Business Intelligence
Data Warehouse
Page 8
© 2013 IBM Corporation8
BigData == Hadoop?
Hadoop BigData
Hadoop
Page 9
© 2013 IBM Corporation9
What is beyond “Data Warehouse”?
Data Lake
Data Warehouse
Page 10
© 2013 IBM Corporation10
First “BigData” UseCase ?● Google Index
● 40 X 10^9 = 40.000.000.000 => 40 billion pages indexed● Will break 100 PB barrier soon● Derived from MapReduce● now “caffeine” based on “percolator”
● Incremental vs. batch● In-Memory vs. disk
●
Page 11
© 2013 IBM Corporation11
Map-Reduce → Hadoop → BigInsights
Page 12
© 2013 IBM Corporation12
BigData Analytics – Predictive Analytics
"sometimes it's not who has the best algorithm that wins; it's who has the most data."
(C) Google Inc.
The Unreasonable Effectiveness of Data¹
¹http://www.csee.wvu.edu/~gidoretto/courses/2011-fall-cp/reading/TheUnreasonable%20EffectivenessofData_IEEE_IS2009.pdf
No Sampling => Work with full dataset => No p-Value/z-Scores anymore
Page 13
© 2013 IBM Corporation13
Aggregated Bandwith between CPU, Main Memory and Hard Drive
1 TB (at 10 GByte/s)
- 1 Node - 100 sec
- 10 Nodes - 10 sec
- 100 Nodes - 1 sec
- 1000 Nodes - 100 msec
Page 14
© 2013 IBM Corporation14
Fault Tolerance / Commodity Hardware
AMD Turion II Neo N40L (2x 1,5GHz / 2MB / 15W), 8 GB RAM,
3TB SEAGATE Barracuda 7200.14
< CHF 500
100 K => 200 X (2, 4, 3) => 400 Cores, 1,6 TB RAM, 200 TB HD
MTBF ~ 365 d > 1,5 d
Source: http://www.cloudcomputingpatterns.org/Watchdog
Page 15
© 2013 IBM Corporation15
“Elastic” Scale-Out
Source: http://www.cloudcomputingpatterns.org/Continuously_Changing_Workload
Page 16
© 2013 IBM Corporation16
“Elastic” Scale-Out
of
Page 17
© 2013 IBM Corporation17
“Elastic” Scale-Out
of
CPU Cores
Page 18
© 2013 IBM Corporation18
“Elastic” Scale-Out
of
CPU Cores Storage
Page 19
© 2013 IBM Corporation19
“Elastic” Scale-Out
of
CPU Cores Storage Memory
Page 20
© 2013 IBM Corporation20
“Elastic” Scale-Out
linear
Source: http://www.cloudcomputingpatterns.org/Elastic_Platform
Page 21
© 2013 IBM Corporation21
How do Databases Scale-Out?
Shared Disk Architectures
Page 22
© 2013 IBM Corporation22
How do Databases Scale-Out?
Shared Nothing Architectures
Page 23
© 2013 IBM Corporation23
Hadoop?
Shared Nothing Architecture?
Shared Disk Architecture?
http://bluemix.net/6 Node Hadoop Cluster 4 Free
Page 24
© 2013 IBM Corporation24
Data Science on Hadoop
SQL (42%)
R (33%)
Python (26%)
Excel (25%)
Java, Ruby, C++ (17%)
SPSS, SAS (9%)
Data Science Hadoop
Page 25
© 2013 IBM Corporation25
SQL on Hadoop● IBM BigSQL (ANSI 92 compliant)● HIVE, Presto● Cloudera Impala ● Lingual● Shark● ...
SQL Hadoop
Page 26
© 2013 IBM Corporation26
Two types of SQL Engines● Type I
● Compiler and Optimizer SQL->MapReduce● Type II
● Brings own distributed execution engine on Data Nodes● Brings own Task Scheduler
● The Hadoop SQL Ecosystem is evolving very fast
Page 27
© 2013 IBM Corporation27
Hive● Runs on top of MapReduce● → Type I
Source: http://cdn.venublog.com/wp-content/uploads/2013/07/hive-1.jpg
Page 28
© 2013 IBM Corporation28
Lingual● ANSI SQL Layer on top of Cascading● Cascading
● Java API do express DAG● Runs on top of MapReduce● → Type I
Page 29
© 2013 IBM Corporation29
Limits of MapReduce● Disk writes between Map and Reduce● Slow for computations which depend on previously computed values● JOINs are very slow and difficult to implement
● Only sequential data access● Only tuple-wise data access● Map-Side joins have sort and size constraints● Reduce-Side joins require secondary sorting of values● …
● ...
Page 30
© 2013 IBM Corporation30
Impala (Type II)
http://blog.cloudera.com/blog/wp-content/uploads/2012/10/impala.png
Page 31
© 2013 IBM Corporation31
Presto (Type II)
https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920
Page 32
© 2013 IBM Corporation32
Spark / Shark (Type II)
Source: http://bighadoop.files.wordpress.com/2014/04/spark-architecture.png
Page 33
© 2013 IBM Corporation33
BigSQL V3.0 (Type II)
Like in Spark, MapReduce has been Kicked out :)(No JobTracker, No Task Tracker, But HDFS/GPFS remains)
Page 34
© 2013 IBM Corporation34
BigSQL V3.0 – Architecture
Putting the story together….Big SQL shares a common SQL dialect with DB2Big SQL shares the same client drivers with DB2
Page 35
© 2013 IBM Corporation35
BigSQL V3.0 – PerformanceQuery rewritesExhaustive query rewrite capabilitiesLeverages additional metadata such as constraints and nullability
OptimizationStatistics and heuristic driven query optimizationQuery optimizer based upon decades of IBM RDBMS experience
Tools and metricsHighly detailed explain plans and query diagnostic toolsExtensive number of available performance metrics
SELECT ITEM_DESC, SUM(QUANTITY_SOLD), AVG(PRICE), AVG(COST)
FROM PERIOD, DAILY_SALES, PRODUCT, STORE
WHERE
PERIOD.PERKEY=DAILY_SALES.PERKEY AND
PRODUCT.PRODKEY=DAILY_SALES.PRODKEY AND
STORE.STOREKEY=DAILY_SALES.STOREKEY AND
CALENDAR_DATE BETWEEN AND
'01/01/2012' AND '04/28/2012' AND
STORE_NUMBER='03' AND
CATEGORY=72
GROUP BY ITEM_DESC
Access plan generationQuery transformation
Dozens of query transformations
Hundreds or thousands of access plan options
Store
Product
Product Store
NLJOIN
Daily SalesNLJOIN
Period
NLJOIN
Product
NLJOIN
Daily Sales
NLJOIN
Period
NLJOIN
Store
HSJOIN
Daily Sales
HSJOIN
Period
HSJOIN
Product
StoreZZJOIN
Daily Sales
HSJOIN
Period
Page 36
© 2013 IBM Corporation36
BigSQL V3.0 – PerformanceYou are substantially faster if you don't use MapReduce
IBM BigInsights v3.0, with Big SQL 3.0, is the only Hadoop distribution to successfully run ALL 99 TPC-DS queries and ALL 22 TPC-H queries without modification. Source: http://www.ibmbigdatahub.com/blog/big-deal-about-infosphere-biginsights-v30-big-sql
Page 37
© 2013 IBM Corporation37
BigSQL V3.0 – Query Federation
Head Node
Big SQL
Compute Node
Task Tracker Data Node BigSQL
Compute Node
Task Tracker Data NodeBigSQL
Compute Node
Task Tracker Data NodeBigSQL
Compute Node
Task Tracker Data NodeBigSQL
Page 38
© 2013 IBM Corporation38
BigSQL V1.0 – Demo (small)● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich)● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)
● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich)● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)
Page 39
© 2013 IBM Corporation39
BigSQL V1.0 – Demo (small)CREATE EXTERNAL TABLE trace (
hour integer, employeeid integer,
departmentid integer, clientid integer,
date string, timestamp string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/user/biadmin/32Gtest';
Page 40
© 2013 IBM Corporation40
BigSQL V1.0 – Demo (small)
Page 41
© 2013 IBM Corporation41
BigSQL V1.0 – Demo (small)
Page 42
© 2013 IBM Corporation42
BigSQL V1.0 – Demo (small)[bivm.ibm.com][biadmin] 1> select count(*) from trace1;
+----------+
| |
+----------+
| 11416740 |
+----------+
1 row in results(first row: 39.78s; total: 39.78s)
Page 43
© 2013 IBM Corporation43
BigSQL V1.0 – Demo (small)
select count(hour), hour from trace group by hour order by hour
30 rows in results(first row: 37.98s; total: 37.99s)
Page 44
© 2013 IBM Corporation44
BigSQL V1.0 – Demo (small)
[bivm.ibm.com][biadmin] 1> select count(*) from trace1 t3 inner join trace2 t4 on t3.hour=t4.hour;
+--------+
| |
+--------+
| 477340 |
+--------+
1 row in results(first row: 32.24s; total: 32.25s)
Page 45
© 2013 IBM Corporation45
BigSQL V3.0 – Demo (small)CREATE HADOOP TABLE trace3 (
hour int, employeeid int,
departmentid int,clientid int,
date varchar(30), timestamp varchar(30) )
row format delimited
fields terminated by '|'
stored as textfile;
Page 46
© 2013 IBM Corporation46
BigSQL V3.0 – Demo (small)[bivm.ibm.com][biadmin] 1> select count(*) from trace3;
+----------+
| 1 |
+----------+
| 12014733 |
+----------+
1 row in results(first row: 2.94s; total: 2.95s)
Page 47
© 2013 IBM Corporation47
BigSQL V3.0 – Demo (small)
[bivm.ibm.com][biadmin] 1> select count(*) from trace3 t3 inner join trace4 t4 on t3.hour=t4.hour;
+--------+
| 1 |
+--------+
| 504360 |
+--------+
1 row in results(first row: 0.79s; total: 0.80s)
Page 48
© 2013 IBM Corporation48
BigSQL V3.0 – Demo (small)
[bivm.ibm.com][biadmin] 1> select count(hour), hour from trace3 group by hour order by hour;
29 rows in results(first row: 1.88s; total: 1.89s)
Page 49
© 2013 IBM Corporation49
R on Hadoop● IBM BigR (based on SystemML Almadan Research project)● Rhadoop● RHIPE● ...
“R” Hadoop
Page 50
© 2013 IBM Corporation50
Page 51
© 2013 IBM Corporation5151
Goal: Find column mean
Problems:• Column vector can not fit into memory
You have to partition and parallelize
Page 52
© 2013 IBM Corporation52
● Sampling Full dataset > RAM Example: use 1% vs 100% of dataset Precision loss from skewed/sparse data
● Numerical Stability Limitation from finite precision in computing Algorithms must be carefully implemented Instability causes errors to cascade throughout your analysis
Catastrophic Cancellation Error: 6.375 – 5.625
True value: 0.75 Computed: 0 Relative Error: 1.06.375 round to 6.0
5.625 round to 6.0
Page 53
© 2013 IBM Corporation53
Data in Hadoop
You
R User
Data in distributed memory
Page 54
© 2013 IBM Corporation54
Data in Hadoop: Can run R on a single node
R User
Data in distributed memory
You
Page 55
© 2013 IBM Corporation55
BigR (based on SystemML)SystemML compiles hybrid runtime plans ranging from in-memory, single machine (CP) to large-scale, cluster (MR) compute
● Challenge● Guaranteed hard memory constraints
(budget of JVM size)● for arbitrary complex ML programs
● Key Technical Innovations● CP & MR Runtime: Single machine & MR operations, integrated runtime● Caching: Reuse and eviction of in-memory objects● Cost Model: Accurate time and worst-case memory estimates● Optimizer: Cost-based runtime plan generation● Dyn. Recompiler: Re-optimization for initial unknowns
Data size
Run
time
CP CP/MR MR
Gradually exploit MR parallelism
High performance computing for small data sizes.
Scalable computing for large data sizes.
Hybrid Plans
Page 56
© 2013 IBM Corporation56
R Clients
SystemMLStatistics
Engine
Data Sources
Embedded R Execution
IBM R Packages
IBM R Packages
Pull data (summaries) to
R client
Or, push R functions
right on the data
1
2
3
© 2014 IBM Corporation17 IBM Internal Use Only
BigR Architecture
Page 57
© 2013 IBM Corporation57
Big R Data Structures: Proxy to entire dataset
data <- bigr.frame(…)
Appears and acts like all of the data is on your laptop
You
Page 58
© 2013 IBM Corporation58
BigR Demo (small) ● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich)● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)
Page 59
© 2013 IBM Corporation59
BigR Demo (small) library(bigr)
bigr.connect(host="bigdata",
port=7052, database="default",
user="biadmin", password="xxx")
is.bigr.connected()
tbr <- bigr.frame(dataSource="DEL", coltypes = c("numeric","numeric","numeric","numeric","character","character"),
dataPath="/user/biadmin/32Gtest", delimiter=",",
header=F, useMapReduce=T)
h <- bigr.histogram.stats(tbr$V1, nbins=24)
Page 60
© 2013 IBM Corporation60
BigR Demo (small) class bins counts centroids
1 ALL 0 18289280 1.583333
2 ALL 1 15360 2.750000
3 ALL 2 55040 3.916667
4 ALL 3 189440 5.083333
5 ALL 4 579840 6.250000
6 ALL 5 5292160 7.416667
7 ALL 6 8074880 8.583333
8 ALL 7 15653120 9.750000
...
Page 61
© 2013 IBM Corporation61
BigR Demo (small)
Page 62
© 2013 IBM Corporation62
BigR Demo (small) jpeg('hist.jpg')
bigr.histogram(tbr$V1, nbins=24)
# This command runs on 32 GB / ~650.000.000 rows in HDFS
dev.off()
Page 63
© 2013 IBM Corporation63
SPSS on Hadoop
Page 64
© 2013 IBM Corporation64
SPSS on Hadoop
Page 65
© 2013 IBM Corporation65
BigSheets Demo (small)● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich)● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)
● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich)● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)
Page 66
© 2013 IBM Corporation66
BigSheets Demo (small)
Page 67
© 2013 IBM Corporation67
BigSheets Demo (small)
This command runs on 32 GB /
~650.000.000 rows in HDFS
Page 68
© 2013 IBM Corporation68
BigSheets Demo (small)
Page 69
© 2013 IBM Corporation69
Text Extraction (SystemT, AQL)
Page 70
© 2013 IBM Corporation70
Text Extraction (SystemT, AQL)
Page 71
© 2013 IBM Corporation71
If this is not enough? → BigData AppStore
Page 72
© 2013 IBM Corporation72
BigData AppStore, Eclipse Tooling● Write your apps in
● Java (MapReduce)● PigLatin,Jaql● BigSQL/Hive/BigR
● Deploy it to BigInsights via Eclipse● Automatically
● Schedule● Update
● hdfs files● BigSQL tables● BigSheets collections
Page 73
© 2013 IBM Corporation73
Questions?
http://www.ibm.com/software/data/bigdata/
Twitter: @RomeoKienzler, @IBMEcosystem_DE, @IBM_ISV_Alps