Kyle Bader - Red Hat Yuan Zhou, Yong Fu, Jian Zhang - Intel May, 2018
Kyle Bader - Red Hat
Yuan Zhou, Yong Fu, Jian Zhang - Intel
May, 2018
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
Agenda
▪ Background and Motivations
▪ The Workloads, Reference Architecture Evolution and Performance Optimization
▪ Performance Comparison with Remote HDFS
▪ Summary & Next Step
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
DISCONTINUITY IN BIG DATAINFRASTRUCTURE - WHY ?
HADOOP
SPARKSQLSPARK
HIVEMAPREDUCE
PRESTOIMPALA
KAFKANiFi
CONGESTIONin busy analytic clusterscausing missed SLAs.
MULTIPLE TEAMS COMPETINGand sharing the samebig data resources.
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
MODERN BIG DATA ANALYTICS PIPELINEKEY TERMINOLOGY
DATAGENERATION
INGEST DATASCIENCE
STREAMPROCESSING
TRANSFORM,MERGE,JOIN
MACHINELEARNING
DATAANALYTICS
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
MODERN BIG DATA ANALYTICS PIPELINEKEY TERMINOLOGY
DATAGENERATION
INGEST DATASCIENCE
MACHINELEARNING
STREAMPROCESSING
TRANSFORM,MERGE,JOIN
DATAANALYTICS
• Sensors• Click-stream• Transactions• Call-detail records
• NiFi• Kafka
• Presto• Impala• SparkSQL
• TensorFlow
• Kafka • Hadoop• Spark
• Spark• Hadoop
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
DISCONTINUITY IN BIG DATAINFRASTRUCTURE - WHY ?
HADOOP
SPARKSQLSPARK
HIVEMAPREDUCE
PRESTOIMPALA
KAFKANiFi
CONGESTIONin busy analytic clusterscausing missed SLAs.
MULTIPLE TEAMS COMPETINGand sharing the samebig data resources.
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
CAUSING CUSTOMERS TO PICK A SOLUTION
Get a bigger clusterfor many teams to share.
Give each team their own dedicated cluster,
each with a copy of PBs of data.
Give teams ability tospin-up/spin-downclusters which can
share data sets.
#1 #2 #3
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
#1 SINGLE LARGE CLUSTER
• Lacks isolationnoisy neighbors hinder SLAs.
• Lacks elasticitysingle rigid cluster.
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
#2 MULTIPLE SMALL CLUSTERS
• No dataset sharing
• Cost of duplicate storage
• Still lacks elasticity
• Can’t scale
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
#3 ON DEMAND ANALYTIC CLUSTERSWITH A SHARED DATA LAKE
HIT SERVICE-LEVEL AGREEMENTSGive teams their owncompute clusters.
ELIMINATE IDLE RESOURCESBy right-sizing de-coupled compute and storage.
BUY 10s OF PBS INSTEAD OF 100S Share data sets across clusters instead of duplicating them.
INCREASE AGILITYWith spin-up/spin-down clusters.
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
#3 ON DEMAND ANALYTIC CLUSTERSWITH A SHARED DATA LAKE
PUBLIC CLOUD (AWS) PRIVATE CLOUD
AWS EC2 PROVISIONING
OPENSTACK PROVISIONING
AWS S3SHARED DATASETS
CEPH SHARED DATASETS
Hadoop
Presto
Spark Hadoop
Presto
Spark
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
GENERATION 1MONOLITHIC HADOOP STACKS
Analytics vendors provide
single-purpose infrastructure
Analytics vendors provideanalytics software
ANALYTICS +INFRASTRUCTURE
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
GENERATION 2DECOUPLED STACK WITH PRIVATE CLOUD INFRASTRUCTURE
Analytics vendors provideanalytics software.
Private cloud providesInfrastructure services
Provisioned Compute Poolvia OpenStack
Shared Datasets on Ceph Object Store
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
Workloads
Simple Read/Write
▪ DFSIO: TestDFSIO is the canonical example of a benchmark that attempts to measure the Storage's capacity for reading and writing bulk data.
▪ Terasort: a popular benchmark that measures the amount of time to sort one terabyte of randomly distributed data on a given computer system.
Data Transformation
▪ ETL: Taking data as it is originally generated and transforming it to a format (Parquet, ORC) that more tuned for analytical workloads.
Batch Analytics
▪ To consistently executing analytical process to process large set of data.
▪ Leveraging 54 derived from TPC-DS * queries with intensive reads across objects in different buckets
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
Bigdata on Object Storage Performance Overview --Batch analytics
• Significant performance improvement from Hadoop 2.7.3/Spark 2.1.1 to Hadoop 2.8.1/Spark 2.2.0 (improvement in s3a)
• Batch analytics performance of 10-node Intel AFA is almost on-par with 60-node HDD cluster
2244
5060
2120
34463968 3852
6573
7719
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
Parquet (SSD) ORC (SSD) Parquet (HDD) ORC HDD)
Query
Tim
e(s
)
1TB Dataset Batch Analytics and Interactive Query Hadoop/SPark/Presto Comparison (lower the better)HDD: (740 HDDD OSDs, 40x Compute nodes, 20x Storage nodes)
SSD:(20 SSDs, 5 Compute nodes, 5 storage nodes)
Hadoop 2.8.1 / Spark 2.2.0 Presto 0.170 Presto 0.177
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
Hardware Configuration--Dedicate LB
5x Compute Node• Intel® Xeon™ processor E5-2699 v4 @
2.2GHz, 128GB mem• 2x10G 82599 10Gb NIC• 2x SSDs • 3x Data storage (can be emliminated) Software:• Hadoop 2.7.3• Spark 2.1.1• Hive 2.2.1• Presto 0.177• RHEL7.3
5x Storage Node, 2 RGW nodes, 1 LB nodes• Intel(R) Xeon(R) CPU E5-2699v4 2.20GHz• 128GB Memory• 2x 82599 10Gb NIC • 1x Intel® P3700 1.0TB SSD as Journal• 4x 1.6TB Intel® SSD DC S3510 as data
drive• 2x 400G S3700 SSDs• 1 OSD instances one each S3510 SSD• RHEl7.3• RHCS 2.3
*Other names and brands may be claimed as the property of others.
OSD1
MON
OSD1 OSD4…
HadoopHive
SparkPresto
1x10Gb NIC
OSD2 OSD3 OSD4 OSD5
2x10Gb NIC
HadoopHive
SparkPresto
HadoopHive
SparkPresto
HadoopHive
SparkPresto
HadoopHive
SparkPresto
RGW1 RGW2
LB
Head
4x10Gb NIC(bonding)
2x10Gb NIC
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
0%
20%
40%
60%
80%
100%
120%
hive-parquet spark-parquet presto-parquet
(untuned)
presto-parquet
(tuned)
Qu
ery
Su
cce
ss %
1TB Query Success % (54 TPC-DS Queries)
0%
20%
40%
60%
80%
100%
120%
spark-parquet spark-orc presto-parquet presto-parquet
1TB&10TB Query Success %(54 TPC-DS Queries)
Improve Query Success Ratio with Functional Trouble-shooting
0 2 4 6 8 10 12 14 16
Ceph issue
Compatible issue
Deployment issue
Improper default configuration
Middleware issue
Runtime issue
S3a driver issueCount of Issue Type
tuned
• 100% selected TPC-DS query passed with tunings • Improper Default configuration
• small capacity size, • wrong middleware configuration • improper Hadoop/Spark configuration for different size
and format data issues
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
Optimizing HTTP Requests-- The bottlenecks
Compute time take the big part. (compute time = read data +sort )
New connections out every time,Connection not reused
ESTAB 0 0 ::ffff:10.0.2.36:44446 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44454 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44374 ::ffff:10.0.2.254:80
ESTAB 159724 0 ::ffff:10.0.2.36:44436 ::ffff:10.0.2.254:80
ESTAB 0 0 ::ffff:10.0.2.36:44448 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44338 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44438 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44414 ::ffff:10.0.2.254:80
ESTAB 0 480 ::ffff:10.0.2.36:44450 ::ffff:10.0.2.254:80 timer:(on,170ms,0)
ESTAB 0 0 ::ffff:10.0.2.36:44442 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44390 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44326 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44452 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44394 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44444 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44456 ::ffff:10.0.2.254:80
2 seconds interval ======================ESTAB 0 0 ::ffff:10.0.2.36:44508 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44476 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44524 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44374 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44500 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44504 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44512 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44506 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44464 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44518 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44510 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44442 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44526 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44472 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44466 ::ffff:10.0.2.254:80
2017-07-18 14:53:52.259976 7fddd67fc700 1 ====== starting new request req=0x7fddd67f6710 =====2017-07-18 14:53:52.271829 7fddd5ffb700 1 ====== starting new request req=0x7fddd5ff5710 =====2017-07-18 14:53:52.273940 7fddd7fff700 0 ERROR: flush_read_list(): d->client_c->handle_data() returned -52017-07-18 14:53:52.274223 7fddd7fff700 0 WARNING: set_req_state_err err_no=5 resorting to 5002017-07-18 14:53:52.274253 7fddd7fff700 0 ERROR: s->cio->send_content_length() returned err=-52017-07-18 14:53:52.274257 7fddd7fff700 0 ERROR: s->cio->print() returned err=-52017-07-18 14:53:52.274258 7fddd7fff700 0 ERROR: STREAM_IO(s)->print() returned err=-52017-07-18 14:53:52.274267 7fddd7fff700 0 ERROR: STREAM_IO(s)->complete_header() returned err=-5
Http 500 errors in RGW log
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
Solution
Enable random read policy hadoop:<property>
<name>fs.s3a.experimental.input.fadvise</name>
<value>random</value>
</property>
<property>
<name>fs.s3a.readahead.range</name>
<value>64K</value>
</property>
By reducing the cost of closing existing HTTP requests, this is highly efficient for file IO accessing a binary file through a series of `PositionedReadable.read()` and `PositionedReadable.readFully()` calls.
Optimizing HTTP Requests-- S3a input policy
Seek in a IO stream
diff > 0 skip forward diff < 0 backward
diff = 0 sequential read
Close stream
&Open connection
again
diff = targetPos – pos
Background
The S3A filesystem client supports the notion of input policies, similar to that of the POSIX fadvise() API call. This tunes the behavior of the S3A client to optimize HTTP GET requests for various use cases. To optimize HTTP GET requests, you can take advantage of the S3A experimental input policy fs.s3a.experimental.input.fadvise.
Ticket: https://issues.apache.org/jira/browse/HADOOP-13203
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
• Readahead feature is supported from Hadoop 2.8.1, but not enabled by default. By applying random read policy, the 500 issue is fixed and performance improved 3x
• All Flash storage architecture also show great performance benefit and low TCO which compared with HDD storage
Optimizing HTTP Requests-- Performance
8305 8267
2659
0
2000
4000
6000
8000
10000
Hadoop 2.7.3/Spark 2.1.1 Hadoop 2.8.1/Spark
2.2.0(untuned)
Hadoop 2.8.1/Spark
2.2.0(tuned)
seco
nd
s
1TB Batch Analytics Query on Parquet
26592120
4810
3346
0
2000
4000
6000
Hadoop 2.8.1/Spark2.2.0[SSD] Hadoop 2.8.1/Spark2.2.0[HDD]
Se
con
ds
1TB Dataset Batch Analytics Query Hadoop/Spark Comparison (lower the better)
Parquet ORC
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
New bottleneck on Load Balancer
• Load Balancer became the bottleneck on networking bandwidth
• Observed many messages blocked at load balancer server(send to s3a driver), but not much blocked at receiving on s3a driver side
0
500000
1000000
1500000
2000000
2500000
3000000
0
27
0
54
0
81
0
10
80
13
50
16
20
18
90
21
60
24
30
27
00
29
70
32
40
35
10
37
80
40
50
43
20
45
90
48
60
Network IO
Sum of rxkB/s
Sum of txkB/s
0
500000
1000000
1500000
2000000
2500000
0
15
30
45
60
75
90
10
5
12
0
13
5
15
0
16
5
Network IO
Sum of rxkB/s
Sum of txkB/s
Enlarge on a single query
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
Hardware Configuration--More RGWs with round-robin DNS
5x Compute Node• Intel® Xeon™ processor E5-2699 v4 @ 2.2GHz,
128GB mem• 2x10G 82599 10Gb NIC• 2x SSDs • 3x Data storage (can be emliminated) Software:• Hadoop 2.7.3• Spark 2.1.1• Hive 2.2.1• Presto 0.177• RHEL7.3
5x Storage Node, 2 RGW nodes, 1 LB nodes• Intel(R) Xeon(R) CPU E5-2699v4 2.20GHz• 128GB Memory• 2x 82599 10Gb NIC • 1x Intel® P3700 1.0TB SSD Journal• 4x 1.6TB Intel® SSD DC S3510 as data drive• 2x 400G S3700 SSDs• 1 OSD instances one each S3510 SSD• RHEl7.3• RHCS 2.3
OSD1
MON
OSD1 OSD4…
HadoopHive
SparkPresto
1x10Gb NIC
OSD2 OSD3 OSD4 OSD5
2x10Gb NIC
HadoopHive
SparkPresto
HadoopHive
SparkPresto
HadoopHive
SparkPresto
HadoopHive
SparkPresto
RGW1 RGW2
HeadDNS Server
RGW2 RGW2 RGW2
1x10Gb NIC
1x10Gb NIC
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
Performance evaluation--More RGWs and round-robin DNS
• 18% performance improvement with more RGWs and round-robin DNS
• Query42(has less shuffle) is 1.64x faster in the new architecture
2659
2244
0
500
1000
1500
2000
2500
3000
Dedicate single load balance DNS
seco
nd
s
1TB Batch Analytics Query (parquet)
129
78
0
20
40
60
80
100
120
140
Dedicate load balance DNS RR
seco
nd
s
Query 42 Query Time(10TB Dataset with
parquet)
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
Key Resource Utilization Comparison
• Compute side(Hadoop s3a driver) can read more data from OSD faster, which showed DNS deployment bring big improvements for network throughput performance than single gateway with bonding/teaming technology
0
100000
200000
300000
400000
500000
600000
700000
800000
0
15
30
45
60
75
90
10
5
12
0
13
5
15
0
(DNS)S3A Driver Network IO
Sum of rxkB/s
Sum of txkB/s
0
100000
200000
300000
400000
500000
600000
700000
8000000
15
30
45
60
75
90
10
5
12
0
13
5
15
0
16
5
kB
/s
Axis Title
(Dedicate LB server)S3A Driver
Network IO
Sum of rxkB/s
Sum of txkB/s
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
Hardware Configuration--RGW and OSD Collocated
5x Compute Node• Intel® Xeon™ processor E5-2699 v4 @
2.2GHz, 128GB mem• 2x10G 82599 10Gb NIC• 2x SSDs • 3x Data storage (can be emliminated) Software:• Hadoop 2.7.3• Spark 2.1.1• Hive 2.2.1• Presto 0.177• RHEL7.3
5x Storage Node, 2 RGW nodes, 1 LB nodes• Intel(R) Xeon(R) CPU E5-2699v4 2.20GHz• 128GB Memory• 2x 82599 10Gb NIC • 1x Intel® P3700 1.0TB SSD as WAL and
rocksdb• 4x 1.6TB Intel® SSD DC S3510 as data
drive• 2x 400G S3700 SSDs• 1 OSD instances one each S3510 SSD• RHEl7.3• RHCS 2.3
*Other names and brands may be claimed as the property of others.
RGW1OSD1
MON
OSD1 OSD4…
HadoopHive
SparkPresto
1x10Gb NIC
RGW2OSD2
RGW3OSD3
RGW4OSD4
RGW5OSD5
3x10Gb NIC(ECMP)
HadoopHive
SparkPresto
HadoopHive
SparkPresto
HadoopHive
SparkPresto
HadoopHive
SparkPresto
HeadDNS Server
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
Performance under RGW & OSD Collocated
• No need extra dedicate RGW servers, RGW instance and OSD go through different network interface by enable ECMP
• No performance degradation, but less TCO
22422162
0
500
1000
1500
2000
2500
dedicate rgws rgw+ osd co-locality
seco
nd
s
1TB Batch Analytics Query on Parquet
0 500 1000 1500 2000 2500
10TB parquet ETL
1TB dfsio_write
Other workloads
rgw+ osd co-locality dedicate rgws
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
RGW & OSD collocated – RGW scaling
• Scale out RGWs can improve performance before OSD(storage) saturating
• So How many RGWs can win the best performance should be decided by the bandwidth of each RGW server and throughput of OSDs
950
1730
21192233 2254
0
500
1000
1500
2000
2500
1 2 3 4 5
MB
/s
# of RGWs
1TB DFSIO write
337
252 250
0
50
100
150
200
250
300
350
400
6 9 18
seco
nd
s
# of RGWs
10TB ETL(Lower the better)
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
0
50
100
150
200
qu
ery
3.s
ql
qu
ery
7.s
ql
qu
ery
12
.sq
l
qu
ery
13
.sq
l
qu
ery
15
.sq
l
qu
ery
17
.sq
l
qu
ery
19
.sq
l
qu
ery
20
.sq
l
qu
ery
21
.sq
l
qu
ery
24
.sq
l
qu
ery
25
.sq
l
qu
ery
26
.sq
l
qu
ery
27
.sq
l
qu
ery
28
.sq
l
qu
ery
29
.sq
l
qu
ery
31
.sq
l
qu
ery
32
.sq
l
qu
ery
34
.sq
l
qu
ery
39
.sq
l
qu
ery
40
.sq
l
qu
ery
42
.sq
l
qu
ery
43
.sq
l
qu
ery
45
.sq
l
qu
ery
46
.sq
l
qu
ery
48
.sq
l
qu
ery
49
.sq
l
qu
ery
51
.sq
l
qu
ery
52
.sq
l
qu
ery
55
.sq
l
qu
ery
56
.sq
l
qu
ery
60
.sq
l
qu
ery
63
.sq
l
qu
ery
64
.sq
l
qu
ery
65
.sq
l
qu
ery
66
.sq
l
qu
ery
68
.sq
l
qu
ery
71
.sq
l
qu
ery
73
.sq
l
qu
ery
75
.sq
l
qu
ery
76
.sq
l
qu
ery
79
.sq
l
qu
ery
82
.sq
l
qu
ery
83
.sq
l
qu
ery
84
.sq
l
qu
ery
85
.sq
l
qu
ery
87
.sq
l
qu
ery
88
.sq
l
qu
ery
89
.sq
l
qu
ery
90
.sq
l
qu
ery
91
.sq
l
qu
ery
92
.sq
l
qu
ery
93
.sq
l
qu
ery
96
.sq
l
qu
ery
97
.sq
l
1TB Batch Analytics (Physical shuffle storage vs RBD shuffle storage)
physical-query_time rbd-query_time
Using Ceph RBD as shuffle Storage-- Eliminate the physical drive on the compute
• Remote RBD volumes on compute node to act as shuffle devices instead of physical shuffle device.
• For most queries the performance is not impacted.
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice31
Compute-side caching5x Compute Node• Intel® Xeon™ processor E5-2699 v4 @
2.2GHz, 128GB mem• 2x10G 82599 10Gb NIC• 2x SSDs • 3x Data storage (can be emliminated) Software:• Hadoop 2.7.3• Spark 2.1.1• Hive 2.2.1• Presto 0.177• RHEL7.3
5x Storage Node, 5 RGW nodes• Intel(R) Xeon(R) CPU E5-2699v4 2.20GHz• 128GB Memory• 2x 82599 10Gb NIC • 1x Intel® P3700 1.0TB SSD as WAL and
rocksdb• 4x 1.6TB Intel® SSD DC S3510 as data
drive• 1 OSD instances one each S3510 SSD• RHEl7.3• RHCS 2.3
*Other names and brands may be claimed as the property of others.
RGW1OSD1
MON
OSD1 OSD4…
HadoopHive
SparkPresto
1x10Gb NIC
RGW2OSD2
RGW3OSD3
RGW4OSD4
RGW5OSD5
3x10Gb NIC
HadoopHive
SparkPresto
HadoopHive
SparkPresto
HadoopHive
SparkPresto
HadoopHive
SparkPresto
HeadDNS Server
SSD Caching SSD Caching SSD Caching SSD Caching SSD Caching
5x Caching Node(co-located with compute)• 1TB SSD(P3700)Software:• Alluxio* 1.7.0
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
Compute-side caching for I/O intensive queries
32
• Compute-side caching brings better efficiency(10% - 30%) for I/O intensive queries
25.55
17.10
13.83
16.88 16.49 16.53
33.15
21.84
12.13 12.50 12.65 12.30 12.96
21.89
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
query19.sql query42.sql query43.sql query52.sql query55.sql query63.sql query68.sql
Se
con
ds
I/O intensive query performance
S3 Compute-side Caching
-37%
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
Hardware Configuration--Remote HDFS 5x Compute Node
• Intel® Xeon™ processor E5-2699 v4 @ 2.2GHz, 128GB mem
• 2x10G 82599 10Gb NIC• 2x S3700 as shuffle storage Software:• Hadoop 2.7.3• Spark 2.1.1• Hive 2.2.1• Presto 0.177• RHEL7.3
5x Data Node • Intel(R) Xeon(R) CPU E5-2699v4 2.20GHz• 128GB Memory• 2x 82599 10Gb NIC • 7x 400G S3700 SSDs as data stire • RHEl7.3
Data Node 1
1x10Gb NIC
Data Node 2
Data Node 3
Data Node 4
Data Node 5
1x10Gb NIC
Node ManagerSpark
Node ManagerSpark
Node ManagerSpark
Node ManagerSpark
Name NodeNode Manager
Spark
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
On-par performance compared with remote HDFS
• With optimizations, bigdata analytics on object storage is onpar with remote, especially on parquet format data
• performance of s3a driver close to native dfsclient , and demonstrate compute and storage separate solution has a considerable performance compare with combination solution
Bigdata on Cloud vs. Remote HDFS--Batch Analytics
22441921
5060
2839
0
1000
2000
3000
4000
5000
6000
Object Store Remote HDFS
1TB Dataset Batch Analytics and Interactive Query comparison with
remote HDFS (lower the better)
Parquet ORC
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
0
50000
100000
150000
200000
250000
300000
-5 25
55
85
11
5
14
5
17
5
20
5
23
5
26
5
29
5
32
5
35
5
38
5
41
5
Disk Bandwidth
Sum of rkB/s
Sum of wkB/s
DFSIO write performance in ceph is better than remote
hdfs(43%), but read performance is 34% lower
Write to Ceph hit disk bottleneck
Bigdata on Cloud vs. Remote HDFS--DFSIO
Shuffle storage block at consuming data from Data Lake
4257
1644
2354
38453532
0
500
1000
1500
2000
2500
3000
3500
4000
4500
remote hdfs(replicaiton
1)
remote hdfs(replication
3)
ceph over s3a
MB
/s
DFSIO 1TB on Parquet(Throughput)
dfsio write dfsio read
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
0
500000
1000000
1500000
2000000
2500000
-5
25
5
51
5
77
5
10
35
12
95
15
55
18
15
20
75
23
35
25
95
OSD Data Drive Disk
Bandwidth
Sum of rkB/s
Sum of wkB/s
0
20
40
60
80
100
120
140
160
180
-5
20
0
40
5
61
0
81
5
10
20
12
25
14
30
16
35
18
40
20
45
22
50
24
55
26
60
OSD Data Drive IO Latencies
Sum of await
Sum of svctm
Time cost at Reduce stage is big
part
Read and write concurrently
887
442
0
100
200
300
400
500
600
700
800
900
1000
Remote HDFS Ceph over s3a
1TB Terasort Total Throughput(MB/s)
Bigdata on Cloud vs Remote HDFS--Terasort
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
DirectOutputCommitter An implementation in Spark 1.6, that return the destination address as working directory then no need to rename/move task output, no good robustness for failures, removed in Spark 2.0
IBM's "Stocator" committer
Targets Openstack Swift, good robustness, but it is another file system for s3a
Staging committer A choice of new s3a committer, need large capacity of hard disk for staged data
Magic committer A choice of new s3a committer, if you know your object store is consistent or use s3gurad, this committer has higher performance
Bigdata on Cloud vs. Remote HDFS--Ongoing rename optimizations
38
“Renaming” overhead can be improved!
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
Summary and Next StepsSummary
• Bigdata on Ceph data lake is functionality ready validated by industry standard decision making workloads TPC-DS
• Bigdata on the Cloud delivers on-par performance with remote HDFS for batch analytics, intensive write operations still need further optimizations
• All flash solutions demonstrated significant TCO benefit compared with HDD solutions
Next
• Expand analytic workloads scope
• Rename operations optimizations to improve the performance
• Accelerating the performance with
• speed up layer for shuffle
• Compute-side caching
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
Experiment environment Cluster Hadoop head Hadoop slave Load balancer OSD RGW
Roles Hadoop name nodeSecondary name node Resource managerData nodeNode manager Hive metastoreserviceYarn history serverSpark history serverPresto server
Data node Node manager Presto server
Haproxy Ceph osd Ceph rados gateway
# of node 1 5 1 5 5
Processor Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz 44 cores HT enabled Intel(R) Xeon(R) CPU E31280 @ 3.50GH 4 cores HT enabled
Memory 128GB 128GB 128GB 32GB
Storage 4x 1TB HDD2x Intel S3510 480GB SSD(vs s3700 metrics)
1x Intel S3510 480 GB SSD
• 1x Intel® P3700 1.6TB as jounal
• 4x 1.6TB Intel® SSD DC S3510 2X 400GB s370 as data store
1x Intel S3510 480 GB SSD
Network 10GB 40GB 10GB+10GB 10GB
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
SW Configuration
Hadoop version 2.7.3/2.8.1
Spark version 2.1.1/2.2.0
Hive version 2.2.1
Presto version 0.177
Executor memory 22GB
Executor cores 5
# of executor 24
JDK version 1.8.0_131
Memory.overhead 5GB
S3A Key Performance Configuration
fs.s3a.connection.maximum
10
fs.s3a.threads.max 30
fs.s3a.socket.send.buffer 8192
fs.s3a.socket.recv.buffer 8192
fs.s3a.threads.keepalivetime
60
fs.s3a.max.total.tasks 1000
fs.s3a.multipart.size 100M
fs.s3a.block.size 32M
fs.s3a.readahead.range 64k
fs.s3a.fast.upload true
fs.s3a.fast.upload.buffer array
fs.s3a.fast.upload.active.blocks
4
fs.s3a.experimental.input.fadvise
radom
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
Legal Disclaimer & Optimization Notice
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Copyright © 2018, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
45
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
0
50
100
150
0
58
5
11
70
17
55
23
40
29
25
35
10
40
95
Cpu Utilization
Average of
%idle
Average of
%steal0
200000
400000
600000
800000
-5
66
5
13
35
20
05
26
75
33
45
40
15
Disk Bandwidth
Sum of
rkB/s
Sum of
wkB/s0
20
40
60
80
5
47
5
94
5
14
15
18
85
23
55
28
25
32
95
37
65
42
35
Memory Utilization
Total
0
500000
1000000
1500000
0
67
0
13
40
20
10
26
80
33
50
40
20
Network IO
Sum of
rxkB/s
Sum of
txkB/s
0
50
100
150
0
58
5
11
70
17
55
23
40
29
25
35
10
40
95
Cpu Utilization
Average of
%idle
Average of
%steal
Average of
%iowait0
200000
400000
600000
800000
-5
66
5
13
35
20
05
26
75
33
45
40
15
Disk Bandwidth
Sum of
rkB/s
Sum of
wkB/s 0
50
100
150
5
47
5
94
5
14
15
18
85
23
55
28
25
32
95
37
65
42
35
Memory Utilization
Total
0
50
100
150
0
52
0
10
40
15
60
20
80
26
00
31
20
36
40
41
60
Cpu Utilization
Average of
%idle
Average of
%steal
Average of
%iowait0
2000
4000
6000
-5
51
5
10
35
15
55
20
75
25
95
31
15
36
35
41
55
Disk Bandwidth
Sum of
rkB/s
Sum of
wkB/s 12
12.5
13
13.5
5
43
0
85
5
12
80
17
05
21
30
25
55
29
80
34
05
38
30
42
55
Memory Utilization
Total
Computenode
OSD node
Resource Utilization on 1TB parquet
0
200000
400000
600000
800000
1000000
0
67
0
13
40
20
10
26
80
33
50
40
20
Network IO
Sum of
rxkB/s
Sum of
txkB/s
0
200000
400000
600000
800000
1000000
0
52
0
10
40
15
60
20
80
26
00
31
20
36
40
41
60
Network IO
Sum of
rxkB/s
Sum of
txkB/s
RGW node
Low Resource Utilizations
Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
Compute node
OSD node
Resource Utilization on 10TB parquet
0
50
100
150
0
31
35
62
70
94
05
12
54
8
15
68
3
18
81
8
21
95
3
25
08
8
Cpu Utilization
Average of
%idle
Average of
%steal
Average of
%iowait
0
500000
1000000
1500000
-5
40
25
80
55
12
09
2
16
12
2
20
15
2
24
18
2
Disk Bandwidth
Sum of
rkB/s
Sum of
wkB/s 0
50
100
150
5
31
40
62
75
94
10
12
55
2
15
68
7
18
82
2
21
95
7
25
09
2
Ax
is T
itle
Memory Utilization
Total
0
500000
1000000
1500000
0
35
25
70
50
10
58
2
14
10
7
17
63
2
21
16
4
24
69
5
Network IO
Sum of
rxkB/s
Sum of
txkB/s
0
50
100
150
0
31
35
62
70
93
53
12
48
8
15
62
3
18
75
8
21
89
3
25
02
9
Cpu Utilization
Average of
%idle
Average of
%steal
Average of
%iowait
0
500000
1000000
1500000
-5
40
25
80
03
12
03
3
16
06
3
20
09
3
24
12
3
Disk Bandwidth
Sum of
rkB/s
Sum of
wkB/s 0
50
100
150
5
25
70
51
35
77
00
10
21
3
12
77
8
15
34
3
17
90
8
20
47
3
23
03
8
25
60
3
Memory Utilization
Total
0
500000
1000000
1500000
0
40
30
80
08
12
03
8
16
06
8
20
09
8
24
12
9
Network IO
Sum of
rxkB/s
Sum of
txkB/s
0
50
100
150
0
35
30
70
60
10
59
1
14
12
1
17
65
1
21
18
1
24
71
1
Cpu Utilization
Average of
%idle
Average of
%steal
Average of
%iowait
0
1000
2000
3000
-5
35
25
70
55
10
58
5
14
11
5
17
64
5
21
17
5
24
70
5Disk Bandwidth
Sum of
rkB/s
Sum of
wkB/s0
5
10
15
5
25
70
51
35
77
01
10
26
6
12
83
1
15
39
6
17
96
1
20
52
6
23
09
1
25
65
6
Memory Utilization
Total0
500000
1000000
1500000
0
40
30
80
61
12
09
1
16
12
1
20
15
1
24
18
1
Network IO
Sum of
rxkB/s
Sum of
txkB/s
RGW node