Big Data Infrastructure. - doag.org

2014 © Trivadis

BASEL BERN BRUGG GENF LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN

2014 © Trivadis

Big Data Infrastructure. Appliance, Cloud, or Do-it-Yourself. Daniel Steiger Discipline Manager Infrastructure Engineering

DOAG Jahreskonferenz 2014 Big Data Infrastructure

1

2014 © Trivadis

Trivadis ist führend bei der IT-Beratung, der Systemintegration, dem Solution-Engineering und der Erbringung von IT-Services mit Fokussierung auf und Technologien im D-A-CH-Raum. Unsere strategischen Geschäftsfelder...

Unser Unternehmen


2

2014 © Trivadis

Mit über 600 IT- und Fachexperten bei Ihnen vor Ort

3

12 Trivadis Niederlassungen mit über 600 Mitarbeitenden

200 Service Level Agreements

Mehr als 4'000 Trainingsteilnehmer

Forschungs- und Entwicklungs-budget: CHF 5.0 Mio. / EUR 4.0 Mio.

Finanziell unabhängig und nachhaltig profitabel

Erfahrung aus mehr als 1'900 Projekten pro Jahr bei über 800 Kunden

(Stand 12/2013) 3


3

Hamburg

Düsseldorf

Frankfurt

Freiburg München Wien

Basel Zürich Bern

Lausanne

Stuttgart

Brugg

2014 © Trivadis

1.  Big Data Infrastructure Challenges

2.  Hadoop on an Appliance

3.  Hadoop in the Cloud

4.  Hadoop Do-it-Yourself

5.  Conclusion


4

Agenda

2014 © Trivadis


5

Big Data Infrastructure Challenges

2014 © Trivadis


6

Trailwise – a "quantified self" use case

47'295 data points rendered in 643ms

11'000 data points rendered in 165ms

2014 © Trivadis


7

Trailwise – Infrastructure for a Proof of Concept

7

§  Hadoop HDFS as data store

§  HBase for real-time data access

§  Hadoop Map/Reduce

2014 © Trivadis

Concerns…

§  Scalability

§  Costs for "always up"

§  Setup and administration of a large cluster on AWS

§  Break-even cloud vs on-premise

For a proof of concept hadoop in the cloud (e.g. on Amazon EC2) is perfect...

+  Fast and easy deployment

+  Optimized Hadoop/HBase setup

+  HBase real-time performance

+  Map/Reduce scalability

+  Affordable, ca. EUR 15.-/day


8

Trailwise – Infrastructure Lessons Learned

2014 © Trivadis

§  Big Data means big data volume §  Petabytes and exabytes

§  Scalability §  10, 20, 50, 100, ... cluster nodes §  Costs should scale as well...

§  High demands on machine-to-machine networks §  In Big Data for every one-client interaction, there may be hundreds or thousands of

server and data node interactions §  This generates far more east-west (server-to-server or server-to-storage) network traffic

than north-south (server-to-client or server-to-outside) network traffic

§  And many others like integration, data protection, operation, etc.


9

Big Data Infrastructure Challenges

2014 © Trivadis

§  Infrastructure must be engineered to scale

§  The network has to provide high bandwidth, low latency, and should scale seamlessly with Hadoop clusters to provide predictable performance

§  And many more, like §  Integration with operational data systems §  Authentication, authorization, encryption §  Centralized management


10

Infrastructure Requirements 7

Figure 1.2: Picture of a row of servers in a Google WSC, 2012.

1.6.1 STORAGEDisk drives or Flash devices are connected directly to each individual server and managed by a global distributed file system (such as Google’s GFS [58]) or they can be part of Network Attached Storage (NAS) devices directly connected to the cluster-level switching fabric. A NAS tends to be a simpler solution to deploy initially because it allows some of the data management responsibilities to be outsourced to a NAS appliance vendor. Keeping storage separate from computing nodes also makes it easier to enforce quality of service guarantees since the NAS runs no compute jobs be-sides the storage server. In contrast, attaching disks directly to compute nodes can reduce hardware costs (the disks leverage the existing server enclosure) and improve networking fabric utilization (each server network port is effectively dynamically shared between the computing tasks and the file system).

The replication model between these two approaches is also fundamentally different. A NAS tends to provide high availability through replication or error correction capabilities within each appliance, whereas systems like GFS implement replication across different machines and conse-quently will use more networking bandwidth to complete write operations. However, GFS-like systems are able to keep data available even after the loss of an entire server enclosure or rack and may allow higher aggregate read bandwidth because the same data can be sourced from multiple

1.6. ARCHITECTURAL OVERVIEW OF WSCS

Will my infrastructure meet my needs

now and in the future without putting my business at risk?

2014 © Trivadis

When enterprises adopt Hadoop, one of the decisions they must make is the deployment model. There are four options:


11

Where to Deploy your Hadoop Cluster?

When enterprises adopt Hadoop, one of the decisions they must make is the deployment model. There are four options as illustrated in Figure 1:

��On-premise full custom. With this option, businesses purchase commodity hardware, then they install software and operate it themselves. This option gives businesses full control of the Hadoop cluster.

��Hadoop appliance. This preconfigured Hadoop cluster allows businesses to bypass detailed technical configuration decisions and jumpstart data analysis.

��Hadoop hosting. Much as with a traditional ISP model, organizations rely on a service provider to deploy and operate Hadoop clusters on their behalf.

� Hadoop-as-a-Service. This option gives businesses instant access to Hadoop clusters with a pay-per-use consumption model, providing greater business agility.

To determine which of these options presents the right deployment model, organizations must consider five key areas. The first is the price-performance ratio, and it is the focus of this paper. The Hadoop-as-a-service model is typically cloud-based and uses virtualization technology to automate deployment and operation processes (in comparison, the other models typically use physical machines directly).

There have existed two divergent views related to the price-performance ratio for Hadoop deployments. One view is that a virtualized Hadoop cluster is slower because Hadoop’s workload has intensive I/O operations, which tend to run slowly on virtualized environments. The other view is that the cloud-based model provides compelling cost savings because its individual server node tends to be less expensive; furthermore, Hadoop is horizontally scalable.

The second area of consideration is data privacy, which is a common concern when storing data outside of corporate-owned infrastructure. Cloud-based deployment requires a comprehensive cloud-data privacy strategy that encompasses areas such as proper implementation of legal requirements, well-orchestrated data-protection technologies, as well as the organization’s culture with regard to adopting emerging technologies. Accenture Cloud Data Privacy Framework outlines a detailed approach to help clients address this issue.

The third area is data gravity. Once data volume reaches a certain point, physical data migration becomes prohibitively slow, which means that many organizations are locked into their current data platform. Therefore, the portability of data, the anticipated future growth of data, and the location of data must all be carefully considered.

A related and fourth area is data enrichment, which involves leveraging multiple datasets to uncover new insights. For example, combining a consumer’s purchase history and social-networking activities can yield a deeper understanding of the consumer’s lifestyle and key personal events and therefore enable companies to introduce new services and products of interest. The primary challenge is that the storage of these multiple datasets increases the volume of data, resulting in slow connectivity. Therefore, many organizations choose to co-locate these datasets. Given volume and portability considerations, most organizations choose to move the smaller datasets to the location of the larger ones. Thus, thinking strategically about where to house your data, considering both current and future needs, is key.

The fifth area is the productivity of developers and data scientists. They tap into the datasets, create a “sandbox” environment, explore the data analysis ideas, and deploy them into production. Cloud’s self-service deployment model tends to expedite this process.

Figure 1. The spectrum of Hadoop deployment options

On-premise full custom

Hadoop appliance

Hadoop hosting

Hadoop-as-a-Service

Bare-metal Cloud

2

Reference: Hadoop Deployment Comparison Study, Price-Performance Comparison, Accenture Technology Labs, 2013

2014 © Trivadis


12

Hadoop on an Appliance Oracle Big Data Appliance

2014 © Trivadis


13

Overview: Oracle's Big Data Solution

§  A complete and optimized solution for big data

§  Tight integration with Exadata, Exalogic, Exalytics and SPARC Supercluster using Infiniband network

§  Single-vendor support for both hardware and software

2014 © Trivadis

Full Rack Configuration (up to 18 racks)

§  18 x compute/storage nodes

Per Node:

§  2 x Eight-Core Intel ® Xeon ® E5-2650 V2 Processors

§  64 GB Memory (up to 512 GB)

§  48 TB Raw Storage Capacity

§  40 Gb/sec Infiniband Network

§  10 Gb/sec Data Center Connectivity


14

Oracle Big Data Appliance X4-2 HW

Sou

rce:

Ora

cle

®

2014 © Trivadis


15

Oracle Big Data Appliance Internal Network Connectivity

Source: Oracle Big Data Appliance: Datacenter Network Integration, Oracle White Paper, 2012

2014 © Trivadis

§  Oracle R Distribution

§  Oracle NoSQL DB Community Ed.

§  BDA Enterprise Manager Plug-In

§  Optional Software* §  Oracle Big Data SQL §  Oracle Big Data Connectors §  Oracle Audit Vault & Database Firewall

for Hadoop Auditing §  Oracle Data Integrator §  Oracle NoSQL Database EE

§  Oracle Linux 6.4 with UEK

§  Oracle Java JDK 7

§  Cloudera Enterprise Data Hub Edition §  Apache Hadoop HDFS §  HBase §  Cloudera Impala §  Cloudera Search §  Cloudera Manager §  Apache Spark


16

Big Data Appliance Software Stack

*Connectors are licensed separately from Oracle Big Data Appliance

2014 © Trivadis

§  Oracle R Support for Big Data §  R is an open-source language and

environment for statistical analysis and graphing

§  The standard R distribution is installed on all nodes of Oracle Big Data Appliance

§  Oracle R Connector for Hadoop provides R users with high-performance, native access to HDFS and the MapReduce programming framework

§  Oracle R Enterprise is a separate package that provides real-time access to Oracle Database.

§  Oracle NoSQL Database §  Oracle NoSQL Database is a

distributed key-value database built on storage technology of Berkeley DB Java Edition.

§  An intelligent driver on top of Berkeley DB keeps track of the underlying storage topology, shards the data and knows where data can be placed with the lowest latency


17

BDA Specific Software Features

2014 © Trivadis

§  Oracle SQL Connector for HDFS

§  Oracle Loader for Hadoop

§  Oracle R Connector for Hadoop

§  Oracle Data Integrator Application Adapter for Hadoop

§  Data in HDFS (and NoSQL) data is accessable through relational database external table mechanism (HDFS as cluster file system)

*The connectors are licensed separately from Oracle Big Data Appliance


18

Oracle Big Data Connectors

Source: Oracle ® Reference: Oracle Big Data Connectors Data Sheet

2014 © Trivadis


19

Oracle Big Data SQL: one tool for all data sources

Reference: https://www.oracle.com/webfolder/s/delivery_production/docs/FY15h1/doc6/1-T2-BigData.pdf

2014 © Trivadis

§  Oracle Big Data Lite VM §  http://www.oracle.com/technetwork/database/bigdata-appliance/

oracle-bigdatalite-2104726.html

§  MOS Notes §  Information Center: Oracle Big Data Appliance (Doc ID 1445762.2) §  Big Data Connectors (ID 1487399.2) §  Sqoop Frequently Asked Questions (FAQ) (Doc ID 1510470.1)


20

Oracle Big Data Appliance Ressources

2014 © Trivadis


21

Hadoop in the Cloud

2014 © Trivadis


22

Hadoop in the Cloud

2014 © Trivadis

There are five key areas to consider when choosing the right deployment model*:

*Public Cloud, Private Cloud, Community Cloud oder Hybrid Cloud


23

Deployment Considerations

The second area of consideration is data

privacy, which is a common concern when

storing data outside of corporate-owned

infrastructure. Cloud-based deployment

requires a comprehensive cloud-data

privacy strategy that encompasses

areas such as proper implementation of

legal requirements, well-orchestrated

data-protection technologies, as well as

the organization’s culture with regard

to adopting emerging technologies.

Accenture Cloud Data Privacy

Framework outlines a detailed approach

to help clients address this issue.

The third area is data gravity. Once

data volume reaches a certain point,

physical data migration becomes

prohibitively slow, which means that

many organizations are locked into

their current data platform. Therefore,

the portability of data, the anticipated

future growth of data, and the location

of data must all be carefully considered.

A related and fourth area is data

enrichment, which involves leveraging

multiple datasets to uncover new

insights. For example, combining

a consumer’s purchase history and

social-networking activities can yield a

deeper understanding of the consumer’s

lifestyle and key personal events

and therefore enable companies to

introduce new services and products of

interest. The primary challenge is that

the storage of these multiple datasets

increases the volume of data, resulting

in slow connectivity. Therefore, many

organizations choose to co-locate these

datasets. Given volume and portability

considerations, most organizations

choose to move the smaller datasets to

the location of the larger ones. Thus,

thinking strategically about where

to house your data, considering both

current and future needs, is key.

The fifth area is the productivity of

developers and data scientists. They tap

into the datasets, create a “sandbox”

environment, explore the data analysis

ideas, and deploy them into production.

Cloud’s self-service deployment model

tends to expedite this process.

Out of these five key areas, Accenture

assessed the price-performance ratio

between bare-metal Hadoop clusters and

Hadoop-as-a-Service on Amazon Web

ServicesTM. (A bare-metal Hadoop cluster

refers to a Hadoop cluster deployed

on top of physical servers without a

virtualization layer. Currently, it is the

most common Hadoop deployment

option in production environments.)

For the experiment, we first built

the total cost of ownership (TCO)

model to control two environments

at the matched cost level. Then, using

Accenture Data Platform Benchmark

as real-world workloads, we compared

the performance of both a bare-

metal Hadoop cluster and Amazon

ElasticMapReduce (Amazon EMRTM).

Employing these empirical and systemic

analyses, Accenture’s study revealed

that Hadoop-as-a-Service offers better

price-performance ratio. Thus, this

result debunks the idea that the cloud

is not suitable for Hadoop MapReduce

workloads, with their heavy I/O

requirements. Moreover, the benefit

of performance tuning is so huge that

cloud’s virtualization layer overhead

is a worthy investment as it expands

performance tuning opportunities.

Lastly, despite of the sizable benefit, the

performance tuning process is complex

and time-consuming, thus requires

automated tuning tools. The results

are explored in detail in our full study,

“Hadoop Deployment Comparison Study”.

Five key areas to consider when choosing the right deployment model:

Price-performance

ratioData privacy Data gravity Data

enrichment

Productivity of developers and data scientists

Reference: Where to Deploy your Hadoop Cluster?, Executive Summary, Accenture Technology Labs, 2013

2014 © Trivadis

EC2 Instance for Hadoop/MapReduce

Storage optimized – current generation

§  Instance "hs1.8xlarge" §  16 vCPUs (Intel Xeon) §  117GB RAM §  24 x 2000GB = 48TB §  10 Gigabit network

§  MapR as option §  M3, M5 or M7 edition


24

Amazon EMR with the MapR Distribution for Hadoop

Reference: http://aws.amazon.com/elasticmapreduce/mapr/

2014 © Trivadis

Costs for "hs1.8xlarge" Instance

§  Medium Utilization Reserved Instances §  1-Year term: upfront $9'200, $1.809 per Hour §  3-Year term: upfront $14'109, $1.581 per Hour

§  Data Transfer IN to Amazon EC2 from internet: $0.0 per GB

§  Data Transfer OUT from Amazon EC2 to internet: $0.12 per GB up to 10TB/month ($120 per TB)

§  MapR M7: $1.49 per Hour

§  Total: $2'600/month, $31'200/year (24/365 utilization)


25

Amazon EMR with the MapR Distribution for Hadoop

2014 © Trivadis


26

Hadoop on Do-It-Yourself Infrastructure

2014 © Trivadis


27

Do-it-Yourself (experimental setup)

Source: http://blog.ittoby.com/

2014 © Trivadis

HP ProLiant DL380p Gen8

§  2 x Eight-Core Intel ® Xeon ® E5-2650 V2

§  64 GB Memory (up to 512 GB)

§  48 TB Raw Storage Capacity

§  40 Gb/sec Infiniband Network

§  10 Gb/sec Data Center Connectivity

§  About $20'000 + Rack + Network + Work


28

Do-it-Yourself (enterprise class setup)

Technical white paper | HP Reference Architecture for MapR M5

11

This section specifies which server to use and the rationale behind it. The Reference Architectures section will provide topologies for the deployment of control and worker services across the nodes for clusters of varying sizes.

Processor configuration MapR manages the amount of work each server is able to undertake via the amount of Map/Reduce slots configured for that server. The more cores available to the server, the more Map/Reduce slots can be configured for the server (see the Computation section for more detail). We recommend 6 core processors for a good balance of price and performance. We recommend that Hyper-Threading is turned on.

Drive configuration Redundancy is built into the MapR architecture and thus there is no need for RAID or additional hardware components to improve redundancy on the server as it is all coordinated and managed in the MapR software.

MapR Benefit Drives should use a Just a Bunch of Disks (JBOD) configuration, which can be achieved with the HP P420 RAID controller by configuring each individual disk as a separate RAID 0 volume. We recommend disabling array acceleration on the controller to better handle large block I/Os in the Hadoop environment.

Lastly, servers should provide a large amount of storage capacity which increases the total capacity of the distributed file system and provide that capacity by using at least twelve 2TB Large Form Factor drives for optimum I/O performance. The DL380e supports 14 Large Form Factor (LFF) drives, which allows one to either use all 14 drives for data or use 12 drives for data and the additional 2 for mirroring the operating system and MapR runtime. Hot pluggable drives are recommended so that drives can be replaced without restarting the server.

Memory configuration Servers running the node processes should have sufficient memory for either HBase or for the amount of Map/Reduce Slots configured on the server. A server with larger RAM configuration will deliver optimum performance for both HBase and Map/Reduce. To ensure optimal memory performance and bandwidth, we recommend using 8GB or 16GB DIMMs to populate each of the 6 memory channels as needed.

Network configuration The DL380e includes four 1GbE NICs onboard. MapR automatically identifies the available NICs on the server and bonds them via the MapR software to increase throughput.

MapR Benefit Each of the reference architecture configurations below specifies an additional Top of Rack Switch for redundancy. To best make use of this, we recommend cabling the ProLiant DL380e Worker Nodes so that NIC 1 is cabled to Switch 1 and NIC 2 is cabled to Switch 2, repeating the same process for NICs 3 and 4. Each NIC in the server should have its own IP subnet instead of sharing the same subnet with other NICs.

HP ProLiant DL380e Gen8 The HP ProLiant DL380e Gen8 (2U) is an excellent choice as the server platform for the worker nodes.

Figure 6. HP ProLiant DL380e Gen8 Server

§  Cloudera Enterprise Data Hub Edition 5.x

§  ca. $2'500/node + support

2014 © Trivadis


29

Conclusion

2014 © Trivadis

Oracle BDA + High performance scalable network architecture

+ Highly integrated into Oracle eco system

+ Complete software stack Oracle & Hadoop

+ Single point of support

+ Competitive price/ performance ratio for enterprise class demands


30

Appliance, Cloud or DIY?

Amazon EC2 Instances + Fast and easy deployment

+ Scales from very small to very large cluster setups

+ Capacity on demand on hourly base + Optional enterprise class hadoop distribution

+ Interesting price model for volatile utilisation and capacity on demand

Do it Yourself + Low entry point

+ Free choice of hardware

+ Free choice of software stack

Technical white paper | HP Reference Architecture for MapR M5

11

This section specifies which server to use and the rationale behind it. The Reference Architectures section will provide topologies for the deployment of control and worker services across the nodes for clusters of varying sizes.

Processor configuration MapR manages the amount of work each server is able to undertake via the amount of Map/Reduce slots configured for that server. The more cores available to the server, the more Map/Reduce slots can be configured for the server (see the Computation section for more detail). We recommend 6 core processors for a good balance of price and performance. We recommend that Hyper-Threading is turned on.

Drive configuration Redundancy is built into the MapR architecture and thus there is no need for RAID or additional hardware components to improve redundancy on the server as it is all coordinated and managed in the MapR software.

MapR Benefit Drives should use a Just a Bunch of Disks (JBOD) configuration, which can be achieved with the HP P420 RAID controller by configuring each individual disk as a separate RAID 0 volume. We recommend disabling array acceleration on the controller to better handle large block I/Os in the Hadoop environment.

Lastly, servers should provide a large amount of storage capacity which increases the total capacity of the distributed file system and provide that capacity by using at least twelve 2TB Large Form Factor drives for optimum I/O performance. The DL380e supports 14 Large Form Factor (LFF) drives, which allows one to either use all 14 drives for data or use 12 drives for data and the additional 2 for mirroring the operating system and MapR runtime. Hot pluggable drives are recommended so that drives can be replaced without restarting the server.

Memory configuration Servers running the node processes should have sufficient memory for either HBase or for the amount of Map/Reduce Slots configured on the server. A server with larger RAM configuration will deliver optimum performance for both HBase and Map/Reduce. To ensure optimal memory performance and bandwidth, we recommend using 8GB or 16GB DIMMs to populate each of the 6 memory channels as needed.

Network configuration The DL380e includes four 1GbE NICs onboard. MapR automatically identifies the available NICs on the server and bonds them via the MapR software to increase throughput.

MapR Benefit Each of the reference architecture configurations below specifies an additional Top of Rack Switch for redundancy. To best make use of this, we recommend cabling the ProLiant DL380e Worker Nodes so that NIC 1 is cabled to Switch 1 and NIC 2 is cabled to Switch 2, repeating the same process for NICs 3 and 4. Each NIC in the server should have its own IP subnet instead of sharing the same subnet with other NICs.

HP ProLiant DL380e Gen8 The HP ProLiant DL380e Gen8 (2U) is an excellent choice as the server platform for the worker nodes.

Figure 6. HP ProLiant DL380e Gen8 Server

2014 © Trivadis

§  Building an enterprise-class hadoop infrastructure is a challenge

§  Analyse and prioritize your requirements (business and IT) is crucial

§  Start „small & fast“ with a proof of concept

§  Consider various deployment models (On-Premis, Appliance, IaaS, PaaS, HaaS, ...)

§  The Oracle Database Appliance is a very competitive offering – especially as extension to your existing Oracle operational data systems


31

Conclusion

2014 © Trivadis 2014 © Trivadis

BASEL BERN BRUGG GENF LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN

Thank you. Daniel Steiger Discipline Manager Infratructure Engineering

Tel: +41 58 459 50 88 [email protected]


32

2014 © Trivadis


33

Trivadis an der DOAG Ebene 3 - gleich neben der Rolltreppe

Wir freuen uns auf Ihren Besuch. Denn mit Trivadis gewinnen Sie immer.

2014 © Trivadis


34

Cost comparison

A"ribute Oracle BDA Amazon EMR DIY Typ X4-‐2 hs1.8xlarge DL-‐380 CPU 2x8-‐Core 16 vCPU 2x8-‐Core RAM 64 GB 117 GB 64 GB Storage 48 TB 48 TB 8 TB Network 10 GB / 40 GB 10 GB 10 GB / 40 GB Hadoop Distr. Cloudera MapR Cloudera Preis / Jahr 525'000 562'256 405'000 Wartung / Jahr 63'000 -‐ 40'000 Total 1. Jahr 588'000 562'256 445'000

Total 3 Jahre 714'000 1'686'768 525'000

Big Data Infrastructure. - doag.org

Documents