Top Banner
ENABLE FAST BIG DATA ANALYTICS ON CEPH WITH ALLUXIO Adit Madan March 2017
19

Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017

Apr 05, 2017

Download

Technology

Alluxio, Inc.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017

ENABLE FAST BIG DATA ANALYTICS ON CEPH WITH ALLUXIOAdit Madan

March 2017

Page 2: Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017

ABOUT ME

Adit Madan, Software Engineer @ Alluxio, Inc

Master’s @ Carnegie Mellon University

Bachelor’s @ Indian Institute of Technology, Delhi

Email: [email protected]

2

Page 3: Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017

ALLUXIO INTRODUCTION

3

Page 4: Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017

FASTEST-GROWING BIG DATA PROJECT

• Fastest growing open-source project in the big data ecosystem

• 400+ contributors from 100+ organizations

• Running world’s largest production clusters

• Welcome to join the community!

4

Page 5: Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017

BIG DATA ECOSYSTEM TODAYBIG DATA ECOSYSTEM WITH ALLUXIOBIG DATA ECOSYSTEM YESTERDAY

FUSE Compatible File System

Hadoop Compatible File System

Native Key-Value Interface

Native File System

Enabling Application to Access Data from any Storage System at Memory-speed

BIG DATA ECOSYSTEM ISSUES

GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface

5

Page 6: Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017

WHY ALLUXIO

Co-located with compute, provides memory-speed access to data

Virtualized across different storage systems under a unified global namespace

Distributed system, scale-out architecture

Software only, no change needed to existing application

6

Page 7: Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017

ALLUXIO BENEFITS

Unification

New workflows across any data in any storage system

Orders of magnitude improvement in run time

Choice in compute and storage – grow each independently, buy only what is needed

Performance Flexibility

7

Page 8: Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017

USE CASE – ACCELERATE I/O TO/FROM REMOTE STORAGE

8

• Compute and Storage Separation• Advantages• Meet different compute and storage hardware

requirements efficiently• Scale compute and storage independently• Store data in Traditional filers/SANs and object

stores cost effectively• Compute on data in existing storage via Big Data

Computational frameworks• Disadvantage• Accessing data requires remote I/O

Page 9: Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017

USE CASE WITHOUT ALLUXIO

9

Spark

Storage

Low latency, memory throughput

High latency, network throughput

Page 10: Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017

USE CASE WITH ALLUXIO

10

Spark

Storage

AlluxioKeeping data in Alluxio accelerates data access

Page 11: Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017

ACCELERATE I/O TO/FROM REMOTE STORAGE

The performance was amazing. With Spark SQL alone, it took 100-150 seconds to finish a query; using Alluxio, where data may hit local or remote Alluxio nodes, it took 10-15 seconds.- Baidu

RESULTS

• Data queries are now 30x faster with Alluxio

• Alluxio cluster runs stably, providing over 50TB of RAM space

• By using Alluxio, batch queries usually lasting over 15 minutes were transformed into an interactive query taking less than 30 seconds

Baidu’s PMs and analysts run

interactive queries to gain insights

into their products and business

• 200+ nodes deployment

• 2+ petabytes of storage

• Mix of memory + HDD

ALLUXIO

Baidu File System

11

Page 12: Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017

ALLUXIO ON CEPH

12

Page 13: Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017

ALLUXIO ON CEPH

13

Spark

Ceph Object Storage

Alluxio

● Connect using RADOS Gateway ○ Swift Object Storage API

Page 14: Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017

EC2 CONFIGURATION

14

● 1  Compute  Master○ Spark  and  Alluxio  Masters

● 3  Compute  Workers○ Spark  and  Alluxio  Workers

● 1  Storage  Manager○ Ceph  RadosGW  and  Monitor

● 2  Storage  Devices○ Ceph  OSDs

● Instance  type:  r3.xlarge● Availability  Zone:  us-­east-­1a

Page 15: Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017

SOFTWARE VERSIONS

15

● Ceph  Version:  0.94.9  

● Alluxio  Version:  1.4.0○ Custom  JOSS  library  0.9.13-­SNAPSHOT

● Spark  Version  1.6.1

Page 16: Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017

DEMO OF THE SOLUTION

16

● Spark,  Alluxio  and  Ceph  Cluster  pre-­deployed

● Ceph  pre-­populated  with  a  60GB  dataset

● Launch  spark  shella. First  ‘count’b. Second  ‘count’c. <Restart  shell>d. Third  ‘count’

● Ad-­hoc  queries  w/  Alluxioa. ‘wordcount’  w/  intermediate  data

Page 17: Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017

SPARK COUNT PERFORMANCE

17

Count  on  60  GB  dataset● 20x  improvement  for  repeated  access

Page 18: Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017

FOR MORE INFORMATION ….

18

Please  take  a  look  at  our  Whitepaper!

● Blog:  https://alluxio.com/blog/accelerating-­data-­analytics-­on-­ceph-­object-­storage-­with-­alluxio

● Whitepaper:  https://alluxio.com/resources/accelerating-­data-­analytics-­on-­ceph-­object-­storage-­with-­alluxio

Page 19: Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017

Thank you!Contact: [email protected] or [email protected]: @AlluxioWebsites: www.alluxio.com and www.alluxio.org

19