Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017
Post on 05-Apr-2017
84 Views
Preview:
Transcript
ABOUT ME
Adit Madan, Software Engineer @ Alluxio, Inc
Master’s @ Carnegie Mellon University
Bachelor’s @ Indian Institute of Technology, Delhi
Email: adit@alluxio.com
2
FASTEST-GROWING BIG DATA PROJECT
• Fastest growing open-source project in the big data ecosystem
• 400+ contributors from 100+ organizations
• Running world’s largest production clusters
• Welcome to join the community!
4
BIG DATA ECOSYSTEM TODAYBIG DATA ECOSYSTEM WITH ALLUXIOBIG DATA ECOSYSTEM YESTERDAY
…
…
FUSE Compatible File System
Hadoop Compatible File System
Native Key-Value Interface
Native File System
Enabling Application to Access Data from any Storage System at Memory-speed
BIG DATA ECOSYSTEM ISSUES
GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface
5
WHY ALLUXIO
Co-located with compute, provides memory-speed access to data
Virtualized across different storage systems under a unified global namespace
Distributed system, scale-out architecture
Software only, no change needed to existing application
6
ALLUXIO BENEFITS
Unification
New workflows across any data in any storage system
Orders of magnitude improvement in run time
Choice in compute and storage – grow each independently, buy only what is needed
Performance Flexibility
7
USE CASE – ACCELERATE I/O TO/FROM REMOTE STORAGE
8
• Compute and Storage Separation• Advantages• Meet different compute and storage hardware
requirements efficiently• Scale compute and storage independently• Store data in Traditional filers/SANs and object
stores cost effectively• Compute on data in existing storage via Big Data
Computational frameworks• Disadvantage• Accessing data requires remote I/O
ACCELERATE I/O TO/FROM REMOTE STORAGE
The performance was amazing. With Spark SQL alone, it took 100-150 seconds to finish a query; using Alluxio, where data may hit local or remote Alluxio nodes, it took 10-15 seconds.- Baidu
RESULTS
• Data queries are now 30x faster with Alluxio
• Alluxio cluster runs stably, providing over 50TB of RAM space
• By using Alluxio, batch queries usually lasting over 15 minutes were transformed into an interactive query taking less than 30 seconds
Baidu’s PMs and analysts run
interactive queries to gain insights
into their products and business
• 200+ nodes deployment
• 2+ petabytes of storage
• Mix of memory + HDD
ALLUXIO
Baidu File System
11
ALLUXIO ON CEPH
13
Spark
Ceph Object Storage
Alluxio
● Connect using RADOS Gateway ○ Swift Object Storage API
EC2 CONFIGURATION
14
● 1 Compute Master○ Spark and Alluxio Masters
● 3 Compute Workers○ Spark and Alluxio Workers
● 1 Storage Manager○ Ceph RadosGW and Monitor
● 2 Storage Devices○ Ceph OSDs
● Instance type: r3.xlarge● Availability Zone: us-east-1a
SOFTWARE VERSIONS
15
● Ceph Version: 0.94.9
● Alluxio Version: 1.4.0○ Custom JOSS library 0.9.13-SNAPSHOT
● Spark Version 1.6.1
DEMO OF THE SOLUTION
16
● Spark, Alluxio and Ceph Cluster pre-deployed
● Ceph pre-populated with a 60GB dataset
● Launch spark shella. First ‘count’b. Second ‘count’c. <Restart shell>d. Third ‘count’
● Ad-hoc queries w/ Alluxioa. ‘wordcount’ w/ intermediate data
FOR MORE INFORMATION ….
18
Please take a look at our Whitepaper!
● Blog: https://alluxio.com/blog/accelerating-data-analytics-on-ceph-object-storage-with-alluxio
● Whitepaper: https://alluxio.com/resources/accelerating-data-analytics-on-ceph-object-storage-with-alluxio
top related