Cloudian® S3 Cloud Storage Platform Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object Storage Paul Turner Cloudian Inc. June 11 th 2014
Nov 18, 2014
Cloudian®S3 Cloud Storage Platform
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object Storage
Paul Turner
Cloudian Inc.
June 11th 2014
About Cloudian
• Hybrid cloud storage startup in Silicon Valley– Strong venture backing: Goldman Sachs, Intel Capital– Solid management with storage, big data, enterprise software and telco
expertise – 50 employees, offices in Foster City, Japan and China
• Production hardened product
• Target market: mid- to large-enterprises & regional service providers
• GTM: traditional storage distribution/VARs
CLOUDIAN PARTNERS
The Challenge
• Business problem = Analysis of log data from our customer systems to improve support (classic ‘Internet of Things’ content)
• Existing system required transformation of the data into HDFS for analytics (slow and costly)
Goal : Reduce cost and provide faster results
04/08/2023 3
Use Case : Support Analytics
• Compare system statistics and usage patterns to previous normal results
04/08/2023 4
Abnormal OperationsAnalysis
End User Analysisto root cause issues
Trend Analysis for Capacity Planning and
Traffic Patterns
• Identify all operations for a particular user and review patterns and any faults
• Build capacity and traffic trend lines based on statistical analysis of all traffic
100tps S3 Server = 83million lines info log = 3.5GB/Day 10 Server System = 35GB/Day ~ 1TB/month100 Customer Systems => 1.2PB Annually
04/08/2023 5
Traditional Big Data Flow
Event Processing Platform
Big Data Storage Platform
Analytics PlatformContent Storage
Consumer Activity(Events, GPS, WiFi)
Social MediaDevice Tracking and Logs(Event, Configuration, Usage, Performance, )
Real TimeEvents
Big Data
Result of analysis
04/08/2023 6
Traditional Big Data Flow
Event Processing
Platform
Analytics Platform(HDFS)Content
Storage(Object, NAS)
• Wasted storage = storage for content and analytics
• Transform of data into HDFS can be costly
• High overhead of HDFS (3copy replica) for content which may be poor quality
Logs, Config
S3 and Hadoop
• Apache Hadoop supports S3 since Jan 2008– http://wiki.apache.org/hadoop/AmazonS3
• Well-proven by Amazon with Elastic MapReduce
• State-of-the-art and advancing quickly to provide much easier Hadoop over S3 – e.g. Netflix Genie– https://github.com/Netflix/genie
04/08/2023 7
04/08/2023 8
Cloudian Approach
Event Processing
PlatformAnalyticsCloudian HyperStore
Storage
• No redundant storage of data
• Hyperstore scales out with your data – adding nodes for I/O
• Analyze more - allows for efficient bulk data analysis in place
• Take advantage of multi-core CPUs – makes sense for MapReduce
• Can feed smarter data for subsequent analytic systems
• Faster time to decision
Cloudian Hadoop Configuration
• Hadoop 2.2
• Configured for native S3 file system (etc/hadoop/core-site.xml)– S3N native file system for reading and writing regular files on S3. The
advantage of this file system is that you can access files on S3 that were written with other tools. Conversely, other tools can access files written using Hadoop.
• Configure Hadoop to use Cloudian (etc/hadoop/jets3t.properties)– s3service.s3-endpoint=CLOUDIAN_ENDPOINT– s3service.s3-endpoint-http-port=CLOUDIAN_PORT
04/08/2023 9
Note: you can also dedicate a bucket for Hadoop analytics and then Hadoop will chunk the content into blocks for storage – like HDFS
04/08/2023 10
S3
NFS
Cloudian HyperStore® Software
Scalable peer-to-peer architecture Multi-data center replication Multi-Tenancy and Chargeback Hybrid cloud-ready (any S3 cloud)
100s of supported applications Optimized for any workload Storage for OpenStack & CloudStack
11
Elastic, Distributed and Reliable
NOSQL database distributes and replicates data
Logical RingData is automatically replicated to multiple nodes.
Location of data can be designated, for instance, to multiple datacenters and per rack.
DC1
DC2
In theory, # of nodes in a logical ring can be up to 2127 (almost infinite).
Data load can be rebalanced when a node is added or removed.
Apr 8, 202304/08/2023
04/08/2023 12
Enhanced HyperStore® Technology
• Policies tailored for different object types
• Optimized for all data• Chunking for better
performance• Erasure Coding for deep
archive efficiency• Reliable storage across
multi-node failures
HyperStore
Patent Pending
Small Objects
Large Objects
Active ContentFile System
NOSQL DB
Erasure Coding
DeepArchives
04/08/2023 13
Cloudian Complete S3 API
• Core REST API – Get, Put, Post, Head, Delete
• Multi-part uploads: Allows uploading large objects in multiple parts
• Versioning: Multiple versions of same object
• Bucket Lifecycle: Auto-expiration using rules
• Server side encryption: Managed by Cloudian
• Location Constraint: Assign data to specific region (e.g. for HIPAA compliance)
• Bucket Website: Create buckets as websites to host web content
• Access control lists (ACLs) define access rights to bucket and object
• And more...
Cloudian Complete S3 APIProducts S3 API
Cloudian
AmpliData
Basho
Caringo
Cleversafe
EMC Atmos
NetApp Bycast
Scality
OpenStack Swift
04/08/2023 14
Seamless tiering to Amazon S3, Glacier and other S3 Service Providers
• Cloudian deployed as On-Premises S3 cloud behind the firewall
• Automatically migrates data to AWS using Bucket Lifecycle Policies
– Optional migration to Glacier– Metadata maintained for
search/list of objects• Configurable to reduce
overhead• Read/Writes to migrated objects
– restore by default, option to redirect to AWS/S3 Service Provider
On-Premises S3
S3
Client/Application
Content migratedor restored viaBucket Lifecycle Policies
Option to redirect migrated content
Amazon S3
Firewall
Amazon Glacier
04/08/2023
Big Data Storage Platform
15
Event Processing Platform Big Data Storage Platform
Input I/F Recommend
CEP Engine
Filter Judge Aggregate
Real Time Analysis
Big Data Analysis
Analyze Recommend
Data Analysis and Storage Platform
Content Storage
Consumer Activity(Events, GPS, WiFi)
Social mediaBusiness Tracking (goods, inventory, campaign, sales)
Smarter Business
Future Work
• Delivery of Cloudian Hadoop-ready object storage (2HCY14)
• Integration with key Hadoop distributions
• Locality awareness
• Potentially use new drive technology for processing (eg HGST Ethernet drive)
• Find out more – Booth 139
04/08/2023 16
Cloudian®S3 Cloud Storage Platform
Thank You!
Questions?
www.cloudian.com“The Leading Provider of Hybrid Cloud Storage”