Transcript
1
Simplify and Automate Tiering & Archive for Hadoop Today
2
Reasons For Storage Tiering with Hadoop:
• Single tier lends to a large imbalance of compute and storage resources• More applications create varying workloads• Large percent of data is cold in most cases• More recently ingested data can be better balanced• Fewer nodes per GB with archive nodes• Lower infrastructure costs
Tiering on Hadoop
Existing Tier NodeMedium ComputeMedium Capacity
Cold Tier NodeLow ComputeHigh Density Capacity4x Less Per GB
Name Nodes
Accessed Data
Cold Data
Archive Node Example
3
• Over 65% less hardware • 60% fewer nodes (software licensing) • Significant performance improvement• Immediate ROI for cloud and private infrastructures
Archive Data Nodes 80%
Disk Data Nodes20%
Disk Data Nodes100%
Single Tier HDFS Storage
“The price per GB of the ARCHIVE tier is 4x less” -eBay Hadoop Engineering Blog
Example of Archive Tiering Benefits
4x Fewer Nodes
Capacity 10PBCapacity 10PB
4
Pillars of Intelligent Tiering for Hadoop
HEA
T
AG
E
SIZE
USA
GE
Access frequency of data is the most important metric for effective tiering
Age is easiest to determine. CAUTION: Some data is long-term active so this cannot be the only criteria.
Zero and small files should be treated differently in tiering Hadoop.Large cold files should have priority for archive
Knowing how long data is accessed once ingested can provide better capacity planning for your tiers.
5
Installed on a server or VM outside your existing Hadoop cluster without inserting any proprietary technology on the cluster or in the data path.
FactorData Approach
Report data usage (heat), small files, user activity, replication, and HDFS tier utilization. Customize rules and queries to properly utilize infrastructure and plan better for future scale.
Automatically archive, promote, or change the replication factor of data based on usage patterns and user defined rules.
Tier Hadoop HDFS By Heat, Age, Size & Activity In Three Easy Steps
01/INSTALL WITHOUT CHANGES TO CLUSTER
02/VISUALIZE & REPORT
03/AUTOMATE OPTIMIZATION
6
FactorData Automates HDFS TieringHDFSplus
Apply storage policy based on custom query
HDFS
Files are optimized during normal balancing window
Query list based on size, heat, activity, and age
1 2 3
• Move all files 120 days old and not accessed for 90 days to ARCHIVE…..
• FactorData creates a data list based on query
FactorData Archive Tiering Example:
• Limit automated run by max files or capacity
• FactorData tracks completion of each run
• Data can be excluded from run according to path, size and application
Custom Query Example: Automated Tiering:
7
FactorData HDFSplus Architecture
Completely out of the data pathFactorData HDFSplus sits outside the Hadoop cluster and collects only metadata information from the Hadoop cluster
No software to install on the existing Hadoop clusterBecause HDFSplus leverages only existing Hadoop APIs and features, there is no software to install on the cluster.
Provides a highly scalable solution in a small foot-printHDFS visibility and automation for thousands of Hadoop nodes on a single node, VM or server
HDFSplus
Namenodes
Communicates withExisting Hadoop API
VM or Physical Machine32GB RAM
4 CPU or vCPU500GB Free Disk
8
Simplify and Automate Archive and Tiering in Hadoop Today• Move less accessed data to storage dense nodes for better utilization• Lower software licensing• Free resources on existing namenodes and datanodes
FactorData Tiering & Archive on Hadoop
How can we get more performance out of our existing Hadoop cluster?
How can we move data not accessed for 90 days to archive nodes?
How can we better plan for future scale with real Hadoop storage metrics?
Result: Better Performance, Lower Hardware Costs, Lower Software Costs
Plus: Get Necessary Storage Visibility To Answer These Questions & More with FactorData HDFSplus
top related