Hadoop Archive and Tiering

Simplify and Automate Tiering & Archive for Hadoop Today

Reasons For Storage Tiering with Hadoop:

• Single tier lends to a large imbalance of compute and storage resources• More applications create varying workloads• Large percent of data is cold in most cases• More recently ingested data can be better balanced• Fewer nodes per GB with archive nodes• Lower infrastructure costs

Tiering on Hadoop

Existing Tier NodeMedium ComputeMedium Capacity

Cold Tier NodeLow ComputeHigh Density Capacity4x Less Per GB

Name Nodes

Accessed Data

Cold Data

Archive Node Example

• Over 65% less hardware • 60% fewer nodes (software licensing) • Significant performance improvement• Immediate ROI for cloud and private infrastructures

Archive Data Nodes 80%

Disk Data Nodes20%

Disk Data Nodes100%

Single Tier HDFS Storage

“The price per GB of the ARCHIVE tier is 4x less” -eBay Hadoop Engineering Blog

Example of Archive Tiering Benefits

4x Fewer Nodes

Capacity 10PBCapacity 10PB

Pillars of Intelligent Tiering for Hadoop

Access frequency of data is the most important metric for effective tiering

Age is easiest to determine. CAUTION: Some data is long-term active so this cannot be the only criteria.

Zero and small files should be treated differently in tiering Hadoop.Large cold files should have priority for archive

Knowing how long data is accessed once ingested can provide better capacity planning for your tiers.

Installed on a server or VM outside your existing Hadoop cluster without inserting any proprietary technology on the cluster or in the data path.

FactorData Approach

Report data usage (heat), small files, user activity, replication, and HDFS tier utilization. Customize rules and queries to properly utilize infrastructure and plan better for future scale.

Automatically archive, promote, or change the replication factor of data based on usage patterns and user defined rules.

Tier Hadoop HDFS By Heat, Age, Size & Activity In Three Easy Steps

01/INSTALL WITHOUT CHANGES TO CLUSTER

02/VISUALIZE & REPORT

03/AUTOMATE OPTIMIZATION

FactorData Automates HDFS TieringHDFSplus

Apply storage policy based on custom query

Files are optimized during normal balancing window

Query list based on size, heat, activity, and age

• Move all files 120 days old and not accessed for 90 days to ARCHIVE…..

• FactorData creates a data list based on query

FactorData Archive Tiering Example:

• Limit automated run by max files or capacity

• FactorData tracks completion of each run

• Data can be excluded from run according to path, size and application

Custom Query Example: Automated Tiering:

FactorData HDFSplus Architecture

Completely out of the data pathFactorData HDFSplus sits outside the Hadoop cluster and collects only metadata information from the Hadoop cluster

No software to install on the existing Hadoop clusterBecause HDFSplus leverages only existing Hadoop APIs and features, there is no software to install on the cluster.

Provides a highly scalable solution in a small foot-printHDFS visibility and automation for thousands of Hadoop nodes on a single node, VM or server

HDFSplus

Namenodes

Communicates withExisting Hadoop API

VM or Physical Machine32GB RAM

4 CPU or vCPU500GB Free Disk

Simplify and Automate Archive and Tiering in Hadoop Today• Move less accessed data to storage dense nodes for better utilization• Lower software licensing• Free resources on existing namenodes and datanodes

FactorData Tiering & Archive on Hadoop

How can we get more performance out of our existing Hadoop cluster?

How can we move data not accessed for 90 days to archive nodes?

How can we better plan for future scale with real Hadoop storage metrics?

Result: Better Performance, Lower Hardware Costs, Lower Software Costs

Plus: Get Necessary Storage Visibility To Answer These Questions & More with FactorData HDFSplus

Thank YouVisit us at: http://www.factordata.com

Hadoop Archive and Tiering

Technology

Hadoop Migration Guide - MinIO · Independent Tiering: With...

EMC Cloud Tiering Appliance and Cloud Tiering Applicance...

Hitachi Dynamic Tiering Storage Tiering for the … Dynamic....

Tiering SSD to Cloud

MapReduce and Hadoop File...

Hadoop , Hadoop , Hadoop !!!

Tiering Announcement Dec 13 10

Differentiated Instruction and Tiering

Data Tiering in BW/4HANA and SAP BW on HANA - Roadmap...

HubStor Overview Brochure - Microsoft Azure...2015/10/01...

Tier data to Azure Blob storage : Cloud Tiering...Tier data....

Leveraging Flash in Integrated, Scalable Systems...Parallel....

Big Data helps particle physicists to concentrate on...

Tier data to AWS S3 : Cloud Tiering · Cloud Tiering...

SAP HANA SPS09 - Dynamic Tiering

Data tiering overview - docs.netapp.com€¦ · Data...