Top Banner
Execution Environments for Distributed Computing Self-Adapting, Energy- Conserving Distributed File Systems EEDC 34330 European Master in Distributed Computing - EMDC EEDC Presentation Mário Almeida– 4knahs[@]gmail.com www.marioalmeida.eu
26

Self-Adapting, Energy-Conserving Distributed File Systems

Jun 08, 2015

Download

Technology

Mário Almeida

Overview of self-adapting, Energy-conserving distributed file systems. Study case: GreenHDFS
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Self-Adapting, Energy-Conserving Distributed File Systems

Execution Environments for Distributed Computing

Self-Adapting, Energy-Conserving Distributed

File Systems

EEDC 34330

European Master in Distributed Computing - EMDC

EEDC PresentationMário Almeida– 4knahs[@]gmail.com

www.marioalmeida.eu

Page 2: Self-Adapting, Energy-Conserving Distributed File Systems

*

Outline● Introduction

○ Green Computing○ Distributed File Systems○ DFS issues

● Hadoop Distributed File System○ Overview○ Evaluation

● Green HDFS○ Overview○ Design○ Goal○ Energy-management

policies○ Machine learning○ Evaluation

● Conclusions● References

Page 3: Self-Adapting, Energy-Conserving Distributed File Systems

*

Introduction - Green Computing● Environmentally sustainable computing with minimal

impact on the environment.● Reduction of the energy consumption, the GreenHouse

Gas emissions and the operational costs.

Page 4: Self-Adapting, Energy-Conserving Distributed File Systems

*

Introduction - Distributed FS● A Distributed File System (DFS) is any file system that

allows access to files from multiple hosts sharing via a computer network.

● May include facilities for transparent replication and fault tolerance.

Page 5: Self-Adapting, Energy-Conserving Distributed File Systems

*

Introduction - DFS Issues● Distributed File Systems are often built to run on a large

number of commodity servers.● Which means that:

○ it generates heat and consumes large amounts of energy.

○ costs are dependent on the initial acquisition costs and power, cooling, etc.

Page 6: Self-Adapting, Energy-Conserving Distributed File Systems

*

Introduction - DFS Issues● Common approach:

○ Scale-Down -Transitioning servers into low power consumption states.

○ Other approaches not exclusive to DFS might include renewable energy, free cooling, etc.

Page 7: Self-Adapting, Energy-Conserving Distributed File Systems

*

● Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications.

● HDFS creates multiple replicas of data blocks and

distributes them on compute nodes throughout a cluster of enable reliable, extremely rapid computations.

HDFS Overview

Page 8: Self-Adapting, Energy-Conserving Distributed File Systems

*

In 2010, a detailed analysis of files was done in a production Yahoo! Hadoop cluster with the following characteristics:

● 2600 servers● 34 million files● Over 5 PB of data● 3 months of observation

HDFS Evaluation

Page 9: Self-Adapting, Energy-Conserving Distributed File Systems

*

Key observations:

● Files are heterogeneous in access and lifespan patterns.● 60% of data is "cold" or dormant.● 95-98% of files have a very short "hotness" lifespan of

less than 3 days.● 90% of files were dormant or "cold" for more than 18

days.● Majority of the data had a news-server-like access

pattern.

HDFS Evaluation

Page 10: Self-Adapting, Energy-Conserving Distributed File Systems

*

GHDFS Overview

● Self-Adaptive - depends only on HDFS and file access patterns

● Applies Data-Classification techniques● Energy-Aware placement of data● Trades cost, performance and power by separating

cluster into logical zones.

Page 11: Self-Adapting, Energy-Conserving Distributed File Systems

*

GHDFS Design

Hot Zone

Files currently accessed and newly created

High energy usage and performance

Cold Zone

Files with low to rare access

Low energy use

and Sleepingmode

Page 12: Self-Adapting, Energy-Conserving Distributed File Systems

*

GHDFS - Management Policies GreenHDFS uses three different management policies:

● FMP - File Migration Policy

● SCP - Server Power Conserver Policy● FRP - File Reversal Policy

Hot Zone

Cold Zone

Page 13: Self-Adapting, Energy-Conserving Distributed File Systems

*

GHDFS - File Migration Policy

Hot Zone

Cold Zone

Coldness > Threshold

Hotness > Threshold

● FMP monitors the dormancy of files● Runs in the Hot Zone

● Gives higher storage effiency for the Hot Zone as less

accessed files are moved to the Cold Zone

Page 14: Self-Adapting, Energy-Conserving Distributed File Systems

*

GHDFS - Power Conserver Policy

Cold Zone

● SCP runs in the ColdZone● Determines which servers can go to stanby/sleep mode.

● Uses hardware techniques to transfer CPU, Disks and

FRAM into low power state.

● Wakes the server up only if:○ Data on that server is accessed○ New data needs to be placed on that server

Page 15: Self-Adapting, Energy-Conserving Distributed File Systems

*

GHDFS - File Reversal Policy● FRP runs in the ColdZone.● Ensures QoS, bandwidth and response time is well

managed in case a file becomes popular.

Hot Zone

Cold Zone#accesses > Threshold

Page 16: Self-Adapting, Energy-Conserving Distributed File Systems

*

GHDFS - Machine Learning● Designing and developing algorithms that allow

computers to evolve behaviors based on empirical data.

● Recognize patterns and make decisions based on data.

Page 17: Self-Adapting, Energy-Conserving Distributed File Systems

*

GHDFS uses:● Supervised machine learning.● A variant of Multiple Linear Regression to find the

statistical correlation between directory and file attributions.

● Training data preparation - audit logs and metadata.● Predicts the files Lifespan, Size and Heat upon creation

of file. It works because there is a high correlation between the directory hierarchy and file attributes in a well-laid out and partitioned name space!!

GHDFS - Machine Learning

Page 18: Self-Adapting, Energy-Conserving Distributed File Systems

*

GHDFS - Machine Learning

Page 19: Self-Adapting, Energy-Conserving Distributed File Systems

*

GHDFS - Evaluation

Page 20: Self-Adapting, Energy-Conserving Distributed File Systems

*

GHDFS - Evaluation

Page 21: Self-Adapting, Energy-Conserving Distributed File Systems

*

GHDFS - Evaluation

Page 22: Self-Adapting, Energy-Conserving Distributed File Systems

*

GHDFS - Evaluation

Page 23: Self-Adapting, Energy-Conserving Distributed File Systems

*

GHDFS - Evaluation

● Energy consumption reduced by 24% and saved $2.1 millions saved in energy costs per annum (38000 servers).

● Maximizes the usage of the power budget by allowing

the infrastructure to expand. More Hot Zone servers offer more availability and performance.

Page 24: Self-Adapting, Energy-Conserving Distributed File Systems

*

Conclusions

● Machine learning can be applied for a predictive self-managed energy control system that achieves better results than reactive approaches.

● Good Energy Management Policies can result in high

savings in energy consumption.

● Data-Classification techniques can help achieving a better energy-aware placement of data in Distributed File Systems.

● The presented techniques applied in conjunction to

other more common green computing technologies can impact significantly the maintenance costs of the cluster.

Page 25: Self-Adapting, Energy-Conserving Distributed File Systems

*

References

● GreenHDFS : Torwards an Energy-Conserving Storage-Efficient, Hybrid Hadoop Compute Cluster

● Evaluation and Analysis of GreenHDFS: A Self-Adaptive, Energy-Conserving Variant of the Hadoop Distributed File System

● Predictive Data and Energy Management in GreenHDFS

● The Hadoop Distributed File System● Introduction to Machine Learning (Adaptive Computation

and Machine Learning)

Page 26: Self-Adapting, Energy-Conserving Distributed File Systems

Execution Environments for Distributed Computing

Self-Adapting, Energy-Conserving Distributed

File Systems

EEDC 34330

European Master in Distributed Computing - EMDC

EEDC PresentationMário Almeida– 4knahs[@]gmail.com

www.marioalmeida.eu