Top Banner
Improving Energy Eiciency of Scientific Data Compression with Decision Trees ENERGY Jun.-Prof. Dr. Michael Kuhn [email protected] -- – -- Parallel Computing and I/O Institute for Intelligent Cooperating Systems Faculty of Computer Science Otto von Guericke University Magdeburg http://parcio.ovgu.de
24

ImprovingEnergyE˙iciency ofScientificDataCompression ... · ImprovingEnergyE˙iciency ofScientificDataCompression withDecisionTrees ENERGY2020 Jun.-Prof. Dr. MichaelKuhn [email protected]

Sep 29, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: ImprovingEnergyE˙iciency ofScientificDataCompression ... · ImprovingEnergyE˙iciency ofScientificDataCompression withDecisionTrees ENERGY2020 Jun.-Prof. Dr. MichaelKuhn michael.kuhn@ovgu.de

Improving Energy E�iciencyof Scientific Data Compressionwith Decision Trees

ENERGY 2020

Jun.-Prof. Dr. Michael [email protected]

2020-09-28 – 2020-10-01

Parallel Computing and I/OInstitute for Intelligent Cooperating SystemsFaculty of Computer ScienceOtto von Guericke University Magdeburghttp://parcio.ovgu.de

Page 2: ImprovingEnergyE˙iciency ofScientificDataCompression ... · ImprovingEnergyE˙iciency ofScientificDataCompression withDecisionTrees ENERGY2020 Jun.-Prof. Dr. MichaelKuhn michael.kuhn@ovgu.de

Introduction and Motivation

Energy-E�icient Data Compression

Evaluation and Results

Conclusion and Future Work

Michael Kuhn Improving Energy E�iciency of Scientific Data Compression with Decision Trees 1 / 19

Page 3: ImprovingEnergyE˙iciency ofScientificDataCompression ... · ImprovingEnergyE˙iciency ofScientificDataCompression withDecisionTrees ENERGY2020 Jun.-Prof. Dr. MichaelKuhn michael.kuhn@ovgu.de

Motivation Introduction and Motivation

• Scientific applications and large-scale experiments generate a deluge of data• Gap between the computational power and other hardware is widening [2]

• More data can be computed than can be stored e�iciently• Additional investments for storage hardware are necessary• New hardware requires additional power and incurs additional costs for energy

• Data-intensive fields have increasing costs for storage and energy• German Climate Computing Center: each petabyte of disk storage costs roughly100,000e plus 3,680e annually for electricity

• Almost 200,000e per year in total for its 54 PiB storage system

Michael Kuhn Improving Energy E�iciency of Scientific Data Compression with Decision Trees 2 / 19

Page 4: ImprovingEnergyE˙iciency ofScientificDataCompression ... · ImprovingEnergyE˙iciency ofScientificDataCompression withDecisionTrees ENERGY2020 Jun.-Prof. Dr. MichaelKuhn michael.kuhn@ovgu.de

Motivation.. . Introduction and Motivation

• Data reduction is a solution to minimize energy consumption• Reducing the amount of storage hardware required to store the data• Can consume significant amounts of energy, negating its benefits

• Energy e�iciency in supercomputers has been investigated extensively• Impact of data reduction on energy e�iciency remains largely unexplored

• Data reduction is of great interest in scientific so�ware• Algorithms have to be appropriate for their data andmust be tuned• Decreasing runtime performance has to be avoided• Choosing a compression algorithm is a technical decision

• Decision depends on the data and the so�ware/hardware environments

Michael Kuhn Improving Energy E�iciency of Scientific Data Compression with Decision Trees 3 / 19

Page 5: ImprovingEnergyE˙iciency ofScientificDataCompression ... · ImprovingEnergyE˙iciency ofScientificDataCompression withDecisionTrees ENERGY2020 Jun.-Prof. Dr. MichaelKuhn michael.kuhn@ovgu.de

Motivation.. . Introduction and Motivation

• Ultimate goal: automatize the decision making process for users• Poor manual choices can lead to low compression ratios (CR), decreased performanceand increased energy consumption

• Focus on scientific applications in the context of HPC with huge amounts of data

• We have already analyzed the energy impact of data reduction in [1]• Extended for intelligently selecting algorithms and settings• Emphasis on the algorithms’ energy consumption• Users are able to specify additional criteria to tune behavior

Michael Kuhn Improving Energy E�iciency of Scientific Data Compression with Decision Trees 4 / 19

Page 6: ImprovingEnergyE˙iciency ofScientificDataCompression ... · ImprovingEnergyE˙iciency ofScientificDataCompression withDecisionTrees ENERGY2020 Jun.-Prof. Dr. MichaelKuhn michael.kuhn@ovgu.de

Introduction and Motivation

Energy-E�icient Data Compression

Evaluation and Results

Conclusion and Future Work

Michael Kuhn Improving Energy E�iciency of Scientific Data Compression with Decision Trees 5 / 19

Page 7: ImprovingEnergyE˙iciency ofScientificDataCompression ... · ImprovingEnergyE˙iciency ofScientificDataCompression withDecisionTrees ENERGY2020 Jun.-Prof. Dr. MichaelKuhn michael.kuhn@ovgu.de

Energy-E�icient Data Compression Energy-E�icient Data Compression

• Extended Scientific Compression Library [6]• SCIL is a meta-compressor• Should pick a suitable chain of algorithmssatisfying the user’s requirements

• Extended by a decision component thatcan use di�erent criteria for selection

• Support energy-e�icient data reduction byusing machine learning approaches

• Main goal: provide the most appropriatedata reduction strategy for a given data set

Michael Kuhn Improving Energy E�iciency of Scientific Data Compression with Decision Trees 6 / 19

Page 8: ImprovingEnergyE˙iciency ofScientificDataCompression ... · ImprovingEnergyE˙iciency ofScientificDataCompression withDecisionTrees ENERGY2020 Jun.-Prof. Dr. MichaelKuhn michael.kuhn@ovgu.de

Decision Component Energy-E�icient Data Compression

• Decision component takes into account information about the data’s structure• Machine learning techniques infer best suited algorithms and settings• A training step is currently done separately from application runs

• Output data set is split up into individual variables and then analyzed• Collect compression ratios, processor utilization, energy consumption etc.• Use wide range of data sets to have a su�iciently large pool of training data

• Decision component currently makes use of decision trees• Planned to be extended with other techniques• Data is split into a training set and a test set to prevent overfitting

Michael Kuhn Improving Energy E�iciency of Scientific Data Compression with Decision Trees 7 / 19

Page 9: ImprovingEnergyE˙iciency ofScientificDataCompression ... · ImprovingEnergyE˙iciency ofScientificDataCompression withDecisionTrees ENERGY2020 Jun.-Prof. Dr. MichaelKuhn michael.kuhn@ovgu.de

Decision Component.. . Energy-E�icient Data Compression

• Trees are created using the DecisionTreeClassifier from scikit-learn• Tree representation is then parsed into a file that is usable in SCIL

• Decision component’s behavior can be separated into two distinct modes1. Data structure is known

• Can select the optimal compression algorithm and settings• Mainly useful for production runs of known applications• E. g., train for specific application and choose the best compressor for each subsequent run

2. Data structure is unknown• Usemachine learning techniques to infer algorithm and settings• Information about storage size, number of elements, data dimensions, data type etc.• Possible to tune decision for energy e�iciency, compression ratio or performance

Michael Kuhn Improving Energy E�iciency of Scientific Data Compression with Decision Trees 8 / 19

Page 10: ImprovingEnergyE˙iciency ofScientificDataCompression ... · ImprovingEnergyE˙iciency ofScientificDataCompression withDecisionTrees ENERGY2020 Jun.-Prof. Dr. MichaelKuhn michael.kuhn@ovgu.de

Training Energy-E�icient Data Compression

• Fine-grained power measurement is necessary• Used the ArduPower wattmeter [4], v2 presented at ENERGY 2020

• Designed to measure di�erent components inside computing systems• Provides 16 channels with a variable sampling rate of 480–5,880Hz

• Main metrics: compression ratio, runtime and consumed energy of each algorithm• Used three data sets from di�erent scientific domains

1. ECOHAM: 17 GB, ecosystemmodel for the North Sea [5] (climate science)2. PETRA III: 14 GB, tomography experiments from P06 beamline [3] (photon science)3. ECHAM: 4 GB, atmospheric model [7] (climate science)

Michael Kuhn Improving Energy E�iciency of Scientific Data Compression with Decision Trees 9 / 19

Page 11: ImprovingEnergyE˙iciency ofScientificDataCompression ... · ImprovingEnergyE˙iciency ofScientificDataCompression withDecisionTrees ENERGY2020 Jun.-Prof. Dr. MichaelKuhn michael.kuhn@ovgu.de

Training.. . Energy-E�icient Data Compression

• Comparable compression ratios withdi�erent energy consumptions

• Compare mafisc and zstd for ECOHAM

• Necessary to select the compressionalgorithm intelligently

• Avoid wasting performance/energy

• Detailed analysis available in [1]

0.0

100.0

200.0

300.0

400.0

500.0

600.0

off blosc-lz4 mafisc lz4 zstd zstd-11 zstd-22 scil

En

erg

y co

nsu

mp

tion

[kJ]

Compression algorithm

ECOHAM PETRA III ECHAM

0.00

1.00

2.00

3.00

4.00

5.00

6.00

off blosc-lz4 mafisc lz4 zstd zstd-11 zstd-22 scil

Com

pre

ssio

n r

ati

o

Compression algorithm

ECOHAM PETRA III ECHAM

Michael Kuhn Improving Energy E�iciency of Scientific Data Compression with Decision Trees 10 / 19

Page 12: ImprovingEnergyE˙iciency ofScientificDataCompression ... · ImprovingEnergyE˙iciency ofScientificDataCompression withDecisionTrees ENERGY2020 Jun.-Prof. Dr. MichaelKuhn michael.kuhn@ovgu.de

Introduction and Motivation

Energy-E�icient Data Compression

Evaluation and Results

Conclusion and Future Work

Michael Kuhn Improving Energy E�iciency of Scientific Data Compression with Decision Trees 11 / 19

Page 13: ImprovingEnergyE˙iciency ofScientificDataCompression ... · ImprovingEnergyE˙iciency ofScientificDataCompression withDecisionTrees ENERGY2020 Jun.-Prof. Dr. MichaelKuhn michael.kuhn@ovgu.de

Evaluation Evaluation and Results

• Run ECOHAM using our SCIL HDF5 plugin to evaluate our approach1

• Trained the decision component using two di�erent sets of training data1. ECOHAM data

• Represents case with known application that has been run before• Still only 75% of ECOHAM’s output data is used for training• Uncertainty corresponds to updated application or changed output structure

2. ECHAM data• Represents case with new application• Decision component has to use information gathered from other applications• Try to map decisions that make sense for other data sets to the current data

1Code and data are available at https://github.com/wr-hamburg/energy2020-compressionMichael Kuhn Improving Energy E�iciency of Scientific Data Compression with Decision Trees 12 / 19

Page 14: ImprovingEnergyE˙iciency ofScientificDataCompression ... · ImprovingEnergyE˙iciency ofScientificDataCompression withDecisionTrees ENERGY2020 Jun.-Prof. Dr. MichaelKuhn michael.kuhn@ovgu.de

Evaluation.. . Evaluation and Results

• Two di�erent optimization targets, which correspond to di�erent use cases1. Optimized for minimal energy consumption per compression ratio

• Allows shrinking the data with the least amount of energy

2. Optimized for maximal compression ratio per time• Performance is not degraded excessively

• Decision tree using ECOHAM training data• Maximal compression ratio per time• Multitude of metrics are taken into account• Array dimensions, data size, number ofelements, size of each dimension andinformation about data types

DM2≤71.5gini=0.637

samples=104value=[41,2,10,46,5]

class=zstd-11

Storage_Size≤2816.0gini=0.727

samples=16value=[5,2,6,1,2]

class=zstd

True

Number_of_Elements≤1400704.0gini=0.568

samples=88value=[36,0,4,45,3]

class=zstd-11

False

gini=0.735samples=7

value=[2,2,1,0,2]class=blosc

gini=0.568samples=9

value=[3,0,5,1,0]class=zstd

gini=0.0samples=1

value=[0,0,0,0,1]class=zstd-22

gini=0.559samples=87

value=[36,0,4,45,2]class=zstd-11

Michael Kuhn Improving Energy E�iciency of Scientific Data Compression with Decision Trees 13 / 19

Page 15: ImprovingEnergyE˙iciency ofScientificDataCompression ... · ImprovingEnergyE˙iciency ofScientificDataCompression withDecisionTrees ENERGY2020 Jun.-Prof. Dr. MichaelKuhn michael.kuhn@ovgu.de

Evaluation.. . Evaluation and Results

• Run ECOHAMwith importantcompressors

• Static mode and decision trees• Decision trees have access to allcompressors

• Four decision trees have been used1. ecoham-1: ECOHAM data, minimalenergy consumption per CR

2. ecoham-2: ECOHAM data, maximalcompression ratio per time

3. echam-1: ECHAM data, minimalenergy consumption per CR

4. echam-2: ECHAM data, maximalcompression ratio per time

0.0

50.0

100.0

150.0

200.0

250.0

off blosc lz4 zstd ecoham-1ecoham-2echam-1 echam-2

En

erg

y co

nsu

mp

tion

[kJ]

Compression algorithm

0.00

1.00

2.00

3.00

4.00

5.00

6.00

off blosc lz4 zstd ecoham-1ecoham-2 echam-1 echam-2

Com

pre

ssio

n r

ati

o

Compression algorithm

Michael Kuhn Improving Energy E�iciency of Scientific Data Compression with Decision Trees 14 / 19

Page 16: ImprovingEnergyE˙iciency ofScientificDataCompression ... · ImprovingEnergyE˙iciency ofScientificDataCompression withDecisionTrees ENERGY2020 Jun.-Prof. Dr. MichaelKuhn michael.kuhn@ovgu.de

Evaluation.. . Evaluation and Results

• Decision component correctly usesenergy-e�icient algorithms

• ecoham-1: ECOHAM’s data structure isknown, data is reduced e�ectively

• echam-1: data structure unknown,still chooses appropriate compressors

• Optimization for maximal compressionratio increases energy consumption

• echam-2: data structure unknown,choices are not as e�ective as inprevious case

0.0

50.0

100.0

150.0

200.0

250.0

off blosc lz4 zstd ecoham-1ecoham-2echam-1 echam-2

En

erg

y co

nsu

mp

tion

[kJ]

Compression algorithm

0.00

1.00

2.00

3.00

4.00

5.00

6.00

off blosc lz4 zstd ecoham-1ecoham-2 echam-1 echam-2

Com

pre

ssio

n r

ati

o

Compression algorithm

Michael Kuhn Improving Energy E�iciency of Scientific Data Compression with Decision Trees 15 / 19

Page 17: ImprovingEnergyE˙iciency ofScientificDataCompression ... · ImprovingEnergyE˙iciency ofScientificDataCompression withDecisionTrees ENERGY2020 Jun.-Prof. Dr. MichaelKuhn michael.kuhn@ovgu.de

Introduction and Motivation

Energy-E�icient Data Compression

Evaluation and Results

Conclusion and Future Work

Michael Kuhn Improving Energy E�iciency of Scientific Data Compression with Decision Trees 16 / 19

Page 18: ImprovingEnergyE˙iciency ofScientificDataCompression ... · ImprovingEnergyE˙iciency ofScientificDataCompression withDecisionTrees ENERGY2020 Jun.-Prof. Dr. MichaelKuhn michael.kuhn@ovgu.de

Conclusion Conclusion and Future Work

• Amount of data saved by compression heavily depends data structure• Preconditioners, algorithms and settings might work well for one data set, but mightincrease energy consumption for others

• Fine-grained per-variable analyses identified strategies for three data sets• Trained the decision component for our real-world evaluation

• Demonstrated that decision component chooses appropriate compressors for bothknown and unknown applications

• Can be further tuned for energy e�iciency or compression ratio• Achieved satisfactory compression ratios without increases in energy consumption• Slight increases in energy consumption allow significantly boosting compression ratios

Michael Kuhn Improving Energy E�iciency of Scientific Data Compression with Decision Trees 17 / 19

Page 19: ImprovingEnergyE˙iciency ofScientificDataCompression ... · ImprovingEnergyE˙iciency ofScientificDataCompression withDecisionTrees ENERGY2020 Jun.-Prof. Dr. MichaelKuhn michael.kuhn@ovgu.de

Future Work Conclusion and Future Work

• Training currently has to be performed in a separate step• Training data collection can bemore tightly integrated with production runs• Training mode to capture and analyze applications’ output data during regular runs• Send samples to a training service that analyzes them in more detail

• HDF5’s current filter interface is too limiting to fully exploit all possibilities• Operates on opaque bu�ers, impossible to access single data points• Use data variance to further tune compressor’s behavior• Chains of compressors can lead to additional space savings

Michael Kuhn Improving Energy E�iciency of Scientific Data Compression with Decision Trees 18 / 19

Page 20: ImprovingEnergyE˙iciency ofScientificDataCompression ... · ImprovingEnergyE˙iciency ofScientificDataCompression withDecisionTrees ENERGY2020 Jun.-Prof. Dr. MichaelKuhn michael.kuhn@ovgu.de

Questions Conclusion and Future Work

Thank you for listening! If you have any questions, please sendme an e-mail [email protected]

Michael Kuhn Improving Energy E�iciency of Scientific Data Compression with Decision Trees 19 / 19

Page 21: ImprovingEnergyE˙iciency ofScientificDataCompression ... · ImprovingEnergyE˙iciency ofScientificDataCompression withDecisionTrees ENERGY2020 Jun.-Prof. Dr. MichaelKuhn michael.kuhn@ovgu.de

References

Page 22: ImprovingEnergyE˙iciency ofScientificDataCompression ... · ImprovingEnergyE˙iciency ofScientificDataCompression withDecisionTrees ENERGY2020 Jun.-Prof. Dr. MichaelKuhn michael.kuhn@ovgu.de

References

[1] Yevhen Alforov, Thomas Ludwig, Anastasiia Novikova, Michael Kuhn, and Julian M.Kunkel. Towards Green Scientific Data Compression Through High-Level I/OInterfaces. In 30th International Symposium on Computer Architecture and HighPerformance Computing, SBAC-PAD 2018, Lyon, France, September 24-27, 2018, pages209–216, 2018.

[2] Renhai Chen, Zili Shao, and Tao Li. Bridging the I/O performance gap for big dataworkloads: A new NVDIMM-based approach. In 49th Annual IEEE/ACM InternationalSymposium on Microarchitecture, MICRO 2016, Taipei, Taiwan, October 15-19, 2016,pages 9:1–9:12, 2016.

[3] DESY. PETRA III. http://petra3.desy.de/index_eng.html, 2015. Retrieved:April, 2020.

Page 23: ImprovingEnergyE˙iciency ofScientificDataCompression ... · ImprovingEnergyE˙iciency ofScientificDataCompression withDecisionTrees ENERGY2020 Jun.-Prof. Dr. MichaelKuhn michael.kuhn@ovgu.de

References .. .

[4] Manuel F. Dolz, Mohammad Reza Heidari, Michael Kuhn, Thomas Ludwig, and GermánFabregat. ArduPower: A low-cost wattmeter to improve energy e�iciency of HPCapplications. In Sixth International Green and Sustainable Computing Conference,IGSC 2015, Las Vegas, NV, USA, December 14-16, 2015, pages 1–8, 2015.

[5] Fabian Große et al. Looking beyond stratification: a model-based analysis of thebiological drivers of oxygen depletion in the North Sea. BiogeosciencesDiscussions, pages 2511–2535, 2015. Retrieved: April, 2020.

[6] Julian M. Kunkel, Anastasiia Novikova, Eugen Betke, and Armin Schaare. TowardDecoupling the Selection of Compression Algorithms fromQuality Constraints.In High Performance Computing - ISC High Performance 2017 International Workshops,DRBSD, ExaComm, HCPM, HPC-IODC, IWOPH, IXPUG, Pˆ3MA, VHPC, Visualization at

Page 24: ImprovingEnergyE˙iciency ofScientificDataCompression ... · ImprovingEnergyE˙iciency ofScientificDataCompression withDecisionTrees ENERGY2020 Jun.-Prof. Dr. MichaelKuhn michael.kuhn@ovgu.de

References .. .

Scale, WOPSSS, Frankfurt, Germany, June 18-22, 2017, Revised Selected Papers, pages3–14, 2017.

[7] Erich Roeckner et al. The atmospheric general circulationmodel ECHAM 5.MaxPlanck Institute for Meteorology, 2003.