1 COSC 6339 Big Data Analytics Data Formats – HDF5 and Parquet files Edgar Gabriel Fall 2018 File Formats - Motivation • Use-case: Analysis of all flights in the US between 2004- 2008 using Apache Spark File Format File Size Processing Time csv 3.4 GB 525 sec json 12 GB 2245 sec Hadoop sequence file 3.7 GB 1745 sec parquet 0.55 GB 100 sec
15
Embed
Data Formats HDF5 and Parquet files - UHgabriel/courses/cosc6339_f18/BDA_16_DataFormats.pdfBig Data Analytics Data Formats – HDF5 and Parquet files Edgar Gabriel Fall 2018 File Formats
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
COSC 6339
Big Data Analytics
Data Formats –
HDF5 and Parquet files
Edgar Gabriel
Fall 2018
File Formats - Motivation
• Use-case: Analysis of all flights in the US between 2004-
2008 using Apache Spark
File Format File Size Processing Time
csv 3.4 GB 525 sec
json 12 GB 2245 sec
Hadoop sequence file 3.7 GB 1745 sec
parquet 0.55 GB 100 sec
2
Scientific data libraries
• Handle data on a higher level
• Provide additional information typically not available in
flat data files (Metadata)
– Size and type of of data structure
– Data format
– Name
– Units
• Two widely used libraries available
– NetCDF
– HDF-5
HDF-5
• Hierarchical Data Format (HDF) developed since 1988
at NCSA (University of Illinois)
– http://hdf.ncsa.uiuc.edu/HDF5/
• Has gone through a long history of changes, the recent
version HDF-5 available since 1999
• HDF-5 supports
– Very large files
– Parallel I/O interface
– Fortran, C, Java, Python bindings
3
HDF-5 dataset
• Multi-dimensional array of basic data elements
• A dataset consists of
– Header + data
• Header consists of
– Name
– Datatype : basic (e.g. HDF_NATIVE_FLOAT) or
compound dataypes
– Dataspace: defines size and shape of a multidimensional
array. Dimensions can be fixed or unlimited.
– Storage layout: defines how multidimensional arrays are
stored in file. Can be contiguous or chunked.
Example of an HDF-5 fileHDF5 “tempseries.h5” {
GROUP “/” {
GROUP “tempseries” {
DATASET “height” {
DATATYPE {“H5_STD_I32BE” }
DATASPACE ( ARRAY (4) (4) }
DATA {
0, 50, 100, 150
}
ATTRIBUTES “units” {
DATATYPE {“undefined string” }
DATASPACE { ARRAY (0) (0) }
DATA {
unable to print
}
}
}
DATASET “temperature” {
DATATYPE {“H5T_IEEE_F32BE” }
DATASPACE{ ARRAY( 3,8,4 ) (H5S_UNLIMITED, 8, 4) }
DATA {…}
4
Storage layout: contiguous vs. chunkedcontiguous chunked
1
9
17
25
33
41
49
57
2
10
18
26
34
42
50
58
3
11
19
27
35
43
51
59
4
12
20
28
36
44
52
60
5
13
21
29
37
45
53
61
6
14
22
30
38
46
54
62
7
15
23
31
39
47
55
63
8
16
24
32
40
48
56
64
1
5
9
13
33
37
41
45
2
6
10
14
34
38
42
46
3
7
11
15
35
39
43
47
4
8
12
16
36
40
44
48
21
25
29
49
57
61
22
26
30
50
58
62
23
27
31
51
59
63
24
28
32
52
60
64
17 18 19 20
53 54 55 56
Advantages and disadvantages of chunking Accessing rows and columns require the same
number of accesses Data can be extended into all dimensions Efficient storage of sparse arrays Can improve caching
HDF-5 API
• HDF-5 naming convention
– All API functions start with an H5
– The next character identifies category of functions
• H5F: functions handling files
• H5G: functions handling groups
• H5D: functions handling datasets
• H5S: functions handling dataspaces
• H5A: functions handling attributes
• A HDF-5 group is a collection of data sets
– Comparable to a directory in a UNIX-like file system
5
h5py
• Python interface to the HDF5 binary data format
• Uses NumPy and Python abstractions such as dictionary
and NumPy array syntax
Reading and Writing an HDF-5 file
using h5py
import numpy as np
import h5py
MyData = np.random.random(size=(100,20))
h5f = h5py.File('data.h5', 'w')
h5f.create_dataset('dataset_1', data=Mydata)
h5f.close()
h5f = h5py.File('data.h5','r')
MyData = h5f['dataset_1'][:]
h5f.close()
6
Setting datatypes and compression in
h5py
hf = h5py.File('integer_8.hdf5', 'w')
d = f.create_dataset('dataset', (100000,), dtype='i8')
d[:] = arr
f.close()
f = h5py.File('float.hdf5', 'w')
d = f.create_dataset('dataset', (100,), dtype='f16‘, compression="gzip")
d[:] = arr
f.close()5f = h5py.File('data.h5','r')
MyData = h5f['dataset_1'][:]
h5f.close()
Parquet files
• Columnar data representation
• Available to many projects in the Hadoop ecosystem
• Built on the record shredding and assembly algorithm
– sequences in which the same data value occurs in many
consecutive data elements are stored as a single data value and
count
– Useful for definition levels of sparse columns
• Dictionary encoding
– searches for matches between the text to be compressed and a
set of strings contained in a 'dictionary'
– When the encoder finds a match, it substitutes a reference to
the string's position in the data structure.
– Useful for small ( ~60k) set of values
Encodings
• Delta encoding (new in parquet 2)
Image source: Julien Le Dem, Nong Li:”Efficient Data Storage for Analytics with Apache Parquet 2.0”, https://www.slideshare.net/cloudera/hadoop-summit-36479635
Image source: Julien Le Dem, Nong Li:”Efficient Data Storage for Analytics with Apache Parquet 2.0”, https://www.slideshare.net/cloudera/hadoop-summit-36479635
Formats comparison (II)
Image source: Julien Le Dem, Nong Li:”Efficient Data Storage for Analytics with Apache Parquet 2.0”, https://www.slideshare.net/cloudera/hadoop-summit-36479635