Programming Hive Reading #4 @just_do_neet
Programming Hive Reading #4
@just_do_neet
Programming Hive Reading #4
Chapter 11. and 15.
•Chapter 11. ‘Other File Formats and Compression’
•Choosing / Enabling / Action / HAR / etc...
•Chapter 15. ‘Customizing Hive File and Record Formats’
•Demystifying DML / File Formats / etc...
•exclude "SerDe" related topics at this presentation...
3
Programming Hive Reading #4
#11 Determining Installed Codecs
4
$ hive -e "set io.compression.codecs"io.compression.codecs= org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, com.hadoop.compression.lzo.LzoCodec, org.apache.hadoop.io.compress.SnappyCodec
Programming Hive Reading #4
#11 Choosing a Compression Codec
•Advantage :
•network I/O , disk space.
•Disadvantage :
•CPU overhead.
•to be short... : Trade-off
5
Programming Hive Reading #4
#11 Choosing a Compression Codec
•“why do we need different compression schemes?”
•speed
•minimizing size
•‘splittable’ or not.
6
Programming Hive Reading #4
#11 Choosing a Compression Codec
•“why do we need different compression schemes?”
7
http://comphadoop.weebly.com/
Programming Hive Reading #4
take a break : algorithm
•lossless compression
•LZ77(LZSS), LZ78, etc...
•DEFLATE (LZ77 with Huffman coding)
•LZH (LZ77 with Static Huffman coding)
•BZIP2(Burrows–Wheeler transform, Move-to-Front, Huffman Coding)
•lossy
•for JPEG, MPEG,etc...(snip.)8
Programming Hive Reading #4
take a break : algorithm
9
http://www.slideshare.net/moaikids/ss-2638826
Programming Hive Reading #4
take a break : algorithm
10
http://www.slideshare.net/moaikids/ss-2638826
Programming Hive Reading #4
take a break : algorithm
•Burrows–Wheeler Transform(BWT)
•block sorting
•“abracadabra” = bwt“ard$rcaaabb”
11
abracadabra$bracadabra$aracadabra$abacadabra$abrcadabra$abraadabra$abracdabra$abracaabra$abracadbra$abracadara$abracadaba$abracadabr$abracadabra
$abracadabraa$abracadabrabra$abracadabracadabra$acadabra$abradabra$abracbra$abracadabracadabra$acadabra$abradabra$abracara$abracadabracadabra$ab
$aaaaabbcdrr
ard$rcaaaabb
abracadabra$
Programming Hive Reading #4
take a break : algorithm
•BWT with Suffix Array
•ref. http://d.hatena.ne.jp/naoya/20081016/1224173077
•ref. http://hillbig.cocolog-nifty.com/do/files/2005-12-compInd.ppt
12
Programming Hive Reading #4
take a break : algorithm
•LZO
•“Compression is comparable in speed to DEFLATE compression.”
•“Very fast decompression”• http://www.oberhumer.com/opensource/lzo/
13
Programming Hive Reading #4
take a break : algorithm
•Google Snappy
•“very high speeds and reasonable compression”
• https://code.google.com/p/snappy/
•ref.http://www.slideshare.net/KeigoMachinaga/snappy-servay-8665889
14
Programming Hive Reading #4
take a break : algorithm
•LZ4
•“very fast lossless compression algorithm”• https://code.google.com/p/lz4/
•ref.http://www.slideshare.net/komiyaatsushi/dsirnlp-3-lz4
15
Programming Hive Reading #4
take a break : algorithm
•“Add support for LZ4 compression”
•fix version : 0.23.1, 0.24.0,(CDH4)
•ref. https://issues.apache.org/jira/browse/HADOOP-7657
16
Programming Hive Reading #4
take a break : Implementation Codec
17
public HogeCodec implements CompressionCodec{ @Override public CompressionOutputStream createOutputStream(OutputStream out, Compressor compressor) throws IOException { return new BlockCompressorStream(out, compressor, bufferSize, compressionOverhead); }
@Override public Class<? extends Compressor> getCompressorType() { return HogeCompressor.class; }
@Override public CompressionOutputStream createOutputStream(OutputStream out) throws IOException { return createOutputStream(out, createCompressor()); }
@Override public Compressor createCompressor() { return new HogeCompressor(); }
@Override public CompressionInputStream createInputStream(InputStream in) throws IOException { return createInputStream(in, createDecompressor()); }............
ref.http://hadoop.apache.org/
docs/current/api/org/apache/hadoop/io/compress/
CompressionCodec.html
Programming Hive Reading #4
#11 Enabling Compression
•Intermediate Compression(hive, mapred)
•Final Output Compression(hive, mapred)
18
Programming Hive Reading #4
#11 Enabling Compression
•Intermediate Compression(hive, mapred)
•setting enable flag
19
Programming Hive Reading #4
#11 Enabling Compression
•Intermediate Compression(hive, mapred)
•setting codec
20
Programming Hive Reading #4
#11 Enabling Compression
•Final Output Compression(hive, mapred)
•setting enable flag
21
Programming Hive Reading #4
#11 Enabling Compression
•Final Output Compression(hive, mapred)
•setting codec
22
Programming Hive Reading #4
#11 Sequence File
•Sequence File Format
• Header
• Record
• Record length
• Key length
• Key
• Value
• A sync-marker every few 100 bytes or so.http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/SequenceFile.html
23
Programming Hive Reading #4
#11 Sequence File
•Compression Type
•NONE : nothing to do
•RECORD : compress on each records
•BLOCK : compress on each blocks
24
Programming Hive Reading #4
#11 Compression in Action
•(DEMO)
25
Programming Hive Reading #4
#11 Archive Partition
•Using ‘HAR’
•ref. http://hadoop.apache.org/docs/r1.0.4/hadoop_archives.html
•Archiving
•Unarchiving
26
$ SET hive.archive.enabled=true;$ ALTER TABLE hoge ARCHIVE PARTITION(folder=‘fuga’)
$ ALTER TABLE hoge UNARCHIVE PARTITION(folder=‘fuga’)
Break :)
Programming Hive Reading #4
#15 Record Format
•TEXTFILE
•SEQUENCEFILE
•RCFILE
28
CREATE TABLE hoge (.........)STORED AS [TEXTFILE|SEQUENCEFILE|RCFILE]
Programming Hive Reading #4
#15 Record Format
•RCFile(Record Columnar File)
•fast data loading
•fast query processing
•highly efficient storage space utilization
•a strong adaptivity to dynamic data access patterns.
•ref. "A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems (ICDE’11)"http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-4.pdf
29
Programming Hive Reading #4
#15 Record Format
•RCFile Format
•1 record = some Row Group
•1 HDFS Block = some Row Group
•Row Group•a sync marker•metadata header•table data
•uses the RLE algorithm to compress ‘metadata header’ section.
30
Programming Hive Reading #4
#15 Record Format
•Implementation of RCFile
•Input Format
•o.a.h.h.ql.io.RCFileInputFormat
•Output Format
•o.a.h.h.ql.io.RCFileOutputFormat
•SerDe
•o.a.h.h.serde2.columnar.ColumnarSerDe
31
Programming Hive Reading #4
#15 Record Format
•Tuning of RCFile
•“hive.io.rcfile.record.buffer.size”
•define “RowGroup” size(default: 4MB)
32
Programming Hive Reading #4
#15 Record Format
•ref. “HDFS and Hive storage - comparing file formats and compression methods”
• http://www.adaltas.com/blog/2012/03/13/hdfs-hive-storage-format-compression/
•"In term of file size, the “RCFILE” format with the “default” and “gz” compression achieve the best results."
•"In term of speed, the “RCFILE” formats with the “lzo” and “snappy” are very fast while preserving a high compression rate."
33
Programming Hive Reading #4
#Appendix - trevni
•ref. https://github.com/cutting/trevni/
•ref. http://avro.apache.org/docs/current/trevni/spec.html
34
Programming Hive Reading #4
#Appendix - trevni
35
file header
file
magic number of rows
number of columns
file header
column ......column column column column column column
file metadata
number of blocks ......block block
column
block descriptor
row row row ...... row
block
number of rows
uncompressed bytes
compressed bytes
block descriptor
column metadata
column start position
・name・type・codec・etc...
column metadata
Programming Hive Reading #4
#Appendix - ORCFile
•ref. http://hortonworks.com/blog/100x-faster-hive/
•ref. https://issues.apache.org/jira/browse/HIVE-3874
•ref. https://issues.apache.org/jira/secure/attachment/12564124/OrcFileIntro.pptx
36
Programming Hive Reading #4
#Appendix - ORCFile
•ref. data size
37
Programming Hive Reading #4
#Appendix - ORCFile
•ref. comparison
38
Programming Hive Reading #4
#Appendix - Column-Oriented Storage
•ref. http://arxiv.org/pdf/1105.4252.pdf
39
Programming Hive Reading #4 40
#Appendix - more informations
http://scholar.google.co.jp/scholar?hl=ja&q=hdfs+columnar&btnG=&lr=
Thanks for your listening :)