Programming Hive Reading #4

Programming Hive Reading #4

@just_do_neet


Chapter 11. and 15.

•Chapter 11. ‘Other File Formats and Compression’

•Choosing / Enabling / Action / HAR / etc...

•Chapter 15. ‘Customizing Hive File and Record Formats’

•Demystifying DML / File Formats / etc...

•exclude "SerDe" related topics at this presentation...

3


#11 Determining Installed Codecs

4

$ hive -e "set io.compression.codecs"io.compression.codecs= org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, com.hadoop.compression.lzo.LzoCodec, org.apache.hadoop.io.compress.SnappyCodec


#11 Choosing a Compression Codec

•Advantage :

•network I/O , disk space.

•Disadvantage :

•CPU overhead.

•to be short... : Trade-off

5



•“why do we need different compression schemes?”

•speed

•minimizing size

•‘splittable’ or not.

6



•“why do we need different compression schemes?”

7

http://comphadoop.weebly.com/




take a break : algorithm

•lossless compression

•LZ77(LZSS), LZ78, etc...

•DEFLATE (LZ77 with Huffman coding)

•LZH (LZ77 with Static Huffman coding)

•BZIP2(Burrows–Wheeler transform, Move-to-Front, Huffman Coding)

•lossy

•for JPEG, MPEG,etc...(snip.)8



9

http://www.slideshare.net/moaikids/ss-2638826





10






•Burrows–Wheeler Transform(BWT)

•block sorting

•“abracadabra” = bwt“ard$rcaaabb”

11

abracadabra$bracadabra$aracadabra$abacadabra$abrcadabra$abraadabra$abracdabra$abracaabra$abracadbra$abracadara$abracadaba$abracadabr$abracadabra

$abracadabraa$abracadabrabra$abracadabracadabra$acadabra$abradabra$abracbra$abracadabracadabra$acadabra$abradabra$abracara$abracadabracadabra$ab

$aaaaabbcdrr

ard$rcaaaabb

abracadabra$



•BWT with Suffix Array

•ref. http://d.hatena.ne.jp/naoya/20081016/1224173077

•ref. http://hillbig.cocolog-nifty.com/do/files/2005-12-compInd.ppt

12

http://d.hatena.ne.jp/naoya/20081016/1224173077

http://d.hatena.ne.jp/naoya/20081016/1224173077

http://hillbig.cocolog-nifty.com/do/files/2005-12-compInd.ppt

http://hillbig.cocolog-nifty.com/do/files/2005-12-compInd.ppt



•LZO

•“Compression is comparable in speed to DEFLATE compression.”

•“Very fast decompression”• http://www.oberhumer.com/opensource/lzo/

13

http://www.oberhumer.com/opensource/lzo/

http://www.oberhumer.com/opensource/lzo/



•Google Snappy

•“very high speeds and reasonable compression”

• https://code.google.com/p/snappy/

•ref.http://www.slideshare.net/KeigoMachinaga/snappy-servay-8665889

14

https://code.google.com/p/snappy/

https://code.google.com/p/snappy/

http://www.slideshare.net/KeigoMachinaga/snappy-servay-8665889

http://www.slideshare.net/KeigoMachinaga/snappy-servay-8665889



•LZ4

•“very fast lossless compression algorithm”• https://code.google.com/p/lz4/

•ref.http://www.slideshare.net/komiyaatsushi/dsirnlp-3-lz4

15

https://code.google.com/p/lz4/

https://code.google.com/p/lz4/

http://www.slideshare.net/komiyaatsushi/dsirnlp-3-lz4

http://www.slideshare.net/komiyaatsushi/dsirnlp-3-lz4



•“Add support for LZ4 compression”

•fix version : 0.23.1, 0.24.0,(CDH4)

•ref. https://issues.apache.org/jira/browse/HADOOP-7657

16

https://issues.apache.org/jira/browse/HADOOP-7657

https://issues.apache.org/jira/browse/HADOOP-7657


take a break : Implementation Codec

17

public HogeCodec implements CompressionCodec{ @Override public CompressionOutputStream createOutputStream(OutputStream out, Compressor compressor) throws IOException { return new BlockCompressorStream(out, compressor, bufferSize, compressionOverhead); }

@Override public Class<? extends Compressor> getCompressorType() { return HogeCompressor.class; }

@Override public CompressionOutputStream createOutputStream(OutputStream out) throws IOException { return createOutputStream(out, createCompressor()); }

@Override public Compressor createCompressor() { return new HogeCompressor(); }

@Override public CompressionInputStream createInputStream(InputStream in) throws IOException { return createInputStream(in, createDecompressor()); }............

ref.http://hadoop.apache.org/

docs/current/api/org/apache/hadoop/io/compress/

CompressionCodec.html

http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/CompressionCodec.html









#11 Enabling Compression

•Intermediate Compression(hive, mapred)

•Final Output Compression(hive, mapred)

18




•setting enable flag

19




•setting codec

20




•setting enable flag

21




•setting codec

22


#11 Sequence File

•Sequence File Format

• Header

• Record

• Record length

• Key length

• Key

• Value

• A sync-marker every few 100 bytes or so.http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/SequenceFile.html

23

http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/SequenceFile.html





#11 Sequence File

•Compression Type

•NONE : nothing to do

•RECORD : compress on each records

•BLOCK : compress on each blocks

24


#11 Compression in Action

•(DEMO)

25


#11 Archive Partition

•Using ‘HAR’

•ref. http://hadoop.apache.org/docs/r1.0.4/hadoop_archives.html

•Archiving

•Unarchiving

26

$ SET hive.archive.enabled=true;$ ALTER TABLE hoge ARCHIVE PARTITION(folder=‘fuga’)

$ ALTER TABLE hoge UNARCHIVE PARTITION(folder=‘fuga’)

http://hadoop.apache.org/docs/r1.0.4/hadoop_archives.html

http://hadoop.apache.org/docs/r1.0.4/hadoop_archives.html

Break :)


#15 Record Format

•TEXTFILE

•SEQUENCEFILE

•RCFILE

28

CREATE TABLE hoge (.........)STORED AS [TEXTFILE|SEQUENCEFILE|RCFILE]


#15 Record Format

•RCFile(Record Columnar File)

•fast data loading

•fast query processing

•highly efficient storage space utilization

•a strong adaptivity to dynamic data access patterns.

•ref. "A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems (ICDE’11)"http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-4.pdf

29

http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-4.pdf





#15 Record Format

•RCFile Format

•1 record = some Row Group

•1 HDFS Block = some Row Group

•Row Group•a sync marker•metadata header•table data

•uses the RLE algorithm to compress ‘metadata header’ section.

30


#15 Record Format

•Implementation of RCFile

•Input Format

•o.a.h.h.ql.io.RCFileInputFormat

•Output Format

•o.a.h.h.ql.io.RCFileOutputFormat

•SerDe

•o.a.h.h.serde2.columnar.ColumnarSerDe

31


#15 Record Format

•Tuning of RCFile

•“hive.io.rcfile.record.buffer.size”

•define “RowGroup” size(default: 4MB)

32


#15 Record Format

•ref. “HDFS and Hive storage - comparing file formats and compression methods”

• http://www.adaltas.com/blog/2012/03/13/hdfs-hive-storage-format-compression/

•"In term of file size, the “RCFILE” format with the “default” and “gz” compression achieve the best results."

•"In term of speed, the “RCFILE” formats with the “lzo” and “snappy” are very fast while preserving a high compression rate."

33

http://www.adaltas.com/blog/2012/03/13/hdfs-hive-storage-format-compression/





#Appendix - trevni

•ref. https://github.com/cutting/trevni/

•ref. http://avro.apache.org/docs/current/trevni/spec.html

34

https://github.com/cutting/trevni/

https://github.com/cutting/trevni/

http://avro.apache.org/docs/current/trevni/spec.html

http://avro.apache.org/docs/current/trevni/spec.html


#Appendix - trevni

35

file header

file

magic number of rows

number of columns

file header

column ......column column column column column column

file metadata

number of blocks ......block block

column

block descriptor

row row row ...... row

block

number of rows

uncompressed bytes

compressed bytes

block descriptor

column metadata

column start position

・name・type・codec・etc...

column metadata


#Appendix - ORCFile

•ref. http://hortonworks.com/blog/100x-faster-hive/

•ref. https://issues.apache.org/jira/browse/HIVE-3874

•ref. https://issues.apache.org/jira/secure/attachment/12564124/OrcFileIntro.pptx

36

http://hortonworks.com/blog/100x-faster-hive/




https://issues.apache.org/jira/browse/HIVE-3874




https://issues.apache.org/jira/secure/attachment/12564124/OrcFileIntro.pptx





#Appendix - ORCFile

•ref. data size

37


#Appendix - ORCFile

•ref. comparison

38


#Appendix - Column-Oriented Storage

•ref. http://arxiv.org/pdf/1105.4252.pdf

39

http://arxiv.org/pdf/1105.4252.pdf

http://arxiv.org/pdf/1105.4252.pdf

Programming Hive Reading #4 40

#Appendix - more informations

http://scholar.google.co.jp/scholar?hl=ja&q=hdfs+columnar&btnG=&lr=



Thanks for your listening :)

Programming Hive Reading #4

Documents