Top Banner
Programming Hive Reading #4 @just_do_neet
41

Programming Hive Reading #4

Nov 28, 2014

Download

Documents

moai kids

 
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Programming Hive Reading #4

Programming Hive Reading #4

@just_do_neet

Page 2: Programming Hive Reading #4
Page 3: Programming Hive Reading #4

Programming Hive Reading #4

Chapter 11. and 15.

•Chapter 11. ‘Other File Formats and Compression’

•Choosing / Enabling / Action / HAR / etc...

•Chapter 15. ‘Customizing Hive File and Record Formats’

•Demystifying DML / File Formats / etc...

•exclude "SerDe" related topics at this presentation...

3

Page 4: Programming Hive Reading #4

Programming Hive Reading #4

#11 Determining Installed Codecs

4

$ hive -e "set io.compression.codecs"io.compression.codecs= org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, com.hadoop.compression.lzo.LzoCodec, org.apache.hadoop.io.compress.SnappyCodec

Page 5: Programming Hive Reading #4

Programming Hive Reading #4

#11 Choosing a Compression Codec

•Advantage :

•network I/O , disk space.

•Disadvantage :

•CPU overhead.

•to be short... : Trade-off

5

Page 6: Programming Hive Reading #4

Programming Hive Reading #4

#11 Choosing a Compression Codec

•“why do we need different compression schemes?”

•speed

•minimizing size

•‘splittable’ or not.

6

Page 7: Programming Hive Reading #4

Programming Hive Reading #4

#11 Choosing a Compression Codec

•“why do we need different compression schemes?”

7

http://comphadoop.weebly.com/

Page 8: Programming Hive Reading #4

Programming Hive Reading #4

take a break : algorithm

•lossless compression

•LZ77(LZSS), LZ78, etc...

•DEFLATE (LZ77 with Huffman coding)

•LZH (LZ77 with Static Huffman coding)

•BZIP2(Burrows–Wheeler transform, Move-to-Front, Huffman Coding)

•lossy

•for JPEG, MPEG,etc...(snip.)8

Page 9: Programming Hive Reading #4

Programming Hive Reading #4

take a break : algorithm

9

http://www.slideshare.net/moaikids/ss-2638826

Page 10: Programming Hive Reading #4

Programming Hive Reading #4

take a break : algorithm

10

http://www.slideshare.net/moaikids/ss-2638826

Page 11: Programming Hive Reading #4

Programming Hive Reading #4

take a break : algorithm

•Burrows–Wheeler Transform(BWT)

•block sorting

•“abracadabra” = bwt“ard$rcaaabb”

11

abracadabra$bracadabra$aracadabra$abacadabra$abrcadabra$abraadabra$abracdabra$abracaabra$abracadbra$abracadara$abracadaba$abracadabr$abracadabra

$abracadabraa$abracadabrabra$abracadabracadabra$acadabra$abradabra$abracbra$abracadabracadabra$acadabra$abradabra$abracara$abracadabracadabra$ab

$aaaaabbcdrr

ard$rcaaaabb

abracadabra$

Page 12: Programming Hive Reading #4

Programming Hive Reading #4

take a break : algorithm

•BWT with Suffix Array

•ref. http://d.hatena.ne.jp/naoya/20081016/1224173077

•ref. http://hillbig.cocolog-nifty.com/do/files/2005-12-compInd.ppt

12

Page 13: Programming Hive Reading #4

Programming Hive Reading #4

take a break : algorithm

•LZO

•“Compression is comparable in speed to DEFLATE compression.”

•“Very fast decompression”• http://www.oberhumer.com/opensource/lzo/

13

Page 14: Programming Hive Reading #4

Programming Hive Reading #4

take a break : algorithm

•Google Snappy

•“very high speeds and reasonable compression”

• https://code.google.com/p/snappy/

•ref.http://www.slideshare.net/KeigoMachinaga/snappy-servay-8665889

14

Page 15: Programming Hive Reading #4

Programming Hive Reading #4

take a break : algorithm

•LZ4

•“very fast lossless compression algorithm”• https://code.google.com/p/lz4/

•ref.http://www.slideshare.net/komiyaatsushi/dsirnlp-3-lz4

15

Page 16: Programming Hive Reading #4

Programming Hive Reading #4

take a break : algorithm

•“Add support for LZ4 compression”

•fix version : 0.23.1, 0.24.0,(CDH4)

•ref. https://issues.apache.org/jira/browse/HADOOP-7657

16

Page 17: Programming Hive Reading #4

Programming Hive Reading #4

take a break : Implementation Codec

17

public HogeCodec implements CompressionCodec{ @Override public CompressionOutputStream createOutputStream(OutputStream out, Compressor compressor) throws IOException { return new BlockCompressorStream(out, compressor, bufferSize, compressionOverhead); }

@Override public Class<? extends Compressor> getCompressorType() { return HogeCompressor.class; }

@Override public CompressionOutputStream createOutputStream(OutputStream out) throws IOException { return createOutputStream(out, createCompressor()); }

@Override public Compressor createCompressor() { return new HogeCompressor(); }

@Override public CompressionInputStream createInputStream(InputStream in) throws IOException { return createInputStream(in, createDecompressor()); }............

ref.http://hadoop.apache.org/

docs/current/api/org/apache/hadoop/io/compress/

CompressionCodec.html

Page 18: Programming Hive Reading #4

Programming Hive Reading #4

#11 Enabling Compression

•Intermediate Compression(hive, mapred)

•Final Output Compression(hive, mapred)

18

Page 19: Programming Hive Reading #4

Programming Hive Reading #4

#11 Enabling Compression

•Intermediate Compression(hive, mapred)

•setting enable flag

19

Page 20: Programming Hive Reading #4

Programming Hive Reading #4

#11 Enabling Compression

•Intermediate Compression(hive, mapred)

•setting codec

20

Page 21: Programming Hive Reading #4

Programming Hive Reading #4

#11 Enabling Compression

•Final Output Compression(hive, mapred)

•setting enable flag

21

Page 22: Programming Hive Reading #4

Programming Hive Reading #4

#11 Enabling Compression

•Final Output Compression(hive, mapred)

•setting codec

22

Page 23: Programming Hive Reading #4

Programming Hive Reading #4

#11 Sequence File

•Sequence File Format

• Header

• Record

• Record length

• Key length

• Key

• Value

• A sync-marker every few 100 bytes or so.http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/SequenceFile.html

23

Page 24: Programming Hive Reading #4

Programming Hive Reading #4

#11 Sequence File

•Compression Type

•NONE : nothing to do

•RECORD : compress on each records

•BLOCK : compress on each blocks

24

Page 25: Programming Hive Reading #4

Programming Hive Reading #4

#11 Compression in Action

•(DEMO)

25

Page 26: Programming Hive Reading #4

Programming Hive Reading #4

#11 Archive Partition

•Using ‘HAR’

•ref. http://hadoop.apache.org/docs/r1.0.4/hadoop_archives.html

•Archiving

•Unarchiving

26

$ SET hive.archive.enabled=true;$ ALTER TABLE hoge ARCHIVE PARTITION(folder=‘fuga’)

$ ALTER TABLE hoge UNARCHIVE PARTITION(folder=‘fuga’)

Page 27: Programming Hive Reading #4

Break :)

Page 28: Programming Hive Reading #4

Programming Hive Reading #4

#15 Record Format

•TEXTFILE

•SEQUENCEFILE

•RCFILE

28

CREATE TABLE hoge (.........)STORED AS [TEXTFILE|SEQUENCEFILE|RCFILE]

Page 29: Programming Hive Reading #4

Programming Hive Reading #4

#15 Record Format

•RCFile(Record Columnar File)

•fast data loading

•fast query processing

•highly efficient storage space utilization

•a strong adaptivity to dynamic data access patterns.

•ref. "A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems (ICDE’11)"http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-4.pdf

29

Page 30: Programming Hive Reading #4

Programming Hive Reading #4

#15 Record Format

•RCFile Format

•1 record = some Row Group

•1 HDFS Block = some Row Group

•Row Group•a sync marker•metadata header•table data

•uses the RLE algorithm to compress ‘metadata header’ section.

30

Page 31: Programming Hive Reading #4

Programming Hive Reading #4

#15 Record Format

•Implementation of RCFile

•Input Format

•o.a.h.h.ql.io.RCFileInputFormat

•Output Format

•o.a.h.h.ql.io.RCFileOutputFormat

•SerDe

•o.a.h.h.serde2.columnar.ColumnarSerDe

31

Page 32: Programming Hive Reading #4

Programming Hive Reading #4

#15 Record Format

•Tuning of RCFile

•“hive.io.rcfile.record.buffer.size”

•define “RowGroup” size(default: 4MB)

32

Page 33: Programming Hive Reading #4

Programming Hive Reading #4

#15 Record Format

•ref. “HDFS and Hive storage - comparing file formats and compression methods”

• http://www.adaltas.com/blog/2012/03/13/hdfs-hive-storage-format-compression/

•"In term of file size, the “RCFILE” format with the “default” and “gz” compression achieve the best results."

•"In term of speed, the “RCFILE” formats with the “lzo” and “snappy” are very fast while preserving a high compression rate."

33

Page 34: Programming Hive Reading #4

Programming Hive Reading #4

#Appendix - trevni

•ref. https://github.com/cutting/trevni/

•ref. http://avro.apache.org/docs/current/trevni/spec.html

34

Page 35: Programming Hive Reading #4

Programming Hive Reading #4

#Appendix - trevni

35

file header

file

magic number of rows

number of columns

file header

column ......column column column column column column

file metadata

number of blocks ......block block

column

block descriptor

row row row ...... row

block

number of rows

uncompressed bytes

compressed bytes

block descriptor

column metadata

column start position

・name・type・codec・etc...

column metadata

Page 37: Programming Hive Reading #4

Programming Hive Reading #4

#Appendix - ORCFile

•ref. data size

37

Page 38: Programming Hive Reading #4

Programming Hive Reading #4

#Appendix - ORCFile

•ref. comparison

38

Page 39: Programming Hive Reading #4

Programming Hive Reading #4

#Appendix - Column-Oriented Storage

•ref. http://arxiv.org/pdf/1105.4252.pdf

39

Page 40: Programming Hive Reading #4

Programming Hive Reading #4 40

#Appendix - more informations

http://scholar.google.co.jp/scholar?hl=ja&q=hdfs+columnar&btnG=&lr=

Page 41: Programming Hive Reading #4

Thanks for your listening :)