Top Banner
Doug Cutting 21 July, 2010 Introduction to Apache Avro
16
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 3 avro hug-2010-07-21

Doug Cutting21 July, 2010

Introduction toApache Avro

Page 2: 3 avro hug-2010-07-21

Avro is...

● data serialization

● file format

● RPC format

Page 3: 3 avro hug-2010-07-21

Existing Serialization Systems:Protocol Buffers & Thrift

● expressive● efficient (small & fast)● but not very dynamic

● cannot browse arbitrary data● viewing a new datatype

– requires code generation & load● writing a new datatype

– requires generating schema text– plus code generation & load

Page 4: 3 avro hug-2010-07-21

Avro Serialization● spec's a serialization format● schema language is in JSON

● each lang already has JSON parser

● each lang implements data reader & writer● in normal code

● code generation is optional● sometimes useful in statically typed languages

● data is untagged● schema required to read/write

Page 5: 3 avro hug-2010-07-21

Avro Schema Evolution

● writer's schema always provided to reader● so reader can compare:

● the schema used to write with● the schema expected by application

● fields that match (name & type) are read● fields written that don't match are skipped● expected fields not written can be identified● same features as provided by numeric field ids

Page 6: 3 avro hug-2010-07-21

Avro JSON Schemas

// a simple three-element record{"name": "Block", "type": "record":, "fields": [ {"name": "id", "type": "string"}, {"name": "length", "type": "integer"}, {"name": "hosts", "type": {"type": "array:, "items": "string"}} ]}

// a linked list of strings or ints{"name": "MyList", "type": "record":, "fields": [ {"name": "value", "type": ["string", "int"]}, {"name": "next", "type": ["MyList", "null"]} ]}

Page 7: 3 avro hug-2010-07-21

Avro IDL Schemas

// a simple three-element recordrecord Block { string id; int length; array<string> hosts;}

// a linked list of strings or intsrecord MyList { union {string, int} value; MyList next;}

Page 8: 3 avro hug-2010-07-21

Hadoop Data Formats

● Today, primarily●  text

– pro: inter­operable– con: not expressive, inefficient

●  Java Writable– pro: expressive, efficient– con: platform­specific, fragile

Page 9: 3 avro hug-2010-07-21

Avro Data

● expressive● small & fast● dynamic

● schema stored with data– but factored out of instances

● APIs permit reading & creating– new datatypes without generating & loading code

Page 10: 3 avro hug-2010-07-21

Avro Data

● includes a file format● replacement for SequenceFile

● includes a textual encoding● handles versioning

● if schema changes● can still process data

● hope Hadoop apps will● upgrade from text; standardize on Avro for data

Page 11: 3 avro hug-2010-07-21

Avro MapReduce API

● Single-valued inputs and outputs● key/value pairs only required for intermediate

● map(IN, Collector<OUT>)● map-only jobs never need to create k/v pairs

● map(IN, Collector<Pair<K,V>>)● reduce(K, Iterable<V>, Collector<OUT>)

● if IN and OUT are pairs, default is sort

● In Avro trunk today, built on Hadoop 0.20 APIs.● in Avro1.4.0 release next month

Page 12: 3 avro hug-2010-07-21

Avro MapReduce Example

public void map(Utf8 text, AvroCollector<Pair<Utf8,Long>> c,                Reporter r) throws IOException {

  StringTokenizer i = new StringTokenizer(text.toString());

  while (i.hasMoreTokens())    c.collect(new Pair<Utf8,Long>(new Utf8(i.nextToken()), 1L));}

public void reduce(Utf8 word, Iterable<Long> counts,                   AvroCollector<Pair<Utf8,Long>> c,                   Reporter r) throws IOException {  long sum = 0;  for (long count : counts)    sum += count;  c.collect(new Pair<Utf8,Long>(word, sum));}

Page 13: 3 avro hug-2010-07-21

Avro RPC

● leverage versioning support● permit different versions of services to interoperate

● for Hadoop, will● let apps talk to clusters running different versions● provide cross­language access

Page 14: 3 avro hug-2010-07-21

Avro IDL Protocol

@namespace("org.apache.avro.test")

protocol HelloWorld {

record Greeting { string who; string what; }

  Greeting hello(Greeting greeting);}

Page 15: 3 avro hug-2010-07-21

Avro Status

● Current● C, C++, Java, Python & Ruby APIs● Interoperable RPC and data● Mapreduce API for Java

● Upcoming● MapReduce APIs for other languages

– efficient, rich data

● RPC used in Flume, Hbase, Cassandra, Hadoop, etc.– inter­version compatibility– non­Java clients

Page 16: 3 avro hug-2010-07-21

Questions?