Data serialization with avro in hive

ACADGILDACADGILD

INTRODUCTION

This blog focuses on providing in depth information of Avro in Hive. Here we have discussed about the

importance and necessity of Avro and how to implement it in Hive. Through this blog you will get a

clear idea about Avro and its implementation in your Hadoop projects.

What is Avro?

Avro is one of the preferred data serialization system because of its language neutrality.

Due to lack of language portability in hadoop writable classes , avro becomes a natural choice because

of its ability to handle multiple data formats which can be further processed by multiple languages.

Avro is also very much preferred for serializing the data in Hadoop.

It uses JSON for defining data types and protocols,and serializes data in a compact binary format. Its

primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data,

and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop

services.

By this we can define Avro as a file format introduced with Hadoop to store data in a predefined

format.This file format can be used in any of the Hadoop's tools like Pig and Hive.

Implementing Avro file format in Hive

Before we take a look at how the Avro file format is implemented, let’s have a quick introduction to the

Avro schema and how to create Avro records, Hive tables and much more.

Avro Schema

Avro relies on schema. When Avro data is read, the schema used for writing it is always present. This

permits each datum to be written with no per-value overheads, making serialization both fast and small.

This also facilitates using it with dynamic scripting languages, since data together with its schema, is

fully self-describing.

When Avro data is stored in a file, its schema is stored with it, so that files may be processed later by

any program. If the program reading the data expects a different schema this can be easily resolved,

since both schemas are present.

When Avro is used in RPC, the client and server exchange schemas in the connection handshake. (This

can be optimized so that, for most calls, no schemas are actually transmitted.) Since both client and

server both have the other's full schema, correspondence between same named fields, missing fields,

extra fields, etc. can all be easily resolved.

https://acadgild.com/blog/avro-in-hive/

ACADGILDACADGILD

Avro schemas are defined with JSON . This facilitates implementation in languages that already have

JSON libraries.Using Avro, we can convert unstructured and semi-structured data into proper structured

data using its schemas.

Creating a table to store the data in avro format

This process is initiated with the creation of JSON based schema to serialize data in a format that has a

schema built in.

Avro has its own parser to return the provided schema as an object.

The created object allows us to create records with that schema.

We can create our schema inside the table properties while creating a Hive table

TBLPROPERTIES ('avro.schema.literal'='{json schema here}');

Now, lets create an Avro file format for olympic data.You can download the data from the following

link:

Olympic Dataset

Creating a Hive Table:

create table olympic_avroROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'tblproperties ('avro.schema.literal'='{"name": "my_record","type": "record","fields": [{"name":"athelete", "type":"string"},{"name":"age", "type":"int"},{"name":"country", "type":"string"},{"name":"year", "type":"string"},{"name":"closing", "type":"string"},{"name":"sport", "type":"string"},{"name":"gold", "type":"int"},{"name":"silver", "type":"int"},{"name":"bronze", "type":"int"},{"name":"total", "type":"int"}]}');

Inside the tblproperties you can see the schema of the data. Every record inside the tblproperties will

become a column. Here, ‘Name’ defines the column name and ‘type’ defines the datatype of the


https://drive.google.com/open?id=0ByJLBTmJojjzV1czX3Nha0R3bTQ

ACADGILDACADGILD

particular column.The above fields are for olympic data.

We can view the fields inside the table by using the command describe olympic_avro.

We have successfully created an avro table!

Here’s the in depth explanation behind the commands used for creating the table:

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'

We have used the above command while creating the table. This means that the data providing is

serialized and de-serialized as well.If we have our own serialized data, there is no need to use this line.


ACADGILDACADGILD

Else, the above line should be present, which acts as the default serialize for the data.

STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'

This line says that we are storing the file as AvroContainerInputFormat which is present in the default

hive package as specified above. When using this line, the file will be stored with extension .avro,

which is an extension of Avro.

OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

This line says that the output of the file is also in Avro format and will have the extension .avro .

The next step is to insert the data into the table.

A default parser will act on the schema and creates an object for the schema, which in turn

accommodates the type of data which suits the created schema.Now the data can be inserted into the

Avro table created with the same schema.

Data Insertion into Avro Table:

There are 2 methods by which the data can be inserted into an Avro table:

1. If we have a file with extension ‘.avro’ and the schema of the file is the same as what you specified,

then you can directly import the file using the command

LOAD DATA LOCAL INPATH 'PATH OF THE FILE';

2. You can copy the contents of a previously created table into the newly created Avro table.

Let’s take a look at the second type of data insertion technique to import data into an Avro table. We

will begin by creating a table which is delimited by tab space.

Text File Table Creation:

create tableolympic(athelete STRING,age INT,country STRING,year STRING,closing STRING,sport STRING,gold INT,silver INT,bronzeINT,total INT) row format delimited fields terminated by '\t' stored as textfile;

Here, we are creating a table with the name ‘olympic’. The schema of the table is as specified and the

data inside the input file is delimited by tab space. At the end of the command we have

specified ‘stored as textfile’, which means that we are using a TEXTFILE format. The schema of the

created table can be checked using the command describe olympic;

Now, the data can be loaded into the created table as follows:

load data local inpath 'path of your file'into table olympic;


ACADGILDACADGILD

We have successfully loaded the data from the input file data into the table in TEXTFILE format.

Now, to copy the data from the ‘olympic’ table to the newly created Avro table, the command

insert overwrite table olympic_avro select * from olympic limit 20; can be used for the same.

This olympic data has some NULL values and the Avro schema cannot handle the null values by

default. To make the Avro table work with NULL values, the schema of the table needs to be changed.

We will see how this is done as well.

Now, let’s perform average operation on the Avro table using the command

select AVG(age) from olympic_avro;


ACADGILDACADGILD

You can see that we have successfully performed average operation on our newly created Avro table.

Output: The average age of the athletes is 23.65

We have successfully created an Avro table, imported data into the Avro and performed a average

operation on the available data.

Handling NULL Values in Avro Table:

Like mentioned earlier, the schema of the table needs to be changed to make the table accept NULL

values as well. In this case,the schema will be as shown below:

create table olympic_avro1ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'tblproperties ('avro.schema.literal'='{"namespace":"com.example.avro","name": "my_record","type": "record","fields": [{"name":"athelete", "type":["string","null"],"default":null},{"name":"age", "type":["int","null"],"default":0},{"name":"country", "type":["string","null"],"default":null},{"name":"year", "type":["string","null"],"default":null},{"name":"closing", "type":["string","null"],"default":null},{"name":"sport", "type":["string","null"],"default":null},{"name":"gold", "type":["int","null"],"default":0},


ACADGILDACADGILD

{"name":"silver", "type":["int","null"],"default":0},{"name":"bronze", "type":["int","null"],"default":0},{"name":"total", "type":["int","null"],"default":0}]}');

The difference between the normal Avro schema and the above specified schema is as follows:

In the type in above code we have given two values, one is the data type and the other is null, which

means that if the value is the specified data type, it accepts the value, else, the value is NULL. It will

consider the default value which we have given with attribute default.Here we have given the default

value as null for string and 0(Zero) for int.

NOTE: When we specify the default value as null or 0,we need to specify it in double quotes.

Now, let’s create a table with the above specified schema which can accept NULL values as well.

We can see the columns created in the table using the below command:

describe olympic_avro1


ACADGILDACADGILD

This table will now accept NULL values as well.

Let’s try loading some data containing NULL values into the table. We know that the olympic data has

some NULL values in it, so let us try to load the contents of the olympic table with the newly created

table using the command

insert overwrite table olympic_avro1 select * from olympic

We have successfully loaded the contents of the olympic table containing NULL values

into olympic_avro1 table.

Now, let’s try performing the AVERAGE operation on the newly created table using the below

command:

select AVG(age) from olympic_avro1


ACADGILDACADGILD

We have successfully performed the AVERAGE operation on the entire olympic data.

Output: The average age of athletes is found to be 26.405433646812956

This is how NULL values can be handled in Avro tables.

Downloading the .avro File for the Created Avro Table:

We can see the Avro schema of the created table and also that a table is created with extension ‘.avro’.

Follow the below steps to download the file with this extension (‘.avro’):

Note:Make sure that all your Hadoop daemons are running.

Step 1: Open your browser and type localhost:50070


ACADGILDACADGILD

Step 2: In that page, click on ‘utilities’and then click on ‘Browse the file system’.

Step 3: You can see the list of files available in your HDFS. In this list of click on ‘user’.

Inside the user directory, Hive warehouse will be stored and within this warehouse, all the Hive tables

will be stored.


ACADGILDACADGILD

Step 4: After clicking on the user,you can see a list of folders. From this list click on ‘Hive’.

Step 5: After clicking on ‘Hive’, click on ‘Warehouse’.


ACADGILDACADGILD

After clicking on ‘Warehouse’, you can see all the databases and tables created in Hive

Step 6: From the list of tables, select the Avro table you had created.

Note:


ACADGILDACADGILD

In this case, no datbase has been created, so the default database has been used. Consequently, all the

tables will be listed on the home page itself.If the table has been created inside the database, the table

should be then checked inside that database.

After clicking on the table,you can now see a table with extension ‘.avro’.

Step 7: As soon as you click on the file you will get an option to download it.


ACADGILDACADGILD

The file will be downloaded into your ‘Downloads’ folder.

Now to view the Avro schema, go to the location where your downloaded .avro extension file (using

the terminal) is present and then type the below command:

cat filename(in this case file name is 000000_0.avro)


ACADGILDACADGILD

In the above image, you can see the schema of the created Avro table. It is quite hard to understand, so

we need to convert it into JSON format to understand the schema.

Converting Avro to JSON

To convert the Avro file into JSON we need to download a jar file called

‘avro-tools-1.7.5 jar’, which contains the option to convert the Avro file into JSON. You can download

the jar file from the following link

avro-tools-1.7.5 jar

This jar file will also be downloaded in the ‘Downloads’ folder. From the terminal move to the

‘Downloads’ folder and then type the command with below syntax.

java -jar avro-tools-1.7.5.jar tojson 'avro file name' >newfilename.json

In our case the command will be as follows:

java -jar avro-tools-1.7.5.jar tojson 000000_0.avro >olympic.json

Now, a file with name ‘olympic.json’ will be created with the json schema of the Avro file. To see the

contents of this json file use the below command:

cat olympic.json


https://drive.google.com/open?id=0ByJLBTmJojjzcWdXSTB6dHFqcXM

ACADGILDACADGILD

Now, you can see the contents of your table with JSON format.

As mentioned above, the Avro parser has converted the schema into an object and the object will

contain the data.In the above image, you can see that inside the schema there are the contents which

says that the created object is holding our content.

Loading Files with .avro Extension into an Avro Table:

Let’s see how to load the file with extension ‘.avro’ directly into the Avro table as mentioned in the

section " Data Insertion into Avro Table "

Let us create another Avro table with the name ‘olympic_avro2’ and the schema as follows:

create table olympic_avro2ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'tblproperties ('avro.schema.literal'='{


ACADGILDACADGILD

"name": "my_record","type": "record","fields": [{"name":"athelete", "type":"string"},{"name":"age", "type":"int"},{"name":"country", "type":"string"},{"name":"year", "type":"string"},{"name":"closing", "type":"string"},{"name":"sport", "type":"string"},{"name":"gold", "type":"int"},{"name":"silver", "type":"int"},{"name":"bronze", "type":"int"},{"name":"total", "type":"int"}]}');


ACADGILDACADGILD

We can see the columns we have created using the command describe olympic_avro2

Now, let’s load the .avro extension file, 000000_0.avro using the command

LOAD DATA LOCAL INPATH '/path of the file' into table olympic_avro2

We have successfully loaded the .avro file into the Avro table. Now, let’s perform

the AVERAGE operation on the newly created ‘olympic_avro2’ table, using the below command:

select AVG(age) from olympic_avro2;

Output: We have got the AVERAGE age of athletes as 23.65, which is the same as what we got in the

first case.

Let’s summarize:

• Created Avro table.


ACADGILDACADGILD

• Discussed the two techniques of loading data into the Avro table.

• Performed queries on the data inside the Avro table

• Downloaded the file with ‘.avro’ extension.

• Viewed the Avro schema.

• Converted the Avro schema into JSON.

• Viewed the contents of table in JSON format.


Data serialization with avro in hive

Education