Apache Avro

Hadoop as we all know is a distributed processing system. Any distributed processing system is incomplete without serialization and de-serialization API. Serialization and de-serialization API help us to write any data into a file , then transfer to other machine and then read it at receiver end. Hadoop has writable and writable comparable interface for this purpose. Any record in Hadoop can be written on hard disk using implementing writable or writable-comparable interface.

       Writable API of Hadoop are compact but not very easy to extend and cannot be used with language other than Java. However people who use streaming API come from different backgrounds. There are some other API like Thrift by Facebook and protocol buffers by Google. But most popular is Avro. it was specifically designed to overcome the disadvantages of Writables. Avro was designed by Doug Cutting.

 Avro Schema: Before Storing any data using Avro, we need to create a schema file. this schema file should contain the information about structure of data.

      For purpose of this blog, we will assume that our input data is in json file. suppose following is the json file. Save this data in a file named examle.json

{"name": "Andy", "age": 25, "city": "bangalore"}
{"name": "Harry", "age": 17, "city": "new york"}
{"name": "Charlie", "age": null, "city": "london"}

Defining Schema: Let us create a avro schema file for this input data. As part of creating schema, we will mention namespace and different fields , their types in the data. Our example data has three fields 1. name , 2. age and 3. city. All records in our data will have these fields. these fields have respective types also name field has type string, age is of type int and city is also of string type. Avro helps us define these fields and also mention type of each fields. Following is the way to create schema

{
    "namespace": "com.harjeet.avro",
    "type": "record",
    "name": "AvroExample",
    "fields": [
        {
            "name": "name",
            "type": "string"
        },
        {
            "name": "age",
            "type": "int"
        },
        {
            "name": "city",
            "type": "string"
        },
   ]
}

store this schema in a file named AvroExample.avsc

Avro provides support of many data types to be defined in schema. following is the list


  1. string: unicode character sequence
  2. int: 32-bit signed integer
  3. long: 64-bit signed integer
  4. boolean: a binary value
  5. float: single precision (32-bit) IEEE 754 floating-point number
  6. double: double precision (64-bit) IEEE 754 floating-point number
  7. bytes: sequence of 8-bit unsigned byte

Now once this schema is created. we can convert json data into avro format. We need to use avro jar file for this. Following is the command we can use.

avro-tools-<version>.jar fromjson --schema-file AvroExample.avsc example.json > avro-example.avro

for using snappy compression

avro-tools-<version>.jar fromjson --schema-file AvroExample.avsc example.json > avro-example.snappy.avro

for getting json data from avro file. you can run following commands respectively

java -jar avro-tools-<version>.jar tojson avro-example.avro

java -jar avro-tools-<version>.jar tojson avro-example.snappy.avro


replace <version> with the version of your jar.



Comments

Post a Comment

Popular posts from this blog

Hive UDF Example

Custom UDF in Apache Spark

Hadoop series : Pseudo Distributed Mode Hadoop Instalation