Using Lzo compression codec in Hadoop

While working on Hadoop, most of the time, the files that we handle are very huge. It is very much required that we compress these kind of files and then use them with Hive or Pig. Hadoop provide various compression formats. there are different advantages and disadvantages of each format.

    Let us start with different options of compression of a file available with us.

Name                                Tool                            Splittable
gzip                                  gzip                                No
LZO                                 lzop                               Yes(If Indexed )
bzip                                  bzip2                             Yes
Snappy                             NA                                 No

Normally, you will like to chose an option where you can split the file and use power of Map Reduce to process that file. otherwise you will be forced to use single Mapper to process that file.

I normally prefer LZO format. Since it is very fast to decompress data when compressed with LZO. it makes it faster to read and process data compressed with this format. However this compression methodology is not as bzip , when it comes to size reduction.

How can you work on big files 

First of all you need to have LZO tool installed on your machine. following are the commands to do this installation

              
  Installing on Mac
  sudo port install lzop lzo2

  Installing on Redhat and CentOS based systems
  sudo yum install liblzo-devel

  Installing on Debian (Ubuntu , Lubuntu etc )based systems
  sudo apt-get install liblzo2-dev


Now you need to download hadoop-lzo jar file. You can download jar file from here or you can yourself compile and create it from project site.

Put the jar file in lib folder of your hadoop installation. If you followed Hadoop installation steps from this blog, then paste this jar file in "/home/hduser/hadoop/lib" folder. If you are using Cloudera installation, then most likely you should paste it in "/usr/local/hadoop/lib" path.

To use this compression format with Map reduce , we also need to set some properties.
Now we need to copy following property in "core-site.xml" file between configuration tag. Basically we are telling hadoop to use these compression formats.

<property>
<name>io.compression.codecs</name>
<value>com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec</value>
</property>

<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>

for also compressing the output we need to add following properties.

<property>
<name>mapred.output.compress</name>
<value>true</value>
</property>

<property>
<name>mapred.output.compression.codec</name>
<value>com.hadoop.compression.lzo.LzopCodec</value>
</property>

If you also want to compress the intermediate output of Map Reduce Program. i.e. output of Mapper Tasks. then set following properties also.

<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>

<property>
<name>mapred.map.output.compression.codec</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>

After all the configurations are done, Now you can use LZO compressor

Suppose i have a text file that i want to compress and load in Hive Table. I can do following to achieve this.

Run following command to compress "input.txt" file and load in hadoop file system.

          lzop -c "/home/hduser/input.txt" | hadoop fs -put _ /user/hduser/input.lzo

This command will compress the file and put it on Hadoop File system on "/user/hduser/input.lzo" path.

Our next step is to index this file using Hadoop LZO indexer so that we can use this file with map reduce.

Run following command to achieve this.

hadoop jar /home/harjeet/hadoop/lib/hadoop-lzo.jar \ com.hadoop.compression.lzo.DistributedLzoIndexer /user/hduser/input.lzo

This will create the index file in "/user/hduser/" folderin HDFS.

Now open shell for hive and set following properties and set properties

SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec
SET hive.exec.compress.output=true
SET mapreduce.output.fileoutputformat.compress=true

Now we can create a table to use compressed data and use it.

create table user_tb
(id int,
 name string
) row format delimited
fields terminated by ','
STORED AS INPUTFORMAT  \"com.hadoop.mapred.DeprecatedLzoTextInputFormat\" OUTPUTFORMAT \"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat\";

Now once the table is successfully created. we need to load the data in this table with the following command

load data inpath '/user/hduser/input*' into table user_tb.

after this step, you can normally use this table as any other table in hive.

Please feel free to leave any comments and feedback. Thanks... :)


Comments

  1. please send materials to mytechnicalstuff007@gmail.com

    ReplyDelete
  2. please provide more info on isha.mandrekar@gmail.com

    ReplyDelete
  3. Please send info @ vijay.vjmehra@gmail.com

    ReplyDelete
  4. I wish this worked for me. I have followed directions to the letter, yet my jobs fail with a warning:

    you may need to add the LZO codec to your io.compression.codecs configuration in core-site.xml.

    It is in there exactly as you show above and the cluster has been restarted. Hadoop makes me want to pull my hair out by the roots sometimes...

    ReplyDelete
  5. please send materials to rajesh.ramasankar@gmail.com

    ReplyDelete

Post a Comment

Popular posts from this blog

Hive UDF Example

Custom UDF in Apache Spark

Enterprise Kafka and Spark : Kerberos based Integration