Posts

Showing posts from July, 2015

Using Lzo compression codec in Hadoop

While working on Hadoop, most of the time, the files that we handle are very huge. It is very much required that we compress these kind of files and then use them with Hive or Pig. Hadoop provide various compression formats. there are different advantages and disadvantages of each format.     Let us start with different options of compression of a file available with us. Name                                Tool                            Splittable gzip                                  gzip                                No LZO                                 lzop                               Yes(If Indexed ) bzip                                  bzip2                             Yes Snappy                             NA                                 No Normally, you will like to chose an option where you can split the file and use power of Map Reduce to process that file. otherwise you will be forced to use single Mapper to process that file. I normally prefer LZO form

Data Science : Kaggle Titanic Problem

Image
Introduction ¶ When i started participating in Kaggle competitions. I started with this problem.I feel this is a very easy problem to start learning Data Science. So i thought let me share how i approached this problem. Please feel free to share your comments. I have used IPython to solve this problem. then converted ipython notebook into html file. So if you feel the UI of this page is screwed up. you can see this notebook here The Problem Statement is very simple. By seeing some example data about people who survived and who died in Titanic,we need to predict that given a new person's data , wether that person will be saved or not. You can read more about this problem statement here Data ¶ You can also get the data from same location. There are two important files there. train.csv test.csv train.csv is the file which contain examples. we will analyze this data and create a mode which will know pattrens about people who were saved and who died. Th