Posts

Showing posts from 2014

Data Science | Steps to approach a Machine Learning Ploblem

It been some time, since i have written any blog. I just managed to get some time to put these things on blog.   In my day to day work, I come across many people who are trying their hands on machine learning. Machine Learning is amazing. Once you fall in love with it, You will enjoy doing it all day. Availability of lot of libraries have made it very easy for us to develop machine learning solutions. many people who are who are new to machine learning use these  API's directly and skip important steps, which will lead to not getting good results. So I thought of sharing some of important steps in this post. Following are the steps followed to create a good machine learning solution.  1. Data collection 2. Data preprocessing     1) Data cleaning     2) Feature creation and feature selection 3) feature scaling and Normalization     4) Divide data into training and testing sets(You can create cross validation set also) 3. Build a model on training data. 4. Evaluate the m

Moving Hadoop Namenode out of safemode.

Hi There, people has been asking me to put some light on safemode of Hadoop. So lets see what is it. Many time when you start Hadoop, it gets stuck in Safemode. Now what exactly is this safemode? when Hadoop starts, Normally it puts itself in Safemode. In this mode, you cannot write any new data to hadoop. it is a read only mode. Hadoop says untill  i am going to get heartbeat from some fix number of datanodes, i will keep myself in Safemode. I have seen this happening even when you are in Pseudo distributed mode. So if your Hadoop is in safemode. you will not be able to write any new data or create any folder on Hadoop. So you have to bring Hadoop out of safemode. Following is the command for bringing hadoop out of safemode.       hadoop dfsadmin -safemode leave After this, you will be able to use Hadoop as normal. Enjoy, Keep Coding, keep facing issues, keep learning :)

Machine Learning : Naive Bayes Part 1

My Major area of  work is Text Analytic and Machine Learning. I always get excited to solve the problems in this area. So i thought i will share some of my knowledge on this also :). We will start with Naive Bayes algorithm. It is a supervised learning, classification algorithm. Supervised learning means, before running on actual data, we have to train this algorithm with some training set and explain it that which records are acceptable and which records are not acceptable. Eg. before trying my NB(Naive Bayes algo) on test data, I will  show some  examples to algo, that these how does a spam message look and how does a non-spam message look. once it is ready we can go ahead with trying it on real world data. Before Trying NB, we need to know some basics about Probability . Lets go through that. Lets assume. we have a dice. A dice have six faces, each face marks a distinct number between 1 to six. If the dice is not biased, what is the probability that we will get a

Pig Installation

Today we will learn how to install Pig. Installing pig is very simple and straight forward. Following are the steps to install pig. Pig requires Hadoop and java to be already installed. If you have not installed it, follow the link  here 1. Download Pig from  Apache Pig website . Check the compatibility of pig you are downloading with already installed hadoop on your machine. I am going to download  Pig 0.10.1 . 2. After the download is complete, go to download directory and extract Pig 0.10.1.tar.gz 3. Copy the extracted folder in $HOME/pig directory. 4. Edit /etc/bash.bashrc and set PIG_HOME with following command     sudo leafpad /etc/bash.bashrc   Then go to last line and copy following      export PIG_HOME=$HOME/pig     export PATH=$PATH:PIG_HOME/bin also set JAVA_HOME if not set earlier using following command. In my case my java is installed at /usr/lib/jvm/java-6-jdk-i386     export JAVA_HOME= /usr/lib/jvm/java-6-jdk-i386 5. After

Hadoop Installation Video

Image
Hi Friends... As promised earlier i have created few videos for Hadoop Installation tutorial. Currently these videos are about Pseudo distributed mode installation of hadoop. I will create few videos for fully distributed mode installation also. For now please find the videos below.  1. Video for installation of Lubuntu on windows. For this you should download vmware player from  vmware site  and install on windows machine. you will also need .iso file for lubuntu.  You can download it from  Lubuntu website . 2. After you have installed Lubuntu, you can go through following video and install Hadoop in Pseudo distributed mode. Stay tuned for more stuff on this blog. Please share your feedback and let me know if you want a post on any specific topic.

Hadoop Series: Hadoop Distributed File System

In Previous posts we learned how to install hadoop , Introductionto hadoop etc. today we will learn about HDFS (Hadoop Distributed file system). HDFS is component of Hadoop. It handles storage part of Hadoop. HDFS follows master slave architecture. Let us discuss Master slave Arch. Let us discuss, what is Master slave Arch.      In Master slave Arch. we have two kind of machines. First set is Master other is slaves. Master does following two things. 1. Plan 2. Monitor     Master is like Manager of your team, He will plan. If he has some work to do, Master will plan whom to assign that work.    Slaves do Following two things. 1. Work 2. Report     Slave is like developer of your team( :P  Please dont feel offended it is just for analogy). Slave does the actual work. If Master assign work to slaves and slaves works and complete the work. Similar to Manager of your team who wants to develop some software, he will plan who is going to develop which component of software. Main

Hive UDF Example

UDF(User Defined Function) is a Very important functionality provided by Hive. It is very simple to create a UDF for Hive. In this tutorial we will learn creating UDF and how to use it with hive. There are two possible ways to create UDF. 1. using org.apache.hadoop.hive.ql.exec.UDF 2. using  org.apache.hadoop.hive.ql.udf.generic.GenericUDF        If input and output to your custom function is basic type eg. Text, FloatWritable, DoubleWritable,IntWritable etc the use  org.apache.hadoop.hive.ql.exec.UDF.        If yor input and output can be Map, set, list type of data structure the use  using  org.apache.hadoop.hive.ql.udf.generic.GenericUDF. We will discuss the first type of UDF here. I will write one more post to discuss the second approach. First of all lets assume I want to create a hive function called toUpper which will convert a string to uppercase. Follow the following steps to achieve it. 1. Download and install eclipse from  here 2. Hive should be installed, if

Common Hadoop Problems

Today we will learn some common problems, that a person faces while installing Hadoop. Here are few problems  listed below. 1. Problem with ssh conguration.  error: connection refused to port 22 2. Namenode node not reachable     error: Retrying to connect 127.0.0.1 1. Problem with ssh configuration: In this case you may face many kind of errors, but most common one while installing hadoop is connection refused to port 22. Here you should check if machine on which you are trying to login, should have ssh server installed.     If you are using Ubuntu/Lubuntu, you can install ssh server using following command.         sudo apt-get install openssh-server        On CentOs or Redhat you can install ssh server using yum package manager         sudo yum install openssh-server        after installing ssh server, make sure you have configured the keys properly and share public key with the machine that you want to login into. If the problem persists then check for configurat

HBase Installation On Ubuntu / Lubuntu

Hi Everybody, Today we will learn how to install HBase. HBase can be installed in two modes. 1. Standalone Mode 2. Distributed Mode       Distributed Mode can be of two types Pseudo Distributed mode and fully distributed mode. In This Tutorial we will discuss about Pseudo Distributed Mode installation of HBase. I will soon share another post on how to install HBase in fully distributed mode. UPDATE: Video for this topic is available  here Following are the steps for HBase installation. 1. Before installing HBase you should have installed java and hadoop already. If you have not installed Hadoop, please follow the link  here . 2. Next is to chose HBase version that is completable with your hadoop installation. I am using Hadoop 1.0.3. so i am using HBase installation HBase 0.94.8 . 3. Download HBase from  HBase 0.94.8  and extract it in "$HOME/hbase" in my case it is "/home/hduser/hbase". 4. Edit $HOME/hbase/conf/hbase-env.sh with following command

Hadoop series: Hadoop Introduction

Big Data ... Big Data everywhere... well thats been the state of Industry as of Now. Everybody is talking about big data, hiring big data professionals, starting projects on big data. People who have knowledge of big data tools, are getting very good salary. So what exactly is this big data. For me it is a problem. Consider a scenario before smartphones or internet enabled phones, For shopping we used to go to markets(at least in India), for paying utility bills we had to stand in queues, for transferring money, for rail ticket booking etc we had to stand in queues or physically go to the place. Now with more and more usage of internet (thanks to smartphones), we are doing every thing online, shopping, bill payments, investment in share market, connecting with friends, applying for passport or Driving licence etc. Now all these things have suddenly increased the amount of digital data we produce. I think big data has been there from quite some time, but now it has come in form of digi

Installing Hive

Hi Guys Today we will learn how to install Hive. It is very easy and needs only few steps. before you start installing hive, you should have already installed hadoop. If not please check out  hadoop installation  post. Steps to install Hive. 1. Download Hive from  Hive 0.11.0 . If you want to install other version of Hive, check its comparability with hadoop version that you have already installed. In my case i am using hadoop 1.0.3 2. Go to downloads folder, right click and extract hive-0.11.0.tar.gz. 3. Copy the extracted jar into /home/hduser/hive. 4. Edit /etc/bash.bashrc and export HADOOP_HOME if not set.        on Lubuntu               sudo leafpad /etc/bash.bashrc        on ubuntu               sudo gedit /etc/bash.bashrc         for people new to linux, leafpad and gedit are two editors to edit text files similar to notepad in windows. 5. Insert following statement in file.                export HADOOP_HOME=/home/hduser/hadoop    

Hadoop series : Pseudo Distributed Mode Hadoop Instalation

In this tutorial, we will learn required steps of setting up Hadoop on single node also called pseudo distributed mode. How ever as part of this series we will also setup hadoop on multiple machines. We will also learn Map Reduce, hive , Pig etc.So stay tuned, Here it comes .... :) We need Following this to start with. Ubuntu: I always prefer ubuntu as my linux flavour. However if you are using a very low end machine with very small ram, you can install Lubuntu also. Lubuntu is very light weight and any low end machine should work. If you have a windows machine. you can install hadoop using cygwin or VirtualBox or Vmware player.  Here we go... Steps for installation of Hadoop. If you are using Windows follow steps from 1 to 4 1. Download  and install Virtual box from https://www.virtualbox.org/wiki/Downloads   2. Download ubuntu from  http://www.ubuntu.com/download/desktop . If you have low  configuration machine then you can use Lubuntu. download Lubuntu from