Data Science | Steps to approach a Machine Learning Ploblem

It been some time, since i have written any blog. I just managed to get some time to put these things on blog.  In my day to day work, I come across many people who are trying their hands on machine learning.
Machine Learning is amazing. Once you fall in love with it, You will enjoy doing it all day. Availability of lot of libraries have made it very easy for us to develop machine learning solutions. many people who are who are new to machine learning use these API's directly and skip important steps, which will lead to not getting good results. So I thought of sharing some of important steps in this post.
Following are the steps followed to create a good machine learning solution. 

1. Data collection
2. Data preprocessing
    1) Data cleaning
    2) Feature creation and feature selection
3) feature scaling and Normalization
    4) Divide data into training and testing sets(You can create cross validation set also)
3. Build a model on training data.
4. Evaluate the model on the test data.
5. If the performance is satisfying, deploy to the real system.
6. If performance is not good, check for over fitting and under fitting 
7. Regularize you algorithm, go to step 3

This process is iterative and you can add more steps in between, depending on situation. Let’s understand each step:

1. Data Collection: 
At this stage we collect data from available sources. for analysing user click behaviour, you will like to collect web logs data. for predicting, if a mail is spam or not, you will collect emails. for predicting sentiment of twitter messages you may like to collect data from twitter.

2. Data Preprocessing:
        The data that you receive from any source may not be in readily usable form. You may like to pre-process it, so that your algorithm can make best use of collected information
Following are the this you may like to do as part of it.

1)Data Cleaning : You may end up collecting data which have wrong or null values for some of the records. The wrong or missing values may be very obvious

2)Viewing data : You may like to make some plots of data to see which parameters affect the output of your record. It will also give you some picture if your data is of skewed nature or it has normal distribution. Viewing data in form of plots and histograms may completely surprise you. If you have data of users who use facebook. you may make a plot to see if male users have more friends or female users have more friends. If you make a plot for age of person and number of people with that age, It will give you very clear picture that which age group is more active on facebook.

3) Data Transformation: Depending on what data you have, you may like to convert some of features to other form. for example if you have age as one of feature of your data. You may want that i want to have only 4 groups. minor(0-18), young(19-45), old(46-65),senior citizen(66- __). you may like to transform age feature to categorical variable. In some complex scenarios you may like to convert low dimensional data to high dimensions also(eg SVM algorithm using Kernals- we will discuss these things later in a separate post) or high dimension to low dimension(eg PCA- for dimensionality reduction)

          In general we work with both numerical and categorical data.  Numerical data consists of actual   numbers, while categorical data have a few discrete values. Examples of categorical data include marriage status, month of birth, employment type or gender.    The Categorical variable can be a number but there is no meaning to adding two vales of actegorical variable eg Zip code.  There may or may not be an order to categorical data.

Comments

Popular posts from this blog

Hive UDF Example

Custom UDF in Apache Spark

Hadoop series : Pseudo Distributed Mode Hadoop Instalation