Data Science : Kaggle Titanic Problem
Introduction¶
When i started participating in Kaggle competitions. I started with this problem.I feel this is a very easy problem to start learning Data Science. So i thought let me share how i approached this problem. Please feel free to share your comments. I have used IPython to solve this problem. then converted ipython notebook into html file. So if you feel the UI of this page is screwed up. you can see this notebook here
The Problem Statement is very simple. By seeing some example data about people who survived and who died in Titanic,we need to predict that given a new person's data , wether that person will be saved or not. You can read more about this problem statement here
Data¶
You can also get the data from same location. There are two important files there.
- train.csv
- test.csv
train.csv is the file which contain examples. we will analyze this data and create a mode which will know pattrens about people who were saved and who died. This execise needs python, ipython, numpy, pandas, matplotlib, sklearn to be installed on your machine. I will shortly create a post which has details to install all these.
Before Starting, Let us first try to know that what kind of columns we have in our data.
survival Survival (0 = No; 1 = Yes) pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) name Name sex Sex age Age sibsp Number of Siblings/Spouses Aboard parch Number of Parents/Children Aboard ticket Ticket Number fare Passenger Fare cabin Cabin embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
Let us Start the fun now. First we will import some libraries
Some Imports¶
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
%matplotlib inline
we have imported pandas, numpy, sklearn and matplotlib pandas is a great library to load and do EDA on data. sklearn will help us in different transformations and creating a model. matplotlib help us ploting data for visulaizations.
"%matplotlib inline" is a magic function in IPython. it helps us intialize matplotlib and display created plots in the notebook itself instead of a separate window.
Read Data¶
Let us Now read the csv files
df = pd.read_csv('./input/train.csv')
test = pd.read_csv('./input/test.csv')
Start Exploring¶
To get some information about loaded data, type following
df.info()
To get basic parameters about different variables call describe function.
df.describe()
df.Survived.value_counts().plot(kind='bar')
No of Males vs Females¶
df.Sex.value_counts().plot(kind='bar')
No of people who boarded from different points¶
df.Embarked.value_counts().plot(kind='bar')
No of people in different classes¶
df.Pclass.value_counts().plot(kind='bar')
Age distribution¶
df.Age.hist()
pclass_crosstab = pd.crosstab(df.Pclass,df.Survived)
pclass_crosstab
Lets explore by percentage¶
pclass_pct = pclass_crosstab.div(pclass_crosstab.sum(1).astype(float) , axis=0)
pclass_pct
Lets Visualize the cross tab table¶
pclass_pct.plot(kind='bar')
pclass_pct.plot(kind='bar' , stacked=True)
Exploring how much gender played any part in saving peoples¶
First get unique genders and the map the gender column to gender_map.male will be converted to 1 and female to 0.
gend_mapping = dict(zip(np.sort(df.Sex.unique()), range(len(df.Sex.unique()))))
df['gend_map']= df.Sex.map(gend_mapping).astype(int)
df.head()
Gender vs Survival Analysis¶
gend_crosstab = pd.crosstab(df.gend_map , df.Survived)
gend_pct = gend_crosstab.div(gend_crosstab.sum(1).astype(float) , axis=0)
gend_pct.plot(kind='bar' , stacked = True)
Analyze Combined effect of gender and class on Survival output¶
#count number of males and females in each class
uniq_pclass = df.Pclass.unique()
print(uniq_pclass)
print("Males in 1st class : ",len(df[(df.Sex == 'male') & (df.Pclass == 1)]))
print("Females in 1st class : ",len(df[(df.Sex == 'female') & (df.Pclass == 1)]))
print("Males in 2nd class : ",len(df[(df.Sex == 'male') & (df.Pclass == 2)]))
print("Females in 2nd class : ",len(df[(df.Sex == 'female') & (df.Pclass == 2)]))
print("Males in 3rd class : ",len(df[(df.Sex == 'male') & (df.Pclass == 3)]))
print("Females in 3rd class : ",len(df[(df.Sex == 'female') & (df.Pclass == 3)]))
#for p_class in uniq_pclass :
Get only Female data¶
#female Survival plot by class
female_df = df[df.Sex == 'female']
female_df_crtb = pd.crosstab(female_df.Pclass,female_df.Survived)
female_df_crtb = female_df_crtb.div(female_df_crtb.sum(axis=1).astype(float),axis=0)
Analyze how female survival rate was affected by Passenger class¶
female_df_crtb.plot(kind='bar', stacked=True)
So Majority of First class and Second class passenger females were saved, while comparatively fewer number of females from 3rd class were saved.
Analyze how male survival rate was affected by Passenger class¶
Select only Male data and plot¶
male_df = df[df.Sex == 'male']
male_df_crtb=pd.crosstab(male_df.Pclass,male_df.Survived)
male_df_crtb=male_df_crtb.div(male_df_crtb.sum(axis=1).astype(float), axis=0)
male_df_crtb.plot(kind='bar' , stacked=True)
Handling Null values¶
Let us see how many null values Embarked column has¶
df[df.Embarked.isnull()]
Encode Categorical variable embarked to numerical values instead of text values¶
#Encode Embarked column
embark_uniq = np.sort(df.Embarked.astype(str).unique())
embark_enc = dict(zip(embark_uniq,range(len(embark_uniq))))
embark_enc
Create new Encoded column Embarked_map¶
df['Embarked_map'] = df.Embarked.map(embark_enc)
From where most of people started their Journey?¶
df.Embarked_map.hist(bins=len(embark_uniq),range=(0,3))
Replace null values with most frequent ouccurence¶
df.Embarked_map[df.Embarked_map.isnull()]
df.Embarked_map.fillna(2,inplace=True)
df.Embarked_map[df.Embarked_map.isnull()]
df.Embarked[df.Embarked.isnull()]
df.Embarked.fillna('S',inplace=True)
df.Embarked[df.Embarked.isnull()]
df.Embarked.unique()
Analysing Age¶
Age is an Ordinal variable. It also has some missing values. there can be many strategies to fill missing values.
- Replace missing value with max occurence (we did this in Embarked case.)
- Replace missing value with mean value.
Here we will replace missing vaules in Age with mean value. One more interesting thing we can try is, replacing missing value with values in similar record. Let us assume we will replace age with mean of age in same Passenger Class and gender.
Let us see rows with missing Age value, with additional value of Pclass and gender
age_null = df[df.Age.isnull()][ ['Pclass','Sex' , 'Age']]
age_null.shape
There are 177 records with null values for Age.
Let us create a new column 'Age_enc' , which will not have any null values replaced with mean(But mean can have decimal value, so lets use Median) in respective Pclass and Gender.
df['Age_enc'] = df['Age']
df.Age_enc = df.Age_enc.groupby([df.gend_map , df.Pclass]).apply(lambda x: x.fillna(x.median()))
len(df[df.Age_enc.isnull()])
Age vs Survival Column¶
Let us do Cross tab analysis of Age variable with Survival column
age_crstb = pd.crosstab(df.Age_enc,df.Survived)
age_crstb.plot(kind='bar' , stacked=True)
Looks Like we have to bin the data
#bins= (df.Age_enc.max()/len(df.Age_enc))
print(df.Age_enc.max())
age_crstb.hist(bins= (df.Age_enc.max()/10),range=(1, df.Age_enc.max()) , stacked = True)
The plot above does not give any clear picture. Let us create a density distribution by class
p_classes = df.Pclass.unique()
for cl in p_classes :
df.Age_enc[df.Pclass == cl].plot(kind='kde')
plt.legend(('1st Class', '2nd Class', '3rd Class'), loc='best')
It looks like First class people were generally older than second class and so was the case with second class with third class.
Creating new Features¶
Data science generally also involves creating new features by combing multiple already existing features. Let us find out how can we do this in this context
Let us combine 'parch' and 'Sibsp' coulmn and create a new column Family size 'FamilySz'
df['FamilySz'] = df.Parch + df.SibSp
df.head(5)
df.FamilySz.value_counts().plot(kind='bar')
Plot Family Size vs Survival¶
fmly_crstb = pd.crosstab(df.FamilySz , df.Survived)
fmly_crstb.plot(kind = 'bar' , stacked = True)
fmly_crstb.div(fmly_crstb.sum(axis=1),axis=0)
These does not give any clear picture. Let us try executing machine learning algo on this data.
Apply Machine Learning on data¶
Dropping unnecessery columns¶
We are not going to use 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked', 'Age', 'SibSp', 'Parch', 'PassengerId'. So let us drop these columns.
let us put all of our cleaning work in a function. So that we can use it with test data also.
def clean_data(df):
#Encoding Gender Column
gend_mapping = dict(zip(np.sort(df.Sex.unique()), range(len(df.Sex.unique()))))
df['gend_map']= df.Sex.map(gend_mapping).astype(int)
#Encode Embarked column
df.Embarked.fillna('S',inplace=True)
embark_uniq = np.sort(df.Embarked.astype(str).unique())
embark_enc = dict(zip(embark_uniq,range(len(embark_uniq))))
df['Embarked_map'] = df.Embarked.map(embark_enc)
# Fill in missing values of Fare with the average Fare
if len(df[df.Fare.isnull()] > 0):
avg_fare = df.Fare.mean()
df.replace({ None: avg_fare }, inplace=True)
#Encode Age
df['Age_enc'] = df.Age
df.Age_enc = df.Age_enc.groupby([df.gend_map , df.Pclass]).apply(lambda x: x.fillna(x.median()))
#Create Family size column
df['FamilySz'] = df.Parch + df.SibSp
# Drop the columns we won't use:
df = df.drop(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked','Age', 'SibSp', 'Parch', 'PassengerId' ], axis=1)
return df
train = clean_data(df)
train.head(5)
Create a RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100)
Training the classifier
y = train.Survived
train = train.drop(['Survived'] , axis=1)
clf.fit(train , y)
clf.score(train,y)
test = clean_data(test)
test.head()
Trying Cross Validation with SKLearn
from sklearn import metrics
from sklearn.cross_validation import train_test_split
# Split 80-20 train vs test data
train_x, test_x, train_y, test_y = train_test_split(train,
y,
test_size=0.20,
random_state=0)
print (train.shape, y.shape)
print (train_x.shape, train_y.shape)
print (test_x.shape, test_y.shape)
Creating Classification report¶
from sklearn.metrics import classification_report
print(classification_report(test_y,clf.predict(test_x)))
Comments
Post a Comment