Data Science : Kaggle Titanic Problem

Introduction

When i started participating in Kaggle competitions. I started with this problem.I feel this is a very easy problem to start learning Data Science. So i thought let me share how i approached this problem. Please feel free to share your comments. I have used IPython to solve this problem. then converted ipython notebook into html file. So if you feel the UI of this page is screwed up. you can see this notebook here

The Problem Statement is very simple. By seeing some example data about people who survived and who died in Titanic,we need to predict that given a new person's data , wether that person will be saved or not. You can read more about this problem statement here

Data

You can also get the data from same location. There are two important files there.

  1. train.csv
  2. test.csv

train.csv is the file which contain examples. we will analyze this data and create a mode which will know pattrens about people who were saved and who died. This execise needs python, ipython, numpy, pandas, matplotlib, sklearn to be installed on your machine. I will shortly create a post which has details to install all these.

Before Starting, Let us first try to know that what kind of columns we have in our data.

survival Survival (0 = No; 1 = Yes) pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) name Name sex Sex age Age sibsp Number of Siblings/Spouses Aboard parch Number of Parents/Children Aboard ticket Ticket Number fare Passenger Fare cabin Cabin embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

Let us Start the fun now. First we will import some libraries

Some Imports

In [1]:
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt

%matplotlib inline

we have imported pandas, numpy, sklearn and matplotlib pandas is a great library to load and do EDA on data. sklearn will help us in different transformations and creating a model. matplotlib help us ploting data for visulaizations.

"%matplotlib inline" is a magic function in IPython. it helps us intialize matplotlib and display created plots in the notebook itself instead of a separate window.

Read Data

Let us Now read the csv files

In [2]:
df = pd.read_csv('./input/train.csv')
test = pd.read_csv('./input/test.csv')

Start Exploring

To get some information about loaded data, type following

In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 73.1+ KB

To get basic parameters about different variables call describe function.

In [4]:
df.describe()
Out[4]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

Data Ploting as part of Exploration

Survived vs Died

In [5]:
df.Survived.value_counts().plot(kind='bar')
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0xacb529ac>

No of Males vs Females

In [6]:
df.Sex.value_counts().plot(kind='bar')
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0xacb78f4c>

No of people who boarded from different points

In [7]:
df.Embarked.value_counts().plot(kind='bar')
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0xaca5a18c>

No of people in different classes

In [8]:
df.Pclass.value_counts().plot(kind='bar')
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0xacacae8c>

Age distribution

In [9]:
df.Age.hist()
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0xaca08c6c>

Exploring Relationship between two variables

How many people survived in different classes

In [10]:
pclass_crosstab = pd.crosstab(df.Pclass,df.Survived)
pclass_crosstab
Out[10]:
Survived 0 1
Pclass
1 80 136
2 97 87
3 372 119

Lets explore by percentage

In [11]:
pclass_pct = pclass_crosstab.div(pclass_crosstab.sum(1).astype(float) , axis=0)
pclass_pct
Out[11]:
Survived 0 1
Pclass
1 0.370370 0.629630
2 0.527174 0.472826
3 0.757637 0.242363

Lets Visualize the cross tab table

In [12]:
pclass_pct.plot(kind='bar')
pclass_pct.plot(kind='bar' , stacked=True)
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0xac9570ac>

Exploring how much gender played any part in saving peoples

First get unique genders and the map the gender column to gender_map.male will be converted to 1 and female to 0.

In [13]:
gend_mapping = dict(zip(np.sort(df.Sex.unique()), range(len(df.Sex.unique()))))
In [14]:
df['gend_map']= df.Sex.map(gend_mapping).astype(int)
df.head()
Out[14]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked gend_map
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C 0
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S 0
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S 0
4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NaN S 1

Gender vs Survival Analysis

In [15]:
gend_crosstab = pd.crosstab(df.gend_map , df.Survived)
gend_pct = gend_crosstab.div(gend_crosstab.sum(1).astype(float) , axis=0)
In [16]:
gend_pct.plot(kind='bar' , stacked = True)
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0xac9c54ec>

Analyze Combined effect of gender and class on Survival output

In [17]:
#count number of males and females in each class
uniq_pclass = df.Pclass.unique()
print(uniq_pclass)
print("Males in 1st class : ",len(df[(df.Sex == 'male') & (df.Pclass == 1)]))
print("Females in 1st class : ",len(df[(df.Sex == 'female') & (df.Pclass == 1)]))
print("Males in 2nd class : ",len(df[(df.Sex == 'male') & (df.Pclass == 2)]))
print("Females in 2nd class : ",len(df[(df.Sex == 'female') & (df.Pclass == 2)]))
print("Males in 3rd class : ",len(df[(df.Sex == 'male') & (df.Pclass == 3)]))
print("Females in 3rd class : ",len(df[(df.Sex == 'female') & (df.Pclass == 3)]))
#for p_class in uniq_pclass :
    
[3 1 2]
Males in 1st class :  122
Females in 1st class :  94
Males in 2nd class :  108
Females in 2nd class :  76
Males in 3rd class :  347
Females in 3rd class :  144

Get only Female data

In [18]:
#female Survival plot by class
female_df = df[df.Sex == 'female']

female_df_crtb = pd.crosstab(female_df.Pclass,female_df.Survived)
female_df_crtb = female_df_crtb.div(female_df_crtb.sum(axis=1).astype(float),axis=0)

Analyze how female survival rate was affected by Passenger class

In [19]:
female_df_crtb.plot(kind='bar', stacked=True)
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0xac9fb3ec>

So Majority of First class and Second class passenger females were saved, while comparatively fewer number of females from 3rd class were saved.

Analyze how male survival rate was affected by Passenger class

Select only Male data and plot

In [20]:
male_df = df[df.Sex == 'male']
male_df_crtb=pd.crosstab(male_df.Pclass,male_df.Survived)
male_df_crtb=male_df_crtb.div(male_df_crtb.sum(axis=1).astype(float), axis=0)
In [21]:
male_df_crtb.plot(kind='bar' , stacked=True)
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0xac8ca16c>

Handling Null values

Let us see how many null values Embarked column has

In [22]:
df[df.Embarked.isnull()]
Out[22]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked gend_map
61 62 1 1 Icard, Miss. Amelie female 38 0 0 113572 80 B28 NaN 0
829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62 0 0 113572 80 B28 NaN 0

Encode Categorical variable embarked to numerical values instead of text values

In [23]:
#Encode Embarked column
embark_uniq = np.sort(df.Embarked.astype(str).unique())
embark_enc = dict(zip(embark_uniq,range(len(embark_uniq))))
embark_enc
Out[23]:
{'C': 0, 'Q': 1, 'S': 2, 'nan': 3}

Create new Encoded column Embarked_map

In [24]:
df['Embarked_map'] = df.Embarked.map(embark_enc)

From where most of people started their Journey?

In [25]:
df.Embarked_map.hist(bins=len(embark_uniq),range=(0,3))
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0xac8a124c>

Replace null values with most frequent ouccurence

In [26]:
df.Embarked_map[df.Embarked_map.isnull()]
Out[26]:
61    NaN
829   NaN
Name: Embarked_map, dtype: float64
In [27]:
df.Embarked_map.fillna(2,inplace=True)
In [28]:
df.Embarked_map[df.Embarked_map.isnull()]
Out[28]:
Series([], Name: Embarked_map, dtype: float64)
In [29]:
df.Embarked[df.Embarked.isnull()]
Out[29]:
61     NaN
829    NaN
Name: Embarked, dtype: object
In [30]:
df.Embarked.fillna('S',inplace=True)
In [31]:
df.Embarked[df.Embarked.isnull()]
Out[31]:
Series([], Name: Embarked, dtype: object)
In [32]:
df.Embarked.unique()
Out[32]:
array(['S', 'C', 'Q'], dtype=object)

Analysing Age

Age is an Ordinal variable. It also has some missing values. there can be many strategies to fill missing values.

  1. Replace missing value with max occurence (we did this in Embarked case.)
  2. Replace missing value with mean value.

Here we will replace missing vaules in Age with mean value. One more interesting thing we can try is, replacing missing value with values in similar record. Let us assume we will replace age with mean of age in same Passenger Class and gender.

Let us see rows with missing Age value, with additional value of Pclass and gender

In [33]:
age_null = df[df.Age.isnull()][ ['Pclass','Sex' , 'Age']]
age_null.shape
Out[33]:
(177, 3)

There are 177 records with null values for Age.

Let us create a new column 'Age_enc' , which will not have any null values replaced with mean(But mean can have decimal value, so lets use Median) in respective Pclass and Gender.

In [34]:
df['Age_enc'] = df['Age']
df.Age_enc = df.Age_enc.groupby([df.gend_map , df.Pclass]).apply(lambda x: x.fillna(x.median()))
In [35]:
len(df[df.Age_enc.isnull()])
Out[35]:
0

Age vs Survival Column

Let us do Cross tab analysis of Age variable with Survival column

In [36]:
age_crstb = pd.crosstab(df.Age_enc,df.Survived)
age_crstb.plot(kind='bar' , stacked=True)
Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0xac878c8c>

Looks Like we have to bin the data

In [37]:
#bins= (df.Age_enc.max()/len(df.Age_enc))
print(df.Age_enc.max())
age_crstb.hist(bins= (df.Age_enc.max()/10),range=(1, df.Age_enc.max()) , stacked = True)
80.0
Out[37]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0xac732bac>,
        <matplotlib.axes._subplots.AxesSubplot object at 0xac66f12c>]], dtype=object)

The plot above does not give any clear picture. Let us create a density distribution by class

In [38]:
p_classes = df.Pclass.unique()

for cl in p_classes :
    df.Age_enc[df.Pclass == cl].plot(kind='kde')
    
plt.legend(('1st Class', '2nd Class', '3rd Class'), loc='best')
Out[38]:
<matplotlib.legend.Legend at 0xac63be6c>

It looks like First class people were generally older than second class and so was the case with second class with third class.

Creating new Features

Data science generally also involves creating new features by combing multiple already existing features. Let us find out how can we do this in this context

Let us combine 'parch' and 'Sibsp' coulmn and create a new column Family size 'FamilySz'

In [39]:
df['FamilySz'] = df.Parch + df.SibSp
df.head(5)
Out[39]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked gend_map Embarked_map Age_enc FamilySz
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S 1 2 22 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C 0 0 38 1
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S 0 2 26 0
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S 0 2 35 1
4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NaN S 1 2 35 0
In [40]:
df.FamilySz.value_counts().plot(kind='bar')
Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0xa9ee3a8c>

Plot Family Size vs Survival

In [41]:
fmly_crstb = pd.crosstab(df.FamilySz , df.Survived)
fmly_crstb.plot(kind = 'bar' , stacked = True)
Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0xac5d18ac>
In [42]:
fmly_crstb.div(fmly_crstb.sum(axis=1),axis=0)
Out[42]:
Survived 0 1
FamilySz
0 0.696462 0.303538
1 0.447205 0.552795
2 0.421569 0.578431
3 0.275862 0.724138
4 0.800000 0.200000
5 0.863636 0.136364
6 0.666667 0.333333
7 1.000000 0.000000
10 1.000000 0.000000

These does not give any clear picture. Let us try executing machine learning algo on this data.

Apply Machine Learning on data

Dropping unnecessery columns

We are not going to use 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked', 'Age', 'SibSp', 'Parch', 'PassengerId'. So let us drop these columns.

let us put all of our cleaning work in a function. So that we can use it with test data also.

In [43]:
def clean_data(df):
    #Encoding Gender Column
    gend_mapping = dict(zip(np.sort(df.Sex.unique()), range(len(df.Sex.unique()))))
    df['gend_map']= df.Sex.map(gend_mapping).astype(int)
    
    #Encode Embarked column
    df.Embarked.fillna('S',inplace=True)
    embark_uniq = np.sort(df.Embarked.astype(str).unique())
    embark_enc = dict(zip(embark_uniq,range(len(embark_uniq))))
    df['Embarked_map'] = df.Embarked.map(embark_enc)
    
    # Fill in missing values of Fare with the average Fare
    if len(df[df.Fare.isnull()] > 0):
        avg_fare = df.Fare.mean()
        df.replace({ None: avg_fare }, inplace=True)

    #Encode Age
    df['Age_enc'] = df.Age
    df.Age_enc = df.Age_enc.groupby([df.gend_map , df.Pclass]).apply(lambda x: x.fillna(x.median()))
    
    #Create Family size column
    df['FamilySz'] = df.Parch + df.SibSp
                   
    # Drop the columns we won't use:
    df = df.drop(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked','Age', 'SibSp', 'Parch', 'PassengerId' ], axis=1)
    
    return df
    
In [44]:
train = clean_data(df)
train.head(5)
Out[44]:
Survived Pclass Fare gend_map Embarked_map Age_enc FamilySz
0 0 3 7.2500 1 2 22 1
1 1 1 71.2833 0 0 38 1
2 1 3 7.9250 0 2 26 0
3 1 1 53.1000 0 2 35 1
4 0 3 8.0500 1 2 35 0

Create a RandomForestClassifier

In [45]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100)

Training the classifier

In [46]:
y = train.Survived
train = train.drop(['Survived'] , axis=1)
clf.fit(train , y)
Out[46]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
In [47]:
clf.score(train,y)
Out[47]:
0.98092031425364756
In [48]:
test = clean_data(test)
test.head()
Out[48]:
Pclass Fare gend_map Embarked_map Age_enc FamilySz
0 3 7.8292 1 1 34.5 0
1 3 7.0000 0 2 47.0 1
2 2 9.6875 1 1 62.0 0
3 3 8.6625 1 2 27.0 0
4 3 12.2875 0 2 22.0 2

Trying Cross Validation with SKLearn

In [49]:
from sklearn import metrics
from sklearn.cross_validation import train_test_split

# Split 80-20 train vs test data
train_x, test_x, train_y, test_y = train_test_split(train, 
                                                    y, 
                                                    test_size=0.20, 
                                                    random_state=0)
print (train.shape, y.shape)
print (train_x.shape, train_y.shape)
print (test_x.shape, test_y.shape)
(891, 6) (891,)
(712, 6) (712,)
(179, 6) (179,)

Creating Classification report

In [50]:
from sklearn.metrics import classification_report
print(classification_report(test_y,clf.predict(test_x)))
             precision    recall  f1-score   support

          0       0.99      0.98      0.99       110
          1       0.97      0.99      0.98        69

avg / total       0.98      0.98      0.98       179

Comments

Popular posts from this blog

Hive UDF Example

Custom UDF in Apache Spark

Hadoop series : Pseudo Distributed Mode Hadoop Instalation