Austin Animal Shelter Analysis

Posted on Wed 21 December 2016 in Projects

Austin Animal Shelter

Animal welfare is something that is really close to my heart so I am really excited to go through this dataset for analysis. This data was posted on Kaggle but I did not discover the dataset until after the competition expired. The goal posted by Austin Animal Shelter for this data was to predict the outcome of each animal.

In [187]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import normalize
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import confusion_matrix

import category_encoders as ce


import xgboost as xgb
import seaborn as sns

%matplotlib inline

Load Data

In [188]:
df = pd.read_csv('train.csv')
orig_df = pd.read_csv('train.csv')
In [189]:
df.shape
Out[189]:
(26729, 10)
In [190]:
df.head()
Out[190]:
AnimalID Name DateTime OutcomeType OutcomeSubtype AnimalType SexuponOutcome AgeuponOutcome Breed Color
0 A671945 Hambone 2014-02-12 18:22:00 Return_to_owner NaN Dog Neutered Male 1 year Shetland Sheepdog Mix Brown/White
1 A656520 Emily 2013-10-13 12:44:00 Euthanasia Suffering Cat Spayed Female 1 year Domestic Shorthair Mix Cream Tabby
2 A686464 Pearce 2015-01-31 12:28:00 Adoption Foster Dog Neutered Male 2 years Pit Bull Mix Blue/White
3 A683430 NaN 2014-07-11 19:09:00 Transfer Partner Cat Intact Male 3 weeks Domestic Shorthair Mix Blue Cream
4 A667013 NaN 2013-11-15 12:52:00 Transfer Partner Dog Neutered Male 2 years Lhasa Apso/Miniature Poodle Tan
In [191]:
df.columns
Out[191]:
Index([u'AnimalID', u'Name', u'DateTime', u'OutcomeType', u'OutcomeSubtype',
       u'AnimalType', u'SexuponOutcome', u'AgeuponOutcome', u'Breed',
       u'Color'],
      dtype='object')

Convert age column to age in years

The age column contains strings and the unit of age is not consistent. So, we will convert the age column into age in years. There are some entries with no age value. We will first fill in those null values with -99, and then replace them with the average age.

In [192]:
df['AgeuponOutcome'].unique()
Out[192]:
array(['1 year', '2 years', '3 weeks', '1 month', '5 months', '4 years',
       '3 months', '2 weeks', '2 months', '10 months', '6 months',
       '5 years', '7 years', '3 years', '4 months', '12 years', '9 years',
       '6 years', '1 weeks', '11 years', '4 weeks', '7 months', '8 years',
       '11 months', '4 days', '9 months', '8 months', '15 years',
       '10 years', '1 week', '0 years', '14 years', '3 days', '6 days',
       '5 days', '5 weeks', '2 days', '16 years', '1 day', '13 years', nan,
       '17 years', '18 years', '19 years', '20 years'], dtype=object)
In [193]:
Ages = df['AgeuponOutcome'].astype(str)
y = [float(age.split()[0]) if 'year' in age
     else float(age.split()[0])/52. if 'week' in age
     else float(age.split()[0])/12. if 'month' in age
     else float(age.split()[0])/352. if 'week' in age
     else -99.0
     for age in Ages]
In [194]:
y[y == -99] = np.mean(y[y != -99])

Here, we replace the missing ages with the average age of the animals

In [195]:
df['AgeuponOutcome'] = pd.DataFrame({'AgeuponOutcome': y})
In [196]:
df.head()
Out[196]:
AnimalID Name DateTime OutcomeType OutcomeSubtype AnimalType SexuponOutcome AgeuponOutcome Breed Color
0 A671945 Hambone 2014-02-12 18:22:00 Return_to_owner NaN Dog Neutered Male 1.000000 Shetland Sheepdog Mix Brown/White
1 A656520 Emily 2013-10-13 12:44:00 Euthanasia Suffering Cat Spayed Female 1.000000 Domestic Shorthair Mix Cream Tabby
2 A686464 Pearce 2015-01-31 12:28:00 Adoption Foster Dog Neutered Male 2.000000 Pit Bull Mix Blue/White
3 A683430 NaN 2014-07-11 19:09:00 Transfer Partner Cat Intact Male 0.057692 Domestic Shorthair Mix Blue Cream
4 A667013 NaN 2013-11-15 12:52:00 Transfer Partner Dog Neutered Male 2.000000 Lhasa Apso/Miniature Poodle Tan

Basic Stats for Dogs vs Cats

Average age upon Outcome

In [197]:
df[df['AnimalType'] == 'Dog']['AgeuponOutcome'].mean()
Out[197]:
2.2319505758749227
In [198]:
df[df['AnimalType'] == 'Cat']['AgeuponOutcome'].mean()
Out[198]:
-1.6021397843519833

Number of dogs vs cats

In [199]:
len(df[df['AnimalType'] == 'Dog'])
Out[199]:
15595
In [200]:
len(df[df['AnimalType'] == 'Cat'])
Out[200]:
11134

Summary on basic stats: The average age for a dog (2.75 yrs) is greater than the average age of cats (1.36 yrs) upon the outcome. The shelter also has sheltered more dogs than cats over the course of 3 years.

Create columns for male vs female, fixed vs intact

In [201]:
df['SexuponOutcome'].unique()
Out[201]:
array(['Neutered Male', 'Spayed Female', 'Intact Male', 'Intact Female',
       'Unknown', nan], dtype=object)
In [202]:
df['SexuponOutcome'] = df['SexuponOutcome'].replace(np.nan,'Unknown')
In [203]:
sex = ['male' if 'Male' in df['SexuponOutcome'].iloc[i] 
       else 'female' if 'Female' in df['SexuponOutcome'].iloc[i] 
       else 'unknown'
       for i in range(len(df['SexuponOutcome']))]

spayed_neutered = ['fixed' if any(x in df['SexuponOutcome'].iloc[i] for x in ['Neutered','Spayed'])
                    else 'not_fixed' if 'Intact' in df['SexuponOutcome'].iloc[i]
                    else 'unknown_fixed'
                    for i in range(len(df['SexuponOutcome']))]
In [204]:
set(spayed_neutered)
Out[204]:
{'fixed', 'not_fixed', 'unknown_fixed'}
In [205]:
set(sex)
Out[205]:
{'female', 'male', 'unknown'}
In [206]:
df['Fixed'] = pd.DataFrame({'fixed':spayed_neutered})
df['Sex'] = pd.DataFrame({'Sex': sex})
df = df.drop('SexuponOutcome',axis = 1)

Simplify the Breed Feature

In [207]:
df.head()
Out[207]:
AnimalID Name DateTime OutcomeType OutcomeSubtype AnimalType AgeuponOutcome Breed Color Fixed Sex
0 A671945 Hambone 2014-02-12 18:22:00 Return_to_owner NaN Dog 1.000000 Shetland Sheepdog Mix Brown/White fixed male
1 A656520 Emily 2013-10-13 12:44:00 Euthanasia Suffering Cat 1.000000 Domestic Shorthair Mix Cream Tabby fixed female
2 A686464 Pearce 2015-01-31 12:28:00 Adoption Foster Dog 2.000000 Pit Bull Mix Blue/White fixed male
3 A683430 NaN 2014-07-11 19:09:00 Transfer Partner Cat 0.057692 Domestic Shorthair Mix Blue Cream not_fixed male
4 A667013 NaN 2013-11-15 12:52:00 Transfer Partner Dog 2.000000 Lhasa Apso/Miniature Poodle Tan fixed male
In [208]:
len(df['Breed'].unique())
Out[208]:
1380
Currently, there are 1380 different breed classifications among cats and dogs
In [209]:
df['Breed'].value_counts()[0:10]
Out[209]:
Domestic Shorthair Mix       8810
Pit Bull Mix                 1906
Chihuahua Shorthair Mix      1766
Labrador Retriever Mix       1363
Domestic Medium Hair Mix      839
German Shepherd Mix           575
Domestic Longhair Mix         520
Siamese Mix                   389
Australian Cattle Dog Mix     367
Dachshund Mix                 318
Name: Breed, dtype: int64
In [210]:
df[df['AnimalType'] == 'Cat']['Breed'].value_counts()[0:10]
Out[210]:
Domestic Shorthair Mix      8810
Domestic Medium Hair Mix     839
Domestic Longhair Mix        520
Siamese Mix                  389
Domestic Shorthair           143
Snowshoe Mix                  75
Maine Coon Mix                44
Manx Mix                      44
Domestic Medium Hair          42
Russian Blue Mix              33
Name: Breed, dtype: int64
In [211]:
len(df[df['AnimalType'] == 'Cat']['Breed'].unique())
Out[211]:
60
In [212]:
len(df[df['AnimalType'] == 'Dog']['Breed'].unique())
Out[212]:
1320

Domestic Shorthair Mix seems to be a classification for cats only

  • Cat breeds seem to be somewhat limited with 60 different breeds while dogs have 1320 different classifications. We will only simplify dog breeds.
In [213]:
df[df['AnimalType'] == 'Dog']['Breed'].value_counts()[0:20]
Out[213]:
Pit Bull Mix                 1906
Chihuahua Shorthair Mix      1766
Labrador Retriever Mix       1363
German Shepherd Mix           575
Australian Cattle Dog Mix     367
Dachshund Mix                 318
Boxer Mix                     245
Miniature Poodle Mix          233
Border Collie Mix             229
Australian Shepherd Mix       163
Rat Terrier Mix               157
Catahoula Mix                 157
Jack Russell Terrier Mix      146
Yorkshire Terrier Mix         143
Chihuahua Longhair Mix        142
Siberian Husky Mix            138
Miniature Schnauzer Mix       136
Beagle Mix                    124
Rottweiler Mix                113
American Bulldog Mix          109
Name: Breed, dtype: int64
Let's remove the "Mix" from the dog breeds and also take only the first breed from classifications in the form "breed 1 / breed 2", assuming that the first breed listed is the dominant breed
In [214]:
import re
breeds = list(df[df['AnimalType'] == 'Dog']['Breed'])
dog_breeds = [re.sub(' Mix', '', dog) if 'Mix' in dog else dog.split("/")[0] for dog in breeds]
# for i in range(len(dog_breeds)):
#     if ('Mix' in dog_breeds[i]):
#         dog_breeds[i] = re.sub(' Mix', '', dog_breeds[i])
#     else:
#         dog_breeds[i] = dog_breeds[i].split("/")[0]
    
            
In [215]:
len(set(dog_breeds))
Out[215]:
188
In [216]:
dog_breeds_df = pd.DataFrame({'Breed':dog_breeds})
We were able to narrow down the different classes of dog breeds from 1320 to 188
In [217]:
dog_breeds_df['Breed'].value_counts()[0:10]
Out[217]:
Chihuahua Shorthair      2145
Pit Bull                 2113
Labrador Retriever       1915
German Shepherd           826
Australian Cattle Dog     511
Dachshund                 510
Boxer                     360
Border Collie             334
Miniature Poodle          310
Australian Shepherd       229
Name: Breed, dtype: int64
Now let's replace all the dog breed classifications from the original dataframe with the new breed classifications
In [218]:
df.ix[df.AnimalType == 'Dog', 'Breed'] = dog_breeds
In [219]:
len(df['Breed'].unique())
Out[219]:
248
In [220]:
x = df['Breed'].value_counts().index.tolist()
y = df['Breed'].value_counts().tolist()
In [221]:
common_breed = df['Breed'].value_counts().index.tolist()[0:10]
pd.crosstab(df[df['Breed'].isin(common_breed)]['Breed'],df['OutcomeType'])
Out[221]:
OutcomeType Adoption Died Euthanasia Return_to_owner Transfer
Breed
Australian Cattle Dog 243 1 20 126 121
Chihuahua Shorthair 966 13 90 478 598
Dachshund 243 0 15 95 157
Domestic Longhair Mix 199 12 56 48 205
Domestic Medium Hair Mix 357 11 62 26 383
Domestic Shorthair Mix 3273 112 535 352 4538
German Shepherd 372 3 37 237 177
Labrador Retriever 857 4 82 498 474
Pit Bull 625 8 287 678 515
Siamese Mix 173 5 30 27 154

Simplify Dog Color Column

The color column has 366 different colors. We will simplify this column similar to the breed column, taking the first color as the dominant color.

In [222]:
len(df['Color'].unique())
Out[222]:
366
In [223]:
len(df[df['AnimalType'] == 'Dog']['Color'].unique())
Out[223]:
262
In [224]:
df[df['AnimalType'] == 'Dog']['Color'].value_counts()[0:15]
Out[224]:
Black/White            1730
Brown/White             882
Black                   851
White                   806
Tan/White               773
Tricolor                751
Black/Tan               672
Brown                   637
Tan                     627
White/Brown             562
White/Black             475
Brown Brindle/White     450
Black/Brown             435
Blue/White              414
White/Tan               389
Name: Color, dtype: int64
In [225]:
df[df['AnimalType'] == 'Cat']['Color'].value_counts()[0:15]
Out[225]:
Brown Tabby           1635
Black                 1441
Black/White           1094
Brown Tabby/White      940
Orange Tabby           841
Tortie                 530
Calico                 517
Orange Tabby/White     455
Blue Tabby             433
Blue                   385
Torbie                 335
Blue/White             288
Blue Tabby/White       241
Cream Tabby            198
White/Black            168
Name: Color, dtype: int64
In [226]:
# Take the first color
colors = df['Color']
color = [c.split("/")[0] for c in df['Color']]
In [227]:
len(pd.DataFrame({'Color':color})['Color'].unique())
Out[227]:
57
In [228]:
pd.DataFrame({'Color':color})['Color'].value_counts()[0:15]
Out[228]:
Black            6422
White            3344
Brown Tabby      2592
Brown            1951
Tan              1674
Orange Tabby     1299
Blue             1199
Tricolor          800
Red               779
Brown Brindle     699
Blue Tabby        678
Tortie            580
Calico            552
Chocolate         448
Torbie            398
Name: Color, dtype: int64

Taking the first listed color as the dominant color, we reduced the number of possibilites for color definition from 366 to 57

In [229]:
df['Color'] = color
In [230]:
df.head()
Out[230]:
AnimalID Name DateTime OutcomeType OutcomeSubtype AnimalType AgeuponOutcome Breed Color Fixed Sex
0 A671945 Hambone 2014-02-12 18:22:00 Return_to_owner NaN Dog 1.000000 Shetland Sheepdog Brown fixed male
1 A656520 Emily 2013-10-13 12:44:00 Euthanasia Suffering Cat 1.000000 Domestic Shorthair Mix Cream Tabby fixed female
2 A686464 Pearce 2015-01-31 12:28:00 Adoption Foster Dog 2.000000 Pit Bull Blue fixed male
3 A683430 NaN 2014-07-11 19:09:00 Transfer Partner Cat 0.057692 Domestic Shorthair Mix Blue Cream not_fixed male
4 A667013 NaN 2013-11-15 12:52:00 Transfer Partner Dog 2.000000 Lhasa Apso Tan fixed male

Simplify the date and time column

In [231]:
df['DateTime'].head()
Out[231]:
0    2014-02-12 18:22:00
1    2013-10-13 12:44:00
2    2015-01-31 12:28:00
3    2014-07-11 19:09:00
4    2013-11-15 12:52:00
Name: DateTime, dtype: object
In [232]:
from datetime import datetime
d = df['DateTime'][0].split()[0]
dates = [df['DateTime'][i].split()[0] for i in range(len(df['DateTime']))]
date = [datetime.strptime(d,'%Y-%m-%d') for d in dates]
In [233]:
# season = ['Winter' if d.month % 11 <= 2
#          else 'Spring' if (d.month % 11 >= 3) & (d.month % 11 <= 5)
#          else 'Summer' if (d.month % 11 >= 6) & (d.month % 11 <= 8)
#          else 'Fall'
#          for d in date]
In [234]:
week = [1 if d.day <= 7
       else 2 if (d.day > 7) & (d.day <= 14)
       else 3 if (d.day > 14) & (d.day <= 21)
       else 4
       for d in date]
In [235]:
# df['Season'] = season
df['Week'] = week
In [236]:
df.head()
Out[236]:
AnimalID Name DateTime OutcomeType OutcomeSubtype AnimalType AgeuponOutcome Breed Color Fixed Sex Week
0 A671945 Hambone 2014-02-12 18:22:00 Return_to_owner NaN Dog 1.000000 Shetland Sheepdog Brown fixed male 2
1 A656520 Emily 2013-10-13 12:44:00 Euthanasia Suffering Cat 1.000000 Domestic Shorthair Mix Cream Tabby fixed female 2
2 A686464 Pearce 2015-01-31 12:28:00 Adoption Foster Dog 2.000000 Pit Bull Blue fixed male 4
3 A683430 NaN 2014-07-11 19:09:00 Transfer Partner Cat 0.057692 Domestic Shorthair Mix Blue Cream not_fixed male 2
4 A667013 NaN 2013-11-15 12:52:00 Transfer Partner Dog 2.000000 Lhasa Apso Tan fixed male 3

Create Indicator Column whether the animal was given a name or not

In [237]:
df['Name'] = df['Name'].isnull() * 1.
df = df.rename(columns = {'Name':'NoName'})

Create column for indicator of Animal Type

In [238]:
df['AnimalType'].unique()
Out[238]:
array(['Dog', 'Cat'], dtype=object)
In [239]:
df['AnimalType'] = (df['AnimalType'] == 'Dog') * 1.

Create Dummy Variables for Columns with Categorical Variables

In [240]:
categorical_var = ['Color', 'Breed','Fixed', 'Sex']
binary = ce.binary.BinaryEncoder(cols = categorical_var)
binary.fit(df)
dat = binary.transform(df)

# dat = pd.concat([df.drop(categorical_var, axis = 1), pd.get_dummies(df[categorical_var])], axis = 1)
In [241]:
pd.set_option('display.max_columns', 100)
dat.head()
Out[241]:
Color_0 Color_1 Color_2 Color_3 Color_4 Color_5 Breed_0 Breed_1 Breed_2 Breed_3 Breed_4 Breed_5 Breed_6 Breed_7 Fixed_0 Fixed_1 Sex_0 Sex_1 AnimalID NoName DateTime OutcomeType OutcomeSubtype AnimalType AgeuponOutcome Week
0 1 0 1 1 0 1 1 1 0 1 0 0 0 0 1 0 0 0 A671945 0.0 2014-02-12 18:22:00 Return_to_owner NaN 1.0 1.000000 2
1 0 0 1 1 0 1 0 1 0 0 1 1 1 1 1 0 0 1 A656520 0.0 2013-10-13 12:44:00 Euthanasia Suffering 0.0 1.000000 2
2 0 1 1 1 1 1 0 1 1 0 0 1 1 0 1 0 0 0 A686464 0.0 2015-01-31 12:28:00 Adoption Foster 1.0 2.000000 4
3 1 1 0 0 1 0 0 1 0 0 1 1 1 1 0 0 0 0 A683430 1.0 2014-07-11 19:09:00 Transfer Partner 0.0 0.057692 2
4 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 A667013 1.0 2013-11-15 12:52:00 Transfer Partner 1.0 2.000000 3

Check dimensions of data frame to check dummy function

Create Labels

In [242]:
lab = dat['OutcomeType']
In [243]:
lab.value_counts()
Out[243]:
Adoption           10769
Transfer            9422
Return_to_owner     4786
Euthanasia          1555
Died                 197
Name: OutcomeType, dtype: int64
In [244]:
le = LabelEncoder()
labels = le.fit_transform(dat['OutcomeType'])

Drop columns not needed for models

In [245]:
dat.drop(['AnimalID','DateTime','OutcomeType','OutcomeSubtype'],axis = 1).dtypes.unique()
Out[245]:
array([dtype('int64'), dtype('float64')], dtype=object)
In [246]:
dat = dat.drop(['AnimalID','DateTime','OutcomeType','OutcomeSubtype', 'AnimalType'],axis = 1)

Create Training and Testing Sets

In [247]:
X_train, X_test, Y_train, Y_test = train_test_split(dat, labels, test_size = 0.25)
In [248]:
df.OutcomeType.value_counts()/float(df.shape[0])
Out[248]:
Adoption           0.402896
Transfer           0.352501
Return_to_owner    0.179056
Euthanasia         0.058177
Died               0.007370
Name: OutcomeType, dtype: float64
In [249]:
np.bincount(Y_train)/float(len(Y_train))
Out[249]:
array([ 0.40177592,  0.00768233,  0.05706874,  0.17863913,  0.35483388])
In [250]:
np.bincount(Y_test)/float(len(Y_test))
Out[250]:
array([ 0.40625468,  0.00643424,  0.06149933,  0.18030824,  0.34550352])

We now need to normalize the training and testing set.

In [251]:
X_train = normalize(X_train, axis = 1)
X_test = normalize(X_test, axis = 1)

Logistic Regression Grid Search and Cross Validation

In [252]:
log_model = LogisticRegression(penalty = 'l1')
params = {'C':[0.25, 0.5, 0.75, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5]}
gs_model = GridSearchCV(log_model, params)
gs_model.fit(X_train, Y_train)
Out[252]:
GridSearchCV(cv=None, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'C': [0.25, 0.5, 0.75, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)
In [253]:
gs_model.best_params_
Out[253]:
{'C': 4.5}
In [254]:
C_val = gs_model.best_params_['C']

best_log_model = LogisticRegression(C = C_val, penalty = 'l1')
best_log_model.fit(X_train, Y_train)
best_log_model.score(X_test, Y_test)
Out[254]:
0.6304055065090528
In [255]:
log_loss(Y_test, best_log_model.predict_proba(X_test))
Out[255]:
0.92958350834823378
In [256]:
cm = confusion_matrix(Y_test, best_log_model.predict(X_test))

plt.matshow(cm)
plt.colorbar()
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')

print cm

le.classes_
[[2317    0    0  301   97]
 [   4    0    0    5   34]
 [  62    0    0  109  240]
 [ 574    0    0  504  127]
 [ 704    0    0  213 1392]]
Out[256]:
array(['Adoption', 'Died', 'Euthanasia', 'Return_to_owner', 'Transfer'], dtype=object)

Gaussian Naive Bayes

In [257]:
nb_model = GaussianNB()
nb_model.fit(X_train, Y_train)
nb_model.score(X_test, Y_test)
Out[257]:
0.43872512344755349

Gradient Boosting

Grid Search for Gradient Boosting

In [258]:
# params = {'n_estimators': [200, 300, 450], 'learning_rate':[0.1, 0.5],
#           'max_depth':[5, 6, 7, 8], 'max_features':['sqrt', .40], 'min_samples_split':[2, 20, 30, 50, 100]}
params = {'n_estimators':[200, 300], 'max_features':['sqrt', 'log2'], 'max_depth':[5, 7]}
grid_search_gb = GridSearchCV(GradientBoostingClassifier(), params, verbose = 1)
grid_search_gb.fit(X_train, Y_train)
Fitting 3 folds for each of 8 candidates, totalling 24 fits
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:  5.6min finished
Out[258]:
GridSearchCV(cv=None, error_score='raise',
       estimator=GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=100, presort='auto', random_state=None,
              subsample=1.0, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_estimators': [200, 300], 'max_features': ['sqrt', 'log2'], 'max_depth': [5, 7]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)
In [259]:
grid_search_gb.best_params_
Out[259]:
{'max_depth': 5, 'max_features': 'sqrt', 'n_estimators': 200}
In [260]:
num_estimators = grid_search_gb.best_params_['n_estimators']
depth = grid_search_gb.best_params_['max_depth']
features = grid_search_gb.best_params_['max_features']
In [261]:
best_gb_model = GradientBoostingClassifier(n_estimators = num_estimators, learning_rate = 0.1, max_depth = depth)
best_gb_model.fit(X_train, Y_train)
best_gb_model.score(X_test, Y_test)
Out[261]:
0.63055513990722734
In [262]:
log_loss(Y_test, best_gb_model.predict_proba(X_test))
Out[262]:
0.89120449184557038
In [263]:
cm = confusion_matrix(Y_test, best_gb_model.predict(X_test))

plt.matshow(cm)
plt.colorbar()
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')

print cm

le.classes_
[[2244    3   12  289  167]
 [  10    0    0    2   31]
 [  57    1   62  102  189]
 [ 557    2   21  517  108]
 [ 637    8   53  220 1391]]
Out[263]:
array(['Adoption', 'Died', 'Euthanasia', 'Return_to_owner', 'Transfer'], dtype=object)
In [264]:
feat_imp = pd.DataFrame({'features':dat.columns.values,'values':best_gb_model.feature_importances_})
In [265]:
top_feat = feat_imp.sort('values').iloc[-10:]
top_feat
/home/truong/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:1: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  if __name__ == '__main__':
Out[265]:
features values
2 Color_2 0.042355
1 Color_1 0.044922
0 Color_0 0.047059
8 Breed_2 0.048157
17 Sex_1 0.049916
5 Color_5 0.051596
18 NoName 0.059181
14 Fixed_0 0.079279
20 Week 0.082495
19 AgeuponOutcome 0.120870
In [266]:
plt.figure(figsize = (20,10))
plt.title('Top 10 Features')
plt.xticks(fontsize = 15)
sns.barplot(top_feat['features'], 100.0 * top_feat['values']/np.max(best_gb_model.feature_importances_))
Out[266]:

XgBoost

In [267]:
params = {'n_estimators':[200, 300], 'max_depth':[4, 5, 6], 'subsample':[1, 0.8]}
gs_xgb = GridSearchCV(xgb.XGBClassifier(), params, verbose = 1)
gs_xgb.fit(X_train, Y_train)
Fitting 3 folds for each of 12 candidates, totalling 36 fits
[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed:  2.1min finished
Out[267]:
GridSearchCV(cv=None, error_score='raise',
       estimator=XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=1),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_estimators': [200, 300], 'subsample': [1, 0.8], 'max_depth': [4, 5, 6]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)
In [268]:
gs_xgb.best_params_
Out[268]:
{'max_depth': 4, 'n_estimators': 200, 'subsample': 1}
In [269]:
depth = gs_xgb.best_params_['max_depth']
estimators = gs_xgb.best_params_['n_estimators']
sub_sample = gs_xgb.best_params_['subsample']
In [278]:
xgb_model = xgb.XGBClassifier(n_estimators = estimators, max_depth = depth, subsample = sub_sample,
                             objective = 'multi:softmax')
xgb_model.fit(X_train, Y_train, eval_metric = 'merror', eval_set = [(X_test, Y_test)],
              early_stopping_rounds = 150,verbose = False)
xgb_model.score(X_test, Y_test)
Out[278]:
0.64132874457578937
In [279]:
log_loss(Y_test, xgb_model.predict_proba(X_test))
Out[279]:
0.86109146072771603

Majority Vote

In [275]:
mv_model = VotingClassifier([('nb',nb_model), ('xgb',best_gb_model), ('log',best_log_model)], voting = 'soft')
mv_model.fit(X_train, Y_train)
mv_model.score(X_test, Y_test)
Out[275]:
0.59120155618734105
In [276]:
log_loss(Y_test, mv_model.predict_proba(X_test))
Out[276]:
0.99900816780210555

ZeroR Baseline

In [274]:
lab.value_counts()[0]/float(len(labels))
Out[274]:
0.40289573122825395

Using the zero rule baseline, we would get a 40% accuracy by "guessing" all animals will be adopted. After testing numerous models, the best result is 64% prediction accuracy.

In [ ]: