Improved Austin Animal Shelter Analysis with Data Visualizations

Preview Data¶

Determine duration in shelter¶

Convert age¶

Convert Sex and Fixed Status¶

Simplify breed and color features¶

Name vs No Name¶

Plots¶

Predictive Analysis¶

Convert categorical variables¶

Get training and testing sets¶

Gradient Boosting¶

Further Improvements:¶

Posted on Sun 05 March 2017 in Projects

One of the first notebooks I uploaded was my analysis on the animal outcome dataset for the Austin Animal Shelter. Recently, I ran across their open data set on Austin's open data government website: https://data.austintexas.gov/. The dataset on this site included both intake and outcome data which the dataset hosted on Kaggle did not have. The data on Kaggle was only the outcome dataset. I was very interested in how much the intake data would help improve results so we will run a similar analysis but now integrate both the intake and outcome data and also add some data visualizations. Both the intake and outcome dataset contain data from 2013 to 2017.

In [144]:

from sqlalchemy import create_engine
import psycopg2
import pandas as pd
from datetime import datetime
import numpy as np
import re

import category_encoders as ce
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import normalize
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import log_loss
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV


import matplotlib.pyplot as plt
# import plotly.plotly
# plotly.offline.init_notebook_mode()
import plotly.plotly as py
import plotly.graph_objs as go

import seaborn as sns

%matplotlib inline


pd.set_option('display.max_columns', 100)

In [56]:

import plotly 
plotly.tools.set_credentials_file(username='****', api_key='*****')

In [57]:

intake = pd.read_csv('intakes.csv')
outcome = pd.read_csv('outcomes.csv')

In [58]:

intake.columns

Out[58]:

Index(['Animal ID', 'Name', 'DateTime', 'MonthYear', 'Found Location',
       'Intake Type', 'Intake Condition', 'Animal Type', 'Sex upon Intake',
       'Age upon Intake', 'Breed', 'Color'],
      dtype='object')

In [59]:

outcome.columns

Out[59]:

Index(['Animal ID', 'Name', 'DateTime', 'MonthYear', 'Date of Birth',
       'Outcome Type', 'Outcome Subtype', 'Animal Type', 'Sex upon Outcome',
       'Age upon Outcome', 'Breed', 'Color'],
      dtype='object')

In [60]:

intake = intake[intake['Animal Type'].isin(['Dog', 'Cat'])]
outcome = outcome[outcome['Animal Type'].isin(['Dog', 'Cat'])]

In [61]:

intake = intake.rename(columns = {'Animal ID': 'ID', 'Found Location': 'Found_Location', 'Intake Type': 'Intake_Type',
                                 'Intake Condition': 'Intake_Condition', 'Animal Type': 'Animal_Type',
                                 'Sex upon Intake':'Sex_Intake','Age upon Intake': 'Age_Intake'})

In [62]:

outcome = outcome.rename(columns = {'Animal ID': 'ID', 'Date of Birth': 'Birth_Date', 'Outcome Type': 'Outcome',
                                   'Outcome Subtype': 'Outcome_Subtype','Animal Type': 'Animal_Type',
                                    'Sex upon Outcome': 'Sex_Outcome', 'Age upon Outcome': 'Outcome_Age'})

Taking a look at the intake dataset, we can see that there are several features that should be quite valuable to this analysis such as the condition of the animal when it first enters the shelter (Intake_Condition), the age of the animal on intake (Age_Intake), the way the animal was found: stray, surrendered by owner, etc (Intake_Type), and the month and day of when the animal entered the shelter. I thin its safe to assume that all of these features are quite important on the outcome of an animal.

In [63]:

intake.head()

Out[63]:

	ID	Name	DateTime	MonthYear	Found_Location	Intake_Type	Intake_Condition	Animal_Type	Sex_Intake	Age_Intake	Breed	Color
0	A730601	NaN	07/07/2016 12:11:00 PM	07/07/2016 12:11:00 PM	1109 Shady Ln in Austin (TX)	Stray	Normal	Cat	Intact Male	7 months	Domestic Shorthair Mix	Blue Tabby
1	A683644	*Zoey	07/13/2014 11:02:00 AM	07/13/2014 11:02:00 AM	Austin (TX)	Owner Surrender	Nursing	Dog	Intact Female	4 weeks	Border Collie Mix	Brown/White
2	A676515	Rico	04/11/2014 08:45:00 AM	04/11/2014 08:45:00 AM	615 E. Wonsley in Austin (TX)	Stray	Normal	Dog	Intact Male	2 months	Pit Bull Mix	White/Brown
3	A742953	NaN	01/31/2017 01:30:00 PM	01/31/2017 01:30:00 PM	S Hwy 183 And Thompson Lane in Austin (TX)	Stray	Normal	Dog	Intact Male	2 years	Saluki	Sable/Cream
4	A679549	*Gilbert	05/22/2014 03:43:00 PM	05/22/2014 03:43:00 PM	124 W Anderson in Austin (TX)	Stray	Normal	Cat	Intact Male	1 month	Domestic Shorthair Mix	Black/White

The outcome table is very similar with no additions compared to the other outcome dataset hosted on Kaggle.

In [64]:

outcome.head()

Out[64]:

	ID	Name	DateTime	MonthYear	Birth_Date	Outcome	Outcome_Subtype	Animal_Type	Sex_Outcome	Outcome_Age	Breed	Color
0	A741715	*Pebbles	01/11/2017 06:17:00 PM	01/11/2017 06:17:00 PM	03/07/2016	Adoption	NaN	Cat	Spayed Female	10 months	Domestic Shorthair Mix	Calico
1	A658751	Benji	11/13/2016 01:38:00 PM	11/13/2016 01:38:00 PM	07/14/2011	Return to Owner	NaN	Dog	Neutered Male	5 years	Border Terrier Mix	Tan
3	A707443	NaN	07/13/2015 01:50:00 PM	07/13/2015 01:50:00 PM	06/21/2015	Transfer	Partner	Cat	Intact Female	3 weeks	Domestic Longhair Mix	Black Smoke
4	A684346	NaN	07/22/2014 04:04:00 PM	07/22/2014 04:04:00 PM	07/07/2014	Transfer	Partner	Cat	Intact Male	2 weeks	Domestic Shorthair Mix	Orange Tabby
5	A666430	Lucy	11/07/2013 11:47:00 AM	11/07/2013 11:47:00 AM	11/06/2012	Transfer	Partner	Dog	Spayed Female	1 year	Beagle Mix	White/Brown

We will no merge the intake and outcome dataset and start cleaning and preparing the set to do some exploratory and predictive data analysis.

In [65]:

df = pd.merge(intake, outcome, on = 'ID', how = 'inner', suffixes = ('_intake', '_outcome'))

In [66]:

df[df.duplicated()].head()

Out[66]:

	ID	Name_intake	DateTime_intake	MonthYear_intake	Found_Location	Intake_Type	Intake_Condition	Animal_Type_intake	Sex_Intake	Age_Intake	Breed_intake	Color_intake	Name_outcome	DateTime_outcome	MonthYear_outcome	Birth_Date	Outcome	Outcome_Subtype	Animal_Type_outcome	Sex_Outcome	Outcome_Age	Breed_outcome	Color_outcome
2411	A683782	NaN	07/19/2014 03:55:00 PM	07/19/2014 03:55:00 PM	11411 Research Blvd in Austin (TX)	Stray	Normal	Cat	Neutered Male	5 months	Domestic Shorthair Mix	Brown Tabby	NaN	07/16/2014 09:00:00 AM	07/16/2014 09:00:00 AM	02/15/2014	Transfer	SCRP	Cat	Neutered Male	4 months	Domestic Shorthair Mix	Brown Tabby
2413	A683782	NaN	07/15/2014 01:49:00 PM	07/15/2014 01:49:00 PM	11411 Research Blvd in Austin (TX)	Stray	Normal	Cat	Neutered Male	4 months	Domestic Shorthair Mix	Brown Tabby	NaN	07/16/2014 09:00:00 AM	07/16/2014 09:00:00 AM	02/15/2014	Transfer	SCRP	Cat	Neutered Male	4 months	Domestic Shorthair Mix	Brown Tabby
17191	A715083	Daisy	10/31/2015 11:39:00 AM	10/31/2015 11:39:00 AM	Ipswixh Bay Dr & Kedington St in Austin (TX)	Stray	Normal	Dog	Intact Female	5 months	Cairn Terrier Mix	Brown/Black	Daisy	11/03/2015 06:27:00 PM	11/03/2015 06:27:00 PM	05/02/2015	Return to Owner	NaN	Dog	Spayed Female	6 months	Cairn Terrier Mix	Brown/Black
17192	A715083	Daisy	10/31/2015 11:39:00 AM	10/31/2015 11:39:00 AM	Ipswixh Bay Dr & Kedington St in Austin (TX)	Stray	Normal	Dog	Intact Female	5 months	Cairn Terrier Mix	Brown/Black	Daisy	11/03/2015 06:27:00 PM	11/03/2015 06:27:00 PM	05/02/2015	Return to Owner	NaN	Dog	Spayed Female	6 months	Cairn Terrier Mix	Brown/Black
17193	A715083	Daisy	10/31/2015 11:39:00 AM	10/31/2015 11:39:00 AM	Ipswixh Bay Dr & Kedington St in Austin (TX)	Stray	Normal	Dog	Intact Female	5 months	Cairn Terrier Mix	Brown/Black	Daisy	11/03/2015 06:27:00 PM	11/03/2015 06:27:00 PM	05/02/2015	Return to Owner	NaN	Dog	Spayed Female	6 months	Cairn Terrier Mix	Brown/Black

In [67]:

df = df.drop_duplicates('ID')
df.shape

Out[67]:

(52354, 23)

In [68]:

df.head()

Out[68]:

	ID	Name_intake	DateTime_intake	MonthYear_intake	Found_Location	Intake_Type	Intake_Condition	Animal_Type_intake	Sex_Intake	Age_Intake	Breed_intake	Color_intake	Name_outcome	DateTime_outcome	MonthYear_outcome	Birth_Date	Outcome	Outcome_Subtype	Animal_Type_outcome	Sex_Outcome	Outcome_Age	Breed_outcome	Color_outcome
0	A730601	NaN	07/07/2016 12:11:00 PM	07/07/2016 12:11:00 PM	1109 Shady Ln in Austin (TX)	Stray	Normal	Cat	Intact Male	7 months	Domestic Shorthair Mix	Blue Tabby	NaN	07/08/2016 09:00:00 AM	07/08/2016 09:00:00 AM	12/07/2015	Transfer	SCRP	Cat	Neutered Male	7 months	Domestic Shorthair Mix	Blue Tabby
1	A683644	*Zoey	07/13/2014 11:02:00 AM	07/13/2014 11:02:00 AM	Austin (TX)	Owner Surrender	Nursing	Dog	Intact Female	4 weeks	Border Collie Mix	Brown/White	*Zoey	11/06/2014 10:06:00 AM	11/06/2014 10:06:00 AM	06/13/2014	Adoption	Foster	Dog	Spayed Female	4 months	Border Collie Mix	Brown/White
2	A676515	Rico	04/11/2014 08:45:00 AM	04/11/2014 08:45:00 AM	615 E. Wonsley in Austin (TX)	Stray	Normal	Dog	Intact Male	2 months	Pit Bull Mix	White/Brown	Rico	04/14/2014 06:38:00 PM	04/14/2014 06:38:00 PM	01/11/2014	Return to Owner	NaN	Dog	Neutered Male	3 months	Pit Bull Mix	White/Brown
3	A742953	NaN	01/31/2017 01:30:00 PM	01/31/2017 01:30:00 PM	S Hwy 183 And Thompson Lane in Austin (TX)	Stray	Normal	Dog	Intact Male	2 years	Saluki	Sable/Cream	NaN	02/04/2017 02:17:00 PM	02/04/2017 02:17:00 PM	01/31/2015	Transfer	Partner	Dog	Intact Male	2 years	Saluki	Sable/Cream
4	A679549	*Gilbert	05/22/2014 03:43:00 PM	05/22/2014 03:43:00 PM	124 W Anderson in Austin (TX)	Stray	Normal	Cat	Intact Male	1 month	Domestic Shorthair Mix	Black/White	*Gilbert	06/16/2014 01:54:00 PM	06/16/2014 01:54:00 PM	03/31/2014	Transfer	Partner	Cat	Neutered Male	2 months	Domestic Shorthair Mix	Black/White

In [69]:

df.columns

Out[69]:

Index(['ID', 'Name_intake', 'DateTime_intake', 'MonthYear_intake',
       'Found_Location', 'Intake_Type', 'Intake_Condition',
       'Animal_Type_intake', 'Sex_Intake', 'Age_Intake', 'Breed_intake',
       'Color_intake', 'Name_outcome', 'DateTime_outcome', 'MonthYear_outcome',
       'Birth_Date', 'Outcome', 'Outcome_Subtype', 'Animal_Type_outcome',
       'Sex_Outcome', 'Outcome_Age', 'Breed_outcome', 'Color_outcome'],
      dtype='object')

In [70]:

df = df.drop(['Name_intake', 'Animal_Type_intake', 'Breed_intake', 'Color_intake'], axis = 1).reset_index(drop = True)

In [71]:

df.head(10)

Out[71]:

	ID	DateTime_intake	MonthYear_intake	Found_Location	Intake_Type	Intake_Condition	Sex_Intake	Age_Intake	Name_outcome	DateTime_outcome	MonthYear_outcome	Birth_Date	Outcome	Outcome_Subtype	Animal_Type_outcome	Sex_Outcome	Outcome_Age	Breed_outcome	Color_outcome
0	A730601	07/07/2016 12:11:00 PM	07/07/2016 12:11:00 PM	1109 Shady Ln in Austin (TX)	Stray	Normal	Intact Male	7 months	NaN	07/08/2016 09:00:00 AM	07/08/2016 09:00:00 AM	12/07/2015	Transfer	SCRP	Cat	Neutered Male	7 months	Domestic Shorthair Mix	Blue Tabby
1	A683644	07/13/2014 11:02:00 AM	07/13/2014 11:02:00 AM	Austin (TX)	Owner Surrender	Nursing	Intact Female	4 weeks	*Zoey	11/06/2014 10:06:00 AM	11/06/2014 10:06:00 AM	06/13/2014	Adoption	Foster	Dog	Spayed Female	4 months	Border Collie Mix	Brown/White
2	A676515	04/11/2014 08:45:00 AM	04/11/2014 08:45:00 AM	615 E. Wonsley in Austin (TX)	Stray	Normal	Intact Male	2 months	Rico	04/14/2014 06:38:00 PM	04/14/2014 06:38:00 PM	01/11/2014	Return to Owner	NaN	Dog	Neutered Male	3 months	Pit Bull Mix	White/Brown
3	A742953	01/31/2017 01:30:00 PM	01/31/2017 01:30:00 PM	S Hwy 183 And Thompson Lane in Austin (TX)	Stray	Normal	Intact Male	2 years	NaN	02/04/2017 02:17:00 PM	02/04/2017 02:17:00 PM	01/31/2015	Transfer	Partner	Dog	Intact Male	2 years	Saluki	Sable/Cream
4	A679549	05/22/2014 03:43:00 PM	05/22/2014 03:43:00 PM	124 W Anderson in Austin (TX)	Stray	Normal	Intact Male	1 month	*Gilbert	06/16/2014 01:54:00 PM	06/16/2014 01:54:00 PM	03/31/2014	Transfer	Partner	Cat	Neutered Male	2 months	Domestic Shorthair Mix	Black/White
5	A683798	07/21/2016 12:16:00 PM	07/21/2016 12:16:00 PM	3118 Windsor Rd in Austin (TX)	Stray	Normal	Spayed Female	3 years	Mustachala	10/18/2016 10:55:00 AM	10/18/2016 10:55:00 AM	01/15/2013	Adoption	Foster	Cat	Spayed Female	3 years	Domestic Medium Hair Mix	White/Black
6	A683656	07/13/2014 01:20:00 PM	07/13/2014 01:20:00 PM	8238 Research Blvd in Austin (TX)	Stray	Normal	Intact Male	2 months	NaN	07/17/2014 04:57:00 PM	07/17/2014 04:57:00 PM	04/13/2014	Adoption	NaN	Cat	Neutered Male	3 months	Snowshoe Mix	Lynx Point
7	A709749	08/12/2015 06:29:00 PM	08/12/2015 06:29:00 PM	4800 Weletka Dr in Austin (TX)	Stray	Normal	Intact Female	1 year	*Janeane	09/22/2015 12:49:00 PM	09/22/2015 12:49:00 PM	08/12/2014	Transfer	Partner	Cat	Spayed Female	1 year	Domestic Shorthair Mix	Calico
8	A692161	11/15/2014 03:18:00 PM	11/15/2014 03:18:00 PM	Avenue G/42Nd in Austin (TX)	Owner Surrender	Normal	Intact Male	5 months	George	11/21/2014 06:55:00 PM	11/21/2014 06:55:00 PM	06/15/2014	Adoption	NaN	Dog	Neutered Male	5 months	Pit Bull Mix	Brown/White
9	A733551	08/23/2016 02:35:00 PM	08/23/2016 02:35:00 PM	183 And Cameron in Austin (TX)	Stray	Normal	Intact Male	1 month	*Phillip	09/01/2016 12:00:00 AM	09/01/2016 12:00:00 AM	06/23/2016	Transfer	Partner	Cat	Intact Male	2 months	Domestic Shorthair Mix	Brown Tabby/White

Let's put the date time column in our data set into python format so we can calculate the duration that an animal spends in the shelter from intake to outcome

In [72]:

intake_dates = [df.DateTime_intake[i].split()[0] for i in range(len(df.DateTime_intake))]
intake_date = [datetime.strptime(d,'%m/%d/%Y') for d in intake_dates]

outcome_dates = [df.DateTime_outcome[i].split()[0] for i in range(len(df.DateTime_outcome))]
outcome_date = [datetime.strptime(d,'%m/%d/%Y') for d in outcome_dates]

In [73]:

days_in = [abs(intake_date[i] - outcome_date[i]).days for i in range(len(intake_date))]

In [74]:

df['Days_In'] = days_in

The age column contains different units (years, months, weeks) so we will convert the age in terms of years old.

In [75]:

def convert_age(ages):
    Ages = ages.astype(str)
    y = np.asarray([float(age.split()[0]) if 'year' in age
     else float(age.split()[0])/52. if 'week' in age
     else float(age.split()[0])/12. if 'month' in age
     else float(age.split()[0])/352. if 'week' in age
     else -99.0
     for age in Ages])
    y[y == -99] = np.mean(y[y != -99])
    return y

In [76]:

df['Age_Intake'] = convert_age(df.Age_Intake)

In [77]:

df['Outcome_Age'] = convert_age(df.Outcome_Age)

In [78]:

df.head()

Out[78]:

	ID	DateTime_intake	MonthYear_intake	Found_Location	Intake_Type	Intake_Condition	Sex_Intake	Age_Intake	Name_outcome	DateTime_outcome	MonthYear_outcome	Birth_Date	Outcome	Outcome_Subtype	Animal_Type_outcome	Sex_Outcome	Outcome_Age	Breed_outcome	Color_outcome	Days_In
0	A730601	07/07/2016 12:11:00 PM	07/07/2016 12:11:00 PM	1109 Shady Ln in Austin (TX)	Stray	Normal	Intact Male	0.583333	NaN	07/08/2016 09:00:00 AM	07/08/2016 09:00:00 AM	12/07/2015	Transfer	SCRP	Cat	Neutered Male	0.583333	Domestic Shorthair Mix	Blue Tabby	1
1	A683644	07/13/2014 11:02:00 AM	07/13/2014 11:02:00 AM	Austin (TX)	Owner Surrender	Nursing	Intact Female	0.076923	*Zoey	11/06/2014 10:06:00 AM	11/06/2014 10:06:00 AM	06/13/2014	Adoption	Foster	Dog	Spayed Female	0.333333	Border Collie Mix	Brown/White	116
2	A676515	04/11/2014 08:45:00 AM	04/11/2014 08:45:00 AM	615 E. Wonsley in Austin (TX)	Stray	Normal	Intact Male	0.166667	Rico	04/14/2014 06:38:00 PM	04/14/2014 06:38:00 PM	01/11/2014	Return to Owner	NaN	Dog	Neutered Male	0.250000	Pit Bull Mix	White/Brown	3
3	A742953	01/31/2017 01:30:00 PM	01/31/2017 01:30:00 PM	S Hwy 183 And Thompson Lane in Austin (TX)	Stray	Normal	Intact Male	2.000000	NaN	02/04/2017 02:17:00 PM	02/04/2017 02:17:00 PM	01/31/2015	Transfer	Partner	Dog	Intact Male	2.000000	Saluki	Sable/Cream	4
4	A679549	05/22/2014 03:43:00 PM	05/22/2014 03:43:00 PM	124 W Anderson in Austin (TX)	Stray	Normal	Intact Male	0.083333	*Gilbert	06/16/2014 01:54:00 PM	06/16/2014 01:54:00 PM	03/31/2014	Transfer	Partner	Cat	Neutered Male	0.166667	Domestic Shorthair Mix	Black/White	25

The sex and fixed status of an animal is in one column (Sex_Intake) so let's break this column up to a gender columnm, and a fixed status column.

In [79]:

df['Sex_Intake'].unique()

Out[79]:

array(['Intact Male', 'Intact Female', 'Spayed Female', 'Neutered Male',
       'Unknown', nan], dtype=object)

In [80]:

df['Sex_Intake'] = df['Sex_Intake'].replace(np.nan,'Unknown')
df['Sex_Intake'].unique()

Out[80]:

array(['Intact Male', 'Intact Female', 'Spayed Female', 'Neutered Male',
       'Unknown'], dtype=object)

In [81]:

def convert_sex(sex):
    n = len(sex)
    gender = ['male' if 'Male' in str(sex[i]) 
           else 'female' if 'Female' in str(sex[i] )
           else 'unknown'
           for i in range(n)]

    spayed_neutered = ['fixed' if any(x in str(sex[i]) for x in ['Neutered','Spayed'])
                        else 'not_fixed' if 'Intact' in str(sex[i])
                        else 'unknown_fixed'
                        for i in range(n)]
    return gender, spayed_neutered

In [82]:

df['Sex'], df['Fixed'] = convert_sex(df['Sex_Outcome'])

In [83]:

df.head()

Out[83]:

	ID	DateTime_intake	MonthYear_intake	Found_Location	Intake_Type	Intake_Condition	Sex_Intake	Age_Intake	Name_outcome	DateTime_outcome	MonthYear_outcome	Birth_Date	Outcome	Outcome_Subtype	Animal_Type_outcome	Sex_Outcome	Outcome_Age	Breed_outcome	Color_outcome	Days_In	Sex	Fixed
0	A730601	07/07/2016 12:11:00 PM	07/07/2016 12:11:00 PM	1109 Shady Ln in Austin (TX)	Stray	Normal	Intact Male	0.583333	NaN	07/08/2016 09:00:00 AM	07/08/2016 09:00:00 AM	12/07/2015	Transfer	SCRP	Cat	Neutered Male	0.583333	Domestic Shorthair Mix	Blue Tabby	1	male	fixed
1	A683644	07/13/2014 11:02:00 AM	07/13/2014 11:02:00 AM	Austin (TX)	Owner Surrender	Nursing	Intact Female	0.076923	*Zoey	11/06/2014 10:06:00 AM	11/06/2014 10:06:00 AM	06/13/2014	Adoption	Foster	Dog	Spayed Female	0.333333	Border Collie Mix	Brown/White	116	female	fixed
2	A676515	04/11/2014 08:45:00 AM	04/11/2014 08:45:00 AM	615 E. Wonsley in Austin (TX)	Stray	Normal	Intact Male	0.166667	Rico	04/14/2014 06:38:00 PM	04/14/2014 06:38:00 PM	01/11/2014	Return to Owner	NaN	Dog	Neutered Male	0.250000	Pit Bull Mix	White/Brown	3	male	fixed
3	A742953	01/31/2017 01:30:00 PM	01/31/2017 01:30:00 PM	S Hwy 183 And Thompson Lane in Austin (TX)	Stray	Normal	Intact Male	2.000000	NaN	02/04/2017 02:17:00 PM	02/04/2017 02:17:00 PM	01/31/2015	Transfer	Partner	Dog	Intact Male	2.000000	Saluki	Sable/Cream	4	male	not_fixed
4	A679549	05/22/2014 03:43:00 PM	05/22/2014 03:43:00 PM	124 W Anderson in Austin (TX)	Stray	Normal	Intact Male	0.083333	*Gilbert	06/16/2014 01:54:00 PM	06/16/2014 01:54:00 PM	03/31/2014	Transfer	Partner	Cat	Neutered Male	0.166667	Domestic Shorthair Mix	Black/White	25	male	fixed

There are 1747 differerent classifications of dog breeds in our set. Some classifications are redundant so let's simplify the column. We will remove the word 'Mix' in the breed and take the dominant breed. There are only 72 different classifications of breeds for cats so we will leave that cats alone.

In [84]:

len(df[df.Animal_Type_outcome == 'Dog' ].Breed_outcome.unique())

Out[84]:

In [85]:

len(df[df.Animal_Type_outcome == 'Cat' ].Breed_outcome.unique())

Out[85]:

In [86]:

breeds = list(df[df['Animal_Type_outcome'] == 'Dog']['Breed_outcome'])
dog_breeds = [re.sub(' Mix', '', dog) if 'Mix' in dog else dog.split("/")[0] for dog in breeds]

In [87]:

df.ix[df.Animal_Type_outcome == 'Dog', 'Breed_outcome'] = dog_breeds

In [88]:

df.head()

Out[88]:

	ID	DateTime_intake	MonthYear_intake	Found_Location	Intake_Type	Intake_Condition	Sex_Intake	Age_Intake	Name_outcome	DateTime_outcome	MonthYear_outcome	Birth_Date	Outcome	Outcome_Subtype	Animal_Type_outcome	Sex_Outcome	Outcome_Age	Breed_outcome	Color_outcome	Days_In	Sex	Fixed
0	A730601	07/07/2016 12:11:00 PM	07/07/2016 12:11:00 PM	1109 Shady Ln in Austin (TX)	Stray	Normal	Intact Male	0.583333	NaN	07/08/2016 09:00:00 AM	07/08/2016 09:00:00 AM	12/07/2015	Transfer	SCRP	Cat	Neutered Male	0.583333	Domestic Shorthair Mix	Blue Tabby	1	male	fixed
1	A683644	07/13/2014 11:02:00 AM	07/13/2014 11:02:00 AM	Austin (TX)	Owner Surrender	Nursing	Intact Female	0.076923	*Zoey	11/06/2014 10:06:00 AM	11/06/2014 10:06:00 AM	06/13/2014	Adoption	Foster	Dog	Spayed Female	0.333333	Border Collie	Brown/White	116	female	fixed
2	A676515	04/11/2014 08:45:00 AM	04/11/2014 08:45:00 AM	615 E. Wonsley in Austin (TX)	Stray	Normal	Intact Male	0.166667	Rico	04/14/2014 06:38:00 PM	04/14/2014 06:38:00 PM	01/11/2014	Return to Owner	NaN	Dog	Neutered Male	0.250000	Pit Bull	White/Brown	3	male	fixed
3	A742953	01/31/2017 01:30:00 PM	01/31/2017 01:30:00 PM	S Hwy 183 And Thompson Lane in Austin (TX)	Stray	Normal	Intact Male	2.000000	NaN	02/04/2017 02:17:00 PM	02/04/2017 02:17:00 PM	01/31/2015	Transfer	Partner	Dog	Intact Male	2.000000	Saluki	Sable/Cream	4	male	not_fixed
4	A679549	05/22/2014 03:43:00 PM	05/22/2014 03:43:00 PM	124 W Anderson in Austin (TX)	Stray	Normal	Intact Male	0.083333	*Gilbert	06/16/2014 01:54:00 PM	06/16/2014 01:54:00 PM	03/31/2014	Transfer	Partner	Cat	Neutered Male	0.166667	Domestic Shorthair Mix	Black/White	25	male	fixed

Removing the 'Mix' classification narrows down the set of different dog breeds to 269, a huge improvement compared to 1747.

In [89]:

len(df.Breed_outcome.unique())

Out[89]:

There are 319 different color classifications for dogs and 199 for cats. For both, we will take the first color listed as the dominant color and replace each classification with the dominant color for each animal. Doing this, we reduce the amount of unique classifications in total to only 57.

In [90]:

# number of distinct colors for dogs
len(df[df.Animal_Type_outcome == 'Dog' ].Color_outcome.unique())

Out[90]:

In [91]:

# number of distinct colors for dogs
len(df[df.Animal_Type_outcome == 'Cat' ].Color_outcome.unique())

Out[91]:

In [92]:

# Take the first color
def get_color(animal_color):
    color = [c.split("/")[0] for c in animal_color]
    return color

In [93]:

df['Color_outcome'] = get_color(df['Color_outcome'])

In [94]:

len(df.Color_outcome.unique())

Out[94]:

In [95]:

df.head()

Out[95]:

	ID	DateTime_intake	MonthYear_intake	Found_Location	Intake_Type	Intake_Condition	Sex_Intake	Age_Intake	Name_outcome	DateTime_outcome	MonthYear_outcome	Birth_Date	Outcome	Outcome_Subtype	Animal_Type_outcome	Sex_Outcome	Outcome_Age	Breed_outcome	Color_outcome	Days_In	Sex	Fixed
0	A730601	07/07/2016 12:11:00 PM	07/07/2016 12:11:00 PM	1109 Shady Ln in Austin (TX)	Stray	Normal	Intact Male	0.583333	NaN	07/08/2016 09:00:00 AM	07/08/2016 09:00:00 AM	12/07/2015	Transfer	SCRP	Cat	Neutered Male	0.583333	Domestic Shorthair Mix	Blue Tabby	1	male	fixed
1	A683644	07/13/2014 11:02:00 AM	07/13/2014 11:02:00 AM	Austin (TX)	Owner Surrender	Nursing	Intact Female	0.076923	*Zoey	11/06/2014 10:06:00 AM	11/06/2014 10:06:00 AM	06/13/2014	Adoption	Foster	Dog	Spayed Female	0.333333	Border Collie	Brown	116	female	fixed
2	A676515	04/11/2014 08:45:00 AM	04/11/2014 08:45:00 AM	615 E. Wonsley in Austin (TX)	Stray	Normal	Intact Male	0.166667	Rico	04/14/2014 06:38:00 PM	04/14/2014 06:38:00 PM	01/11/2014	Return to Owner	NaN	Dog	Neutered Male	0.250000	Pit Bull	White	3	male	fixed
3	A742953	01/31/2017 01:30:00 PM	01/31/2017 01:30:00 PM	S Hwy 183 And Thompson Lane in Austin (TX)	Stray	Normal	Intact Male	2.000000	NaN	02/04/2017 02:17:00 PM	02/04/2017 02:17:00 PM	01/31/2015	Transfer	Partner	Dog	Intact Male	2.000000	Saluki	Sable	4	male	not_fixed
4	A679549	05/22/2014 03:43:00 PM	05/22/2014 03:43:00 PM	124 W Anderson in Austin (TX)	Stray	Normal	Intact Male	0.083333	*Gilbert	06/16/2014 01:54:00 PM	06/16/2014 01:54:00 PM	03/31/2014	Transfer	Partner	Cat	Neutered Male	0.166667	Domestic Shorthair Mix	Black	25	male	fixed

Now, let's create a feature column that indicates the month the animal enters the animal shelter.

In [96]:

dates = [df['DateTime_intake'][i].split()[0] for i in range(len(df['DateTime_intake']))]
date = [datetime.strptime(d,'%m/%d/%Y') for d in dates]

month = ['Janurary' if d.month  == 1
        else 'February' if d.month == 2
        else 'March' if d.month == 3
        else 'April' if d.month == 4
        else 'May' if d.month == 5
        else 'June' if d.month == 6
        else 'July' if d.month == 7
        else 'August' if d.month == 8
        else 'September' if d.month == 9
        else 'October' if d.month == 10
        else 'November' if d.month == 11
        else 'December'
        for d in date]

In [97]:

df['Month_Intake'] = month

In [98]:

df.head()

Out[98]:

	ID	DateTime_intake	MonthYear_intake	Found_Location	Intake_Type	Intake_Condition	Sex_Intake	Age_Intake	Name_outcome	DateTime_outcome	MonthYear_outcome	Birth_Date	Outcome	Outcome_Subtype	Animal_Type_outcome	Sex_Outcome	Outcome_Age	Breed_outcome	Color_outcome	Days_In	Sex	Fixed	Month_Intake
0	A730601	07/07/2016 12:11:00 PM	07/07/2016 12:11:00 PM	1109 Shady Ln in Austin (TX)	Stray	Normal	Intact Male	0.583333	NaN	07/08/2016 09:00:00 AM	07/08/2016 09:00:00 AM	12/07/2015	Transfer	SCRP	Cat	Neutered Male	0.583333	Domestic Shorthair Mix	Blue Tabby	1	male	fixed	July
1	A683644	07/13/2014 11:02:00 AM	07/13/2014 11:02:00 AM	Austin (TX)	Owner Surrender	Nursing	Intact Female	0.076923	*Zoey	11/06/2014 10:06:00 AM	11/06/2014 10:06:00 AM	06/13/2014	Adoption	Foster	Dog	Spayed Female	0.333333	Border Collie	Brown	116	female	fixed	July
2	A676515	04/11/2014 08:45:00 AM	04/11/2014 08:45:00 AM	615 E. Wonsley in Austin (TX)	Stray	Normal	Intact Male	0.166667	Rico	04/14/2014 06:38:00 PM	04/14/2014 06:38:00 PM	01/11/2014	Return to Owner	NaN	Dog	Neutered Male	0.250000	Pit Bull	White	3	male	fixed	April
3	A742953	01/31/2017 01:30:00 PM	01/31/2017 01:30:00 PM	S Hwy 183 And Thompson Lane in Austin (TX)	Stray	Normal	Intact Male	2.000000	NaN	02/04/2017 02:17:00 PM	02/04/2017 02:17:00 PM	01/31/2015	Transfer	Partner	Dog	Intact Male	2.000000	Saluki	Sable	4	male	not_fixed	Janurary
4	A679549	05/22/2014 03:43:00 PM	05/22/2014 03:43:00 PM	124 W Anderson in Austin (TX)	Stray	Normal	Intact Male	0.083333	*Gilbert	06/16/2014 01:54:00 PM	06/16/2014 01:54:00 PM	03/31/2014	Transfer	Partner	Cat	Neutered Male	0.166667	Domestic Shorthair Mix	Black	25	male	fixed	May

In [99]:

df = df[df.Outcome.notnull()]

We'll construct a column that indicates that if the animal was given a name while it was in the shelter.

In [100]:

df['Given_Name'] = df.Name_outcome.notnull() * 1.0

In [101]:

df.head()

Out[101]:

	ID	DateTime_intake	MonthYear_intake	Found_Location	Intake_Type	Intake_Condition	Sex_Intake	Age_Intake	Name_outcome	DateTime_outcome	MonthYear_outcome	Birth_Date	Outcome	Outcome_Subtype	Animal_Type_outcome	Sex_Outcome	Outcome_Age	Breed_outcome	Color_outcome	Days_In	Sex	Fixed	Month_Intake	Given_Name
0	A730601	07/07/2016 12:11:00 PM	07/07/2016 12:11:00 PM	1109 Shady Ln in Austin (TX)	Stray	Normal	Intact Male	0.583333	NaN	07/08/2016 09:00:00 AM	07/08/2016 09:00:00 AM	12/07/2015	Transfer	SCRP	Cat	Neutered Male	0.583333	Domestic Shorthair Mix	Blue Tabby	1	male	fixed	July	0.0
1	A683644	07/13/2014 11:02:00 AM	07/13/2014 11:02:00 AM	Austin (TX)	Owner Surrender	Nursing	Intact Female	0.076923	*Zoey	11/06/2014 10:06:00 AM	11/06/2014 10:06:00 AM	06/13/2014	Adoption	Foster	Dog	Spayed Female	0.333333	Border Collie	Brown	116	female	fixed	July	1.0
2	A676515	04/11/2014 08:45:00 AM	04/11/2014 08:45:00 AM	615 E. Wonsley in Austin (TX)	Stray	Normal	Intact Male	0.166667	Rico	04/14/2014 06:38:00 PM	04/14/2014 06:38:00 PM	01/11/2014	Return to Owner	NaN	Dog	Neutered Male	0.250000	Pit Bull	White	3	male	fixed	April	1.0
3	A742953	01/31/2017 01:30:00 PM	01/31/2017 01:30:00 PM	S Hwy 183 And Thompson Lane in Austin (TX)	Stray	Normal	Intact Male	2.000000	NaN	02/04/2017 02:17:00 PM	02/04/2017 02:17:00 PM	01/31/2015	Transfer	Partner	Dog	Intact Male	2.000000	Saluki	Sable	4	male	not_fixed	Janurary	0.0
4	A679549	05/22/2014 03:43:00 PM	05/22/2014 03:43:00 PM	124 W Anderson in Austin (TX)	Stray	Normal	Intact Male	0.083333	*Gilbert	06/16/2014 01:54:00 PM	06/16/2014 01:54:00 PM	03/31/2014	Transfer	Partner	Cat	Neutered Male	0.166667	Domestic Shorthair Mix	Black	25	male	fixed	May	1.0

Here, we will construct some data visualization for our data. Let's take a look at some statistics for the top 10 breeds that come into the shelter in terms of count.

In [102]:

f = {'Breed_outcome':'count', 'Days_In': 'mean'}

days_avg = df.groupby('Breed_outcome').agg(f)

In [103]:

df2 = days_avg.sort_values(['Breed_outcome','Days_In'])[-10:]

In [104]:

df2 = df2.rename(columns = {'Breed_outcome': 'Count'})
df2

Out[104]:

	Count	Days_In
Breed_outcome
Siamese Mix	758	18.961741
Dachshund	937	10.732124
Domestic Longhair Mix	980	24.886735
Australian Cattle Dog	1001	27.180819
German Shepherd	1669	22.524865
Domestic Medium Hair Mix	1725	20.058551
Labrador Retriever	3959	24.488002
Chihuahua Shorthair	4120	19.309223
Pit Bull	4312	44.365260
Domestic Shorthair Mix	17452	21.633051

In [105]:

ann_text = ['Count = ' + list(df2['Count'].astype(str))[i] for i in range(len(df2))]

Here, we take a look at the average amount of time each breed spends in the shelter from 2013 - 2017. As you hover over the chart, the average days and the count of how many times that breed has come into the shelter over all 4 years are shown.

In [106]:

trace1 = go.Bar(
    y = df2['Days_In'], 
    x = df2.index,
    name = 'Average Days',
    text = ann_text
)

layout = go.Layout(
    title = 'Average Days in Shelter by Breed (2013 - 2017)', 
    yaxis = dict(title = 'Days')

)
data = [trace1]
fig = go.Figure(data = data, layout = layout)
# plotly.offline.iplot(fig)
py.iplot(fig)

Out[106]:

Summary:
- Pitbulls spend the longest amount of time in this shelter on average, about 44 days. They are also the second highest breed in count that this shelter receives.
- Dachshunds spend the least amount of time in this shelter on average, about 11 days. They are the ninth highest breed in count that this shelter receives.
- The other breeds, including cats, roughly spend a similar amount of time in this shelter, about 20 days.

In [107]:

summary_outcome = pd.crosstab(df.Breed_outcome, df.Outcome).apply(lambda r: r/r.sum() * 100., axis = 1)

In [108]:

summary_outcome.head()

Out[108]:

Outcome	Adoption	Died	Disposal	Euthanasia	Missing	Return to Owner	Transfer
Breed_outcome
Abyssinian Mix	66.666667	0.0	0.0	0.0	0.0	0.000000	33.333333
Affenpinscher	50.000000	0.0	0.0	0.0	0.0	50.000000	0.000000
Afghan Hound	0.000000	0.0	0.0	0.0	0.0	100.000000	0.000000
Airedale Terrier	41.176471	0.0	0.0	0.0	0.0	35.294118	23.529412
Akbash	100.000000	0.0	0.0	0.0	0.0	0.000000	0.000000

Now let's take a look at outcome rate by breed. The worst possible outcome we want to occur is euthanasia and the best outcome is adoption or return to owner. We can see that the combinination of the rates of return to owner and adoption dominates the other outcome rates, which is a good sign. However, the transfer rates for some of these breeds are quite high such as domestic short hair, which are cats.

In [109]:

cols = summary_outcome.columns
summary = summary_outcome[summary_outcome.index.isin(list(df2.index))][-10:]
data = []
for col in cols:
    trace = go.Bar(
        y = summary[col],
        x = summary.index,
        name = col
    )
    data.append(trace)
    
layout = go.Layout(
    title = 'Outcome by Breed (2013 - 2017)',
    yaxis = dict(title = 'Percentage'),
#     xaxis = dict(title = 'Breed')
)

fig = go.Figure(data = data, layout = layout)
# plotly.offline.iplot(fig)
py.iplot(fig)

Out[109]:

Summary:
- Among dogs, the majority of the outcome for every breed in the highest 10 breeds result in adoption or return to owner, which is a very good sign.
- Among cats (domestic long/medium/short hair, and siamese mix, they have a high transfer rate, which is not very indicative of their outcome. In fact, for domestic short/medium, their transfer rates are as high as their adoption rates. There is also a very low return to owner rate, about 7%, for cats. However, this is intuitive due to the high number of stray cats.

In [110]:

year = [date[i].year for i in range(len(date))]

In [111]:

df3 = pd.concat([df, pd.DataFrame({'year': year})], axis = 1)

In [112]:

df3.head()

Out[112]:

	ID	DateTime_intake	MonthYear_intake	Found_Location	Intake_Type	Intake_Condition	Sex_Intake	Age_Intake	Name_outcome	DateTime_outcome	MonthYear_outcome	Birth_Date	Outcome	Outcome_Subtype	Animal_Type_outcome	Sex_Outcome	Outcome_Age	Breed_outcome	Color_outcome	Days_In	Sex	Fixed	Month_Intake	Given_Name	year
0	A730601	07/07/2016 12:11:00 PM	07/07/2016 12:11:00 PM	1109 Shady Ln in Austin (TX)	Stray	Normal	Intact Male	0.583333	NaN	07/08/2016 09:00:00 AM	07/08/2016 09:00:00 AM	12/07/2015	Transfer	SCRP	Cat	Neutered Male	0.583333	Domestic Shorthair Mix	Blue Tabby	1.0	male	fixed	July	0.0	2016
1	A683644	07/13/2014 11:02:00 AM	07/13/2014 11:02:00 AM	Austin (TX)	Owner Surrender	Nursing	Intact Female	0.076923	*Zoey	11/06/2014 10:06:00 AM	11/06/2014 10:06:00 AM	06/13/2014	Adoption	Foster	Dog	Spayed Female	0.333333	Border Collie	Brown	116.0	female	fixed	July	1.0	2014
2	A676515	04/11/2014 08:45:00 AM	04/11/2014 08:45:00 AM	615 E. Wonsley in Austin (TX)	Stray	Normal	Intact Male	0.166667	Rico	04/14/2014 06:38:00 PM	04/14/2014 06:38:00 PM	01/11/2014	Return to Owner	NaN	Dog	Neutered Male	0.250000	Pit Bull	White	3.0	male	fixed	April	1.0	2014
3	A742953	01/31/2017 01:30:00 PM	01/31/2017 01:30:00 PM	S Hwy 183 And Thompson Lane in Austin (TX)	Stray	Normal	Intact Male	2.000000	NaN	02/04/2017 02:17:00 PM	02/04/2017 02:17:00 PM	01/31/2015	Transfer	Partner	Dog	Intact Male	2.000000	Saluki	Sable	4.0	male	not_fixed	Janurary	0.0	2017
4	A679549	05/22/2014 03:43:00 PM	05/22/2014 03:43:00 PM	124 W Anderson in Austin (TX)	Stray	Normal	Intact Male	0.083333	*Gilbert	06/16/2014 01:54:00 PM	06/16/2014 01:54:00 PM	03/31/2014	Transfer	Partner	Cat	Neutered Male	0.166667	Domestic Shorthair Mix	Black	25.0	male	fixed	May	1.0	2014

In [113]:

summary_by_year = pd.crosstab([df3.Breed_outcome, df3.Outcome], df3.year)

In [114]:

summary_by_year.head()

Out[114]:

In [115]:

df4 = summary_by_year[summary_by_year.index.get_level_values(0).isin(df2.index)]

In [116]:

breeds = df4.index.get_level_values(0).unique().tolist()
show = np.array(range(1, 65))
# s = (show < 1e6).tolist()
# button.append(dict(args = ['visible', s], label = 'All', method = 'restyle'))
button = []
data = []
count = 0
for breed in breeds:
    # Determine visiblility for drop down menu
    outcomes = df4.loc[breed].index.get_level_values(0).unique().tolist()
    s = ((show <= (len(outcomes) + count)) & (show > count)).tolist()
    count += len(outcomes)
    button.append(dict(args = ['visible', s], label = breed, method = 'restyle'))
    for outcome in outcomes:
        xlab = df4.loc[breed].columns.tolist()
        totals = df4.groupby(level = 0).sum().loc[breed]
        ylab = df4.loc[breed].loc[outcome]/totals * 100
        #Use to define first breed visibility in plot
        if breed == breeds[0]:
            v = True
        else:
            v = False
        trace = go.Scatter(
            x = xlab,
            y = ylab,
            name = outcome,
            visible = v
        )
        data.append(trace)
        

layout = go.Layout(
title = 'Outcome Percentages by Breed',
xaxis = dict(title = 'Year'),
yaxis = dict(title = 'Percentage %'),
updatemenus = list([
    dict(
    x = -0.05, 
    y = 1,
    yanchor = 'top',
    buttons = button
    )
])
)

fig = go.Figure(data = data, layout = layout)

Now, let's take a look at how the percentage of outcome for each breed changes over the years between 2013 and 2017. Below is a plot that shows the outcome rates by breed. On the left, we can change the breed with the drop down menu.

In [117]:

# plotly.offline.iplot(fig)
py.iplot(fig)

Out[117]:

Quick Summary:
- The rates that stand out the most for dog breeds are the rates for positive outcomes such as adoption and return to owner for Pitbulls and Labrador Retrievers. They have increased from year to year and both of their transfer rates have decreased. Additionally, both breeds' euthanasia rates decrease tremendously from year to year.
- Domestic longhair mix and domestic medium hair mix adoption rates have increased as well.
- The other breeds' adoption and return to owner rates fluctuate from year to year. However, the breeds that have decreasing adoption rates have increasing return to owner rates or vice versa.

Remark: Keep in mind that this data only goes into March 2017 so the rates for 2017 is from a small sample size of what the rest of the year will turn out to be.

Now, we will use machine learning to see how well we can predict an animal's outcome using the features stated below and also see how much integrating the intake data improves our prediction over just using outcome data.

Features we will use:

Intake Type
Intake Condition
Age Intake
Given Name
Animal Type
Breed Outcome
Color Outcome
Sex
Fixed
Month Intake
Days In

Labels

Outcome

In [118]:

features = ['Intake_Type', 'Intake_Condition', 'Age_Intake', 'Given_Name', 'Animal_Type_outcome', 'Breed_outcome',
           'Color_outcome', 'Sex', 'Fixed', 'Month_Intake','Days_In', 'Outcome']
df1 = df[features]

In [119]:

df1.head()

Out[119]:

	Intake_Type	Intake_Condition	Age_Intake	Given_Name	Animal_Type_outcome	Breed_outcome	Color_outcome	Sex	Fixed	Month_Intake	Days_In	Outcome
0	Stray	Normal	0.583333	0.0	Cat	Domestic Shorthair Mix	Blue Tabby	male	fixed	July	1	Transfer
1	Owner Surrender	Nursing	0.076923	1.0	Dog	Border Collie	Brown	female	fixed	July	116	Adoption
2	Stray	Normal	0.166667	1.0	Dog	Pit Bull	White	male	fixed	April	3	Return to Owner
3	Stray	Normal	2.000000	0.0	Dog	Saluki	Sable	male	not_fixed	Janurary	4	Transfer
4	Stray	Normal	0.083333	1.0	Cat	Domestic Shorthair Mix	Black	male	fixed	May	25	Transfer

To deal with our categorical variables, I will use binary encoding. Typically, one would use one hot encoding, but recent papers have shown that binary encoding not only has better performances under many circumstances, it also takes less memory in data than one hot encoding, which will help us with running our models.

Features that need to be converted to binary

Intake Type
Intake Condition
Animal Type
Breed Outcome
Color Outcome
Sex
Fixed
Month Intake

Features that do not need to be converted to binary

Age Intake
Days In
Given Name

Label Encode:

Outcome

In [120]:

categorical_vars = ['Intake_Type', 'Intake_Condition', 'Animal_Type_outcome', 'Breed_outcome', 'Color_outcome', 
                   'Sex','Fixed', 'Month_Intake']

binary = ce.binary.BinaryEncoder(cols = categorical_vars)
binary.fit(df1)
dat = binary.transform(df1)

/Users/truong/anaconda/lib/python3.6/site-packages/category_encoders/ordinal.py:178: FutureWarning:

reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead

/Users/truong/anaconda/lib/python3.6/site-packages/category_encoders/ordinal.py:167: FutureWarning:

reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead

In [121]:

labels = dat['Outcome']
labels.value_counts()

Out[121]:

Adoption           22342
Transfer           17575
Return to Owner     9495
Euthanasia          2472
Died                 411
Missing               36
Disposal              21
Name: Outcome, dtype: int64

In [122]:

le = LabelEncoder()
lab = le.fit_transform(labels)

In [123]:

le.classes_

Out[123]:

array(['Adoption', 'Died', 'Disposal', 'Euthanasia', 'Missing',
       'Return to Owner', 'Transfer'], dtype=object)

In [124]:

dat = dat.drop(['Outcome', 'Days_In'], axis = 1)

This will be the final dataset that we will use for our machine learning models. We can see that all of the categorical variables have been binary encoded and that our labels for outcomes have also been encoded with correspondance:

0: Adoption
1: Died
2: Disposal
3: Euthanasia
4: Missing
5: Return to owner
6: Transfer

In [125]:

dat.head()

Out[125]:

	Intake_Type_0	Intake_Type_1	Intake_Condition_1	Intake_Condition_2	Animal_Type_outcome_0	Breed_outcome_1	Breed_outcome_2	Breed_outcome_3	Breed_outcome_4	Breed_outcome_5	Breed_outcome_6	Breed_outcome_7	Breed_outcome_8	Color_outcome_0	Color_outcome_1	Color_outcome_2	Color_outcome_3	Color_outcome_4	Color_outcome_5	Sex_0	Sex_1	Fixed_1	Month_Intake_0	Month_Intake_1	Month_Intake_2	Month_Intake_3	Age_Intake	Given_Name
0	1	0	1	1	0	1	1	1	0	0	1	1	0	0	0	0	1	1	0	1	0	1	0	1	1	0	0.583333	0.0
1	0	1	0	1	1	1	1	1	1	0	1	1	1	1	0	1	0	1	0	0	1	1	0	1	1	0	0.076923	1.0
2	1	0	1	1	1	1	1	1	1	1	0	0	1	0	1	1	1	1	1	1	0	1	1	0	1	0	0.166667	1.0
3	1	0	1	1	1	0	0	1	0	0	0	1	1	0	0	0	1	0	0	1	0	0	0	1	0	1	2.000000	0.0
4	1	0	1	1	0	1	1	1	0	0	1	1	0	1	1	0	0	1	0	1	0	1	0	1	0	0	0.083333	1.0

In [127]:

lab[10:25]

Out[127]:

array([0, 6, 6, 5, 0, 0, 0, 0, 6, 5, 0, 3, 5, 0, 6])

We will do a random split of our data to construct the training and testing set. The training set will contain 70% of the original data and the test set will contain the rest, 30%.

In [129]:

X_train, X_test, Y_train, Y_test = train_test_split(dat, lab, test_size = 0.30)

In [130]:

np.bincount(Y_train)/float(len(Y_train))

Out[130]:

array([  4.28941767e-01,   8.04999181e-03,   3.54745402e-04,
         4.60623260e-02,   6.82202696e-04,   1.79528461e-01,
         3.36380505e-01])

In [131]:

# Normalize training and testing set
X_train = normalize(X_train, axis = 0)
X_test = normalize(X_test, axis = 0)

First, we will do a grid serach to find the best parameters for the best results. There will be 54 fits in total using 3 fold cross validation sets. After running this model, we will compare the results with the model used last analysis that just used the outcome data. As a reminder, the results we got from the last analysis was a 64% prediction accuracy and a 0.86 log loss using XGBoost.

In [132]:

params = {'n_estimators':[250, 300, 350], 'max_depth': [3, 4, 5], 'max_features':['sqrt', 'log2']}
gb_gs = GridSearchCV(GradientBoostingClassifier(), params, verbose = 1)

In [133]:

gb_gs.fit(X_train, Y_train)

Fitting 3 folds for each of 18 candidates, totalling 54 fits

[Parallel(n_jobs=1)]: Done  54 out of  54 | elapsed: 30.1min finished

Out[133]:

GridSearchCV(cv=None, error_score='raise',
       estimator=GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=100, presort='auto', random_state=None,
              subsample=1.0, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_estimators': [250, 300, 350], 'max_depth': [3, 4, 5], 'max_features': ['sqrt', 'log2']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)

In [134]:

gb_gs.best_params_

Out[134]:

{'max_depth': 4, 'max_features': 'log2', 'n_estimators': 300}

In [135]:

# Get best parameters
depth = gb_gs.best_params_['max_depth']
feat = gb_gs.best_params_['max_features']
estimators = gb_gs.best_params_['n_estimators']

In [136]:

gb_model = GradientBoostingClassifier(n_estimators = estimators, max_depth = depth, max_features = feat)

In [137]:

gb_model.fit(X_train, Y_train)

Out[137]:

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=4,
              max_features='log2', max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=300, presort='auto', random_state=None,
              subsample=1.0, verbose=0, warm_start=False)

In [138]:

print('Test Accuracy :', gb_model.score(X_test, Y_test) * 100)

Test Accuracy : 69.629440978

In [139]:

log_loss(Y_test, gb_model.predict_proba(X_test))

Out[139]:

0.78989170504643291

In [140]:

cm = confusion_matrix(Y_test, gb_model.predict(X_test))
cm

Out[140]:

array([[5380,    0,    0,   22,    0,  810,  411],
       [  10,    4,    0,   19,    0,    5,   78],
       [   0,    0,    0,    1,    0,    0,    7],
       [ 100,    1,    0,  356,    0,  106,  221],
       [   1,    0,    0,    1,    0,    3,    6],
       [ 789,    0,    0,   17,    0, 1901,  209],
       [1280,    0,    0,  165,    0,  508, 3295]])

In [159]:

list(le.classes_)

Out[159]:

['Adoption',
 'Died',
 'Disposal',
 'Euthanasia',
 'Missing',
 'Return to Owner',
 'Transfer']

In [173]:

fig = plt.figure(figsize = (10,8))
sns.heatmap(cm, annot = True, fmt = 'd', annot_kws = {'size': 20}, cmap = 'coolwarm', 
           cbar = 0, xticklabels = list(le.classes_), yticklabels = list(le.classes_) )
plt.xticks(rotation = 35, fontsize = 12)
plt.yticks(rotation = 25, fontsize = 12)
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix for Test Set Predictions', fontsize = 16)
plt.tight_layout(h_pad = 5.0, w_pad = 3.0)

Considering the confusion matrix, we can see that the model does well with predicting adoption, return to owner, and transfer outcomes. When the model does not accurately predict adoption, most of the predictions are returned to owner or transferred. The same holds for the converse. When considering the confusion matrix, we want the diagonal elements, which indicate correct prediction, to hold the highest values. Our confusion matrix exhibits this property for most of the outcomes with exception to predicting death, disposal, and missing. This is most likely due to the very low number of occurrences of each outcome and the model was not able to capture features that accurately determined those outcomes.

In [142]:

feat_imp = pd.DataFrame({'Features':dat.columns.values,'Values':gb_model.feature_importances_})
top_feat = feat_imp.sort_values(by = 'Values').iloc[-10:]
top_feat

Out[142]:

	Features	Values
26	Month_Intake_1	0.028307
23	Fixed_0	0.028399
28	Month_Intake_3	0.030661
5	Animal_Type_outcome_0	0.032003
1	Intake_Type_1	0.034448
0	Intake_Type_0	0.038367
2	Intake_Condition_0	0.045723
30	Given_Name	0.048138
24	Fixed_1	0.062480
29	Age_Intake	0.223453

In [143]:

# labels.value_counts()[0]/float(len(labels))

Using Gradient Boosting, we get a 69% accuracy, which is a 5% increase in accuracy from my last analysis using just the outcome data. This is a big improvement and we can actually see from the table above that on the top 10 features that influenced our accuracy, age intake is by far the most important feature in our set. Age intake was not in the outcome set and this is one indicator that the intake dataset contain valuable data that improves prediction accuracy. Another improvement we see from this model compared to the last one is the log loss. The log loss for this model is 0.79 compared to the last model, 0.86. This tells us that we can be more confident in our prediction probabilities from this model.

Breed_outcome

Abyssinian Mix

Abyssinian Mix

Affenpinscher

Affenpinscher

Return to Owner

Afghan Hound

Return to Owner

I will further work on this dataset as time opens up. Some things I would like to work on are more data visualization, ensemble methods to improve prediction accuracy, try XGBoost with GPU support to improve computation time and accuracy, and to look at precision and recall results. Additionally, I will set up a dashboard using Plotly, which after you produce the plots, Plotly has a point and click tool to easily create dashboards via their UI online.