Improved Austin Animal Shelter Analysis with Data Visualizations
Posted on Sun 05 March 2017 in Projects
One of the first notebooks I uploaded was my analysis on the animal outcome dataset for the Austin Animal Shelter. Recently, I ran across their open data set on Austin's open data government website: https://data.austintexas.gov/. The dataset on this site included both intake and outcome data which the dataset hosted on Kaggle did not have. The data on Kaggle was only the outcome dataset. I was very interested in how much the intake data would help improve results so we will run a similar analysis but now integrate both the intake and outcome data and also add some data visualizations. Both the intake and outcome dataset contain data from 2013 to 2017.
from sqlalchemy import create_engine
import psycopg2
import pandas as pd
from datetime import datetime
import numpy as np
import re
import category_encoders as ce
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import normalize
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import log_loss
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
# import plotly.plotly
# plotly.offline.init_notebook_mode()
import plotly.plotly as py
import plotly.graph_objs as go
import seaborn as sns
%matplotlib inline
pd.set_option('display.max_columns', 100)
import plotly
plotly.tools.set_credentials_file(username='****', api_key='*****')
intake = pd.read_csv('intakes.csv')
outcome = pd.read_csv('outcomes.csv')
intake.columns
outcome.columns
intake = intake[intake['Animal Type'].isin(['Dog', 'Cat'])]
outcome = outcome[outcome['Animal Type'].isin(['Dog', 'Cat'])]
intake = intake.rename(columns = {'Animal ID': 'ID', 'Found Location': 'Found_Location', 'Intake Type': 'Intake_Type',
'Intake Condition': 'Intake_Condition', 'Animal Type': 'Animal_Type',
'Sex upon Intake':'Sex_Intake','Age upon Intake': 'Age_Intake'})
outcome = outcome.rename(columns = {'Animal ID': 'ID', 'Date of Birth': 'Birth_Date', 'Outcome Type': 'Outcome',
'Outcome Subtype': 'Outcome_Subtype','Animal Type': 'Animal_Type',
'Sex upon Outcome': 'Sex_Outcome', 'Age upon Outcome': 'Outcome_Age'})
Preview Data¶
Taking a look at the intake dataset, we can see that there are several features that should be quite valuable to this analysis such as the condition of the animal when it first enters the shelter (Intake_Condition), the age of the animal on intake (Age_Intake), the way the animal was found: stray, surrendered by owner, etc (Intake_Type), and the month and day of when the animal entered the shelter. I thin its safe to assume that all of these features are quite important on the outcome of an animal.
intake.head()
The outcome table is very similar with no additions compared to the other outcome dataset hosted on Kaggle.
outcome.head()
We will no merge the intake and outcome dataset and start cleaning and preparing the set to do some exploratory and predictive data analysis.
df = pd.merge(intake, outcome, on = 'ID', how = 'inner', suffixes = ('_intake', '_outcome'))
df[df.duplicated()].head()
df = df.drop_duplicates('ID')
df.shape
df.head()
df.columns
df = df.drop(['Name_intake', 'Animal_Type_intake', 'Breed_intake', 'Color_intake'], axis = 1).reset_index(drop = True)
df.head(10)
Determine duration in shelter¶
Let's put the date time column in our data set into python format so we can calculate the duration that an animal spends in the shelter from intake to outcome
intake_dates = [df.DateTime_intake[i].split()[0] for i in range(len(df.DateTime_intake))]
intake_date = [datetime.strptime(d,'%m/%d/%Y') for d in intake_dates]
outcome_dates = [df.DateTime_outcome[i].split()[0] for i in range(len(df.DateTime_outcome))]
outcome_date = [datetime.strptime(d,'%m/%d/%Y') for d in outcome_dates]
days_in = [abs(intake_date[i] - outcome_date[i]).days for i in range(len(intake_date))]
df['Days_In'] = days_in
Convert age¶
The age column contains different units (years, months, weeks) so we will convert the age in terms of years old.
def convert_age(ages):
Ages = ages.astype(str)
y = np.asarray([float(age.split()[0]) if 'year' in age
else float(age.split()[0])/52. if 'week' in age
else float(age.split()[0])/12. if 'month' in age
else float(age.split()[0])/352. if 'week' in age
else -99.0
for age in Ages])
y[y == -99] = np.mean(y[y != -99])
return y
df['Age_Intake'] = convert_age(df.Age_Intake)
df['Outcome_Age'] = convert_age(df.Outcome_Age)
df.head()
Convert Sex and Fixed Status¶
The sex and fixed status of an animal is in one column (Sex_Intake) so let's break this column up to a gender columnm, and a fixed status column.
df['Sex_Intake'].unique()
df['Sex_Intake'] = df['Sex_Intake'].replace(np.nan,'Unknown')
df['Sex_Intake'].unique()
def convert_sex(sex):
n = len(sex)
gender = ['male' if 'Male' in str(sex[i])
else 'female' if 'Female' in str(sex[i] )
else 'unknown'
for i in range(n)]
spayed_neutered = ['fixed' if any(x in str(sex[i]) for x in ['Neutered','Spayed'])
else 'not_fixed' if 'Intact' in str(sex[i])
else 'unknown_fixed'
for i in range(n)]
return gender, spayed_neutered
df['Sex'], df['Fixed'] = convert_sex(df['Sex_Outcome'])
df.head()
Simplify breed and color features¶
There are 1747 differerent classifications of dog breeds in our set. Some classifications are redundant so let's simplify the column. We will remove the word 'Mix' in the breed and take the dominant breed. There are only 72 different classifications of breeds for cats so we will leave that cats alone.
len(df[df.Animal_Type_outcome == 'Dog' ].Breed_outcome.unique())
len(df[df.Animal_Type_outcome == 'Cat' ].Breed_outcome.unique())
breeds = list(df[df['Animal_Type_outcome'] == 'Dog']['Breed_outcome'])
dog_breeds = [re.sub(' Mix', '', dog) if 'Mix' in dog else dog.split("/")[0] for dog in breeds]
df.ix[df.Animal_Type_outcome == 'Dog', 'Breed_outcome'] = dog_breeds
df.head()
Removing the 'Mix' classification narrows down the set of different dog breeds to 269, a huge improvement compared to 1747.
len(df.Breed_outcome.unique())
There are 319 different color classifications for dogs and 199 for cats. For both, we will take the first color listed as the dominant color and replace each classification with the dominant color for each animal. Doing this, we reduce the amount of unique classifications in total to only 57.
# number of distinct colors for dogs
len(df[df.Animal_Type_outcome == 'Dog' ].Color_outcome.unique())
# number of distinct colors for dogs
len(df[df.Animal_Type_outcome == 'Cat' ].Color_outcome.unique())
# Take the first color
def get_color(animal_color):
color = [c.split("/")[0] for c in animal_color]
return color
df['Color_outcome'] = get_color(df['Color_outcome'])
len(df.Color_outcome.unique())
df.head()
Now, let's create a feature column that indicates the month the animal enters the animal shelter.
dates = [df['DateTime_intake'][i].split()[0] for i in range(len(df['DateTime_intake']))]
date = [datetime.strptime(d,'%m/%d/%Y') for d in dates]
month = ['Janurary' if d.month == 1
else 'February' if d.month == 2
else 'March' if d.month == 3
else 'April' if d.month == 4
else 'May' if d.month == 5
else 'June' if d.month == 6
else 'July' if d.month == 7
else 'August' if d.month == 8
else 'September' if d.month == 9
else 'October' if d.month == 10
else 'November' if d.month == 11
else 'December'
for d in date]
df['Month_Intake'] = month
df.head()
df = df[df.Outcome.notnull()]
Name vs No Name¶
We'll construct a column that indicates that if the animal was given a name while it was in the shelter.
df['Given_Name'] = df.Name_outcome.notnull() * 1.0
df.head()
Plots¶
Here, we will construct some data visualization for our data. Let's take a look at some statistics for the top 10 breeds that come into the shelter in terms of count.
f = {'Breed_outcome':'count', 'Days_In': 'mean'}
days_avg = df.groupby('Breed_outcome').agg(f)
df2 = days_avg.sort_values(['Breed_outcome','Days_In'])[-10:]
df2 = df2.rename(columns = {'Breed_outcome': 'Count'})
df2
ann_text = ['Count = ' + list(df2['Count'].astype(str))[i] for i in range(len(df2))]
Here, we take a look at the average amount of time each breed spends in the shelter from 2013 - 2017. As you hover over the chart, the average days and the count of how many times that breed has come into the shelter over all 4 years are shown.
trace1 = go.Bar(
y = df2['Days_In'],
x = df2.index,
name = 'Average Days',
text = ann_text
)
layout = go.Layout(
title = 'Average Days in Shelter by Breed (2013 - 2017)',
yaxis = dict(title = 'Days')
)
data = [trace1]
fig = go.Figure(data = data, layout = layout)
# plotly.offline.iplot(fig)
py.iplot(fig)
Summary:
- Pitbulls spend the longest amount of time in this shelter on average, about 44 days. They are also the second highest breed in count that this shelter receives.
- Dachshunds spend the least amount of time in this shelter on average, about 11 days. They are the ninth highest breed in count that this shelter receives.
- The other breeds, including cats, roughly spend a similar amount of time in this shelter, about 20 days.
summary_outcome = pd.crosstab(df.Breed_outcome, df.Outcome).apply(lambda r: r/r.sum() * 100., axis = 1)
summary_outcome.head()
Now let's take a look at outcome rate by breed. The worst possible outcome we want to occur is euthanasia and the best outcome is adoption or return to owner. We can see that the combinination of the rates of return to owner and adoption dominates the other outcome rates, which is a good sign. However, the transfer rates for some of these breeds are quite high such as domestic short hair, which are cats.
cols = summary_outcome.columns
summary = summary_outcome[summary_outcome.index.isin(list(df2.index))][-10:]
data = []
for col in cols:
trace = go.Bar(
y = summary[col],
x = summary.index,
name = col
)
data.append(trace)
layout = go.Layout(
title = 'Outcome by Breed (2013 - 2017)',
yaxis = dict(title = 'Percentage'),
# xaxis = dict(title = 'Breed')
)
fig = go.Figure(data = data, layout = layout)
# plotly.offline.iplot(fig)
py.iplot(fig)
- Summary:
- Among dogs, the majority of the outcome for every breed in the highest 10 breeds result in adoption or return to owner, which is a very good sign.
- Among cats (domestic long/medium/short hair, and siamese mix, they have a high transfer rate, which is not very indicative of their outcome. In fact, for domestic short/medium, their transfer rates are as high as their adoption rates. There is also a very low return to owner rate, about 7%, for cats. However, this is intuitive due to the high number of stray cats.
year = [date[i].year for i in range(len(date))]
df3 = pd.concat([df, pd.DataFrame({'year': year})], axis = 1)
df3.head()
summary_by_year = pd.crosstab([df3.Breed_outcome, df3.Outcome], df3.year)
summary_by_year.head()
df4 = summary_by_year[summary_by_year.index.get_level_values(0).isin(df2.index)]
breeds = df4.index.get_level_values(0).unique().tolist()
show = np.array(range(1, 65))
# s = (show < 1e6).tolist()
# button.append(dict(args = ['visible', s], label = 'All', method = 'restyle'))
button = []
data = []
count = 0
for breed in breeds:
# Determine visiblility for drop down menu
outcomes = df4.loc[breed].index.get_level_values(0).unique().tolist()
s = ((show <= (len(outcomes) + count)) & (show > count)).tolist()
count += len(outcomes)
button.append(dict(args = ['visible', s], label = breed, method = 'restyle'))
for outcome in outcomes:
xlab = df4.loc[breed].columns.tolist()
totals = df4.groupby(level = 0).sum().loc[breed]
ylab = df4.loc[breed].loc[outcome]/totals * 100
#Use to define first breed visibility in plot
if breed == breeds[0]:
v = True
else:
v = False
trace = go.Scatter(
x = xlab,
y = ylab,
name = outcome,
visible = v
)
data.append(trace)
layout = go.Layout(
title = 'Outcome Percentages by Breed',
xaxis = dict(title = 'Year'),
yaxis = dict(title = 'Percentage %'),
updatemenus = list([
dict(
x = -0.05,
y = 1,
yanchor = 'top',
buttons = button
)
])
)
fig = go.Figure(data = data, layout = layout)
Now, let's take a look at how the percentage of outcome for each breed changes over the years between 2013 and 2017. Below is a plot that shows the outcome rates by breed. On the left, we can change the breed with the drop down menu.
# plotly.offline.iplot(fig)
py.iplot(fig)
Quick Summary:
The rates that stand out the most for dog breeds are the rates for positive outcomes such as adoption and return to owner for Pitbulls and Labrador Retrievers. They have increased from year to year and both of their transfer rates have decreased. Additionally, both breeds' euthanasia rates decrease tremendously from year to year.
Domestic longhair mix and domestic medium hair mix adoption rates have increased as well.
The other breeds' adoption and return to owner rates fluctuate from year to year. However, the breeds that have decreasing adoption rates have increasing return to owner rates or vice versa.
Remark: Keep in mind that this data only goes into March 2017 so the rates for 2017 is from a small sample size of what the rest of the year will turn out to be.
Predictive Analysis¶
Now, we will use machine learning to see how well we can predict an animal's outcome using the features stated below and also see how much integrating the intake data improves our prediction over just using outcome data.
Features we will use:
- Intake Type
- Intake Condition
- Age Intake
- Given Name
- Animal Type
- Breed Outcome
- Color Outcome
- Sex
- Fixed
- Month Intake
- Days In
Labels
- Outcome
features = ['Intake_Type', 'Intake_Condition', 'Age_Intake', 'Given_Name', 'Animal_Type_outcome', 'Breed_outcome',
'Color_outcome', 'Sex', 'Fixed', 'Month_Intake','Days_In', 'Outcome']
df1 = df[features]
df1.head()
Convert categorical variables¶
To deal with our categorical variables, I will use binary encoding. Typically, one would use one hot encoding, but recent papers have shown that binary encoding not only has better performances under many circumstances, it also takes less memory in data than one hot encoding, which will help us with running our models.
Features that need to be converted to binary
- Intake Type
- Intake Condition
- Animal Type
- Breed Outcome
- Color Outcome
- Sex
- Fixed
- Month Intake
Features that do not need to be converted to binary
- Age Intake
- Days In
- Given Name
Label Encode:
- Outcome
categorical_vars = ['Intake_Type', 'Intake_Condition', 'Animal_Type_outcome', 'Breed_outcome', 'Color_outcome',
'Sex','Fixed', 'Month_Intake']
binary = ce.binary.BinaryEncoder(cols = categorical_vars)
binary.fit(df1)
dat = binary.transform(df1)
labels = dat['Outcome']
labels.value_counts()
le = LabelEncoder()
lab = le.fit_transform(labels)
le.classes_
dat = dat.drop(['Outcome', 'Days_In'], axis = 1)
This will be the final dataset that we will use for our machine learning models. We can see that all of the categorical variables have been binary encoded and that our labels for outcomes have also been encoded with correspondance:
- 0: Adoption
- 1: Died
- 2: Disposal
- 3: Euthanasia
- 4: Missing
- 5: Return to owner
- 6: Transfer
dat.head()
lab[10:25]
Get training and testing sets¶
We will do a random split of our data to construct the training and testing set. The training set will contain 70% of the original data and the test set will contain the rest, 30%.
X_train, X_test, Y_train, Y_test = train_test_split(dat, lab, test_size = 0.30)
np.bincount(Y_train)/float(len(Y_train))
# Normalize training and testing set
X_train = normalize(X_train, axis = 0)
X_test = normalize(X_test, axis = 0)
Gradient Boosting¶
First, we will do a grid serach to find the best parameters for the best results. There will be 54 fits in total using 3 fold cross validation sets. After running this model, we will compare the results with the model used last analysis that just used the outcome data. As a reminder, the results we got from the last analysis was a 64% prediction accuracy and a 0.86 log loss using XGBoost.
params = {'n_estimators':[250, 300, 350], 'max_depth': [3, 4, 5], 'max_features':['sqrt', 'log2']}
gb_gs = GridSearchCV(GradientBoostingClassifier(), params, verbose = 1)
gb_gs.fit(X_train, Y_train)
gb_gs.best_params_
# Get best parameters
depth = gb_gs.best_params_['max_depth']
feat = gb_gs.best_params_['max_features']
estimators = gb_gs.best_params_['n_estimators']
gb_model = GradientBoostingClassifier(n_estimators = estimators, max_depth = depth, max_features = feat)
gb_model.fit(X_train, Y_train)
print('Test Accuracy :', gb_model.score(X_test, Y_test) * 100)
log_loss(Y_test, gb_model.predict_proba(X_test))
cm = confusion_matrix(Y_test, gb_model.predict(X_test))
cm
list(le.classes_)
fig = plt.figure(figsize = (10,8))
sns.heatmap(cm, annot = True, fmt = 'd', annot_kws = {'size': 20}, cmap = 'coolwarm',
cbar = 0, xticklabels = list(le.classes_), yticklabels = list(le.classes_) )
plt.xticks(rotation = 35, fontsize = 12)
plt.yticks(rotation = 25, fontsize = 12)
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix for Test Set Predictions', fontsize = 16)
plt.tight_layout(h_pad = 5.0, w_pad = 3.0)
Considering the confusion matrix, we can see that the model does well with predicting adoption, return to owner, and transfer outcomes. When the model does not accurately predict adoption, most of the predictions are returned to owner or transferred. The same holds for the converse. When considering the confusion matrix, we want the diagonal elements, which indicate correct prediction, to hold the highest values. Our confusion matrix exhibits this property for most of the outcomes with exception to predicting death, disposal, and missing. This is most likely due to the very low number of occurrences of each outcome and the model was not able to capture features that accurately determined those outcomes.
feat_imp = pd.DataFrame({'Features':dat.columns.values,'Values':gb_model.feature_importances_})
top_feat = feat_imp.sort_values(by = 'Values').iloc[-10:]
top_feat
# labels.value_counts()[0]/float(len(labels))
Using Gradient Boosting, we get a 69% accuracy, which is a 5% increase in accuracy from my last analysis using just the outcome data. This is a big improvement and we can actually see from the table above that on the top 10 features that influenced our accuracy, age intake is by far the most important feature in our set. Age intake was not in the outcome set and this is one indicator that the intake dataset contain valuable data that improves prediction accuracy. Another improvement we see from this model compared to the last one is the log loss. The log loss for this model is 0.79 compared to the last model, 0.86. This tells us that we can be more confident in our prediction probabilities from this model.
Further Improvements:¶
I will further work on this dataset as time opens up. Some things I would like to work on are more data visualization, ensemble methods to improve prediction accuracy, try XGBoost with GPU support to improve computation time and accuracy, and to look at precision and recall results. Additionally, I will set up a dashboard using Plotly, which after you produce the plots, Plotly has a point and click tool to easily create dashboards via their UI online.