Using Hidden Markov Models to Make Smarter Pitching Decisions
Posted on Sun 08 January 2017 in Projects
Analysis Topic:¶
One topic that has been discussed recently in baseball is the third time through the order penalty (TTOP). Third time through the order penalty is simply the theory that a pitcher is less effective when he goes through the lineup of batters twice and faces each batter for the third time.
This topic has come up several times throughout the years. Here are a few links to articles discussing TTOP along with a group of interviews conducted with team general managers and their thoughts on the penalty:
http://www.businessinsider.com/ap-third-time-through-the-order-can-be-tough-to-navigate-2016-4
http://sabr.org/latest/lichtman-pitch-types-and-times-through-order-penalty
http://sabr.org/latest/lichtman-pitch-types-and-times-through-order-penalty
Interviews: http://www.fangraphs.com/blogs/managers-on-the-third-time-through-the-order/
In 2012, a study was presented at the Sloan Conference by MIT. The Colorado Rockies took the idea of the third time through the order penalty and experimented with the strategy of pulling their pitchers when they reached the third time through the order. They called this strategy 'paired pitching': use two pitchers, both only going through the order at most two times. The experiment did not go so well as they only won 64 games out of 162 games that season.
From then on, the topic of TTOP has been somewhat controversial among sabermetricians and baseball professionals.
Goal:¶
The goal of this analysis is to explore the third time through the order penalty. I will use data published by MLB to first, find evidence that pitchers do exhibit this third time through the order penalty, and second, use a statistical model, namely Hidden Markov Models, to explore an analytical approach teams can take to deal with the penalty, if it exists.
Data¶
The data is parsed using the pitchRx package in R. This package was written by Carson Sievert and parses the xml files released and published by the MLB. I will parse the 2013 and 2014 seasons. The data is prased and put into a sqlite3 database. There are several tables that are created from this parser. I will only be using two tables, the atbat and pitch table. The atbat table has features on each plate appearance and the pitch table has features on each pitch.
Remark: There is no particular reason I chose the 2013 and 2014 seasons. Any season or seasons would just do. For two seasons, the parser takes about a little over an hour to parse the data and create the database, so to grab more data, it would require a longer run time to create the database.
from sqlalchemy import create_engine
import pandas as pd
import sqlite3
import numpy as np
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 1000)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import normalize
from sklearn.externals import joblib
from hmmlearn.hmm import MultinomialHMM
from hmmlearn.hmm import GaussianHMM
from pomegranate import *
import plotly.plotly as py
# plotly.offline.init_notebook_mode()
from plotly.graph_objs import *
from plotly.tools import FigureFactory as FF
%load_ext rpy2.ipython
# %load_ext pymatbridge
np.set_printoptions(precision = 5, suppress = True)
# Parse using pitchRx package
# %%R
# library('DBI')
# library('dplyr')
# library('ggplot2')
# library('pitchRx')
# db <- src_sqlite("pitchRx1314.sqlite3", create = TRUE)
# scrape(start = "2013-01-01",end = "2014-12-01", connect = db$con)
# scrape(start = "2013-01-01", end = "2014-12-01", suffix = files, connect = db$con)
conn = sqlite3.connect('pitchRx1314.sqlite3')
Preview of the atbat table:
pd.read_sql_query('''
SELECT *
FROM atbat
LIMIT 10
''', conn).head(4)
Preview of the pitch table:
pd.read_sql_query('''
SELECT *
FROM pitch
LIMIT 2000
''', conn).iloc[1000:1005]
We will be doing a lot of dataset splicing and calculating baseball statistics so I will write functions so we don't have to repeat the process. The following functions will be used:
get_split_index: Function to find the pitch number that seperates the first two times through the order and the third time through the order
split_3rd: Function that splits the dataset into two. One that includes only data on first time through the order and another that only includes third time through the order and beyond. This function works on either one game, or multiple games (or entire season).
innings_pitched: Returns the number of innings_pitched.
strikeout_ratio: Calculates the strikeout ratio per 9 innings.
fip: Calculates the FIP value.
whip: Calculates the WHIP.
walks_ratio: Calculates the walks ratio per 9 innings.
Equations:
$$ FIP = \frac{13*\text{Home Runs} + 3 * (\text{Walks} + \text{Hit By Pitch}) - 2 * \text{Strikeouts}}{\text{Innings Pitched}} + \text{FIP Constant},\quad \text{FIP Constant} = 3.10 $$$$ WHIP = \frac{\text{Walks} + \text{Hits}}{\text{Innings Pitched}} $$$$ \text{K/9} = 9 * \frac{Strikeouts}{\text{Innings Pitched}} $$$$ \text{BB/9} = 9 * \frac{Walks}{\text{Innings Pitched}}$$def get_split_index(df):
count = 0
i = 0
for i in range(len(df) - 1):
batter_old = df.batter[i]
batter_new = df.batter[i + 1]
if batter_old != batter_new:
count += 1
if count == 18:
break
return (i + 1)
def split_3rd(df):
before_3rd = pd.DataFrame()
after_3rd = pd.DataFrame()
dates = df.date.unique()
for d in dates:
game_df = df[df.date == d].reset_index(drop = True)
split_index = get_split_index(game_df) # Get row number that splits third time through order
before_3rd = before_3rd.append(game_df.iloc[0:split_index,:]) # Append dataframes
after_3rd = after_3rd.append(game_df.iloc[split_index:,:])
return before_3rd, after_3rd.reset_index(drop = True)
def innings_pitched(df):
one_out_plays = ['Groundout','Strikeout','Lineout','Sac Bunt','Pop Out','Flyout','Sac Fly','Forceout',
'Runner Out','Fielders Choice Out','Bunt Groundout','Fielders Choice','Bunt Pop Out','Sac Fly',
'Bunt Lineout']
two_out_plays = ['Double Play','Grounded Into DP','Strikeout - DP','Sac Fly DP','Sacrifice Bunt DP']
three_out_plays = ['Triple Play']
dates = list(df.date.unique())
ip = 0
for d in dates:
max_inning = max(df[df.date == d].inning).astype(int)
game_ip = 0
for i in range(1, (max_inning + 1)) :
df1 = df[(df.date == d) & (df.inning == i)]
one_outs = sum(df1.event.isin(one_out_plays))
two_outs = 2 * sum(df1.event.isin(two_out_plays))
three_outs = 3 * sum(df1.event.isin(three_out_plays))
game_ip += (one_outs + two_outs + three_outs)/3.0
ip += (one_outs + two_outs + three_outs)/3.0
return ip
def strikeout_ratio(df):
return 9.0 * sum(df.event.isin(['Strikeout','Strikeout - DP']))/innings_pitched(df)
def fip(df):
hr = sum(df.event.isin(['Home Run']))
bb = sum(df.event.isin(['Walk']))
hbp = sum(df.event.isin(['Hit By Pitch']))
k = sum(df.event.isin(['Strikeout','Strikeout - DP']))
ip = innings_pitched(df)
return (13 * hr + 3 * (bb + hbp) - 2*k)/ip + 3.10
def whip(df):
bb = sum(df.event.isin(['Walk']))
hits = sum(df.event.isin(['Single','Home Run','Double','Triple']))
return (bb + hits)/innings_pitched(df)
def walks_ratio(df):
return sum(df.event.isin(['Walk']))/innings_pitched(df) * 9.0
def get_game(df, game):
return df[df.date == df.date.unique()[game]].reset_index(drop = True).drop('index', axis = 1)
Find Evidence:¶
First, we will try to find evidence that pitchers are effected by this third time through the order penalty. I will take the starting pitchers who started 10 or more games in the 2013 season and split their datasets into one that includes first two times through the order and another one with third time through the order and beyond. Then, calculate their stats (FIP, WHIP, strikeout ratio per 9, etc.) and determine how many pitchers exhibit the penalty effect. I grabbed the list of pitchers from baseball-reference:
http://www.baseball-reference.com/leagues/MLB/2013-standard-pitching.shtml
pitchers = list(pd.read_csv('pitchers.csv')['Name'])
before_fip = []
after_fip = []
before_whip = []
after_whip = []
before_strikeout = []
after_strikeout = []
before_walks = []
after_walks = []
pitch = []
for pitcher in pitchers:
# print pitcher
sql_query = '''
SELECT *
FROM atbat
WHERE pitcher_name == ? AND (date > '2013_03_31' AND date < '2013_09_29' )
'''
player_bat_df = pd.read_sql_query(sql_query, con = conn, params = [pitcher])
if len(player_bat_df) == 0:
continue
player_bat_df = player_bat_df.sort_values(['date', 'num']).reset_index(drop = True)
before,after = split_3rd(player_bat_df)
before_fip.append(fip(before))
after_fip.append(fip(after))
before_whip.append(whip(before))
after_whip.append(whip(after))
before_strikeout.append(strikeout_ratio(before))
after_strikeout.append(strikeout_ratio(after))
before_walks.append(walks_ratio(before))
after_walks.append(walks_ratio(after))
pitch.append(pitcher)
stats_df = pd.DataFrame({'aaname':pitch,
'before_fip':before_fip,
'after_fip':after_fip,
'before_whip':before_whip,
'after_whip':after_whip,
'before_strikeout':before_strikeout,
'after_strikeout':after_strikeout,
'before_walks':before_walks,
'after_walks':after_walks})
cols = ['aaname','before_fip','after_fip','before_whip','after_whip',
'before_strikeout','after_strikeout','before_walks','after_walks']
stats_df[cols].head()
print 'Percentage of Pitchers Affected:'
print 'FIP:', len(stats_df[stats_df.before_fip < stats_df.after_fip])/float(len(stats_df))
print 'WHIP:',len(stats_df[stats_df.before_whip < stats_df.after_whip])/float(len(stats_df))
print 'Strikeout per9:',len(stats_df[stats_df.before_strikeout > stats_df.after_strikeout])/float(len(stats_df))
print 'Walk ratio per9:',len(stats_df[stats_df.before_walks < stats_df.after_walks])/float(len(stats_df))
As we can see, over 50% of the pitchers in the 2013 season are affected by TTOP. A stacked bar chart of 20 pitchers can be seen below.
x_trace = []
y_before = []
y_after = []
for i in range(20):
pitcher_name = stats_df.aaname[i]
x_trace.append(pitcher_name)
y_before.append(stats_df.ix[stats_df.aaname == pitcher_name, 'before_fip'][i])
y_after.append(stats_df.ix[stats_df.aaname == pitcher_name, 'after_fip'][i])
trace1 = Bar(x = x_trace, y = y_before, name = 'Before TTO', marker = dict(color = 'rgb(49,130,189)'))
trace2 = Bar(x = x_trace, y = y_after, name = 'After TTO', marker = dict(color = 'rgb(150,0,0)'))
layout = Layout(title = 'Before and After Thrid Time Through Order FIP',
xaxis = dict(title = 'Pitcher'), yaxis = dict(title = 'FIP'), barmode = 'stack' )
data = [trace1, trace2]
fig1 = Figure(data = data, layout = layout)
# plotly.offline.iplot(fig)
x_trace = []
y_before = []
y_after = []
for i in range(20):
pitcher_name = stats_df.aaname[i]
x_trace.append(pitcher_name)
y_before.append(stats_df.ix[stats_df.aaname == pitcher_name, 'before_whip'][i])
y_after.append(stats_df.ix[stats_df.aaname == pitcher_name, 'after_whip'][i])
trace1 = Bar(x = x_trace, y = y_before, name = 'Before TTO', marker = dict(color = 'rgb(49,130,189)'))
trace2 = Bar(x = x_trace, y = y_after, name = 'After TTO', marker = dict(color = 'rgb(150,0,0)'))
layout = Layout(title = 'Before and After Thrid Time Through Order WHIP',
xaxis = dict(title = 'Pitcher'), yaxis = dict(title = 'WHIP'), barmode = 'stack' )
data = [trace1, trace2]
fig2 = Figure(data = data, layout = layout)
py.iplot(fig1)
py.iplot(fig2)
Hidden Markov Model¶
A Hidden Markov Model (HMM) is composed of a sequence of states $s_1, s_2, s_3, ..., s_n$, where each state generates an observation, $y_1, y_2,..., y_n$. The assumption of a hidden markov model is that the next state is only dependent on the previous state. This is considered a first order HMM and the sequence of states is called a Markov Chain.
Let's cook up a scenario for an example: The weather (sunny or raining) on a day will influence what I do on that day (go shopping or watch a movie at home). Assuming that the weather condition is the sequence of states, and my action is the sequence of observations. One possible sequence of states can be {raining, sunny, sunny, raining}, and one possible sequence of observations can be {shop, movie, shop, shop}.
Notice that each state (raining or sunny) generates an observation (shop or movie). Now, let's say we have no other way of recalling what the weather was the last 4 days, so I told you what I did for the past four days, i.e. {shop, shop, movie, shop}. From there, we are tasked to determine the most likely sequence of weather on each day. The hidden markov model is a method that allows us tackle this task.
The hidden markov model assumes we have a sequence of observations and we can infer the most likely sequence of states that correspond. We can also estimate the probabilities of one state transitioning to another. For example, we can determine the probability of transitioning from raining on one day to sunny on the next. Additionally, we can determine the probability of each state emitting a specific observation. For example, the probability that I shop given the weather is in the a sunny state.
Applying Hidden Markov Model to Baseball to Make Smarter Decisions¶
We will extend this idea to the problem we are trying to analyze: the third time through the order penalty. I will assume that the pitcher has three states during a game: good pitching state, mediocre pitching state, and bad pitching state. The observation sequence will consist of: the pitch was a strike, the pitch was a ball, and the pitch resulted in a hit (single, double, etc.). For each pitcher, we will apply the hidden markov model to the sequence of observations (strike, ball, hit) to determine the most likely sequence of states that the pitcher was in before each pitch. From here, we can see how often the pitcher was in a good pitching state and so on. Additionally, we can recover the state transitioning probabilities and emission probabilities.
As history dictates (2012 Colorado Rockies season), pulling a pitcher as soon as he finishes pitching to his 18th batter (second time through the order) in a game does not produce good results. While a majority of pitchers do suffer from TTOP, there are certain pitchers that actually perform better third time through the order. Thus, applying one strategy to all pitchers proved to be unsuccessful. One possible aid for smart decision-making during the game is to look at the state transitioning probabilities for each pitcher. If John Doe, our starting pitcher has a very low probability of transitioning out of a bad state, then we can come to a conclusion that when John starts to pitch poorly during a game, he is less likely to recover and get back into a state where he is pitching well again. Thus, especially during the third or fourth time through the order, we will take John out of the game if he starts performing poorly.
If we are in the middle of the season and our starting pitchers are pitching their 13th game this week, we can take the sequence of observations for their previous 12 games and look at their state transitioning probabilities. This will help us make smarter decisions during this week's games on whether to pull them out of the game if they start to enter a bad pitching state.
Recovering State Sequence and Estimating State Transition Probabilities and Emission Probabilites¶
I will use the hmmlearn package in python to apply the hidden markov model to this data. I actually experimented with other packages including Pomegranate, HMM in R, the built in HMM for Matlab, and I also wrote my own code in Python and Matlab to recover the state sequence and probabilities. I decided to use the hmmlearn package in Python because of speed and it is very easy to use. I will train our Hidden Markov Model using the entire 2013 season for a specified pitcher. For quick analysis, I will use the Hidden Markov Model to recover transition and emission probabilities for two pitchers, one who had a successful 2013 season, and one who did not have a great season: Adam Wainwright, and Jeremy Guthrie.
index = (stats_df.before_fip - stats_df.after_fip).sort_values().index
# stats_df.aaname[index[0]]
def hidden_markov(pitcher):
sql_query = '''
SELECT
X.date as date,
X.event as event,
pitch.inning as inning,
X.o as o,
X.s as o,
X.b as b,
X.s as s,
X.pitcher_name as pitcher_name,
X.batter_name as batter,
pitch.type as type,
pitch.x0 as x_location,
pitch.z0 as z_location
FROM (
SELECT *
FROM atbat
WHERE pitcher_name = ? AND (date > '2013_03_30' AND date < '2013_09_29')
) AS X
INNER JOIN pitch on X.gameday_link = pitch.gameday_link AND X.num = pitch.num
ORDER BY date,inning, id, o ASC
'''
player_df = pd.read_sql_query(sql_query, con = conn, params = [pitcher])
stats_df[stats_df.aaname == pitcher][cols]
out_plays = ['Groundout','Strikeout','Lineout','Sac Bunt','Pop Out',
'Flyout','Sac Fly','Forceout',
'Runner Out','Fielders Choice Out','Bunt Groundout',
'Fielders Choice','Bunt Pop Out','Sac Fly',
'Bunt Lineout',
'Double Play','Grounded Into DP','Strikeout - DP',
'Sac Fly DP','Sacrifice Bunt DP',
'Triple Play'
]
observations = [str(player_df.type[i]) if player_df.type[i] != 'X'
else 'S' if ((player_df.event[i] in out_plays) & (player_df.type[i] == 'X'))
else 'X'
for i in range(len(player_df))
]
observations = player_df.type
le = LabelEncoder()
obs = le.fit_transform(observations)
obs = obs.reshape(len(observations), 1)
lengths = [len(player_df[player_df.date == d]) for d in player_df.date.unique()]
dates = player_df.date.unique()
hmm = MultinomialHMM(n_components = 3, n_iter = 1000, tol = 1e-5)
hmm.startprob_ = np.array([1/4., 1/2., 1/4.])
hmm.fit(obs, lengths)
states = hmm.decode(obs, lengths)[1]
joblib.dump(hmm, pitcher + ".pkl") # Save Model
hmmdf = pd.DataFrame({'states':states,
'event':player_df.type,
'date':player_df.date,
'batter':player_df.batter,
'event':player_df.event,
'inning': player_df.inning
}).reset_index()
return player_df, hmmdf, hmm.transmat_, hmm.emissionprob_
def get_plot(Pitcher):
player_df, hmmdf, A, b = hidden_markov(Pitcher)
df_A = pd.DataFrame(A, columns = ['State 0', 'State 1', 'State 2']) * 100
df_A['States'] = ['State 0', 'State 1', 'State 2']
df_A.set_index('States', drop = True, inplace = True)
table1 = FF.create_table(df_A, index = True, index_title = 'Transition Matrix')
# plotly.offline.iplot(table1)
py.iplot(table1)
df_b = pd.DataFrame(b, columns = ['Ball', 'Strike', 'Hit']) * 100
df_b['States'] = ['State 0', 'State 1', 'State 2']
df_b.set_index('States', drop = True, inplace = True)
table2 = FF.create_table(df_b, index = True, index_title = 'Emission Matrix')
# plotly.offline.iplot(table2)
py.iplot(table2)
data = []
button = []
dates = player_df.date.unique()
n = len(dates)
button.append(dict(args = ['visible', [1] * n], label = 'All', method = 'restyle'))
for i in range(0, n):
t = Scatter( x = range(1, len(get_game(hmmdf, i)) + 1 ), y = get_game(hmmdf, i).states,
name = 'Game' + ' ' + str(i + 1))
show = [i == range(n)[j] for j in range(n)]
button.append(dict(args = ['visible', show], label = dates[i] + '/' + ' ' + 'Game' + ' ' + str(i + 1),
method = 'restyle'))
data.append(t)
layout = Layout(
title = 'States Chain',
xaxis = dict(title = 'Pitch'),
yaxis = dict(title = 'State'),
updatemenus = list([
dict(
x = -0.05,
y = 1,
yanchor = 'top',
buttons = button
)
])
)
fig = Figure(data = data, layout = layout)
# plotly.offline.iplot(fig)
return table1, table2, fig
Decisions Decisions¶
Below are the tables of the transition and emission matrices for Adam Wainwright and Jeremy Guthrie. For the transition matrix, we can view current state by row and the next state is by column. For example, the probability of Adam Wainwright transitioning from State 0 to State 1 is 61.18%. For the emission matrix, the values represent the probability of emitting the indicated column being in each state. For example, if Adam Wainwright is currently in state 0, he has a 60.5% chance of throwing a ball, and a 16.3% chance of giving up a hit.
Adam Wainwright¶
table1, table2, fig = get_plot('Adam Wainwright')
py.iplot(table2)
py.iplot(table1)
Notice that the states here are by number opposed to the assumption we made earlier: good, mediocre, bad. Because the model is a mathematical model, the states had to be encoded to recover the probabilities and sequence of states. However, from the tables, we can infer which state is the bad state, medicore state, and good state.
First notice Wainwright's emission matrix, if he is in state 0, he has a 60.5% chance of throwing a ball, which is the highest of all the states. Thus, this state is the medicore state. Next, notice if he is in state 1, he is 68.78% likely to throw a strike, which is the highest percentage of all states; state 1 will be our good state. Finally, if he is in state 2, he is 58.89% likely to give up a hit, the highest probability of all states so state 2 will be our bad state. So we have established for Wainwright that :
State 2 = Bad State
State 0 = Mediocre State
State 1 = Good State
Next, observe his transition matrix. When he is in his bad state (state 2), he has a 0% chance of self-transitioning to state 0. Instead, he is most likely to transition to state 1 (56.03%), which is his good state. When he is in his medicore state, he also has the highest probability of transitioning to state 1, his good state with 61.18% probability. This tells us that when he goes into a bad or mediocre state during a game, he is most likley to leave that state and is likely to go back into his good state.
Below are the plots of the Markov Chains, the sequence of states for each game in the season. Feel free to use the drop down menu to navigate to any individual game to see the sequence of states for the pitcher.
py.iplot(fig)
Jeremy Guthrie¶
table1, table2, fig = get_plot('Jeremy Guthrie')
py.iplot(table2)
py.iplot(table1)
If we do the same analysis for Jeremy Guthrie, we can conclude that his corresponding states are:
State 2 = Bad State
State 0 = Mediocre State
State 1 = Good State
Looking at his transition matrix, we can see that when Guthrie is in his bad state (state 2), he has a 17.69% chance of returning to his bad state. The biggest difference between Guthrie and Weinwright is that when he is in either a good state or mediocre state, he has the highest probability to transition back into his bad state. In fact, when Guthrie is in his good state, he only has a 17.8% chance of staying in his good state. Additionally, unlike Weinwright who has 0% chance of staying in his bad state, Guthrie still has a 17.69% chance of staying in his bad state. We can see the distinction between these two pitchers in terms of their transition and emission matrices.
py.iplot(fig)
Results¶
We can see from running the HMM on Adam Wainwright and Jeremy Guthrie that there is a distinction on their transitioning and emission matrices. Adam Wainwright had transitioning probabilites that favored going back into his good state. Jeremy Guthrie matrices did not establish the same feature. We see from Guthrie's transitioning matrix that he can struggle and transition to his bad state often. The differences between their HMM results can be an explantion for their season results. Wainwright had a 2.55 FIP and 2.94 ERA (Earned Run Average) while Guthrie had a 4.79 FIP and 4.04 ERA in 2013 .
More Analysis:¶
I will do more analysis soon. We still need to look at more pitchers to see if their transitioning and emission matrices provide indication of their season results. As time permits, I will do more comparisons and produce more charts to show results. I am actually excited about these results as it shows promise in a method to aid smarter decisions during the game. I am actually interested in extending this analysis to relievers to measure their effectiveness.