Using Hidden Markov Models to Make Smarter Pitching Decisions

Posted on Sun 08 January 2017 in Projects

Analysis Topic:¶

One topic that has been discussed recently in baseball is the third time through the order penalty (TTOP). Third time through the order penalty is simply the theory that a pitcher is less effective when he goes through the lineup of batters twice and faces each batter for the third time.

This topic has come up several times throughout the years. Here are a few links to articles discussing TTOP along with a group of interviews conducted with team general managers and their thoughts on the penalty:

http://www.sabermagician.com/visiting-the-fatigue-effect-in-starting-pitchers-the-times-through-the-order-penalty/

http://www.businessinsider.com/ap-third-time-through-the-order-can-be-tough-to-navigate-2016-4

http://sabr.org/latest/lichtman-pitch-types-and-times-through-order-penalty

Interviews: http://www.fangraphs.com/blogs/managers-on-the-third-time-through-the-order/

In 2012, a study was presented at the Sloan Conference by MIT. The Colorado Rockies took the idea of the third time through the order penalty and experimented with the strategy of pulling their pitchers when they reached the third time through the order. They called this strategy 'paired pitching': use two pitchers, both only going through the order at most two times. The experiment did not go so well as they only won 64 games out of 162 games that season.

From then on, the topic of TTOP has been somewhat controversial among sabermetricians and baseball professionals.

Goal:¶

The goal of this analysis is to explore the third time through the order penalty. I will use data published by MLB to first, find evidence that pitchers do exhibit this third time through the order penalty, and second, use a statistical model, namely Hidden Markov Models, to explore an analytical approach teams can take to deal with the penalty, if it exists.

Data¶

The data is parsed using the pitchRx package in R. This package was written by Carson Sievert and parses the xml files released and published by the MLB. I will parse the 2013 and 2014 seasons. The data is prased and put into a sqlite3 database. There are several tables that are created from this parser. I will only be using two tables, the atbat and pitch table. The atbat table has features on each plate appearance and the pitch table has features on each pitch.

Remark: There is no particular reason I chose the 2013 and 2014 seasons. Any season or seasons would just do. For two seasons, the parser takes about a little over an hour to parse the data and create the database, so to grab more data, it would require a longer run time to create the database.

In [1]:

from sqlalchemy import create_engine
import pandas as pd
import sqlite3
import numpy as np


pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 1000)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import normalize
from sklearn.externals import joblib

from hmmlearn.hmm import MultinomialHMM
from hmmlearn.hmm import GaussianHMM
from pomegranate import *



import plotly.plotly as py
# plotly.offline.init_notebook_mode()
from plotly.graph_objs import *
from plotly.tools import FigureFactory as FF


%load_ext rpy2.ipython
# %load_ext pymatbridge
np.set_printoptions(precision = 5, suppress = True)

In [2]:

# Parse using pitchRx package
# %%R
# library('DBI')
# library('dplyr')
# library('ggplot2')
# library('pitchRx')
# db <- src_sqlite("pitchRx1314.sqlite3", create = TRUE)
# scrape(start = "2013-01-01",end = "2014-12-01", connect = db$con)
# scrape(start = "2013-01-01", end = "2014-12-01", suffix = files, connect = db$con)

In [3]:

conn = sqlite3.connect('pitchRx1314.sqlite3')

Preview of the atbat table:

In [4]:

pd.read_sql_query('''
    SELECT *
    FROM atbat
    LIMIT 10
    ''', conn).head(4)

Out[4]:

	pitcher	batter	num	s	o	start_tfs	start_tfs_zulu	stand	b_height	p_throws	atbat_des	atbat_des_es	event	score	home_team_runs	away_team_runs	url	inning_side	inning	next_	event2	event3	batter_name	pitcher_name	event4	gameday_link	date	event_num	event_es	event2_es	event3_es	event4_es
0	493157.000	570805.000	21.000	3.000	1.000	185159	2013-02-24T18:51:59Z	L	6-2	R	Oscar Taveras strikes out swinging.	Oscar Taveras se poncha tirÃ¡ndole.	Strikeout	None	None	None	http://gd2.mlb.com/components/game/mlb/year_20...	bottom	3.000	Y	None	None	Oscar Taveras	Koji Uehara	None	gid_2013_02_24_bosmlb_slnmlb_1	2013_02_24	None	None	None	None	None
1	493157.000	434624.000	22.000	0.000	3.000	192028	2013-03-03T19:20:28Z	R	5-11	R	Jayson Nix flies out to center fielder Jacoby ...	Jayson Nix batea elevado de out a jardinero ce...	Flyout	None	None	None	http://gd2.mlb.com/components/game/mlb/year_20...	top	4.000	Y	None	None	Jayson Nix	Koji Uehara	None	gid_2013_03_03_nyamlb_bosmlb_1	2013_03_03	None	None	None	None	None
2	493157.000	461858.000	50.000	0.000	3.000	200446	2013-03-07T20:04:46Z	R	6-2	R	Trevor Plouffe grounds out, third baseman Drew...	Trevor Plouffe batea rodado de out, tercera ba...	Groundout	None	None	None	http://gd2.mlb.com/components/game/mlb/year_20...	bottom	5.000	Y	None	None	Trevor Plouffe	Koji Uehara	None	gid_2013_03_07_bosmlb_minmlb_1	2013_03_07	None	None	None	None	None
3	493157.000	425491.000	49.000	0.000	2.000	200223	2013-03-07T20:02:23Z	L	6-1	R	Ryan Doumit grounds out to first baseman Mike ...	Ryan Doumit batea rodado de out a primera base...	Groundout	None	None	None	http://gd2.mlb.com/components/game/mlb/year_20...	bottom	5.000	Y	None	None	Ryan Doumit	Koji Uehara	None	gid_2013_03_07_bosmlb_minmlb_1	2013_03_07	None	None	None	None	None

Preview of the pitch table:

In [5]:

pd.read_sql_query('''
    SELECT *
    FROM pitch
    LIMIT 2000
    ''', conn).iloc[1000:1005]

Out[5]:

	des	des_es	id	type	tfs	tfs_zulu	x	y	sv_id	start_speed	end_speed	sz_top	sz_bot	pfx_x	pfx_z	px	pz	x0	y0	z0	vx0	vy0	vz0	ax	ay	az	break_y	break_angle	break_length	pitch_type	type_confidence	zone	nasty	spin_dir	spin_rate	url	inning_side	inning	next_	num	on_1b	on_2b	on_3b	count	gameday_link	event_num	play_guid
1000	In play, out(s)	En juego, out(s)	521.000	X	221041	2013-02-22T22:10:41Z	127.040	140.740	130222_151027	89.300	81.200	3.380	1.760	-9.530	9.900	-0.995	2.571	-1.499	50.000	6.330	4.456	-130.727	-6.660	-16.225	31.717	-15.251	23.700	42.000	5.400	FT	0.910	11.000	None	223.793	2601.810	http://gd2.mlb.com/components/game/mlb/year_20...	bottom	8.000	Y	64.000	552662.000	nan	nan	1-0	gid_2013_02_22_sdnmlb_seamlb_1	None	None
1001	Called Strike	Strike cantado	529.000	S	221347	2013-02-22T22:13:47Z	102.150	165.780	130222_151347	90.100	82.800	3.440	1.560	-13.100	3.880	0.067	1.681	-2.798	50.000	5.226	11.847	-131.608	-4.321	-22.903	28.689	-25.322	23.800	37.900	7.700	FT	1.007	8.000	None	253.344	2635.282	http://gd2.mlb.com/components/game/mlb/year_20...	top	9.000	N	65.000	nan	nan	nan	0-0	gid_2013_02_22_sdnmlb_seamlb_1	None	None
1002	Foul	Foul	530.000	S	221359	2013-02-22T22:13:59Z	103.860	155.420	130222_151359	91.800	83.900	3.380	1.500	-13.720	3.500	-0.137	2.120	-2.878	50.000	5.279	11.923	-134.062	-3.448	-24.769	31.130	-25.775	23.700	39.800	7.800	FT	0.840	8.000	None	255.514	2768.439	http://gd2.mlb.com/components/game/mlb/year_20...	top	9.000	N	65.000	nan	nan	nan	0-1	gid_2013_02_22_sdnmlb_seamlb_1	None	None
1003	Foul	Foul	531.000	S	221426	2013-02-22T22:14:26Z	87.550	132.110	130222_151426	94.400	86.700	3.380	1.500	-10.040	9.010	0.445	2.848	-2.722	50.000	5.262	12.172	-137.854	-3.860	-19.329	30.496	-14.756	23.800	46.900	4.800	FF	1.312	3.000	None	227.976	2738.270	http://gd2.mlb.com/components/game/mlb/year_20...	top	9.000	N	65.000	nan	nan	nan	0-2	gid_2013_02_22_sdnmlb_seamlb_1	None	None
1004	Ball	Bola mala	532.000	B	221444	2013-02-22T22:14:44Z	63.520	164.060	130222_151445	89.500	82.500	3.390	1.540	-1.550	7.670	1.140	1.585	-2.694	50.000	5.414	10.404	-130.717	-6.217	-2.675	28.076	-18.849	23.800	1.600	4.700	FC	1.018	14.000	None	191.350	1508.401	http://gd2.mlb.com/components/game/mlb/year_20...	top	9.000	N	65.000	nan	nan	nan	0-2	gid_2013_02_22_sdnmlb_seamlb_1	None	None

We will be doing a lot of dataset splicing and calculating baseball statistics so I will write functions so we don't have to repeat the process. The following functions will be used:

get_split_index: Function to find the pitch number that seperates the first two times through the order and the third time through the order

split_3rd: Function that splits the dataset into two. One that includes only data on first time through the order and another that only includes third time through the order and beyond. This function works on either one game, or multiple games (or entire season).

innings_pitched: Returns the number of innings_pitched.

strikeout_ratio: Calculates the strikeout ratio per 9 innings.

fip: Calculates the FIP value.

whip: Calculates the WHIP.

walks_ratio: Calculates the walks ratio per 9 innings.

Equations:

$$ FIP = \frac{13*\text{Home Runs} + 3 * (\text{Walks} + \text{Hit By Pitch}) - 2 * \text{Strikeouts}}{\text{Innings Pitched}} + \text{FIP Constant},\quad \text{FIP Constant} = 3.10 $$$$ WHIP = \frac{\text{Walks} + \text{Hits}}{\text{Innings Pitched}} $$$$ \text{K/9} = 9 * \frac{Strikeouts}{\text{Innings Pitched}} $$$$ \text{BB/9} = 9 * \frac{Walks}{\text{Innings Pitched}}$$

In [6]:

def get_split_index(df):
        count = 0
        i = 0
        for i in range(len(df) - 1):
            batter_old = df.batter[i]
            batter_new = df.batter[i + 1]
            if batter_old != batter_new:
                count += 1
            if count == 18:
                break
        return (i + 1)

def split_3rd(df):
    before_3rd = pd.DataFrame()
    after_3rd = pd.DataFrame()
    dates = df.date.unique()
    for d in dates:
        game_df = df[df.date == d].reset_index(drop = True)
        split_index = get_split_index(game_df) # Get row number that splits third time through order
        before_3rd = before_3rd.append(game_df.iloc[0:split_index,:]) # Append dataframes
        after_3rd = after_3rd.append(game_df.iloc[split_index:,:])
        
    return before_3rd, after_3rd.reset_index(drop = True)


def innings_pitched(df):
    one_out_plays = ['Groundout','Strikeout','Lineout','Sac Bunt','Pop Out','Flyout','Sac Fly','Forceout',
                     'Runner Out','Fielders Choice Out','Bunt Groundout','Fielders Choice','Bunt Pop Out','Sac Fly',
                    'Bunt Lineout']
    two_out_plays = ['Double Play','Grounded Into DP','Strikeout - DP','Sac Fly DP','Sacrifice Bunt DP']
    three_out_plays = ['Triple Play']
    dates = list(df.date.unique())
    ip = 0
    for d in dates:
        max_inning = max(df[df.date == d].inning).astype(int)
        game_ip = 0
        for i in range(1, (max_inning + 1)) :
            df1 = df[(df.date == d) & (df.inning == i)]
            one_outs = sum(df1.event.isin(one_out_plays))
            two_outs = 2 * sum(df1.event.isin(two_out_plays))
            three_outs = 3 * sum(df1.event.isin(three_out_plays))
            game_ip += (one_outs + two_outs + three_outs)/3.0
            ip += (one_outs + two_outs + three_outs)/3.0
    return ip

def strikeout_ratio(df):
    return 9.0 * sum(df.event.isin(['Strikeout','Strikeout - DP']))/innings_pitched(df)

def fip(df):
    hr = sum(df.event.isin(['Home Run']))
    bb = sum(df.event.isin(['Walk']))
    hbp = sum(df.event.isin(['Hit By Pitch']))
    k = sum(df.event.isin(['Strikeout','Strikeout - DP']))
    ip = innings_pitched(df)
    return (13 * hr + 3 * (bb + hbp) - 2*k)/ip + 3.10

def whip(df):
    bb = sum(df.event.isin(['Walk']))
    hits = sum(df.event.isin(['Single','Home Run','Double','Triple']))
    return (bb + hits)/innings_pitched(df)

def walks_ratio(df):
    return sum(df.event.isin(['Walk']))/innings_pitched(df) * 9.0

def get_game(df, game):
    return df[df.date == df.date.unique()[game]].reset_index(drop = True).drop('index', axis = 1)

Find Evidence:¶

First, we will try to find evidence that pitchers are effected by this third time through the order penalty. I will take the starting pitchers who started 10 or more games in the 2013 season and split their datasets into one that includes first two times through the order and another one with third time through the order and beyond. Then, calculate their stats (FIP, WHIP, strikeout ratio per 9, etc.) and determine how many pitchers exhibit the penalty effect. I grabbed the list of pitchers from baseball-reference:

http://www.baseball-reference.com/leagues/MLB/2013-standard-pitching.shtml

In [7]:

pitchers = list(pd.read_csv('pitchers.csv')['Name'])

In [8]:

before_fip = []
after_fip = []
before_whip = []
after_whip = []
before_strikeout = []
after_strikeout = []
before_walks = []
after_walks = []
pitch = []
for pitcher in pitchers:
#     print pitcher
    sql_query = '''
    SELECT *
    FROM atbat
    WHERE pitcher_name == ? AND (date >  '2013_03_31' AND date < '2013_09_29' )
    '''
    player_bat_df = pd.read_sql_query(sql_query, con = conn, params = [pitcher])
    if len(player_bat_df) == 0:
        continue

    player_bat_df = player_bat_df.sort_values(['date', 'num']).reset_index(drop = True)
    
    before,after = split_3rd(player_bat_df)
    
    before_fip.append(fip(before))
    after_fip.append(fip(after))
    before_whip.append(whip(before))
    after_whip.append(whip(after))
    before_strikeout.append(strikeout_ratio(before))
    after_strikeout.append(strikeout_ratio(after))
    before_walks.append(walks_ratio(before))
    after_walks.append(walks_ratio(after))
    pitch.append(pitcher)
    
stats_df = pd.DataFrame({'aaname':pitch,
                        'before_fip':before_fip,
                        'after_fip':after_fip,
                       'before_whip':before_whip,
                       'after_whip':after_whip,
                       'before_strikeout':before_strikeout,
                       'after_strikeout':after_strikeout,
                       'before_walks':before_walks,
                       'after_walks':after_walks})

In [9]:

cols = ['aaname','before_fip','after_fip','before_whip','after_whip',
       'before_strikeout','after_strikeout','before_walks','after_walks']

stats_df[cols].head()

Out[9]:

	aaname	before_fip	after_fip	before_whip	after_whip	before_strikeout	after_strikeout	before_walks	after_walks
0	Adam Wainwright	2.595	2.544	1.043	1.100	8.806	7.089	1.400	0.938
1	James Shields	3.542	3.498	1.339	1.108	8.453	6.723	2.965	2.277
2	Justin Verlander	3.122	3.932	1.358	1.309	8.803	8.959	3.745	1.964
3	R.A. Dickey	4.036	5.744	1.259	1.208	7.307	6.750	2.539	3.432
4	Hisashi Iwakuma	3.678	2.970	1.036	0.909	7.325	8.178	1.513	1.688

In [10]:

print 'Percentage of Pitchers Affected:'
print 'FIP:', len(stats_df[stats_df.before_fip < stats_df.after_fip])/float(len(stats_df))
print 'WHIP:',len(stats_df[stats_df.before_whip < stats_df.after_whip])/float(len(stats_df))
print 'Strikeout per9:',len(stats_df[stats_df.before_strikeout > stats_df.after_strikeout])/float(len(stats_df))
print 'Walk ratio per9:',len(stats_df[stats_df.before_walks < stats_df.after_walks])/float(len(stats_df))

Percentage of Pitchers Affected:
FIP: 0.634920634921
WHIP: 0.68253968254
Strikeout per9: 0.785714285714
Walk ratio per9: 0.619047619048

As we can see, over 50% of the pitchers in the 2013 season are affected by TTOP. A stacked bar chart of 20 pitchers can be seen below.

In [11]:

x_trace = []
y_before = []
y_after = []
for i in range(20):
    pitcher_name = stats_df.aaname[i]
    x_trace.append(pitcher_name)
    y_before.append(stats_df.ix[stats_df.aaname == pitcher_name, 'before_fip'][i])
    y_after.append(stats_df.ix[stats_df.aaname == pitcher_name, 'after_fip'][i])
trace1 = Bar(x = x_trace, y = y_before, name = 'Before TTO', marker = dict(color = 'rgb(49,130,189)'))
trace2 = Bar(x = x_trace, y = y_after, name = 'After TTO', marker = dict(color = 'rgb(150,0,0)'))
layout = Layout(title = 'Before and After Thrid Time Through Order FIP',
                xaxis = dict(title = 'Pitcher'), yaxis = dict(title = 'FIP'), barmode = 'stack' )
data = [trace1, trace2]
fig1 = Figure(data = data, layout = layout)
# plotly.offline.iplot(fig)

x_trace = []
y_before = []
y_after = []
for i in range(20):
    pitcher_name = stats_df.aaname[i]
    x_trace.append(pitcher_name)
    y_before.append(stats_df.ix[stats_df.aaname == pitcher_name, 'before_whip'][i])
    y_after.append(stats_df.ix[stats_df.aaname == pitcher_name, 'after_whip'][i])
trace1 = Bar(x = x_trace, y = y_before, name = 'Before TTO', marker = dict(color = 'rgb(49,130,189)'))
trace2 = Bar(x = x_trace, y = y_after, name = 'After TTO', marker = dict(color = 'rgb(150,0,0)'))
layout = Layout(title = 'Before and After Thrid Time Through Order WHIP',
                xaxis = dict(title = 'Pitcher'), yaxis = dict(title = 'WHIP'), barmode = 'stack' )
data = [trace1, trace2]
fig2 = Figure(data = data, layout = layout)

In [12]:

py.iplot(fig1)

Out[12]:

In [13]:

py.iplot(fig2)

Out[13]:

Hidden Markov Model¶

A Hidden Markov Model (HMM) is composed of a sequence of states $s_1, s_2, s_3, ..., s_n$, where each state generates an observation, $y_1, y_2,..., y_n$. The assumption of a hidden markov model is that the next state is only dependent on the previous state. This is considered a first order HMM and the sequence of states is called a Markov Chain.

title

Let's cook up a scenario for an example: The weather (sunny or raining) on a day will influence what I do on that day (go shopping or watch a movie at home). Assuming that the weather condition is the sequence of states, and my action is the sequence of observations. One possible sequence of states can be {raining, sunny, sunny, raining}, and one possible sequence of observations can be {shop, movie, shop, shop}.

Notice that each state (raining or sunny) generates an observation (shop or movie). Now, let's say we have no other way of recalling what the weather was the last 4 days, so I told you what I did for the past four days, i.e. {shop, shop, movie, shop}. From there, we are tasked to determine the most likely sequence of weather on each day. The hidden markov model is a method that allows us tackle this task.

The hidden markov model assumes we have a sequence of observations and we can infer the most likely sequence of states that correspond. We can also estimate the probabilities of one state transitioning to another. For example, we can determine the probability of transitioning from raining on one day to sunny on the next. Additionally, we can determine the probability of each state emitting a specific observation. For example, the probability that I shop given the weather is in the a sunny state.

Applying Hidden Markov Model to Baseball to Make Smarter Decisions¶

We will extend this idea to the problem we are trying to analyze: the third time through the order penalty. I will assume that the pitcher has three states during a game: good pitching state, mediocre pitching state, and bad pitching state. The observation sequence will consist of: the pitch was a strike, the pitch was a ball, and the pitch resulted in a hit (single, double, etc.). For each pitcher, we will apply the hidden markov model to the sequence of observations (strike, ball, hit) to determine the most likely sequence of states that the pitcher was in before each pitch. From here, we can see how often the pitcher was in a good pitching state and so on. Additionally, we can recover the state transitioning probabilities and emission probabilities.

As history dictates (2012 Colorado Rockies season), pulling a pitcher as soon as he finishes pitching to his 18th batter (second time through the order) in a game does not produce good results. While a majority of pitchers do suffer from TTOP, there are certain pitchers that actually perform better third time through the order. Thus, applying one strategy to all pitchers proved to be unsuccessful. One possible aid for smart decision-making during the game is to look at the state transitioning probabilities for each pitcher. If John Doe, our starting pitcher has a very low probability of transitioning out of a bad state, then we can come to a conclusion that when John starts to pitch poorly during a game, he is less likely to recover and get back into a state where he is pitching well again. Thus, especially during the third or fourth time through the order, we will take John out of the game if he starts performing poorly.

If we are in the middle of the season and our starting pitchers are pitching their 13th game this week, we can take the sequence of observations for their previous 12 games and look at their state transitioning probabilities. This will help us make smarter decisions during this week's games on whether to pull them out of the game if they start to enter a bad pitching state.

Recovering State Sequence and Estimating State Transition Probabilities and Emission Probabilites¶

I will use the hmmlearn package in python to apply the hidden markov model to this data. I actually experimented with other packages including Pomegranate, HMM in R, the built in HMM for Matlab, and I also wrote my own code in Python and Matlab to recover the state sequence and probabilities. I decided to use the hmmlearn package in Python because of speed and it is very easy to use. I will train our Hidden Markov Model using the entire 2013 season for a specified pitcher. For quick analysis, I will use the Hidden Markov Model to recover transition and emission probabilities for two pitchers, one who had a successful 2013 season, and one who did not have a great season: Adam Wainwright, and Jeremy Guthrie.

In [17]:

index = (stats_df.before_fip - stats_df.after_fip).sort_values().index
# stats_df.aaname[index[0]]

def hidden_markov(pitcher):
    sql_query = '''
    SELECT 
    X.date as date,
    X.event as event,
    pitch.inning as inning,
    X.o as o,
    X.s as o,
    X.b as b, 
    X.s as s, 
    X.pitcher_name as pitcher_name,
    X.batter_name as batter,
    pitch.type as type,
    pitch.x0 as x_location,
    pitch.z0 as z_location
    FROM (
        SELECT *
        FROM atbat
        WHERE pitcher_name = ? AND (date >  '2013_03_30' AND date < '2013_09_29')
    ) AS X
    INNER JOIN pitch on X.gameday_link = pitch.gameday_link AND X.num = pitch.num
    ORDER BY date,inning, id, o ASC
    '''
    
    player_df = pd.read_sql_query(sql_query, con = conn, params = [pitcher])

    stats_df[stats_df.aaname == pitcher][cols]

    out_plays = ['Groundout','Strikeout','Lineout','Sac Bunt','Pop Out',
                                                     'Flyout','Sac Fly','Forceout',
                                                     'Runner Out','Fielders Choice Out','Bunt Groundout',
                                                     'Fielders Choice','Bunt Pop Out','Sac Fly',
                                                     'Bunt Lineout',
                                                     'Double Play','Grounded Into DP','Strikeout - DP',
                                                     'Sac Fly DP','Sacrifice Bunt DP', 
                                                     'Triple Play'
                                                    ]
    observations = [str(player_df.type[i]) if player_df.type[i] != 'X'
                    else 'S' if ((player_df.event[i] in out_plays) & (player_df.type[i] == 'X'))
                    else 'X'
                    for i in range(len(player_df))
                   ]

    observations = player_df.type
    le = LabelEncoder()
    obs = le.fit_transform(observations)
    obs = obs.reshape(len(observations), 1)

    lengths = [len(player_df[player_df.date == d]) for d in player_df.date.unique()]
    dates = player_df.date.unique()

    hmm = MultinomialHMM(n_components = 3, n_iter = 1000, tol = 1e-5)
    hmm.startprob_ = np.array([1/4., 1/2., 1/4.])
    hmm.fit(obs, lengths)
    states = hmm.decode(obs, lengths)[1]
    joblib.dump(hmm, pitcher + ".pkl") # Save Model
    
    hmmdf = pd.DataFrame({'states':states,
                          'event':player_df.type,
                          'date':player_df.date,
                          'batter':player_df.batter, 
                          'event':player_df.event,
                          'inning': player_df.inning
                         }).reset_index()
    
    return player_df, hmmdf, hmm.transmat_, hmm.emissionprob_

In [18]:

def get_plot(Pitcher):
    player_df, hmmdf, A, b = hidden_markov(Pitcher)
    df_A = pd.DataFrame(A, columns = ['State 0', 'State 1', 'State 2']) * 100
    df_A['States'] = ['State 0', 'State 1', 'State 2']
    df_A.set_index('States', drop = True, inplace = True)
    table1 = FF.create_table(df_A, index = True, index_title = 'Transition Matrix')
#     plotly.offline.iplot(table1)
    py.iplot(table1)
    
    df_b = pd.DataFrame(b, columns = ['Ball', 'Strike', 'Hit']) * 100
    df_b['States'] = ['State 0', 'State 1', 'State 2']
    df_b.set_index('States', drop = True, inplace = True)
    table2 = FF.create_table(df_b, index = True, index_title = 'Emission Matrix')
#     plotly.offline.iplot(table2)
    py.iplot(table2)
    
    
    data = []
    button = []
    dates = player_df.date.unique()
    n = len(dates)
    button.append(dict(args = ['visible', [1] * n], label = 'All', method = 'restyle'))    
    for i in range(0, n):
        t = Scatter( x = range(1, len(get_game(hmmdf, i)) + 1 ), y = get_game(hmmdf, i).states,
                   name = 'Game' + ' ' + str(i + 1))

        show = [i == range(n)[j] for j in range(n)]
        button.append(dict(args = ['visible', show], label = dates[i] + '/' + ' ' + 'Game' + ' ' + str(i + 1),
                           method = 'restyle'))
        data.append(t)


    layout = Layout(
        title = 'States Chain',
        xaxis = dict(title = 'Pitch'),
        yaxis = dict(title = 'State'),
        updatemenus = list([
                dict(
                    x = -0.05,
                    y = 1,
                    yanchor = 'top',
                    buttons = button  
                )
            ])
    )
    fig = Figure(data = data, layout = layout)
#     plotly.offline.iplot(fig)
    return table1, table2, fig

Decisions Decisions¶

Below are the tables of the transition and emission matrices for Adam Wainwright and Jeremy Guthrie. For the transition matrix, we can view current state by row and the next state is by column. For example, the probability of Adam Wainwright transitioning from State 0 to State 1 is 61.18%. For the emission matrix, the values represent the probability of emitting the indicated column being in each state. For example, if Adam Wainwright is currently in state 0, he has a 60.5% chance of throwing a ball, and a 16.3% chance of giving up a hit.

Adam Wainwright¶

In [19]:

table1, table2, fig = get_plot('Adam Wainwright')

In [20]:

py.iplot(table2)

Out[20]:

In [21]:

py.iplot(table1)

Out[21]:

Notice that the states here are by number opposed to the assumption we made earlier: good, mediocre, bad. Because the model is a mathematical model, the states had to be encoded to recover the probabilities and sequence of states. However, from the tables, we can infer which state is the bad state, medicore state, and good state.

First notice Wainwright's emission matrix, if he is in state 0, he has a 60.5% chance of throwing a ball, which is the highest of all the states. Thus, this state is the medicore state. Next, notice if he is in state 1, he is 68.78% likely to throw a strike, which is the highest percentage of all states; state 1 will be our good state. Finally, if he is in state 2, he is 58.89% likely to give up a hit, the highest probability of all states so state 2 will be our bad state. So we have established for Wainwright that :

State 2 = Bad State

State 0 = Mediocre State

State 1 = Good State

Next, observe his transition matrix. When he is in his bad state (state 2), he has a 0% chance of self-transitioning to state 0. Instead, he is most likely to transition to state 1 (56.03%), which is his good state. When he is in his medicore state, he also has the highest probability of transitioning to state 1, his good state with 61.18% probability. This tells us that when he goes into a bad or mediocre state during a game, he is most likley to leave that state and is likely to go back into his good state.

Below are the plots of the Markov Chains, the sequence of states for each game in the season. Feel free to use the drop down menu to navigate to any individual game to see the sequence of states for the pitcher.

In [22]:

py.iplot(fig)

Out[22]:

Jeremy Guthrie¶

In [23]:

table1, table2, fig = get_plot('Jeremy Guthrie')

In [24]:

py.iplot(table2)

Out[24]:

In [25]:

py.iplot(table1)

Out[25]:

If we do the same analysis for Jeremy Guthrie, we can conclude that his corresponding states are:

State 2 = Bad State

State 0 = Mediocre State

State 1 = Good State

Looking at his transition matrix, we can see that when Guthrie is in his bad state (state 2), he has a 17.69% chance of returning to his bad state. The biggest difference between Guthrie and Weinwright is that when he is in either a good state or mediocre state, he has the highest probability to transition back into his bad state. In fact, when Guthrie is in his good state, he only has a 17.8% chance of staying in his good state. Additionally, unlike Weinwright who has 0% chance of staying in his bad state, Guthrie still has a 17.69% chance of staying in his bad state. We can see the distinction between these two pitchers in terms of their transition and emission matrices.

In [26]:

py.iplot(fig)

Out[26]:

Results¶

We can see from running the HMM on Adam Wainwright and Jeremy Guthrie that there is a distinction on their transitioning and emission matrices. Adam Wainwright had transitioning probabilites that favored going back into his good state. Jeremy Guthrie matrices did not establish the same feature. We see from Guthrie's transitioning matrix that he can struggle and transition to his bad state often. The differences between their HMM results can be an explantion for their season results. Wainwright had a 2.55 FIP and 2.94 ERA (Earned Run Average) while Guthrie had a 4.79 FIP and 4.04 ERA in 2013 .

More Analysis:¶

I will do more analysis soon. We still need to look at more pitchers to see if their transitioning and emission matrices provide indication of their season results. As time permits, I will do more comparisons and produce more charts to show results. I am actually excited about these results as it shows promise in a method to aid smarter decisions during the game. I am actually interested in extending this analysis to relievers to measure their effectiveness.