Using Hidden Markov Models to Make Smarter Pitching Decisions

Posted on Sun 08 January 2017 in Projects

Analysis Topic:

One topic that has been discussed recently in baseball is the third time through the order penalty (TTOP). Third time through the order penalty is simply the theory that a pitcher is less effective when he goes through the lineup of batters twice and faces each batter for the third time.

This topic has come up several times throughout the years. Here are a few links to articles discussing TTOP along with a group of interviews conducted with team general managers and their thoughts on the penalty:

http://www.sabermagician.com/visiting-the-fatigue-effect-in-starting-pitchers-the-times-through-the-order-penalty/

http://www.businessinsider.com/ap-third-time-through-the-order-can-be-tough-to-navigate-2016-4

http://sabr.org/latest/lichtman-pitch-types-and-times-through-order-penalty

http://sabr.org/latest/lichtman-pitch-types-and-times-through-order-penalty

Interviews: http://www.fangraphs.com/blogs/managers-on-the-third-time-through-the-order/

In 2012, a study was presented at the Sloan Conference by MIT. The Colorado Rockies took the idea of the third time through the order penalty and experimented with the strategy of pulling their pitchers when they reached the third time through the order. They called this strategy 'paired pitching': use two pitchers, both only going through the order at most two times. The experiment did not go so well as they only won 64 games out of 162 games that season.

From then on, the topic of TTOP has been somewhat controversial among sabermetricians and baseball professionals.

Goal:

The goal of this analysis is to explore the third time through the order penalty. I will use data published by MLB to first, find evidence that pitchers do exhibit this third time through the order penalty, and second, use a statistical model, namely Hidden Markov Models, to explore an analytical approach teams can take to deal with the penalty, if it exists.

Data

The data is parsed using the pitchRx package in R. This package was written by Carson Sievert and parses the xml files released and published by the MLB. I will parse the 2013 and 2014 seasons. The data is prased and put into a sqlite3 database. There are several tables that are created from this parser. I will only be using two tables, the atbat and pitch table. The atbat table has features on each plate appearance and the pitch table has features on each pitch.

Remark: There is no particular reason I chose the 2013 and 2014 seasons. Any season or seasons would just do. For two seasons, the parser takes about a little over an hour to parse the data and create the database, so to grab more data, it would require a longer run time to create the database.

In [1]:
from sqlalchemy import create_engine
import pandas as pd
import sqlite3
import numpy as np


pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 1000)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import normalize
from sklearn.externals import joblib

from hmmlearn.hmm import MultinomialHMM
from hmmlearn.hmm import GaussianHMM
from pomegranate import *



import plotly.plotly as py
# plotly.offline.init_notebook_mode()
from plotly.graph_objs import *
from plotly.tools import FigureFactory as FF


%load_ext rpy2.ipython
# %load_ext pymatbridge
np.set_printoptions(precision = 5, suppress = True)
In [2]:
# Parse using pitchRx package
# %%R
# library('DBI')
# library('dplyr')
# library('ggplot2')
# library('pitchRx')
# db <- src_sqlite("pitchRx1314.sqlite3", create = TRUE)
# scrape(start = "2013-01-01",end = "2014-12-01", connect = db$con)
# scrape(start = "2013-01-01", end = "2014-12-01", suffix = files, connect = db$con)
In [3]:
conn = sqlite3.connect('pitchRx1314.sqlite3')

Preview of the atbat table:

In [4]:
pd.read_sql_query('''
    SELECT *
    FROM atbat
    LIMIT 10
    ''', conn).head(4)
Out[4]:
pitcher batter num b s o start_tfs start_tfs_zulu stand b_height p_throws atbat_des atbat_des_es event score home_team_runs away_team_runs url inning_side inning next_ event2 event3 batter_name pitcher_name event4 gameday_link date event_num event_es event2_es event3_es event4_es
0 493157.000 570805.000 21.000 0.000 3.000 1.000 185159 2013-02-24T18:51:59Z L 6-2 R Oscar Taveras strikes out swinging. Oscar Taveras se poncha tirándole. Strikeout None None None http://gd2.mlb.com/components/game/mlb/year_20... bottom 3.000 Y None None Oscar Taveras Koji Uehara None gid_2013_02_24_bosmlb_slnmlb_1 2013_02_24 None None None None None
1 493157.000 434624.000 22.000 0.000 0.000 3.000 192028 2013-03-03T19:20:28Z R 5-11 R Jayson Nix flies out to center fielder Jacoby ... Jayson Nix batea elevado de out a jardinero ce... Flyout None None None http://gd2.mlb.com/components/game/mlb/year_20... top 4.000 Y None None Jayson Nix Koji Uehara None gid_2013_03_03_nyamlb_bosmlb_1 2013_03_03 None None None None None
2 493157.000 461858.000 50.000 0.000 0.000 3.000 200446 2013-03-07T20:04:46Z R 6-2 R Trevor Plouffe grounds out, third baseman Drew... Trevor Plouffe batea rodado de out, tercera ba... Groundout None None None http://gd2.mlb.com/components/game/mlb/year_20... bottom 5.000 Y None None Trevor Plouffe Koji Uehara None gid_2013_03_07_bosmlb_minmlb_1 2013_03_07 None None None None None
3 493157.000 425491.000 49.000 0.000 0.000 2.000 200223 2013-03-07T20:02:23Z L 6-1 R Ryan Doumit grounds out to first baseman Mike ... Ryan Doumit batea rodado de out a primera base... Groundout None None None http://gd2.mlb.com/components/game/mlb/year_20... bottom 5.000 Y None None Ryan Doumit Koji Uehara None gid_2013_03_07_bosmlb_minmlb_1 2013_03_07 None None None None None

Preview of the pitch table:

In [5]:
pd.read_sql_query('''
    SELECT *
    FROM pitch
    LIMIT 2000
    ''', conn).iloc[1000:1005]
Out[5]:
des des_es id type tfs tfs_zulu x y sv_id start_speed end_speed sz_top sz_bot pfx_x pfx_z px pz x0 y0 z0 vx0 vy0 vz0 ax ay az break_y break_angle break_length pitch_type type_confidence zone nasty spin_dir spin_rate cc mt url inning_side inning next_ num on_1b on_2b on_3b count gameday_link event_num play_guid
1000 In play, out(s) En juego, out(s) 521.000 X 221041 2013-02-22T22:10:41Z 127.040 140.740 130222_151027 89.300 81.200 3.380 1.760 -9.530 9.900 -0.995 2.571 -1.499 50.000 6.330 4.456 -130.727 -6.660 -16.225 31.717 -15.251 23.700 42.000 5.400 FT 0.910 11.000 None 223.793 2601.810 http://gd2.mlb.com/components/game/mlb/year_20... bottom 8.000 Y 64.000 552662.000 nan nan 1-0 gid_2013_02_22_sdnmlb_seamlb_1 None None
1001 Called Strike Strike cantado 529.000 S 221347 2013-02-22T22:13:47Z 102.150 165.780 130222_151347 90.100 82.800 3.440 1.560 -13.100 3.880 0.067 1.681 -2.798 50.000 5.226 11.847 -131.608 -4.321 -22.903 28.689 -25.322 23.800 37.900 7.700 FT 1.007 8.000 None 253.344 2635.282 http://gd2.mlb.com/components/game/mlb/year_20... top 9.000 N 65.000 nan nan nan 0-0 gid_2013_02_22_sdnmlb_seamlb_1 None None
1002 Foul Foul 530.000 S 221359 2013-02-22T22:13:59Z 103.860 155.420 130222_151359 91.800 83.900 3.380 1.500 -13.720 3.500 -0.137 2.120 -2.878 50.000 5.279 11.923 -134.062 -3.448 -24.769 31.130 -25.775 23.700 39.800 7.800 FT 0.840 8.000 None 255.514 2768.439 http://gd2.mlb.com/components/game/mlb/year_20... top 9.000 N 65.000 nan nan nan 0-1 gid_2013_02_22_sdnmlb_seamlb_1 None None
1003 Foul Foul 531.000 S 221426 2013-02-22T22:14:26Z 87.550 132.110 130222_151426 94.400 86.700 3.380 1.500 -10.040 9.010 0.445 2.848 -2.722 50.000 5.262 12.172 -137.854 -3.860 -19.329 30.496 -14.756 23.800 46.900 4.800 FF 1.312 3.000 None 227.976 2738.270 http://gd2.mlb.com/components/game/mlb/year_20... top 9.000 N 65.000 nan nan nan 0-2 gid_2013_02_22_sdnmlb_seamlb_1 None None
1004 Ball Bola mala 532.000 B 221444 2013-02-22T22:14:44Z 63.520 164.060 130222_151445 89.500 82.500 3.390 1.540 -1.550 7.670 1.140 1.585 -2.694 50.000 5.414 10.404 -130.717 -6.217 -2.675 28.076 -18.849 23.800 1.600 4.700 FC 1.018 14.000 None 191.350 1508.401 http://gd2.mlb.com/components/game/mlb/year_20... top 9.000 N 65.000 nan nan nan 0-2 gid_2013_02_22_sdnmlb_seamlb_1 None None

We will be doing a lot of dataset splicing and calculating baseball statistics so I will write functions so we don't have to repeat the process. The following functions will be used:

get_split_index: Function to find the pitch number that seperates the first two times through the order and the third time through the order

split_3rd: Function that splits the dataset into two. One that includes only data on first time through the order and another that only includes third time through the order and beyond. This function works on either one game, or multiple games (or entire season).

innings_pitched: Returns the number of innings_pitched.

strikeout_ratio: Calculates the strikeout ratio per 9 innings.

fip: Calculates the FIP value.

whip: Calculates the WHIP.

walks_ratio: Calculates the walks ratio per 9 innings.

Equations:

$$ FIP = \frac{13*\text{Home Runs} + 3 * (\text{Walks} + \text{Hit By Pitch}) - 2 * \text{Strikeouts}}{\text{Innings Pitched}} + \text{FIP Constant},\quad \text{FIP Constant} = 3.10 $$$$ WHIP = \frac{\text{Walks} + \text{Hits}}{\text{Innings Pitched}} $$$$ \text{K/9} = 9 * \frac{Strikeouts}{\text{Innings Pitched}} $$$$ \text{BB/9} = 9 * \frac{Walks}{\text{Innings Pitched}}$$
In [6]:
def get_split_index(df):
        count = 0
        i = 0
        for i in range(len(df) - 1):
            batter_old = df.batter[i]
            batter_new = df.batter[i + 1]
            if batter_old != batter_new:
                count += 1
            if count == 18:
                break
        return (i + 1)

def split_3rd(df):
    before_3rd = pd.DataFrame()
    after_3rd = pd.DataFrame()
    dates = df.date.unique()
    for d in dates:
        game_df = df[df.date == d].reset_index(drop = True)
        split_index = get_split_index(game_df) # Get row number that splits third time through order
        before_3rd = before_3rd.append(game_df.iloc[0:split_index,:]) # Append dataframes
        after_3rd = after_3rd.append(game_df.iloc[split_index:,:])
        
    return before_3rd, after_3rd.reset_index(drop = True)


def innings_pitched(df):
    one_out_plays = ['Groundout','Strikeout','Lineout','Sac Bunt','Pop Out','Flyout','Sac Fly','Forceout',
                     'Runner Out','Fielders Choice Out','Bunt Groundout','Fielders Choice','Bunt Pop Out','Sac Fly',
                    'Bunt Lineout']
    two_out_plays = ['Double Play','Grounded Into DP','Strikeout - DP','Sac Fly DP','Sacrifice Bunt DP']
    three_out_plays = ['Triple Play']
    dates = list(df.date.unique())
    ip = 0
    for d in dates:
        max_inning = max(df[df.date == d].inning).astype(int)
        game_ip = 0
        for i in range(1, (max_inning + 1)) :
            df1 = df[(df.date == d) & (df.inning == i)]
            one_outs = sum(df1.event.isin(one_out_plays))
            two_outs = 2 * sum(df1.event.isin(two_out_plays))
            three_outs = 3 * sum(df1.event.isin(three_out_plays))
            game_ip += (one_outs + two_outs + three_outs)/3.0
            ip += (one_outs + two_outs + three_outs)/3.0
    return ip

def strikeout_ratio(df):
    return 9.0 * sum(df.event.isin(['Strikeout','Strikeout - DP']))/innings_pitched(df)

def fip(df):
    hr = sum(df.event.isin(['Home Run']))
    bb = sum(df.event.isin(['Walk']))
    hbp = sum(df.event.isin(['Hit By Pitch']))
    k = sum(df.event.isin(['Strikeout','Strikeout - DP']))
    ip = innings_pitched(df)
    return (13 * hr + 3 * (bb + hbp) - 2*k)/ip + 3.10

def whip(df):
    bb = sum(df.event.isin(['Walk']))
    hits = sum(df.event.isin(['Single','Home Run','Double','Triple']))
    return (bb + hits)/innings_pitched(df)

def walks_ratio(df):
    return sum(df.event.isin(['Walk']))/innings_pitched(df) * 9.0

def get_game(df, game):
    return df[df.date == df.date.unique()[game]].reset_index(drop = True).drop('index', axis = 1)

Find Evidence:

First, we will try to find evidence that pitchers are effected by this third time through the order penalty. I will take the starting pitchers who started 10 or more games in the 2013 season and split their datasets into one that includes first two times through the order and another one with third time through the order and beyond. Then, calculate their stats (FIP, WHIP, strikeout ratio per 9, etc.) and determine how many pitchers exhibit the penalty effect. I grabbed the list of pitchers from baseball-reference:

http://www.baseball-reference.com/leagues/MLB/2013-standard-pitching.shtml

In [7]:
pitchers = list(pd.read_csv('pitchers.csv')['Name'])
In [8]:
before_fip = []
after_fip = []
before_whip = []
after_whip = []
before_strikeout = []
after_strikeout = []
before_walks = []
after_walks = []
pitch = []
for pitcher in pitchers:
#     print pitcher
    sql_query = '''
    SELECT *
    FROM atbat
    WHERE pitcher_name == ? AND (date >  '2013_03_31' AND date < '2013_09_29' )
    '''
    player_bat_df = pd.read_sql_query(sql_query, con = conn, params = [pitcher])
    if len(player_bat_df) == 0:
        continue

    player_bat_df = player_bat_df.sort_values(['date', 'num']).reset_index(drop = True)
    
    before,after = split_3rd(player_bat_df)
    
    before_fip.append(fip(before))
    after_fip.append(fip(after))
    before_whip.append(whip(before))
    after_whip.append(whip(after))
    before_strikeout.append(strikeout_ratio(before))
    after_strikeout.append(strikeout_ratio(after))
    before_walks.append(walks_ratio(before))
    after_walks.append(walks_ratio(after))
    pitch.append(pitcher)
    
stats_df = pd.DataFrame({'aaname':pitch,
                        'before_fip':before_fip,
                        'after_fip':after_fip,
                       'before_whip':before_whip,
                       'after_whip':after_whip,
                       'before_strikeout':before_strikeout,
                       'after_strikeout':after_strikeout,
                       'before_walks':before_walks,
                       'after_walks':after_walks})
                            
In [9]:
cols = ['aaname','before_fip','after_fip','before_whip','after_whip',
       'before_strikeout','after_strikeout','before_walks','after_walks']

stats_df[cols].head()
Out[9]:
aaname before_fip after_fip before_whip after_whip before_strikeout after_strikeout before_walks after_walks
0 Adam Wainwright 2.595 2.544 1.043 1.100 8.806 7.089 1.400 0.938
1 James Shields 3.542 3.498 1.339 1.108 8.453 6.723 2.965 2.277
2 Justin Verlander 3.122 3.932 1.358 1.309 8.803 8.959 3.745 1.964
3 R.A. Dickey 4.036 5.744 1.259 1.208 7.307 6.750 2.539 3.432
4 Hisashi Iwakuma 3.678 2.970 1.036 0.909 7.325 8.178 1.513 1.688
In [10]:
print 'Percentage of Pitchers Affected:'
print 'FIP:', len(stats_df[stats_df.before_fip < stats_df.after_fip])/float(len(stats_df))
print 'WHIP:',len(stats_df[stats_df.before_whip < stats_df.after_whip])/float(len(stats_df))
print 'Strikeout per9:',len(stats_df[stats_df.before_strikeout > stats_df.after_strikeout])/float(len(stats_df))
print 'Walk ratio per9:',len(stats_df[stats_df.before_walks < stats_df.after_walks])/float(len(stats_df))
Percentage of Pitchers Affected:
FIP: 0.634920634921
WHIP: 0.68253968254
Strikeout per9: 0.785714285714
Walk ratio per9: 0.619047619048

As we can see, over 50% of the pitchers in the 2013 season are affected by TTOP. A stacked bar chart of 20 pitchers can be seen below.

In [11]:
x_trace = []
y_before = []
y_after = []
for i in range(20):
    pitcher_name = stats_df.aaname[i]
    x_trace.append(pitcher_name)
    y_before.append(stats_df.ix[stats_df.aaname == pitcher_name, 'before_fip'][i])
    y_after.append(stats_df.ix[stats_df.aaname == pitcher_name, 'after_fip'][i])
trace1 = Bar(x = x_trace, y = y_before, name = 'Before TTO', marker = dict(color = 'rgb(49,130,189)'))
trace2 = Bar(x = x_trace, y = y_after, name = 'After TTO', marker = dict(color = 'rgb(150,0,0)'))
layout = Layout(title = 'Before and After Thrid Time Through Order FIP',
                xaxis = dict(title = 'Pitcher'), yaxis = dict(title = 'FIP'), barmode = 'stack' )
data = [trace1, trace2]
fig1 = Figure(data = data, layout = layout)
# plotly.offline.iplot(fig)

x_trace = []
y_before = []
y_after = []
for i in range(20):
    pitcher_name = stats_df.aaname[i]
    x_trace.append(pitcher_name)
    y_before.append(stats_df.ix[stats_df.aaname == pitcher_name, 'before_whip'][i])
    y_after.append(stats_df.ix[stats_df.aaname == pitcher_name, 'after_whip'][i])
trace1 = Bar(x = x_trace, y = y_before, name = 'Before TTO', marker = dict(color = 'rgb(49,130,189)'))
trace2 = Bar(x = x_trace, y = y_after, name = 'After TTO', marker = dict(color = 'rgb(150,0,0)'))
layout = Layout(title = 'Before and After Thrid Time Through Order WHIP',
                xaxis = dict(title = 'Pitcher'), yaxis = dict(title = 'WHIP'), barmode = 'stack' )
data = [trace1, trace2]
fig2 = Figure(data = data, layout = layout)
In [12]:
py.iplot(fig1)
Out[12]:
In [13]:
py.iplot(fig2)
Out[13]:

Hidden Markov Model

A Hidden Markov Model (HMM) is composed of a sequence of states $s_1, s_2, s_3, ..., s_n$, where each state generates an observation, $y_1, y_2,..., y_n$. The assumption of a hidden markov model is that the next state is only dependent on the previous state. This is considered a first order HMM and the sequence of states is called a Markov Chain.

title

Let's cook up a scenario for an example: The weather (sunny or raining) on a day will influence what I do on that day (go shopping or watch a movie at home). Assuming that the weather condition is the sequence of states, and my action is the sequence of observations. One possible sequence of states can be {raining, sunny, sunny, raining}, and one possible sequence of observations can be {shop, movie, shop, shop}.

Notice that each state (raining or sunny) generates an observation (shop or movie). Now, let's say we have no other way of recalling what the weather was the last 4 days, so I told you what I did for the past four days, i.e. {shop, shop, movie, shop}. From there, we are tasked to determine the most likely sequence of weather on each day. The hidden markov model is a method that allows us tackle this task.

The hidden markov model assumes we have a sequence of observations and we can infer the most likely sequence of states that correspond. We can also estimate the probabilities of one state transitioning to another. For example, we can determine the probability of transitioning from raining on one day to sunny on the next. Additionally, we can determine the probability of each state emitting a specific observation. For example, the probability that I shop given the weather is in the a sunny state.

Applying Hidden Markov Model to Baseball to Make Smarter Decisions

We will extend this idea to the problem we are trying to analyze: the third time through the order penalty. I will assume that the pitcher has three states during a game: good pitching state, mediocre pitching state, and bad pitching state. The observation sequence will consist of: the pitch was a strike, the pitch was a ball, and the pitch resulted in a hit (single, double, etc.). For each pitcher, we will apply the hidden markov model to the sequence of observations (strike, ball, hit) to determine the most likely sequence of states that the pitcher was in before each pitch. From here, we can see how often the pitcher was in a good pitching state and so on. Additionally, we can recover the state transitioning probabilities and emission probabilities.

As history dictates (2012 Colorado Rockies season), pulling a pitcher as soon as he finishes pitching to his 18th batter (second time through the order) in a game does not produce good results. While a majority of pitchers do suffer from TTOP, there are certain pitchers that actually perform better third time through the order. Thus, applying one strategy to all pitchers proved to be unsuccessful. One possible aid for smart decision-making during the game is to look at the state transitioning probabilities for each pitcher. If John Doe, our starting pitcher has a very low probability of transitioning out of a bad state, then we can come to a conclusion that when John starts to pitch poorly during a game, he is less likely to recover and get back into a state where he is pitching well again. Thus, especially during the third or fourth time through the order, we will take John out of the game if he starts performing poorly.

If we are in the middle of the season and our starting pitchers are pitching their 13th game this week, we can take the sequence of observations for their previous 12 games and look at their state transitioning probabilities. This will help us make smarter decisions during this week's games on whether to pull them out of the game if they start to enter a bad pitching state.

Recovering State Sequence and Estimating State Transition Probabilities and Emission Probabilites

I will use the hmmlearn package in python to apply the hidden markov model to this data. I actually experimented with other packages including Pomegranate, HMM in R, the built in HMM for Matlab, and I also wrote my own code in Python and Matlab to recover the state sequence and probabilities. I decided to use the hmmlearn package in Python because of speed and it is very easy to use. I will train our Hidden Markov Model using the entire 2013 season for a specified pitcher. For quick analysis, I will use the Hidden Markov Model to recover transition and emission probabilities for two pitchers, one who had a successful 2013 season, and one who did not have a great season: Adam Wainwright, and Jeremy Guthrie.

In [17]:
index = (stats_df.before_fip - stats_df.after_fip).sort_values().index
# stats_df.aaname[index[0]]

def hidden_markov(pitcher):
    sql_query = '''
    SELECT 
    X.date as date,
    X.event as event,
    pitch.inning as inning,
    X.o as o,
    X.s as o,
    X.b as b, 
    X.s as s, 
    X.pitcher_name as pitcher_name,
    X.batter_name as batter,
    pitch.type as type,
    pitch.x0 as x_location,
    pitch.z0 as z_location
    FROM (
        SELECT *
        FROM atbat
        WHERE pitcher_name = ? AND (date >  '2013_03_30' AND date < '2013_09_29')
    ) AS X
    INNER JOIN pitch on X.gameday_link = pitch.gameday_link AND X.num = pitch.num
    ORDER BY date,inning, id, o ASC
    '''
    
    player_df = pd.read_sql_query(sql_query, con = conn, params = [pitcher])

    stats_df[stats_df.aaname == pitcher][cols]

    out_plays = ['Groundout','Strikeout','Lineout','Sac Bunt','Pop Out',
                                                     'Flyout','Sac Fly','Forceout',
                                                     'Runner Out','Fielders Choice Out','Bunt Groundout',
                                                     'Fielders Choice','Bunt Pop Out','Sac Fly',
                                                     'Bunt Lineout',
                                                     'Double Play','Grounded Into DP','Strikeout - DP',
                                                     'Sac Fly DP','Sacrifice Bunt DP', 
                                                     'Triple Play'
                                                    ]
    observations = [str(player_df.type[i]) if player_df.type[i] != 'X'
                    else 'S' if ((player_df.event[i] in out_plays) & (player_df.type[i] == 'X'))
                    else 'X'
                    for i in range(len(player_df))
                   ]

    observations = player_df.type
    le = LabelEncoder()
    obs = le.fit_transform(observations)
    obs = obs.reshape(len(observations), 1)

    lengths = [len(player_df[player_df.date == d]) for d in player_df.date.unique()]
    dates = player_df.date.unique()

    hmm = MultinomialHMM(n_components = 3, n_iter = 1000, tol = 1e-5)
    hmm.startprob_ = np.array([1/4., 1/2., 1/4.])
    hmm.fit(obs, lengths)
    states = hmm.decode(obs, lengths)[1]
    joblib.dump(hmm, pitcher + ".pkl") # Save Model
    
    hmmdf = pd.DataFrame({'states':states,
                          'event':player_df.type,
                          'date':player_df.date,
                          'batter':player_df.batter, 
                          'event':player_df.event,
                          'inning': player_df.inning
                         }).reset_index()
    
    return player_df, hmmdf, hmm.transmat_, hmm.emissionprob_
In [18]:
def get_plot(Pitcher):
    player_df, hmmdf, A, b = hidden_markov(Pitcher)
    df_A = pd.DataFrame(A, columns = ['State 0', 'State 1', 'State 2']) * 100
    df_A['States'] = ['State 0', 'State 1', 'State 2']
    df_A.set_index('States', drop = True, inplace = True)
    table1 = FF.create_table(df_A, index = True, index_title = 'Transition Matrix')
#     plotly.offline.iplot(table1)
    py.iplot(table1)
    
    df_b = pd.DataFrame(b, columns = ['Ball', 'Strike', 'Hit']) * 100
    df_b['States'] = ['State 0', 'State 1', 'State 2']
    df_b.set_index('States', drop = True, inplace = True)
    table2 = FF.create_table(df_b, index = True, index_title = 'Emission Matrix')
#     plotly.offline.iplot(table2)
    py.iplot(table2)
    
    
    data = []
    button = []
    dates = player_df.date.unique()
    n = len(dates)
    button.append(dict(args = ['visible', [1] * n], label = 'All', method = 'restyle'))    
    for i in range(0, n):
        t = Scatter( x = range(1, len(get_game(hmmdf, i)) + 1 ), y = get_game(hmmdf, i).states,
                   name = 'Game' + ' ' + str(i + 1))

        show = [i == range(n)[j] for j in range(n)]
        button.append(dict(args = ['visible', show], label = dates[i] + '/' + ' ' + 'Game' + ' ' + str(i + 1),
                           method = 'restyle'))
        data.append(t)


    layout = Layout(
        title = 'States Chain',
        xaxis = dict(title = 'Pitch'),
        yaxis = dict(title = 'State'),
        updatemenus = list([
                dict(
                    x = -0.05,
                    y = 1,
                    yanchor = 'top',
                    buttons = button  
                )
            ])
    )
    fig = Figure(data = data, layout = layout)
#     plotly.offline.iplot(fig)
    return table1, table2, fig

Decisions Decisions

Below are the tables of the transition and emission matrices for Adam Wainwright and Jeremy Guthrie. For the transition matrix, we can view current state by row and the next state is by column. For example, the probability of Adam Wainwright transitioning from State 0 to State 1 is 61.18%. For the emission matrix, the values represent the probability of emitting the indicated column being in each state. For example, if Adam Wainwright is currently in state 0, he has a 60.5% chance of throwing a ball, and a 16.3% chance of giving up a hit.

Adam Wainwright

In [19]:
table1, table2, fig = get_plot('Adam Wainwright')
In [20]:
py.iplot(table2)
Out[20]:
In [21]:
py.iplot(table1)
Out[21]:

Notice that the states here are by number opposed to the assumption we made earlier: good, mediocre, bad. Because the model is a mathematical model, the states had to be encoded to recover the probabilities and sequence of states. However, from the tables, we can infer which state is the bad state, medicore state, and good state.

First notice Wainwright's emission matrix, if he is in state 0, he has a 60.5% chance of throwing a ball, which is the highest of all the states. Thus, this state is the medicore state. Next, notice if he is in state 1, he is 68.78% likely to throw a strike, which is the highest percentage of all states; state 1 will be our good state. Finally, if he is in state 2, he is 58.89% likely to give up a hit, the highest probability of all states so state 2 will be our bad state. So we have established for Wainwright that :

State 2 = Bad State

State 0 = Mediocre State

State 1 = Good State

Next, observe his transition matrix. When he is in his bad state (state 2), he has a 0% chance of self-transitioning to state 0. Instead, he is most likely to transition to state 1 (56.03%), which is his good state. When he is in his medicore state, he also has the highest probability of transitioning to state 1, his good state with 61.18% probability. This tells us that when he goes into a bad or mediocre state during a game, he is most likley to leave that state and is likely to go back into his good state.

Below are the plots of the Markov Chains, the sequence of states for each game in the season. Feel free to use the drop down menu to navigate to any individual game to see the sequence of states for the pitcher.

In [22]:
py.iplot(fig)
Out[22]:

Jeremy Guthrie

In [23]:
table1, table2, fig = get_plot('Jeremy Guthrie')
In [24]:
py.iplot(table2)
Out[24]:
In [25]:
py.iplot(table1)
Out[25]:

If we do the same analysis for Jeremy Guthrie, we can conclude that his corresponding states are:

State 2 = Bad State

State 0 = Mediocre State

State 1 = Good State

Looking at his transition matrix, we can see that when Guthrie is in his bad state (state 2), he has a 17.69% chance of returning to his bad state. The biggest difference between Guthrie and Weinwright is that when he is in either a good state or mediocre state, he has the highest probability to transition back into his bad state. In fact, when Guthrie is in his good state, he only has a 17.8% chance of staying in his good state. Additionally, unlike Weinwright who has 0% chance of staying in his bad state, Guthrie still has a 17.69% chance of staying in his bad state. We can see the distinction between these two pitchers in terms of their transition and emission matrices.

In [26]:
py.iplot(fig)
Out[26]:

Results

We can see from running the HMM on Adam Wainwright and Jeremy Guthrie that there is a distinction on their transitioning and emission matrices. Adam Wainwright had transitioning probabilites that favored going back into his good state. Jeremy Guthrie matrices did not establish the same feature. We see from Guthrie's transitioning matrix that he can struggle and transition to his bad state often. The differences between their HMM results can be an explantion for their season results. Wainwright had a 2.55 FIP and 2.94 ERA (Earned Run Average) while Guthrie had a 4.79 FIP and 4.04 ERA in 2013 .

More Analysis:

I will do more analysis soon. We still need to look at more pitchers to see if their transitioning and emission matrices provide indication of their season results. As time permits, I will do more comparisons and produce more charts to show results. I am actually excited about these results as it shows promise in a method to aid smarter decisions during the game. I am actually interested in extending this analysis to relievers to measure their effectiveness.