This is the third post in a series called Making the Model, which documents my journey to building a data set and creating the best model to predict baseball daily fantasy sports (DFS) scores. With the projected performances from such a model, a user can set an optimal lineup in FanDuel or DraftKings and dominate the competition. However, before reaching this goal, there will be many obstacles along the way. This post describes a couple of interesting feature engineering techniques I performed to improve my model.

Feature engineering is one aspect of the machine learning workflow that really requires a human element. It is the process of transforming the features in the data into features that are more meaningful and will lead to better predictions, often times requireing a certain amount of domain knowledge. For example, if you are predicting the price of a home, using the house’s latitude and logitude as model features will probably not be very helpful. Instead, using the latitude and longitude to determine the house’s neighborhood and then using external data to determine the median neighborhood home price would lead to a much more meaningful predictor than if you had stuck to using the latitude and longitude.

The two approaches I’ll describe both rely on the mathematical concept of taking a weighted mean. As defined by Wikipedia: “The weighted arithmetic mean is similar to an ordinary arithmetic mean, except that instead of each of the data points contributing equally to the final average, some data points contribute more than others.”

1. Aggregating batting features across all nine batters

This technique is specifically applicable to the model predicting pitching performance. In trying to predict how well a pitcher will do in a game, it’s necessary to know the skill of the opposing batters. On one hand, there are nine opposing batters, and if you include features for each of the batters individually, then the number of features related to the opposing hitters will overwhelm the number of features related to the pitcher’s ability. On the other hand, if you just take into account overall batting stats for the opposing team as a whole, this won’t provide an optimal amount of detail. Imagine the scenario where a pitcher is playing against the best hitting team in baseball, but the team’s two best hitters are out with the flu. There may be a big gap in the team’s “usual” ability and the team’s ability for this particular game.

I figured that the best solution would be to average each hitting feature across all nine batters. However, it became apparent to me that the player batting ninth should not receive the same amount of weight as the player batting first; players batting closer to the front of the batting order are expected to have more at bats (and more opportunities to accrue hits) than a player batting towards the end of the order.

The question then becomes: “How many times is a batter expected to come up to bat based on lineup position?” Fortunately, I came across a Fangraphs article that answers exactly this question. This analysis found that the number of plate appearances per start (PA/GS) is 4.65 for those batting 1st and gradually decreases down to 3.77 for those batting 9th. I’ll use these values of PA/GS as the weights for averaging features across the batters.

The code to take the weighted average of all the batting features is shown below. As an example, for batting average the features ‘batter1_AVG_season’, ‘batter2_AVG_season’ … ‘batter9_AVG_season’ are averaged to a single column called ‘batter_AVG_season’. I’m unable to use the basic Numpy ‘average’ function with the ‘weights’ parameter because the data has missing values. I therefore have to use the masked array in Numpy. The mask serves to represent the data as a Boolean in which False values are valid for analysis and True values are not. In this case, the elements with missing values are represented as Trues. I can then go ahead and take the weighted average on the masked data, where the weights are specified in the ‘weights’ list and axis=1 indicates that we’re averaging across columns.

weights = [4.65, 4.55, 4.43, 4.33, 4.24, 4.13, 4.01, 3.90, 3.77]
batters = ["batter" + str(i) for i in range(1,10)]
stats = [i[8:] for i in list(df_pitching2.columns) if "batter1" in i]

for stat in stats:
    features = [i + "_" + stat for i in batters]
    masked_data = np.ma.masked_array(df_pitching2[features], np.isnan(df_pitching2[features]))
    new_stat = "batter_" + stat
    df_pitching2[new_stat] = np.ma.average(masked_data, axis=1, weights=weights)
    df_pitching2 = df_pitching2.drop(features, 1)

2. Combining the ‘previous season’ and ‘season-to-date’ features

Many of the features in the data I put together come in two different varieties: that statistic’s value for the season-to-date (i.e. what the season value would be right before the next game) and the statistic’s overall value in the previous season. The relative importance of the two groups of features is expected to change over the course of a season. Let’s say you’re trying to predict how well a batter will perform in a game in early April. The batter’s season-to-date strikeout rate won’t be very meaningful, since it’s early in the season and the sample size is very small. The batter’s previous season strikeout rate will probably be a better representation of that batter’s actual strikeout rate. The reverse can be expected to be true; for predicting games towards the end of the season, the batter’s season-to-date statistic should be more meaningful than that of the previous season since the season-to-date statistic will be closer in time to the game at hand and will be of sufficient sample size.

One approach to handling this situation would be to simply leave the data as is, make sure the month is included as a feature, and try to have the model incorporate the interaction between month and the rest of the features. Another approach worth attempting was to combine the previous season feature with the season-to-date feature, weighing based on the percent progress in the season. For example, a game taking place 5% of the way through the season would determine the weighted strikeout rate as (.95*previous season K%) + (.05*season-to-date K%).

The first step in coding this transformation was to create a feature indicating the percent progres through the season, ‘sn_pct’.

day_of_year = pd.to_datetime(df_pitching2[["year", "month", "day"]]).dt.dayofyear
df_pitching2["day_of_sn"] = np.where(df_pitching2.year == 2018, day_of_year - first_gm18, day_of_year - first_gm19)

daymax18 = df_pitching2.query("year==2018")["day_of_sn"].max()
daymax19 = df_pitching2.query("year==2019")["day_of_sn"].max()
df_pitching2["sn_pct"] = np.where(df_pitching2.year == 2018, 
                                  df_pitching2.day_of_sn / daymax18, 
                                  df_pitching2.day_of_sn / daymax19)

Next, I created a function to perform the weighted average and created lists of the season-to-date features (‘season_feats’) and the previous season features (‘prevseason_feats’). Both lists were sorted to ensure that features would match at the same indices.

def featAvg(ser1, ser2, wt):
    out = ((ser1*wt) + (ser2*(1-wt)))
    return out

season_feats = [i for i in df_pitching2.columns if "_season" in i and "points" not in i]
prevseason_feats = [i for i in df_pitching2.columns if "_prevseason" in i and "points" not in i]

Lastly, I looped through each feature pair, created the weighted value (‘x’) using the previously created ‘featAvg’ function, then continually concatenated it to a dataframe. In the last line, I combined the newly created features with the rest of the original dataset.

new_feats = pd.DataFrame()

for (a, b) in zip(season_feats, prevseason_feats):
    x = {a.replace("_season", ""): featAvg(df_pitching2[a], df_pitching2[b], df_pitching2['sn_pct'])}
    new_feats = pd.concat([new_feats, pd.DataFrame(x)], 1)

df_pitching2 = pd.concat([new_feats, df_pitching2.drop((season_feats + prevseason_feats), 1)], 1, sort=False)

This project will have plenty more instances of feature engineering, but this post helps to highlight a couple of interesting techniques I tried. The real measure of success will be to see if the changes actually lead to better model performance. Such success may vary based on the type of model; feature engineering that improves a linear regression model will not necessariliy improve a random forest model. I look forward to continually digging in and exploring new ways to make the best model possible.