Revisiting All-NBA Predictions
Hey folks. I’m writing this post about 2 months since the last post. I wanted to revisit this project because I’ve gotten acquainted a bit more data science tools and methods, and I wanted to apply them back to the model I built. There are a few reasons why I felt the uncontrollable urge to come back…
1. Automation
The entire time I was writing the last 2 posts where I made predictions on all-NBA players, first using an ensemble model and second using tree-based models (namely gradient boosting trees with xgboost), I was constantly bothered by the fact that I was listing out all the players and their probabilities and then hand-picking the ones who would the first, second, and third teams. It wasn’t very efficient, and I was slightly ashamed of myself at the time, but I was just too excited that I got any model working that I didn’t really take the time to automate that process. To go down the list and pick the first 6 guards, 6 forwards, and 3 centers is an extremely simple script to write and I aim to do that now.
2. Cross Validation
Cross validation is one of the most important parts in building a model as it ensures you prevent overfitting and decreases your out-of-sample error. The biggest reason I didn’t implement cross validation before was that I didn’t know how to use the cross validation libraries in R or python. python’s sklearn has the model selection and grid search modules with a ton of tools to perform cross validation (something I was not finding very easily in R where models seem to have their own cross validation functions). Since I had modeled in R and scripted in python in the previous posts, I ran into some headaches switching back and forth between the two… Cross validation would’ve been much easier if I had worked within a single language. Since I wrote my last post, I’ve become much more familiar with sklearn. I’ve also played quite a bit with the python implementation of xgboost, so I want to try to cross validate my previous result to get a more accurate score.
Cross Validation Strategy
Although I mentioned that I had become a bit more familiar the sklearn grid search module, I’m literally just realizing now that I don’t think I can actually use it with the way that I want to perform cross validation. Yup, I’m talking about the time between me finishing the last paragraph and starting this one.
Okay, in my own defense, I’m stupid because I don’t have that much experience with model selection methods yet and I’m actually quickly finding that different data actually command different validation methods.
With the NBA data, cross-validation should theoretically tell me how accurate my model is, right? Well, what is the objective I’m actually trying to solve? It can be approached in two ways:
- If I’m just using the model_selection tools to perform straight cross validation (let’s say 5-fold CV), I would have my data split 80% train and 20% test for five iterations. My measurement of correctness would not take into account that there are only 6 G’s, 6 F’s, and 3 C’s that make the team.
- If I take a step back, the real accuracy of my model should be averaged over all the years of data we have, because what we really care about is how many of the 15 players we got each year. To achieve measurement using this criteria, we must train the model on all the other ~29 years (I think we have around 30 years of data) and measure accuracy with a “leave-one-out” cross validation approach, where the one out is not necessarily one sample, but one year’s worth of data.
There are two cases where cross-validation is used:
- To get a sense of the end-model accuracy
- Tune parameters
Once I have my model, I can absolutely use the “leave-one-year-out” cross validation to gauge model score. If I tried to use the same approach for parameter tuning, I’d have 30x however many combinations of parameters I have. Even if I wanted to tune, let’s say, 10 sets of parameters, I’d be training 300 model… this might be a bit too much.
I think to tune parameters, I’m going to just use GridSearchCV from sklearn, and to measure overall model accuracy, I’ll use the “leave-one-year-out” CV.
Leggo.
Data Load & Initial Setup
# Enable plots in the notebook
%matplotlib inline
import matplotlib.pyplot as plt
# Seaborn makes our plots prettier
import seaborn
seaborn.set(style = 'ticks')
import numpy as np
import pandas as pd
import copy
# Retrieve team stats from S3
playerAggDfAllNbaAllStar = pd.read_csv('https://s3.ca-central-1.amazonaws.com/2017edmfasatb/fas_boto/data/playerAggDfAllNbaAllStar20170606.csv', index_col = 0)
pd.set_option('display.max_rows', len(playerAggDfAllNbaAllStar.dtypes))
print playerAggDfAllNbaAllStar.dtypes
pd.reset_option('display.max_rows')
# Select braoder set of features manually
selectedCols = [
'perGameStats_Age',
'perGameStats_G',
'perGameStats_GS',
'perGameStats_MP',
'per100Stats_FG',
'per100Stats_FGA',
'per100Stats_FGPerc',
'per100Stats_3P',
'per100Stats_3PA',
'per100Stats_3PPerc',
'per100Stats_2P',
'per100Stats_2PA',
'per100Stats_2PPerc',
'per100Stats_FT',
'per100Stats_FTA',
'per100Stats_FTPerc',
'per100Stats_ORB',
'per100Stats_DRB',
'per100Stats_TRB',
'per100Stats_AST',
'per100Stats_STL',
'per100Stats_BLK',
'per100Stats_TOV',
'per100Stats_PF',
'per100Stats_PTS',
'per100Stats_ORtg',
'per100Stats_DRtg',
'advancedStats_PER',
'advancedStats_TSPerc',
'advancedStats_3PAr',
'advancedStats_FTr',
'advancedStats_ORBPerc',
'advancedStats_DRBPerc',
'advancedStats_TRBPerc',
'advancedStats_ASTPerc',
'advancedStats_STLPerc',
'advancedStats_BLKPerc',
'advancedStats_TOVPerc',
'advancedStats_USGPerc',
'advancedStats_OWS',
'advancedStats_DWS',
'advancedStats_WS',
'advancedStats_WS48',
'advancedStats_OBPM',
'advancedStats_DBPM',
'advancedStats_BPM',
'advancedStats_VORP',
'accolades_all_nba'
]
playerAggDfAllNbaAllStarInitFeatures = playerAggDfAllNbaAllStar[selectedCols]
# Drop GS & 3P%
playerAggDfAllNbaAllStarInitFeatures = playerAggDfAllNbaAllStarInitFeatures.drop(['perGameStats_GS', 'per100Stats_3PPerc'], 1)
# Drop any rows with remaining NA's (FT% has 3 rows)
print playerAggDfAllNbaAllStarInitFeatures.shape
playerAggDfAllNbaAllStarInitFeatures.dropna(inplace = True)
print playerAggDfAllNbaAllStarInitFeatures.shape
Building xgboost Model
# Load xgboost & sklearn modules
import xgboost as xgb
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.preprocessing import LabelEncoder
# Setting up x and y for training
x_train = playerAggDfAllNbaAllStarInitFeatures.ix[:, playerAggDfAllNbaAllStarInitFeatures.columns != 'accolades_all_nba']
y_train = playerAggDfAllNbaAllStarInitFeatures['accolades_all_nba']
# Hold out 10% for xgboost early stopping testing
x_train, x_test, y_train, y_test = train_test_split(x_train, y_train, test_size = 0.1, random_state = 1, stratify = y_train)
# Encode y_train to 0 and 1
# Encode our labels to integers (e.g. "Dance" becomes a number like 1 or 2)
lb = LabelEncoder()
y_train_encoded = lb.fit_transform(y_train)
y_test_encoded = lb.transform(y_test)
print 'Y (train) set has the following labels {} and has {} elements'.format(np.unique(y_train_encoded), len(y_train_encoded))
print 'Y (test) set has the following labels {} and has {} elements'.format(np.unique(y_test_encoded), len(y_test_encoded))
# Fit our train and test data to a xgb sparse matrix
xgb_train = xgb.DMatrix(x_train, label = y_train_encoded)
xgb_test = xgb.DMatrix(x_test, label = y_test_encoded)
# Instantiate model
xgb_model = xgb.XGBClassifier()
# Set parameters (for GridSearchCV, every value must be in a list, even if there is only 1 value we want to test)
param_grid_search = {
'max_depth': [1, 5, 10],
'learning_rate': [0.05, 0.1, 0.5, 1],
'objective': ['binary:logistic'],
'silent': [1],
'n_estimators': [10000]
}
fit_params = {
'early_stopping_rounds': 100,
'eval_metric': 'auc',
'eval_set': [[x_test, y_test_encoded]],
'verbose': 500
}
# Set up grid search
clf = GridSearchCV(
xgb_model,
param_grid_search,
fit_params = fit_params,
n_jobs = -1,
cv = 5,
scoring = 'roc_auc',
verbose = 2,
refit = True
)
# Fit model
clf.fit(x_train, y_train_encoded)
# Check scores
clf.grid_scores_
So it looks like it really doesn’t matter. I mean, the fact that we’re at 99% AUC is already good enough, ya know? We see that a learning rate of 1 consistently yielded (barely) lower scores, so let’s go with something like a learning rate of 0.1 (to speed things up) and a max depth of 5. Let’s do some tuning on subsample and colsubsample bytree which randomly samples rows and columns respectively each iteration. I didn’t actually tune any parameters last time, so let’s see if this model does even better.
# Instantiate model
xgb_model = xgb.XGBClassifier()
# Set parameters (for GridSearchCV, every value must be in a list, even if there is only 1 value we want to test)
param_grid_search = {
'max_depth': [5],
'learning_rate': [0.1],
'subsample': [0.8, 0.9, 1],
'colsample_bytree': [0.8, 0.9, 1],
'objective': ['binary:logistic'],
'silent': [1],
'n_estimators': [10000]
}
fit_params = {
'early_stopping_rounds': 100,
'eval_metric': 'auc',
'eval_set': [[x_test, y_test_encoded]],
'verbose': 500
}
# Set up grid search
clf = GridSearchCV(
xgb_model,
param_grid_search,
fit_params = fit_params,
n_jobs = -1,
cv = 5,
scoring = 'roc_auc',
verbose = 2,
refit = True
)
# Fit model
clf.fit(x_train, y_train_encoded)
# Check scores
clf.grid_scores_
# Best model
clf.best_params_
They are all really good models to be honest, but let’s just take the one that GridSearchCV found as the best.
“Leave One Year Out” Cross Validation
Okay, so I have a model that seems to be doing pretty well. The next thing I need to do is go through each year and figure out what my final “accuracy” score is for this model. That is, what is the average amount of players I picked correctly out of the 15 chosen all-NBA players?
To re-iterate my point in the intro paragraphs, I can get a ranking of probabilities of how likely the model thinks the player is all-NBA, but it is not enough to simply pick the top 15 players that come up. What if the top 15 players with the best all-NBA probabilities are all guards? It’s impossible to have an all-NBA team with only guards as per the definition of an all-NBA TEAM. I need to write a bit of logic to take the
- top 6 guards (2 per team)
- top 6 forwards (2 per team)
- top 3 centers (1 per team)
Also, there is an additional nuance in that the league, prior to the 1988 – 1989 season, there were only two all-NBA teams. Since then, there have been three all-NBA teams. So, realistically, I will be counting the following:
- Before 1988 – 1989 season:
- top 4 guards (2 per team)
- top 4 forwards (2 per team)
- top 2 centers (1 per team)
- 1988 – 1989 season to present:
- top 6 guards (2 per team)
- top 6 forwards (2 per team)
- top 3 centers (1 per team)
Let’s try it out.
# Review our dataframe of original full test set
playerAggDfAllNbaAllStar.head()
# View cleaned test set for model training
playerAggDfAllNbaAllStarInitFeatures.head()
# Merge season, player name, and position back into the cleaned dataframe
playerAggDfAllNbaAllStarInitFeaturesFullSet = playerAggDfAllNbaAllStarInitFeatures.merge(
playerAggDfAllNbaAllStar[['season_start_year', 'perGameStats_Player', 'perGameStats_Pos']],
how = 'left',
left_index = True,
right_index = True
)
# Check results
playerAggDfAllNbaAllStarInitFeaturesFullSet.tail()
# Unique years
unique_years_list = playerAggDfAllNbaAllStarInitFeaturesFullSet['season_start_year'].unique().tolist()
print 'There are {} seasons from {} to {}'.format(len(unique_years_list), np.min(unique_years_list), np.max(unique_years_list))
Let’s first calculate the probabilities of all-NBA team selection based on the model we built. I’ll generate the probabilities using GridSearchCV’s predict_proba method.
# Calculate all-star probabilities for our entire data set
# Build test data to model on (filter for year)
x_test_full = copy.deepcopy(playerAggDfAllNbaAllStarInitFeaturesFullSet)
x_test_selected_features = x_test_full[playerAggDfAllNbaAllStarInitFeatures.columns]
x_test_no_labels = x_test_selected_features.drop('accolades_all_nba', 1)
print 'Input test set contains shape {}'.format(x_test_no_labels.shape)
# Make prediction using our xgboost model (best iteration is stored in clf with a predict_proba() method)
y_test_pred = clf.predict_proba(x_test_no_labels)
y_test_pred_df = pd.DataFrame(y_test_pred, columns = ['all_NBA_proba', 'not_all_NBA_proba'])
print 'Predicted labels contains shape {}'.format(y_test_pred_df.shape)
# Merge predictions back into full data set so we can access positions and true labels
x_test_full['y_test_pred_proba'] = y_test_pred_df['all_NBA_proba'].tolist()
# Check results
x_test_full.head()
# This function takes in a dataframe of the players' names, positions, and predicted probabiity of all-NBA and returns
# the predicted team selected by the model
def select_team(x_test_full, year):
# Define dict of number of players to select
player_counters = {}
# Define the number of guards to select depending on the year
if year < 1988:
player_counters['G'] = 4
else:
player_counters['G'] = 6
# Define the number of forwards and centers to select (both dependent on guards)
player_counters['F'] = copy.deepcopy(player_counters['G'])
player_counters['C'] = copy.deepcopy(player_counters['G']) / 2
# Define empty dataframe to store players we found
all_nba_selected_df = None
# Sort dataframe by y_test_pred_proba
x_test_full.sort_values('y_test_pred_proba', ascending = False, inplace = True)
print 'Year {}: Finding the top {} guards, {} forwards, and {} centers'.format(year, player_counters['G'], player_counters['F'], player_counters['C'])
# Loop through each position and select top guards, forwards, and centers
for position in player_counters:
x_test_full_position = x_test_full[x_test_full['perGameStats_Pos'].str.contains(position)][['season_start_year', 'perGameStats_Player', 'perGameStats_Pos', 'accolades_all_nba', 'y_test_pred_proba']].head(player_counters[position])
# Append the results from each position to the final results dataframe
if all_nba_selected_df is None:
all_nba_selected_df = copy.deepcopy(x_test_full_position)
else:
all_nba_selected_df = pd.concat([all_nba_selected_df, x_test_full_position])
# Return dataframe of all selected players in the year
return all_nba_selected_df
# Initiate empty dataframe to store all the predicted results
all_nba_all_year_predicted_df = None
# Loop and predict for every year, calculating the number of players predicted correctly
i = 1
for year in playerAggDfAllNbaAllStarInitFeaturesFullSet['season_start_year'].unique().tolist():
print 'Starting year {}'.format(year)
# Build test data to model on (filter for year)
all_nba_selected_df = select_team(x_test_full[x_test_full['season_start_year'] == year], year)
# Append the results from each year to the final results dataframe
if all_nba_all_year_predicted_df is None:
all_nba_all_year_predicted_df = copy.deepcopy(all_nba_selected_df)
else:
all_nba_all_year_predicted_df = pd.concat([all_nba_all_year_predicted_df, all_nba_selected_df])
print all_nba_all_year_predicted_df
# Merge selections back into main data frame
all_nba_all_year_predicted_df['y_test_pred'] = 'All-NBA'
x_test_full = x_test_full.merge(
all_nba_all_year_predicted_df[['y_test_pred']],
how = 'left',
left_index = True,
right_index = True
)
# The left join merge above leaves Non All-NBA selections as NaN, let's replace them with "Not All-NBA" to match y_test values
x_test_full['y_test_pred'].fillna('Not All-NBA', inplace = True)
# Check results
x_test_full[['accolades_all_nba', 'y_test_pred']].head()
Let’s check the importance plot of the model using xgboost’s plot_importance method:
# Plot feature importance with xgboost
from xgboost import plot_importance
plot_importance(clf.best_estimator_, importance_type = 'gain')
Like we saw in earlier posts, the importance plot (using “gain” as the method of calculating importance, which measures the improvement of the models after making splits on that specific feature) shows that WS and PER are by far the most important splits, with USG\% trailing them VORP, along with all the other features following but seemingly insignificant.
Okay, so we now we’ve automated the choosing of our all-NBA players according to the criteria earlier (4 G / 4 F / 2 C before 1988, 6 G / 6 F / 4 C including and after 1988). Let’s see how many we classified are actually all-NBA and how many are actually not all-NBA.
all_nba_all_year_predicted_df['accolades_all_nba'].value_counts().plot(kind = 'bar')

# Show tallies of predictions
all_nba_all_year_predicted_df['accolades_all_nba'].value_counts()
490 predicted correctly out of 510… that’s good for _**96% accuracy!!!!!!**_ Let’s take a look at some of those that we predicted incorrectly.
# View all players predicted by our model as all-NBA but actually did not make a team
all_nba_all_year_predicted_df[all_nba_all_year_predicted_df['accolades_all_nba'] == 'Not All-NBA']
# Plot distribution of model-predicted all-NBA players who are actually not all-NBA
all_nba_all_year_predicted_df[all_nba_all_year_predicted_df['accolades_all_nba'] == 'Not All-NBA']['y_test_pred_proba'].hist()
It looks like 4 of the players in here (’84 ‘Nique, ’92 Shaq, ’05 Pau, ’06 Tony Parker) had >90% probability of making the team. I think the model did the right thing in choosing these players, but clearly they were beat out by players with intangibles or situational advantages (on a really winning team… etc). I can investigate this later perhaps.
We see ~20 some odd folks here over the years who our model thinks made the team despite having predicted probabilities of under 30%. This doesn’t sound right, but I’m thinking maybe those positions in that year were very low on talent, and just to round out the all-NBA teams, they had to reach to the bottom of the barrel… Let’s look at one year in depth:
In ’97, KG made the team while our model thinks he had a 6.7% chance of making it. What did that all-NBA forwards look like?
# Define set of columns to look at so we can make our analysis more concise
cols_to_view = [
'season_start_year',
'perGameStats_Player',
'perGameStats_Pos',
'advancedStats_USGPerc',
'advancedStats_WS',
'advancedStats_BPM',
'advancedStats_VORP',
'advancedStats_PER',
'accolades_all_nba',
'y_test_pred',
'y_test_pred_proba'
]
# Look at forwards in all-NBA team in '97
x_test_full[(x_test_full['accolades_all_nba'] == 'All-NBA') & (x_test_full['season_start_year'] == 1997) & (x_test_full['perGameStats_Pos'].str.contains('F'))][cols_to_view]
Hmm, it looks like there are only 5 forwards in the all-NBA team…
# Look at all-NBA team in '97
x_test_full[(x_test_full['accolades_all_nba'] == 'All-NBA') & (x_test_full['season_start_year'] == 1997) & (x_test_full['perGameStats_Pos'].str.contains('C'))][cols_to_view]
And there are 4 centers! According to the wikipedia page for the ’97 season, Vin Baker took the other forward spot, while basketball-reference had him listed as a C. The Forward landscape that year must’ve been quite thin and / or Vin Baker was pretty good that year as well (decent VORP and great Win Shares). Regardless, my programmed logic for selecting all-NBA players by positions prevented me from selecting Vin Baker because all the other all-NBA guys had great stats (Robinson, Shaq, Mutombo).
Let’s check another year here where Pau got left off the ’05 team:
# Look at all-NBA team in '05
x_test_full[((x_test_full['accolades_all_nba'] == 'All-NBA') & (x_test_full['season_start_year'] == 2005) & (x_test_full['perGameStats_Pos'].str.contains('F'))) | ((x_test_full['season_start_year'] == 2005) & (x_test_full['perGameStats_Player'] == 'Pau Gasol'))][cols_to_view]
It looks like Pau was very deserving of an all-NBA team… in fact his stats were pretty crazy across the board. The 12 WS speaks volumes as we saw how important WS was in the tree’s decisions. Melo was the proud recipient of third team honours in Pau’s place, and he’s got much lower percentage confidence of making the all-NBA team as predicted by our model (86\% vs Pau’s 99\%). Melo definitely has very good usage rate and PER, but nowhere near the WS or VORP as anyone else on this list.
It seems that Pau is the clear choice, but I’m basically forced to nitpick here at why I don’t think he was chosen. I can’t find too many articles online, but from the bit of reading I just did, it seems that Melo lead the nuggets to the division title that year and he did it seemingly alone (hence the super high usage rate). Perhaps Melo had such a bad team that to even get to 9.4 WS is a feat. Realistically, however, Pau didn’t have it any better than Melo did and he squeezed out more wins than Melo did too. Simply put, basketball-wise, it seems that Pau was more deserving, which leads me to think think that there were simply too many PFs on the all NBA teams. I’ve been generalizing both PFs and SFs to Fs, but I can understand the logic of lack of diversity if Pau was picked (would’ve resulted in 5 PFs and 1 true SF in Lebron). Interesting stuff…
Let’s take a look at maybe one more. Let’s go 2014 Kawhi.
# Look at all-NBA team in '14
x_test_full[((x_test_full['accolades_all_nba'] == 'All-NBA') & (x_test_full['season_start_year'] == 2014) & (x_test_full['perGameStats_Pos'].str.contains('F'))) | ((x_test_full['season_start_year'] == 2014) & (x_test_full['perGameStats_Player'] == 'Kawhi Leonard'))][cols_to_view]
Another case of the ambiguously labeled C?
# Look at all-NBA team in '14
x_test_full[(x_test_full['accolades_all_nba'] == 'All-NBA') & (x_test_full['season_start_year'] == 2014) & (x_test_full['perGameStats_Pos'].str.contains('C'))][cols_to_view]
This kinda makes sense. Tim was always shuffling between C and PF, so the fact that Tim should’ve made it over Kawhi (which is exactly what happened) is absolutely believable.
OK – LET’S TRY ONE MORE… This is actually getting kind of addicting…
Let’s go ’06 Tony Parker. Another one that the model predicted to have a great probability of success but fell flat.
# Look at all-NBA team in '06
x_test_full[((x_test_full['accolades_all_nba'] == 'All-NBA') & (x_test_full['season_start_year'] == 2006) & (x_test_full['perGameStats_Pos'].str.contains('G'))) | ((x_test_full['season_start_year'] == 2006) & (x_test_full['perGameStats_Player'] == 'Tony Parker'))][cols_to_view]
Ok. My question has now changed. Gilbert Arenas is the one that made it over TP, but I totally get it because Gilbert went into god mode with a pretty bad team. Gilbert Arenas was BADASS that season and has every right to be on that team purely because of all the buzzer beaters and games he just took over that season. This doesn’t take away from the fact that TP was performing great as well.
The model says that TP was at around 96\% of making it on the team, and Chauncey Billups was at around 85\%, so my question is now, why did Chauncey make it over TP? In the 2006 – 2007 season, Mr Big Shot lead his team to a 53 win record while TP lead his team to a 58 win record… hmm… Mr Big Shot has a higher WS though… responsible for ~2 more wins than TP was for his team. Does this speak to how efficient those Pistons were that Chauncey was able to generate so many Win Shares out of so little usage? At the end of the day, if we look at the two most importance features in WS and USG\%, it kinda is a tossup between those two…
Okay. I’m done diving into some of these scenarios. I do have one more question before I wrap it up here. Earlier, I looked at the distribution of probabilities of those who the model predicted to make a team but ultimately they didn’t. I want to take a look at the distributions of probabilities for the other way around – Those who actually made a team, but the model predicted that they wouldn’t.
# Plot distribution of model-predicted non all-NBA players who are actually all-NBA
x_test_full[(x_test_full['accolades_all_nba'] == 'All-NBA') & (x_test_full['y_test_pred'] == 'Not All-NBA')]['y_test_pred_proba'].hist()

x_test_full[(x_test_full['accolades_all_nba'] == 'All-NBA') & (x_test_full['y_test_pred'] == 'Not All-NBA')][cols_to_view]
One pattern right off the bat here is that like 2 / 3 of these are CENTERS!!! We had difficulties with centers before, and the fact that half the amount of centers are taken compared to guards or forwards doesn’t help either. It seems it’s relatively difficult to classify centers, or rather, there is more room for subjectivity here. In modern eras, the center is generally not the centerpiece (heh…) of the team either. It’s usually an agile G or F who can take over and the C nowadays provides extra rebounding or spacing. We see many deserving candidates for C, and this is no surprise as less spots simply means more chances of being snubbed by our model. The fact that the distribution shows such high probabilities (only 4 players under ~70\%) is a sign that there were many good players and something had to give.
One LAST thing I’d like to look at… the model predicted all 2016 – 2017 all-NBA team members successfully! We didn’t get this in our last model, so looks like cross validation (and the fact that I trained and tested on the same data, albeit a super small portion) has immediately noticeable benefits! I got about 13 / 15 in my last model, good for 86%. My cross-validated score on the entire dataset is frickin 96%.