Let’s get right into it. Last post, I looked at, well, really I learned what PER, BPM, VORP, and WS were. Basketball-reference’s blurbs now make some sense to me haha:
- PER – Sum up all a player’s positive accomplishments, subtract the negative accomplishments, and return a per-minute rating of a player’s performance
- BPM – Box score estimate of the points per 100 possessions that a player contributed above a league-average player, translated to an average team
- VORP – Convert the BPM rate into an estimate of each player’s overall contribution to the team, measured vs. what a theoretical “replacement player” would provide, where the “replacement player” is defined as a player on minimum salary or not a normal member of a team’s rotation, accounting for amount of playing time
- WS – A player statistic which attempts to divvy up credit for team success to the individuals on the team. Important things to note are that it is calculated using player, team and league-wide statistics and the sum of player win shares on a given team will be roughly equal to that team’s win total for the season
I took a look at how these stats are distributed throughout seasons in the past, and I left off wondering what the difference between VORP and WS was. I made the possibly unethical decision to skip over the math itself and just look at correlations with players to see if I could spot the differences by correlating it with the opinions of media voting. For this, I’ll be looking at the all-NBA and all-star teams.
All-NBA & All-Star Team Correlation to VORP & WS
To me, the all-NBA team is probably the best measure of subjective success. First runner up to that is all-star selections, which is definitely not as concrete because voting starts within the first half of the season (the season has barely began in the grand scheme of things) and the starters are voted in by the fans (which changed this year to 50% media voting because the fans are horrible, horrible people, including myself). Both are interesting in their own right, and at the end of the day, usually bubble up the best players in the league (whether individual or team)
The all-NBA teams are constructed of three teams, each with five players. These guys are voted in by media and broadcasting people, so, while they aren’t the players or coaches themselves, they’re at least more knowledgable people about the game than the lowly fans (see the 2017 Zaza Pachulia almost mishap). Generally, there is a G / G / F / F / C lineup. I guess it tries to balance all the positions, which may be a pro or con depending on how you look at things. This can be correct or incorrect depending on the year. If all the centers are garbage that year, a center still has to be chosen and may get the spot over a more deserving forward… etc. The all-star rules have relaxed on the F / C distinction, and perhaps could hedge this a little bit so I’ll somewhat reluctantly add in all-stars as well.
Let’s scrape bball-ref for all-nba and all-star information.
# Load libraries & initial config
%load_ext rpy2.ipython
%R library(ggplot2)
%R library(gridExtra)
%R library(scales)
%R library(ggbiplot)
%matplotlib nbagg
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import boto3
from StringIO import StringIO
import warnings
warnings.filterwarnings('ignore')
# Retrieve team stats from S3
playerAggDfToAnalyze = pd.read_csv('https://s3.ca-central-1.amazonaws.com/2017edmfasatb/fas_boto/data/playerAggDfToAnalyze20170606.csv', index_col = 0)
pd.set_option('display.max_rows', len(playerAggDfToAnalyze.dtypes))
print playerAggDfToAnalyze.dtypes
pd.reset_option('display.max_rows')
# Filter to remove outliers, player must have played over 10 minutes and in over 20 games on the season
playerAggDfToAnalyzeMin10Min20Games = playerAggDfToAnalyze[(playerAggDfToAnalyze['perGameStats_MP'] > 10) & (playerAggDfToAnalyze['perGameStats_G'] > 20)]
Scrape All-NBA Team Information
All-NBA team information is cleanly laid out within basketball-reference here.
It’s all in one table, so that makes for easy scraping, but we see there are definitely some formatting considerations that we have to knock down before going any further after scraping:
- Season will have to be cleaned up to only show the starting year of the season (to match our dataframe of players)
- The player names have the position following it, I’ll have to strip the positions out as I’m not so concerned about these right now
- The format of the table is not what I’d like as well if I’m to join and look these values back up into the main dataframe table, examples of the current format and my desired format are below
Current Data Format
pd.DataFrame(['2015-16', '1st', 'DeAndre Jordan C', 'Kawhi Leonard F', 'LeBron James F', 'Stephen Curry G', 'Russell Westbrook G']).transpose().rename(columns = {0: 'Season', 1: 'Tm', 2: '', 3: '', 4: '', 5: '', 6: ''})
Desired Data Format
pd.DataFrame({
'Season': [2015, 2015, 2015, 2015, 2015],
'Tm': ['1st', '1st', '1st', '1st', '1st'],
'Player': ['DeAndre Jordan', 'Kawhi Leonard', 'LeBron James', 'Stephen Curry', 'Russell Westbrook']
})[['Season', 'Tm', 'Player']]
Let’s scrape!
# Format URL to scan
urlToScan = 'http://www.basketball-reference.com/awards/all_league.html'
# Pull data from HTML table
allNbaDf = pd.read_html(
io = urlToScan,
header = None,
attrs = {'class': 'stats_table'}
)[0]
# Fix some formatting issues (extra header rows in the middle of table) from bball ref
allNbaDf = allNbaDf.dropna()
allNbaDf.columns = ['Season', 'Lg', 'Tm', 'C', 'F1', 'F2', 'G1', 'G2']
# Use pandas melt function to repivot to my desired format
allNbaDfFormatted = pd.melt(
allNbaDf,
id_vars = ['Season', 'Lg', 'Tm'],
value_vars = ['C', 'F1', 'F2', 'G1', 'G2'],
var_name = 'Position',
value_name = 'Player'
)
# Fix remainder of formatting issues
# Removing position from player name field (e.g. Lebron James F)
allNbaDfFormatted['Player'] = allNbaDfFormatted['Player'].replace(' [FCG]{1}$', '', regex = True)
allNbaDfFormatted['season_start_year'] = allNbaDfFormatted['Season'].apply(lambda x: x[:4])
allNbaDfFormatted = allNbaDfFormatted[[
'season_start_year',
'Tm',
'Player'
]]
# Change season_start_year to int to match master dataframe
allNbaDfFormatted['season_start_year'] = allNbaDfFormatted['season_start_year'].astype(int)
print allNbaDfFormatted
Scrape All-Star Team Information
Unfortunately, the all-star team information isn’t broken down as easily as the all-NBA information is on basketball-reference. In fact, I couldn’t actually find the data in the same format. I think you’d have to go year by year through the different all-star pages to get the actual rosters.
Kind of a hack, but I found this website that sums the rosters up pretty nicely on one page that’s conducive to scraping. The formatting is a bit wonky, but it should be too hard to follow these steps:
- Scrape and get all the tables into an array
- Loop through array and, for each table (per season)
- Extract the season (the season here is the third column of the headers) and subtract one to get the start year of the season
- Extract the east and west players and append them to a master dataframe
Leggo.
# Format URL to scan
urlToScan = 'http://www.nba-allstar.com/allstargame/rosters.htm'
# Pull data from HTML table, every year's all-star roster table is in a dataframe within this list
allStarTableList = pd.read_html(
io = urlToScan,
header = None,
attrs = {'class': 'stats'}
)
print 'There have been {} all star games'.format(len(allStarTableList))
# Loop through list of all star rosters and compile in a master dataframe
allStarDf = None
for table in allStarTableList:
# Extract the year from the table, minus 1 to get season_start_year
year = table[2][0] - 1
# Extract the columns of names that are the east and west all stars
eastAllStars = table[0][1:].dropna()
westAllStars = table[3][1:].dropna()
# Concatenate east and west all stars (currently stored as a pandas series) to pandas dataframe
concatDf = pd.DataFrame({
'Player': pd.concat([eastAllStars, westAllStars])
})
# Fill in year as an extra column on dataframe so we can use to join back to master dataframe later
concatDf['season_start_year'] = year
# Append this year's all stars to the aggregate all star dataframe
if allStarDf is None:
allStarDf = concatDf
else:
allStarDf = pd.concat([allStarDf, concatDf])
# Add a label to join back into the master dataframe later
allStarDf['all_star'] = 'All Star'
# Change season_start_year to int to match master dataframe
allStarDf['season_start_year'] = allStarDf['season_start_year'].astype(int)
print allStarDf
Joining Data Back to Master Dataframe
Okay, so I have the data now by player and year, and can probably join this back to the master dataframe. How do I join? Names is always an interesting field to join on because of its inconsistent nature.
Let’s take the following names:
- Shaquille O’Neal
- Shaquille O Neal
- Shaq
- shaquille oneal
These are of course the same person, but neither me nor my computer are smart enough to perform this fuzzy matching at a large scale. Or rather, there’s probably not the need to have to dive into that much detail when cleaning up the names are good enough for this specific use case.
If I make the assumption that these reference websites are using a players FULL NAMES
- Stephen Curry, not “Steph” Curry
- Carmelo Anthony, not “Melo”
- Shaquille O’Neal, not “Shaq O’Neal”
I think I’d basically just be able to change all characters to lower case and strip out anything not alphanumeric
- Stephen Curry becomes stephencurry
- Carmelo Anthony becomes carmeloanthony
- Shaquille O’Neal becomes shaquilleoneal
I don’t believe I’ve ever seen two players with the exact same name in the all-star game, so there shouldn’t be any duplicate joins… WELP, LET’S TRY THIS.
# Format player names in each dataframe to join
playerAggDfToAnalyzeMin10Min20Games['player_formatted'] = playerAggDfToAnalyzeMin10Min20Games['perGameStats_Player'].str.lower().replace('\W', '', regex = True)
allNbaDfFormatted['player_formatted'] = allNbaDfFormatted['Player'].str.lower().replace('\W', '', regex = True)
allStarDf['player_formatted'] = allStarDf['Player'].str.lower().replace('\W', '', regex = True)
# Join dataframes
playerAggDfAllNbaAllStar = playerAggDfToAnalyzeMin10Min20Games.merge(
allNbaDfFormatted,
how = 'left',
on = ['player_formatted', 'season_start_year']
).merge(
allStarDf,
how = 'left',
on = ['player_formatted', 'season_start_year']
)
print playerAggDfAllNbaAllStar[playerAggDfAllNbaAllStar['player_formatted'] == 'lebronjames'][[
'perGameStats_Player',
'player_formatted',
'season_start_year',
'Tm',
'all_star'
]]
Noice! I got the all-NBA and all-star flags in my dataframe now. Seems to make sense. I guess Lebron didn’t make any teams in his first season, and basically has just been owning the league since then. 2016 all-NBA is missing because it hasn’t been announced yet, but otherwise everything else looks right!
Let’s just check shaq for quality control as well.
print playerAggDfAllNbaAllStar[playerAggDfAllNbaAllStar['player_formatted'] == 'shaquilleoneal'][[
'perGameStats_Player',
'player_formatted',
'season_start_year',
'Tm',
'all_star'
]]
Okay, we see a few duplicates here actually, but this is when he was traded midseason from MIA to PHO, so no worries it’s not a product of the join.
I’m just going to make a single column now with both accolades reflected in one column. I think the highest honour is to be named to the all-NBA teams, but if they didn’t make an all-NBA teams and they made an all-star team, that’s worth nothing as well. The final column will have these labels, with the following priority:
- All-NBA First Team
- All-NBA Second Team
- All-NBA Third Team
- All-Star Team
playerAggDfAllNbaAllStar['accolades'] = np.where(
playerAggDfAllNbaAllStar['Tm'] == '1st',
'All-NBA First Team',
np.where(
playerAggDfAllNbaAllStar['Tm'] == '2nd',
'All-NBA Second Team',
np.where(
playerAggDfAllNbaAllStar['Tm'] == '3rd',
'All-NBA Third Team',
np.where(
playerAggDfAllNbaAllStar['all_star'] == 'All Star',
'All-Star Team',
'No Team'
)
)
)
)
print playerAggDfAllNbaAllStar[playerAggDfAllNbaAllStar['player_formatted'] == 'shaquilleoneal'][[
'perGameStats_Player',
'season_start_year',
'accolades'
]]
Sweet. Let’s graph this on the VORP / WS plot. I have no other expectations than to see every single person in the top right hand corner.
%%R -i playerAggDfAllNbaAllStar -w 700 -u px
ggplot(
playerAggDfAllNbaAllStar,
aes(
x = advancedStats_VORP,
y = advancedStats_WS,
color = accolades
)
) +
geom_point()
Perfect. Exactly what I wanted to see. A gradient starting at the top right hand corner trickling down as we go lower in the priorities. We see all those outliers in the top right as NBA 1st teams and some second teams.
First thing that jumps out to me for the all-NBA teams… PLAYING TIME AND GAMES PLAYED MATTERS. The fact that all those first teamers are basically dominating that corner of the plot implies that these guys played the most minutes and likely played a very high amount of games to achieve that VORP rating.
All-stars are a bit more variable, and they’re kinda mixed in with those 2nd and 3rd teamers, but this is what we’d expect – these all-NBA teams can only hold 5 people per team. If it was easy to determine who these 5 people should be, there wouldn’t be a voting system, right? People can approach the game and the analysis of the game in different ways… VERY different ways yielding to different votes to these teams. This implies that an all-star could have been narrowly missed from an all-NBA team, and for that matter, a non all-NBA, all-star player can be narrowly missed from either of these teams.
I just want to forget about specific teams for a second. I just want to make the distinction between someone who is on one of these teams and someone who is not.
playerAggDfAllNbaAllStar['accolades_any_team'] = np.where(
playerAggDfAllNbaAllStar['accolades'] == 'No Team',
'No Team',
'On Team'
)
%%R -i playerAggDfAllNbaAllStar -w 700 -u px
ggplot(
playerAggDfAllNbaAllStar,
aes(
x = advancedStats_VORP,
y = advancedStats_WS,
color = accolades_any_team
)
) +
geom_point()
I’m super interested in those regular players that were left off all teams, and the (presumably) all stars that were extremely low on the VORP and WS scale…
Great Players, No Accolades
Let’s check out the players with great cumulative VORP + WS, but no accolades
# Create aggregate VORP + WS metric
playerAggDfAllNbaAllStar['VORP_WS_sum'] = playerAggDfAllNbaAllStar['advancedStats_VORP'] + playerAggDfAllNbaAllStar['advancedStats_WS']
# Filter for those on no teams and sort by highest VORP / WS sum
playerAggDfAllNbaAllStar[playerAggDfAllNbaAllStar['accolades_any_team'] == 'No Team'].sort_values('VORP_WS_sum', ascending = False)[[
'season_start_year',
'perGameStats_Player',
'perGameStats_Tm',
'perGameStats_G',
'perGameStats_MP',
'advancedStats_VORP',
'advancedStats_WS',
'per100Stats_PTS',
'per100Stats_TRB',
'per100Stats_AST',
'per100Stats_STL',
'per100Stats_BLK',
'per100Stats_TOV'
]].head(20)
Perfect, there you go. Two 2016’s in there so I can actually speak to what I know. Two players are Rudy Gobert and KAT.
Literally a quote from post #16 where I analyzed the PCA bi-plots and picked out all-stars this year:
Wow, west forwards are pretty stacked. I see why the likes of KAT and Gobert got cut
And there lies at least one reason why these guys exist… there are only so many spots every year. Note that these 2016 guys haven’t had all-NBA teams come out yet, and Gobert could possibly be on the list here, but in all likelihood KAT probably won’t make any team despite his 12.4 / 5.2 on WS / VORP…
One thing is extremely clear from these numbers of the top 20 “snubs”. they all played a ton of minutes in a ton of games. Nobody on this list played less than 78 games, and averaged pretty much at least 35 minutes within the games they played. That is MINUTES, clearly a factor for both WS and VORP.
We see a great 2012 Curry in there with great splits as well, if we look at west guards that season… CP3, Kobe, James Harden, Tony Parker, Russ. What were their stats like?
playerAggDfAllNbaAllStar[
(playerAggDfAllNbaAllStar['season_start_year'] == 2012) &
(playerAggDfAllNbaAllStar['perGameStats_Player'].isin(['Stephen Curry', 'James Harden', 'Chris Paul', 'Kobe Bryant', 'Tony Parker', 'Russell Westbrook']))
][[
'season_start_year',
'perGameStats_Player',
'perGameStats_Tm',
'perGameStats_G',
'perGameStats_MP',
'advancedStats_VORP',
'advancedStats_WS',
'per100Stats_PTS',
'per100Stats_TRB',
'per100Stats_AST',
'per100Stats_STL',
'per100Stats_BLK',
'per100Stats_TOV'
]]
All of these seem deserving for sure. The only questionable stats there is Tony Parker, but even then he was putting up some pretty efficient numbers despite his lower VORP / WS. Tony Parker was playing less minutes than most as well, but he had a solid 31 / 12 on PTS / AST per 100… nothing to laugh at. It’s also worth noting that Tony Parker played on a much more winning team as well. By the beginning of Feb (around all-star time), SAS had 10 more wins than GSW did, so it looks like the media elected to reward winning. Perhaps a thought later to actually add winning into this equation.
All in all. The case can still be made for Steph Curry to not make any teams, but it’s pretty clear he was already balling out 5 years ago.
Great Accolades, Questionable Stats
Let’s do the opposite for players who made all-star teams / all-nba teams and had pretty low VORP / WS.
# Filter for those on no teams and sort by highest VORP / WS sum
playerAggDfAllNbaAllStar[playerAggDfAllNbaAllStar['accolades_any_team'] == 'On Team'].sort_values('VORP_WS_sum', ascending = True)[[
'season_start_year',
'perGameStats_Player',
'perGameStats_Age',
'perGameStats_Tm',
'perGameStats_G',
'perGameStats_MP',
'advancedStats_VORP',
'advancedStats_WS',
'per100Stats_PTS',
'per100Stats_TRB',
'per100Stats_AST',
'per100Stats_STL',
'per100Stats_BLK',
'per100Stats_TOV',
'accolades'
]].head(20)
Lol, this is pretty funny. You got Kobe’s last 2 season. 2 of AI’s last seasons. In fact, I threw in age here because just from the memory test there are legends who are voted into the all-star game because they’re close to retirement or they’ve just always been there so if they perform relatively well, they get in by the popular fan vote. Looking at the “accolades” field, all of these guys are all-stars who didn’t make an all-nba team as well. A 41 year old KAJ, a 34 year old Shaq, a 34 year old Nique.
You definitely have some younger guys in there as well (Isiah @ 20yo)! I don’t know enough about some of these older teams unfortunately to comment.
I’m curious to just take away all-stars and see what the scatterplot looks like.
playerAggDfAllNbaAllStar['accolades_all_nba'] = np.where(
pd.isnull(playerAggDfAllNbaAllStar['Tm']),
'Not All-NBA',
'All-NBA'
)
%%R -i playerAggDfAllNbaAllStar -w 700 -u px
ggplot(
playerAggDfAllNbaAllStar,
aes(
x = advancedStats_VORP,
y = advancedStats_WS,
color = accolades_all_nba
)
) +
geom_point()
Wow, even a few all-NBA players who rank pretty low on the VORP / WS scale. There are a lot more all-NBA calibre players with higher WS than VORP though, as we see you basically have to have over 7.5 WS to be considered all-NBA while there are players near 0 VORP… 0 value above a replacement player!!!
My gut instinct is that these guys are great defensive players part of championship teams… let’s check this category out…
All-NBA, High WS, Low VORP
playerAggDfAllNbaAllStar[playerAggDfAllNbaAllStar['accolades_all_nba'] == 'All-NBA'].sort_values('advancedStats_VORP', ascending = True)[[
'season_start_year',
'perGameStats_Player',
'perGameStats_Age',
'perGameStats_Tm',
'perGameStats_G',
'perGameStats_MP',
'advancedStats_BPM',
'advancedStats_VORP',
'advancedStats_WS',
'per100Stats_PTS',
'per100Stats_TRB',
'per100Stats_AST',
'per100Stats_STL',
'per100Stats_BLK',
'per100Stats_TOV'
]].head(20)
So, somehow, nobody on this last has a VORP of higher than 1.8. This means that no one on this list has a BPM of higher than -0.2. In fact, some have VORPs close to 0, which mean they are being valued close to a replacement player. How is this possible?
2013-2014 Tony Parker. VORP of 0.8, BPM of -1.2. The guy had 5.9 WS, which is decent, but -1.2 BPM when a replacement player is supposedly -2 BPM? If you replaced Tony Parker with a guy who’s out of any NBA rotation, you’d only be losing 0.8 points?? That sounds blasphemous. That year, Tony Parker had 29 / 10 on PTS / AST per 100 poss. Also had 4 TOV per 100 poss, which I’d have to look back and see if TOV is extremely highly weighted in the BPM regression model, but the only other thing that really jumps out at me is that he missed 14 games on the season. But his VORP is friggin 0.8 which means unadjusted for games played he’d only be sitting at like 1.
2015-2016 Klay Thompson is also an interesting one with a VORP of 1.8. He was basically their second scoring option that season and went to the finals. Even dropped 37 points in a quarter (I think that was this season, right?). He had 32 PTS per 100 poss, but I guess he didn’t do much else. 5.5 / 3 / 1 / 1 on TRB / AST / STL / BLK is nothing to really celebrate about, and I suppose these could be the numbers of a “league average” player. I guess it’s just such an interesting thought because you know Klay is one of the best two-way players in the league and is usually tasked with guarding the best player on the opposing team. He’s also a key offensive piece in the league-best GSW offense.
Looking at what terms make up, or are most important in bball-ref’s BPM calculation:
Bball-ref’s BPM page goes through in much more detail what each of these sections / terms mean, so I won’t repeat it here, but they use a bunch of advanced stats to calculate their coefficients. Maybe I should look at a few of these metrics (e.g. bball-ref uses AST% instead of just vanilla AST / 100 poss) to see if it makes more sense.
playerAggDfAllNbaAllStar[playerAggDfAllNbaAllStar['accolades_all_nba'] == 'All-NBA'].sort_values('advancedStats_VORP', ascending = True)[[
'season_start_year',
'perGameStats_Player',
'perGameStats_Age',
'perGameStats_Tm',
'perGameStats_G',
'perGameStats_MP',
'advancedStats_BPM',
'advancedStats_VORP',
'per100Stats_ORtg',
'per100Stats_DRtg',
'advancedStats_WS',
'advancedStats_TSPerc',
'advancedStats_ORBPerc',
'advancedStats_DRBPerc',
'advancedStats_ASTPerc',
'advancedStats_STLPerc',
'advancedStats_BLKPerc',
'advancedStats_TOVPerc',
'advancedStats_USGPerc',
]].head(20)
Well, these are now completely new numbers unfortunately, and I’ll definitely have to do a scatterplot matrix of some of these really quick.
# Print out summary statistics
playerAggDfAllNbaAllStar[[
'advancedStats_TSPerc',
'advancedStats_ORBPerc',
'advancedStats_DRBPerc',
'advancedStats_ASTPerc',
'advancedStats_STLPerc',
'advancedStats_BLKPerc',
'advancedStats_TOVPerc',
'advancedStats_USGPerc'
]].describe()
from pandas.tools.plotting import scatter_matrix
# Build scatterplot matrix
ax = scatter_matrix(playerAggDfAllNbaAllStar[[
'advancedStats_TSPerc',
'advancedStats_ORBPerc',
'advancedStats_DRBPerc',
'advancedStats_ASTPerc',
'advancedStats_STLPerc',
'advancedStats_BLKPerc',
'advancedStats_TOVPerc',
'advancedStats_USGPerc'
]])
# We have to set axis labels manaully with Pandas' scatter_matrix function. Maybe there's a better function out there for
# scatterplot matrices, but for now, this is fairly simple.
[plt.setp(item.xaxis.get_label(), 'size', 8) for item in ax.ravel()]
[plt.setp(item.yaxis.get_label(), 'size', 8) for item in ax.ravel()]

Okay, sorry, a ton of information. Definitely not the most efficient way to look at this data, but I have a bit of a better foothold on the numbers now.
If I take the guys I picked out and look at their % stats across the board, I dunno… I guess they kind of are average. Klay across the board is basically below average except for his shooting which is a bit of an outlier in the favourable direction. But by these stats, I guess he really didn’t rebound, assist, steal, or block too much more than your average player. Tony Parker scores quite a bit and gets quiet a bit of assists, but he as well, has “average” stats across the board otherwise.
If we look at a guy like the 2015-2016 Andre Drummond. Decent scoring, great rebounding, decent steals, decent blocks, decent turnovers… he’s above average in many categories, but there is he, the 4th lowest VORP all-time for an all-nba guy…
Looking at the coefficients a bit closer, it seems that a few of the terms have significant coefficients (to my naive eye).
- STL%
- TO%*USG%
- AST%*TRB%
Andre Drummond played 81 games that season that a decent amount of minutes, so it’s not the playing time / opportunities that dropped his VORP. The only thing I can think of here is that interaction term between AST and TRB because he assisted in very very little shots. Let’s do an exercise to see how his BPM would have been tweaked given higher AST%.
Andre Drummond had 15% ORB% and 34% DRB%. Let’s just generalize this to 30%. He only had an AST% of 4%.
If he had brought his AST% up to, let’s say, 8% (I’m not sure how unreasonable that is), he would have had
He would have raised his BPM by nearly 3.5 points. This would have catapulted him out of this list for sure. Even an increase of 2% in AST% would have brought his BPM up by almost 2 points, moving him up quite a few spots on this list.
The interaction term definitely seems to play a role here. Otherwise, we see many folks who simply didn’t play that many games. Maybe one of the most interesting ones is Yao Ming in 06-07 making an all-nba team while only playing 48 games. Holy cow. Rajon Rondo playing 53 games (albeit putting up ABSURD AST% of 53% and still maintaining an average TRB%… Crazy. Guess a lot of these guys just meant a lot to their team (they have high win shares after all… right?). Clearly VORP doesn’t tell the whole story as we’ve seen here.
Win Shares, on the other hand, is… well… it’s complicated. As bball-ref states…
The formulas are quite detailed, so I would point you to Oliver’s book Basketball on Paper for complete details.
Win shares are made up of offensive win shares and defensive win shares the concepts of points produced and points allowed respectively (leading to ORtg and DRtg for players). These are calculated from pure box score statistics and kind of work backwards from VORP.
VORP is taking data and finding coefficients of the different predictors by using regression as a means to do so. In essence, we’re letting the data speak to itself to figure out how the different box score stats matter. WS on the other hand (and like most other metrics) is explicitly giving a chosen value to the different predictors and we are emulating ORtg and DRtg, then using the pythagorean formula to calculate the “wins”.
We saw that no win shares are generally below 7.5 for an all-nba type of player. Again, our wins formula is as follows:
Let’s take an average player with an average DRtg:
# Average DRtg
averageDrtg = playerAggDfAllNbaAllStar['per100Stats_DRtg'].mean()
print 'The average player\'s DRtg is {}'.format(averageDrtg)
And then let’s plot what a player’s ORtg would have to be to generate at least 7.5 win shares.
# Generate a list of ORtg to test
ortg = range(80, 150)
# Generate the wins
wins = [(x**14.91) / (x**14.91 + averageDrtg**14.91) for x in ortg]
# Plot ORtg vs Wins generated
pd.DataFrame({
'ortg': ortg,
'wins': wins
}).plot(
kind = 'scatter',
x = 'ortg',
y = 'wins'
)

Okay, so the wins as a function of ORtg is essentially a sigmoid function. Let’s dive deeper into the realistic values
# Generate a list of ORtg to test
ortg = range(95, 120)
# Generate the wins
wins = [(x**14.91) / (x**14.91 + averageDrtg**14.91) for x in ortg]
# Plot ORtg vs Wins generated
pd.DataFrame({
'ortg': ortg,
'wins': wins
}).plot(
kind = 'scatter',
x = 'ortg',
y = 'wins'
)

In the middle of the plot there, we’re at every 1 point increase in ORtg increases our % of wins by 4%! Near the ends, 1 point increase in ORtg increases our % of wins by 2%. That’s a pretty significant increase, so we see how WS values both offense and defense, whatever widens the gap between the two.
Let’s go back to our objective here… What would cause someone to have high win shares, but low VORP? Well, I think the point has to come back to those interaction terms of BPM… in my mind (and I really don’t know the answer here and I think it’s completely debatable) BPM is just a bit more of a complex model, or rather, it’s just a different model. Again, those interaction terms are really important to VORP! Look back at Andre Drummond with a 1.0 VORP… he has 7.4 win shares. The 2015 pistons only had 44 wins, so he accounted for 16% of his teams wins! Not too shabby. Win shares valued him decently high, clearly better than a replacement player, right? He is a relatively one-dimensional big man, and that seems like the types of players that VORP doesn’t seem to like. VORP likes talented players doing a bit of everything and it’s just unfortunate if you’re better at blocking than stealing, because stealing is worth so much more. Going down the list, we see many one-dimensional big men… Dikembe, Drummond, Aldrige, Jermaine O’Neal, Amare… were these guys bad? Definitely not, they made an all-nba team for god’s sakes, but they definitely were not the type of players that VORP liked!
For Funsies…
Highest VORP of All Time
There’s a guy that has insane VORP, clearly better than the rest, and not an all-nba player. Let’s check him out.
# Player with highest VORP of all time (no all-nba team)
playerAggDfAllNbaAllStar[playerAggDfAllNbaAllStar['advancedStats_VORP'] == playerAggDfAllNbaAllStar['advancedStats_VORP'].max()]
Okay, I swear Russ was on the tip of my tongue… This season isn’t over yet. He’ll probably make an All-NBA team. Crazy that he is THE HIGHEST OF ALL TIME though… NEXT!
Let’s save the aggregated player / all-star / all-nba data so we can use it next time.
# Save data to S3 so we can use it for future analysis
csv_buffer = StringIO()
playerAggDfAllNbaAllStar.to_csv(csv_buffer)
s3_resource = boto3.resource('s3')
s3_resource.Object('2017edmfasatb', 'fas_boto/data/playerAggDfAllNbaAllStar20170606.csv').put(Body=csv_buffer.getvalue())