Ok, let’s try to focus back here. For the past two posts, I’ve been looking at rebounds per game and turnovers per game. I’m trying to correlate this to ORtg which is a per 100 possessions stat. Intrinsically, I guess I didn’t think that pace was too different from team to team. I mean, I know that the Warriors play at a much faster pace than, let’s say, the Spurs, but I just thought it was a matter of a few possessions here and there. Kind of like that turnover chart where the difference between the 2016 4th best team (raps) and the 5th worst team (rockets) was 2.5 turnovers. Again, although I know the logic is principally flawed, I thought there would be a much smaller gap that at least would show some correlation even if it wasn’t a strong one.
I’m now getting flashbacks of
I’m not sure what you guys see, but this is basically what I see:
Yeah, yeah, it’s all my fault, I can’t analyze worth jack I GET IT BUT THAT’S WHAT THIS BLOG IS FOR.
Pace
In the basketball-reference data, I have something called “Pace” for each team. Basketball-reference defines this as
An estimate of possessions per 48 minutes
Awesome, this is basically what I need! Essentially, let’s go back to the drawing board with ORB / DRB, and let’s try to find ORB / 100 possessions to correlate exactly with ORtg, which is
Points scored per 100 possessions
Sorry, am I getting carried away with the markdown block quotes now? I just discovered them. Anyways, let’s see what else I have. If I let the following “variables” represent the following metrics:
- a = Rebounds per game
- b = Minutes played per game (not always 48!)
- c = Possessions per 48 minutes
Let’s work backwards starting with what I want
Lol, what? I have to do a series of conversions to get to rebounds per 100 possessions, and the formula above seems pretty much like black magic. Let’s work it out though:
- The stat I currently have to start with is rebounds / game, since I don’t have the direct metric possessions / game, I have to go through the intermediary metric Pace to get there, but Pace is possessions per 48 minutes, so I first have to figure out how many minutes we’re playing every game
- Note that the answer here should be close to 48, but not always!
- Note that the answer here should be close to 48, but not always!
- We now have the rebounds / game in terms of minutes, but before we get to incorporate Pace, we need to convert this to rebounds / 48 minutes.
- Phew, we now have rebounds / 48 minutes (not a game, but 48 minutes… not a game, not a game, not a game…). We can now find rebounds / c possessions!
- Aaaaaaand we do the same thing as two steps before to standardize to 100 possessions
KAPICHE? I won’t disclose the amount of time it took me to think about, fix, and write that out. Had another “what am I doing with my life” moment in there as well, but each time it gets less painful.
Turnovers / 100 Possessions
Let’s try this calculation with TOV / 100 possessions to continue where we left off last time.
%load_ext rpy2.ipython
%%R
library(ggplot2)
library(gridExtra)
library(scales)
# Load libraries & initial config
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import boto3
from StringIO import StringIO
# Retrieve team stats from S3
teamAggDfToAnalyze = pd.read_csv('https://s3.ca-central-1.amazonaws.com/2017edmfasatb/fas_boto/data/teamAggDfToAnalyze.csv', index_col = 0)
print teamAggDfToAnalyze.dtypes
teamPer100Metrics = teamAggDfToAnalyze[[
'season_start_year',
'perGameStats_Tm',
'perGameStats_ORB',
'perGameStats_DRB',
'perGameStats_TOV',
'perGameStats_MP',
'baseStats_Pace',
'baseStats_ORtg',
'baseStats_W/L%'
]]
teamPer100Metrics.rename(index = str, columns = {'baseStats_W/L%': 'baseStats_WLPerc'}, inplace = True)
Let’s try to regenerate that turnover graphic for 2016 we created in the last post. The last post outlined turnovers PER GAME, discounting pace, and looked something like this:
teamTovPerGame20162017 = teamPer100Metrics[teamPer100Metrics['season_start_year'] == 2016]
%%R -i teamTovPerGame20162017 -w 900 -h 350 -u px
ggplot(
teamTovPerGame20162017,
aes(
x = reorder(perGameStats_Tm, -perGameStats_TOV),
y = perGameStats_TOV
)
) +
geom_bar(stat = 'identity') +
geom_text(aes(label = perGameStats_TOV), vjust = -0.5) +
ggtitle('2016-2017 Turnovers / Game') +
theme(axis.title.x=element_blank())
Let’s try the same thing now with TOV / 100 Poss
teamPer100Metrics['TOV_per_100_poss'] = (100/teamPer100Metrics['baseStats_Pace'])*(48/(teamPer100Metrics['perGameStats_MP']/5))*teamPer100Metrics['perGameStats_TOV']
teamTovPerGame20162017 = teamPer100Metrics[teamPer100Metrics['season_start_year'] == 2016]
%%R -i teamTovPerGame20162017 -w 900 -h 350 -u px
ggplot(
teamTovPerGame20162017,
aes(
x = reorder(perGameStats_Tm, -TOV_per_100_poss),
y = TOV_per_100_poss
)
) +
geom_bar(stat = 'identity') +
geom_text(aes(label = sprintf("%0.1f", round(TOV_per_100_poss, digits = 1))), vjust = -0.5) +
ggtitle('2016-2017 Turnovers / 100 Possessions') +
theme(axis.title.x=element_blank())
From this, it’s kind of hard to actually see which teams have changed a lot. The curve looks more or less the same. I just want to quickly see which teams that judging for pace benefitted and detrimented.
teamTovPerGame20162017['TOV_pace_adjusted_diff'] = teamTovPerGame20162017['TOV_per_100_poss'] - teamTovPerGame20162017['perGameStats_TOV']
%%R -i teamTovPerGame20162017 -w 900 -h 350 -u px
ggplot(
teamTovPerGame20162017,
aes(
x = reorder(perGameStats_Tm, -TOV_pace_adjusted_diff),
y = TOV_pace_adjusted_diff
)
) +
geom_bar(stat = 'identity') +
geom_text(aes(label = sprintf("%0.1f", round(TOV_pace_adjusted_diff, digits = 1))), vjust = -0.5) +
ggtitle('2016-2017 Difference in Turnovers After Adjusting for Pace') +
theme(axis.title.x=element_blank())
To me, this is essentially a ranking of pace. Minutes / game and the actual turnovers themselves are sure to play a part, but they are relatively similar for all teams (how many overtime games are there in a season, even?), so pace would be the more heavily deciding factor. I just want to double check this really quickly and see if it correlates with pace.
%%R -i teamTovPerGame20162017 -w 900 -h 350 -u px
ggplot(
teamTovPerGame20162017,
aes(
x = reorder(perGameStats_Tm, baseStats_Pace),
y = baseStats_Pace
)
) +
geom_bar(stat = 'identity', fill = 'blue4') +
geom_text(aes(label = sprintf("%0.1f", round(baseStats_Pace, digits = 1))), vjust = -0.5) +
ggtitle('2016-2017 Team Pace') +
theme(axis.title.x=element_blank())
Very similar ordering. I also want to see if this hit any teams hard in terms of where they rank vs other teams. This is more of an absolute metric that I don’t think I can use in any real quantitative analysis, but just for my own curiosity.
teamTovPerGame20162017Ranking = teamTovPerGame20162017.sort_values('perGameStats_TOV').reset_index().drop('index', 1)
teamTovPerGame20162017Ranking['TOV_per_game_ranking'] = teamTovPerGame20162017Ranking.index
teamTovPerGame20162017Ranking2 = teamTovPerGame20162017Ranking.sort_values('TOV_per_100_poss').reset_index().drop('index', 1)
teamTovPerGame20162017Ranking2['TOV_per_100_poss_ranking'] = teamTovPerGame20162017Ranking2.index
teamTovPerGame20162017Ranking2['TOV_ranking_diff'] = teamTovPerGame20162017Ranking2['TOV_per_game_ranking'] - teamTovPerGame20162017Ranking2['TOV_per_100_poss_ranking']
%%R -i teamTovPerGame20162017Ranking2 -w 900 -h 350 -u px
ggplot(
teamTovPerGame20162017Ranking2,
aes(
x = reorder(perGameStats_Tm, TOV_ranking_diff),
y = TOV_ranking_diff
)
) +
geom_bar(stat = 'identity') +
geom_text(aes(label = sprintf("%0.1f", round(TOV_ranking_diff, digits = 1))), vjust = -0.5) +
ggtitle('2016-2017 Turnover Team Ranking Change After Adjusting for Pace') +
theme(axis.title.x=element_blank())
Nothing too crazy here. UTA, SAC, SAS all pretty slow in pace, they move down in rankings. GSW, HOU pretty high in pace, they move up in rankings. PORTLAND?! I don’t really understand this one. They must’ve played more minutes than usual…?
# Average minutes played per game for POR
teamTovPerGame20162017[teamTovPerGame20162017['perGameStats_Tm'] == 'POR'][['perGameStats_MP']]/5
# League distribution of minutes played per game
ax = (teamTovPerGame20162017['perGameStats_MP']/5).plot(kind = 'hist', title = '2016-2017 Average Minutes / Game')
ax.set_xlabel("Minutes Played / Game")
Okay, my hunch is correct, but hmm I didn’t expect it to make that big of a difference. Anyways, I’m not going to use the rankings and it’s quite clear from the relative difference that POR didn’t experience that big of a jump in turnovers after adjusting for pace. Just interesting to see.
Let’s try the same for ORB and try to correlate it with ORtg again…
Offensive Rebounds / 100 Possessions
# Take copy of original data
teamOrbPerGame = teamPer100Metrics
# Calculate ORB / 100 poss
teamOrbPerGame['ORB_per_100_poss'] = (100/teamOrbPerGame['baseStats_Pace'])*(48/(teamOrbPerGame['perGameStats_MP']/5))*teamOrbPerGame['perGameStats_ORB']
# Calculate ORB difference after adjusting for pace
teamOrbPerGame['ORB_pace_adjusted_diff'] = teamOrbPerGame['ORB_per_100_poss'] - teamOrbPerGame['perGameStats_ORB']
# Filter for 2016
teamOrbPerGame20162017 = teamOrbPerGame[teamOrbPerGame['season_start_year'] == 2016]
%%R -i teamOrbPerGame20162017 -w 900 -h 350 -u px
ggplot(
teamOrbPerGame20162017,
aes(
x = reorder(perGameStats_Tm, -ORB_pace_adjusted_diff),
y = ORB_pace_adjusted_diff
)
) +
geom_bar(stat = 'identity') +
geom_text(aes(label = sprintf("%0.1f", round(ORB_pace_adjusted_diff, digits = 1))), vjust = -0.5) +
ggtitle('2016-2017 Difference in Offensive Rebounds After Adjusting for Pace') +
theme(axis.title.x=element_blank())
Alright, let’s try that black hole graphic again!
%%R -i teamOrbPerGame
# ORB scatterplot
ggplot(
teamOrbPerGame,
aes(
x = ORB_per_100_poss,
y = baseStats_ORtg
)
) +
geom_point() +
ggtitle('ORB / 100 Possessions vs ORtg') +
scale_y_continuous(limits = c(90, 115))
This is just getting stupid now… how about correlating it to just straight up Win %…?
%%R -i teamOrbPerGame
# ORB scatterplot
ggplot(
teamOrbPerGame,
aes(
x = ORB_per_100_poss,
y = baseStats_WLPerc
)
) +
geom_point() +
ggtitle('ORB / 100 Possessions vs ORtg')
At this point I just don’t think offensive rebounding matters… If we go back in time, we see from the last graph that literally the team in history with the highest W/L% was literally one of the worst offensive rebounding teams ever… what??
The team with the second highest W/L% was one of the better offensive rebounding teams in history as well… This literally backs the thought that offensive rebounding just doesn’t matter, nay, rather, this is one small, small piece in the overall equation, and may even be the latter half of a causal relationship.
Let’s check out what these two teams were like across the board:
# Sort descending by W/L% and take top two teams in history, only take some columns we are interested in
topTwoWLPercTeams = teamAggDfToAnalyze.sort_values('baseStats_W/L%', ascending = False).head(2)[[
'season_start_year',
'perGameStats_Tm',
'baseStats_W',
'baseStats_W/L%',
'perGameStats_MP',
'baseStats_Pace',
'baseStats_Rel_Pace',
'baseStats_ORtg',
'baseStats_Rel_ORtg',
'baseStats_DRtg',
'baseStats_Rel_DRtg',
'perGameStats_PTS',
'perGameStats_2PA',
'perGameStats_2P%',
'perGameStats_3PA',
'perGameStats_3P%',
'perGameStats_FTA',
'perGameStats_FT%',
'perGameStats_ORB',
'perGameStats_DRB',
'perGameStats_AST',
'perGameStats_STL',
'perGameStats_BLK',
'perGameStats_TOV'
]]
# This is a long dataframe (55 columns), and I want to see all the columns, so we tweak some pandas parameters before we print the actual dataframe itself
pd.set_option('display.max_columns', len(topTwoWLPercTeams.columns))
print(topTwoWLPercTeams)
pd.reset_option('display.max_columns')
First off, I knew that the 15-16 warriors had that historic season last year, but it really didn’t hit me that one of the greatest teams that ever existed, if not the greatest team came from an era where I was an avid basketball watcher. Despite the era of social media and facts just hitting you in the face left and right, you just somehow don’t believe that you’re watching it live.
But anyways, a few things that jump out to me right off the bat just looking at basketball-reference data:
- Both teams could SCORE, goddamn… We talk about the 15-16 GSW as one of the greatest scoring teams ever, and sure enough at ~115 pts per game, they sure were, scoring on average 10 more points per game than the bulls
- Both teams were EFFICIENT – despite the last point, the bulls were MORE EFFICIENT than the warriors! And both were way more efficient than the rest of the league at the time
- This leads me to PACE, the warriors averaged 8 more possessions roughly every single game!
- The warriors also had much different shot selection… look at the amount of THREES attempted! Literally double than that of the bulls…
- There’s our culprit… the beloved offensive rebound. 10 ORB per game for the warriors… 10! The difference between these two teams was literally the difference between the best and worst team in the league today. Clearly focusing on offensive rebounding was the wrong metric haha
No other stats really jump out at me, though. There are discrepancies here and there among the teams (DRB, BLK, AST…) but nothing that jumps out at you quite like pace or threes. Let’s try to adjust for pace on these metrics and see if they’re more comparable.
# At this point, I'm just going to define the pace conversion logic (per game to per 100 possessions) in a function because I'll be using it multiple times here
def paceConversion(df, listOfFields):
for field in listOfFields:
df['{}_per_100_poss'.format(field)] = (100/df['baseStats_Pace'])*(48/(df['perGameStats_MP']/5))*df[field]
return df
topTwoWLPercTeamsPaceAdjusted = paceConversion(
topTwoWLPercTeams,
[
'perGameStats_PTS',
'perGameStats_2PA',
'perGameStats_3PA',
'perGameStats_FTA',
'perGameStats_ORB',
'perGameStats_DRB',
'perGameStats_AST',
'perGameStats_STL',
'perGameStats_BLK',
'perGameStats_TOV'
]
)
pd.set_option('display.max_columns', len(topTwoWLPercTeamsPaceAdjusted.columns))
print(topTwoWLPercTeamsPaceAdjusted)
pd.reset_option('display.max_columns')
Okay, there’s a lot happening here now. Let’s re-evaluate our 4 key components of pace:
Shooting
The bulls crank out
The warriors crank out
The bulls are definitely doing more with their possessions in terms of scoring opportunities. They get about 7.5 more scoring opportunities per 100 possessions than the warriors do.
In terms of percentages, the bulls and warriors both shot at quite an elite level with ~50% on twos, ~40% on threes, and ~75% from the charity stripe. How did the warriors match the bulls ORtg given 7.5 less scoring opportunities? I guess 3’s?
Let’s say the bulls just used those ~8 possessions and scored 2 twos, 1 three, and missed / turned over the rest of the possessions. They would’ve gotten an extra 7 points from those possessions. A decent difference to say the least.
The warriors, however, given that they attempted less shots, attempted 13.5 more threes per 100 possessions than the bulls. They hit them at around 40%, so this gives them an additional 5-6 points per 100 possessions! I’m sure I’m not considering all the factors here, but there you go, just from threes, they were able to make up a few traditional possessions worth.
Offensive Rebounding
Perhaps (not coincidentally…?) those extra possessions came from offensive rebounds? The bulls got about 6.5 more offensive rebounds per 100 possessions, correlating quite closely to the 7.5 more scoring opportunities they got per 100 possessions. This seems to make sense.
It seems that the warriors were so good at 3 point shooting, they could basically afford to not really concentrate on offensive boards and get the same offensive output… wow.
I’m not saying that the warriors didn’t need to offensive rebound, nor am I saying shooting 3’s is the most important factor, or even an important factor in general (both teams played elite defense as well), but perhaps either shooting 3’s evened out their lack of offensive rebounding power, or maybe the fact that they shot 3’s took away opportunities to offensive rebound, because we’ve all seen one of these (click to go to YouTube link)…
Turnovers
Honestly here I have no clue because both of these numbers would rank in the bottom third of the league today… I won’t go analyzing the 80’s style of basketball just yet, but suffice it to say both of these teams racked up quite a few turnovers. This leads me to believe that the turnover vs ORtg graph also looks something like:
Or maybe even proportionately related! If only a team could attain 25 turnovers per 100 possessions. They’d probably get like, I dunno, 103 wins or something.
All I have the energy to say right now is that it seems that both teams managed to get points, and get points efficiently whether it be from shooting well, shooting 3’s, extending possessions by rebounding, or perhaps just having that tenacity to put the ball in the hole (Steph / Jordan?).
Just looking at these two teams is really interesting though. I was 7 years old in 1995 and, to have watched, let alone understood, let alone liked basketball is a gross overstatement. I have to do my homework in understanding exactly how the style of play impacts the stats here. That offensive rebounding horror was enough to basically deck me in the face and say that simple analytical techniques looking at just numbers is outright stupid.
I might just have to look at offense as a whole suite, defense as a whole suite, and how they play off each other as well. I probably also need to get off my ass and go outside and actually play some basketball with some of these factors in mind. Still, really good learning experience just to get to this realization. Through all the <0.2 covariance values, I have found some peace.