All-NBA Predict #14 – Exploration of Historical NBA Players (Part V, PCA to Cluster Playing Styles)

In our last post, we harnessed the power of PCA even further to look at different ways to break up the PCA bi-plot. With ggbiplot’s “ellipse” argument, we were able to check out where different players lie on the plot. To review, my 3 main questions were:

  • How closely can I map different “types” of players to sections of the plot?
  • How can I account for players who exhibit qualities across multiple “types” of players?
  • What does that bottom-right hand side of the plot communicate exactly?

Some of these questions may answer the others, but I’d say all 3 are worth consideration and I’d hope to answer by the end of this post.

Q1 – “Types of Players” and “Playing Styles” Mapping on PCA Bi-plot

In an idea world, I’d see distinct clusters… something like this

Then I can be like “OH MY GOD IT ALL MAKES SENSE THERE ARE MY SCORERS AND THERE ARE MY BRUTE FORCE DEFENDERS AND THERE ARE MY LUNCH PAIL BIG MEN GRABBING ALL THE OFFENSIVE BOARDS” and life would be just dandy. However, basketball is more complex than that and I guess humans have the ability to score AND rebound. Man who woulda thunk it. In fact, rebounding can even LEAD TO scoring! No way!! We kinda saw this with the placement of Shaq’s data points on the last post right?

As a big of a reminder, let’s generate our general plot again based on our base metrics (not advanced metrics).

In [8]:
%load_ext rpy2.ipython
The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython
In [9]:
%%R
# Load libraries & initial config
library(ggplot2)
library(gridExtra)
library(scales)
In [10]:
# Load libraries & initial config
%matplotlib nbagg
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import boto3
from StringIO import StringIO
import warnings
warnings.filterwarnings('ignore')
In [11]:
# Retrieve team stats from S3
playerAggDfToAnalyze = pd.read_csv('https://s3.ca-central-1.amazonaws.com/2017edmfasatb/fas_boto/data/playerAggDfToAnalyze.csv', index_col = 0)

pd.set_option('display.max_rows', len(playerAggDfToAnalyze.dtypes))
print playerAggDfToAnalyze.dtypes
pd.reset_option('display.max_rows')
season_start_year          int64
perGameStats_Player       object
perGameStats_Pos          object
perGameStats_Age           int64
perGameStats_Tm           object
perGameStats_G             int64
perGameStats_GS          float64
perGameStats_MP          float64
per100Stats_FG           float64
per100Stats_FGA          float64
per100Stats_FGPerc       float64
per100Stats_3P           float64
per100Stats_3PA          float64
per100Stats_3PPerc       float64
per100Stats_2P           float64
per100Stats_2PA          float64
per100Stats_2PPerc       float64
per100Stats_FT           float64
per100Stats_FTA          float64
per100Stats_FTPerc       float64
per100Stats_ORB          float64
per100Stats_DRB          float64
per100Stats_TRB          float64
per100Stats_AST          float64
per100Stats_STL          float64
per100Stats_BLK          float64
per100Stats_TOV          float64
per100Stats_PF           float64
per100Stats_PTS          float64
per100Stats_ORtg         float64
per100Stats_DRtg         float64
advancedStats_PER        float64
advancedStats_TSPerc     float64
advancedStats_3PAr       float64
advancedStats_FTr        float64
advancedStats_ORBPerc    float64
advancedStats_DRBPerc    float64
advancedStats_TRBPerc    float64
advancedStats_ASTPerc    float64
advancedStats_STLPerc    float64
advancedStats_BLKPerc    float64
advancedStats_TOVPerc    float64
advancedStats_USGPerc    float64
advancedStats_OWS        float64
advancedStats_DWS        float64
advancedStats_WS         float64
advancedStats_WS48       float64
advancedStats_OBPM       float64
advancedStats_DBPM       float64
advancedStats_BPM        float64
advancedStats_VORP       float64
dtype: object
In [12]:
# Filter to remove outliers, player must have played over 10 minutes and in over 20 games on the season
playerAggDfToAnalyzeMin10Min20Games = playerAggDfToAnalyze[(playerAggDfToAnalyze['perGameStats_MP'] > 10) & (playerAggDfToAnalyze['perGameStats_G'] > 20)]
In [67]:
# Select subset of features
playerAggDfToAnalyzeMin10Min20GamesPCAFeatures = playerAggDfToAnalyzeMin10Min20Games[[
    'season_start_year',
    'perGameStats_Player',
    'per100Stats_FGA',
    'per100Stats_3PA',
    'per100Stats_2PA',
    'per100Stats_2PPerc',
    'per100Stats_FTA',
    'per100Stats_FTPerc',
    'per100Stats_ORB',
    'per100Stats_DRB',
    'per100Stats_AST',
    'per100Stats_STL',
    'per100Stats_BLK',
    'per100Stats_TOV',
    'per100Stats_PF',
    'per100Stats_PTS'
]].dropna()

playerAggDfToAnalyzeMin10Min20GamesPCAFeaturesLabel = playerAggDfToAnalyzeMin10Min20GamesPCAFeatures['perGameStats_Player'].tolist()
playerAggDfToAnalyzeMin10Min20GamesPCAFeaturesData = playerAggDfToAnalyzeMin10Min20GamesPCAFeatures.drop(['season_start_year', 'perGameStats_Player'], 1)
In [68]:
%%R -i playerAggDfToAnalyzeMin10Min20GamesPCAFeaturesData -w 800 -u px -o firstTwoComponents

library(ggbiplot)

# Fit PCA & output bi-plot
pca = prcomp(playerAggDfToAnalyzeMin10Min20GamesPCAFeaturesData, center = T, scale = T)
ggbiplot(pca, obs.scale = 1, var.scale = 1, circle = TRUE, ellipse = TRUE, alpha = 0.05)

# Pass the first two PC's back to python so we can analyze later
firstTwoComponents = pca$x[,c(1,2)]

I think the best way to slice this up is to break up our plot into 9 distinct sections, 8 of them being more outliers based and 1 of them smack dab in the middle. This is what I mean:

I’m not necessarily trying to define clusters here, but I do want to just get a sense of what these players at the edge are like. They are the outliers of the outliers, they will in a sense define what that area of the plot looks like. They are the model players in that direction / axis of the plot. How many more ways can I say this?

I’m literally just going to eyeball the boundaries and… yeah… please just accept that.

In [71]:
# Get the first two components from R and concatenate back to our original dataframe
firstTwoComponentsDf = pd.DataFrame(firstTwoComponents, columns = ['PC1', 'PC2'])
playerAggDfToAnalyzeMin10Min20GamesPCAWithResults = pd.concat(
    [
        playerAggDfToAnalyzeMin10Min20GamesPCAFeatures.reset_index(drop = True),
        firstTwoComponentsDf.reset_index(drop = True)
    ],
    axis = 1
)

playerAggDfToAnalyzeMin10Min20GamesPCAWithResults['cluster_region'] = np.NaN
In [56]:
# Define my boundaries in a dict
boundaries = {}

boundaries[1] = {
    'left' : -5,
    'right' : -3,
    'bottom' : 1.5,
    'top' : 2.5
}

boundaries[2] = {
    'left': -3.75,
    'right': -1.25,
    'bottom': 4,
    'top': 5.5
}

boundaries[3] = {
    'left': 0,
    'right': 2.5,
    'bottom': 4.5,
    'top': 6
}

boundaries[4] = {
    'left': -5,
    'right': -2.5,
    'bottom': -2,
    'top': -1
}

boundaries[5] = {
    'left': -1,
    'right': 1,
    'bottom': -1,
    'top': 1
}

boundaries[6] = {
    'left': 4,
    'right': 6,
    'bottom': 1,
    'top': 3
}

boundaries[7] = {
    'left': -1.25,
    'right': 1.25,
    'bottom': -6,
    'top': -4
}

boundaries[8] = {
    'left': 1.25,
    'right': 3.75,
    'bottom': -4,
    'top': -3
}

boundaries[9] = {
    'left': 5,
    'right': 7.5,
    'bottom': -3,
    'top': -1
}
In [95]:
# Assign region to each observation in the population dataframe using PC1 and PC2 components from PCA
for region in range(1, 10):
    playerAggDfToAnalyzeMin10Min20GamesPCAWithResults['cluster_region'] = np.where(
        np.isnan(playerAggDfToAnalyzeMin10Min20GamesPCAWithResults['cluster_region']),
        np.where(
            ((playerAggDfToAnalyzeMin10Min20GamesPCAWithResults['PC1'] >= boundaries[region]['left']) & (playerAggDfToAnalyzeMin10Min20GamesPCAWithResults['PC1'] <= boundaries[region]['right']) & (playerAggDfToAnalyzeMin10Min20GamesPCAWithResults['PC2'] >= boundaries[region]['bottom']) & (playerAggDfToAnalyzeMin10Min20GamesPCAWithResults['PC2'] <= boundaries[region]['top'])),
            region,
            np.NaN
        ),
        playerAggDfToAnalyzeMin10Min20GamesPCAWithResults['cluster_region']
    )
In [94]:
# Check out the number of unique players within each region
playerAggDfToAnalyzeMin10Min20GamesPCAWithResults.groupby('cluster_region')['perGameStats_Player'].nunique()
Out[94]:
cluster_region
1.0     50
2.0     26
3.0     23
4.0     65
5.0    566
6.0     27
7.0     37
8.0     33
9.0     22
Name: perGameStats_Player, dtype: int64

Perfect, in each cluster region, we only see a few players with the exception of region 5, which is the one right in the middle, where we see many more players in the same space because this is where many more players are concentrated.

In [118]:
playerAggDfToAnalyzeMin10Min20GamesPCAWithResults.groupby('cluster_region').agg({
    'per100Stats_FGA': [np.mean],
    'per100Stats_FTA': [np.mean],
    'per100Stats_3PA': [np.mean],
    'per100Stats_2PPerc': [np.mean],
    'per100Stats_FTPerc': [np.mean],
    'per100Stats_ORB': [np.mean],
    'per100Stats_DRB': [np.mean],
    'per100Stats_AST': [np.mean],
    'per100Stats_STL': [np.mean],
    'per100Stats_BLK': [np.mean],
    'per100Stats_PTS': [np.mean]
})[[
    'per100Stats_PTS',
    'per100Stats_FGA',
    'per100Stats_FTA',
    'per100Stats_3PA',
    'per100Stats_2PPerc',
    'per100Stats_FTPerc',
    'per100Stats_ORB',
    'per100Stats_DRB',
    'per100Stats_AST',
    'per100Stats_STL',
    'per100Stats_BLK'
]]
Out[118]:
per100Stats_PTS per100Stats_FGA per100Stats_FTA per100Stats_3PA per100Stats_2PPerc per100Stats_FTPerc per100Stats_ORB per100Stats_DRB per100Stats_AST per100Stats_STL per100Stats_BLK
mean mean mean mean mean mean mean mean mean mean mean
cluster_region
1.0 29.822772 23.692079 8.152475 5.053465 0.477149 0.832475 1.094059 4.127723 9.069307 2.017822 0.335644
2.0 36.525275 27.383516 11.101099 2.929670 0.505670 0.807407 2.383516 6.249451 5.649451 1.853846 0.841758
3.0 32.207463 23.353731 10.983582 0.205970 0.523000 0.726478 4.695522 11.050746 3.100000 1.414925 3.429851
4.0 20.178761 17.881416 3.753097 7.168142 0.440628 0.833628 0.806195 3.693805 9.647788 2.050442 0.202655
5.0 19.997287 17.156672 4.786510 2.653812 0.476179 0.749863 2.593622 5.957845 3.721334 1.684971 0.838930
6.0 18.157778 13.297778 6.891111 0.033333 0.550711 0.545578 6.224444 10.980000 1.340000 1.115556 4.102222
7.0 11.197959 10.679592 1.353061 6.924490 0.360000 0.724939 0.812245 4.355102 4.136735 1.502041 0.479592
8.0 8.853704 8.872222 1.824074 1.953704 0.426278 0.598037 2.425926 5.746296 2.525926 1.605556 0.975926
9.0 7.646809 6.802128 2.627660 0.097872 0.473702 0.469872 5.248936 9.746809 1.568085 1.042553 3.729787

Region 1

In [97]:
playerAggDfToAnalyzeMin10Min20GamesPCAWithResults[playerAggDfToAnalyzeMin10Min20GamesPCAWithResults['cluster_region'] == 1]['perGameStats_Player'].value_counts()
Out[97]:
Isiah Thomas*       7
Allen Iverson*      6
Magic Johnson*      4
Kyrie Irving        4
Kevin Johnson       4
Monta Ellis         4
Ray Allen           4
Sam Cassell         4
Stephon Marbury     4
Tony Parker         3
Lou Williams        3
Gary Payton*        3
Walter Davis        3
Kevin Martin        2
Paul George         2
Robert Pack         2
Reggie Jackson      2
Eric Bledsoe        2
Deron Williams      2
Eric Gordon         2
Ben Gordon          2
Terrell Brandon     2
Mitch Richmond*     2
Calvin Murphy*      2
Gus Williams        1
Michael Redd        1
Richard Hamilton    1
Kobe Bryant         1
James Harden        1
Will Bynum          1
Reggie Miller*      1
Derrick Rose        1
Mike Newlin         1
Paul Westphal       1
Jerry Stackhouse    1
Rod Strickland      1
Joe Johnson         1
Gilbert Arenas      1
C.J. McCollum       1
Devin Booker        1
Byron Scott         1
Tony Wroten         1
Damian Lillard      1
Manu Ginobili       1
Latrell Sprewell    1
Jeremy Lin          1
Isaiah Thomas       1
Joe Dumars*         1
Ricky Pierce        1
Reggie Theus        1
Name: perGameStats_Player, dtype: int64

Alrighty, Region 1. All these regions, again, are pretty elite with only ~30-40ish players per region. Region 1 in particular has 50, which is on the higher side for sure! Looking through the names, a lot of HOFers in there (Isiah, AI, Magic, GP, Reggie…). Many of these guys regarded as both masters of scoring and passing.

To no surprise really, these guys averaged 30 / 9 / 2 on PTS / AST / STL per 100 and shot 83% from the FT line. BEASTS!!!

That really does look like AI’s stat lines all the time. Isiah preceded me a bit but he appears in this region 7 times, the most of anyone, so I can only assume he was beasting his whole career.

Monta shows up in this region multiple times, and so does Lou Will lol. Wtf? I suppose this is what makes Lou Will 6man… This guy usually comes off the bench and is pretty much averaging per 100 the same calibre of stats that AI was averaging. Now, there’s something to be said about AI doing this 45 minutes a night vs Lou Will who I assume would maybe play 30 minutes or so per game, but still to be doing that off the bench, that’s what I call PRODUCTION.

Some interesting newbies in here, CJ McCollum, Devin Booker, Dame Dolla, Isaiah… fits quite well with the names we’d be hearing these days. Devin Booker tho… how old is this guy? Sheesh.

Region 2

In [98]:
playerAggDfToAnalyzeMin10Min20GamesPCAWithResults[playerAggDfToAnalyzeMin10Min20GamesPCAWithResults['cluster_region'] == 2]['perGameStats_Player'].value_counts()
Out[98]:
LeBron James          11
Dominique Wilkins*    10
Carmelo Anthony        8
Dwyane Wade            6
Michael Jordan*        6
Adrian Dantley*        6
Dirk Nowitzki          4
Kevin Durant           4
Mark Aguirre           4
Bernard King*          3
Ricky Pierce           3
George Gervin*         3
Alex English*          3
John Drew              3
World B. Free          2
Karl Malone*           2
Paul Pierce            2
Tracy McGrady          2
Kobe Bryant            2
Richard Hamilton       1
Julius Erving*         1
Xavier McDaniel        1
DeMar DeRozan          1
Tom Chambers           1
Grant Hill             1
Quintin Dailey         1
Name: perGameStats_Player, dtype: int64

Region 2… These guys are scorers in every sense of the word, and in fact, to go one deeper, shooters.

Averaging the most points per 100 possessions out of any region of the chart, these guys have a combined statline of 37 / 9 / 6 on PTS / TRB / AST on the most attempts of any region as well, 27 FGA and 11 FTA. By that math, these guys basically take 1/3 of their team’s shots whenever they’re on the floor.

Some of the usual suspects are here in terms of greatest scorers of all time. Lebron (11 times!!!!!), Melo, Wade (2003 draft class anyone?), Dirk, KD, PP, Tracy, Kobe, and… DeMar??? Yeah, he’s probably reached that level this season lol.

A lot of legends that I can’t necessarily speak to but definitely hear about, Nique, MJ, Adrian Dantley, Bernard King, Iceman, Alex English, Karl Malone, and some guy named… World B. Free?? Looks like this is a real guy and made the all-star team once.

I guess, to me, this group is your ice cold scorers. I’d think Kobe would be in here more, but honestly Kobe might even be beyond this box I’ve outlined.

Region 3

In [99]:
playerAggDfToAnalyzeMin10Min20GamesPCAWithResults[playerAggDfToAnalyzeMin10Min20GamesPCAWithResults['cluster_region'] == 3]['perGameStats_Player'].value_counts()
Out[99]:
Tim Duncan              10
Patrick Ewing*           9
Hakeem Olajuwon*         7
Alonzo Mourning*         6
Moses Malone*            6
David Robinson*          4
Yao Ming*                4
Amar'e Stoudemire        3
Jermaine O'Neal          2
Shawn Kemp               2
Shaquille O'Neal*        2
Anthony Davis            1
Al Jefferson             1
Kareem Abdul-Jabbar*     1
Chris Gatling            1
Robert Parish*           1
Charles Barkley*         1
DeMarcus Cousins         1
Enes Kanter              1
Dwight Howard            1
Brook Lopez              1
Elton Brand              1
Rik Smits                1
Name: perGameStats_Player, dtype: int64

Elite big men. Nuff said.

Tim Duncan, Patrick Ewing, Hakeem, David Robinson, Yao, Amare, Shaq. ELITE. I think we see a lot of your “traditional” big men in here because this side of the plot values rebounds, blocks, generally higher FG%.

These guys collectively averaged a whopping 32 / 16 / 3 on PTS / TRB / BLK while shooting better than a coin flip at 53% on 2PA. Look at that rebounding though… The rebound rate of these guys are probably nuts.

Region 4

In [100]:
playerAggDfToAnalyzeMin10Min20GamesPCAWithResults[playerAggDfToAnalyzeMin10Min20GamesPCAWithResults['cluster_region'] == 4]['perGameStats_Player'].value_counts()
Out[100]:
Rickey Green            6
Jose Calderon           5
Mookie Blaylock         4
Nick Van Exel           4
Jason Williams          4
John Lucas              3
Spud Webb               3
Michael Adams           3
Chauncey Billups        3
Tim Hardaway            3
Mike Bibby              3
Rafer Alston            3
Ricky Rubio             3
Greivis Vasquez         2
Earl Boykins            2
Chris Whitney           2
Jannero Pargo           2
Jameer Nelson           2
Dana Barros             2
J.J. Redick             2
D.J. Augustin           2
Sarunas Jasikevicius    2
Troy Hudson             2
Isaiah Canaan           2
Jordan Farmar           2
Darrell Armstrong       2
Damon Stoudamire        2
Eric Gordon             1
John Lucas III          1
A.J. Guyton             1
                       ..
Kenny Smith             1
Jamaal Tinsley          1
Derek Anderson          1
Eddie House             1
Craig Hodges            1
Trey Burke              1
Aaron Brooks            1
Patty Mills             1
John Starks             1
Tyler Ennis             1
Sleepy Floyd            1
Keith Jennings          1
Lou Williams            1
Toney Douglas           1
Joe Dumars*             1
Vernon Maxwell          1
Cory Alexander          1
Scott Skiles            1
Johnny Dawkins          1
C.J. Watson             1
Scott Brooks            1
Johnny Moore            1
Brandon Jennings        1
Dedric Willoughby       1
Mike Conley             1
Baron Davis             1
Jimmer Fredette         1
Marcus Williams         1
Darrick Martin          1
Greg Anthony            1
Name: perGameStats_Player, dtype: int64

This is kind of a weird group. Or perhaps it just seems weird to me on the surface. If I’d have to give this group a name… I guess it would be something like… good dependable point guard? In general, these guys seem to be starting point guards (back up in some cases), some all stars, some non all stars. But I guess the point here is that they were all quite efficient while they were on the floor raking up a 20 / 10 / 2 on PTS / AST / STL statline. Shooting well, shooting lots of 3’s as well.

I guess it’s just kind of weird looking at this seeing Jose Calderon on here a bunch of times. I think these were in his Toronto years too. I remember watching Jose play back when I didn’t know anything, didn’t think much of it probably because he was an international player without much flash and the team was pretty bad during those years anyways, but this, I guess, sheds some light on just how efficient he was and perhaps that’s how we even made it into the playoffs (along with Bosh of course).

It just seems like a relatively unreliable correlation in this group because you got guards like Rafer Alston who is in the same group has Chauncey Billups, but we also have to keep in mind that these were perhaps Rafer’s BEST years and Chauncey’s average years. Also, Earl Boykins in here with Spudd Webb makes you think about just how good they were for their natural gifts.

Region 5

In [101]:
playerAggDfToAnalyzeMin10Min20GamesPCAWithResults[playerAggDfToAnalyzeMin10Min20GamesPCAWithResults['cluster_region'] == 5]['perGameStats_Player'].value_counts()
Out[101]:
Tyrone Corbin        12
Tim Thomas           12
Tony Allen           12
Shawn Marion         10
Sam Perkins          10
Jeff Green           10
Ersan Ilyasova       10
Tayshaun Prince       9
Al Harrington         9
Rodney Rogers         9
Robert Reid           9
Fred Roberts          9
Thaddeus Young        8
Rodney McCray         8
Derrick McKey         8
Brad Miller           8
Blue Edwards          8
Reggie Williams       8
Wilson Chandler       8
Mike Sanders          7
Boris Diaw            7
Chris Mills           7
Sam Mitchell          7
Rasheed Wallace       7
Troy Murphy           6
Andres Nocioni        6
Scott Wedman          6
DeShawn Stevenson     6
Detlef Schrempf       6
Mike Dunleavy         6
                     ..
Kawhi Leonard         1
Jim Spanarkel         1
Danilo Gallinari      1
Marc Jackson          1
Bernard Robinson      1
Don Collins           1
Sly Williams          1
Mickey Johnson        1
James Wilkes          1
Andray Blatche        1
Kevin Martin          1
Josh Smith            1
Tom Hammonds          1
Derek Strong          1
Trent Tucker          1
Eric Bledsoe          1
Jiri Welsch           1
Randy Brown           1
Perry Moss            1
Nene Hilario          1
Tim Kempton           1
Clyde Drexler*        1
Corey Maggette        1
Luis Scola            1
Don Ford              1
Larry Robinson        1
Rick Calloway         1
Jason Smith           1
Wayne Robinson        1
Wang Zhizhi           1
Name: perGameStats_Player, dtype: int64

This entire 4-5-6 region are your 20 point / 100 possession scorers. These mostly seem like non-all star starters in a wing type of role. I could maybe make the assumption that they were pretty decent on defense as 1v1 defenders within their position and even on switch defenses, and saw a decent amount of playing time because of their 2-way ability. You see a lot of 2-way players here like Tony Allen, Shawn Marion, Sheed, Tayshaun, Thaddeus, Wilson Chandler who could, but never consistently lit up the scoreboard but seemingly were still efficient in their play and managed to gather 20 / 9 / 4 on AST / REB / AST per 100 possessions. I can’t tell much more about defense without looking at advanced metrics although these guys did get 1.7 / 0.8 on STL / BLK as well.

Region 6

In [102]:
playerAggDfToAnalyzeMin10Min20GamesPCAWithResults[playerAggDfToAnalyzeMin10Min20GamesPCAWithResults['cluster_region'] == 6]['perGameStats_Player'].value_counts()
Out[102]:
DeAndre Jordan      5
Dikembe Mutombo*    4
Jahidi White        3
Alton Lister        3
Andre Drummond      2
Rudy Gobert         2
Chris Andersen      2
Jerome James        2
JaVale McGee        2
Nazr Mohammed       2
Alonzo Mourning*    2
Shawn Bradley       1
Serge Ibaka         1
Dan Gadzuric        1
Danny Fortson       1
Hamed Haddadi       1
Ed Davis            1
Keon Clark          1
Greg Oden           1
Jayson Williams     1
Shawnelle Scott     1
Festus Ezeli        1
Stanley Roberts     1
Tree Rollins        1
Clint Capela        1
Justin Williams     1
Larry Sanders       1
Name: perGameStats_Player, dtype: int64

These guys had a line of 18 / 17 / 4 on PTS / TRB / BLK shooting the rock at 55% on 2PA. Christ. They basically rebounded as much as they scored the entire time. Add in a bunch of blocks for insult to injury as well.

It seems to be that these guys are essentially the region 3 guys without as much scoring. Still dependable, but perhaps not as gifted like the Shaq’s and Yao’s of the world. The most frequent offenders here are Deandre and Dikembe. None (I believe) were known for their ability to dominate from a scoring perspective, but rather their defense were stellar while still being able to generate points in the right situation, perhaps with a good point guard.

At the end of the day, these guys are great rebounders, get a lot of blocks, and can put up the few points you need them to put up per game.

Region 7

In [103]:
playerAggDfToAnalyzeMin10Min20GamesPCAWithResults[playerAggDfToAnalyzeMin10Min20GamesPCAWithResults['cluster_region'] == 7]['perGameStats_Player'].value_counts()
Out[103]:
DeShawn Stevenson      5
James Jones            3
Brandon Rush           2
Lee Mayberry           2
Dan Majerle            2
Steve Blake            2
Shane Battier          2
Pablo Prigioni         2
Brian Cardinal         1
Sasha Vujacic          1
John Paxson            1
Damon Jones            1
J.R. Smith             1
Sidney Lowe            1
Roger Mason            1
Mike Miller            1
Anthony Tolliver       1
Eric Snow              1
Keith Bogans           1
Jason Kidd             1
Bruce Bowen            1
Bobby Simmons          1
Darius Miller          1
Rashad Vaughn          1
Jason Terry            1
Andre Barrett          1
Luke Babbitt           1
Michael Curry          1
Anthony Brown          1
Matthew Dellavedova    1
Royal Ivey             1
Chris Duhon            1
Steve Novak            1
Kyle Singler           1
Brad Lohaus            1
John Salmons           1
Daniel Gibson          1
Name: perGameStats_Player, dtype: int64

Just by being in this corner of the plot, if you look at it, implies that you are

  1. Not a volume scorer
  2. Do not exhibit big men qualities
  3. Rack up more 3PA than traditional guard qualities (by a small margin)

With that, and by looking at some of the names here, I feel like these guys are guards who are in the rotation but have more of a spot up 3 role. Some of the most prolific (let’s say in recent memory to account for my age bias here) 3 ball guys are here. James Jones 3 times (alongside Lebron James), Shane Battier (alongside Lebron James), Damon Jones (alongside Lebron James), Mike Miller (alongside Lebron James), JR Smith (alongside Lebron James)… wait… did Lebron basically make this category possible? Just kidding, there are other players here too, but jesus it is kinda scary how many of your Lebron 3-ball guys are in here.

These guys are getting 11 / 4 / 4 on PTS / REB / AST, however the key here is that 7 / 11 of their FGA are 3PA. Basically 2 / 3 of all their shots are 3’s. Now I know they’re not trying to pull up on 3’s 66% of the time… implying that their role is probably to shoot 3’s, or that in many cases, the 3 is the best shot they can take and they’re capable of hitting them.

Region 8

In [104]:
playerAggDfToAnalyzeMin10Min20GamesPCAWithResults[playerAggDfToAnalyzeMin10Min20GamesPCAWithResults['cluster_region'] == 8]['perGameStats_Player'].value_counts()
Out[104]:
Jason Collins        6
Trenton Hassell      5
Jon Koncak           3
Quinton Ross         3
T.R. Dunn            3
Eduardo Najera       2
Jared Jeffries       2
Bob Hansen           2
Keith Askins         2
Bruce Bowen          2
Derrick McKey        2
Dante Cunningham     1
Troy Murphy          1
Mark Pope            1
Clifford Robinson    1
Sasha Pavlovic       1
Harvey Grant         1
Calbert Cheaney      1
Danny Vranes         1
Tony Battie          1
Jud Buechler         1
Grant Long           1
Dominic McGuire      1
Walter McCarty       1
Kyle Singler         1
Luc Mbah a Moute     1
Robert Horry         1
Scott Hastings       1
Elston Turner        1
Thabo Sefolosha      1
LaSalle Thompson     1
Jeff Cook            1
Joe Wolf             1
Name: perGameStats_Player, dtype: int64

I feel like your Region 8 guys are probably your worst guys given our constraints of the player must have played at least 10 minutes per game and at least 20 games on the season. If you’re getting those kind of minutes, you’re probably doing something right. You must have something to offer. Defense. Length. Rebounding. Intensity. The ability to just not turn it over when all the stars are on the bench and you are on the floor. Maybe your team is just… short of players or trying to tank cough develop their young players.

These guys collectively had stats like 8 / 8 / 3 / 1.6 / 1 on PTS / TRB / AST / STL / BLK. Well, actually, when you look at it that way, it’s not horrible. But translated to what this would look like in an actual game and not just 100 poss, I think we’d see something more along the lines of 5 / 5 / 2 / 1 / 0.5. Pretty average across the board, and not too good at one thing or another.

An interesting one here is Bruce Bowen, who was a key piece of those spurs teams and prolifically known for not being a stats guy. Other names here, though, Jason Collins, Quinton Ross, Eduardo Najera, Jared Jeffries, all really not known for making too much of an impact in games other than really just not being a liability.

Region 9

In [105]:
playerAggDfToAnalyzeMin10Min20GamesPCAWithResults[playerAggDfToAnalyzeMin10Min20GamesPCAWithResults['cluster_region'] == 9]['perGameStats_Player'].value_counts()
Out[105]:
Manute Bol          5
DeSagana Diop       4
Joel Przybilla      4
Lorenzo Williams    4
Eric Montross       3
Chris Dudley        3
Larry Smith         3
Dikembe Mutombo*    2
Andrew Bogut        2
Jim McIlvaine       2
Ervin Johnson       2
George Johnson      2
Michael Ruffin      2
Adrian Caldwell     1
Greg Kite           1
Duane Causwell      1
Jamaal Magloire     1
Furkan Aldemir      1
Dennis Rodman*      1
Mark Eaton          1
Leon Douglas        1
Andris Biedrins     1
Name: perGameStats_Player, dtype: int64

Final region… Phew… This, along with Region 7 (spot up 3 guys) are probably the most interesting of the bunch.

These guys look like they’re basically incapable of scoring. 47% on both 2P% AND FT%!!. These guys are not shooters, and probably did not grow up practicing shooting too much, averaging only 6 PTS on 6 FGA per 100 possessions. Instead, they look like they were blessed with other gifts. Being long, athletic, tall scraping up 15 / 4 on TRB / BLK per 100 possessions. That’s no joke for efficiency, as good as the rest of the big men types in Regions 3 and 6.

While Region 8 kind of implies the players weren’t really good at any particular thing, Region 9 implies that these players exhibit elite big men qualities. We even have two HOFers in here, Mutombo and Rodman.

Any more gifted on the offensive end, however, you’d see these guys creep up into Region 6.

Regions, Regions, Regions

Great. So I got 9 regions. So what? To me, this exercise was about abstracting stats up to a level understandable from a basketball perspective. Sure, AST / STL / 3PA might be correlated, but there isn’t necessarily causation here from one to the other, but rather the human who is playing the game has the physical attributes, and has the coaching staff that put him in a situation where he is able to succeed in getting these types of numbers. If you’re not a tall guy, you probably won’t get too many rebounds (generally speaking!).

While I’ll never say “oh you have low point numbers and low big men attributes, and not that many assists, you belong in Region 7 and I will now label you as a spot up shooter”, there is a natural correlation that a spot up shooter will exhibit many of these attributes, and PCA is kind of helping us define these natural tendencies. PCA isn’t defining specific roles, or perhaps even natural correlation actually, but it is painting a story of how the game has been historically played (without necessarily any information on how the game will evolve in the future). Historically, there are stronger correlations between certain factors, and whether it’s due to some sort of natural cause or causation through coaching… etc, these relationships do exist statistically and it’s quite interesting to walk this fine line between watching a basketball game unfolding before your eyes and seeing the same story being told from a numbers perspective.

Next Steps

From this, I want to dive into team lineups, and that might be all I do with the basic metrics for now before diving into advanced metrics to better explore the All-Star / All-NBA / MVP / HOF race.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s