# All-NBA Predict #10 – Exploration of Historical NBA Players (Part II, Scatterplot Matrices)

## Data Exploration

So, I have a crapton of data. I didn’t really take too much of a scientific approach last time when analyzing team stats, I had some ideas in my head that I wanted to execute against, and essentially one thing just lead to another. Eventually I came to a point where I was analyzing offenses and defenses, and then I started watching film. It never really had a structure, and I got further and further away from the numbers.

My ADD mind wants to try to come at this from all angles. In one case, the film was explaining the numbers to me. In this case, I want to start purely from the numbers, essentially without any context of what basketball is or how basketball works (okay, I still need to know what rebounds and steals are… and the fact that certain points are worth 3 points… and that basketball is played with a ball… and that hoops are involved…).

Again, I’m writing these posts to learn about basketball as a primary reason, but I’m also an Engineer working in “analytics”, so the inner nerd wants to jump into advanced analytics and machine learning right off the bat. After all, I am trying to predict basketball outcomes so I can get rich and quit my day job (sorry, employer).

I will have to preface this whole post by saying I’m not a data scientist, but I’m a wannabe data scientist.

A data scientist knows what he wants and how he wants to get there. I’m not saying things don’t change along the way, or that they don’t use exploratory methods to inquire about data sets they have no context of, but I am saying that a data scientist probably has an understanding of what types of unknown information he will now know after specific methods. Me? I have none of that. I don’t know what I want, I don’t know how to get what I want because I don’t know what I want, and I’m more so going off shiny new objects than a methodical approach.

We come back to the idea of failing extremely fast and extremely frequently.

1. I don’t have the analytical experience to really know what to look for and how to look for it
2. I don’t have the understanding of the methods to understand how to use them in the right situations

The following, my friends, is called the “shotgun” approach. Let’s throw a bunch of stuff at some arbitrary target to see what we get!

What is the objective, you ask? It’s simply to know more information coming out of this post than I knew coming in. That’s it.

## Scatterplot Matrices

There’s so many ways to explore data. Man, that was a dumb statement, but it’s true lol. We saw that I started with a bunch of stats. There’s per game, there’s per 36, there’s per 100, there’s advanced… Each category came with a good 10+ features. I narrowed it down by just taking some per 100 and advanced stats, but even then, I have… let’s check.

In [1]:
```%load_ext rpy2.ipython
```
In [2]:
```%%R
# Load libraries & initial config
library(ggplot2)
library(gridExtra)
library(scales)
```
In [3]:
```# Load libraries & initial config
%matplotlib nbagg
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import boto3
from StringIO import StringIO
import warnings
warnings.filterwarnings('ignore')
```
In [4]:
```# Retrieve team stats from S3
playerAggDfToAnalyze = pd.read_csv('https://s3.ca-central-1.amazonaws.com/2017edmfasatb/fas_boto/data/playerAggDfToAnalyze.csv', index_col = 0)

pd.set_option('display.max_rows', len(playerAggDfToAnalyze.dtypes))
print playerAggDfToAnalyze.dtypes
pd.reset_option('display.max_rows')
```
```season_start_year          int64
perGameStats_Player       object
perGameStats_Pos          object
perGameStats_Age           int64
perGameStats_Tm           object
perGameStats_G             int64
perGameStats_GS          float64
perGameStats_MP          float64
per100Stats_FG           float64
per100Stats_FGA          float64
per100Stats_FGPerc       float64
per100Stats_3P           float64
per100Stats_3PA          float64
per100Stats_3PPerc       float64
per100Stats_2P           float64
per100Stats_2PA          float64
per100Stats_2PPerc       float64
per100Stats_FT           float64
per100Stats_FTA          float64
per100Stats_FTPerc       float64
per100Stats_ORB          float64
per100Stats_DRB          float64
per100Stats_TRB          float64
per100Stats_AST          float64
per100Stats_STL          float64
per100Stats_BLK          float64
per100Stats_TOV          float64
per100Stats_PF           float64
per100Stats_PTS          float64
per100Stats_ORtg         float64
per100Stats_DRtg         float64
dtype: object
```
In [5]:
```print 'There are {} columns in the data set'.format(len(playerAggDfToAnalyze.dtypes))
```
```There are 51 columns in the data set
```

How in the world do we even make sense of 51 columns simultaneously? I’ve been using the pandas plot function with bar charts and histograms this entire time. Those are great for more targeted approaches.

Like in the last post, I wanted to check Kobe’s numbers per year. Easy, two variables. Year, points. There are a multitude of graphs that I could use to communicate this information. A bar chart is as good and as effective as any out there.

I also wanted to see the distribution of points scored by each player / 100 possessions per season. A single variable! Even easier! That’s what a histogram was built for.

Now I want to see 51 variables at a time, maybe get a sense of where I should start, perhaps see some interesting things that I wouldn’t see in two dimensions. As humans, we’re straight up dumb. Any more than basically 3 dimensions we’re useless. We live in a 3 dimensional world. It’s not fair for for me to try to think in 8 dimensions when I can’t even live in a world where 8 dimensions exist. 51 dimensions? Cool, I’d imagine it’s something like this:

And that I’d react something like this:

But NOT because 51 dimensions is so beautiful, but because I’d be so successful in my career and I’d be known as the guy that can think in 51 dimensions. Scatterplot what? Principal components what? Until then, though, let’s try out some methods.

While a scatterplot doesn’t allow us to necessarily think in 51 dimensions, it allows us to think in 2 dimensions over and over again very quickly. Let’s build a quick scatterplot matrix of some “key” metrics?

In [13]:
```from pandas.tools.plotting import scatter_matrix

# Build scatterplot matrix
ax = scatter_matrix(playerAggDfToAnalyze[[
'per100Stats_FGPerc',
'per100Stats_FTPerc',
'per100Stats_TRB',
'per100Stats_AST',
'per100Stats_STL',
'per100Stats_BLK',
'per100Stats_TOV'
]])

# We have to set axis labels manaully with Pandas' scatter_matrix function. Maybe there's a better function out there for
#   scatterplot matrices, but for now, this is fairly simple.
[plt.setp(item.xaxis.get_label(), 'size', 10) for item in ax.ravel()]
[plt.setp(item.yaxis.get_label(), 'size', 10) for item in ax.ravel()]
```
Out[13]:
```[[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None]]```

Nothing really interesting here. Not sure if I was really expecting something, but nice to see that basically nothing correlates. If there’s any semblance of any correlation, it would be AST and FT%. The more assists you get, the more likely you’re a guard? If you’re a guard, you probably shoot the ball pretty well too? Sure, it’s something, but it’s not really anything new or groundbreaking.

Maybe what I’m getting the most out of this though is all the outliers in the data. You’re gonna tell me that, on average, there are folks averaging 50 turnovers / 100 possessions? So they must first have the ball for at least half of their teams possessions, and then they must turn it over all 50 times? Not correct, but this is probably just guys who played only a few possessions in the season during garbage time, and they maybe got like 2 turnovers in 4 possessions and it’s just been extrapolated out.

Let’s take a look at anyone with over 40 turnovers / 100 possessions.

In [15]:
```playerAggDfToAnalyze[playerAggDfToAnalyze['per100Stats_TOV'] > 40]
```
Out[15]:
121 1985 Claude Gregory PF 27 WSB 2 0.0 1.0 24.2 48.5 50.0 83.6 0.0 0.0 0.0 -0.640 -11.5 1.4 -10.2 0.0
322 1993 Chuck Nevitt C 34 SAS 1 0.0 1.0 0.0 0.0 27.5 100.0 0.0 0.0 0.0 -0.174 0.8 -29.3 -28.5 0.0
429 2000 Larry Robinson SG 33 CLE 1 0.0 1.0 0.0 0.0 100.0 45.2 0.0 0.0 0.0 -1.521 -48.1 0.1 -48.0 0.0
312 2009 Coby Karl SG 26 CLE 3 0.0 1.7 0.0 0.0 100.0 37.3 -0.1 0.0 -0.1 -1.112 -45.3 -0.9 -46.2 -0.1

4 rows × 51 columns

I know literally none of these players, and none averaged more than 2 minutes per game that season. Pretty much what I thought. I probably makes sense to remove these guys as they really are causing false outliers in the data.

I don’t think anyone should be average over even like… 10 turnovers per 100 possessions. That’s a lot! Let’s see how many minutes players have to play to shield them from this extrapolation error. Well not error… “phenomenon”.

In [18]:
```ax = scatter_matrix(playerAggDfToAnalyze[[
'perGameStats_MP',
'per100Stats_TOV'
]])
```

This scatterplot matrix is actually quite helpful to figure out how much data I’d be cutting out if I only took players over a certain MP.

If we just take an arbitrary cut off of players who play more than, let’s say 10 minutes, we see that the turnovers / 100 poss comes right down to about 10. Let’s take a closer look at folks with turnovers between 10 and 20 (look at the TOV histogram… it’s a very small amount of data which makes me think this is fine).

In [21]:
```playerAggDfToAnalyze[(playerAggDfToAnalyze['per100Stats_TOV'] > 10) & (playerAggDfToAnalyze['per100Stats_TOV'] < 20)][[
'perGameStats_MP'
]].plot(kind = 'hist')
```
Out[21]:

This data set contains nobody that played over 10 minutes. Looks about right, I’m going to go ahead and cut the data set off to only those who played a minimum of 10 minutes. That should be enough to get rid of the outliers. My assumption right here is that turnovers will actually

In [25]:

```playerAggDfToAnalyzeMin10Min = playerAggDfToAnalyze[(playerAggDfToAnalyze['perGameStats_MP'] > 10)]
```
In [26]:
```# Build scatterplot matrix
ax2 = scatter_matrix(playerAggDfToAnalyzeMin10Min[[
'per100Stats_FGPerc',
'per100Stats_FTPerc',
'per100Stats_TRB',
'per100Stats_AST',
'per100Stats_STL',
'per100Stats_BLK',
'per100Stats_TOV'
]])

# We have to set axis labels manaully with Pandas' scatter_matrix function. Maybe there's a better function out there for
#   scatterplot matrices, but for now, this is fairly simple.
[plt.setp(item.xaxis.get_label(), 'size', 10) for item in ax2.ravel()]
[plt.setp(item.yaxis.get_label(), 'size', 10) for item in ax2.ravel()]
```
Out[26]:
```[[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None]]```

Some of these other ones also look kinda odd… I feel like 30 rebounds / 100 possessions is also kinda crazy, no? Although you know what, there’s a decent amount of folks even in that 20-25 rebound range… crazy.

Let’s look at these culprits over 25 rebounds / 100 possessions and see if they make sense.

In [30]:
```playerAggDfToAnalyzeMin10Min[(playerAggDfToAnalyzeMin10Min['per100Stats_TRB'] > 25)][[
'perGameStats_MP'
]].plot(kind = 'hist')

playerAggDfToAnalyzeMin10Min[(playerAggDfToAnalyzeMin10Min['per100Stats_TRB'] > 25)][[
'season_start_year',
'perGameStats_Player',
'perGameStats_Tm',
'perGameStats_G',
'perGameStats_MP',
'per100Stats_TRB'
]]
```
Out[30]:
season_start_year perGameStats_Player perGameStats_Tm perGameStats_G perGameStats_MP per100Stats_TRB
345 1994 Dennis Rodman* SAS 49 32.0 26.6
340 1996 Anthony Miller ATL 1 14.0 27.6
203 2008 Drew Gooden SAC 1 26.0 25.5
40 2012 Earl Barron NYK 1 37.0 26.0
108 2013 Andrew Bynum IND 2 18.0 27.4
540 2013 Malcolm Thomas SAS 1 15.0 30.3

First of all, holy F#&@. Dennis Rodman. What the hell mate, the guy grabbed 25% of the boards that ever happened while he was on the court in this season… out of 10 guys, 1 guy grabbed 25% of the rebounds. SURE, WHY NOT.

But you can see a pattern with these other guys. It comes back down to the fact that each row represents one player per one season per one team. I don’t even remember Andrew Bynum being on Indiana, but I guess it happened.

Here comes another filter. My gut feel is that for a player to be truly represented in their stats that year, they should have played at least like a third of the season. I get injuries and whatnot wiping out half the season, but if an injury or trade hinders more than 66% of the season, is it even a good representation of what happened? Let’s try to throw GP into the scatterplot, that’s probably what I should have done with MP to be honest, but let’s see.

In [33]:
```# Build scatterplot matrix
ax3 = scatter_matrix(playerAggDfToAnalyzeMin10Min[[
'perGameStats_G',
'per100Stats_FGPerc',
'per100Stats_FTPerc',
'per100Stats_TRB',
'per100Stats_AST',
'per100Stats_STL',
'per100Stats_BLK',
'per100Stats_TOV'
]])

# We have to set axis labels manaully with Pandas' scatter_matrix function. Maybe there's a better function out there for
#   scatterplot matrices, but for now, this is fairly simple.
[plt.setp(item.xaxis.get_label(), 'size', 10) for item in ax3.ravel()]
[plt.setp(item.yaxis.get_label(), 'size', 10) for item in ax3.ravel()]
```
Out[33]:
```[[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None]]```

From these charts, it looks like 20 is a pretty good cutoff, or maybe even 15 actually. If we look at TOV, TRB, and FG%, at 20 games it pretty much equalizes.

Let’s go with 20 and try the scatterplot again lol.

In [35]:
```playerAggDfToAnalyzeMin10Min20Games = playerAggDfToAnalyze[(playerAggDfToAnalyze['perGameStats_MP'] > 10) & (playerAggDfToAnalyze['perGameStats_G'] > 20)]
```
In [37]:
```# Build scatterplot matrix
ax4 = scatter_matrix(playerAggDfToAnalyzeMin10Min20Games[[
'per100Stats_FGPerc',
'per100Stats_FTPerc',
'per100Stats_PTS',
'per100Stats_TRB',
'per100Stats_AST',
'per100Stats_STL',
'per100Stats_BLK',
'per100Stats_TOV'
]])

# We have to set axis labels manaully with Pandas' scatter_matrix function. Maybe there's a better function out there for
#   scatterplot matrices, but for now, this is fairly simple.
[plt.setp(item.xaxis.get_label(), 'size', 10) for item in ax4.ravel()]
[plt.setp(item.yaxis.get_label(), 'size', 10) for item in ax4.ravel()]
```
Out[37]:
```[[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None],
[None]]```

Look at how nicely the data fills out the boxes now. Nice histograms throughout as well.

Your average player, per 100 possessions, has the following statline:

In [38]:
```playerAggDfToAnalyzeMin10Min20Games[[
'per100Stats_FGPerc',
'per100Stats_FTPerc',
'per100Stats_PTS',
'per100Stats_TRB',
'per100Stats_AST',
'per100Stats_STL',
'per100Stats_BLK',
'per100Stats_TOV'
]].describe()
```
Out[38]:
per100Stats_FGPerc per100Stats_FTPerc per100Stats_PTS per100Stats_TRB per100Stats_AST per100Stats_STL per100Stats_BLK per100Stats_TOV
count 13220.000000 13217.000000 13220.000000 13220.000000 13220.000000 13220.000000 13220.000000 13220.000000
mean 0.456676 0.741509 19.998911 8.851528 4.563026 1.661271 1.049720 3.092700
std 0.054780 0.105516 5.839397 3.987094 3.006272 0.675376 1.057753 0.962238
min 0.158000 0.000000 3.100000 1.900000 0.000000 0.000000 0.000000 0.400000
25% 0.420000 NaN 15.900000 5.400000 2.400000 1.200000 0.300000 2.400000
50% 0.454000 NaN 19.600000 8.100000 3.600000 1.600000 0.700000 3.000000
75% 0.491000 NaN 23.600000 11.900000 6.100000 2.000000 1.400000 3.700000
max 0.748000 1.000000 46.400000 26.600000 19.400000 5.800000 9.300000 7.700000
• 20 PTS
• 9 TRB
• 4.5 AST
• 1.5 STL
• 1 BLK
• 3 TOV

Good to know! Do I know more now than I did before? I think so.

In the next post, I’ll try to throw Principal Components Analysis at the problem because… well basically more or less because I want to learn it regardless of whether it applies here or not!