All-NBA Predict #12 – Exploration of Historical NBA Players (Part IV, Isomap)

Looking Back

So far, we’ve thrown two data exploration methods at the data with really no objectives. Or rather, the objective was just to explore the methods and see how to structure the inputs, run the method / model, reading the outputs, and understanding the application methods. Each method so far has come into use not necessarily in the way that I thought they would be, and that’s what makes this exploration so worth it for me.

Scatterplot matrices were very easy to understand and apply. You get scatterplots between two variables and histograms of single variables. The histogram obviously allows you to understand the distribution, extremely useful, and the scatterplots theoretically show any correlations between two variables. The scatterplots in theory should be quite useful, and in this case we used it to determine that there were essentially no correlations among any of the variables. I’m finding that, with each method, a different part of the story is unravelled. The scatterplot matrix in this case helped me filter out players that didn’t play much and were inflating their per 100 stats. I didn’t really go into the scatterplot matrix thinking “oh, I’ll be able to filter out players inflating their stats!”, but that was actually the largest takeaway from the method for me. The best thing is, I never even plotted Minutes or Games Played!

PCA was interesting and basically just told a story within itself. I honestly didn’t really know what to expect going in, I thought honestly the results might be similar to that of the scatterplot matrix because finding variances in a linear world is very close conceptually to finding linear correlations, but the bi-plot painted a picture that reaffirmed the attributes of guards vs the attributes of big men. We also saw a bit how the game has changed with centers within the bi-plot demonstrating a wider variety of attributes vs guards.

Looking Forward

I probably just want to explore one more data exploration / dimension reduction method, and that’s the Isomap. I may do a bit more data exploration on advanced metrics, but I likely won’t use any more new methods.

Isomap

I’ve said it before and I’ll said it again, I want to check out the Isomap because I saw it in a video once and it just looked beautifully intuitive. Check out this scikitlearn talk by Jake VanderPlas around the 55 minute mark.

AIN’T THAT BEAUTIFUL? Ignorance is bliss, if only I could just admire colors all day without actually having to understand what they mean.

Truth be told, all I know about the isomap is that it seems to be a non-linear dimension reduction algorithm that somehow uses clustering…

Because this is my blog and I do what I want, I’m just going to use isomap and see what happens, and possibly look for an explanation later haha.

I’ll reference the youtube video above to replicate the code

In [1]:
%load_ext rpy2.ipython
In [2]:
%%R
# Load libraries & initial config
library(ggplot2)
library(gridExtra)
library(scales)
In [3]:
# Load libraries & initial config
%matplotlib nbagg
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import boto3
from StringIO import StringIO
import warnings
warnings.filterwarnings('ignore')
In [4]:
# Retrieve team stats from S3
playerAggDfToAnalyze = pd.read_csv('https://s3.ca-central-1.amazonaws.com/2017edmfasatb/fas_boto/data/playerAggDfToAnalyze.csv', index_col = 0)

pd.set_option('display.max_rows', len(playerAggDfToAnalyze.dtypes))
print playerAggDfToAnalyze.dtypes
pd.reset_option('display.max_rows')
season_start_year          int64
perGameStats_Player       object
perGameStats_Pos          object
perGameStats_Age           int64
perGameStats_Tm           object
perGameStats_G             int64
perGameStats_GS          float64
perGameStats_MP          float64
per100Stats_FG           float64
per100Stats_FGA          float64
per100Stats_FGPerc       float64
per100Stats_3P           float64
per100Stats_3PA          float64
per100Stats_3PPerc       float64
per100Stats_2P           float64
per100Stats_2PA          float64
per100Stats_2PPerc       float64
per100Stats_FT           float64
per100Stats_FTA          float64
per100Stats_FTPerc       float64
per100Stats_ORB          float64
per100Stats_DRB          float64
per100Stats_TRB          float64
per100Stats_AST          float64
per100Stats_STL          float64
per100Stats_BLK          float64
per100Stats_TOV          float64
per100Stats_PF           float64
per100Stats_PTS          float64
per100Stats_ORtg         float64
per100Stats_DRtg         float64
advancedStats_PER        float64
advancedStats_TSPerc     float64
advancedStats_3PAr       float64
advancedStats_FTr        float64
advancedStats_ORBPerc    float64
advancedStats_DRBPerc    float64
advancedStats_TRBPerc    float64
advancedStats_ASTPerc    float64
advancedStats_STLPerc    float64
advancedStats_BLKPerc    float64
advancedStats_TOVPerc    float64
advancedStats_USGPerc    float64
advancedStats_OWS        float64
advancedStats_DWS        float64
advancedStats_WS         float64
advancedStats_WS48       float64
advancedStats_OBPM       float64
advancedStats_DBPM       float64
advancedStats_BPM        float64
advancedStats_VORP       float64
dtype: object
In [5]:
# Filter to remove outliers, player must have played over 10 minutes and in over 20 games on the season
playerAggDfToAnalyzeMin10Min20Games = playerAggDfToAnalyze[(playerAggDfToAnalyze['perGameStats_MP'] > 10) & (playerAggDfToAnalyze['perGameStats_G'] > 20)]
In [12]:
# Select subset of features
playerAggDfToAnalyzeMin10Min20GamesIsomapFeatures = playerAggDfToAnalyzeMin10Min20Games[[
    'perGameStats_Pos',
    'per100Stats_FGA',
    'per100Stats_3PA',
    'per100Stats_2PA',
    'per100Stats_2PPerc',
    'per100Stats_FTA',
    'per100Stats_FTPerc',
    'per100Stats_ORB',
    'per100Stats_DRB',
    'per100Stats_AST',
    'per100Stats_STL',
    'per100Stats_BLK',
    'per100Stats_TOV',
    'per100Stats_PF',
    'per100Stats_PTS'
]].dropna()

playerAggDfToAnalyzeMin10Min20GamesIsomapFeaturesLabel = playerAggDfToAnalyzeMin10Min20GamesIsomapFeatures['perGameStats_Pos']
playerAggDfToAnalyzeMin10Min20GamesIsomapFeaturesData = playerAggDfToAnalyzeMin10Min20GamesIsomapFeatures.drop('perGameStats_Pos', 1)
In [59]:
from sklearn.manifold import Isomap

iso = Isomap(n_components = 2)
dataProjected = iso.fit_transform(playerAggDfToAnalyzeMin10Min20GamesIsomapFeaturesData)
In [49]:
# Combine the isomap results with the position labels so we can map in R
dataProjectedDf = pd.concat(
    [
        pd.DataFrame(dataProjected, columns = ['component_1', 'component_2']).reset_index(drop = True),
        pd.DataFrame(playerAggDfToAnalyzeMin10Min20GamesIsomapFeaturesLabel).reset_index(drop = True)
    ],
    axis = 1
)

Wow, first thing right off the bat, that algo took about 5 minutes to run. MUCH MORE SIGNIFICANT amount of time than PCA. PCA, I want to say, was essentially instantaneous. If I look at what’s happening with PCA under the hood, it’s essentially performing a handful of calculations to decompose a matrix into sub-parts. The number of calculations don’t really grow with resizing of the matrix (not in an exponential manner anyways). As for isomap, since I don’t know how that works. I will leave this for now and research later.

In [58]:
%%R -i dataProjectedDf -w 700 -u px

ggplot(
    dataProjectedDf,
    aes(
        x = component_1,
        y = component_2,
        color = perGameStats_Pos
    )
) +
geom_point(alpha = 0.3)
all_nba_predict_12_1

Alright, this bares resemblance to whatever we got out of PCA. We see the spectrum of PG / SG / SF / PF / C represented by the same colors. As PCA found the axis of largest variance along a more slanted axis, we see the spread of the different positions here pretty much line up exactly vertically. We also see that it’s the C’s this time that are more tightly bound than the G and SG’s… interesting.

Why does any of this happen? I have no idea. In fact, I don’t even know what components 1 and 2 mean. This is about the right time to actually learn what isomap does.

Actually Learning Isomap

So after watching a few videos on isomap, namely one from Dr Ali Ghodsi from the University of Waterloo:

It seems that isomap is a non-linear dimension reduction, but before we dive into isomap, we need to go even one level higher it seems…

Multidimensional Scaling

There exists another dimension reduction method that forms the fundamental engine of an isomap, the MDS or Multidimensional Scaling method.

Recall, PCA determines its first two components by looking at the 1st and 2nd axis of greatest variance. It does this by using the SVD decomposition of the data, and in essence, finds these two components using a matrix geometry approach.

So how did isomap find its two components? Isomap, and inherently MDA, finds its components by… well actually it doesn’t seem to matter. The axis are simply an abstract concept that simply define the number of dimensions we want to work in. We tell MDS we want a 2D result, and MDS will give us a two-axis interpretation of the data because that’s the nature of 2D, but the axis themselves don’t mean anything.

Why? Well all MDS is doing is basically finding the distance between all the points within the 7D, 8D, 60D, whatever dimensional space our initial data is in. It’s got a map saying, in 60D space, the distance between points X and Y is 60 units. It not only has the distance between X and Y, but between X and every other point in the data set. It then jumps into the 2D space, or whatever dimension we want our output to be in, and it tries to arrange the points in such a way that the

Distances between the points in the 2D space are as close to the distances in the 60D space as possible

Here is a common example that I’m seeing throughout my research:

Let’s say we have the distances measured between these different US cities.

In reality, this data is already in 2D format, so to go to 2D is a bit silly, but it will give great digestible insight into how MDS works. Also keep in mind that these distances that we have don’t necessarily need to be generated out of a 2D world. Sure, cities are in 2D, but the “euclidean” distance between Lebron and Kyrie also exists in the 6-dimensional space of PTS, DRB, ORB, AST, STL, BLK right? That distance is nothing but a number that we have between the US cities as well.

Now, we tell MDS that we want to work in 2D. MDS then goes to work and runs an optimization algorithm using an iterative approach. Once optimized, we get something like this:

It’s very easy to now see why the axis don’t really mean anything. It seems that north and south are mixed up. Miami is north of NY and that obviously is incorrect geographically! Theoretically, this chart could be completely vertical and it would still satisfy the parameters of MDS! I read that R then runs PCA and organizes the x-axis by the first principal component, which is maybe what the output here did, but anyways I’m not quite sure about the bells and whistles on top for the sklearn isomap algo.

So we see how MDS works. The key here is that

MDS uses the euclidean distance between points in the original space as it’s “distance” metric

and this leads nicely into isomap.

Back To Isomap

Now, the only difference between MDS and isomap is that

isomap assumes a geodesic distance instead of euclidean distance

Why is this important / why do we care? Well, this is the lifeblood that allows us to detect non-linear constraints that the data follows. A very simple image that shows the difference between euclidean and geodesic distance is shown below

Above, d15 is the euclidean distance between points 1 and 5, and the route that flows through d12, d23, d34, and d45 make up an approximation of the geodesic distance. Of course, the true geodesic distance is the one that follows the boundary as close as possible. This is only possible if

  1. We know the manifold ahead of time and we can model it
  2. We have infinite points that will allow us to track the manifold with 100% accuracy

Unfortunately, neither of these are usually true, so we try to follow 2. as closely as possible with the data that we have. We can see how this logic makes sense by looking at the image again, by tracing the distance, between the nearest neighbours of a point, we definitely get a better sense of a non-linear trend, if one exists!

This is where the nearest neighbours method comes into play as well that we mentioned earlier from the wikipedia page. Let’s recap how the wiki page outlines the isomap method:

  1. Determine the neighbours at each point
  2. Construct a neighbourhood graph
  3. Compute the shortest path between two nodes
  4. Compute lower-dimensional embedding

1. Determine the neighbours at each point

Generally, it seems that KNN is used. The sklearn isomap function, in fact, has k as it’s first parameter which defaults to k=5. As k changes, weird things can start happening:

  • If k is too small, we may not even be able to connect every single point, that is, we may not have an interconnected network where each node can access every other node
  • If k is too large, isomap can become vanilla MDS

2. Construct a neighbourhood graph

Once we have all the KNNs mapped out, we are now left with a fully connected graph of our population… maybe something that looks like this:

3. Compute the shortest path between two nodes

Here, we now build a database of the shortest path between every single node, which seems to generally use something like Djikstra’s Algorithm.

4. Compute lower-dimensional embedding

FINALLY… we get back to MDS. Remember the distances between the cities we had? Well we’ve essentially calculated that, except we jumped through loops of KNN and Djikstra’s to get there to account for the possibility of a non-linear path.


Now, I finally see why the isomap algorithm could’ve taken so long. We did a whole KNN, we did Djikstra’s, and we went through an iterative optimization process for MDS to actually arrive at our mapping. And at the end of the day the axis don’t even mean anything lol.

This is definitely interesting because I never really thought about the possibility of a non-linear manifold. A common manifold that I saw while researching MDS and isomap was the “swiss roll”. If we try to project something like this onto 2D space, we get the following results using different methods:

b) above used a method based on Euclidean distance, and c) above used a method based on geodesic distance. We clearly see the differences in this case as euclidean distance doesn’t quite represent the situation if the swiss roll was “rolled out”, which is what we are after. Instead, it seems to just have been “flattened”.

Back To The NBA…

To me, right now, the NBA (and basketball in general) is a simple game. Mentally complex in terms of strategy, but simple in terms of the structure of the game from a math and physics perspective. Everything seems quite linear, the more minutes you play, the more points you score, rebounds you grab, blocks you make… in a generally linear fashion! It’s not like the more minutes you play, the more points you score exponentially. There may be instances where a player steps it up in the 4th quarter, but that’s not necessarily following so much a relationship than it is more the mental / strategy side of the game affecting the numbers. That can’t necessarily be modelled (I’ll caveat that with can’t be modelled by my dumb brain).

With that, there probably isn’t much left we can do with isomaps because it’s essentially doing what PCA is doing in this linear world. The results speak that story as well as what we got with isomap was essentially a rotated version of the PCA output, and in fact the PCA output in this case gave us more information with the bi-plot as we could see which features spanned in which directions of the principal components. Not ragging on isomaps or MDS here at all, just recognizing the nature of the basketball game likely doesn’t require us going into the non-linear world.
It’s great to know that in the worst case scenario, an isomap does what PCA does, and in the best case scenario, it will account for non-linearities given that we choose our neighbours properly and that we have enough samples close enough to each other to trace the non-linear manifold.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s