So far, we’ve thrown two data exploration methods at the data with really no objectives. Or rather, the objective was just to explore the methods and see how to structure the inputs, run the method / model, reading the outputs, and understanding the application methods. Each method so far has come into use not necessarily in the way that I thought they would be, and that’s what makes this exploration so worth it for me.
Scatterplot matrices were very easy to understand and apply. You get scatterplots between two variables and histograms of single variables. The histogram obviously allows you to understand the distribution, extremely useful, and the scatterplots theoretically show any correlations between two variables. The scatterplots in theory should be quite useful, and in this case we used it to determine that there were essentially no correlations among any of the variables. I’m finding that, with each method, a different part of the story is unravelled. The scatterplot matrix in this case helped me filter out players that didn’t play much and were inflating their per 100 stats. I didn’t really go into the scatterplot matrix thinking “oh, I’ll be able to filter out players inflating their stats!”, but that was actually the largest takeaway from the method for me. The best thing is, I never even plotted Minutes or Games Played!
PCA was interesting and basically just told a story within itself. I honestly didn’t really know what to expect going in, I thought honestly the results might be similar to that of the scatterplot matrix because finding variances in a linear world is very close conceptually to finding linear correlations, but the bi-plot painted a picture that reaffirmed the attributes of guards vs the attributes of big men. We also saw a bit how the game has changed with centers within the bi-plot demonstrating a wider variety of attributes vs guards.
I probably just want to explore one more data exploration / dimension reduction method, and that’s the Isomap. I may do a bit more data exploration on advanced metrics, but I likely won’t use any more new methods.
I’ve said it before and I’ll said it again, I want to check out the Isomap because I saw it in a video once and it just looked beautifully intuitive. Check out this scikitlearn talk by Jake VanderPlas around the 55 minute mark.
AIN’T THAT BEAUTIFUL? Ignorance is bliss, if only I could just admire colors all day without actually having to understand what they mean.
Truth be told, all I know about the isomap is that it seems to be a non-linear dimension reduction algorithm that somehow uses clustering…
Because this is my blog and I do what I want, I’m just going to use isomap and see what happens, and possibly look for an explanation later haha.
I’ll reference the youtube video above to replicate the code
%%R # Load libraries & initial config library(ggplot2) library(gridExtra) library(scales)
# Load libraries & initial config %matplotlib nbagg import pandas as pd import numpy as np import matplotlib.pyplot as plt import boto3 from StringIO import StringIO import warnings warnings.filterwarnings('ignore')
# Retrieve team stats from S3 playerAggDfToAnalyze = pd.read_csv('https://s3.ca-central-1.amazonaws.com/2017edmfasatb/fas_boto/data/playerAggDfToAnalyze.csv', index_col = 0) pd.set_option('display.max_rows', len(playerAggDfToAnalyze.dtypes)) print playerAggDfToAnalyze.dtypes pd.reset_option('display.max_rows')
season_start_year int64 perGameStats_Player object perGameStats_Pos object perGameStats_Age int64 perGameStats_Tm object perGameStats_G int64 perGameStats_GS float64 perGameStats_MP float64 per100Stats_FG float64 per100Stats_FGA float64 per100Stats_FGPerc float64 per100Stats_3P float64 per100Stats_3PA float64 per100Stats_3PPerc float64 per100Stats_2P float64 per100Stats_2PA float64 per100Stats_2PPerc float64 per100Stats_FT float64 per100Stats_FTA float64 per100Stats_FTPerc float64 per100Stats_ORB float64 per100Stats_DRB float64 per100Stats_TRB float64 per100Stats_AST float64 per100Stats_STL float64 per100Stats_BLK float64 per100Stats_TOV float64 per100Stats_PF float64 per100Stats_PTS float64 per100Stats_ORtg float64 per100Stats_DRtg float64 advancedStats_PER float64 advancedStats_TSPerc float64 advancedStats_3PAr float64 advancedStats_FTr float64 advancedStats_ORBPerc float64 advancedStats_DRBPerc float64 advancedStats_TRBPerc float64 advancedStats_ASTPerc float64 advancedStats_STLPerc float64 advancedStats_BLKPerc float64 advancedStats_TOVPerc float64 advancedStats_USGPerc float64 advancedStats_OWS float64 advancedStats_DWS float64 advancedStats_WS float64 advancedStats_WS48 float64 advancedStats_OBPM float64 advancedStats_DBPM float64 advancedStats_BPM float64 advancedStats_VORP float64 dtype: object
# Filter to remove outliers, player must have played over 10 minutes and in over 20 games on the season playerAggDfToAnalyzeMin10Min20Games = playerAggDfToAnalyze[(playerAggDfToAnalyze['perGameStats_MP'] > 10) & (playerAggDfToAnalyze['perGameStats_G'] > 20)]
# Select subset of features playerAggDfToAnalyzeMin10Min20GamesIsomapFeatures = playerAggDfToAnalyzeMin10Min20Games[[ 'perGameStats_Pos', 'per100Stats_FGA', 'per100Stats_3PA', 'per100Stats_2PA', 'per100Stats_2PPerc', 'per100Stats_FTA', 'per100Stats_FTPerc', 'per100Stats_ORB', 'per100Stats_DRB', 'per100Stats_AST', 'per100Stats_STL', 'per100Stats_BLK', 'per100Stats_TOV', 'per100Stats_PF', 'per100Stats_PTS' ]].dropna() playerAggDfToAnalyzeMin10Min20GamesIsomapFeaturesLabel = playerAggDfToAnalyzeMin10Min20GamesIsomapFeatures['perGameStats_Pos'] playerAggDfToAnalyzeMin10Min20GamesIsomapFeaturesData = playerAggDfToAnalyzeMin10Min20GamesIsomapFeatures.drop('perGameStats_Pos', 1)
from sklearn.manifold import Isomap iso = Isomap(n_components = 2) dataProjected = iso.fit_transform(playerAggDfToAnalyzeMin10Min20GamesIsomapFeaturesData)
# Combine the isomap results with the position labels so we can map in R dataProjectedDf = pd.concat( [ pd.DataFrame(dataProjected, columns = ['component_1', 'component_2']).reset_index(drop = True), pd.DataFrame(playerAggDfToAnalyzeMin10Min20GamesIsomapFeaturesLabel).reset_index(drop = True) ], axis = 1 )
Wow, first thing right off the bat, that algo took about 5 minutes to run. MUCH MORE SIGNIFICANT amount of time than PCA. PCA, I want to say, was essentially instantaneous. If I look at what’s happening with PCA under the hood, it’s essentially performing a handful of calculations to decompose a matrix into sub-parts. The number of calculations don’t really grow with resizing of the matrix (not in an exponential manner anyways). As for isomap, since I don’t know how that works. I will leave this for now and research later.
%%R -i dataProjectedDf -w 700 -u px ggplot( dataProjectedDf, aes( x = component_1, y = component_2, color = perGameStats_Pos ) ) + geom_point(alpha = 0.3)
Alright, this bares resemblance to whatever we got out of PCA. We see the spectrum of PG / SG / SF / PF / C represented by the same colors. As PCA found the axis of largest variance along a more slanted axis, we see the spread of the different positions here pretty much line up exactly vertically. We also see that it’s the C’s this time that are more tightly bound than the G and SG’s… interesting.
Why does any of this happen? I have no idea. In fact, I don’t even know what components 1 and 2 mean. This is about the right time to actually learn what isomap does.
Actually Learning Isomap
So after watching a few videos on isomap, namely one from Dr Ali Ghodsi from the University of Waterloo:
It seems that isomap is a non-linear dimension reduction, but before we dive into isomap, we need to go even one level higher it seems…
There exists another dimension reduction method that forms the fundamental engine of an isomap, the MDS or Multidimensional Scaling method.
Recall, PCA determines its first two components by looking at the 1st and 2nd axis of greatest variance. It does this by using the SVD decomposition of the data, and in essence, finds these two components using a matrix geometry approach.
So how did isomap find its two components? Isomap, and inherently MDA, finds its components by… well actually it doesn’t seem to matter. The axis are simply an abstract concept that simply define the number of dimensions we want to work in. We tell MDS we want a 2D result, and MDS will give us a two-axis interpretation of the data because that’s the nature of 2D, but the axis themselves don’t mean anything.
Why? Well all MDS is doing is basically finding the distance between all the points within the 7D, 8D, 60D, whatever dimensional space our initial data is in. It’s got a map saying, in 60D space, the distance between points X and Y is 60 units. It not only has the distance between X and Y, but between X and every other point in the data set. It then jumps into the 2D space, or whatever dimension we want our output to be in, and it tries to arrange the points in such a way that the
Distances between the points in the 2D space are as close to the distances in the 60D space as possible
Here is a common example that I’m seeing throughout my research:
Let’s say we have the distances measured between these different US cities.
In reality, this data is already in 2D format, so to go to 2D is a bit silly, but it will give great digestible insight into how MDS works. Also keep in mind that these distances that we have don’t necessarily need to be generated out of a 2D world. Sure, cities are in 2D, but the “euclidean” distance between Lebron and Kyrie also exists in the 6-dimensional space of PTS, DRB, ORB, AST, STL, BLK right? That distance is nothing but a number that we have between the US cities as well.
Now, we tell MDS that we want to work in 2D. MDS then goes to work and runs an optimization algorithm using an iterative approach. Once optimized, we get something like this:
It’s very easy to now see why the axis don’t really mean anything. It seems that north and south are mixed up. Miami is north of NY and that obviously is incorrect geographically! Theoretically, this chart could be completely vertical and it would still satisfy the parameters of MDS! I read that R then runs PCA and organizes the x-axis by the first principal component, which is maybe what the output here did, but anyways I’m not quite sure about the bells and whistles on top for the sklearn isomap algo.
So we see how MDS works. The key here is that
MDS uses the euclidean distance between points in the original space as it’s “distance” metric
and this leads nicely into isomap.
Back To Isomap
Now, the only difference between MDS and isomap is that
isomap assumes a geodesic distance instead of euclidean distance
Why is this important / why do we care? Well, this is the lifeblood that allows us to detect non-linear constraints that the data follows. A very simple image that shows the difference between euclidean and geodesic distance is shown below
Above, d15 is the euclidean distance between points 1 and 5, and the route that flows through d12, d23, d34, and d45 make up an approximation of the geodesic distance. Of course, the true geodesic distance is the one that follows the boundary as close as possible. This is only possible if
- We know the manifold ahead of time and we can model it
- We have infinite points that will allow us to track the manifold with 100% accuracy
Unfortunately, neither of these are usually true, so we try to follow 2. as closely as possible with the data that we have. We can see how this logic makes sense by looking at the image again, by tracing the distance, between the nearest neighbours of a point, we definitely get a better sense of a non-linear trend, if one exists!
This is where the nearest neighbours method comes into play as well that we mentioned earlier from the wikipedia page. Let’s recap how the wiki page outlines the isomap method:
- Determine the neighbours at each point
- Construct a neighbourhood graph
- Compute the shortest path between two nodes
- Compute lower-dimensional embedding
1. Determine the neighbours at each point
Generally, it seems that KNN is used. The sklearn isomap function, in fact, has k as it’s first parameter which defaults to k=5. As k changes, weird things can start happening:
- If k is too small, we may not even be able to connect every single point, that is, we may not have an interconnected network where each node can access every other node
- If k is too large, isomap can become vanilla MDS
2. Construct a neighbourhood graph
Once we have all the KNNs mapped out, we are now left with a fully connected graph of our population… maybe something that looks like this:
3. Compute the shortest path between two nodes
Here, we now build a database of the shortest path between every single node, which seems to generally use something like Djikstra’s Algorithm.
4. Compute lower-dimensional embedding
FINALLY… we get back to MDS. Remember the distances between the cities we had? Well we’ve essentially calculated that, except we jumped through loops of KNN and Djikstra’s to get there to account for the possibility of a non-linear path.
Now, I finally see why the isomap algorithm could’ve taken so long. We did a whole KNN, we did Djikstra’s, and we went through an iterative optimization process for MDS to actually arrive at our mapping. And at the end of the day the axis don’t even mean anything lol.
This is definitely interesting because I never really thought about the possibility of a non-linear manifold. A common manifold that I saw while researching MDS and isomap was the “swiss roll”. If we try to project something like this onto 2D space, we get the following results using different methods:
b) above used a method based on Euclidean distance, and c) above used a method based on geodesic distance. We clearly see the differences in this case as euclidean distance doesn’t quite represent the situation if the swiss roll was “rolled out”, which is what we are after. Instead, it seems to just have been “flattened”.
Back To The NBA…
To me, right now, the NBA (and basketball in general) is a simple game. Mentally complex in terms of strategy, but simple in terms of the structure of the game from a math and physics perspective. Everything seems quite linear, the more minutes you play, the more points you score, rebounds you grab, blocks you make… in a generally linear fashion! It’s not like the more minutes you play, the more points you score exponentially. There may be instances where a player steps it up in the 4th quarter, but that’s not necessarily following so much a relationship than it is more the mental / strategy side of the game affecting the numbers. That can’t necessarily be modelled (I’ll caveat that with can’t be modelled by my dumb brain).
With that, there probably isn’t much left we can do with isomaps because it’s essentially doing what PCA is doing in this linear world. The results speak that story as well as what we got with isomap was essentially a rotated version of the PCA output, and in fact the PCA output in this case gave us more information with the bi-plot as we could see which features spanned in which directions of the principal components. Not ragging on isomaps or MDS here at all, just recognizing the nature of the basketball game likely doesn’t require us going into the non-linear world.
It’s great to know that in the worst case scenario, an isomap does what PCA does, and in the best case scenario, it will account for non-linearities given that we choose our neighbours properly and that we have enough samples close enough to each other to trace the non-linear manifold.