All-NBA Predict #20 – Classifying All-NBA Players (Part II – Linear Discriminant Analysis)

Okay, so last time, I selected two boundaries – one a rectangle, one a linear line. Anything on one side of the shape was considered all-NBA, anything on the other was considered non all-NBA. Did it work? For the rectangle, no. Well… I guess I didn’t really try to optimize it, but it was clear that the decision boundary shape was not optimal just by looking at it. For the line, we got pretty good results. We found a boundary that classified with about 92% accuracy for both classes. 92% is a pretty good number! I’m happy with that!

I then considered the situation where my boundary no longer fits that specific axis of intersection, and how I would compensate for that from an automation perspective.

Luckily, I’ve learned about a model called Linear Discriminant Analysis that does… well… just that!

Multivariate Gaussian

The basis of LDA is quite simple. It assumes that each class has a multi-variate Gaussian distribution. A Gaussian distribution is, of course, the normal distribution:

Cool. If I didn’t already know that, I probably wouldn’t have gotten this far. A multi-variate gaussian distribution is simply a data set that has a normal distribution in two dimensions, or across two variables / predictors / features:

Essentially we see a normal distribution in the x direction, and a normal distribution in the y direction. Depending on how the means of these distributions line up and depending on what the variances of the two distributions are like, we frequently see the ‘ellipse’ shape that we see in the all-NBA data!

In fact, we see that each category (all-NBA, not all-NBA) is its own distribution! Lets take a look at that graph again.

In [2]:
# Load libraries & initial config
%load_ext rpy2.ipython

%R library(ggplot2)
%R library(gridExtra)
%R library(scales)
%R library(ggbiplot)

%matplotlib nbagg
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import boto3
from StringIO import StringIO
import warnings
warnings.filterwarnings('ignore')
In [3]:
# Retrieve team stats from S3
playerAggDfAllNbaAllStar = pd.read_csv('https://s3.ca-central-1.amazonaws.com/2017edmfasatb/fas_boto/data/playerAggDfAllNbaAllStar.csv', index_col = 0)

pd.set_option('display.max_rows', len(playerAggDfAllNbaAllStar.dtypes))
print playerAggDfAllNbaAllStar.dtypes
pd.reset_option('display.max_rows')
season_start_year          int64
perGameStats_Player       object
perGameStats_Pos          object
perGameStats_Age           int64
perGameStats_Tm           object
perGameStats_G             int64
perGameStats_GS          float64
perGameStats_MP          float64
per100Stats_FG           float64
per100Stats_FGA          float64
per100Stats_FGPerc       float64
per100Stats_3P           float64
per100Stats_3PA          float64
per100Stats_3PPerc       float64
per100Stats_2P           float64
per100Stats_2PA          float64
per100Stats_2PPerc       float64
per100Stats_FT           float64
per100Stats_FTA          float64
per100Stats_FTPerc       float64
per100Stats_ORB          float64
per100Stats_DRB          float64
per100Stats_TRB          float64
per100Stats_AST          float64
per100Stats_STL          float64
per100Stats_BLK          float64
per100Stats_TOV          float64
per100Stats_PF           float64
per100Stats_PTS          float64
per100Stats_ORtg         float64
per100Stats_DRtg         float64
advancedStats_PER        float64
advancedStats_TSPerc     float64
advancedStats_3PAr       float64
advancedStats_FTr        float64
advancedStats_ORBPerc    float64
advancedStats_DRBPerc    float64
advancedStats_TRBPerc    float64
advancedStats_ASTPerc    float64
advancedStats_STLPerc    float64
advancedStats_BLKPerc    float64
advancedStats_TOVPerc    float64
advancedStats_USGPerc    float64
advancedStats_OWS        float64
advancedStats_DWS        float64
advancedStats_WS         float64
advancedStats_WS48       float64
advancedStats_OBPM       float64
advancedStats_DBPM       float64
advancedStats_BPM        float64
advancedStats_VORP       float64
player_formatted          object
Tm                        object
Player_x                  object
Player_y                  object
all_star                  object
accolades                 object
accolades_any_team        object
VORP_WS_sum              float64
accolades_all_nba         object
dtype: object
In [4]:
%%R -i playerAggDfAllNbaAllStar -w 700 -u px

allNbaPlot = ggplot(
    NULL
) +
geom_point(
    data = playerAggDfAllNbaAllStar,
    aes(
        x = advancedStats_VORP,
        y = advancedStats_WS,
        color = accolades_all_nba
    )
)

allNbaPlot
all_nba_predict_20_1

We see the ellipse shape in the reds and the blues above. Each category has its own multivariate gaussian distribution.

LDA

What, then, is LDA? LDA a classification and a dimension reduction algorithm that takes into account two metrics for each category of data:

  1. Maximizing the distance between the means of the classes
  2. Minimizing the variance within each class

What does this mean exactly? Well let’s take a look.

I found a youtube tutorial online that helped me understand LDA quite easily. Let’s say you’re trying to measure whether or not a drug works on patients with varying sets of transcripts of a single gene.

If we expand this to two genes, we might get better results of classification / separation:

LDA in essence, is a form of dimension reduction. It’s trying to find perhaps a single axis on this graph that we can reduce it to maximize the separation of the categories. This is where the two metrics come in:

You see that in this single dimension (we basically just try to find the right axis), we want to maximize the distances between the means of classes and minimize the variance within each class.

In this case, we see that the axis on the gene data might look something like this:

Below demonstrates the reason why we need to optimize both the means between the classes as well as the variance within each class!

We see in the case that we only maximize the mean between each class, we actually do not find the axis of best separation as clearly the case where we optimize both provides a much larger gap for the purposes of separation.

LDA On All-NBA Data

Okay, so now that we know a bit more about LDA, let’s try to apply it to what we have here. Just by looking at the data, we can already kind of guess at what the axis of dimension reduction would be. Not surprisingly, my mind tells me that it’s the perpendicular axis of how I drew my decision boundary last time. The method I used last time was simply by eye, and who really knows if the slope was completely optimized for best separation. I saw 93% / 92% and I was happy with the result from just eyeballing it! LDA should allow us to actually put a formula to work and find this axis which optimizes our separation:

max(\frac{d^2}{s_1^2+s_2^2})

Luckily, in R there’s a library called ‘lda’ within the MASS package which provides us LDA superpowers.

In [16]:
%R library(MASS)
%R allNbaLda = lda(accolades_all_nba ~ advancedStats_VORP + advancedStats_WS, data = playerAggDfAllNbaAllStar)
Out[16]:
<ListVector - Python:0x000000000C45F288 / R:0x000000000E208348>
[FactorVector, Matrix, Matrix]
  class: <class 'rpy2.robjects.vectors.FactorVector'>
  <FactorVector - Python:0x0000000008B87B08 / R:0x000000000F6A46A0>
[       1,        2,        2, ...,        2,        2,        2]
  posterior: <class 'rpy2.robjects.vectors.Matrix'>
  <Matrix - Python:0x000000000C481FC8 / R:0x000000000F6B1560>
[0.999584, 0.000021, 0.108310, ..., 0.999991, 0.999996, 0.999991]
  x: <class 'rpy2.robjects.vectors.Matrix'>
  <Matrix - Python:0x000000000C481848 / R:0x000000000F6F1E90>
[-4.775113, 0.562312, -1.929384, ..., 0.811234, 1.068619, 0.796734]

Okay, so now I have an LDA object from R that… well I’m not really sure what it did lol. This is my first time using it so please forgive the naivety. Let’s think about this logically…

What I THINK LDA did for sure is find the axis of the greatest separation. Perhaps even multiple axis of separation. Like we had with principal components, there was an item in the list that gave us the dimension-reduced coordinates to each sample. I’d expect to find that here.

Even before that, though, I know there’s an aspect of covariance that I haven’t considered, but have read about. Covariance is something that comes into lay with multi-variate Gaussian distributions. At the beginning of this post, we looked at a univariate gaussian and a multivariate gaussian, right?

A univariate gaussian distribution has the parameters
mean-\mu
variance-\sigma^2
and, in both cases, these are scalars

A multivariate gaussian distribution then has the parameters
mean-\mu
covariance-\Sigma
where mean is a vector with length of the number of dimensions / variables, and covariance is a square matrix with length of each side of the matrix as the number of dimensions / variables.

The covariance governs the shape of the multivariate normal distribution, and it takes into account the distribution of both variables. The covariance works with the SVD decomposition in mind (I go through this in the first principal components analysis post), as the multivariate gaussian distribution is basically a multidimensional unit circle (identity distribution) scaled, rotated, and shifted to the data’s liking.

Here is a number of multivariate gaussian distributions with certain means and covariances:

Now why do I go into all this? Because LDA makes the assumption that both classes have the same covariance. Whether this is correct in this scenario, I have yet to determine both theoretically and practically.

I don’t really know where to start, I could be looking at the data, I could be looking at the model, I could be trying to predict results to see if the results are any indication of anything… Since I just created the model, let me just poke around and see what’s in there as a quick win.

It looks like R gives us a pretty nice plot function to view the density space within the first linear discriminant component.

In [51]:
# Nice! R's native plot() function works out of the box with an LDA model object!
%R plot(allNbaLda, type = 'both')
all_nba_predict_20_2

Okay, so what is this telling us… Remember, this is showing histograms of each class within the first linear discriminant component! That means, this was LDA’s axis of largest separation. Is this better than what we got last time? I’m not quite sure… I can see that in this first linear discriminant component, around x = 0 or x = 0.5 would probably be a good place to split the data. At this point, the tail of each groups are seem to be minimized equally.

I don’t even actually know where the decision boundary sits though… is it actually at x = 0.5?

In [59]:
%R library(klaR)
%R partimat(accolades_all_nba ~ advancedStats_WS + advancedStats_VORP, data = playerAggDfAllNbaAllStar, method = 'lda', col.mean = 1)
all_nba_predict_20_3

Well, in about 2 lines of code, I’m pretty much fucking mind blown… This thing did what I did by eye automatically in 0.00001 seconds. Okay, maybe 1 second. Compared to the decision boundary that I explored in my last post, this one does have a very similar axis of separation. A line from the top left portion of the graph extending to the bottom right hand side of the graph. It looked like it didn’t make as deep of a cut as I did into the non all-NBA portion of the graph (I guessed it’s denoted by ‘N’ here, whereas all-NBA is denoted by an ‘A’), but we have to remember that LDA is not only not going off the data directly, but covariances of the data, but also that it assumes the covariance matrix among the two classes are the same!!

If we look at the density distributions in the first linear discriminant component per class, we can easily see that distributions are different, and in fact, the non all-NBA group is not even true to the Gaussian shape. It’s a bit left skewed. The all-NBA group, however, is more like a true Gaussian, but it’s much fatter than the non all-NBA distribution. They are not the same distribution so LDAs assumption is a bit off here (yes, I get that no distribution is even a true gaussian unfortunately, but this is some low hanging fruit for us). To fix this, we can look at the concept of Quadratic Discriminant Analysis, which has the capability to assume different covariance matrices for each class and draw non-linear decision boundaries as necessary.

Before we jump into that, however, let’s try to actually predict using this model and see how it goes…

In [74]:
# Predict using the existing data and model that we have
%R allNbaLdaPrediction = predict(allNbaLda)

# Generate confusion matrix and set -o flag to send results back to python
%R -o allNbaLdaConfMatrix allNbaLdaConfMatrix = as.data.frame(table(playerAggDfAllNbaAllStar[, c('accolades_all_nba')], allNbaLdaPrediction$class))
Out[74]:
Var1 Var2 Freq
1 All-NBA All-NBA 347
2 Not All-NBA All-NBA 231
3 All-NBA Not All-NBA 155
4 Not All-NBA Not All-NBA 12487
In [84]:
# Label dataframe indexes and columns correctly
allNbaLdaConfMatrix.index = ['All-NBA - Successfully Classified', 'Not All-NBA - Wrongly Classified', 'All-NBA - Wrongly Classified', 'Not All-NBA - Successfully Classified']
allNbaLdaConfMatrix.columns = ['True Value', 'Predicted Value', 'Freq']
allNbaLdaConfMatrix
Out[84]:
True Value Predicted Value Freq
All-NBA – Successfully Classified All-NBA All-NBA 347
Not All-NBA – Wrongly Classified Not All-NBA All-NBA 231
All-NBA – Wrongly Classified All-NBA Not All-NBA 155
Not All-NBA – Successfully Classified Not All-NBA Not All-NBA 12487
In [90]:
print 'All-NBA was classified correctly {} / {} ({})'.format(
    allNbaLdaConfMatrix.get_value('All-NBA - Successfully Classified', 'Freq'),
    allNbaLdaConfMatrix.get_value('All-NBA - Successfully Classified', 'Freq') + allNbaLdaConfMatrix.get_value('All-NBA - Wrongly Classified', 'Freq'),
    float(allNbaLdaConfMatrix.get_value('All-NBA - Successfully Classified', 'Freq')) / float(allNbaLdaConfMatrix.get_value('All-NBA - Successfully Classified', 'Freq') + allNbaLdaConfMatrix.get_value('All-NBA - Wrongly Classified', 'Freq'))*100
)

print 'Not All-NBA was classified correctly {} / {} ({})'.format(
    allNbaLdaConfMatrix.get_value('Not All-NBA - Successfully Classified', 'Freq'),
    allNbaLdaConfMatrix.get_value('Not All-NBA - Successfully Classified', 'Freq') + allNbaLdaConfMatrix.get_value('Not All-NBA - Wrongly Classified', 'Freq'),
    float(allNbaLdaConfMatrix.get_value('Not All-NBA - Successfully Classified', 'Freq')) / float(allNbaLdaConfMatrix.get_value('Not All-NBA - Successfully Classified', 'Freq') + allNbaLdaConfMatrix.get_value('Not All-NBA - Wrongly Classified', 'Freq'))*100
)
All-NBA was classified correctly 347 / 502 (69.1235059761)
Not All-NBA was classified correctly 12487 / 12718 (98.1836766787)

Cool, this model actually does a lot worse than the model I built by eye and by doing a bit of sensitivity analysis by calibrating the y-intercept of my decision boundary. Again, LDA is assuming both classes have the same covariance in their distribution!! I can absolutely see how that would cause the results to skew towards predicting “Not All-NBA” correctly.

Because the distribution of our non All-NBA class is more compact, it’s getting the benefit of the doubt because the common covariance matrix will have to be something in between the two classes. The distribution of the all-NBA class is smaller than it should be, and the distribution of the non all-NBA class is larger than it should be. As a result, we see the all-NBA class really suffer in the predictions.

QDA

In the next post, I’ll extend this model to include non-linear boundaries and try out QDA.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s