Okay, so last time, I selected two boundaries – one a rectangle, one a linear line. Anything on one side of the shape was considered all-NBA, anything on the other was considered non all-NBA. Did it work? For the rectangle, no. Well… I guess I didn’t really try to optimize it, but it was clear that the decision boundary shape was not optimal just by looking at it. For the line, we got pretty good results. We found a boundary that classified with about 92% accuracy for both classes. 92% is a pretty good number! I’m happy with that!

I then considered the situation where my boundary no longer fits that specific axis of intersection, and how I would compensate for that from an automation perspective.

Luckily, I’ve learned about a model called Linear Discriminant Analysis that does… well… just that!

### Multivariate Gaussian

The basis of LDA is quite simple. It assumes that each class has a multi-variate Gaussian distribution. A Gaussian distribution is, of course, the normal distribution:

Cool. If I didn’t already know that, I probably wouldn’t have gotten this far. A multi-variate gaussian distribution is simply a data set that has a normal distribution in two dimensions, or across two variables / predictors / features:

Essentially we see a normal distribution in the x direction, and a normal distribution in the y direction. Depending on how the means of these distributions line up and depending on what the variances of the two distributions are like, we frequently see the ‘ellipse’ shape that we see in the all-NBA data!

In fact, we see that each category (all-NBA, not all-NBA) is its own distribution! Lets take a look at that graph again.

```
# Load libraries & initial config
%load_ext rpy2.ipython
%R library(ggplot2)
%R library(gridExtra)
%R library(scales)
%R library(ggbiplot)
%matplotlib nbagg
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import boto3
from StringIO import StringIO
import warnings
warnings.filterwarnings('ignore')
```

```
# Retrieve team stats from S3
playerAggDfAllNbaAllStar = pd.read_csv('https://s3.ca-central-1.amazonaws.com/2017edmfasatb/fas_boto/data/playerAggDfAllNbaAllStar.csv', index_col = 0)
pd.set_option('display.max_rows', len(playerAggDfAllNbaAllStar.dtypes))
print playerAggDfAllNbaAllStar.dtypes
pd.reset_option('display.max_rows')
```

```
%%R -i playerAggDfAllNbaAllStar -w 700 -u px
allNbaPlot = ggplot(
NULL
) +
geom_point(
data = playerAggDfAllNbaAllStar,
aes(
x = advancedStats_VORP,
y = advancedStats_WS,
color = accolades_all_nba
)
)
allNbaPlot
```

We see the * ellipse* shape in the reds and the blues above. Each category has its own multivariate gaussian distribution.

### LDA

What, then, is LDA? LDA a classification and a dimension reduction algorithm that takes into account two metrics for each category of data:

**Maximizing the distance between the means of the classes****Minimizing the variance within each class**

What does this mean exactly? Well let’s take a look.

I found a youtube tutorial online that helped me understand LDA quite easily. Let’s say you’re trying to measure whether or not a drug works on patients with varying sets of transcripts of a single gene.

If we expand this to two genes, we might get better results of classification / separation:

LDA in essence, is a form of dimension reduction. It’s trying to find perhaps a single axis on this graph that we can reduce it to maximize the separation of the categories. This is where the two metrics come in:

You see that in this single dimension (we basically just try to find the right axis), we want to * maximize the distances between the means of classes* and

*.*

**minimize the variance within each class**In this case, we see that the axis on the gene data might look something like this:

Below demonstrates the reason why we need to optimize both the means between the classes * as well as* the variance within each class!

We see in the case that we only maximize the mean between each class, we actually * do not* find the axis of best separation as clearly the case where we optimize both provides a much larger gap

*.*

**for the purposes of separation**### LDA On All-NBA Data

Okay, so now that we know a * bit* more about LDA, let’s try to apply it to what we have here. Just by looking at the data, we can already kind of guess at what the axis of dimension reduction would be. Not surprisingly, my mind tells me that it’s the perpendicular axis of how I drew my decision boundary last time. The method I used last time was simply by eye, and who really knows if the slope was completely optimized for best separation. I saw 93% / 92% and I was happy with the result from just eyeballing it! LDA should allow us to actually put a formula to work and find this axis which optimizes our separation:

Luckily, in R there’s a library called ‘lda’ within the MASS package which provides us LDA superpowers.

```
%R library(MASS)
%R allNbaLda = lda(accolades_all_nba ~ advancedStats_VORP + advancedStats_WS, data = playerAggDfAllNbaAllStar)
```

Okay, so now I have an LDA object from R that… well I’m not really sure what it did lol. This is my first time using it so please forgive the naivety. Let’s think about this logically…

What I * THINK* LDA did for sure is find the axis of the greatest separation. Perhaps even multiple axis of separation. Like we had with principal components, there was an item in the list that gave us the dimension-reduced coordinates to each sample. I’d expect to find that here.

Even before that, though, I know there’s an aspect of covariance that I haven’t considered, but have read about. Covariance is something that comes into lay with * multi-variate Gaussian* distributions. At the beginning of this post, we looked at a univariate gaussian and a multivariate gaussian, right?

A univariate gaussian distribution has the parameters

and, in both cases, these are **scalars**

A multivariate gaussian distribution then has the parameters

where mean is a * vector* with length of the number of dimensions / variables, and covariance is a

*with length of each side of the matrix as the number of dimensions / variables.*

**square matrix**The * covariance* governs the shape of the multivariate normal distribution, and it takes into account the distribution of

*variables. The covariance works with the SVD decomposition in mind (I go through this in the first principal components analysis post), as the multivariate gaussian distribution is basically a multidimensional unit circle (identity distribution) scaled, rotated, and shifted to the data’s liking.*

**both**Here is a number of multivariate gaussian distributions with certain means and covariances:

Now why do I go into all this? Because LDA * makes the assumption that both classes have the same covariance*. Whether this is correct in this scenario, I have yet to determine both theoretically and practically.

I don’t really know where to start, I could be looking at the data, I could be looking at the model, I could be trying to predict results to see if the results are any indication of anything… Since I just created the model, let me just poke around and see what’s in there as a quick win.

It looks like R gives us a pretty nice plot function to view the density space within the * first linear discriminant component*.

```
# Nice! R's native plot() function works out of the box with an LDA model object!
%R plot(allNbaLda, type = 'both')
```

Okay, so what is this telling us… Remember, this is showing histograms of each class within the * first linear discriminant component*! That means, this was LDA’s axis of

*. Is this better than what we got last time? I’m not quite sure… I can see that in this first linear discriminant component, around x = 0 or x = 0.5 would probably be a good place to split the data. At this point, the tail of each groups are*

**largest separation***to be minimized equally.*

**seem**I don’t even actually know where the decision boundary sits though… is it actually at x = 0.5?

```
%R library(klaR)
%R partimat(accolades_all_nba ~ advancedStats_WS + advancedStats_VORP, data = playerAggDfAllNbaAllStar, method = 'lda', col.mean = 1)
```

Well, in about 2 lines of code, I’m pretty much fucking mind blown… This thing did what I did by eye automatically in 0.00001 seconds. Okay, maybe 1 second. Compared to the decision boundary that I explored in my last post, this one does have a very similar axis of separation. A line from the top left portion of the graph extending to the bottom right hand side of the graph. It looked like it didn’t make as deep of a cut as I did into the non all-NBA portion of the graph (I guessed it’s denoted by ‘N’ here, whereas all-NBA is denoted by an ‘A’), but we have to remember that LDA is not only * not going off the data directly*, but covariances of the data, but also that it assumes

**the covariance matrix among the two classes are the same!!**If we look at the density distributions in the first linear discriminant component per class, we can easily see that distributions are different, and in fact, the non all-NBA group is not even true to the Gaussian shape. It’s a bit left skewed. The all-NBA group, however, is more like a true Gaussian, but it’s much fatter than the non all-NBA distribution. They * are not the same distribution* so LDAs assumption is a bit off here (yes, I get that no distribution is even a true gaussian unfortunately, but this is some low hanging fruit for us). To fix this, we can look at the concept of

*, which has the capability to assume*

**Quadratic Discriminant Analysis***and draw*

**different covariance matrices for each class***as necessary.*

**non-linear decision boundaries**Before we jump into that, however, let’s try to actually predict using this model and see how it goes…

```
# Predict using the existing data and model that we have
%R allNbaLdaPrediction = predict(allNbaLda)
# Generate confusion matrix and set -o flag to send results back to python
%R -o allNbaLdaConfMatrix allNbaLdaConfMatrix = as.data.frame(table(playerAggDfAllNbaAllStar[, c('accolades_all_nba')], allNbaLdaPrediction$class))
```

```
# Label dataframe indexes and columns correctly
allNbaLdaConfMatrix.index = ['All-NBA - Successfully Classified', 'Not All-NBA - Wrongly Classified', 'All-NBA - Wrongly Classified', 'Not All-NBA - Successfully Classified']
allNbaLdaConfMatrix.columns = ['True Value', 'Predicted Value', 'Freq']
allNbaLdaConfMatrix
```

```
print 'All-NBA was classified correctly {} / {} ({})'.format(
allNbaLdaConfMatrix.get_value('All-NBA - Successfully Classified', 'Freq'),
allNbaLdaConfMatrix.get_value('All-NBA - Successfully Classified', 'Freq') + allNbaLdaConfMatrix.get_value('All-NBA - Wrongly Classified', 'Freq'),
float(allNbaLdaConfMatrix.get_value('All-NBA - Successfully Classified', 'Freq')) / float(allNbaLdaConfMatrix.get_value('All-NBA - Successfully Classified', 'Freq') + allNbaLdaConfMatrix.get_value('All-NBA - Wrongly Classified', 'Freq'))*100
)
print 'Not All-NBA was classified correctly {} / {} ({})'.format(
allNbaLdaConfMatrix.get_value('Not All-NBA - Successfully Classified', 'Freq'),
allNbaLdaConfMatrix.get_value('Not All-NBA - Successfully Classified', 'Freq') + allNbaLdaConfMatrix.get_value('Not All-NBA - Wrongly Classified', 'Freq'),
float(allNbaLdaConfMatrix.get_value('Not All-NBA - Successfully Classified', 'Freq')) / float(allNbaLdaConfMatrix.get_value('Not All-NBA - Successfully Classified', 'Freq') + allNbaLdaConfMatrix.get_value('Not All-NBA - Wrongly Classified', 'Freq'))*100
)
```

Cool, this model actually does * a lot worse* than the model I built by eye and by doing a bit of sensitivity analysis by calibrating the y-intercept of my decision boundary. Again,

*I can absolutely see how that would cause the results to skew towards predicting “Not All-NBA” correctly.*

**LDA is assuming both classes have the same covariance in their distribution!!**Because the distribution of our non All-NBA class is more compact, it’s getting the benefit of the doubt because the common covariance matrix will have to be something in between the two classes. The distribution of the all-NBA class is * smaller* than it should be, and the distribution of the non all-NBA class is

*than it should be. As a result, we see the all-NBA class really suffer in the predictions.*

**larger**## QDA

In the next post, I’ll extend this model to include non-linear boundaries and try out QDA.