All-NBA Predict #27 – Classifying All-NBA Players (Part VIIII – K-Nearest Neighbours)

We go from one of the most complex models to one of the least complex… K-NN here we go. Not really too much to explain here… We pick the K closest observations by euclidean distance and take a vote. If the majority of them are all-NBA, then the observation is all-NBA. If the majority of them are not, then the observation is not all-NBA. If it’s a tie, we break it at random and assign to one or the other. Anything else to explain? I don’t really think so. Let’s go.

In [1]:
# Load libraries & initial config
%load_ext rpy2.ipython

%R library(ggplot2)
%R library(gridExtra)
%R library(scales)
%R library(ggbiplot)
%R library(dplyr)

%matplotlib nbagg
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import boto3
from StringIO import StringIO
import warnings
warnings.filterwarnings('ignore')
In [2]:
# Retrieve team stats from S3
playerAggDfAllNbaAllStar = pd.read_csv('https://s3.ca-central-1.amazonaws.com/2017edmfasatb/fas_boto/data/playerAggDfAllNbaAllStar.csv', index_col = 0)

pd.set_option('display.max_rows', len(playerAggDfAllNbaAllStar.dtypes))
print playerAggDfAllNbaAllStar.dtypes
pd.reset_option('display.max_rows')
season_start_year          int64
perGameStats_Player       object
perGameStats_Pos          object
perGameStats_Age           int64
perGameStats_Tm           object
perGameStats_G             int64
perGameStats_GS          float64
perGameStats_MP          float64
per100Stats_FG           float64
per100Stats_FGA          float64
per100Stats_FGPerc       float64
per100Stats_3P           float64
per100Stats_3PA          float64
per100Stats_3PPerc       float64
per100Stats_2P           float64
per100Stats_2PA          float64
per100Stats_2PPerc       float64
per100Stats_FT           float64
per100Stats_FTA          float64
per100Stats_FTPerc       float64
per100Stats_ORB          float64
per100Stats_DRB          float64
per100Stats_TRB          float64
per100Stats_AST          float64
per100Stats_STL          float64
per100Stats_BLK          float64
per100Stats_TOV          float64
per100Stats_PF           float64
per100Stats_PTS          float64
per100Stats_ORtg         float64
per100Stats_DRtg         float64
advancedStats_PER        float64
advancedStats_TSPerc     float64
advancedStats_3PAr       float64
advancedStats_FTr        float64
advancedStats_ORBPerc    float64
advancedStats_DRBPerc    float64
advancedStats_TRBPerc    float64
advancedStats_ASTPerc    float64
advancedStats_STLPerc    float64
advancedStats_BLKPerc    float64
advancedStats_TOVPerc    float64
advancedStats_USGPerc    float64
advancedStats_OWS        float64
advancedStats_DWS        float64
advancedStats_WS         float64
advancedStats_WS48       float64
advancedStats_OBPM       float64
advancedStats_DBPM       float64
advancedStats_BPM        float64
advancedStats_VORP       float64
player_formatted          object
Tm                        object
Player_x                  object
Player_y                  object
all_star                  object
accolades                 object
accolades_any_team        object
VORP_WS_sum              float64
accolades_all_nba         object
dtype: object

We scale first because we are dealing with euclidean distance here.

In [10]:
%%R -i playerAggDfAllNbaAllStar

# Scale inputs
playerAggDfAllNbaAllStar['advancedStats_WS_scaled'] = scale(playerAggDfAllNbaAllStar['advancedStats_WS'])
playerAggDfAllNbaAllStar['advancedStats_VORP_scaled'] = scale(playerAggDfAllNbaAllStar['advancedStats_VORP'])
In [12]:
%%R

library(class)

# Prepare x and y vars
x = playerAggDfAllNbaAllStar[,c('advancedStats_WS_scaled', 'advancedStats_VORP_scaled')]
y = playerAggDfAllNbaAllStar[,c('accolades_all_nba')]

# Build model
knnModel = knn(x, x, y, k = 30)
In [14]:
%R # Output prediction results
%R knnModelConfMatrix = as.data.frame(table(y, knnModel))
%R print(knnModelConfMatrix)
Out[14]:
y knnModel Freq
1 All-NBA All-NBA 286
2 Not All-NBA All-NBA 82
3 All-NBA Not All-NBA 216
4 Not All-NBA Not All-NBA 12636

We’re looking at 99% / 57%. Pretty shitty. There’s a pretty big glaring problem that this data set probably has with kNN though. Remember how 95% of the data is all-NBA observations? Well where all-NBA and non all-NBA overlap, we’re probably going to hit way more all-NBA observations than not. They’re just… everywhere.

Since we’re kinda cheating here and using the same training set as our test set, we can’t really put knn to 1 (1 will always be the observation itself). There you go, think about that. We are using THE SAME TRAINING SET AS OUR TEST SET and it predicts all-NBA at 57 frickin percent. So we can’t use k = 1. k = 2 probably also is not too fair as one will always be the observation itself, so the best it can do is tie. Let’s try k = 3?

In [17]:
%%R

# Build model
knnModelk3 = knn(x, x, y, k = 3)
In [19]:
%R # Output prediction results
%R knnModelk3ConfMatrix = as.data.frame(table(y, knnModelk3))
%R print(knnModelk3ConfMatrix)
Out[19]:
y knnModelk3 Freq
1 All-NBA All-NBA 327
2 Not All-NBA All-NBA 65
3 All-NBA Not All-NBA 175
4 Not All-NBA Not All-NBA 12653

Here, we’re only at 65% on all-NBA prediction. It doesn’t even matter what non all-NBA is… Let’s move on.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s