Jesus christ… we are on our 10th classifier… I’m going to have to make this my last one because I feel like I’m not doing any analysis anymore haha. NOT THAT THIS ISN’T FUN… but I’d like to actually be able to compare some of these methods together to get a better sense of how they function compared to each other. I’m basically just playing with very very shiny things right now.
Jesus I’m sorry that is scary. While I am extremely excited by the shiny object, I don’t quite feel the fear that the cat feels in that photo.
Now… gradient boosted trees… what the hell does that mean exactly. Let’s start easy. What is a tree?
Oh man, what a beautiful photo. Nice wide angle, the glare from the sun peeking through the trees, the contrast from the time of day… perfect. But not the tree we’re talking about unfortunately. Turn it up side down though and we get something that we explored in post #23!
The two photos are pretty much mirror images of each other, right? Well trees make up the base of the classifier we are about to use, but first, we have to explore gradient boosting. Gradient boosting is essentially an ensemble method that averages a bunch of trees together. Wait, didn’t we already do this in bagging and random forests in post #24? The answer is yes because all 3 are ensemble methods using the basic decision tree as its base learner, however, gradient boosting is a beast that works in a very different fashion. I’m not going to pretend like I’ve been there since the beginning of gradient boosting or machine learning in general, but the buzz is that gradient boosting, and the xgboost library in particular that we’re going to use today, is sitting pretty right now with gradient boosting becoming one of the biggest breakthrough concepts in the last decade and xgboost owning all the kaggle competitions right now. It seems that, along with deep learning neural nets, gradient boosting is basically magical.
Ok, let’s get into gradient boosting. The main explanation I hear about gradient boosting, including how they introduce it in ESL, is that boosting builds a committee of learners by combining multiple weak learners. In general, the ensemble approach is analogous to the “ask the audience” option in who wants to be a millionaire. How many times have you seen the audience get a question wrong and the phone a friend guy get the answer wrong? The audience (albeit that lifeline is often used in the earlier stages of the game) is almost never wrong! Boosting takes this one step even further by making each of these learners an expert in one section of the data. It’s like you can phone 100 friends, each of them being an expert at something. In fact, boosting is even crazier than that, after we’ve seen the question, we can then, in hindsight, pick a specific friend who knows about the subject matter and increase our chances dramatically that way.
Before fully understanding boosting from ESL, I first stumbled upon this video by Alexander Ihler @ UC Irvine who does a beautiful job just visualizing the steps of boosting.
We start out with a data set and we fit a very simple model. Here, our step function can essentially be seen like a 1 node tree.
On the left hand side, we see our raw data. Here, we only have 1 feature (x axis) and an output (y axis). We “fit” the regression tree
- if
then
- if
then
On the right hand side, we see the residuals of our first tree against our data. This is where our second learner comes in. In classic boosting, We then fit another regression tree to the residuals.
Our model then becomes our original tree plus some weighted version of our second tree.
What do we do with this new model? That’s right, we fit another tree to its residuals.
What really ends up happening at the end of the day is that we’ll start to build a model that gets closer and closer to the actual shape of the data. Like a neural network of sorts, we’re essentially able to use something simple like a tree (depicted by a step function here) and sum more and more of them to build the model or decision boundary. If we took infinite trees, just like if we took infinite step functions or sigmoids, we can essentially build any function we want. Along with the neural network, boosting loses a bit of its interpretability because at the end of the day its like hundreds or thousands of trees, and good luck understanding the placement of each, but we’d see something like this:
We see above that our residuals are shrinking more and more because all we’ve done here is take the latest model, find what its weaknesses are, and fit a model specifically geared towards fixing this weakness. In a sense, the newest tree at every iteration is trained to be an expert at everything the old model missed. Often, single node trees are used for the weak learners, so it takes a few iterations for the trees to start becoming experts at very specific portions of the data or even specific observations themselves, but in general the newest tree is picking up some of the slack of the sum of the previous trees.
Gradient boosting is an extension of classic boosting where, instead of fitting the residuals, we define a loss function, as we always do, descend through the gradient to the hells of uninterpretable statistical accuracy. Amazing.
In the world of classification, there are a few loss functions that are usually used:
This plot is straight out of ESL and does a great job of visualizing how our misclassifications are being penalized.
In the case of misclassification loss, our rule is simple: We penalize a constant amount for wrong classifications, we don’t penalize at all for correct classifications. We can see situations where this may not be ideal when our margin grossly misclassifies by euclidean distance and we actually want to penalize far away misclassifications more severely than others. A value of less than 0 is a misclassification, by the way, and above 0 is a correct classification.
If we look at squared error loss, we see that we’d actually be penalizing correct classifications! This means that points well within the correct side of the decision boundary will get penalized the more correct they are!
We then get into some of the more common misclassification loss functions, take deviance, where we don’t penalize grossly wrong classifications as much (more robust to outliers) and we also penalize observations less and less the more correct they become.
This brings us to xgboost. This is a gradient boosted trees algorithm that… well… I need to learn more about lol.
# Load libraries & initial config
%load_ext rpy2.ipython
%R library(ggplot2)
%R library(gridExtra)
%R library(scales)
%R library(ggbiplot)
%R library(dplyr)
%matplotlib nbagg
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import boto3
from StringIO import StringIO
import warnings
warnings.filterwarnings('ignore')
# Retrieve team stats from S3
playerAggDfAllNbaAllStar = pd.read_csv('https://s3.ca-central-1.amazonaws.com/2017edmfasatb/fas_boto/data/playerAggDfAllNbaAllStar.csv', index_col = 0)
pd.set_option('display.max_rows', len(playerAggDfAllNbaAllStar.dtypes))
print playerAggDfAllNbaAllStar.dtypes
pd.reset_option('display.max_rows')
So xgboost apparently suggests that you download and install from github rather than CRAN because CRAN lags behind by quite a bit, but I’m having a bit of trouble getting the xgboost package to compile right now. I’m on a windows machine unfortunately and there are a few extra steps that has taken me 15 or so minutes now of googling. I just want to get my hands on xgboost to play around at a high level, so I’ll just download it from CRAN.
For xgboost, we have to format the data in a numeric matrix. This can be done simply with the data.matrix() function. Our labels also have to be a numeric matrix, so we just have to modify them to be 1 or 0, with 1 indicating all-NBA status.
# Prepare x and y vars
x = data.matrix(playerAggDfAllNbaAllStar[,c('advancedStats_WS', 'advancedStats_VORP')])
y = data.matrix(ifelse(playerAggDfAllNbaAllStar[,c('accolades_all_nba')] == 'All-NBA', 1, 0))
%%R -i playerAggDfAllNbaAllStar -o gbModel
library(xgboost)
# Build gradient boosted tree model with 10-fold cross validation and 2 iterations of boosted trees
gbModel = xgb.cv(
data = x,
label = y,
nround = 2,
objective = 'binary:logistic',
eval_metric = 'auc',
nfold = 10
)
print gbModel
Alright, so I’m basically just following this tutorial here from Tong He, the creator and maintainer of the R package.
If you watch the video you’ll see that he has a cross validation section. I’m just following his lead here and building a quick GBM with 2 iterations, but we see real quick even at 1 iteration we’re at 0.957 AUC (note that I’m looking at the _test_auc_mean_ metric here). Recall with the neural net we were sitting at almost 0.98. Not bad, and I’m sure the AUC will increase as we increase the number of trees
Real quick though, I just wanted to review (also from the video), that for binary 1-0 classification xgboost seems to default to the log-loss function:
The xgboost calculation also takes into account a regularization term:
For a total objective to minimize of:
Notice here that the regularization term is essentially controlling the depth of the tree here as is the number of terminal nodes, and
is the value of the
th terminal node.
Usually, decision trees use something like the Gini index or cross entropy as we had described in the decision trees post, but xgboost operations on this objective function by taking the the taylor approximation of the derivative of the function. Taking the taylor approximation and solving for the optimum value of
yields
where
is the first order term in the taylor approximation of
is the second order term in the taylor approximation of
If we take this and plug it back int above, we can then come to the following method of measuring node “purity”:
Okay, so to back up a bit, in boosted trees, we are iterating to continuously optimize the loss function. With each iteration, we add one tree that becomes an expert catering to the residuals from last iteration’s model. How do we know when to stop? This question actually applies two-fold…
- Within each iteration, how do we know when to stop building the tree?
- The answer here ties directly to the node purity objective defined a few lines ago… A tree in xgboost is grown to it’s maximum depth based on certain parameters set by the user, and it is then pruned back using the node purity objective function (if the combined purity scores of the observations in the child nodes are worse than that of the parent node, we prune the tree back).
- Among all the iterations, when do we stop adding trees?
- This can depend on a number of factors, many of which are manual parameters in the xgboost package including nround (max number of iterations) and early.stop.round (in the xgb.cv function, this sets a threshold for the number of times a test set can regress in accuracy or quality before we decide to stop building trees)
Last but not least, there are quite a few parameters to tune, so I just want to review some here before I go any further:
- General parameters
- nthread – number of parallel threads to compute with
- Parameters for boosting process
- eta – step size (between 0 and 1, default 0.3)
- gamma – minimum loss reduction required to make a split (between 0 and
, default 0)
- max_depth – max depth of a tree (between 1 and
, default 1)
- min_child_weight – minimum sum of instance weight needed in a child (between 0 and
, default 1)
Ideal way to tune the parameters is to use xgb.cv, use early.stop.round to not waste time when the test set error leaves the optimum, and in the case of overfitting reduce the eta step size and increase the nround max iterations.
That was a mouthful… I felt like it’s been a month since I even ran the first model to begin with. In our first model, we got a pretty good AUC of 0.963 on the GBM with only 2 iterations. We saw a pretty good jump from the 1st iteration to the 2nd as well. Let’s try this with… I dunno… 50 iterations?
%%R -o gbModelRounds50EvaluationLog
# Build gradient boosted tree model with 10-fold cross validation and 50 iterations of boosted trees
gbModelRounds50 = xgb.cv(
data = x,
label = y,
nround = 50,
objective = 'binary:logistic',
eval_metric = 'auc',
nfold = 10
)
gbModelRounds50EvaluationLog = data.frame(gbModelRounds50['evaluation_log'])
# Print the results dataframe
print gbModelRounds50EvaluationLog
This is probably better to look at in a plot haha.
# Plot trainin and test errors
fig, ax = plt.subplots(1, 1)
gbModelRounds50EvaluationLog.plot(
kind = 'line',
x = 'evaluation_log.iter',
y = 'evaluation_log.train_auc_mean',
ax = ax
)
gbModelRounds50EvaluationLog.plot(
kind = 'line',
x = 'evaluation_log.iter',
y = 'evaluation_log.test_auc_mean',
ax = ax
)

Awesome, this makes way more sense. We’re looking for something that can match an AUC of 0.979, and we got… pretty close actually! At about the 19th iteration we peak at a value of 0.976, only slightly off from what we had before!
I kind of just want to do like 1000 iterations and just see what happens.
%%R -o gbModelRounds1000EvaluationLog
# Build gradient boosted tree model with 10-fold cross validation and 50 iterations of boosted trees
gbModelRounds1000 = xgb.cv(
data = x,
label = y,
nround = 1000,
objective = 'binary:logistic',
eval_metric = 'auc',
nfold = 10
)
gbModelRounds1000EvaluationLog = data.frame(gbModelRounds1000['evaluation_log'])
# Plot trainin and test errors
fig1, ax1 = plt.subplots(1, 1)
gbModelRounds1000EvaluationLog.plot(
kind = 'line',
x = 'evaluation_log.iter',
y = 'evaluation_log.train_auc_mean',
ax = ax1
)
gbModelRounds1000EvaluationLog.plot(
kind = 'line',
x = 'evaluation_log.iter',
y = 'evaluation_log.test_auc_mean',
ax = ax1
)

Okay! Simple enough. 20 was obviously the best choice. I could’ve theoretically saved myself the time of the other… what… 980 iterations if I possibly had used the early.stop.round option?
%%R -o gbModelRounds1000EarlyStopEvaluationLog
# Build gradient boosted tree model with 10-fold cross validation and 50 iterations of boosted trees
gbModelRounds1000EarlyStop = xgb.cv(
data = x,
label = y,
nround = 1000,
objective = 'binary:logistic',
eval_metric = 'auc',
nfold = 10,
early.stop.round = 10
)
gbModelRounds1000EarlyStopEvaluationLog = data.frame(gbModelRounds1000EarlyStop['evaluation_log'])
# Plot trainin and test errors
fig1, ax2 = plt.subplots(1, 1)
gbModelRounds1000EarlyStopEvaluationLog.plot(
kind = 'line',
x = 'evaluation_log.iter',
y = 'evaluation_log.train_auc_mean',
ax = ax2
)
gbModelRounds1000EarlyStopEvaluationLog.plot(
kind = 'line',
x = 'evaluation_log.iter',
y = 'evaluation_log.test_auc_mean',
ax = ax2
)

Nice, that only took like 5 seconds! Funny enough though, the best iteration here is actually the 26th iteration and we’ve gotten up to 0.979. What If I keep my early stop mechanism and I decrease my step size? Also, just looking at the data a bit, I’m not 100% sure if an early stop threshold of 10 continuously decreasing iterations are enough… I’m seeing decreases of 8 and then the model jumps to another maximum AUC… that’s not too far off from 10 and I wonder if I’m cutting off the iterations prematurely in some of these iterations. I’ll try and early stop threshold of 20 iterations.
%%R -o gbModelRounds1000EarlyStopSmallEtaEvaluationLog
# Build gradient boosted tree model with 10-fold cross validation and 50 iterations of boosted trees
gbModelRounds1000SmallEtaEarlyStop = xgb.cv(
data = x,
label = y,
nround = 1000,
objective = 'binary:logistic',
eval_metric = 'auc',
nfold = 10,
early.stop.round = 20,
eta = 0.1
)
gbModelRounds1000EarlyStopSmallEtaEvaluationLog = data.frame(gbModelRounds1000SmallEtaEarlyStop['evaluation_log'])
# Plot trainin and test errors
fig1, ax3 = plt.subplots(1, 1)
gbModelRounds1000EarlyStopSmallEtaEvaluationLog.plot(
kind = 'line',
x = 'evaluation_log.iter',
y = 'evaluation_log.train_auc_mean',
ax = ax3
)
gbModelRounds1000EarlyStopSmallEtaEvaluationLog.plot(
kind = 'line',
x = 'evaluation_log.iter',
y = 'evaluation_log.test_auc_mean',
ax = ax3
)

Obviously took us a bit longer to climb to max (59th iteration here), but we’re getting to around the same accuracy. I’ve also ready that sometimes the best performing trees are just simply tree roots (each iteration only makes one split). Let’s try that out with the current options. I’m going to cut down the variable names a bit because they’re getting out of hand (and I’m probably also butchering best practices anyways).
%%R -o gbModelMaxDepth1EvaluationLog
# Build gradient boosted tree model with 10-fold cross validation and 50 iterations of boosted trees
gbModelMaxDepth1 = xgb.cv(
data = x,
label = y,
nround = 1000,
objective = 'binary:logistic',
eval_metric = 'auc',
nfold = 10,
early.stop.round = 20,
eta = 0.1,
max_depth = 1
)
gbModelMaxDepth1EvaluationLog = data.frame(gbModelMaxDepth1['evaluation_log'])
# Plot trainin and test errors
fig1, ax4 = plt.subplots(1, 1)
gbModelMaxDepth1EvaluationLog.plot(
kind = 'line',
x = 'evaluation_log.iter',
y = 'evaluation_log.train_auc_mean',
ax = ax4
)
gbModelMaxDepth1EvaluationLog.plot(
kind = 'line',
x = 'evaluation_log.iter',
y = 'evaluation_log.test_auc_mean',
ax = ax4
)

Here, the max AUC is… 0.9796! With the neural network, we were seeing 0.9797. We’re basically getting the same results! If I ran this a few more times, I’m sure we’d see something maybe equivalent!
–2 seconds later–
Yup just ran another one and got 0.9799!
–2 seconds later–
Just ran another and got 0.98!
— 5 minutes later–
Okay I don’t think I can beat 0.98 lol. But how cool is that, I see why xgboost, or gradient boosting in general is the beez kneez! The unfortunate thing right now is that I think I’m getting super super caught up in a kaggle-crazed accuracy prediction mode. All I’m caring about, and all I’m getting excited over is that one AUC number. It’s like cocaine to me right now. I need more AUC, all of the AUC in fact. The downside to this mindset is… well… this frickin model is an aggregate of, on this last model, over 100 trees. How the hell am I supposed to explain this to someone else?
“TRUST ME MAN!!! THE AUC WAS 0.98!”
“Wait, what? I thought we were looking at WS and VORP? What’s AUC?”
“AUC IS AREA UNDER THE CURVE. IT WAS 0.98, AND IT WAS GLORIOUS”
“What curve? What are you talking about?”
“AREA UNDER THE CURVE OF THE SENSITIVITY AND SPECIFICITY TRADE OFF PLOT OF THE AGGREGATED BOOSTED TREES STEMMING FROM WS AND VORP”
“Boosted… what? Trees? Trees have sensitive curves? I gotta go man… My… mom is calling me…”
“NO COME ON LET ME SHOW YOU THIS CHAPTER IN ESL I SWEAR IT WILL MAKE SENSE AFTER”
“The all-NBA teams have already been released man… Bye”
If I’ve learned anything in consulting so far, it’s that some people just do not have the time of day. Through no fault of their own, they simply won’t get the message unless it’s simple and relatable to them. So when I’ve gone through and tuned not even like 1/3 of the parameters that I should have probably played around with to really get a feel for the model, how can I expect someone else to pick it up in one conversation? Contrast this to the decision tree which literally gave me a picture of a frickin tree that I can put into a slide or something and had an AUC of 0.971 anyways, it’s pretty easy to say which model I’d go with in the business world. In the data science / Kaggle world, of course I’d pick the boosted tree. A higher accuracy is objective… it’s simply the best I’ve seen, and that works for me because interpretability is not an issue I have to worry about. The flip side is me banging my head on a desk 80 times because I lost a battle with an executive who would rather do 1000 other tasks of higher priority than listening to me sit here and justify the 0.09 AUC difference (again, what’s AUC again?), so if I can save that and take the 0.09 AUC hit, I’ll probably do it.
Another thing I’ve noticed about boosted trees is that the results are slightly different every time! The aggregate trees will change because, like random forest, each iteration takes only a subset of the data and only a subset of the columns to average a variety of models together and decrease variance.
Like the random forest as well, it seems that xgboost also gives us a matrix of feature importance:
%%R -o gbModelFinal
# Build gradient boosted tree model with 10-fold cross validation and 50 iterations of boosted trees
gbModelFinal = xgboost(
data = x,
label = y,
nround = 120,
objective = 'binary:logistic',
eval_metric = 'auc',
eta = 0.1,
max_depth = 1
)
gbModelFinalEvaluationLog = data.frame(gbModelFinal['evaluation_log'])
%R print(data.frame(xgb.importance(names(data.frame(x)), model = gbModelFinal)))
Interestingly enough, it’s giving me the idea that our decision tree gave us (remember when our first shallow tree were all WS splits?), but these two deviate a bit from what random forest told us for feature importance (both features were ranked more evenly, although WS was still more important). Won’t quite go into too much analysis as to why they’re different (because I don’t know the mechanisms of each model by heart yet), but something I’ll definitely keep in the back of my mind! Let’s just check the ROC from the ROCR package because I want to see the sensitivity and specificity values of the best value.
%%R
library(ROCR)
# Use the tree library to predict probabilities
gbModelFinalPred = predict(gbModelFinal, x)
# Use the ROCR library to build the ROC curve
gbModelFinalPredObj = prediction(gbModelFinalPred, ifelse(playerAggDfAllNbaAllStar['accolades_all_nba'] == 'All-NBA', TRUE, FALSE))
# Run performance evaluation for the metric 'total accuracy'
gbModelFinalRocEval = performance(gbModelFinalPredObj, 'sens', 'spec')
plot(gbModelFinalRocEval, colorize = T)
text(
0.2,
0.08,
labels = paste("AUC = ", round(performance(gbModelFinalPredObj, 'auc')@y.values[[1]], digits = 4), sep= ""),
adj = 1
)

# Retrieve the iterative cut-off sensitivity analysis that logistic regression did behind the scenes
%R cutoffs = data.frame(cut = gbModelFinalRocEval@alpha.values[[1]], sens = gbModelFinalRocEval@x.values[[1]], spec = gbModelFinalRocEval@y.values[[1]])
# Calculate the metrics sensitivity + specificity. This will help us gauge the accuracy of both classes simultaneously.
# E.g. if we were guessing each class 100% correctly (there is a very distinct decision boundary), then we would have 1 + 1 = 2
%R cutoffs['sens_plus_spec'] = cutoffs['sens'] + cutoffs['spec']
# See the last few rows of this dataframe where the sensitivity + specificity are at its max
%R tail(cutoffs[order(cutoffs$sens_plus_spec),])
95% / 92%. Definitely the best we’ve seen so far considering both, as noted by the AUC. Also, note now the AUC on the training data is actually 0.9811 now on the ROC plot. This is because I had to use xgboost instead of xgb.cv in order to get the feature importances and predictions. 0.9811 was higher than what we saw before (0.98) because we were looking at the AUC metrics for the test set earlier. As we get further, gradient boosting can basically fit our data 100% (unless there’s an overlap of a data point) because, with more iterations, we fit trees specifically geared for certain observations or even single observations, so we’ll always be able to pick out all the spatial intricacies. This value of nrounds is what cross validation chose for us, so we’ll go ahead with this model. Even though the model is probably slightly overfitting a bit, it’s actually quite interesting to see how flexible the GBM is compared to other models. With certain models, it would be basically impossible to achieve this AUC even on the training data because the models and the boundaries they produce are way too rigid. Cool stuff.