Chi / Larissa Face Detection #8 – Cutting Cloud Costs with Infrastructure Automation (Part III – Preparing and Training Code on AWS – MNIST)


At this point, we have a working EC2 instance loaded with the AWS Deep Learning AMI, with the jupyter notebook tested and ready to go.

The last step I want to take to automate this analysis as much as I can is to just get a notebook ready so I can just import it and run it. I’ll start with the MNIST data set because I basically already have that code written and tested on my own laptop. All I need to do is to

  • Prepare a jupyter notebook
  • Host it on git
  • Fire up my EC2
  • Clone the project and code onto the EC2
  • Train on the EC2
  • View / save results
  • Terminate the EC2 and stop the billing

Theoretically, this should take no longer than like 5 minutes to set up, and then however long it takes for the model to train (I’m hoping for like less than 30 minutes for 10 epochs, but I honestly have no clue what to expect). I’ll use this as the notebook to clone and run on the EC2.


Please refer to post #3 for the fully documented and commentated code. To keep this post light, I’m just going to use the code from post #3. MNIST should be good example for automation too because it comes with a built in function that loads the data so we don’t even have to worry too much about the ETL portion.


I just wanted to preface this notebook first by noting how much this exercise will cost me. If things go well, my balance after will only be $0.20 more than it is right now. Current balance:

That exchange rate is killing me… but what can I really do?

In [2]:
# Install tflearn
import os
os.system("sudo pip install tflearn")
In [3]:
# TFlearn libraries
import tflearn
from tflearn.layers.conv import conv_2d, max_pool_2d
from tflearn.layers.core import input_data, dropout, fully_connected
from tflearn.layers.estimator import regression
import tflearn.datasets.mnist as mnist

# General purpose libraries
import matplotlib.pyplot as plt
import numpy as np
import math
In [4]:
# Extract data from mnist.load_data()
x, y, x_test, y_test = mnist.load_data(one_hot = True)
Downloading MNIST...
Succesfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting mnist/train-images-idx3-ubyte.gz
Downloading MNIST...
Succesfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting mnist/train-labels-idx1-ubyte.gz
Downloading MNIST...
Succesfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting mnist/t10k-images-idx3-ubyte.gz
Downloading MNIST...
Succesfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting mnist/t10k-labels-idx1-ubyte.gz
In [5]:
# Reshape x
x_reshaped = x.reshape([-1, 28, 28, 1])
print 'x_reshaped has the shape {}'.format(x_reshaped.shape)
x_reshaped has the shape (55000, 28, 28, 1)
In [6]:
# Reshape x_test
x_test_reshaped = x_test.reshape([-1, 28, 28, 1])
print 'x_test_reshaped has the shape {}'.format(x_test_reshaped.shape)
x_test_reshaped has the shape (10000, 28, 28, 1)
In [7]:
# sentdex's code to build the neural net using tflearn
#   Input layer --> conv layer w/ max pooling --> conv layer w/ max pooling --> fully connected layer --> output layer
convnet = input_data(shape = [None, 28, 28, 1], name = 'input')

convnet = conv_2d(convnet, 32, 2, activation = 'relu')
convnet = max_pool_2d(convnet, 2)

convnet = conv_2d(convnet, 64, 2, activation = 'relu')
convnet = max_pool_2d(convnet, 2)

convnet = fully_connected(convnet, 1024, activation = 'relu')
# convnet = dropout(convnet, 0.8)

convnet = fully_connected(convnet, 10, activation = 'softmax')
convnet = regression(convnet, optimizer = 'sgd', learning_rate = 0.01, loss = 'categorical_crossentropy', name = 'targets')
In [11]:
model = tflearn.DNN(convnet)
    {'input': x_reshaped}, 
    {'targets': y}, 
    n_epoch = 5, 
    validation_set = ({'input': x_test_reshaped}, {'targets': y_test}), 
    snapshot_step = 500, 
    show_metric = True
Training Step: 4299  | time: 10.886s
| SGD | epoch: 005 | loss: 0.00000 - acc: 0.0000 -- iter: 54976/55000
Training Step: 4300  | time: 12.184s
| SGD | epoch: 005 | loss: 0.00000 - acc: 0.0000 | val_loss: 0.11502 - val_acc: 0.9648 -- iter: 55000/55000


Again, it was a bit easier to actually just make a quick video showing the model running on AWS to really get a sense of how fast the model trained. 12-13 seconds per epoch!!

After 1 minute / 5 epochs, we got to ~96.4%! To reiterate the findings of the video, performing the model training on AWS literally took 23 times faster than training on my laptop. It took less than 5% of the time it took on my laptop. Amazing gains for a very low price. You could argue that it took a while just to get this whole environment up and running, but the lead time to get it up again for the next analysis is next to nothing. We have our Terraform script, the AWS Deep Learning AMI, and a local jupyter environment to prepare our code in before porting it to AWS to run on a GPU-optimized instance.

23 times faster… I had an idea it would be faster, but I’m blown away by how much faster.

Revisiting Cost

Let’s see what our costs are at. Before I do that, I just want to say that I probably ran the EC2 3-4 times rather than the single time I had planned. This was straight up poor planning on my part because there was a time I messed up spinning up the instance and another time I didn’t pushing the results back to git so I didn’t have the AWS-run TFlearn metrics. I think it’s going to cost more than 20 cents here, but it shouldn’t cost more than 1 dollar.

We’ve spent a grand total of 92 cents USD here. My guess is that I probably spent 80 cents on spinning up the EC2 4 times and 12 cents on other services during that time. Not the 20 cents I was looking for, but I didn’t even have to spend a dollar to do this analysis! Just what I was looking for.

Another side note about EC2 instance-hour billing. Amazon will bill EC2s by the hour as the smallest delimeter. This means that even if you spin up an EC2 for a few seconds, as I did when I messed up spinning up the EC2 in the first place (wrong public DNS settings), it will charge me for an hour. Just more parameters to make our lives more difficult… As if learning a NN wasn’t hard enough, now we gotta count minutes!

I think I’m ready to try it on our faces!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s