Chi / Larissa Face Detection #2 – Convolutional Neural Networks

Reviewing Neural Networks

Back when I was trying to predict all-NBA players, I did a simple feed-forward Neural Network in R using the neural net package. That package was pretty simple and allowed you to simply define how many hidden layers you wanted and how many nodes you wanted in each hidden layer. I won’t get too into how a regular feed-forward neural network works again here, but essentially we are building linear combinations of sigmoid functions to create a potentially highly non-linear decision boundary:

When using a NN to perform image recognition, generally we take each pixel of the image as an input and we generate a NN with many inputs (just think if we had a simple 28 x 28 pixel image, we would have 784 inputs to our NN). Aside from the fact that a NN with more inputs takes more time to train, the image can easily shift… even if you’re trying your hardest to recreate the image in the same way!

In reality, we can’t expect the image to come out the exact same way in the exact same pixels every time because we usually are training on real-world data.

Convolutional Neural Networks

To continue our point in the last section, let’s take a look at the classic MNIST data set. The MNIST (Modified National Institute of Standards and Technology) dataset is a database of handwritten digits used for training image processing models. It includes ~55k training samples and ~10k test values, all consisting of 28 x 28 greyscale images of handwritten digits.

I say we can’t expect the image to come out exactly the same way in the same pixels because, well, let’s just look at these two digits below:

If we pretend like each of these are its own 28 x 28 pixel image and visualize each pixel of each image as an input to the NN, we obviously will not have consistent inputs.

Take these two zeroes for example. Sure, we can tell they’re zeroes, but to a simple feed-forward NN, each pixel was is its own input, and therefore, they are independent of each other.

Going back to our all-NBA example, we had two inputs: WS and VORP. We could easily have thrown in more inputs as well: PTS, REB, AST, MP, WS48, DBPM… you name it… but the NN would look at these inputs as random variables that we’re feeding it. In the case of image processing, contextual location matters.

Pixelized handwritten digits are obviously not independent in each pixel. A stroke of a pen / pencil / whatever writing utensil will have a width depending on the width of the utensil tip. This is evident as the second zero above is wider than the first. If we wanted to get REALLY REALLY nitpicky, the outer pixels of the second zero will be 1’s (representing a black pixel) whereas those same pixels would be 0’s (representing a white pixel) on the first zero. This means that a feed-forward NN will view these input pixels and simply go “they’re not the same”. It has no context that the overall shape of the images are very very similar.

In another example, let’s look at 1’s:

The matrix representation is shown on the right (here, there are continuous values representing the darkness of the pixel rather than a binary 0 / 1, white / black representation, but the idea is the same). Let’s say the 1 is moved five pixels to the left… The two input signals would be completely different. There literally would be no two pixels that are the same! From a feed-forward NN perspective, the inputs can vary so much that it wouldn’t be too rigid of a training process.

Enter the convolutional NN.

A convolutional NN uses the concepts of convolution and max pooling to simplify the image first before throwing it into a fully-connected layer and output layer.

I’d like to first throw out this video which I will be referencing for more or less the rest of this CNN intro. I obviously cannot take any of the credit for this video and just want to thank Brandon Rohrer for creating this great video that I’m sure everyone and their mothers who are involved in CNNs have stumbled upon at this point. Upon checking his LinkedIn, it’s clear that he likely knows what he’s talking about as he has had certain credentials since he was 18 that I will never have lol.

Brandon begins by illustrating some of the points I was trying to make above. In his video, he tries to classify a 2D image that can either be an X or an O.

But, of course, we don’t all write X’s and O’s the same way…

We need to find a way to mitigate this variance because, again, a computer sees each pixel as an independent variable:

At this point, the video does a great job of explaining the convolution and max pooling concepts. Let’s get into it.


Convolution is the process of using parts of the image to match and summarize the larger image. For example, if we localize our scope of the two X’s above, we see that that the image is actually made of the same smaller pieces and represent the same “idea” of an image, but obviously the smaller pieces are just shifted around a bit. In reality, we can think of many use cases that have this type of structural makeup… in our current objective, face detection, to some extent all humans have eyes, noses, a mouth… just shifted in different places on a face.

We start convolution by using a set of filter banks that are randomly generated, but trained filter banks might look something like:

with the filters on either side representing the straight line strokes of the X and the filter in the middle representing the cross of the two strokes of the X.

The convolution algorithm goes as follows:
For each filter…

  1. Line up the filter and image patch
  2. Multiply each filter pixel by each image pixel (cross product)
  3. Add them up
  4. Divide by the total number of pixels
  5. Represent the image patch with the total

Okay… that was a bit quick… let’s take a closer look at these steps…

Let’s say we’re using the first filter to match parts of the image. Here, in step 1, we take the first filter and match it to a part of the image that matches up exactly:

In steps 2, 3, and 4, we essentially aggregate this entire patch matching process down to a single number as a result by using the dot product:

In the first 3 images above, we are performing the dot product, that is, the pixelwise multiplication between the filter and the image. Below, we complete step 5 and use the result from the pixelwise multiplication and we sum and average into a single value to represent how the filter matched up at that point in the image.

Taking a look at how the filter matches up in another area of the image…

And completing the image…

And performing this process for all 3 filters available…

Notice that the filters representing diagonals are prominent along the axis of those diagonals, and the filter representing the cross is prominent at the center of the image… We are “generalizing” the larger image into these smaller features and we obtain a stack of results, 1 layer of the stack for each image.

At this point, our convolution step is done! Let’s talk about max pooling.

Max Pooling

Max pooling is a relatively easy concept. It’s a step that

  1. Minimizes the number of end-features and end-computations when we feed it into our feed-forward NN
  2. Prevents overfitting by performing a type of averaging

Max pooling simply slides a window cross the image the represents that entire window by the maximum value found within that window. For example, the following situation uses a 2×2 pixel window sliding across our 7×7 convoluted image:

Notice on that last one, the sliding window hangs off the image, one way to handle it as seen above is to simply only consider the pixels that are there, and consider all other pixels in the window non existent.

Completing the image…

And pooling all 3 layers…

Notice that, at the end of this pooling process, we have an image with fewer inputs that still represent an averaged pattern of our image!!


The final layer is the normalization layer. The normalization layer is also quite easy to grasp: it simply uses the Rectified Linear Unit function (RELU) to zero out any negative values while keeping positive values:

Applying the RELU function on a layer is simple:

What does this mean? It seems that the RELU layer often follows the convolution layer before max pooling. A negative value indicates that, in the convolution layer, a filter matched less than half of the sample from the image. If there were an equal number of pixels that matched and didn’t match, we’d get a zero value. Therefore, we want to penalize filter matches which matched less than half of the pixels equally.

Putting It All Together

Using these 3 tools (convolution, max pooling, normalization), we can combine the layers to continue shrinking and averaging our image down to a size that we are comfortable dealing with.

A simple CNN can look like this:

A more complex CNN can look like this:

Fully Connected Layer

The last section of the CNN is a basic DNN. Each convoluted / averaged down stacks from the convoluted layers would act as inputs to a DNN. With this, we basically get a sense of how each of the filters contribute to the final classification task.

We can also build deep layers to end off our CNN.

Final Network

Our final network can then look something like this:

Wow… that was a ride… well… that’s a CNN! Let’s explore TFlearn in the next post, a high-level abstraction to the popular NN python library, Tensorflow!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s