Hello! I’m back for another random data science project to try, fail, and sometimes succeed at! Key word… probably “fail”.
All good, though, I’ve failed so many times now that it’s basically become second nature!
In the last post, I tried to build a Convolutional Neural Network to identify my girlfriend’s face from mine. Well, I guess it started out as me thinking that it would be face detection algorithm. Realistically, it was more of a silhouette detection algorithm because the face was only 1 small part of the overall detection scheme among other factors like the background color and hair color shape. In fact, these considerations were more important in some test cases than the actual face itself, with the algorithm thinking Larissa with her hair up was me, and me with dark fabric draped around my head to simulate long hair as Larissa. Talk about lessons learned. Anyways, you can read or venture as much as you’d like into that post here.
What to do now… This is getting fun – I need to keep up the momentum. I’ve explored most linear and non-linear classification models, non-linear regression to generate a heat map, scratched the surface of NNs, scratched the surface of cloud… I NEED MOAR!! Is this what a data scientist feels like? Just diving directly into a back hole of stuff I never learned in school and pretending like I know it after one blog post?
In all seriousness, I really do feel much more comfortable with all the common data science tools after so much practice. The workflow feels so much more familiar every time I start a new project, and I am able to complete a task a lot more efficiently than I was able to before I started this blog. The familiarity with workflow is one of the foundational pillars that my skills will continue to build off of. Alongside the mathematical understanding of the models, I think the familiarity with workflow is the other huge demotivator for someone to give up going any deeper into data science. I mean, at one point before this blog, I was just working with python in notepad… Now I’m firing up jupyter, taking advantage of hosting services in github and s3, exploring and learning all these new libraries, leveraging cloud services for more firepower… the impacts of this workflow spans far and wide. In some cases, it’s making me simply more efficient – Understanding how to handle data is key in even getting a model to work, and understanding the right techniques to clean and format the data saves hours and hours of head banging. In other cases, it’s completely opening up new realms of possibilities… we saw how fast it was to run a CNN on AWS vs my own laptop (23x faster, as a reminder), and this ease of speed gain literally makes it possible to perform agile, continuous model development without having to wait days to train a rather complex model.
I want to continue this path of learning, but where do we turn to next?
NYC Open Data & NYPD Public Complaints
I’ll be honest. I have no idea why I’m choosing to look at this data set. I wanted to do more with open data, I was looking for a dataset with a rather large number of rows, and I guess somehow I ended up on the NYC open data site. A few months ago, when I was watching videos and doing a bit of side learning on Spark, I came across this video of Sameer Farooqui from Databricks (largest contributor to the Spark project) analyzing some SF Fire Department call data with Spark:
I think subconsciously I’ve always wanted to try out Spark and use it to analyze a larger data set to take advantage of the parallel computing, and subconsciously it drove me to explore a major metropolitan city’s open data portal, whether or not it was SF at the end of the day.
A few days ago, I was following the Carmelo drama and whether or not he’d get traded to the Rockets. Carmelo –> Knicks –> New York –> New York Open Data. That train of thought may not make sense to you, but that’s how my brain works and I’m not quite proud of it. Random sparks of inspiration from… well… random things haha. Anyways, the data set I’m going to explore in this project is a 10-year history (2006 – 2016) of public complaints that the NYPD has received. It contains about 5.3M rows of data. Not too crazy, but good for about 1.5G of data. I’m in no way claiming “LOOK AT ME, I’M DOING BIG DATA. MY DAD IS SO BIG, IT HAS MILLIONS OF ROWS“, but this is a larger data set than I’d generally deal with because, again, my Mac has 4 measly GB of RAM. I have to move in baby steps as well, so maybe I’ll go a bit bigger in future projects. However, it’s of course not how big your data is, it’s how you use it ;).
But, yeah, at the end of the day, this is what I’ve decided to look into. Just like the all-NBA predict one, I honestly don’t know what my objective here is yet, and I may come out of this project without really achieving anything other than simply learning more about crime in NYC, but my primary goal with this project is to actually dive a bit deeper into newer technologies – AWS EMR and Spark to be more specific.
Technology & Tools
As I said, there are two main things I want to discover in this project:
AWS EMR & Distributed Computing
AWS was such a good experience in my CNN face detection project. It was so easy to get going with an architecture that could handle the compute and memory power, so I’m going to dive deeper into the AWS rabbit hole here.
EMR is AWS’ multi-node distributed computing platform based around Hadoop and Spark. In the music genre clustering project, we talked about multi processing on a single node. With EMR, we’ll talk about multiprocessing on multiple nodes. That is, there are multiple machines working simultaneously on the same computation! This is another way to achieve a more efficient analysis engine, depending on the types of computations we want to perform. EMR is basically the service that spins up and provisions multiple nodes for your architecture, without you having to go through the trouble of installing and syncing all the software across every node.
Spark & In-Memory Distributed Computing
Spark is one of the softwares that EMR can spin up for us. EMR can spin up a bunch of crap, most of which I won’t go into. I’ll try to explore Hadoop a bit, but I’ll mostly be trying to get into Spark. Spark is a multi-node software that can leverage the RAM of multiple machines to perform computation. When we we’re working in Pandas on our local laptop and we create a dataframe, we load and store data into RAM. When we’re working with RAM, there are no physical discs spinning, no needles panning… none of that.
Most computers these days have RAM of 8GB or 16GB. Mine has 4GB… sadface. At 16GB, we’re getting into a few more hundred bucks of cash being spent as well. I’ve already spend like 8 bucks on AWS so far, so it’s not hard to see a world where that 8 becomes 800 after a year or two. So is it worth it? Hard to say… I don’t know my objectives well enough yet to make an investment into more RAM (but more importantly, I’m lazy as hell). In our music genre clustering project, the reason why we spun up multi-processing in the first place was because we didn’t have enough RAM to fit all the data at the same time… we had to feature engineer in batches. Spark aims to solve this problem. Maybe I can’t fit all the data into my 4GB RAM, but maybe if I had 3 machines all with 4GB, I’d have 12GB to work with, and maybe, JUST MAYBE, that’s enough.
Mathematics & Statistics / Machine Learning
I’m not sure exactly what I want to cover here yet because I’m not really sure what my objective is here yet. Somewhat to the tune of machine learning, Spark has a built in distributed ML package called… Spark ML! Appropriate name if I may say so myself. Spark ML has essentially found a way to perform the logic needed for certain algorithms across multiple nodes. Although this is probably more of an extension to the technology & tools section, perhaps I’ll get a little bit into how the algorithms are distributed and I’ll learn more about machine learning in that sense. Otherwise, I expect to learn little new knowledge here, at least until I look at the data and see if there’s anything interesting to classify or regress on.
Domain knowledge.. haha I’m not sure what I can say here either. Again, I really just chose this dataset at random with more of a focus on EMR and Spark, so… yeah… I want to learn about crime in NYC. Nuff said!
Hmm… a lot of ground to cover here before we even start to take a look at the data (or in my case, can even load the data on my Mac). Let’s get back to AWS in our next post.