NYPD Crime #20 – Conclusion

What A Journey

It basically took me 20 posts to get come to some conclusions and then kind of contradict them. In terms of the data set itself, I definitely found some interesting insights, but my understanding of the topic kinda followed this trajectory:

In [1]:
# I'm starting to think the single graph this setup will produce is not worth it
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
In [10]:
# Plot trajectory of my understanding of crime in NYC
sns.regplot(
    x = 'time',
    y = 'nyc_crime_understanding',
    data = pd.DataFrame({
        'time': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
        'nyc_crime_understanding': [2, 4, 7, 8, 9, 7, 5, 4, 5, 5]
    }),
    order = 3,
    ci = None
)
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x111de0350>
nypd_crime_20_1

The beginning is when I was trying to figure out Spark, the first peak is where I got datashader working, the dip is when I created the biplots, and the little tail at the end is me fooling myself into thinking I’ve learned something again. You’ve come out of this project a better, more knowledgeable, more intelligent person Chi… You didn’t just waste 2 months of your life… haha if you say something enough, you’ll speak it into existence.

Nah, I’m being too harsh on myself. Again, what is this blog all about? Trying AND FAILING at data science. Not sure if this was a failure per se, but I’ll say that I didn’t learn as much as I had hoped coming out of it. Well, maybe I should rephrase that as well. Objectively, I actually learned a lot. On the development tools side, I learned a ton… AWS EMR, Spark, Datashader… I’d say most of my notebooks in the project were created to debug these tools and get them working. I could view this project from the perspective of even having my understanding trajectory swerve up and down, or even having a trajectory at all, is a victory. My understanding of the topic was, after all, guided by the products of these tools. However, I just wished that I came out with a better understanding of “these types of crimes happen here”.

ANYWAYS, let’s summarize in a bit more detail what we learned here.

What Did I Fool Myself Into Thinking I Learned This Time?

Development Tools

EMR

I continued my journey through AWS in this project. Early in the project I explored the topic of parallel computing and Hadoop a little bit. The data set I as dealing with was hardly any excuse to be using EMR, but I thought it was a nice and cheap playground to experiment, learn, and fail in.

Everytime I spun up a cluster, I would work on it for about 2-4 hours. This resulted in nightly charges of \$0.30 – \$1.00… not horrible at all! I was generally using 2x m4.larges (\$0.03 spot instance + \$0.03 EMR charge per instance) for my worker nodes, and switched between m4.large and m4.xlarge (\$0.06 spot instance + \$0.06 EMR charge per instance) as my project progressed. At this point, the only EMR expenses I’ve accrued so far have to do with this project, and over the entire history of my AWS account, I have spent \$11.84. Seeing as how the EC2 spot instances basically mirror the EMR charges, I can just double that to get a grand total of _**\$23.68 spent on this project! I’ve definitely spent more than that ordering 3 drinks at the bar, so I’m going to go ahead and file this into the money-well-spent**_ cabinet.

I didn’t explore much else of the EMR infrastructure other than Spark, but Ganglia came into play quite a bit when I started exploring memory usage. I generally had three chrome tabs open alongside my jupyter notebook, one for each node of the cluster. It was extremely useful to view server load in real time like that, and played a major part in helping me understand how Spark interacts between its masters and workers (although I would still rank myself a 2/10 in Spark knowledge).

I also became familiar with how seamlessly it integrates with the rest of the AWS ecosystem as well in terms of networking and security.

Spark

Ahh… Spark… I think I’ve already started off on a love-hate relationship with Spark.

The pros: The thing I absolutely loved most was the integration of Spark SQL. Man was it nice to not have to use Pandas / Pyspark to perform some queries. Being able to fire up such a familiar language was a nice bell and whistle to have. The fact that it doesn’t slow down the execution of the query is great as well, and almost provides an advantage depending on who else you’re working with on the project. If the other person is most familiar with SQL, Spark SQL will yield more interpretable code as well! Awesome. Spark also has a nice look a feel to it especially when still working in a jupyter notebook and switching back between the Python and Spark kernel. The workflow felt really familiar, but…

The cons: As much as Pyspark is build to mimic Pandas, it really isn’t yet. The fact that I had to spend a few days trying to figure out how to simply concatenate two dataframes along the horizontal axis was a bit unbearable. I ended up having to manually assign incrementing indexes on each dataframe and performing a merge / join, not to mention I had to take a huge pit stop to debug memory issues. Now, a lot of this was due to my lack of knowledge around Spark and distributed computing myself (hey, if I could build a better Spark, I would have built it), but all I’m saying is that it’s not as seamless of a transition as some may make it seem. Spark also has some error messages that left me scratching my head a few times. Often, Spark would just throw a Java error. If we go deep enough into the logs, we may be able to get some clues, but with my lack of experience in Spark and distributed computing, some errors left me debugging for literally days. My hope is obviously that with more experience, I’ll become more comfortable with the environments and its tendencies.

Neither a pro nor a con is the experience I’ve now had with distributed computing. The length of some debugging sessions left me with the conclusion that it’s a different ball game. Master memory? Worker memory? Spark settings to memory limits? EMR’s settings to limit Spark’s resources? Understanding which settings to even look at to optimize your code / solve your bug?

Again, I’m being very careful to point out that I’m extremely inexperienced. I don’t want to be so quick to blame Spark because if I don’t have the license to ride a motorcycle, I probably shouldn’t be complaining about not being able to ride it! I’m relatively lucky that the issues I ran into here seemed to be issues every newbie runs into when firing up Spark on EMR. For example, the default setting to tell EMR to maximize all resources on every node fixed the concatenation problem I was describing above. The second big memory issue, I simply increased my master node size after looking at Ganglia graphs for a bit. A lot of time and googling to come to those solutions, but simple fixes at the end of the day. Again with more experience hopefully comes less stress… Until the next big parallel computing platform comes out and the learning curve starts again… 🙂

Datashader

Datashader was the least of my worries here haha. Datashader only exists on a single-node Python-based environment, so we’re back to… the other ball game! Datashader allowed us to plot a ton of data on a plot where that much data had no business being individually plotted on… say that 10 times fast. In this world of (larger) datasets (trying not to say “big data” here…), we constantly see this idea of mapping and reducing. In this case, each pixel on screen basically figures out what it needs to show, then it shows the corresponding plot all concatenated together pixel by pixel. It’s nice to know that, with a tool like datashader, I’ll be able to visualize large data sets (as long as they fit in memory).

Mathematics & Statistics

I didn’t learn anything here in this project, but surely I did enough talking in the “Development Tools” section to tie me over.

Machine Learning

I didn’t learn anything here in this project, but surely I did enough talking in the “Development Tools” section to tie me over. Well… I used PCA… but that’s really nothing new as I’ve used it in previous projects before. Spark has an entire SparkML library that I didn’t even tab into yet… maybe in the future!

Domain Knowledge

Well, this one was a doozie. I’d probably go back to post “15 – Takeaways” to sum up what I learned here, although I slightly contradicted some of the takeaways when I generated the biplots. Still, there are a lot of high level information I was able to gather.

This plot, for example, cannot be disputed. These are the areas of NYC where the top 3% of crimes are committed… fact. This one image paints a picture of how dangerous midtown Manhattan, The Bronx, Harlem, and Brooklyn can be.

Another takeaway is perhaps how much more “domesticated” (is that the right word?) Queens and Staten Island are. Most crimes in these areas occur in houses rather than apartments or public housing. Does this mean that there are most houses? Not necessarily, but I think it’s quite a safe assumption.

How about the types of crime? Well, I thought I had it figured out in post #15, but post #\19 came along and well… you know the rest. Again, just different ways to look at this: NYC is a city of 8.5 million people. That’s almost 1/3 of all of Canada. That’s a lot of people doing a lot of different things and committing a lot of crimes. With that many people packed in such a space, you’re bound to have a bit of scatter and variance. If I really wanted to get serious about this, I’d have dive a bit deeper into demographic data instead of chalking up my failures to this, but there is some truth to it.

At the end of the day, however, the data is the data. The picture that it paints is the picture that it paints, period. My mind wants to come away with some simple understandings and insights, but perhaps life is just not like that. Again, with that many people, the data is complex! Should I be so ashamed? Part of me thinks that if I had better grasps of statistical methods, I would’ve found some way to summarize the data, no matter how complex. I can tell this will be one of the recurring questions when dealing with probably any project I work on in the future. If only I was smarter…

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s