Intro
At this point, we’ve gotten a feel for the data, but we really still don’t know what’s in it. After 9 posts, we’re finally ready to actually explore haha… Oh, the miraculous world of data science…
Similar to the All-NBA prediction project I was doing, I really have no objective here. I came in just looking to use Spark and explore a hefty amount of Police calls, so let’s just start with whatever is interesting to us.
Incident Timestamps
I’ve been wondering about the time-based trends of the complaints since I was cleaning up the data… What’s the trend year over year? What’s months are most active with complaints? Day of the week? Hour of the day? Let’s explore all of these.
import os
os.system("sudo pip install findspark sql_magic pyspark_dist_explore")
# Use findspark package to connect Jupyter to Spark shell
import findspark
findspark.init('/usr/lib/spark')
# Load SparkSession object
import pyspark
from pyspark.sql import SparkSession
# Load other libraries
from datetime import datetime
from pyspark.sql import functions as F
from pyspark.sql.types import DateType
from functools import reduce
import pandas as pd
import numpy as np
# Graphing with matplotlib
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline
# Initiate SparkSession as "spark"
spark = SparkSession\
.builder\
.getOrCreate()
# Load sql_magic and connect to Spark
%load_ext sql_magic
%config SQL.conn_name = 'spark'
A few notes above:
- I should be doing a lot more graphing from now on, so I’m going to try out the ggplot style of matplotlib, this is supposed to mimic R graph styles and color schemes. I’ve always liked ggplot’s color schemes, so let’s try it out.
- When importing functions from the pyspark.sql package, I’ve been importing every single function separately… that’s like importing every Pandas dataframe command individually when you need them… I’ve seen multiple folks import the entire pyspark.sq.functions package as the literal ‘F’ for some reason… I’ll try this.
%%time
# Read NYPD Complaint Data
df = spark.read.parquet("s3n://2017edmfasatb/nypd_complaints/data/df_clean.parquet")
The CSV, for reference, took ~50 seconds to load. The parquet load only took 15s, which I assume can be attributed to the fact that we don’t have to infer any type of schema because the schema is built into the parquet itself and all Spark has to do is read the metadata. Good stuff!
df.printSchema()
# Add table to SQL Context
df.createOrReplaceTempView("df")
Time-Based Analysis
Annual Trend
Let’s start with a simple annual trend.
complaint_start_result = %read_sql \
SELECT \
COMPLAINT_START_TIMESTAMP_YEAR as COMPLAINT_YEAR, \
COUNT(*) as NUM_COMPLAINTS \
FROM df \
GROUP BY COMPLAINT_START_TIMESTAMP_YEAR
# Plot number of complaints per year
complaint_start_result.sort_values('COMPLAINT_YEAR').plot(kind = 'bar', x = 'COMPLAINT_YEAR', y = 'NUM_COMPLAINTS', figsize = (15, 5))
So the documentation for the data set says that it’s got data from 2006 to the end of 2016, about 10 years worth. This is absolutely NOT what I’m seeing in the data itself…
complaint_start_result.sort_values('COMPLAINT_YEAR').tail(20)
For some reason, there are complaints pre-2006… how many records?
# Create a boolean flag for those records greater than or equal to the year 2006
complaint_start_result['2006_or_later'] = np.where(
complaint_start_result['COMPLAINT_YEAR'].astype('int') >= 2006,
'2006 Or Later',
'Before 2006'
)
# Sum up number of complaints per category
complaint_start_result_grouped = complaint_start_result.groupby('2006_or_later').agg({'NUM_COMPLAINTS': [np.sum]}).reset_index()
# Calculate percentage of total
complaint_start_result_grouped['NUM_COMPLAINTS_PERCENT'] = (complaint_start_result_grouped['NUM_COMPLAINTS']['sum'] / complaint_start_result_grouped['NUM_COMPLAINTS']['sum'].sum())*100
complaint_start_result_grouped
Okay, there are a negligible amount of rows where the incident started before 2006. Maybe the data set meant complaints that were reported 2006 or after?
report_result = %read_sql \
SELECT \
REPORTED_DATE_TIMESTAMP_YEAR as REPORTED_YEAR, \
COUNT(*) as NUM_COMPLAINTS \
FROM df \
GROUP BY REPORTED_DATE_TIMESTAMP_YEAR
# Plot number of complaints per year
report_result.sort_values('REPORTED_YEAR').plot(kind = 'bar', x = 'REPORTED_YEAR', y = 'NUM_COMPLAINTS')
Okay, I guess that is what the dataset description meant. Anything that is was REPORTED 2006 or later. This is fine, as it implies that there were incidents that were reported, in some cases, DECADES after they actually happened. There was even one in… 1015? I mean… the declaration of independence was on July 4th, 1776. Plus someone who was even alive in 1015 is now 1002 years old…
Let’s see what the 1015 one is…
# Look at those rows which the incident started on the year 1015
df[df['COMPLAINT_START_TIMESTAMP_YEAR'] == 1015].toPandas()
Welp, it looks like these were definitely typos. They should all be in the 2015 – 2016 range. It seems here that some 1015s should be 2015, and some should be 2016. For example, in row id #6, you’d imagine that a small misdemeanor of Petit Larceny in Brooklyn lasted from 2016 to 2016… even then, though, it indicates that the incident lasted from February 14th to 20th, which I honestly can’t be too sure of here. In any event, I guess my point is that I don’t want to make assumptions about any of these dates or typos because I simply don’t know much about the criminal system, and crimes in general. I think I’m going to take out all of these years that are in the 1000s millenium – they are typos and may mess with my analysis when I look at the time it took to report or the time it took for the incident to finish.
Let’s take a look at the 1900 one:
# Look at those rows which the incident started on the year 1900
df[df['COMPLAINT_START_TIMESTAMP_YEAR'] == 1900].toPandas()
Hmm… many of these look like typos too. Row index #1 has one going from 1900 to 2010, which is clearly a typo. I’m not sure exactly what the typo should be though… 1900 to 2010 is not even close. The ones from 1900 to 1900 are also kinda weird… why would someone in 2006 report a crime that happened in the 1900s? Maybe there is something I’m missing here, but it just doesn’t make sense to me. Nowhere in the NYC statute of limitations does any crime let you report 100 years later, let alone if that person who any parties involved would still be alive then…
I’m not convinced of these 1900 ones so far… let’s try one in 1980 to see an example of a crime where it’s plausible that the reportee could even be alive.
# Look at those rows which the incident started on the year 1980
df[df['COMPLAINT_START_TIMESTAMP_YEAR'] == 1980].toPandas()
These look fishy too, as it looks that some of these are clearly typos as well. I’m going to choose not to believe any of this, and just filter for incidents that started in 2006 or later.
# Look at those rows which the incident started on the year 2000
df[df['COMPLAINT_START_TIMESTAMP_YEAR'] == 2000].toPandas().head(20)
Still shady… For example on row index #9 can an Assault case really last from 2000 to 2009? In the case of row index #14, a Public Order misdemeanor went on from 2000 to 2009? I’m not believing it. Many of these have 2009 as an end date, so perhaps these are typos as well because the 0 and 9 keys are close together. In any event, I will filter out anything with a starting date before 2006.
# Filter out rows with complaint start year before 2006
df_filtered = df[df['COMPLAINT_START_TIMESTAMP_YEAR'] >= 2006]
df_filtered.count()
# Add table to SQL Context
df_filtered.createOrReplaceTempView("df_filtered")
complaint_start_result = %read_sql \
SELECT \
COMPLAINT_START_TIMESTAMP_YEAR as COMPLAINT_YEAR, \
COUNT(*) as NUM_COMPLAINTS \
FROM df_filtered \
GROUP BY COMPLAINT_START_TIMESTAMP_YEAR
# Plot number of complaints per year
complaint_start_result.sort_values('COMPLAINT_YEAR').plot(kind = 'bar', x = 'COMPLAINT_YEAR', y = 'NUM_COMPLAINTS')
Looks better to me! Crime is decreasing – yay?
Length Of Incident
Before I go ahead and look at any more time-series trends, I want to stop and take a look at the length of incidents because it’s quite clear that typos exist in the data. This is an after-the-fact data cleansing exercise as my trust is a bit shaken right now.
# Generate summary statistics for complaint length
complaint_length_summary = df_filtered[['COMPLAINT_LENGTH']].describe().toPandas()
complaint_length_summary
Remember, I loaded these up to be in minutes… I probably should have generated them as hours or days haha. Let’s do days…
# Add additional column expressing length of complaint in days
complaint_length_summary['COMPLAINT_LENGTH_DAYS'] = complaint_length_summary['COMPLAINT_LENGTH'].astype(float)/60/24
complaint_length_summary
The average crime is 3.5 days, with a standard deviation of 1.5 months (44 days)… That seems very long… especially when I take a look at the max value, which is 82 years long. More typos are coming…
Ugh, I thought I was done with the cleansing step last post – it has a way of just pulling you pack in. I guess we’ll have to take a look at the complaint end date as well.
complaint_end_result = %read_sql \
SELECT \
COMPLAINT_END_TIMESTAMP_YEAR as COMPLAINT_END_YEAR, \
COUNT(*) as NUM_COMPLAINTS \
FROM df_filtered \
GROUP BY COMPLAINT_END_TIMESTAMP_YEAR
# Plot complaint ends per year
complaint_end_result.sort_values('COMPLAINT_END_YEAR').plot(kind = 'bar', x = 'COMPLAINT_END_YEAR', y = 'NUM_COMPLAINTS')
Well the 2001, 2017, and 2090 values definitely explain some of the bad summary statistics… Let’s filter these out too.
# Filter end date for 2006 - 2016 inclusive
df_filtered = df_filtered[(df_filtered['COMPLAINT_END_TIMESTAMP_YEAR'] >= 2006) & (df_filtered['COMPLAINT_END_TIMESTAMP_YEAR'] <= 2016)]
# Add table to SQL Context
df_filtered.createOrReplaceTempView("df_filtered")
complaint_end_result = %read_sql \
SELECT \
COMPLAINT_END_TIMESTAMP_YEAR as COMPLAINT_END_YEAR, \
COUNT(*) as NUM_COMPLAINTS \
FROM df_filtered \
GROUP BY COMPLAINT_END_TIMESTAMP_YEAR
# Plot complaint ends per year
complaint_end_result.sort_values('COMPLAINT_END_YEAR').plot(kind = 'bar', x = 'COMPLAINT_END_YEAR', y = 'NUM_COMPLAINTS')
Let’s try calculating complaint length summary statistics again.
# Generate summary statistics for complaint length
complaint_length_summary = df_filtered[['COMPLAINT_LENGTH']].describe().toPandas()
complaint_length_summary['COMPLAINT_LENGTH_DAYS'] = complaint_length_summary['COMPLAINT_LENGTH'].astype(float)/60/24
complaint_length_summary
You know what, this would actually be a very good use of a histogram… I secretly have been holding out on histograms because, embarrassingly enough, how do I actually plot a histogram in Spark? I don’t think Spark has a graphing library… It looks like there’s a library called pyspark_dist_explore that leverages Spark to compute the histogram bins and uses.
from pyspark_dist_explore import hist
fig, ax = plt.subplots()
hist(ax, df_filtered[['COMPLAINT_LENGTH']], bins = 10)
Let’s create a days column and plot that instead.
# Create permenant COMPLAINT_LENGTH_DAYS column
df_filtered = df_filtered.withColumn('COMPLAINT_LENGTH_DAYS', df['COMPLAINT_LENGTH']/60/24)
fig, ax = plt.subplots()
hist(ax, df_filtered[['COMPLAINT_LENGTH_DAYS']], bins = 10)
The output of hist() from the pyspark_dist_explore package gives us two arrays:
- First array returns the count per histogram bin
- Second array returns the lower boundary of the respective bin
Remember, our mean was around 3 days and the standard deviation was like 40 days, so theoretically most of the data should lie between like 0 and 200 days, 200 days is probably being generous. Here, our buckets are a bit too large, so let’s try to break it up by more buckets.
fig, ax = plt.subplots()
hist(ax, df_filtered[['COMPLAINT_LENGTH_DAYS']], bins = 50)
I think 0 – 500 is my sweet spot. Let’s just do a quick sanity check to see how many records are between 0 – 500. Anything under 0 doesn’t make sense, and is likely an error (how can an incident end before it starts)? If it’s over 500… well the histogram tells us that it is quite an outlier… I’ll take these out for now just to make my own life easier. I mean, theoretically anything over 365 days is a bit shady to me… a crime that lasts longer than a year? I’m not sure about that. In fact, why don’t I actually make that my cutoff!
# Create flag to determine whether or not length of complaint in days is under a year
df_filtered = df_filtered.withColumn(
'COMPLAINT_LENGTH_UNDER_ONE_YEAR',
F.when((df_filtered['COMPLAINT_LENGTH_DAYS'] >= 0) & (df_filtered['COMPLAINT_LENGTH_DAYS'] <= 365), True) \
.otherwise(False)
)
# Add table to SQL Context
df_filtered.createOrReplaceTempView("df_filtered")
complaint_length_groupby = %read_sql SELECT \
COMPLAINT_LENGTH_UNDER_ONE_YEAR, \
COUNT(*) AS NUM_COMPLAINTS \
FROM df_filtered \
GROUP BY \
COMPLAINT_LENGTH_UNDER_ONE_YEAR
# Add percentages column
complaint_length_groupby['NUM_COMPLAINTS_PERCENTAGE'] = (complaint_length_groupby['NUM_COMPLAINTS'] / complaint_length_groupby['NUM_COMPLAINTS'].sum())*100
complaint_length_groupby
Once again, barely any of the rows fall over a year or take on a negative value, so I feel comfortable discarding these as well. Let’s make the change in the original Spark dataframe.
# Filter for those with complaint lengths under one year
df_filtered = df_filtered[(df_filtered['COMPLAINT_LENGTH_DAYS'] >= 0) & (df_filtered['COMPLAINT_LENGTH_DAYS'] <= 365)]
fig, ax = plt.subplots()
hist(ax, df_filtered[['COMPLAINT_LENGTH_DAYS']], bins = 10)
This looks fine to me for now. I’m inferring that the large large majority of the crimes are actually 0 days long (i.e. was a single instance in time, starting and ending at the exact same moment). Let’s confirm this:
complaint_length_groupby = %read_sql SELECT \
COMPLAINT_LENGTH_ZERO_TIME, \
COUNT(*) AS NUM_COMPLAINTS \
FROM df_filtered \
GROUP BY \
COMPLAINT_LENGTH_ZERO_TIME
# Add percentages column
complaint_length_groupby['NUM_COMPLAINTS_PERCENTAGE'] = (complaint_length_groupby['NUM_COMPLAINTS'] / complaint_length_groupby['NUM_COMPLAINTS'].sum())*100
complaint_length_groupby
Ah, not quite, maybe within 1-2 days?
fig, ax = plt.subplots()
hist(ax, df_filtered[df_filtered['COMPLAINT_LENGTH_DAYS'] <= 5][['COMPLAINT_LENGTH_DAYS']], bins = 10)
Finally, we get to a histogram that actually gives me a good sense of the distribution of complaint lengths.
- ~90% of the incidents took less than half a day (remember that a length of 0 makes up about 1/3 of all the rows in the dataset)
- Maybe ~3-4% take half a day to a day to complete
- The rest take over a day to complete, up to our arbitrary limit of 365 days (1 year)
Let’s move on.
Monthly, Weekly, Daily, Hourly Trends
Monthly
Let’s take a look at how many crimes happen per month, day of week, and day of month, and hour of day.
# Add table to SQL Context
df_filtered.createOrReplaceTempView("df_filtered")
monthly_result = %read_sql \
SELECT \
CAST(COMPLAINT_START_TIMESTAMP_MONTH AS INT) AS COMPLAINT_MONTH, \
COUNT(*) AS NUM_COMPLAINTS \
FROM df_filtered \
GROUP BY COMPLAINT_START_TIMESTAMP_MONTH
# Plot number of complaints per month
monthly_result.sort_values('COMPLAINT_MONTH', ascending = True).plot(kind = 'bar', x = 'COMPLAINT_MONTH', y = 'NUM_COMPLAINTS')
More complaints during the summer months and very few complaints in February. I actually think this is because we’re February has less days than the rest of the months. Nothing else that’s too out of the ordinary here. Winter in NYC can be cold, perhaps not as many crimes open to being committed.
Daily
daily_result = %read_sql \
SELECT \
CAST(COMPLAINT_START_TIMESTAMP_DAY AS INT) AS COMPLAINT_DAY, \
COUNT(*) AS NUM_COMPLAINTS \
FROM df_filtered \
GROUP BY COMPLAINT_START_TIMESTAMP_DAY
# Plot number of complaints per day
daily_result.sort_values('COMPLAINT_DAY', ascending = True).plot(kind = 'bar', x = 'COMPLAINT_DAY', y = 'NUM_COMPLAINTS')
We see the effect of February again with the drops in the 29th, 30th, and 31st of the month. 31st of the month is missing by half the months as well, of course, so relatively, it’s actually quite high.
A few dates pop out for some reason.
- There are far more complaints than average on the 1st
- There are slightly more complaints than average on the 5th, 10th, 15th, and 20th
The first point I could see. The 1st of every month is a milestone. If someone gets paid the first of the month, maybe that would Spark a theft or burglary… there are lots of reasons why someone would plan something on the first of every month because it’s just a simple, easy to remember date. From that, crimes and complaints may revolve around these events that happen.
The second point, I’m not so sure… I honestly have no clue right now. I’ll have to do more research and come back to this later.
weekday_result = %read_sql \
SELECT \
COMPLAINT_START_TIMESTAMP_WEEKDAY AS COMPLAINT_WEEKDAY, \
COUNT(*) AS NUM_COMPLAINTS \
FROM df_filtered \
GROUP BY COMPLAINT_START_TIMESTAMP_WEEKDAY
# Plot number of complaints per weekday
weekday_result.reindex([1, 6, 4, 2, 5, 3, 0]).plot(kind = 'bar', x = 'COMPLAINT_WEEKDAY', y = 'NUM_COMPLAINTS')
More complaints on Fridays, less on Mondays and Sundays… I can see both of these. Same amount of complaints on a Wednesday vs a Saturday? I would’ve thought Saturday would be filled with the most complaints, but good to know what the data says.
Hourly
hourly_result = %read_sql \
SELECT \
CAST(COMPLAINT_START_TIMESTAMP_HOUR AS INT) AS COMPLAINT_HOUR, \
COUNT(*) AS NUM_COMPLAINTS \
FROM df_filtered \
GROUP BY COMPLAINT_START_TIMESTAMP_HOUR
# Plot number of complaints per hour
hourly_result.sort_values('COMPLAINT_HOUR', ascending = True).plot(kind = 'bar', x = 'COMPLAINT_HOUR', y = 'NUM_COMPLAINTS')
Nothing too out of the ordinary here, either. Most incidents happen in the evening, with only a small fraction happening between the hours of 1AM – 8AM. We peak around 3PM – 8PM, when everyone is generally off work. I’d expect this to change on the weekends (maybe a flatter histogram)?
hourly_result = %read_sql \
SELECT \
CAST(COMPLAINT_START_TIMESTAMP_HOUR AS INT) AS COMPLAINT_HOUR, \
COUNT(*) AS NUM_COMPLAINTS \
FROM df_filtered \
WHERE COMPLAINT_START_TIMESTAMP_WEEKDAY IN ('Sat', 'Sun') \
GROUP BY COMPLAINT_START_TIMESTAMP_HOUR
# Plot number of complaints per hour on the weekend
hourly_result.sort_values('COMPLAINT_HOUR', ascending = True).plot(kind = 'bar', x = 'COMPLAINT_HOUR', y = 'NUM_COMPLAINTS')
Okay, maybe the histogram is not as flat as I thought it would be… I guess people still need to sleep lol. The 1AM – 8AM lull becomes more of a 5AM – 8AM lull. In fact, on these days, the most complaints are occurring at 12AM! In many cases, this is when people could be going out to hit up the bars wreaking havoc and whatnot. The incidents will continue all the way until 3AM – 4AM. We still see the peak in the evening as well.
Busiest Day Of The Year
I just wanted to answer one more question for my own curiosity… What is generally the busiest day of the year for the NYPD? New Years? July 4th? Christmas? Some random day like September 2nd? Let’s find out.
busiest_day_of_year_results = %read_sql \
SELECT \
CAST(COMPLAINT_START_TIMESTAMP_MONTH AS INT) AS COMPLAINT_MONTH, \
COMPLAINT_START_TIMESTAMP_DAY AS COMPLAINT_DAY, \
COUNT(*) AS NUM_COMPLAINTS \
FROM df_filtered \
GROUP BY \
COMPLAINT_START_TIMESTAMP_MONTH, \
COMPLAINT_START_TIMESTAMP_DAY
# Plot number of complaints per day
busiest_day_of_year_results['NUM_COMPLAINTS_PERC_TOT'] = (busiest_day_of_year_results['NUM_COMPLAINTS'] / busiest_day_of_year_results['NUM_COMPLAINTS'].sum())*100
busiest_day_of_year_results.sort_values('NUM_COMPLAINTS', ascending = False).head(20)
Hmm, well I was right about new years… but something tells me it’s not necessarily because of new years… The top 12 days with most complaints throughout the year are the 1st of every single month. This can’t be a coincidence, but I’m not sure why the 1st specifically is so popular…
Then comes the 15th, 20th, and 25ths… also quite confusing to me. I think there’s something about the Police system or complaint process that I’m not understanding right now. Anyways, good to keep that in the back of my head, and in the next post, let’s explore some non datetime based fields.
# Save CSV back to S3
df_filtered.write.parquet('s3n://2017edmfasatb/nypd_complaints/data/df_filtered.parquet')