NYPD Crime #7 – Data Exploration (Part II – Data Cleansing)

Intro

In the last post, we wanted to dive into data exploration, but got caught up looking at the basic workflow of simple Spark processes. In fact, we really did no exploration whatsoever other than look at the top 10 rows of the data set haha. It’s okay though, that deeper dive was necessary and will help with troubleshooting in the future. Let’s load the dataset from S3 into a Spark dataframe again.

In [77]:
# Use findspark package to connect Jupyter to Spark shell
import findspark
findspark.init('/usr/lib/spark')

# Load SparkSession object
import pyspark
from pyspark.sql import SparkSession

# Load other libraries
from datetime import datetime
from pyspark.sql.functions import col, udf, count, isnan, lit, sum, coalesce, concat, to_timestamp, when
from pyspark.sql.types import DateType
from functools import reduce
import pandas as pd

# Initiate SparkSession as "spark"
spark = SparkSession\
    .builder\
    .getOrCreate()

# Load sql_magic and connect to Spark
%load_ext sql_magic
%config SQL.conn_name = 'spark'
The sql_magic extension is already loaded. To reload it, use:
  %reload_ext sql_magic
In [2]:
%%time
# Read NYPD Complaint Data
df = spark.read.csv(
    "s3n://2017edmfasatb/nypd_complaints/data/NYPD_Complaint_Data_Historic.csv", 
    header = True, 
    inferSchema = True
)
CPU times: user 8 ms, sys: 0 ns, total: 8 ms
Wall time: 50.7 s
In [3]:
# Describe df
df.printSchema()
root
 |-- CMPLNT_NUM: integer (nullable = true)
 |-- CMPLNT_FR_DT: string (nullable = true)
 |-- CMPLNT_FR_TM: string (nullable = true)
 |-- CMPLNT_TO_DT: string (nullable = true)
 |-- CMPLNT_TO_TM: string (nullable = true)
 |-- RPT_DT: string (nullable = true)
 |-- KY_CD: integer (nullable = true)
 |-- OFNS_DESC: string (nullable = true)
 |-- PD_CD: integer (nullable = true)
 |-- PD_DESC: string (nullable = true)
 |-- CRM_ATPT_CPTD_CD: string (nullable = true)
 |-- LAW_CAT_CD: string (nullable = true)
 |-- JURIS_DESC: string (nullable = true)
 |-- BORO_NM: string (nullable = true)
 |-- ADDR_PCT_CD: integer (nullable = true)
 |-- LOC_OF_OCCUR_DESC: string (nullable = true)
 |-- PREM_TYP_DESC: string (nullable = true)
 |-- PARKS_NM: string (nullable = true)
 |-- HADEVELOPT: string (nullable = true)
 |-- X_COORD_CD: integer (nullable = true)
 |-- Y_COORD_CD: integer (nullable = true)
 |-- Latitude: double (nullable = true)
 |-- Longitude: double (nullable = true)
 |-- Lat_Lon: string (nullable = true)

In [4]:
# Register temporary table
df.createOrReplaceTempView("df")
In [5]:
# Perform the SQL equivalent of "head"
result = %read_sql SELECT * FROM df LIMIT 10;
result
Query started at 09:57:13 PM UTC; Query executed in 0.01 m
Out[5]:
CMPLNT_NUM CMPLNT_FR_DT CMPLNT_FR_TM CMPLNT_TO_DT CMPLNT_TO_TM RPT_DT KY_CD OFNS_DESC PD_CD PD_DESC ADDR_PCT_CD LOC_OF_OCCUR_DESC PREM_TYP_DESC PARKS_NM HADEVELOPT X_COORD_CD Y_COORD_CD Latitude Longitude Lat_Lon
0 101109527 12/31/2015 23:45:00 None None 12/31/2015 113 FORGERY 729.0 FORGERY,ETC.,UNCLASSIFIED-FELO 44 INSIDE BAR/NIGHT CLUB None None 1007314 241257 40.828848 -73.916661 (40.828848333, -73.916661142)
1 153401121 12/31/2015 23:36:00 None None 12/31/2015 101 MURDER & NON-NEGL. MANSLAUGHTER NaN None 103 OUTSIDE None None None 1043991 193406 40.697338 -73.784557 (40.697338138, -73.784556739)
2 569369778 12/31/2015 23:30:00 None None 12/31/2015 117 DANGEROUS DRUGS 503.0 CONTROLLED SUBSTANCE,INTENT TO 28 None OTHER None None 999463 231690 40.802607 -73.945052 (40.802606608, -73.945051911)
3 968417082 12/31/2015 23:30:00 None None 12/31/2015 344 ASSAULT 3 & RELATED OFFENSES 101.0 ASSAULT 3 105 INSIDE RESIDENCE-HOUSE None None 1060183 177862 40.654549 -73.726339 (40.654549444, -73.726338791)
4 641637920 12/31/2015 23:25:00 12/31/2015 23:30:00 12/31/2015 344 ASSAULT 3 & RELATED OFFENSES 101.0 ASSAULT 3 13 FRONT OF OTHER None None 987606 208148 40.738002 -73.987891 (40.7380024, -73.98789129)
5 365661343 12/31/2015 23:18:00 12/31/2015 23:25:00 12/31/2015 106 FELONY ASSAULT 109.0 ASSAULT 2,1,UNCLASSIFIED 71 FRONT OF DRUG STORE None None 996149 181562 40.665023 -73.957111 (40.665022689, -73.957110763)
6 608231454 12/31/2015 23:15:00 None None 12/31/2015 235 DANGEROUS DRUGS 511.0 CONTROLLED SUBSTANCE, POSSESSI 7 OPPOSITE OF STREET None None 987373 201662 40.720200 -73.988735 (40.720199996, -73.988735082)
7 265023856 12/31/2015 23:15:00 12/31/2015 23:15:00 12/31/2015 118 DANGEROUS WEAPONS 792.0 WEAPONS POSSESSION 1 & 2 46 FRONT OF STREET None None 1009041 247401 40.845707 -73.910398 (40.845707148, -73.910398033)
8 989238731 12/31/2015 23:15:00 12/31/2015 23:30:00 12/31/2015 344 ASSAULT 3 & RELATED OFFENSES 101.0 ASSAULT 3 48 INSIDE RESIDENCE – APT. HOUSE None None 1014154 251416 40.856711 -73.891900 (40.856711291, -73.891899956)
9 415095955 12/31/2015 23:10:00 12/31/2015 23:10:00 12/31/2015 341 PETIT LARCENY 338.0 LARCENY,PETIT FROM BUILDING,UN 19 INSIDE DRUG STORE None None 994327 218211 40.765618 -73.963623 (40.765617688, -73.96362342)

10 rows × 24 columns

Data Cleansing

Removing NAs

Okay, so let’s actually start cleaning up some of this data. The first thing I want to see if how many NAs exist in each row.

In [6]:
# Find number of rows in dataframe
df_num_rows = df.count()
df_num_rows
Out[6]:
5580035
In [17]:
# Function copied from https://stackoverflow.com/questions/33900726/count-number-of-non-nan-entries-in-each-column-of-spark-dataframe-with-pyspark
# Retrieves all the non-na rows
def count_not_null(c, nan_as_null=False):
    pred = col(c).isNotNull() & (~isnan(c) if nan_as_null else lit(True))
    return sum(pred.cast("integer")).alias(c)

def show_null_count(df, df_num_rows):
    df_na_pandas = df.agg(*[count_not_null(c) for c in df.columns]).toPandas().T
    df_na_pandas.columns = ['non_na_rows']
    df_na_pandas['total_rows'] = df_num_rows
    df_na_pandas['na_rows'] = df_na_pandas['total_rows'] - df_na_pandas['non_na_rows']
    df_na_pandas['na_rows_ratio'] = df_na_pandas['na_rows'] / df_na_pandas['total_rows']
    return df_na_pandas
In [10]:
# Show nulls
show_null_count(df, df_num_rows)
Out[10]:
non_na_rows total_rows na_rows na_rows_ratio
CMPLNT_NUM 5580035 5580035 0 0.000000
CMPLNT_FR_DT 5579380 5580035 655 0.000117
CMPLNT_FR_TM 5579987 5580035 48 0.000009
CMPLNT_TO_DT 4107249 5580035 1472786 0.263938
CMPLNT_TO_TM 4111153 5580035 1468882 0.263239
RPT_DT 5580035 5580035 0 0.000000
KY_CD 5580035 5580035 0 0.000000
OFNS_DESC 5561143 5580035 18892 0.003386
PD_CD 5575126 5580035 4909 0.000880
PD_DESC 5575126 5580035 4909 0.000880
CRM_ATPT_CPTD_CD 5580028 5580035 7 0.000001
LAW_CAT_CD 5580035 5580035 0 0.000000
JURIS_DESC 5580035 5580035 0 0.000000
BORO_NM 5579572 5580035 463 0.000083
ADDR_PCT_CD 5579645 5580035 390 0.000070
LOC_OF_OCCUR_DESC 4356643 5580035 1223392 0.219245
PREM_TYP_DESC 5544837 5580035 35198 0.006308
PARKS_NM 12538 5580035 5567497 0.997753
HADEVELOPT 277817 5580035 5302218 0.950212
X_COORD_CD 5384167 5580035 195868 0.035102
Y_COORD_CD 5384167 5580035 195868 0.035102
Latitude 5384167 5580035 195868 0.035102
Longitude 5384167 5580035 195868 0.035102
Lat_Lon 5384167 5580035 195868 0.035102

I actually am not a big fan of the column names. I’m going to clean them up a big based on the data dictionary.

In [11]:
oldColumns = df.schema.names
newColumns = [
    'COMPLAINT_NUMBER',
    'COMPLAINT_START_DATE',
    'COMPLAINT_START_TIME',
    'COMPLAINT_END_DATE',
    'COMPLAINT_END_TIME',
    'REPORTED_DATE',
    'OFFENSE_ID',
    'OFFENSE_DESCRIPTION',
    'OFFENSE_INTERNAL_CODE',
    'OFFENSE_INTERNAL_DESCRIPTION',
    'OFFENSE_RESULT',
    'OFFENSE_LEVEL',
    'JURISDICTION',
    'BOROUGH',
    'PRECINCT',
    'SPECIFIC_LOCATION',
    'PREMISE_DESCRIPTION',
    'PARK_NAME',
    'HOUSING_NAME',
    'X_COORD_NYC',
    'Y_COORD_NYC',
    'LAT',
    'LON',
    'LAT_LON'
]

df = reduce(lambda data, idx: data.withColumnRenamed(oldColumns[idx], newColumns[idx]), range(len(oldColumns)), df)
df.printSchema()
root
 |-- COMPLAINT_NUMBER: integer (nullable = true)
 |-- COMPLAINT_START_DATE: string (nullable = true)
 |-- COMPLAINT_START_TIME: string (nullable = true)
 |-- COMPLAINT_END_DATE: string (nullable = true)
 |-- COMPLAINT_END_TIME: string (nullable = true)
 |-- REPORTED_DATE: string (nullable = true)
 |-- OFFENSE_ID: integer (nullable = true)
 |-- OFFENSE_DESCRIPTION: string (nullable = true)
 |-- OFFENSE_INTERNAL_CODE: integer (nullable = true)
 |-- OFFENSE_INTERNAL_DESCRIPTION: string (nullable = true)
 |-- OFFENSE_RESULT: string (nullable = true)
 |-- OFFENSE_LEVEL: string (nullable = true)
 |-- JURISDICTION: string (nullable = true)
 |-- BOROUGH: string (nullable = true)
 |-- PRECINCT: integer (nullable = true)
 |-- SPECIFIC_LOCATION: string (nullable = true)
 |-- PREMISE_DESCRIPTION: string (nullable = true)
 |-- PARK_NAME: string (nullable = true)
 |-- HOUSING_NAME: string (nullable = true)
 |-- X_COORD_NYC: integer (nullable = true)
 |-- Y_COORD_NYC: integer (nullable = true)
 |-- LAT: double (nullable = true)
 |-- LON: double (nullable = true)
 |-- LAT_LON: string (nullable = true)

In [12]:
# Show nulls
show_null_count(df, df_num_rows)
Out[12]:
non_na_rows total_rows na_rows na_rows_ratio
COMPLAINT_NUMBER 5580035 5580035 0 0.000000
COMPLAINT_START_DATE 5579380 5580035 655 0.000117
COMPLAINT_START_TIME 5579987 5580035 48 0.000009
COMPLAINT_END_DATE 4107249 5580035 1472786 0.263938
COMPLAINT_END_TIME 4111153 5580035 1468882 0.263239
REPORTED_DATE 5580035 5580035 0 0.000000
OFFENSE_ID 5580035 5580035 0 0.000000
OFFENSE_DESCRIPTION 5561143 5580035 18892 0.003386
OFFENSE_INTERNAL_CODE 5575126 5580035 4909 0.000880
OFFENSE_INTERNAL_DESCRIPTION 5575126 5580035 4909 0.000880
OFFENSE_RESULT 5580028 5580035 7 0.000001
OFFENSE_LEVEL 5580035 5580035 0 0.000000
JURISDICTION 5580035 5580035 0 0.000000
BOROUGH 5579572 5580035 463 0.000083
PRECINCT 5579645 5580035 390 0.000070
SPECIFIC_LOCATION 4356643 5580035 1223392 0.219245
PREMISE_DESCRIPTION 5544837 5580035 35198 0.006308
PARK_NAME 12538 5580035 5567497 0.997753
HOUSING_NAME 277817 5580035 5302218 0.950212
X_COORD_NYC 5384167 5580035 195868 0.035102
Y_COORD_NYC 5384167 5580035 195868 0.035102
LAT 5384167 5580035 195868 0.035102
LON 5384167 5580035 195868 0.035102
LAT_LON 5384167 5580035 195868 0.035102

A few observations

  • Overall, the data quality looks great! Most fields I’m interested don’t have more than 5% of data missing
  • According to the data dictionary, the COMPLAINT_START/END_DATE/TIME are when the incident itself took place, and the COMPLAINT_END fields will only exist if they are different than the COMPLAINT_START fields
  • Fields like PARK_NAME and HOUSING_NAME are largely missing, and can likely be excluded
  • There are 3.5% rows of data that do not have a specific location, maybe I can throw these out for now as I’d prefer not to deal with missing data

Let’s remove all rows in which the following columns are NA:

  • COMPLAINT_START_DATE
  • COMPLAINT_START_TIME
  • OFFENSE_DESCRIPTION
  • OFFENSE_RESULT
  • BOROUGH
  • PRECINCT
  • LAT / LON
In [43]:
# Drop rows with any NA values in the specified columns
df_na_drop = df.na.drop(subset=[
    'COMPLAINT_START_DATE',
    'COMPLAINT_START_TIME',
    'OFFENSE_DESCRIPTION',
    'OFFENSE_RESULT',
    'BOROUGH',
    'PRECINCT',
    'LAT',
    'LON'
])

Okay, we let’s see how many rows remain.

In [44]:
# Count number of rows remaining
df_num_rows_na = df_na_drop.count()
In [45]:
print('{} out of {} rows remain after dropping NAs ({}%)'.format(df_num_rows_na, df_num_rows, df_num_rows_na / df_num_rows))
5365257 out of 5580035 rows remain after dropping NAs (0.9615095604239041%)

We ended up dropping not more than 4% of our total number of rows. That’s a number I can live with!

In [46]:
# Show nulls
show_null_count(df_na_drop, df_num_rows_na)
Out[46]:
non_na_rows total_rows na_rows na_rows_ratio
COMPLAINT_NUMBER 5365257 5365257 0 0.000000
COMPLAINT_START_DATE 5365257 5365257 0 0.000000
COMPLAINT_START_TIME 5365257 5365257 0 0.000000
COMPLAINT_END_DATE 3954848 5365257 1410409 0.262878
COMPLAINT_END_TIME 3958596 5365257 1406661 0.262180
REPORTED_DATE 5365257 5365257 0 0.000000
OFFENSE_ID 5365257 5365257 0 0.000000
OFFENSE_DESCRIPTION 5365257 5365257 0 0.000000
OFFENSE_INTERNAL_CODE 5360350 5365257 4907 0.000915
OFFENSE_INTERNAL_DESCRIPTION 5360350 5365257 4907 0.000915
OFFENSE_RESULT 5365257 5365257 0 0.000000
OFFENSE_LEVEL 5365257 5365257 0 0.000000
JURISDICTION 5365257 5365257 0 0.000000
BOROUGH 5365257 5365257 0 0.000000
PRECINCT 5365257 5365257 0 0.000000
SPECIFIC_LOCATION 4230081 5365257 1135176 0.211579
PREMISE_DESCRIPTION 5332116 5365257 33141 0.006177
PARK_NAME 11789 5365257 5353468 0.997803
HOUSING_NAME 271212 5365257 5094045 0.949450
X_COORD_NYC 5365257 5365257 0 0.000000
Y_COORD_NYC 5365257 5365257 0 0.000000
LAT 5365257 5365257 0 0.000000
LON 5365257 5365257 0 0.000000
LAT_LON 5365257 5365257 0 0.000000

Formatting Dates & Times

The second thing that jumped out at me is that dates and times are still being formatted as strings. It looks like Spark’s inferSchema parameter of the read.csv() command wasn’t able to tell dates and times.

Even before that, however, we mentioned earlier that the COMPLAINT_END fields are populated only if there exists an end date that’s different than the start date. This means that, if the COMPLAINT_END fields are not populated, they are in fact the same as the COMPLAINT_START dates. Let’s make that change now.

In [47]:
df_na_drop = df_na_drop.withColumn('COMPLAINT_END_DATE', coalesce(df_na_drop['COMPLAINT_END_DATE'], df_na_drop['COMPLAINT_START_DATE']))
df_na_drop = df_na_drop.withColumn('COMPLAINT_END_TIME', coalesce(df_na_drop['COMPLAINT_END_TIME'], df_na_drop['COMPLAINT_START_TIME'])) 
In [48]:
# Show nulls
show_null_count(df_na_drop, df_num_rows_na)
Out[48]:
non_na_rows total_rows na_rows na_rows_ratio
COMPLAINT_NUMBER 5365257 5365257 0 0.000000
COMPLAINT_START_DATE 5365257 5365257 0 0.000000
COMPLAINT_START_TIME 5365257 5365257 0 0.000000
COMPLAINT_END_DATE 5365257 5365257 0 0.000000
COMPLAINT_END_TIME 5365257 5365257 0 0.000000
REPORTED_DATE 5365257 5365257 0 0.000000
OFFENSE_ID 5365257 5365257 0 0.000000
OFFENSE_DESCRIPTION 5365257 5365257 0 0.000000
OFFENSE_INTERNAL_CODE 5360350 5365257 4907 0.000915
OFFENSE_INTERNAL_DESCRIPTION 5360350 5365257 4907 0.000915
OFFENSE_RESULT 5365257 5365257 0 0.000000
OFFENSE_LEVEL 5365257 5365257 0 0.000000
JURISDICTION 5365257 5365257 0 0.000000
BOROUGH 5365257 5365257 0 0.000000
PRECINCT 5365257 5365257 0 0.000000
SPECIFIC_LOCATION 4230081 5365257 1135176 0.211579
PREMISE_DESCRIPTION 5332116 5365257 33141 0.006177
PARK_NAME 11789 5365257 5353468 0.997803
HOUSING_NAME 271212 5365257 5094045 0.949450
X_COORD_NYC 5365257 5365257 0 0.000000
Y_COORD_NYC 5365257 5365257 0 0.000000
LAT 5365257 5365257 0 0.000000
LON 5365257 5365257 0 0.000000
LAT_LON 5365257 5365257 0 0.000000

Pyspark has got some interesting notation… the withColumn() function seems to be used a lot and takes the place of a straight assignment operator =… Why? I’m not sure yet…

Alright, now we can try to format our dates. Let’s start by concatenating our COMPLAINT date and times together in hopes of combining them into a single field of TimestampType. REPORTED_DATE also is a date field, but contains no time.

In [58]:
# Combine date and time fields and create new timestamp field for COMPLAINT fields
df_na_drop = df_na_drop.withColumn(
    'COMPLAINT_START_TIMESTAMP', 
    to_timestamp(
        concat(df_na_drop['COMPLAINT_START_DATE'], lit(' '), df_na_drop['COMPLAINT_START_TIME']),
        'MM/dd/yyyy HH:mm:ss'
    )
)

df_na_drop = df_na_drop.withColumn(
    'COMPLAINT_END_TIMESTAMP', 
    to_timestamp(
        concat(df_na_drop['COMPLAINT_END_DATE'], lit(' '), df_na_drop['COMPLAINT_END_TIME']),
        'MM/dd/yyyy HH:mm:ss'
    )
)

# Convert REPORTED_DATE
df_na_drop = df_na_drop.withColumn(
    'REPORTED_DATE_TIMESTAMP', 
    to_timestamp(
        df_na_drop['REPORTED_DATE'],
        'MM/dd/yyyy'
    )
)
In [59]:
df_na_drop.printSchema()
root
 |-- COMPLAINT_NUMBER: integer (nullable = true)
 |-- COMPLAINT_START_DATE: string (nullable = true)
 |-- COMPLAINT_START_TIME: string (nullable = true)
 |-- COMPLAINT_END_DATE: string (nullable = true)
 |-- COMPLAINT_END_TIME: string (nullable = true)
 |-- REPORTED_DATE: string (nullable = true)
 |-- OFFENSE_ID: integer (nullable = true)
 |-- OFFENSE_DESCRIPTION: string (nullable = true)
 |-- OFFENSE_INTERNAL_CODE: integer (nullable = true)
 |-- OFFENSE_INTERNAL_DESCRIPTION: string (nullable = true)
 |-- OFFENSE_RESULT: string (nullable = true)
 |-- OFFENSE_LEVEL: string (nullable = true)
 |-- JURISDICTION: string (nullable = true)
 |-- BOROUGH: string (nullable = true)
 |-- PRECINCT: integer (nullable = true)
 |-- SPECIFIC_LOCATION: string (nullable = true)
 |-- PREMISE_DESCRIPTION: string (nullable = true)
 |-- PARK_NAME: string (nullable = true)
 |-- HOUSING_NAME: string (nullable = true)
 |-- X_COORD_NYC: integer (nullable = true)
 |-- Y_COORD_NYC: integer (nullable = true)
 |-- LAT: double (nullable = true)
 |-- LON: double (nullable = true)
 |-- LAT_LON: string (nullable = true)
 |-- COMPLAINT_START_TIMESTAMP: timestamp (nullable = true)
 |-- COMPLAINT_END_TIMESTAMP: timestamp (nullable = true)
 |-- REPORTED_DATE_TIMESTAMP: timestamp (nullable = true)

In [60]:
# View timestamp columns
df_na_drop.createOrReplaceTempView("df_na_drop")
%read_sql SELECT COMPLAINT_START_DATE, COMPLAINT_START_TIME, COMPLAINT_END_DATE, COMPLAINT_END_TIME, REPORTED_DATE, COMPLAINT_START_TIMESTAMP, COMPLAINT_END_TIMESTAMP, REPORTED_DATE_TIMESTAMP FROM df_na_drop LIMIT 10;
Query started at 10:43:59 PM UTC; Query executed in 0.19 m
Out[60]:
COMPLAINT_START_DATE COMPLAINT_START_TIME COMPLAINT_END_DATE COMPLAINT_END_TIME REPORTED_DATE COMPLAINT_START_TIMESTAMP COMPLAINT_END_TIMESTAMP REPORTED_DATE_TIMESTAMP
0 12/31/2015 23:45:00 12/31/2015 23:45:00 12/31/2015 2015-12-31 23:45:00 2015-12-31 23:45:00 2015-12-31
1 12/31/2015 23:36:00 12/31/2015 23:36:00 12/31/2015 2015-12-31 23:36:00 2015-12-31 23:36:00 2015-12-31
2 12/31/2015 23:30:00 12/31/2015 23:30:00 12/31/2015 2015-12-31 23:30:00 2015-12-31 23:30:00 2015-12-31
3 12/31/2015 23:30:00 12/31/2015 23:30:00 12/31/2015 2015-12-31 23:30:00 2015-12-31 23:30:00 2015-12-31
4 12/31/2015 23:25:00 12/31/2015 23:30:00 12/31/2015 2015-12-31 23:25:00 2015-12-31 23:30:00 2015-12-31
5 12/31/2015 23:18:00 12/31/2015 23:25:00 12/31/2015 2015-12-31 23:18:00 2015-12-31 23:25:00 2015-12-31
6 12/31/2015 23:15:00 12/31/2015 23:15:00 12/31/2015 2015-12-31 23:15:00 2015-12-31 23:15:00 2015-12-31
7 12/31/2015 23:15:00 12/31/2015 23:15:00 12/31/2015 2015-12-31 23:15:00 2015-12-31 23:15:00 2015-12-31
8 12/31/2015 23:15:00 12/31/2015 23:30:00 12/31/2015 2015-12-31 23:15:00 2015-12-31 23:30:00 2015-12-31
9 12/31/2015 23:10:00 12/31/2015 23:10:00 12/31/2015 2015-12-31 23:10:00 2015-12-31 23:10:00 2015-12-31

Looks good to me! Let’s take a look at some other fields as well.

Offenses

The next most interesting thing to me is probably the offenses themselves. What are the offenses? How standardized are they? How many are there? I would imagine there to be quite a bit. Should I keep all of them? Cluster them? Lots of questions to be answered, obviously. Let’s see how many there are first.

In [64]:
%%read_sql
SELECT 
    COUNT(DISTINCT OFFENSE_ID), 
    COUNT(DISTINCT OFFENSE_DESCRIPTION),
    COUNT(DISTINCT OFFENSE_INTERNAL_CODE),
    COUNT(DISTINCT OFFENSE_INTERNAL_DESCRIPTION),
    COUNT(DISTINCT OFFENSE_RESULT),
    COUNT(DISTINCT OFFENSE_LEVEL)
FROM df_na_drop;
Query started at 01:28:04 AM UTC; Query executed in 0.89 m
Out[64]:
count(DISTINCT OFFENSE_ID) count(DISTINCT OFFENSE_DESCRIPTION) count(DISTINCT OFFENSE_INTERNAL_CODE) count(DISTINCT OFFENSE_INTERNAL_DESCRIPTION) count(DISTINCT OFFENSE_RESULT) count(DISTINCT OFFENSE_LEVEL)
0 73 69 395 385 2 3

Okay, so it looks like the offense descriptions may cause us some trouble. 69 overall distinct offense types actually sound quite reasonable to me – I’d imagined there would be much more, but of course that’s where the internal descriptions come in. According to the data dictionary, the internal description is where the more detailed descriptions are kept.

Let’s just take a look at the 1st level offense descriptions… perhaps the top 20 or something.

In [69]:
%%read_sql
SELECT 
    OFFENSE_ID,
    OFFENSE_DESCRIPTION,
    COUNT(*) AS NUM_RECORDS,
    (COUNT(*) / (SELECT COUNT(*) FROM df_na_drop))*100 AS NUM_REC_PERC
FROM df_na_drop
GROUP BY
    OFFENSE_ID,
    OFFENSE_DESCRIPTION
ORDER BY
    COUNT(*) DESC
LIMIT 25;
Query started at 01:43:47 AM UTC; Query executed in 1.33 m
Out[69]:
OFFENSE_ID OFFENSE_DESCRIPTION NUM_RECORDS NUM_REC_PERC
0 341 PETIT LARCENY 882672 16.451626
1 578 HARRASSMENT 2 661299 12.325579
2 344 ASSAULT 3 & RELATED OFFENSES 565040 10.531462
3 351 CRIMINAL MISCHIEF & RELATED OF 462647 8.623017
4 109 GRAND LARCENY 460576 8.584416
5 361 OFF. AGNST PUB ORD SENSBLTY & 302367 5.635648
6 235 DANGEROUS DRUGS 297365 5.542419
7 105 ROBBERY 208459 3.885350
8 107 BURGLARY 201975 3.764498
9 106 FELONY ASSAULT 201152 3.749159
10 126 MISCELLANEOUS PENAL LAW 121186 2.258718
11 359 OFFENSES AGAINST PUBLIC ADMINI 107107 1.996307
12 110 GRAND LARCENY OF MOTOR VEHICLE 105329 1.963168
13 121 CRIMINAL MISCHIEF & RELATED OF 80621 1.502649
14 236 DANGEROUS WEAPONS 75759 1.412029
15 347 INTOXICATED & IMPAIRED DRIVING 73349 1.367111
16 352 CRIMINAL TRESPASS 68871 1.283648
17 117 DANGEROUS DRUGS 66199 1.233846
18 112 THEFT-FRAUD 60842 1.134000
19 348 VEHICLE AND TRAFFIC LAWS 58592 1.092063
20 118 DANGEROUS WEAPONS 54900 1.023250
21 113 FORGERY 54269 1.011489
22 340 FRAUDS 34728 0.647276
23 232 POSSESSION OF STOLEN PROPERTY 20916 0.389842
24 358 OFFENSES INVOLVING FRAUD 18699 0.348520

Alrighty. Petit Larceny. I’ll admit I’ve heard of Grand Larceny before, but I never knew what Larceny ever meant… You learn something new every day, I guess:

Larceny – A crime involving the unlawful taking of the personal property of another person or business.

Hmm, that sounds… a lot like Theft? But wait there is Robbery as a category too… and Burglary. What’s the difference between them?

The basic distinction between robbery and larceny is that robbery involves the use of force, whereas larceny doesn’t

Ok fair enough. How about Burglary?

The crime of burglary, though most often equated with theft, doesn’t actually require that a theft occur, or even be intended. Burglary is simply the unlawful entry into a structure, such as a home or business, with the intent to commit a crime inside. Although many burglaries involve theft, the crime intended can be any crime from theft, to murder, to making pot brownies.

Wow, never had a clue between those distinctions. I’ll have to add this to my domain knowledge learnings lol. How about some other ones… oh wait, Petit Larceny and Grand Larceny… what’s the difference? $200! Anything below is Petit and anything larger is Grand.

How about Harassment 2 and Assault 3? It looks like anything in this notation means x in the yth degree.

  • 2nd Degree Harassment: Intent to harass, annoy or alarm another person.
  • 3rd Degree Assault: Intentionally or recklessly causes injury to another person, or if he is criminally negligent with a weapon.

How about… Criminal Mischief? Looks like it’s when you damage or vandalize someone else’s property.

OFF. AGNST PUB ORD SENSBLTY &? I’m going ahead to assume that it means Offenses Against Public Order, which are violations that interfere with the normal operations of society. These crimes go against publicly shared values, norms, or customs. One example given is public drunkenness – Fair enough!

There’s also Felony Assault, which just seems like a more severe Assault.

I think those are the major ones. It’s important to point out that the first 3, Petit Larceny, Harassment, and Assault make up almost 40% of all crime! Including Mischief, Grand Larceny, Public Order, and Dangerous Drugs, we’re almost up to 70%. No other complaints make up more than 4% of the total calls. Let’s keep the top 10 categories here and lump the rest into ‘Other’.

In [78]:
# List of crimes to keep
crimes_to_keep = [
    'PETIT LARCENY',
    'HARRASSMENT 2',
    'ASSAULT 3 & RELATED OFFENSES',
    'CRIMINAL MISCHIEF & RELATED OF',
    'GRAND LARCENY',
    'OFF. AGNST PUB ORD SENSBLTY &',
    'DANGEROUS DRUGS',
    'ROBBERY',
    'BURGLUARY',
    'FELONY ASSAULT'
]

# Anything not in the list becomes 'OTHER'
df_na_drop = df_na_drop.withColumn(
    'OFFENSE_DESCRIPTION', 
    when(df_na_drop['OFFENSE_DESCRIPTION'].isin(crimes_to_keep), df_na_drop['OFFENSE_DESCRIPTION']).otherwise('OTHER')
)
In [80]:
# Add table to SQL Context
df_na_drop.createOrReplaceTempView("df_na_drop")
In [81]:
%%read_sql
SELECT 
    OFFENSE_DESCRIPTION,
    COUNT(*) AS NUM_RECORDS,
    (COUNT(*) / (SELECT COUNT(*) FROM df_na_drop))*100 AS NUM_REC_PERC
FROM df_na_drop
GROUP BY
    OFFENSE_DESCRIPTION
ORDER BY
    COUNT(*) DESC
LIMIT 25;
Query started at 03:02:16 AM UTC; Query executed in 1.05 m
Out[81]:
OFFENSE_DESCRIPTION NUM_RECORDS NUM_REC_PERC
0 OTHER 1176860 21.934830
1 PETIT LARCENY 882672 16.451626
2 HARRASSMENT 2 661299 12.325579
3 ASSAULT 3 & RELATED OFFENSES 565040 10.531462
4 CRIMINAL MISCHIEF & RELATED OF 543268 10.125666
5 GRAND LARCENY 460576 8.584416
6 DANGEROUS DRUGS 363564 6.776264
7 OFF. AGNST PUB ORD SENSBLTY & 302367 5.635648
8 ROBBERY 208459 3.885350
9 FELONY ASSAULT 201152 3.749159

Perfect – Other is now our largest category, but it includes like 50-60 categories lumped in there. The top 10 make up around 80%. I might just leave it at this without diving too deep into the internal offense codes for now. This is about as detailed I was thinking about in my head anyways.

Let’s take a look at OFFENSE_RESULT and OFFENSE_LEVEL before we wrap up offenses, just to be aware of what’s there.

In [82]:
%%read_sql
SELECT DISTINCT OFFENSE_RESULT FROM df_na_drop
Query started at 03:09:39 AM UTC; Query executed in 0.81 m
Out[82]:
OFFENSE_RESULT
0 ATTEMPTED
1 COMPLETED
In [83]:
%%read_sql
SELECT DISTINCT OFFENSE_LEVEL FROM df_na_drop
Query started at 03:10:27 AM UTC; Query executed in 0.56 m
Out[83]:
OFFENSE_LEVEL
0 FELONY
1 MISDEMEANOR
2 VIOLATION

Ok… more education time… what’s the difference between these 3 haha.

  • VIOLATION – Punishable by up to 15 days in jail
  • MISDEMEANOR – Punishable by between 15 days and 1 year in jail
  • FELONY – Punishable by over 1 year in jail

Location (Environment)

The next area I want to look to clean is the the location, and I’m talking specifically about the SPECIFIC_LOCATION and the PREMISE_DESCRIPTION fields. Note that these are not the long lat / lat fields, but a text description of the location and environment of the incident.

In [84]:
%%read_sql
SELECT 
    COUNT(DISTINCT SPECIFIC_LOCATION), 
    COUNT(DISTINCT PREMISE_DESCRIPTION)
FROM df_na_drop;
Query started at 04:58:33 AM UTC; Query executed in 0.79 m
Out[84]:
count(DISTINCT SPECIFIC_LOCATION) count(DISTINCT PREMISE_DESCRIPTION)
0 6 70

Let’s check out SPECIFIC_LOCATION first:

In [85]:
%%read_sql
SELECT DISTINCT SPECIFIC_LOCATION FROM df_na_drop;
Query started at 05:00:06 AM UTC; Query executed in 0.55 m
Out[85]:
SPECIFIC_LOCATION
0 OPPOSITE OF
1 REAR OF
2 None
3 INSIDE
4 OUTSIDE
5 FRONT OF
6

Not too many descriptors, which will be great for us. I’m honestly not even sure how useful this column will be at all, so I won’t spend too much time cleaning or standardizing. There are the None values and the blank values, which to me, are the same… but maybe there’s a bit of context that I’m missing. Maybe “None” indicates that the premise doesn’t have descriptors that apply to it, whereas blank encapsulates “other” or something like that… who knows. Let’s leave this for now and maybe take a look at it in tandem with PREMISE_DESCRIPTION before we make a call.

Now PREMISE_DESCRIPTION (let’s take the top 25 again here).

In [86]:
%%read_sql
SELECT 
    PREMISE_DESCRIPTION,
    COUNT(*) AS NUM_RECORDS,
    (COUNT(*) / (SELECT COUNT(*) FROM df_na_drop))*100 AS NUM_REC_PERC
FROM df_na_drop
GROUP BY
    PREMISE_DESCRIPTION
ORDER BY
    COUNT(*) DESC
LIMIT 25;
Query started at 05:02:11 AM UTC; Query executed in 1.34 m
Out[86]:
PREMISE_DESCRIPTION NUM_RECORDS NUM_REC_PERC
0 STREET 1781985 33.213414
1 RESIDENCE – APT. HOUSE 1117541 20.829217
2 RESIDENCE-HOUSE 531870 9.913225
3 RESIDENCE – PUBLIC HOUSING 408787 7.619150
4 COMMERCIAL BUILDING 140784 2.623994
5 OTHER 139490 2.599875
6 DEPARTMENT STORE 105062 1.958191
7 TRANSIT – NYC SUBWAY 104692 1.951295
8 CHAIN STORE 104357 1.945051
9 PUBLIC SCHOOL 68299 1.272987
10 GROCERY/BODEGA 62718 1.168965
11 RESTAURANT/DINER 61447 1.145276
12 BAR/NIGHT CLUB 61071 1.138268
13 PARK/PLAYGROUND 58227 1.085260
14 CLOTHING/BOUTIQUE 45604 0.849987
15 DRUG STORE 41792 0.778938
16 PUBLIC BUILDING 33844 0.630799
17 None 33141 0.617696
18 PARKING LOT/GARAGE (PRIVATE) 30384 0.566310
19 PARKING LOT/GARAGE (PUBLIC) 30073 0.560514
20 FOOD SUPERMARKET 29256 0.545286
21 STORE UNCLASSIFIED 24983 0.465644
22 FAST FOOD 24982 0.465625
23 HOSPITAL 24717 0.460686
24 SMALL MERCHANT 23299 0.434257

Let’s see the most popular combinations of the 2 fields:

In [87]:
%%read_sql
SELECT 
    PREMISE_DESCRIPTION,
    SPECIFIC_LOCATION,
    COUNT(*) AS NUM_RECORDS,
    (COUNT(*) / (SELECT COUNT(*) FROM df_na_drop))*100 AS NUM_REC_PERC
FROM df_na_drop
GROUP BY
    PREMISE_DESCRIPTION,
    SPECIFIC_LOCATION
ORDER BY
    COUNT(*) DESC
LIMIT 25;
Query started at 05:10:47 AM UTC; Query executed in 1.31 m
Out[87]:
PREMISE_DESCRIPTION SPECIFIC_LOCATION NUM_RECORDS NUM_REC_PERC
0 RESIDENCE – APT. HOUSE INSIDE 976115 18.193257
1 STREET None 848046 15.806251
2 STREET FRONT OF 758509 14.137422
3 RESIDENCE-HOUSE INSIDE 386413 7.202134
4 RESIDENCE – PUBLIC HOUSING INSIDE 275264 5.130490
5 RESIDENCE-HOUSE FRONT OF 115625 2.155069
6 COMMERCIAL BUILDING INSIDE 114823 2.140121
7 RESIDENCE – APT. HOUSE FRONT OF 113835 2.121706
8 STREET OPPOSITE OF 113023 2.106572
9 TRANSIT – NYC SUBWAY None 103736 1.933477
10 DEPARTMENT STORE INSIDE 100498 1.873126
11 CHAIN STORE INSIDE 96010 1.789476
12 RESIDENCE – PUBLIC HOUSING FRONT OF 91393 1.703423
13 OTHER INSIDE 73107 1.362600
14 PUBLIC SCHOOL INSIDE 56360 1.050462
15 RESTAURANT/DINER INSIDE 48490 0.903778
16 GROCERY/BODEGA INSIDE 48441 0.902864
17 BAR/NIGHT CLUB INSIDE 46934 0.874776
18 STREET INSIDE 44116 0.822253
19 CLOTHING/BOUTIQUE INSIDE 42311 0.788611
20 PARK/PLAYGROUND None 41294 0.769656
21 DRUG STORE INSIDE 38804 0.723246
22 OTHER FRONT OF 32770 0.610782
23 RESIDENCE – PUBLIC HOUSING REAR OF 25585 0.476864
24 FOOD SUPERMARKET INSIDE 25195 0.469595

It looks like the “None” value implies “On” the premise. On a street, or on an NYC subway, or on the grounds of a park. I guess in some cases it can also mean “In”, but there is also the “Inside” descriptor which is explicitly stated as a value. In any event, I’ll leave None for now because it really does apply. Not one word will be able to replace “None”. I’m still not sure what the blank means, though, but it seems that it only occurs in a very small portion of the data anyways.

Let’s just do what we did last time and take the top 10 PREMISE_DESCRIPTION.

In [88]:
# List of premises to keep
premises_to_keep = [
    'STREET',
    'RESIDENCE - APT. HOUSE',
    'RESIDENCE-HOUSE',
    'RESIDENCE - PUBLIC HOUSING',
    'COMMERCIAL BUILDING',
    'DEPARTMENT STORE',
    'TRANSIT - NYC SUBWAY',
    'CHAIN STORE',
    'PUBLIC SCHOOL',
    'GROCERY/BODEGA',
    'RESTAURANT/DINER',
    'BAR/NIGHT CLUB',
    'PARK/PLAYGROUND'
]

# Anything not in the list becomes 'OTHER'
df_na_drop = df_na_drop.withColumn(
    'PREMISE_DESCRIPTION', 
    when(df_na_drop['PREMISE_DESCRIPTION'].isin(premises_to_keep), df_na_drop['PREMISE_DESCRIPTION']).otherwise('OTHER')
)
In [91]:
# Add table to SQL Context
df_na_drop.createOrReplaceTempView("df_na_drop")
In [92]:
%%read_sql
SELECT 
    PREMISE_DESCRIPTION,
    COUNT(*) AS NUM_RECORDS,
    (COUNT(*) / (SELECT COUNT(*) FROM df_na_drop))*100 AS NUM_REC_PERC
FROM df_na_drop
GROUP BY
    PREMISE_DESCRIPTION
ORDER BY
    COUNT(*) DESC
LIMIT 25;
Query started at 05:25:17 AM UTC; Query executed in 1.05 m
Out[92]:
PREMISE_DESCRIPTION NUM_RECORDS NUM_REC_PERC
0 STREET 1781985 33.213414
1 RESIDENCE – APT. HOUSE 1117541 20.829217
2 OTHER 758417 14.135707
3 RESIDENCE-HOUSE 531870 9.913225
4 RESIDENCE – PUBLIC HOUSING 408787 7.619150
5 COMMERCIAL BUILDING 140784 2.623994
6 DEPARTMENT STORE 105062 1.958191
7 TRANSIT – NYC SUBWAY 104692 1.951295
8 CHAIN STORE 104357 1.945051
9 PUBLIC SCHOOL 68299 1.272987
10 GROCERY/BODEGA 62718 1.168965
11 RESTAURANT/DINER 61447 1.145276
12 BAR/NIGHT CLUB 61071 1.138268
13 PARK/PLAYGROUND 58227 1.085260

I lied, I kept the top 13 PREMISE_DESCRIPTION values because I thought restaurants, bars, and parks would be interesting to explore later on… sly… I know…

“OTHER” falls in third this time, with more crime happening in streets and apartments. Awesome. Well, I mean… not awesome I guess, but… interesting. Yes, let’s go with that!

With that, I could argue that I’m actually done here! Let’s feature build in the next post.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s