NYPD Crime #19 – Clustering To Explore Neighbourhoods (Part IV – Continued Because Spark Hates Me)

Review

I ran into a major dead end in the last post. The problem? Data pre-processing… You don’t often think that data processing would be the activity that prevents you from moving forward eh? As a novice data scientist, you’re so infatuated with the high level objective, the meat of the analysis, the sexy chart or graph or insight that you’re trying to find… Luckily I’ve had enough experience cleaning data in my previous jobs to know how important of a step it really is, and this is a great example of that. This dead end is literally making me pivot and change directions…

The last post, however, wasn’t a complete wash despite the tragic ending. It allowed me to go deeper into Spark’s memory management. I took a very close look at the memory usage of each node (1 master + 2 workers) as my notebook ran, and I was able to understand a bit better the intricacies of the master memory usage vs worker memory usage. The fact that I was even able to get to the one hot encoding step was a product of this memory monitoring. I’m sure this will help us going forward with my new method as well.

This project just seems to be sprawling in different directions and losing scope with every notebook, but I guess this is what data science looks like haha.

Let’s get our dataframe back to the way it was before one hot encoding, and try another method.

In [2]:
import os
os.system("sudo pip install findspark sql_magic pyspark_dist_explore seaborn")
Out[2]:
0
In [3]:
# Use findspark package to connect Jupyter to Spark shell
import findspark
findspark.init('/usr/lib/spark')

# Load SparkSession object
import pyspark
from pyspark.sql import SparkSession

# Initiate SparkSession as "spark"
spark = SparkSession\
    .builder\
    .getOrCreate()

# Load sql_magic and connect to Spark
%load_ext sql_magic
%config SQL.conn_name = 'spark'

# Load other libraries
from datetime import datetime
from pyspark.sql import functions as F
from pyspark.sql.types import DateType
from functools import reduce
import pandas as pd
import numpy as np
import seaborn as sns

# Graphing with matplotlib
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline

Load Data

In [4]:
%%time
# Read NYPD Complaint Data
df_filtered = spark.read.parquet("s3n://2017edmfasatb/nypd_complaints/data/df_filtered.parquet")
df_filtered.cache()
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 7.97 s

K-Means

In [5]:
from pyspark.ml.clustering import KMeans
from pyspark.ml.feature import VectorAssembler

# Set seed for ability to reproduce results, 20 clusters
kmeans = KMeans(k = 20, seed = 1)

# Initiate and transform columns into vector
vecAssembler = VectorAssembler(inputCols = ['LAT', 'LON'], outputCol = "features")
k_means_input = vecAssembler.transform(df_filtered)
In [6]:
%%time
# Refit model
model = kmeans.fit(k_means_input[['features']])
CPU times: user 24 ms, sys: 4 ms, total: 28 ms
Wall time: 2min 24s
In [7]:
%%time
# Use model to assign the samples a cluster to belong to
prediction = model.transform(k_means_input[['features']])
print(prediction.head(5))
[Row(features=DenseVector([40.8288, -73.9167]), prediction=0), Row(features=DenseVector([40.6973, -73.7846]), prediction=10), Row(features=DenseVector([40.8026, -73.9451]), prediction=19), Row(features=DenseVector([40.6545, -73.7263]), prediction=10), Row(features=DenseVector([40.738, -73.9879]), prediction=6)]
CPU times: user 4 ms, sys: 4 ms, total: 8 ms
Wall time: 401 ms

Join K-Means Clusters Into Original Dataframe

In [8]:
# Since there are no common column between these two dataframes add row_index so that it can be joined
df_filtered_indexed = df_filtered.withColumn('row_index', F.monotonically_increasing_id())
df_filtered.unpersist()

prediction_indexed = prediction.withColumn('row_index', F.monotonically_increasing_id())
prediction.unpersist()
Out[8]:
DataFrame[features: vector, prediction: int]
In [9]:
# Perform join on our generated ID row_index
df_predicted = df_filtered_indexed.join(prediction_indexed, on = ['row_index'], how = 'left').drop('row_index')
df_filtered_indexed.unpersist()
prediction_indexed.unpersist()
Out[9]:
DataFrame[features: vector, prediction: int, row_index: bigint]

Viewing Clusters

In [10]:
# Add table to SQL Context
df_predicted.createOrReplaceTempView("df_predicted")
In [11]:
cluster_stats_result = %read_sql \
SELECT \
    prediction, \
    COUNT(*) AS NUM_SAMPLES, \
    AVG(LAT) AS LAT_CENTRE, \
    AVG(LON) AS LON_CENTRE \
FROM df_predicted \
GROUP BY \
    prediction
Query started at 04:13:36 AM UTC; Query executed in 0.38 m
In [12]:
# See cluster centres
fig, ax = plt.subplots()
sns.regplot(
    x = "LON_CENTRE", 
    y = "LAT_CENTRE", 
    data = cluster_stats_result, 
    fit_reg = False, 
    scatter_kws = {'s': cluster_stats_result['NUM_SAMPLES'] / cluster_stats_result['NUM_SAMPLES'].max() * 150},
    ax = ax
)
cluster_stats_result[['LON_CENTRE','LAT_CENTRE','prediction']].apply(lambda x: ax.text(*x), axis=1);
nypd_crime_19_1

Alternate Cluster Summary Processing

Alright. Let’s review our task again. Remember, we’re trying to get a dataframe that looks like this:

In [13]:
pd.DataFrame({
    'COMPLAINT_NUMBER': [1, 2, 3, 4, 5],
    'CLUSTER': [1, 1, 2, 2, 2],
    'OFFENSE_LEVEL': ['VIOLATION', 'VIOLATION', 'MISDEMEANOR', 'FELONY', 'FELONY']
})
Out[13]:
CLUSTER COMPLAINT_NUMBER OFFENSE_LEVEL
0 1 1 VIOLATION
1 1 2 VIOLATION
2 2 3 MISDEMEANOR
3 2 4 FELONY
4 2 5 FELONY

Into a dataframe like this:

In [14]:
pd.DataFrame({
    'COMPLAINT_NUMBER': [1, 2, 3, 4, 5],
    'CLUSTER': [1, 1, 2, 2, 2],
    'OFFENSE_LEVEL_VIOLATION': [1, 1, 0, 0, 0],
    'OFFENSE_LEVEL_MISDEMEANOR': [0, 0, 1, 0, 0],
    'OFFENSE_LEVEL_FELONY': [0, 0, 0, 1, 1]
})
Out[14]:
CLUSTER COMPLAINT_NUMBER OFFENSE_LEVEL_FELONY OFFENSE_LEVEL_MISDEMEANOR OFFENSE_LEVEL_VIOLATION
0 1 1 0 0 1
1 1 2 0 0 1
2 2 3 0 1 0
3 2 4 1 0 0
4 2 5 1 0 0

Would have loved to resort to a cleaner option, but the only thing I can think of right now is to hardcode the mappings between the values and the encoded columns within SQL itself. It’s going to be messy, but at least I’m confident it will work (well, I guess I shouldn’t speak so soon…).

In [15]:
%read_sql SELECT \
    OFFENSE_LEVEL, \
    CASE \
        WHEN (OFFENSE_LEVEL = 'MISDEMEANOR') THEN 1 \
        ELSE 0 \
    END AS OFFENSE_LEVEL_MISDEMEANOR, \
    CASE \
        WHEN (OFFENSE_LEVEL = 'VIOLATION') THEN 1 \
        ELSE 0 \
    END AS OFFENSE_LEVEL_VIOLATION, \
    CASE \
        WHEN (OFFENSE_LEVEL = 'FELONY') THEN 1 \
        ELSE 0 \
    END AS OFFENSE_LEVEL_FELONY \
FROM df_predicted \
LIMIT 10
Query started at 04:14:00 AM UTC; Query executed in 0.22 m
Out[15]:
OFFENSE_LEVEL OFFENSE_LEVEL_MISDEMEANOR OFFENSE_LEVEL_VIOLATION OFFENSE_LEVEL_FELONY
0 FELONY 0 0 1
1 FELONY 0 0 1
2 FELONY 0 0 1
3 MISDEMEANOR 1 0 0
4 MISDEMEANOR 1 0 0
5 FELONY 0 0 1
6 MISDEMEANOR 1 0 0
7 FELONY 0 0 1
8 MISDEMEANOR 1 0 0
9 MISDEMEANOR 1 0 0

That works. The code is going to look absolutely horrible when I add in OFFENSE_DESCRIPTION and PREMISE_DESCRIPTION as well. Hooray.

In [16]:
# Perform manual one hot encoding to dense columns through SQL
df_encoded = spark.sql(" \
    SELECT \
        prediction as CLUSTER, \
        LAT, \
        LON, \
        OFFENSE_DESCRIPTION, \
        CASE \
            WHEN (OFFENSE_DESCRIPTION = 'PETIT LARCENY') THEN 1 \
            ELSE 0 \
        END AS OFFENSE_DESCRIPTION_PETIT_LARCENY, \
        CASE \
            WHEN (OFFENSE_DESCRIPTION = 'HARRASSMENT 2') THEN 1 \
            ELSE 0 \
        END AS OFFENSE_DESCRIPTION_HARRASSMENT, \
        CASE \
            WHEN (OFFENSE_DESCRIPTION = 'ASSAULT 3 & RELATED OFFENSES') THEN 1 \
            ELSE 0 \
        END AS OFFENSE_DESCRIPTION_ASSAULT_3, \
        CASE \
            WHEN (OFFENSE_DESCRIPTION = 'CRIMINAL MISCHIEF & RELATED OF') THEN 1 \
            ELSE 0 \
        END AS OFFENSE_DESCRIPTION_MISCHIEF, \
        CASE \
            WHEN (OFFENSE_DESCRIPTION = 'GRAND LARCENY') THEN 1 \
            ELSE 0 \
        END AS OFFENSE_DESCRIPTION_GRAND_LARCENY, \
        CASE \
            WHEN (OFFENSE_DESCRIPTION = 'DANGEROUS DRUGS') THEN 1 \
            ELSE 0 \
        END AS OFFENSE_DESCRIPTION_DANGEROUS_DRUGS, \
        CASE \
            WHEN (OFFENSE_DESCRIPTION = 'FELONY ASSAULT') THEN 1 \
            ELSE 0 \
        END AS OFFENSE_DESCRIPTION_FELONY_ASSAULT, \
        OFFENSE_LEVEL, \
        CASE \
            WHEN (OFFENSE_LEVEL = 'MISDEMEANOR') THEN 1 \
            ELSE 0 \
        END AS OFFENSE_LEVEL_MISDEMEANOR, \
        CASE \
            WHEN (OFFENSE_LEVEL = 'VIOLATION') THEN 1 \
            ELSE 0 \
        END AS OFFENSE_LEVEL_VIOLATION, \
        CASE \
            WHEN (OFFENSE_LEVEL = 'FELONY') THEN 1 \
            ELSE 0 \
        END AS OFFENSE_LEVEL_FELONY, \
        PREMISE_DESCRIPTION, \
        CASE \
            WHEN (PREMISE_DESCRIPTION = 'RESIDENCE - APT. HOUSE') THEN 1 \
            ELSE 0 \
        END AS PREMISE_DESCRIPTION_RESIDENCE_APARTMENT, \
        CASE \
            WHEN (PREMISE_DESCRIPTION = 'RESIDENCE - PUBLIC HOUSING') THEN 1 \
            ELSE 0 \
        END AS PREMISE_DESCRIPTION_RESIDENCE_PUBLIC_HOUSING, \
        CASE \
            WHEN (PREMISE_DESCRIPTION = 'TRANSIT - NYC SUBWAY') THEN 1 \
            ELSE 0 \
        END AS PREMISE_DESCRIPTION_SUBWAY, \
        CASE \
            WHEN (PREMISE_DESCRIPTION = 'STREET') THEN 1 \
            ELSE 0 \
        END AS PREMISE_DESCRIPTION_STREET, \
        CASE \
            WHEN (PREMISE_DESCRIPTION = 'COMMERCIAL BUILDING') THEN 1 \
            ELSE 0 \
        END AS PREMISE_DESCRIPTION_COMMERCIAL_BUILDING, \
        CASE \
            WHEN (PREMISE_DESCRIPTION = 'PUBLIC SCHOOL') THEN 1 \
            ELSE 0 \
        END AS PREMISE_DESCRIPTION_PUBLIC_SCHOOL, \
        CASE \
            WHEN (PREMISE_DESCRIPTION = 'RESIDENCE-HOUSE') THEN 1 \
            ELSE 0 \
        END AS PREMISE_DESCRIPTION_RESIDENCE_HOUSE, \
        CASE \
            WHEN (PREMISE_DESCRIPTION = 'RESTAURANT/DINER') THEN 1 \
            ELSE 0 \
        END AS PREMISE_DESCRIPTION_RESTAURANT_DINER, \
        CASE \
            WHEN (PREMISE_DESCRIPTION = 'BAR/NIGHT CLUB') THEN 1 \
            ELSE 0 \
        END AS PREMISE_DESCRIPTION_BAR_NIGHT_CLUB \
    FROM df_predicted \
")
In [17]:
# Add table to SQL Context
df_encoded.createOrReplaceTempView("df_encoded")

Now we can actually summarize all the metrics per cluster by adding up all the instances of each columns. E.g., we’d be looking for the amount of Dangerous Drugs, Grand Larcenies, and Harassment per cluster. It might be useful to then take some summary statistics after the fact, but let’s do the grouping first.

In [18]:
%read_sql DESCRIBE df_encoded
Query started at 04:14:14 AM UTC; Query executed in 0.00 m
Out[18]:
col_name data_type comment
0 CLUSTER int None
1 LAT double None
2 LON double None
3 OFFENSE_DESCRIPTION string None
4 OFFENSE_DESCRIPTION_PETIT_LARCENY int None
5 OFFENSE_DESCRIPTION_HARRASSMENT int None
6 OFFENSE_DESCRIPTION_ASSAULT_3 int None
7 OFFENSE_DESCRIPTION_MISCHIEF int None
8 OFFENSE_DESCRIPTION_GRAND_LARCENY int None
9 OFFENSE_DESCRIPTION_DANGEROUS_DRUGS int None
10 OFFENSE_DESCRIPTION_FELONY_ASSAULT int None
11 OFFENSE_LEVEL string None
12 OFFENSE_LEVEL_MISDEMEANOR int None
13 OFFENSE_LEVEL_VIOLATION int None
14 OFFENSE_LEVEL_FELONY int None
15 PREMISE_DESCRIPTION string None
16 PREMISE_DESCRIPTION_RESIDENCE_APARTMENT int None
17 PREMISE_DESCRIPTION_RESIDENCE_PUBLIC_HOUSING int None
18 PREMISE_DESCRIPTION_SUBWAY int None
19 PREMISE_DESCRIPTION_STREET int None
20 PREMISE_DESCRIPTION_COMMERCIAL_BUILDING int None
21 PREMISE_DESCRIPTION_PUBLIC_SCHOOL int None
22 PREMISE_DESCRIPTION_RESIDENCE_HOUSE int None
23 PREMISE_DESCRIPTION_RESTAURANT_DINER int None
24 PREMISE_DESCRIPTION_BAR_NIGHT_CLUB int None
In [46]:
cluster_summary = %read_sql \
SELECT \
    CLUSTER, \
    COUNT(*) AS NUM_CRIMES, \
    AVG(LAT) AS LAT_CENTRE, \
    AVG(LON) AS LON_CENTRE, \
    SUM(OFFENSE_DESCRIPTION_PETIT_LARCENY) AS OFFENSE_DESCRIPTION_PETIT_LARCENY, \
    SUM(OFFENSE_DESCRIPTION_HARRASSMENT) AS OFFENSE_DESCRIPTION_HARRASSMENT, \
    SUM(OFFENSE_DESCRIPTION_ASSAULT_3) AS OFFENSE_DESCRIPTION_ASSAULT_3, \
    SUM(OFFENSE_DESCRIPTION_MISCHIEF) AS OFFENSE_DESCRIPTION_MISCHIEF, \
    SUM(OFFENSE_DESCRIPTION_GRAND_LARCENY) AS OFFENSE_DESCRIPTION_GRAND_LARCENY, \
    SUM(OFFENSE_DESCRIPTION_DANGEROUS_DRUGS) AS OFFENSE_DESCRIPTION_DANGEROUS_DRUGS, \
    SUM(OFFENSE_DESCRIPTION_FELONY_ASSAULT) AS OFFENSE_DESCRIPTION_FELONY_ASSAULT, \
    SUM(OFFENSE_LEVEL_MISDEMEANOR) AS OFFENSE_LEVEL_MISDEMEANOR, \
    SUM(OFFENSE_LEVEL_VIOLATION) AS OFFENSE_LEVEL_VIOLATION, \
    SUM(OFFENSE_LEVEL_FELONY) AS OFFENSE_LEVEL_FELONY, \
    SUM(PREMISE_DESCRIPTION_RESIDENCE_APARTMENT) AS PREMISE_DESCRIPTION_RESIDENCE_APARTMENT, \
    SUM(PREMISE_DESCRIPTION_RESIDENCE_PUBLIC_HOUSING) AS PREMISE_DESCRIPTION_RESIDENCE_PUBLIC_HOUSING, \
    SUM(PREMISE_DESCRIPTION_SUBWAY) AS PREMISE_DESCRIPTION_SUBWAY, \
    SUM(PREMISE_DESCRIPTION_STREET) AS PREMISE_DESCRIPTION_STREET, \
    SUM(PREMISE_DESCRIPTION_COMMERCIAL_BUILDING) AS PREMISE_DESCRIPTION_COMMERCIAL_BUILDING, \
    SUM(PREMISE_DESCRIPTION_PUBLIC_SCHOOL) AS PREMISE_DESCRIPTION_PUBLIC_SCHOOL, \
    SUM(PREMISE_DESCRIPTION_RESIDENCE_HOUSE) AS PREMISE_DESCRIPTION_RESIDENCE_HOUSE, \
    SUM(PREMISE_DESCRIPTION_RESTAURANT_DINER) AS PREMISE_DESCRIPTION_RESTAURANT_DINER, \
    SUM(PREMISE_DESCRIPTION_BAR_NIGHT_CLUB) AS PREMISE_DESCRIPTION_BAR_NIGHT_CLUB \
FROM df_encoded \
GROUP BY \
    CLUSTER
Query started at 04:25:21 AM UTC; Query executed in 0.87 m
In [47]:
# View results
cluster_summary
Out[47]:
CLUSTER NUM_CRIMES LAT_CENTRE LON_CENTRE OFFENSE_DESCRIPTION_PETIT_LARCENY OFFENSE_DESCRIPTION_HARRASSMENT OFFENSE_DESCRIPTION_ASSAULT_3 OFFENSE_DESCRIPTION_MISCHIEF OFFENSE_DESCRIPTION_GRAND_LARCENY OFFENSE_DESCRIPTION_DANGEROUS_DRUGS OFFENSE_LEVEL_FELONY PREMISE_DESCRIPTION_RESIDENCE_APARTMENT PREMISE_DESCRIPTION_RESIDENCE_PUBLIC_HOUSING PREMISE_DESCRIPTION_SUBWAY PREMISE_DESCRIPTION_STREET PREMISE_DESCRIPTION_COMMERCIAL_BUILDING PREMISE_DESCRIPTION_PUBLIC_SCHOOL PREMISE_DESCRIPTION_RESIDENCE_HOUSE PREMISE_DESCRIPTION_RESTAURANT_DINER PREMISE_DESCRIPTION_BAR_NIGHT_CLUB
0 12 183862 40.592803 -73.969842 31425 23652 18246 21816 16606 8878 58228 35787 17031 2657 60503 2850 2941 24498 1355 974
1 1 179745 40.634055 -74.006811 31264 24012 17975 22601 14541 7050 54697 40004 24 2335 73211 4491 1924 21388 1835 1693
2 13 340058 40.873135 -73.879422 44995 41210 39286 35203 19735 34636 97845 94490 16764 4562 132004 3882 5044 24844 1848 1188
3 6 504025 40.751261 -73.986319 144494 42817 34024 27482 107264 19018 181763 64500 8081 23188 130883 40552 4486 2633 20270 23645
4 16 304220 40.643766 -73.945032 46132 38306 34849 30413 25158 15752 102234 80934 2146 2532 113464 4457 3330 42006 2064 1367
5 3 363679 40.667012 -73.897975 39278 45391 47810 36411 19544 35872 116011 76484 46986 6425 124636 4247 4338 45934 1359 533
6 5 87249 40.722177 -73.744405 11299 11890 8523 11673 7940 1032 33232 8882 100 1 28357 1365 1501 32501 560 223
7 19 477917 40.804788 -73.947876 70849 59316 48222 42459 34119 50061 129901 109286 89216 16472 146018 6453 5992 3712 4936 2631
8 15 174254 40.688913 -73.842615 26406 23293 18613 23029 13447 4603 58580 28546 2239 2636 66049 4139 1444 37158 1400 1638
9 9 126405 40.757722 -73.814995 25342 16991 11718 15496 13325 1287 42683 19342 4527 556 38413 4498 1846 23831 1601 1362
10 17 220482 40.745377 -73.875549 40665 23723 23799 22368 18560 6720 72087 44627 198 2336 78128 4182 1363 29610 2398 3200
11 4 68818 40.595491 -73.781780 7991 11796 8994 8814 3155 1699 21192 16834 7282 975 16429 562 1056 14342 326 294
12 8 269773 40.835591 -73.855600 38698 34945 29373 28215 15694 27267 70118 58679 28964 1798 91872 4428 3997 26994 1555 848
13 7 154088 40.757822 -73.926834 28413 20209 15786 18193 15162 3942 49051 33007 13032 2327 52902 6291 1725 7722 1954 2976
14 10 206507 40.689167 -73.783596 26819 26608 24693 22050 13538 4670 75605 23259 3541 2391 71147 3938 2678 59128 983 943
15 11 180555 40.617069 -74.109248 24793 33742 19662 25606 7707 9480 39424 24127 10371 1 50032 4515 2981 53038 1300 801
16 14 425112 40.691671 -73.937964 59782 51643 50083 46426 28036 32302 142103 110384 43600 9346 153055 6847 5823 25780 3073 3351
17 2 73904 40.554887 -74.180490 14764 14141 5091 11365 4414 2233 14983 2275 13 0 16252 3048 1034 32524 710 436
18 0 621772 40.835427 -73.912472 74073 74857 78301 57190 28722 79880 164775 188248 65028 10958 229429 7157 8815 12151 3330 2479
19 18 373752 40.705183 -73.989868 91674 40043 28965 34738 46994 16934 112696 47142 48062 13085 105155 21392 5680 6174 8380 10308

20 rows × 23 columns

Awesome! We now have all our cluster metrics in a pandas dataframe. Note that we are now working in PYTHON and not PYSPARK anymore. Our dataframe is 20 x 22, so we have more then enough memory on our master node to just work in master memory! The first thing I want to do is to convert each of my metrics columns into z-score.

In [48]:
from scipy.stats import zscore
from sklearn.preprocessing import StandardScaler

# Define columns to take z-score of
zscore_cols = [
    'OFFENSE_DESCRIPTION_PETIT_LARCENY',
    'OFFENSE_DESCRIPTION_HARRASSMENT',
    'OFFENSE_DESCRIPTION_ASSAULT_3',
    'OFFENSE_DESCRIPTION_MISCHIEF',
    'OFFENSE_DESCRIPTION_GRAND_LARCENY',
    'OFFENSE_DESCRIPTION_DANGEROUS_DRUGS',
    'OFFENSE_DESCRIPTION_FELONY_ASSAULT',
    'OFFENSE_LEVEL_MISDEMEANOR',
    'OFFENSE_LEVEL_VIOLATION',
    'OFFENSE_LEVEL_FELONY',
    'PREMISE_DESCRIPTION_RESIDENCE_APARTMENT',
    'PREMISE_DESCRIPTION_RESIDENCE_PUBLIC_HOUSING',
    'PREMISE_DESCRIPTION_SUBWAY',
    'PREMISE_DESCRIPTION_STREET',
    'PREMISE_DESCRIPTION_COMMERCIAL_BUILDING',
    'PREMISE_DESCRIPTION_PUBLIC_SCHOOL',
    'PREMISE_DESCRIPTION_RESIDENCE_HOUSE',
    'PREMISE_DESCRIPTION_RESTAURANT_DINER',
    'PREMISE_DESCRIPTION_BAR_NIGHT_CLUB'
]

# Initialize scaler
scaler = StandardScaler()

# Replace initial columns with z-score columns
cluster_summary_standard = cluster_summary.drop(zscore_cols, axis = 1).merge(
    pd.DataFrame(scaler.fit_transform(cluster_summary[zscore_cols]), columns = zscore_cols), 
    left_index = True, 
    right_index = True
)

# View results
cluster_summary_standard.head()
Out[48]:
CLUSTER NUM_CRIMES LAT_CENTRE LON_CENTRE OFFENSE_DESCRIPTION_PETIT_LARCENY OFFENSE_DESCRIPTION_HARRASSMENT OFFENSE_DESCRIPTION_ASSAULT_3 OFFENSE_DESCRIPTION_MISCHIEF OFFENSE_DESCRIPTION_GRAND_LARCENY OFFENSE_DESCRIPTION_DANGEROUS_DRUGS OFFENSE_LEVEL_FELONY PREMISE_DESCRIPTION_RESIDENCE_APARTMENT PREMISE_DESCRIPTION_RESIDENCE_PUBLIC_HOUSING PREMISE_DESCRIPTION_SUBWAY PREMISE_DESCRIPTION_STREET PREMISE_DESCRIPTION_COMMERCIAL_BUILDING PREMISE_DESCRIPTION_PUBLIC_SCHOOL PREMISE_DESCRIPTION_RESIDENCE_HOUSE PREMISE_DESCRIPTION_RESTAURANT_DINER PREMISE_DESCRIPTION_BAR_NIGHT_CLUB
0 12 183862 40.592803 -73.969842 -0.402207 -0.572521 -0.570313 -0.437944 -0.277314 -0.470620 -0.511930 -0.447990 -0.134995 -0.420059 -0.544142 -0.472868 -0.224976 -0.115843 -0.395071 -0.399264
1 1 179745 40.634055 -74.006811 -0.407374 -0.550305 -0.585838 -0.372603 -0.371546 -0.563246 -0.588419 -0.351381 -0.824596 -0.472647 -0.300606 -0.284282 -0.723560 -0.315950 -0.283969 -0.260616
2 13 340058 40.873135 -73.879422 0.033286 0.511025 0.635092 0.676352 -0.134528 0.834557 0.346262 0.896860 -0.145821 -0.108941 0.826104 -0.354269 0.806020 -0.093581 -0.280960 -0.357998
3 6 504025 40.751261 -73.986319 3.226442 0.610197 0.333626 0.033678 3.859686 0.043182 2.164110 0.209808 -0.497900 2.933000 0.804621 3.859910 0.532460 -1.522704 3.983035 3.972491
4 16 304220 40.643766 -73.945032 0.069775 0.331812 0.380891 0.277646 0.112939 -0.122309 0.441337 0.586300 -0.738553 -0.440474 0.470803 -0.288189 -0.034268 1.010675 -0.230964 -0.323480

5 rows × 23 columns

PCA

Perfect! Perhaps the best way to explore the correlations amongst all these is to take the PCA and see if certain correlations can be found within the clusters.

In [49]:
from sklearn.decomposition import PCA

# Initialize PCA with 2 components
pca = PCA(n_components=2)

# Fit data
cluster_summary_pca = pca.fit_transform(cluster_summary[zscore_cols])

# Look at the variance explained by the components
print(pca.explained_variance_ratio_)
[ 0.89296386  0.06960998]

We see that 91% of the variance is explained through the first axis! This means there should be high correlations amongst the metrics across all the clusters (neighbourhoods), or in other words, the crimes in new york can be highly explained by the neighbourhood they occur in.

In [50]:
# Define function to view biplot
def myplot(score, coeff, line_labels, scatter_labels):
    xs = score[:,0]
    ys = score[:,1]
    n = coeff.shape[0]
    scalex = 1.0/(xs.max() - xs.min())
    scaley = 1.0/(ys.max() - ys.min())
#     plt.scatter(xs * scalex, ys * scaley)
    for i in range(len(score[:,0])):
        plt.text(xs[i-1] * scalex, ys[i-1] * scaley, scatter_labels[i], color = 'r', ha = 'center', va = 'center')
    for i in range(n):
        plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
        plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, line_labels[i], color = 'g', ha = 'center', va = 'center')

plt.figure(figsize=(15,15))
plt.xlim(-1,1)
plt.ylim(-1,1)
plt.xlabel("PC{}".format(1))
plt.ylabel("PC{}".format(2))
plt.grid()

# View biplot
myplot(cluster_summary_pca[:,0:2],np.transpose(pca.components_[0:2, :]), zscore_cols, cluster_summary['CLUSTER'].tolist())
nypd_crime_19_2

Let’s remind ourselves what our cluster centers look like as well.

In [51]:
# See cluster centres
fig, ax = plt.subplots()
sns.regplot(
    x = "LON_CENTRE", 
    y = "LAT_CENTRE", 
    data = cluster_stats_result, 
    fit_reg = False, 
    scatter_kws = {'s': cluster_stats_result['NUM_SAMPLES'] / cluster_stats_result['NUM_SAMPLES'].max() * 150},
    ax = ax
)
cluster_stats_result[['LON_CENTRE','LAT_CENTRE','prediction']].apply(lambda x: ax.text(*x), axis=1);
nypd_crime_19_3

The PCA biplot is unfortunately a bit messy… I’m not really getting too much insight out of it. If anything, I can take away that Larcenies don’t often happen in Apartments. I was hoping to get a little more than that. There are probably other takeaways, but the graph is pretty clustered as it is. Let’s take a look at each of the categories separately.

In [60]:
def pca_and_biplot(cluster_summary, zscore_cols):
    # PCA
    pca = PCA(n_components=2)
    cluster_summary_pca = pca.fit_transform(cluster_summary[zscore_cols])
    print(pca.explained_variance_ratio_)
    
    # Biplot
    plt.figure(figsize=(7, 7))
    plt.xlim(-1,1)
    plt.ylim(-1,1)
    plt.xlabel("PC{}".format(1))
    plt.ylabel("PC{}".format(2))
    plt.grid()
    
    myplot(cluster_summary_pca[:,0:2],np.transpose(pca.components_[0:2, :]), zscore_cols, cluster_summary['CLUSTER'].tolist())

Offense Description

In [61]:
# View offense description biplot
offense_description_cols = [field for field in cluster_summary.dtypes.index.tolist() if ('OFFENSE_DESCRIPTION' in field)]
pca_and_biplot(cluster_summary, offense_description_cols)
[ 0.73850206  0.24437015]
nypd_crime_19_4

This looks much nicer. I can actually see all the text. What a luxury! From this graph, we see a very clear pattern. There are two groups of crime that seem to happen in correlation (with some hot spot clusters):

  • Larcenies (12, 16)
  • Everything else (18)

Offense Level

In [62]:
# View offense description biplot
offense_level_cols = [field for field in cluster_summary.dtypes.index.tolist() if ('OFFENSE_LEVEL' in field)]
pca_and_biplot(cluster_summary, offense_level_cols)
[ 0.97972816  0.01902775]
nypd_crime_19_5

Here, we see that misdemeanors and felonies are prevalent:

  • Misdemeanor (18, 15)
  • Felony (16, 2)

Premise Description

In [63]:
# View offense description biplot
premise_description_cols = [field for field in cluster_summary.dtypes.index.tolist() if ('PREMISE_DESCRIPTION' in field)]
pca_and_biplot(cluster_summary, premise_description_cols)
[ 0.87723059  0.05664648]
nypd_crime_19_6

This one is pretty interesting, saying that there are three groups of locations that crimes usually happen in:

  • Public Housing (15, 12)
  • Houses (3, 11, but none that are too prevalent)
  • Apartments / Streets (18)

Summary

Some hot spot clusters that were identified were:

  • 2 (Staten Island, mild felonies)
  • 11 (Staten Island, houses)
  • 3 (Brooklyn / Queens Border, houses)
  • 12 (Brooklyn, public housing, larcenies)
  • 16 (Brooklyn, felonies, larcenies)
  • 15 (Queens, public housing, misdemeanors)
  • 18 (Downtown Manhattan, apartments / streets, a mix of every crime)

A few things that I’m noticing comparing these biplots with my datashader plots:

  1. The lack of identity of Harlem & the Bronx (clusters 8 & 13) – Or rather, the diversity of crimes that happen. When looking at Harlem & The Bronx before, we saw that there were many more Dangerous Drugs crimes than Grand Larcenies. While this is still true (as datashader wouldn’t lie to us), there is much more to the story. Handpicking 2 crimes painted an accurate, but misleading story at the same time. Harlem & The Bronx basically experiences all crimes!
  2. Oddly enough, downtown manhattan actually seems to lean more towards Dangerous Drugs rather than Grand Larceny on the biplots, which is the opposite of what our datashader plots tell us. The difference here is that our biplots include many more crimes than just pitting Grand Larceny and Dangerous Drugs together. In fact, we went over how there is a ton of Assault in downtown manhattan as well, but our biplots just happen to tell us that Assault and Dangerous Drugs trend in the same direction.

The discrepancies in our biplots vs datashader plots is a matter of context. When we use datashader, we, unfortunately, can only few 3-4 factors simultaneously before the graph becomes less easy to understand. We are easily and clearly able to see how the few factors pit against each other, but we are limited to only looking at the scope of these factors. When we use the biplot, we’re basically doing the opposite. We’re telling python to try to take a bunch of factors and summarize them in a way which describes variance. We are squeezing multiple dimensions into two, and along the way, we are sure to have a loss of data.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s