I’m extremely happy with how gradient boosting turned out in my last post. I think I’ve just had xgboost on the brain… since it’s worked so well in my last 2 projects. KNN should’ve been the first thing that came to my head when I thought about this haha. I don’t think I’ve actually done a KNN before, so this is exciting!
K-Nearest Neighbours
K-NN… one of the easiest algorithms lol. Take the K nearest points around you (with, let’s say, Euclidean distance), average those values, and voila – you have your answer!
In this classification (remember, we’re doing regression in our property assessment project, but classification is just easier to demonstrate) example, the boundary lines are drawn for k = 15 and uniform weighting (all points have equal weight). This means that, for every point on the grid, if we took the 15 nearest neighbours and did a vote on what class the point should belong to, we would get the above decision boundaries. Hopefully I will be able to get something like that, but similar to a heat map for regression!
# Enable plots in the notebook
%matplotlib inline
import matplotlib.pyplot as plt
# Seaborn makes our plots prettier
import seaborn
seaborn.set(style = 'ticks')
import numpy as np
import pandas as pd
import os
import gmaps
import warnings
warnings.filterwarnings('ignore')
# Load data set
edm_data = pd.read_csv('../data/Property_Assessment_Data.csv')
edm_data.dtypes
# Replace dollar signs and cast to int
edm_data['Assessed Value'] = edm_data['Assessed Value'].str.replace('$', '').astype(int)
# Filter for only residential buildings
edm_data_res = edm_data[edm_data['Assessment Class'] == 'Residential']
# Import ML libraries
from sklearn.neighbors import KNeighborsRegressor
from sklearn.grid_search import GridSearchCV
from sklearn.preprocessing import StandardScaler
# Scale to mean 0 and standard deviation 1
lat_scaler = StandardScaler()
lng_scaler = StandardScaler()
edm_data_res['Latitude Scaled'] = lat_scaler.fit_transform(edm_data_res['Latitude'])
edm_data_res['Longitude Scaled'] = lng_scaler.fit_transform(edm_data_res['Longitude'])
print 'Latitude has mean {} and standard deviation {} ({} / {} before)'.format(
edm_data_res['Latitude Scaled'].mean(),
edm_data_res['Latitude Scaled'].std(),
edm_data_res['Latitude'].mean(),
edm_data_res['Latitude'].std()
)
print 'Longitude has mean {} and standard deviation {} ({} / {} before)'.format(
edm_data_res['Longitude Scaled'].mean(),
edm_data_res['Longitude Scaled'].std(),
edm_data_res['Longitude'].mean(),
edm_data_res['Longitude'].std()
)
# Define x and y
x = edm_data_res[['Latitude Scaled', 'Longitude Scaled']].values
y = edm_data_res['Assessed Value'].values
print 'x has shape {}'.format(x.shape)
print 'y has shape {}'.format(y.shape)
# Set up grid search CV object to tune number of neighbours
k = np.arange(1, 16, 1)
print 'Testing knn for the following number of neighbours: {}'.format(k)
parameters = {'n_neighbors': k}
knn = KNeighborsRegressor()
clf = GridSearchCV(knn, parameters, cv = 5, verbose = 2)
# Fit model
clf.fit(x, y)
clf.best_params_
# Plot best score for k
plt.plot(k, [i[1] for i in clf.grid_scores_])
Apparently 7 nearest neighbours is the best model via GridSearchCV. We can see that, after 7, it really doesn’t matter anymore. Let’s try to build our grid and map it with gmaps. Let’s create our grid of sample points again.
# Generate statistics per neighborhood
edm_data_neighbour_grouped = edm_data_res.groupby(['Neighbourhood', 'Assessment Class']).agg({
'Assessed Value': [np.mean, np.size],
'Latitude': [np.mean],
'Longitude': [np.mean]
}).reset_index()
# Show all neighbourhoods with greater than 20 units
neighbourhoods = edm_data_neighbour_grouped[edm_data_neighbour_grouped[('Assessed Value', 'size')] > 20].sort_values([('Assessed Value', 'mean')], ascending = False)
neighbourhoods.columns = neighbourhoods.columns.droplevel(-1)
neighbourhoods.columns = ['Neighbourhood', 'Assessment Class', 'Latitude', 'Assessment Value Mean', 'Units', 'Longitude']
neighbourhoods.tail()
# Define city boundaries
lng_min = -113.709582
lng_max = -113.297595
lat_min = 53.396169
lat_max = 53.672860
# Set padding if we want to expand map by certain amount
map_padding = 0
lng_min -= map_padding
lng_max += map_padding
lat_min -= map_padding
lat_max += map_padding
# Number of steps
num_steps = 50
# Calculate step sizes
lng_step_size = (lng_max - lng_min) / num_steps
lat_step_size = (lat_max - lat_min) / num_steps
print 'Longitude step size: {}'.format(lng_step_size)
print 'Latitude step size: {}'.format(lat_step_size)
# Import geopy
from geopy.distance import vincenty
# Generate grid of lat / long points
lat_lng_threshold = 1000
lng_pts = []
lat_pts = []
total_points = num_steps**2
i = 0
for lng in np.arange(lng_min, lng_max, lng_step_size):
for lat in np.arange(lat_min, lat_max, lat_step_size):
# Print only every cluster of iterations
if i % 100 == 0:
print 'Iteration {} / {}'.format(i, total_points)
# Only keep point on grid if it's within 1km distance of a community
neighbourhoods['Lat Lng Tuple'] = neighbourhoods[['Latitude', 'Longitude']].apply(tuple, axis = 1)
neighbourhoods['Distance'] = neighbourhoods['Lat Lng Tuple'].apply(lambda x: vincenty(x, (lat, lng)).meters)
if (neighbourhoods['Distance'].min() <= lat_lng_threshold):
lng_pts.append(lng)
lat_pts.append(lat)
i += 1
len(lng_pts)
# Set up input dataframe, along with the scaled variables for input to model
x_edm_grid_test = pd.DataFrame({
'Latitude': lat_pts,
'Latitude Scaled': lat_scaler.transform(lat_pts),
'Longitude': lng_pts,
'Longitude Scaled': lng_scaler.transform(lng_pts)
})
print 'Latitude has mean {} and standard deviation {} ({} / {} before)'.format(
x_edm_grid_test['Latitude Scaled'].mean(),
x_edm_grid_test['Latitude Scaled'].std(),
x_edm_grid_test['Latitude'].mean(),
x_edm_grid_test['Latitude'].std()
)
print 'Longitude has mean {} and standard deviation {} ({} / {} before)'.format(
x_edm_grid_test['Longitude Scaled'].mean(),
x_edm_grid_test['Longitude Scaled'].std(),
x_edm_grid_test['Longitude'].mean(),
x_edm_grid_test['Longitude'].std()
)
# Make predictions
x_edm_grid_test['y_edm_grid_pred'] = clf.predict(x_edm_grid_test[['Latitude Scaled', 'Longitude Scaled']])
# Create new log column
x_edm_grid_test['y_edm_grid_pred_log'] = x_edm_grid_test['y_edm_grid_pred'].apply(lambda x: np.sign(x)*(np.log10(np.abs(x)+1)))
# Remove heavy outliers on the negative side
x_edm_grid_test_no_outliers = x_edm_grid_test[x_edm_grid_test['y_edm_grid_pred_log'].between(3, 8)]
# Scale responese from 0 to 1 to match gmaps opacity parameter
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
x_edm_grid_test_no_outliers['y_edm_grid_pred_log_scaled'] = scaler.fit_transform(x_edm_grid_test_no_outliers['y_edm_grid_pred_log'])
# Import gmaps library and initial config
import gmaps
gmaps.configure(api_key=os.environ["GOOGLE_API_KEY"])
# Plot gmaps
edm_grid_heatmap = gmaps.heatmap_layer(
x_edm_grid_test_no_outliers[['Latitude', 'Longitude']],
max_intensity = 10,
weights = np.array(x_edm_grid_test_no_outliers['y_edm_grid_pred_log_scaled'].tolist())*7
,
opacity = 0.4
)
edm_grid_fig = gmaps.figure()
edm_grid_fig.add_layer(edm_grid_heatmap)
edm_grid_fig
Cool, the graph is more red than general, but we see many of the same hot spots on the map! Along the river, In the south, in the southwest, up north, that little island thing by sherwood park. I’m not quite sure which one is better, but I think the map generated from xgboost is a bit more clear in where the hot spots are. Perhaps if we widened our KNN net, we could see smoother hot spots as well! Let’s try like… K = 15.
# Generate k = 15 knn regressor
knn_15 = KNeighborsRegressor(n_neighbors = 15)
knn.fit(x, y)
# Make predictions
x_edm_grid_test['y_edm_grid_pred_knn_15'] = knn.predict(x_edm_grid_test[['Latitude Scaled', 'Longitude Scaled']])
# Create new log column
x_edm_grid_test['y_edm_grid_pred_knn_15_log'] = x_edm_grid_test['y_edm_grid_pred_knn_15'].apply(lambda x: np.sign(x)*(np.log10(np.abs(x)+1)))
# Remove heavy outliers on the negative side
x_edm_grid_test_no_outliers = x_edm_grid_test[x_edm_grid_test['y_edm_grid_pred_knn_15_log'].between(3, 8)]
# Scale
x_edm_grid_test_no_outliers['y_edm_grid_pred_knn_15_log_scaled'] = scaler.transform(x_edm_grid_test_no_outliers['y_edm_grid_pred_knn_15_log'])
# Plot gmaps
edm_grid_heatmap_knn_15 = gmaps.heatmap_layer(
x_edm_grid_test_no_outliers[['Latitude', 'Longitude']],
max_intensity = 10,
weights = np.array(x_edm_grid_test_no_outliers['y_edm_grid_pred_knn_15_log_scaled'].tolist())*7
,
opacity = 0.4
)
edm_grid_fig_knn_15 = gmaps.figure()
edm_grid_fig_knn_15.add_layer(edm_grid_heatmap_knn_15)
edm_grid_fig_knn_15
The hot spots aren’t as big or red… I think I like knn = 7 better for visualization purposes. This would make sense… the more neighbours you use, the more the visualization would just converge to the one median color because you’re taking the average of more and more points.
For me, I think that wraps it up for K-NN! Short and sweet post for a very simple but effective method!
One thought on “Edmonton Property Assessment #5 – Building A Regression Model (K-NN)”