Analyzing and predicting Service Request Types in DC¶
The flow adopted in this notebook is as follows:
- Read in the datasets using ArcGIS API for Python
- Merge datasets
- Construct model that predicts service type
- How many requests does each neighborhood make?
- What kind of requests does each neighborhood mostly make?
- Next Steps
The datasets used in this notebook are the
City Service Requests in 2018
Neighborhood Clusters
These datasets can be found at opendata.dc.gov
We start by importing the ArcGIS package to load the data using a service URL
import arcgis
from arcgis.features import *
1.1 Read in service requests for 2018¶
Link to Service Requests 2018 dataset
requests_url = 'https://maps2.dcgis.dc.gov/dcgis/rest/services/DCGIS_DATA/ServiceRequests/MapServer/9'
requests_layer = FeatureLayer(requests_url)
requests_layer
# Extract all the data and display number of rows
requests_features = requests_layer.query()
print('Total number of rows in the dataset: ')
print(len(requests_features.features))
This dataset updates on runtime, hence the number of rows could vary each time.
# store as dataframe
requests = requests_features.df
# View first 5 rows
requests.head()
1.2 Read in Neighborhood Clusters dataset¶
Link to this dataset
neighborhood_url = 'https://maps2.dcgis.dc.gov/dcgis/rest/services/DCGIS_DATA/Administrative_Other_Boundaries_WebMercator/MapServer/17'
neighborhood_layer = FeatureLayer(neighborhood_url)
neighborhood_layer
# Extract all the data and display number of rows
neighborhood_features = neighborhood_layer.query()
print('Total number of rows in the dataset: ')
print(len(neighborhood_features.features))
# store as dataframe
neighborhood = neighborhood_features.df
# View first 5 rows
neighborhood.head()
We now merge the two datasets
# Connect to the GIS
from arcgis.gis import GIS
gis = GIS('http://dcdev.maps.arcgis.com/', 'username', 'password')
# Perform spatial join between CBG layer and the service areas created for all time durations
requests_with_neighborhood = arcgis.features.analysis.join_features(requests_url, neighborhood_url, spatial_relationship='Intersects', output_name='serviceRequests_Neighborhood_DC_1')
requests_with_neighborhood.share(everyone=True)
requests_with_neighborhood_url = str(requests_with_neighborhood.url)+'/0/'
layer = FeatureLayer(requests_with_neighborhood_url)
features = layer.query()
print('Total number of rows in the dataset: ')
print(len(features.features))
merged = features.df
merged.head()
3. Construct model that predicts service type¶
The variables used to build the model are:
- City Quadrant
- Neighborhood cluster
- Ward (Geographical unit)
- Organization acronym
- Status Code
3.1 Data preprocessing¶
quads = ['NE', 'NW', 'SE', 'SW']
def generate_quadrant(x):
'''Function that extracts quadrant from street address'''
try:
temp = x[-2:]
if temp in quads:
return temp
else:
return 'NaN'
except:
return 'NaN'
merged['QUADRANT'] = merged['STREETADDRESS'].apply(generate_quadrant)
merged['QUADRANT'].head()
merged['QUADRANT'].unique()
merged['CLUSTER'] = merged['NAME'].apply(lambda x: x[8:])
merged['CLUSTER'].head()
merged['CLUSTER'] = merged['CLUSTER'].astype(int)
merged['ORGANIZATIONACRONYM'].unique()
merged['STATUS_CODE'].unique()
Let's extract the number of possible outcomes, i.e. length of the target variable and also take a look at the values
len(merged['SERVICETYPECODEDESCRIPTION'].unique())
requests['SERVICETYPECODEDESCRIPTION'].unique()
3.2 Model building¶
# Import necessary packages
from sklearn.preprocessing import *
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Convert categorical (text) fields to numbers
number = LabelEncoder()
merged['SERVICETYPE_NUMBER'] = number.fit_transform(merged['SERVICETYPECODEDESCRIPTION'].astype('str'))
merged['STATUS_CODE_NUMBER'] = number.fit_transform(merged['STATUS_CODE'].astype('str'))
# Extract desired fields
data = merged[['SERVICETYPECODEDESCRIPTION', 'SERVICETYPE_NUMBER', 'QUADRANT', 'CLUSTER', 'WARD', 'ORGANIZATIONACRONYM', 'STATUS_CODE', 'STATUS_CODE_NUMBER']]
data.reset_index(inplace=True)
data.head()
Let's binarize values in fields QUADRANT
(4) and ORGANIZATIONACRONYM
(8)
Wonder why are not doing it for CLUSTER
? Appropriate nomenclature of adjacent clusters.
import pandas as pd
data = pd.get_dummies(data=data, columns=['QUADRANT', 'ORGANIZATIONACRONYM'])
data.head()
# Extract input dataframe
model_data = data.drop(['SERVICETYPECODEDESCRIPTION', 'SERVICETYPE_NUMBER', 'STATUS_CODE'], axis=1)
model_data.head()
def handle_ward(x):
accept = [range(0,8)]
if x not in accept:
return 0
else:
return x
model_data['WARD'] = model_data['WARD'].apply(handle_ward)
# Define independent and dependent variables
y = data['SERVICETYPE_NUMBER'].values
X = model_data.values
# Split data into training and test samples of 70%-30%
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = .3, random_state=522, stratify=y)
# n_estimators = number of trees in the forest
# min_samples_leaf = minimum number of samples required to be at a leaf node for the tree
rf = RandomForestClassifier(n_estimators=2500, min_samples_leaf=5, random_state=522)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print(y_pred)
print('Accuracy: ', accuracy_score(y_test, y_pred))
3.3 Alternate model, excluding the department codes¶
data = merged[['SERVICETYPECODEDESCRIPTION', 'SERVICETYPE_NUMBER', 'QUADRANT', 'CLUSTER', 'WARD', 'ORGANIZATIONACRONYM', 'STATUS_CODE', 'STATUS_CODE_NUMBER']]
data.reset_index(inplace=True)
data.head()
data1 = pd.get_dummies(data=data,columns=['QUADRANT'])
data1.head()
model_data1 = data1.drop(['SERVICETYPECODEDESCRIPTION', 'SERVICETYPE_NUMBER', 'STATUS_CODE', 'ORGANIZATIONACRONYM'], axis=1)
model_data1.head()
model_data1['WARD'] = model_data1['WARD'].apply(handle_ward)
y = data['SERVICETYPE_NUMBER'].values
X = model_data1.values
# Split data into training and test samples of 70%-30%
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = .3, random_state=522, stratify=y)
# n_estimators = number of trees in the forest
# min_samples_leaf = minimum number of samples required to be at a leaf node for the tree
rf = RandomForestClassifier(n_estimators=2500, min_samples_leaf=5, random_state=522)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print(y_pred)
print('Accuracy: ', accuracy_score(y_test, y_pred))
A drop in accuracy from 68.39% to 48.78% demonstrates the importance of using the correct predictors.
4. How many requests does each neighborhood make?¶
# Count of service requests per cluster
cluster_count = merged.groupby('NAME').size().reset_index(name='counts')
cluster_count.head()
# merge with original file
neighborhood = pd.merge(neighborhood, cluster_count, on='NAME')
neighborhood.head()
temp = neighborhood.sort_values(['counts'], ascending=[False])
temp[['NAME', 'NBH_NAMES', 'counts']]
# Viewing the map
search_result = gis.content.search("Neighborhood_Service_Requests")
search_result[0]
5. What kind of requests does each neighborhood mostly make?¶
import scipy.stats
merged.columns
df = merged[['NAME', 'SERVICECODEDESCRIPTION']]
# Extract the most frequently occuring service request type, and its count
df1 = df.groupby('NAME').agg(lambda x: scipy.stats.mode(x)[0][0])
df2 = df.groupby('NAME').agg(lambda x: scipy.stats.mode(x)[1][0])
df1.reset_index(inplace=True)
df2.reset_index(inplace=True)
df2 = df2.rename(columns={'SERVICECODEDESCRIPTION':'SERVICECODEDESCRIPTION_COUNT'})
# merge the two datasets
final_df = pd.merge(df1, df2, on='NAME')
final_df.head()
# merge it with neighborhood clusters
neighborhood_data = pd.merge(neighborhood, final_df, on='NAME')
# view the map
search_result = gis.content.search("Neighborhood_Service_DC")
search_result[0]
Feedback on this topic?