Analyzing and predicting Service Request Types in DC

The datasets used in this notebook are the

  1. City Service Requests in 2018
  2. Neighborhood Clusters

These datasets can be found at opendata.dc.gov

We start by importing the ArcGIS package to load the data using a service URL

import arcgis
from arcgis.features import *

Read in service requests for 2018

Link to Service Requests 2018 dataset

requests_url = 'https://maps2.dcgis.dc.gov/dcgis/rest/services/DCGIS_DATA/ServiceRequests/MapServer/9'

requests_layer = FeatureLayer(requests_url)
requests_layer
<FeatureLayer url:"https://maps2.dcgis.dc.gov/dcgis/rest/services/DCGIS_DATA/ServiceRequests/MapServer/9">
# Extract all the data and display number of rows
requests_features = requests_layer.query()
print('Total number of rows in the dataset: ')
print(len(requests_features.features))

This dataset updates on runtime, hence the number of rows could vary each time.

# store as dataframe
requests = requests_features.sdf

# View first 5 rows
requests.head()

Read in Neighborhood Clusters dataset

Link to this dataset

neighborhood_url = 'https://maps2.dcgis.dc.gov/dcgis/rest/services/DCGIS_DATA/Administrative_Other_Boundaries_WebMercator/MapServer/17'

neighborhood_layer = FeatureLayer(neighborhood_url)
neighborhood_layer
# Extract all the data and display number of rows
neighborhood_features = neighborhood_layer.query()
print('Total number of rows in the dataset: ')
print(len(neighborhood_features.features))
# store as dataframe
neighborhood = neighborhood_features.sdf

# View first 5 rows
neighborhood.head()

We now merge the two datasets

# Connect to the GIS
from arcgis.gis import GIS
gis = GIS('http://dcdev.maps.arcgis.com/', 'username', 'password')
# Perform spatial join between CBG layer and the service areas created for all time durations
requests_with_neighborhood = arcgis.features.analysis.join_features(requests_url, neighborhood_url, spatial_relationship='Intersects', output_name='serviceRequests_Neighborhood_DC_1')
requests_with_neighborhood.share(everyone=True)
{'itemId': '21ecb944fdc3495ea23e8cee411d0e29', 'notSharedWith': []}
requests_with_neighborhood_url = str(requests_with_neighborhood.url)+'/0/'
layer = FeatureLayer(requests_with_neighborhood_url)
features = layer.query()
print('Total number of rows in the dataset: ')
print(len(features.features))
Total number of rows in the dataset: 
80957
merged = features.sdf
merged.head()
ADDDATECITYDETAILSINSPECTIONDATEINSPECTIONFLAGINSPECTORNAMEJoin_CountLATITUDELONGITUDEMARADDRESSREPOSITORYID...STATESTATUS_CODESTREETADDRESSTYPEWARDWEB_URLXCOORDYCOORDZIPCODESHAPE
01514882874000WASHINGTONjg - Not Found – Close SRNaNNNone138.894965-76.93586219821.0...DCCLOSED4531 EADS STREET NEOriginal7http://planning.dc.gov/405564.01136348.2920019.0{'x': 405564.0099999979, 'y': 136348.2899999991}
11514883836000WASHINGTONam - Collected – Close SRNaNNNone138.966837-77.077571285265.0...DCCLOSED5727 WESTERN AVENUE NWOriginal3http://planning.dc.gov/393277.48144327.5920015.0{'x': 393277.4799999967, 'y': 144327.58999999985}
21514877377000WASHINGTONNoneNaNNNone138.909417-77.040607238156.0...DCCLOSED1750 P STREET NWOriginal2http://planning.dc.gov/396478.07137951.4120036.0{'x': 396478.0700000003, 'y': 137951.41000000015}
31514882441000WASHINGTONPer T. Duckett 1-9-18. closed by A .Hedgeman 0...1.515497e+12YNone138.927450-77.097581224717.0...DCCLOSED2895 UNIVERSITY TERRACE NWOriginal3http://planning.dc.gov/391538.72139956.9820016.0{'x': 391538.7199999988, 'y': 139956.98000000045}
41514882008000WASHINGTONCollect on 1-6-18 by A.HillNaNNNone138.862290-76.98931767895.0...DCCLOSED1319 MAPLE VIEW PLACE SEOriginal8http://planning.dc.gov/400927.19132719.1920020.0{'x': 400927.1899999976, 'y': 132719.19000000134}

5 rows × 34 columns

Construct model that predicts service type

The variables used to build the model are:

  1. City Quadrant
  2. Neighborhood cluster
  3. Ward (Geographical unit)
  4. Organization acronym
  5. Status Code

Data preprocessing

quads = ['NE', 'NW', 'SE', 'SW']
def generate_quadrant(x):
    '''Function that extracts quadrant from street address'''
    try:
        temp = x[-2:]
        if temp in quads:
            return temp
        else:
            return 'NaN'
    except:
        return 'NaN'
merged['QUADRANT'] = merged['STREETADDRESS'].apply(generate_quadrant)
merged['QUADRANT'].head()
0    NE
1    NW
2    NW
3    NW
4    SE
Name: QUADRANT, dtype: object
merged['QUADRANT'].unique()
array(['NE', 'NW', 'SE', 'NaN', 'SW'], dtype=object)
merged['CLUSTER'] = merged['NAME'].apply(lambda x: x[8:])
merged['CLUSTER'].head()
0    30
1    10
2     6
3    13
4    28
Name: CLUSTER, dtype: object
merged['CLUSTER'] = merged['CLUSTER'].astype(int)
merged['ORGANIZATIONACRONYM'].unique()
array(['DPW', 'DDOT', 'FEMS', 'DOH', 'OUC', 'DOEE', 'DMV', 'ORM',
       'DC-ICH', 'DDS'], dtype=object)
merged['STATUS_CODE'].unique()
array(['CLOSED', 'OPEN'], dtype=object)

Let's extract the number of possible outcomes, i.e. length of the target variable and also take a look at the values

len(merged['SERVICETYPECODEDESCRIPTION'].unique())
23
requests['SERVICETYPECODEDESCRIPTION'].unique()
array(['SNOW', 'PEMA- Parking Enforcement Management Administration',
       'Toa-Street & Bridge Maintenance', 'Urban Forrestry',
       'SWMA- Solid Waste Management Admistration',
       'Transportation Operations Administration', 'SIOD',
       'Department of Transportation', 'Driver Vehicle Services',
       'Toa-Trans Sys Mnt-Signs', 'Toa- Trans Sys Mnt',
       'DOH- Department Of Health', 'Tru-311',
       'Transportation Policy & Planning Administration',
       'FEMS-Smoke Alarms', 'FEMS-Special Events',
       'Department of Energy and Environment', 'Adjudication Services',
       '311- Call Center', 'HOMYDRPR- How Is My Driving Program',
       'DC Interagency Council on Homelessness', '311- Emergencies',
       'Department of Disability Services'], dtype=object)

Model building

# Import necessary packages
from sklearn.preprocessing import *
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Convert categorical (text) fields to numbers
number = LabelEncoder()
merged['SERVICETYPE_NUMBER'] = number.fit_transform(merged['SERVICETYPECODEDESCRIPTION'].astype('str'))
merged['STATUS_CODE_NUMBER'] = number.fit_transform(merged['STATUS_CODE'].astype('str'))
# Extract desired fields
data = merged[['SERVICETYPECODEDESCRIPTION', 'SERVICETYPE_NUMBER', 'QUADRANT', 'CLUSTER', 'WARD', 'ORGANIZATIONACRONYM', 'STATUS_CODE', 'STATUS_CODE_NUMBER']]
data.reset_index(inplace=True)
data.head()
indexSERVICETYPECODEDESCRIPTIONSERVICETYPE_NUMBERQUADRANTCLUSTERWARDORGANIZATIONACRONYMSTATUS_CODESTATUS_CODE_NUMBER
00SWMA- Solid Waste Management Admistration15NE307DPWCLOSED0
11SWMA- Solid Waste Management Admistration15NW103DPWCLOSED0
22PEMA- Parking Enforcement Management Administr...12NW62DPWCLOSED0
33SNOW14NW133DPWCLOSED0
44SWMA- Solid Waste Management Admistration15SE288DPWCLOSED0

Let's binarize values in fields QUADRANT (4) and ORGANIZATIONACRONYM (8)

Wonder why are not doing it for CLUSTER? Appropriate nomenclature of adjacent clusters.

import pandas as pd
data = pd.get_dummies(data=data, columns=['QUADRANT', 'ORGANIZATIONACRONYM'])
data.head()
indexSERVICETYPECODEDESCRIPTIONSERVICETYPE_NUMBERCLUSTERWARDSTATUS_CODESTATUS_CODE_NUMBERQUADRANT_NEQUADRANT_NWQUADRANT_NaN...ORGANIZATIONACRONYM_DC-ICHORGANIZATIONACRONYM_DDOTORGANIZATIONACRONYM_DDSORGANIZATIONACRONYM_DMVORGANIZATIONACRONYM_DOEEORGANIZATIONACRONYM_DOHORGANIZATIONACRONYM_DPWORGANIZATIONACRONYM_FEMSORGANIZATIONACRONYM_ORMORGANIZATIONACRONYM_OUC
00SWMA- Solid Waste Management Admistration15307CLOSED0100...0000001000
11SWMA- Solid Waste Management Admistration15103CLOSED0010...0000001000
22PEMA- Parking Enforcement Management Administr...1262CLOSED0010...0000001000
33SNOW14133CLOSED0010...0000001000
44SWMA- Solid Waste Management Admistration15288CLOSED0000...0000001000

5 rows × 22 columns

# Extract input dataframe
model_data = data.drop(['SERVICETYPECODEDESCRIPTION', 'SERVICETYPE_NUMBER', 'STATUS_CODE'], axis=1)
model_data.head()
indexCLUSTERWARDSTATUS_CODE_NUMBERQUADRANT_NEQUADRANT_NWQUADRANT_NaNQUADRANT_SEQUADRANT_SWORGANIZATIONACRONYM_DC-ICHORGANIZATIONACRONYM_DDOTORGANIZATIONACRONYM_DDSORGANIZATIONACRONYM_DMVORGANIZATIONACRONYM_DOEEORGANIZATIONACRONYM_DOHORGANIZATIONACRONYM_DPWORGANIZATIONACRONYM_FEMSORGANIZATIONACRONYM_ORMORGANIZATIONACRONYM_OUC
003070100000000001000
111030010000000001000
22620010000000001000
331330010000000001000
442880000100000001000
def handle_ward(x):
    accept = [range(0,8)]
    if x not in accept:
        return 0
    else:
        return x
model_data['WARD'] = model_data['WARD'].apply(handle_ward)
# Define independent and dependent variables
y = data['SERVICETYPE_NUMBER'].values
X = model_data.values
# Split data into training and test samples of 70%-30%
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = .3, random_state=522, stratify=y)
# n_estimators = number of trees in the forest
# min_samples_leaf = minimum number of samples required to be at a leaf node for the tree
rf = RandomForestClassifier(n_estimators=2500, min_samples_leaf=5, random_state=522)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print(y_pred)
[15 15 12 ... 15 21 15]
print('Accuracy: ', accuracy_score(y_test, y_pred))
Accuracy:  0.6824769433465085

Alternate model, excluding the department codes

data = merged[['SERVICETYPECODEDESCRIPTION', 'SERVICETYPE_NUMBER', 'QUADRANT', 'CLUSTER', 'WARD', 'ORGANIZATIONACRONYM', 'STATUS_CODE', 'STATUS_CODE_NUMBER']]
data.reset_index(inplace=True)
data.head()
indexSERVICETYPECODEDESCRIPTIONSERVICETYPE_NUMBERQUADRANTCLUSTERWARDORGANIZATIONACRONYMSTATUS_CODESTATUS_CODE_NUMBER
00SWMA- Solid Waste Management Admistration15NE307DPWCLOSED0
11SWMA- Solid Waste Management Admistration15NW103DPWCLOSED0
22PEMA- Parking Enforcement Management Administr...12NW62DPWCLOSED0
33SNOW14NW133DPWCLOSED0
44SWMA- Solid Waste Management Admistration15SE288DPWCLOSED0
data1 = pd.get_dummies(data=data,columns=['QUADRANT'])
data1.head()
indexSERVICETYPECODEDESCRIPTIONSERVICETYPE_NUMBERCLUSTERWARDORGANIZATIONACRONYMSTATUS_CODESTATUS_CODE_NUMBERQUADRANT_NEQUADRANT_NWQUADRANT_NaNQUADRANT_SEQUADRANT_SW
00SWMA- Solid Waste Management Admistration15307DPWCLOSED010000
11SWMA- Solid Waste Management Admistration15103DPWCLOSED001000
22PEMA- Parking Enforcement Management Administr...1262DPWCLOSED001000
33SNOW14133DPWCLOSED001000
44SWMA- Solid Waste Management Admistration15288DPWCLOSED000010
model_data1 = data1.drop(['SERVICETYPECODEDESCRIPTION', 'SERVICETYPE_NUMBER', 'STATUS_CODE', 'ORGANIZATIONACRONYM'], axis=1)
model_data1.head()
indexCLUSTERWARDSTATUS_CODE_NUMBERQUADRANT_NEQUADRANT_NWQUADRANT_NaNQUADRANT_SEQUADRANT_SW
00307010000
11103001000
2262001000
33133001000
44288000010
model_data1['WARD'] = model_data1['WARD'].apply(handle_ward)
y = data['SERVICETYPE_NUMBER'].values
X = model_data1.values
# Split data into training and test samples of 70%-30%
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = .3, random_state=522, stratify=y)
# n_estimators = number of trees in the forest
# min_samples_leaf = minimum number of samples required to be at a leaf node for the tree
rf = RandomForestClassifier(n_estimators=2500, min_samples_leaf=5, random_state=522)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print(y_pred)
[15 15 12 ... 15 15 15]
print('Accuracy: ', accuracy_score(y_test, y_pred))
Accuracy:  0.4862895256916996

A drop in accuracy from 68.39% to 48.78% demonstrates the importance of using the correct predictors.

How many requests does each neighborhood make?

# Count of service requests per cluster
cluster_count = merged.groupby('NAME').size().reset_index(name='counts')
cluster_count.head()
NAMEcounts
0Cluster 11991
1Cluster 101707
2Cluster 112475
3Cluster 12767
4Cluster 131625
# merge with original file
neighborhood = pd.merge(neighborhood, cluster_count, on='NAME')
neighborhood.head()
NAMENBH_NAMESOBJECTIDShape_AreaShape_LengthTYPEWEB_URLSHAPEcounts
0Cluster 39Congress Heights, Bellevue, Washington Highlands14.886463e+0610711.668010Originalhttp://planning.dc.gov/{'rings': [[[-8570934.978117127, 4699521.51243...2360
1Cluster 38Douglas, Shipley Terrace22.367958e+068229.486324Originalhttp://planning.dc.gov/{'rings': [[[-8568786.426828014, 4700618.41227...733
2Cluster 36Woodland/Fort Stanton, Garfield Heights, Knox ...31.119573e+064746.344457Originalhttp://planning.dc.gov/{'rings': [[[-8568124.617559846, 4701733.64556...378
3Cluster 27Near Southeast, Navy Yard41.619167e+067286.968902Originalhttp://planning.dc.gov/{'rings': [[[-8570182.535376322, 4704085.08115...480
4Cluster 32River Terrace, Benning, Greenway, Dupont Park54.286254e+0611251.012821Originalhttp://planning.dc.gov/{'rings': [[[-8564654.618529493, 4705921.46259...1285
temp = neighborhood.sort_values(['counts'], ascending=[False])
temp[['NAME', 'NBH_NAMES', 'counts']]
NAMENBH_NAMEScounts
33Cluster 2Columbia Heights, Mt. Pleasant, Pleasant Plain...5291
30Cluster 25Union Station, Stanton Park, Kingman Park5277
20Cluster 18Brightwood Park, Crestwood, Petworth5137
13Cluster 6Dupont Circle, Connecticut Avenue/K Street4629
38Cluster 26Capitol Hill, Lincoln Park3876
5Cluster 8Downtown, Chinatown, Penn Quarters, Mount Vern...3763
32Cluster 21Edgewood, Bloomingdale, Truxton Circle, Eckington3554
21Cluster 11Friendship Heights, American University Park, ...2475
23Cluster 17Takoma, Brightwood, Manor Park2467
6Cluster 5West End, Foggy Bottom, GWU2445
29Cluster 34Twining, Fairlawn, Randle Highlands, Penn Bran...2429
14Cluster 3Howard University, Le Droit Park, Cardozo/Shaw2412
0Cluster 39Congress Heights, Bellevue, Washington Highlands2360
9Cluster 7Shaw, Logan Circle2270
12Cluster 4Georgetown, Burleith/Hillandale2243
16Cluster 33Capitol View, Marshall Heights, Benning Heights2145
11Cluster 23Ivy City, Arboretum, Trinidad, Carver Langston2110
34Cluster 22Brookland, Brentwood, Langdon2066
8Cluster 31Deanwood, Burrville, Grant Park, Lincoln Heigh...1999
31Cluster 1Kalorama Heights, Adams Morgan, Lanier Heights1991
15Cluster 9Southwest Employment Area, Southwest/Waterfron...1965
25Cluster 10Hawthorne, Barnaby Woods, Chevy Chase1707
22Cluster 19Lamont Riggs, Queens Chapel, Fort Totten, Plea...1628
17Cluster 13Spring Valley, Palisades, Wesley Heights, Foxh...1625
18Cluster 20North Michigan Park, Michigan Park, University...1427
37Cluster 15Cleveland Park, Woodley Park, Massachusetts Av...1322
4Cluster 32River Terrace, Benning, Greenway, Dupont Park1285
36Cluster 14Cathedral Heights, McLean Gardens, Glover Park1070
45Cluster 45National Mall, Potomac River952
35Cluster 24Woodridge, Fort Lincoln, Gateway941
7Cluster 30Mayfair, Hillbrook, Mahaning Heights778
19Cluster 12North Cleveland Park, Forest Hills, Van Ness767
1Cluster 38Douglas, Shipley Terrace733
24Cluster 16Colonial Village, Shepherd Park, North Portal ...715
26Cluster 28Historic Anacostia678
27Cluster 35Fairfax Village, Naylor Gardens, Hillcrest, Su...642
28Cluster 37Sheridan, Barry Farm, Buena Vista524
3Cluster 27Near Southeast, Navy Yard480
2Cluster 36Woodland/Fort Stanton, Garfield Heights, Knox ...378
10Cluster 29Eastland Gardens, Kenilworth191
43Cluster 44Joint Base Anacostia-Bolling77
40Cluster 41Rock Creek Park51
44Cluster 46Arboretum, Anacostia River41
42Cluster 43Saint Elizabeths26
39Cluster 40Walter Reed12
41Cluster 42Observatory Circle3
# Viewing the map
search_result = gis.content.search("Neighborhood_Service_Requests")
search_result[0]
Neighborhood_Service_Requests
Web Map by mmajumdar_dcdev
Last Modified: January 22, 2018
0 comments, 0 views

What kind of requests does each neighborhood mostly make?

import scipy.stats
merged.columns
Index(['ADDDATE', 'CITY', 'DETAILS', 'INSPECTIONDATE', 'INSPECTIONFLAG',
       'INSPECTORNAME', 'Join_Count', 'LATITUDE', 'LONGITUDE',
       'MARADDRESSREPOSITORYID', 'NAME', 'NBH_NAMES', 'OBJECTID',
       'ORGANIZATIONACRONYM', 'PRIORITY', 'RESOLUTIONDATE', 'SERVICECALLCOUNT',
       'SERVICECODE', 'SERVICECODEDESCRIPTION', 'SERVICEDUEDATE',
       'SERVICEORDERDATE', 'SERVICEORDERSTATUS', 'SERVICEREQUESTID',
       'SERVICETYPECODEDESCRIPTION', 'STATE', 'STATUS_CODE', 'STREETADDRESS',
       'TYPE', 'WARD', 'WEB_URL', 'XCOORD', 'YCOORD', 'ZIPCODE', 'SHAPE',
       'QUADRANT', 'CLUSTER', 'SERVICETYPE_NUMBER', 'STATUS_CODE_NUMBER'],
      dtype='object')
df = merged[['NAME', 'SERVICECODEDESCRIPTION']]
# Extract the most frequently occuring service request type, and its count
df1 = df.groupby('NAME').agg(lambda x: scipy.stats.mode(x)[0][0])
df2 = df.groupby('NAME').agg(lambda x: scipy.stats.mode(x)[1][0])
df1.reset_index(inplace=True)
df2.reset_index(inplace=True)
df2 = df2.rename(columns={'SERVICECODEDESCRIPTION':'SERVICECODEDESCRIPTION_COUNT'})
# merge the two datasets
final_df = pd.merge(df1, df2, on='NAME')
final_df.head()
NAMESERVICECODEDESCRIPTIONSERVICECODEDESCRIPTION_COUNT
0Cluster 1Parking Enforcement383
1Cluster 10Bulk Collection342
2Cluster 11Parking Meter Repair863
3Cluster 12Roadway Signs111
4Cluster 13Pothole254
# merge it with neighborhood clusters
neighborhood_data = pd.merge(neighborhood, final_df, on='NAME')
# view the map
search_result = gis.content.search("Neighborhood_Service_DC")
search_result[0]
Neighborhood_Service_DC
Web Map by mmajumdar_dcdev
Last Modified: April 05, 2018
0 comments, 0 views

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.