Introduction
The objective of this notebook is to demonstrate the application of AutoML on tabular data and show the improvements that can be achieved using this method, rather than conventional workflows. In part 1 of this notebook series, a considerable increase was obtained when implementing AutoML, and in this notebook, the result will be further enhanced using spatial feature engineering. These new features will be estimated by considering, and subsequently extracting, the inherent spatial patterns present in the data.
The percentage of voter turnout by county for the general election for US in 2016 will be predicted using the demographic characteristics of US counties and their socioeconomic parameters.
Imports
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
from pathlib import Path
from IPython.display import Image, HTML
from fastai.imports import *
from datetime import datetime as dt
import arcgis
from arcgis.gis import GIS
from arcgis.learn import prepare_tabulardata, AutoML, MLModel
import arcpy
Connecting to ArcGIS
gis = GIS("home")
Accessing & Visualizing datasets
The 2016 election data is downloaded from the portal as a zipped shapefile, which is then unzipped and processed in the following.
voter_zip = gis.content.get('650e7d6aa8fb4601a75d632a2c114425')
voter_zip
import os, zipfile
filepath_new = voter_zip.download(file_name=voter_zip.name)
with zipfile.ZipFile(filepath_new, 'r') as zip_ref:
zip_ref.extractall(Path(filepath_new).parent)
output_path = Path(os.path.join(os.path.splitext(filepath_new)[0]))
output_path = os.path.join(output_path,"VotersTurnoutCountyEelction2016.shp")
The attribute table contains voter turnout data per county for the entire US, which is extracted here as a pandas dataframe. The voter_turn
field in the dataframe contains voter turnout percentages for each county for the 2016 election. This will be used as the dependent variable and will be predicted using the various demographic and socioeconomic variables of each county.
# getting the attribute table from the shapefile which will be used for building the model
sdf_main = pd.DataFrame.spatial.from_featureclass(output_path)
sdf_main.head()
FID | Join_Count | TARGET_FID | FIPS | county | state | voter_turn | gender_med | householdi | electronic | ... | NNeighbors | ZTransform | SpatialLag | LMi_hi_sig | LMi_normal | Shape_Le_1 | Shape_Ar_1 | LMiHiDist | NEAR_FID | SHAPE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 1 | 01001 | Autauga | Alabama | 0.613738 | 38.6 | 25553 | 4.96 | ... | 44 | 0.21158 | 0.154568 | 0 | 0 | 249674.500799 | 2208597808.5 | 133735.292502 | 0 | {"rings": [[[-9619465, 3856529.0001000017], [-... |
1 | 1 | 1 | 2 | 01003 | Baldwin | Alabama | 0.627364 | 42.9 | 31429 | 4.64 | ... | 22 | 0.358894 | 0.057952 | 0 | 0 | 1642763.26146 | 5671095677.35 | 241925.196426 | 3 | {"rings": [[[-9746859, 3539643.0001000017], [-... |
2 | 2 | 1 | 3 | 01005 | Barbour | Alabama | 0.513816 | 40.2 | 16876 | 3.49 | ... | 62 | -0.868722 | -0.498354 | 1 | 1 | 320297.06515 | 3257816458.5 | 0.0 | 0 | {"rings": [[[-9468394, 3771591.0001000017], [-... |
3 | 3 | 1 | 4 | 01007 | Bibb | Alabama | 0.501364 | 39.3 | 19360 | 3.64 | ... | 43 | -1.003341 | 0.28644 | 0 | 0 | 227910.108916 | 2311954706.0 | 170214.485759 | 7 | {"rings": [[[-9692114, 3928124.0001000017], [-... |
4 | 4 | 1 | 5 | 01009 | Blount | Alabama | 0.603064 | 40.9 | 21785 | 3.86 | ... | 51 | 0.096177 | -0.336198 | 0 | 1 | 291875.255483 | 2456919058.5 | 21128.568784 | 7 | {"rings": [[[-9623907, 4063676.0001000017], [-... |
5 rows × 97 columns
sdf_main.shape
(3112, 97)
Here, the data is visualized by mapping the voter turnout field into five classes. It can be observed that there are belts running along the eastern and southern parts of the country that represent comparatively lower voter turnout of less than 55%.
The AutoMl process significantly improves the fit, compared to the standalone random forest model, and the validation R-squared jumps to a new high. Now, the previous visualization of the data reveals the presence of a spatial pattern in the data. Next, this spatial pattern will be estimated and included as spatial features to further improve the model.
Estimating Spatial Autocorrelation
This characteristic is also known as spatial autocorrelation and is measured by the index known as Moran's I, which is estimated using the ClustersOutliers
tool available in Arcpy.
# First the Arcpy env is specified which will be used saving the result of the Arcpy tool
arcpy.env.workspace = output_path.replace(output_path.split('\\')[-1], "arcpy_test_env")
if os.path.exists(arcpy.env.workspace):
shutil.rmtree(arcpy.env.workspace)
os.makedirs(arcpy.env.workspace)
Calculating Local Moran's I
The ClustersOutliers
tool will calculate the local Moran's I index for each county and identify statistically significant hot spots, cold spots, and spatial outliers. As input, the tool takes the shapefile containing the data, the field name for which the clustering is to be estimated, and the output name of the shapefile, and outputs a Moran's I value, a z-score, a pseudo p-value, and a code representing the cluster type for each statistically significant feature. The z-scores and pseudo p-values represent the statistical significance of the computed index values.
arcpy.env.workspace = arcpy.env.workspace
output_path = output_path
result = arcpy.stats.ClustersOutliers(output_path,
"voter_turn", "voters_turnout_ClusterOutlier.shp",
'INVERSE_DISTANCE',
'EUCLIDEAN_DISTANCE','ROW', "#", "#","NO_FDR", 499)
# accessing the attribute table from the output shapefile
sdf_main_LMi = pd.DataFrame.spatial.from_featureclass(result[0])
sdf_main_LMi.head()
FID | SOURCE_ID | voter_turn | LMiIndex | LMiZScore | LMiPValue | COType | NNeighbors | ZTransform | SpatialLag | SHAPE | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0.613738 | 0.032693 | 0.95507 | 0.176 | 44 | 0.21158 | 0.154568 | {"rings": [[[-9619465, 3856529.0001000017], [-... | |
1 | 1 | 1 | 0.627364 | 0.020792 | 0.318927 | 0.376 | 22 | 0.358894 | 0.057952 | {"rings": [[[-9746859, 3539643.0001000017], [-... | |
2 | 2 | 2 | 0.513816 | 0.432791 | 3.572737 | 0.002 | LL | 62 | -0.868722 | -0.498354 | {"rings": [[[-9468394, 3771591.0001000017], [-... |
3 | 3 | 3 | 0.501364 | -0.287305 | -1.828555 | 0.042 | LH | 43 | -1.003341 | 0.28644 | {"rings": [[[-9692114, 3928124.0001000017], [-... |
4 | 4 | 4 | 0.603064 | -0.032324 | -2.317385 | 0.012 | HL | 51 | 0.096177 | -0.336198 | {"rings": [[[-9623907, 4063676.0001000017], [-... |
Here, the Moran's I value is stored in the LMiIndex
, field, with its z-score and pseudo p-value in the fields LMiZScore
and LMiPValue
respectively, and the code in COType
.
Visualizing the spatial autocorrelation
The COType
field in the Output Feature Class will be HH for a statistically significant cluster of high values and LL for a statistically significant cluster of low values. The COType
field in the Output Feature Class will also indicate if the feature has a high value and is surrounded by features with low values (HL) or if the feature has a low value and is surrounded by features with high values (LH). This is visualized in the map below:
# visualizing spatial autocorrelation in voters turnout
m2 = gis.map('United States')
m2.legend.enabled = True
m2

m2.content.add(sdf_main_LMi)
m2.zoom_to_layer(sdf_main_LMi)
Apply symbology to the feature layer
sm_manager = m2.content.renderer(0).smart_mapping()
sm_manager.unique_values_renderer(field="COType")
The black pixels in the map above show that there is spatial clustering of low voter turnout along the eastern coast, while the white pixels in the northeastern, central, and northwestern portions of the country indicate areas of spatial clustering of high voter turnout.
To include this data as spatial features, the counties with the most significant (lowest) p-values will be identified, and the distance and the angle or direction of each county will be measured from those lowest-p value counties. These two variables, the distance and the angle, are included as the new spatial features in the model.
# checking the field names having the p values
sdf_main_LMi.columns
Index(['FID', 'SOURCE_ID', 'voter_turn', 'LMiIndex', 'LMiZScore', 'LMiPValue', 'COType', 'NNeighbors', 'ZTransform', 'SpatialLag', 'SHAPE'], dtype='object')
Selecting highly significant spatial clustering county
The most significant (lowest) p-value here is 0.002. All counties with this p value will be selected, and a field will be created that will be used to generate a shapefile containing these highly significant counties. Another field will also be created for counties with p-values less than or equal to 0.05, representing the remaining significantly clustering counties, that will be used as the third spatial feature in the final model.
# creating new fields with highly significant clustering counties
sdf_main_LMi['LMi_hi_sig<.002'] = np.where(sdf_main_LMi['LMiPValue']<=.002, 1,0)
sdf_main_LMi['LMi_sig_<0.05'] = np.where(sdf_main_LMi['LMiPValue']<=.05, 1,0)
sdf_main_LMi.head()
FID | SOURCE_ID | voter_turn | LMiIndex | LMiZScore | LMiPValue | COType | NNeighbors | ZTransform | SpatialLag | SHAPE | LMi_hi_sig<.002 | LMi_sig_<0.05 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0.613738 | 0.032693 | 0.95507 | 0.176 | 44 | 0.21158 | 0.154568 | {"rings": [[[-9619465, 3856529.0001000017], [-... | 0 | 0 | |
1 | 1 | 1 | 0.627364 | 0.020792 | 0.318927 | 0.376 | 22 | 0.358894 | 0.057952 | {"rings": [[[-9746859, 3539643.0001000017], [-... | 0 | 0 | |
2 | 2 | 2 | 0.513816 | 0.432791 | 3.572737 | 0.002 | LL | 62 | -0.868722 | -0.498354 | {"rings": [[[-9468394, 3771591.0001000017], [-... | 1 | 1 |
3 | 3 | 3 | 0.501364 | -0.287305 | -1.828555 | 0.042 | LH | 43 | -1.003341 | 0.28644 | {"rings": [[[-9692114, 3928124.0001000017], [-... | 0 | 1 |
4 | 4 | 4 | 0.603064 | -0.032324 | -2.317385 | 0.012 | HL | 51 | 0.096177 | -0.336198 | {"rings": [[[-9623907, 4063676.0001000017], [-... | 0 | 1 |
# create new dataframe for LMi_hi_sig<.002
LMi_hi_sig_county_main = sdf_main_LMi[sdf_main_LMi['LMi_hi_sig<.002']==1].copy()
LMi_hi_sig_county_main.columns
Index(['FID', 'SOURCE_ID', 'voter_turn', 'LMiIndex', 'LMiZScore', 'LMiPValue', 'COType', 'NNeighbors', 'ZTransform', 'SpatialLag', 'SHAPE', 'LMi_hi_sig<.002', 'LMi_sig_<0.05'], dtype='object')
# creating a new shapefile for the most significant clustering counties from spatial dataframe
near_dist_from_main_county = sdf_main_LMi.spatial.to_featureclass('voters_turnout_train_LMi'+str(dt.now().microsecond))
near_dist_to_hi_sig_county = LMi_hi_sig_county_main.spatial.to_featureclass('LMi_hi_sig_county_train'+str(dt.now().microsecond))
Estimating distances and angle of counties from highly clustered counties
The Near
(Analysis) tool from Arcpy is used to calculate the distance and the angle of all the counties from the highly significant clustering counties. As input, it takes the counties of high significance from which the distance is to be estimated, followed by the shapefile containing all of the counties to which the distance and the angle is to be calculated.
# Using the Near tool to calculate distance and angle
dist_to_nearest_hi_sig = arcpy.analysis.Near(near_dist_from_main_county,near_dist_to_hi_sig_county,'#','#','ANGLE','GEODESIC')
# Accessing the attribute table from the resulting shapefile
sdf_nearest_hi_sig = pd.DataFrame.spatial.from_featureclass(dist_to_nearest_hi_sig[0])
sdf_nearest_hi_sig.head()
FID | Id | source_id | voter_turn | l_mi_index | l_mi_z_sco | l_mi_p_val | co_type | n_neighbor | z_transfor | spatial_la | l_mi_hi_si | l_mi_sig_0 | NEAR_FID | NEAR_DIST | NEAR_ANGLE | SHAPE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0.613738 | 0.032693 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 67162.144028 | 89.80393 | {"rings": [[[-9619465, 3856529.0001000017], [-... | |
1 | 1 | 0 | 1 | 0.627364 | 0.020792 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 182315.087799 | 76.372022 | {"rings": [[[-9746859, 3539643.0001000017], [-... | |
2 | 2 | 0 | 2 | 0.513816 | 0.432791 | 0 | 0 | LL | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | {"rings": [[[-9468394, 3771591.0001000017], [-... |
3 | 3 | 0 | 3 | 0.501364 | -0.287305 | 0 | 0 | LH | 0 | 0 | 0 | 0 | 0 | 7 | 111571.054693 | 97.612838 | {"rings": [[[-9692114, 3928124.0001000017], [-... |
4 | 4 | 0 | 4 | 0.603064 | -0.032324 | 0 | 0 | HL | 0 | 0 | 0 | 0 | 0 | 10 | 0.0 | 0.0 | {"rings": [[[-9623907, 4063676.0001000017], [-... |
sdf_nearest_hi_sig.columns
Index(['FID', 'Id', 'source_id', 'voter_turn', 'l_mi_index', 'l_mi_z_sco', 'l_mi_p_val', 'co_type', 'n_neighbor', 'z_transfor', 'spatial_la', 'l_mi_hi_si', 'l_mi_sig_0', 'NEAR_FID', 'NEAR_DIST', 'NEAR_ANGLE', 'SHAPE'], dtype='object')
LMi_hi_sig_county_main.columns
Index(['FID', 'SOURCE_ID', 'voter_turn', 'LMiIndex', 'LMiZScore', 'LMiPValue', 'COType', 'NNeighbors', 'ZTransform', 'SpatialLag', 'SHAPE', 'LMi_hi_sig<.002', 'LMi_sig_<0.05'], dtype='object')
In the resulting dataframe above, the fields NEAR_DIST
and NEAR_ANGLE
( third and the second field from the last) represent the distance and angle of the counties from the highly significant clustering counties, while the field named LMi_sig_<0
, represents all of the significant counties. All three will be used as the spatial predictors in the final model.
sdf_main.head(2)
FID | Join_Count | TARGET_FID | FIPS | county | state | voter_turn | gender_med | householdi | electronic | ... | NNeighbors | ZTransform | SpatialLag | LMi_hi_sig | LMi_normal | Shape_Le_1 | Shape_Ar_1 | LMiHiDist | NEAR_FID | SHAPE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 1 | 01001 | Autauga | Alabama | 0.613738 | 38.6 | 25553 | 4.96 | ... | 44 | 0.21158 | 0.154568 | 0 | 0 | 249674.500799 | 2208597808.5 | 133735.292502 | 0 | {"rings": [[[-9619465, 3856529.0001000017], [-... |
1 | 1 | 1 | 2 | 01003 | Baldwin | Alabama | 0.627364 | 42.9 | 31429 | 4.64 | ... | 22 | 0.358894 | 0.057952 | 0 | 0 | 1642763.26146 | 5671095677.35 | 241925.196426 | 3 | {"rings": [[[-9746859, 3539643.0001000017], [-... |
2 rows × 97 columns
# dropping the existing p-values estimated columns from the main table to be replaced by the newly calculated values
sdf_main_final = sdf_main.drop(['SOURCE_ID', 'voter_tu_1',
'Shape_Leng', 'Shape_Area', 'LMiIndex', 'LMiZScore', 'LMiPValue',
'COType', 'NNeighbors', 'ZTransform', 'SpatialLag', 'LMi_hi_sig',
'LMi_normal', 'NEAR_FID', 'Shape_Le_1', 'Shape_Ar_1', 'LMiHiDist', 'SHAPE'], axis=1)
sdf_main_final.head(2)
FID | Join_Count | TARGET_FID | FIPS | county | state | voter_turn | gender_med | householdi | electronic | ... | City9Dist | City9Ang | City8Dist | City8Ang | City7Dist | City7Ang | City6Dist | City6Ang | City5Dist | City5Ang | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 1 | 01001 | Autauga | Alabama | 0.613738 | 38.6 | 25553 | 4.96 | ... | 383948.84777 | -0.847576 | 10748.108812 | 109.277531 | 76082.644216 | -6.321051 | 0.0 | 0.0 | 358644.11945 | -74.872116 |
1 | 1 | 1 | 2 | 01003 | Baldwin | Alabama | 0.627364 | 42.9 | 31429 | 4.64 | ... | 472377.29045 | -25.580055 | 4252.349631 | -85.916425 | 17821.080946 | 94.801172 | 0.0 | 0.0 | 356543.92578 | -44.723872 |
2 rows × 79 columns
Final dataset with spatial cluster variables
# joining the newly calculated spatial features with the main dataset
sdf_main_final_merged = sdf_main_final.merge(sdf_nearest_hi_sig, on='FID', how='inner')
# checking the final merged data
sdf_main_final_merged.head(2)
FID | Join_Count | TARGET_FID | FIPS | county | state | voter_turn_x | gender_med | householdi | electronic | ... | co_type | n_neighbor | z_transfor | spatial_la | l_mi_hi_si | l_mi_sig_0 | NEAR_FID | NEAR_DIST | NEAR_ANGLE | SHAPE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 1 | 01001 | Autauga | Alabama | 0.613738 | 38.6 | 25553 | 4.96 | ... | 0 | 0 | 0 | 0 | 0 | 7 | 67162.144028 | 89.80393 | {"rings": [[[-9619465, 3856529.0001000017], [-... | |
1 | 1 | 1 | 2 | 01003 | Baldwin | Alabama | 0.627364 | 42.9 | 31429 | 4.64 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 182315.087799 | 76.372022 | {"rings": [[[-9746859, 3539643.0001000017], [-... |
2 rows × 95 columns
Model Building
Next, the dataset containing the new spatial variables will be used to fit the AutoML model for further model improvements.
Train-Test split
Here, the dataset with 3112 samples is split into training and test datasets with a 90 to 10 ratio.
from sklearn.model_selection import train_test_split
# Splitting data with test size of 10% data for validation
test_size = 0.10
sdf_train, sdf_test = train_test_split(sdf_main_final_merged, test_size = test_size, random_state=32)
# checking train-test split
print(sdf_train.shape)
print(sdf_test.shape)
(2800, 95) (312, 95)
sdf_train.head(2)
FID | Join_Count | TARGET_FID | FIPS | county | state | voter_turn_x | gender_med | householdi | electronic | ... | co_type | n_neighbor | z_transfor | spatial_la | l_mi_hi_si | l_mi_sig_0 | NEAR_FID | NEAR_DIST | NEAR_ANGLE | SHAPE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2244 | 2244 | 1 | 2245 | 42061 | Huntingdon | Pennsylvania | 0.534472 | 43.0 | 23471 | 3.73 | ... | 0 | 0 | 0 | 0 | 0 | 417 | 50738.617273 | 138.368342 | {"rings": [[[-8647447, 4972564.000100002], [-8... | |
2710 | 2710 | 1 | 2711 | 48435 | Sutton | Texas | 0.547776 | 39.3 | 31334 | 3.77 | ... | 0 | 0 | 0 | 0 | 0 | 892 | 41735.520773 | -1.056234 | {"rings": [[[-11144914, 3540920.0001000017], [... |
2 rows × 95 columns
sdf_train.columns
Index(['FID', 'Join_Count', 'TARGET_FID', 'FIPS', 'county', 'state', 'voter_turn_x', 'gender_med', 'householdi', 'electronic', 'raceandhis', 'voter_laws', 'educationa', 'educatio_1', 'educatio_2', 'educatio_3', 'maritalsta', 'F5yearincr', 'F5yearin_1', 'F5yearin_2', 'F5yearin_3', 'F5yearin_4', 'F5yearin_5', 'F5yearin_6', 'language_a', 'hispanicor', 'hispanic_1', 'raceandh_1', 'atrisk_avg', 'disposable', 'disposab_1', 'disposab_2', 'disposab_3', 'disposab_4', 'disposab_5', 'disposab_6', 'disposab_7', 'disposab_8', 'disposab_9', 'disposa_10', 'househol_1', 'househol_2', 'househol_3', 'househol_4', 'househol_5', 'househol_6', 'househol_7', 'househol_8', 'househol_9', 'language_1', 'language_2', 'households', 'househo_10', 'educatio_4', 'educatio_5', 'educatio_6', 'educatio_7', 'psychograp', 'psychogr_1', 'financial_', 'financial1', 'financia_1', 'miscellane', 'state_vote', 'state_vo_1', 'randomized', 'random_num', 'City10Dist', 'City10Ang', 'City9Dist', 'City9Ang', 'City8Dist', 'City8Ang', 'City7Dist', 'City7Ang', 'City6Dist', 'City6Ang', 'City5Dist', 'City5Ang', 'Id', 'source_id', 'voter_turn_y', 'l_mi_index', 'l_mi_z_sco', 'l_mi_p_val', 'co_type', 'n_neighbor', 'z_transfor', 'spatial_la', 'l_mi_hi_si', 'l_mi_sig_0', 'NEAR_FID', 'NEAR_DIST', 'NEAR_ANGLE', 'SHAPE'], dtype='object')
Data Preprocessing
Here, X
is the list of explanatory variables chosen from the new feature data that will be used for predicting voter turnout. The new spatial cluster features used here are NEAR_DIST
, NEAR_ANGLE
,LMi_sig_<0
as explained in the previous section. Some additional spatial features (City10Ang
, City9Ang
,City8Ang
etc.) were also included to account for the direction of the counties in terms of the angle of the counties from various grades of cities that were pre-calculated.
Also, the categorical variables are marked with a True value inside of a tuple. The scaler is defined in the preprocessors.
# listing explanatory variables
X =[('county',True), ('state',True),'gender_med', 'householdi', 'electronic', 'raceandhis',
('voter_laws',True), 'educationa', 'educatio_1', 'educatio_2', 'educatio_3',
'maritalsta', 'F5yearincr', 'F5yearin_1', 'F5yearin_2', 'F5yearin_3',
'F5yearin_4', 'F5yearin_5', 'F5yearin_6', 'language_a', 'hispanicor',
'hispanic_1', 'raceandh_1', 'atrisk_avg', 'disposable', 'disposab_1',
'disposab_2', 'disposab_3', 'disposab_4', 'disposab_5', 'disposab_6',
'disposab_7', 'disposab_8', 'disposab_9', 'disposa_10', 'househol_1',
'househol_2', 'househol_3', 'househol_4', 'househol_5', 'househol_6',
'househol_7', 'househol_8', 'househol_9', 'language_1', 'language_2',
'households', 'househo_10', 'educatio_4', 'educatio_5', 'educatio_6',
'educatio_7', 'psychograp', 'psychogr_1', 'financial_', 'financial1',
'financia_1', 'miscellane', 'state_vote', 'state_vo_1',
'City10Ang', 'City9Dist', 'City9Ang',
'City8Dist', 'City8Ang', 'City7Dist', 'City7Ang', 'City6Dist',
'City6Ang', 'City5Dist', 'City5Ang', 'NEAR_DIST', 'NEAR_ANGLE']
from sklearn.preprocessing import MinMaxScaler
# defining the preprocessors for scaling data
preprocessors = [('county', 'state','gender_med', 'householdi', 'electronic', 'raceandhis',
'voter_laws', 'educationa', 'educatio_1', 'educatio_2', 'educatio_3',
'maritalsta', 'F5yearincr', 'F5yearin_1', 'F5yearin_2', 'F5yearin_3',
'F5yearin_4', 'F5yearin_5', 'F5yearin_6', 'language_a', 'hispanicor',
'hispanic_1', 'raceandh_1', 'atrisk_avg', 'disposable', 'disposab_1',
'disposab_2', 'disposab_3', 'disposab_4', 'disposab_5', 'disposab_6',
'disposab_7', 'disposab_8', 'disposab_9', 'disposa_10', 'househol_1',
'househol_2', 'househol_3', 'househol_4', 'househol_5', 'househol_6',
'househol_7', 'househol_8', 'househol_9', 'language_1', 'language_2',
'households', 'househo_10', 'educatio_4', 'educatio_5', 'educatio_6',
'educatio_7', 'psychograp', 'psychogr_1', 'financial_', 'financial1',
'financia_1', 'miscellane', 'state_vote', 'state_vo_1',
'City10Ang', 'City9Dist', 'City9Ang',
'City8Dist', 'City8Ang', 'City7Dist', 'City7Ang', 'City6Dist',
'City6Ang', 'City5Dist', 'City5Ang', 'NEAR_DIST', 'NEAR_ANGLE', MinMaxScaler())]
# preparing data for the model
data = prepare_tabulardata(sdf_train,
variable_predict='voter_turn_x',
explanatory_variables=X,
preprocessors=preprocessors)
data.show_batch()
City10Ang | City5Ang | City5Dist | City6Ang | City6Dist | City7Ang | City7Dist | City8Ang | City8Dist | City9Ang | ... | miscellane | psychogr_1 | psychograp | raceandh_1 | raceandhis | state | state_vo_1 | state_vote | voter_laws | voter_turn_x | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
970 | -38.097491 | 160.82218 | 253980.119335 | 0.0 | 0.0 | 27.338939 | 31509.472596 | 69.548398 | 10415.735546 | -53.27879 | ... | 19 | 40.48 | 7.15 | 27.7 | 13.36 | Kentucky | 0.298375 | 574117 | nonphotoid | 0.651014 |
1182 | 30.162913 | -65.473209 | 314965.336032 | 0.0 | 0.0 | -56.759361 | 38224.808568 | -84.374642 | 60282.91058 | -32.611268 | ... | 21 | 44.69 | 5.98 | 43.7 | 23.2 | Maryland | 0.264164 | 734759 | no_doc | 0.696712 |
1761 | 81.04786 | 73.563287 | 96856.433863 | 0.0 | 0.0 | 123.794591 | 863.024943 | 89.125132 | 17583.489697 | -126.132351 | ... | 37 | 33.98 | 8.45 | 67.4 | 46.1 | New Jersey | 0.141027 | 546345 | no_doc | 0.701084 |
2163 | -179.686912 | 58.574999 | 104948.236884 | 0.0 | 0.0 | -89.628491 | 20827.282684 | -91.785635 | 27726.568576 | -89.205592 | ... | 11 | 46.9 | 7.22 | 48.9 | 28.15 | Oklahoma | 0.363912 | 528761 | nonphotoid | 0.475694 |
2473 | -22.481323 | 83.390854 | 47420.679826 | 143.000934 | 23345.291534 | 90.218356 | 151289.075466 | 118.863339 | 49455.218555 | -104.587926 | ... | 7 | 47.26 | 10.38 | 6.7 | 2.94 | Tennessee | 0.260057 | 652230 | strict_photoid | 0.429226 |
5 rows × 74 columns
Fitting a random forest model
First, a random forest model is fitted to the new spatial data.
Model Initialization
The MLModel is initialized with the Random Forest model from Scikit-learn (Sklearn), along with its model parameters
# defining the model along with the parameters
model = MLModel(data, 'sklearn.ensemble.RandomForestRegressor', n_estimators=500, random_state=43)
model.fit()
model.score()
0.7336351093652952
# validating trained model on test dataset
voter_county_mlmodel_predicted = model.predict(sdf_test, prediction_type='dataframe')
voter_county_mlmodel_predicted.head(2)
FID | Join_Count | TARGET_FID | FIPS | county | state | voter_turn_x | gender_med | householdi | electronic | ... | n_neighbor | z_transfor | spatial_la | l_mi_hi_si | l_mi_sig_0 | NEAR_FID | NEAR_DIST | NEAR_ANGLE | SHAPE | prediction_results | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
115 | 115 | 1 | 116 | 05067 | Jackson | Arkansas | 0.371336 | 41.8 | 17956 | 3.33 | ... | 0 | 0 | 0 | 0 | 0 | 25 | 0.0 | 0.0 | {"rings": [[[-10133692, 4284820.000100002], [-... | 0.471488 |
1529 | 1529 | 1 | 1530 | 29153 | Ozark | Missouri | 0.591207 | 51.4 | 18827 | 3.92 | ... | 0 | 0 | 0 | 0 | 0 | 34 | 0.0 | 0.0 | {"rings": [[[-10253900, 4410465.000100002], [-... | 0.541081 |
2 rows × 96 columns
import sklearn.metrics as metrics
# calculating validation model score
r_square_voter_county_mlmodel_Test = metrics.r2_score(voter_county_mlmodel_predicted['voter_turn_x'], voter_county_mlmodel_predicted['prediction_results'])
print('r_square_voter_county_mlmodel_Test: ', round(r_square_voter_county_mlmodel_Test,2))
r_square_voter_county_mlmodel_Test: 0.79
The validation r square for the random forest model is satisfactory, and now AutoML will be used to improve it.
Fitting Using AutoML
The same data obtained using the prepare_taulardata
function is used as input for the AutoML model. Here, the model is initialized using the Compete
mode, which is the best performing option of the available modes.
# initializing AutoML model with the Compete mode
AutoML_voters_county_obj_compete = AutoML(data, eval_metric='r2', mode='Compete', n_jobs=1)
# training the AutoML model
AutoML_voters_county_obj_compete.fit()
Neural Network algorithm was disabled because it doesn't support n_jobs parameter. AutoML directory: ~\AppData\Local\Temp\scratch\tmpgbw18ede The task is regression with evaluation metric r2 AutoML will use algorithms: ['Linear', 'Decision Tree', 'Random Trees', 'Extra Trees', 'LightGBM', 'Xgboost'] AutoML will stack models AutoML will ensemble available models AutoML steps: ['adjust_validation', 'simple_algorithms', 'default_algorithms', 'not_so_random', 'mix_encoding', 'insert_random_feature', 'features_selection', 'hill_climbing_1', 'hill_climbing_2', 'boost_on_errors', 'ensemble', 'stack', 'ensemble_stacked'] * Step adjust_validation will try to check up to 1 model 1_DecisionTree r2 0.384239 trained in 3.43 seconds Adjust validation. Remove: 1_DecisionTree Validation strategy: 10-fold CV Shuffle * Step simple_algorithms will try to check up to 4 models 1_DecisionTree r2 0.416282 trained in 32.05 seconds 2_DecisionTree r2 0.476178 trained in 30.31 seconds 3_DecisionTree r2 0.475658 trained in 28.41 seconds 4_Linear r2 0.657478 trained in 32.74 seconds * Step default_algorithms will try to check up to 4 models 5_Default_LightGBM r2 0.772731 trained in 68.24 seconds
6_Default_Xgboost r2 0.769644 trained in 114.5 seconds 7_Default_RandomTrees r2 0.591756 trained in 367.89 seconds 8_Default_ExtraTrees r2 0.531776 trained in 67.03 seconds * Step not_so_random will try to check up to 36 models 18_LightGBM r2 0.783642 trained in 55.11 seconds
9_Xgboost r2 0.755805 trained in 123.99 seconds 27_RandomTrees r2 0.590433 trained in 165.51 seconds 36_ExtraTrees r2 0.50989 trained in 51.77 seconds 19_LightGBM r2 0.757933 trained in 41.83 seconds
10_Xgboost r2 0.748685 trained in 171.85 seconds 28_RandomTrees r2 0.528941 trained in 153.77 seconds 37_ExtraTrees r2 0.468456 trained in 55.49 seconds 20_LightGBM r2 0.775966 trained in 73.04 seconds
11_Xgboost r2 0.769376 trained in 57.62 seconds 29_RandomTrees r2 0.658831 trained in 379.98 seconds Skip mix_encoding because of the time limit. Not enough time to perform features selection. Skip Time needed for features selection ~ 461.0 seconds Please increase total_time_limit to at least (4671 seconds) to have features selection Skip insert_random_feature because no parameters were generated. Skip features_selection because no parameters were generated. * Step hill_climbing_1 will try to check up to 21 models 38_LightGBM r2 0.782465 trained in 60.21 seconds 39_LightGBM r2 0.779481 trained in 62.88 seconds 40_LightGBM r2 0.771906 trained in 79.11 seconds 41_LightGBM r2 0.780096 trained in 74.6 seconds 42_LightGBM r2 0.767546 trained in 104.47 seconds
43_Xgboost r2 0.768957 trained in 119.14 seconds * Step hill_climbing_2 will try to check up to 20 models 44_LightGBM r2 0.78336 trained in 54.25 seconds 45_LightGBM r2 0.780685 trained in 56.42 seconds 46_LightGBM r2 0.777007 trained in 64.33 seconds 47_LightGBM r2 0.779373 trained in 63.48 seconds
48_Xgboost r2 0.770172 trained in 95.72 seconds * Step boost_on_errors will try to check up to 1 model 18_LightGBM_BoostOnErrors r2 0.777597 trained in 49.65 seconds * Step ensemble will try to check up to 1 model Ensemble r2 0.790207 trained in 2.34 seconds * Step stack will try to check up to 22 models 18_LightGBM_Stacked r2 0.781946 trained in 38.77 seconds
48_Xgboost_Stacked r2 0.765817 trained in 56.7 seconds 29_RandomTrees_Stacked not trained. Stop training after the first fold. Time needed to train on the first fold 65.0 seconds. The time estimate for training on all folds is larger than total_time_limit. 8_Default_ExtraTrees_Stacked r2 0.787562 trained in 70.85 seconds 44_LightGBM_Stacked r2 0.78109 trained in 39.31 seconds
6_Default_Xgboost_Stacked r2 0.767522 trained in 59.69 seconds 7_Default_RandomTrees_Stacked r2 0.786985 trained in 252.41 seconds 36_ExtraTrees_Stacked not trained. Stop training after the first fold. Time needed to train on the first fold 2.0 seconds. The time estimate for training on all folds is larger than total_time_limit. 38_LightGBM_Stacked not trained. Stop training after the first fold. Time needed to train on the first fold 1.0 seconds. The time estimate for training on all folds is larger than total_time_limit.
11_Xgboost_Stacked not trained. Stop training after the first fold. Time needed to train on the first fold 2.0 seconds. The time estimate for training on all folds is larger than total_time_limit. 27_RandomTrees_Stacked not trained. Stop training after the first fold. Time needed to train on the first fold 11.0 seconds. The time estimate for training on all folds is larger than total_time_limit. * Step ensemble_stacked will try to check up to 1 model Ensemble_Stacked r2 0.791633 trained in 3.28 seconds AutoML fit time: 3612.34 seconds AutoML best model: Ensemble_Stacked All the evaluated models are saved in the path ~\AppData\Local\Temp\scratch\tmpgbw18ede
Here, the ensemble model is the best model, and its R-squared validation score shows the final improvements achieved after including the new, spatially engineered variables. The best model diagnostics and related reports, like feature importance, model performance, etc., are saved in the folder mentioned in the output message for further reference.
# train score of the model
AutoML_voters_county_obj_compete.score()
0.9681373663713142
Model output
# The output diagnostics can also be printed in a report form
AutoML_voters_county_obj_compete.report()
In case the report html is not rendered appropriately in the notebook, the same can be found in the path ~\AppData\Local\Temp\scratch\tmpgbw18ede\README.html
AutoML Leaderboard
Best model | name | model_type | metric_type | metric_value | train_time |
---|---|---|---|---|---|
1_DecisionTree | Decision Tree | r2 | 0.416282 | 32.71 | |
2_DecisionTree | Decision Tree | r2 | 0.476178 | 30.9 | |
3_DecisionTree | Decision Tree | r2 | 0.475658 | 28.98 | |
4_Linear | Linear | r2 | 0.657478 | 33.38 | |
5_Default_LightGBM | LightGBM | r2 | 0.772731 | 68.96 | |
6_Default_Xgboost | Xgboost | r2 | 0.769644 | 115.43 | |
7_Default_RandomTrees | Random Trees | r2 | 0.591756 | 368.61 | |
8_Default_ExtraTrees | Extra Trees | r2 | 0.531776 | 67.78 | |
18_LightGBM | LightGBM | r2 | 0.783642 | 55.81 | |
9_Xgboost | Xgboost | r2 | 0.755805 | 124.72 | |
27_RandomTrees | Random Trees | r2 | 0.590433 | 166.24 | |
36_ExtraTrees | Extra Trees | r2 | 0.50989 | 52.53 | |
19_LightGBM | LightGBM | r2 | 0.757933 | 42.58 | |
10_Xgboost | Xgboost | r2 | 0.748685 | 172.61 | |
28_RandomTrees | Random Trees | r2 | 0.528941 | 154.5 | |
37_ExtraTrees | Extra Trees | r2 | 0.468456 | 56.17 | |
20_LightGBM | LightGBM | r2 | 0.775966 | 73.71 | |
11_Xgboost | Xgboost | r2 | 0.769376 | 58.29 | |
29_RandomTrees | Random Trees | r2 | 0.658831 | 380.73 | |
38_LightGBM | LightGBM | r2 | 0.782465 | 60.99 | |
39_LightGBM | LightGBM | r2 | 0.779481 | 63.63 | |
40_LightGBM | LightGBM | r2 | 0.771906 | 79.93 | |
41_LightGBM | LightGBM | r2 | 0.780096 | 75.51 | |
42_LightGBM | LightGBM | r2 | 0.767546 | 105.09 | |
43_Xgboost | Xgboost | r2 | 0.768957 | 119.88 | |
44_LightGBM | LightGBM | r2 | 0.78336 | 54.91 | |
45_LightGBM | LightGBM | r2 | 0.780685 | 57.12 | |
46_LightGBM | LightGBM | r2 | 0.777007 | 65.04 | |
47_LightGBM | LightGBM | r2 | 0.779373 | 64.24 | |
48_Xgboost | Xgboost | r2 | 0.770172 | 96.34 | |
18_LightGBM_BoostOnErrors | LightGBM | r2 | 0.777597 | 50.34 | |
Ensemble | Ensemble | r2 | 0.790207 | 2.34 | |
18_LightGBM_Stacked | LightGBM | r2 | 0.781946 | 39.46 | |
48_Xgboost_Stacked | Xgboost | r2 | 0.765817 | 57.29 | |
8_Default_ExtraTrees_Stacked | Extra Trees | r2 | 0.787562 | 71.47 | |
44_LightGBM_Stacked | LightGBM | r2 | 0.78109 | 39.99 | |
6_Default_Xgboost_Stacked | Xgboost | r2 | 0.767522 | 60.35 | |
7_Default_RandomTrees_Stacked | Random Trees | r2 | 0.786985 | 252.98 | |
the best | Ensemble_Stacked | Ensemble | r2 | 0.791633 | 3.28 |
AutoML Performance
AutoML Performance Boxplot
Spearman Correlation of Models
Voter turnout prediction & Validation
# validating trained model on test dataset
voter_county_automl_predicted = AutoML_voters_county_obj_compete.predict(sdf_test, prediction_type='dataframe')
voter_county_automl_predicted.head(2)
FID | Join_Count | TARGET_FID | FIPS | county | state | voter_turn_x | gender_med | householdi | electronic | ... | n_neighbor | z_transfor | spatial_la | l_mi_hi_si | l_mi_sig_0 | NEAR_FID | NEAR_DIST | NEAR_ANGLE | SHAPE | prediction_results | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
115 | 115 | 1 | 116 | 05067 | Jackson | Arkansas | 0.371336 | 41.8 | 17956 | 3.33 | ... | 0 | 0 | 0 | 0 | 0 | 25 | 0.0 | 0.0 | {"rings": [[[-10133692, 4284820.000100002], [-... | 0.429061 |
1529 | 1529 | 1 | 1530 | 29153 | Ozark | Missouri | 0.591207 | 51.4 | 18827 | 3.92 | ... | 0 | 0 | 0 | 0 | 0 | 34 | 0.0 | 0.0 | {"rings": [[[-10253900, 4410465.000100002], [-... | 0.567492 |
2 rows × 96 columns
Estimate model metrics for validation
import sklearn.metrics as metrics
r_square_voter_county_automl_Test = metrics.r2_score(voter_county_automl_predicted['voter_turn_x'], voter_county_automl_predicted['prediction_results'])
print('r_square_voter_county_automl_Test: ', round(r_square_voter_county_automl_Test,2))
r_square_voter_county_automl_Test: 0.84
Conclusion
In the first part of this notebook series, AutoML was applied to a regression dataset, where it was able to achieve significant improvements over traditional methods of modeling. In this notebook, the model's fit was further improved by extracting the spatial patterns in the voter turnout dataset and including them as additional spatial features.
The spatial feature engineering employed consisted of calculating the spatial autocorrelation in the data using the cluster outlier analysis tool from Arcpy, followed by measuring the distances of each county and their respective angles from the highly significant clustering counties. Including these new spatial variables enhanced the model further. Similarly, this process could be applied to other spatial dataframes.
Data resources & References
Reference | Source | Link |
---|---|---|
Voters turnout by county for 2016 US general election | Esri | https://www.arcgis.com/home/item.html?id=650e7d6aa8fb4601a75d632a2c114425 |