Predicting voters turnout for US election in 2016 using AutoML and spatial feature engineering - Part II

Introduction

The objective of this notebook is to demonstrate the application of AutoML on tabular data and show the improvements that can be achieved using this method, rather than conventional workflows. In part 1 of this notebook series, a considerable increase was obtained when implementing AutoML, and in this notebook, the result will be further enhanced using spatial feature engineering. These new features will be estimated by considering, and subsequently extracting, the inherent spatial patterns present in the data.

The percentage of voter turnout by county for the general election for US in 2016 will be predicted using the demographic characteristics of US counties and their socioeconomic parameters.

Imports

%matplotlib inline

import matplotlib.pyplot as plt
import pandas as pd

from IPython.display import Image, HTML
from sklearn.preprocessing import MinMaxScaler,RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import sklearn.metrics as metrics
from fastai.imports import *
from datetime import datetime as dt

import arcgis
from arcgis.gis import GIS
from arcgis.learn import prepare_tabulardata, AutoML, MLModel
import arcpy

Connecting to ArcGIS

gis = GIS("home")

Accessing & Visualizing datasets

The 2016 election data is downloaded from the portal as a zipped shapefile, which is then unzipped and processed in the following.

voter_zip = gis.content.get('650e7d6aa8fb4601a75d632a2c114425') 
voter_zip

VotersTurnoutCountyEelction2016
voters turnout 2016

Shapefile by api_data_owner
Last Modified: August 23, 2021
0 comments, 78 views

import os, zipfile

filepath_new = voter_zip.download(file_name=voter_zip.name)
with zipfile.ZipFile(filepath_new, 'r') as zip_ref:
    zip_ref.extractall(Path(filepath_new).parent)
output_path = Path(os.path.join(os.path.splitext(filepath_new)[0]))
output_path = os.path.join(output_path,"VotersTurnoutCountyEelction2016.shp")

The attribute table contains voter turnout data per county for the entire US, which is extracted here as a pandas dataframe. The voter_turn field in the dataframe contains voter turnout percentages for each county for the 2016 election. This will be used as the dependent variable and will be predicted using the various demographic and socioeconomic variables of each county.

# getting the attribute table from the shapefile which will be used for building the model
sdf_main = pd.DataFrame.spatial.from_featureclass(output_path)
sdf_main.head()

	FID	Join_Count	TARGET_FID	FIPS	county	state	voter_turn	gender_med	householdi	electronic	...	NNeighbors	ZTransform	SpatialLag	LMi_hi_sig	LMi_normal	Shape_Le_1	Shape_Ar_1	LMiHiDist	NEAR_FID	SHAPE
0	0	1	1	01001	Autauga	Alabama	0.613738	38.6	25553.0	4.96	...	44	0.211580	0.154568	0	0	2.496745e+05	2.208598e+09	133735.292502	0	{"rings": [[[-9619465, 3856529.0001000017], [-...
1	1	1	2	01003	Baldwin	Alabama	0.627364	42.9	31429.0	4.64	...	22	0.358894	0.057952	0	0	1.642763e+06	5.671096e+09	241925.196426	3	{"rings": [[[-9746859, 3539643.0001000017], [-...
2	2	1	3	01005	Barbour	Alabama	0.513816	40.2	16876.0	3.49	...	62	-0.868722	-0.498354	1	1	3.202971e+05	3.257816e+09	0.000000	0	{"rings": [[[-9468394, 3771591.0001000017], [-...
3	3	1	4	01007	Bibb	Alabama	0.501364	39.3	19360.0	3.64	...	43	-1.003341	0.286440	0	0	2.279101e+05	2.311955e+09	170214.485759	7	{"rings": [[[-9692114, 3928124.0001000017], [-...
4	4	1	5	01009	Blount	Alabama	0.603064	40.9	21785.0	3.86	...	51	0.096177	-0.336198	0	1	2.918753e+05	2.456919e+09	21128.568784	7	{"rings": [[[-9623907, 4063676.0001000017], [-...

5 rows × 97 columns

sdf_main.shape

(3112, 97)

Here, the data is visualized by mapping the voter turnout field into five classes. It can be observed that there are belts running along the eastern and southern parts of the country that represent comparatively lower voter turnout of less than 55%.

The AutoMl process significantly improves the fit, compared to the standalone random forest model, and the validation R-squared jumps to a new high. Now, the previous visualization of the data reveals the presence of a spatial pattern in the data. Next, this spatial pattern will be estimated and included as spatial features to further improve the model.

Estimating Spatial Autocorrelation

This characteristic is also known as spatial autocorrelation and is measured by the index known as Moran's I, which is estimated using the ClustersOutliers tool available in Arcpy.

# First the Arcpy env is specified which will be used saving the result of the Arcpy tool
arcpy.env.workspace = output_path.replace(output_path.split('\\')[-1], "arcpy_test_env")

if os.path.exists(arcpy.env.workspace):
    shutil.rmtree(arcpy.env.workspace)

os.makedirs(arcpy.env.workspace)

Calculating Local Moran's I

The ClustersOutliers tool will calculate the local Moran's I index for each county and identify statistically significant hot spots, cold spots, and spatial outliers. As input, the tool takes the shapefile containing the data, the field name for which the clustering is to be estimated, and the output name of the shapefile, and outputs a Moran's I value, a z-score, a pseudo p-value, and a code representing the cluster type for each statistically significant feature. The z-scores and pseudo p-values represent the statistical significance of the computed index values.

arcpy.env.workspace = arcpy.env.workspace 
output_path = output_path

result = arcpy.stats.ClustersOutliers(output_path,
                                      "voter_turn", "voters_turnout_ClusterOutlier.shp",
                                     'INVERSE_DISTANCE',
                                     'EUCLIDEAN_DISTANCE','ROW', "#", "#","NO_FDR", 499)

# accessing the attribute table from the output shapefile
sdf_main_LMi = pd.DataFrame.spatial.from_featureclass(result[0])
sdf_main_LMi.head()

	FID	SOURCE_ID	voter_turn	LMiIndex	LMiZScore	LMiPValue	COType	NNeighbors	ZTransform	SpatialLag	SHAPE
0	0	0	0.613738	0.032693	0.947522	0.172		44	0.211580	0.154568	{"rings": [[[-9619465, 3856529.0001000017], [-...
1	1	1	0.627364	0.020792	0.284678	0.376		22	0.358894	0.057952	{"rings": [[[-9746859, 3539643.0001000017], [-...
2	2	2	0.513816	0.432791	3.658427	0.002	LL	62	-0.868722	-0.498354	{"rings": [[[-9468394, 3771591.0001000017], [-...
3	3	3	0.501364	-0.287305	-1.723358	0.042	LH	43	-1.003341	0.286440	{"rings": [[[-9692114, 3928124.0001000017], [-...
4	4	4	0.603064	-0.032324	-2.045158	0.024	HL	51	0.096177	-0.336198	{"rings": [[[-9623907, 4063676.0001000017], [-...

Here, the Moran's I value is stored in the LMiIndex, field, with its z-score and pseudo p-value in the fields LMiZScore and LMiPValuerespectively, and the code in COType.

Visualizing the spatial autocorrelation

The COType field in the Output Feature Class will be HH for a statistically significant cluster of high values and LL for a statistically significant cluster of low values. The COType field in the Output Feature Class will also indicate if the feature has a high value and is surrounded by features with low values (HL) or if the feature has a low value and is surrounded by features with high values (LH). This is visualized in the map below:

# visualizing spatial autocorrelation in voters turnout
m2= GIS().map('United States', zoomlevel=4)
sdf_main_LMi.spatial.plot(map_widget = m2, renderer_type='u', col='COType', line_width=0.5)
m2.legend=True
m2

The black pixels in the map above show that there is spatial clustering of low voter turnout along the eastern coast, while the white pixels in the northeastern, central, and northwestern portions of the country indicate areas of spatial clustering of high voter turnout.

To include this data as spatial features, the counties with the most significant (lowest) p-values will be identified, and the distance and the angle or direction of each county will be measured from those lowest-p value counties. These two variables, the distance and the angle, are included as the new spatial features in the model.

# checking the field names having the p values
sdf_main_LMi.columns

Index(['FID', 'SOURCE_ID', 'voter_turn', 'LMiIndex', 'LMiZScore', 'LMiPValue',
       'COType', 'NNeighbors', 'ZTransform', 'SpatialLag', 'SHAPE'],
      dtype='object')

Selecting highly significant spatial clustering county

The most significant (lowest) p-value here is 0.002. All counties with this p value will be selected, and a field will be created that will be used to generate a shapefile containing these highly significant counties. Another field will also be created for counties with p-values less than or equal to 0.05, representing the remaining significantly clustering counties, that will be used as the third spatial feature in the final model.

# creating new fields with highly significant clustering counties
sdf_main_LMi['LMi_hi_sig<.002'] =  np.where(sdf_main_LMi['LMiPValue']<=.002, 1,0)
sdf_main_LMi['LMi_sig_<0.05'] =  np.where(sdf_main_LMi['LMiPValue']<=.05, 1,0)
sdf_main_LMi.head()

	FID	SOURCE_ID	voter_turn	LMiIndex	LMiZScore	LMiPValue	COType	NNeighbors	ZTransform	SpatialLag	SHAPE	LMi_hi_sig<.002	LMi_sig_<0.05
0	0	0	0.613738	0.032693	0.947522	0.172		44	0.211580	0.154568	{"rings": [[[-9619465, 3856529.0001000017], [-...	0	0
1	1	1	0.627364	0.020792	0.284678	0.376		22	0.358894	0.057952	{"rings": [[[-9746859, 3539643.0001000017], [-...	0	0
2	2	2	0.513816	0.432791	3.658427	0.002	LL	62	-0.868722	-0.498354	{"rings": [[[-9468394, 3771591.0001000017], [-...	1	1
3	3	3	0.501364	-0.287305	-1.723358	0.042	LH	43	-1.003341	0.286440	{"rings": [[[-9692114, 3928124.0001000017], [-...	0	1
4	4	4	0.603064	-0.032324	-2.045158	0.024	HL	51	0.096177	-0.336198	{"rings": [[[-9623907, 4063676.0001000017], [-...	0	1

# create new dataframe for LMi_hi_sig<.002 
LMi_hi_sig_county_main = sdf_main_LMi[sdf_main_LMi['LMi_hi_sig<.002']==1].copy()

LMi_hi_sig_county_main.columns

Index(['FID', 'SOURCE_ID', 'voter_turn', 'LMiIndex', 'LMiZScore', 'LMiPValue',
       'COType', 'NNeighbors', 'ZTransform', 'SpatialLag', 'SHAPE',
       'LMi_hi_sig<.002', 'LMi_sig_<0.05'],
      dtype='object')

# creating a new shapefile for the most significant clustering counties from spatial dataframe 
near_dist_from_main_county = sdf_main_LMi.spatial.to_featureclass('voters_turnout_train_LMi'+str(dt.now().microsecond))
near_dist_to_hi_sig_county = LMi_hi_sig_county_main.spatial.to_featureclass('LMi_hi_sig_county_train'+str(dt.now().microsecond))

Estimating distances and angle of counties from highly clustered counties

The Near(Analysis) tool from Arcpy is used to calculate the distance and the angle of all the counties from the highly significant clustering counties. As input, it takes the counties of high significance from which the distance is to be estimated, followed by the shapefile containing all of the counties to which the distance and the angle is to be calculated.

# Using the Near tool to calculate distance and angle
dist_to_nearest_hi_sig = arcpy.analysis.Near(near_dist_from_main_county,near_dist_to_hi_sig_county,'#','#','ANGLE','GEODESIC')

# Accessing the attribute table from the resulting shapefile
sdf_nearest_hi_sig = pd.DataFrame.spatial.from_featureclass(dist_to_nearest_hi_sig[0])
sdf_nearest_hi_sig.head()

	FID	SOURCE_ID	voter_turn	LMiIndex	LMiZScore	LMiPValue	COType	NNeighbors	ZTransform	SpatialLag	NEAR_FID	NEAR_DIST	NEAR_ANGLE	SHAPE
0	0	0	0.613738	0.032693	0.947522	0.172		44	0.211580	0.154568	12	36505.871660	89.398064	{"rings": [[[-9619465, 3856529.0001000017], [-...
1	1	1	0.627364	0.020792	0.284678	0.376		22	0.358894	0.057952	7	171810.775382	69.106286	{"rings": [[[-9746859, 3539643.0001000017], [-...
2	2	2	0.513816	0.432791	3.658427	0.002	LL	62	-0.868722	-0.498354	0	0.000000	0.000000	{"rings": [[[-9468394, 3771591.0001000017], [-...
3	3	3	0.501364	-0.287305	-1.723358	0.042	LH	43	-1.003341	0.286440	12	92415.309691	119.739256	{"rings": [[[-9692114, 3928124.0001000017], [-...
4	4	4	0.603064	-0.032324	-2.045158	0.024	HL	51	0.096177	-0.336198	4	0.000000	0.000000	{"rings": [[[-9623907, 4063676.0001000017], [-...

sdf_nearest_hi_sig.columns

Index(['FID', 'Id', 'SOURCE_ID', 'voter_turn', 'LMiIndex', 'LMiZScore',
       'LMiPValue', 'COType', 'NNeighbors', 'ZTransform', 'SpatialLag',
       'LMi_hi_sig', 'LMi_sig_<0', 'NEAR_FID', 'NEAR_DIST', 'NEAR_ANGLE',
       'SHAPE'],
      dtype='object')

#LMi_hi_sig_county_main.columns

Index(['FID', 'SOURCE_ID', 'voter_turn', 'LMiIndex', 'LMiZScore', 'LMiPValue',
       'COType', 'NNeighbors', 'ZTransform', 'SpatialLag', 'SHAPE',
       'LMi_hi_sig<.002', 'LMi_sig_<0.05'],
      dtype='object')

In the resulting dataframe above, the fields NEAR_DIST and NEAR_ANGLE ( third and the second field from the last) represent the distance and angle of the counties from the highly significant clustering counties, while the field named LMi_sig_<0, represents all of the significant counties. All three will be used as the spatial predictors in the final model.

sdf_main.head(2)

	FID	Join_Count	TARGET_FID	FIPS	county	state	voter_turn	gender_med	householdi	electronic	...	NNeighbors	ZTransform	SpatialLag	LMi_hi_sig	LMi_normal	Shape_Le_1	Shape_Ar_1	LMiHiDist	NEAR_FID	SHAPE
0	0	1	1	01001	Autauga	Alabama	0.613738	38.6	25553.0	4.96	...	44	0.211580	0.154568	0	0	2.496745e+05	2.208598e+09	133735.292502	0	{"rings": [[[-9619465, 3856529.0001000017], [-...
1	1	1	2	01003	Baldwin	Alabama	0.627364	42.9	31429.0	4.64	...	22	0.358894	0.057952	0	0	1.642763e+06	5.671096e+09	241925.196426	3	{"rings": [[[-9746859, 3539643.0001000017], [-...

2 rows × 97 columns

# dropping the existing p-values estimated columns from the main table to be replaced by the newly calculated values
sdf_main_final = sdf_main.drop(['SOURCE_ID', 'voter_tu_1',
       'Shape_Leng', 'Shape_Area', 'LMiIndex', 'LMiZScore', 'LMiPValue',
       'COType', 'NNeighbors', 'ZTransform', 'SpatialLag', 'LMi_hi_sig',
       'LMi_normal', 'NEAR_FID', 'Shape_Le_1', 'Shape_Ar_1', 'LMiHiDist',  'SHAPE'], axis=1)

sdf_main_final.head(2)

	FID	Join_Count	TARGET_FID	FIPS	county	state	voter_turn	gender_med	householdi	electronic	...	City9Dist	City9Ang	City8Dist	City8Ang	City7Dist	City7Ang	City6Dist	City6Ang	City5Dist	City5Ang
0	0	1	1	01001	Autauga	Alabama	0.613738	38.6	25553.0	4.96	...	383948.84777	-0.847576	10748.108812	109.277531	76082.644216	-6.321051	0.0	0.0	358644.11945	-74.872116
1	1	1	2	01003	Baldwin	Alabama	0.627364	42.9	31429.0	4.64	...	472377.29045	-25.580055	4252.349631	-85.916425	17821.080946	94.801172	0.0	0.0	356543.92578	-44.723872

2 rows × 79 columns

Final dataset with spatial cluster variables

# joining the newly calculated spatial features with the main dataset 
sdf_main_final_merged = sdf_main_final.merge(sdf_nearest_hi_sig, on='FID', how='inner')

# checking the final merged data
sdf_main_final_merged.head(2)

	FID	Join_Count	TARGET_FID	FIPS	county	state	voter_turn_x	gender_med	householdi	electronic	...	COType	NNeighbors	ZTransform	SpatialLag	LMi_hi_sig	LMi_sig_<0	NEAR_FID	NEAR_DIST	NEAR_ANGLE	SHAPE
0	0	1	1	01001	Autauga	Alabama	0.613738	38.6	25553.0	4.96	...		44	0.211580	0.154568	0	0	12	36505.871660	89.398064	{'rings': [[[-9619465, 3856529.0001000017], [-...
1	1	1	2	01003	Baldwin	Alabama	0.627364	42.9	31429.0	4.64	...		22	0.358894	0.057952	0	0	7	171810.775382	69.106286	{'rings': [[[-9746859, 3539643.0001000017], [-...

2 rows × 95 columns

Model Building

Next, the dataset containing the new spatial variables will be used to fit the AutoML model for further model improvements.

Train-Test split

Here, the dataset with 3112 samples is split into training and test datasets with a 90 to 10 ratio.

# Splitting data with test size of 10% data for validation 
test_size = 0.10
sdf_train, sdf_test = train_test_split(sdf_main_final_merged, test_size = test_size, random_state=32)

# checking train-test split
print(sdf_train.shape)
print(sdf_test.shape)

(2800, 95)
(312, 95)

sdf_train.head(2)

	FID	Join_Count	TARGET_FID	FIPS	county	state	voter_turn_x	gender_med	householdi	electronic	...	COType	NNeighbors	ZTransform	SpatialLag	LMi_hi_sig	LMi_sig_<0	NEAR_FID	NEAR_DIST	NEAR_ANGLE	SHAPE
2244	2244	1	2245	42061	Huntingdon	Pennsylvania	0.534472	43.0	23471.0	3.73	...		41	-0.645395	-0.198609	0	0	683	35718.647361	114.935085	{'rings': [[[-8647447, 4972564.000100002], [-8...
2710	2710	1	2711	48435	Sutton	Texas	0.547776	39.3	31334.0	3.77	...		22	-0.501568	-0.259770	0	0	876	41735.520773	-1.056234	{'rings': [[[-11144914, 3540920.0001000017], [...

2 rows × 95 columns

sdf_train.columns

Index(['FID', 'Join_Count', 'TARGET_FID', 'FIPS', 'county', 'state',
       'voter_turn_x', 'gender_med', 'householdi', 'electronic', 'raceandhis',
       'voter_laws', 'educationa', 'educatio_1', 'educatio_2', 'educatio_3',
       'maritalsta', 'F5yearincr', 'F5yearin_1', 'F5yearin_2', 'F5yearin_3',
       'F5yearin_4', 'F5yearin_5', 'F5yearin_6', 'language_a', 'hispanicor',
       'hispanic_1', 'raceandh_1', 'atrisk_avg', 'disposable', 'disposab_1',
       'disposab_2', 'disposab_3', 'disposab_4', 'disposab_5', 'disposab_6',
       'disposab_7', 'disposab_8', 'disposab_9', 'disposa_10', 'househol_1',
       'househol_2', 'househol_3', 'househol_4', 'househol_5', 'househol_6',
       'househol_7', 'househol_8', 'househol_9', 'language_1', 'language_2',
       'households', 'househo_10', 'educatio_4', 'educatio_5', 'educatio_6',
       'educatio_7', 'psychograp', 'psychogr_1', 'financial_', 'financial1',
       'financia_1', 'miscellane', 'state_vote', 'state_vo_1', 'randomized',
       'random_num', 'City10Dist', 'City10Ang', 'City9Dist', 'City9Ang',
       'City8Dist', 'City8Ang', 'City7Dist', 'City7Ang', 'City6Dist',
       'City6Ang', 'City5Dist', 'City5Ang', 'Id', 'SOURCE_ID', 'voter_turn_y',
       'LMiIndex', 'LMiZScore', 'LMiPValue', 'COType', 'NNeighbors',
       'ZTransform', 'SpatialLag', 'LMi_hi_sig', 'LMi_sig_<0', 'NEAR_FID',
       'NEAR_DIST', 'NEAR_ANGLE', 'SHAPE'],
      dtype='object')

Data Preprocessing

Here, X is the list of explanatory variables chosen from the new feature data that will be used for predicting voter turnout. The new spatial cluster features used here are NEAR_DIST, NEAR_ANGLE,LMi_sig_<0 as explained in the previous section. Some additional spatial features (City10Ang, City9Ang,City8Ang etc.) were also included to account for the direction of the counties in terms of the angle of the counties from various grades of cities that were pre-calculated.

Also, the categorical variables are marked with a True value inside of a tuple. The scaler is defined in the preprocessors.

#listing explanatory variables
X =[('county',True), ('state',True),'gender_med', 'householdi', 'electronic', 'raceandhis',
       ('voter_laws',True), 'educationa', 'educatio_1', 'educatio_2', 'educatio_3',
       'maritalsta', 'F5yearincr', 'F5yearin_1', 'F5yearin_2', 'F5yearin_3',
       'F5yearin_4', 'F5yearin_5', 'F5yearin_6', 'language_a', 'hispanicor',
       'hispanic_1', 'raceandh_1', 'atrisk_avg', 'disposable', 'disposab_1',
       'disposab_2', 'disposab_3', 'disposab_4', 'disposab_5', 'disposab_6',
       'disposab_7', 'disposab_8', 'disposab_9', 'disposa_10', 'househol_1',
       'househol_2', 'househol_3', 'househol_4', 'househol_5', 'househol_6',
       'househol_7', 'househol_8', 'househol_9', 'language_1', 'language_2',
       'households', 'househo_10', 'educatio_4', 'educatio_5', 'educatio_6',
       'educatio_7', 'psychograp', 'psychogr_1', 'financial_', 'financial1',
       'financia_1', 'miscellane', 'state_vote', 'state_vo_1',
        'City10Ang', 'City9Dist', 'City9Ang',
       'City8Dist', 'City8Ang', 'City7Dist', 'City7Ang', 'City6Dist',
       'City6Ang', 'City5Dist', 'City5Ang', 'LMi_sig_<0', 'NEAR_DIST', 'NEAR_ANGLE']

# defining the preprocessors for scaling data
preprocessors = [('county', 'state','gender_med', 'householdi', 'electronic', 'raceandhis',
       'voter_laws', 'educationa', 'educatio_1', 'educatio_2', 'educatio_3',
       'maritalsta', 'F5yearincr', 'F5yearin_1', 'F5yearin_2', 'F5yearin_3',
       'F5yearin_4', 'F5yearin_5', 'F5yearin_6', 'language_a', 'hispanicor',
       'hispanic_1', 'raceandh_1', 'atrisk_avg', 'disposable', 'disposab_1',
       'disposab_2', 'disposab_3', 'disposab_4', 'disposab_5', 'disposab_6',
       'disposab_7', 'disposab_8', 'disposab_9', 'disposa_10', 'househol_1',
       'househol_2', 'househol_3', 'househol_4', 'househol_5', 'househol_6',
       'househol_7', 'househol_8', 'househol_9', 'language_1', 'language_2',
       'households', 'househo_10', 'educatio_4', 'educatio_5', 'educatio_6',
       'educatio_7', 'psychograp', 'psychogr_1', 'financial_', 'financial1',
       'financia_1', 'miscellane', 'state_vote', 'state_vo_1',
        'City10Ang', 'City9Dist', 'City9Ang',
       'City8Dist', 'City8Ang', 'City7Dist', 'City7Ang', 'City6Dist',
       'City6Ang', 'City5Dist', 'City5Ang', 'LMi_sig_<0', 'NEAR_DIST', 'NEAR_ANGLE', MinMaxScaler())]

# preparing data for the model
data = prepare_tabulardata(sdf_train,
                           variable_predict='voter_turn_x',
                           explanatory_variables=X, 
                           preprocessors=preprocessors)

C:\Users\sup10432\AppData\Local\ESRI\conda\envs\pro_automl_26Octb\lib\site-packages\arcgis\learn\_utils\tabular_data.py:1035: UserWarning:

Column county has more than 20 unique value. Sure this is categorical?

C:\Users\sup10432\AppData\Local\ESRI\conda\envs\pro_automl_26Octb\lib\site-packages\arcgis\learn\_utils\tabular_data.py:1035: UserWarning:

Column state has more than 20 unique value. Sure this is categorical?

data.show_batch()

	City10Ang	City5Ang	City5Dist	City6Ang	City6Dist	City7Ang	City7Dist	City8Ang	City8Dist	City9Ang	...	miscellane	psychogr_1	psychograp	raceandh_1	raceandhis	state	state_vo_1	state_vote	voter_laws	voter_turn_x
970	-38.097491	160.822180	253980.119335	0.000000	0.000000	27.338939	31509.472596	69.548398	10415.735546	-53.278790	...	19.0	40.48	7.15	27.7	13.36	Kentucky	0.298375	574117.0	nonphotoid	0.651014
1182	30.162913	-65.473209	314965.336032	0.000000	0.000000	-56.759361	38224.808568	-84.374642	60282.910580	-32.611268	...	21.0	44.69	5.98	43.7	23.20	Maryland	0.264164	734759.0	no_doc	0.696712
1761	81.047860	73.563287	96856.433863	0.000000	0.000000	123.794591	863.024943	89.125132	17583.489697	-126.132351	...	37.0	33.98	8.45	67.4	46.10	New Jersey	0.141027	546345.0	no_doc	0.701084
2163	-179.686912	58.574999	104948.236884	0.000000	0.000000	-89.628491	20827.282684	-91.785635	27726.568576	-89.205592	...	11.0	46.90	7.22	48.9	28.15	Oklahoma	0.363912	528761.0	nonphotoid	0.475694
2473	-22.481323	83.390854	47420.679826	143.000934	23345.291534	90.218356	151289.075466	118.863339	49455.218555	-104.587926	...	7.0	47.26	10.38	6.7	2.94	Tennessee	0.260057	652230.0	strict_photoid	0.429226

5 rows × 75 columns

Fitting a random forest model

First, a random forest model is fitted to the new spatial data.

Model Initialization

The MLModel is initialized with the Random Forest model from Scikit-learn (Sklearn), along with its model parameters

# defining the model along with the parameters 
model = MLModel(data, 'sklearn.ensemble.RandomForestRegressor', n_estimators=500, random_state=43)

model.fit()

model.score()

0.7367267033752201

# validating trained model on test dataset
voter_county_mlmodel_predicted = model.predict(sdf_test, prediction_type='dataframe')

voter_county_mlmodel_predicted.head(2)

	FID	Join_Count	TARGET_FID	FIPS	county	state	voter_turn_x	gender_med	householdi	electronic	...	NNeighbors	ZTransform	SpatialLag	LMi_hi_sig	LMi_sig_<0	NEAR_FID	NEAR_DIST	NEAR_ANGLE	SHAPE	prediction_results
115	115	1	116	05067	Jackson	Arkansas	0.371336	41.8	17956.0	3.33	...	44	-2.409135	-1.089202	0	0	27	0.000000	0.000000	{'rings': [[[-10133692, 4284820.000100002], [-...	0.471453
1529	1529	1	1530	29153	Ozark	Missouri	0.591207	51.4	18827.0	3.92	...	42	-0.032009	-0.297026	0	0	43	26336.364318	179.584017	{'rings': [[[-10253900, 4410465.000100002], [-...	0.584314

2 rows × 96 columns

# calculating validation model score
r_square_voter_county_mlmodel_Test = metrics.r2_score(voter_county_mlmodel_predicted['voter_turn_x'], voter_county_mlmodel_predicted['prediction_results']) 
print('r_square_voter_county_mlmodel_Test: ', round(r_square_voter_county_mlmodel_Test,2))

r_square_voter_county_mlmodel_Test:  0.79

The validation r square for the random forest model is satisfactory, and now AutoML will be used to improve it.

Fitting Using AutoML

The same data obtained using the prepare_taulardata function is used as input for the AutoML model. Here, the model is initialized using the Compete mode, which is the best performing option of the available modes.

# initializing AutoML model with the Compete mode 
AutoML_voters_county_obj_compete = AutoML(data, eval_metric='r2', mode='Compete', n_jobs=1)

# training the AutoML model
AutoML_voters_county_obj_compete.fit()

Neural Network algorithm was disabled because it doesn't support n_jobs parameter.
AutoML directory: AutoML_8
The task is regression with evaluation metric r2
AutoML will use algorithms: ['Linear', 'Decision Tree', 'Random Forest', 'Extra Trees', 'LightGBM', 'Xgboost']
AutoML will stack models
AutoML will ensemble availabe models
AutoML steps: ['adjust_validation', 'simple_algorithms', 'default_algorithms', 'not_so_random', 'kmeans_features', 'insert_random_feature', 'features_selection', 'hill_climbing_1', 'hill_climbing_2', 'boost_on_errors', 'ensemble', 'stack', 'ensemble_stacked']
* Step adjust_validation will try to check up to 1 model
1_DecisionTree r2 0.469355 trained in 1.49 seconds
Adjust validation. Remove: 1_DecisionTree
Validation strategy: 10-fold CV Shuffle
* Step simple_algorithms will try to check up to 4 models
1_DecisionTree r2 0.399429 trained in 13.92 seconds
2_DecisionTree r2 0.448983 trained in 13.09 seconds
3_DecisionTree r2 0.445253 trained in 13.18 seconds
4_Linear r2 0.66238 trained in 13.15 seconds
* Step default_algorithms will try to check up to 4 models
5_Default_LightGBM r2 0.770059 trained in 58.69 seconds
6_Default_Xgboost r2 0.766568 trained in 167.82 seconds
7_Default_RandomForest r2 0.587453 trained in 64.63 seconds
8_Default_ExtraTrees r2 0.526491 trained in 23.81 seconds
* Step not_so_random will try to check up to 36 models
18_LightGBM r2 0.777045 trained in 28.4 seconds
9_Xgboost r2 0.751944 trained in 160.33 seconds
27_RandomForest r2 0.585405 trained in 47.67 seconds
36_ExtraTrees r2 0.505466 trained in 22.52 seconds
19_LightGBM r2 0.754122 trained in 21.66 seconds
10_Xgboost r2 0.744709 trained in 243.49 seconds
28_RandomForest r2 0.522452 trained in 41.12 seconds
37_ExtraTrees r2 0.46222 trained in 21.93 seconds
20_LightGBM r2 0.773785 trained in 44.6 seconds
11_Xgboost r2 0.774746 trained in 42.8 seconds
29_RandomForest r2 0.654542 trained in 87.34 seconds
38_ExtraTrees r2 0.611703 trained in 29.63 seconds
21_LightGBM r2 0.760481 trained in 79.12 seconds
12_Xgboost r2 0.760265 trained in 26.37 seconds
30_RandomForest r2 0.653129 trained in 63.72 seconds
39_ExtraTrees r2 0.591046 trained in 25.27 seconds
22_LightGBM r2 0.748574 trained in 22.28 seconds
13_Xgboost r2 0.769551 trained in 34.65 seconds
31_RandomForest r2 0.682243 trained in 74.34 seconds
40_ExtraTrees r2 0.629549 trained in 28.03 seconds
23_LightGBM r2 0.771215 trained in 40.33 seconds
14_Xgboost r2 0.780234 trained in 39.87 seconds
32_RandomForest r2 0.676783 trained in 86.29 seconds
41_ExtraTrees r2 0.638892 trained in 32.57 seconds
24_LightGBM r2 0.774532 trained in 45.83 seconds
15_Xgboost r2 0.770555 trained in 116.43 seconds
* Step kmeans_features will try to check up to 3 models
14_Xgboost_KMeansFeatures r2 0.767433 trained in 43.52 seconds
18_LightGBM_KMeansFeatures r2 0.772894 trained in 38.32 seconds
11_Xgboost_KMeansFeatures r2 0.767903 trained in 58.95 seconds
* Step insert_random_feature will try to check up to 1 model
14_Xgboost_RandomFeature r2 0.773451 trained in 44.52 seconds
Drop features ['City5Dist', 'raceandh_1', 'hispanic_1', 'househol_3', 'households', 'language_1', 'househo_10', 'state_vo_1', 'househol_2', 'City6Dist', 'gender_med', 'househol_9', 'disposab_3', 'househol_5', 'random_feature', 'hispanicor', 'language_2', 'disposab_2', 'disposab_4', 'househol_6', 'househol_4']
* Step features_selection will try to check up to 4 models
14_Xgboost_SelectedFeatures r2 0.774603 trained in 32.27 seconds
18_LightGBM_SelectedFeatures r2 0.781371 trained in 30.29 seconds
31_RandomForest_SelectedFeatures r2 0.683045 trained in 64.27 seconds
41_ExtraTrees_SelectedFeatures r2 0.640023 trained in 32.86 seconds
* Step hill_climbing_1 will try to check up to 20 models
42_LightGBM_SelectedFeatures r2 0.78418 trained in 33.8 seconds
43_Xgboost r2 0.775862 trained in 40.78 seconds
44_Xgboost r2 0.766297 trained in 30.97 seconds
45_LightGBM r2 0.777641 trained in 36.05 seconds
46_Xgboost r2 0.776048 trained in 54.6 seconds
47_Xgboost r2 0.770999 trained in 41.35 seconds
48_Xgboost_SelectedFeatures r2 0.77757 trained in 38.38 seconds
49_Xgboost_SelectedFeatures r2 0.765088 trained in 29.81 seconds
50_LightGBM r2 0.779526 trained in 38.57 seconds
* Step hill_climbing_2 will try to check up to 17 models
51_LightGBM_SelectedFeatures r2 0.782666 trained in 32.02 seconds
52_LightGBM_SelectedFeatures r2 0.783625 trained in 38.86 seconds
53_LightGBM_SelectedFeatures r2 0.784112 trained in 31.33 seconds
54_LightGBM_SelectedFeatures r2 0.783465 trained in 33.09 seconds
55_Xgboost r2 0.773724 trained in 44.82 seconds
56_Xgboost r2 0.768679 trained in 38.39 seconds
57_LightGBM r2 0.780059 trained in 43.84 seconds
58_LightGBM r2 0.780877 trained in 43.36 seconds
* Step boost_on_errors will try to check up to 1 model
42_LightGBM_SelectedFeatures_BoostOnErrors r2 0.783428 trained in 34.72 seconds
* Step ensemble will try to check up to 1 model
Ensemble r2 0.792354 trained in 8.42 seconds
* Step stack will try to check up to 35 models
42_LightGBM_SelectedFeatures_Stacked r2 0.780498 trained in 26.64 seconds
14_Xgboost_Stacked r2 0.783106 trained in 28.13 seconds
31_RandomForest_SelectedFeatures_Stacked r2 0.786935 trained in 114.92 seconds
41_ExtraTrees_SelectedFeatures_Stacked r2 0.791264 trained in 39.03 seconds
53_LightGBM_SelectedFeatures_Stacked r2 0.783177 trained in 24.41 seconds
48_Xgboost_SelectedFeatures_Stacked r2 0.783833 trained in 29.0 seconds
31_RandomForest_Stacked r2 0.790533 trained in 109.56 seconds
41_ExtraTrees_Stacked r2 0.791692 trained in 41.03 seconds
52_LightGBM_SelectedFeatures_Stacked r2 0.779776 trained in 27.97 seconds
46_Xgboost_Stacked r2 0.776083 trained in 38.06 seconds
32_RandomForest_Stacked r2 0.786649 trained in 144.92 seconds
* Step ensemble_stacked will try to check up to 1 model
Ensemble_Stacked r2 0.795538 trained in 12.06 seconds
AutoML fit time: 3621.83 seconds
AutoML best model: Ensemble_Stacked
All the evaluated models are saved in the path  C:\Users\sup10432\review_notebooks\voters_turnout\part II\2\AutoML_8

Here, the ensemble model is the best model, and its R-squared validation score shows the final improvements achieved after including the new, spatially engineered variables. The best model diagnostics and related reports, like feature importance, model performance, etc., are saved in the folder mentioned in the output message for further reference.

# train score of the model
AutoML_voters_county_obj_compete.score()

0.9614068198030955

Model output

# The output diagnostics can also be printed in a report form
AutoML_voters_county_obj_compete.report()

C:\Users\sup10432\AppData\Local\ESRI\conda\envs\pro_automl_26Octb\lib\site-packages\arcgis\learn\models\_auto_ml.py:284: UserWarning:

In case the report html is not rendered appropriately in the notebook, the same can be found in the path AutoML_8\README.html

AutoML Leaderboard

Best model	name	model_type	metric_type	metric_value	train_time
	1_DecisionTree	Decision Tree	r2	0.399429	14.33
	2_DecisionTree	Decision Tree	r2	0.448983	13.5
	3_DecisionTree	Decision Tree	r2	0.445253	13.6
	4_Linear	Linear	r2	0.66238	13.57
	5_Default_LightGBM	LightGBM	r2	0.770059	59.24
	6_Default_Xgboost	Xgboost	r2	0.766568	168.39
	7_Default_RandomForest	Random Forest	r2	0.587453	65.2
	8_Default_ExtraTrees	Extra Trees	r2	0.526491	24.35
	18_LightGBM	LightGBM	r2	0.777045	28.97
	9_Xgboost	Xgboost	r2	0.751944	160.86
	27_RandomForest	Random Forest	r2	0.585405	48.18
	36_ExtraTrees	Extra Trees	r2	0.505466	23.13
	19_LightGBM	LightGBM	r2	0.754122	22.17
	10_Xgboost	Xgboost	r2	0.744709	244.06
	28_RandomForest	Random Forest	r2	0.522452	41.72
	37_ExtraTrees	Extra Trees	r2	0.46222	22.43
	20_LightGBM	LightGBM	r2	0.773785	45.16
	11_Xgboost	Xgboost	r2	0.774746	43.31
	29_RandomForest	Random Forest	r2	0.654542	87.83
	38_ExtraTrees	Extra Trees	r2	0.611703	30.18
	21_LightGBM	LightGBM	r2	0.760481	79.71
	12_Xgboost	Xgboost	r2	0.760265	26.88
	30_RandomForest	Random Forest	r2	0.653129	64.25
	39_ExtraTrees	Extra Trees	r2	0.591046	25.79
	22_LightGBM	LightGBM	r2	0.748574	22.81
	13_Xgboost	Xgboost	r2	0.769551	35.16
	31_RandomForest	Random Forest	r2	0.682243	74.91
	40_ExtraTrees	Extra Trees	r2	0.629549	28.58
	23_LightGBM	LightGBM	r2	0.771215	40.85
	14_Xgboost	Xgboost	r2	0.780234	40.45
	32_RandomForest	Random Forest	r2	0.676783	86.82
	41_ExtraTrees	Extra Trees	r2	0.638892	33.17
	24_LightGBM	LightGBM	r2	0.774532	46.36
	15_Xgboost	Xgboost	r2	0.770555	117
	14_Xgboost_KMeansFeatures	Xgboost	r2	0.767433	44.16
	18_LightGBM_KMeansFeatures	LightGBM	r2	0.772894	38.96
	11_Xgboost_KMeansFeatures	Xgboost	r2	0.767903	59.52
	14_Xgboost_RandomFeature	Xgboost	r2	0.773451	45.62
	14_Xgboost_SelectedFeatures	Xgboost	r2	0.774603	32.83
	18_LightGBM_SelectedFeatures	LightGBM	r2	0.781371	30.84
	31_RandomForest_SelectedFeatures	Random Forest	r2	0.683045	64.85
	41_ExtraTrees_SelectedFeatures	Extra Trees	r2	0.640023	33.41
	42_LightGBM_SelectedFeatures	LightGBM	r2	0.78418	34.36
	43_Xgboost	Xgboost	r2	0.775862	41.33
	44_Xgboost	Xgboost	r2	0.766297	31.5
	45_LightGBM	LightGBM	r2	0.777641	36.61
	46_Xgboost	Xgboost	r2	0.776048	55.15
	47_Xgboost	Xgboost	r2	0.770999	41.84
	48_Xgboost_SelectedFeatures	Xgboost	r2	0.77757	39.02
	49_Xgboost_SelectedFeatures	Xgboost	r2	0.765088	30.36
	50_LightGBM	LightGBM	r2	0.779526	39.18
	51_LightGBM_SelectedFeatures	LightGBM	r2	0.782666	32.57
	52_LightGBM_SelectedFeatures	LightGBM	r2	0.783625	39.45
	53_LightGBM_SelectedFeatures	LightGBM	r2	0.784112	31.89
	54_LightGBM_SelectedFeatures	LightGBM	r2	0.783465	33.66
	55_Xgboost	Xgboost	r2	0.773724	45.43
	56_Xgboost	Xgboost	r2	0.768679	38.96
	57_LightGBM	LightGBM	r2	0.780059	44.44
	58_LightGBM	LightGBM	r2	0.780877	43.93
	42_LightGBM_SelectedFeatures_BoostOnErrors	LightGBM	r2	0.783428	35.28
	Ensemble	Ensemble	r2	0.792354	8.42
	42_LightGBM_SelectedFeatures_Stacked	LightGBM	r2	0.780498	27.23
	14_Xgboost_Stacked	Xgboost	r2	0.783106	28.67
	31_RandomForest_SelectedFeatures_Stacked	Random Forest	r2	0.786935	115.47
	41_ExtraTrees_SelectedFeatures_Stacked	Extra Trees	r2	0.791264	39.52
	53_LightGBM_SelectedFeatures_Stacked	LightGBM	r2	0.783177	25.04
	48_Xgboost_SelectedFeatures_Stacked	Xgboost	r2	0.783833	29.56
	31_RandomForest_Stacked	Random Forest	r2	0.790533	110.11
	41_ExtraTrees_Stacked	Extra Trees	r2	0.791692	41.59
	52_LightGBM_SelectedFeatures_Stacked	LightGBM	r2	0.779776	28.53
	46_Xgboost_Stacked	Xgboost	r2	0.776083	38.55
	32_RandomForest_Stacked	Random Forest	r2	0.786649	145.44
the best	Ensemble_Stacked	Ensemble	r2	0.795538	12.06

AutoML Performance

AutoML Performance Boxplot

Spearman Correlation of Models

models spearman correlation

Metric	Score
MAE	0.0340579
MSE	0.00218046
RMSE	0.0466954
R2	0.744709
MAPE	0.0602362

Metric	Score
MAE	0.03173
MSE	0.00192391
RMSE	0.0438624
R2	0.774746
MAPE	0.0559427

Metric	Score
MAE	0.0324512
MSE	0.00198235
RMSE	0.0445236
R2	0.767903
MAPE	0.0573705

Metric	Score
MAE	0.0328491
MSE	0.0020476
RMSE	0.0452504
R2	0.760265
MAPE	0.0577756

Metric	Score
MAE	0.0321627
MSE	0.00196828
RMSE	0.0443653
R2	0.769551
MAPE	0.0567343

Metric	Score
MAE	0.0316635
MSE	0.00187704
RMSE	0.0433248
R2	0.780234
MAPE	0.0557259

Metric	Score
MAE	0.0319659
MSE	0.00198637
RMSE	0.0445687
R2	0.767433
MAPE	0.0563932

Metric	Score
MAE	0.0319354
MSE	0.00193497
RMSE	0.0439883
R2	0.773451
MAPE	0.0560987

Metric	Score
MAE	0.0317485
MSE	0.00192513
RMSE	0.0438763
R2	0.774603
MAPE	0.0558524

Metric	Score
MAE	0.0310336
MSE	0.0018525
RMSE	0.0430407
R2	0.783106
MAPE	0.0550723

Metric	Score
MAE	0.0321593
MSE	0.00195971
RMSE	0.0442686
R2	0.770555
MAPE	0.0568882

Metric	Score
MAE	0.0314767
MSE	0.00190427
RMSE	0.043638
R2	0.777045
MAPE	0.0555492

Metric	Score
MAE	0.0317851
MSE	0.00193973
RMSE	0.0440423
R2	0.772894
MAPE	0.056089

Metric	Score
MAE	0.0311585
MSE	0.00186732
RMSE	0.0432125
R2	0.781371
MAPE	0.0549145

Metric	Score
MAE	0.0330713
MSE	0.00210006
RMSE	0.0458264
R2	0.754122
MAPE	0.0583661

Metric	Score
MAE	0.0555485
MSE	0.00512952
RMSE	0.0716207
R2	0.399429
MAPE	0.0978215

Metric	Score
MAE	0.0318725
MSE	0.00193212
RMSE	0.0439559
R2	0.773785
MAPE	0.056256

Metric	Score
MAE	0.0329359
MSE	0.00204574
RMSE	0.0452299
R2	0.760481
MAPE	0.0581459

Metric	Score
MAE	0.0338231
MSE	0.00214744
RMSE	0.0463405
R2	0.748574
MAPE	0.0596131

Metric	Score
MAE	0.032038
MSE	0.00195407
RMSE	0.0442049
R2	0.771215
MAPE	0.0566979

Metric	Score
MAE	0.0316783
MSE	0.00192574
RMSE	0.0438833
R2	0.774532
MAPE	0.0558973

Metric	Score
MAE	0.045692
MSE	0.00354109
RMSE	0.059507
R2	0.585405
MAPE	0.0812284

Metric	Score
MAE	0.0491693
MSE	0.00407877
RMSE	0.0638653
R2	0.522452
MAPE	0.0873597

Metric	Score
MAE	0.0411637
MSE	0.00295058
RMSE	0.0543193
R2	0.654542
MAPE	0.0730317

Metric	Score
MAE	0.0526576
MSE	0.00470628
RMSE	0.0686023
R2	0.448983
MAPE	0.0927066

Metric	Score
MAE	0.0412679
MSE	0.00296265
RMSE	0.0544302
R2	0.653129
MAPE	0.0733034

Metric	Score
MAE	0.0391919
MSE	0.00271399
RMSE	0.052096
R2	0.682243
MAPE	0.069494

Metric	Score
MAE	0.0392941
MSE	0.00270713
RMSE	0.0520301
R2	0.683045
MAPE	0.0696175

Metric	Score
MAE	0.0305907
MSE	0.0018198
RMSE	0.0426591
R2	0.786935
MAPE	0.0538265

Metric	Score
MAE	0.0305075
MSE	0.00178907
RMSE	0.0422974
R2	0.790533
MAPE	0.0536043

Metric	Score
MAE	0.0395469
MSE	0.00276061
RMSE	0.0525416
R2	0.676783
MAPE	0.0700435

Metric	Score
MAE	0.0307335
MSE	0.00182224
RMSE	0.0426877
R2	0.786649
MAPE	0.0540244

Metric	Score
MAE	0.0499657
MSE	0.00422385
RMSE	0.0649912
R2	0.505466
MAPE	0.0890381

Metric	Score
MAE	0.0523113
MSE	0.00459321
RMSE	0.0677733
R2	0.46222
MAPE	0.0931283

Metric	Score
MAE	0.0438293
MSE	0.00331648
RMSE	0.0575889
R2	0.611703
MAPE	0.0777161

Metric	Score
MAE	0.0451548
MSE	0.0034929
RMSE	0.0591008
R2	0.591046
MAPE	0.0800956

Metric	Score
MAE	0.0527654
MSE	0.00473813
RMSE	0.0688341
R2	0.445253
MAPE	0.0929209

Metric	Score
MAE	0.0425645
MSE	0.00316404
RMSE	0.0562498
R2	0.629549
MAPE	0.0754384

Metric	Score
MAE	0.0419765
MSE	0.00308425
RMSE	0.055536
R2	0.638892
MAPE	0.0743221

Voter turnout prediction & Validation

# validating trained model on test dataset
voter_county_automl_predicted = AutoML_voters_county_obj_compete.predict(sdf_test, prediction_type='dataframe')

voter_county_automl_predicted.head(2)

	FID	Join_Count	TARGET_FID	FIPS	county	state	voter_turn_x	gender_med	householdi	electronic	...	NNeighbors	ZTransform	SpatialLag	LMi_hi_sig	LMi_sig_<0	NEAR_FID	NEAR_DIST	NEAR_ANGLE	SHAPE	prediction_results
115	115	1	116	05067	Jackson	Arkansas	0.371336	41.8	17956.0	3.33	...	44	-2.409135	-1.089202	0	0	27	0.000000	0.000000	{'rings': [[[-10133692, 4284820.000100002], [-...	0.423994
1529	1529	1	1530	29153	Ozark	Missouri	0.591207	51.4	18827.0	3.92	...	42	-0.032009	-0.297026	0	0	43	26336.364318	179.584017	{'rings': [[[-10253900, 4410465.000100002], [-...	0.584422

2 rows × 96 columns

Estimate model metrics for validation

import sklearn.metrics as metrics

r_square_voter_county_automl_Test = metrics.r2_score(voter_county_automl_predicted['voter_turn_x'], voter_county_automl_predicted['prediction_results']) 
print('r_square_voter_county_automl_Test: ', round(r_square_voter_county_automl_Test,2))

r_square_voter_county_automl_Test:  0.84

Conclusion

In the first part of this notebook series, AutoML was applied to a regression dataset, where it was able to achieve significant improvements over traditional methods of modeling. In this notebook, the model's fit was further improved by extracting the spatial patterns in the voter turnout dataset and including them as additional spatial features.

The spatial feature engineering employed consisted of calculating the spatial autocorrelation in the data using the cluster outlier analysis tool from Arcpy, followed by measuring the distances of each county and their respective angles from the highly significant clustering counties. Including these new spatial variables enhanced the model further. Similarly, this process could be applied to other spatial dataframes.

Data resources & References

Reference	Source	Link
Voters turnout by county for 2016 US general election	Esri	https://www.arcgis.com/home/item.html?id=650e7d6aa8fb4601a75d632a2c114425

Predicting voters turnout for US election in 2016 using AutoML and spatial feature engineering - Part II

Introduction

Imports

Connecting to ArcGIS

Accessing & Visualizing datasets

Estimating Spatial Autocorrelation

Calculating Local Moran's I

Visualizing the spatial autocorrelation

Selecting highly significant spatial clustering county

Estimating distances and angle of counties from highly clustered counties

Final dataset with spatial cluster variables

Model Building

Train-Test split

Data Preprocessing

Fitting a random forest model

Model Initialization

Fitting Using AutoML

Model output

AutoML Leaderboard

AutoML Performance

AutoML Performance Boxplot

Spearman Correlation of Models

Summary of 10_Xgboost

Extreme Gradient Boosting (Xgboost)

Validation

Optimized metric

Training time

Metric details:

Learning curves

True vs Predicted

Predicted vs Residuals

Summary of 11_Xgboost

Extreme Gradient Boosting (Xgboost)

Validation

Optimized metric

Training time

Metric details:

Learning curves

True vs Predicted

Predicted vs Residuals

Summary of 11_Xgboost_KMeansFeatures

Extreme Gradient Boosting (Xgboost)

Validation

Optimized metric

Training time

Metric details:

Learning curves

True vs Predicted

Predicted vs Residuals

Summary of 12_Xgboost

Extreme Gradient Boosting (Xgboost)

Validation

Optimized metric

Training time

Metric details:

Learning curves

True vs Predicted

Predicted vs Residuals

Summary of 13_Xgboost

Extreme Gradient Boosting (Xgboost)

Validation

Optimized metric

Training time

Metric details:

Learning curves

True vs Predicted

Predicted vs Residuals

Summary of 14_Xgboost

Extreme Gradient Boosting (Xgboost)

Validation

Optimized metric

Training time

Metric details:

Learning curves

True vs Predicted

Predicted vs Residuals

Summary of 14_Xgboost_KMeansFeatures

Extreme Gradient Boosting (Xgboost)

Validation

Optimized metric