Predicting voters turnout for US election in 2016 using AutoML - Part I

Introduction

The objective of this notebook is to demonstrate the application of AutoML on tabular data and show the improvements that can be achieved using this method, rather than conventional workflows. The newly added AutoML module in arcgis.learn is based on the MLJar library, which automates the processes of algorithm selection, data preprocessing, model training, model explainability, and final model evaluation. With these functionalities, it can perform Automatic Exploratory Data Analysis, Algorithm Selection, and Hyper-Parameters tuning to find the best model. Automatic documentation can be generated as markdown reports on the analysis with details about all models.

Once the desired improvements are obtained using AutoML, the result will be further enhanced using spatial feature engineering in the second part of the notebook.

The dataset used here is the percentage of voter turnout by county for the general election for the United States in 2016, which will be predicted using the demographic characteristics of US counties and their socioeconomic parameters.

Imports

%matplotlib inline

import matplotlib.pyplot as plt
import pandas as pd

from IPython.display import Image, HTML
from sklearn.preprocessing import MinMaxScaler,RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import sklearn.metrics as metrics
from fastai.imports import *
from datetime import datetime as dt

import arcgis
from arcgis.gis import GIS
from arcgis.learn import prepare_tabulardata, AutoML, MLModel
import arcpy

Connecting to ArcGIS

gis = GIS("home")

Accessing & Visualizing datasets

Here, the 2016 election data is downloaded from the portal as a zipped shapefile, which is then unzipped and processed in the following.

voter_zip = gis.content.get('650e7d6aa8fb4601a75d632a2c114425') 
voter_zip

VotersTurnoutCountyEelction2016
voters turnout 2016

Shapefile by api_data_owner
Last Modified: August 23, 2021
0 comments, 79 views

import os, zipfile

filepath_new = voter_zip.download(file_name=voter_zip.name)
with zipfile.ZipFile(filepath_new, 'r') as zip_ref:
    zip_ref.extractall(Path(filepath_new).parent)
output_path = Path(os.path.join(os.path.splitext(filepath_new)[0]))
output_path = os.path.join(output_path,"VotersTurnoutCountyEelction2016.shp")

output_path

'C:\\Users\\sup10432\\AppData\\Local\\Temp\\VotersTurnoutCountyEelction2016\\VotersTurnoutCountyEelction2016.shp'

The attribute table contains the voter turnout data by county for the entire US, which is extracted here as a pandas dataframe. The voter_turn field in the dataframe contains the voter turnout data in percentages for each county for the 2016 election. This will be used as the dependent variable and will be predicted using the various demographic and socioeconomic variables for each county.

# getting the attribute table from the shapefile which will be used for building the model
sdf_main = pd.DataFrame.spatial.from_featureclass(output_path)
sdf_main.head()

	FID	Join_Count	TARGET_FID	FIPS	county	state	voter_turn	gender_med	householdi	electronic	...	NNeighbors	ZTransform	SpatialLag	LMi_hi_sig	LMi_normal	Shape_Le_1	Shape_Ar_1	LMiHiDist	NEAR_FID	SHAPE
0	0	1	1	01001	Autauga	Alabama	0.613738	38.6	25553.0	4.96	...	44	0.211580	0.154568	0	0	2.496745e+05	2.208598e+09	133735.292502	0	{"rings": [[[-9619465, 3856529.0001000017], [-...
1	1	1	2	01003	Baldwin	Alabama	0.627364	42.9	31429.0	4.64	...	22	0.358894	0.057952	0	0	1.642763e+06	5.671096e+09	241925.196426	3	{"rings": [[[-9746859, 3539643.0001000017], [-...
2	2	1	3	01005	Barbour	Alabama	0.513816	40.2	16876.0	3.49	...	62	-0.868722	-0.498354	1	1	3.202971e+05	3.257816e+09	0.000000	0	{"rings": [[[-9468394, 3771591.0001000017], [-...
3	3	1	4	01007	Bibb	Alabama	0.501364	39.3	19360.0	3.64	...	43	-1.003341	0.286440	0	0	2.279101e+05	2.311955e+09	170214.485759	7	{"rings": [[[-9692114, 3928124.0001000017], [-...
4	4	1	5	01009	Blount	Alabama	0.603064	40.9	21785.0	3.86	...	51	0.096177	-0.336198	0	1	2.918753e+05	2.456919e+09	21128.568784	7	{"rings": [[[-9623907, 4063676.0001000017], [-...

5 rows × 97 columns

sdf_main.shape

(3112, 97)

The data is visualized here by mapping the voter turnout field into five classes. It can be seen that there is a belt running along the southeastern part of the country, which represents comparatively lower voter turnout of less than 55%.

# Visualizing voters turnout in percentages by county
m1= GIS().map('United States', zoomlevel=4)
sdf_main.spatial.plot(map_widget = m1,renderer_type='c', col='voter_turn',  line_width=0.2, method='esriClassifyNaturalBreaks', class_count=5, cmap='gist_heat_r',alpha=0.7)
m1.legend=True
m1

Model Building

Once the dataset is divided into the training and test dataset, the training data is ready to be used for modeling.

Train-Test Data split

The dataset above has 3112 samples, each representing US counties and their voter turnout, along with related variables. Next, it will be split into training and test datasets, in a 90 to 10 ratio for training and validation respectively.

# Splitting data with a test size of 10% for validation 
test_size = 0.10
sdf_train_base, sdf_test_base = train_test_split(sdf_main, test_size = test_size, random_state=42)

sdf_train_base.head(2)

	FID	Join_Count	TARGET_FID	FIPS	county	state	voter_turn	gender_med	householdi	electronic	...	NNeighbors	ZTransform	SpatialLag	LMi_hi_sig	LMi_normal	Shape_Le_1	Shape_Ar_1	LMiHiDist	NEAR_FID	SHAPE
1798	1798	1	1799	36001	Albany	New York	0.587546	40.1	38227.0	5.00	...	31	-0.071597	-0.099099	0	0	225997.958853	2.550720e+09	150330.823825	589	{"rings": [[[-8201660, 5279044.000100002], [-8...
1003	1003	1	1004	21081	Grant	Kentucky	0.541740	36.9	20490.0	3.59	...	87	-0.566824	-0.125723	0	0	148480.244249	1.108905e+09	84976.878589	339	{"rings": [[[-9419382, 4693378.000100002], [-9...

2 rows × 97 columns

# checking the columns in the dataset
sdf_main.columns

Index(['FID', 'Join_Count', 'TARGET_FID', 'FIPS', 'county', 'state',
       'voter_turn', 'gender_med', 'householdi', 'electronic', 'raceandhis',
       'voter_laws', 'educationa', 'educatio_1', 'educatio_2', 'educatio_3',
       'maritalsta', 'F5yearincr', 'F5yearin_1', 'F5yearin_2', 'F5yearin_3',
       'F5yearin_4', 'F5yearin_5', 'F5yearin_6', 'language_a', 'hispanicor',
       'hispanic_1', 'raceandh_1', 'atrisk_avg', 'disposable', 'disposab_1',
       'disposab_2', 'disposab_3', 'disposab_4', 'disposab_5', 'disposab_6',
       'disposab_7', 'disposab_8', 'disposab_9', 'disposa_10', 'househol_1',
       'househol_2', 'househol_3', 'househol_4', 'househol_5', 'househol_6',
       'househol_7', 'househol_8', 'househol_9', 'language_1', 'language_2',
       'households', 'househo_10', 'educatio_4', 'educatio_5', 'educatio_6',
       'educatio_7', 'psychograp', 'psychogr_1', 'financial_', 'financial1',
       'financia_1', 'miscellane', 'state_vote', 'state_vo_1', 'randomized',
       'random_num', 'City10Dist', 'City10Ang', 'City9Dist', 'City9Ang',
       'City8Dist', 'City8Ang', 'City7Dist', 'City7Ang', 'City6Dist',
       'City6Ang', 'City5Dist', 'City5Ang', 'SOURCE_ID', 'voter_tu_1',
       'Shape_Leng', 'Shape_Area', 'LMiIndex', 'LMiZScore', 'LMiPValue',
       'COType', 'NNeighbors', 'ZTransform', 'SpatialLag', 'LMi_hi_sig',
       'LMi_normal', 'Shape_Le_1', 'Shape_Ar_1', 'LMiHiDist', 'NEAR_FID',
       'SHAPE'],
      dtype='object')

Data Preparation

First, a list of explanatory variables is chosen that consists of the feature data that will be used for predicting voter turnout. By default, it will receive continuous variables, and in the case of categorical variables, the True value is passed inside a tuple, along with the variable. Here county, state and voter_laws are categorical variables.

# listing explanatory variables
X =[('county',True), ('state',True),'gender_med', 'householdi', 'electronic', 'raceandhis',
       ('voter_laws',True), 'educationa', 'educatio_1', 'educatio_2', 'educatio_3',
       'maritalsta', 'F5yearincr', 'F5yearin_1', 'F5yearin_2', 'F5yearin_3',
       'F5yearin_4', 'F5yearin_5', 'F5yearin_6', 'language_a', 'hispanicor',
       'hispanic_1', 'raceandh_1', 'atrisk_avg', 'disposable', 'disposab_1',
       'disposab_2', 'disposab_3', 'disposab_4', 'disposab_5', 'disposab_6',
       'disposab_7', 'disposab_8', 'disposab_9', 'disposa_10', 'househol_1',
       'househol_2', 'househol_3', 'househol_4', 'househol_5', 'househol_6',
       'househol_7', 'househol_8', 'househol_9', 'language_1', 'language_2',
       'households', 'househo_10', 'educatio_4', 'educatio_5', 'educatio_6',
       'educatio_7', 'psychograp', 'psychogr_1', 'financial_', 'financial1',
       'financia_1', 'miscellane', 'state_vote', 'state_vo_1',
        'City10Dist','City9Dist', 'City8Dist', 'City7Dist','City6Dist',
        'City5Dist']

The preprocessor uses a scaler to transform the explanatory variables, which is defined as follows:

# defining the preprocessors for scaling data
preprocessors = [('county', 'state','gender_med', 'householdi', 'electronic', 'raceandhis',
       'voter_laws', 'educationa', 'educatio_1', 'educatio_2', 'educatio_3',
       'maritalsta', 'F5yearincr', 'F5yearin_1', 'F5yearin_2', 'F5yearin_3',
       'F5yearin_4', 'F5yearin_5', 'F5yearin_6', 'language_a', 'hispanicor',
       'hispanic_1', 'raceandh_1', 'atrisk_avg', 'disposable', 'disposab_1',
       'disposab_2', 'disposab_3', 'disposab_4', 'disposab_5', 'disposab_6',
       'disposab_7', 'disposab_8', 'disposab_9', 'disposa_10', 'househol_1',
       'househol_2', 'househol_3', 'househol_4', 'househol_5', 'househol_6',
       'househol_7', 'househol_8', 'househol_9', 'language_1', 'language_2',
       'households', 'househo_10', 'educatio_4', 'educatio_5', 'educatio_6',
       'educatio_7', 'psychograp', 'psychogr_1', 'financial_', 'financial1',
       'financia_1', 'miscellane', 'state_vote', 'state_vo_1',
        'City10Dist', 'City9Dist',
       'City8Dist', 'City7Dist','City6Dist',
       'City5Dist', MinMaxScaler())]

Finally, using the explanatory variables list above, the preprocessors and the prediction variable of voter turnout, the prepare_tabulardata prepares the data to be fed into the model.

# preparing data for the model
data_base_model = prepare_tabulardata(sdf_train_base,
                           variable_predict='voter_turn',
                           explanatory_variables=X, 
                           preprocessors=preprocessors)

C:\Users\sup10432\AppData\Local\ESRI\conda\envs\pro_automl_26Octb\lib\site-packages\arcgis\learn\_utils\tabular_data.py:1035: UserWarning:

Column county has more than 20 unique value. Sure this is categorical?

C:\Users\sup10432\AppData\Local\ESRI\conda\envs\pro_automl_26Octb\lib\site-packages\arcgis\learn\_utils\tabular_data.py:1035: UserWarning:

Column state has more than 20 unique value. Sure this is categorical?

Fitting a random forest model

First a random forest model is fitted to the data, and its performance is measured.

Model Initialization

The MLModel is initialized with the Random Forest model from Sklearn, along with its model parameters.

# defining the model along with the parameters 
model = MLModel(data_base_model, 'sklearn.ensemble.RandomForestRegressor', n_estimators=500, random_state=43)

model.fit()

model.score()

0.6388235590727049

# validating trained model on test dataset
voter_county_mlmodel_predicted = model.predict(sdf_test_base, prediction_type='dataframe')

voter_county_mlmodel_predicted.head(2)

	FID	Join_Count	TARGET_FID	FIPS	county	state	voter_turn	gender_med	householdi	electronic	...	ZTransform	SpatialLag	LMi_hi_sig	LMi_normal	Shape_Le_1	Shape_Ar_1	LMiHiDist	NEAR_FID	SHAPE	prediction_results
557	557	1	558	16073	Owyhee	Idaho	0.529332	37.1	19701.0	3.66	...	-0.700966	-0.496409	0	0	942771.682539	3.679323e+10	484820.245797	672	{"rings": [[[-12970046, 5356298.000100002], [-...	0.533127
416	416	1	417	13119	Franklin	Georgia	0.506977	42.2	18965.0	3.44	...	-0.942663	-0.089913	0	0	152970.676619	1.015526e+09	129997.253626	591	{"rings": [[[-9245266, 4095289.0001000017], [-...	0.533401

2 rows × 98 columns

# calculating validation model score
r_square_voter_county_mlmodel_Test = metrics.r2_score(voter_county_mlmodel_predicted['voter_turn'], voter_county_mlmodel_predicted['prediction_results']) 
print('r_square_voter_county_mlmodel_Test: ', round(r_square_voter_county_mlmodel_Test,2))

r_square_voter_county_mlmodel_Test:  0.71

The validation r square for the random forest model is satisfactory, and now AutoML will be used to improve it.

Fitting Using AutoML

The same data obtained using the prepare_taular data function is next used as input for the AutoML model. Out of the various AutoML modes available, here the compete mode is used which uses 10-fold CV (Cross-Validation) and the Decision Tree, Random Forest, Extra Trees, XGBoost, Neural Network, Nearest Neighbors, Ensemble, and Stacking models to achieve higher machine learning accuracy.

# initializing AutoML model with the Compete mode 
AutoML_voters_county_base_compete = AutoML(data_base_model, eval_metric='r2', mode='Compete', n_jobs=1)

In the above initialization, the Compete mode is selected out of the three available modes, Compete, Perform, and Compete. While Compete is the best performing mode, it also consumes a significant amount of resources and time, and it is only recommended for instances where the best results are necessary. In other cases, the Explain or Perform modes can be used for a faster basic fit.

# training the AutoML model
AutoML_voters_county_base_compete.fit()

Neural Network algorithm was disabled because it doesn't support n_jobs parameter.
AutoML directory: AutoML_1
The task is regression with evaluation metric r2
AutoML will use algorithms: ['Linear', 'Decision Tree', 'Random Forest', 'Extra Trees', 'LightGBM', 'Xgboost']
AutoML will stack models
AutoML will ensemble availabe models
AutoML steps: ['adjust_validation', 'simple_algorithms', 'default_algorithms', 'not_so_random', 'kmeans_features', 'insert_random_feature', 'features_selection', 'hill_climbing_1', 'hill_climbing_2', 'boost_on_errors', 'ensemble', 'stack', 'ensemble_stacked']
* Step adjust_validation will try to check up to 1 model
1_DecisionTree r2 0.278123 trained in 0.82 seconds
Adjust validation. Remove: 1_DecisionTree
Validation strategy: 10-fold CV Shuffle
* Step simple_algorithms will try to check up to 4 models
1_DecisionTree r2 0.403535 trained in 6.96 seconds
2_DecisionTree r2 0.458642 trained in 6.55 seconds
3_DecisionTree r2 0.458642 trained in 6.7 seconds
4_Linear r2 0.661466 trained in 6.97 seconds
* Step default_algorithms will try to check up to 4 models
5_Default_LightGBM r2 0.772467 trained in 39.19 seconds
6_Default_Xgboost r2 0.76779 trained in 130.98 seconds
7_Default_RandomForest r2 0.587323 trained in 58.24 seconds
8_Default_ExtraTrees r2 0.532677 trained in 16.98 seconds
* Step not_so_random will try to check up to 36 models
18_LightGBM r2 0.784405 trained in 25.39 seconds
9_Xgboost r2 0.759364 trained in 137.63 seconds
27_RandomForest r2 0.585097 trained in 47.95 seconds
36_ExtraTrees r2 0.525521 trained in 14.03 seconds
19_LightGBM r2 0.756595 trained in 15.0 seconds
10_Xgboost r2 0.74418 trained in 207.35 seconds
28_RandomForest r2 0.527252 trained in 39.33 seconds
37_ExtraTrees r2 0.465072 trained in 19.44 seconds
20_LightGBM r2 0.774992 trained in 46.14 seconds
11_Xgboost r2 0.778035 trained in 35.9 seconds
29_RandomForest r2 0.660641 trained in 81.68 seconds
38_ExtraTrees r2 0.610569 trained in 30.66 seconds
21_LightGBM r2 0.765569 trained in 76.77 seconds
12_Xgboost r2 0.767627 trained in 21.05 seconds
30_RandomForest r2 0.652762 trained in 55.28 seconds
39_ExtraTrees r2 0.593974 trained in 18.17 seconds
22_LightGBM r2 0.759384 trained in 15.49 seconds
13_Xgboost r2 0.779413 trained in 30.12 seconds
31_RandomForest r2 0.683753 trained in 70.05 seconds
40_ExtraTrees r2 0.629244 trained in 22.78 seconds
23_LightGBM r2 0.779126 trained in 45.39 seconds
14_Xgboost r2 0.772772 trained in 27.06 seconds
32_RandomForest r2 0.677973 trained in 114.41 seconds
41_ExtraTrees r2 0.637788 trained in 26.65 seconds
24_LightGBM r2 0.775953 trained in 35.92 seconds
15_Xgboost r2 0.772261 trained in 106.44 seconds
33_RandomForest r2 0.585173 trained in 53.87 seconds
42_ExtraTrees r2 0.52614 trained in 19.97 seconds
25_LightGBM r2 0.779734 trained in 27.09 seconds
16_Xgboost r2 0.779104 trained in 35.84 seconds
34_RandomForest r2 0.586946 trained in 47.75 seconds
43_ExtraTrees r2 0.51673 trained in 19.5 seconds
* Step kmeans_features will try to check up to 3 models
18_LightGBM_KMeansFeatures r2 0.774187 trained in 30.75 seconds
25_LightGBM_KMeansFeatures r2 0.770623 trained in 30.81 seconds
13_Xgboost_KMeansFeatures r2 0.773129 trained in 43.43 seconds
* Step insert_random_feature will try to check up to 1 model
18_LightGBM_RandomFeature r2 0.779158 trained in 55.47 seconds
Drop features ['households', 'disposab_9', 'City10Dist', 'househo_10', 'househol_2', 'househol_4', 'disposab_2', 'random_feature', 'state_vo_1', 'language_2']
* Step features_selection will try to check up to 4 models
18_LightGBM_SelectedFeatures r2 0.78134 trained in 25.4 seconds
13_Xgboost_SelectedFeatures r2 0.78038 trained in 32.15 seconds
31_RandomForest_SelectedFeatures r2 0.684658 trained in 63.17 seconds
41_ExtraTrees_SelectedFeatures r2 0.639667 trained in 26.54 seconds
* Step hill_climbing_1 will try to check up to 22 models
44_LightGBM r2 0.777434 trained in 20.63 seconds
45_LightGBM_SelectedFeatures r2 0.774463 trained in 20.38 seconds
46_Xgboost_SelectedFeatures r2 0.781583 trained in 28.49 seconds
47_Xgboost_SelectedFeatures r2 0.781911 trained in 35.19 seconds
48_LightGBM r2 0.777738 trained in 32.0 seconds
49_LightGBM r2 0.757144 trained in 19.63 seconds
50_Xgboost r2 0.775644 trained in 31.74 seconds
51_Xgboost r2 0.779625 trained in 35.18 seconds
52_Xgboost r2 0.776825 trained in 35.21 seconds
53_Xgboost r2 0.774031 trained in 37.76 seconds
54_RandomForest_SelectedFeatures r2 0.681516 trained in 59.11 seconds
55_RandomForest_SelectedFeatures r2 0.682717 trained in 79.23 seconds
* Step hill_climbing_2 will try to check up to 17 models
56_LightGBM r2 0.777941 trained in 27.36 seconds
57_Xgboost_SelectedFeatures r2 0.777909 trained in 30.4 seconds
58_Xgboost_SelectedFeatures r2 0.782662 trained in 35.95 seconds
59_Xgboost_SelectedFeatures r2 0.781634 trained in 28.9 seconds
60_Xgboost_SelectedFeatures r2 0.779578 trained in 31.72 seconds
61_LightGBM_SelectedFeatures r2 0.77983 trained in 27.04 seconds
62_Xgboost_SelectedFeatures r2 0.780135 trained in 28.38 seconds
63_Xgboost_SelectedFeatures r2 0.777918 trained in 32.16 seconds
* Step boost_on_errors will try to check up to 1 model
18_LightGBM_BoostOnErrors r2 0.777478 trained in 26.16 seconds
* Step ensemble will try to check up to 1 model
Ensemble r2 0.793829 trained in 11.24 seconds
* Step stack will try to check up to 39 models
18_LightGBM_Stacked r2 0.779208 trained in 20.48 seconds
58_Xgboost_SelectedFeatures_Stacked r2 0.777255 trained in 27.07 seconds
31_RandomForest_SelectedFeatures_Stacked r2 0.787421 trained in 121.84 seconds
41_ExtraTrees_SelectedFeatures_Stacked r2 0.790031 trained in 35.82 seconds
18_LightGBM_SelectedFeatures_Stacked r2 0.778824 trained in 20.4 seconds
47_Xgboost_SelectedFeatures_Stacked r2 0.778543 trained in 26.14 seconds
31_RandomForest_Stacked r2 0.787983 trained in 111.51 seconds
41_ExtraTrees_Stacked r2 0.788431 trained in 37.75 seconds
61_LightGBM_SelectedFeatures_Stacked r2 0.779328 trained in 21.21 seconds
59_Xgboost_SelectedFeatures_Stacked r2 0.778012 trained in 24.09 seconds
55_RandomForest_SelectedFeatures_Stacked r2 0.788015 trained in 129.65 seconds
40_ExtraTrees_Stacked r2 0.790068 trained in 36.66 seconds
25_LightGBM_Stacked r2 0.774845 trained in 22.96 seconds
46_Xgboost_SelectedFeatures_Stacked r2 0.779182 trained in 25.36 seconds
54_RandomForest_SelectedFeatures_Stacked not trained. Stop training after the first fold. Time needed to train on the first fold 5.0 seconds. The time estimate for training on all folds is larger than total_time_limit.
* Step ensemble_stacked will try to check up to 1 model
Ensemble_Stacked r2 0.794873 trained in 16.19 seconds
AutoML fit time: 3631.5 seconds
AutoML best model: Ensemble_Stacked
All the evaluated models are saved in the path  C:\Users\sup10432\review_notebooks\voters_turnout\part I\2\AutoML_1

AutoML significantly improves the fit when compared to the standalone random forest model, and the validation r square jumps to a new high. Now, the previous visualization of the data reveals the presence of a spatial pattern in the data, and in the second part of the notebook, this spatial pattern is estimated and included as a spatial feature to further improve the model.

# train score of the model
AutoML_voters_county_base_compete.score()

0.9560210208116269

Model output

# The output diagnostics can also be printed in a report form
AutoML_voters_county_base_compete.report()

C:\Users\sup10432\AppData\Local\ESRI\conda\envs\pro_automl_26Octb\lib\site-packages\arcgis\learn\models\_auto_ml.py:284: UserWarning:

In case the report html is not rendered appropriately in the notebook, the same can be found in the path AutoML_1\README.html

AutoML Leaderboard

Best model	name	model_type	metric_type	metric_value	train_time
	1_DecisionTree	Decision Tree	r2	0.403535	7.36
	2_DecisionTree	Decision Tree	r2	0.458642	6.95
	3_DecisionTree	Decision Tree	r2	0.458642	7.1
	4_Linear	Linear	r2	0.661466	7.4
	5_Default_LightGBM	LightGBM	r2	0.772467	39.76
	6_Default_Xgboost	Xgboost	r2	0.76779	131.51
	7_Default_RandomForest	Random Forest	r2	0.587323	58.82
	8_Default_ExtraTrees	Extra Trees	r2	0.532677	17.57
	18_LightGBM	LightGBM	r2	0.784405	25.94
	9_Xgboost	Xgboost	r2	0.759364	138.21
	27_RandomForest	Random Forest	r2	0.585097	48.54
	36_ExtraTrees	Extra Trees	r2	0.525521	14.61
	19_LightGBM	LightGBM	r2	0.756595	15.53
	10_Xgboost	Xgboost	r2	0.74418	207.95
	28_RandomForest	Random Forest	r2	0.527252	39.99
	37_ExtraTrees	Extra Trees	r2	0.465072	20.03
	20_LightGBM	LightGBM	r2	0.774992	46.68
	11_Xgboost	Xgboost	r2	0.778035	36.43
	29_RandomForest	Random Forest	r2	0.660641	82.2
	38_ExtraTrees	Extra Trees	r2	0.610569	31.21
	21_LightGBM	LightGBM	r2	0.765569	77.33
	12_Xgboost	Xgboost	r2	0.767627	21.57
	30_RandomForest	Random Forest	r2	0.652762	55.8
	39_ExtraTrees	Extra Trees	r2	0.593974	18.72
	22_LightGBM	LightGBM	r2	0.759384	16.02
	13_Xgboost	Xgboost	r2	0.779413	30.67
	31_RandomForest	Random Forest	r2	0.683753	70.59
	40_ExtraTrees	Extra Trees	r2	0.629244	23.31
	23_LightGBM	LightGBM	r2	0.779126	45.99
	14_Xgboost	Xgboost	r2	0.772772	27.63
	32_RandomForest	Random Forest	r2	0.677973	114.97
	41_ExtraTrees	Extra Trees	r2	0.637788	27.15
	24_LightGBM	LightGBM	r2	0.775953	36.51
	15_Xgboost	Xgboost	r2	0.772261	107.03
	33_RandomForest	Random Forest	r2	0.585173	54.42
	42_ExtraTrees	Extra Trees	r2	0.52614	20.54
	25_LightGBM	LightGBM	r2	0.779734	27.67
	16_Xgboost	Xgboost	r2	0.779104	36.36
	34_RandomForest	Random Forest	r2	0.586946	48.32
	43_ExtraTrees	Extra Trees	r2	0.51673	20.02
	18_LightGBM_KMeansFeatures	LightGBM	r2	0.774187	31.37
	25_LightGBM_KMeansFeatures	LightGBM	r2	0.770623	31.39
	13_Xgboost_KMeansFeatures	Xgboost	r2	0.773129	44.06
	18_LightGBM_RandomFeature	LightGBM	r2	0.779158	56.57
	18_LightGBM_SelectedFeatures	LightGBM	r2	0.78134	25.97
	13_Xgboost_SelectedFeatures	Xgboost	r2	0.78038	32.71
	31_RandomForest_SelectedFeatures	Random Forest	r2	0.684658	63.75
	41_ExtraTrees_SelectedFeatures	Extra Trees	r2	0.639667	27.12
	44_LightGBM	LightGBM	r2	0.777434	21.19
	45_LightGBM_SelectedFeatures	LightGBM	r2	0.774463	20.91
	46_Xgboost_SelectedFeatures	Xgboost	r2	0.781583	29.08
	47_Xgboost_SelectedFeatures	Xgboost	r2	0.781911	35.72
	48_LightGBM	LightGBM	r2	0.777738	32.54
	49_LightGBM	LightGBM	r2	0.757144	20.19
	50_Xgboost	Xgboost	r2	0.775644	32.27
	51_Xgboost	Xgboost	r2	0.779625	35.71
	52_Xgboost	Xgboost	r2	0.776825	35.71
	53_Xgboost	Xgboost	r2	0.774031	38.29
	54_RandomForest_SelectedFeatures	Random Forest	r2	0.681516	59.71
	55_RandomForest_SelectedFeatures	Random Forest	r2	0.682717	79.82
	56_LightGBM	LightGBM	r2	0.777941	27.94
	57_Xgboost_SelectedFeatures	Xgboost	r2	0.777909	30.96
	58_Xgboost_SelectedFeatures	Xgboost	r2	0.782662	36.5
	59_Xgboost_SelectedFeatures	Xgboost	r2	0.781634	29.42
	60_Xgboost_SelectedFeatures	Xgboost	r2	0.779578	32.25
	61_LightGBM_SelectedFeatures	LightGBM	r2	0.77983	27.6
	62_Xgboost_SelectedFeatures	Xgboost	r2	0.780135	28.91
	63_Xgboost_SelectedFeatures	Xgboost	r2	0.777918	32.69
	18_LightGBM_BoostOnErrors	LightGBM	r2	0.777478	26.68
	Ensemble	Ensemble	r2	0.793829	11.24
	18_LightGBM_Stacked	LightGBM	r2	0.779208	21.05
	58_Xgboost_SelectedFeatures_Stacked	Xgboost	r2	0.777255	27.63
	31_RandomForest_SelectedFeatures_Stacked	Random Forest	r2	0.787421	122.42
	41_ExtraTrees_SelectedFeatures_Stacked	Extra Trees	r2	0.790031	36.41
	18_LightGBM_SelectedFeatures_Stacked	LightGBM	r2	0.778824	20.92
	47_Xgboost_SelectedFeatures_Stacked	Xgboost	r2	0.778543	26.65
	31_RandomForest_Stacked	Random Forest	r2	0.787983	112.04
	41_ExtraTrees_Stacked	Extra Trees	r2	0.788431	38.29
	61_LightGBM_SelectedFeatures_Stacked	LightGBM	r2	0.779328	21.7
	59_Xgboost_SelectedFeatures_Stacked	Xgboost	r2	0.778012	24.61
	55_RandomForest_SelectedFeatures_Stacked	Random Forest	r2	0.788015	130.19
	40_ExtraTrees_Stacked	Extra Trees	r2	0.790068	37.16
	25_LightGBM_Stacked	LightGBM	r2	0.774845	23.48
	46_Xgboost_SelectedFeatures_Stacked	Xgboost	r2	0.779182	25.87
the best	Ensemble_Stacked	Ensemble	r2	0.794873	16.19

AutoML Performance

AutoML Performance Boxplot

Spearman Correlation of Models

models spearman correlation

Metric	Score
MAE	0.0339646
MSE	0.00218772
RMSE	0.0467731
R2	0.74418
MAPE	0.0600358

Metric	Score
MAE	0.0316142
MSE	0.0018982
RMSE	0.0435684
R2	0.778035
MAPE	0.0557158

Metric	Score
MAE	0.0325828
MSE	0.00198721
RMSE	0.0445781
R2	0.767627
MAPE	0.057377

Metric	Score
MAE	0.0314505
MSE	0.00188642
RMSE	0.0434329
R2	0.779413
MAPE	0.0554015

Metric	Score
MAE	0.0318459
MSE	0.00194016
RMSE	0.0440472
R2	0.773129
MAPE	0.0562227

Metric	Score
MAE	0.0318767
MSE	0.00187815
RMSE	0.0433377
R2	0.78038
MAPE	0.056196

Metric	Score
MAE	0.0320724
MSE	0.00194321
RMSE	0.0440819
R2	0.772772
MAPE	0.0565604

Metric	Score
MAE	0.0319331
MSE	0.00194758
RMSE	0.0441314
R2	0.772261
MAPE	0.0565375

Metric	Score
MAE	0.0316564
MSE	0.00188906
RMSE	0.0434633
R2	0.779104
MAPE	0.0558806

Metric	Score
MAE	0.0314226
MSE	0.00184373
RMSE	0.0429387
R2	0.784405
MAPE	0.0553366

Metric	Score
MAE	0.0318695
MSE	0.00190297
RMSE	0.043623
R2	0.777478
MAPE	0.0562365

Metric	Score
MAE	0.0316602
MSE	0.00193112
RMSE	0.0439445
R2	0.774187
MAPE	0.0559487

Metric	Score
MAE	0.0317889
MSE	0.0018886
RMSE	0.043458
R2	0.779158
MAPE	0.0561247

Metric	Score
MAE	0.0314444
MSE	0.00186994
RMSE	0.0432428
R2	0.78134
MAPE	0.0553535

Metric	Score
MAE	0.0315875
MSE	0.00189146
RMSE	0.0434909
R2	0.778824
MAPE	0.0558053

Metric	Score
MAE	0.0315712
MSE	0.00188817
RMSE	0.0434531
R2	0.779208
MAPE	0.0558183

Metric	Score
MAE	0.0332854
MSE	0.00208156
RMSE	0.0456241
R2	0.756595
MAPE	0.05876

Metric	Score
MAE	0.055206
MSE	0.00510086
RMSE	0.0714203
R2	0.403535
MAPE	0.0977826

Metric	Score
MAE	0.0318455
MSE	0.00192423
RMSE	0.0438661
R2	0.774992
MAPE	0.0561372

Metric	Score
MAE	0.0327023
MSE	0.00200481
RMSE	0.0447752
R2	0.765569
MAPE	0.0577152

Metric	Score
MAE	0.033405
MSE	0.0020577
RMSE	0.0453619
R2	0.759384
MAPE	0.0589146

Metric	Score
MAE	0.0316762
MSE	0.00188887
RMSE	0.0434612
R2	0.779126
MAPE	0.0558647

Metric	Score
MAE	0.0317538
MSE	0.00191601
RMSE	0.0437723
R2	0.775953
MAPE	0.0561125

Metric	Score
MAE	0.0316265
MSE	0.00188368
RMSE	0.0434014
R2	0.779734
MAPE	0.0558191

Metric	Score
MAE	0.0322572
MSE	0.00196159
RMSE	0.0442898
R2	0.770623
MAPE	0.056952

Metric	Score
MAE	0.0317886
MSE	0.00192549
RMSE	0.0438804
R2	0.774845
MAPE	0.0561857

Metric	Score
MAE	0.0458156
MSE	0.00354817
RMSE	0.0595666
R2	0.585097
MAPE	0.0817712

Metric	Score
MAE	0.0489944
MSE	0.00404285
RMSE	0.0635834
R2	0.527252
MAPE	0.0873793

Metric	Score
MAE	0.0408397
MSE	0.00290214
RMSE	0.0538715
R2	0.660641
MAPE	0.0726133

Metric	Score
MAE	0.052287
MSE	0.0046296
RMSE	0.0680412
R2	0.458642
MAPE	0.0921917

Metric	Score
MAE	0.0414963
MSE	0.00296952
RMSE	0.0544933
R2	0.652762
MAPE	0.0738484

Metric	Score
MAE	0.0392236
MSE	0.00270448
RMSE	0.0520047
R2	0.683753
MAPE	0.0697842

Metric	Score
MAE	0.0391615
MSE	0.00269675
RMSE	0.0519302
R2	0.684658
MAPE	0.0695873

Metric	Score
MAE	0.0311372
MSE	0.00181794
RMSE	0.0426373
R2	0.787421
MAPE	0.054769

Metric	Score
MAE	0.0309505
MSE	0.00181314
RMSE	0.0425809
R2	0.787983
MAPE	0.0544666

Metric	Score
MAE	0.0395212
MSE	0.00275392
RMSE	0.0524778
R2	0.677973
MAPE	0.0702565

Metric	Score
MAE	0.0455502
MSE	0.00354753
RMSE	0.0595611
R2	0.585173
MAPE	0.0812106

Metric	Score
MAE	0.0457441
MSE	0.00353236
RMSE	0.0594337
R2	0.586946
MAPE	0.0814379

Metric	Score
MAE	0.0490921
MSE	0.00405766
RMSE	0.0636998
R2	0.525521
MAPE	0.0875293

Metric	Score
MAE	0.0519873
MSE	0.00457461
RMSE	0.0676359
R2	0.465072
MAPE	0.0930743

Metric	Score
MAE	0.0438355
MSE	0.00333035
RMSE	0.0577092
R2	0.610569
MAPE	0.0778722

Metric	Score
MAE	0.0450324
MSE	0.00347226
RMSE	0.0589259
R2	0.593974
MAPE	0.0800709

Metric	Score
MAE	0.052287
MSE	0.0046296
RMSE	0.0680412
R2	0.458642
MAPE	0.0921917

Metric	Score
MAE	0.0427268
MSE	0.00317064
RMSE	0.0563084
R2	0.629244
MAPE	0.0758222

Metric	Score
MAE	0.0308568
MSE	0.00179531
RMSE	0.042371
R2	0.790068
MAPE	0.054317

Metric	Score
MAE	0.0420355
MSE	0.00309757
RMSE	0.0556558
R2	0.637788
MAPE	0.0746275

Voters turnout prediction & Validation

# validating trained model on test dataset
voter_county_automl_predicted = AutoML_voters_county_base_compete.predict(sdf_test_base, prediction_type='dataframe')

voter_county_automl_predicted.head(2)

	FID	Join_Count	TARGET_FID	FIPS	county	state	voter_turn	gender_med	householdi	electronic	...	ZTransform	SpatialLag	LMi_hi_sig	LMi_normal	Shape_Le_1	Shape_Ar_1	LMiHiDist	NEAR_FID	SHAPE	prediction_results
557	557	1	558	16073	Owyhee	Idaho	0.529332	37.1	19701.0	3.66	...	-0.700966	-0.496409	0	0	942771.682539	3.679323e+10	484820.245797	672	{"rings": [[[-12970046, 5356298.000100002], [-...	0.519827
416	416	1	417	13119	Franklin	Georgia	0.506977	42.2	18965.0	3.44	...	-0.942663	-0.089913	0	0	152970.676619	1.015526e+09	129997.253626	591	{"rings": [[[-9245266, 4095289.0001000017], [-...	0.525310

2 rows × 98 columns

Estimate model metrics for validation

import sklearn.metrics as metrics

r_square_voter_county_automl_Test = metrics.r2_score(voter_county_automl_predicted['voter_turn'], voter_county_automl_predicted['prediction_results']) 
print('r_square_voter_county_automl_Test: ', round(r_square_voter_county_automl_Test,2))

r_square_voter_county_automl_Test:  0.78

Conclusion

In this notebook, AutoML was applied to a regression dataset and was able to achieve significant improvement over traditional methods of modeling. Data visualization also showed the presence of spatial autocorrelation in voter turnout distributed across the country. The fit of the model can be further improved by extracting this spatial pattern in the data, and this process is elaborated on in part two of this notebook.

Data resources

Reference	Source	Link
Voters turnout by county for 2016 US general election	Esri	https://www.arcgis.com/home/item.html?id=650e7d6aa8fb4601a75d632a2c114425

Predicting voters turnout for US election in 2016 using AutoML - Part I

Introduction

Imports

Connecting to ArcGIS

Accessing & Visualizing datasets

Model Building

Train-Test Data split

Data Preparation

Fitting a random forest model

Model Initialization

Fitting Using AutoML

Model output

AutoML Leaderboard

AutoML Performance

AutoML Performance Boxplot

Spearman Correlation of Models

Summary of 10_Xgboost

Extreme Gradient Boosting (Xgboost)

Validation

Optimized metric

Training time

Metric details:

Learning curves

True vs Predicted

Predicted vs Residuals

Summary of 11_Xgboost

Extreme Gradient Boosting (Xgboost)

Validation

Optimized metric

Training time

Metric details:

Learning curves

True vs Predicted

Predicted vs Residuals

Summary of 12_Xgboost

Extreme Gradient Boosting (Xgboost)

Validation

Optimized metric

Training time

Metric details:

Learning curves

True vs Predicted

Predicted vs Residuals

Summary of 13_Xgboost

Extreme Gradient Boosting (Xgboost)

Validation

Optimized metric

Training time

Metric details:

Learning curves

True vs Predicted

Predicted vs Residuals

Summary of 13_Xgboost_KMeansFeatures

Extreme Gradient Boosting (Xgboost)

Validation

Optimized metric

Training time

Metric details:

Learning curves

True vs Predicted

Predicted vs Residuals

Summary of 13_Xgboost_SelectedFeatures

Extreme Gradient Boosting (Xgboost)

Validation

Optimized metric

Training time

Metric details:

Learning curves

True vs Predicted

Predicted vs Residuals

Summary of 14_Xgboost

Extreme Gradient Boosting (Xgboost)

Validation

Optimized metric

Training time

Metric details:

Learning curves

True vs Predicted

Predicted vs Residuals

Summary of 15_Xgboost