Predicting voters turnout for US election in 2016 using AutoML - Part I

Introduction

The objective of this notebook is to demonstrate the application of AutoML on tabular data and show the improvements that can be achieved using this method, rather than conventional workflows. The newly added AutoML module in arcgis.learn is based on the MLJar library, which automates the processes of algorithm selection, data preprocessing, model training, model explainability, and final model evaluation. With these functionalities, it can perform Automatic Exploratory Data Analysis, Algorithm Selection, and Hyper-Parameters tuning to find the best model. Automatic documentation can be generated as markdown reports on the analysis with details about all models.

Once the desired improvements are obtained using AutoML, the result will be further enhanced using spatial feature engineering in the second part of the notebook.

The dataset used here is the percentage of voter turnout by county for the general election for the United States in 2016, which will be predicted using the demographic characteristics of US counties and their socioeconomic parameters.

Imports

%matplotlib inline

import matplotlib.pyplot as plt
import pandas as pd

from IPython.display import Image, HTML
from sklearn.preprocessing import MinMaxScaler,RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import sklearn.metrics as metrics
from fastai.imports import *
from datetime import datetime as dt

import arcgis
from arcgis.gis import GIS
from arcgis.learn import prepare_tabulardata, AutoML, MLModel
import arcpy

Connecting to ArcGIS

gis = GIS("home")

Accessing & Visualizing datasets

Here, the 2016 election data is downloaded from the portal as a zipped shapefile, which is then unzipped and processed in the following.

voter_zip = gis.content.get('650e7d6aa8fb4601a75d632a2c114425') 
voter_zip
VotersTurnoutCountyEelction2016
voters turnout 2016Shapefile by api_data_owner
Last Modified: August 23, 2021
0 comments, 79 views
import os, zipfile
filepath_new = voter_zip.download(file_name=voter_zip.name)
with zipfile.ZipFile(filepath_new, 'r') as zip_ref:
    zip_ref.extractall(Path(filepath_new).parent)
output_path = Path(os.path.join(os.path.splitext(filepath_new)[0]))
output_path = os.path.join(output_path,"VotersTurnoutCountyEelction2016.shp")  
output_path
'C:\\Users\\sup10432\\AppData\\Local\\Temp\\VotersTurnoutCountyEelction2016\\VotersTurnoutCountyEelction2016.shp'

The attribute table contains the voter turnout data by county for the entire US, which is extracted here as a pandas dataframe. The voter_turn field in the dataframe contains the voter turnout data in percentages for each county for the 2016 election. This will be used as the dependent variable and will be predicted using the various demographic and socioeconomic variables for each county.

# getting the attribute table from the shapefile which will be used for building the model
sdf_main = pd.DataFrame.spatial.from_featureclass(output_path)
sdf_main.head()
FIDJoin_CountTARGET_FIDFIPScountystatevoter_turngender_medhouseholdielectronic...NNeighborsZTransformSpatialLagLMi_hi_sigLMi_normalShape_Le_1Shape_Ar_1LMiHiDistNEAR_FIDSHAPE
001101001AutaugaAlabama0.61373838.625553.04.96...440.2115800.154568002.496745e+052.208598e+09133735.2925020{"rings": [[[-9619465, 3856529.0001000017], [-...
111201003BaldwinAlabama0.62736442.931429.04.64...220.3588940.057952001.642763e+065.671096e+09241925.1964263{"rings": [[[-9746859, 3539643.0001000017], [-...
221301005BarbourAlabama0.51381640.216876.03.49...62-0.868722-0.498354113.202971e+053.257816e+090.0000000{"rings": [[[-9468394, 3771591.0001000017], [-...
331401007BibbAlabama0.50136439.319360.03.64...43-1.0033410.286440002.279101e+052.311955e+09170214.4857597{"rings": [[[-9692114, 3928124.0001000017], [-...
441501009BlountAlabama0.60306440.921785.03.86...510.096177-0.336198012.918753e+052.456919e+0921128.5687847{"rings": [[[-9623907, 4063676.0001000017], [-...

5 rows × 97 columns

sdf_main.shape
(3112, 97)

The data is visualized here by mapping the voter turnout field into five classes. It can be seen that there is a belt running along the southeastern part of the country, which represents comparatively lower voter turnout of less than 55%.

# Visualizing voters turnout in percentages by county
m1= GIS().map('United States', zoomlevel=4)
sdf_main.spatial.plot(map_widget = m1,renderer_type='c', col='voter_turn',  line_width=0.2, method='esriClassifyNaturalBreaks', class_count=5, cmap='gist_heat_r',alpha=0.7)
m1.legend=True
m1

Model Building

Once the dataset is divided into the training and test dataset, the training data is ready to be used for modeling.

Train-Test Data split

The dataset above has 3112 samples, each representing US counties and their voter turnout, along with related variables. Next, it will be split into training and test datasets, in a 90 to 10 ratio for training and validation respectively.

# Splitting data with a test size of 10% for validation 
test_size = 0.10
sdf_train_base, sdf_test_base = train_test_split(sdf_main, test_size = test_size, random_state=42)
sdf_train_base.head(2)
FIDJoin_CountTARGET_FIDFIPScountystatevoter_turngender_medhouseholdielectronic...NNeighborsZTransformSpatialLagLMi_hi_sigLMi_normalShape_Le_1Shape_Ar_1LMiHiDistNEAR_FIDSHAPE
179817981179936001AlbanyNew York0.58754640.138227.05.00...31-0.071597-0.09909900225997.9588532.550720e+09150330.823825589{"rings": [[[-8201660, 5279044.000100002], [-8...
100310031100421081GrantKentucky0.54174036.920490.03.59...87-0.566824-0.12572300148480.2442491.108905e+0984976.878589339{"rings": [[[-9419382, 4693378.000100002], [-9...

2 rows × 97 columns

# checking the columns in the dataset
sdf_main.columns
Index(['FID', 'Join_Count', 'TARGET_FID', 'FIPS', 'county', 'state',
       'voter_turn', 'gender_med', 'householdi', 'electronic', 'raceandhis',
       'voter_laws', 'educationa', 'educatio_1', 'educatio_2', 'educatio_3',
       'maritalsta', 'F5yearincr', 'F5yearin_1', 'F5yearin_2', 'F5yearin_3',
       'F5yearin_4', 'F5yearin_5', 'F5yearin_6', 'language_a', 'hispanicor',
       'hispanic_1', 'raceandh_1', 'atrisk_avg', 'disposable', 'disposab_1',
       'disposab_2', 'disposab_3', 'disposab_4', 'disposab_5', 'disposab_6',
       'disposab_7', 'disposab_8', 'disposab_9', 'disposa_10', 'househol_1',
       'househol_2', 'househol_3', 'househol_4', 'househol_5', 'househol_6',
       'househol_7', 'househol_8', 'househol_9', 'language_1', 'language_2',
       'households', 'househo_10', 'educatio_4', 'educatio_5', 'educatio_6',
       'educatio_7', 'psychograp', 'psychogr_1', 'financial_', 'financial1',
       'financia_1', 'miscellane', 'state_vote', 'state_vo_1', 'randomized',
       'random_num', 'City10Dist', 'City10Ang', 'City9Dist', 'City9Ang',
       'City8Dist', 'City8Ang', 'City7Dist', 'City7Ang', 'City6Dist',
       'City6Ang', 'City5Dist', 'City5Ang', 'SOURCE_ID', 'voter_tu_1',
       'Shape_Leng', 'Shape_Area', 'LMiIndex', 'LMiZScore', 'LMiPValue',
       'COType', 'NNeighbors', 'ZTransform', 'SpatialLag', 'LMi_hi_sig',
       'LMi_normal', 'Shape_Le_1', 'Shape_Ar_1', 'LMiHiDist', 'NEAR_FID',
       'SHAPE'],
      dtype='object')

Data Preparation

First, a list of explanatory variables is chosen that consists of the feature data that will be used for predicting voter turnout. By default, it will receive continuous variables, and in the case of categorical variables, the True value is passed inside a tuple, along with the variable. Here county, state and voter_laws are categorical variables.

# listing explanatory variables
X =[('county',True), ('state',True),'gender_med', 'householdi', 'electronic', 'raceandhis',
       ('voter_laws',True), 'educationa', 'educatio_1', 'educatio_2', 'educatio_3',
       'maritalsta', 'F5yearincr', 'F5yearin_1', 'F5yearin_2', 'F5yearin_3',
       'F5yearin_4', 'F5yearin_5', 'F5yearin_6', 'language_a', 'hispanicor',
       'hispanic_1', 'raceandh_1', 'atrisk_avg', 'disposable', 'disposab_1',
       'disposab_2', 'disposab_3', 'disposab_4', 'disposab_5', 'disposab_6',
       'disposab_7', 'disposab_8', 'disposab_9', 'disposa_10', 'househol_1',
       'househol_2', 'househol_3', 'househol_4', 'househol_5', 'househol_6',
       'househol_7', 'househol_8', 'househol_9', 'language_1', 'language_2',
       'households', 'househo_10', 'educatio_4', 'educatio_5', 'educatio_6',
       'educatio_7', 'psychograp', 'psychogr_1', 'financial_', 'financial1',
       'financia_1', 'miscellane', 'state_vote', 'state_vo_1',
        'City10Dist','City9Dist', 'City8Dist', 'City7Dist','City6Dist',
        'City5Dist']

The preprocessor uses a scaler to transform the explanatory variables, which is defined as follows:

# defining the preprocessors for scaling data
preprocessors = [('county', 'state','gender_med', 'householdi', 'electronic', 'raceandhis',
       'voter_laws', 'educationa', 'educatio_1', 'educatio_2', 'educatio_3',
       'maritalsta', 'F5yearincr', 'F5yearin_1', 'F5yearin_2', 'F5yearin_3',
       'F5yearin_4', 'F5yearin_5', 'F5yearin_6', 'language_a', 'hispanicor',
       'hispanic_1', 'raceandh_1', 'atrisk_avg', 'disposable', 'disposab_1',
       'disposab_2', 'disposab_3', 'disposab_4', 'disposab_5', 'disposab_6',
       'disposab_7', 'disposab_8', 'disposab_9', 'disposa_10', 'househol_1',
       'househol_2', 'househol_3', 'househol_4', 'househol_5', 'househol_6',
       'househol_7', 'househol_8', 'househol_9', 'language_1', 'language_2',
       'households', 'househo_10', 'educatio_4', 'educatio_5', 'educatio_6',
       'educatio_7', 'psychograp', 'psychogr_1', 'financial_', 'financial1',
       'financia_1', 'miscellane', 'state_vote', 'state_vo_1',
        'City10Dist', 'City9Dist',
       'City8Dist', 'City7Dist','City6Dist',
       'City5Dist', MinMaxScaler())]

Finally, using the explanatory variables list above, the preprocessors and the prediction variable of voter turnout, the prepare_tabulardata prepares the data to be fed into the model.

# preparing data for the model
data_base_model = prepare_tabulardata(sdf_train_base,
                           variable_predict='voter_turn',
                           explanatory_variables=X, 
                           preprocessors=preprocessors)
C:\Users\sup10432\AppData\Local\ESRI\conda\envs\pro_automl_26Octb\lib\site-packages\arcgis\learn\_utils\tabular_data.py:1035: UserWarning:

Column county has more than 20 unique value. Sure this is categorical?

C:\Users\sup10432\AppData\Local\ESRI\conda\envs\pro_automl_26Octb\lib\site-packages\arcgis\learn\_utils\tabular_data.py:1035: UserWarning:

Column state has more than 20 unique value. Sure this is categorical?

Fitting a random forest model

First a random forest model is fitted to the data, and its performance is measured.

Model Initialization

The MLModel is initialized with the Random Forest model from Sklearn, along with its model parameters.

# defining the model along with the parameters 
model = MLModel(data_base_model, 'sklearn.ensemble.RandomForestRegressor', n_estimators=500, random_state=43)
model.fit()
model.score()
0.6388235590727049
# validating trained model on test dataset
voter_county_mlmodel_predicted = model.predict(sdf_test_base, prediction_type='dataframe')
voter_county_mlmodel_predicted.head(2)
FIDJoin_CountTARGET_FIDFIPScountystatevoter_turngender_medhouseholdielectronic...ZTransformSpatialLagLMi_hi_sigLMi_normalShape_Le_1Shape_Ar_1LMiHiDistNEAR_FIDSHAPEprediction_results
557557155816073OwyheeIdaho0.52933237.119701.03.66...-0.700966-0.49640900942771.6825393.679323e+10484820.245797672{"rings": [[[-12970046, 5356298.000100002], [-...0.533127
416416141713119FranklinGeorgia0.50697742.218965.03.44...-0.942663-0.08991300152970.6766191.015526e+09129997.253626591{"rings": [[[-9245266, 4095289.0001000017], [-...0.533401

2 rows × 98 columns

# calculating validation model score
r_square_voter_county_mlmodel_Test = metrics.r2_score(voter_county_mlmodel_predicted['voter_turn'], voter_county_mlmodel_predicted['prediction_results']) 
print('r_square_voter_county_mlmodel_Test: ', round(r_square_voter_county_mlmodel_Test,2))
r_square_voter_county_mlmodel_Test:  0.71

The validation r square for the random forest model is satisfactory, and now AutoML will be used to improve it.

Fitting Using AutoML

The same data obtained using the prepare_taular data function is next used as input for the AutoML model. Out of the various AutoML modes available, here the compete mode is used which uses 10-fold CV (Cross-Validation) and the Decision Tree, Random Forest, Extra Trees, XGBoost, Neural Network, Nearest Neighbors, Ensemble, and Stacking models to achieve higher machine learning accuracy.

# initializing AutoML model with the Compete mode 
AutoML_voters_county_base_compete = AutoML(data_base_model, eval_metric='r2', mode='Compete', n_jobs=1)

In the above initialization, the Compete mode is selected out of the three available modes, Compete, Perform, and Compete. While Compete is the best performing mode, it also consumes a significant amount of resources and time, and it is only recommended for instances where the best results are necessary. In other cases, the Explain or Perform modes can be used for a faster basic fit.

# training the AutoML model
AutoML_voters_county_base_compete.fit()
Neural Network algorithm was disabled because it doesn't support n_jobs parameter.
AutoML directory: AutoML_1
The task is regression with evaluation metric r2
AutoML will use algorithms: ['Linear', 'Decision Tree', 'Random Forest', 'Extra Trees', 'LightGBM', 'Xgboost']
AutoML will stack models
AutoML will ensemble availabe models
AutoML steps: ['adjust_validation', 'simple_algorithms', 'default_algorithms', 'not_so_random', 'kmeans_features', 'insert_random_feature', 'features_selection', 'hill_climbing_1', 'hill_climbing_2', 'boost_on_errors', 'ensemble', 'stack', 'ensemble_stacked']
* Step adjust_validation will try to check up to 1 model
1_DecisionTree r2 0.278123 trained in 0.82 seconds
Adjust validation. Remove: 1_DecisionTree
Validation strategy: 10-fold CV Shuffle
* Step simple_algorithms will try to check up to 4 models
1_DecisionTree r2 0.403535 trained in 6.96 seconds
2_DecisionTree r2 0.458642 trained in 6.55 seconds
3_DecisionTree r2 0.458642 trained in 6.7 seconds
4_Linear r2 0.661466 trained in 6.97 seconds
* Step default_algorithms will try to check up to 4 models
5_Default_LightGBM r2 0.772467 trained in 39.19 seconds
6_Default_Xgboost r2 0.76779 trained in 130.98 seconds
7_Default_RandomForest r2 0.587323 trained in 58.24 seconds
8_Default_ExtraTrees r2 0.532677 trained in 16.98 seconds
* Step not_so_random will try to check up to 36 models
18_LightGBM r2 0.784405 trained in 25.39 seconds
9_Xgboost r2 0.759364 trained in 137.63 seconds
27_RandomForest r2 0.585097 trained in 47.95 seconds
36_ExtraTrees r2 0.525521 trained in 14.03 seconds
19_LightGBM r2 0.756595 trained in 15.0 seconds
10_Xgboost r2 0.74418 trained in 207.35 seconds
28_RandomForest r2 0.527252 trained in 39.33 seconds
37_ExtraTrees r2 0.465072 trained in 19.44 seconds
20_LightGBM r2 0.774992 trained in 46.14 seconds
11_Xgboost r2 0.778035 trained in 35.9 seconds
29_RandomForest r2 0.660641 trained in 81.68 seconds
38_ExtraTrees r2 0.610569 trained in 30.66 seconds
21_LightGBM r2 0.765569 trained in 76.77 seconds
12_Xgboost r2 0.767627 trained in 21.05 seconds
30_RandomForest r2 0.652762 trained in 55.28 seconds
39_ExtraTrees r2 0.593974 trained in 18.17 seconds
22_LightGBM r2 0.759384 trained in 15.49 seconds
13_Xgboost r2 0.779413 trained in 30.12 seconds
31_RandomForest r2 0.683753 trained in 70.05 seconds
40_ExtraTrees r2 0.629244 trained in 22.78 seconds
23_LightGBM r2 0.779126 trained in 45.39 seconds
14_Xgboost r2 0.772772 trained in 27.06 seconds
32_RandomForest r2 0.677973 trained in 114.41 seconds
41_ExtraTrees r2 0.637788 trained in 26.65 seconds
24_LightGBM r2 0.775953 trained in 35.92 seconds
15_Xgboost r2 0.772261 trained in 106.44 seconds
33_RandomForest r2 0.585173 trained in 53.87 seconds
42_ExtraTrees r2 0.52614 trained in 19.97 seconds
25_LightGBM r2 0.779734 trained in 27.09 seconds
16_Xgboost r2 0.779104 trained in 35.84 seconds
34_RandomForest r2 0.586946 trained in 47.75 seconds
43_ExtraTrees r2 0.51673 trained in 19.5 seconds
* Step kmeans_features will try to check up to 3 models
18_LightGBM_KMeansFeatures r2 0.774187 trained in 30.75 seconds
25_LightGBM_KMeansFeatures r2 0.770623 trained in 30.81 seconds
13_Xgboost_KMeansFeatures r2 0.773129 trained in 43.43 seconds
* Step insert_random_feature will try to check up to 1 model
18_LightGBM_RandomFeature r2 0.779158 trained in 55.47 seconds
Drop features ['households', 'disposab_9', 'City10Dist', 'househo_10', 'househol_2', 'househol_4', 'disposab_2', 'random_feature', 'state_vo_1', 'language_2']
* Step features_selection will try to check up to 4 models
18_LightGBM_SelectedFeatures r2 0.78134 trained in 25.4 seconds
13_Xgboost_SelectedFeatures r2 0.78038 trained in 32.15 seconds
31_RandomForest_SelectedFeatures r2 0.684658 trained in 63.17 seconds
41_ExtraTrees_SelectedFeatures r2 0.639667 trained in 26.54 seconds
* Step hill_climbing_1 will try to check up to 22 models
44_LightGBM r2 0.777434 trained in 20.63 seconds
45_LightGBM_SelectedFeatures r2 0.774463 trained in 20.38 seconds
46_Xgboost_SelectedFeatures r2 0.781583 trained in 28.49 seconds
47_Xgboost_SelectedFeatures r2 0.781911 trained in 35.19 seconds
48_LightGBM r2 0.777738 trained in 32.0 seconds
49_LightGBM r2 0.757144 trained in 19.63 seconds
50_Xgboost r2 0.775644 trained in 31.74 seconds
51_Xgboost r2 0.779625 trained in 35.18 seconds
52_Xgboost r2 0.776825 trained in 35.21 seconds
53_Xgboost r2 0.774031 trained in 37.76 seconds
54_RandomForest_SelectedFeatures r2 0.681516 trained in 59.11 seconds
55_RandomForest_SelectedFeatures r2 0.682717 trained in 79.23 seconds
* Step hill_climbing_2 will try to check up to 17 models
56_LightGBM r2 0.777941 trained in 27.36 seconds
57_Xgboost_SelectedFeatures r2 0.777909 trained in 30.4 seconds
58_Xgboost_SelectedFeatures r2 0.782662 trained in 35.95 seconds
59_Xgboost_SelectedFeatures r2 0.781634 trained in 28.9 seconds
60_Xgboost_SelectedFeatures r2 0.779578 trained in 31.72 seconds
61_LightGBM_SelectedFeatures r2 0.77983 trained in 27.04 seconds
62_Xgboost_SelectedFeatures r2 0.780135 trained in 28.38 seconds
63_Xgboost_SelectedFeatures r2 0.777918 trained in 32.16 seconds
* Step boost_on_errors will try to check up to 1 model
18_LightGBM_BoostOnErrors r2 0.777478 trained in 26.16 seconds
* Step ensemble will try to check up to 1 model
Ensemble r2 0.793829 trained in 11.24 seconds
* Step stack will try to check up to 39 models
18_LightGBM_Stacked r2 0.779208 trained in 20.48 seconds
58_Xgboost_SelectedFeatures_Stacked r2 0.777255 trained in 27.07 seconds
31_RandomForest_SelectedFeatures_Stacked r2 0.787421 trained in 121.84 seconds
41_ExtraTrees_SelectedFeatures_Stacked r2 0.790031 trained in 35.82 seconds
18_LightGBM_SelectedFeatures_Stacked r2 0.778824 trained in 20.4 seconds
47_Xgboost_SelectedFeatures_Stacked r2 0.778543 trained in 26.14 seconds
31_RandomForest_Stacked r2 0.787983 trained in 111.51 seconds
41_ExtraTrees_Stacked r2 0.788431 trained in 37.75 seconds
61_LightGBM_SelectedFeatures_Stacked r2 0.779328 trained in 21.21 seconds
59_Xgboost_SelectedFeatures_Stacked r2 0.778012 trained in 24.09 seconds
55_RandomForest_SelectedFeatures_Stacked r2 0.788015 trained in 129.65 seconds
40_ExtraTrees_Stacked r2 0.790068 trained in 36.66 seconds
25_LightGBM_Stacked r2 0.774845 trained in 22.96 seconds
46_Xgboost_SelectedFeatures_Stacked r2 0.779182 trained in 25.36 seconds
54_RandomForest_SelectedFeatures_Stacked not trained. Stop training after the first fold. Time needed to train on the first fold 5.0 seconds. The time estimate for training on all folds is larger than total_time_limit.
* Step ensemble_stacked will try to check up to 1 model
Ensemble_Stacked r2 0.794873 trained in 16.19 seconds
AutoML fit time: 3631.5 seconds
AutoML best model: Ensemble_Stacked
All the evaluated models are saved in the path  C:\Users\sup10432\review_notebooks\voters_turnout\part I\2\AutoML_1

AutoML significantly improves the fit when compared to the standalone random forest model, and the validation r square jumps to a new high. Now, the previous visualization of the data reveals the presence of a spatial pattern in the data, and in the second part of the notebook, this spatial pattern is estimated and included as a spatial feature to further improve the model.

# train score of the model
AutoML_voters_county_base_compete.score()
0.9560210208116269

Model output

# The output diagnostics can also be printed in a report form
AutoML_voters_county_base_compete.report()
C:\Users\sup10432\AppData\Local\ESRI\conda\envs\pro_automl_26Octb\lib\site-packages\arcgis\learn\models\_auto_ml.py:284: UserWarning:

In case the report html is not rendered appropriately in the notebook, the same can be found in the path AutoML_1\README.html

AutoML Leaderboard

Best modelnamemodel_typemetric_typemetric_valuetrain_time
1_DecisionTreeDecision Treer20.4035357.36
2_DecisionTreeDecision Treer20.4586426.95
3_DecisionTreeDecision Treer20.4586427.1
4_LinearLinearr20.6614667.4
5_Default_LightGBMLightGBMr20.77246739.76
6_Default_XgboostXgboostr20.76779131.51
7_Default_RandomForestRandom Forestr20.58732358.82
8_Default_ExtraTreesExtra Treesr20.53267717.57
18_LightGBMLightGBMr20.78440525.94
9_XgboostXgboostr20.759364138.21
27_RandomForestRandom Forestr20.58509748.54
36_ExtraTreesExtra Treesr20.52552114.61
19_LightGBMLightGBMr20.75659515.53
10_XgboostXgboostr20.74418207.95
28_RandomForestRandom Forestr20.52725239.99
37_ExtraTreesExtra Treesr20.46507220.03
20_LightGBMLightGBMr20.77499246.68
11_XgboostXgboostr20.77803536.43
29_RandomForestRandom Forestr20.66064182.2
38_ExtraTreesExtra Treesr20.61056931.21
21_LightGBMLightGBMr20.76556977.33
12_XgboostXgboostr20.76762721.57
30_RandomForestRandom Forestr20.65276255.8
39_ExtraTreesExtra Treesr20.59397418.72
22_LightGBMLightGBMr20.75938416.02
13_XgboostXgboostr20.77941330.67
31_RandomForestRandom Forestr20.68375370.59
40_ExtraTreesExtra Treesr20.62924423.31
23_LightGBMLightGBMr20.77912645.99
14_XgboostXgboostr20.77277227.63
32_RandomForestRandom Forestr20.677973114.97
41_ExtraTreesExtra Treesr20.63778827.15
24_LightGBMLightGBMr20.77595336.51
15_XgboostXgboostr20.772261107.03
33_RandomForestRandom Forestr20.58517354.42
42_ExtraTreesExtra Treesr20.5261420.54
25_LightGBMLightGBMr20.77973427.67
16_XgboostXgboostr20.77910436.36
34_RandomForestRandom Forestr20.58694648.32
43_ExtraTreesExtra Treesr20.5167320.02
18_LightGBM_KMeansFeaturesLightGBMr20.77418731.37
25_LightGBM_KMeansFeaturesLightGBMr20.77062331.39
13_Xgboost_KMeansFeaturesXgboostr20.77312944.06
18_LightGBM_RandomFeatureLightGBMr20.77915856.57
18_LightGBM_SelectedFeaturesLightGBMr20.7813425.97
13_Xgboost_SelectedFeaturesXgboostr20.7803832.71
31_RandomForest_SelectedFeaturesRandom Forestr20.68465863.75
41_ExtraTrees_SelectedFeaturesExtra Treesr20.63966727.12
44_LightGBMLightGBMr20.77743421.19
45_LightGBM_SelectedFeaturesLightGBMr20.77446320.91
46_Xgboost_SelectedFeaturesXgboostr20.78158329.08
47_Xgboost_SelectedFeaturesXgboostr20.78191135.72
48_LightGBMLightGBMr20.77773832.54
49_LightGBMLightGBMr20.75714420.19
50_XgboostXgboostr20.77564432.27
51_XgboostXgboostr20.77962535.71
52_XgboostXgboostr20.77682535.71
53_XgboostXgboostr20.77403138.29
54_RandomForest_SelectedFeaturesRandom Forestr20.68151659.71
55_RandomForest_SelectedFeaturesRandom Forestr20.68271779.82
56_LightGBMLightGBMr20.77794127.94
57_Xgboost_SelectedFeaturesXgboostr20.77790930.96
58_Xgboost_SelectedFeaturesXgboostr20.78266236.5
59_Xgboost_SelectedFeaturesXgboostr20.78163429.42
60_Xgboost_SelectedFeaturesXgboostr20.77957832.25
61_LightGBM_SelectedFeaturesLightGBMr20.7798327.6
62_Xgboost_SelectedFeaturesXgboostr20.78013528.91
63_Xgboost_SelectedFeaturesXgboostr20.77791832.69
18_LightGBM_BoostOnErrorsLightGBMr20.77747826.68
EnsembleEnsembler20.79382911.24
18_LightGBM_StackedLightGBMr20.77920821.05
58_Xgboost_SelectedFeatures_StackedXgboostr20.77725527.63
31_RandomForest_SelectedFeatures_StackedRandom Forestr20.787421122.42
41_ExtraTrees_SelectedFeatures_StackedExtra Treesr20.79003136.41
18_LightGBM_SelectedFeatures_StackedLightGBMr20.77882420.92
47_Xgboost_SelectedFeatures_StackedXgboostr20.77854326.65
31_RandomForest_StackedRandom Forestr20.787983112.04
41_ExtraTrees_StackedExtra Treesr20.78843138.29
61_LightGBM_SelectedFeatures_StackedLightGBMr20.77932821.7
59_Xgboost_SelectedFeatures_StackedXgboostr20.77801224.61
55_RandomForest_SelectedFeatures_StackedRandom Forestr20.788015130.19
40_ExtraTrees_StackedExtra Treesr20.79006837.16
25_LightGBM_StackedLightGBMr20.77484523.48
46_Xgboost_SelectedFeatures_StackedXgboostr20.77918225.87
the bestEnsemble_StackedEnsembler20.79487316.19

AutoML Performance

AutoML Performance

AutoML Performance Boxplot

AutoML Performance Boxplot

Spearman Correlation of Models

models spearman correlation

Voters turnout prediction & Validation

# validating trained model on test dataset
voter_county_automl_predicted = AutoML_voters_county_base_compete.predict(sdf_test_base, prediction_type='dataframe')
voter_county_automl_predicted.head(2)
FIDJoin_CountTARGET_FIDFIPScountystatevoter_turngender_medhouseholdielectronic...ZTransformSpatialLagLMi_hi_sigLMi_normalShape_Le_1Shape_Ar_1LMiHiDistNEAR_FIDSHAPEprediction_results
557557155816073OwyheeIdaho0.52933237.119701.03.66...-0.700966-0.49640900942771.6825393.679323e+10484820.245797672{"rings": [[[-12970046, 5356298.000100002], [-...0.519827
416416141713119FranklinGeorgia0.50697742.218965.03.44...-0.942663-0.08991300152970.6766191.015526e+09129997.253626591{"rings": [[[-9245266, 4095289.0001000017], [-...0.525310

2 rows × 98 columns

Estimate model metrics for validation

import sklearn.metrics as metrics
r_square_voter_county_automl_Test = metrics.r2_score(voter_county_automl_predicted['voter_turn'], voter_county_automl_predicted['prediction_results']) 
print('r_square_voter_county_automl_Test: ', round(r_square_voter_county_automl_Test,2))
r_square_voter_county_automl_Test:  0.78

Conclusion

In this notebook, AutoML was applied to a regression dataset and was able to achieve significant improvement over traditional methods of modeling. Data visualization also showed the presence of spatial autocorrelation in voter turnout distributed across the country. The fit of the model can be further improved by extracting this spatial pattern in the data, and this process is elaborated on in part two of this notebook.

Data resources

ReferenceSourceLink
Voters turnout by county for 2016 US general electionEsrihttps://www.arcgis.com/home/item.html?id=650e7d6aa8fb4601a75d632a2c114425

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.