Skip to content

Predicting voters turnout for US election in 2016 using AutoML - Part I

Introduction

The objective of this notebook is to demonstrate the application of AutoML on tabular data and show the improvements that can be achieved using this method, rather than conventional workflows. The newly added AutoML module in arcgis.learn is based on the MLJar library, which automates the processes of algorithm selection, data preprocessing, model training, model explainability, and final model evaluation. With these functionalities, it can perform Automatic Exploratory Data Analysis, Algorithm Selection, and Hyper-Parameters tuning to find the best model. Automatic documentation can be generated as markdown reports on the analysis with details about all models.

Once the desired improvements are obtained using AutoML, the result will be further enhanced using spatial feature engineering in the second part of the notebook.

The dataset used here is the percentage of voter turnout by county for the general election for the United States in 2016, which will be predicted using the demographic characteristics of US counties and their socioeconomic parameters.

Imports

%matplotlib inline

import matplotlib.pyplot as plt
import pandas as pd
from pathlib import Path
from IPython.display import Image, HTML
from fastai.imports import *
from datetime import datetime as dt

import arcgis
from arcgis.gis import GIS
from arcgis.learn import prepare_tabulardata, AutoML, MLModel
import arcpy

Connecting to ArcGIS

gis = GIS('home')

Accessing & Visualizing datasets

Here, the 2016 election data is downloaded from the portal as a zipped shapefile, which is then unzipped and processed in the following.

voter_zip = gis.content.get('650e7d6aa8fb4601a75d632a2c114425') 
voter_zip
VotersTurnoutCountyEelction2016
voters turnout 2016
Shapefile by api_data_owner
Last Modified: August 23, 2021
0 comments, 142 views
import os, zipfile
filepath_new = voter_zip.download(file_name=voter_zip.name)
with zipfile.ZipFile(filepath_new, 'r') as zip_ref:
    zip_ref.extractall(Path(filepath_new).parent)
output_path = Path(os.path.join(os.path.splitext(filepath_new)[0]))
output_path = os.path.join(output_path,"VotersTurnoutCountyEelction2016.shp")  
output_path
'~\AppData\\Local\\Temp\\VotersTurnoutCountyEelction2016\\VotersTurnoutCountyEelction2016.shp'

The attribute table contains the voter turnout data by county for the entire US, which is extracted here as a pandas dataframe. The voter_turn field in the dataframe contains the voter turnout data in percentages for each county for the 2016 election. This will be used as the dependent variable and will be predicted using the various demographic and socioeconomic variables for each county.

# getting the attribute table from the shapefile which will be used for building the model
sdf_main = pd.DataFrame.spatial.from_featureclass(output_path)
sdf_main.head()
FIDJoin_CountTARGET_FIDFIPScountystatevoter_turngender_medhouseholdielectronic...NNeighborsZTransformSpatialLagLMi_hi_sigLMi_normalShape_Le_1Shape_Ar_1LMiHiDistNEAR_FIDSHAPE
001101001AutaugaAlabama0.61373838.6255534.96...440.211580.15456800249674.5007992208597808.5133735.2925020{"rings": [[[-9619465, 3856529.0001000017], [-...
111201003BaldwinAlabama0.62736442.9314294.64...220.3588940.057952001642763.261465671095677.35241925.1964263{"rings": [[[-9746859, 3539643.0001000017], [-...
221301005BarbourAlabama0.51381640.2168763.49...62-0.868722-0.49835411320297.065153257816458.50.00{"rings": [[[-9468394, 3771591.0001000017], [-...
331401007BibbAlabama0.50136439.3193603.64...43-1.0033410.2864400227910.1089162311954706.0170214.4857597{"rings": [[[-9692114, 3928124.0001000017], [-...
441501009BlountAlabama0.60306440.9217853.86...510.096177-0.33619801291875.2554832456919058.521128.5687847{"rings": [[[-9623907, 4063676.0001000017], [-...

5 rows × 97 columns

sdf_main.shape
(3112, 97)

The data is visualized here by mapping the voter turnout field into five classes. It can be seen that there is a belt running along the southeastern part of the country, which represents comparatively lower voter turnout of less than 55%.

# # Visualizing voters turnout in percentages by county
m1 = GIS().map('United States')
m1.legend.enabled = True
m1
<PIL.PngImagePlugin.PngImageFile image mode=RGBA size=1412x676>
m1.content.add(sdf_main)
m1.zoom_to_layer(sdf_main)

Applying symbology on the feature layer

sm_manager = m1.content.renderer(0).smart_mapping()
sm_manager.class_breaks_renderer(field="voter_turn", break_type = "color", num_classes=5)

Model Building

Once the dataset is divided into the training and test dataset, the training data is ready to be used for modeling.

Train-Test Data split

The dataset above has 3112 samples, each representing US counties and their voter turnout, along with related variables. Next, it will be split into training and test datasets, in a 90 to 10 ratio for training and validation respectively.

from sklearn.model_selection import train_test_split
# Splitting data with a test size of 10% for validation 
test_size = 0.10
sdf_train_base, sdf_test_base = train_test_split(sdf_main, test_size = test_size, random_state=42)
sdf_train_base.head(2)
FIDJoin_CountTARGET_FIDFIPScountystatevoter_turngender_medhouseholdielectronic...NNeighborsZTransformSpatialLagLMi_hi_sigLMi_normalShape_Le_1Shape_Ar_1LMiHiDistNEAR_FIDSHAPE
179817981179936001AlbanyNew York0.58754640.1382275.0...31-0.071597-0.09909900225997.9588532550719523.0150330.823825589{"rings": [[[-8201660, 5279044.000100002], [-8...
100310031100421081GrantKentucky0.5417436.9204903.59...87-0.566824-0.12572300148480.2442491108905122.584976.878589339{"rings": [[[-9419382, 4693378.000100002], [-9...

2 rows × 97 columns

# checking the columns in the dataset
sdf_main.columns
Index(['FID', 'Join_Count', 'TARGET_FID', 'FIPS', 'county', 'state',
       'voter_turn', 'gender_med', 'householdi', 'electronic', 'raceandhis',
       'voter_laws', 'educationa', 'educatio_1', 'educatio_2', 'educatio_3',
       'maritalsta', 'F5yearincr', 'F5yearin_1', 'F5yearin_2', 'F5yearin_3',
       'F5yearin_4', 'F5yearin_5', 'F5yearin_6', 'language_a', 'hispanicor',
       'hispanic_1', 'raceandh_1', 'atrisk_avg', 'disposable', 'disposab_1',
       'disposab_2', 'disposab_3', 'disposab_4', 'disposab_5', 'disposab_6',
       'disposab_7', 'disposab_8', 'disposab_9', 'disposa_10', 'househol_1',
       'househol_2', 'househol_3', 'househol_4', 'househol_5', 'househol_6',
       'househol_7', 'househol_8', 'househol_9', 'language_1', 'language_2',
       'households', 'househo_10', 'educatio_4', 'educatio_5', 'educatio_6',
       'educatio_7', 'psychograp', 'psychogr_1', 'financial_', 'financial1',
       'financia_1', 'miscellane', 'state_vote', 'state_vo_1', 'randomized',
       'random_num', 'City10Dist', 'City10Ang', 'City9Dist', 'City9Ang',
       'City8Dist', 'City8Ang', 'City7Dist', 'City7Ang', 'City6Dist',
       'City6Ang', 'City5Dist', 'City5Ang', 'SOURCE_ID', 'voter_tu_1',
       'Shape_Leng', 'Shape_Area', 'LMiIndex', 'LMiZScore', 'LMiPValue',
       'COType', 'NNeighbors', 'ZTransform', 'SpatialLag', 'LMi_hi_sig',
       'LMi_normal', 'Shape_Le_1', 'Shape_Ar_1', 'LMiHiDist', 'NEAR_FID',
       'SHAPE'],
      dtype='object')

Data Preparation

First, a list of explanatory variables is chosen that consists of the feature data that will be used for predicting voter turnout. By default, it will receive continuous variables, and in the case of categorical variables, the True value is passed inside a tuple, along with the variable. Here county, state and voter_laws are categorical variables.

# listing explanatory variables
X =[('county',True), ('state',True),'gender_med', 'householdi', 'electronic', 'raceandhis',
       ('voter_laws',True), 'educationa', 'educatio_1', 'educatio_2', 'educatio_3',
       'maritalsta', 'F5yearincr', 'F5yearin_1', 'F5yearin_2', 'F5yearin_3',
       'F5yearin_4', 'F5yearin_5', 'F5yearin_6', 'language_a', 'hispanicor',
       'hispanic_1', 'raceandh_1', 'atrisk_avg', 'disposable', 'disposab_1',
       'disposab_2', 'disposab_3', 'disposab_4', 'disposab_5', 'disposab_6',
       'disposab_7', 'disposab_8', 'disposab_9', 'disposa_10', 'househol_1',
       'househol_2', 'househol_3', 'househol_4', 'househol_5', 'househol_6',
       'househol_7', 'househol_8', 'househol_9', 'language_1', 'language_2',
       'households', 'househo_10', 'educatio_4', 'educatio_5', 'educatio_6',
       'educatio_7', 'psychograp', 'psychogr_1', 'financial_', 'financial1',
       'financia_1', 'miscellane', 'state_vote', 'state_vo_1',
        'City10Dist','City9Dist', 'City8Dist', 'City7Dist','City6Dist',
        'City5Dist']

The preprocessor uses a scaler to transform the explanatory variables, which is defined as follows:

from sklearn.preprocessing import MinMaxScaler
# defining the preprocessors for scaling data
preprocessors = [('county', 'state','gender_med', 'householdi', 'electronic', 'raceandhis',
       'voter_laws', 'educationa', 'educatio_1', 'educatio_2', 'educatio_3',
       'maritalsta', 'F5yearincr', 'F5yearin_1', 'F5yearin_2', 'F5yearin_3',
       'F5yearin_4', 'F5yearin_5', 'F5yearin_6', 'language_a', 'hispanicor',
       'hispanic_1', 'raceandh_1', 'atrisk_avg', 'disposable', 'disposab_1',
       'disposab_2', 'disposab_3', 'disposab_4', 'disposab_5', 'disposab_6',
       'disposab_7', 'disposab_8', 'disposab_9', 'disposa_10', 'househol_1',
       'househol_2', 'househol_3', 'househol_4', 'househol_5', 'househol_6',
       'househol_7', 'househol_8', 'househol_9', 'language_1', 'language_2',
       'households', 'househo_10', 'educatio_4', 'educatio_5', 'educatio_6',
       'educatio_7', 'psychograp', 'psychogr_1', 'financial_', 'financial1',
       'financia_1', 'miscellane', 'state_vote', 'state_vo_1',
        'City10Dist', 'City9Dist',
       'City8Dist', 'City7Dist','City6Dist',
       'City5Dist', MinMaxScaler())]

Finally, using the explanatory variables list above, the preprocessors and the prediction variable of voter turnout, the prepare_tabulardata prepares the data to be fed into the model.

# preparing data for the model
data_base_model = prepare_tabulardata(sdf_train_base,
                           variable_predict='voter_turn',
                           explanatory_variables=X, 
                           preprocessors=preprocessors)

Fitting a random forest model

First a random forest model is fitted to the data, and its performance is measured.

Model Initialization

The MLModel is initialized with the Random Forest model from Sklearn, along with its model parameters.

# defining the model along with the parameters 
model = MLModel(data_base_model, 'sklearn.ensemble.RandomForestRegressor', n_estimators=500, random_state=43)
model.fit()
model.score()
0.6400988935906538
# validating trained model on test dataset
voter_county_mlmodel_predicted = model.predict(sdf_test_base, prediction_type='dataframe')
voter_county_mlmodel_predicted.head(2)
FIDJoin_CountTARGET_FIDFIPScountystatevoter_turngender_medhouseholdielectronic...ZTransformSpatialLagLMi_hi_sigLMi_normalShape_Le_1Shape_Ar_1LMiHiDistNEAR_FIDSHAPEprediction_results
557557155816073OwyheeIdaho0.52933237.1197013.66...-0.700966-0.49640900942771.68253936793231250.5484820.245797672{"rings": [[[-12970046, 5356298.000100002], [-...0.531904
416416141713119FranklinGeorgia0.50697742.2189653.44...-0.942663-0.08991300152970.6766191015526410.0129997.253626591{"rings": [[[-9245266, 4095289.0001000017], [-...0.534624

2 rows × 98 columns

import sklearn.metrics as metrics
# calculating validation model score
r_square_voter_county_mlmodel_Test = metrics.r2_score(voter_county_mlmodel_predicted['voter_turn'], voter_county_mlmodel_predicted['prediction_results']) 
print('r_square_voter_county_mlmodel_Test: ', round(r_square_voter_county_mlmodel_Test,2))
r_square_voter_county_mlmodel_Test:  0.71

The validation r square for the random forest model is satisfactory, and now AutoML will be used to improve it.

Fitting Using AutoML

The same data obtained using the prepare_taular data function is next used as input for the AutoML model. Out of the various AutoML modes available, here the compete mode is used which uses 10-fold CV (Cross-Validation) and the Decision Tree, Random Forest, Extra Trees, XGBoost, Neural Network, Nearest Neighbors, Ensemble, and Stacking models to achieve higher machine learning accuracy.

# initializing AutoML model with the Compete mode 
AutoML_voters_county_base_compete = AutoML(data_base_model, eval_metric='r2', mode='Compete', n_jobs=1)

In the above initialization, the Compete mode is selected out of the three available modes, Compete, Perform, and Compete. While Compete is the best performing mode, it also consumes a significant amount of resources and time, and it is only recommended for instances where the best results are necessary. In other cases, the Explain or Perform modes can be used for a faster basic fit.

# training the AutoML model
AutoML_voters_county_base_compete.fit()
Neural Network algorithm was disabled because it doesn't support n_jobs parameter.
AutoML directory: ~\AppData\Local\Temp\scratch\tmp065_13k2
The task is regression with evaluation metric r2
AutoML will use algorithms: ['Linear', 'Decision Tree', 'Random Trees', 'Extra Trees', 'LightGBM', 'Xgboost']
AutoML will stack models
AutoML will ensemble available models
AutoML steps: ['adjust_validation', 'simple_algorithms', 'default_algorithms', 'not_so_random', 'mix_encoding', 'insert_random_feature', 'features_selection', 'hill_climbing_1', 'hill_climbing_2', 'boost_on_errors', 'ensemble', 'stack', 'ensemble_stacked']
* Step adjust_validation will try to check up to 1 model
1_DecisionTree r2 0.339807 trained in 6.53 seconds
Adjust validation. Remove: 1_DecisionTree
Validation strategy: 5-fold CV Shuffle
* Step simple_algorithms will try to check up to 4 models
1_DecisionTree r2 0.384567 trained in 11.87 seconds
2_DecisionTree r2 0.438367 trained in 10.57 seconds
3_DecisionTree r2 0.440063 trained in 10.52 seconds
4_Linear r2 0.651521 trained in 12.55 seconds
* Step default_algorithms will try to check up to 4 models
5_Default_LightGBM r2 0.773111 trained in 48.03 seconds
6_Default_Xgboost r2 0.765353 trained in 45.17 seconds
7_Default_RandomTrees r2 0.582249 trained in 94.64 seconds
8_Default_ExtraTrees r2 0.525917 trained in 34.52 seconds
* Step not_so_random will try to check up to 36 models
18_LightGBM r2 0.778586 trained in 19.69 seconds
9_Xgboost r2 0.74601 trained in 60.14 seconds
27_RandomTrees r2 0.584312 trained in 76.86 seconds
36_ExtraTrees r2 0.509314 trained in 34.63 seconds
19_LightGBM r2 0.745846 trained in 17.38 seconds
10_Xgboost r2 0.741448 trained in 69.99 seconds
28_RandomTrees r2 0.518539 trained in 61.65 seconds
37_ExtraTrees r2 0.461816 trained in 27.51 seconds
20_LightGBM r2 0.772607 trained in 34.54 seconds
11_Xgboost r2 0.773569 trained in 21.86 seconds
29_RandomTrees r2 0.654122 trained in 161.82 seconds
38_ExtraTrees r2 0.611619 trained in 37.09 seconds
21_LightGBM r2 0.758954 trained in 62.74 seconds
12_Xgboost r2 0.752604 trained in 20.04 seconds
30_RandomTrees r2 0.648906 trained in 110.54 seconds
39_ExtraTrees r2 0.592089 trained in 29.68 seconds
22_LightGBM r2 0.754357 trained in 17.2 seconds
13_Xgboost r2 0.775421 trained in 23.6 seconds
31_RandomTrees r2 0.675828 trained in 118.82 seconds
40_ExtraTrees r2 0.623064 trained in 38.19 seconds
23_LightGBM r2 0.774772 trained in 25.83 seconds
14_Xgboost r2 0.771997 trained in 24.1 seconds
32_RandomTrees r2 0.668177 trained in 199.31 seconds
41_ExtraTrees r2 0.634077 trained in 29.54 seconds
24_LightGBM r2 0.776354 trained in 35.68 seconds
15_Xgboost r2 0.773189 trained in 43.42 seconds
33_RandomTrees r2 0.58038 trained in 116.53 seconds
42_ExtraTrees r2 0.52028 trained in 33.13 seconds
25_LightGBM r2 0.768833 trained in 20.25 seconds
* Step mix_encoding will try to check up to 1 model
13_Xgboost_categorical_mix r2 0.775825 trained in 22.61 seconds
* Step insert_random_feature will try to check up to 1 model
18_LightGBM_RandomFeature r2 0.774383 trained in 134.49 seconds
Drop features ['disposab_7', 'random_feature', 'hispanic_1', 'educatio_2', 'disposab_4', 'disposa_10', 'F5yearin_2', 'househol_2', 'disposab_9', 'disposab_1', 'househol_4', 'disposab_6', 'miscellane', 'county_brown', 'county_pulaski', 'county_carbon', 'county_rock', 'county_grundy', 'county_la', 'county_miami', 'county_river', 'county_butte', 'county_richland', 'county_anderson', 'county_pierce', 'county_kent', 'county_henderson', 'county_richmond', 'county_randolph', 'county_caldwell', 'county_green', 'county_garfield', 'county_allen', 'county_sheridan', 'county_sherman', 'county_sullivan', 'county_valley', 'county_wilson', 'county_middlesex', 'county_mitchell', 'county_phillips', 'county_lancaster', 'county_george', 'county_park', 'county_red', 'county_dallas', 'county_charles', 'county_clair', 'county_dodge', 'county_delaware', 'county_dawson', 'county_wood', 'county_putnam', 'county_benton', 'county_perry', 'county_cumberland', 'county_custer', 'county_davis', 'county_decatur', 'county_dekalb', 'county_fayette', 'county_howard', 'county_jasper', 'county_carroll', 'county_orange', 'county_columbia', 'county_newton', 'county_new', 'county_morgan', 'county_montgomery', 'county_monroe', 'county_mineral', 'county_mercer', 'county_mason', 'county_martin', 'county_crawford', 'county_clinton', 'county_marion', 'county_henry', 'county_floyd', 'county_franklin', 'county_fulton', 'county_grand', 'county_clay', 'county_hamilton', 'county_hancock', 'county_hardin', 'county_harrison', 'county_grant', 'county_york', 'county_greene', 'county_essex', 'county_douglas', 'county_carter', 'county_cass', 'county_cherokee', 'county_city', 'county_clark', 'county_clarke', 'county_boone', 'county_marshall', 'county_madison', 'county_washington', 'county_macon', 'county_shelby', 'county_camden', 'county_st', 'county_taylor', 'county_union', 'county_van', 'county_warren', 'county_wayne', 'county_san', 'county_webster', 'county_wheeler', 'county_white', 'county_saline', 'county_russell', 'county_pike', 'county_adams', 'county_polk', 'county_santa', 'county_scott', 'county_calhoun', 'county_jackson', 'county_lyon', 'county_logan', 'county_jefferson', 'county_johnson', 'county_jones', 'county_king', 'county_knox', 'county_lafayette', 'county_butler', 'county_stone', 'county_lake', 'county_lawrence', 'county_lee', 'county_lewis', 'county_lincoln', 'county_linn', 'county_livingston', 'county_lamar', 'county_campbell', 'language_1', 'disposable', 'financia_1', 'househol_3', 'househol_5', 'atrisk_avg']
* Step features_selection will try to check up to 4 models
18_LightGBM_SelectedFeatures r2 0.780683 trained in 20.16 seconds
13_Xgboost_categorical_mix_SelectedFeatures r2 0.774165 trained in 19.28 seconds
31_RandomTrees_SelectedFeatures r2 0.678155 trained in 93.54 seconds
41_ExtraTrees_SelectedFeatures r2 0.637948 trained in 21.32 seconds
* Step hill_climbing_1 will try to check up to 24 models
43_LightGBM_SelectedFeatures r2 0.783672 trained in 18.01 seconds
44_LightGBM r2 0.772547 trained in 18.42 seconds
45_LightGBM r2 0.769483 trained in 26.0 seconds
46_Xgboost r2 0.770869 trained in 23.87 seconds
47_Xgboost r2 0.774883 trained in 23.82 seconds
48_Xgboost r2 0.775008 trained in 26.4 seconds
49_Xgboost r2 0.776888 trained in 24.93 seconds
50_Xgboost_SelectedFeatures r2 0.776082 trained in 20.86 seconds
51_Xgboost_SelectedFeatures r2 0.775982 trained in 21.4 seconds
52_RandomTrees_SelectedFeatures r2 0.686752 trained in 94.93 seconds
53_RandomTrees_SelectedFeatures r2 0.674463 trained in 98.79 seconds
* Step hill_climbing_2 will try to check up to 22 models
54_LightGBM_SelectedFeatures r2 0.781741 trained in 18.24 seconds
55_LightGBM_SelectedFeatures r2 0.785634 trained in 23.24 seconds
56_LightGBM r2 0.782188 trained in 25.46 seconds
57_Xgboost r2 0.772739 trained in 26.52 seconds
58_Xgboost r2 0.77337 trained in 28.45 seconds
59_Xgboost_SelectedFeatures r2 0.777363 trained in 21.46 seconds
60_Xgboost_SelectedFeatures r2 0.780775 trained in 21.44 seconds
61_Xgboost_SelectedFeatures r2 0.770041 trained in 19.53 seconds
62_Xgboost_SelectedFeatures r2 0.776709 trained in 21.37 seconds
63_RandomTrees_SelectedFeatures r2 0.685755 trained in 88.82 seconds
* Step boost_on_errors will try to check up to 1 model
55_LightGBM_SelectedFeatures_BoostOnErrors r2 0.782378 trained in 22.47 seconds
* Step ensemble will try to check up to 1 model
Ensemble r2 0.792575 trained in 10.53 seconds
* Step stack will try to check up to 38 models
55_LightGBM_SelectedFeatures_Stacked r2 0.776583 trained in 16.33 seconds
60_Xgboost_SelectedFeatures_Stacked r2 0.764339 trained in 17.01 seconds
52_RandomTrees_SelectedFeatures_Stacked r2 0.784741 trained in 243.34 seconds
41_ExtraTrees_SelectedFeatures_Stacked r2 0.787197 trained in 32.22 seconds
43_LightGBM_SelectedFeatures_Stacked r2 0.776042 trained in 15.71 seconds
59_Xgboost_SelectedFeatures_Stacked r2 0.766169 trained in 16.9 seconds
63_RandomTrees_SelectedFeatures_Stacked r2 0.78466 trained in 263.52 seconds
41_ExtraTrees_Stacked not trained. Stop training after the first fold. Time needed to train on the first fold 8.0 seconds. The time estimate for training on all folds is larger than total_time_limit.
56_LightGBM_Stacked r2 0.775774 trained in 18.89 seconds
49_Xgboost_Stacked not trained. Stop training after the first fold. Time needed to train on the first fold 2.0 seconds. The time estimate for training on all folds is larger than total_time_limit.
31_RandomTrees_SelectedFeatures_Stacked not trained. Stop training after the first fold. Time needed to train on the first fold 46.0 seconds. The time estimate for training on all folds is larger than total_time_limit.
* Step ensemble_stacked will try to check up to 1 model
Ensemble_Stacked r2 0.792647 trained in 13.55 seconds
AutoML fit time: 3662.3 seconds
AutoML best model: Ensemble_Stacked
All the evaluated models are saved in the path  ~\AppData\Local\Temp\scratch\tmp065_13k2

AutoML significantly improves the fit when compared to the standalone random forest model, and the validation r square jumps to a new high. Now, the previous visualization of the data reveals the presence of a spatial pattern in the data, and in the second part of the notebook, this spatial pattern is estimated and included as a spatial feature to further improve the model.

# train score of the model
AutoML_voters_county_base_compete.score()
0.9602925513426736

Model output

# The output diagnostics can also be printed in a report form
AutoML_voters_county_base_compete.report()
In case the report html is not rendered appropriately in the notebook, the same can be found in the path ~\AppData\Local\Temp\scratch\tmp065_13k2\README.html

AutoML Leaderboard

Best modelnamemodel_typemetric_typemetric_valuetrain_time
1_DecisionTreeDecision Treer20.38456712.38
2_DecisionTreeDecision Treer20.43836711.18
3_DecisionTreeDecision Treer20.44006311.12
4_LinearLinearr20.65152113.03
5_Default_LightGBMLightGBMr20.77311148.66
6_Default_XgboostXgboostr20.76535345.81
7_Default_RandomTreesRandom Treesr20.58224995.19
8_Default_ExtraTreesExtra Treesr20.52591735.09
18_LightGBMLightGBMr20.77858620.18
9_XgboostXgboostr20.7460160.83
27_RandomTreesRandom Treesr20.58431277.44
36_ExtraTreesExtra Treesr20.50931435.17
19_LightGBMLightGBMr20.74584617.95
10_XgboostXgboostr20.74144870.57
28_RandomTreesRandom Treesr20.51853962.28
37_ExtraTreesExtra Treesr20.46181628.06
20_LightGBMLightGBMr20.77260735.05
11_XgboostXgboostr20.77356922.44
29_RandomTreesRandom Treesr20.654122162.33
38_ExtraTreesExtra Treesr20.61161937.62
21_LightGBMLightGBMr20.75895463.31
12_XgboostXgboostr20.75260420.55
30_RandomTreesRandom Treesr20.648906111.06
39_ExtraTreesExtra Treesr20.59208930.34
22_LightGBMLightGBMr20.75435717.78
13_XgboostXgboostr20.77542124.17
31_RandomTreesRandom Treesr20.675828119.35
40_ExtraTreesExtra Treesr20.62306438.83
23_LightGBMLightGBMr20.77477226.44
14_XgboostXgboostr20.77199724.73
32_RandomTreesRandom Treesr20.668177199.86
41_ExtraTreesExtra Treesr20.63407730.12
24_LightGBMLightGBMr20.77635436.22
15_XgboostXgboostr20.77318943.94
33_RandomTreesRandom Treesr20.58038117.07
42_ExtraTreesExtra Treesr20.5202833.73
25_LightGBMLightGBMr20.76883320.79
13_Xgboost_categorical_mixXgboostr20.77582523.21
18_LightGBM_RandomFeatureLightGBMr20.774383135.37
18_LightGBM_SelectedFeaturesLightGBMr20.78068320.69
13_Xgboost_categorical_mix_SelectedFeaturesXgboostr20.77416519.8
31_RandomTrees_SelectedFeaturesRandom Treesr20.67815594.07
41_ExtraTrees_SelectedFeaturesExtra Treesr20.63794821.92
43_LightGBM_SelectedFeaturesLightGBMr20.78367218.52
44_LightGBMLightGBMr20.77254718.93
45_LightGBMLightGBMr20.76948326.54
46_XgboostXgboostr20.77086924.41
47_XgboostXgboostr20.77488324.33
48_XgboostXgboostr20.77500826.93
49_XgboostXgboostr20.77688825.48
50_Xgboost_SelectedFeaturesXgboostr20.77608221.39
51_Xgboost_SelectedFeaturesXgboostr20.77598222.05
52_RandomTrees_SelectedFeaturesRandom Treesr20.68675295.53
53_RandomTrees_SelectedFeaturesRandom Treesr20.67446399.41
54_LightGBM_SelectedFeaturesLightGBMr20.78174118.85
55_LightGBM_SelectedFeaturesLightGBMr20.78563423.83
56_LightGBMLightGBMr20.78218826.1
57_XgboostXgboostr20.77273927.12
58_XgboostXgboostr20.7733729.02
59_Xgboost_SelectedFeaturesXgboostr20.77736322.17
60_Xgboost_SelectedFeaturesXgboostr20.78077522.08
61_Xgboost_SelectedFeaturesXgboostr20.77004120.12
62_Xgboost_SelectedFeaturesXgboostr20.77670921.94
63_RandomTrees_SelectedFeaturesRandom Treesr20.68575589.42
55_LightGBM_SelectedFeatures_BoostOnErrorsLightGBMr20.78237823.06
EnsembleEnsembler20.79257510.53
55_LightGBM_SelectedFeatures_StackedLightGBMr20.77658316.85
60_Xgboost_SelectedFeatures_StackedXgboostr20.76433917.56
52_RandomTrees_SelectedFeatures_StackedRandom Treesr20.784741243.91
41_ExtraTrees_SelectedFeatures_StackedExtra Treesr20.78719732.86
43_LightGBM_SelectedFeatures_StackedLightGBMr20.77604216.31
59_Xgboost_SelectedFeatures_StackedXgboostr20.76616917.49
63_RandomTrees_SelectedFeatures_StackedRandom Treesr20.78466264.09
56_LightGBM_StackedLightGBMr20.77577419.59
the bestEnsemble_StackedEnsembler20.79264713.55

AutoML Performance

AutoML Performance

AutoML Performance Boxplot

AutoML Performance Boxplot

Spearman Correlation of Models

models spearman correlation

Voters turnout prediction & Validation

# validating trained model on test dataset
voter_county_automl_predicted = AutoML_voters_county_base_compete.predict(sdf_test_base, prediction_type='dataframe')
voter_county_automl_predicted.head(2)
FIDJoin_CountTARGET_FIDFIPScountystatevoter_turngender_medhouseholdielectronic...ZTransformSpatialLagLMi_hi_sigLMi_normalShape_Le_1Shape_Ar_1LMiHiDistNEAR_FIDSHAPEprediction_results
557557155816073OwyheeIdaho0.52933237.1197013.66...-0.700966-0.49640900942771.68253936793231250.5484820.245797672{"rings": [[[-12970046, 5356298.000100002], [-...0.524788
416416141713119FranklinGeorgia0.50697742.2189653.44...-0.942663-0.08991300152970.6766191015526410.0129997.253626591{"rings": [[[-9245266, 4095289.0001000017], [-...0.520055

2 rows × 98 columns

Estimate model metrics for validation

import sklearn.metrics as metrics
r_square_voter_county_automl_Test = metrics.r2_score(voter_county_automl_predicted['voter_turn'], voter_county_automl_predicted['prediction_results']) 
print('r_square_voter_county_automl_Test: ', round(r_square_voter_county_automl_Test,2))
r_square_voter_county_automl_Test:  0.78

Conclusion

In this notebook, AutoML was applied to a regression dataset and was able to achieve significant improvement over traditional methods of modeling. Data visualization also showed the presence of spatial autocorrelation in voter turnout distributed across the country. The fit of the model can be further improved by extracting this spatial pattern in the data, and this process is elaborated on in part two of this notebook.

Data resources

ReferenceSourceLink
Voters turnout by county for 2016 US general electionEsrihttps://www.arcgis.com/home/item.html?id=650e7d6aa8fb4601a75d632a2c114425

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.