ArcGIS Developer
Dashboard

ArcGIS API for Python

Predicting voters turnout for US election in 2016 using AutoML - Part I

Introduction

The objective of this notebook is to demonstrate the application of AutoML on tabular data and show the improvements that can be achieved using this method, rather than conventional workflows. The newly added AutoML module in arcgis.learn is based on the MLJar library, which automates the processes of algorithm selection, data preprocessing, model training, model explainability, and final model evaluation. With these functionalities, it can perform Automatic Exploratory Data Analysis, Algorithm Selection, and Hyper-Parameters tuning to find the best model. Automatic documentation can be generated as markdown reports on the analysis with details about all models.

Once the desired improvements are obtained using AutoML, the result will be further enhanced using spatial feature engineering in the second part of the notebook.

The dataset used here is the percentage of voter turnout by county for the general election for the United States in 2016, which will be predicted using the demographic characteristics of US counties and their socioeconomic parameters.

Imports

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import pandas as pd

from IPython.display import Image, HTML
from sklearn.preprocessing import MinMaxScaler,RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import sklearn.metrics as metrics
from fastai.imports import *
from datetime import datetime as dt

import arcgis
from arcgis.gis import GIS
from arcgis.learn import prepare_tabulardata, AutoML, MLModel
import arcpy

Connecting to ArcGIS

In [2]:
gis = GIS("home")

Accessing & Visualizing datasets

Here, the 2016 election data is downloaded from the portal as a zipped shapefile, which is then unzipped and processed in the following.

In [3]:
voter_zip = gis.content.get('650e7d6aa8fb4601a75d632a2c114425') 
voter_zip
Out[3]:
VotersTurnoutCountyEelction2016
voters turnout 2016Shapefile by api_data_owner
Last Modified: August 23, 2021
0 comments, 79 views
In [4]:
import os, zipfile
In [5]:
filepath_new = voter_zip.download(file_name=voter_zip.name)
with zipfile.ZipFile(filepath_new, 'r') as zip_ref:
    zip_ref.extractall(Path(filepath_new).parent)
output_path = Path(os.path.join(os.path.splitext(filepath_new)[0]))
output_path = os.path.join(output_path,"VotersTurnoutCountyEelction2016.shp")  
In [6]:
output_path
Out[6]:
'C:\\Users\\sup10432\\AppData\\Local\\Temp\\VotersTurnoutCountyEelction2016\\VotersTurnoutCountyEelction2016.shp'

The attribute table contains the voter turnout data by county for the entire US, which is extracted here as a pandas dataframe. The voter_turn field in the dataframe contains the voter turnout data in percentages for each county for the 2016 election. This will be used as the dependent variable and will be predicted using the various demographic and socioeconomic variables for each county.

In [7]:
# getting the attribute table from the shapefile which will be used for building the model
sdf_main = pd.DataFrame.spatial.from_featureclass(output_path)
sdf_main.head()
Out[7]:
FID Join_Count TARGET_FID FIPS county state voter_turn gender_med householdi electronic ... NNeighbors ZTransform SpatialLag LMi_hi_sig LMi_normal Shape_Le_1 Shape_Ar_1 LMiHiDist NEAR_FID SHAPE
0 0 1 1 01001 Autauga Alabama 0.613738 38.6 25553.0 4.96 ... 44 0.211580 0.154568 0 0 2.496745e+05 2.208598e+09 133735.292502 0 {"rings": [[[-9619465, 3856529.0001000017], [-...
1 1 1 2 01003 Baldwin Alabama 0.627364 42.9 31429.0 4.64 ... 22 0.358894 0.057952 0 0 1.642763e+06 5.671096e+09 241925.196426 3 {"rings": [[[-9746859, 3539643.0001000017], [-...
2 2 1 3 01005 Barbour Alabama 0.513816 40.2 16876.0 3.49 ... 62 -0.868722 -0.498354 1 1 3.202971e+05 3.257816e+09 0.000000 0 {"rings": [[[-9468394, 3771591.0001000017], [-...
3 3 1 4 01007 Bibb Alabama 0.501364 39.3 19360.0 3.64 ... 43 -1.003341 0.286440 0 0 2.279101e+05 2.311955e+09 170214.485759 7 {"rings": [[[-9692114, 3928124.0001000017], [-...
4 4 1 5 01009 Blount Alabama 0.603064 40.9 21785.0 3.86 ... 51 0.096177 -0.336198 0 1 2.918753e+05 2.456919e+09 21128.568784 7 {"rings": [[[-9623907, 4063676.0001000017], [-...

5 rows × 97 columns

In [8]:
sdf_main.shape
Out[8]:
(3112, 97)

The data is visualized here by mapping the voter turnout field into five classes. It can be seen that there is a belt running along the southeastern part of the country, which represents comparatively lower voter turnout of less than 55%.

In [9]:
# Visualizing voters turnout in percentages by county
m1= GIS().map('United States', zoomlevel=4)
sdf_main.spatial.plot(map_widget = m1,renderer_type='c', col='voter_turn',  line_width=0.2, method='esriClassifyNaturalBreaks', class_count=5, cmap='gist_heat_r',alpha=0.7)
m1.legend=True
m1