Covid case forecasting Using TimeSeriesModel from arcgis.learn

Introduction

COVID-19 forecasting has been vital for efficiently planning health care policy during the pandemic. There are many forecasting models, a few of which require explanatory variables like population, social distancing, etc. This notebook uses the deep learning TimeSeriesModel from arcgis.learn for data modeling and is helpful in the prediction of future trends.

To demonstrate the utility of this method, this notebook will analyze confirmed cases for all counties in Alabama. The dataset contains the unique county FIPS ID, county Name, State ID, and cumulative confirmed cases datewise for each county. The dataset ranges from January 2020 to February 2022, with the data from January 2022 to February 2022 being used to validate the quality of the forecast. The approach utilized in this analysis for forecasting future COVID-19 cases involves: (a) Data Processing (calculating the seven day moving average for removing the noise and vertically stacking the county data), (b) creating functions for test-train splitting, tabular data preparation, model fitting using Inception Time for a sequence length of 60, and forecasting, and (c) validation and visualization of the predicted data and results.

Importing Libraries

Input
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import sklearn.metrics as metrics

from arcgis.gis import GIS
from arcgis.learn import TimeSeriesModel, prepare_tabulardata

Connecting to your GIS

Input
gis = GIS("home")

Accessing the dataset

The latest dataset can be downloaded from USAFacts: https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/

Input
# Access the data table
data_table = gis.content.get("b222748b885e4741839f3787f207b2b1")
data_table
Output
USA Covid Confirmed Cases Dataset
This data contains the confirmed covid cases from 01/22/2020 to 02/01/2022 for USA counties. The latest dataset can be downloaded from USAFacts: https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/CSV by api_data_owner
Last Modified: May 12, 2022
0 comments, 2 views
Input
# Download the csv and saving it in local folder
data_path = data_table.get_data()
Input
# # Read the csv file
confirmed = pd.read_csv(data_path)
confirmed.head()
Output
countyFIPS County Name State StateFIPS 2020-01-22 2020-01-23 2020-01-24 2020-01-25 2020-01-26 2020-01-27 ... 2022-01-23 2022-01-24 2022-01-25 2022-01-26 2022-01-27 2022-01-28 2022-01-29 2022-01-30 2022-01-31 2022-02-01
0 0 Statewide Unallocated AL 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 1001 Autauga County AL 1 0 0 0 0 0 0 ... 13019 13251 13251 13251 13251 13251 13251 13251 13251 14826
2 1003 Baldwin County AL 1 0 0 0 0 0 0 ... 49168 50313 50313 50313 50313 50313 50313 50313 50313 53083
3 1005 Barbour County AL 1 0 0 0 0 0 0 ... 4902 5054 5054 5054 5054 5054 5054 5054 5054 5297
4 1007 Bibb County AL 1 0 0 0 0 0 0 ... 5663 5795 5795 5795 5795 5795 5795 5795 5795 6158

5 rows × 746 columns

Raw data cleaning

Input
# Extract the data of Alabama State
confirmed_AL = confirmed.loc[
    (confirmed["countyFIPS"] >= 1000) & (confirmed["countyFIPS"] <= 1133)]
Input
# Stack the table for cumulative confirmed cases
confirmed_AL = confirmed_AL.set_index(["countyFIPS"])
confirmed_AL = confirmed_AL.drop(columns=["State", "County Name", "StateFIPS"])
confirmed_stacked_df = (
    confirmed_AL.stack()
    .reset_index()
    .rename(columns={"level_1": "OriginalDate", 0: "ConfirmedCases"})
)
confirmed_stacked_df
Output
countyFIPS OriginalDate ConfirmedCases
0 1001 2020-01-22 0
1 1001 2020-01-23 0
2 1001 2020-01-24 0
3 1001 2020-01-25 0
4 1001 2020-01-26 0
... ... ... ...
49709 1133 2022-01-28 6323
49710 1133 2022-01-29 6323
49711 1133 2022-01-30 6323
49712 1133 2022-01-31 6323
49713 1133 2022-02-01 7057

49714 rows × 3 columns

Input
# Converting into date time field format
confirmed_stacked_df["DateTime"] = pd.to_datetime(
    confirmed_stacked_df["OriginalDate"], infer_datetime_format=True
)
confirmed_stacked_df = confirmed_stacked_df.drop(columns=["OriginalDate"])
confirmed_stacked_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49714 entries, 0 to 49713
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   countyFIPS      49714 non-null  int64         
 1   ConfirmedCases  49714 non-null  int64         
 2   DateTime        49714 non-null  datetime64[ns]
dtypes: datetime64[ns](1), int64(2)
memory usage: 1.1 MB

Calculate Moving Average for Confirmed cases

Here, we calculate a 7-day simple moving average to smooth out the data and remove noise caused by spikes in testing results.

Input
# Set moving average window = 7 days
SMA_Window = 7
# Copy the dataframe and set columns need to be calculated
df = confirmed_stacked_df
cols = {1: "ConfirmedCases"}
Input
SMA_Window = 7
for fips in df.countyFIPS.unique():
    for col in cols:
        field = f"{cols[col]}_SMA{SMA_Window}"
        df.loc[df["countyFIPS"] == fips, field] = (
            df.loc[df["countyFIPS"] == fips]
            .iloc[:, col]
            .rolling(window=SMA_Window)
            .mean()
        )

Cut off first 6 day's date

As the first moving average value starts from the seventh day, we will disregard the first 6 days.

Input
firstMADay = df["DateTime"].iloc[0] + pd.DateOffset(days=SMA_Window - 1)
firstMADay
Output
Timestamp('2020-01-28 00:00:00')
Input
df_FirstMADay = df.loc[df["DateTime"] >= firstMADay]
df_FirstMADay.reset_index(drop=True, inplace=True)
df_FirstMADay
Output
countyFIPS ConfirmedCases DateTime ConfirmedCases_SMA7
0 1001 0 2020-01-28 0.000000
1 1001 0 2020-01-29 0.000000
2 1001 0 2020-01-30 0.000000
3 1001 0 2020-01-31 0.000000
4 1001 0 2020-02-01 0.000000
... ... ... ... ...
49307 1133 6323 2022-01-28 6248.714286
49308 1133 6323 2022-01-29 6285.857143
49309 1133 6323 2022-01-30 6323.000000
49310 1133 6323 2022-01-31 6323.000000
49311 1133 7057 2022-02-01 6427.857143

49312 rows × 4 columns

Time series data preprocessing

The preprocessing of the data for multivariate time series modeling includes the selection of required columns, converting time into the date-time format, and collecting all the counties of the state.

Input
# Selecting the required columns for modeling
df = df_FirstMADay[["DateTime", "ConfirmedCases_SMA7", "countyFIPS"]].copy()
df.columns = ["date", "cases", "countyFIPS"]
df.date = pd.to_datetime(df.date, format="%Y-%m-%d")
Input
df.tail()
Output
date cases countyFIPS
49307 2022-01-28 6248.714286 1133
49308 2022-01-29 6285.857143 1133
49309 2022-01-30 6323.000000 1133
49310 2022-01-31 6323.000000 1133
49311 2022-02-01 6427.857143 1133

Collecting the counties of Alabama

Input
# This cell collects all counties by their Unique FIPS IDs.
counties = df.countyFIPS.unique()
counties = [county for county in counties if county != 0]
len(counties)
Output
67

The next cell can be used to forecast for a specific county. You can declare the county to forecast by using its FIPS ID.

Input
# counties = df.countyFIPS.unique()
# counties = [county for county in counties if county == 1001]

Time series modeling and forecasting

Here, we will create the different functions for preparing tabular data, modeling, and forecasting that will later be called for each county.

Input
# This function selects the specified county data and splits the train and test data
def CountyData(county, test_size):
    data_file = df[df["countyFIPS"] == county]
    data_file.reset_index(inplace=True, drop=True)
    train, test = train_test_split(data_file, test_size=test_size, shuffle=False)
    return train, test

The next function prepares the tabular data and initializes the model from the available set of backbones (InceptionTime, ResCNN, Resnet, and FCN). The sequence length here is provided as 15, which was found by performing a grid search. To train the model, the model.fit method is used and is provided with the number of training epochs and the learning rate.

Input
def Model(train, seq_len, test_size):
    data = prepare_tabulardata(
        train, variable_predict="cases", index_field="date", seed=42
    )  # Preparing the tabular data
    tsmodel = TimeSeriesModel(
        data, seq_len=seq_len, model_arch="InceptionTime"
    )  # Model initialization
    lr_rate = tsmodel.lr_find()  # Finding the Learning rate
    tsmodel.fit(100, lr=lr_rate, checkpoint=False)  # Model training
    sdf_forecasted = tsmodel.predict(
        train, prediction_type="dataframe", number_of_predictions=test_size
    )  # Forecasting using the trained TimeSeriesModel
    return sdf_forecasted
Input
# This function evalutes the model metrics and returns the dictionary
def evaluate(test, sdf_forecasted):
    r2_test = r2_score(test["cases"], sdf_forecasted["cases_results"][-14:])
    mse = metrics.mean_squared_error(
        test["cases"], sdf_forecasted["cases_results"][-14:]
    )
    mae = metrics.mean_absolute_error(
        test["cases"], sdf_forecasted["cases_results"][-14:]
    )
    return {
        "DATE": test["date"],
        "cases_actual": test["cases"],
        "cases_predicted": sdf_forecasted["cases_results"][-14:],
        "R-square": round(r2_test, 2),
        "V_RMSE": round(np.sqrt(mse), 4),
        "MAE": round(mae, 4),
    }
Input
# This class calls all the defined functions
class CovidModel(object):
    seq_len = 15
    test_size = 14

    def __init__(self, county):
        self.county = county

    def CountyData(self):
        self.train, self.test = CountyData(self.county, self.test_size)

    def Model(self):
        self.sdf_forecasted = Model(self.train, self.seq_len, self.test_size)

    def evaluate(self):
        return evaluate(self.test, self.sdf_forecasted)

Training the model for all counties and saving the metrics in the dictionary.

Input
dct = {}

for i, county in enumerate(counties):
    covidmodel = CovidModel(county)
    covidmodel.CountyData()
    covidmodel.Model()
    dct[county] = covidmodel.evaluate()
epoch train_loss valid_loss time
0 0.022296 0.072083 00:00
1 0.021528 0.056567 00:00
2 0.019676 0.056948 00:00
3 0.017958 0.063523 00:00
4 0.016059 0.051571 00:00
5 0.014255 0.022249 00:00
6 0.012279 0.007439 00:00
7 0.010936 0.004679 00:00
8 0.010135 0.003242 00:00
9 0.008722 0.002114 00:00
10 0.007479 0.001448 00:00
11 0.006344 0.000820 00:00
12 0.005539 0.000626 00:00
13 0.004810 0.000156 00:00
14 0.004176 0.000091 00:00
15 0.003613 0.000100 00:00
16 0.003132 0.000077 00:00
17 0.002894 0.000230 00:00
18 0.002629 0.000161 00:00
19 0.002283 0.000087 00:00
20 0.002073 0.000320 00:00
21 0.001855 0.000131 00:00
22 0.001651 0.000139 00:00
23 0.001483 0.000202 00:00
24 0.001333 0.000078 00:00
25 0.001308 0.000239 00:00
26 0.001160 0.000315 00:00
27 0.001031 0.000212 00:00
28 0.000994 0.000078 00:00
29 0.001058 0.000162 00:00
30 0.000948 0.000865 00:00
31 0.000862 0.000587 00:00
32 0.000797 0.000192 00:00
33 0.000722 0.000051 00:00
34 0.000668 0.000104 00:00
35 0.000648 0.000066 00:00
36 0.000594 0.000543 00:00
37 0.000642 0.000080 00:00
38 0.000594 0.000081 00:00
39 0.000528 0.000192 00:00
40 0.000490 0.000520 00:00
41 0.000479 0.000175 00:00
42 0.000474 0.000509 00:00
43 0.000467 0.000559 00:00
44 0.000547 0.000214 00:00
45 0.000499 0.000077 00:00
46 0.000451 0.000152 00:00
47 0.000444 0.001184 00:00
48 0.000469 0.000031 00:00
49 0.000413 0.000155 00:00
50 0.000397 0.000246 00:00
51 0.000362 0.000137 00:00
52 0.000329 0.000027 00:00
53 0.000295 0.000019 00:00
54 0.000265 0.000049 00:00
55 0.000239 0.000032 00:00
56 0.000224 0.000048 00:00
57 0.000247 0.000253 00:00
58 0.000257 0.000026 00:00
59 0.000266 0.000036 00:00
60 0.000248 0.000051 00:00
61 0.000224 0.000021 00:00
62 0.000213 0.000013 00:00
63 0.000207 0.000080 00:00
64 0.000229 0.000025 00:00
65 0.000210 0.000038 00:00
66 0.000195 0.000048 00:00
67 0.000183 0.000049 00:00
68 0.000206 0.000037 00:00
69 0.000188 0.000019 00:00
70 0.000172 0.000014 00:00
71 0.000169 0.000020 00:00
72 0.000158 0.000017 00:00
73 0.000149 0.000028 00:00
74 0.000158 0.000022 00:00
75 0.000174 0.000066 00:00
76 0.000180 0.000097 00:00
77 0.000176 0.000171 00:00
78 0.000170 0.000046 00:00
79 0.000161 0.000037 00:00
80 0.000170 0.000012 00:00
81 0.000164 0.000024 00:00
82 0.000158 0.000030 00:00
83 0.000155 0.000033 00:00
84 0.000147 0.000022 00:00
85 0.000137 0.000019 00:00
86 0.000136 0.000018 00:00
87 0.000126 0.000011 00:00
88 0.000122 0.000013 00:00
89 0.000117 0.000016 00:00
90 0.000109 0.000013 00:00
91 0.000107 0.000010 00:00
92 0.000106 0.000008 00:00
93 0.000099 0.000010 00:00
94 0.000103 0.000011 00:00
95 0.000099 0.000019 00:00
96 0.000099 0.000036 00:00
97 0.000103 0.000018 00:00
98 0.000097 0.000012 00:00
99 0.000109 0.000014 00:00

Result Visualization

Finally, the actual and forecasted values are plotted to visualize their distribution over the validation period, with the orange line representing the forecasted values and the blue line representing the actual values.

Input
# Specifying few counties for visualizing the results
viz_counties = [1007,1113]

for i, county in enumerate(viz_counties):
    result_df = pd.DataFrame(dct[county])
    plt.figure(figsize=(20, 5))
    plt.plot(result_df["DATE"], result_df[["cases_actual", "cases_predicted"]])
    plt.xlabel("Date")
    plt.ylabel("Covid Cases")
    plt.legend(["Cases_Actual", "Cases_Predicted"], loc="upper left")
    plt.title(str(county) + ": Covid Forecast Result")
    plt.show()
Input
# Here the Alabama counties feature layer is accessed and converted to spatial dataframe
item = gis.content.get("41e8eb46285d4e1f85ee6e826b05e077")
flayer = item.layers[0]
f_sdf = flayer.query().sdf
Input
# Adding the RMSE and MAE from the output dictionary to the spatial dataframe
RMSE = []
MAE = []
for i, county in enumerate(counties):
    MAE.append(dct[county]["MAE"])
    RMSE.append(dct[county]["V_RMSE"])

f_sdf = f_sdf.assign(RMSE=RMSE, MAE=MAE)

Next, we will publish this spatial dataframe as a feature layer.

Input
published_sdf = gis.content.import_data(f_sdf, title='Alabama Covid Time Series Model Metrics')
published_sdf
Output
Alabama Covid Time Series Model Metrics
This is the feature layer containing the RMSE and MAE errors resulted from time series analysis of COVID cases using TimeSeriesModel from arcgis.learnFeature Layer Collection by api_data_owner
Last Modified: May 12, 2022
0 comments, 3 views

Next, we will open the published web layer and input the item id of the published output layer.

Input
item = gis.content.get("9d197a4870a1479c81ddfd6b739816da")
map1 = gis.map("Alabama")
map1.add_layer(item)
map1.legend = True
map1

From the map, it can be seen that most of the counties have RMSE ranging from 18-400 cases, represented by the blue polygons. The fewer green and cream colored counties have higher RMSE, and the one red county has the maximum RMSE. This indicates that InceptionTime is performing well for this state, and that other backbones can be introduced to further reduce the RMSE in the counties that have higher RMSE.

Conclusion

This study conducted a univariate time series analysis using the Deep learning TimeSeriesModel from the arcgis.learn library and forecasted the COVID-19 confirmed cases for the counties in Alabama. The initial raw data was averaged over 7 days using the seven-day moving average method to avoid sudden spikes. The methodology also included preparing a time series dataset using the prepare_tabulardata() method, followed by modeling, predicting, and validating the test dataset. The TimeSeriesModel from arcgis.learn includes backbones, such as InceptionTime, ResCNN, ResNet, and FCN, that do not need fine-tuning of multiple hyperparameters before fitting the model. Our method produced reasonably accurate results, and users can change the sequence length or backbone for forecasting in other areas.

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.