Covid case forecasting Using TimeSeriesModel from arcgis.learn

Introduction

COVID-19 forecasting has been vital for efficiently planning health care policy during the pandemic. There are many forecasting models, a few of which require explanatory variables like population, social distancing, etc. This notebook uses the deep learning TimeSeriesModel from arcgis.learn for data modeling and is helpful in the prediction of future trends.

To demonstrate the utility of this method, this notebook will analyze confirmed cases for all counties in Alabama. The dataset contains the unique county FIPS ID, county Name, State ID, and cumulative confirmed cases datewise for each county. The dataset ranges from January 2020 to February 2022, with the data from January 2022 to February 2022 being used to validate the quality of the forecast. The approach utilized in this analysis for forecasting future COVID-19 cases involves: (a) Data Processing (calculating the seven day moving average for removing the noise and vertically stacking the county data), (b) creating functions for test-train splitting, tabular data preparation, model fitting using Inception Time for a sequence length of 60, and forecasting, and (c) validation and visualization of the predicted data and results.

Importing Libraries

%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import sklearn.metrics as metrics

from arcgis.gis import GIS
from arcgis.learn import TimeSeriesModel, prepare_tabulardata

Connecting to your GIS

gis = GIS("home")

Accessing the dataset

The latest dataset can be downloaded from USAFacts: https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/

# Access the data table
data_table = gis.content.get("b222748b885e4741839f3787f207b2b1")
data_table
USA Covid Confirmed Cases Dataset
This data contains the confirmed covid cases from 01/22/2020 to 02/01/2022 for USA counties. The latest dataset can be downloaded from USAFacts: https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/CSV by api_data_owner
Last Modified: May 12, 2022
0 comments, 2 views
# Download the csv and saving it in local folder
data_path = data_table.get_data()
# # Read the csv file
confirmed = pd.read_csv(data_path)
confirmed.head()
countyFIPSCounty NameStateStateFIPS2020-01-222020-01-232020-01-242020-01-252020-01-262020-01-27...2022-01-232022-01-242022-01-252022-01-262022-01-272022-01-282022-01-292022-01-302022-01-312022-02-01
00Statewide UnallocatedAL1000000...0000000000
11001Autauga CountyAL1000000...13019132511325113251132511325113251132511325114826
21003Baldwin CountyAL1000000...49168503135031350313503135031350313503135031353083
31005Barbour CountyAL1000000...4902505450545054505450545054505450545297
41007Bibb CountyAL1000000...5663579557955795579557955795579557956158

5 rows × 746 columns

Raw data cleaning

# Extract the data of Alabama State
confirmed_AL = confirmed.loc[
    (confirmed["countyFIPS"] >= 1000) & (confirmed["countyFIPS"] <= 1133)]
# Stack the table for cumulative confirmed cases
confirmed_AL = confirmed_AL.set_index(["countyFIPS"])
confirmed_AL = confirmed_AL.drop(columns=["State", "County Name", "StateFIPS"])
confirmed_stacked_df = (
    confirmed_AL.stack()
    .reset_index()
    .rename(columns={"level_1": "OriginalDate", 0: "ConfirmedCases"})
)
confirmed_stacked_df
countyFIPSOriginalDateConfirmedCases
010012020-01-220
110012020-01-230
210012020-01-240
310012020-01-250
410012020-01-260
............
4970911332022-01-286323
4971011332022-01-296323
4971111332022-01-306323
4971211332022-01-316323
4971311332022-02-017057

49714 rows × 3 columns

# Converting into date time field format
confirmed_stacked_df["DateTime"] = pd.to_datetime(
    confirmed_stacked_df["OriginalDate"], infer_datetime_format=True
)
confirmed_stacked_df = confirmed_stacked_df.drop(columns=["OriginalDate"])
confirmed_stacked_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49714 entries, 0 to 49713
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   countyFIPS      49714 non-null  int64         
 1   ConfirmedCases  49714 non-null  int64         
 2   DateTime        49714 non-null  datetime64[ns]
dtypes: datetime64[ns](1), int64(2)
memory usage: 1.1 MB

Calculate Moving Average for Confirmed cases

Here, we calculate a 7-day simple moving average to smooth out the data and remove noise caused by spikes in testing results.

# Set moving average window = 7 days
SMA_Window = 7
# Copy the dataframe and set columns need to be calculated
df = confirmed_stacked_df
cols = {1: "ConfirmedCases"}
SMA_Window = 7
for fips in df.countyFIPS.unique():
    for col in cols:
        field = f"{cols[col]}_SMA{SMA_Window}"
        df.loc[df["countyFIPS"] == fips, field] = (
            df.loc[df["countyFIPS"] == fips]
            .iloc[:, col]
            .rolling(window=SMA_Window)
            .mean()
        )

Cut off first 6 day's date

As the first moving average value starts from the seventh day, we will disregard the first 6 days.

firstMADay = df["DateTime"].iloc[0] + pd.DateOffset(days=SMA_Window - 1)
firstMADay
Timestamp('2020-01-28 00:00:00')
df_FirstMADay = df.loc[df["DateTime"] >= firstMADay]
df_FirstMADay.reset_index(drop=True, inplace=True)
df_FirstMADay
countyFIPSConfirmedCasesDateTimeConfirmedCases_SMA7
0100102020-01-280.000000
1100102020-01-290.000000
2100102020-01-300.000000
3100102020-01-310.000000
4100102020-02-010.000000
...............
49307113363232022-01-286248.714286
49308113363232022-01-296285.857143
49309113363232022-01-306323.000000
49310113363232022-01-316323.000000
49311113370572022-02-016427.857143

49312 rows × 4 columns

Time series data preprocessing

The preprocessing of the data for multivariate time series modeling includes the selection of required columns, converting time into the date-time format, and collecting all the counties of the state.

# Selecting the required columns for modeling
df = df_FirstMADay[["DateTime", "ConfirmedCases_SMA7", "countyFIPS"]].copy()
df.columns = ["date", "cases", "countyFIPS"]
df.date = pd.to_datetime(df.date, format="%Y-%m-%d")
df.tail()
datecasescountyFIPS
493072022-01-286248.7142861133
493082022-01-296285.8571431133
493092022-01-306323.0000001133
493102022-01-316323.0000001133
493112022-02-016427.8571431133

Collecting the counties of Alabama

# This cell collects all counties by their Unique FIPS IDs.
counties = df.countyFIPS.unique()
counties = [county for county in counties if county != 0]
len(counties)
67

The next cell can be used to forecast for a specific county. You can declare the county to forecast by using its FIPS ID.

# counties = df.countyFIPS.unique()
# counties = [county for county in counties if county == 1001]

Time series modeling and forecasting

Here, we will create the different functions for preparing tabular data, modeling, and forecasting that will later be called for each county.

# This function selects the specified county data and splits the train and test data
def CountyData(county, test_size):
    data_file = df[df["countyFIPS"] == county]
    data_file.reset_index(inplace=True, drop=True)
    train, test = train_test_split(data_file, test_size=test_size, shuffle=False)
    return train, test

The next function prepares the tabular data and initializes the model from the available set of backbones (InceptionTime, ResCNN, Resnet, and FCN). The sequence length here is provided as 15, which was found by performing a grid search. To train the model, the model.fit method is used and is provided with the number of training epochs and the learning rate.

def Model(train, seq_len, test_size):
    data = prepare_tabulardata(
        train, variable_predict="cases", index_field="date", seed=42
    )  # Preparing the tabular data
    tsmodel = TimeSeriesModel(
        data, seq_len=seq_len, model_arch="InceptionTime"
    )  # Model initialization
    lr_rate = tsmodel.lr_find()  # Finding the Learning rate
    tsmodel.fit(100, lr=lr_rate, checkpoint=False)  # Model training
    sdf_forecasted = tsmodel.predict(
        train, prediction_type="dataframe", number_of_predictions=test_size
    )  # Forecasting using the trained TimeSeriesModel
    return sdf_forecasted
# This function evalutes the model metrics and returns the dictionary
def evaluate(test, sdf_forecasted):
    r2_test = r2_score(test["cases"], sdf_forecasted["cases_results"][-14:])
    mse = metrics.mean_squared_error(
        test["cases"], sdf_forecasted["cases_results"][-14:]
    )
    mae = metrics.mean_absolute_error(
        test["cases"], sdf_forecasted["cases_results"][-14:]
    )
    return {
        "DATE": test["date"],
        "cases_actual": test["cases"],
        "cases_predicted": sdf_forecasted["cases_results"][-14:],
        "R-square": round(r2_test, 2),
        "V_RMSE": round(np.sqrt(mse), 4),
        "MAE": round(mae, 4),
    }
# This class calls all the defined functions
class CovidModel(object):
    seq_len = 15
    test_size = 14

    def __init__(self, county):
        self.county = county

    def CountyData(self):
        self.train, self.test = CountyData(self.county, self.test_size)

    def Model(self):
        self.sdf_forecasted = Model(self.train, self.seq_len, self.test_size)

    def evaluate(self):
        return evaluate(self.test, self.sdf_forecasted)

Training the model for all counties and saving the metrics in the dictionary.

dct = {}

for i, county in enumerate(counties):
    covidmodel = CovidModel(county)
    covidmodel.CountyData()
    covidmodel.Model()
    dct[county] = covidmodel.evaluate()
<Figure size 432x288 with 1 Axes>
epochtrain_lossvalid_losstime
00.0222960.07208300:00
10.0215280.05656700:00
20.0196760.05694800:00
30.0179580.06352300:00
40.0160590.05157100:00
50.0142550.02224900:00
60.0122790.00743900:00
70.0109360.00467900:00
80.0101350.00324200:00
90.0087220.00211400:00
100.0074790.00144800:00
110.0063440.00082000:00
120.0055390.00062600:00
130.0048100.00015600:00
140.0041760.00009100:00
150.0036130.00010000:00
160.0031320.00007700:00
170.0028940.00023000:00
180.0026290.00016100:00
190.0022830.00008700:00
200.0020730.00032000:00
210.0018550.00013100:00
220.0016510.00013900:00
230.0014830.00020200:00
240.0013330.00007800:00
250.0013080.00023900:00
260.0011600.00031500:00
270.0010310.00021200:00
280.0009940.00007800:00
290.0010580.00016200:00
300.0009480.00086500:00
310.0008620.00058700:00
320.0007970.00019200:00
330.0007220.00005100:00
340.0006680.00010400:00
350.0006480.00006600:00
360.0005940.00054300:00
370.0006420.00008000:00
380.0005940.00008100:00
390.0005280.00019200:00
400.0004900.00052000:00
410.0004790.00017500:00
420.0004740.00050900:00
430.0004670.00055900:00
440.0005470.00021400:00
450.0004990.00007700:00
460.0004510.00015200:00
470.0004440.00118400:00
480.0004690.00003100:00
490.0004130.00015500:00
500.0003970.00024600:00
510.0003620.00013700:00
520.0003290.00002700:00
530.0002950.00001900:00
540.0002650.00004900:00
550.0002390.00003200:00
560.0002240.00004800:00
570.0002470.00025300:00
580.0002570.00002600:00
590.0002660.00003600:00
600.0002480.00005100:00
610.0002240.00002100:00
620.0002130.00001300:00
630.0002070.00008000:00
640.0002290.00002500:00
650.0002100.00003800:00
660.0001950.00004800:00
670.0001830.00004900:00
680.0002060.00003700:00
690.0001880.00001900:00
700.0001720.00001400:00
710.0001690.00002000:00
720.0001580.00001700:00
730.0001490.00002800:00
740.0001580.00002200:00
750.0001740.00006600:00
760.0001800.00009700:00
770.0001760.00017100:00
780.0001700.00004600:00
790.0001610.00003700:00
800.0001700.00001200:00
810.0001640.00002400:00
820.0001580.00003000:00
830.0001550.00003300:00
840.0001470.00002200:00
850.0001370.00001900:00
860.0001360.00001800:00
870.0001260.00001100:00
880.0001220.00001300:00
890.0001170.00001600:00
900.0001090.00001300:00
910.0001070.00001000:00
920.0001060.00000800:00
930.0000990.00001000:00
940.0001030.00001100:00
950.0000990.00001900:00
960.0000990.00003600:00
970.0001030.00001800:00
980.0000970.00001200:00
990.0001090.00001400:00

Result Visualization

Finally, the actual and forecasted values are plotted to visualize their distribution over the validation period, with the orange line representing the forecasted values and the blue line representing the actual values.

# Specifying few counties for visualizing the results
viz_counties = [1007,1113]

for i, county in enumerate(viz_counties):
    result_df = pd.DataFrame(dct[county])
    plt.figure(figsize=(20, 5))
    plt.plot(result_df["DATE"], result_df[["cases_actual", "cases_predicted"]])
    plt.xlabel("Date")
    plt.ylabel("Covid Cases")
    plt.legend(["Cases_Actual", "Cases_Predicted"], loc="upper left")
    plt.title(str(county) + ": Covid Forecast Result")
    plt.show()
<Figure size 1440x360 with 1 Axes><Figure size 1440x360 with 1 Axes>
# Here the Alabama counties feature layer is accessed and converted to spatial dataframe
item = gis.content.get("41e8eb46285d4e1f85ee6e826b05e077")
flayer = item.layers[0]
f_sdf = flayer.query().sdf
# Adding the RMSE and MAE from the output dictionary to the spatial dataframe
RMSE = []
MAE = []
for i, county in enumerate(counties):
    MAE.append(dct[county]["MAE"])
    RMSE.append(dct[county]["V_RMSE"])

f_sdf = f_sdf.assign(RMSE=RMSE, MAE=MAE)

Next, we will publish this spatial dataframe as a feature layer.

published_sdf = gis.content.import_data(f_sdf, title='Alabama Covid Time Series Model Metrics')
published_sdf
Alabama Covid Time Series Model Metrics
This is the feature layer containing the RMSE and MAE errors resulted from time series analysis of COVID cases using TimeSeriesModel from arcgis.learnFeature Layer Collection by api_data_owner
Last Modified: May 12, 2022
0 comments, 3 views

Next, we will open the published web layer and input the item id of the published output layer.

item = gis.content.get("9d197a4870a1479c81ddfd6b739816da")
map1 = gis.map("Alabama")
map1.add_layer(item)
map1.legend = True
map1

From the map, it can be seen that most of the counties have RMSE ranging from 18-400 cases, represented by the blue polygons. The fewer green and cream colored counties have higher RMSE, and the one red county has the maximum RMSE. This indicates that InceptionTime is performing well for this state, and that other backbones can be introduced to further reduce the RMSE in the counties that have higher RMSE.

Conclusion

This study conducted a univariate time series analysis using the Deep learning TimeSeriesModel from the arcgis.learn library and forecasted the COVID-19 confirmed cases for the counties in Alabama. The initial raw data was averaged over 7 days using the seven-day moving average method to avoid sudden spikes. The methodology also included preparing a time series dataset using the prepare_tabulardata() method, followed by modeling, predicting, and validating the test dataset. The TimeSeriesModel from arcgis.learn includes backbones, such as InceptionTime, ResCNN, ResNet, and FCN, that do not need fine-tuning of multiple hyperparameters before fitting the model. Our method produced reasonably accurate results, and users can change the sequence length or backbone for forecasting in other areas.

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.