Forecasting monthly rainfall in California using Deep Learning Time Series techniques

Introduction

Forest fires of historical proportions are ravaging various parts of California, started by a long and continuous period of drought. To help in dealing with this growing environmental emergency, utilizing prior knowledge of rainfall is critical. In this sample study, the deepelarning timeseries model from ArcGIS learn will be used to predict monthly rainfall for a whole year at a certain location in the Sierra Nevada foothills, around 30 miles north east of Fresno California, using the location's historic rainfall data. Data from January to December of 2019 will be used to validate the quality of the forecast.

Weather forecasting is a popular field for the application of timeseries modelling. There are various approaches used for solving timeseries problems, including classical statistical methods, such as ARIMA group of models, machine learning models, and deep learning models. The current implementation of the ArcGIS learn timeseries model uses state of the art convolutional neural networks backbones especially curated for timeseries datasets. These include InceptionTime, ResCNN, Resnet, and FCN. What makes timeseries modeling unique is that, in the classical methodology of ARIMA, multiple hyperparameters require fine tuning before fitting the model, while with the current deep learning technique, most of the parameters are learned by the model itself from the data.

Imports

%matplotlib inline
import matplotlib.pyplot as plt

import numpy as np
import pandas as pd
import math
from datetime import datetime as dt
from IPython.display import Image, HTML

from sklearn.model_selection import train_test_split

from arcgis.gis import GIS
from arcgis.learn import TimeSeriesModel, prepare_tabulardata
from arcgis.features import FeatureLayer, FeatureLayerCollection

Connecting to ArcGIS

gis = GIS("home")

Accessing & Visualizing datasets

The dataset used in this sample study is a univariate timeseries dataset of total monthly rainfall from a fixed location of 1 sqkm area in the state of California, ranging from the January 1980 to December 2019.

The following cell downloads the California rainfall dataset:

url = 'https://services7.arcgis.com/JEwYeAy2cc8qOe3o/arcgis/rest/services/cali_precipitation/FeatureServer/0'
table = FeatureLayer(url)

Next, we preprocess and sort the downloaded dataset by datetime sequence.

cali_rainfall_df1 = table.query().sdf
cali_rainfall_df1 = cali_rainfall_df1.drop("ObjectId", axis=1)
cali_rainfall_df1_sorted = cali_rainfall_df1.sort_values(by='date')
cali_rainfall_df1_sorted.head()
dateprcp_mm_
11980-01-31224.31
41980-02-29181.84
51980-03-3178.86
81980-04-3034.19
91980-05-3117.57
cali_rainfall_df1_sorted.shape
(480, 2)
cali_rainfall_df1_sorted.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 480 entries, 1 to 479
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   date      480 non-null    datetime64[ns]
 1   prcp_mm_  480 non-null    float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 11.2 KB

Timeseries Data Preparation

Preparing the data for timeseries modeling consists of the following three steps:

Sequencing Timeseries data

The first step consist of establishing the sequence of the timeseries data, which is done by creating a new index that is used by the model for processing the sequential data.

# The first step consist of reindexing the timeseries data into a sequential series  
cali_rainfall_reindexed = cali_rainfall_df1_sorted.reset_index()
cali_rainfall_reindexed = cali_rainfall_reindexed.drop("index", axis=1)
cali_rainfall_reindexed.head()
dateprcp_mm_
01980-01-31224.31
11980-02-29181.84
21980-03-3178.86
31980-04-3034.19
41980-05-3117.57

Datatypes of timeseries variable

The next step validates the data types of the variables.

# check the data types of the variables
cali_rainfall_reindexed.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 480 entries, 0 to 479
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   date      480 non-null    datetime64[ns]
 1   prcp_mm_  480 non-null    float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 7.6 KB

Here, the timeseries variable is a float, which should be the expected data type. If the variable is not of a float data type, then it needs to be changed to a float data type, as demonstrated by the commented out line in the next cell.

#cali_rainfall_reindexed['prcp_mm_'].astype('float64')

Checking autocorrelation of timeseries variable

The most important step in this process is to determine if the timeseries sequence is autocorrelated. To ensure that our timeseries data can be modeled well, the strength of correlation of the variable with its past data must be estimated.

from pandas.plotting import autocorrelation_plot
plt.figure(figsize=(20,5))
autocorrelation_plot(cali_rainfall_reindexed["prcp_mm_"])
plt.show()
<Figure size 1440x360 with 1 Axes>

The plot above shows us that there is significant correlation of the data with its immediate time lagged terms, and that it gradually decreases over time as the lag increases.

Train - Test split of timeseries dataset

The dataset above has 480 data samples each representing monthly ranifall of california for 40 years(1980-2019). Out of this 39 years(1980-2018) of data will be used for training the model and the rest 1 year or a total of 12 months of data are held out for validation. Accordingly the dataset is now split into train and test in the following.

# Splitting timeseries data retaining the original sequence by keeping shuffle=False, and test size of 12 months for validation 
test_size = 12
train, test = train_test_split(cali_rainfall_reindexed, test_size = test_size, shuffle=False)
train 
dateprcp_mm_
01980-01-31224.31
11980-02-29181.84
21980-03-3178.86
31980-04-3034.19
41980-05-3117.57
.........
4632018-08-310.00
4642018-09-300.00
4652018-10-3113.53
4662018-11-30113.04
4672018-12-3137.26

468 rows × 2 columns

Model Building

Once the dataset is divided into the training and test dataset, the training data is ready to be used for modeling.

Data Preprocessing

In this example, the data used is a univariate timeseries of total monthly rainfall in millimeters. This single variable will be used to forecast the 12 months of rainfall for the months subsequent to the last date in the training data, or put simply, a single variable will be used to predict the future values of that same variable. In the case of a multivariate timeseries model, there would be a list of multiple explanatory variables.

Once the variables are identified, the preprocessing of the data is performed by the prepare_tabulardata method from the arcgis.learn module in the ArcGIS API for Python. This function will take either a non spatial dataframe, a feature layer, or a spatial dataframe containing the dataset as input, and will return a TabularDataObject that can be fed into the model.

The primary input parameters required for the tool are:

  • input_features : non spatial dataframe, feature layer, or spatial dataframe containing the primary dataset and the explanatory variables, if there are any
  • variable_predict : field name containing the y-variable to be forecasted from the input feature layer/dataframe
  • explanatory_variables : list of the field names as 2-sized tuples containing the explanatory variables as mentioned above. Since there are none in this example, it is not required here
  • index_field : field name containing the timestamp

At this point, preprocessors could be used for scaling the data using a scaler as follows, depending on the data distribution.

#from sklearn.preprocessing import MinMaxScaler
#preprocessors = [('prcp_mm_', MinMaxScaler())]
#data = prepare_tabulardata(train, variable_predict='prcp_mm_', index_field='date', preprocessors=preprocessors)

In this example, preprocessors are not used, as the data is normalized by default.

data = prepare_tabulardata(train, variable_predict='prcp_mm_', index_field='date', seed=42)
C:\Users\sup10432\AppData\Local\ESRI\conda\envs\pro_dl_28October\lib\site-packages\arcgis\learn\_utils\tabular_data.py:876: UserWarning: Dataframe is not spatial, Rasters and distance layers will not work
  warnings.warn("Dataframe is not spatial, Rasters and distance layers will not work")
# Visualize the entire timeseries data
data.show_batch(graph=True)
<Figure size 1800x360 with 1 Axes>
# Here sequence length is used as 12 which also indicates the seasonality of the data
seq_len=12
# visualize the timeseries in batches, here the sequence length is mentioned which would be treated as the batch length
data.show_batch(rows=4,seq_len=seq_len)
<Figure size 720x720 with 16 Axes>

Model Initialization

This is the most significant step for fitting a timeseries model. Here, along with the data, the backbone for training the model and the sequence length are passed as parameters. Out of these three, the sequence length has to be selected carefully, as it can make or break the model. The sequence length is usually the cycle of the data, which in this case is 12, as it is monthly data and the pattern repeats after 12 months.

# In model initialization, the data and the backbone is selected from the available set of InceptionTime, ResCNN, Resnet, FCN
ts_model = TimeSeriesModel(data, seq_len=seq_len, model_arch='InceptionTime')
# Finding the learning rate for training the model
l_rate = ts_model.lr_find()
<Figure size 432x288 with 1 Axes>

Model Training

Finally, the model is now ready for training. To train the model, the model.fit method is called and is provided the number of epochs for training and the estimated learning rate suggested by lr_find in the previous step:

ts_model.fit(100, lr=l_rate)
epochtrain_lossvalid_losstime
00.0830080.03760900:00
10.0613290.03661700:00
20.0508080.03649900:00
30.0448770.03683500:00
40.0396190.03647900:00
50.0348150.03467100:00
60.0310230.03200200:00
70.0281470.03069400:00
80.0251520.02662700:00
90.0227420.02463300:00
100.0203270.02656000:00
110.0182060.02335200:00
120.0163120.02465200:00
130.0146410.02680800:00
140.0131420.02694400:00
150.0118420.02446400:00
160.0107300.02464700:00
170.0098800.02331900:00
180.0090310.02579100:00
190.0082970.02494700:00
200.0076180.02680700:00
210.0071050.02544200:00
220.0068660.02650000:00
230.0065730.02841500:00
240.0063380.02318400:00
250.0062560.03071400:00
260.0063760.02858600:00
270.0065480.02634200:00
280.0066650.02044800:00
290.0065320.02944100:00
300.0063060.02525500:00
310.0061370.02517400:00
320.0059780.02355700:00
330.0056870.02323400:00
340.0053320.02556200:00
350.0049550.02333100:00
360.0046720.02373500:00
370.0043300.02384700:00
380.0040950.02660800:00
390.0038490.02236900:00
400.0035770.02323800:00
410.0033760.02375000:00
420.0031580.02384300:00
430.0029630.02301300:00
440.0028010.02202800:00
450.0026520.02318000:00
460.0025580.02440900:00
470.0024170.02247900:00
480.0022640.02394500:00
490.0021720.02292000:00
500.0020850.02199700:00
510.0019900.02379300:00
520.0018730.02417000:00
530.0017740.02369900:00
540.0017240.02323700:00
550.0016110.02316400:00
560.0015270.02290800:00
570.0014510.02304000:00
580.0013900.02215500:00
590.0013470.02204100:00
600.0013330.02226500:00
610.0012910.02256100:00
620.0012240.02177900:00
630.0011590.02244500:00
640.0010800.02274000:00
650.0010230.02269100:00
660.0009710.02329000:00
670.0009410.02285400:00
680.0008760.02243200:00
690.0008290.02299400:00
700.0007740.02258700:00
710.0007220.02257800:00
720.0006810.02234800:00
730.0006430.02264700:00
740.0006080.02239500:00
750.0005690.02251800:00
760.0005330.02256100:00
770.0005100.02214600:00
780.0004770.02235400:00
790.0004540.02208400:00
800.0004390.02216600:00
810.0004200.02205300:00
820.0004010.02213200:00
830.0003700.02244500:00
840.0003550.02207200:00
850.0003440.02207100:00
860.0003220.02219600:00
870.0003050.02232600:00
880.0002890.02246000:00
890.0002640.02254500:00
900.0002510.02260700:00
910.0002450.02260900:00
920.0002380.02255600:00
930.0002250.02244400:00
940.0002200.02233000:00
950.0002150.02232600:00
960.0002120.02231400:00
970.0002020.02230300:00
980.0002260.02231500:00
990.0002130.02231400:00
# the train vs valid losses is plotted to check quality of the trained model, and whether the model needs more training 
ts_model.plot_losses()
<Figure size 432x288 with 1 Axes>
# the predicted values by the trained model is printed for the test set
ts_model.show_results(rows=5)
<Figure size 720x720 with 10 Axes>

The figures above display the training and the validation of the prediction attained by the model while training.

Rainfall Forecast & Validation

Forecasting Using the trained Timeseries Model

During forecasting, the model will use the same training dataset as input and will use the last sequence length number of terms from that dataset's tail to predict the rainfall for the number of months specified by the user.

from datetime import datetime  
# checking the training dataset
train
dateprcp_mm_
01980-01-31224.31
11980-02-29181.84
21980-03-3178.86
31980-04-3034.19
41980-05-3117.57
.........
4632018-08-310.00
4642018-09-300.00
4652018-10-3113.53
4662018-11-30113.04
4672018-12-3137.26

468 rows × 2 columns

Forecasting requires the format of the date column to be in datetime. If the date column is not in the datetime format, it can be changed to datetime by using the pd.to_datetime() method.

# checking if the datatype of the 'date' column is in datetime format
train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 468 entries, 0 to 467
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   date      468 non-null    datetime64[ns]
 1   prcp_mm_  468 non-null    float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 11.0 KB

In this example, the date column is already in the required datetime format.

train.tail(5)
dateprcp_mm_
4632018-08-310.00
4642018-09-300.00
4652018-10-3113.53
4662018-11-30113.04
4672018-12-3137.26

Finally the predict function is used to forecast for a period of the 12 months subsequent to the last date in the training dataset. As such, this will be forecasting rainfall for the 12 months of 2019, starting from January of 2019.

# Here the forecast is returned as a dataframe, since it is non spatial data, mentioned in the 'prediction_type'     
sdf_forecasted = ts_model.predict(train, prediction_type='dataframe', number_of_predictions=test_size)
C:\Users\sup10432\AppData\Local\ESRI\conda\envs\pro_dl_28October\lib\site-packages\pandas\core\indexing.py:670: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)
# final forecasted result returned by the model
sdf_forecasted
dateprcp_mm_prcp_mm__results
01980-01-31224.31224.310000
11980-02-29181.84181.840000
21980-03-3178.8678.860000
31980-04-3034.1934.190000
41980-05-3117.5717.570000
............
4752019-08-20NaN-5.099367
4762019-09-18NaN10.751540
4772019-10-17NaN25.207242
4782019-11-15NaN83.685172
4792019-12-14NaN142.722045

480 rows × 3 columns

# Formating the result into actual vs the predicted columns
sdf_forecasted = sdf_forecasted.tail(test_size)
sdf_forecasted = sdf_forecasted[['date','prcp_mm__results']]
sdf_forecasted['actual'] = test[test.columns[-1]].values
sdf_forecasted = sdf_forecasted.set_index(sdf_forecasted.columns[0])
sdf_forecasted.head()
prcp_mm__resultsactual
date
2019-01-29129.882450149.69
2019-02-27144.863222222.38
2019-03-28143.764554102.48
2019-04-2648.77259425.12
2019-05-2561.366233102.49

Estimate model metrics for validation

The accuracy of the forecasted values is measured by comparing the forecasted values against the actual values of the 12 months.

from sklearn.metrics import r2_score
import sklearn.metrics as metrics
r2_test = r2_score(sdf_forecasted['actual'],sdf_forecasted['prcp_mm__results'])
print('R-Square: ', round(r2_test, 2))
R-Square:  0.8

A considerably high r-squared value indicates a high similarity between the forecasted and the actual sales values.

mse_RF_train = metrics.mean_squared_error(sdf_forecasted['actual'], sdf_forecasted['prcp_mm__results'])
print('RMSE: ', round(np.sqrt(mse_RF_train), 4))

mean_absolute_error_RF_train = metrics.mean_absolute_error(sdf_forecasted['actual'], sdf_forecasted['prcp_mm__results'])
print('MAE: ', round(mean_absolute_error_RF_train, 4))
RMSE:  32.2893
MAE:  25.5549

The error terms of RMSE and MAE in the forecasting are 32.28mm and 25.55mm respectively, which are quite low.

Result Visualization

Finally, the actual and forecasted values are plotted to visualize their distribution over the 12 months period, with the blue lines indicating forecasted values and orange line showing the actual values.

plt.figure(figsize=(20,5))
plt.plot(sdf_forecasted)
plt.ylabel('prcp_mm__results')
plt.legend(sdf_forecasted.columns.values,loc='upper right')
plt.title( 'Rainfall Forecast')
plt.show()
<Figure size 1440x360 with 1 Axes>

Conclusion

The newly implemented deeplearning timeseries model from the arcgis.learn library was used to forecast monthly rainfall for a location of 1 sqkm in California, for the period of January to December 2019, which it was able to model with a high accuracy. The notebook elaborates on the methodology of applying the model for forecasting time series data. The process includes first preparing a timeseries dataset using the prepare_tabulardata() method, followed by modeling and fitting the dataset. Usually, timeseries modelling requires fine tuning several hyperparameters for properly fitting the data, most of which has been internalized in this current implementation, leaving the user responsible for configuring only a few significant parameters, like the sequence length.

Summary of methods used

MethodDescriptionExamples
prepare_tabulardataprepare data including imputation, normalization and train-test splitprepare data ready for fitting a Timeseries Model
model.lr_find()find an optimal learning ratefinalize a good learning rate for training the Timeseries model
TimeSeriesModel()Model Initialization by selecting the Timeseries Deeplearning algorithm to be used for fittingSelected Timsereis algorithm from Fastai timeseries regression can be used
model.fit()train a model with epochs & learning rate as inputtraining the Timeseries model with sutiable input
model.score()find the model metric of R-squared of the trained modelreturns R-squared value after training the Timeseries Model
model.predict()predict on a test setforecast values using the trained models on test input

Data resources

DatasetSourceLink
California Daily weather dataMODIS - Daily Surface Weather Data on a 1-km Grid for North America, Version 3https://daac.ornl.gov/DAYMET/guides/Daymet_V3_CFMosaics.html

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.