Forecasting monthly rainfall in California using Deep Learning Time Series techniques

Introduction

Forest fires of historical proportions are ravaging various parts of California, started by a long and continuous period of drought. To help in dealing with this growing environmental emergency, utilizing prior knowledge of rainfall is critical. In this sample study, the deepelarning timeseries model from ArcGIS learn will be used to predict monthly rainfall for a whole year at a certain location in the Sierra Nevada foothills, around 30 miles north east of Fresno California, using the location's historic rainfall data. Data from January to December of 2019 will be used to validate the quality of the forecast.

Weather forecasting is a popular field for the application of timeseries modelling. There are various approaches used for solving timeseries problems, including classical statistical methods, such as ARIMA group of models, machine learning models, and deep learning models. The current implementation of the ArcGIS learn timeseries model uses state of the art convolutional neural networks backbones especially curated for timeseries datasets. These include InceptionTime, ResCNN, Resnet, and FCN. What makes timeseries modeling unique is that, in the classical methodology of ARIMA, multiple hyperparameters require fine tuning before fitting the model, while with the current deep learning technique, most of the parameters are learned by the model itself from the data.

Imports

%matplotlib inline
import matplotlib.pyplot as plt

import numpy as np
import pandas as pd
import math
from datetime import datetime as dt
from IPython.display import Image, HTML

from sklearn.model_selection import train_test_split

from arcgis.gis import GIS
from arcgis.learn import TimeSeriesModel, prepare_tabulardata
from arcgis.features import FeatureLayer, FeatureLayerCollection

Connecting to ArcGIS

gis = GIS("home")

Accessing & Visualizing datasets

The dataset used in this sample study is a univariate timeseries dataset of total monthly rainfall from a fixed location of 1 sqkm area in the state of California, ranging from the January 1980 to December 2019.

The following cell downloads the California rainfall dataset:

url = 'https://services7.arcgis.com/JEwYeAy2cc8qOe3o/arcgis/rest/services/cali_precipitation/FeatureServer/0'
table = FeatureLayer(url)

Next, we preprocess and sort the downloaded dataset by datetime sequence.

cali_rainfall_df1 = table.query().sdf
cali_rainfall_df1 = cali_rainfall_df1.drop("ObjectId", axis=1)
cali_rainfall_df1_sorted = cali_rainfall_df1.sort_values(by='date')
cali_rainfall_df1_sorted.head()

	date	prcp_mm_
1	1980-01-31	224.31
4	1980-02-29	181.84
5	1980-03-31	78.86
8	1980-04-30	34.19
9	1980-05-31	17.57

cali_rainfall_df1_sorted.shape

(480, 2)

cali_rainfall_df1_sorted.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 480 entries, 1 to 479
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   date      480 non-null    datetime64[ns]
 1   prcp_mm_  480 non-null    float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 11.2 KB

Timeseries Data Preparation

Preparing the data for timeseries modeling consists of the following three steps:

Sequencing Timeseries data

The first step consist of establishing the sequence of the timeseries data, which is done by creating a new index that is used by the model for processing the sequential data.

# The first step consist of reindexing the timeseries data into a sequential series  
cali_rainfall_reindexed = cali_rainfall_df1_sorted.reset_index()
cali_rainfall_reindexed = cali_rainfall_reindexed.drop("index", axis=1)
cali_rainfall_reindexed.head()

	date	prcp_mm_
0	1980-01-31	224.31
1	1980-02-29	181.84
2	1980-03-31	78.86
3	1980-04-30	34.19
4	1980-05-31	17.57

Datatypes of timeseries variable

The next step validates the data types of the variables.

# check the data types of the variables
cali_rainfall_reindexed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 480 entries, 0 to 479
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   date      480 non-null    datetime64[ns]
 1   prcp_mm_  480 non-null    float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 7.6 KB

Here, the timeseries variable is a float, which should be the expected data type. If the variable is not of a float data type, then it needs to be changed to a float data type, as demonstrated by the commented out line in the next cell.

#cali_rainfall_reindexed['prcp_mm_'].astype('float64')

Checking autocorrelation of timeseries variable

The most important step in this process is to determine if the timeseries sequence is autocorrelated. To ensure that our timeseries data can be modeled well, the strength of correlation of the variable with its past data must be estimated.

from pandas.plotting import autocorrelation_plot

plt.figure(figsize=(20,5))
autocorrelation_plot(cali_rainfall_reindexed["prcp_mm_"])
plt.show()

The plot above shows us that there is significant correlation of the data with its immediate time lagged terms, and that it gradually decreases over time as the lag increases.

Train - Test split of timeseries dataset

The dataset above has 480 data samples each representing monthly ranifall of california for 40 years(1980-2019). Out of this 39 years(1980-2018) of data will be used for training the model and the rest 1 year or a total of 12 months of data are held out for validation. Accordingly the dataset is now split into train and test in the following.

# Splitting timeseries data retaining the original sequence by keeping shuffle=False, and test size of 12 months for validation 
test_size = 12
train, test = train_test_split(cali_rainfall_reindexed, test_size = test_size, shuffle=False)

train

	date	prcp_mm_
0	1980-01-31	224.31
1	1980-02-29	181.84
2	1980-03-31	78.86
3	1980-04-30	34.19
4	1980-05-31	17.57
...	...	...
463	2018-08-31	0.00
464	2018-09-30	0.00
465	2018-10-31	13.53
466	2018-11-30	113.04
467	2018-12-31	37.26

468 rows × 2 columns

Model Building

Once the dataset is divided into the training and test dataset, the training data is ready to be used for modeling.

Data Preprocessing

In this example, the data used is a univariate timeseries of total monthly rainfall in millimeters. This single variable will be used to forecast the 12 months of rainfall for the months subsequent to the last date in the training data, or put simply, a single variable will be used to predict the future values of that same variable. In the case of a multivariate timeseries model, there would be a list of multiple explanatory variables.

Once the variables are identified, the preprocessing of the data is performed by the prepare_tabulardata method from the arcgis.learn module in the ArcGIS API for Python. This function will take either a non spatial dataframe, a feature layer, or a spatial dataframe containing the dataset as input, and will return a TabularDataObject that can be fed into the model.

The primary input parameters required for the tool are:

input_features : non spatial dataframe, feature layer, or spatial dataframe containing the primary dataset and the explanatory variables, if there are any
variable_predict : field name containing the y-variable to be forecasted from the input feature layer/dataframe
explanatory_variables : list of the field names as 2-sized tuples containing the explanatory variables as mentioned above. Since there are none in this example, it is not required here
index_field : field name containing the timestamp

At this point, preprocessors could be used for scaling the data using a scaler as follows, depending on the data distribution.

#from sklearn.preprocessing import MinMaxScaler

#preprocessors = [('prcp_mm_', MinMaxScaler())]
#data = prepare_tabulardata(train, variable_predict='prcp_mm_', index_field='date', preprocessors=preprocessors)

In this example, preprocessors are not used, as the data is normalized by default.

data = prepare_tabulardata(train, variable_predict='prcp_mm_', index_field='date', seed=42)

C:\Users\sup10432\AppData\Local\ESRI\conda\envs\pro_dl_28October\lib\site-packages\arcgis\learn\_utils\tabular_data.py:876: UserWarning: Dataframe is not spatial, Rasters and distance layers will not work
  warnings.warn("Dataframe is not spatial, Rasters and distance layers will not work")

# Visualize the entire timeseries data
data.show_batch(graph=True)

# Here sequence length is used as 12 which also indicates the seasonality of the data
seq_len=12

# visualize the timeseries in batches, here the sequence length is mentioned which would be treated as the batch length
data.show_batch(rows=4,seq_len=seq_len)

Model Initialization

This is the most significant step for fitting a timeseries model. Here, along with the data, the backbone for training the model and the sequence length are passed as parameters. Out of these three, the sequence length has to be selected carefully, as it can make or break the model. The sequence length is usually the cycle of the data, which in this case is 12, as it is monthly data and the pattern repeats after 12 months.

# In model initialization, the data and the backbone is selected from the available set of InceptionTime, ResCNN, Resnet, FCN
ts_model = TimeSeriesModel(data, seq_len=seq_len, model_arch='InceptionTime')

Learning Rate Search

# Finding the learning rate for training the model
l_rate = ts_model.lr_find()

Model Training

Finally, the model is now ready for training. To train the model, the model.fit method is called and is provided the number of epochs for training and the estimated learning rate suggested by lr_find in the previous step:

ts_model.fit(100, lr=l_rate)

epoch	train_loss	valid_loss	time
0	0.083008	0.037609	00:00
1	0.061329	0.036617	00:00
2	0.050808	0.036499	00:00
3	0.044877	0.036835	00:00
4	0.039619	0.036479	00:00
5	0.034815	0.034671	00:00
6	0.031023	0.032002	00:00
7	0.028147	0.030694	00:00
8	0.025152	0.026627	00:00
9	0.022742	0.024633	00:00
10	0.020327	0.026560	00:00
11	0.018206	0.023352	00:00
12	0.016312	0.024652	00:00
13	0.014641	0.026808	00:00
14	0.013142	0.026944	00:00
15	0.011842	0.024464	00:00
16	0.010730	0.024647	00:00
17	0.009880	0.023319	00:00
18	0.009031	0.025791	00:00
19	0.008297	0.024947	00:00
20	0.007618	0.026807	00:00
21	0.007105	0.025442	00:00
22	0.006866	0.026500	00:00
23	0.006573	0.028415	00:00
24	0.006338	0.023184	00:00
25	0.006256	0.030714	00:00
26	0.006376	0.028586	00:00
27	0.006548	0.026342	00:00
28	0.006665	0.020448	00:00
29	0.006532	0.029441	00:00
30	0.006306	0.025255	00:00
31	0.006137	0.025174	00:00
32	0.005978	0.023557	00:00
33	0.005687	0.023234	00:00
34	0.005332	0.025562	00:00
35	0.004955	0.023331	00:00
36	0.004672	0.023735	00:00
37	0.004330	0.023847	00:00
38	0.004095	0.026608	00:00
39	0.003849	0.022369	00:00
40	0.003577	0.023238	00:00
41	0.003376	0.023750	00:00
42	0.003158	0.023843	00:00
43	0.002963	0.023013	00:00
44	0.002801	0.022028	00:00
45	0.002652	0.023180	00:00
46	0.002558	0.024409	00:00
47	0.002417	0.022479	00:00
48	0.002264	0.023945	00:00
49	0.002172	0.022920	00:00
50	0.002085	0.021997	00:00
51	0.001990	0.023793	00:00
52	0.001873	0.024170	00:00
53	0.001774	0.023699	00:00
54	0.001724	0.023237	00:00
55	0.001611	0.023164	00:00
56	0.001527	0.022908	00:00
57	0.001451	0.023040	00:00
58	0.001390	0.022155	00:00
59	0.001347	0.022041	00:00
60	0.001333	0.022265	00:00
61	0.001291	0.022561	00:00
62	0.001224	0.021779	00:00
63	0.001159	0.022445	00:00
64	0.001080	0.022740	00:00
65	0.001023	0.022691	00:00
66	0.000971	0.023290	00:00
67	0.000941	0.022854	00:00
68	0.000876	0.022432	00:00
69	0.000829	0.022994	00:00
70	0.000774	0.022587	00:00
71	0.000722	0.022578	00:00
72	0.000681	0.022348	00:00
73	0.000643	0.022647	00:00
74	0.000608	0.022395	00:00
75	0.000569	0.022518	00:00
76	0.000533	0.022561	00:00
77	0.000510	0.022146	00:00
78	0.000477	0.022354	00:00
79	0.000454	0.022084	00:00
80	0.000439	0.022166	00:00
81	0.000420	0.022053	00:00
82	0.000401	0.022132	00:00
83	0.000370	0.022445	00:00
84	0.000355	0.022072	00:00
85	0.000344	0.022071	00:00
86	0.000322	0.022196	00:00
87	0.000305	0.022326	00:00
88	0.000289	0.022460	00:00
89	0.000264	0.022545	00:00
90	0.000251	0.022607	00:00
91	0.000245	0.022609	00:00
92	0.000238	0.022556	00:00
93	0.000225	0.022444	00:00
94	0.000220	0.022330	00:00
95	0.000215	0.022326	00:00
96	0.000212	0.022314	00:00
97	0.000202	0.022303	00:00
98	0.000226	0.022315	00:00
99	0.000213	0.022314	00:00

# the train vs valid losses is plotted to check quality of the trained model, and whether the model needs more training 
ts_model.plot_losses()

# the predicted values by the trained model is printed for the test set
ts_model.show_results(rows=5)

The figures above display the training and the validation of the prediction attained by the model while training.

Rainfall Forecast & Validation

Forecasting Using the trained Timeseries Model

During forecasting, the model will use the same training dataset as input and will use the last sequence length number of terms from that dataset's tail to predict the rainfall for the number of months specified by the user.

from datetime import datetime

# checking the training dataset
train

	date	prcp_mm_
0	1980-01-31	224.31
1	1980-02-29	181.84
2	1980-03-31	78.86
3	1980-04-30	34.19
4	1980-05-31	17.57
...	...	...
463	2018-08-31	0.00
464	2018-09-30	0.00
465	2018-10-31	13.53
466	2018-11-30	113.04
467	2018-12-31	37.26

468 rows × 2 columns

Forecasting requires the format of the date column to be in datetime. If the date column is not in the datetime format, it can be changed to datetime by using the pd.to_datetime() method.

# checking if the datatype of the 'date' column is in datetime format
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 468 entries, 0 to 467
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   date      468 non-null    datetime64[ns]
 1   prcp_mm_  468 non-null    float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 11.0 KB

In this example, the date column is already in the required datetime format.

train.tail(5)

	date	prcp_mm_
463	2018-08-31	0.00
464	2018-09-30	0.00
465	2018-10-31	13.53
466	2018-11-30	113.04
467	2018-12-31	37.26

Finally the predict function is used to forecast for a period of the 12 months subsequent to the last date in the training dataset. As such, this will be forecasting rainfall for the 12 months of 2019, starting from January of 2019.

# Here the forecast is returned as a dataframe, since it is non spatial data, mentioned in the 'prediction_type'     
sdf_forecasted = ts_model.predict(train, prediction_type='dataframe', number_of_predictions=test_size)

C:\Users\sup10432\AppData\Local\ESRI\conda\envs\pro_dl_28October\lib\site-packages\pandas\core\indexing.py:670: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)

# final forecasted result returned by the model
sdf_forecasted

	date	prcp_mm_	prcp_mm__results
0	1980-01-31	224.31	224.310000
1	1980-02-29	181.84	181.840000
2	1980-03-31	78.86	78.860000
3	1980-04-30	34.19	34.190000
4	1980-05-31	17.57	17.570000
...	...	...	...
475	2019-08-20	NaN	-5.099367
476	2019-09-18	NaN	10.751540
477	2019-10-17	NaN	25.207242
478	2019-11-15	NaN	83.685172
479	2019-12-14	NaN	142.722045

480 rows × 3 columns

# Formating the result into actual vs the predicted columns
sdf_forecasted = sdf_forecasted.tail(test_size)
sdf_forecasted = sdf_forecasted[['date','prcp_mm__results']]
sdf_forecasted['actual'] = test[test.columns[-1]].values
sdf_forecasted = sdf_forecasted.set_index(sdf_forecasted.columns[0])
sdf_forecasted.head()

	prcp_mm__results	actual
date
2019-01-29	129.882450	149.69
2019-02-27	144.863222	222.38
2019-03-28	143.764554	102.48
2019-04-26	48.772594	25.12
2019-05-25	61.366233	102.49

Estimate model metrics for validation

The accuracy of the forecasted values is measured by comparing the forecasted values against the actual values of the 12 months.

from sklearn.metrics import r2_score
import sklearn.metrics as metrics

r2_test = r2_score(sdf_forecasted['actual'],sdf_forecasted['prcp_mm__results'])
print('R-Square: ', round(r2_test, 2))

R-Square:  0.8

A considerably high r-squared value indicates a high similarity between the forecasted and the actual sales values.

mse_RF_train = metrics.mean_squared_error(sdf_forecasted['actual'], sdf_forecasted['prcp_mm__results'])
print('RMSE: ', round(np.sqrt(mse_RF_train), 4))

mean_absolute_error_RF_train = metrics.mean_absolute_error(sdf_forecasted['actual'], sdf_forecasted['prcp_mm__results'])
print('MAE: ', round(mean_absolute_error_RF_train, 4))

RMSE:  32.2893
MAE:  25.5549

The error terms of RMSE and MAE in the forecasting are 32.28mm and 25.55mm respectively, which are quite low.

Result Visualization

Finally, the actual and forecasted values are plotted to visualize their distribution over the 12 months period, with the blue lines indicating forecasted values and orange line showing the actual values.

plt.figure(figsize=(20,5))
plt.plot(sdf_forecasted)
plt.ylabel('prcp_mm__results')
plt.legend(sdf_forecasted.columns.values,loc='upper right')
plt.title( 'Rainfall Forecast')
plt.show()

Conclusion

The newly implemented deeplearning timeseries model from the arcgis.learn library was used to forecast monthly rainfall for a location of 1 sqkm in California, for the period of January to December 2019, which it was able to model with a high accuracy. The notebook elaborates on the methodology of applying the model for forecasting time series data. The process includes first preparing a timeseries dataset using the prepare_tabulardata() method, followed by modeling and fitting the dataset. Usually, timeseries modelling requires fine tuning several hyperparameters for properly fitting the data, most of which has been internalized in this current implementation, leaving the user responsible for configuring only a few significant parameters, like the sequence length.

Summary of methods used

Method	Description	Examples
prepare_tabulardata	prepare data including imputation, normalization and train-test split	prepare data ready for fitting a Timeseries Model
model.lr_find()	find an optimal learning rate	finalize a good learning rate for training the Timeseries model
TimeSeriesModel()	Model Initialization by selecting the Timeseries Deeplearning algorithm to be used for fitting	Selected Timsereis algorithm from Fastai timeseries regression can be used
model.fit()	train a model with epochs & learning rate as input	training the Timeseries model with sutiable input
model.score()	find the model metric of R-squared of the trained model	returns R-squared value after training the Timeseries Model
model.predict()	predict on a test set	forecast values using the trained models on test input

Data resources

Dataset	Source	Link
California Daily weather data	MODIS - Daily Surface Weather Data on a 1-km Grid for North America, Version 3	https://daac.ornl.gov/DAYMET/guides/Daymet_V3_CFMosaics.html