ArcGIS Developers
Dashboard

ArcGIS API for Python

Download the samples Try it live

Forecasting monthly rainfall in California using Deep Learning Time Series techniques

Introduction

Forest fires of historical proportions are ravaging various parts of California, started by a long and continuous period of drought. To help in dealing with this growing environmental emergency, utilizing prior knowledge of rainfall is critical. In this sample study, the deepelarning timeseries model from ArcGIS learn will be used to predict monthly rainfall for a whole year at a certain location in the Sierra Nevada foothills, around 30 miles north east of Fresno California, using the location's historic rainfall data. Data from January to December of 2019 will be used to validate the quality of the forecast.

Weather forecasting is a popular field for the application of timeseries modelling. There are various approaches used for solving timeseries problems, including classical statistical methods, such as ARIMA group of models, machine learning models, and deep learning models. The current implementation of the ArcGIS learn timeseries model uses state of the art convolutional neural networks backbones especially curated for timeseries datasets. These include InceptionTime, ResCNN, Resnet, and FCN. What makes timeseries modeling unique is that, in the classical methodology of ARIMA, multiple hyperparameters require fine tuning before fitting the model, while with the current deep learning technique, most of the parameters are learned by the model itself from the data.

Imports

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt

import numpy as np
import pandas as pd
import math
from datetime import datetime as dt
from IPython.display import Image, HTML

from sklearn.model_selection import train_test_split

from arcgis.gis import GIS
from arcgis.learn import TimeSeriesModel, prepare_tabulardata
from arcgis.features import FeatureLayer, FeatureLayerCollection

Connecting to ArcGIS

In [3]:
gis = GIS("home")

Accessing & Visualizing datasets

The dataset used in this sample study is a univariate timeseries dataset of total monthly rainfall from a fixed location of 1 sqkm area in the state of California, ranging from the January 1980 to December 2019.

The following cell downloads the California rainfall dataset:

In [4]:
url = 'https://services7.arcgis.com/JEwYeAy2cc8qOe3o/arcgis/rest/services/cali_precipitation/FeatureServer/0'
table = FeatureLayer(url)

Next, we preprocess and sort the downloaded dataset by datetime sequence.

In [8]:
cali_rainfall_df1 = table.query().sdf
cali_rainfall_df1 = cali_rainfall_df1.drop("ObjectId", axis=1)
cali_rainfall_df1_sorted = cali_rainfall_df1.sort_values(by='date')
cali_rainfall_df1_sorted.head()
Out[8]:
date prcp_mm_
1 1980-01-31 224.31
4 1980-02-29 181.84
5 1980-03-31 78.86
8 1980-04-30 34.19
9 1980-05-31 17.57
In [9]:
cali_rainfall_df1_sorted.shape
Out[9]:
(480, 2)
In [10]:
cali_rainfall_df1_sorted.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 480 entries, 1 to 479
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   date      480 non-null    datetime64[ns]
 1   prcp_mm_  480 non-null    float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 11.2 KB

Timeseries Data Preparation

Preparing the data for timeseries modeling consists of the following three steps:

Sequencing Timeseries data

The first step consist of establishing the sequence of the timeseries data, which is done by creating a new index that is used by the model for processing the sequential data.

In [11]:
# The first step consist of reindexing the timeseries data into a sequential series  
cali_rainfall_reindexed = cali_rainfall_df1_sorted.reset_index()
cali_rainfall_reindexed = cali_rainfall_reindexed.drop("index", axis=1)
cali_rainfall_reindexed.head()
Out[11]:
date prcp_mm_
0 1980-01-31 224.31
1 1980-02-29 181.84
2 1980-03-31 78.86
3 1980-04-30 34.19
4 1980-05-31 17.57

Datatypes of timeseries variable

The next step validates the data types of the variables.

In [12]:
# check the data types of the variables
cali_rainfall_reindexed.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 480 entries, 0 to 479
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   date      480 non-null    datetime64[ns]
 1   prcp_mm_  480 non-null    float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 7.6 KB

Here, the timeseries variable is a float, which should be the expected data type. If the variable is not of a float data type, then it needs to be changed to a float data type, as demonstrated by the commented out line in the next cell.

In [ ]:
#cali_rainfall_reindexed['prcp_mm_'].astype('float64')

Checking autocorrelation of timeseries variable

The most important step in this process is to determine if the timeseries sequence is autocorrelated. To ensure that our timeseries data can be modeled well, the strength of correlation of the variable with its past data must be estimated.

In [13]:
from pandas.plotting import autocorrelation_plot
In [14]:
plt.figure(figsize=(20,5))
autocorrelation_plot(cali_rainfall_reindexed["prcp_mm_"])
plt.show()

The plot above shows us that there is significant correlation of the data with its immediate time lagged terms, and that it gradually decreases over time as the lag increases.

Train - Test split of timeseries dataset

The dataset above has 480 data samples each representing monthly ranifall of california for 40 years(1980-2019). Out of this 39 years(1980-2018) of data will be used for training the model and the rest 1 year or a total of 12 months of data are held out for validation. Accordingly the dataset is now split into train and test in the following.

In [15]:
# Splitting timeseries data retaining the original sequence by keeping shuffle=False, and test size of 12 months for validation 
test_size = 12
train, test = train_test_split(cali_rainfall_reindexed, test_size = test_size, shuffle=False)
In [16]:
train 
Out[16]:
date prcp_mm_
0 1980-01-31 224.31
1 1980-02-29 181.84
2 1980-03-31 78.86
3 1980-04-30 34.19
4 1980-05-31 17.57
... ... ...
463 2018-08-31 0.00
464 2018-09-30 0.00
465 2018-10-31 13.53
466 2018-11-30 113.04
467 2018-12-31 37.26

468 rows × 2 columns

Model Building

Once the dataset is divided into the training and test dataset, the training data is ready to be used for modeling.

Data Preprocessing

In this example, the data used is a univariate timeseries of total monthly rainfall in millimeters. This single variable will be used to forecast the 12 months of rainfall for the months subsequent to the last date in the training data, or put simply, a single variable will be used to predict the future values of that same variable. In the case of a multivariate timeseries model, there would be a list of multiple explanatory variables.

Once the variables are identified, the preprocessing of the data is performed by the prepare_tabulardata method from the arcgis.learn module in the ArcGIS API for Python. This function will take either a non spatial dataframe, a feature layer, or a spatial dataframe containing the dataset as input, and will return a TabularDataObject that can be fed into the model.

The primary input parameters required for the tool are:

  • input_features : non spatial dataframe, feature layer, or spatial dataframe containing the primary dataset and the explanatory variables, if there are any
  • variable_predict : field name containing the y-variable to be forecasted from the input feature layer/dataframe
  • explanatory_variables : list of the field names as 2-sized tuples containing the explanatory variables as mentioned above. Since there are none in this example, it is not required here
  • index_field : field name containing the timestamp

At this point, preprocessors could be used for scaling the data using a scaler as follows, depending on the data distribution.

In [17]:
#from sklearn.preprocessing import MinMaxScaler
In [18]:
#preprocessors = [('prcp_mm_', MinMaxScaler())]
#data = prepare_tabulardata(train, variable_predict='prcp_mm_', index_field='date', preprocessors=preprocessors)

In this example, preprocessors are not used, as the data is normalized by default.

In [19]:
data = prepare_tabulardata(train, variable_predict='prcp_mm_', index_field='date', seed=42)
C:\Users\sup10432\AppData\Local\ESRI\conda\envs\pro_dl_28October\lib\site-packages\arcgis\learn\_utils\tabular_data.py:876: UserWarning: Dataframe is not spatial, Rasters and distance layers will not work
  warnings.warn("Dataframe is not spatial, Rasters and distance layers will not work")
In [20]:
# Visualize the entire timeseries data
data.show_batch(graph=True)
In [21]:
# Here sequence length is used as 12 which also indicates the seasonality of the data
seq_len=12
In [22]:
# visualize the timeseries in batches, here the sequence length is mentioned which would be treated as the batch length
data.show_batch(rows=4,seq_len=seq_len)

Model Initialization

This is the most significant step for fitting a timeseries model. Here, along with the data, the backbone for training the model and the sequence length are passed as parameters. Out of these three, the sequence length has to be selected carefully, as it can make or break the model. The sequence length is usually the cycle of the data, which in this case is 12, as it is monthly data and the pattern repeats after 12 months.

In [23]:
# In model initialization, the data and the backbone is selected from the available set of InceptionTime, ResCNN, Resnet, FCN
ts_model = TimeSeriesModel(data, seq_len=seq_len, model_arch='InceptionTime')
In [24]:
# Finding the learning rate for training the model
l_rate = ts_model.lr_find()

Model Training

Finally, the model is now ready for training. To train the model, the model.fit method is called and is provided the number of epochs for training and the estimated learning rate suggested by lr_find in the previous step:

In [25]:
ts_model.fit(100, lr=l_rate)
epoch train_loss valid_loss time
0 0.083008 0.037609 00:00
1 0.061329 0.036617 00:00
2 0.050808 0.036499 00:00
3 0.044877 0.036835 00:00
4 0.039619 0.036479 00:00
5 0.034815 0.034671 00:00
6 0.031023 0.032002 00:00
7 0.028147 0.030694 00:00
8 0.025152 0.026627 00:00
9 0.022742 0.024633 00:00
10 0.020327 0.026560 00:00
11 0.018206 0.023352 00:00
12 0.016312 0.024652 00:00
13 0.014641 0.026808 00:00
14 0.013142 0.026944 00:00
15 0.011842 0.024464 00:00
16 0.010730 0.024647 00:00
17 0.009880 0.023319 00:00
18 0.009031 0.025791 00:00
19 0.008297 0.024947 00:00
20 0.007618 0.026807 00:00
21 0.007105 0.025442 00:00
22 0.006866 0.026500 00:00
23 0.006573 0.028415 00:00
24 0.006338 0.023184 00:00
25 0.006256 0.030714 00:00
26 0.006376 0.028586 00:00
27 0.006548 0.026342 00:00
28 0.006665 0.020448 00:00
29 0.006532 0.029441 00:00
30 0.006306 0.025255 00:00
31 0.006137 0.025174 00:00
32 0.005978 0.023557 00:00
33 0.005687 0.023234 00:00
34 0.005332 0.025562 00:00
35 0.004955 0.023331 00:00
36 0.004672 0.023735 00:00
37 0.004330 0.023847 00:00
38 0.004095 0.026608 00:00
39 0.003849 0.022369 00:00
40 0.003577 0.023238 00:00
41 0.003376 0.023750 00:00
42 0.003158 0.023843 00:00
43 0.002963 0.023013 00:00
44 0.002801 0.022028 00:00
45 0.002652 0.023180 00:00
46 0.002558 0.024409 00:00
47 0.002417 0.022479 00:00
48 0.002264 0.023945 00:00
49 0.002172 0.022920 00:00
50 0.002085 0.021997 00:00
51 0.001990 0.023793 00:00
52 0.001873 0.024170 00:00
53 0.001774 0.023699 00:00
54 0.001724 0.023237 00:00
55 0.001611 0.023164 00:00
56 0.001527 0.022908 00:00
57 0.001451 0.023040 00:00
58 0.001390 0.022155 00:00
59 0.001347 0.022041 00:00
60 0.001333 0.022265 00:00
61 0.001291 0.022561 00:00
62 0.001224 0.021779 00:00
63 0.001159 0.022445 00:00
64 0.001080 0.022740 00:00
65 0.001023 0.022691 00:00
66 0.000971 0.023290 00:00
67 0.000941 0.022854 00:00
68 0.000876 0.022432 00:00
69 0.000829 0.022994 00:00
70 0.000774 0.022587 00:00
71 0.000722 0.022578 00:00
72 0.000681 0.022348 00:00
73 0.000643 0.022647 00:00
74 0.000608 0.022395 00:00
75 0.000569 0.022518 00:00
76 0.000533 0.022561 00:00
77 0.000510 0.022146 00:00
78 0.000477 0.022354 00:00
79 0.000454 0.022084 00:00
80 0.000439 0.022166 00:00
81 0.000420 0.022053 00:00
82 0.000401 0.022132 00:00
83 0.000370 0.022445 00:00
84 0.000355 0.022072 00:00
85 0.000344 0.022071 00:00
86 0.000322 0.022196 00:00
87 0.000305 0.022326 00:00
88 0.000289 0.022460 00:00
89 0.000264 0.022545 00:00
90 0.000251 0.022607 00:00
91 0.000245 0.022609 00:00
92 0.000238 0.022556 00:00
93 0.000225 0.022444 00:00
94 0.000220 0.022330 00:00
95 0.000215 0.022326 00:00
96 0.000212 0.022314 00:00
97 0.000202 0.022303 00:00
98 0.000226 0.022315 00:00
99 0.000213 0.022314 00:00
In [26]:
# the train vs valid losses is plotted to check quality of the trained model, and whether the model needs more training 
ts_model.plot_losses()
In [27]:
# the predicted values by the trained model is printed for the test set
ts_model.show_results(rows=5)

The figures above display the training and the validation of the prediction attained by the model while training.

Rainfall Forecast & Validation

Forecasting Using the trained Timeseries Model

During forecasting, the model will use the same training dataset as input and will use the last sequence length number of terms from that dataset's tail to predict the rainfall for the number of months specified by the user.

In [29]:
from datetime import datetime  
In [30]:
# checking the training dataset
train
Out[30]:
date prcp_mm_
0 1980-01-31 224.31
1 1980-02-29 181.84
2 1980-03-31 78.86
3 1980-04-30 34.19
4 1980-05-31 17.57
... ... ...
463 2018-08-31 0.00
464 2018-09-30 0.00
465 2018-10-31 13.53
466 2018-11-30 113.04
467 2018-12-31 37.26

468 rows × 2 columns

Forecasting requires the format of the date column to be in datetime. If the date column is not in the datetime format, it can be changed to datetime by using the pd.to_datetime() method.

In [31]:
# checking if the datatype of the 'date' column is in datetime format
train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 468 entries, 0 to 467
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   date      468 non-null    datetime64[ns]
 1   prcp_mm_  468 non-null    float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 11.0 KB

In this example, the date column is already in the required datetime format.

In [32]:
train.tail(5)
Out[32]:
date prcp_mm_
463 2018-08-31 0.00
464 2018-09-30 0.00
465 2018-10-31 13.53
466 2018-11-30 113.04
467 2018-12-31 37.26

Finally the predict function is used to forecast for a period of the 12 months subsequent to the last date in the training dataset. As such, this will be forecasting rainfall for the 12 months of 2019, starting from January of 2019.

In [33]:
# Here the forecast is returned as a dataframe, since it is non spatial data, mentioned in the 'prediction_type'     
sdf_forecasted = ts_model.predict(train, prediction_type='dataframe', number_of_predictions=test_size)
C:\Users\sup10432\AppData\Local\ESRI\conda\envs\pro_dl_28October\lib\site-packages\pandas\core\indexing.py:670: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)
In [34]:
# final forecasted result returned by the model
sdf_forecasted
Out[34]:
date prcp_mm_ prcp_mm__results
0 1980-01-31 224.31 224.310000
1 1980-02-29 181.84 181.840000
2 1980-03-31 78.86 78.860000
3 1980-04-30 34.19 34.190000
4 1980-05-31 17.57 17.570000
... ... ... ...
475 2019-08-20 NaN -5.099367
476 2019-09-18 NaN 10.751540
477 2019-10-17 NaN 25.207242
478 2019-11-15 NaN 83.685172
479 2019-12-14 NaN 142.722045

480 rows × 3 columns

In [35]:
# Formating the result into actual vs the predicted columns
sdf_forecasted = sdf_forecasted.tail(test_size)
sdf_forecasted = sdf_forecasted[['date','prcp_mm__results']]
sdf_forecasted['actual'] = test[test.columns[-1]].values
sdf_forecasted = sdf_forecasted.set_index(sdf_forecasted.columns[0])
sdf_forecasted.head()
Out[35]:
prcp_mm__results actual
date
2019-01-29 129.882450 149.69
2019-02-27 144.863222 222.38
2019-03-28 143.764554 102.48
2019-04-26 48.772594 25.12
2019-05-25 61.366233 102.49

Estimate model metrics for validation

The accuracy of the forecasted values is measured by comparing the forecasted values against the actual values of the 12 months.

In [36]:
from sklearn.metrics import r2_score
import sklearn.metrics as metrics
In [37]:
r2_test = r2_score(sdf_forecasted['actual'],sdf_forecasted['prcp_mm__results'])
print('R-Square: ', round(r2_test, 2))
R-Square:  0.8

A considerably high r-squared value indicates a high similarity between the forecasted and the actual sales values.

In [38]:
mse_RF_train = metrics.mean_squared_error(sdf_forecasted['actual'], sdf_forecasted['prcp_mm__results'])
print('RMSE: ', round(np.sqrt(mse_RF_train), 4))

mean_absolute_error_RF_train = metrics.mean_absolute_error(sdf_forecasted['actual'], sdf_forecasted['prcp_mm__results'])
print('MAE: ', round(mean_absolute_error_RF_train, 4))
RMSE:  32.2893
MAE:  25.5549

The error terms of RMSE and MAE in the forecasting are 32.28mm and 25.55mm respectively, which are quite low.

Result Visualization

Finally, the actual and forecasted values are plotted to visualize their distribution over the 12 months period, with the blue lines indicating forecasted values and orange line showing the actual values.

In [39]:
plt.figure(figsize=(20,5))
plt.plot(sdf_forecasted)
plt.ylabel('prcp_mm__results')
plt.legend(sdf_forecasted.columns.values,loc='upper right')
plt.title( 'Rainfall Forecast')
plt.show()

Conclusion

The newly implemented deeplearning timeseries model from the arcgis.learn library was used to forecast monthly rainfall for a location of 1 sqkm in California, for the period of January to December 2019, which it was able to model with a high accuracy. The notebook elaborates on the methodology of applying the model for forecasting time series data. The process includes first preparing a timeseries dataset using the prepare_tabulardata() method, followed by modeling and fitting the dataset. Usually, timeseries modelling requires fine tuning several hyperparameters for properly fitting the data, most of which has been internalized in this current implementation, leaving the user responsible for configuring only a few significant parameters, like the sequence length.

Summary of methods used

Method Description Examples
prepare_tabulardata prepare data including imputation, normalization and train-test split prepare data ready for fitting a Timeseries Model
model.lr_find() find an optimal learning rate finalize a good learning rate for training the Timeseries model
TimeSeriesModel() Model Initialization by selecting the Timeseries Deeplearning algorithm to be used for fitting Selected Timsereis algorithm from Fastai timeseries regression can be used
model.fit() train a model with epochs & learning rate as input training the Timeseries model with sutiable input
model.score() find the model metric of R-squared of the trained model returns R-squared value after training the Timeseries Model
model.predict() predict on a test set forecast values using the trained models on test input

Data resources

Dataset Source Link
California Daily weather data MODIS - Daily Surface Weather Data on a 1-km Grid for North America, Version 3 https://daac.ornl.gov/DAYMET/guides/Daymet_V3_CFMosaics.html

Feedback on this topic?