Forecasting monthly rainfall in California using Deep Learning Time Series techniques
Forest fires of historical proportions are ravaging various parts of California, started by a long and continuous period of drought. To help in dealing with this growing environmental emergency, utilizing prior knowledge of rainfall is critical. In this sample study, the deepelarning timeseries model from ArcGIS learn will be used to predict monthly rainfall for a whole year at a certain location in the Sierra Nevada foothills, around 30 miles north east of Fresno California, using the location's historic rainfall data. Data from January to December of 2019 will be used to validate the quality of the forecast.
Weather forecasting is a popular field for the application of timeseries modelling. There are various approaches used for solving timeseries problems, including classical statistical methods, such as ARIMA group of models, machine learning models, and deep learning models. The current implementation of the ArcGIS learn timeseries model uses state of the art convolutional neural networks backbones especially curated for timeseries datasets. These include InceptionTime, ResCNN, Resnet, and FCN. What makes timeseries modeling unique is that, in the classical methodology of ARIMA, multiple hyperparameters require fine tuning before fitting the model, while with the current deep learning technique, most of the parameters are learned by the model itself from the data.
%matplotlib inline import matplotlib.pyplot as plt import numpy as np import pandas as pd import math from datetime import datetime as dt from IPython.display import Image, HTML from sklearn.model_selection import train_test_split from arcgis.gis import GIS from arcgis.learn import TimeSeriesModel, prepare_tabulardata from arcgis.features import FeatureLayer, FeatureLayerCollection
gis = GIS("home")
The dataset used in this sample study is a univariate timeseries dataset of total monthly rainfall from a fixed location of 1 sqkm area in the state of California, ranging from the January 1980 to December 2019.
The following cell downloads the California rainfall dataset:
url = 'https://services7.arcgis.com/JEwYeAy2cc8qOe3o/arcgis/rest/services/cali_precipitation/FeatureServer/0' table = FeatureLayer(url)
Next, we preprocess and sort the downloaded dataset by datetime sequence.
cali_rainfall_df1 = table.query().sdf cali_rainfall_df1 = cali_rainfall_df1.drop("ObjectId", axis=1) cali_rainfall_df1_sorted = cali_rainfall_df1.sort_values(by='date') cali_rainfall_df1_sorted.head()
<class 'pandas.core.frame.DataFrame'> Int64Index: 480 entries, 1 to 479 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 480 non-null datetime64[ns] 1 prcp_mm_ 480 non-null float64 dtypes: datetime64[ns](1), float64(1) memory usage: 11.2 KB
Timeseries Data Preparation¶
Preparing the data for timeseries modeling consists of the following three steps:
The first step consist of establishing the sequence of the timeseries data, which is done by creating a new index that is used by the model for processing the sequential data.
# The first step consist of reindexing the timeseries data into a sequential series cali_rainfall_reindexed = cali_rainfall_df1_sorted.reset_index() cali_rainfall_reindexed = cali_rainfall_reindexed.drop("index", axis=1) cali_rainfall_reindexed.head()
# check the data types of the variables cali_rainfall_reindexed.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 480 entries, 0 to 479 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 480 non-null datetime64[ns] 1 prcp_mm_ 480 non-null float64 dtypes: datetime64[ns](1), float64(1) memory usage: 7.6 KB
Here, the timeseries variable is a float, which should be the expected data type. If the variable is not of a float data type, then it needs to be changed to a float data type, as demonstrated by the commented out line in the next cell.
Checking autocorrelation of timeseries variable¶
The most important step in this process is to determine if the timeseries sequence is autocorrelated. To ensure that our timeseries data can be modeled well, the strength of correlation of the variable with its past data must be estimated.
from pandas.plotting import autocorrelation_plot
plt.figure(figsize=(20,5)) autocorrelation_plot(cali_rainfall_reindexed["prcp_mm_"]) plt.show()
The plot above shows us that there is significant correlation of the data with its immediate time lagged terms, and that it gradually decreases over time as the lag increases.
The dataset above has 480 data samples each representing monthly ranifall of california for 40 years(1980-2019). Out of this 39 years(1980-2018) of data will be used for training the model and the rest 1 year or a total of 12 months of data are held out for validation. Accordingly the dataset is now split into train and test in the following.
# Splitting timeseries data retaining the original sequence by keeping shuffle=False, and test size of 12 months for validation test_size = 12 train, test = train_test_split(cali_rainfall_reindexed, test_size = test_size, shuffle=False)
468 rows × 2 columns
Once the dataset is divided into the training and test dataset, the training data is ready to be used for modeling.
In this example, the data used is a univariate timeseries of total monthly rainfall in millimeters. This single variable will be used to forecast the 12 months of rainfall for the months subsequent to the last date in the training data, or put simply, a single variable will be used to predict the future values of that same variable. In the case of a multivariate timeseries model, there would be a list of multiple explanatory variables.
Once the variables are identified, the preprocessing of the data is performed by the
prepare_tabulardata method from the
arcgis.learn module in the ArcGIS API for Python. This function will take either a non spatial dataframe, a feature layer, or a spatial dataframe containing the dataset as input, and will return a TabularDataObject that can be fed into the model.
The primary input parameters required for the tool are:
- input_features : non spatial dataframe, feature layer, or spatial dataframe containing the primary dataset and the explanatory variables, if there are any
- variable_predict : field name containing the y-variable to be forecasted from the input feature layer/dataframe
- explanatory_variables : list of the field names as 2-sized tuples containing the explanatory variables as mentioned above. Since there are none in this example, it is not required here
- index_field : field name containing the timestamp
At this point, preprocessors could be used for scaling the data using a scaler as follows, depending on the data distribution.
#from sklearn.preprocessing import MinMaxScaler
#preprocessors = [('prcp_mm_', MinMaxScaler())] #data = prepare_tabulardata(train, variable_predict='prcp_mm_', index_field='date', preprocessors=preprocessors)
In this example, preprocessors are not used, as the data is normalized by default.
data = prepare_tabulardata(train, variable_predict='prcp_mm_', index_field='date', seed=42)
C:\Users\sup10432\AppData\Local\ESRI\conda\envs\pro_dl_28October\lib\site-packages\arcgis\learn\_utils\tabular_data.py:876: UserWarning: Dataframe is not spatial, Rasters and distance layers will not work warnings.warn("Dataframe is not spatial, Rasters and distance layers will not work")
# Visualize the entire timeseries data data.show_batch(graph=True)
# Here sequence length is used as 12 which also indicates the seasonality of the data seq_len=12
# visualize the timeseries in batches, here the sequence length is mentioned which would be treated as the batch length data.show_batch(rows=4,seq_len=seq_len)
Model Initialization ¶
This is the most significant step for fitting a timeseries model. Here, along with the data, the backbone for training the model and the sequence length are passed as parameters. Out of these three, the sequence length has to be selected carefully, as it can make or break the model. The sequence length is usually the cycle of the data, which in this case is 12, as it is monthly data and the pattern repeats after 12 months.
# In model initialization, the data and the backbone is selected from the available set of InceptionTime, ResCNN, Resnet, FCN ts_model = TimeSeriesModel(data, seq_len=seq_len, model_arch='InceptionTime')
# Finding the learning rate for training the model l_rate = ts_model.lr_find()
Model Training ¶
Finally, the model is now ready for training. To train the model, the
model.fit method is called and is provided the number of epochs for training and the estimated learning rate suggested by
lr_find in the previous step:
# the train vs valid losses is plotted to check quality of the trained model, and whether the model needs more training ts_model.plot_losses()
# the predicted values by the trained model is printed for the test set ts_model.show_results(rows=5)
The figures above display the training and the validation of the prediction attained by the model while training.
Forecasting Using the trained Timeseries Model ¶
During forecasting, the model will use the same training dataset as input and will use the last sequence length number of terms from that dataset's tail to predict the rainfall for the number of months specified by the user.
from datetime import datetime
# checking the training dataset train
468 rows × 2 columns
Forecasting requires the format of the date column to be in datetime. If the date column is not in the datetime format, it can be changed to datetime by using the
# checking if the datatype of the 'date' column is in datetime format train.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 468 entries, 0 to 467 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 468 non-null datetime64[ns] 1 prcp_mm_ 468 non-null float64 dtypes: datetime64[ns](1), float64(1) memory usage: 11.0 KB
In this example, the date column is already in the required datetime format.
Finally the predict function is used to forecast for a period of the 12 months subsequent to the last date in the training dataset. As such, this will be forecasting rainfall for the 12 months of 2019, starting from January of 2019.
# Here the forecast is returned as a dataframe, since it is non spatial data, mentioned in the 'prediction_type' sdf_forecasted = ts_model.predict(train, prediction_type='dataframe', number_of_predictions=test_size)
C:\Users\sup10432\AppData\Local\ESRI\conda\envs\pro_dl_28October\lib\site-packages\pandas\core\indexing.py:670: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy iloc._setitem_with_indexer(indexer, value)
# final forecasted result returned by the model sdf_forecasted
480 rows × 3 columns
# Formating the result into actual vs the predicted columns sdf_forecasted = sdf_forecasted.tail(test_size) sdf_forecasted = sdf_forecasted[['date','prcp_mm__results']] sdf_forecasted['actual'] = test[test.columns[-1]].values sdf_forecasted = sdf_forecasted.set_index(sdf_forecasted.columns) sdf_forecasted.head()
Estimate model metrics for validation ¶
The accuracy of the forecasted values is measured by comparing the forecasted values against the actual values of the 12 months.
from sklearn.metrics import r2_score import sklearn.metrics as metrics
r2_test = r2_score(sdf_forecasted['actual'],sdf_forecasted['prcp_mm__results']) print('R-Square: ', round(r2_test, 2))
A considerably high r-squared value indicates a high similarity between the forecasted and the actual sales values.
mse_RF_train = metrics.mean_squared_error(sdf_forecasted['actual'], sdf_forecasted['prcp_mm__results']) print('RMSE: ', round(np.sqrt(mse_RF_train), 4)) mean_absolute_error_RF_train = metrics.mean_absolute_error(sdf_forecasted['actual'], sdf_forecasted['prcp_mm__results']) print('MAE: ', round(mean_absolute_error_RF_train, 4))
RMSE: 32.2893 MAE: 25.5549
The error terms of RMSE and MAE in the forecasting are 32.28mm and 25.55mm respectively, which are quite low.
Finally, the actual and forecasted values are plotted to visualize their distribution over the 12 months period, with the blue lines indicating forecasted values and orange line showing the actual values.
plt.figure(figsize=(20,5)) plt.plot(sdf_forecasted) plt.ylabel('prcp_mm__results') plt.legend(sdf_forecasted.columns.values,loc='upper right') plt.title( 'Rainfall Forecast') plt.show()
The newly implemented deeplearning timeseries model from the arcgis.learn library was used to forecast monthly rainfall for a location of 1 sqkm in California, for the period of January to December 2019, which it was able to model with a high accuracy. The notebook elaborates on the methodology of applying the model for forecasting time series data. The process includes first preparing a timeseries dataset using the prepare_tabulardata() method, followed by modeling and fitting the dataset. Usually, timeseries modelling requires fine tuning several hyperparameters for properly fitting the data, most of which has been internalized in this current implementation, leaving the user responsible for configuring only a few significant parameters, like the sequence length.
|prepare_tabulardata||prepare data including imputation, normalization and train-test split||prepare data ready for fitting a Timeseries Model|
|model.lr_find()||find an optimal learning rate||finalize a good learning rate for training the Timeseries model|
|TimeSeriesModel()||Model Initialization by selecting the Timeseries Deeplearning algorithm to be used for fitting||Selected Timsereis algorithm from Fastai timeseries regression can be used|
|model.fit()||train a model with epochs & learning rate as input||training the Timeseries model with sutiable input|
|model.score()||find the model metric of R-squared of the trained model||returns R-squared value after training the Timeseries Model|
|model.predict()||predict on a test set||forecast values using the trained models on test input|