Forecasting Air Temperature in California using ResCNN model

Introduction

A rise in air temperature is directly correlated with Global warming and change in climatic conditions and is one of the main factors in predicting other meteorological variables, like streamflow, evapotranspiration, and solar radiation. As such, accurate forecasting of this variable is vital in pursuing the mitigation of environmental and economic destruction. Including the dependency of air temperature in other variables, like wind speed or precipitation, helps in deriving more precise predictions. In this study, the deep learning TimeSeriesModel from arcgis.learn is used to predict monthly air temperature for two years at a ground station at the Fresno Yosemite International Airport in California, USA. The dataset ranges from 1948-2015. Data from January 2014 to November 2015 is used to validate the quality of the forecast.

Univariate time series modeling is one of the more popular applications of time series analysis. This study includes multivariate time series analysis, which is a bit more convoluted, as the dataset contains more than one time-dependent variable. The TimeSeriesModel from arcgis.learn includes backbones, such as InceptionTime, ResCNN, ResNet and FCN, which do not need fine-tuning of multiple hyperparameters before fitting the model. Here is the schematic flow chart of the methodology:

Importing libraries

Input
%matplotlib inline
import matplotlib.pyplot as plt

import numpy as np
import pandas as pd
from pandas.plotting import autocorrelation_plot as aplot

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import sklearn.metrics as metrics

from arcgis.gis import GIS
from arcgis.learn import TimeSeriesModel, prepare_tabulardata
from arcgis.features import FeatureLayer, FeatureLayerCollection

Connecting to your GIS

Input
gis = GIS('home')

Accessing & visualizing the dataset

The data used in this sample study is a multivariate monthly time series dataset recorded at a ground station in the Fresno Yosemite International Airport, California, USA. It ranges from January 1948 to November 2015.

Input
# Location of the ground station
location = gis.map(location="Fresno Yosemite International California", zoomlevel=12)
location

Input
# Access the data table
data_table = gis.content.get("8c58e808aabd40408f7bc4eeac64fffb")
data_table
Output
Weather Data of Fresno International California
This Feature layer contains meteorological dataset from ground instrument placed at Fresno Yosemite International, California USA. The data is available at monthly temporal scale ranging from 1948 to 2015.Feature Layer Collection by api_data_owner
Last Modified: November 17, 2021
0 comments, 30 views
Input
# Visualize as pandas dataframe
climate_data = data_table.tables[0]
climate_df = climate_data.query().sdf
climate_df.head()
Output
STATION NAME DATE AWND PRCP PSUN SNOW TAVG TMAX TMIN TSUN WSFG ObjectId
0 USW00093193 FRESNO YOSEMITE INTERNATIONAL, CA US 1948-01-01 None 0.00 None 0 51.2 66.3 36.2 None None 1
1 USW00093193 FRESNO YOSEMITE INTERNATIONAL, CA US 1948-02-01 None 0.78 None 0 49.0 62.2 35.8 None None 2
2 USW00093193 FRESNO YOSEMITE INTERNATIONAL, CA US 1948-03-01 None 2.29 None 0 53.2 65.6 40.7 None None 3
3 USW00093193 FRESNO YOSEMITE INTERNATIONAL, CA US 1948-04-01 None 2.28 None 0 59.8 71.8 47.7 None None 4
4 USW00093193 FRESNO YOSEMITE INTERNATIONAL, CA US 1948-05-01 None 0.96 None 0 65.2 79.7 50.7 None None 5

The dataframe above contains columns for station ID (STATION), station name (NAME), Date (DATE), Wind speed (AWND), precipitation (PRCP), possible sunshine (PSUN), snow cover (SNOW), average temperature (TAVG), maximum temperature (TMAX), minimum temperature (TMIN), total sunshine (TSUN), and peak wind gust speed (WSFG).

Input
climate_df.shape
Output
(815, 13)

Next, the dataset is prepared by dropping the variables for station, possible sunshine, snow cover, maximum temperature, minimum temperature, total sunshine, and peak wind gust speed. Then, the dataset is narrowed to the data from 1987 on, to avoid missing values.

Input
climate_df = climate_df.drop(
    ["ObjectId", "STATION", "NAME", "PSUN", "SNOW", "TSUN",'TMAX', 'TMIN', "WSFG"], axis=1
)
Input
climate_df.columns
Output
Index(['DATE', 'AWND', 'PRCP', 'TAVG'], dtype='object')
Input
# Selecting dataset from year 1987 to get continous data without NAN values
selected_df = climate_df[climate_df.DATE > "1987"]
selected_df.head()
Output
DATE AWND PRCP TAVG
469 1987-02-01 5.8 1.36 52.7
470 1987-03-01 6.3 2.39 55.6
471 1987-04-01 6.9 0.07 66.6
472 1987-05-01 7.4 0.87 71.8
473 1987-06-01 7.4 0.01 78.4

Here, TAVG is our variable to be predicted, with PRCP and AWND being the predictors used, showing their influence on temperature.

Input
selected_df.shape
Output
(346, 4)

Time series data preprocessing

The preprocessing of the data for multivariate time series modeling involves the following steps:

Converting into time series format

The dataset is now transformed into a time series data format by creating a new index that will be used by the model for processing the sequential data.

Input
final_df = selected_df.reset_index()
final_df = final_df.drop("index", axis=1)
final_df.head()
Output
DATE AWND PRCP TAVG
0 1987-02-01 5.8 1.36 52.7
1 1987-03-01 6.3 2.39 55.6
2 1987-04-01 6.9 0.07 66.6
3 1987-05-01 7.4 0.87 71.8
4 1987-06-01 7.4 0.01 78.4

Data types of time series variables

Here we check the data types of the variables.

Input
final_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 346 entries, 0 to 345
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   DATE    346 non-null    datetime64[ns]
 1   AWND    346 non-null    object        
 2   PRCP    346 non-null    float64       
 3   TAVG    346 non-null    float64       
dtypes: datetime64[ns](1), float64(2), object(1)
memory usage: 10.9+ KB

The time-dependent variables should of the type float. If a time-dependent variable is not of a float data type, then it needs to be changed to float. Here, Windspeed (AWND) is converted from object dtype to float64, as shown in the next cell.

Input
final_df["AWND"] = final_df["AWND"].astype("float64")
final_df.head()
Output
DATE AWND PRCP TAVG
0 1987-02-01 5.8 1.36 52.7
1 1987-03-01 6.3 2.39 55.6
2 1987-04-01 6.9 0.07 66.6
3 1987-05-01 7.4 0.87 71.8
4 1987-06-01 7.4 0.01 78.4

Checking autocorrelation of time dependent variables

The next step will determine if the time series sequence is autocorrelated. To ensure that our time series data can be modeled well, the strength of correlation of the variable with its past data must be estimated.

Input
variables = ["AWND", "PRCP", "TAVG"]
for variable in variables:
    plt.figure(figsize=(20, 2))
    autocorr = aplot(final_df[variable], color="blue")
    plt.title(variable)

The plots are showing a significant correlation of the data with its immediate time-lagged terms, and that it gradually decreases over time as the lag increases.

Creating dataset for prediction

Here, in the original dataset, the variable predict column of Average Temperature (TAVG) is populated with NaNs for the forecasting period of 2014-2015. This format is required for the model.predict() function in time series analysis, which will fill up the NaN values with forecasted temperatures.

Input
predict_df = final_df.copy()
predict_df.loc[predict_df["DATE"] > "2013-12-01", "TAVG"] = None
predict_df.tail()
Output
DATE AWND PRCP TAVG
341 2015-07-01 8.1 0.43 NaN
342 2015-08-01 7.6 0.00 NaN
343 2015-09-01 5.8 0.12 NaN
344 2015-10-01 4.7 0.49 NaN
345 2015-11-01 3.6 1.74 NaN

Train - Test split of time series dataset

Out of these 27 years(1987-2015), 25 years of data is used for training the model, with the remaining 23 months (2014-2015) being used for forecasting and validation. As we are splitting timeseries data, we set shuffle=False to keep the sequence intact and we set a test size of 23 months for validation.

Input
test_size = 23
train, test = train_test_split(final_df, test_size=test_size, shuffle=False)
Input
train
Output
DATE AWND PRCP TAVG
0 1987-02-01 5.8 1.36 52.7
1 1987-03-01 6.3 2.39 55.6
2 1987-04-01 6.9 0.07 66.6
3 1987-05-01 7.4 0.87 71.8
4 1987-06-01 7.4 0.01 78.4
... ... ... ... ...
318 2013-08-01 7.2 0.00 83.0
319 2013-09-01 6.5 0.01 77.8
320 2013-10-01 3.4 0.03 66.6
321 2013-11-01 2.5 0.54 58.5
322 2013-12-01 2.2 0.15 47.4

323 rows × 4 columns

Time series model building

After the train and test sets are created, the training set is ready for modeling.

Data preprocessing

In this example, the dataset contains 'AWND' (Windspeed), 'PRCP' (Precipitation), and 'TAVG' (Average Air temperature) as time-dependent variables leading to a multivariate time series analysis at a monthly time scale. These variables are used to forecast the next 23 months of air temperature for the months after the last date in the training data, or, in other words, these multiple explanatory variables are used to predict the future values of the dependent air temperature variable.

Once the variables are identified, the preprocessing of the data is performed by the prepare_tabulardata method from the arcgis.learn module in the ArcGIS API for Python. This function takes either a non-spatial data frame, a feature layer, or a spatial data frame containing the dataset as input and returns a TabularDataObject that can be fed into the model. By default, prepare_tabulardata scales/normalizes the numerical columns in a dataset using StandardScaler. The primary input parameters required for the tool are:

  • input_features : Takes the spatially enabled dataframe as a feature layer in this model
  • variable_predict : The field name of the forecasting variable
  • explanatory_variables : A list of the field names that are used as time-dependent variables in multivariate time series
  • index_field : The field name containing the timestamp that will be used as the index field for the data and to visualize values on the x-axis in the time series
Input
data = prepare_tabulardata(
    train,
    variable_predict="TAVG",
    explanatory_variables=["AWND", "PRCP"],
    index_field="DATE",
    seed=42,
)
C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\lib\site-packages\arcgis\learn\_utils\tabular_data.py:936: UserWarning:

Dataframe is not spatial, Rasters and distance layers will not work

Input
# Visualize the entire timeseries data
data.show_batch(graph=True)
Input
# Here sequence length is used as 12 which also indicates the seasonality of the data
seq_len = 12

Next, we visualize the timeseries in batches. Here, we will pass the sequence length as the batch length.

Input
data.show_batch(rows=4, seq_len=seq_len)

Model initialization

This is an important step for fitting a time series model. Here, along with the input dataset, the backbone for training the model and the sequence length are passed as parameters. Out of these three, the sequence length has to be selected carefully. The sequence length is usually the cycle of the data, which in this case is 12, as it is monthly data and the pattern repeats after 12 months. In model initialization, the data and the backbone are selected from the available set of InceptionTime, ResCNN, Resnet, and FCN.

Input
tsmodel = TimeSeriesModel(data, seq_len=seq_len, model_arch="ResCNN")

Here, we find the optimal learning rate for training the model.

Input
lr_rate = tsmodel.lr_find()

Model training

The model is now ready for training. To train the model, the model.fit method is used and is provided with the number of epochs for training and the learning rate suggested above as parameters:

Input
tsmodel.fit(100, lr=lr_rate)
epoch train_loss valid_loss time
0 0.960341 0.327997 00:00
1 0.905986 0.398620 00:00
2 0.854629 0.495778 00:00
3 0.803305 0.601204 00:00
4 0.750918 0.714522 00:00
5 0.694405 0.837259 00:00
6 0.629638 0.899321 00:00
7 0.564060 0.799259 00:00
8 0.498104 0.516647 00:00
9 0.436980 0.270044 00:00
10 0.382555 0.127830 00:00
11 0.335652 0.072086 00:00
12 0.296507 0.011030 00:00
13 0.263554 0.007625 00:00
14 0.235444 0.007725 00:00
15 0.210686 0.016559 00:00
16 0.189255 0.008922 00:00
17 0.170306 0.004350 00:00
18 0.153841 0.010464 00:00
19 0.139377 0.015209 00:00
20 0.126539 0.004292 00:00
21 0.115118 0.005252 00:00
22 0.104859 0.002869 00:00
23 0.095638 0.002625 00:00
24 0.087301 0.006485 00:00
25 0.080027 0.009469 00:00
26 0.073334 0.009204 00:00
27 0.067324 0.005014 00:00
28 0.061931 0.003779 00:00
29 0.056923 0.003239 00:00
30 0.052320 0.003748 00:00
31 0.048077 0.002600 00:00
32 0.044202 0.004696 00:00
33 0.040764 0.005033 00:00
34 0.037607 0.001809 00:00
35 0.034591 0.002268 00:00
36 0.031892 0.002683 00:00
37 0.029392 0.002046 00:00
38 0.027079 0.003014 00:00
39 0.025031 0.002233 00:00
40 0.023200 0.003286 00:00
41 0.021619 0.004038 00:00
42 0.020044 0.002866 00:00
43 0.018576 0.002538 00:00
44 0.017208 0.002113 00:00
45 0.015948 0.003038 00:00
46 0.014801 0.001666 00:00
47 0.013740 0.005640 00:00
48 0.012854 0.001948 00:00
49 0.012011 0.003482 00:00
50 0.011200 0.001851 00:00
51 0.010443 0.003678 00:00
52 0.009682 0.002429 00:00
53 0.009051 0.002110 00:00
54 0.008452 0.002501 00:00
55 0.007954 0.002407 00:00
56 0.007382 0.001861 00:00
57 0.006923 0.002152 00:00
58 0.006464 0.002342 00:00
59 0.006101 0.001794 00:00
60 0.005727 0.001933 00:00
61 0.005591 0.003044 00:00
62 0.005265 0.004069 00:00
63 0.005186 0.002659 00:00
64 0.004908 0.002070 00:00
65 0.004728 0.002142 00:00
66 0.004427 0.001737 00:00
67 0.004143 0.002057 00:00
68 0.003877 0.001939 00:00
69 0.003609 0.001804 00:00
70 0.003400 0.001770 00:00
71 0.003187 0.001897 00:00
72 0.003073 0.001739 00:00
73 0.002866 0.001978 00:00
74 0.002682 0.001729 00:00
75 0.002523 0.001811 00:00
76 0.002354 0.001935 00:00
77 0.002253 0.001690 00:00
78 0.002108 0.001890 00:00
79 0.001960 0.002030 00:00
80 0.001826 0.001838 00:00
81 0.001734 0.001769 00:00
82 0.001704 0.001844 00:00
83 0.001645 0.001852 00:00
84 0.001591 0.001816 00:00
85 0.001580 0.001718 00:00
86 0.001504 0.001718 00:00
87 0.001406 0.001782 00:00
88 0.001502 0.001797 00:00
89 0.001413 0.001755 00:00
90 0.001378 0.001745 00:00
91 0.001311 0.001726 00:00
92 0.001237 0.001727 00:00
93 0.001171 0.001742 00:00
94 0.001100 0.001734 00:00
95 0.001056 0.001722 00:00
96 0.000998 0.001727 00:00
97 0.000968 0.001729 00:00
98 0.001004 0.001731 00:00
99 0.000939 0.001726 00:00

To check the quality of the trained model and whether the model needs more training, we generate a train vs validation loss plot below:

Input
tsmodel.plot_losses()

Next, the predicted values of the model and the actual values are printed for the training dataset.

Input
tsmodel.show_results(rows=5)

Air temperature forecast & validation

Forecasting using the trained TimeSeriesModel

During forecasting, the model uses the dataset prepared above with NaN values as input, with the prediction_type set as dataframe.

Input
# Checking the input dataset
predict_df.tail(23)
Output
DATE AWND PRCP TAVG
323 2014-01-01 2.5 0.57 NaN
324 2014-02-01 4.9 2.11 NaN
325 2014-03-01 5.8 0.62 NaN
326 2014-04-01 6.9 0.74 NaN
327 2014-05-01 8.9 0.04 NaN
328 2014-06-01 8.1 0.00 NaN
329 2014-07-01 7.4 0.01 NaN
330 2014-08-01 6.7 0.00 NaN
331 2014-09-01 6.3 0.18 NaN
332 2014-10-01 4.5 0.50 NaN
333 2014-11-01 3.1 0.40 NaN
334 2014-12-01 4.0 2.30 NaN
335 2015-01-01 2.2 0.21 NaN
336 2015-02-01 3.8 1.13 NaN
337 2015-03-01 5.4 0.06 NaN
338 2015-04-01 6.9 1.25 NaN
339 2015-05-01 7.6 0.57 NaN
340 2015-06-01 7.6 0.01 NaN
341 2015-07-01 8.1 0.43 NaN
342 2015-08-01 7.6 0.00 NaN
343 2015-09-01 5.8 0.12 NaN
344 2015-10-01 4.7 0.49 NaN
345 2015-11-01 3.6 1.74 NaN
Input
df_forecasted = tsmodel.predict(predict_df, prediction_type="dataframe")
Input
# Final forecasted result returned by the model
df_forecasted
Output
DATE AWND PRCP TAVG TAVG_results
0 1987-02-01 5.8 1.36 52.7 52.700000
1 1987-03-01 6.3 2.39 55.6 55.600000
2 1987-04-01 6.9 0.07 66.6 66.600000
3 1987-05-01 7.4 0.87 71.8 71.800000
4 1987-06-01 7.4 0.01 78.4 78.400000
... ... ... ... ... ...
341 2015-07-01 8.1 0.43 NaN 84.969053
342 2015-08-01 7.6 0.00 NaN 81.919138
343 2015-09-01 5.8 0.12 NaN 77.078820
344 2015-10-01 4.7 0.49 NaN 69.666935
345 2015-11-01 3.6 1.74 NaN 58.852926

346 rows × 5 columns

Next, we format the results into actual vs predicted columns.

Input
result_df = pd.DataFrame()
result_df["DATE"] = test["DATE"]
result_df["Airtemp_actual"] = test["TAVG"]
result_df["Airtemp_predicted"] = df_forecasted["TAVG_results"][-23:]
result_df = result_df.set_index(result_df.columns[0])
result_df
Output
Airtemp_actual Airtemp_predicted
DATE
2014-01-01 53.2 48.777484
2014-02-01 56.8 52.200898
2014-03-01 62.3 59.264070
2014-04-01 66.8 65.984249
2014-05-01 74.2 72.340793
2014-06-01 80.9 77.236778
2014-07-01 86.9 84.582398
2014-08-01 84.4 81.742578
2014-09-01 80.7 76.948140
2014-10-01 72.0 67.554652
2014-11-01 57.7 58.481853
2014-12-01 51.9 47.955960
2015-01-01 49.0 47.175748
2015-02-01 57.0 50.230618
2015-03-01 64.0 56.937533
2015-04-01 64.3 64.488977
2015-05-01 68.5 71.976515
2015-06-01 81.9 78.548443
2015-07-01 83.1 84.969053
2015-08-01 82.4 81.919138
2015-09-01 78.7 77.078820
2015-10-01 71.3 69.666935
2015-11-01 52.0 58.852926

Estimate model metrics for validation

The accuracy of the forecasted values is measured by comparing the forecasted values against the actual values for the 23 months chosen for testing.

Input
r2 = r2_score(result_df["Airtemp_actual"], result_df["Airtemp_predicted"])
mse = metrics.mean_squared_error(
    result_df["Airtemp_actual"], result_df["Airtemp_predicted"]
)
rmse = metrics.mean_absolute_error(
    result_df["Airtemp_actual"], result_df["Airtemp_predicted"]
)
print(
    "RMSE:     ",
    round(np.sqrt(mse), 4),
    "\n" "MAE:      ",
    round(rmse, 4),
    "\n" "R-Square: ",
    round(r2, 2),
)
RMSE:      3.661 
MAE:       3.1054 
R-Square:  0.91

A considerably high r-square value of .91 indicates a high similarity between the forecasted values and the actual values. Furthermore, the RMSE error of 3.661 is quite low, indicating a good fit by the model.

Result visualization

Finally, the actual and forecasted values are plotted to visualize their distribution over the validation period, with the orange line representing the forecasted values and the blue line representing the actual values.

Input
plt.figure(figsize=(20, 5))
plt.plot(result_df)
plt.ylabel("Air Temperature")
plt.legend(result_df.columns.values, loc="upper right")
plt.title("Forecasted Air Temperature")
plt.show()

Conclusion

The study conducted a multivariate time series analysis using the Deep learning TimeSeriesModel from the arcgis.learn library and forecasted the monthly Air temperature for a station in California. The model was trained with 25 years of data (1987-2013) that was used to forecast a period of 2 years (2014-2015) with high accuracy. The independent variables were wind speed and precipitation. The methodology included preparing a times series dataset using the prepare_tabulardata() method, followed by modeling, predicting, and validating the test dataset. Usually, time series modeling requires fine-tuning several hyperparameters for properly fitting the data, most of which has been internalized in this Model, leaving the user responsible for configuring only a few significant parameters, like the sequence length.

Summary of methods used

Method Description Examples
prepare_tabulardata prepare data including imputation, scaling and train-test split prepare data ready for fitting a Timeseries Model
model.lr_find() finds an optimal learning rate finalize a good learning rate for training the Timeseries model
TimeSeriesModel() Model Initialization by selecting the TimeSeriesModel algorithm to be used for fitting Selected Timeseries algorithm from Fastai time series regression can be used
model.fit() trains a model with epochs & learning rate as input training the Timeseries model with suitable input
model.predict() predicts on a test set forecast values using the trained models on the test input

References

Data resources

Dataset Source Link
Global Summary of the Month NOAA Climate Data Online https://www.ncdc.noaa.gov/cdo-web/search

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.