Forecasting Air Temperature in California using ResCNN model

Introduction

A rise in air temperature is directly correlated with Global warming and change in climatic conditions and is one of the main factors in predicting other meteorological variables, like streamflow, evapotranspiration, and solar radiation. As such, accurate forecasting of this variable is vital in pursuing the mitigation of environmental and economic destruction. Including the dependency of air temperature in other variables, like wind speed or precipitation, helps in deriving more precise predictions. In this study, the deep learning TimeSeriesModel from arcgis.learn is used to predict monthly air temperature for two years at a ground station at the Fresno Yosemite International Airport in California, USA. The dataset ranges from 1948-2015. Data from January 2014 to November 2015 is used to validate the quality of the forecast.

Univariate time series modeling is one of the more popular applications of time series analysis. This study includes multivariate time series analysis, which is a bit more convoluted, as the dataset contains more than one time-dependent variable. The TimeSeriesModel from arcgis.learn includes backbones, such as InceptionTime, ResCNN, ResNet and FCN, which do not need fine-tuning of multiple hyperparameters before fitting the model. Here is the schematic flow chart of the methodology:

Importing libraries

%matplotlib inline
import matplotlib.pyplot as plt

import numpy as np
import pandas as pd
from pandas.plotting import autocorrelation_plot as aplot

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import sklearn.metrics as metrics

from arcgis.gis import GIS
from arcgis.learn import TimeSeriesModel, prepare_tabulardata
from arcgis.features import FeatureLayer, FeatureLayerCollection

Connecting to your GIS

gis = GIS('home')

Accessing & visualizing the dataset

The data used in this sample study is a multivariate monthly time series dataset recorded at a ground station in the Fresno Yosemite International Airport, California, USA. It ranges from January 1948 to November 2015.

# Location of the ground station
location = gis.map(location="Fresno Yosemite International California", zoomlevel=12)
location
# Access the data table
data_table = gis.content.get("8c58e808aabd40408f7bc4eeac64fffb")
data_table
Weather Data of Fresno International California
This Feature layer contains meteorological dataset from ground instrument placed at Fresno Yosemite International, California USA. The data is available at monthly temporal scale ranging from 1948 to 2015.Feature Layer Collection by api_data_owner
Last Modified: November 17, 2021
0 comments, 30 views
# Visualize as pandas dataframe
climate_data = data_table.tables[0]
climate_df = climate_data.query().sdf
climate_df.head()
STATIONNAMEDATEAWNDPRCPPSUNSNOWTAVGTMAXTMINTSUNWSFGObjectId
0USW00093193FRESNO YOSEMITE INTERNATIONAL, CA US1948-01-01None0.00None051.266.336.2NoneNone1
1USW00093193FRESNO YOSEMITE INTERNATIONAL, CA US1948-02-01None0.78None049.062.235.8NoneNone2
2USW00093193FRESNO YOSEMITE INTERNATIONAL, CA US1948-03-01None2.29None053.265.640.7NoneNone3
3USW00093193FRESNO YOSEMITE INTERNATIONAL, CA US1948-04-01None2.28None059.871.847.7NoneNone4
4USW00093193FRESNO YOSEMITE INTERNATIONAL, CA US1948-05-01None0.96None065.279.750.7NoneNone5

The dataframe above contains columns for station ID (STATION), station name (NAME), Date (DATE), Wind speed (AWND), precipitation (PRCP), possible sunshine (PSUN), snow cover (SNOW), average temperature (TAVG), maximum temperature (TMAX), minimum temperature (TMIN), total sunshine (TSUN), and peak wind gust speed (WSFG).

climate_df.shape
(815, 13)

Next, the dataset is prepared by dropping the variables for station, possible sunshine, snow cover, maximum temperature, minimum temperature, total sunshine, and peak wind gust speed. Then, the dataset is narrowed to the data from 1987 on, to avoid missing values.

climate_df = climate_df.drop(
    ["ObjectId", "STATION", "NAME", "PSUN", "SNOW", "TSUN",'TMAX', 'TMIN', "WSFG"], axis=1
)
climate_df.columns
Index(['DATE', 'AWND', 'PRCP', 'TAVG'], dtype='object')
# Selecting dataset from year 1987 to get continous data without NAN values
selected_df = climate_df[climate_df.DATE > "1987"]
selected_df.head()
DATEAWNDPRCPTAVG
4691987-02-015.81.3652.7
4701987-03-016.32.3955.6
4711987-04-016.90.0766.6
4721987-05-017.40.8771.8
4731987-06-017.40.0178.4

Here, TAVG is our variable to be predicted, with PRCP and AWND being the predictors used, showing their influence on temperature.

selected_df.shape
(346, 4)

Time series data preprocessing

The preprocessing of the data for multivariate time series modeling involves the following steps:

Converting into time series format

The dataset is now transformed into a time series data format by creating a new index that will be used by the model for processing the sequential data.

final_df = selected_df.reset_index()
final_df = final_df.drop("index", axis=1)
final_df.head()
DATEAWNDPRCPTAVG
01987-02-015.81.3652.7
11987-03-016.32.3955.6
21987-04-016.90.0766.6
31987-05-017.40.8771.8
41987-06-017.40.0178.4

Data types of time series variables

Here we check the data types of the variables.

final_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 346 entries, 0 to 345
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   DATE    346 non-null    datetime64[ns]
 1   AWND    346 non-null    object        
 2   PRCP    346 non-null    float64       
 3   TAVG    346 non-null    float64       
dtypes: datetime64[ns](1), float64(2), object(1)
memory usage: 10.9+ KB

The time-dependent variables should of the type float. If a time-dependent variable is not of a float data type, then it needs to be changed to float. Here, Windspeed (AWND) is converted from object dtype to float64, as shown in the next cell.

final_df["AWND"] = final_df["AWND"].astype("float64")
final_df.head()
DATEAWNDPRCPTAVG
01987-02-015.81.3652.7
11987-03-016.32.3955.6
21987-04-016.90.0766.6
31987-05-017.40.8771.8
41987-06-017.40.0178.4

Checking autocorrelation of time dependent variables

The next step will determine if the time series sequence is autocorrelated. To ensure that our time series data can be modeled well, the strength of correlation of the variable with its past data must be estimated.

variables = ["AWND", "PRCP", "TAVG"]
for variable in variables:
    plt.figure(figsize=(20, 2))
    autocorr = aplot(final_df[variable], color="blue")
    plt.title(variable)
<Figure size 1440x144 with 1 Axes><Figure size 1440x144 with 1 Axes><Figure size 1440x144 with 1 Axes>

The plots are showing a significant correlation of the data with its immediate time-lagged terms, and that it gradually decreases over time as the lag increases.

Creating dataset for prediction

Here, in the original dataset, the variable predict column of Average Temperature (TAVG) is populated with NaNs for the forecasting period of 2014-2015. This format is required for the model.predict() function in time series analysis, which will fill up the NaN values with forecasted temperatures.

predict_df = final_df.copy()
predict_df.loc[predict_df["DATE"] > "2013-12-01", "TAVG"] = None
predict_df.tail()
DATEAWNDPRCPTAVG
3412015-07-018.10.43NaN
3422015-08-017.60.00NaN
3432015-09-015.80.12NaN
3442015-10-014.70.49NaN
3452015-11-013.61.74NaN

Train - Test split of time series dataset

Out of these 27 years(1987-2015), 25 years of data is used for training the model, with the remaining 23 months (2014-2015) being used for forecasting and validation. As we are splitting timeseries data, we set shuffle=False to keep the sequence intact and we set a test size of 23 months for validation.

test_size = 23
train, test = train_test_split(final_df, test_size=test_size, shuffle=False)
train
DATEAWNDPRCPTAVG
01987-02-015.81.3652.7
11987-03-016.32.3955.6
21987-04-016.90.0766.6
31987-05-017.40.8771.8
41987-06-017.40.0178.4
...............
3182013-08-017.20.0083.0
3192013-09-016.50.0177.8
3202013-10-013.40.0366.6
3212013-11-012.50.5458.5
3222013-12-012.20.1547.4

323 rows × 4 columns

Time series model building

After the train and test sets are created, the training set is ready for modeling.

Data preprocessing

In this example, the dataset contains 'AWND' (Windspeed), 'PRCP' (Precipitation), and 'TAVG' (Average Air temperature) as time-dependent variables leading to a multivariate time series analysis at a monthly time scale. These variables are used to forecast the next 23 months of air temperature for the months after the last date in the training data, or, in other words, these multiple explanatory variables are used to predict the future values of the dependent air temperature variable.

Once the variables are identified, the preprocessing of the data is performed by the prepare_tabulardata method from the arcgis.learn module in the ArcGIS API for Python. This function takes either a non-spatial data frame, a feature layer, or a spatial data frame containing the dataset as input and returns a TabularDataObject that can be fed into the model. By default, prepare_tabulardata scales/normalizes the numerical columns in a dataset using StandardScaler. The primary input parameters required for the tool are:

  • input_features : Takes the spatially enabled dataframe as a feature layer in this model
  • variable_predict : The field name of the forecasting variable
  • explanatory_variables : A list of the field names that are used as time-dependent variables in multivariate time series
  • index_field : The field name containing the timestamp that will be used as the index field for the data and to visualize values on the x-axis in the time series
data = prepare_tabulardata(
    train,
    variable_predict="TAVG",
    explanatory_variables=["AWND", "PRCP"],
    index_field="DATE",
    seed=42,
)
C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\lib\site-packages\arcgis\learn\_utils\tabular_data.py:936: UserWarning:

Dataframe is not spatial, Rasters and distance layers will not work

# Visualize the entire timeseries data
data.show_batch(graph=True)
<Figure size 1800x1080 with 3 Axes>
# Here sequence length is used as 12 which also indicates the seasonality of the data
seq_len = 12

Next, we visualize the timeseries in batches. Here, we will pass the sequence length as the batch length.

data.show_batch(rows=4, seq_len=seq_len)
<Figure size 720x720 with 16 Axes>

Model initialization

This is an important step for fitting a time series model. Here, along with the input dataset, the backbone for training the model and the sequence length are passed as parameters. Out of these three, the sequence length has to be selected carefully. The sequence length is usually the cycle of the data, which in this case is 12, as it is monthly data and the pattern repeats after 12 months. In model initialization, the data and the backbone are selected from the available set of InceptionTime, ResCNN, Resnet, and FCN.

tsmodel = TimeSeriesModel(data, seq_len=seq_len, model_arch="ResCNN")

Here, we find the optimal learning rate for training the model.

lr_rate = tsmodel.lr_find()
<Figure size 432x288 with 1 Axes>

Model training

The model is now ready for training. To train the model, the model.fit method is used and is provided with the number of epochs for training and the learning rate suggested above as parameters:

tsmodel.fit(100, lr=lr_rate)
epochtrain_lossvalid_losstime
00.9603410.32799700:00
10.9059860.39862000:00
20.8546290.49577800:00
30.8033050.60120400:00
40.7509180.71452200:00
50.6944050.83725900:00
60.6296380.89932100:00
70.5640600.79925900:00
80.4981040.51664700:00
90.4369800.27004400:00
100.3825550.12783000:00
110.3356520.07208600:00
120.2965070.01103000:00
130.2635540.00762500:00
140.2354440.00772500:00
150.2106860.01655900:00
160.1892550.00892200:00
170.1703060.00435000:00
180.1538410.01046400:00
190.1393770.01520900:00
200.1265390.00429200:00
210.1151180.00525200:00
220.1048590.00286900:00
230.0956380.00262500:00
240.0873010.00648500:00
250.0800270.00946900:00
260.0733340.00920400:00
270.0673240.00501400:00
280.0619310.00377900:00
290.0569230.00323900:00
300.0523200.00374800:00
310.0480770.00260000:00
320.0442020.00469600:00
330.0407640.00503300:00
340.0376070.00180900:00
350.0345910.00226800:00
360.0318920.00268300:00
370.0293920.00204600:00
380.0270790.00301400:00
390.0250310.00223300:00
400.0232000.00328600:00
410.0216190.00403800:00
420.0200440.00286600:00
430.0185760.00253800:00
440.0172080.00211300:00
450.0159480.00303800:00
460.0148010.00166600:00
470.0137400.00564000:00
480.0128540.00194800:00
490.0120110.00348200:00
500.0112000.00185100:00
510.0104430.00367800:00
520.0096820.00242900:00
530.0090510.00211000:00
540.0084520.00250100:00
550.0079540.00240700:00
560.0073820.00186100:00
570.0069230.00215200:00
580.0064640.00234200:00
590.0061010.00179400:00
600.0057270.00193300:00
610.0055910.00304400:00
620.0052650.00406900:00
630.0051860.00265900:00
640.0049080.00207000:00
650.0047280.00214200:00
660.0044270.00173700:00
670.0041430.00205700:00
680.0038770.00193900:00
690.0036090.00180400:00
700.0034000.00177000:00
710.0031870.00189700:00
720.0030730.00173900:00
730.0028660.00197800:00
740.0026820.00172900:00
750.0025230.00181100:00
760.0023540.00193500:00
770.0022530.00169000:00
780.0021080.00189000:00
790.0019600.00203000:00
800.0018260.00183800:00
810.0017340.00176900:00
820.0017040.00184400:00
830.0016450.00185200:00
840.0015910.00181600:00
850.0015800.00171800:00
860.0015040.00171800:00
870.0014060.00178200:00
880.0015020.00179700:00
890.0014130.00175500:00
900.0013780.00174500:00
910.0013110.00172600:00
920.0012370.00172700:00
930.0011710.00174200:00
940.0011000.00173400:00
950.0010560.00172200:00
960.0009980.00172700:00
970.0009680.00172900:00
980.0010040.00173100:00
990.0009390.00172600:00

To check the quality of the trained model and whether the model needs more training, we generate a train vs validation loss plot below:

tsmodel.plot_losses()
<Figure size 432x288 with 1 Axes>

Next, the predicted values of the model and the actual values are printed for the training dataset.

tsmodel.show_results(rows=5)
<Figure size 720x720 with 10 Axes>

Air temperature forecast & validation

Forecasting using the trained TimeSeriesModel

During forecasting, the model uses the dataset prepared above with NaN values as input, with the prediction_type set as dataframe.

# Checking the input dataset
predict_df.tail(23)
DATEAWNDPRCPTAVG
3232014-01-012.50.57NaN
3242014-02-014.92.11NaN
3252014-03-015.80.62NaN
3262014-04-016.90.74NaN
3272014-05-018.90.04NaN
3282014-06-018.10.00NaN
3292014-07-017.40.01NaN
3302014-08-016.70.00NaN
3312014-09-016.30.18NaN
3322014-10-014.50.50NaN
3332014-11-013.10.40NaN
3342014-12-014.02.30NaN
3352015-01-012.20.21NaN
3362015-02-013.81.13NaN
3372015-03-015.40.06NaN
3382015-04-016.91.25NaN
3392015-05-017.60.57NaN
3402015-06-017.60.01NaN
3412015-07-018.10.43NaN
3422015-08-017.60.00NaN
3432015-09-015.80.12NaN
3442015-10-014.70.49NaN
3452015-11-013.61.74NaN
df_forecasted = tsmodel.predict(predict_df, prediction_type="dataframe")
# Final forecasted result returned by the model
df_forecasted
DATEAWNDPRCPTAVGTAVG_results
01987-02-015.81.3652.752.700000
11987-03-016.32.3955.655.600000
21987-04-016.90.0766.666.600000
31987-05-017.40.8771.871.800000
41987-06-017.40.0178.478.400000
..................
3412015-07-018.10.43NaN84.969053
3422015-08-017.60.00NaN81.919138
3432015-09-015.80.12NaN77.078820
3442015-10-014.70.49NaN69.666935
3452015-11-013.61.74NaN58.852926

346 rows × 5 columns

Next, we format the results into actual vs predicted columns.

result_df = pd.DataFrame()
result_df["DATE"] = test["DATE"]
result_df["Airtemp_actual"] = test["TAVG"]
result_df["Airtemp_predicted"] = df_forecasted["TAVG_results"][-23:]
result_df = result_df.set_index(result_df.columns[0])
result_df
Airtemp_actualAirtemp_predicted
DATE
2014-01-0153.248.777484
2014-02-0156.852.200898
2014-03-0162.359.264070
2014-04-0166.865.984249
2014-05-0174.272.340793
2014-06-0180.977.236778
2014-07-0186.984.582398
2014-08-0184.481.742578
2014-09-0180.776.948140
2014-10-0172.067.554652
2014-11-0157.758.481853
2014-12-0151.947.955960
2015-01-0149.047.175748
2015-02-0157.050.230618
2015-03-0164.056.937533
2015-04-0164.364.488977
2015-05-0168.571.976515
2015-06-0181.978.548443
2015-07-0183.184.969053
2015-08-0182.481.919138
2015-09-0178.777.078820
2015-10-0171.369.666935
2015-11-0152.058.852926

Estimate model metrics for validation

The accuracy of the forecasted values is measured by comparing the forecasted values against the actual values for the 23 months chosen for testing.

r2 = r2_score(result_df["Airtemp_actual"], result_df["Airtemp_predicted"])
mse = metrics.mean_squared_error(
    result_df["Airtemp_actual"], result_df["Airtemp_predicted"]
)
rmse = metrics.mean_absolute_error(
    result_df["Airtemp_actual"], result_df["Airtemp_predicted"]
)
print(
    "RMSE:     ",
    round(np.sqrt(mse), 4),
    "\n" "MAE:      ",
    round(rmse, 4),
    "\n" "R-Square: ",
    round(r2, 2),
)
RMSE:      3.661 
MAE:       3.1054 
R-Square:  0.91

A considerably high r-square value of .91 indicates a high similarity between the forecasted values and the actual values. Furthermore, the RMSE error of 3.661 is quite low, indicating a good fit by the model.

Result visualization

Finally, the actual and forecasted values are plotted to visualize their distribution over the validation period, with the orange line representing the forecasted values and the blue line representing the actual values.

plt.figure(figsize=(20, 5))
plt.plot(result_df)
plt.ylabel("Air Temperature")
plt.legend(result_df.columns.values, loc="upper right")
plt.title("Forecasted Air Temperature")
plt.show()
<Figure size 1440x360 with 1 Axes>

Conclusion

The study conducted a multivariate time series analysis using the Deep learning TimeSeriesModel from the arcgis.learn library and forecasted the monthly Air temperature for a station in California. The model was trained with 25 years of data (1987-2013) that was used to forecast a period of 2 years (2014-2015) with high accuracy. The independent variables were wind speed and precipitation. The methodology included preparing a times series dataset using the prepare_tabulardata() method, followed by modeling, predicting, and validating the test dataset. Usually, time series modeling requires fine-tuning several hyperparameters for properly fitting the data, most of which has been internalized in this Model, leaving the user responsible for configuring only a few significant parameters, like the sequence length.

Summary of methods used

MethodDescriptionExamples
prepare_tabulardataprepare data including imputation, scaling and train-test splitprepare data ready for fitting a Timeseries Model
model.lr_find()finds an optimal learning ratefinalize a good learning rate for training the Timeseries model
TimeSeriesModel()Model Initialization by selecting the TimeSeriesModel algorithm to be used for fittingSelected Timeseries algorithm from Fastai time series regression can be used
model.fit()trains a model with epochs & learning rate as inputtraining the Timeseries model with suitable input
model.predict()predicts on a test setforecast values using the trained models on the test input

References

Data resources

DatasetSourceLink
Global Summary of the MonthNOAA Climate Data Onlinehttps://www.ncdc.noaa.gov/cdo-web/search

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.