# Forecasting Air Temperature in California using ResCNN model

## Introduction ¶

A rise in air temperature is directly correlated with Global warming and change in climatic conditions and is one of the main factors in predicting other meteorological variables, like streamflow, evapotranspiration, and solar radiation. As such, accurate forecasting of this variable is vital in pursuing the mitigation of environmental and economic destruction. Including the dependency of air temperature in other variables, like wind speed or precipitation, helps in deriving more precise predictions. In this study, the deep learning TimeSeriesModel from arcgis.learn is used to predict monthly air temperature for two years at a ground station at the Fresno Yosemite International Airport in California, USA. The dataset ranges from 1948-2015. Data from January 2014 to November 2015 is used to validate the quality of the forecast.

Univariate time series modeling is one of the more popular applications of time series analysis. This study includes multivariate time series analysis, which is a bit more convoluted, as the dataset contains more than one time-dependent variable. The TimeSeriesModel from arcgis.learn includes backbones, such as InceptionTime, ResCNN, ResNet and FCN, which do not need fine-tuning of multiple hyperparameters before fitting the model. Here is the schematic flow chart of the methodology:

## Importing libraries ¶

```
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pandas.plotting import autocorrelation_plot as aplot
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import sklearn.metrics as metrics
from arcgis.gis import GIS
from arcgis.learn import TimeSeriesModel, prepare_tabulardata
from arcgis.features import FeatureLayer, FeatureLayerCollection
```

## Connecting to your GIS ¶

```
gis = GIS('home')
```

## Accessing & visualizing the dataset ¶

The data used in this sample study is a multivariate monthly time series dataset recorded at a ground station in the Fresno Yosemite International Airport, California, USA. It ranges from January 1948 to November 2015.

```
# Location of the ground station
location = gis.map(location="Fresno Yosemite International California", zoomlevel=12)
location
```

```
# Access the data table
data_table = gis.content.get("8c58e808aabd40408f7bc4eeac64fffb")
data_table
```

```
# Visualize as pandas dataframe
climate_data = data_table.tables[0]
climate_df = climate_data.query().sdf
climate_df.head()
```

The dataframe above contains columns for station ID (STATION), station name (NAME), Date (DATE), Wind speed (AWND), precipitation (PRCP), possible sunshine (PSUN), snow cover (SNOW), average temperature (TAVG), maximum temperature (TMAX), minimum temperature (TMIN), total sunshine (TSUN), and peak wind gust speed (WSFG).

```
climate_df.shape
```

Next, the dataset is prepared by dropping the variables for station, possible sunshine, snow cover, maximum temperature, minimum temperature, total sunshine, and peak wind gust speed. Then, the dataset is narrowed to the data from 1987 on, to avoid missing values.

```
climate_df = climate_df.drop(
["ObjectId", "STATION", "NAME", "PSUN", "SNOW", "TSUN",'TMAX', 'TMIN', "WSFG"], axis=1
)
```

```
climate_df.columns
```

```
# Selecting dataset from year 1987 to get continous data without NAN values
selected_df = climate_df[climate_df.DATE > "1987"]
selected_df.head()
```

Here, **TAVG** is our variable to be predicted, with **PRCP** and **AWND** being the predictors used, showing their influence on temperature.

```
selected_df.shape
```

## Time series data preprocessing¶

The preprocessing of the data for multivariate time series modeling involves the following steps:

### Converting into time series format¶

The dataset is now transformed into a time series data format by creating a new index that will be used by the model for processing the sequential data.

```
final_df = selected_df.reset_index()
final_df = final_df.drop("index", axis=1)
final_df.head()
```

### Data types of time series variables¶

Here we check the data types of the variables.

```
final_df.info()
```

The time-dependent variables should of the type float. If a time-dependent variable is not of a float data type, then it needs to be changed to float. Here, Windspeed (AWND) is converted from object dtype to float64, as shown in the next cell.

```
final_df["AWND"] = final_df["AWND"].astype("float64")
final_df.head()
```

### Checking autocorrelation of time dependent variables¶

The next step will determine if the time series sequence is autocorrelated. To ensure that our time series data can be modeled well, the strength of correlation of the variable with its past data must be estimated.

```
variables = ["AWND", "PRCP", "TAVG"]
for variable in variables:
plt.figure(figsize=(20, 2))
autocorr = aplot(final_df[variable], color="blue")
plt.title(variable)
```

The plots are showing a significant correlation of the data with its immediate time-lagged terms, and that it gradually decreases over time as the lag increases.

### Creating dataset for prediction¶

Here, in the original dataset, the variable predict column of Average Temperature (TAVG) is populated with NaNs for the forecasting period of 2014-2015. This format is required for the `model.predict()`

function in time series analysis, which will fill up the NaN values with forecasted temperatures.

```
predict_df = final_df.copy()
predict_df.loc[predict_df["DATE"] > "2013-12-01", "TAVG"] = None
predict_df.tail()
```

### Train - Test split of time series dataset¶

Out of these 27 years(1987-2015), 25 years of data is used for training the model, with the remaining 23 months (2014-2015) being used for forecasting and validation. As we are splitting timeseries data, we set shuffle=False to keep the sequence intact and we set a test size of 23 months for validation.

```
test_size = 23
train, test = train_test_split(final_df, test_size=test_size, shuffle=False)
```

```
train
```

## Time series model building¶

After the train and test sets are created, the training set is ready for modeling.

### Data preprocessing ¶

In this example, the dataset contains 'AWND' (Windspeed), 'PRCP' (Precipitation), and 'TAVG' (Average Air temperature) as time-dependent variables leading to a multivariate time series analysis at a monthly time scale. These variables are used to forecast the next 23 months of air temperature for the months after the last date in the training data, or, in other words, these multiple explanatory variables are used to predict the future values of the dependent air temperature variable.

Once the variables are identified, the preprocessing of the data is performed by the `prepare_tabulardata`

method from the `arcgis.learn`

module in the ArcGIS API for Python. This function takes either a non-spatial data frame, a feature layer, or a spatial data frame containing the dataset as input and returns a TabularDataObject that can be fed into the model. By default, `prepare_tabulardata`

scales/normalizes the numerical columns in a dataset using StandardScaler.
The primary input parameters required for the tool are:

- input_features : Takes the spatially enabled dataframe as a feature layer in this model
- variable_predict : The field name of the forecasting variable
- explanatory_variables : A list of the field names that are used as time-dependent variables in multivariate time series
- index_field : The field name containing the timestamp that will be used as the index field for the data and to visualize values on the x-axis in the time series

```
data = prepare_tabulardata(
train,
variable_predict="TAVG",
explanatory_variables=["AWND", "PRCP"],
index_field="DATE",
seed=42,
)
```

```
# Visualize the entire timeseries data
data.show_batch(graph=True)
```

```
# Here sequence length is used as 12 which also indicates the seasonality of the data
seq_len = 12
```

Next, we visualize the timeseries in batches. Here, we will pass the sequence length as the batch length.

```
data.show_batch(rows=4, seq_len=seq_len)
```

### Model initialization ¶

This is an important step for fitting a time series model. Here, along with the input dataset, the backbone for training the model and the sequence length are passed as parameters. Out of these three, the sequence length has to be selected carefully. The sequence length is usually the cycle of the data, which in this case is 12, as it is monthly data and the pattern repeats after 12 months. In model initialization, the data and the backbone are selected from the available set of InceptionTime, ResCNN, Resnet, and FCN.

```
tsmodel = TimeSeriesModel(data, seq_len=seq_len, model_arch="ResCNN")
```

### Learning rate search¶

Here, we find the optimal learning rate for training the model.

```
lr_rate = tsmodel.lr_find()
```

### Model training ¶

The model is now ready for training. To train the model, the `model.fit`

method is used and is provided with the number of epochs for training and the learning rate suggested above as parameters:

```
tsmodel.fit(100, lr=lr_rate)
```

To check the quality of the trained model and whether the model needs more training, we generate a train vs validation loss plot below:

```
tsmodel.plot_losses()
```

Next, the predicted values of the model and the actual values are printed for the training dataset.

```
tsmodel.show_results(rows=5)
```

## Air temperature forecast & validation ¶

### Forecasting using the trained TimeSeriesModel ¶

During forecasting, the model uses the dataset prepared above with NaN values as input, with the `prediction_type`

set as `dataframe`

.

```
# Checking the input dataset
predict_df.tail(23)
```

```
df_forecasted = tsmodel.predict(predict_df, prediction_type="dataframe")
```

```
# Final forecasted result returned by the model
df_forecasted
```

Next, we format the results into actual vs predicted columns.

```
result_df = pd.DataFrame()
result_df["DATE"] = test["DATE"]
result_df["Airtemp_actual"] = test["TAVG"]
result_df["Airtemp_predicted"] = df_forecasted["TAVG_results"][-23:]
result_df = result_df.set_index(result_df.columns[0])
result_df
```

### Estimate model metrics for validation ¶

The accuracy of the forecasted values is measured by comparing the forecasted values against the actual values for the 23 months chosen for testing.

```
r2 = r2_score(result_df["Airtemp_actual"], result_df["Airtemp_predicted"])
mse = metrics.mean_squared_error(
result_df["Airtemp_actual"], result_df["Airtemp_predicted"]
)
rmse = metrics.mean_absolute_error(
result_df["Airtemp_actual"], result_df["Airtemp_predicted"]
)
print(
"RMSE: ",
round(np.sqrt(mse), 4),
"\n" "MAE: ",
round(rmse, 4),
"\n" "R-Square: ",
round(r2, 2),
)
```

A considerably high r-square value of .91 indicates a high similarity between the forecasted values and the actual values. Furthermore, the RMSE error of 3.661 is quite low, indicating a good fit by the model.

## Result visualization¶

Finally, the actual and forecasted values are plotted to visualize their distribution over the validation period, with the orange line representing the forecasted values and the blue line representing the actual values.

```
plt.figure(figsize=(20, 5))
plt.plot(result_df)
plt.ylabel("Air Temperature")
plt.legend(result_df.columns.values, loc="upper right")
plt.title("Forecasted Air Temperature")
plt.show()
```

## Conclusion¶

The study conducted a multivariate time series analysis using the Deep learning TimeSeriesModel from the arcgis.learn library and forecasted the monthly Air temperature for a station in California. The model was trained with 25 years of data (1987-2013) that was used to forecast a period of 2 years (2014-2015) with high accuracy. The independent variables were wind speed and precipitation. The methodology included preparing a times series dataset using the prepare_tabulardata() method, followed by modeling, predicting, and validating the test dataset. Usually, time series modeling requires fine-tuning several hyperparameters for properly fitting the data, most of which has been internalized in this Model, leaving the user responsible for configuring only a few significant parameters, like the sequence length.

## Summary of methods used ¶

Method | Description | Examples |
---|---|---|

prepare_tabulardata | prepare data including imputation, scaling and train-test split | prepare data ready for fitting a Timeseries Model |

model.lr_find() | finds an optimal learning rate | finalize a good learning rate for training the Timeseries model |

TimeSeriesModel() | Model Initialization by selecting the TimeSeriesModel algorithm to be used for fitting | Selected Timeseries algorithm from Fastai time series regression can be used |

model.fit() | trains a model with epochs & learning rate as input | training the Timeseries model with suitable input |

model.predict() | predicts on a test set | forecast values using the trained models on the test input |

## References¶

Jenny Cifuentes et.al., 2020. "Air Temperature Forecasting Using Machine Learning Techniques: A Review" https://doi.org/10.3390/en13164215

Xuejie, G. et.al., 2001. "Climate change due to greenhouse effects in China as simulated by a regional climate model" https://doi.org/10.1007/s00376-001-0036-y

"gsom-gsoy_documentation" https://www1.ncdc.noaa.gov/pub/data/cdo/documentation/gsom-gsoy_documentation.pdf

"Prediction task with Multivariate Time Series and VAR model" https://towardsdatascience.com/prediction-task-with-multivariate-timeseries-and-var-model-47003f629f9

## Data resources ¶

Dataset | Source | Link |
---|---|---|

Global Summary of the Month | NOAA Climate Data Online | https://www.ncdc.noaa.gov/cdo-web/search |