Predicting El Niño–Southern Oscillation through correlation and time series analysis/deep learning
This example uses correlation analysis and time series analysis to predict El Niño–Southern Oscillation (ENSO) based on climate variables and indices. ENSO is an irregular periodical variation in winds and sea surface temperatures over the tropical eastern Pacific Ocean.
from IPython.display import YouTubeVideo YouTubeVideo('d6s0T0m3F8s')
ENSO can have tremendous potential impact such as droughts, floods, and tropical storms. You can look at a few story maps below to get a better understanding of the process.
from arcgis.gis import GIS agol = GIS() story_maps = agol.content.search('What El Niño Means for Southern California') story_maps
Accurate characterization of ENSO is critical for understanding the trends. In climate science, ENSO is characterized through Southern Oscillation Index (SOI), a standardized index based on the observed sea level pressure differences between Tahiti and Darwin, Australia. If SOI exhibits warm (greater than 0.5) or cool phase conditions for at least five consecutive values, it officially becomes an El Niño or La Niña event. Therefore, predicting SOI is the first step of ENSO forecasting. This notebooks consists of 3 sections: (1) Data exploration (2) correlation analysis; (3) Time series analysis
The following four variables were used in this examplea as ENSO is beleived to be related to sea surface temperature, sea level pressure, precipitation, etc.
- Oceanic Nino Index (ONI), a climate index used for sea surface temperature (SST).
- Eastern Tropical Pacific SST (Nino 3), another climate index used for SST focusing on a slightly different region.
- Pacific North American Index (PNA). PNA is a closely related phenomena to ENSO.
- Precipitation monthly mean. Historical global precipitation monthly mean in raster format.
Note: to run this sample, you need a few extra libraries in your conda environment. If you don't have the libraries, install them by running the following commands from cmd.exe or your shell
conda update conda conda install matplotlib conda install scikit-learn conda install -c conda-forge scikit-image conda install -c conda-forge keras
Part 1. Data exploration
The first three variables along with SOI has been put into a .CSV file. To give you an overview, let's read it and look at the first few lines.
%matplotlib inline import os import numpy as np import pandas as pd from pandas import read_csv from pandas import datetime from matplotlib import pyplot def parser(x): if x.endswith('11') or x.endswith('12')or x.endswith('10'): return datetime.strptime(x, '%Y%m') else: return datetime.strptime(x, '%Y0%m') enso_original_path = os.path.join('.', 'data', 'enso_original.csv') df = read_csv(enso_original_path, header=0, parse_dates=, index_col=0, date_parser=parser) start = 336 df = df.iloc[start:] df = (df - df.mean()) / df.std() print(df.head())
soi oni nino3 pna date 1979-01-01 -0.441750 -0.059963 -0.150376 -1.537109 1979-02-01 0.997371 0.056451 -0.271512 -2.725606 1979-03-01 0.072222 0.172865 -0.139364 0.080846 1979-04-01 -0.133367 0.289280 0.213033 -0.148864 1979-05-01 0.483399 0.172865 0.036835 1.269344
Let us examine how SOI changes over time.
pyplot.figure(figsize=(15,5)) pyplot.plot(df.soi) pyplot.title('How SOI changes over Time') pyplot.xlabel('Time')
Text(0.5, 0, 'Time')
Let us put all four variables together and see how they are distributed or if there is any visual relationship.
i = 1 fig = pyplot.figure(figsize=(15,10)) for col in df.columns.tolist(): fig.add_subplot(len(df.columns.tolist()), 1, i) df[col].plot() pyplot.title(col, y=0.8, loc='right') if i != len(df.columns.tolist()): pyplot.tick_params( axis='x', # changes apply to the x-axis which='both', # both major and minor ticks are affected bottom=False, # ticks along the bottom edge are off top=False, # ticks along the top edge are off labelbottom=False) # labels along the bottom edge are off pyplot.xlabel('') i += 1 pyplot.show()
As we can see, there is some relationship between the input variables (i.e., oni, nino3, pna) and output variable, soi. For example, oni and soi look highly negatively correlated. We will cover how to model SOI using these variables in the time series analysis, but first let's look at how to bring in the last variable - precipitation.
Part 2. Correlation analysis
Global precipitation monthly mean is avaialble in raster format on ArcGIS online. In this part, we will cover how to indentify the most correlated (on time dimension) grid cell through lagged correlation analysis. There are two rationals behind this.
- In climate science, a phenomenon that is happening now could be a result of what has happend in the past. In other words, SOI of this month could be most correlated with the ONI of last month or the month before depending on the actual physical process.
- ENOS is a type of teleconnection which refers to climate anomalies being related to each other at large distances.
First, let us search and acess the precipitation data from ArcGIS Online.
data = agol.content.search('Global Monthly Precipitation 1979-2017') data
data_item = agol.content.get(data.id)
Once the data is download to local, we can read it and visualize it through skimage and matplotlib to get a sense of what the data looks like. Because the spatial resolution is 2.5 by 2.5 degrees, there are 72 rows and 144 columns. The third dimension is the time dimension as the data contains monthly mean from 1970.1 to 2017.10, which is 466 months in total.
from skimage import io import math from scipy.stats.stats import pearsonr from numpy import unravel_index precipitation_path = os.path.join('.', 'data', 'precipitation.tif') precip_full = io.imread(precipitation_path) precip_full = np.flip(precip_full, 0) # print the dimension of the precip_full, (lat, lon, time) print(precip_full.shape) # let's print out the first month/band of this data pyplot.figure(figsize=(15,6)) pyplot.imshow(precip_full[:,:,0], extent=[-180, +180, -90, 90])
(72, 144, 466)
<matplotlib.image.AxesImage at 0x1c2c385be0>
Before calculating the correlation, let's transform the precipitation monthly mean to precipitation anomaly, which a usually provides better signal than absolute precipitation. For short, anomaly means how much precipitation departs from what is normal for that time of year at a specific location. Let's say, we would like to know the anomaly of a specific grid in October this year. We need to compute the historial mean of October precipitation and then caculate its difference with value of this October.
# calculate the historical mean for each month a = np.zeros(shape = (precip_full.shape, precip_full.shape, math.ceil(precip_full.shape/12)*12)) a[:,:,0:precip_full.shape] = precip_full a = a.reshape(a.shape, a.shape, int(a.shape/12), 12) monthly_sum = np.sum(a, axis=2) montly_mean_1_10 = monthly_sum[:,:,0:10]/a.shape montly_mean_11_12 = monthly_sum[:,:,-2:]/(a.shape - 1) monthly_mean = np.append(montly_mean_1_10, montly_mean_11_12, axis=2) # calculate the difference for i in range(a.shape): a[:,:,i,:] -=monthly_mean # reshape the data back to its orginial dimension a = a.reshape(a.shape, a.shape, a.shape*a.shape) # visualize the first month anomaly pyplot.figure(figsize=(15,6)) pyplot.imshow(a[:,:,0], extent=[-180, +180, -90, 90])
<matplotlib.image.AxesImage at 0x1c33362940>
Now we have the anomaly. Let us calculate Pearson correlation coefficient between SOI and precipitation anomaly across all the grids based on a range of time lag from 0 to 5. The goal is the find the grid that has the greatest absolute Pearson correlation.
lag = 5 fig, axs = pyplot.subplots(nrows=3, ncols=2, figsize=(14, 12)) fig.subplots_adjust(left=0.03, right=0.97, hspace=0.4, wspace=0.4) for t in range(lag+1): soi = df.values[t:,0] soi = soi.reshape(soi.shape, 1) precip = a[:,:,0:-4-t] r2 =  for i in range(precip.shape): for j in range(precip.shape): r2_index = pearsonr(soi, precip[i,j,:].reshape(precip.shape, 1)) r2.append(r2_index) r2_map = np.array(r2).reshape(precip.shape, precip.shape) max_index = unravel_index(r2_map.argmax(), r2_map.shape) im = axs.flat[t-1].imshow(np.abs(r2_map), extent=[-180, +180, -90, 90]) fig.colorbar(im, ax = axs[t//2, t%2]) # we only care only absolute correlation in this case r2_map = np.abs(r2_map) max_index = unravel_index(r2_map.argmax(), r2_map.shape) axs.flat[t-1].set_title('lag=' + str(t) + '\n' + 'max_index=' + str(max_index) + '\n' + 'max_absolute_correlation=' + str(float("%.3f" % r2_map[max_index])))
As can be seen, the largest correlation results from lag=0 at (row, column) = (34, 74). We will then take the precipitation anomaly of this grid and combine it with the other variables to predict SOI in another notebook.
Part 3. Time Series Analysis/LSTM
The third part of ENSO prediction focuses on time seris analysis. Here our SOI time series prediction problem is formulated as a regression problem and the idea is to use prior time steps to predict the next time step(s). Specifically, this analysis consists of three sections:
- Data cleanup and transformation: transform the raw data into format that supervised machine learning algorithms can take, and split out training and test datasets
- Model training: train a Long short-term memory (LSTM) neural network.
- Prediction and evaluation: evaluate the models on test dataset and plot the results.
Data cleanup and transformation
Time series data can be phrased as supervised learning if we think of by previous time steps as input variables and the next time step(s) as the output variable. Before we do any transformationm, let's first import all the necessary packages and take a look at what our data looks like.
%matplotlib inline import os.path import warnings import numpy as np from math import sqrt from matplotlib import pyplot from pandas import read_csv from pandas import DataFrame from pandas import concat from pandas import datetime from sklearn.metrics import mean_squared_error from keras.models import Sequential from keras.layers import Dense from keras.layers import LSTM from keras.models import load_model warnings.filterwarnings('ignore') def parser(x): if x.endswith('11') or x.endswith('12')or x.endswith('10'): return datetime.strptime(x, '%Y%m') else: return datetime.strptime(x, '%Y0%m') enso_ready_path = os.path.join('.', 'data', 'enso_ready.csv') df = read_csv(enso_ready_path, header=0, parse_dates=, index_col=0, date_parser=parser) df.head()
Using TensorFlow backend.
As precipitation data is not available until 1979, let's remove the first few rows, standardize each column by calcuting the z-score, and move soi (input variable) to the last column of the table.
start = 336 df = df.iloc[start:] df = (df - df.mean()) / df.std() cols = df.columns.tolist() cols = cols[1:] + cols[:1] df = df[cols] df.head()
Most supervised learning algorithms require all input variables to be on the same row with its corresponding output variable. In SOI prediction, the goal is to use the variables (i.e., oni, nino3, pna, precip, and soi) of the previous time steps (e.g. 12) to predict the SOI of the next time steps (e.g. 3). Formally, the use of prior time steps to predict the next time step is called the sliding window approach (aka window or lag method) in time series analysis/prediction. Therefore, let's define a method that transforms panda time series into format that supervised learning algorithms can take.
""" This method takes a time series and returns transformed data. data: time series as pandas dataframe n_in: number of previous time steps as input (X) n_out: number of next time steps as output (y) dropnan: whether or not to drop rows with NaN values """ def series_to_supervised(data, n_in=1, n_out=1, dropnan=True): n_vars = 1 if type(data) is list else data.shape df = DataFrame(data) cols, names = list(), list() # input sequence (t-n, ... t-1) for i in range(n_in, 0, -1): cols.append(df.shift(i)) names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)] # forecast sequence (t, t+1, ... t+n) for i in range(0, n_out): cols.append(df.shift(-i).iloc[:,-1]) if i == 0: names += ['VAR(t)'] else: names += ['VAR(t+%d)' % i] # put it all together agg = concat(cols, axis=1) agg.columns = names # drop rows with NaN values if dropnan: agg.dropna(inplace=True) return agg
Note that after transformation, there are some edges cases where the input or output variable could be NaN. For example, as data before 1979 is not available, there is no way to form a row for SOI of 1979. That's why there is a dropnan option is the method.
# specify the size of our sliding window and number of features enso = df.values.astype('float32') lag = 12 ahead = 3 n_features = 1 reframed = series_to_supervised(enso, lag, ahead) reframed.head()
5 rows × 63 columns
Now we have the data, let's define a method that trains a LSTM model. For the purpose of simplicity, we define a two layer neural network with one LSTM layer and one dense layer.
""" This method takes training data and returns a LSTM model train: training data n_lag: number of previous time steps n_ahead: number of next time steps nb_epoch: number of epochs n_neurons: number of n_neurons in the first layer """ def fit_lstm(train, n_lag, n_ahead, n_batch, nb_epoch, n_neurons): # reshape training into [samples, timesteps, features] X, y = train[:, :-n_ahead], train[:, -n_ahead:] X = X.reshape(X.shape, n_lag, int(X.shape/n_lag)) # design neural network architecture. This is a simple LSTM just for demo purpose model = Sequential() model.add(LSTM(n_neurons, batch_input_shape=(n_batch, X.shape, X.shape), stateful=True)) model.add(Dense(n_ahead)) model.compile(loss='mean_squared_error', optimizer='adam') # fit the NN for i in range(nb_epoch): model.fit(X, y, epochs=1, batch_size=n_batch, verbose=2, shuffle=False) model.reset_states() return model
Let's split our transformed data into training and test sets and feed the model training method. The first 80 precent will be used for training purpose and the last 20 percent will be using as test set. Note that in time sereis analysis, we don't do random shuffle because it's important to preserve time dependency/order.
values = reframed.values n_train = int(len(values) * 0.8) train = values[:n_train, :] test = values[n_train:, :] # fit a LSTM model with the transformed data # model fitting can be very time-consuming, therefore a pre-trained model is included in the data folder model_path = os.path.join('.', 'data', 'my_model.h5') if not os.path.exists(model_path): model = fit_lstm(train, lag, ahead, 1, 30, 30) model.save(model_path) else: model = load_model(model_path)
WARNING:tensorflow:From /anaconda3/envs/geosaurus/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. WARNING:tensorflow:From /anaconda3/envs/geosaurus/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead.
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= lstm_4 (LSTM) (1, 30) 4320 _________________________________________________________________ dense_4 (Dense) (1, 3) 93 ================================================================= Total params: 4,413 Trainable params: 4,413 Non-trainable params: 0 _________________________________________________________________
Prediction and evaluation
Now let's apply the model to the test set and evaluate the accuracy for each of those three next time steps.
# predict the SOI values for next three time steps given a single input sample def forecast_lstm(model, X, n_batch, n_lag): # reshape input pattern to [samples, timesteps, features] X = X.reshape(1, n_lag, int(len(X)/n_lag)) # make forecast forecast = model.predict(X, batch_size=n_batch) # convert to array return [x for x in forecast[0, :]] # make prediciton for a list of input samples def make_forecasts(model, n_batch, train, test, n_lag, n_ahead): forecasts = list() for i in range(len(test)): X = test[i, :-n_ahead] # make forecast forecast = forecast_lstm(model, X, n_batch, n_lag) # store the forecast forecasts.append(forecast) return forecasts
forecasts = make_forecasts(model, 1, train, test, lag, ahead) # pring out the output for the first input sample forecasts
[-1.5042609, -1.4436339, -1.52989]
As mentioned in the very beginning, time series prediction is treated as a regression problem in our case, so let's compute mean square error (MSE) for each next time step.
def evaluate_forecasts(y, forecasts, n_lag, n_seq): print('Evaluation results (RMSE) for each next tim step:') for i in range(n_seq): actual = [row[i] for row in y] predicted = [forecast[i] for forecast in forecasts] rmse = sqrt(mean_squared_error(actual, predicted)) print('t+%d time step: %f' % ((i+1), rmse)) # evaluate forecasts actual = [row[-ahead:] for row in test] evaluate_forecasts(actual, forecasts, lag, ahead)
Evaluation results (RMSE) for each next tim step: t+1 time step: 0.795490 t+2 time step: 0.843505 t+3 time step: 0.907717
To to a better understanding of the prediction result, let's plot it out and compare with with the original data.
# plot the forecasts in the context of the original dataset, multiple segments def plot_forecasts(series, forecasts, n_test, xlim, ylim, n_ahead, linestyle = None): pyplot.figure(figsize=(15,8)) if linestyle==None: pyplot.plot(series, label='observed') else: pyplot.plot(series, linestyle, label='observed') pyplot.xlim(xlim, ylim) pyplot.legend(loc='upper right') # plot the forecasts in red for i in range(len(forecasts)): # this ensures not all segements are plotted, it is plotted every n_ahead if i%n_ahead ==0: off_s = len(series) - n_test + 2 + i - 1 off_e = off_s + len(forecasts[i]) + 1 xaxis = [x for x in range(off_s, off_e)] yaxis = [series[off_s]] + forecasts[i] pyplot.plot(xaxis, yaxis, 'r') pyplot.show() plot_forecasts(df['soi'].values, forecasts, test.shape + ahead - 1, 0, 500, ahead)
Here the blue line is the ogriginal time series, the red line is the prediction results. As we can see, it is doing a reasonable job. ENSO prediction is considered one of the most difficult task in climate science, but with more sophisicated modeling tuning and architecture desgin, we believe better results could be achieved.
In this notebook, we observed how ENSO prediction can be done with the aid of ArcGIS API for Python. We started with a correlation to find the most correlated grid in terms of precipitation. And then we performed time series analysis and LSTM to predict SOI based a few input variables including precipitation from prior time steps. With a basic LSTM example, we are able to acheive a reasonable accuracy as ENSO prediction is one of the most challenge tasks in climatology.