Time Series Analysis of the 2019 Novel Coronavirus Pandemic

This notebook is to perform analysis and time series charting of 2019 novel coronavirus disease (COVID-19) globally (updated as of June 28th, 2021):

1. Import Data

The data source repo to be used, is created and maintained by the the Center for Systems Science and Engineering (CSSE) at the Johns Hopkins University, and the official maps can be viewed here. If you are having an issue accessing the Google Sheet, please try downloading the data source provided on their GitHub repo.

The csv file has three types of cases - Confirmed, Recovered and Deaths - spotted inside and outside Mainland China, across the time span from 1/21/2020 to Current.

Necessary Imports

from io import BytesIO
import requests
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt

Read Google Sheet into DataFrame

Global Dataset

First, read the three tabs on Google Sheet - namely 'confirmed', 'death', 'recovered' - into three individual DataFrames, and append them one after another into an empty list.

cases = ['confirmed', 'deaths', 'recovered']
sheet = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_'
suffix = '_global.csv'
df_list = []

for i in range(len(cases)):
    ts_url = sheet + cases[i] + suffix
    df = pd.read_csv(ts_url, header=0, escapechar='\\')
    display(df.head(3))
    df_list.append(df)
    exec("{0}=df".format(cases[i]))
Province/StateCountry/RegionLatLong1/22/201/23/201/24/201/25/201/26/201/27/20...6/18/216/19/216/20/216/21/216/22/216/23/216/24/216/25/216/26/216/27/21
0NaNAfghanistan33.9391167.709953000000...987349873498734103902105749107957109532111592111592111592
1NaNAlbania41.1533020.168300000000...132484132488132490132490132496132497132499132506132509132512
2NaNAlgeria28.033901.659600000000...135219135586135821136294136679137049137403137772138113138465

3 rows × 527 columns

Province/StateCountry/RegionLatLong1/22/201/23/201/24/201/25/201/26/201/27/20...6/18/216/19/216/20/216/21/216/22/216/23/216/24/216/25/216/26/216/27/21
0NaNAfghanistan33.9391167.709953000000...3934393439344215429343664452451945194519
1NaNAlbania41.1533020.168300000000...2454245424542454245524552455245524562456
2NaNAlgeria28.033901.659600000000...3615362436313641365036603669367836853693

3 rows × 527 columns

Province/StateCountry/RegionLatLong1/22/201/23/201/24/201/25/201/26/201/27/20...6/18/216/19/216/20/216/21/216/22/216/23/216/24/216/25/216/26/216/27/21
0NaNAfghanistan33.9391167.709953000000...63426638756440165071655656610266799671836748368288
1NaNAlbania41.1533020.168300000000...129903129910129918129918129941129955129966129975129982129990
2NaNAlgeria28.033901.659600000000...94093943369457194822950849533195599958789610896335

3 rows × 527 columns

# shape of matrices for confirmed, death, and recovered
df_list[0].shape, df_list[1].shape, df_list[2].shape
((279, 527), (279, 527), (264, 527))

U.S. Dataset

cases = ['confirmed', 'deaths']
sheet = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_'
suffix = '_US.csv'
us_df_list = []

for i in range(len(cases)):
    us_ts_url = sheet + cases[i] + suffix
    df = pd.read_csv(us_ts_url, header=0, escapechar='\\')
    display(df.head(3))
    us_df_list.append(df)
    exec("{0}=df".format(cases[i]))
UIDiso2iso3code3FIPSAdmin2Province_StateCountry_RegionLatLong_...6/18/216/19/216/20/216/21/216/22/216/23/216/24/216/25/216/26/216/27/21
084001001USUSA8401001.0AutaugaAlabamaUS32.539527-86.644082...7241724172417242724272447244724472447244
184001003USUSA8401003.0BaldwinAlabamaUS30.727750-87.722071...21868218682186821901219012192121921219452194521945
284001005USUSA8401005.0BarbourAlabamaUS31.868263-85.387129...2345234523452345234523442344234423442344

3 rows × 534 columns

UIDiso2iso3code3FIPSAdmin2Province_StateCountry_RegionLatLong_...6/18/216/19/216/20/216/21/216/22/216/23/216/24/216/25/216/26/216/27/21
084001001USUSA8401001.0AutaugaAlabamaUS32.539527-86.644082...113113113113113113113113113113
184001003USUSA8401003.0BaldwinAlabamaUS30.727750-87.722071...314314314314314314314314314314
284001005USUSA8401005.0BarbourAlabamaUS31.868263-85.387129...59595959596060606060

3 rows × 535 columns

# shape of matrices for confirmed, death, and recovered
us_df_list[0].shape, us_df_list[1].shape
((3261, 175), (3261, 176))
Repair and summarize the U.S. Data

In the U.S. Dataset, there could be one or more administrative regions per state, and in order to summarize and simplify the dataset, the following function sum_all_admins_in_state is to be declared and used to sum all Admin inside one state into a single record.

def sum_all_admins_in_state(df, state):
    
    # query all sub-records of the selected country
    tmp_df = df[df["Province_State"]==state]
    
    # create a new row which is to sum all statistics of this country, and 
    # assign the summed value of all sub-records to the date_time column of the new row
    sum_row = tmp_df.sum(axis=0)
    
    # assign the constants to the ['Province/State', 'Country/Region', 'Lat', 'Long'] columns; 
    # note that the Country/Region column will be renamed from solely the country name to country name + ", Sum".
    sum_row.loc['UID'] = "NaN"
    sum_row.loc['Admin2'] = "NaN"
    sum_row.loc['FIPS'] = "NaN"
    sum_row.loc['iso2'] = "US"
    sum_row.loc['iso3'] = "USA"
    sum_row.loc['code3'] = 840
    sum_row.loc['Country_Region'] = "US"
    sum_row.loc['Province_State'] = state + ", Sum"
    sum_row.loc['Lat'] = tmp_df['Lat'].values[0]
    sum_row.loc['Long_'] = tmp_df['Long_'].values[0]
    
    # append the new row to the original DataFrame, and 
    # remove the sub-records of the selected country.
    df = pd.concat([df, sum_row.to_frame().T], ignore_index=True)
    #display(df[df["Province_State"].str.contains(state + ", Sum")])
    df=df[df['Province_State'] != state]
    df.loc[df.Province_State == state+", Sum", 'Province_State'] = state
    
    return df
for i in range(2):
    usa_ts_df=us_df_list[i]
    for state in usa_ts_df.Province_State.unique():
        usa_ts_df = sum_all_admins_in_state(usa_ts_df, state)
    us_df_list[i]=usa_ts_df

Analysis

Now, let's proceed to these three categories ('confirmed', 'death', 'recovered') individually.

Confirmed Cases

As shown below, of the 279 rows of records in the first DataFrame being converted and parsed from Google Sheet, 34 rows are for cases reported for each province in Mainland China.

df_confirmed = df_list[0]
print(df_confirmed[df_confirmed['Country/Region'] == 'China'].shape)
df_confirmed[df_confirmed['Country/Region'] == 'China'].head()
(34, 527)
Province/StateCountry/RegionLatLong1/22/201/23/201/24/201/25/201/26/201/27/20...6/18/216/19/216/20/216/21/216/22/216/23/216/24/216/25/216/26/216/27/21
58AnhuiChina31.8257117.22641915396070...1004100410041004100410041004100410041004
59BeijingChina40.1824116.4142142236416880...1073107310751075107510751076107610771079
60ChongqingChina30.0572107.874069275775110...598598598598598598598598598598
61FujianChina26.0789117.98741510183559...650651652659660661661665669672
62GansuChina35.7518104.28610224714...194194194194194194195195195195

5 rows × 527 columns

While some countries/regions like China have been displayed with multiple rows each representing confirmed cases per province with reported cases, others are listed in the DataFrame as a single row that sums confirmed cases across states/provinces, such as the United States shown below:

df_confirmed_usa = df_confirmed[df_confirmed['Country/Region'] == 'US']
print(df_confirmed_usa.shape)
df_confirmed_usa.head()
(1, 527)
Province/StateCountry/RegionLatLong1/22/201/23/201/24/201/25/201/26/201/27/20...6/18/216/19/216/20/216/21/216/22/216/23/216/24/216/25/216/26/216/27/21
254NaNUS40.0-100.0112255...33529475335379953354188733554275335652153357765133590481336141963362149933624983

1 rows × 527 columns

Recovered Cases

Same here, for the purpose of comparison, we would need to get the figures of the other two categories - recovered and deaths besides confirmed cases.

df_recovered = df_list[2]
df_recovered.tail(3)
Province/StateCountry/RegionLatLong1/22/201/23/201/24/201/25/201/26/201/27/20...6/18/216/19/216/20/216/21/216/22/216/23/216/24/216/25/216/26/216/27/21
261NaNYemen15.55272748.516388000000...3846385539533958398039903995400540194025
262NaNZambia-13.13389727.849332000000...103884105960108960110450111844113109115898119411121965124582
263NaNZimbabwe-19.01543829.154857000000...37143371673718437200372883747737524376043776137817

3 rows × 527 columns

Death Cases

df_death = df_list[1]
df_death.tail(3)
Province/StateCountry/RegionLatLong1/22/201/23/201/24/201/25/201/26/201/27/20...6/18/216/19/216/20/216/21/216/22/216/23/216/24/216/25/216/26/216/27/21
276NaNYemen15.55272748.516388000000...1352135313551355135513551356135713581360
277NaNZambia-13.13389727.849332000000...1554159516441691174417941855191519672022
278NaNZimbabwe-19.01543829.154857000000...1656166616721685169116921709172117251736

3 rows × 527 columns

As we can tell from the snapshots of these dataframes, cases are reported in three geographic administrative units:

  1. for countries/regions with significant numbers of confirmed/deaths/recovered cases (e.g. Mainland China), number of cases are reported per province/state;
  2. for other regions/countries, number of cases are summarized per region/country (e.g. Australia, or Canada);
  3. Also, the global DataFrame lists Cruise Ship since the Diamond Princess and several other cruise ships themselves contain a considerable amount of confirmed/deaths cases.

2. Parse the Time-Series Data (Mainland China)

Now we have obtained DataFrame for each type of coronavirus cases across the globe. We will apply a filter on each of the DataFrames, and analyze the Confirmed, Recovered, and Death cases separately, to see how the time-series evolved inside Mainland China.

Time-Series of Confirmed Cases in Mainland China

First, a list called provinces_list needs to be extracted from the selected rows, and then be concatenated with the category (e.g. _Confirmed), in order to differentiate from the other two categories (e.g. _Recovered and _Deaths).

provinces_list = df_confirmed[df_confirmed['Country/Region'] == 'China'].iloc[:,0:1].T.values.tolist()[0]
map_output = map(lambda x: x + '_Confirmed', provinces_list)
list_map_output = list(map_output)

Next, let's remove the first five rows from the DataFrame df (which are the row#, Province/State, Country/Region, Unnamed:2, and Unnamed:3 columns, and are not needed for time-series charting), specify the index to the matrix, and perform a Transpose to have the date_time index shown as row indices.

df0 = df_confirmed[df_confirmed['Country/Region'] == 'China'].iloc[:,5:].fillna(0)
df0.index = pd.Index(list_map_output, name='date_time')
df0 = df0.T
df0.tail(3)
date_timeAnhui_ConfirmedBeijing_ConfirmedChongqing_ConfirmedFujian_ConfirmedGansu_ConfirmedGuangdong_ConfirmedGuangxi_ConfirmedGuizhou_ConfirmedHainan_ConfirmedHebei_Confirmed...Shandong_ConfirmedShanghai_ConfirmedShanxi_ConfirmedSichuan_ConfirmedTianjin_ConfirmedTibet_ConfirmedUnknown_ConfirmedXinjiang_ConfirmedYunnan_ConfirmedZhejiang_Confirmed
6/25/211004107659866519527272751471881317...88321992531090400109804011385
6/26/211004107759866919527282751471881317...88322002531091401109804061385
6/27/211004107959867219527332751471881317...88322012531095401109804111385

3 rows × 34 columns

Also, we would need to standardize the date_time string (esp. that the year should be represented as XXXX instead of XX), and then to convert it from a string type to a datetime type:

df0.index
Index(['1/23/20', '1/24/20', '1/25/20', '1/26/20', '1/27/20', '1/28/20',
       '1/29/20', '1/30/20', '1/31/20', '2/1/20',
       ...
       '6/18/21', '6/19/21', '6/20/21', '6/21/21', '6/22/21', '6/23/21',
       '6/24/21', '6/25/21', '6/26/21', '6/27/21'],
      dtype='object', length=522)
df0.index = pd.to_datetime(df0.index, format='%m/%d/%y', exact = False)

If the datetime conversion is successful, use the following cell to validate and check how many rows of datetime records are in the dataframe.

print("Dataframe shape: ", df0.shape)
time_diff = (df0.index[-1] - df0.index[0])
print("Number of hours between start and end dates: ", time_diff.total_seconds()/3600 + 1)
Dataframe shape:  (522, 34)
Number of hours between start and end dates:  12505.0

The following will achieve three different plots:

  1. Plotting all the time series on one axis (line-plot)
  2. Plotting them all on separate subplots to see them more clearly (sharing the x axis)
  3. Plotting all the time series on one axis (scatterplot)
df0.plot(figsize=(15,10.5), title='Plotting all the time series on one axis (line-plot)').legend(loc='upper left')
plt.xlabel('Date Time'); text = plt.ylabel('Num of Cases')
<Figure size 1080x756 with 1 Axes>
ax_array = df0.plot(subplots=True, figsize=(15,18))
for ax in ax_array:
    ax.legend(loc='upper left')
plt.xlabel('Date Time'); plt.ylabel('Num of Cases')
text = plt.title('Plotting all time-series on separate subplots (sharing the x axis)', pad="-120", y=2.0, loc="center")
<Figure size 1080x1296 with 34 Axes>
df0.plot(y=list_map_output, linestyle=':', linewidth=4, 
         figsize=(15,10.5), grid=True,
         title="Plotting all time series on one axis (scatterplot)").legend(loc='upper left')
plt.xlabel('Date Time'); text = plt.ylabel('Num of Cases')
<Figure size 1080x756 with 1 Axes>

From the three plots shown above, we can tell that within Mainland China, Hubei province has the largest number of confirmed COVID-19 cases, preceded by Guangdong province, and the city of Shanghai.

The Recovered, and Death cases in Mainland China

Now the confirmed COVID-19 cases for each province in mainland China are shown as above, we are to define a function plot_per_country that is to help plot other countries/regions, not only for the confirmed cases, but also other cases (Recovered or Deaths).

def plot_per_country(df, country_name, category = "Confirmed", ref_df = df0):
    """to help us plot other countries/regions, not only for the confirmed cases, 
    but also other cases (Recovered or Deaths).
    """
    if 'Country/Region' in df.columns:
        provinces_list = df[df['Country/Region'] == country_name].iloc[:,0:1].T.values.tolist()[0]
    else:
        provinces_list = df[df['Country_Region'] == country_name].iloc[:,6:7].T.values.tolist()[0]
        
    map_output = map(lambda x: x + '_' + category, provinces_list)
    list_map_output = list(map_output)

    if 'Country/Region' in df.columns:
        df0 = df[df['Country/Region'] == country_name].iloc[:,5:].fillna(0)
    else:
        df0 = df[df['Country_Region'] == country_name].iloc[:,11:].fillna(0)
    
    df0.index = pd.Index(list_map_output, name='date_time')
    df0 = df0.loc[:, ~df0.columns.str.contains('^Unnamed')]
    df0 = df0.T
    df0.index = pd.to_datetime(df0.index, format='%m/%d/%y', exact = False)
    
    width_multiplier = df0.shape[1]/5

    df0.plot(figsize=(15,2*width_multiplier), 
             title='Plotting all the time series on one axis (line-plot)').legend(loc='upper left')
    plt.xlabel('Date Time'); plt.ylabel('Num of Cases')
    
    ax_array = df0.plot(subplots=True, figsize=(15,3*width_multiplier))
    for ax in ax_array:
        ax.legend(loc='upper left')
    plt.xlabel('Date Time'); plt.ylabel('Num of Cases')
    text = plt.title('Plotting all time-series on separate subplots (sharing the x axis)', pad="-120",
                     y=2.0, loc="center")
    
    df0.plot(y=list_map_output, linestyle=':', linewidth=4, 
             grid=True, figsize=(15,2*width_multiplier),
              title="Plotting all time series on one axis (scatterplot)").legend(loc='upper left')
    plt.xlabel('Date Time'); plt.ylabel('Num of Cases')
    
    return df0
df_recovered_china = plot_per_country(df_recovered, "China", "Recovered")
<Figure size 1080x979.2 with 1 Axes><Figure size 1080x1468.8 with 34 Axes><Figure size 1080x979.2 with 1 Axes>

The three plots shown above indicate that Hubei province also has the largest number of recovered cases from COVID-19, preceded by Guangdong province, and the city of Shanghai.

df_death_china = plot_per_country(df_death, "China", "Death")
<Figure size 1080x979.2 with 1 Axes><Figure size 1080x1468.8 with 34 Axes><Figure size 1080x979.2 with 1 Axes>

Unfortunately, among all provinces of Mainland China, Hubei again has the largest number of deaths caused by COVID-19.

Cross-Comparisons within different categories

Now we have seen the confirmed, recovered, and death cases caused by COVID-19 for each province in mainland China, let's perform a cross comparison within these three categories by viewing the three time-series in a single plot per single province.

First, make sure these three DataFrames are of the same length (in time-series).

df0.shape, df_recovered_china.shape, df_death_china.shape
((522, 34), (522, 34), (522, 34))
def cross_compare_per_province(province="Hubei"):
    """ Used to plot the time-series of the confirmed, recovered and death 
    cases per province;
        Input: string for province name
        Output: provides three plots of the cross comparison per province
    """
    key0 = province + '_Confirmed'
    key1 = province + '_Recovered'
    key2 = province + '_Death'
    
    df_l = df0.loc[~df0.index.duplicated(keep='first')]
    df_m = df_recovered_china.loc[~df_recovered_china.index.duplicated(keep='first')]
    df_r = df_death_china.loc[~df_death_china.index.duplicated(keep='first')]
    
    df_all_china = pd.concat([df_l[key0], df_m[key1], df_r[key2]], axis=1)
    df_all_china.plot(figsize=(15,2))
    plt.xlabel('Date Time'); plt.ylabel('Num of Cases')
    
    df_all_china.plot(subplots=True, figsize=(15,3))
    plt.xlabel('Date Time'); plt.ylabel('Num of Cases')
    
    df_all_china.plot(y=[key0, key1, key2], style='.', figsize=(15,2), linestyle=":")
    plt.xlabel('Date Time'); plt.ylabel('Num of Cases')
    
    return df_all_china
df_hubei = cross_compare_per_province()
<Figure size 1080x144 with 1 Axes><Figure size 1080x216 with 3 Axes><Figure size 1080x144 with 1 Axes>

In order to help understand the development trend of COVID-19, we need to obtain the Recovered_Rate and Death_Rate per state/province/region/country, and decide when is the turning point for Recovered_Rate to have surpassed Death_Rate.

Recovered_Rate = (# of Recovered Cases)/(# of Confirmed Cases)
Death_Rate = (# of Death Cases)/(# of Confirmed Cases)

Function calc_rate_per_province is defined to plot the trends of these two rates:

def calc_rate_per_province(province="Hubei"):
    """ Used to plot the time-series of the recovered and death 
    rates per province;
        Input: string for province name
        Output: provides three plots of the cross comparison per province
    """
    key0 = province + '_Confirmed'
    key1 = province + '_Recovered'
    key2 = province + '_Death'
    
    df_l = df0.loc[~df0.index.duplicated(keep='first')]
    df_m = df_recovered_china.loc[~df_recovered_china.index.duplicated(keep='first')]
    df_r = df_death_china.loc[~df_death_china.index.duplicated(keep='first')]

    df_all_china = pd.concat([(df_m[key1]*1.0).div((df_l[key0]*1.0), axis='index'),
                              (df_r[key2]*1.0).div((df_l[key0]*1.0), axis='index')],
                             keys=[key1 + '_Rate', key2 + '_Rate'],
                             axis=1)
    
    df_all_china.plot(figsize=(15,2))
    plt.xlabel('Date Time'); plt.ylabel('% of Confirmed Cases')
    
    df_all_china.plot(subplots=True, figsize=(15,3))
    plt.xlabel('Date Time'); plt.ylabel('% of Confirmed Cases')
    
    df_all_china.plot(style='.', figsize=(15,2), linestyle=":")   
    plt.xlabel('Date Time'); plt.ylabel('% of Confirmed Cases')
calc_rate_per_province()
<Figure size 1080x144 with 1 Axes><Figure size 1080x216 with 2 Axes><Figure size 1080x144 with 1 Axes>

We can tell from the plots above, the Recovered Rate has since surpassed the Death Rate for Hubei Province after 02/05/2020.

Cross-comparisons within provinces

After looking at the different categories per province, let us now view the various provinces in one chart to see the variance from a different dimension. To start with, we pick the three provinces in China that are with largest numbers of confirmed COVID-19 cases.

df1 = df_confirmed[df_confirmed['Province/State'].isin(['Hubei', 'Guangdong', 'Shanghai'])].iloc[:,5:].fillna(0)
df1
1/23/201/24/201/25/201/26/201/27/201/28/201/29/201/30/201/31/202/1/20...6/18/216/19/216/20/216/21/216/22/216/23/216/24/216/25/216/26/216/27/21
63325378111151207277354436535...2680269226992706270927172723272727282733
714445497611058142335543554490358067153...68160681606816068160681606816068160681606816068160
8316203340536696112135169...2179218221832184219021902196219922002201

3 rows × 522 columns

df1.index = pd.Index(['Guangdong_Confirmed', 'Hubei_Confirmed', 'Shanghai_Confirmed'],name='date_time')
df1 = df1.T
df1.index = pd.to_datetime(df1.index, format='%m/%d/%y', exact = False)
df1
date_timeGuangdong_ConfirmedHubei_ConfirmedShanghai_Confirmed
2020-01-233244416
2020-01-245354920
2020-01-257876133
2020-01-26111105840
2020-01-27151142353
............
2021-06-232717681602190
2021-06-242723681602196
2021-06-252727681602199
2021-06-262728681602200
2021-06-272733681602201

522 rows × 3 columns

print("Dataframe shape: ", df1.shape)
time_diff = (df1.index[-1] - df1.index[0])
print("Number of hours between start and end dates: ", time_diff.total_seconds()/3600 + 1)
Dataframe shape:  (522, 3)
Number of hours between start and end dates:  12505.0
df1.plot(figsize=(15,2))
plt.xlabel('Date Time'); text=plt.ylabel('Num of Cases')
    
df1.plot(subplots=True, figsize=(15,3))
plt.xlabel('Date Time'); text=plt.ylabel('Num of Cases')

df1.plot(y=['Guangdong_Confirmed', 'Hubei_Confirmed', 'Shanghai_Confirmed'], linestyle=':', figsize=(15,2))
plt.xlabel('Date Time'); text=plt.ylabel('Num of Cases')
<Figure size 1080x144 with 1 Axes><Figure size 1080x216 with 3 Axes><Figure size 1080x144 with 1 Axes>

It is obvious that the curves of confirmed cases for Guangdong province and the city of Shanghai look quite similar, while that of Hubei province displays different pattern.

Identify periodicities & similarities

Next, let us explore how compositing can be done to identify the periodicities and similarities from the many observations along the time-series. This is especially useful when one is looking for responses to a specific time-series event that are combined with noise from a lot of other influences. Examples of compositing include the climatic response to a volcanic eruption, the global weather response to el Niño, calculating the mean diurnal cycle of surface temperature in Dallas at Texas, or finding if precipitation responds to the phase of the moon. The last two natural phenomena relate to sorting out the true amplitude of cyclic responses [1].

Often compositing will reveal periodic phenomena with fixed phase that cannot be extracted from spectral analysis if the signal is small compared to the noise. Compositing makes no assumption of linearity, and it is good at separating small signals from noise, with sample large enough [1].

What we need to do here, is to average the data in some clever way in relation to the event, the event signal will remain and all other influences will tend to average out. The overall workflow [1] includes:

  • Select the basis for compositing and define the categories;
  • Compute the means and statistics for each category;
  • Organize and display the results;
  • Validate the results.

Let's now split the time series up into its constituent cycles and stack them together via pandas and matplotlib. At this point we will also downsample to a 8-day rate, which makes the plot a bit clearer and quicker to generate.

minima = ["2020-01-23", "2020-01-31", 
          "2020-02-07", "2020-02-15", "2020-02-23", 
          "2020-03-03", "2020-03-11", "2020-03-19", 
          "2020-03-27", "2020-04-03", "2020-04-11",
          "2020-04-19", "2020-04-27", "2020-05-05", 
          "2020-05-13", "2020-05-21", 
          "2020-05-29", "2020-06-05", "2020-06-13",
          "2020-06-21", "2020-06-29", "2020-07-03"]

def split_into_cycles(province = 'Hubei', df2 = df_hubei):
    """Returns a list of dataframes, one for each cycle"""
    if df2 is None:
        key0 = province + '_Confirmed'
        key1 = province + '_Recovered'
        key2 = province + '_Death'
        df0.reset_index(drop= True)
        df_recovered_china.reset_index(drop= True)
        df_death_china.reset_index(drop= True)
        df2 = pd.concat([df0[key0],
                         df_recovered_china[key1],
                         df_death_china[key2]],
                        axis=1, ignore_index=True)
    else:
        print("Use existing DataFrame")
    df2.index = pd.to_datetime(df2.index, format='%m/%d/%Y %H:%M', exact = False)
    df_daily = df2.resample("8D").mean()
    
    cycles = []
    # Split by cycle
    for start, end in zip(minima[0:-1], minima[1:]):
        cycle = df2[start:end]
        # Convert from dates to days from minimum
        cycle.index = (cycle.index - cycle.index[0]).days
        # Extend so that each cycle lasts a full 100 days (filled with nan)
        cycle = cycle[~cycle.index.duplicated()]
        
        import numpy as np
        ix = pd.Int64Index(np.arange(0,100))
        cycle.reindex(ix)
        cycles.append(cycle)
    return cycles

cycles = split_into_cycles('Hubei', df_hubei)
Use existing DataFrame

By manually composing the list minima, we are expecting the output of split_into_cycles to be a list, cycles in this case, which contains 20 DataFrames, and each containing a different cycle. On each DataFrame, we have changed the index into the number of days from the minimum, and used .reindex() to fix them all to the same length so that we can perform arithmetic operations on them together. The following will create a plot of each parameter, with the cycles superposed over each other.

In this example, we first create the figure and its axes using matplotlib directly (using sharex=True to link the x-axes on each plot), then direct the pandas plotting commands to point them to the axis we want each thing to plot onto using the ax kwarg. We also calculate the mean of the stacked time series.

def plot_with_cycles(province = "Hubei"):
    fig, axes = plt.subplots(3, 1, figsize=(15,25), sharex=True)
    plt.subplots_adjust(wspace=0.25, 
                        hspace=0.25)
    key0 = province + '_Confirmed'
    key1 = province + '_Recovered'
    key2 = province + '_Death'
    for i, cycle in enumerate(cycles):
        cycle[key0].plot(ax=axes[0], label=f"Cycle {i}")
        cycle[key1].plot(ax=axes[1])
        cycle[key2].plot(ax=axes[2])
    N_cycles = len(cycles)
    (sum(cycles)[key0]/N_cycles).plot(ax=axes[0], color="black", label='Mean_'+key0)
    (sum(cycles)[key1]/N_cycles).plot(ax=axes[1], color="orange", label='Mean_'+key1)
    (sum(cycles)[key2]/N_cycles).plot(ax=axes[2], color="pink", label='Mean_'+key2)
    
    axes[0].set_ylabel(key0)
    axes[1].set_ylabel(key1)
    axes[2].set_ylabel(key2)
    axes[2].set_xlabel("Days since minimum")
    for ax in axes:
        ax.grid()
    axes[0].legend(loc="lower left")

plot_with_cycles()
<Figure size 1080x1800 with 3 Axes>

This helps us to see how the cycles differ from each other. For example, the two most recent cycles (cycles 19 & 20 in this case) are consistently higher than the mean, both in the Recovered and Confirmed cases, while other cycles are always lower than the mean. Overall, there is no similar pattern between cycles detected via the plots.

By constructing the mean of the cycles, we are actually reinforcing the similar pattern over each cycle and reducing the effect of the random noise. This is the basis of a technique called superposed epoch analysis, which is useful for identifying periodicities and similarities between noisy time series.

3. Parse the time-series data (in the U. S., Australia & Canada)

Let's summarize the data we have seen so far:

  1. for countries/regions with significant numbers of confirmed/deaths/recovered cases (e.g. Mainland China), number of cases are reported per province/state;
  2. for other regions/countries, number of cases are summarized per region/country (e.g. Australia, or Canada);
  3. Also, the global DataFrame lists Cruise Ship since the Diamond Princess and several other cruise ships themselves contain a considerable amount of confirmed/deaths cases.

So far, we have covered the first case for Mainland China. Next, using the plot_per_function function defined previously, we can also look into the statistics for regions other than Mainland China that fall under case 1.

Confirmed Cases in Australia & Canada

df_au_confirmed = plot_per_country(df_confirmed, "Australia")
<Figure size 1080x230.4 with 1 Axes><Figure size 1080x345.6 with 8 Axes><Figure size 1080x230.4 with 1 Axes>
df_ca_confirmed = plot_per_country(df_confirmed, "Canada")
<Figure size 1080x460.8 with 1 Axes><Figure size 1080x691.2 with 16 Axes><Figure size 1080x460.8 with 1 Axes>

As shown in the plots above, Australian Capitol Terrotory tops in Australia with nearly 20,000, and Alberta tops in Canada with almost 500,000 cases confirmed.

Next, we will look into the statistics collected for the United States.

Confirmed Cases in the U.S.

Note here, the global DataFrame (a.k.a. df_confirmed) created previously only contains one row for the overall statistics of the entire United States, so we will instead use the other DataFrame object (i.e. us_df_list[0]) that was imported and parsed from the other GitHub link representing the per-state cases in the United States.

df_usa_confirmed = plot_per_country(us_df_list[0], "US")
<Figure size 1080x1670.4 with 1 Axes><Figure size 1080x2505.6 with 58 Axes><Figure size 1080x1670.4 with 1 Axes>

The top 3 states with the highest numbers of confirmed cases in the U.S. so far are the States of California, Florida and New York.

Confirmed Cases on Cruise Ships

In case 3 of this dataset, Cruise ships themselves are listed under the Country\Region column. We then plot the time-series of its data, appearing in two DataFrames - df_confirmed and us_df_list[0], for cruise ships owned by other countries and the United States, respectively.

A customized function time_series_per_location is declared and used below to draw the time-series of confirmed cases reported on cruise ships.

def time_series_per_location(df=us_df_list[0], location="Diamond Princess", op=1):
    """ Used to plot the time-series of confirmed cases on cruise ships
    """
    if op==0:
        if 'Country/Region' in df.columns:
            df0 = df[df['Country/Region'] == location].iloc[:,5:].fillna(0)
        else:
            df0 = df[df['Country_Region'] == location].iloc[:,11:].fillna(0)
    elif op==1:
        if 'Province/State' in df.columns:
            df0 = df[df['Province/State'] == location].iloc[:,5:].fillna(0)
        else:
            df0 = df[df['Province_State'] == location].iloc[:,11:].fillna(0)
    
    df0 = df0.loc[:, ~df0.columns.str.contains('^Unnamed')]
    df0 = df0.T
    df0.index = pd.to_datetime(df0.index, format='%m/%d/%y', exact = False)
    
    width_multiplier = df0.shape[1]/5

    df0.plot(figsize=(15,2*width_multiplier), 
             title='Plotting the time-series of confirmed cases (line-plot)').legend(loc='upper left')
    plt.xlabel('Date Time'); plt.ylabel('Num of Cases')
df_confirmed[df_confirmed['Country/Region'].str.contains('Princess')]
Province/StateCountry/RegionLatLong1/22/201/23/201/24/201/25/201/26/201/27/20...6/18/216/19/216/20/216/21/216/22/216/23/216/24/216/25/216/26/216/27/21
105NaNDiamond Princess0.00.0000000...712712712712712712712712712712

1 rows × 527 columns

time_series_per_location(df=df_confirmed, location="Diamond Princess", op=0)
<Figure size 1080x28.8 with 1 Axes>
us_df_list[0][us_df_list[0]['Province_State']=='Diamond Princess']
UIDiso2iso3code3FIPSAdmin2Province_StateCountry_RegionLatLong_...6/18/216/19/216/20/216/21/216/22/216/23/216/24/216/25/216/26/216/27/21
34NaNUSUSA840NaNNaNDiamond PrincessUS0.00.0...49494949494949494949

1 rows × 534 columns

time_series_per_location()
<Figure size 1080x28.8 with 1 Axes>
us_df_list[0][us_df_list[0]['Province_State']=='Grand Princess']
UIDiso2iso3code3FIPSAdmin2Province_StateCountry_RegionLatLong_...6/18/216/19/216/20/216/21/216/22/216/23/216/24/216/25/216/26/216/27/21
38NaNUSUSA840NaNNaNGrand PrincessUS0.00.0...103103103103103103103103103103

1 rows × 534 columns

time_series_per_location(location="Grand Princess")
<Figure size 1080x28.8 with 1 Axes>

Comparing these three time-series plots, we can see that for the cruise ship #1, the number of cases experienced a rapid rise in mid-Feb of 2020, while for #2, the rapid climb of confirmed cases happened in mid-March and end of March (in year 2020), respectively.

4. Time Series Lag Scatter Plots

Time series modeling assumes a relationship between an observation and the previous observation. Previous observations in a time series are called lags, with the observation at the previous time step called lag1, the observation at two time steps ago lag2, and so on.

A useful type of plot to explore the relationship between each observation and a lag of that observation is the scatter plot. Pandas has a built-in function for exactly this called the lag plot. It plots the observation at time t on the x-axis and the lag1 observation (t-1) on the y-axis.

  • If the points cluster along a diagonal line from the bottom-left to the top-right of the plot, it suggests a positive correlation relationship.
  • If the points cluster along a diagonal line from the top-left to the bottom-right, it suggests a negative correlation relationship.
  • Either relationship is good as they can be modeled.

More points tighter in to the diagonal line suggests a stronger relationship and more spread from the line suggests a weaker relationship. A ball in the middle or a spread across the plot suggests a weak or no relationship.

Below is an example of a lag plot for the Hubei_Confirmed Series.

from matplotlib import pyplot
from pandas.plotting import lag_plot
lag_plot(df1['Hubei_Confirmed'])
pyplot.show()
<Figure size 432x288 with 1 Axes>

Because the points cluster along the diagonal line from bottom-left to the top-right, and more points are tighter to the diagonal line compared to points loosely spread from the line, so we can say the plot suggests a strong positive correlation relationship between the observation and its lag1.

We can repeat this process for an observation and any lag values. Perhaps with the observation at the same time last week, last month, or last year, or any other domain-specific knowledge we may wish to explore. For example, we can create a scatter plot for the observation with each value in the previous seven days. Below is an example of this for the Hubei_Confirmed Series.

In the cell below, first, a new DataFrame is created with the lag values as new columns. The columns are named appropriately. Then a new subplot is created that plots each observation with a different lag value.

values = df1['Hubei_Confirmed']
lags = 7
columns = [values]
for i in range(1,(lags + 1)):
    columns.append(values.shift(i))
dataframe = pd.concat(columns, axis=1)
columns = ['t+1']
for i in range(1,(lags + 1)):
    columns.append('t-' + str(i))
dataframe.columns = columns
pyplot.figure(1, figsize=(12,9))
for i in range(1,(lags + 1)):
    ax = pyplot.subplot(240 + i)
    pyplot.subplots_adjust(bottom=0.4, wspace = 0.4, hspace = 0.4)
    ax.set_title('t+1 vs t-' + str(i))
    pyplot.scatter(x=dataframe['t+1'].values, y=dataframe['t-'+str(i)].values)
pyplot.show()
<Figure size 864x648 with 7 Axes>

Running the code block above suggests the strongest relationship between an observation with its lag1 value, but generally a good positive correlation with each value in the last week.

5. Time Series Autocorrelation Plots

We can quantify the strength and type of relationship between observations and their lags. In statistics, this is called correlation, and when calculated against lag values in time series, it is called autocorrelation (self-correlation).

A correlation value calculated between two groups of numbers, such as observations and their lag1 values, results in a number between -1 and 1. The sign of this number indicates a negative or positive correlation respectively. A value close to zero suggests a weak correlation, whereas a value closer to -1 or 1 indicates a strong correlation.

Correlation values, called correlation coefficients, can be calculated for each observation and different lag values. Once calculated, a plot can be created to help better understand how this relationship changes over the lag. This type of plot is called an autocorrelation plot and Pandas provides this capability built in, called the autocorrelation_plot() function.

The cell below creates an autocorrelation plot for the Hubei_Confirmed Series:

from pandas.plotting import autocorrelation_plot
ax = autocorrelation_plot(df1['Hubei_Confirmed'], label="Value of Self-Correlation")
ax.set_xlabel("Lag (in days)")
ax.set_ylabel("Value of Autocorrelation")
pyplot.show()
<Figure size 432x288 with 1 Axes>

The resulting plot shows lag (in days) along the x-axis and the value of correlation on the y-axis. Dotted lines are provided that indicate any correlation values beyond those lines are statistically significant. We can see that for the Hubei_Confirmed Series, cycles of strong positive correlation before day 6. This captures the relationship of an observation with past observations, and there is no seasonal or recurrent trends.

Conclusions

While many countries or regions have struggled to cope with the rapid spread of the virus, it is important for everyone to closely monitor the situations and take precautionary measures. Serving to the purpose, this notebook has discussed how to perform data analysis and time series charting of COVID-19 cases across the globe at various levels, which includes all provinces in Mainland China, the United States, Australia, and Canada, and Cruise Ships with reported cases. Also, approaches to perform cross-comparisons within different categories, or provinces/states, are demonstrated here, to help figure out which provinces/states in the selected countries are impacted the most, or at critical phases of virus development. Then, Time-Series Lag Scatter plots and Autocorrelation Plots are drawn in the process of studying the time-lag-correlation, and the autocorrelation of the COVID-19 time-series.

More info can be found at storymaps. You can also check out Part 3 notebook for predictive analysis of COVID-19 trends.

References

[1] "Compositing or Superposed Epoch Analysis", https://atmos.washington.edu/~dennis/552_Notes_2.pdf, accessed on 07/04/2020

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.