ArcGIS Developers
Dashboard

ArcGIS API for Python

Time Series Analysis of the 2019 Novel Coronavirus Pandemic

This notebook is to perform analysis and time series charting of 2019 novel coronavirus disease (COVID-19) globally (updated as of June 28th, 2021):

1. Import Data

The data source repo to be used, is created and maintained by the the Center for Systems Science and Engineering (CSSE) at the Johns Hopkins University, and the official maps can be viewed here. If you are having an issue accessing the Google Sheet, please try downloading the data source provided on their GitHub repo.

The csv file has three types of cases - Confirmed, Recovered and Deaths - spotted inside and outside Mainland China, across the time span from 1/21/2020 to Current.

Necessary Imports

In [1]:
from io import BytesIO
import requests
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt

Read Google Sheet into DataFrame

Global Dataset

First, read the three tabs on Google Sheet - namely 'confirmed', 'death', 'recovered' - into three individual DataFrames, and append them one after another into an empty list.

In [2]:
cases = ['confirmed', 'deaths', 'recovered']
sheet = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_'
suffix = '_global.csv'
df_list = []

for i in range(len(cases)):
    ts_url = sheet + cases[i] + suffix
    df = pd.read_csv(ts_url, header=0, escapechar='\\')
    display(df.head(3))
    df_list.append(df)
    exec("{0}=df".format(cases[i]))
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 6/18/21 6/19/21 6/20/21 6/21/21 6/22/21 6/23/21 6/24/21 6/25/21 6/26/21 6/27/21
0 NaN Afghanistan 33.93911 67.709953 0 0 0 0 0 0 ... 98734 98734 98734 103902 105749 107957 109532 111592 111592 111592
1 NaN Albania 41.15330 20.168300 0 0 0 0 0 0 ... 132484 132488 132490 132490 132496 132497 132499 132506 132509 132512
2 NaN Algeria 28.03390 1.659600 0 0 0 0 0 0 ... 135219 135586 135821 136294 136679 137049 137403 137772 138113 138465

3 rows × 527 columns

Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 6/18/21 6/19/21 6/20/21 6/21/21 6/22/21 6/23/21 6/24/21 6/25/21 6/26/21 6/27/21
0 NaN Afghanistan 33.93911 67.709953 0 0 0 0 0 0 ... 3934 3934 3934 4215 4293 4366 4452 4519 4519 4519
1 NaN Albania 41.15330 20.168300 0 0 0 0 0 0 ... 2454 2454 2454 2454 2455 2455 2455 2455 2456 2456
2 NaN Algeria 28.03390 1.659600 0 0 0 0 0 0 ... 3615 3624 3631 3641 3650 3660 3669 3678 3685 3693

3 rows × 527 columns

Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 6/18/21 6/19/21 6/20/21 6/21/21 6/22/21 6/23/21 6/24/21 6/25/21 6/26/21 6/27/21
0 NaN Afghanistan 33.93911 67.709953 0 0 0 0 0 0 ... 63426 63875 64401 65071 65565 66102 66799 67183 67483 68288
1 NaN Albania 41.15330 20.168300 0 0 0 0 0 0 ... 129903 129910 129918 129918 129941 129955 129966 129975 129982 129990
2 NaN Algeria 28.03390 1.659600 0 0 0 0 0 0 ... 94093 94336 94571 94822 95084 95331 95599 95878 96108 96335

3 rows × 527 columns

In [3]:
# shape of matrices for confirmed, death, and recovered
df_list[0].shape, df_list[1].shape, df_list[2].shape
Out[3]:
((279, 527), (279, 527), (264, 527))

U.S. Dataset

In [4]:
cases = ['confirmed', 'deaths']
sheet = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_'
suffix = '_US.csv'
us_df_list = []

for i in range(len(cases)):
    us_ts_url = sheet + cases[i] + suffix
    df = pd.read_csv(us_ts_url, header=0, escapechar='\\')
    display(df.head(3))
    us_df_list.append(df)
    exec("{0}=df".format(cases[i]))
UID iso2 iso3 code3 FIPS Admin2 Province_State Country_Region Lat Long_ ... 6/18/21 6/19/21 6/20/21 6/21/21 6/22/21 6/23/21 6/24/21 6/25/21 6/26/21 6/27/21
0 84001001 US USA 840 1001.0 Autauga Alabama US 32.539527 -86.644082 ... 7241 7241 7241 7242 7242 7244 7244 7244 7244 7244
1 84001003 US USA 840 1003.0 Baldwin Alabama US 30.727750 -87.722071 ... 21868 21868 21868 21901 21901 21921 21921 21945 21945 21945
2 84001005 US USA 840 1005.0 Barbour Alabama US 31.868263 -85.387129 ... 2345 2345 2345 2345 2345 2344 2344 2344 2344 2344

3 rows × 534 columns

UID iso2 iso3 code3 FIPS Admin2 Province_State Country_Region Lat Long_ ... 6/18/21 6/19/21 6/20/21 6/21/21 6/22/21 6/23/21 6/24/21 6/25/21 6/26/21 6/27/21
0 84001001 US USA 840 1001.0 Autauga Alabama US 32.539527 -86.644082 ... 113 113 113 113 113 113 113 113 113 113
1 84001003 US USA 840 1003.0 Baldwin Alabama US 30.727750 -87.722071 ... 314 314 314 314 314 314 314 314 314 314
2 84001005 US USA 840 1005.0 Barbour Alabama US 31.868263 -85.387129 ... 59 59 59 59 59 60 60 60 60 60

3 rows × 535 columns

In [5]:
# shape of matrices for confirmed, death, and recovered
us_df_list[0].shape, us_df_list[1].shape
Out[5]:
((3261, 175), (3261, 176))
Repair and summarize the U.S. Data

In the U.S. Dataset, there could be one or more administrative regions per state, and in order to summarize and simplify the dataset, the following function sum_all_admins_in_state is to be declared and used to sum all Admin inside one state into a single record.

In [5]:
def sum_all_admins_in_state(df, state):
    
    # query all sub-records of the selected country
    tmp_df = df[df["Province_State"]==state]
    
    # create a new row which is to sum all statistics of this country, and 
    # assign the summed value of all sub-records to the date_time column of the new row
    sum_row = tmp_df.sum(axis=0)
    
    # assign the constants to the ['Province/State', 'Country/Region', 'Lat', 'Long'] columns; 
    # note that the Country/Region column will be renamed from solely the country name to country name + ", Sum".
    sum_row.loc['UID'] = "NaN"
    sum_row.loc['Admin2'] = "NaN"
    sum_row.loc['FIPS'] = "NaN"
    sum_row.loc['iso2'] = "US"
    sum_row.loc['iso3'] = "USA"
    sum_row.loc['code3'] = 840
    sum_row.loc['Country_Region'] = "US"
    sum_row.loc['Province_State'] = state + ", Sum"
    sum_row.loc['Lat'] = tmp_df['Lat'].values[0]
    sum_row.loc['Long_'] = tmp_df['Long_'].values[0]
    
    # append the new row to the original DataFrame, and 
    # remove the sub-records of the selected country.
    df = pd.concat([df, sum_row.to_frame().T], ignore_index=True)
    #display(df[df["Province_State"].str.contains(state + ", Sum")])
    df=df[df['Province_State'] != state]
    df.loc[df.Province_State == state+", Sum", 'Province_State'] = state
    
    return df
In [6]:
for i in range(2):
    usa_ts_df=us_df_list[i]
    for state in usa_ts_df.Province_State.unique():
        usa_ts_df = sum_all_admins_in_state(usa_ts_df, state)
    us_df_list[i]=usa_ts_df

Analysis

Now, let's proceed to these three categories ('confirmed', 'death', 'recovered') individually.

Confirmed Cases

As shown below, of the 279 rows of records in the first DataFrame being converted and parsed from Google Sheet, 34 rows are for cases reported for each province in Mainland China.

In [7]:
df_confirmed = df_list[0]
print(df_confirmed[df_confirmed['Country/Region'] == 'China'].shape)
df_confirmed[df_confirmed['Country/Region'] == 'China'].head()
(34, 527)
Out[7]:
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 6/18/21 6/19/21 6/20/21 6/21/21 6/22/21 6/23/21 6/24/21 6/25/21 6/26/21 6/27/21
58 Anhui China 31.8257 117.2264 1 9 15 39 60 70 ... 1004 1004 1004 1004 1004 1004 1004 1004 1004 1004
59 Beijing China 40.1824 116.4142 14 22 36 41 68 80 ... 1073 1073 1075 1075 1075 1075 1076 1076 1077 1079
60 Chongqing China 30.0572 107.8740 6 9 27 57 75 110 ... 598 598 598 598 598 598 598 598 598 598
61 Fujian China 26.0789 117.9874 1 5 10 18 35 59 ... 650 651 652 659 660 661 661 665 669 672
62 Gansu China 35.7518 104.2861 0 2 2 4 7 14 ... 194 194 194 194 194 194 195 195 195 195

5 rows × 527 columns

While some countries/regions like China have been displayed with multiple rows each representing confirmed cases per province with reported cases, others are listed in the DataFrame as a single row that sums confirmed cases across states/provinces, such as the United States shown below:

In [8]:
df_confirmed_usa = df_confirmed[df_confirmed['Country/Region'] == 'US']
print(df_confirmed_usa.shape)
df_confirmed_usa.head()
(1, 527)
Out[8]:
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 6/18/21 6/19/21 6/20/21 6/21/21 6/22/21 6/23/21 6/24/21 6/25/21 6/26/21 6/27/21
254 NaN US 40.0 -100.0 1 1 2 2 5 5 ... 33529475 33537995 33541887 33554275 33565215 33577651 33590481 33614196 33621499 33624983

1 rows × 527 columns

Recovered Cases

Same here, for the purpose of comparison, we would need to get the figures of the other two categories - recovered and deaths besides confirmed cases.

In [9]:
df_recovered = df_list[2]
df_recovered.tail(3)
Out[9]:
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 6/18/21 6/19/21 6/20/21 6/21/21 6/22/21 6/23/21 6/24/21 6/25/21 6/26/21 6/27/21
261 NaN Yemen 15.552727 48.516388 0 0 0 0 0 0 ... 3846 3855 3953 3958 3980 3990 3995 4005 4019 4025
262 NaN Zambia -13.133897 27.849332 0 0 0 0 0 0 ... 103884 105960 108960 110450 111844 113109 115898 119411 121965 124582
263 NaN Zimbabwe -19.015438 29.154857 0 0 0 0 0 0 ... 37143 37167 37184 37200 37288 37477 37524 37604 37761 37817

3 rows × 527 columns

Death Cases

In [10]:
df_death = df_list[1]
df_death.tail(3)
Out[10]:
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 6/18/21 6/19/21 6/20/21 6/21/21 6/22/21 6/23/21 6/24/21 6/25/21 6/26/21 6/27/21
276 NaN Yemen 15.552727 48.516388 0 0 0 0 0 0 ... 1352 1353 1355 1355 1355 1355 1356 1357 1358 1360
277 NaN Zambia -13.133897 27.849332 0 0 0 0 0 0 ... 1554 1595 1644 1691 1744 1794 1855 1915 1967 2022
278 NaN Zimbabwe -19.015438 29.154857 0 0 0 0 0 0 ... 1656 1666 1672 1685 1691 1692 1709 1721 1725 1736

3 rows × 527 columns

As we can tell from the snapshots of these dataframes, cases are reported in three geographic administrative units:

  1. for countries/regions with significant numbers of confirmed/deaths/recovered cases (e.g. Mainland China), number of cases are reported per province/state;
  2. for other regions/countries, number of cases are summarized per region/country (e.g. Australia, or Canada);
  3. Also, the global DataFrame lists Cruise Ship since the Diamond Princess and several other cruise ships themselves contain a considerable amount of confirmed/deaths cases.

2. Parse the Time-Series Data (Mainland China)

Now we have obtained DataFrame for each type of coronavirus cases across the globe. We will apply a filter on each of the DataFrames, and analyze the Confirmed, Recovered, and Death cases separately, to see how the time-series evolved inside Mainland China.

Time-Series of Confirmed Cases in Mainland China

First, a list called provinces_list needs to be extracted from the selected rows, and then be concatenated with the category (e.g. _Confirmed), in order to differentiate from the other two categories (e.g. _Recovered and _Deaths).

In [11]:
provinces_list = df_confirmed[df_confirmed['Country/Region'] == 'China'].iloc[:,0:1].T.values.tolist()[0]
In [12]:
map_output = map(lambda x: x + '_Confirmed', provinces_list)
list_map_output = list(map_output)

Next, let's remove the first five rows from the DataFrame df (which are the row#, Province/State, Country/Region, Unnamed:2, and Unnamed:3 columns, and are not needed for time-series charting), specify the index to the matrix, and perform a Transpose to have the date_time index shown as row indices.

In [19]:
df0 = df_confirmed[df_confirmed['Country/Region'] == 'China'].iloc[:,5:].fillna(0)
df0.index = pd.Index(list_map_output, name='date_time')
df0 = df0.T
df0.tail(3)
Out[19]:
date_time Anhui_Confirmed Beijing_Confirmed Chongqing_Confirmed Fujian_Confirmed Gansu_Confirmed Guangdong_Confirmed Guangxi_Confirmed Guizhou_Confirmed Hainan_Confirmed Hebei_Confirmed ... Shandong_Confirmed Shanghai_Confirmed Shanxi_Confirmed Sichuan_Confirmed Tianjin_Confirmed Tibet_Confirmed Unknown_Confirmed Xinjiang_Confirmed Yunnan_Confirmed Zhejiang_Confirmed
6/25/21 1004 1076 598 665 195 2727 275 147 188 1317 ... 883 2199 253 1090 400 1 0 980 401 1385
6/26/21 1004 1077 598 669 195 2728 275 147 188 1317 ... 883 2200 253 1091 401 1 0 980 406 1385
6/27/21 1004 1079 598 672 195 2733 275 147 188 1317 ... 883 2201 253 1095 401 1 0 980 411 1385

3 rows × 34 columns

Also, we would need to standardize the date_time string (esp. that the year should be represented as XXXX instead of XX), and then to convert it from a string type to a datetime type:

In [20]:
df0.index
Out[20]:
Index(['1/23/20', '1/24/20', '1/25/20', '1/26/20', '1/27/20', '1/28/20',
       '1/29/20', '1/30/20', '1/31/20', '2/1/20',
       ...
       '6/18/21', '6/19/21', '6/20/21', '6/21/21', '6/22/21', '6/23/21',
       '6/24/21', '6/25/21', '6/26/21', '6/27/21'],
      dtype='object', length=522)
In [21]:
df0.index = pd.to_datetime(df0.index, format='%m/%d/%y', exact = False)

If the datetime conversion is successful, use the following cell to validate and check how many rows of datetime records are in the dataframe.

In [22]:
print("Dataframe shape: ", df0.shape)
time_diff = (df0.index[-1] - df0.index[0])
print("Number of hours between start and end dates: ", time_diff.total_seconds()/3600 + 1)
Dataframe shape:  (522, 34)
Number of hours between start and end dates:  12505.0

The following will achieve three different plots:

  1. Plotting all the time series on one axis (line-plot)
  2. Plotting them all on separate subplots to see them more clearly (sharing the x axis)
  3. Plotting all the time series on one axis (scatterplot)
In [24]:
df0.plot(figsize=(15,10.5), title='Plotting all the time series on one axis (line-plot)').legend(loc='upper left')
plt.xlabel('Date Time'); text = plt.ylabel('Num of Cases')
In [34]:
ax_array = df0.plot(subplots=True, figsize=(15,18))
for ax in ax_array:
    ax.legend(loc='upper left')
plt.xlabel('Date Time'); plt.ylabel('Num of Cases')
text = plt.title('Plotting all time-series on separate subplots (sharing the x axis)', pad="-120", y=2.0, loc="center")
In [48]:
df0.plot(y=list_map_output, linestyle=':', linewidth=4, 
         figsize=(15,10.5), grid=True,
         title="Plotting all time series on one axis (scatterplot)").legend(loc='upper left')
plt.xlabel('Date Time'); text = plt.ylabel('Num of Cases')

From the three plots shown above, we can tell that within Mainland China, Hubei province has the largest number of confirmed COVID-19 cases, preceded by Guangdong province, and the city of Shanghai.

The Recovered, and Death cases in Mainland China

Now the confirmed COVID-19 cases for each province in mainland China are shown as above, we are to define a function plot_per_country that is to help plot other countries/regions, not only for the confirmed cases, but also other cases (Recovered or Deaths).

In [59]:
def plot_per_country(df, country_name, category = "Confirmed", ref_df = df0):
    """to help us plot other countries/regions, not only for the confirmed cases, 
    but also other cases (Recovered or Deaths).
    """
    if 'Country/Region' in df.columns:
        provinces_list = df[df['Country/Region'] == country_name].iloc[:,0:1].T.values.tolist()[0]
    else:
        provinces_list = df[df['Country_Region'] == country_name].iloc[:,6:7].T.values.tolist()[0]
        
    map_output = map(lambda x: x + '_' + category, provinces_list)
    list_map_output = list(map_output)

    if 'Country/Region' in df.columns:
        df0 = df[df['Country/Region'] == country_name].iloc[:,5:].fillna(0)
    else:
        df0 = df[df['Country_Region'] == country_name].iloc[:,11:].fillna(0)
    
    df0.index = pd.Index(list_map_output, name='date_time')
    df0 = df0.loc[:, ~df0.columns.str.contains('^Unnamed')]
    df0 = df0.T
    df0.index = pd.to_datetime(df0.index, format='%m/%d/%y', exact = False)
    
    width_multiplier = df0.shape[1]/5

    df0.plot(figsize=(15,2*width_multiplier), 
             title='Plotting all the time series on one axis (line-plot)').legend(loc='upper left')
    plt.xlabel('Date Time'); plt.ylabel('Num of Cases')
    
    ax_array = df0.plot(subplots=True, figsize=(15,3*width_multiplier))
    for ax in ax_array:
        ax.legend(loc='upper left')
    plt.xlabel('Date Time'); plt.ylabel('Num of Cases')
    text = plt.title('Plotting all time-series on separate subplots (sharing the x axis)', pad="-120",
                     y=2.0, loc="center")
    
    df0.plot(y=list_map_output, linestyle=':', linewidth=4, 
             grid=True, figsize=(15,2*width_multiplier),
              title="Plotting all time series on one axis (scatterplot)").legend(loc='upper left')
    plt.xlabel('Date Time'); plt.ylabel('Num of Cases')
    
    return df0
In [60]:
df_recovered_china = plot_per_country(df_recovered, "China", "Recovered")

The three plots shown above indicate that Hubei province also has the largest number of recovered cases from COVID-19, preceded by Guangdong province, and the city of Shanghai.

In [61]:
df_death_china = plot_per_country(df_death, "China", "Death")