Skip To Content ArcGIS for Developers Sign In Dashboard

ArcGIS API for Python

Download the samples Try it live

Time Series Analysis of the 2019 Novel Coronavirus Pandemic

This notebook is to perform analysis and time series charting of 2019 novel coronavirus disease (COVID-19) globally:

1. Import Data

The data source repo to be used, is created and maintained by the the Center for Systems Science and Engineering (CSSE) at the Johns Hopkins University, and the official maps can be viewed here. If you are having an issue accessing the Google Sheet, please try downloading the data source provided on their GitHub repo.

The csv file has three types of cases - Confirmed, Recovered and Deaths - spotted inside and outside Mainland China, across the time span from 1/21/2020 to Current.

Necessary Imports

In [1]:
from io import BytesIO
import requests
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt

Read Google Sheet into DataFrame

Global Dataset

First, read the three tabs on Google Sheet - namely 'confirmed', 'death', 'recovered' - into three individual DataFrames, and append them one after another into an empty list.

In [2]:
cases = ['confirmed', 'deaths', 'recovered']
sheet = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_'
suffix = '_global.csv'
df_list = []

for i in range(len(cases)):
    ts_url = sheet + cases[i] + suffix
    df = pd.read_csv(ts_url, header=0, escapechar='\\')
    display(df.head(3))
    df_list.append(df)
    exec("{0}=df".format(cases[i]))
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 6/24/20 6/25/20 6/26/20 6/27/20 6/28/20 6/29/20 6/30/20 7/1/20 7/2/20 7/3/20
0 NaN Afghanistan 33.0000 65.0000 0 0 0 0 0 0 ... 29640 30175 30451 30616 30967 31238 31517 31836 32022 32324
1 NaN Albania 41.1533 20.1683 0 0 0 0 0 0 ... 2114 2192 2269 2330 2402 2466 2535 2580 2662 2752
2 NaN Algeria 28.0339 1.6596 0 0 0 0 0 0 ... 12248 12445 12685 12968 13273 13571 13907 14272 14657 15070

3 rows × 168 columns

Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 6/24/20 6/25/20 6/26/20 6/27/20 6/28/20 6/29/20 6/30/20 7/1/20 7/2/20 7/3/20
0 NaN Afghanistan 33.0000 65.0000 0 0 0 0 0 0 ... 639 675 683 703 721 733 746 774 807 819
1 NaN Albania 41.1533 20.1683 0 0 0 0 0 0 ... 47 49 51 53 55 58 62 65 69 72
2 NaN Algeria 28.0339 1.6596 0 0 0 0 0 0 ... 869 878 885 892 897 905 912 920 928 937

3 rows × 168 columns

Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 6/24/20 6/25/20 6/26/20 6/27/20 6/28/20 6/29/20 6/30/20 7/1/20 7/2/20 7/3/20
0 NaN Afghanistan 33.0000 65.0000 0 0 0 0 0 0 ... 9869 10174 10306 10674 12604 13934 14131 15651 16041 17331
1 NaN Albania 41.1533 20.1683 0 0 0 0 0 0 ... 1217 1250 1298 1346 1384 1438 1459 1516 1559 1592
2 NaN Algeria 28.0339 1.6596 0 0 0 0 0 0 ... 8792 8920 9066 9202 9371 9674 9897 10040 10342 10832

3 rows × 168 columns

In [3]:
# shape of matrices for confirmed, death, and recovered
df_list[0].shape, df_list[1].shape, df_list[2].shape
Out[3]:
((266, 168), (266, 168), (253, 168))

U.S. Dataset

In [4]:
cases = ['confirmed', 'deaths']
sheet = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_'
suffix = '_US.csv'
us_df_list = []

for i in range(len(cases)):
    us_ts_url = sheet + cases[i] + suffix
    df = pd.read_csv(us_ts_url, header=0, escapechar='\\')
    display(df.head(3))
    us_df_list.append(df)
    exec("{0}=df".format(cases[i]))
UID iso2 iso3 code3 FIPS Admin2 Province_State Country_Region Lat Long_ ... 6/24/20 6/25/20 6/26/20 6/27/20 6/28/20 6/29/20 6/30/20 7/1/20 7/2/20 7/3/20
0 16 AS ASM 16 60.0 NaN American Samoa US -14.2710 -170.1320 ... 0 0 0 0 0 0 0 0 0 0
1 316 GU GUM 316 66.0 NaN Guam US 13.4443 144.7937 ... 226 231 247 247 247 253 257 267 280 280
2 580 MP MNP 580 69.0 NaN Northern Mariana Islands US 15.0979 145.6739 ... 30 30 30 30 30 30 30 30 31 31

3 rows × 175 columns

UID iso2 iso3 code3 FIPS Admin2 Province_State Country_Region Lat Long_ ... 6/24/20 6/25/20 6/26/20 6/27/20 6/28/20 6/29/20 6/30/20 7/1/20 7/2/20 7/3/20
0 16 AS ASM 16 60.0 NaN American Samoa US -14.2710 -170.1320 ... 0 0 0 0 0 0 0 0 0 0
1 316 GU GUM 316 66.0 NaN Guam US 13.4443 144.7937 ... 5 5 5 5 5 5 5 5 5 5
2 580 MP MNP 580 69.0 NaN Northern Mariana Islands US 15.0979 145.6739 ... 2 2 2 2 2 2 2 2 2 2

3 rows × 176 columns

In [5]:
# shape of matrices for confirmed, death, and recovered
us_df_list[0].shape, us_df_list[1].shape
Out[5]:
((3261, 175), (3261, 176))
Repair and summarize the U.S. Data

In the U.S. Dataset, there could be one or more administrative regions per state, and in order to summarize and simplify the dataset, the following function sum_all_admins_in_state is to be declared and used to sum all Admin inside one state into a single record.

In [93]:
def sum_all_admins_in_state(df, state):
    
    # query all sub-records of the selected country
    tmp_df = df[df["Province_State"]==state]
    
    # create a new row which is to sum all statistics of this country, and 
    # assign the summed value of all sub-records to the date_time column of the new row
    sum_row = tmp_df.sum(axis=0)
    
    # assign the constants to the ['Province/State', 'Country/Region', 'Lat', 'Long'] columns; 
    # note that the Country/Region column will be renamed from solely the country name to country name + ", Sum".
    sum_row.loc['UID'] = "NaN"
    sum_row.loc['Admin2'] = "NaN"
    sum_row.loc['FIPS'] = "NaN"
    sum_row.loc['iso2'] = "US"
    sum_row.loc['iso3'] = "USA"
    sum_row.loc['code3'] = 840
    sum_row.loc['Country_Region'] = "US"
    sum_row.loc['Province_State'] = state + ", Sum"
    sum_row.loc['Lat'] = tmp_df['Lat'].values[0]
    sum_row.loc['Long_'] = tmp_df['Long_'].values[0]
    
    # append the new row to the original DataFrame, and 
    # remove the sub-records of the selected country.
    df = pd.concat([df, sum_row.to_frame().T], ignore_index=True)
    #display(df[df["Province_State"].str.contains(state + ", Sum")])
    df=df[df['Province_State'] != state]
    df.loc[df.Province_State == state+", Sum", 'Province_State'] = state
    
    return df
In [94]:
for i in range(2):
    usa_ts_df=us_df_list[i]
    for state in usa_ts_df.Province_State.unique():
        usa_ts_df = sum_all_admins_in_state(usa_ts_df, state)
    us_df_list[i]=usa_ts_df

Analysis

Now, let's proceed to these three categories ('confirmed', 'death', 'recovered') individually.

Confirmed Cases

As shown below, of the 266 rows of records in the first DataFrame being converted and parsed from Google Sheet, 33 rows are for cases reported for each province in Mainland China.

In [8]:
df_confirmed = df_list[0]
print(df_confirmed[df_confirmed['Country/Region'] == 'China'].shape)
df_confirmed[df_confirmed['Country/Region'] == 'China'].head()
(33, 168)
Out[8]:
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 6/24/20 6/25/20 6/26/20 6/27/20 6/28/20 6/29/20 6/30/20 7/1/20 7/2/20 7/3/20
49 Anhui China 31.8257 117.2264 1 9 15 39 60 70 ... 991 991 991 991 991 991 991 991 991 991
50 Beijing China 40.1824 116.4142 14 22 36 41 68 80 ... 863 874 891 905 912 919 922 923 925 926
51 Chongqing China 30.0572 107.8740 6 9 27 57 75 110 ... 582 582 582 582 582 582 582 582 582 582
52 Fujian China 26.0789 117.9874 1 5 10 18 35 59 ... 363 363 363 363 363 363 363 363 363 363
53 Gansu China 37.8099 101.0583 0 2 2 4 7 14 ... 161 161 162 163 163 164 164 164 164 164

5 rows × 168 columns

While some countries/regions like China have been displayed with multiple rows each representing confirmed cases per province with reported cases, others are listed in the DataFrame as a single row that sums confirmed cases across states/provinces, such as the United States shown below:

In [9]:
df_confirmed_usa = df_confirmed[df_confirmed['Country/Region'] == 'US']
print(df_confirmed_usa.shape)
df_confirmed_usa.head()
(1, 168)
Out[9]:
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 6/24/20 6/25/20 6/26/20 6/27/20 6/28/20 6/29/20 6/30/20 7/1/20 7/2/20 7/3/20
225 NaN US 37.0902 -95.7129 1 1 2 2 5 5 ... 2382426 2422299 2467554 2510259 2549294 2590668 2636414 2687588 2742049 2794153

1 rows × 168 columns

Recovered Cases

Same here, for the purpose of comparison, we would need to get the figures of the other two categories - recovered and deaths besides confirmed cases.

In [10]:
df_recovered = df_list[2]
df_recovered.tail(3)
Out[10]:
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 6/24/20 6/25/20 6/26/20 6/27/20 6/28/20 6/29/20 6/30/20 7/1/20 7/2/20 7/3/20
250 NaN Comoros -11.645500 43.333300 0 0 0 0 0 0 ... 159 161 161 161 161 161 200 200 200 241
251 NaN Tajikistan 38.861034 71.276093 0 0 0 0 0 0 ... 4194 4267 4331 4391 4448 4506 4506 4627 4690 4690
252 NaN Lesotho -29.609988 28.233608 0 0 0 0 0 0 ... 2 2 4 4 4 4 4 11 11 11

3 rows × 168 columns

Death Cases

In [11]:
df_death = df_list[1]
df_death.tail(3)
Out[11]:
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 6/24/20 6/25/20 6/26/20 6/27/20 6/28/20 6/29/20 6/30/20 7/1/20 7/2/20 7/3/20
263 NaN Comoros -11.645500 43.333300 0 0 0 0 0 0 ... 7 7 7 7 7 7 7 7 7 7
264 NaN Tajikistan 38.861034 71.276093 0 0 0 0 0 0 ... 52 52 52 52 52 52 52 52 52 52
265 NaN Lesotho -29.609988 28.233608 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

3 rows × 168 columns

As we can tell from the snapshots of these dataframes, cases are reported in three geographic administrative units:

  1. for countries/regions with significant numbers of confirmed/deaths/recovered cases (e.g. Mainland China), number of cases are reported per province/state;
  2. for other regions/countries, number of cases are summarized per region/country (e.g. Australia, or Canada);
  3. Also, the global DataFrame lists Cruise Ship since the Diamond Princess and several other cruise ships themselves contain a considerable amount of confirmed/deaths cases.

2. Parse the Time-Series Data (Mainland China)

Now we have obtained DataFrame for each type of coronavirus cases across the globe. We will apply a filter on each of the DataFrames, and analyze the Confirmed, Recovered, and Death cases separately, to see how the time-series evolved inside Mainland China.

Time-Series of Confirmed Cases in Mainland China

First, a list called provinces_list needs to be extracted from the selected rows, and then be concatenated with the category (e.g. _Confirmed), in order to differentiate from the other two categories (e.g. _Recovered and _Deaths).

In [12]:
provinces_list = df_confirmed[df_confirmed['Country/Region'] == 'China'].iloc[:,0:1].T.values.tolist()[0]
In [13]:
map_output = map(lambda x: x + '_Confirmed', provinces_list)
list_map_output = list(map_output)

Next, let's remove the first five rows from the DataFrame df (which are the row#, Province/State, Country/Region, Unnamed:2, and Unnamed:3 columns, and are not needed for time-series charting), specify the index to the matrix, and perform a Transpose to have the date_time index shown as row indices.

In [14]:
df0 = df_confirmed[df_confirmed['Country/Region'] == 'China'].iloc[:,5:].fillna(0)
df0.index = pd.Index(list_map_output, name='date_time')
df0 = df0.T
df0.tail(3)
Out[14]:
date_time Anhui_Confirmed Beijing_Confirmed Chongqing_Confirmed Fujian_Confirmed Gansu_Confirmed Guangdong_Confirmed Guangxi_Confirmed Guizhou_Confirmed Hainan_Confirmed Hebei_Confirmed ... Shaanxi_Confirmed Shandong_Confirmed Shanghai_Confirmed Shanxi_Confirmed Sichuan_Confirmed Tianjin_Confirmed Tibet_Confirmed Xinjiang_Confirmed Yunnan_Confirmed Zhejiang_Confirmed
7/1/20 991 923 582 363 164 1642 254 147 171 349 ... 320 792 713 198 595 198 1 76 185 1269
7/2/20 991 925 582 363 164 1642 254 147 171 349 ... 320 792 714 198 595 198 1 76 186 1269
7/3/20 991 926 582 363 164 1643 254 147 171 349 ... 320 792 715 198 595 198 1 76 186 1269

3 rows × 33 columns

Also, we would need to standardize the date_time string (esp. that the year should be represented as XXXX instead of XX), and then to convert it from a string type to a datetime type:

In [15]:
df0.index
Out[15]:
Index(['1/23/20', '1/24/20', '1/25/20', '1/26/20', '1/27/20', '1/28/20',
       '1/29/20', '1/30/20', '1/31/20', '2/1/20',
       ...
       '6/24/20', '6/25/20', '6/26/20', '6/27/20', '6/28/20', '6/29/20',
       '6/30/20', '7/1/20', '7/2/20', '7/3/20'],
      dtype='object', length=163)
In [16]:
df0.index = df0.index.str.replace('/20', '/2020')
In [17]:
df0.index = df0.index.str.replace('2020/2020', '20/2020')
In [18]:
df0.index = pd.to_datetime(df0.index, format='%m/%d/%Y', exact = False)

If the datetime conversion is successful, use the following cell to validate and check how many rows of datetime records are in the dataframe.

In [19]:
print("Dataframe shape: ", df0.shape)
time_diff = (df0.index[-1] - df0.index[0])
print("Number of hours between start and end dates: ", time_diff.total_seconds()/3600 + 1)
Dataframe shape:  (163, 33)
Number of hours between start and end dates:  3889.0

The following will achieve three different plots:

  1. Plotting all the time series on one axis (line-plot)
  2. Plotting them all on separate subplots to see them more clearly (sharing the x axis)
  3. Plotting all the time series on one axis (scatterplot)
In [20]:
df0.plot(figsize=(15,10.5), title='Plotting all the time series on one axis (line-plot)').legend(loc='upper left')
plt.axis(aspect='equal'); plt.xlabel('Date Time'); text = plt.ylabel('Num of Cases')
In [21]:
ax_array = df0.plot(subplots=True, figsize=(15,18))
for ax in ax_array:
    ax.legend(loc='upper left')
plt.axis(aspect='equal'); plt.xlabel('Date Time'); plt.ylabel('Num of Cases')
text = plt.title('Plotting all time-series on separate subplots (sharing the x axis)', pad="-120")
In [22]:
df0.plot(y=list_map_output, style='.', figsize=(15,10.5),
         title="Plotting all time series on one axis (scatterplot)").legend(loc='upper left')
plt.axis(aspect='equal'); plt.xlabel('Date Time'); text = plt.ylabel('Num of Cases')

From the three plots shown above, we can tell that within Mainland China, Hubei province has the largest number of confirmed COVID-19 cases, preceded by Guangdong and Zhejiang provinces.

The Recovered, and Death cases in Mainland China

Now the confirmed COVID-19 cases for each province in mainland China are shown as above, we are to define a function plot_per_country that is to help plot other countries/regions, not only for the confirmed cases, but also other cases (Recovered or Deaths).

In [23]:
def plot_per_country(df, country_name, category = "Confirmed", ref_df = df0):
    """to help us plot other countries/regions, not only for the confirmed cases, 
    but also other cases (Recovered or Deaths).
    """
    if 'Country/Region' in df.columns:
        provinces_list = df[df['Country/Region'] == country_name].iloc[:,0:1].T.values.tolist()[0]
    else:
        provinces_list = df[df['Country_Region'] == country_name].iloc[:,6:7].T.values.tolist()[0]
        
    map_output = map(lambda x: x + '_' + category, provinces_list)
    list_map_output = list(map_output)

    if 'Country/Region' in df.columns:
        df0 = df[df['Country/Region'] == country_name].iloc[:,5:].fillna(0)
    else:
        df0 = df[df['Country_Region'] == country_name].iloc[:,11:].fillna(0)
    
    df0.index = pd.Index(list_map_output, name='date_time')
    df0 = df0.loc[:, ~df0.columns.str.contains('^Unnamed')]
    df0 = df0.T
    df0.index = df0.index.str.replace('/20', '/2020')
    df0.index = df0.index.str.replace('2020/2020', '20/2020')
    df0.index = pd.to_datetime(df0.index, format='%m/%d/%Y', exact = False)
    
    width_multiplier = df0.shape[1]/5

    df0.plot(figsize=(15,2*width_multiplier), 
             title='Plotting all the time series on one axis (line-plot)').legend(loc='upper left')
    plt.axis(aspect='equal'); plt.xlabel('Date Time'); plt.ylabel('Num of Cases')
    
    ax_array = df0.plot(subplots=True, figsize=(15,3*width_multiplier))
    for ax in ax_array:
        ax.legend(loc='upper left')
    plt.axis(aspect='equal'); plt.xlabel('Date Time'); plt.ylabel('Num of Cases')
    text = plt.title('Plotting all time-series on separate subplots (sharing the x axis)', pad="-120")
    
    df0.plot(y=list_map_output, style='.', figsize=(15,2*width_multiplier),
         title="Plotting all time series on one axis (scatterplot)").legend(loc='upper left')
    plt.axis(aspect='equal'); plt.xlabel('Date Time'); plt.ylabel('Num of Cases')
    
    return df0
In [24]:
df_recovered_china = plot_per_country(df_recovered, "China", "Recovered")

The three plots shown above indicate that Hubei province also has the largest number of recovered cases from COVID-19, preceded by Zhejiang and Hunan provinces.

In [25]:
df_death_china = plot_per_country(df_death, "China", "Death")