Time Series Analysis of the 2019 Novel Coronavirus Pandemic¶
This notebook is to perform analysis and time series charting of 2019 novel coronavirus disease (COVID-19) globally:
1. Import Data¶
The data source repo to be used, is created and maintained by the the Center for Systems Science and Engineering (CSSE) at the Johns Hopkins University, and the official maps can be viewed here. If you are having an issue accessing the Google Sheet, please try downloading the data source provided on their GitHub repo.
The csv file has three types of cases - Confirmed
, Recovered
and Deaths
- spotted inside and outside Mainland China, across the time span from 1/21/2020 to Current.
Necessary Imports¶
from io import BytesIO
import requests
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
cases = ['confirmed', 'deaths', 'recovered']
sheet = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_'
suffix = '_global.csv'
df_list = []
for i in range(len(cases)):
ts_url = sheet + cases[i] + suffix
df = pd.read_csv(ts_url, header=0, escapechar='\\')
display(df.head(3))
df_list.append(df)
exec("{0}=df".format(cases[i]))
# shape of matrices for confirmed, death, and recovered
df_list[0].shape, df_list[1].shape, df_list[2].shape
U.S. Dataset¶
cases = ['confirmed', 'deaths']
sheet = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_'
suffix = '_US.csv'
us_df_list = []
for i in range(len(cases)):
us_ts_url = sheet + cases[i] + suffix
df = pd.read_csv(us_ts_url, header=0, escapechar='\\')
display(df.head(3))
us_df_list.append(df)
exec("{0}=df".format(cases[i]))
# shape of matrices for confirmed, death, and recovered
us_df_list[0].shape, us_df_list[1].shape
Repair and summarize the U.S. Data¶
In the U.S. Dataset, there could be one or more administrative regions per state, and in order to summarize and simplify the dataset, the following function sum_all_admins_in_state
is to be declared and used to sum all Admin
inside one state into a single record.
def sum_all_admins_in_state(df, state):
# query all sub-records of the selected country
tmp_df = df[df["Province_State"]==state]
# create a new row which is to sum all statistics of this country, and
# assign the summed value of all sub-records to the date_time column of the new row
sum_row = tmp_df.sum(axis=0)
# assign the constants to the ['Province/State', 'Country/Region', 'Lat', 'Long'] columns;
# note that the Country/Region column will be renamed from solely the country name to country name + ", Sum".
sum_row.loc['UID'] = "NaN"
sum_row.loc['Admin2'] = "NaN"
sum_row.loc['FIPS'] = "NaN"
sum_row.loc['iso2'] = "US"
sum_row.loc['iso3'] = "USA"
sum_row.loc['code3'] = 840
sum_row.loc['Country_Region'] = "US"
sum_row.loc['Province_State'] = state + ", Sum"
sum_row.loc['Lat'] = tmp_df['Lat'].values[0]
sum_row.loc['Long_'] = tmp_df['Long_'].values[0]
# append the new row to the original DataFrame, and
# remove the sub-records of the selected country.
df = pd.concat([df, sum_row.to_frame().T], ignore_index=True)
#display(df[df["Province_State"].str.contains(state + ", Sum")])
df=df[df['Province_State'] != state]
df.loc[df.Province_State == state+", Sum", 'Province_State'] = state
return df
for i in range(2):
usa_ts_df=us_df_list[i]
for state in usa_ts_df.Province_State.unique():
usa_ts_df = sum_all_admins_in_state(usa_ts_df, state)
us_df_list[i]=usa_ts_df
As shown below, of the 266 rows of records in the first DataFrame being converted and parsed from Google Sheet, 33 rows are for cases reported for each province in Mainland China.
df_confirmed = df_list[0]
print(df_confirmed[df_confirmed['Country/Region'] == 'China'].shape)
df_confirmed[df_confirmed['Country/Region'] == 'China'].head()
While some countries/regions like China have been displayed with multiple rows each representing confirmed cases per province with reported cases, others are listed in the DataFrame as a single row that sums confirmed cases across states/provinces, such as the United States shown below:
df_confirmed_usa = df_confirmed[df_confirmed['Country/Region'] == 'US']
print(df_confirmed_usa.shape)
df_confirmed_usa.head()
Recovered Cases¶
Same here, for the purpose of comparison, we would need to get the figures of the other two categories - recovered
and deaths
besides confirmed
cases.
df_recovered = df_list[2]
df_recovered.tail(3)
Death Cases¶
df_death = df_list[1]
df_death.tail(3)
As we can tell from the snapshots of these dataframes, cases are reported in three geographic administrative units:
- for countries/regions with significant numbers of confirmed/deaths/recovered cases (e.g. Mainland China), number of cases are reported per province/state;
- for other regions/countries, number of cases are summarized per region/country (e.g. Australia, or Canada);
- Also, the global DataFrame lists
Cruise Ship
since the Diamond Princess and several other cruise ships themselves contain a considerable amount of confirmed/deaths cases.
2. Parse the Time-Series Data (Mainland China)¶
Now we have obtained DataFrame for each type of coronavirus cases across the globe. We will apply a filter on each of the DataFrames, and analyze the Confirmed, Recovered, and Death cases separately, to see how the time-series evolved inside Mainland China.
Time-Series of Confirmed Cases in Mainland China¶
First, a list called provinces_list
needs to be extracted from the selected rows, and then be concatenated with the category (e.g. _Confirmed
), in order to differentiate from the other two categories (e.g. _Recovered
and _Deaths
).
provinces_list = df_confirmed[df_confirmed['Country/Region'] == 'China'].iloc[:,0:1].T.values.tolist()[0]
map_output = map(lambda x: x + '_Confirmed', provinces_list)
list_map_output = list(map_output)
Next, let's remove the first five rows from the DataFrame df
(which are the row#, Province/State, Country/Region, Unnamed:2, and Unnamed:3 columns, and are not needed for time-series charting), specify the index to the matrix, and perform a Transpose
to have the date_time
index shown as row indices.
df0 = df_confirmed[df_confirmed['Country/Region'] == 'China'].iloc[:,5:].fillna(0)
df0.index = pd.Index(list_map_output, name='date_time')
df0 = df0.T
df0.tail(3)
Also, we would need to standardize the date_time string (esp. that the year should be represented as XXXX instead of XX), and then to convert it from a string type to a datetime type:
df0.index
df0.index = df0.index.str.replace('/20', '/2020')
df0.index = df0.index.str.replace('2020/2020', '20/2020')
df0.index = pd.to_datetime(df0.index, format='%m/%d/%Y', exact = False)
If the datetime conversion is successful, use the following cell to validate and check how many rows of datetime records are in the dataframe.
print("Dataframe shape: ", df0.shape)
time_diff = (df0.index[-1] - df0.index[0])
print("Number of hours between start and end dates: ", time_diff.total_seconds()/3600 + 1)
The following will achieve three different plots:
- Plotting all the time series on one axis (line-plot)
- Plotting them all on separate subplots to see them more clearly (sharing the x axis)
- Plotting all the time series on one axis (scatterplot)
df0.plot(figsize=(15,10.5), title='Plotting all the time series on one axis (line-plot)').legend(loc='upper left')
plt.axis(aspect='equal'); plt.xlabel('Date Time'); text = plt.ylabel('Num of Cases')
ax_array = df0.plot(subplots=True, figsize=(15,18))
for ax in ax_array:
ax.legend(loc='upper left')
plt.axis(aspect='equal'); plt.xlabel('Date Time'); plt.ylabel('Num of Cases')
text = plt.title('Plotting all time-series on separate subplots (sharing the x axis)', pad="-120")
df0.plot(y=list_map_output, style='.', figsize=(15,10.5),
title="Plotting all time series on one axis (scatterplot)").legend(loc='upper left')
plt.axis(aspect='equal'); plt.xlabel('Date Time'); text = plt.ylabel('Num of Cases')
From the three plots shown above, we can tell that within Mainland China, Hubei province has the largest number of confirmed COVID-19 cases, preceded by Guangdong and Zhejiang provinces.
The Recovered, and Death cases in Mainland China¶
Now the confirmed COVID-19 cases for each province in mainland China are shown as above, we are to define a function plot_per_country
that is to help plot other countries/regions, not only for the confirmed
cases, but also other cases (Recovered
or Deaths
).
def plot_per_country(df, country_name, category = "Confirmed", ref_df = df0):
"""to help us plot other countries/regions, not only for the confirmed cases,
but also other cases (Recovered or Deaths).
"""
if 'Country/Region' in df.columns:
provinces_list = df[df['Country/Region'] == country_name].iloc[:,0:1].T.values.tolist()[0]
else:
provinces_list = df[df['Country_Region'] == country_name].iloc[:,6:7].T.values.tolist()[0]
map_output = map(lambda x: x + '_' + category, provinces_list)
list_map_output = list(map_output)
if 'Country/Region' in df.columns:
df0 = df[df['Country/Region'] == country_name].iloc[:,5:].fillna(0)
else:
df0 = df[df['Country_Region'] == country_name].iloc[:,11:].fillna(0)
df0.index = pd.Index(list_map_output, name='date_time')
df0 = df0.loc[:, ~df0.columns.str.contains('^Unnamed')]
df0 = df0.T
df0.index = df0.index.str.replace('/20', '/2020')
df0.index = df0.index.str.replace('2020/2020', '20/2020')
df0.index = pd.to_datetime(df0.index, format='%m/%d/%Y', exact = False)
width_multiplier = df0.shape[1]/5
df0.plot(figsize=(15,2*width_multiplier),
title='Plotting all the time series on one axis (line-plot)').legend(loc='upper left')
plt.axis(aspect='equal'); plt.xlabel('Date Time'); plt.ylabel('Num of Cases')
ax_array = df0.plot(subplots=True, figsize=(15,3*width_multiplier))
for ax in ax_array:
ax.legend(loc='upper left')
plt.axis(aspect='equal'); plt.xlabel('Date Time'); plt.ylabel('Num of Cases')
text = plt.title('Plotting all time-series on separate subplots (sharing the x axis)', pad="-120")
df0.plot(y=list_map_output, style='.', figsize=(15,2*width_multiplier),
title="Plotting all time series on one axis (scatterplot)").legend(loc='upper left')
plt.axis(aspect='equal'); plt.xlabel('Date Time'); plt.ylabel('Num of Cases')
return df0
df_recovered_china = plot_per_country(df_recovered, "China", "Recovered")
The three plots shown above indicate that Hubei province also has the largest number of recovered cases from COVID-19, preceded by Zhejiang and Hunan provinces.
df_death_china = plot_per_country(df_death, "China", "Death")