Predictive Analysis of the 2019 Novel Coronavirus Pandemic

Introduction

Dashboards, statistics, and other information about the COVID-19 are floating all over the internet, and different countries or regions are adopting varied strategies, from complete lockdown, to social distancing, to herd immunity, you might be confused at what is the right strategy, and which information is valid. This notebook is not providing you a final answer, but tools or methods that you can try yourself in performing data modeling, analyzing, and predicting the spread of COVID-19 with the ArcGIS API for Python, and other libraries such as pandas and numpy. Hopefully, given the workflow demonstrations, you are able to find the basic facts, current patterns, and future trends behind the common notions about how COVID-19 spread from a dataset perspective [1,2,3,4].

Before we dive into data science and analytics, let's start with importing the necessary modules:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
from arcgis.gis import GIS

Import and Understand Source Dataset

Among all the official and unofficial data sources on the web providing COVID-19 related data, one of the most widely used dataset today is the one provided by the John Hopkins University's Center for Systems Science and Engineering (JHU CSSE), which can be accessed on GitHub under the name - Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE [5,10,11,12]. The time-series consolidated data needed for all the analysis to be performed in this notebook fall into these two categories:

  1. Type: Confirmed Cases, Deaths, and the Recovered;
  2. Geography: Global, and the United States only.

Now let's first look at the U.S. dataset.

Time-series data for the United States

The dataset can be directly imported into data-frames with read_csv method in Pandas. Compared to downloading the file manually and then read it, it is preferred to use the URLs (which point to the CSV files archived on GitHub) because as situation changes, it becomes easier to load and refresh the analysis with new data.

Now, let's read the time-series data of the confirmed COVID-19 cases in the United States from the GitHub source url, into a Pandas DataFrame:

# read time-series csv
usa_ts_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv'
usa_ts_df = pd.read_csv(usa_ts_url, header=0, escapechar='\\')
usa_ts_df.head(5)
UIDiso2iso3code3FIPSAdmin2Province_StateCountry_RegionLatLong_...6/26/206/27/206/28/206/29/206/30/207/1/207/2/207/3/207/4/207/5/20
016ASASM1660.0NaNAmerican SamoaUS-14.2710-170.1320...0000000000
1316GUGUM31666.0NaNGuamUS13.4443144.7937...247247247253257267280280280280
2580MPMNP58069.0NaNNorthern Mariana IslandsUS15.0979145.6739...30303030303031313131
3630PRPRI63072.0NaNPuerto RicoUS18.2208-66.5901...6922706671897250746575377608768377877916
4850VIVIR85078.0NaNVirgin IslandsUS18.3358-64.8963...8181818181909298111111

5 rows × 177 columns

usa_ts_df.columns
Index(['UID', 'iso2', 'iso3', 'code3', 'FIPS', 'Admin2', 'Province_State',
       'Country_Region', 'Lat', 'Long_',
       ...
       '6/26/20', '6/27/20', '6/28/20', '6/29/20', '6/30/20', '7/1/20',
       '7/2/20', '7/3/20', '7/4/20', '7/5/20'],
      dtype='object', length=177)

As we can see from the printouts of usa_ts_df.columns, the first 11 columns are displayed as ['UID', 'iso2', 'iso3', 'code3', 'FIPS', 'Admin2', 'Province_State', 'Country_Region', 'Lat', 'Long_'], while the rest of the columns are dates from 1/22/20 to the most current date on record.

date_list = usa_ts_df.columns.tolist()[11:]
date_list[0], date_list[-1]
('1/22/20', '7/5/20')

Repair and Summarize the state-wide time-series data

Look at the last five rows of the DataFrame usa_ts_df, and see that they are all cases for Province_State=="Utah":

usa_ts_df.tail()
UIDiso2iso3code3FIPSAdmin2Province_StateCountry_RegionLatLong_...6/26/206/27/206/28/206/29/206/30/207/1/207/2/207/3/207/4/207/5/20
325684070016USUSA840NaNCentral UtahUtahUS39.372319-111.575868...127134143159169173180194201209
325784070017USUSA840NaNSoutheast UtahUtahUS38.996171-110.701396...33343535363739404040
325884070018USUSA840NaNSouthwest UtahUtahUS37.854472-111.441876...1302136114281467151915531584162516601689
325984070019USUSA840NaNTriCountyUtahUS40.124915-109.517442...45464848505052535555
326084070020USUSA840NaNWeber-MorganUtahUS41.271160-111.914512...81484687291995410041042109011721195

5 rows × 177 columns

Some rows in the DataFrame are of many-rows-to-one-state matching, while others are of the one-row-to-one-state matching replationship. All records listed in the usa_ts_df with Admin2 not equal to NaN, are those rows that fall into the category of "many-rows-to-one-state" matching.

usa_ts_df[usa_ts_df["Admin2"].notna()].head()
UIDiso2iso3code3FIPSAdmin2Province_StateCountry_RegionLatLong_...6/26/206/27/206/28/206/29/206/30/207/1/207/2/207/3/207/4/207/5/20
584001001USUSA8401001.0AutaugaAlabamaUS32.539527-86.644082...482492497521530545553560583607
684001003USUSA8401003.0BaldwinAlabamaUS30.727750-87.722071...500539559626663686735828846864
784001005USUSA8401005.0BarbourAlabamaUS31.868263-85.387129...309314314319322323333345347349
884001007USUSA8401007.0BibbAlabamaUS32.996421-87.125115...150158159162167171176186187190
984001009USUSA8401009.0BlountAlabamaUS33.982109-86.567906...181185186196204214218226230235

5 rows × 177 columns

As shown in the output of the previous two cells, we can see that for the usa_ts_df which we have created and parsed from the U.S. Dataset, there are multiple rows per state representing different administrative areas of the state with reported cases. Next, we will use the to-be-defined function sum_all_admins_in_state() to summarize all administrative areas in one state into a single row.

def sum_all_admins_in_state(df, state):
    
    # query all sub-records of the selected state
    tmp_df = df[df["Province_State"]==state]
    
    # create a new row which is to sum all statistics of this state, and 
    # assign the summed value of all sub-records to the date_time column of the new row
    sum_row = tmp_df.sum(axis=0)
    
    # assign the constants to the ['Province/State', 'Country/Region', 'Lat', 'Long'] columns; 
    # note that the Province/State column will be renamed from solely the country name to country name + ", Sum".
    sum_row.loc['UID'] = "NaN"
    sum_row.loc['Admin2'] = "NaN"
    sum_row.loc['FIPS'] = "NaN"
    sum_row.loc['iso2'] = "US"
    sum_row.loc['iso3'] = "USA"
    sum_row.loc['code3'] = 840
    sum_row.loc['Country_Region'] = "US"
    sum_row.loc['Province_State'] = state + ", Sum"
    sum_row.loc['Lat'] = tmp_df['Lat'].values[0]
    sum_row.loc['Long_'] = tmp_df['Long_'].values[0]
    
    # append the new row to the original DataFrame, and 
    # remove the sub-records of the selected country.
    df = pd.concat([df, sum_row.to_frame().T], ignore_index=True)
    #display(df[df["Province_State"].str.contains(state + ", Sum")])
    df=df[df['Province_State'] != state]
    df.loc[df.Province_State == state+", Sum", 'Province_State'] = state
    
    return df
# loop thru all states in the U.S.
for state in usa_ts_df.Province_State.unique():
    usa_ts_df = sum_all_admins_in_state(usa_ts_df, state)

Now with sum_all_admins_in_state applied to all states, we shall be expecting usa_ts_df to be with all rows converted to one-row-to-one-state matching. Let's browse the last five rows in the DataFrame to validate the results.

usa_ts_df.tail()
UIDiso2iso3code3FIPSAdmin2Province_StateCountry_RegionLatLong_...6/26/206/27/206/28/206/29/206/30/207/1/207/2/207/3/207/4/207/5/20
54NaNUSUSA840NaNNaNWest VirginiaUS39.1307-80.0035...2730278228322870290529793053312632053262
55NaNUSUSA840NaNNaNWisconsinUS43.9697-89.7678...26747272862774328058286592919929738303173105531577
56NaNUSUSA840NaNNaNWyomingUS41.655-105.724...1368139214171450148715141550158216061634
57NaNUSUSA840NaNNaNDiamond PrincessUS00...49494949494949494949
58NaNUSUSA840NaNNaNGrand PrincessUS00...103103103103103103103103103103

5 rows × 177 columns

Explore the state-wide time-series data

If you wonder in which state(s) the first COVID-19 case was confirmed and reported, use the cell below to check for first occurrence - in this case, the Washington State.

usa_ts_df_all_states = usa_ts_df.groupby('Province_State').sum()[date_list]
usa_ts_df_all_states[usa_ts_df_all_states['1/22/20']>0]
1/22/201/23/201/24/201/25/201/26/201/27/201/28/201/29/201/30/201/31/20...6/26/206/27/206/28/206/29/206/30/207/1/207/2/207/3/207/4/207/5/20
Province_State
Washington1111111111...30855314043175232253328243343534151347783524735898

1 rows × 166 columns

Or if you want to query for the top 10 states with the highest numbers of confirmed cases for the time being, run the following cell to display the query results.

usa_ts_df_all_states[date_list[-1]].sort_values(ascending = False).head(10)
Province_State
New York         397131
California       264681
Florida          200111
Texas            194932
New Jersey       173402
Illinois         147251
Massachusetts    109974
Arizona           98103
Georgia           95516
Pennsylvania      94403
Name: 7/5/20, dtype: int64

Compared to what we have collected as the top 10 states on May 05, 2020, we can see California has climbed up to the 2nd place, while New Jersey rescinded to the 5th place.

Province_State
New York         321192
New Jersey       130593
Massachusetts     70271
Illinois          65889
California        58456
Pennsylvania      53864
Michigan          44451
Florida           37439
Texas             33912
Connecticut       30621
Name: 5/5/20, dtype: int64

The approach is quite similar if you want to query for the top 10 states with the lowest numbers of confirmed cases, just by simply changing the ascending order from False to True:

usa_ts_df_all_states[date_list[-1]].sort_values(ascending = True).head(10)
Province_State
American Samoa                 0
Northern Mariana Islands      31
Diamond Princess              49
Grand Princess               103
Virgin Islands               111
Guam                         280
Hawaii                      1023
Alaska                      1134
Montana                     1212
Vermont                     1249
Name: 7/5/20, dtype: int64

Comparatively, we can also check what is the difference against the statistics obtained on May 05, 2020 (as below),

Province_State
American Samoa                0
Northern Mariana Islands     14
Diamond Princess             49
Virgin Islands               66
Grand Princess              103
Guam                        145
Alaska                      371
Montana                     456
Wyoming                     604
Hawaii                      625
Name: 5/5/20, dtype: int64

As shown above, from May to July the state with the highest confirmed cases is the New York State, while the American Samoa is that of the lowest confirmed cases (as of 07/05/2020). Also, if you are only interested in finding out which state has the highest confirmed cases instead of the top 10, you can just run the cells below for an exact query result, and its time-series:

# state name, and the current number of confirmed
usa_ts_df_all_states[date_list[-1]].idxmax(), usa_ts_df_all_states[date_list[-1]].max()
('New York', 397131)
usa_ts_df[usa_ts_df['Province_State']=="New York"].sum()[date_list]
1/22/20         0
1/23/20         0
1/24/20         0
1/25/20         0
1/26/20         0
            ...  
7/1/20     394079
7/2/20     394954
7/3/20     395872
7/4/20     396598
7/5/20     397131
Length: 166, dtype: object

Map the confirmed cases per state

With the time-series DataFrame for the United States ready-to-use, we can then map the number of confirmed cases reported per state in a time-enabled manner. Next, we will see an animation being created from the end of January to Current:

gis = GIS('home', verify_cert=False)
"""Confirmed Cases per State shown on map widget"""
map0 = gis.map("US")
map0
<IPython.core.display.Image object>
map0.basemap = "dark-gray"
map0.zoom = 4
map0.center = {  'spatialReference': {'latestWkid': 3857, 'wkid': 102100},
                 'x': -9452434.7482555,
                 'y': 4595401.402885098}
for d in date_list:
    map0.clear_graphics()
    map0.remove_layers()
    map0.legend=False
    
    df = usa_ts_df[['Province_State', "Lat", "Long_", d]]
    df.rename(columns={d: 'Confirmed'}, inplace=True)

    fc = gis.content.import_data(df,
                                 {"Address":"Province_State"})
    fset=fc.query()
    fset.sdf.spatial.plot(map_widget=map0,
                          renderer_type="c",
                          method="esriClassifyNaturalBreaks",
                          class_count=13,
                          col="Confirmed",
                          min_value=13,
                          cmap="coolwarm",
                          alpha=0.7)
                          
    map0.legend=True
    time.sleep(3)

Chart the top 20 states with highest confirmed cases

Besides visualizing how the epicenter shifted from the West Coast to East Coast from previous animation, we can also show the evolution of the each state over time using a bar chart. Run the cell below to create an animation of how the top 20 states with highest confirmed cases in the U.S. changed during the pandemic:

from IPython.display import clear_output
import matplotlib.pyplot as plot
"""Chart the top 20 states with highest confirmed cases"""
time.sleep(3)
for d in date_list:
    clear_output(wait=True)
    top_20_per_d = usa_ts_df.groupby('Province_State')[['Province_State', d]].sum().sort_values(by=d, ascending=False).head(20)
    top_20_per_d.plot(kind='barh', log=True, figsize=(8,6))
    plt.ylabel("Province_State", labelpad=14)
    plt.xlabel("# of Confirmed Cases (log=True)", labelpad=14)
    plt.title("Chart the confirmed cases per state", y=1.02)
    plt.show()
    time.sleep(1)
<IPython.core.display.Image object>

Time-series data at the country level around the world

Besides the U.S. dataset, we can also read the time-series data of the confirmed COVID-19 cases globally from the GitHub source url, into a Pandas DataFrame, again:

world_confirmed_ts_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv'
world_confirmed_ts_df = pd.read_csv(world_confirmed_ts_url, header=0, escapechar='\\')
world_confirmed_ts_df.head(5)
Province/StateCountry/RegionLatLong1/22/201/23/201/24/201/25/201/26/201/27/20...6/26/206/27/206/28/206/29/206/30/207/1/207/2/207/3/207/4/207/5/20
0NaNAfghanistan33.000065.0000000000...30451306163096731238315173183632022323243267232951
1NaNAlbania41.153320.1683000000...2269233024022466253525802662275228192893
2NaNAlgeria28.03391.6596000000...12685129681327313571139071427214657150701550015941
3NaNAndorra42.50631.5218000000...855855855855855855855855855855
4NaNAngola-11.202717.8739000000...212259267276284291315328346346

5 rows × 170 columns

Splitting records by different matching types

Though most of the rows in world_confirmed_ts_df DataFrame are the statistics at country-level (one-row-to-one-country matching), there are several countries listed in the DataFrame as an aggregation of provinces or states (many-rows-to-one-country matching). For instance, as shown below, the confirmed cases of Australia can be shown in the 8 rows printed.

world_confirmed_ts_df[world_confirmed_ts_df["Province/State"].notna()].head(8)
Province/StateCountry/RegionLatLong1/22/201/23/201/24/201/25/201/26/201/27/20...6/26/206/27/206/28/206/29/206/30/207/1/207/2/207/3/207/4/207/5/20
8Australian Capital TerritoryAustralia-35.4735149.0124000000...108108108108108108108108108108
9New South WalesAustralia-33.8688151.2093000034...3174317731843189320332113211340534193429
10Northern TerritoryAustralia-12.4634130.8456000000...29292929293030303030
11QueenslandAustralia-28.0167153.4000000000...1067106710671067106710671067106710671067
12South AustraliaAustralia-34.9285138.6007000000...440440440443443443443443443443
13TasmaniaAustralia-41.4545145.9707000000...228228228228228228228228228228
14VictoriaAustralia-37.8136144.9631000011...1947202820992159223123032368236825362660
15Western AustraliaAustralia-31.9505115.8605000000...608609609611611611611611612618

8 rows × 170 columns

Browsing all the selected rows from above, we come to an conclusion that these five countries - Australia, Canada, China, Denmark, and France - are of the many-rows-to-one-country matching, and we need to summarize these rows into a single record, if such a row does not exist already. The following cell is to rule out the countries that are provided not only many-rows-to-one-country matching but also one-row-to-one-country matching.

for country in ["Australia", "Canada", "China", "Denmark", "France"]:
    df = world_confirmed_ts_df[world_confirmed_ts_df["Province/State"].isna()]
    df = df[df["Country/Region"]==country]
    if df.size:
        display(df)
Province/StateCountry/RegionLatLong1/22/201/23/201/24/201/25/201/26/201/27/20...6/26/206/27/206/28/206/29/206/30/207/1/207/2/207/3/207/4/207/5/20
94NaNDenmark56.26399.5018000000...12675126751267512751127681279412815128321283212832

1 rows × 170 columns

Province/StateCountry/RegionLatLong1/22/201/23/201/24/201/25/201/26/201/27/20...6/26/206/27/206/28/206/29/206/30/207/1/207/2/207/3/207/4/207/5/20
116NaNFrance46.22762.2137002333...193346193152192429194109194373194985195458195904195546195535

1 rows × 170 columns

To summarize, we now can split the records into three matching types:

  1. Countries with both many-rows-to-one-country matching and one-row-to-one-country matching data, e.g. Denmark, and France.
  2. Countries with only many-rows-to-one-country matching data, e.g. Australia, Canada, and China.
  3. Countries with only one-row-to-one-country matching data - rest of all countries in the DataFrame.

That being said, we would only need to summarize and repair the data for countries of type #2, which are Australia, Canada, and China.

Summarize and repair the time-series global confirmed cases

The method sum_all_provinces_in_country to be defined in the next cell performs these tasks:

  • query all sub-records of the selected country
  • create a new row which is to sum all statistics of this country, and assign the summed value of all sub-records to the date_time column of the new row
  • assign the constants to the ['Province/State', 'Country/Region', 'Lat', 'Long'] columns; note that the Country/Region column will be renamed from solely the country name to country name + ", Sum".
  • append the new row to the original DataFrame, and remove the sub-records of the selected country.

We will run the method against the subset of countries in type #2, which are Australia, Canada, and China.

def sum_all_provinces_in_country(df, country):
    
    # query all sub-records of the selected country
    tmp_df = df[df["Country/Region"]==country]
    
    # create a new row which is to sum all statistics of this country, and 
    # assign the summed value of all sub-records to the date_time column of the new row
    sum_row = tmp_df.sum(axis=0)
    
    # assign the constants to the ['Province/State', 'Country/Region', 'Lat', 'Long'] columns; 
    # note that the Country/Region column will be renamed from solely the country name to country name + ", Sum".
    sum_row.loc['Province/State'] = "NaN"
    sum_row.loc['Country/Region'] = country + ", Sum"
    sum_row.loc['Lat'] = tmp_df['Lat'].values[0]
    sum_row.loc['Long'] = tmp_df['Long'].values[0]
    
    # append the new row to the original DataFrame, and 
    # remove the sub-records of the selected country.
    df = pd.concat([df, sum_row.to_frame().T], ignore_index=True)
    display(df[df["Country/Region"].str.contains(country + ", Sum")])
    df=df[df['Country/Region'] != country]
    df.loc[df['Country/Region']== country+", Sum", 'Country/Region'] = country
    
    return df
for country in ["Australia", "Canada", "China"]:
    world_confirmed_ts_df = sum_all_provinces_in_country(world_confirmed_ts_df, country)
Province/StateCountry/RegionLatLong1/22/201/23/201/24/201/25/201/26/201/27/20...6/26/206/27/206/28/206/29/206/30/207/1/207/2/207/3/207/4/207/5/20
266NaNAustralia, Sum-35.4735149.012000045...7601768677647834792080018066826084438583

1 rows × 170 columns

Province/StateCountry/RegionLatLong1/22/201/23/201/24/201/25/201/26/201/27/20...6/26/206/27/206/28/206/29/206/30/207/1/207/2/207/3/207/4/207/5/20
259NaNCanada, Sum53.9333-116.576000011...104629104878105193105830106097106288106643106962107185107394

1 rows × 170 columns

Province/StateCountry/RegionLatLong1/22/201/23/201/24/201/25/201/26/201/27/20...6/26/206/27/206/28/206/29/206/30/207/1/207/2/207/3/207/4/207/5/20
246NaNChina, Sum31.8257117.226548643920140620752877...84725847438475784780847858481684830848388485784871

1 rows × 170 columns

Now with the data summarized and repaired, we are good to query for the top 10 countries with the highest numbers of confirmed cases around the globe. You can choose to list the entire time-series for these 10 countries, or just pick the country/region name, and the most current statistics.

world_confirmed_ts_df.sort_values(by = date_list[-1], ascending = False).head(10)
Province/StateCountry/RegionLatLong1/22/201/23/201/24/201/25/201/26/201/27/20...6/26/206/27/206/28/206/29/206/30/207/1/207/2/207/3/207/4/207/5/20
206NaNUS37.0902-95.7129112255...2467554251025925492942590668263641426875882742049279415328394362888635
20NaNBrazil-14.235-51.9253000000...1274974131366713441431368195140204114487531496858153908115770041603055
112NaNIndia2178000000...508953528859548318566840585481604641625544648315673165697413
168NaNRussia6090000000...619936626779633563640246646929653479660231666941673564680283
162NaNPeru-9.19-75.0152000000...272364275989279419282365285213288477292004295599299080302718
29NaNChile-35.6751-71.543000000...263360267766271982275999279393282043284541288089291847295532
204NaNUnited Kingdom55.3781-3.436000000...281675282308282703283307283710283770283774284276284900285416
139NaNMexico23.6345-102.553000000...208392212802216852220657226089231770238511245251252165256848
182NaNSpain40-4000000...247905248469248770248970249271249659250103250545250545250545
118NaNItaly4312000000...239961240136240310240436240578240760240961241184241419241611

10 rows × 170 columns

# a simplified list
world_confirmed_ts_df[['Country/Region', date_list[-1]]].sort_values(by = date_list[-1], ascending = False).head(10)
Country/Region7/5/20
206US2888635
20Brazil1603055
112India697413
168Russia680283
162Peru302718
29Chile295532
204United Kingdom285416
139Mexico256848
182Spain250545
118Italy241611

That compared to the top 10 countries with confirmed COVID-19 cases we fetched and sorted on 03/23/2020, we can see how the list of countries changed.

Country_RegionConfirmedRecoveredDeaths
China81591732803281
Italy6917683266820
United States536600703
Spain3988537942808
Germany329863243157
Iran2481189131934
France2262232881102
Switzerland9877131122
South Korea90373507120
United Kingdom8164140423

Table 1. The top 10 countries with confirmed COVID-19 cases collected at 03/23/2020.

Map the confirmed cases per country

Similar to the previous approach of mapping the confirmed cases in the United states through a timely evolutionary manner, here we are to visualize the confirmed cases per country/region along the timeline (from end of January to Current), and we shall be seeing the shift of epicenters from East Asia to North America in the animation.

As displayed in the legend, countries with the highest confirmed cases are to appear as red points, and the colormap changes from red to light blue and then to dark blue as the number of cases reduce.

"""Map the Confirmed Cases per country/region world-wide in the MapView"""
map1 = gis.map("Italy")
map1
<IPython.core.display.Image object>
map1.basemap = "dark-gray"
map1.zoom = 1
for d in date_list:
    map1.clear_graphics()
    map1.remove_layers()
    map1.legend=False
    
    df = world_confirmed_ts_df[['Country/Region', d]]
    df.rename(columns={d: 'Confirmed'}, inplace=True)
    fc = gis.content.import_data(df, 
                                 {"CountryCode":"Country_Region"},
                                 title="Confirmed cases per country of "+d)

    fset=fc.query()
    fset.sdf.spatial.plot(map_widget=map1,
                          renderer_type="c",
                          method="esriClassifyNaturalBreaks",
                          class_count=13,
                          col="Confirmed",
                          min_value=250,
                          cmap="coolwarm",
                          alpha=0.7)
    map1.legend=True
    time.sleep(3)
Chart the confirmed cases per country

Similarly, let us plot the top 20 countries with largest confirmed cases to be shown in vertical bar charts (in a time-enabled manner):

"""Chart the top 20 countries with highest confirmed cases"""
time.sleep(3)
for d in date_list:
    clear_output(wait=True)
    top_20_per_d = world_confirmed_ts_df.groupby('Country/Region')[['Country/Region', d]].sum().sort_values(by=d, ascending=False).head(20)
    top_20_per_d.plot(kind='barh', log=True, figsize=(8,6))
    plt.ylabel("Country/Region", labelpad=14)
    plt.xlabel("# of Confirmed Cases (log=True)", labelpad=14)
    plt.title("Chart the confirmed cases per country/region", y=1.02)
    plt.show()
    time.sleep(1)
<IPython.core.display.Image object>

Summarize and repair the time-series COVID-19 deaths globally

Similar to what we have done in the previous section, we will perform the similar approach to fetch, parse and repair the data for COVID-19 deaths globally, which includes steps to:

  1. fetch and read the online CSV source into DataFrame
  2. summarize and repair the DataFrame
  3. list the top 10 countries
world_deaths_ts_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'
world_deaths_ts_df = pd.read_csv(world_deaths_ts_url, header=0, escapechar='\\')
world_deaths_ts_df.head(5)
Province/StateCountry/RegionLatLong1/22/201/23/201/24/201/25/201/26/201/27/20...6/26/206/27/206/28/206/29/206/30/207/1/207/2/207/3/207/4/207/5/20
0NaNAfghanistan33.000065.0000000000...683703721733746774807819826864
1NaNAlbania41.153320.1683000000...51535558626569727476
2NaNAlgeria28.03391.6596000000...885892897905912920928937946952
3NaNAndorra42.50631.5218000000...52525252525252525252
4NaNAngola-11.202717.8739000000...10101111131517181919

5 rows × 170 columns

for country in ["Australia", "Canada", "China"]:
    world_deaths_ts_df = sum_all_provinces_in_country(world_deaths_ts_df, country)
Province/StateCountry/RegionLatLong1/22/201/23/201/24/201/25/201/26/201/27/20...6/26/206/27/206/28/206/29/206/30/207/1/207/2/207/3/207/4/207/5/20
266NaNAustralia, Sum-35.4735149.012000000...104104104104104104104104104106

1 rows × 170 columns

Province/StateCountry/RegionLatLong1/22/201/23/201/24/201/25/201/26/201/27/20...6/26/206/27/206/28/206/29/206/30/207/1/207/2/207/3/207/4/207/5/20
259NaNCanada, Sum53.9333-116.576000000...8571857685828628865086788700872287328739

1 rows × 170 columns

Province/StateCountry/RegionLatLong1/22/201/23/201/24/201/25/201/26/201/27/20...6/26/206/27/206/28/206/29/206/30/207/1/207/2/207/3/207/4/207/5/20
246NaNChina, Sum31.8257117.226171826425682...4641464146414641464146414641464146414641

1 rows × 170 columns

Now with the deaths statistics summarized and repaired, we are good to query for the top 10 countries with the highest numbers of COVID-19 deaths around the globe.

world_deaths_ts_df.sort_values(by = date_list[-1], ascending = False).head(10)
Province/StateCountry/RegionLatLong1/22/201/23/201/24/201/25/201/26/201/27/20...6/26/206/27/206/28/206/29/206/30/207/1/207/2/207/3/207/4/207/5/20
206NaNUS37.0902-95.7129000000...125631126120126360126711127432128105128803129434129676129947
20NaNBrazil-14.235-51.9253000000...55961570705762258314595946063261884631746426564867
204NaNUnited Kingdom55.3781-3.436000000...43414435144355043575437304390643995441314419844220
118NaNItaly4312000000...34708347163473834744347673478834818348333485434861
139NaNMexico23.6345-102.553000000...25779263812664827121277692851029189298433036630639
97NaNFrance46.22762.2137000000...29705297042970429736297632978029794298122981229813
182NaNSpain40-4000000...28338283412834328346283552836428368283852838528385
112NaNIndia2178000000...15685160951647516893174001783418213186551926819693
114NaNIran3253000000...10239103641050810670108171095811106112601140811571
162NaNPeru-9.19-75.0152000000...89399135931795049677986010045102261041210589

10 rows × 170 columns

# a simplified view
world_deaths_ts_df[['Country/Region', date_list[-1]]].sort_values(by = date_list[-1], ascending = False).head(10)
Country/Region7/5/20
206US129947
20Brazil64867
204United Kingdom44220
118Italy34861
139Mexico30639
97France29813
182Spain28385
112India19693
114Iran11571
162Peru10589

Summarize and repair the time-series global recovered cases

Like what we have done in the previous sections, we will perform the similar approach to fetch, parse and repair the data for global recovered cases, which includes steps to:

  1. fetch and read the online CSV source into DataFrame
  2. summarize and repair the DataFrame
  3. list the top 10 countries
world_recovered_ts_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv'
world_recovered_ts_df = pd.read_csv(world_recovered_ts_url, header=0, escapechar='\\')
world_recovered_ts_df.head(5)
Province/StateCountry/RegionLatLong1/22/201/23/201/24/201/25/201/26/201/27/20...6/26/206/27/206/28/206/29/206/30/207/1/207/2/207/3/207/4/207/5/20
0NaNAfghanistan33.000065.0000000000...10306106741260413934141311565116041173311916419366
1NaNAlbania41.153320.1683000000...1298134613841438145915161559159216371657
2NaNAlgeria28.03391.6596000000...906692029371967498971004010342108321118111492
3NaNAndorra42.50631.5218000000...799799799799799799800800800800
4NaNAngola-11.202717.8739000000...81818193939797107108108

5 rows × 170 columns

for country in ["Australia", "Canada", "China"]:
    world_recovered_ts_df = sum_all_provinces_in_country(world_recovered_ts_df, country)
Province/StateCountry/RegionLatLong1/22/201/23/201/24/201/25/201/26/201/27/20...6/26/206/27/206/28/206/29/206/30/207/1/207/2/207/3/207/4/207/5/20
253NaNAustralia, Sum-35.4735149.012000000...6960699370077037704070907130731973997420

1 rows × 170 columns

Province/StateCountry/RegionLatLong1/22/201/23/201/24/201/25/201/26/201/27/20...6/26/206/27/206/28/206/29/206/30/207/1/207/2/207/3/207/4/207/5/20
246NaNCanada, Sum56.1304-106.347000000...67182674456768968698691206939769872702327050770772

1 rows × 170 columns

Province/StateCountry/RegionLatLong1/22/201/23/201/24/201/25/201/26/201/27/20...6/26/206/27/206/28/206/29/206/30/207/1/207/2/207/3/207/4/207/5/20
246NaNChina, Sum31.8257117.226283036394958...79580795917960979619796327965079665796807970679718

1 rows × 170 columns

Now with the statistics summarized and repaired, we are good to query for the top 10 countries with the highest numbers of COVID-19 recovered cases around the globe.

world_recovered_ts_df.sort_values(by = date_list[-1], ascending = False).head(10)
Province/StateCountry/RegionLatLong1/22/201/23/201/24/201/25/201/26/201/27/20...6/26/206/27/206/28/206/29/206/30/207/1/207/2/207/3/207/4/207/5/20
21NaNBrazil-14.235-51.9253000000...7023997277157460187578117883188176429576929846159907311029045
216NaNUS37.0902-95.7129000000...670809679308685164705203720631729994781970790404894325906763
175NaNRussia6090000000...383524392703398436402778411973422235428276437155446127449995
116NaNIndia2178000000...295881309713321723334822347912359860379892394227409083424433
30NaNChile-35.6751-71.543000000...223431228055232210236154241229245443249247253343257451261039
118NaNIran3253000000...177852180661183310186180188758191487194098196446198949201330
145NaNMexico23.6345-102.553000000...156827160721164646170147174538178526183757189345195724199914
169NaNPeru-9.19-75.0152000000...159806164024167998171159174535178245182097185852189621193957
122NaNItaly4312000000...187615188584188891189196190248190717191083191467191944192108
103NaNGermany519000000...177149177518177657177770178100179100179800180300181000181719

10 rows × 170 columns

# a simplfied view
world_recovered_ts_df[['Country/Region', date_list[-1]]].sort_values(by = date_list[-1], ascending = False).head(10)
Country/Region7/5/20
21Brazil1029045
216US906763
175Russia449995
116India424433
30Chile261039
118Iran201330
145Mexico199914
169Peru193957
122Italy192108
103Germany181719

Enough with understanding, summarizing, and exploratory analysis on the the existing data, now let us model the data on the Polynomial Regression, Neural Network, and SIR epidemic models and try to predict the count of cases in the upcoming days.

Predictive Analysis

Before performing the predictive analysis using the time-series data for the United States, we first need to transpose the time-series matrix, and map the date fields into number of days since the first reported. In the following, we will perform the approach three times to make three new analysis-ready DataFrames, namely, new_usa_confirmed_df, new_usa_deaths_df, and new_usa_recovered_df [15,16,17,18].

usa_overall_confirmed_ts_df = world_confirmed_ts_df[world_confirmed_ts_df["Country/Region"] == "US"]
new_usa_confirmed_df = usa_overall_confirmed_ts_df[date_list].T
new_usa_confirmed_df.columns = ["confirmed"]
new_usa_confirmed_df = new_usa_confirmed_df.assign(days=[1 + 
                                               i for i in range(len(new_usa_confirmed_df))])[['days'] + 
                                               new_usa_confirmed_df.columns.tolist()]
new_usa_confirmed_df
daysconfirmed
1/22/2011
1/23/2021
1/24/2032
1/25/2042
1/26/2055
.........
6/30/201612636414
7/1/201622687588
7/2/201632742049
7/3/201642794153
7/4/201652839436

165 rows × 2 columns

usa_overall_deaths_ts_df = world_deaths_ts_df[world_deaths_ts_df["Country/Region"] == "US"]
new_usa_deaths_df = usa_overall_deaths_ts_df[date_list].T
new_usa_deaths_df.columns = ["deaths"]
new_usa_deaths_df = new_usa_deaths_df.assign(days=[1 +
                                                   i for i in range(len(new_usa_deaths_df))])[['days'] + 
                                                   new_usa_deaths_df.columns.tolist()]
new_usa_deaths_df
daysdeaths
1/22/2010
1/23/2020
1/24/2030
1/25/2040
1/26/2050
.........
6/30/20161127432
7/1/20162128105
7/2/20163128803
7/3/20164129434
7/4/20165129676

165 rows × 2 columns

usa_overall_recovered_ts_df = world_recovered_ts_df[world_recovered_ts_df["Country/Region"] == "US"]
new_usa_recovered_df = usa_overall_recovered_ts_df[date_list].T
new_usa_recovered_df.columns = ["recovered"]
new_usa_recovered_df = new_usa_recovered_df.assign(days=[1 +
                                                         i for i in range(len(new_usa_recovered_df))])[['days'] + 
                                                         new_usa_recovered_df.columns.tolist()]
new_usa_recovered_df
daysrecovered
1/22/2010
1/23/2020
1/24/2030
1/25/2040
1/26/2050
.........
6/30/20161720631
7/1/20162729994
7/2/20163781970
7/3/20164790404
7/4/20165894325

165 rows × 2 columns

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

class PolynomialRegressionModel:

    def __init__(self, model_name, polynomial_degree):
        self.__model_name = model_name
        self.__polynomial_degree = polynomial_degree
        self.__model = None

    def train(self, x, y):
        polynomial_features = PolynomialFeatures(degree=self.__polynomial_degree)
        x_poly = polynomial_features.fit_transform(x)
        self.__model = LinearRegression()
        self.__model.fit(x_poly, y)

    def get_predictions(self, x):
        polynomial_features = PolynomialFeatures(degree=self.__polynomial_degree)
        x_poly = polynomial_features.fit_transform(x)
        return np.round(self.__model.predict(x_poly), 0).astype(np.int32)

    def get_model_polynomial_str(self):
        coef = self.__model.coef_
        intercept = self.__model.intercept_
        poly = "{0:.3f}".format(intercept)

        for i in range(1, len(coef)):
            if coef[i] >= 0:
                poly += " + "
            else:
                poly += " - "
            poly += "{0:.3f}".format(coef[i]).replace("-", "") + "X^" + str(i)

        return poly
training_set = new_usa_confirmed_df
x = np.array(training_set["days"]).reshape(-1, 1)
y = training_set["confirmed"]
training_set_deaths = new_usa_deaths_df
x_deaths = np.array(training_set_deaths["days"]).reshape(-1, 1)
y_deaths = training_set_deaths["deaths"]

Polynomial Regression

Let us first try to fit the data with a Polynomial Regression model, since the time-series chart of confirmed cases looks like an exponential curve:

Confirmed Cases with PR model

degrees = 2
regression_model = PolynomialRegressionModel("Cases using Polynomial Regression", 2)
regression_model.train(x, y)
y_pred = regression_model.get_predictions(x)
y_pred
array([ -68133,  -70086,  -71795,  -73260,  -74481,  -75456,  -76188,
        -76674,  -76917,  -76915,  -76668,  -76177,  -75442,  -74462,
        -73237,  -71768,  -70055,  -68097,  -65894,  -63447,  -60756,
        -57820,  -54640,  -51215,  -47546,  -43632,  -39474,  -35071,
        -30424,  -25532,  -20396,  -15016,   -9390,   -3521,    2593,
          8952,   15555,   22402,   29494,   36831,   44412,   52237,
         60307,   68621,   77180,   85983,   95031,  104323,  113860,
        123641,  133667,  143937,  154452,  165211,  176214,  187463,
        198955,  210692,  222674,  234900,  247370,  260085,  273045,
        286248,  299697,  313390,  327327,  341509,  355935,  370606,
        385521,  400681,  416085,  431734,  447627,  463764,  480147,
        496773,  513644,  530760,  548120,  565724,  583573,  601667,
        620005,  638587,  657414,  676485,  695801,  715361,  735166,
        755216,  775509,  796048,  816830,  837857,  859129,  880645,
        902406,  924411,  946660,  969155,  991893, 1014876, 1038104,
       1061576, 1085292, 1109253, 1133458, 1157908, 1182602, 1207541,
       1232725, 1258152, 1283825, 1309741, 1335903, 1362308, 1388958,
       1415853, 1442992, 1470376, 1498004, 1525876, 1553994, 1582355,
       1610961, 1639811, 1668906, 1698246, 1727830, 1757658, 1787731,
       1818048, 1848610, 1879416, 1910467, 1941762, 1973302, 2005086,
       2037115, 2069388, 2101906, 2134668, 2167675, 2200926, 2234421,
       2268161, 2302146, 2336375, 2370848, 2405566, 2440528, 2475735,
       2511187, 2546883, 2582823, 2619008, 2655437, 2692111, 2729029,
       2766192, 2803599, 2841250, 2879147])
def print_forecast(model_name, model, beginning_day=0, limit=10):

    next_days_x = np.array(range(beginning_day, beginning_day + limit)).reshape(-1, 1)
    next_days_pred = model.get_predictions(next_days_x)

    print("The forecast for " + model_name + " in the following " + str(limit) + " days is:")
    for i in range(0, limit):
        print("Day " + str(i + 1) + ": " + str(next_days_pred[i]))
print_forecast("Cases using Polynomial Regression", regression_model, 
               beginning_day=len(x), 
               limit=10)
The forecast for Cases using Polynomial Regression in the following 10 days is:
Day 1: 2879147
Day 2: 2917287
Day 3: 2955672
Day 4: 2994302
Day 5: 3033176
Day 6: 3072294
Day 7: 3111657
Day 8: 3151265
Day 9: 3191117
Day 10: 3231213
import operator

def plot_graph(model_name, x, y, y_pred):

    plt.scatter(x, y, s=10)
    sort_axis = operator.itemgetter(0)
    sorted_zip = sorted(zip(x, y_pred), key=sort_axis)
    x, y_pred = zip(*sorted_zip)

    plt.plot(x, y_pred, color='m')
    plt.title("Amount of " + model_name + " in each day")
    plt.xlabel("Day")
    plt.ylabel(model_name)
    plt.show()
plot_graph("Cases using Polynomial Regression", x, y, y_pred)
<Figure size 432x288 with 1 Axes>

The approximation between the fitted and the actual curves does not look too ideal. We can then increase the degree of the Polynomial Regression Model to 3, and see what happens:

degrees = 3
regression_model = PolynomialRegressionModel("Confirmed cases using Polynomial Regression (d=3)", 3)
regression_model.train(x, y)
y_pred = regression_model.get_predictions(x)
print_forecast("Confirmed cases using Polynomial Regression (d=3)", regression_model, 
               beginning_day=len(x), 
               limit=10)
plot_graph("Confirmed cases using Polynomial Regression (d=3)", x, y, y_pred)
The forecast for Confirmed cases using Polynomial Regression (d=3) in the following 10 days is:
Day 1: 2664083
Day 2: 2685999
Day 3: 2707664
Day 4: 2729074
Day 5: 2750221
Day 6: 2771100
Day 7: 2791706
Day 8: 2812031
Day 9: 2832070
Day 10: 2851818
<Figure size 432x288 with 1 Axes>

Though the approximation of the fitted curve and the actual data got improved for when days<80, the two curves diverged towards days=100.

The same approach can be done with the death cases using Polynomial Regression model (degree=2, and 3):

Deaths with PR

Degrees = 2
regression_model = PolynomialRegressionModel("Deaths using Polynomial Regression", 2)
regression_model.train(x_deaths, y_deaths)
y_deaths_pred = regression_model.get_predictions(x_deaths)
print_forecast("Deaths using Polynomial Regression", regression_model, 
               beginning_day=len(x_deaths), 
               limit=10)
plot_graph("Deaths using Polynomial Regression", x_deaths, y_deaths, y_deaths_pred)
The forecast for Deaths using Polynomial Regression in the following 10 days is:
Day 1: 152437
Day 2: 154332
Day 3: 156237
Day 4: 158153
Day 5: 160081
Day 6: 162020
Day 7: 163970
Day 8: 165930
Day 9: 167902
Day 10: 169885
<Figure size 432x288 with 1 Axes>
regression_model = PolynomialRegressionModel("Deaths using Polynomial Regression (d=3)", 3)
regression_model.train(x_deaths, y_deaths)
y_deaths_pred = regression_model.get_predictions(x_deaths)
print_forecast("Deaths using Polynomial Regression (d=3)", regression_model, 
               beginning_day=len(x_deaths), 
               limit=10)
plot_graph("Deaths using Polynomial Regression (d=3)", x_deaths, y_deaths, y_deaths_pred)
The forecast for Deaths using Polynomial Regression (d=3) in the following 10 days is:
Day 1: 129623
Day 2: 129797
Day 3: 129928
Day 4: 130018
Day 5: 130065
Day 6: 130069
Day 7: 130029
Day 8: 129945
Day 9: 129815
Day 10: 129639
<Figure size 432x288 with 1 Axes>

Since the model fitting results we have obtained so far with Polynomial Regression (degrees=2, and 3) are not too ideal, let us turn to Neural Network modeling and see whether its result would be better:

Neural Network

Confirmed Cases with NN

from sklearn.neural_network import MLPRegressor

class NeuralNetModel:

    def __init__(self, model_name):
        self.__model_name = model_name
        self.__model = None

    def train(self, x, y, hidden_layer_sizes=[10,], learning_rate=0.001, max_iter=2000):
        self.__model = MLPRegressor(solver="adam", activation="relu", alpha=1e-5, random_state=0, 
                                    hidden_layer_sizes=hidden_layer_sizes, verbose=False, tol=1e-5, 
                                    learning_rate_init=learning_rate, max_iter=max_iter)
        self.__model.fit(x, y)

    def get_predictions(self, x):
        return np.round(self.__model.predict(x), 0).astype(np.int32)
neural_net_model = NeuralNetModel("Confirmed Cases using Neural Network")
neural_net_model.train(x, y, [80, 80], 0.001, 50000)
y_pred = neural_net_model.get_predictions(x)
y_pred
array([  -2452,    -386,    1681,    3748,    5812,    6725,    5716,
          4642,    3569,    2495,    1421,     347,    -726,   -1800,
         -2874,   -3947,   -2357,   -2283,   -2228,   -2177,   -2213,
         -2250,   -2287,   -2323,   -2360,   -2397,   -2434,   -2470,
         -2507,   -2544,   -2580,   -2617,   -2654,   -2691,   -2727,
         -2764,   -2801,   -2837,   -2874,   -2911,   -2948,   -2984,
         -3021,   -2152,   -1039,      74,    1187,    2300,    3414,
          4527,    5640,    6753,    7866,    8979,   10092,   11205,
         12318,   13432,   14545,   15658,   28702,   49546,   70389,
         95895,  121508,  147121,  172734,  198347,  223961,  249574,
        275187,  300800,  326414,  352027,  377640,  403253,  428867,
        454480,  480093,  505706,  531320,  556933,  582546,  608159,
        633772,  659386,  684999,  710612,  736225,  761839,  787452,
        813065,  838678,  864292,  889905,  915518,  941131,  966745,
        992358, 1017971, 1043584, 1069197, 1094811, 1120424, 1146037,
       1171650, 1197264, 1222877, 1248490, 1274103, 1299717, 1325330,
       1350943, 1376556, 1402170, 1427783, 1453396, 1479009, 1504622,
       1530236, 1555849, 1581462, 1607075, 1632689, 1658302, 1683915,
       1709528, 1735142, 1760755, 1786368, 1811981, 1837595, 1863208,
       1888821, 1914434, 1940047, 1965661, 1991274, 2016887, 2042500,
       2068114, 2093727, 2119340, 2144953, 2170567, 2196180, 2221793,
       2247406, 2273019, 2298633, 2324246, 2349859, 2375472, 2401086,
       2426699, 2452312, 2477925, 2503539, 2529152, 2554765, 2580378,
       2605992, 2631605, 2657218, 2682831])
print_forecast("Confirmed Cases using Neural Network", neural_net_model, 
               beginning_day=len(x), 
               limit=10)
The forecast for Confirmed Cases using Neural Network in the following 10 days is:
Day 1: 2682831
Day 2: 2708305
Day 3: 2732643
Day 4: 2756752
Day 5: 2780861
Day 6: 2804970
Day 7: 2829079
Day 8: 2853188
Day 9: 2877298
Day 10: 2901407
plot_graph("Confirmed Cases using Neural Network", x, y, y_pred)
<Figure size 432x288 with 1 Axes>

Deaths with NN

neural_net_model = NeuralNetModel("Deaths using Neural Network")
neural_net_model.train(x_deaths, y_deaths, [100, 100], 0.001, 50000)
y_deaths_pred = neural_net_model.get_predictions(x_deaths)
print_forecast("Deaths using Neural Network", neural_net_model, 
               beginning_day=len(x_deaths), 
               limit=10)
plot_graph("Deaths using Neural Network", x_deaths, y_deaths, y_deaths_pred)
The forecast for Deaths using Neural Network in the following 10 days is:
Day 1: 133677
Day 2: 134568
Day 3: 135459
Day 4: 136350
Day 5: 137241
Day 6: 138132
Day 7: 139023
Day 8: 139914
Day 9: 140805
Day 10: 141696
<Figure size 432x288 with 1 Axes>

For Neural Network modeling, the approximation between the fitted and the actual curves looks much better than that of Polynomial Regression.

To summarize, the predicted numbers for the confirmed cases and deaths for the next 10 days (as of EOD 07/14/2020) in the United States are different with varied models:

ModelConfirmedDeathsDeath Rate
PR (d=2)3,231,213169,8855.26%
PR (d=3)2,851,818129,6394.55%
NN2,901,407141,6964.88%

Table 2. The predicted numbers for the confirmed cases and deaths for the next 10 days (as of EOD 07/14/2020).

SIR Model

The Polynomial Regression and Neural Network models used in the previous section did provide reasonable predictions to the confirmed and death numbers in upcoming days. However, these models did not consider the nature of infectious diseases in between the figures. Next, we will talk about how Compartmental Modeling techniques can be used to model infectious diseases (and in this case, the COVID-19), of which the simplest compartmental model is the SIR model. You can check out the descriptions to the model and its basic blocks in the Wiki Page here. The SIR model is reasonably predictive for infectious diseases which are transmitted from human to human, and where recovery confers lasting resistance, such as measles, mumps and rubella [6,7,8,9].

As we can tell from the name of the model, it consists of three compartments: S for the number of susceptible, I for the number of infectious, and R for the number of recovered or deceased (or immune) individuals. Each member of the population typically progresses from susceptible to infectious to recovered.

Equation 1. SIR Model is an ordinary differential equation model, and can be described by the above equation (source:

ASU

,

ASU Public Course

).

Parameters being used in the equation includes,

  • The infectious rate, β, controls the rate of spread which represents the probability of transmitting disease between a susceptible and an infectious individual.
  • D, the average duration, or say days of infection.
  • The recovery rate, γ = 1/D, is determined by the average duration.
  • N, the total population of the study.

Basic Reproduction Number

One of the critical question that epidemiologisits ask the most during each disease outbreak, is that how far and how fast a virus can spread through a population, and the basic reproduction number (a.k.a. the R naught, or R0) is often used to answer the question.

R0 refers to "how many other people one sick person is likely to infect on average in a group that's susceptible to the disease (meaning they don’t already have immunity from a vaccine or from fighting off the disease before)", and is "super important in the context of public health because it foretells how big an outbreak will be. The higher the number, the greater likelihood a large population will fall sick" [1].

Another mostly asked question is how contagious or deadly a virus is, and it together with R0 can depict the most important properties of a virus. Take Measles as an example, as the most contagious virus researchers know about, measles can linger in the air of a closed environment and sicken people up to two hours after an infected person who coughed or sneezed there has left. R0 of measles can be as high as 18, if people exposed to the virus aren't vaccinated. Let's now compare it to Ebola, which is more deadly but much less efficient in spreading. R0 of Ebola is typically just 2, which in part might be due to many infected individuals die before they can pass the virus to someone else.

How a R0 is computed?

In a basic scenario, the basic reproductive number R0 equals to

R0 = β/γ = β*D

However, in a fully susceptible population, R0 is the number of secondary infections generated by the first infectious individual over the course of the infectious period, which is defined as,

R0=S*L*β

in which S=the number of susceptible hosts, L=length of infection, and β=transmissibility.

How to understand R0?

According to this heathline article, there are three possibilities for the potential spread or decline of a disease, depending on its R0 value:

  • If less than 1, each existing infection causes less than one new infection, and the disease will decline and eventually die out.
  • If equals to 1, each existing infection causes one new infection, and the disease will stay alive and stable, but there will not be an outbreak or an epidemic.
  • If more than 1, each existing infection causes more than one new infection, and the disease will spread between people, and there may be an outbreak or epidemic.

Fit the SIR Model per country

In order to fit the SIR model with the actual number of confirmed cases (and the recovered, and death tolls), we need to estimate the β and γ parameters. In the process, we will use solve_ivp function in scipy module, to solve the ordinary differential equation in the SIR model.

from scipy.integrate import solve_ivp
from scipy.optimize import minimize
from datetime import timedelta, datetime

"""The dict defined here includes the countries to run the model against,
   and the date that the first confirmed case was reported.
"""
START_DATE = {
  'China': '1/22/20',
  'Japan': '1/22/20',
  'Korea, South': '1/22/20',
  'US': '1/22/20',
  'Italy': '1/31/20',
  'Spain': '2/1/20',
  'Iran': '2/19/20'
}
# Adding into the consideration the social distancing factor (to be explained in the next section)
rho = 1
class Learner(object):
    """constructs an SIR model learner to load training set, train the model,
       and make predictions at country level.
    """
    def __init__(self, country, loss, start_date, predict_range,s_0, i_0, r_0):
        self.country = country
        self.loss = loss
        self.start_date = start_date
        self.predict_range = predict_range
        self.s_0 = s_0
        self.i_0 = i_0
        self.r_0 = r_0

    def load_confirmed(self, country):
      """
      Load confirmed cases downloaded from pre-made dataframe
      """
      country_df = world_confirmed_ts_df[world_confirmed_ts_df["Country/Region"] == country]
      return country_df.iloc[0].loc[START_DATE[country]:]

    def load_dead(self, country):
      """
      Load deaths downloaded from pre-made dataframe
      """
      country_df = world_deaths_ts_df[world_deaths_ts_df["Country/Region"] == country]
      return country_df.iloc[0].loc[START_DATE[country]:]

    def load_recovered(self, country):
      """
      Load recovered cases downloaded from pre-made dataframe
      """
      country_df = world_recovered_ts_df[world_recovered_ts_df["Country/Region"] == country]
      return country_df.iloc[0].loc[START_DATE[country]:]

    def extend_index(self, index, new_size):
        values = index.values
        current = datetime.strptime(index[-1], '%m/%d/%y')
        while len(values) < new_size:
            current = current + timedelta(days=1)
            values = np.append(values, datetime.strftime(current, '%m/%d/%y'))
        return values

    def predict_0(self, beta, gamma, data):
        """
        Simplifield version.
        Predict how the number of people in each compartment can be changed through time toward the future.
        The model is formulated with the given beta and gamma.
        """
        predict_range = 150
        new_index = self.extend_index(data.index, predict_range)
        size = len(new_index)
        def SIR(t, y):
            S = y[0]
            I = y[1]
            R = y[2]
            return [-rho*beta*S*I, rho*beta*S*I-gamma*I, gamma*I]
        extended_actual = np.concatenate((data.values, [None] * (size - len(data.values))))
        return new_index, extended_actual, solve_ivp(SIR, [0, size], [S_0,I_0,R_0], t_eval=np.arange(0, size, 1))
    
    def predict(self, beta, gamma, data, recovered, death, country, s_0, i_0, r_0):
        """
        Predict how the number of people in each compartment can be changed through time toward the future.
        The model is formulated with the given beta and gamma.
        """
        new_index = self.extend_index(data.index, self.predict_range)
        size = len(new_index)

        def SIR(t, y):
            S = y[0]
            I = y[1]
            R = y[2]
            return [-rho*beta*S*I, rho*beta*S*I-gamma*I, gamma*I]

        extended_actual = np.concatenate((data.values, [None] * (size - len(data.values))))
        extended_recovered = np.concatenate((recovered.values, [None] * (size - len(recovered.values))))
        extended_death = np.concatenate((death.values, [None] * (size - len(death.values))))
        solved = solve_ivp(SIR, [0, size], [s_0,i_0,r_0], t_eval=np.arange(0, size, 1))
        return new_index, extended_actual, extended_recovered, extended_death, solved

    def train_0(self):
        """
        Simplified version.
        Run the optimization to estimate the beta and gamma fitting the given confirmed cases.
        """
        data = self.load_confirmed(self.country)
        optimal = minimize(
            loss,
            [0.001, 0.001],
            args=(data),
            method='L-BFGS-B',
            bounds=[(0.00000001, 0.4), (0.00000001, 0.4)]
        )
        beta, gamma = optimal.x
        new_index, extended_actual, prediction = self.predict(beta, gamma, data)
        df = pd.DataFrame({
            'Actual': extended_actual,
            'S': prediction.y[0],
            'I': prediction.y[1],
            'R': prediction.y[2]
        }, index=new_index)
        fig, ax = plt.subplots(figsize=(15, 10))
        ax.set_title(self.country)
        df.plot(ax=ax)
        fig.savefig(f"{self.country}.png")
        
    def train(self):
        """
        Run the optimization to estimate the beta and gamma fitting the given confirmed cases.
        """
        recovered = self.load_recovered(self.country)
        death = self.load_dead(self.country)
        data = (self.load_confirmed(self.country) - recovered - death)

        optimal = minimize(loss, [0.001, 0.001], 
                           args=(data, recovered, self.s_0, self.i_0, self.r_0), 
                           method='L-BFGS-B', 
                           bounds=[(0.00000001, 0.4), (0.00000001, 0.4)])
        print(optimal)

        beta, gamma = optimal.x
        new_index, extended_actual, extended_recovered, extended_death, prediction = self.predict(beta, gamma, data, 
                                                                                                  recovered, death, 
                                                                                                  self.country, self.s_0, 
                                                                                                  self.i_0, self.r_0)
        df = pd.DataFrame({'Infected data': extended_actual, 'Recovered data': extended_recovered, 
                           'Death data': extended_death, 'Susceptible': prediction.y[0], 
                           'Infected': prediction.y[1], 'Recovered': prediction.y[2]}, index=new_index)

        fig, ax = plt.subplots(figsize=(15, 10))
        ax.set_title(self.country)
        df.plot(ax=ax)
        print(f"country={self.country}, beta={beta:.8f}, gamma={gamma:.8f}, r_0:{(beta/gamma):.8f}")
        fig.savefig(f"{self.country}.png")

We need to also define the loss function to be used in the optimization process - note that the the root mean squared error (RMSE) not only considers the confirmed cases, but also the recovered.

def loss(point, data, recovered, s_0, i_0, r_0):

    size = len(data)
    beta, gamma = point

    def SIR(t, y):
        S = y[0]
        I = y[1]
        R = y[2]
        return [-rho*beta*S*I, rho*beta*S*I-gamma*I, gamma*I]

    solution = solve_ivp(SIR, [0, size], [s_0,i_0,r_0], t_eval=np.arange(0, size, 1), vectorized=True)
    l1 = np.sqrt(np.mean((solution.y[1] - data)**2))
    l2 = np.sqrt(np.mean((solution.y[2] - recovered)**2))
    alpha = 0.1

    return alpha * l1 + (1 - alpha) * l2

Now let's run the estimation process against major countries suffering from the COVID-19 outbreak. The following cells illustrate how the number of the compartment can be changed over time, according to the SIR model.

How to decide the initial values of compartments? This is based on two conditions: the rough inference of the max quantity of susceptible people for each country, and the computation capacity (the execution takes overly long time for 1 million susceptible population). Though SIR model expects the susceptible to be homogenous, well-mixed, and accessible to each other, setting the whole population in the country is not realistic for sure. In this case, let's set the compartment 100000 to start with.

Also in the initial settings, the initial infectious people were 2 for every country, and the start of the outbreak is varied, as specified in the START_DATE dict object.

The initial batch of SIR Model fitting

countries = ['China', 'Korea, South']
predict_range = 250
s_0 = 100000
i_0 = 2
r_0 = 10

for country in countries:
    start_date = START_DATE[country]
    learner = Learner(country, loss, start_date, predict_range, s_0, i_0, r_0)
    learner.train()
      fun: 10280.378870137927
 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>
      jac: array([4227.74755862,   -7.67140591])
  message: b'ABNORMAL_TERMINATION_IN_LNSRCH'
     nfev: 375
      nit: 27
   status: 2
  success: False
        x: array([7.53235845e-06, 2.06861562e-02])
country=China, beta=0.00000753, gamma=0.02068616, r_0:0.00036413
      fun: 5302.062155691667
 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>
      jac: array([ 2.09337929e+08, -1.32026589e+03])
  message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
     nfev: 237
      nit: 19
   status: 0
  success: True
        x: array([1.01684995e-06, 4.17528180e-02])
country=Korea, South, beta=0.00000102, gamma=0.04175282, r_0:0.00002435
<Figure size 1080x720 with 1 Axes><Figure size 1080x720 with 1 Axes>

How do we decide if the model is a good approximation? First, by observing if curves of the Infected data and the projected Infected are close together; Second, by observing if the Recovered data and the projected Recovered curves are also close. For instance, in the previous cell output, we can see the approximation of the model is doing fairly nice for that of Mainland China, but not as ideal for that of South Korea.

countries = ['Japan', 'Italy', 'Spain']
predict_range = 250
s_0 = 100000
i_0 = 2
r_0 = 10

for country in countries:
    start_date = START_DATE[country]
    learner = Learner(country, loss, start_date, predict_range, s_0, i_0, r_0)
    learner.train()
      fun: 8039.341910449291
 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>
      jac: array([135990.51944766, -13777.95169901])
  message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
     nfev: 162
      nit: 6
   status: 0
  success: True
        x: array([4.45162819e-07, 4.45177249e-07])
country=Japan, beta=0.00000045, gamma=0.00000045, r_0:0.99996759
      fun: 42082.071337634945
 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>
      jac: array([1.38366158e+02, 1.67347025e-02])
  message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
     nfev: 399
      nit: 34
   status: 0
  success: True
        x: array([2.42577465e-06, 5.09067457e-02])
country=Italy, beta=0.00000243, gamma=0.05090675, r_0:0.00004765
      fun: 37210.55846013384
 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>
      jac: array([-4.30707999e+09,  1.31464732e+05])
  message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
     nfev: 297
      nit: 23
   status: 0
  success: True
        x: array([2.89569389e-06, 9.88570287e-02])
country=Spain, beta=0.00000290, gamma=0.09885703, r_0:0.00002929
<Figure size 1080x720 with 1 Axes><Figure size 1080x720 with 1 Axes><Figure size 1080x720 with 1 Axes>

As seen in the cell output above, the approximation of the model is looking reasonably good for that of Italy and Spain, but not as ideal for that of Japan - the gap between the Infected data and the projected Infected curves is overly large, and so is that between the Recovered data and Recovered curves.

predict_range = 250
s_0 = 100000
i_0 = 2
r_0 = 10

country="US"
start_date = START_DATE[country]
learner = Learner(country, loss, start_date, predict_range, s_0, i_0, r_0)
learner.train()
      fun: 317178.7735714818
 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>
      jac: array([9.87201929, 0.20954758])
  message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
     nfev: 45
      nit: 5
   status: 0
  success: True
        x: array([0.02542935, 0.02778655])
country=US, beta=0.02542935, gamma=0.02778655, r_0:0.91516763
<Figure size 1080x720 with 1 Axes>

Table 3 (as shown below) displays what we have got so far with the model, for the studied countries. Note that the last column represents if the approximation of the model is good, or else parameters need to be modified before running the 2nd batch of fitting.

CountriesBetaGammaR0Good approximation?
China0.000007530.020686160.00036413True
Japan0.000000450.000000450.9999675False
S. Korea0.000001020.041752820.00002435False
Italy0.000002430.050906750.00004765True
Spain0.000002900.098857030.00002929True
US0.025429350.027786550.91516763False

Table 3. The computed Beta, Gamma, and R0 values for studied countries (with data collected up to EOD 07/04/2020).

Now the SIR model is well-fitting the actual data for both confirmed and recovered cases. The R0 values of Japan and the United States are significantly higher than in other countries, which might be caused by the relatively lower recovery rates in both countries, or that the initial values can be assumed wrong (or that the model fails to find the global minimum in the optimization process).

On the other hand, the total number of confirmed cases for the United States is way higher than the initial susceptible number we picked here. The trends for S, I, and R, and the R0 value hence computed would not make much sense. In the following cells, we will experiment with larger susceptible numbers (and/or longer prediction periods), and see how the results would differ.

The second batch of SIR Model fitting

Let's run the model again with modified parameters for the countries that (1) were showing unreasonably high R0 values, or (2) failed to make good approximation in the initial batch.

predict_range = 250
s_0 = 25000
i_0 = 2
r_0 = 10

country="Japan"
start_date = START_DATE[country]
learner = Learner(country, loss, start_date, predict_range, s_0, i_0, r_0)
learner.train()
      fun: 1422.4699520979984
 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>
      jac: array([279725.30349416,    589.61854847])
  message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
     nfev: 432
      nit: 16
   status: 0
  success: True
        x: array([5.16087340e-06, 4.11036952e-02])
country=Japan, beta=0.00000516, gamma=0.04110370, r_0:0.00012556
<Figure size 1080x720 with 1 Axes>
countries = ['Korea, South', 'US']
predict_range = 250
s_0_values = [25000, 1000000]
i_0 = 2
r_0 = 10

for country, s_0 in zip(countries, s_0_values):
    start_date = START_DATE[country]
    learner = Learner(country, loss, start_date, predict_range, s_0, i_0, r_0)
    learner.train()
      fun: 2177.392810393044
 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>
      jac: array([258762.58505377,    387.30468077])
  message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
     nfev: 315
      nit: 26
   status: 0
  success: True
        x: array([1.01798066e-05, 6.15592306e-03])
country=Korea, South, beta=0.00001018, gamma=0.00615592, r_0:0.00165366
      fun: 132485.37921448034
 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>
      jac: array([1.83150386e+10, 4.76380734e+06])
  message: b'ABNORMAL_TERMINATION_IN_LNSRCH'
     nfev: 300
      nit: 15
   status: 2
  success: False
        x: array([2.31863766e-07, 1.31293805e-01])
country=US, beta=0.00000023, gamma=0.13129380, r_0:0.00000177
<Figure size 1080x720 with 1 Axes><Figure size 1080x720 with 1 Axes>

Table 4 (as shown below) lists the modified Beta, Gamma, and R0 values for Japan, Korea, Spain and US. Note that the last column represents if the approximation of the model is good, or else parameters need to be modified before the 3rd batch of fitting.

CountriesBetaGammaR0Good approximation?
China0.000007530.020686160.00036413True
Japan0.000005160.041103700.00012556True
S. Korea0.000010180.006155920.00165366True
Italy0.000002430.050906750.00004765True
Spain0.000002900.098857030.00002929True
US0.000000230.131293800.00000177False

Table 4. The Beta, Gamma, and R0 values computed out of the 2nd batch model fitting for studied countries (with data collected up to EOD 07/04/2020).

Up to this point, we have run two batches of SIR model fitting, and all countries of study except for the United States have shown good approximations to the model (with modified parameters). Because the susceptible values set for the United States so far have been too low compared to the actual confirmed cases, we are going to continue modifying the s_0 values for the United States, and see if by doing so, the approximation of the model fitting can be improved in the next section.

The final batch of SIR Model fitting

predict_range = 250
s_0 = 2000000
i_0 = 2
r_0 = 10

country="US"
start_date = START_DATE[country]
learner = Learner(country, loss, start_date, predict_range, s_0, i_0, r_0)
learner.train()
      fun: 273805.4913666793
 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>
      jac: array([ 1849.03037734, -1823.58780876])
  message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
     nfev: 30
      nit: 5
   status: 0
  success: True
        x: array([0.00161177, 0.00161186])
country=US, beta=0.00161177, gamma=0.00161186, r_0:0.99994539
<Figure size 1080x720 with 1 Axes>
predict_range = 250
s_0 = 3000000
i_0 = 2
r_0 = 10

country="US"
start_date = START_DATE[country]
learner = Learner(country, loss, start_date, predict_range, s_0, i_0, r_0)
learner.train()
      fun: 361448.0875488295
 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>
      jac: array([3502.73912773,  -58.84794518])
  message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
     nfev: 24
      nit: 3
   status: 0
  success: True
        x: array([0.00099983, 0.00106169])
country=US, beta=0.00099983, gamma=0.00106169, r_0:0.94172925
<Figure size 1080x720 with 1 Axes>

The previous two plots were due to susceptible being set too low, and we will continue to adjust the s_0 to larger numbers.

predict_range = 250
s_0 = 3500000
i_0 = 2
r_0 = 10

country="US"
start_date = START_DATE[country]
learner = Learner(country, loss, start_date, predict_range, s_0, i_0, r_0)
learner.train()
      fun: 224079.8306076969
 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>
      jac: array([ 3.91748893e+12, -2.12266260e+06])
  message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
     nfev: 207
      nit: 7
   status: 0
  success: True
        x: array([3.09456750e-08, 9.17191515e-03])
country=US, beta=0.00000003, gamma=0.00917192, r_0:0.00000337
<Figure size 1080x720 with 1 Axes>
predict_range = 250
s_0 = 4000000
i_0 = 2
r_0 = 10

country="US"
start_date = START_DATE[country]
learner = Learner(country, loss, start_date, predict_range, s_0, i_0, r_0)
learner.train()
      fun: 369649.23863352626
 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>
      jac: array([ 7.68412486e+08, -1.60154421e+04])
  message: b'ABNORMAL_TERMINATION_IN_LNSRCH'
     nfev: 186
      nit: 4
   status: 2
  success: False
        x: array([1.72969697e-08, 1.72955155e-08])
country=US, beta=0.00000002, gamma=0.00000002, r_0:1.00008408
<Figure size 1080x720 with 1 Axes>
s_0BetaGammaR0Date of PeakMax Infected Population
0.1M0.025429350.027786550.91516763Not applicableNot applicable
1M0.000000230.131293800.00000177Not applicableNot applicable
2M0.001611770.001611860.99994539Not applicableNot applicable
3M0.000999830.001061690.94172925Not applicableNot applicable
3.5M0.000000030.009171920.0000033707/19/20202.5M
4M0.000000020.000000021.00008408>10/28/2020>4M

Table 5. The Beta, Gamma, and R0 values computed out of the final batch model fitting for the United States (with data collected up to EOD 07/04/2020).

Consider the social distancing effect in SIR Model fitting

Efforts of social distancing can mitigate the spread of infectious disease, and impact the transmitting/infectious rate, β. Now, let's introduce this new value, ρ, to describe the social distancing effect. It is going to be ranging from 0 to 1, where 0 indicates everyone is locked down and quarantined (as individuals), and 1 is equivalent to the base case demonstrated above. With ρ being defined, the SIR model will be modified as:

S' = -ρ*β*S*I

We have already seen the trends of SIR model fitting when ρ=1. Now, let's set ρ to 0.8, and 0.5, and see the flattening effect as the social distancing effects got increased to contain the disease over a set period of time.

rho = 0.8
predict_range = 250
s_0 = 3500000
i_0 = 2
r_0 = 10

country="US"
start_date = START_DATE[country]
learner = Learner(country, loss, start_date, predict_range, s_0, i_0, r_0)
learner.train()
      fun: 254442.13493214428
 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>
      jac: array([4.47150648e+12, 5.36130878e+06])
  message: b'ABNORMAL_TERMINATION_IN_LNSRCH'
     nfev: 288
      nit: 5
   status: 2
  success: False
        x: array([3.81849319e-08, 1.66008689e-02])
country=US, beta=0.00000004, gamma=0.01660087, r_0:0.00000230
<Figure size 1080x720 with 1 Axes>

As listed in Table 6, social distancing efforts will likely improve the survivability of the disease by giving more time for treatments and supplies to develop while keeping the peaks low.

RhoBetaGammaR0Date of PeakMax Infected Population
10.000000030.009171920.0000033707/19/20202.5M
0.80.000000040.016600870.0000023007/26/20201.95M

Table 6. The Beta, Gamma, and R0 values computed with different `Rho` factors (when `S`=3.5M) for the United States (with data collected up to EOD 07/04/2020).

Comparison between models

The SIR model we have been seeing so far, is a simplified approximation of early dynamics of an epidemic, with exponential growth. Based on a system of differential equations and three distinct populations (S, I, and R), it has been around for decades. For future extensions of this study, we can explore other complex approaches, such as the SEIR model (an adaptation of SIR with an additional population of Exposed individuals), or a polynomial curve fitting further-along epidemics and recalibrating locally, for instance, the CHIME model.

CategorySIRSEIR [18]CHIME [18]
DefinitionSimple & Basic compartmental modela derivative of SIR model that contains exposed (infected but not infectious)COVID19 Hospital Impact Model for Epidemics
Latency PeriodNoYesNo
Progress of IndividualsS-I-RS-E-I-RS-I-R
ExamplesN = S+I+RN = S+E+I+RN = S+I+R

Conclusions

Before vaccines to COVID-19 come available to all, what governments can do is only to slow down the transmission. By doing so, we don’t actually stop the spread but keep the transmission and the active cases at a point that in the time being well within the limits of the handling capacity of medical facilities. This is what we call as Flattening The Curve. As advised by experts, there are at least three measures to help flatten the curve:

  1. Place travel restrictions; the transmission could be reduced by restricting people from traveling in or out of a particular region with high infections.
  2. Keep social distance; As shown in the SIR modeling that COVID-19 has a high R0 and each infected person ends up infecting 2 or 3 people, social distancing might be the most significant in reducing transmission from the infected to the others.
  3. Have more testing done. Testing is necessary to quickly identify and isolate the infected from the non-infected given that Covid-19 has a long incubation period (ranging from 2 to 26 days, while symptoms usually start appearing after 5-7 days), it is possible for an infected individual to spread the infection to others without his/her even knowing.

As stated by Thomas Pueyo in his Medium Post (which can be seen in the illustration below), when combining Transmission reduction and Travel restriction, if transmission rate of COVID-19 went down by even 25%, it could delay the peak by almost 14 weeks. Further reduction would delay it even more.

Together with what we have seen from the SIR modeling, reducing the transmission rate, and keeping the number of compartment small is crucial to flattening the curve. All should act together and fight the pandemic!

References

[1] https://www.vox.com/2020/2/18/21142009/coronavirus-covid19-china-wuhan-deaths-pandemic

[2] https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge

[3] https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series

[4] https://github.com/CSSEGISandData/COVID-19/tree/web-data

[5] http://gabgoh.github.io/COVID/index.html

[6] https://scipython.com/book/chapter-8-scipy/additional-examples/the-sir-epidemic-model/

[7] https://www.lewuathe.com/covid-19-dynamics-with-sir-model.html

[8] https://github.com/Lewuathe/COVID19-SIR

[9] https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_India

[10] https://medium.com/@tomaspueyo/coronavirus-act-today-or-people-will-die-f4d3d9cd99ca

[11] https://science.sciencemag.org/content/early/2020/03/05/science.aba9757

[12] https://en.wikipedia.org/wiki/Compartmental_models_in_epidemiology

[13] https://science.sciencemag.org/content/early/2020/03/05/science.aba9757/tab-figures-data

[14] https://thespinoff.co.nz/politics/22-03-2020/siouxsie-wiles-toby-morris-what-does-level-two-mean-and-why-does-it-matter/

[15] https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6

[16] https://www.weforum.org/agenda/2020/03/covid19-coronavirus-countries-infection-trajectory/

[17] https://in.springboard.com/blog/data-modelling-covid/

[18] https://penn-chime.phl.io/

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.