## Introduction

Dashboards, statistics, and other information about the `COVID-19`

are floating all over the internet, and different countries or regions are adopting varied strategies, from complete lockdown, to social distancing, to herd immunity, you might be confused at what is the right strategy, and which information is valid. This notebook is not providing you a final answer, but tools or methods that you can try yourself in performing data modeling, analyzing, and predicting the spread of COVID-19 with the `ArcGIS API for Python`

, and other libraries such as `pandas`

and `numpy`

. Hopefully, given the workflow demonstrations, you are able to find the basic facts, current patterns, and future trends behind the common notions about how COVID-19 spread from a dataset perspective [1,2,3,4].

Before we dive into data science and analytics, let's start with importing the necessary modules:

```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
from arcgis.gis import GIS
```

## Import and Understand Source Dataset

Among all the official and unofficial data sources on the web providing COVID-19 related data, one of the most widely used dataset today is the one provided by the John Hopkins University's Center for Systems Science and Engineering (JHU CSSE), which can be accessed on GitHub under the name - Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE [5,10,11,12]. The time-series consolidated data needed for all the analysis to be performed in this notebook fall into these two categories:

- Type: Confirmed Cases, Deaths, and the Recovered;
- Geography: Global, and the United States only.

Now let's first look at the U.S. dataset.

### Time-series data for the United States

The dataset can be directly imported into data-frames with `read_csv`

method in `Pandas`

. Compared to downloading the file manually and then read it, it is preferred to use the URLs (which point to the CSV files archived on GitHub) because as situation changes, it becomes easier to load and refresh the analysis with new data.

Now, let's read the time-series data of the confirmed COVID-19 cases in the United States from the GitHub source url, into a Pandas DataFrame:

```
# read time-series csv
usa_ts_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv'
usa_ts_df = pd.read_csv(usa_ts_url, header=0, escapechar='\\')
usa_ts_df.head(5)
```

UID | iso2 | iso3 | code3 | FIPS | Admin2 | Province_State | Country_Region | Lat | Long_ | ... | 6/26/20 | 6/27/20 | 6/28/20 | 6/29/20 | 6/30/20 | 7/1/20 | 7/2/20 | 7/3/20 | 7/4/20 | 7/5/20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 16 | AS | ASM | 16 | 60.0 | NaN | American Samoa | US | -14.2710 | -170.1320 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

1 | 316 | GU | GUM | 316 | 66.0 | NaN | Guam | US | 13.4443 | 144.7937 | ... | 247 | 247 | 247 | 253 | 257 | 267 | 280 | 280 | 280 | 280 |

2 | 580 | MP | MNP | 580 | 69.0 | NaN | Northern Mariana Islands | US | 15.0979 | 145.6739 | ... | 30 | 30 | 30 | 30 | 30 | 30 | 31 | 31 | 31 | 31 |

3 | 630 | PR | PRI | 630 | 72.0 | NaN | Puerto Rico | US | 18.2208 | -66.5901 | ... | 6922 | 7066 | 7189 | 7250 | 7465 | 7537 | 7608 | 7683 | 7787 | 7916 |

4 | 850 | VI | VIR | 850 | 78.0 | NaN | Virgin Islands | US | 18.3358 | -64.8963 | ... | 81 | 81 | 81 | 81 | 81 | 90 | 92 | 98 | 111 | 111 |

5 rows × 177 columns

`usa_ts_df.columns`

Index(['UID', 'iso2', 'iso3', 'code3', 'FIPS', 'Admin2', 'Province_State', 'Country_Region', 'Lat', 'Long_', ... '6/26/20', '6/27/20', '6/28/20', '6/29/20', '6/30/20', '7/1/20', '7/2/20', '7/3/20', '7/4/20', '7/5/20'], dtype='object', length=177)

As we can see from the printouts of `usa_ts_df.columns`

, the first 11 columns are displayed as `['UID', 'iso2', 'iso3', 'code3', 'FIPS', 'Admin2', 'Province_State', 'Country_Region', 'Lat', 'Long_']`

, while the rest of the columns are dates from `1/22/20`

to the most current date on record.

```
date_list = usa_ts_df.columns.tolist()[11:]
date_list[0], date_list[-1]
```

('1/22/20', '7/5/20')

#### Repair and Summarize the state-wide time-series data

Look at the last five rows of the DataFrame `usa_ts_df`

, and see that they are all cases for `Province_State`

=="Utah":

`usa_ts_df.tail()`

UID | iso2 | iso3 | code3 | FIPS | Admin2 | Province_State | Country_Region | Lat | Long_ | ... | 6/26/20 | 6/27/20 | 6/28/20 | 6/29/20 | 6/30/20 | 7/1/20 | 7/2/20 | 7/3/20 | 7/4/20 | 7/5/20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

3256 | 84070016 | US | USA | 840 | NaN | Central Utah | Utah | US | 39.372319 | -111.575868 | ... | 127 | 134 | 143 | 159 | 169 | 173 | 180 | 194 | 201 | 209 |

3257 | 84070017 | US | USA | 840 | NaN | Southeast Utah | Utah | US | 38.996171 | -110.701396 | ... | 33 | 34 | 35 | 35 | 36 | 37 | 39 | 40 | 40 | 40 |

3258 | 84070018 | US | USA | 840 | NaN | Southwest Utah | Utah | US | 37.854472 | -111.441876 | ... | 1302 | 1361 | 1428 | 1467 | 1519 | 1553 | 1584 | 1625 | 1660 | 1689 |

3259 | 84070019 | US | USA | 840 | NaN | TriCounty | Utah | US | 40.124915 | -109.517442 | ... | 45 | 46 | 48 | 48 | 50 | 50 | 52 | 53 | 55 | 55 |

3260 | 84070020 | US | USA | 840 | NaN | Weber-Morgan | Utah | US | 41.271160 | -111.914512 | ... | 814 | 846 | 872 | 919 | 954 | 1004 | 1042 | 1090 | 1172 | 1195 |

5 rows × 177 columns

Some rows in the DataFrame are of `many-rows-to-one-state`

matching, while others are of the `one-row-to-one-state`

matching replationship. All records listed in the `usa_ts_df`

with `Admin2`

not equal to NaN, are those rows that fall into the category of "many-rows-to-one-state" matching.

`usa_ts_df[usa_ts_df["Admin2"].notna()].head()`

UID | iso2 | iso3 | code3 | FIPS | Admin2 | Province_State | Country_Region | Lat | Long_ | ... | 6/26/20 | 6/27/20 | 6/28/20 | 6/29/20 | 6/30/20 | 7/1/20 | 7/2/20 | 7/3/20 | 7/4/20 | 7/5/20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

5 | 84001001 | US | USA | 840 | 1001.0 | Autauga | Alabama | US | 32.539527 | -86.644082 | ... | 482 | 492 | 497 | 521 | 530 | 545 | 553 | 560 | 583 | 607 |

6 | 84001003 | US | USA | 840 | 1003.0 | Baldwin | Alabama | US | 30.727750 | -87.722071 | ... | 500 | 539 | 559 | 626 | 663 | 686 | 735 | 828 | 846 | 864 |

7 | 84001005 | US | USA | 840 | 1005.0 | Barbour | Alabama | US | 31.868263 | -85.387129 | ... | 309 | 314 | 314 | 319 | 322 | 323 | 333 | 345 | 347 | 349 |

8 | 84001007 | US | USA | 840 | 1007.0 | Bibb | Alabama | US | 32.996421 | -87.125115 | ... | 150 | 158 | 159 | 162 | 167 | 171 | 176 | 186 | 187 | 190 |

9 | 84001009 | US | USA | 840 | 1009.0 | Blount | Alabama | US | 33.982109 | -86.567906 | ... | 181 | 185 | 186 | 196 | 204 | 214 | 218 | 226 | 230 | 235 |

5 rows × 177 columns

As shown in the output of the previous two cells, we can see that for the `usa_ts_df`

which we have created and parsed from the U.S. Dataset, there are multiple rows per state representing different administrative areas of the state with reported cases. Next, we will use the to-be-defined function `sum_all_admins_in_state()`

to summarize all administrative areas in one state into a single row.

```
def sum_all_admins_in_state(df, state):
# query all sub-records of the selected state
tmp_df = df[df["Province_State"]==state]
# create a new row which is to sum all statistics of this state, and
# assign the summed value of all sub-records to the date_time column of the new row
sum_row = tmp_df.sum(axis=0)
# assign the constants to the ['Province/State', 'Country/Region', 'Lat', 'Long'] columns;
# note that the Province/State column will be renamed from solely the country name to country name + ", Sum".
sum_row.loc['UID'] = "NaN"
sum_row.loc['Admin2'] = "NaN"
sum_row.loc['FIPS'] = "NaN"
sum_row.loc['iso2'] = "US"
sum_row.loc['iso3'] = "USA"
sum_row.loc['code3'] = 840
sum_row.loc['Country_Region'] = "US"
sum_row.loc['Province_State'] = state + ", Sum"
sum_row.loc['Lat'] = tmp_df['Lat'].values[0]
sum_row.loc['Long_'] = tmp_df['Long_'].values[0]
# append the new row to the original DataFrame, and
# remove the sub-records of the selected country.
df = pd.concat([df, sum_row.to_frame().T], ignore_index=True)
#display(df[df["Province_State"].str.contains(state + ", Sum")])
df=df[df['Province_State'] != state]
df.loc[df.Province_State == state+", Sum", 'Province_State'] = state
return df
```

```
# loop thru all states in the U.S.
for state in usa_ts_df.Province_State.unique():
usa_ts_df = sum_all_admins_in_state(usa_ts_df, state)
```

Now with `sum_all_admins_in_state`

applied to all states, we shall be expecting `usa_ts_df`

to be with all rows converted to `one-row-to-one-state`

matching. Let's browse the last five rows in the DataFrame to validate the results.

`usa_ts_df.tail()`

UID | iso2 | iso3 | code3 | FIPS | Admin2 | Province_State | Country_Region | Lat | Long_ | ... | 6/26/20 | 6/27/20 | 6/28/20 | 6/29/20 | 6/30/20 | 7/1/20 | 7/2/20 | 7/3/20 | 7/4/20 | 7/5/20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

54 | NaN | US | USA | 840 | NaN | NaN | West Virginia | US | 39.1307 | -80.0035 | ... | 2730 | 2782 | 2832 | 2870 | 2905 | 2979 | 3053 | 3126 | 3205 | 3262 |

55 | NaN | US | USA | 840 | NaN | NaN | Wisconsin | US | 43.9697 | -89.7678 | ... | 26747 | 27286 | 27743 | 28058 | 28659 | 29199 | 29738 | 30317 | 31055 | 31577 |

56 | NaN | US | USA | 840 | NaN | NaN | Wyoming | US | 41.655 | -105.724 | ... | 1368 | 1392 | 1417 | 1450 | 1487 | 1514 | 1550 | 1582 | 1606 | 1634 |

57 | NaN | US | USA | 840 | NaN | NaN | Diamond Princess | US | 0 | 0 | ... | 49 | 49 | 49 | 49 | 49 | 49 | 49 | 49 | 49 | 49 |

58 | NaN | US | USA | 840 | NaN | NaN | Grand Princess | US | 0 | 0 | ... | 103 | 103 | 103 | 103 | 103 | 103 | 103 | 103 | 103 | 103 |

5 rows × 177 columns

#### Explore the state-wide time-series data

If you wonder in which state(s) the first COVID-19 case was confirmed and reported, use the cell below to check for first occurrence - in this case, the **Washington** State.

```
usa_ts_df_all_states = usa_ts_df.groupby('Province_State').sum()[date_list]
usa_ts_df_all_states[usa_ts_df_all_states['1/22/20']>0]
```

1/22/20 | 1/23/20 | 1/24/20 | 1/25/20 | 1/26/20 | 1/27/20 | 1/28/20 | 1/29/20 | 1/30/20 | 1/31/20 | ... | 6/26/20 | 6/27/20 | 6/28/20 | 6/29/20 | 6/30/20 | 7/1/20 | 7/2/20 | 7/3/20 | 7/4/20 | 7/5/20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Province_State | |||||||||||||||||||||

Washington | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 30855 | 31404 | 31752 | 32253 | 32824 | 33435 | 34151 | 34778 | 35247 | 35898 |

1 rows × 166 columns

Or if you want to query for the **top 10 states** with the highest numbers of confirmed cases for the time being, run the following cell to display the query results.

`usa_ts_df_all_states[date_list[-1]].sort_values(ascending = False).head(10)`

Province_State New York 397131 California 264681 Florida 200111 Texas 194932 New Jersey 173402 Illinois 147251 Massachusetts 109974 Arizona 98103 Georgia 95516 Pennsylvania 94403 Name: 7/5/20, dtype: int64

Compared to what we have collected as the top 10 states on May 05, 2020, we can see California has climbed up to the 2nd place, while New Jersey rescinded to the 5th place.

```
Province_State
New York 321192
New Jersey 130593
Massachusetts 70271
Illinois 65889
California 58456
Pennsylvania 53864
Michigan 44451
Florida 37439
Texas 33912
Connecticut 30621
Name: 5/5/20, dtype: int64
```

The approach is quite similar if you want to query for the top 10 states with the lowest numbers of confirmed cases, just by simply changing the `ascending`

order from `False`

to `True`

:

`usa_ts_df_all_states[date_list[-1]].sort_values(ascending = True).head(10)`

Province_State American Samoa 0 Northern Mariana Islands 31 Diamond Princess 49 Grand Princess 103 Virgin Islands 111 Guam 280 Hawaii 1023 Alaska 1134 Montana 1212 Vermont 1249 Name: 7/5/20, dtype: int64

Comparatively, we can also check what is the difference against the statistics obtained on May 05, 2020 (as below),

```
Province_State
American Samoa 0
Northern Mariana Islands 14
Diamond Princess 49
Virgin Islands 66
Grand Princess 103
Guam 145
Alaska 371
Montana 456
Wyoming 604
Hawaii 625
Name: 5/5/20, dtype: int64
```

As shown above, from May to July the state with the highest confirmed cases is the **New York** State, while the **American Samoa** is that of the lowest confirmed cases (as of 07/05/2020). Also, if you are only interested in finding out which state has the highest confirmed cases instead of the top 10, you can just run the cells below for an exact query result, and its time-series:

```
# state name, and the current number of confirmed
usa_ts_df_all_states[date_list[-1]].idxmax(), usa_ts_df_all_states[date_list[-1]].max()
```

('New York', 397131)

`usa_ts_df[usa_ts_df['Province_State']=="New York"].sum()[date_list]`

1/22/20 0 1/23/20 0 1/24/20 0 1/25/20 0 1/26/20 0 ... 7/1/20 394079 7/2/20 394954 7/3/20 395872 7/4/20 396598 7/5/20 397131 Length: 166, dtype: object

#### Map the confirmed cases per state

With the time-series DataFrame for the United States ready-to-use, we can then map the number of confirmed cases reported per state in a time-enabled manner. Next, we will see an animation being created from the end of January to Current:

`gis = GIS('home', verify_cert=False)`

```
"""Confirmed Cases per State shown on map widget"""
map0 = gis.map("US")
map0
```

```
map0.basemap.basemap = "dark-gray-vector"
map0.zoom = 4
```

```
for d in date_list:
map0.content.remove_all()
map0.legend.enabled=False
df = usa_ts_df[['Province_State', "Lat", "Long_", d]]
df.rename(columns={d: 'Confirmed'}, inplace=True)
fc = gis.content.import_data(df,
{"Address":"Province_State"})
fset=fc.query()
fset.sdf.spatial.plot(map0)
renderer_manager = map0.content.renderer(0)
renderer_manager.smart_mapping().class_breaks_renderer(break_type="color", field="Confirmed", class_count=13, classification_method="natural-breaks")
map0.legend.enabled=True
time.sleep(3)
```

#### Chart the top 20 states with highest confirmed cases

Besides visualizing how the epicenter shifted from the West Coast to East Coast from previous animation, we can also show the evolution of the each state over time using a bar chart. Run the cell below to create an animation of how the top 20 states with highest confirmed cases in the U.S. changed during the pandemic:

```
from IPython.display import clear_output
import matplotlib.pyplot as plot
```

```
"""Chart the top 20 states with highest confirmed cases"""
time.sleep(3)
for d in date_list:
clear_output(wait=True)
top_20_per_d = usa_ts_df.groupby('Province_State')[['Province_State', d]].sum().sort_values(by=d, ascending=False).head(20)
top_20_per_d.plot(kind='barh', log=True, figsize=(8,6))
plt.ylabel("Province_State", labelpad=14)
plt.xlabel("# of Confirmed Cases (log=True)", labelpad=14)
plt.title("Chart the confirmed cases per state", y=1.02)
plt.show()
time.sleep(1)
```

### Time-series data at the country level around the world

Besides the U.S. dataset, we can also read the time-series data of the confirmed COVID-19 cases globally from the GitHub source url, into a Pandas DataFrame, again:

```
world_confirmed_ts_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv'
world_confirmed_ts_df = pd.read_csv(world_confirmed_ts_url, header=0, escapechar='\\')
world_confirmed_ts_df.head(5)
```

Province/State | Country/Region | Lat | Long | 1/22/20 | 1/23/20 | 1/24/20 | 1/25/20 | 1/26/20 | 1/27/20 | ... | 6/26/20 | 6/27/20 | 6/28/20 | 6/29/20 | 6/30/20 | 7/1/20 | 7/2/20 | 7/3/20 | 7/4/20 | 7/5/20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | NaN | Afghanistan | 33.0000 | 65.0000 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 30451 | 30616 | 30967 | 31238 | 31517 | 31836 | 32022 | 32324 | 32672 | 32951 |

1 | NaN | Albania | 41.1533 | 20.1683 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 2269 | 2330 | 2402 | 2466 | 2535 | 2580 | 2662 | 2752 | 2819 | 2893 |

2 | NaN | Algeria | 28.0339 | 1.6596 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 12685 | 12968 | 13273 | 13571 | 13907 | 14272 | 14657 | 15070 | 15500 | 15941 |

3 | NaN | Andorra | 42.5063 | 1.5218 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 855 | 855 | 855 | 855 | 855 | 855 | 855 | 855 | 855 | 855 |

4 | NaN | Angola | -11.2027 | 17.8739 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 212 | 259 | 267 | 276 | 284 | 291 | 315 | 328 | 346 | 346 |

5 rows × 170 columns

#### Splitting records by different matching types

Though most of the rows in `world_confirmed_ts_df`

DataFrame are the statistics at country-level (`one-row-to-one-country matching`

), there are several countries listed in the DataFrame as an aggregation of provinces or states (`many-rows-to-one-country matching`

). For instance, as shown below, the confirmed cases of **Australia** can be shown in the 8 rows printed.

`world_confirmed_ts_df[world_confirmed_ts_df["Province/State"].notna()].head(8)`

Province/State | Country/Region | Lat | Long | 1/22/20 | 1/23/20 | 1/24/20 | 1/25/20 | 1/26/20 | 1/27/20 | ... | 6/26/20 | 6/27/20 | 6/28/20 | 6/29/20 | 6/30/20 | 7/1/20 | 7/2/20 | 7/3/20 | 7/4/20 | 7/5/20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

8 | Australian Capital Territory | Australia | -35.4735 | 149.0124 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 108 | 108 | 108 | 108 | 108 | 108 | 108 | 108 | 108 | 108 |

9 | New South Wales | Australia | -33.8688 | 151.2093 | 0 | 0 | 0 | 0 | 3 | 4 | ... | 3174 | 3177 | 3184 | 3189 | 3203 | 3211 | 3211 | 3405 | 3419 | 3429 |

10 | Northern Territory | Australia | -12.4634 | 130.8456 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 29 | 29 | 29 | 29 | 29 | 30 | 30 | 30 | 30 | 30 |

11 | Queensland | Australia | -28.0167 | 153.4000 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1067 | 1067 | 1067 | 1067 | 1067 | 1067 | 1067 | 1067 | 1067 | 1067 |

12 | South Australia | Australia | -34.9285 | 138.6007 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 440 | 440 | 440 | 443 | 443 | 443 | 443 | 443 | 443 | 443 |

13 | Tasmania | Australia | -41.4545 | 145.9707 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 228 | 228 | 228 | 228 | 228 | 228 | 228 | 228 | 228 | 228 |

14 | Victoria | Australia | -37.8136 | 144.9631 | 0 | 0 | 0 | 0 | 1 | 1 | ... | 1947 | 2028 | 2099 | 2159 | 2231 | 2303 | 2368 | 2368 | 2536 | 2660 |

15 | Western Australia | Australia | -31.9505 | 115.8605 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 608 | 609 | 609 | 611 | 611 | 611 | 611 | 611 | 612 | 618 |

8 rows × 170 columns

Browsing all the selected rows from above, we come to an conclusion that these five countries - `Australia, Canada, China, Denmark, and France`

- are of the `many-rows-to-one-country matching`

, and we need to summarize these rows into a single record, if such a row does not exist already. The following cell is to rule out the countries that are provided not only `many-rows-to-one-country matching`

but also `one-row-to-one-country matching`

.

```
for country in ["Australia", "Canada", "China", "Denmark", "France"]:
df = world_confirmed_ts_df[world_confirmed_ts_df["Province/State"].isna()]
df = df[df["Country/Region"]==country]
if df.size:
display(df)
```

Province/State | Country/Region | Lat | Long | 1/22/20 | 1/23/20 | 1/24/20 | 1/25/20 | 1/26/20 | 1/27/20 | ... | 6/26/20 | 6/27/20 | 6/28/20 | 6/29/20 | 6/30/20 | 7/1/20 | 7/2/20 | 7/3/20 | 7/4/20 | 7/5/20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

94 | NaN | Denmark | 56.2639 | 9.5018 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 12675 | 12675 | 12675 | 12751 | 12768 | 12794 | 12815 | 12832 | 12832 | 12832 |

1 rows × 170 columns

Province/State | Country/Region | Lat | Long | 1/22/20 | 1/23/20 | 1/24/20 | 1/25/20 | 1/26/20 | 1/27/20 | ... | 6/26/20 | 6/27/20 | 6/28/20 | 6/29/20 | 6/30/20 | 7/1/20 | 7/2/20 | 7/3/20 | 7/4/20 | 7/5/20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

116 | NaN | France | 46.2276 | 2.2137 | 0 | 0 | 2 | 3 | 3 | 3 | ... | 193346 | 193152 | 192429 | 194109 | 194373 | 194985 | 195458 | 195904 | 195546 | 195535 |

1 rows × 170 columns

To summarize, we now can split the records into three matching types:

- Countries with both
`many-rows-to-one-country matching`

and`one-row-to-one-country matching`

data, e.g.**Denmark, and France**. - Countries with only
`many-rows-to-one-country matching`

data, e.g.**Australia, Canada, and China**. - Countries with only
`one-row-to-one-country matching`

data - rest of all countries in the DataFrame.

That being said, we would only need to summarize and repair the data for countries of type #2, which are *Australia, Canada, and China*.

#### Summarize and repair the time-series global confirmed cases

The method `sum_all_provinces_in_country`

to be defined in the next cell performs these tasks:

- query all sub-records of the selected country
- create a new row which is to sum all statistics of this country, and assign the summed value of all sub-records to the date_time column of the new row
- assign the constants to the ['Province/State', 'Country/Region', 'Lat', 'Long'] columns; note that the
`Country/Region`

column will be renamed from solely the country name to country name + ", Sum". - append the new row to the original DataFrame, and remove the sub-records of the selected country.

We will run the method against the subset of countries in type #2, which are *Australia, Canada, and China*.

```
def sum_all_provinces_in_country(df, country):
# query all sub-records of the selected country
tmp_df = df[df["Country/Region"]==country]
# create a new row which is to sum all statistics of this country, and
# assign the summed value of all sub-records to the date_time column of the new row
sum_row = tmp_df.sum(axis=0)
# assign the constants to the ['Province/State', 'Country/Region', 'Lat', 'Long'] columns;
# note that the Country/Region column will be renamed from solely the country name to country name + ", Sum".
sum_row.loc['Province/State'] = "NaN"
sum_row.loc['Country/Region'] = country + ", Sum"
sum_row.loc['Lat'] = tmp_df['Lat'].values[0]
sum_row.loc['Long'] = tmp_df['Long'].values[0]
# append the new row to the original DataFrame, and
# remove the sub-records of the selected country.
df = pd.concat([df, sum_row.to_frame().T], ignore_index=True)
display(df[df["Country/Region"].str.contains(country + ", Sum")])
df=df[df['Country/Region'] != country]
df.loc[df['Country/Region']== country+", Sum", 'Country/Region'] = country
return df
```

```
for country in ["Australia", "Canada", "China"]:
world_confirmed_ts_df = sum_all_provinces_in_country(world_confirmed_ts_df, country)
```

Province/State | Country/Region | Lat | Long | 1/22/20 | 1/23/20 | 1/24/20 | 1/25/20 | 1/26/20 | 1/27/20 | ... | 6/26/20 | 6/27/20 | 6/28/20 | 6/29/20 | 6/30/20 | 7/1/20 | 7/2/20 | 7/3/20 | 7/4/20 | 7/5/20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

266 | NaN | Australia, Sum | -35.4735 | 149.012 | 0 | 0 | 0 | 0 | 4 | 5 | ... | 7601 | 7686 | 7764 | 7834 | 7920 | 8001 | 8066 | 8260 | 8443 | 8583 |

1 rows × 170 columns

Province/State | Country/Region | Lat | Long | 1/22/20 | 1/23/20 | 1/24/20 | 1/25/20 | 1/26/20 | 1/27/20 | ... | 6/26/20 | 6/27/20 | 6/28/20 | 6/29/20 | 6/30/20 | 7/1/20 | 7/2/20 | 7/3/20 | 7/4/20 | 7/5/20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

259 | NaN | Canada, Sum | 53.9333 | -116.576 | 0 | 0 | 0 | 0 | 1 | 1 | ... | 104629 | 104878 | 105193 | 105830 | 106097 | 106288 | 106643 | 106962 | 107185 | 107394 |

1 rows × 170 columns

Province/State | Country/Region | Lat | Long | 1/22/20 | 1/23/20 | 1/24/20 | 1/25/20 | 1/26/20 | 1/27/20 | ... | 6/26/20 | 6/27/20 | 6/28/20 | 6/29/20 | 6/30/20 | 7/1/20 | 7/2/20 | 7/3/20 | 7/4/20 | 7/5/20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

246 | NaN | China, Sum | 31.8257 | 117.226 | 548 | 643 | 920 | 1406 | 2075 | 2877 | ... | 84725 | 84743 | 84757 | 84780 | 84785 | 84816 | 84830 | 84838 | 84857 | 84871 |

1 rows × 170 columns

Now with the data summarized and repaired, we are good to query for the **top 10 countries with the highest numbers of confirmed cases around the globe**. You can choose to list the entire time-series for these 10 countries, or just pick the country/region name, and the most current statistics.

`world_confirmed_ts_df.sort_values(by = date_list[-1], ascending = False).head(10)`

Province/State | Country/Region | Lat | Long | 1/22/20 | 1/23/20 | 1/24/20 | 1/25/20 | 1/26/20 | 1/27/20 | ... | 6/26/20 | 6/27/20 | 6/28/20 | 6/29/20 | 6/30/20 | 7/1/20 | 7/2/20 | 7/3/20 | 7/4/20 | 7/5/20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

206 | NaN | US | 37.0902 | -95.7129 | 1 | 1 | 2 | 2 | 5 | 5 | ... | 2467554 | 2510259 | 2549294 | 2590668 | 2636414 | 2687588 | 2742049 | 2794153 | 2839436 | 2888635 |

20 | NaN | Brazil | -14.235 | -51.9253 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1274974 | 1313667 | 1344143 | 1368195 | 1402041 | 1448753 | 1496858 | 1539081 | 1577004 | 1603055 |

112 | NaN | India | 21 | 78 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 508953 | 528859 | 548318 | 566840 | 585481 | 604641 | 625544 | 648315 | 673165 | 697413 |

168 | NaN | Russia | 60 | 90 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 619936 | 626779 | 633563 | 640246 | 646929 | 653479 | 660231 | 666941 | 673564 | 680283 |

162 | NaN | Peru | -9.19 | -75.0152 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 272364 | 275989 | 279419 | 282365 | 285213 | 288477 | 292004 | 295599 | 299080 | 302718 |

29 | NaN | Chile | -35.6751 | -71.543 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 263360 | 267766 | 271982 | 275999 | 279393 | 282043 | 284541 | 288089 | 291847 | 295532 |

204 | NaN | United Kingdom | 55.3781 | -3.436 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 281675 | 282308 | 282703 | 283307 | 283710 | 283770 | 283774 | 284276 | 284900 | 285416 |

139 | NaN | Mexico | 23.6345 | -102.553 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 208392 | 212802 | 216852 | 220657 | 226089 | 231770 | 238511 | 245251 | 252165 | 256848 |

182 | NaN | Spain | 40 | -4 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 247905 | 248469 | 248770 | 248970 | 249271 | 249659 | 250103 | 250545 | 250545 | 250545 |

118 | NaN | Italy | 43 | 12 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 239961 | 240136 | 240310 | 240436 | 240578 | 240760 | 240961 | 241184 | 241419 | 241611 |

10 rows × 170 columns

```
# a simplified list
world_confirmed_ts_df[['Country/Region', date_list[-1]]].sort_values(by = date_list[-1], ascending = False).head(10)
```

Country/Region | 7/5/20 | |
---|---|---|

206 | US | 2888635 |

20 | Brazil | 1603055 |

112 | India | 697413 |

168 | Russia | 680283 |

162 | Peru | 302718 |

29 | Chile | 295532 |

204 | United Kingdom | 285416 |

139 | Mexico | 256848 |

182 | Spain | 250545 |

118 | Italy | 241611 |

That compared to the top 10 countries with confirmed COVID-19 cases we fetched and sorted on 03/23/2020, we can see how the list of countries changed.

Country_Region | Confirmed | Recovered | Deaths |
---|---|---|---|

China | 81591 | 73280 | 3281 |

Italy | 69176 | 8326 | 6820 |

United States | 53660 | 0 | 703 |

Spain | 39885 | 3794 | 2808 |

Germany | 32986 | 3243 | 157 |

Iran | 24811 | 8913 | 1934 |

France | 22622 | 3288 | 1102 |

Switzerland | 9877 | 131 | 122 |

South Korea | 9037 | 3507 | 120 |

United Kingdom | 8164 | 140 | 423 |

Table 1. The top 10 countries with confirmed COVID-19 cases collected at 03/23/2020.

##### Map the confirmed cases per country

Similar to the previous approach of mapping the confirmed cases in the United states through a timely evolutionary manner, here we are to visualize the confirmed cases per country/region along the timeline (from end of January to Current), and we shall be seeing the shift of epicenters from East Asia to North America in the animation.

As displayed in the legend, countries with the highest confirmed cases are to appear as red points, and the colormap changes from red to light blue and then to dark blue as the number of cases reduce.

```
"""Map the Confirmed Cases per country/region world-wide in the MapView"""
map1 = gis.map("Italy")
map1
```

```
map1.basemap.basemap = "dark-gray-vector"
map1.zoom = 1
```

```
for d in date_list:
map1.remove_layers()
map1.legend.enabled=False
df = world_confirmed_ts_df[['Country/Region', d]]
df.rename(columns={d: 'Confirmed'}, inplace=True)
fc = gis.content.import_data(df,
{"CountryCode":"Country_Region"},
title="Confirmed cases per country of "+d)
fset=fc.query()
fset.sdf.spatial.plot(map1)
renderer_manager = map1.content.renderer(0)
renderer_manager.smart_mapping().class_breaks_renderer(break_type="color", field="Confirmed", class_count=13, classification_method="natural-breaks")
map1.legend.enabled=True
time.sleep(3)
```

##### Chart the confirmed cases per country

Similarly, let us plot the top 20 countries with largest confirmed cases to be shown in vertical bar charts (in a time-enabled manner):

```
"""Chart the top 20 countries with highest confirmed cases"""
time.sleep(3)
for d in date_list:
clear_output(wait=True)
top_20_per_d = world_confirmed_ts_df.groupby('Country/Region')[['Country/Region', d]].sum().sort_values(by=d, ascending=False).head(20)
top_20_per_d.plot(kind='barh', log=True, figsize=(8,6))
plt.ylabel("Country/Region", labelpad=14)
plt.xlabel("# of Confirmed Cases (log=True)", labelpad=14)
plt.title("Chart the confirmed cases per country/region", y=1.02)
plt.show()
time.sleep(1)
```

#### Summarize and repair the time-series COVID-19 deaths globally

Similar to what we have done in the previous section, we will perform the similar approach to fetch, parse and repair the data for COVID-19 deaths globally, which includes steps to:

- fetch and read the online CSV source into DataFrame
- summarize and repair the DataFrame
- list the top 10 countries

```
world_deaths_ts_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'
world_deaths_ts_df = pd.read_csv(world_deaths_ts_url, header=0, escapechar='\\')
world_deaths_ts_df.head(5)
```

Province/State | Country/Region | Lat | Long | 1/22/20 | 1/23/20 | 1/24/20 | 1/25/20 | 1/26/20 | 1/27/20 | ... | 6/26/20 | 6/27/20 | 6/28/20 | 6/29/20 | 6/30/20 | 7/1/20 | 7/2/20 | 7/3/20 | 7/4/20 | 7/5/20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | NaN | Afghanistan | 33.0000 | 65.0000 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 683 | 703 | 721 | 733 | 746 | 774 | 807 | 819 | 826 | 864 |

1 | NaN | Albania | 41.1533 | 20.1683 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 51 | 53 | 55 | 58 | 62 | 65 | 69 | 72 | 74 | 76 |

2 | NaN | Algeria | 28.0339 | 1.6596 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 885 | 892 | 897 | 905 | 912 | 920 | 928 | 937 | 946 | 952 |

3 | NaN | Andorra | 42.5063 | 1.5218 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 52 | 52 | 52 | 52 | 52 | 52 | 52 | 52 | 52 | 52 |

4 | NaN | Angola | -11.2027 | 17.8739 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 10 | 10 | 11 | 11 | 13 | 15 | 17 | 18 | 19 | 19 |

5 rows × 170 columns

```
for country in ["Australia", "Canada", "China"]:
world_deaths_ts_df = sum_all_provinces_in_country(world_deaths_ts_df, country)
```

Province/State | Country/Region | Lat | Long | 1/22/20 | 1/23/20 | 1/24/20 | 1/25/20 | 1/26/20 | 1/27/20 | ... | 6/26/20 | 6/27/20 | 6/28/20 | 6/29/20 | 6/30/20 | 7/1/20 | 7/2/20 | 7/3/20 | 7/4/20 | 7/5/20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

266 | NaN | Australia, Sum | -35.4735 | 149.012 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 104 | 104 | 104 | 104 | 104 | 104 | 104 | 104 | 104 | 106 |

1 rows × 170 columns

Province/State | Country/Region | Lat | Long | 1/22/20 | 1/23/20 | 1/24/20 | 1/25/20 | 1/26/20 | 1/27/20 | ... | 6/26/20 | 6/27/20 | 6/28/20 | 6/29/20 | 6/30/20 | 7/1/20 | 7/2/20 | 7/3/20 | 7/4/20 | 7/5/20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

259 | NaN | Canada, Sum | 53.9333 | -116.576 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 8571 | 8576 | 8582 | 8628 | 8650 | 8678 | 8700 | 8722 | 8732 | 8739 |

1 rows × 170 columns

Province/State | Country/Region | Lat | Long | 1/22/20 | 1/23/20 | 1/24/20 | 1/25/20 | 1/26/20 | 1/27/20 | ... | 6/26/20 | 6/27/20 | 6/28/20 | 6/29/20 | 6/30/20 | 7/1/20 | 7/2/20 | 7/3/20 | 7/4/20 | 7/5/20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

246 | NaN | China, Sum | 31.8257 | 117.226 | 17 | 18 | 26 | 42 | 56 | 82 | ... | 4641 | 4641 | 4641 | 4641 | 4641 | 4641 | 4641 | 4641 | 4641 | 4641 |

1 rows × 170 columns

Now with the deaths statistics summarized and repaired, we are good to query for the top 10 countries with the highest numbers of COVID-19 deaths around the globe.

`world_deaths_ts_df.sort_values(by = date_list[-1], ascending = False).head(10)`

Province/State | Country/Region | Lat | Long | 1/22/20 | 1/23/20 | 1/24/20 | 1/25/20 | 1/26/20 | 1/27/20 | ... | 6/26/20 | 6/27/20 | 6/28/20 | 6/29/20 | 6/30/20 | 7/1/20 | 7/2/20 | 7/3/20 | 7/4/20 | 7/5/20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

206 | NaN | US | 37.0902 | -95.7129 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 125631 | 126120 | 126360 | 126711 | 127432 | 128105 | 128803 | 129434 | 129676 | 129947 |

20 | NaN | Brazil | -14.235 | -51.9253 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 55961 | 57070 | 57622 | 58314 | 59594 | 60632 | 61884 | 63174 | 64265 | 64867 |

204 | NaN | United Kingdom | 55.3781 | -3.436 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 43414 | 43514 | 43550 | 43575 | 43730 | 43906 | 43995 | 44131 | 44198 | 44220 |

118 | NaN | Italy | 43 | 12 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 34708 | 34716 | 34738 | 34744 | 34767 | 34788 | 34818 | 34833 | 34854 | 34861 |

139 | NaN | Mexico | 23.6345 | -102.553 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 25779 | 26381 | 26648 | 27121 | 27769 | 28510 | 29189 | 29843 | 30366 | 30639 |

97 | NaN | France | 46.2276 | 2.2137 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 29705 | 29704 | 29704 | 29736 | 29763 | 29780 | 29794 | 29812 | 29812 | 29813 |

182 | NaN | Spain | 40 | -4 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 28338 | 28341 | 28343 | 28346 | 28355 | 28364 | 28368 | 28385 | 28385 | 28385 |

112 | NaN | India | 21 | 78 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 15685 | 16095 | 16475 | 16893 | 17400 | 17834 | 18213 | 18655 | 19268 | 19693 |

114 | NaN | Iran | 32 | 53 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 10239 | 10364 | 10508 | 10670 | 10817 | 10958 | 11106 | 11260 | 11408 | 11571 |

162 | NaN | Peru | -9.19 | -75.0152 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 8939 | 9135 | 9317 | 9504 | 9677 | 9860 | 10045 | 10226 | 10412 | 10589 |

10 rows × 170 columns

```
# a simplified view
world_deaths_ts_df[['Country/Region', date_list[-1]]].sort_values(by = date_list[-1], ascending = False).head(10)
```

Country/Region | 7/5/20 | |
---|---|---|

206 | US | 129947 |

20 | Brazil | 64867 |

204 | United Kingdom | 44220 |

118 | Italy | 34861 |

139 | Mexico | 30639 |

97 | France | 29813 |

182 | Spain | 28385 |

112 | India | 19693 |

114 | Iran | 11571 |

162 | Peru | 10589 |

#### Summarize and repair the time-series global recovered cases

Like what we have done in the previous sections, we will perform the similar approach to fetch, parse and repair the data for global recovered cases, which includes steps to:

- fetch and read the online CSV source into DataFrame
- summarize and repair the DataFrame
- list the top 10 countries

```
world_recovered_ts_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv'
world_recovered_ts_df = pd.read_csv(world_recovered_ts_url, header=0, escapechar='\\')
world_recovered_ts_df.head(5)
```

Province/State | Country/Region | Lat | Long | 1/22/20 | 1/23/20 | 1/24/20 | 1/25/20 | 1/26/20 | 1/27/20 | ... | 6/26/20 | 6/27/20 | 6/28/20 | 6/29/20 | 6/30/20 | 7/1/20 | 7/2/20 | 7/3/20 | 7/4/20 | 7/5/20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | NaN | Afghanistan | 33.0000 | 65.0000 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 10306 | 10674 | 12604 | 13934 | 14131 | 15651 | 16041 | 17331 | 19164 | 19366 |

1 | NaN | Albania | 41.1533 | 20.1683 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1298 | 1346 | 1384 | 1438 | 1459 | 1516 | 1559 | 1592 | 1637 | 1657 |

2 | NaN | Algeria | 28.0339 | 1.6596 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 9066 | 9202 | 9371 | 9674 | 9897 | 10040 | 10342 | 10832 | 11181 | 11492 |

3 | NaN | Andorra | 42.5063 | 1.5218 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 799 | 799 | 799 | 799 | 799 | 799 | 800 | 800 | 800 | 800 |

4 | NaN | Angola | -11.2027 | 17.8739 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 81 | 81 | 81 | 93 | 93 | 97 | 97 | 107 | 108 | 108 |

5 rows × 170 columns

```
for country in ["Australia", "Canada", "China"]:
world_recovered_ts_df = sum_all_provinces_in_country(world_recovered_ts_df, country)
```

Province/State | Country/Region | Lat | Long | 1/22/20 | 1/23/20 | 1/24/20 | 1/25/20 | 1/26/20 | 1/27/20 | ... | 6/26/20 | 6/27/20 | 6/28/20 | 6/29/20 | 6/30/20 | 7/1/20 | 7/2/20 | 7/3/20 | 7/4/20 | 7/5/20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

253 | NaN | Australia, Sum | -35.4735 | 149.012 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 6960 | 6993 | 7007 | 7037 | 7040 | 7090 | 7130 | 7319 | 7399 | 7420 |

1 rows × 170 columns

Province/State | Country/Region | Lat | Long | 1/22/20 | 1/23/20 | 1/24/20 | 1/25/20 | 1/26/20 | 1/27/20 | ... | 6/26/20 | 6/27/20 | 6/28/20 | 6/29/20 | 6/30/20 | 7/1/20 | 7/2/20 | 7/3/20 | 7/4/20 | 7/5/20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

246 | NaN | Canada, Sum | 56.1304 | -106.347 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 67182 | 67445 | 67689 | 68698 | 69120 | 69397 | 69872 | 70232 | 70507 | 70772 |

1 rows × 170 columns

Province/State | Country/Region | Lat | Long | 1/22/20 | 1/23/20 | 1/24/20 | 1/25/20 | 1/26/20 | 1/27/20 | ... | 6/26/20 | 6/27/20 | 6/28/20 | 6/29/20 | 6/30/20 | 7/1/20 | 7/2/20 | 7/3/20 | 7/4/20 | 7/5/20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

246 | NaN | China, Sum | 31.8257 | 117.226 | 28 | 30 | 36 | 39 | 49 | 58 | ... | 79580 | 79591 | 79609 | 79619 | 79632 | 79650 | 79665 | 79680 | 79706 | 79718 |

1 rows × 170 columns

Now with the statistics summarized and repaired, we are good to query for the **top 10 countries with the highest numbers of COVID-19 recovered cases around the globe**.

`world_recovered_ts_df.sort_values(by = date_list[-1], ascending = False).head(10)`

Province/State | Country/Region | Lat | Long | 1/22/20 | 1/23/20 | 1/24/20 | 1/25/20 | 1/26/20 | 1/27/20 | ... | 6/26/20 | 6/27/20 | 6/28/20 | 6/29/20 | 6/30/20 | 7/1/20 | 7/2/20 | 7/3/20 | 7/4/20 | 7/5/20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

21 | NaN | Brazil | -14.235 | -51.9253 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 702399 | 727715 | 746018 | 757811 | 788318 | 817642 | 957692 | 984615 | 990731 | 1029045 |

216 | NaN | US | 37.0902 | -95.7129 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 670809 | 679308 | 685164 | 705203 | 720631 | 729994 | 781970 | 790404 | 894325 | 906763 |

175 | NaN | Russia | 60 | 90 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 383524 | 392703 | 398436 | 402778 | 411973 | 422235 | 428276 | 437155 | 446127 | 449995 |

116 | NaN | India | 21 | 78 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 295881 | 309713 | 321723 | 334822 | 347912 | 359860 | 379892 | 394227 | 409083 | 424433 |

30 | NaN | Chile | -35.6751 | -71.543 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 223431 | 228055 | 232210 | 236154 | 241229 | 245443 | 249247 | 253343 | 257451 | 261039 |

118 | NaN | Iran | 32 | 53 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 177852 | 180661 | 183310 | 186180 | 188758 | 191487 | 194098 | 196446 | 198949 | 201330 |

145 | NaN | Mexico | 23.6345 | -102.553 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 156827 | 160721 | 164646 | 170147 | 174538 | 178526 | 183757 | 189345 | 195724 | 199914 |

169 | NaN | Peru | -9.19 | -75.0152 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 159806 | 164024 | 167998 | 171159 | 174535 | 178245 | 182097 | 185852 | 189621 | 193957 |

122 | NaN | Italy | 43 | 12 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 187615 | 188584 | 188891 | 189196 | 190248 | 190717 | 191083 | 191467 | 191944 | 192108 |

103 | NaN | Germany | 51 | 9 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 177149 | 177518 | 177657 | 177770 | 178100 | 179100 | 179800 | 180300 | 181000 | 181719 |

10 rows × 170 columns

```
# a simplfied view
world_recovered_ts_df[['Country/Region', date_list[-1]]].sort_values(by = date_list[-1], ascending = False).head(10)
```

Country/Region | 7/5/20 | |
---|---|---|

21 | Brazil | 1029045 |

216 | US | 906763 |

175 | Russia | 449995 |

116 | India | 424433 |

30 | Chile | 261039 |

118 | Iran | 201330 |

145 | Mexico | 199914 |

169 | Peru | 193957 |

122 | Italy | 192108 |

103 | Germany | 181719 |

Enough with understanding, summarizing, and exploratory analysis on the the existing data, now let us model the data on the Polynomial Regression, Neural Network, and SIR epidemic models and try to predict the count of cases in the upcoming days.

## Predictive Analysis

Before performing the predictive analysis using the time-series data for the United States, we first need to transpose the time-series matrix, and map the `date`

fields into `number of days`

since the first reported. In the following, we will perform the approach three times to make three new analysis-ready DataFrames, namely, `new_usa_confirmed_df`

, `new_usa_deaths_df`

, and `new_usa_recovered_df`

[15,16,17,18].

```
usa_overall_confirmed_ts_df = world_confirmed_ts_df[world_confirmed_ts_df["Country/Region"] == "US"]
new_usa_confirmed_df = usa_overall_confirmed_ts_df[date_list].T
new_usa_confirmed_df.columns = ["confirmed"]
new_usa_confirmed_df = new_usa_confirmed_df.assign(days=[1 +
i for i in range(len(new_usa_confirmed_df))])[['days'] +
new_usa_confirmed_df.columns.tolist()]
new_usa_confirmed_df
```

days | confirmed | |
---|---|---|

1/22/20 | 1 | 1 |

1/23/20 | 2 | 1 |

1/24/20 | 3 | 2 |

1/25/20 | 4 | 2 |

1/26/20 | 5 | 5 |

... | ... | ... |

6/30/20 | 161 | 2636414 |

7/1/20 | 162 | 2687588 |

7/2/20 | 163 | 2742049 |

7/3/20 | 164 | 2794153 |

7/4/20 | 165 | 2839436 |

165 rows × 2 columns

```
usa_overall_deaths_ts_df = world_deaths_ts_df[world_deaths_ts_df["Country/Region"] == "US"]
new_usa_deaths_df = usa_overall_deaths_ts_df[date_list].T
new_usa_deaths_df.columns = ["deaths"]
new_usa_deaths_df = new_usa_deaths_df.assign(days=[1 +
i for i in range(len(new_usa_deaths_df))])[['days'] +
new_usa_deaths_df.columns.tolist()]
new_usa_deaths_df
```

days | deaths | |
---|---|---|

1/22/20 | 1 | 0 |

1/23/20 | 2 | 0 |

1/24/20 | 3 | 0 |

1/25/20 | 4 | 0 |

1/26/20 | 5 | 0 |

... | ... | ... |

6/30/20 | 161 | 127432 |

7/1/20 | 162 | 128105 |

7/2/20 | 163 | 128803 |

7/3/20 | 164 | 129434 |

7/4/20 | 165 | 129676 |

165 rows × 2 columns

```
usa_overall_recovered_ts_df = world_recovered_ts_df[world_recovered_ts_df["Country/Region"] == "US"]
new_usa_recovered_df = usa_overall_recovered_ts_df[date_list].T
new_usa_recovered_df.columns = ["recovered"]
new_usa_recovered_df = new_usa_recovered_df.assign(days=[1 +
i for i in range(len(new_usa_recovered_df))])[['days'] +
new_usa_recovered_df.columns.tolist()]
new_usa_recovered_df
```

days | recovered | |
---|---|---|

1/22/20 | 1 | 0 |

1/23/20 | 2 | 0 |

1/24/20 | 3 | 0 |

1/25/20 | 4 | 0 |

1/26/20 | 5 | 0 |

... | ... | ... |

6/30/20 | 161 | 720631 |

7/1/20 | 162 | 729994 |

7/2/20 | 163 | 781970 |

7/3/20 | 164 | 790404 |

7/4/20 | 165 | 894325 |

165 rows × 2 columns

```
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
class PolynomialRegressionModel:
def __init__(self, model_name, polynomial_degree):
self.__model_name = model_name
self.__polynomial_degree = polynomial_degree
self.__model = None
def train(self, x, y):
polynomial_features = PolynomialFeatures(degree=self.__polynomial_degree)
x_poly = polynomial_features.fit_transform(x)
self.__model = LinearRegression()
self.__model.fit(x_poly, y)
def get_predictions(self, x):
polynomial_features = PolynomialFeatures(degree=self.__polynomial_degree)
x_poly = polynomial_features.fit_transform(x)
return np.round(self.__model.predict(x_poly), 0).astype(np.int32)
def get_model_polynomial_str(self):
coef = self.__model.coef_
intercept = self.__model.intercept_
poly = "{0:.3f}".format(intercept)
for i in range(1, len(coef)):
if coef[i] >= 0:
poly += " + "
else:
poly += " - "
poly += "{0:.3f}".format(coef[i]).replace("-", "") + "X^" + str(i)
return poly
```

```
training_set = new_usa_confirmed_df
x = np.array(training_set["days"]).reshape(-1, 1)
y = training_set["confirmed"]
```

```
training_set_deaths = new_usa_deaths_df
x_deaths = np.array(training_set_deaths["days"]).reshape(-1, 1)
y_deaths = training_set_deaths["deaths"]
```

### Polynomial Regression

Let us first try to fit the data with a Polynomial Regression model, since the time-series chart of confirmed cases looks like an exponential curve:

#### Confirmed Cases with PR model

##### degrees = 2

```
regression_model = PolynomialRegressionModel("Cases using Polynomial Regression", 2)
regression_model.train(x, y)
```

```
y_pred = regression_model.get_predictions(x)
y_pred
```

array([ -68133, -70086, -71795, -73260, -74481, -75456, -76188, -76674, -76917, -76915, -76668, -76177, -75442, -74462, -73237, -71768, -70055, -68097, -65894, -63447, -60756, -57820, -54640, -51215, -47546, -43632, -39474, -35071, -30424, -25532, -20396, -15016, -9390, -3521, 2593, 8952, 15555, 22402, 29494, 36831, 44412, 52237, 60307, 68621, 77180, 85983, 95031, 104323, 113860, 123641, 133667, 143937, 154452, 165211, 176214, 187463, 198955, 210692, 222674, 234900, 247370, 260085, 273045, 286248, 299697, 313390, 327327, 341509, 355935, 370606, 385521, 400681, 416085, 431734, 447627, 463764, 480147, 496773, 513644, 530760, 548120, 565724, 583573, 601667, 620005, 638587, 657414, 676485, 695801, 715361, 735166, 755216, 775509, 796048, 816830, 837857, 859129, 880645, 902406, 924411, 946660, 969155, 991893, 1014876, 1038104, 1061576, 1085292, 1109253, 1133458, 1157908, 1182602, 1207541, 1232725, 1258152, 1283825, 1309741, 1335903, 1362308, 1388958, 1415853, 1442992, 1470376, 1498004, 1525876, 1553994, 1582355, 1610961, 1639811, 1668906, 1698246, 1727830, 1757658, 1787731, 1818048, 1848610, 1879416, 1910467, 1941762, 1973302, 2005086, 2037115, 2069388, 2101906, 2134668, 2167675, 2200926, 2234421, 2268161, 2302146, 2336375, 2370848, 2405566, 2440528, 2475735, 2511187, 2546883, 2582823, 2619008, 2655437, 2692111, 2729029, 2766192, 2803599, 2841250, 2879147])

```
def print_forecast(model_name, model, beginning_day=0, limit=10):
next_days_x = np.array(range(beginning_day, beginning_day + limit)).reshape(-1, 1)
next_days_pred = model.get_predictions(next_days_x)
print("The forecast for " + model_name + " in the following " + str(limit) + " days is:")
for i in range(0, limit):
print("Day " + str(i + 1) + ": " + str(next_days_pred[i]))
```

```
print_forecast("Cases using Polynomial Regression", regression_model,
beginning_day=len(x),
limit=10)
```

The forecast for Cases using Polynomial Regression in the following 10 days is: Day 1: 2879147 Day 2: 2917287 Day 3: 2955672 Day 4: 2994302 Day 5: 3033176 Day 6: 3072294 Day 7: 3111657 Day 8: 3151265 Day 9: 3191117 Day 10: 3231213

```
import operator
def plot_graph(model_name, x, y, y_pred):
plt.scatter(x, y, s=10)
sort_axis = operator.itemgetter(0)
sorted_zip = sorted(zip(x, y_pred), key=sort_axis)
x, y_pred = zip(*sorted_zip)
plt.plot(x, y_pred, color='m')
plt.title("Amount of " + model_name + " in each day")
plt.xlabel("Day")
plt.ylabel(model_name)
plt.show()
```

`plot_graph("Cases using Polynomial Regression", x, y, y_pred)`

The approximation between the fitted and the actual curves does not look too ideal. We can then increase the degree of the Polynomial Regression Model to 3, and see what happens:

##### degrees = 3

```
regression_model = PolynomialRegressionModel("Confirmed cases using Polynomial Regression (d=3)", 3)
regression_model.train(x, y)
y_pred = regression_model.get_predictions(x)
print_forecast("Confirmed cases using Polynomial Regression (d=3)", regression_model,
beginning_day=len(x),
limit=10)
plot_graph("Confirmed cases using Polynomial Regression (d=3)", x, y, y_pred)
```

The forecast for Confirmed cases using Polynomial Regression (d=3) in the following 10 days is: Day 1: 2664083 Day 2: 2685999 Day 3: 2707664 Day 4: 2729074 Day 5: 2750221 Day 6: 2771100 Day 7: 2791706 Day 8: 2812031 Day 9: 2832070 Day 10: 2851818

Though the approximation of the fitted curve and the actual data got improved for when days<80, the two curves diverged towards days=100.

The same approach can be done with the death cases using Polynomial Regression model (degree=2, and 3):

#### Deaths with PR

##### Degrees = 2

```
regression_model = PolynomialRegressionModel("Deaths using Polynomial Regression", 2)
regression_model.train(x_deaths, y_deaths)
y_deaths_pred = regression_model.get_predictions(x_deaths)
print_forecast("Deaths using Polynomial Regression", regression_model,
beginning_day=len(x_deaths),
limit=10)
plot_graph("Deaths using Polynomial Regression", x_deaths, y_deaths, y_deaths_pred)
```

The forecast for Deaths using Polynomial Regression in the following 10 days is: Day 1: 152437 Day 2: 154332 Day 3: 156237 Day 4: 158153 Day 5: 160081 Day 6: 162020 Day 7: 163970 Day 8: 165930 Day 9: 167902 Day 10: 169885

```
regression_model = PolynomialRegressionModel("Deaths using Polynomial Regression (d=3)", 3)
regression_model.train(x_deaths, y_deaths)
y_deaths_pred = regression_model.get_predictions(x_deaths)
print_forecast("Deaths using Polynomial Regression (d=3)", regression_model,
beginning_day=len(x_deaths),
limit=10)
plot_graph("Deaths using Polynomial Regression (d=3)", x_deaths, y_deaths, y_deaths_pred)
```

The forecast for Deaths using Polynomial Regression (d=3) in the following 10 days is: Day 1: 129623 Day 2: 129797 Day 3: 129928 Day 4: 130018 Day 5: 130065 Day 6: 130069 Day 7: 130029 Day 8: 129945 Day 9: 129815 Day 10: 129639

Since the model fitting results we have obtained so far with Polynomial Regression (degrees=2, and 3) are not too ideal, let us turn to Neural Network modeling and see whether its result would be better:

### Neural Network

#### Confirmed Cases with NN

```
from sklearn.neural_network import MLPRegressor
class NeuralNetModel:
def __init__(self, model_name):
self.__model_name = model_name
self.__model = None
def train(self, x, y, hidden_layer_sizes=[10,], learning_rate=0.001, max_iter=2000):
self.__model = MLPRegressor(solver="adam", activation="relu", alpha=1e-5, random_state=0,
hidden_layer_sizes=hidden_layer_sizes, verbose=False, tol=1e-5,
learning_rate_init=learning_rate, max_iter=max_iter)
self.__model.fit(x, y)
def get_predictions(self, x):
return np.round(self.__model.predict(x), 0).astype(np.int32)
```

```
neural_net_model = NeuralNetModel("Confirmed Cases using Neural Network")
neural_net_model.train(x, y, [80, 80], 0.001, 50000)
```

```
y_pred = neural_net_model.get_predictions(x)
y_pred
```

array([ -2452, -386, 1681, 3748, 5812, 6725, 5716, 4642, 3569, 2495, 1421, 347, -726, -1800, -2874, -3947, -2357, -2283, -2228, -2177, -2213, -2250, -2287, -2323, -2360, -2397, -2434, -2470, -2507, -2544, -2580, -2617, -2654, -2691, -2727, -2764, -2801, -2837, -2874, -2911, -2948, -2984, -3021, -2152, -1039, 74, 1187, 2300, 3414, 4527, 5640, 6753, 7866, 8979, 10092, 11205, 12318, 13432, 14545, 15658, 28702, 49546, 70389, 95895, 121508, 147121, 172734, 198347, 223961, 249574, 275187, 300800, 326414, 352027, 377640, 403253, 428867, 454480, 480093, 505706, 531320, 556933, 582546, 608159, 633772, 659386, 684999, 710612, 736225, 761839, 787452, 813065, 838678, 864292, 889905, 915518, 941131, 966745, 992358, 1017971, 1043584, 1069197, 1094811, 1120424, 1146037, 1171650, 1197264, 1222877, 1248490, 1274103, 1299717, 1325330, 1350943, 1376556, 1402170, 1427783, 1453396, 1479009, 1504622, 1530236, 1555849, 1581462, 1607075, 1632689, 1658302, 1683915, 1709528, 1735142, 1760755, 1786368, 1811981, 1837595, 1863208, 1888821, 1914434, 1940047, 1965661, 1991274, 2016887, 2042500, 2068114, 2093727, 2119340, 2144953, 2170567, 2196180, 2221793, 2247406, 2273019, 2298633, 2324246, 2349859, 2375472, 2401086, 2426699, 2452312, 2477925, 2503539, 2529152, 2554765, 2580378, 2605992, 2631605, 2657218, 2682831])

```
print_forecast("Confirmed Cases using Neural Network", neural_net_model,
beginning_day=len(x),
limit=10)
```

The forecast for Confirmed Cases using Neural Network in the following 10 days is: Day 1: 2682831 Day 2: 2708305 Day 3: 2732643 Day 4: 2756752 Day 5: 2780861 Day 6: 2804970 Day 7: 2829079 Day 8: 2853188 Day 9: 2877298 Day 10: 2901407

`plot_graph("Confirmed Cases using Neural Network", x, y, y_pred)`

#### Deaths with NN

```
neural_net_model = NeuralNetModel("Deaths using Neural Network")
neural_net_model.train(x_deaths, y_deaths, [100, 100], 0.001, 50000)
y_deaths_pred = neural_net_model.get_predictions(x_deaths)
print_forecast("Deaths using Neural Network", neural_net_model,
beginning_day=len(x_deaths),
limit=10)
plot_graph("Deaths using Neural Network", x_deaths, y_deaths, y_deaths_pred)
```

The forecast for Deaths using Neural Network in the following 10 days is: Day 1: 133677 Day 2: 134568 Day 3: 135459 Day 4: 136350 Day 5: 137241 Day 6: 138132 Day 7: 139023 Day 8: 139914 Day 9: 140805 Day 10: 141696

For Neural Network modeling, the approximation between the fitted and the actual curves looks much better than that of Polynomial Regression.

To summarize, the predicted numbers for the confirmed cases and deaths for the next 10 days (as of EOD 07/14/2020) in the United States are different with varied models:

Model | Confirmed | Deaths | Death Rate |
---|---|---|---|

PR (d=2) | 3,231,213 | 169,885 | 5.26% |

PR (d=3) | 2,851,818 | 129,639 | 4.55% |

NN | 2,901,407 | 141,696 | 4.88% |

Table 2. The predicted numbers for the confirmed cases and deaths for the next 10 days (as of EOD 07/14/2020).

## SIR Model

The Polynomial Regression and Neural Network models used in the previous section did provide reasonable predictions to the confirmed and death numbers in upcoming days. However, these models did not consider the nature of infectious diseases in between the figures. Next, we will talk about how `Compartmental Modeling`

techniques can be used to model infectious diseases (and in this case, the COVID-19), of which the simplest compartmental model is the `SIR model`

. You can check out the descriptions to the model and its basic blocks in the Wiki Page here. The SIR model is reasonably predictive for infectious diseases which are transmitted from human to human, and where recovery confers lasting resistance, such as measles, mumps and rubella [6,7,8,9].

As we can tell from the name of the model, it consists of three compartments: `S`

for the number of `susceptible`

, `I`

for the number of `infectious`

, and `R`

for the number of `recovered`

or deceased (or immune) individuals. Each member of the population typically progresses from susceptible to infectious to recovered.

Equation 1. SIR Model is an ordinary differential equation model, and can be described by the above equation (source:

ASU,

ASU Public Course).

Parameters being used in the equation includes,

- The infectious rate,
`β`

, controls the rate of spread which represents the probability of transmitting disease between a susceptible and an infectious individual. `D`

, the average duration, or say days of infection.- The recovery rate,
`γ = 1/D`

, is determined by the average duration. `N`

, the total population of the study.

### Basic Reproduction Number

One of the critical question that epidemiologisits ask the most during each disease outbreak, is that how far and how fast a virus can spread through a population, and the `basic reproduction number`

(a.k.a. the `R naught`

, or `R0`

) is often used to answer the question.

`R0`

refers to "how many other people one sick person is likely to infect on average in a group that's susceptible to the disease (meaning they don’t already have immunity from a vaccine or from fighting off the disease before)", and is "super important in the context of public health because it foretells how big an outbreak will be. The higher the number, the greater likelihood a large population will fall sick" [1].

Another mostly asked question is how contagious or deadly a virus is, and it together with `R0`

can depict the most important properties of a virus. Take Measles as an example, as the most contagious virus researchers know about, measles can linger in the air of a closed environment and sicken people up to two hours after an infected person who coughed or sneezed there has left. `R0`

of measles can be as high as 18, if people exposed to the virus aren't vaccinated. Let's now compare it to Ebola, which is more deadly but much less efficient in spreading. `R0`

of Ebola is typically just 2, which in part might be due to many infected individuals die before they can pass the virus to someone else.

#### How a `R0`

is computed?

In a basic scenario, the basic reproductive number `R0`

equals to

```
R0 = β/γ = β*D
```

However, in a fully susceptible population, `R0`

is the number of secondary infections generated by the first infectious individual over the course of the infectious period, which is defined as,

```
R0=S*L*β
```

in which `S`

=`the number of susceptible hosts`

, `L`

=`length of infection`

, and `β`

=`transmissibility`

.

#### How to understand `R0`

?

According to this heathline article, there are three possibilities for the potential spread or decline of a disease, depending on its `R0`

value:

- If less than 1, each existing infection causes less than one new infection, and the disease will decline and eventually die out.
- If equals to 1, each existing infection causes one new infection, and the disease will stay alive and stable, but there will not be an outbreak or an epidemic.
- If more than 1, each existing infection causes more than one new infection, and the disease will spread between people, and there may be an outbreak or epidemic.

### Fit the SIR Model per country

In order to fit the SIR model with the actual number of confirmed cases (and the recovered, and death tolls), we need to estimate the `β`

and `γ`

parameters. In the process, we will use `solve_ivp`

function in scipy module, to solve the ordinary differential equation in the SIR model.

```
from scipy.integrate import solve_ivp
from scipy.optimize import minimize
from datetime import timedelta, datetime
"""The dict defined here includes the countries to run the model against,
and the date that the first confirmed case was reported.
"""
START_DATE = {
'China': '1/22/20',
'Japan': '1/22/20',
'Korea, South': '1/22/20',
'US': '1/22/20',
'Italy': '1/31/20',
'Spain': '2/1/20',
'Iran': '2/19/20'
}
```

```
# Adding into the consideration the social distancing factor (to be explained in the next section)
rho = 1
```

```
class Learner(object):
"""constructs an SIR model learner to load training set, train the model,
and make predictions at country level.
"""
def __init__(self, country, loss, start_date, predict_range,s_0, i_0, r_0):
self.country = country
self.loss = loss
self.start_date = start_date
self.predict_range = predict_range
self.s_0 = s_0
self.i_0 = i_0
self.r_0 = r_0
def load_confirmed(self, country):
"""
Load confirmed cases downloaded from pre-made dataframe
"""
country_df = world_confirmed_ts_df[world_confirmed_ts_df["Country/Region"] == country]
return country_df.iloc[0].loc[START_DATE[country]:]
def load_dead(self, country):
"""
Load deaths downloaded from pre-made dataframe
"""
country_df = world_deaths_ts_df[world_deaths_ts_df["Country/Region"] == country]
return country_df.iloc[0].loc[START_DATE[country]:]
def load_recovered(self, country):
"""
Load recovered cases downloaded from pre-made dataframe
"""
country_df = world_recovered_ts_df[world_recovered_ts_df["Country/Region"] == country]
return country_df.iloc[0].loc[START_DATE[country]:]
def extend_index(self, index, new_size):
values = index.values
current = datetime.strptime(index[-1], '%m/%d/%y')
while len(values) < new_size:
current = current + timedelta(days=1)
values = np.append(values, datetime.strftime(current, '%m/%d/%y'))
return values
def predict_0(self, beta, gamma, data):
"""
Simplifield version.
Predict how the number of people in each compartment can be changed through time toward the future.
The model is formulated with the given beta and gamma.
"""
predict_range = 150
new_index = self.extend_index(data.index, predict_range)
size = len(new_index)
def SIR(t, y):
S = y[0]
I = y[1]
R = y[2]
return [-rho*beta*S*I, rho*beta*S*I-gamma*I, gamma*I]
extended_actual = np.concatenate((data.values, [None] * (size - len(data.values))))
return new_index, extended_actual, solve_ivp(SIR, [0, size], [S_0,I_0,R_0], t_eval=np.arange(0, size, 1))
def predict(self, beta, gamma, data, recovered, death, country, s_0, i_0, r_0):
"""
Predict how the number of people in each compartment can be changed through time toward the future.
The model is formulated with the given beta and gamma.
"""
new_index = self.extend_index(data.index, self.predict_range)
size = len(new_index)
def SIR(t, y):
S = y[0]
I = y[1]
R = y[2]
return [-rho*beta*S*I, rho*beta*S*I-gamma*I, gamma*I]
extended_actual = np.concatenate((data.values, [None] * (size - len(data.values))))
extended_recovered = np.concatenate((recovered.values, [None] * (size - len(recovered.values))))
extended_death = np.concatenate((death.values, [None] * (size - len(death.values))))
solved = solve_ivp(SIR, [0, size], [s_0,i_0,r_0], t_eval=np.arange(0, size, 1))
return new_index, extended_actual, extended_recovered, extended_death, solved
def train_0(self):
"""
Simplified version.
Run the optimization to estimate the beta and gamma fitting the given confirmed cases.
"""
data = self.load_confirmed(self.country)
optimal = minimize(
loss,
[0.001, 0.001],
args=(data),
method='L-BFGS-B',
bounds=[(0.00000001, 0.4), (0.00000001, 0.4)]
)
beta, gamma = optimal.x
new_index, extended_actual, prediction = self.predict(beta, gamma, data)
df = pd.DataFrame({
'Actual': extended_actual,
'S': prediction.y[0],
'I': prediction.y[1],
'R': prediction.y[2]
}, index=new_index)
fig, ax = plt.subplots(figsize=(15, 10))
ax.set_title(self.country)
df.plot(ax=ax)
fig.savefig(f"{self.country}.png")
def train(self):
"""
Run the optimization to estimate the beta and gamma fitting the given confirmed cases.
"""
recovered = self.load_recovered(self.country)
death = self.load_dead(self.country)
data = (self.load_confirmed(self.country) - recovered - death)
optimal = minimize(loss, [0.001, 0.001],
args=(data, recovered, self.s_0, self.i_0, self.r_0),
method='L-BFGS-B',
bounds=[(0.00000001, 0.4), (0.00000001, 0.4)])
print(optimal)
beta, gamma = optimal.x
new_index, extended_actual, extended_recovered, extended_death, prediction = self.predict(beta, gamma, data,
recovered, death,
self.country, self.s_0,
self.i_0, self.r_0)
df = pd.DataFrame({'Infected data': extended_actual, 'Recovered data': extended_recovered,
'Death data': extended_death, 'Susceptible': prediction.y[0],
'Infected': prediction.y[1], 'Recovered': prediction.y[2]}, index=new_index)
fig, ax = plt.subplots(figsize=(15, 10))
ax.set_title(self.country)
df.plot(ax=ax)
print(f"country={self.country}, beta={beta:.8f}, gamma={gamma:.8f}, r_0:{(beta/gamma):.8f}")
fig.savefig(f"{self.country}.png")
```

We need to also define the loss function to be used in the optimization process - note that the `the root mean squared error (RMSE)`

not only considers the confirmed cases, but also the recovered.

```
def loss(point, data, recovered, s_0, i_0, r_0):
size = len(data)
beta, gamma = point
def SIR(t, y):
S = y[0]
I = y[1]
R = y[2]
return [-rho*beta*S*I, rho*beta*S*I-gamma*I, gamma*I]
solution = solve_ivp(SIR, [0, size], [s_0,i_0,r_0], t_eval=np.arange(0, size, 1), vectorized=True)
l1 = np.sqrt(np.mean((solution.y[1] - data)**2))
l2 = np.sqrt(np.mean((solution.y[2] - recovered)**2))
alpha = 0.1
return alpha * l1 + (1 - alpha) * l2
```

Now let's run the estimation process against major countries suffering from the COVID-19 outbreak. The following cells illustrate how the number of the compartment can be changed over time, according to the `SIR`

model.

How to decide the initial values of compartments? This is based on two conditions: the rough inference of the max quantity of susceptible people for each country, and the computation capacity (the execution takes overly long time for 1 million susceptible population). Though SIR model expects the susceptible to be homogenous, well-mixed, and accessible to each other, setting the whole population in the country is not realistic for sure. In this case, let's set the compartment 100000 to start with.

Also in the initial settings, the initial infectious people were 2 for every country, and the start of the outbreak is varied, as specified in the `START_DATE`

dict object.

#### The initial batch of SIR Model fitting

```
countries = ['China', 'Korea, South']
predict_range = 250
s_0 = 100000
i_0 = 2
r_0 = 10
for country in countries:
start_date = START_DATE[country]
learner = Learner(country, loss, start_date, predict_range, s_0, i_0, r_0)
learner.train()
```

fun: 10280.378870137927 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64> jac: array([4227.74755862, -7.67140591]) message: b'ABNORMAL_TERMINATION_IN_LNSRCH' nfev: 375 nit: 27 status: 2 success: False x: array([7.53235845e-06, 2.06861562e-02]) country=China, beta=0.00000753, gamma=0.02068616, r_0:0.00036413 fun: 5302.062155691667 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64> jac: array([ 2.09337929e+08, -1.32026589e+03]) message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH' nfev: 237 nit: 19 status: 0 success: True x: array([1.01684995e-06, 4.17528180e-02]) country=Korea, South, beta=0.00000102, gamma=0.04175282, r_0:0.00002435

How do we decide if the model is a good approximation? First, by observing if curves of the `Infected data`

and the projected `Infected`

are close together; Second, by observing if the `Recovered data`

and the projected `Recovered`

curves are also close. For instance, in the previous cell output, we can see the approximation of the model is doing fairly nice for that of Mainland China, but not as ideal for that of South Korea.

```
countries = ['Japan', 'Italy', 'Spain']
predict_range = 250
s_0 = 100000
i_0 = 2
r_0 = 10
for country in countries:
start_date = START_DATE[country]
learner = Learner(country, loss, start_date, predict_range, s_0, i_0, r_0)
learner.train()
```

fun: 8039.341910449291 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64> jac: array([135990.51944766, -13777.95169901]) message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH' nfev: 162 nit: 6 status: 0 success: True x: array([4.45162819e-07, 4.45177249e-07]) country=Japan, beta=0.00000045, gamma=0.00000045, r_0:0.99996759 fun: 42082.071337634945 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64> jac: array([1.38366158e+02, 1.67347025e-02]) message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH' nfev: 399 nit: 34 status: 0 success: True x: array([2.42577465e-06, 5.09067457e-02]) country=Italy, beta=0.00000243, gamma=0.05090675, r_0:0.00004765 fun: 37210.55846013384 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64> jac: array([-4.30707999e+09, 1.31464732e+05]) message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH' nfev: 297 nit: 23 status: 0 success: True x: array([2.89569389e-06, 9.88570287e-02]) country=Spain, beta=0.00000290, gamma=0.09885703, r_0:0.00002929

As seen in the cell output above, the approximation of the model is looking reasonably good for that of Italy and Spain, but not as ideal for that of Japan - the gap between the `Infected data`

and the projected `Infected`

curves is overly large, and so is that between the `Recovered data`

and `Recovered`

curves.

```
predict_range = 250
s_0 = 100000
i_0 = 2
r_0 = 10
country="US"
start_date = START_DATE[country]
learner = Learner(country, loss, start_date, predict_range, s_0, i_0, r_0)
learner.train()
```

fun: 317178.7735714818 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64> jac: array([9.87201929, 0.20954758]) message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH' nfev: 45 nit: 5 status: 0 success: True x: array([0.02542935, 0.02778655]) country=US, beta=0.02542935, gamma=0.02778655, r_0:0.91516763

Table 3 (as shown below) displays what we have got so far with the model, for the studied countries. Note that the last column represents if the approximation of the model is good, or else parameters need to be modified before running the 2nd batch of fitting.

Countries | Beta | Gamma | R0 | Good approximation? |
---|---|---|---|---|

China | 0.00000753 | 0.02068616 | 0.00036413 | True |

Japan | 0.00000045 | 0.00000045 | 0.9999675 | False |

S. Korea | 0.00000102 | 0.04175282 | 0.00002435 | False |

Italy | 0.00000243 | 0.05090675 | 0.00004765 | True |

Spain | 0.00000290 | 0.09885703 | 0.00002929 | True |

US | 0.02542935 | 0.02778655 | 0.91516763 | False |

Table 3. The computed Beta, Gamma, and R0 values for studied countries (with data collected up to EOD 07/04/2020).

Now the SIR model is well-fitting the actual data for both confirmed and recovered cases. The `R0`

values of Japan and the United States are significantly higher than in other countries, which might be caused by the relatively lower recovery rates in both countries, or that the initial values can be assumed wrong (or that the model fails to find the global minimum in the optimization process).

On the other hand, the total number of confirmed cases for the United States is way higher than the initial susceptible number we picked here. The trends for S, I, and R, and the R0 value hence computed would not make much sense. In the following cells, we will experiment with larger susceptible numbers (and/or longer prediction periods), and see how the results would differ.

#### The second batch of SIR Model fitting

Let's run the model again with modified parameters for the countries that (1) were showing unreasonably high R0 values, or (2) failed to make good approximation in the initial batch.

```
predict_range = 250
s_0 = 25000
i_0 = 2
r_0 = 10
country="Japan"
start_date = START_DATE[country]
learner = Learner(country, loss, start_date, predict_range, s_0, i_0, r_0)
learner.train()
```

fun: 1422.4699520979984 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64> jac: array([279725.30349416, 589.61854847]) message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH' nfev: 432 nit: 16 status: 0 success: True x: array([5.16087340e-06, 4.11036952e-02]) country=Japan, beta=0.00000516, gamma=0.04110370, r_0:0.00012556

```
countries = ['Korea, South', 'US']
predict_range = 250
s_0_values = [25000, 1000000]
i_0 = 2
r_0 = 10
for country, s_0 in zip(countries, s_0_values):
start_date = START_DATE[country]
learner = Learner(country, loss, start_date, predict_range, s_0, i_0, r_0)
learner.train()
```

fun: 2177.392810393044 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64> jac: array([258762.58505377, 387.30468077]) message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH' nfev: 315 nit: 26 status: 0 success: True x: array([1.01798066e-05, 6.15592306e-03]) country=Korea, South, beta=0.00001018, gamma=0.00615592, r_0:0.00165366 fun: 132485.37921448034 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64> jac: array([1.83150386e+10, 4.76380734e+06]) message: b'ABNORMAL_TERMINATION_IN_LNSRCH' nfev: 300 nit: 15 status: 2 success: False x: array([2.31863766e-07, 1.31293805e-01]) country=US, beta=0.00000023, gamma=0.13129380, r_0:0.00000177

Table 4 (as shown below) lists the modified `Beta`

, `Gamma`

, and `R0`

values for Japan, Korea, Spain and US. Note that the last column represents if the approximation of the model is good, or else parameters need to be modified before the 3rd batch of fitting.

Countries | Beta | Gamma | R0 | Good approximation? |
---|---|---|---|---|

China | 0.00000753 | 0.02068616 | 0.00036413 | True |

Japan | 0.00000516 | 0.04110370 | 0.00012556 | True |

S. Korea | 0.00001018 | 0.00615592 | 0.00165366 | True |

Italy | 0.00000243 | 0.05090675 | 0.00004765 | True |

Spain | 0.00000290 | 0.09885703 | 0.00002929 | True |

US | 0.00000023 | 0.13129380 | 0.00000177 | False |

Table 4. The Beta, Gamma, and R0 values computed out of the 2nd batch model fitting for studied countries (with data collected up to EOD 07/04/2020).

Up to this point, we have run two batches of SIR model fitting, and all countries of study except for the United States have shown good approximations to the model (with modified parameters). Because the susceptible values set for the United States so far have been too low compared to the actual confirmed cases, we are going to continue modifying the `s_0`

values for the United States, and see if by doing so, the approximation of the model fitting can be improved in the next section.

#### The final batch of SIR Model fitting

```
predict_range = 250
s_0 = 2000000
i_0 = 2
r_0 = 10
country="US"
start_date = START_DATE[country]
learner = Learner(country, loss, start_date, predict_range, s_0, i_0, r_0)
learner.train()
```

fun: 273805.4913666793 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64> jac: array([ 1849.03037734, -1823.58780876]) message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH' nfev: 30 nit: 5 status: 0 success: True x: array([0.00161177, 0.00161186]) country=US, beta=0.00161177, gamma=0.00161186, r_0:0.99994539

```
predict_range = 250
s_0 = 3000000
i_0 = 2
r_0 = 10
country="US"
start_date = START_DATE[country]
learner = Learner(country, loss, start_date, predict_range, s_0, i_0, r_0)
learner.train()
```

fun: 361448.0875488295 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64> jac: array([3502.73912773, -58.84794518]) message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH' nfev: 24 nit: 3 status: 0 success: True x: array([0.00099983, 0.00106169]) country=US, beta=0.00099983, gamma=0.00106169, r_0:0.94172925

The previous two plots were due to `susceptible`

being set too low, and we will continue to adjust the `s_0`

to larger numbers.

```
predict_range = 250
s_0 = 3500000
i_0 = 2
r_0 = 10
country="US"
start_date = START_DATE[country]
learner = Learner(country, loss, start_date, predict_range, s_0, i_0, r_0)
learner.train()
```

fun: 224079.8306076969 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64> jac: array([ 3.91748893e+12, -2.12266260e+06]) message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH' nfev: 207 nit: 7 status: 0 success: True x: array([3.09456750e-08, 9.17191515e-03]) country=US, beta=0.00000003, gamma=0.00917192, r_0:0.00000337

```
predict_range = 250
s_0 = 4000000
i_0 = 2
r_0 = 10
country="US"
start_date = START_DATE[country]
learner = Learner(country, loss, start_date, predict_range, s_0, i_0, r_0)
learner.train()
```

fun: 369649.23863352626 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64> jac: array([ 7.68412486e+08, -1.60154421e+04]) message: b'ABNORMAL_TERMINATION_IN_LNSRCH' nfev: 186 nit: 4 status: 2 success: False x: array([1.72969697e-08, 1.72955155e-08]) country=US, beta=0.00000002, gamma=0.00000002, r_0:1.00008408

s_0 | Beta | Gamma | R0 | Date of Peak | Max Infected Population |
---|---|---|---|---|---|

0.1M | 0.02542935 | 0.02778655 | 0.91516763 | Not applicable | Not applicable |

1M | 0.00000023 | 0.13129380 | 0.00000177 | Not applicable | Not applicable |

2M | 0.00161177 | 0.00161186 | 0.99994539 | Not applicable | Not applicable |

3M | 0.00099983 | 0.00106169 | 0.94172925 | Not applicable | Not applicable |

3.5M | 0.00000003 | 0.00917192 | 0.00000337 | 07/19/2020 | 2.5M |

4M | 0.00000002 | 0.00000002 | 1.00008408 | >10/28/2020 | >4M |

Table 5. The Beta, Gamma, and R0 values computed out of the final batch model fitting for the United States (with data collected up to EOD 07/04/2020).

#### Consider the social distancing effect in SIR Model fitting

Efforts of social distancing can mitigate the spread of infectious disease, and impact the transmitting/infectious rate, `β`

. Now, let's introduce this new value, `ρ`

, to describe the social distancing effect. It is going to be ranging from 0 to 1, where 0 indicates everyone is locked down and quarantined (as individuals), and 1 is equivalent to the base case demonstrated above. With `ρ`

being defined, the SIR model will be modified as:

```
S' = -ρ*β*S*I
```

We have already seen the trends of SIR model fitting when ρ=1. Now, let's set ρ to 0.8, and 0.5, and see the flattening effect as the social distancing effects got increased to contain the disease over a set period of time.

`rho = 0.8`

```
predict_range = 250
s_0 = 3500000
i_0 = 2
r_0 = 10
country="US"
start_date = START_DATE[country]
learner = Learner(country, loss, start_date, predict_range, s_0, i_0, r_0)
learner.train()
```

fun: 254442.13493214428 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64> jac: array([4.47150648e+12, 5.36130878e+06]) message: b'ABNORMAL_TERMINATION_IN_LNSRCH' nfev: 288 nit: 5 status: 2 success: False x: array([3.81849319e-08, 1.66008689e-02]) country=US, beta=0.00000004, gamma=0.01660087, r_0:0.00000230

As listed in Table 6, social distancing efforts will likely improve the survivability of the disease by giving more time for treatments and supplies to develop while keeping the peaks low.

Rho | Beta | Gamma | R0 | Date of Peak | Max Infected Population |
---|---|---|---|---|---|

1 | 0.00000003 | 0.00917192 | 0.00000337 | 07/19/2020 | 2.5M |

0.8 | 0.00000004 | 0.01660087 | 0.00000230 | 07/26/2020 | 1.95M |

Table 6. The Beta, Gamma, and R0 values computed with different `Rho` factors (when `S`=3.5M) for the United States (with data collected up to EOD 07/04/2020).

## Comparison between models

The SIR model we have been seeing so far, is a simplified approximation of early dynamics of an epidemic, with exponential growth. Based on a system of differential equations and three distinct populations (S, I, and R), it has been around for decades. For future extensions of this study, we can explore other complex approaches, such as the `SEIR`

model (an adaptation of SIR with an additional population of `Exposed`

individuals), or a polynomial curve fitting further-along epidemics and recalibrating locally, for instance, the `CHIME`

model.

Category | SIR | SEIR [18] | CHIME [18] |
---|---|---|---|

Definition | Simple & Basic compartmental model | a derivative of SIR model that contains `exposed` (infected but not infectious) | COVID19 Hospital Impact Model for Epidemics |

Latency Period | No | Yes | No |

Progress of Individuals | S-I-R | S-E-I-R | S-I-R |

Examples | N = S+I+R | N = S+E+I+R | N = S+I+R |

## Conclusions

Before vaccines to COVID-19 come available to all, what governments can do is only to slow down the transmission. By doing so, we don’t actually stop the spread but keep the transmission and the active cases at a point that in the time being well within the limits of the handling capacity of medical facilities. This is what we call as `Flattening The Curve`

. As advised by experts, there are at least three measures to help flatten the curve:

- Place travel restrictions; the transmission could be reduced by restricting people from traveling in or out of a particular region with high infections.
- Keep social distance; As shown in the SIR modeling that COVID-19 has a high
`R0`

and each infected person ends up infecting 2 or 3 people, social distancing might be the most significant in reducing transmission from the infected to the others. - Have more testing done. Testing is necessary to quickly identify and isolate the infected from the non-infected given that Covid-19 has a long incubation period (ranging from 2 to 26 days, while symptoms usually start appearing after 5-7 days), it is possible for an infected individual to spread the infection to others without his/her even knowing.

As stated by Thomas Pueyo in his Medium Post (which can be seen in the illustration below), when combining Transmission reduction and Travel restriction, if transmission rate of COVID-19 went down by even 25%, it could delay the peak by almost 14 weeks. Further reduction would delay it even more.

Together with what we have seen from the SIR modeling, reducing the transmission rate, and keeping the number of compartment small is crucial to flattening the curve. All should act together and fight the pandemic!

## References

[1] https://www.vox.com/2020/2/18/21142009/coronavirus-covid19-china-wuhan-deaths-pandemic

[2] https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge

[3] https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series

[4] https://github.com/CSSEGISandData/COVID-19/tree/web-data

[5] http://gabgoh.github.io/COVID/index.html

[6] https://scipython.com/book/chapter-8-scipy/additional-examples/the-sir-epidemic-model/

[7] https://www.lewuathe.com/covid-19-dynamics-with-sir-model.html

[8] https://github.com/Lewuathe/COVID19-SIR

[9] https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_India

[10] https://medium.com/@tomaspueyo/coronavirus-act-today-or-people-will-die-f4d3d9cd99ca

[11] https://science.sciencemag.org/content/early/2020/03/05/science.aba9757

[12] https://en.wikipedia.org/wiki/Compartmental_models_in_epidemiology

[13] https://science.sciencemag.org/content/early/2020/03/05/science.aba9757/tab-figures-data

[15] https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6

[16] https://www.weforum.org/agenda/2020/03/covid19-coronavirus-countries-infection-trajectory/