Predictive Analysis of the 2019 Novel Coronavirus Pandemic¶
Introduction¶
Dashboards, statistics, and other information about the COVID-19
are floating all over the internet, and different countries or regions are adopting varied strategies, from complete lockdown, to social distancing, to herd immunity, you might be confused at what is the right strategy, and which information is valid. This notebook is not providing you a final answer, but tools or methods that you can try yourself in performing data modeling, analyzing, and predicting the spread of COVID-19 with the ArcGIS API for Python
, and other libraries such as pandas
and numpy
. Hopefully, given the workflow demonstrations, you are able to find the basic facts, current patterns, and future trends behind the common notions about how COVID-19 spread from a dataset perspective [1,2,3,4].
Before we dive into data science and analytics, let's start with importing the necessary modules:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
from arcgis.gis import GIS
Import and Understand Source Dataset¶
Among all the official and unofficial data sources on the web providing COVID-19 related data, one of the most widely used dataset today is the one provided by the John Hopkins University's Center for Systems Science and Engineering (JHU CSSE), which can be accessed on GitHub under the name - Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE [5,10,11,12]. The time-series consolidated data needed for all the analysis to be performed in this notebook fall into these two categories:
- Type: Confirmed Cases, Deaths, and the Recovered;
- Geography: Global, and the United States only.
Now let's first look at the U.S. dataset.
Time-series data for the United States¶
The dataset can be directly imported into data-frames with read_csv
method in Pandas
. Compared to downloading the file manually and then read it, it is preferred to use the URLs (which point to the CSV files archived on GitHub) because as situation changes, it becomes easier to load and refresh the analysis with new data.
Now, let's read the time-series data of the confirmed COVID-19 cases in the United States from the GitHub source url, into a Pandas DataFrame:
# read time-series csv
usa_ts_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv'
usa_ts_df = pd.read_csv(usa_ts_url, header=0, escapechar='\\')
usa_ts_df.head(5)
usa_ts_df.columns
As we can see from the printouts of usa_ts_df.columns
, the first 11 columns are displayed as ['UID', 'iso2', 'iso3', 'code3', 'FIPS', 'Admin2', 'Province_State', 'Country_Region', 'Lat', 'Long_']
, while the rest of the columns are dates from 1/22/20
to the most current date on record.
date_list = usa_ts_df.columns.tolist()[11:]
date_list[0], date_list[-1]
Repair and Summarize the state-wide time-series data¶
Look at the last five rows of the DataFrame usa_ts_df
, and see that they are all cases for Province_State
=="Utah":
usa_ts_df.tail()
Some rows in the DataFrame are of many-rows-to-one-state
matching, while others are of the one-row-to-one-state
matching replationship. All records listed in the usa_ts_df
with Admin2
not equal to NaN, are those rows that fall into the category of "many-rows-to-one-state" matching.
usa_ts_df[usa_ts_df["Admin2"].notna()].head()
As shown in the output of the previous two cells, we can see that for the usa_ts_df
which we have created and parsed from the U.S. Dataset, there are multiple rows per state representing different administrative areas of the state with reported cases. Next, we will use the to-be-defined function sum_all_admins_in_state()
to summarize all administrative areas in one state into a single row.
def sum_all_admins_in_state(df, state):
# query all sub-records of the selected state
tmp_df = df[df["Province_State"]==state]
# create a new row which is to sum all statistics of this state, and
# assign the summed value of all sub-records to the date_time column of the new row
sum_row = tmp_df.sum(axis=0)
# assign the constants to the ['Province/State', 'Country/Region', 'Lat', 'Long'] columns;
# note that the Province/State column will be renamed from solely the country name to country name + ", Sum".
sum_row.loc['UID'] = "NaN"
sum_row.loc['Admin2'] = "NaN"
sum_row.loc['FIPS'] = "NaN"
sum_row.loc['iso2'] = "US"
sum_row.loc['iso3'] = "USA"
sum_row.loc['code3'] = 840
sum_row.loc['Country_Region'] = "US"
sum_row.loc['Province_State'] = state + ", Sum"
sum_row.loc['Lat'] = tmp_df['Lat'].values[0]
sum_row.loc['Long_'] = tmp_df['Long_'].values[0]
# append the new row to the original DataFrame, and
# remove the sub-records of the selected country.
df = pd.concat([df, sum_row.to_frame().T], ignore_index=True)
#display(df[df["Province_State"].str.contains(state + ", Sum")])
df=df[df['Province_State'] != state]
df.loc[df.Province_State == state+", Sum", 'Province_State'] = state
return df
# loop thru all states in the U.S.
for state in usa_ts_df.Province_State.unique():
usa_ts_df = sum_all_admins_in_state(usa_ts_df, state)
Now with sum_all_admins_in_state
applied to all states, we shall be expecting usa_ts_df
to be with all rows converted to one-row-to-one-state
matching. Let's browse the last five rows in the DataFrame to validate the results.
usa_ts_df.tail()
Explore the state-wide time-series data¶
If you wonder in which state(s) the first COVID-19 case was confirmed and reported, use the cell below to check for first occurrence - in this case, the Washington State.
usa_ts_df_all_states = usa_ts_df.groupby('Province_State').sum()[date_list]
usa_ts_df_all_states[usa_ts_df_all_states['1/22/20']>0]
Or if you want to query for the top 10 states with the highest numbers of confirmed cases for the time being, run the following cell to display the query results.
usa_ts_df_all_states[date_list[-1]].sort_values(ascending = False).head(10)
Compared to what we have collected as the top 10 states on May 05, 2020, we can see California has climbed up to the 2nd place, while New Jersey rescinded to the 5th place.
Province_State
New York 321192
New Jersey 130593
Massachusetts 70271
Illinois 65889
California 58456
Pennsylvania 53864
Michigan 44451
Florida 37439
Texas 33912
Connecticut 30621
Name: 5/5/20, dtype: int64
The approach is quite similar if you want to query for the top 10 states with the lowest numbers of confirmed cases, just by simply changing the ascending
order from False
to True
:
usa_ts_df_all_states[date_list[-1]].sort_values(ascending = True).head(10)
Comparatively, we can also check what is the difference against the statistics obtained on May 05, 2020 (as below),
Province_State
American Samoa 0
Northern Mariana Islands 14
Diamond Princess 49
Virgin Islands 66
Grand Princess 103
Guam 145
Alaska 371
Montana 456
Wyoming 604
Hawaii 625
Name: 5/5/20, dtype: int64
As shown above, from May to July the state with the highest confirmed cases is the New York State, while the American Samoa is that of the lowest confirmed cases (as of 07/05/2020). Also, if you are only interested in finding out which state has the highest confirmed cases instead of the top 10, you can just run the cells below for an exact query result, and its time-series:
# state name, and the current number of confirmed
usa_ts_df_all_states[date_list[-1]].idxmax(), usa_ts_df_all_states[date_list[-1]].max()
usa_ts_df[usa_ts_df['Province_State']=="New York"].sum()[date_list]
Map the confirmed cases per state¶
With the time-series DataFrame for the United States ready-to-use, we can then map the number of confirmed cases reported per state in a time-enabled manner. Next, we will see an animation being created from the end of January to Current:
gis = GIS('home', verify_cert=False)
"""Confirmed Cases per State shown on map widget"""
map0 = gis.map("US")
map0