Skip To Content ArcGIS for Developers Sign In Dashboard

ArcGIS API for Python

Download the samples Try it live

Predictive Analysis of the 2019 Novel Coronavirus Pandemic

Introduction

Dashboards, statistics, and other information about the COVID-19 are floating all over the internet, and different countries or regions are adopting varied strategies, from complete lockdown, to social distancing, to herd immunity, you might be confused at what is the right strategy, and which information is valid. This notebook is not providing you a final answer, but tools or methods that you can try yourself in performing data modeling, analyzing, and predicting the spread of COVID-19 with the ArcGIS API for Python, and other libraries such as pandas and numpy. Hopefully, given the workflow demonstrations, you are able to find the basic facts, current patterns, and future trends behind the common notions about how COVID-19 spread from a dataset perspective [1,2,3,4].

Before we dive into data science and analytics, let's start with importing the necessary modules:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
from arcgis.gis import GIS

Import and Understand Source Dataset

Among all the official and unofficial data sources on the web providing COVID-19 related data, one of the most widely used dataset today is the one provided by the John Hopkins University's Center for Systems Science and Engineering (JHU CSSE), which can be accessed on GitHub under the name - Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE [5,10,11,12]. The time-series consolidated data needed for all the analysis to be performed in this notebook fall into these two categories:

  1. Type: Confirmed Cases, Deaths, and the Recovered;
  2. Geography: Global, and the United States only.

Now let's first look at the U.S. dataset.

Time-series data for the United States

The dataset can be directly imported into data-frames with read_csv method in Pandas. Compared to downloading the file manually and then read it, it is preferred to use the URLs (which point to the CSV files archived on GitHub) because as situation changes, it becomes easier to load and refresh the analysis with new data.

Now, let's read the time-series data of the confirmed COVID-19 cases in the United States from the GitHub source url, into a Pandas DataFrame:

In [2]:
# read time-series csv
usa_ts_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv'
usa_ts_df = pd.read_csv(usa_ts_url, header=0, escapechar='\\')
usa_ts_df.head(5)
Out[2]:
UID iso2 iso3 code3 FIPS Admin2 Province_State Country_Region Lat Long_ ... 6/26/20 6/27/20 6/28/20 6/29/20 6/30/20 7/1/20 7/2/20 7/3/20 7/4/20 7/5/20
0 16 AS ASM 16 60.0 NaN American Samoa US -14.2710 -170.1320 ... 0 0 0 0 0 0 0 0 0 0
1 316 GU GUM 316 66.0 NaN Guam US 13.4443 144.7937 ... 247 247 247 253 257 267 280 280 280 280
2 580 MP MNP 580 69.0 NaN Northern Mariana Islands US 15.0979 145.6739 ... 30 30 30 30 30 30 31 31 31 31
3 630 PR PRI 630 72.0 NaN Puerto Rico US 18.2208 -66.5901 ... 6922 7066 7189 7250 7465 7537 7608 7683 7787 7916
4 850 VI VIR 850 78.0 NaN Virgin Islands US 18.3358 -64.8963 ... 81 81 81 81 81 90 92 98 111 111

5 rows × 177 columns

In [3]:
usa_ts_df.columns
Out[3]:
Index(['UID', 'iso2', 'iso3', 'code3', 'FIPS', 'Admin2', 'Province_State',
       'Country_Region', 'Lat', 'Long_',
       ...
       '6/26/20', '6/27/20', '6/28/20', '6/29/20', '6/30/20', '7/1/20',
       '7/2/20', '7/3/20', '7/4/20', '7/5/20'],
      dtype='object', length=177)

As we can see from the printouts of usa_ts_df.columns, the first 11 columns are displayed as ['UID', 'iso2', 'iso3', 'code3', 'FIPS', 'Admin2', 'Province_State', 'Country_Region', 'Lat', 'Long_'], while the rest of the columns are dates from 1/22/20 to the most current date on record.

In [4]:
date_list = usa_ts_df.columns.tolist()[11:]
date_list[0], date_list[-1]
Out[4]:
('1/22/20', '7/5/20')

Repair and Summarize the state-wide time-series data

Look at the last five rows of the DataFrame usa_ts_df, and see that they are all cases for Province_State=="Utah":

In [5]:
usa_ts_df.tail()
Out[5]:
UID iso2 iso3 code3 FIPS Admin2 Province_State Country_Region Lat Long_ ... 6/26/20 6/27/20 6/28/20 6/29/20 6/30/20 7/1/20 7/2/20 7/3/20 7/4/20 7/5/20
3256 84070016 US USA 840 NaN Central Utah Utah US 39.372319 -111.575868 ... 127 134 143 159 169 173 180 194 201 209
3257 84070017 US USA 840 NaN Southeast Utah Utah US 38.996171 -110.701396 ... 33 34 35 35 36 37 39 40 40 40
3258 84070018 US USA 840 NaN Southwest Utah Utah US 37.854472 -111.441876 ... 1302 1361 1428 1467 1519 1553 1584 1625 1660 1689
3259 84070019 US USA 840 NaN TriCounty Utah US 40.124915 -109.517442 ... 45 46 48 48 50 50 52 53 55 55
3260 84070020 US USA 840 NaN Weber-Morgan Utah US 41.271160 -111.914512 ... 814 846 872 919 954 1004 1042 1090 1172 1195

5 rows × 177 columns

Some rows in the DataFrame are of many-rows-to-one-state matching, while others are of the one-row-to-one-state matching replationship. All records listed in the usa_ts_df with Admin2 not equal to NaN, are those rows that fall into the category of "many-rows-to-one-state" matching.

In [6]:
usa_ts_df[usa_ts_df["Admin2"].notna()].head()
Out[6]:
UID iso2 iso3 code3 FIPS Admin2 Province_State Country_Region Lat Long_ ... 6/26/20 6/27/20 6/28/20 6/29/20 6/30/20 7/1/20 7/2/20 7/3/20 7/4/20 7/5/20
5 84001001 US USA 840 1001.0 Autauga Alabama US 32.539527 -86.644082 ... 482 492 497 521 530 545 553 560 583 607
6 84001003 US USA 840 1003.0 Baldwin Alabama US 30.727750 -87.722071 ... 500 539 559 626 663 686 735 828 846 864
7 84001005 US USA 840 1005.0 Barbour Alabama US 31.868263 -85.387129 ... 309 314 314 319 322 323 333 345 347 349
8 84001007 US USA 840 1007.0 Bibb Alabama US 32.996421 -87.125115 ... 150 158 159 162 167 171 176 186 187 190
9 84001009 US USA 840 1009.0 Blount Alabama US 33.982109 -86.567906 ... 181 185 186 196 204 214 218 226 230 235

5 rows × 177 columns

As shown in the output of the previous two cells, we can see that for the usa_ts_df which we have created and parsed from the U.S. Dataset, there are multiple rows per state representing different administrative areas of the state with reported cases. Next, we will use the to-be-defined function sum_all_admins_in_state() to summarize all administrative areas in one state into a single row.

In [7]:
def sum_all_admins_in_state(df, state):
    
    # query all sub-records of the selected state
    tmp_df = df[df["Province_State"]==state]
    
    # create a new row which is to sum all statistics of this state, and 
    # assign the summed value of all sub-records to the date_time column of the new row
    sum_row = tmp_df.sum(axis=0)
    
    # assign the constants to the ['Province/State', 'Country/Region', 'Lat', 'Long'] columns; 
    # note that the Province/State column will be renamed from solely the country name to country name + ", Sum".
    sum_row.loc['UID'] = "NaN"
    sum_row.loc['Admin2'] = "NaN"
    sum_row.loc['FIPS'] = "NaN"
    sum_row.loc['iso2'] = "US"
    sum_row.loc['iso3'] = "USA"
    sum_row.loc['code3'] = 840
    sum_row.loc['Country_Region'] = "US"
    sum_row.loc['Province_State'] = state + ", Sum"
    sum_row.loc['Lat'] = tmp_df['Lat'].values[0]
    sum_row.loc['Long_'] = tmp_df['Long_'].values[0]
    
    # append the new row to the original DataFrame, and 
    # remove the sub-records of the selected country.
    df = pd.concat([df, sum_row.to_frame().T], ignore_index=True)
    #display(df[df["Province_State"].str.contains(state + ", Sum")])
    df=df[df['Province_State'] != state]
    df.loc[df.Province_State == state+", Sum", 'Province_State'] = state
    
    return df
In [8]:
# loop thru all states in the U.S.
for state in usa_ts_df.Province_State.unique():
    usa_ts_df = sum_all_admins_in_state(usa_ts_df, state)

Now with sum_all_admins_in_state applied to all states, we shall be expecting usa_ts_df to be with all rows converted to one-row-to-one-state matching. Let's browse the last five rows in the DataFrame to validate the results.

In [9]:
usa_ts_df.tail()
Out[9]:
UID iso2 iso3 code3 FIPS Admin2 Province_State Country_Region Lat Long_ ... 6/26/20 6/27/20 6/28/20 6/29/20 6/30/20 7/1/20 7/2/20 7/3/20 7/4/20 7/5/20
54 NaN US USA 840 NaN NaN West Virginia US 39.1307 -80.0035 ... 2730 2782 2832 2870 2905 2979 3053 3126 3205 3262
55 NaN US USA 840 NaN NaN Wisconsin US 43.9697 -89.7678 ... 26747 27286 27743 28058 28659 29199 29738 30317 31055 31577
56 NaN US USA 840 NaN NaN Wyoming US 41.655 -105.724 ... 1368 1392 1417 1450 1487 1514 1550 1582 1606 1634
57 NaN US USA 840 NaN NaN Diamond Princess US 0 0 ... 49 49 49 49 49 49 49 49 49 49
58 NaN US USA 840 NaN NaN Grand Princess US 0 0 ... 103 103 103 103 103 103 103 103 103 103

5 rows × 177 columns

Explore the state-wide time-series data

If you wonder in which state(s) the first COVID-19 case was confirmed and reported, use the cell below to check for first occurrence - in this case, the Washington State.

In [10]:
usa_ts_df_all_states = usa_ts_df.groupby('Province_State').sum()[date_list]
usa_ts_df_all_states[usa_ts_df_all_states['1/22/20']>0]
Out[10]:
1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 1/28/20 1/29/20 1/30/20 1/31/20 ... 6/26/20 6/27/20 6/28/20 6/29/20 6/30/20 7/1/20 7/2/20 7/3/20 7/4/20 7/5/20
Province_State
Washington 1 1 1 1 1 1 1 1 1 1 ... 30855 31404 31752 32253 32824 33435 34151 34778 35247 35898

1 rows × 166 columns

Or if you want to query for the top 10 states with the highest numbers of confirmed cases for the time being, run the following cell to display the query results.

In [11]:
usa_ts_df_all_states[date_list[-1]].sort_values(ascending = False).head(10)
Out[11]:
Province_State
New York         397131
California       264681
Florida          200111
Texas            194932
New Jersey       173402
Illinois         147251
Massachusetts    109974
Arizona           98103
Georgia           95516
Pennsylvania      94403
Name: 7/5/20, dtype: int64

Compared to what we have collected as the top 10 states on May 05, 2020, we can see California has climbed up to the 2nd place, while New Jersey rescinded to the 5th place.

Province_State
New York         321192
New Jersey       130593
Massachusetts     70271
Illinois          65889
California        58456
Pennsylvania      53864
Michigan          44451
Florida           37439
Texas             33912
Connecticut       30621
Name: 5/5/20, dtype: int64

The approach is quite similar if you want to query for the top 10 states with the lowest numbers of confirmed cases, just by simply changing the ascending order from False to True:

In [12]:
usa_ts_df_all_states[date_list[-1]].sort_values(ascending = True).head(10)
Out[12]:
Province_State
American Samoa                 0
Northern Mariana Islands      31
Diamond Princess              49
Grand Princess               103
Virgin Islands               111
Guam                         280
Hawaii                      1023
Alaska                      1134
Montana                     1212
Vermont                     1249
Name: 7/5/20, dtype: int64

Comparatively, we can also check what is the difference against the statistics obtained on May 05, 2020 (as below),

Province_State
American Samoa                0
Northern Mariana Islands     14
Diamond Princess             49
Virgin Islands               66
Grand Princess              103
Guam                        145
Alaska                      371
Montana                     456
Wyoming                     604
Hawaii                      625
Name: 5/5/20, dtype: int64

As shown above, from May to July the state with the highest confirmed cases is the New York State, while the American Samoa is that of the lowest confirmed cases (as of 07/05/2020). Also, if you are only interested in finding out which state has the highest confirmed cases instead of the top 10, you can just run the cells below for an exact query result, and its time-series:

In [13]:
# state name, and the current number of confirmed
usa_ts_df_all_states[date_list[-1]].idxmax(), usa_ts_df_all_states[date_list[-1]].max()
Out[13]:
('New York', 397131)
In [14]:
usa_ts_df[usa_ts_df['Province_State']=="New York"].sum()[date_list]
Out[14]:
1/22/20         0
1/23/20         0
1/24/20         0
1/25/20         0
1/26/20         0
            ...  
7/1/20     394079
7/2/20     394954
7/3/20     395872
7/4/20     396598
7/5/20     397131
Length: 166, dtype: object

Map the confirmed cases per state

With the time-series DataFrame for the United States ready-to-use, we can then map the number of confirmed cases reported per state in a time-enabled manner. Next, we will see an animation being created from the end of January to Current:

In [15]:
gis = GIS('home', verify_cert=False)
In [1]:
"""Confirmed Cases per State shown on map widget"""
map0 = gis.map("US")
map0