Skip To Content ArcGIS for Developers Sign In Dashboard

ArcGIS API for Python

Download the samples Try it live

Data Preparation - Hurricane analysis, part 1/3

Introduction

Hurricanes are large swirling storms that produce winds of speeds 74 miles per hour (119 kmph) or higher. When hurricanes make a landfall, they produce heavy rainfall, cause storm surges and intense flooding. Often hurricanes strike places that are dense in population, causing devastating amounts of death and destruction throughout the world.

Since the recent past, agencies such as the National Hurricane Center have been collecting quantitative data about hurricanes. In this study we use meteorological data of hurricanes recorded in the past 169 years to analyze their location, intensity and investigate if there are any statistically significant trends. We also analyze the places most affected by hurricanes and what their demographic make up is. We conclude by citing relevant articles that draw similar conclusions.

This notebook covers part 1 of this study. In this notebook, we:

  • download data from NCEI portal
  • do extensive pre-processing in the form of clearing headers, merging redundant columns
  • aggregate the observations into hurricane tracks.

Note: To run this sample, you need a few extra libraries in your conda environment. If you don't have the libraries, install them by running the following commands from cmd.exe or your shell.

In [ ]:
pip install dask==2.14.0
pip install toolz
pip install fsspec==0.3.1

Download hurricane data from NCEI FTP portal

The National Centers for Environmental Information, formerly National Climatic Data Center shares the historic hurricane track datasets at ftp://eclipse.ncdc.noaa.gov/pub/ibtracs/v03r09/all/csv/. We use the ftplib Python library to login in and download these datasets.

In [1]:
# imports for downloading data from FTP site
import os
from ftplib import FTP

# imports to process data using DASK
from dask import delayed
import dask.dataframe as ddf

# imports for data analysis and visualization
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# imports to perform spatial aggregation using ArcGIS GeoAnalytics server
from arcgis.gis import GIS
from arcgis.geoanalytics import get_datastores
from arcgis.geoanalytics.summarize_data import reconstruct_tracks
import arcgis

# miscellaneous imports
from pprint import pprint
from copy import deepcopy

Establish an anonymous connection to FTP site.

In [4]:
conn = FTP(host='eclipse.ncdc.noaa.gov')
conn.login()
Out[4]:
'230 Anonymous access granted, restrictions apply'

Change directory to folder containing the hurricane files. List the files.

In [5]:
conn.cwd('/pub/ibtracs/v03r10/all/csv/year/')
file_list = conn.nlst()
len(file_list)
Out[5]:
176

Print the top 10 items.

In [6]:
file_list[:10]
Out[6]:
['Year.1842.ibtracs_all.v03r10.csv',
 'Year.1843.ibtracs_all.v03r10.csv',
 'Year.1844.ibtracs_all.v03r10.csv',
 'Year.1845.ibtracs_all.v03r10.csv',
 'Year.1846.ibtracs_all.v03r10.csv',
 'Year.1847.ibtracs_all.v03r10.csv',
 'Year.1848.ibtracs_all.v03r10.csv',
 'Year.1849.ibtracs_all.v03r10.csv',
 'Year.1850.ibtracs_all.v03r10.csv',
 'Year.1851.ibtracs_all.v03r10.csv']

Download each file into the hurricanes_raw directory

In [1]:
data_dir = r'data/hurricanes_data/'
In [31]:
if 'hurricanes_raw' not in os.listdir(data_dir):
    os.mkdir(os.path.join(data_dir,'hurricanes_raw'))

hurricane_raw_dir = os.path.join(data_dir,'hurricanes_raw')
os.listdir(data_dir)
Out[31]:
['Allstorms.ibtracs_all.v03r09.csv',
 'hurricanes_raw',
 '.nb_auth_file',
 'Allstorms.ibtracs_all.v03r09.csv.gz']

Now we are going to download data from 1842-2017, whih might take around 15 mins.

In [11]:
file_path = hurricane_raw_dir
for file in file_list:
    with open(os.path.join(file_path, file), 'wb') as file_handle:
        try:
            conn.retrbinary('RETR ' + file, file_handle.write, 1024)
            print(f'Downloaded {file}')
        
        except Exception as download_ex:
            print(f'Error downloading {file} + {str(download_ex)}')
Downloaded Year.1842.ibtracs_all.v03r10.csv
Downloaded Year.1843.ibtracs_all.v03r10.csv
Downloaded Year.1844.ibtracs_all.v03r10.csv
Downloaded Year.1845.ibtracs_all.v03r10.csv
....Downloaded Year.2015.ibtracs_all.v03r10.csv
Downloaded Year.2016.ibtracs_all.v03r10.csv
Downloaded Year.2017.ibtracs_all.v03r10.csv
CPU times: user 8.63 s, sys: 12.1 s, total: 20.8 s
Wall time: 12min 5s

Process CSV files by removing header rows

The CSV files have multiple header rows. Let us start by processing one of the files as an example.

In [2]:
csv_path = os.path.join(hurricane_raw_dir,'Year.2017.ibtracs_all.v03r10.csv')
In [18]:
df = pd.read_csv(csv_path)
df.head()
Out[18]:
IBTrACS -- Version: v03r10
Serial_Num Season Num Basin Sub_basin Name ISO_time Nature Latitude Longitude Wind(WMO) Pres(WMO) Center Wind(WMO) Percentile Pres(WMO) Percentile Track_type Latitude_for_mapping Longitude_for_mapping Current Basin hurdat_atl_lat hurdat_atl_lon hurdat_atl_grade hurdat_atl_wind hurdat_atl_pres td9636_lat td9636_lon td9636_grade td9636_wind td9636_pres reunion_lat reunion_lon reunion_grade reunion_wind reunion_pres atcf_lat atcf_lon atcf_grade atcf_wind atcf_pres mlc_natl_lat mlc_natl_lon mlc_natl_grade mlc_natl_wind mlc_natl_pres ds824_sh_lat ds824_sh_lon ds824_sh_grade ds824_sh_wind ds824_sh_pres ds824_ni_lat ds824_ni_lon ds824_ni_grade ds824_ni_wind ds824_ni_pres bom_lat bom_lon bom_grade bom_wind bom_pres ds824_au_lat ds824_au_lon ds824_au_grade ds824_au_wind ds824_au_pres jtwc_sh_lat jtwc_sh_lon jtwc_sh_grade jtwc_sh_wind jtwc_sh_pres jtwc_wp_lat jtwc_wp_lon jtwc_wp_grade jtwc_wp_wind jtwc_wp_pres td9635_lat td9635_lon td9635_grade td9635_wind td9635_pres ds824_wp_lat ds824_wp_lon ds824_wp_grade ds824_wp_wind ds824_wp_pres jtwc_io_lat jtwc_io_lon jtwc_io_grade jtwc_io_wind jtwc_io_pres cma_lat cma_lon cma_grade cma_wind cma_pres hurdat_epa_lat hurdat_epa_lon hurdat_epa_grade hurdat_epa_wind hurdat_epa_pres jtwc_ep_lat jtwc_ep_lon jtwc_ep_grade jtwc_ep_wind jtwc_ep_pres ds824_ep_lat ds824_ep_lon ds824_ep_grade ds824_ep_wind ds824_ep_pres jtwc_cp_lat jtwc_cp_lon jtwc_cp_grade jtwc_cp_wind jtwc_cp_pres tokyo_lat tokyo_lon tokyo_grade tokyo_wind tokyo_pres neumann_lat neumann_lon neumann_grade neumann_wind neumann_pres hko_lat hko_lon hko_grade hko_wind hko_pres cphc_lat cphc_lon cphc_grade cphc_wind cphc_pres wellington_lat wellington_lon wellington_grade wellington_wind wellington_pres newdelhi_lat newdelhi_lon newdelhi_grade newdelhi_wind newdelhi_pres nadi_lat nadi_lon nadi_grade nadi_wind nadi_pres reunion_rmw reunion_wind_radii_1_ne reunion_wind_radii_1_se reunion_wind_radii_1_sw reunion_wind_radii_1_nw reunion_wind_radii_2_ne reunion_wind_radii_2_se reunion_wind_radii_2_sw reunion_wind_radii_2_nw bom_mn_hurr_xtnt bom_mn_gale_xtnt bom_mn_eye_diam bom_roci atcf_rmw atcf_poci atcf_roci atcf_eye atcf_wrad34_rad1 atcf_wrad34_rad2 atcf_wrad34_rad3 atcf_wrad34_rad4 atcf_wrad50_rad1 atcf_wrad50_rad2 atcf_wrad50_rad3 atcf_wrad50_rad4 atcf_wrad64_rad1 atcf_wrad64_rad2 atcf_wrad64_rad3 atcf_wrad64_rad4 tokyo_dir50 tokyo_long50 tokyo_short50 tokyo_dir30 tokyo_long30 tokyo_short30 jtwc_??_rmw jtwc_??_poci jtwc_??_roci jtwc_??_eye jtwc_??_wrad34_rad1 jtwc_??_wrad34_rad2 jtwc_??_wrad34_rad3 jtwc_??_wrad34_rad4 jtwc_??_wrad50_rad1 jtwc_??_wrad50_rad2 jtwc_??_wrad50_rad3 jtwc_??_wrad50_rad4 jtwc_??_wrad64_rad1 jtwc_??_wrad64_rad2 jtwc_??_wrad64_rad3 jtwc_??_wrad64_rad4
NaN Year # BB BB NaN YYYY-MM-DD HH:MM:SS NaN deg_north deg_east kt mb NaN % % NaN degrees_north degrees_east NaN deg_north deg_east kt mb deg_north deg_east kt mb deg_north deg_east kt mb deg_north deg_east kt mb deg_north deg_east kt mb deg_north deg_east kt mb deg_north deg_east kt mb deg_north deg_east kt mb deg_north deg_east kt mb deg_north deg_east kt mb deg_north deg_east kt mb deg_north deg_east kt mb deg_north deg_east kt mb deg_north deg_east kt mb deg_north deg_east kt mb deg_north deg_east kt mb deg_north deg_east kt mb deg_north deg_east kt mb deg_north deg_east kt mb deg_north deg_east kt mb deg_north deg_east kt mb deg_north deg_east kt mb deg_north deg_east kt mb deg_north deg_east kt mb deg_north deg_east kt mb deg_north deg_east kt mb nmile nmile nmile nmile nmile nmile nmile nmile nmile nmile nmile nmile nmile nmile mb nmile nmile nmile nmile nmile nmile nmile nmile nmile nmile nmile nmile nmile nmile Quad nmile nmile Quad nmile nmile nmile mb nmile nmile nmile nmile nmile nmile nmile nmile nmile nmile nmile nmile nmile nmile
1874011S14064 1874 01 SI MM XXXX874148 1874-01-11 06:00:00 NR -13.70 63.90 0.0 0.0 reunion -100.000 -100.000 main -13.70 63.90 SI -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -13.7 63.9 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000
1874-01-11 12:00:00 NR -999. -999. -999. -999. NaN -999. -999. main -13.75 63.86 SI -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -13.7 63.9 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000
1874-01-11 18:00:00 NR -999. -999. -999. -999. NaN -999. -999. main -13.88 63.77 SI -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -13.9 63.8 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.0 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000

The input looks mangled. This is because the file's row 1 has a header that pandas fails to read. So let us skip that row.

In [19]:
df = pd.read_csv(csv_path, skiprows=1)
df.head()
Out[19]:
Serial_Num Season Num Basin Sub_basin Name ISO_time Nature Latitude Longitude ... jtwc_??_wrad34_rad3 jtwc_??_wrad34_rad4 jtwc_??_wrad50_rad1 jtwc_??_wrad50_rad2 jtwc_??_wrad50_rad3 jtwc_??_wrad50_rad4 jtwc_??_wrad64_rad1 jtwc_??_wrad64_rad2 jtwc_??_wrad64_rad3 jtwc_??_wrad64_rad4
0 NaN Year # BB BB NaN YYYY-MM-DD HH:MM:SS NaN deg_north deg_east ... nmile nmile nmile nmile nmile nmile nmile nmile nmile nmile
1 1874011S14064 1874 01 SI MM XXXX874148 1874-01-11 06:00:00 NR -13.70 63.90 ... -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000
2 1874011S14064 1874 01 SI MM XXXX874148 1874-01-11 12:00:00 NR -999. -999. ... -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000
3 1874011S14064 1874 01 SI MM XXXX874148 1874-01-11 18:00:00 NR -999. -999. ... -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000
4 1874011S14064 1874 01 SI MM XXXX874148 1874-01-12 00:00:00 NR -999. -999. ... -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000

5 rows × 200 columns

A little better. But the file's 3rd row is also a header. Let us drop that row.

In [20]:
df.drop(labels=0, axis=0, inplace=True)
df.head()
Out[20]:
Serial_Num Season Num Basin Sub_basin Name ISO_time Nature Latitude Longitude ... jtwc_??_wrad34_rad3 jtwc_??_wrad34_rad4 jtwc_??_wrad50_rad1 jtwc_??_wrad50_rad2 jtwc_??_wrad50_rad3 jtwc_??_wrad50_rad4 jtwc_??_wrad64_rad1 jtwc_??_wrad64_rad2 jtwc_??_wrad64_rad3 jtwc_??_wrad64_rad4
1 1874011S14064 1874 01 SI MM XXXX874148 1874-01-11 06:00:00 NR -13.70 63.90 ... -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000
2 1874011S14064 1874 01 SI MM XXXX874148 1874-01-11 12:00:00 NR -999. -999. ... -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000
3 1874011S14064 1874 01 SI MM XXXX874148 1874-01-11 18:00:00 NR -999. -999. ... -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000
4 1874011S14064 1874 01 SI MM XXXX874148 1874-01-12 00:00:00 NR -999. -999. ... -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000
5 1874011S14064 1874 01 SI MM XXXX874148 1874-01-12 06:00:00 NR -14.80 63.30 ... -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000

5 rows × 200 columns

Automate across all files

Now we need to repeat the above cleaning steps across all CSV files. In the steps below, we will read all CSV files, drop the headers, and write to disk. This step is necessary as it will ease subsequent processing using the DASK library.

In [9]:
file_path = hurricane_raw_dir
num_records = {}
for file in file_list:
    df = pd.read_csv(os.path.join(file_path, file), skiprows=1)
    num_records[str(file.split('.')[1])] = df.shape[0]
    
    df.drop(labels=0, axis=0, inplace=True)
    df.to_csv(os.path.join(file_path, file))
    print(f'Processed {file}')
Processed Year.1842.ibtracs_all.v03r10.csv
Processed Year.1843.ibtracs_all.v03r10.csv
Processed Year.1844.ibtracs_all.v03r10.csv
Processed Year.1845.ibtracs_all.v03r10.csv
...Processed Year.2013.ibtracs_all.v03r10.csv
Processed Year.2014.ibtracs_all.v03r10.csv
Processed Year.2015.ibtracs_all.v03r10.csv
Processed Year.2016.ibtracs_all.v03r10.csv
Processed Year.2017.ibtracs_all.v03r10.csv
CPU times: user 36.4 s, sys: 3.39 s, total: 39.8 s
Wall time: 46.8 s

Cleaning hurricane observations with Dask

The data collected from NOAA NCDC source is just too large to clean with Pandas or Excel. With 350,000 x 200 in dense matrix, this data is larger than memory for a normal computer. Hence traditional packages such as Pandas cannot be used as they expect data to fit fully in memory.

Thus, in this part of the study, we use Dask, a distributed data analysis library. Functionally, Dask provides a DataFrame object that behaves similar to a traditional pandas DataFrame object. You can perform slicing, dicing, exploration on them. However transformative operations on the DataFrame get queued and are operated only when necessary. When executed, Dask will read data in chunks, distribute it to workers (be it cores on a single machine or multiple machines in a cluster set up) and collect the data back for you. Thus, DASK allows you to work with any larger than memory dataset as it performs operations on chunks of it, in a distributed manner.

Read input CSV data

As mentioned earlier, DASK allows you to work with larger than memory datasets. These datasets can reside as one large file or as multiple files in a folder. For the latter, DASK allows you to just specify the folder containing the datasets as input. In turn, it provides you a single DataFrame object that represents all your datasets combined together. The operations you perform on this DataFrame get queued and executed only when necessary.

In [3]:
fld_path = hurricane_raw_dir
csv_path = os.path.join(fld_path,'*.csv')

Preemptively, specify the assortment of values that should be treated as null values.

In [36]:
table_na_values=['-999.','-999','-999.000', '-1', '-1.0','0','0.0']
full_df = ddf.read_csv(csv_path, na_values=table_na_values, dtype={'Center': 'object'})
CPU times: user 1.26 s, sys: 17.6 ms, total: 1.28 s
Wall time: 1.29 s

You can query the top few (or bottom few) records as you do on a regular Pandas DataFrame object.

In [37]:
full_df.head()
Out[37]:
Unnamed: 0 Serial_Num Season Num Basin Sub_basin Name ISO_time Nature Latitude ... jtwc_??_wrad34_rad3 jtwc_??_wrad34_rad4 jtwc_??_wrad50_rad1 jtwc_??_wrad50_rad2 jtwc_??_wrad50_rad3 jtwc_??_wrad50_rad4 jtwc_??_wrad64_rad1 jtwc_??_wrad64_rad2 jtwc_??_wrad64_rad3 jtwc_??_wrad64_rad4
0 1 1842298N11080 1842 1 NI BB NOT NAMED 1842-10-25 06:00:00 NR NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2 1842298N11080 1842 1 NI BB NOT NAMED 1842-10-25 12:00:00 NR NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 3 1842298N11080 1842 1 NI AS NOT NAMED 1842-10-25 18:00:00 NR NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 4 1842298N11080 1842 1 NI AS NOT NAMED 1842-10-26 00:00:00 NR NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 5 1842298N11080 1842 1 NI AS NOT NAMED 1842-10-26 06:00:00 NR NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 201 columns

Drop the first duplicate index column.

In [38]:
full_df = full_df.drop(labels=['Unnamed: 0'], axis=1)
In [39]:
all_columns=list(full_df.columns)
len(all_columns)
Out[39]:
200

This dataset has 200 columns. Not all are unique, as you can see from the print out below:

In [40]:
pprint(all_columns, compact=True, width=100)
['Serial_Num', 'Season', 'Num', 'Basin', 'Sub_basin', 'Name', 'ISO_time', 'Nature', 'Latitude',
 'Longitude', 'Wind(WMO)', 'Pres(WMO)', 'Center', 'Wind(WMO) Percentile', 'Pres(WMO) Percentile',
 'Track_type', 'Latitude_for_mapping', 'Longitude_for_mapping', 'Current Basin', 'hurdat_atl_lat',
 'hurdat_atl_lon', 'hurdat_atl_grade', 'hurdat_atl_wind', 'hurdat_atl_pres', 'td9636_lat',
 'td9636_lon', 'td9636_grade', 'td9636_wind', 'td9636_pres', 'reunion_lat', 'reunion_lon',
 'reunion_grade', 'reunion_wind', 'reunion_pres', 'atcf_lat', 'atcf_lon', 'atcf_grade', 'atcf_wind',
 'atcf_pres', 'mlc_natl_lat', 'mlc_natl_lon', 'mlc_natl_grade', 'mlc_natl_wind', 'mlc_natl_pres',
 'ds824_sh_lat', 'ds824_sh_lon', 'ds824_sh_grade', 'ds824_sh_wind', 'ds824_sh_pres', 'ds824_ni_lat',
 'ds824_ni_lon', 'ds824_ni_grade', 'ds824_ni_wind', 'ds824_ni_pres', 'bom_lat', 'bom_lon',
 'bom_grade', 'bom_wind', 'bom_pres', 'ds824_au_lat', 'ds824_au_lon', 'ds824_au_grade',
 'ds824_au_wind', 'ds824_au_pres', 'jtwc_sh_lat', 'jtwc_sh_lon', 'jtwc_sh_grade', 'jtwc_sh_wind',
 'jtwc_sh_pres', 'jtwc_wp_lat', 'jtwc_wp_lon', 'jtwc_wp_grade', 'jtwc_wp_wind', 'jtwc_wp_pres',
 'td9635_lat', 'td9635_lon', 'td9635_grade', 'td9635_wind', 'td9635_pres', 'ds824_wp_lat',
 'ds824_wp_lon', 'ds824_wp_grade', 'ds824_wp_wind', 'ds824_wp_pres', 'jtwc_io_lat', 'jtwc_io_lon',
 'jtwc_io_grade', 'jtwc_io_wind', 'jtwc_io_pres', 'cma_lat', 'cma_lon', 'cma_grade', 'cma_wind',
 'cma_pres', 'hurdat_epa_lat', 'hurdat_epa_lon', 'hurdat_epa_grade', 'hurdat_epa_wind',
 'hurdat_epa_pres', 'jtwc_ep_lat', 'jtwc_ep_lon', 'jtwc_ep_grade', 'jtwc_ep_wind', 'jtwc_ep_pres',
 'ds824_ep_lat', 'ds824_ep_lon', 'ds824_ep_grade', 'ds824_ep_wind', 'ds824_ep_pres', 'jtwc_cp_lat',
 'jtwc_cp_lon', 'jtwc_cp_grade', 'jtwc_cp_wind', 'jtwc_cp_pres', 'tokyo_lat', 'tokyo_lon',
 'tokyo_grade', 'tokyo_wind', 'tokyo_pres', 'neumann_lat', 'neumann_lon', 'neumann_grade',
 'neumann_wind', 'neumann_pres', 'hko_lat', 'hko_lon', 'hko_grade', 'hko_wind', 'hko_pres',
 'cphc_lat', 'cphc_lon', 'cphc_grade', 'cphc_wind', 'cphc_pres', 'wellington_lat', 'wellington_lon',
 'wellington_grade', 'wellington_wind', 'wellington_pres', 'newdelhi_lat', 'newdelhi_lon',
 'newdelhi_grade', 'newdelhi_wind', 'newdelhi_pres', 'nadi_lat', 'nadi_lon', 'nadi_grade',
 'nadi_wind', 'nadi_pres', 'reunion_rmw', 'reunion_wind_radii_1_ne', 'reunion_wind_radii_1_se',
 'reunion_wind_radii_1_sw', 'reunion_wind_radii_1_nw', 'reunion_wind_radii_2_ne',
 'reunion_wind_radii_2_se', 'reunion_wind_radii_2_sw', 'reunion_wind_radii_2_nw',
 'bom_mn_hurr_xtnt', 'bom_mn_gale_xtnt', 'bom_mn_eye_diam', 'bom_roci', 'atcf_rmw', 'atcf_poci',
 'atcf_roci', 'atcf_eye', 'atcf_wrad34_rad1', 'atcf_wrad34_rad2', 'atcf_wrad34_rad3',
 'atcf_wrad34_rad4', 'atcf_wrad50_rad1', 'atcf_wrad50_rad2', 'atcf_wrad50_rad3', 'atcf_wrad50_rad4',
 'atcf_wrad64_rad1', 'atcf_wrad64_rad2', 'atcf_wrad64_rad3', 'atcf_wrad64_rad4', 'tokyo_dir50',
 'tokyo_long50', 'tokyo_short50', 'tokyo_dir30', 'tokyo_long30', 'tokyo_short30', 'jtwc_??_rmw',
 'jtwc_??_poci', 'jtwc_??_roci', 'jtwc_??_eye', 'jtwc_??_wrad34_rad1', 'jtwc_??_wrad34_rad2',
 'jtwc_??_wrad34_rad3', 'jtwc_??_wrad34_rad4', 'jtwc_??_wrad50_rad1', 'jtwc_??_wrad50_rad2',
 'jtwc_??_wrad50_rad3', 'jtwc_??_wrad50_rad4', 'jtwc_??_wrad64_rad1', 'jtwc_??_wrad64_rad2',
 'jtwc_??_wrad64_rad3', 'jtwc_??_wrad64_rad4']

Reading the metadata from NOAA NCDC site, we find sensor measurements get unique columns if they are collected by a different agency. Thus we find multiple pressure, wind speed, latitude, longitude, etc. columns with different suffixes and prefixes. Data is sparse as it gets distributed between these columns. For our geospatial analysis, it suffices if we can merge these columns together and get location information from the coordinates.

Merge all location columns

Below we prototype merging location columns. If this succeeds, we will proceed to merge all remaining columns.

In [41]:
lat_columns = [x for x in all_columns if 'lat' in x.lower()]
lon_columns = [x for x in all_columns if 'lon' in x.lower()]
for x in zip(lat_columns, lon_columns):
    print(x)
('Latitude', 'Longitude')
('Latitude_for_mapping', 'Longitude_for_mapping')
('hurdat_atl_lat', 'hurdat_atl_lon')
('td9636_lat', 'td9636_lon')
('reunion_lat', 'reunion_lon')
('atcf_lat', 'atcf_lon')
('mlc_natl_lat', 'mlc_natl_lon')
('ds824_sh_lat', 'ds824_sh_lon')
('ds824_ni_lat', 'ds824_ni_lon')
('bom_lat', 'bom_lon')
('ds824_au_lat', 'ds824_au_lon')
('jtwc_sh_lat', 'jtwc_sh_lon')
('jtwc_wp_lat', 'jtwc_wp_lon')
('td9635_lat', 'td9635_lon')
('ds824_wp_lat', 'ds824_wp_lon')
('jtwc_io_lat', 'jtwc_io_lon')
('cma_lat', 'cma_lon')
('hurdat_epa_lat', 'hurdat_epa_lon')
('jtwc_ep_lat', 'jtwc_ep_lon')
('ds824_ep_lat', 'ds824_ep_lon')
('jtwc_cp_lat', 'jtwc_cp_lon')
('tokyo_lat', 'tokyo_lon')
('neumann_lat', 'neumann_lon')
('hko_lat', 'hko_lon')
('cphc_lat', 'cphc_lon')
('wellington_lat', 'wellington_lon')
('newdelhi_lat', 'newdelhi_lon')
('nadi_lat', 'nadi_lon')

In this dataset, if data is collected by 1 agency, the corresponding duplicate columns from other agencies are empty. However there may be exceptions. Hence we define a custom function that will pick median value for a row, from a given list of columns. This way, we can consolidate latitude / longitude information from all the agencies.

In [42]:
def pick_median_value(row, col_list):
    return row[col_list].median()
In [43]:
full_df['latitude_merged'] = full_df.apply(pick_median_value, axis=1,
                                          col_list = lat_columns)
CPU times: user 56.9 ms, sys: 5.31 ms, total: 62.2 ms
Wall time: 58.3 ms
In [44]:
full_df['longitude_merged'] = full_df.apply(pick_median_value, axis=1,
                                          col_list = lon_columns)
CPU times: user 58.3 ms, sys: 5.43 ms, total: 63.7 ms
Wall time: 59.1 ms

With dask, the above operation was delayed and stored in a queue. It has not been evaluated yet. Next, let us evaluate for 5 records and print output. If results look good, we will merge all remaining related columns together.

In [45]:
full_df.head(5)
CPU times: user 137 ms, sys: 6.17 ms, total: 143 ms
Wall time: 141 ms
Out[45]:
Serial_Num Season Num Basin Sub_basin Name ISO_time Nature Latitude Longitude ... jtwc_??_wrad50_rad1 jtwc_??_wrad50_rad2 jtwc_??_wrad50_rad3 jtwc_??_wrad50_rad4 jtwc_??_wrad64_rad1 jtwc_??_wrad64_rad2 jtwc_??_wrad64_rad3 jtwc_??_wrad64_rad4 latitude_merged longitude_merged
0 1842298N11080 1842 1 NI BB NOT NAMED 1842-10-25 06:00:00 NR NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN 10.885 79.815
1 1842298N11080 1842 1 NI BB NOT NAMED 1842-10-25 12:00:00 NR NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN 10.810 78.890
2 1842298N11080 1842 1 NI AS NOT NAMED 1842-10-25 18:00:00 NR NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN 10.795 77.910
3 1842298N11080 1842 1 NI AS NOT NAMED 1842-10-26 00:00:00 NR NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN 10.795 76.915
4 1842298N11080 1842 1 NI AS NOT NAMED 1842-10-26 06:00:00 NR NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN 10.805 75.820

5 rows × 202 columns

The results look good. Two additional columns (latitude_merged, longitude_merged) have been added. By merging related columns, the redundant sparse columns can be removed, thereby simplifying the dimension of the input dataset.

Now that this prototype looks good, we will proceed by identifying the lists of remaining columns that are redundant and can be merged.

Merge similar columns

To keep track of which columns have been accounted for, we will duplicate the all_columns list and remove ones that we have identified.

In [46]:
columns_tracker = deepcopy(all_columns)
len(columns_tracker)
Out[46]:
200

From the columns_tracker list, let us remove the redundant columns we already identified for location columns.

In [47]:
columns_tracker = [x for x in columns_tracker if x not in lat_columns]
columns_tracker = [x for x in columns_tracker if x not in lon_columns]
len(columns_tracker)
Out[47]:
142

Thus, we have reduced the number of columns from 200 to 142. We will progressively reduce this while retaining key information.

Merge wind columns

Wind, pressure, grade are some of the meteorological observations this dataset contains. To start off, let us identify the wind columns:

In [48]:
# pick all columns that have 'wind' in name
wind_columns = [x for x in columns_tracker if 'wind' in x.lower()]

# based on metadata doc, we decide to eliminate percentile and wind distance columns
columns_to_eliminate = [x for x in wind_columns if 'radii' in x or 'percentile' in x.lower()]

# trim wind_columns by removing the ones we need to eliminate
wind_columns = [x for x in wind_columns if x not in columns_to_eliminate]
wind_columns
Out[48]:
['Wind(WMO)',
 'hurdat_atl_wind',
 'td9636_wind',
 'reunion_wind',
 'atcf_wind',
 'mlc_natl_wind',
 'ds824_sh_wind',
 'ds824_ni_wind',
 'bom_wind',
 'ds824_au_wind',
 'jtwc_sh_wind',
 'jtwc_wp_wind',
 'td9635_wind',
 'ds824_wp_wind',
 'jtwc_io_wind',
 'cma_wind',
 'hurdat_epa_wind',
 'jtwc_ep_wind',
 'ds824_ep_wind',
 'jtwc_cp_wind',
 'tokyo_wind',
 'neumann_wind',
 'hko_wind',
 'cphc_wind',
 'wellington_wind',
 'newdelhi_wind',
 'nadi_wind']
In [49]:
full_df['wind_merged'] = full_df.apply(pick_median_value, axis=1,
                                          col_list = wind_columns)
CPU times: user 56.7 ms, sys: 4.92 ms, total: 61.6 ms
Wall time: 57.6 ms

Merge pressure columns

We proceed to identify all pressure columns. But before that, we update the columns_tracker list by removing those we identified for wind:

In [50]:
columns_tracker = [x for x in columns_tracker if x not in wind_columns]
columns_tracker = [x for x in columns_tracker if x not in columns_to_eliminate]
len(columns_tracker)
Out[50]:
106
In [51]:
# pick all columns that have 'pres' in name
pressure_columns = [x for x in columns_tracker if 'pres' in x.lower()]

# from metadata, we eliminate percentile and pres distance columns
if columns_to_eliminate:
    columns_to_eliminate.extend([x for x in pressure_columns if 'radii' in x or 'percentile' in x.lower()])
else:
    columns_to_eliminate = [x for x in pressure_columns if 'radii' in x or 'percentile' in x.lower()]

# trim wind_columns by removing the ones we need to eliminate
pressure_columns = [x for x in pressure_columns if x not in columns_to_eliminate]
pressure_columns
Out[51]:
['Pres(WMO)',
 'hurdat_atl_pres',
 'td9636_pres',
 'reunion_pres',
 'atcf_pres',
 'mlc_natl_pres',
 'ds824_sh_pres',
 'ds824_ni_pres',
 'bom_pres',
 'ds824_au_pres',
 'jtwc_sh_pres',
 'jtwc_wp_pres',
 'td9635_pres',
 'ds824_wp_pres',
 'jtwc_io_pres',
 'cma_pres',
 'hurdat_epa_pres',
 'jtwc_ep_pres',
 'ds824_ep_pres',
 'jtwc_cp_pres',
 'tokyo_pres',
 'neumann_pres',
 'hko_pres',
 'cphc_pres',
 'wellington_pres',
 'newdelhi_pres',
 'nadi_pres']
In [52]:
full_df['pressure_merged'] = full_df.apply(pick_median_value, axis=1,
                                          col_list = pressure_columns)
CPU times: user 122 ms, sys: 5.33 ms, total: 127 ms
Wall time: 123 ms

Merge grade columns

In [53]:
columns_tracker = [x for x in columns_tracker if x not in pressure_columns]
columns_tracker = [x for x in columns_tracker if x not in columns_to_eliminate]
len(columns_tracker)
Out[53]:
78

Notice the length of columns_tracker is reducing progressively as we identify redundant columns.

In [54]:
# pick all columns that have 'grade' in name
grade_columns = [x for x in columns_tracker if 'grade' in x.lower()]
grade_columns
Out[54]:
['hurdat_atl_grade',
 'td9636_grade',
 'reunion_grade',
 'atcf_grade',
 'mlc_natl_grade',
 'ds824_sh_grade',
 'ds824_ni_grade',
 'bom_grade',
 'ds824_au_grade',
 'jtwc_sh_grade',
 'jtwc_wp_grade',
 'td9635_grade',
 'ds824_wp_grade',
 'jtwc_io_grade',
 'cma_grade',
 'hurdat_epa_grade',
 'jtwc_ep_grade',
 'ds824_ep_grade',
 'jtwc_cp_grade',
 'tokyo_grade',
 'neumann_grade',
 'hko_grade',
 'cphc_grade',
 'wellington_grade',
 'newdelhi_grade',
 'nadi_grade']
In [55]:
full_df['grade_merged'] = full_df.apply(pick_median_value, axis=1,
                                          col_list = grade_columns)
CPU times: user 54.9 ms, sys: 5.59 ms, total: 60.5 ms
Wall time: 56.4 ms

Merge eye diameter columns

In [56]:
columns_tracker = [x for x in columns_tracker if x not in grade_columns]
len(columns_tracker)
Out[56]:
52
In [57]:
# pick all columns that have 'eye' in name
eye_dia_columns = [x for x in columns_tracker if 'eye' in x.lower()]
eye_dia_columns
Out[57]:
['bom_mn_eye_diam', 'atcf_eye', 'jtwc_??_eye']
In [58]:
full_df['eye_dia_merged'] = full_df.apply(pick_median_value, axis=1,
                                          col_list = eye_dia_columns)
CPU times: user 53.6 ms, sys: 4.74 ms, total: 58.4 ms
Wall time: 54.8 ms

Identify remaining redundant columns

In [59]:
columns_tracker = [x for x in columns_tracker if x not in eye_dia_columns]
len(columns_tracker)
Out[59]:
49

We are down to 49 columns, let us visualize what those look like.

In [60]:
pprint(columns_tracker, width=119, compact=True)
['Serial_Num', 'Season', 'Num', 'Basin', 'Sub_basin', 'Name', 'ISO_time', 'Nature', 'Center', 'Track_type',
 'Current Basin', 'reunion_rmw', 'bom_mn_hurr_xtnt', 'bom_mn_gale_xtnt', 'bom_roci', 'atcf_rmw', 'atcf_poci',
 'atcf_roci', 'atcf_wrad34_rad1', 'atcf_wrad34_rad2', 'atcf_wrad34_rad3', 'atcf_wrad34_rad4', 'atcf_wrad50_rad1',
 'atcf_wrad50_rad2', 'atcf_wrad50_rad3', 'atcf_wrad50_rad4', 'atcf_wrad64_rad1', 'atcf_wrad64_rad2',
 'atcf_wrad64_rad3', 'atcf_wrad64_rad4', 'tokyo_dir50', 'tokyo_short50', 'tokyo_dir30', 'tokyo_short30', 'jtwc_??_rmw',
 'jtwc_??_poci', 'jtwc_??_roci', 'jtwc_??_wrad34_rad1', 'jtwc_??_wrad34_rad2', 'jtwc_??_wrad34_rad3',
 'jtwc_??_wrad34_rad4', 'jtwc_??_wrad50_rad1', 'jtwc_??_wrad50_rad2', 'jtwc_??_wrad50_rad3', 'jtwc_??_wrad50_rad4',
 'jtwc_??_wrad64_rad1', 'jtwc_??_wrad64_rad2', 'jtwc_??_wrad64_rad3', 'jtwc_??_wrad64_rad4']

Based on metadata shared by data provider, we choose to retain only the first 11 columns. We add the rest to the list columns_to_eliminate.

In [61]:
columns_to_eliminate.extend(columns_tracker[11:])
pprint(columns_to_eliminate, width=119, compact=True)
['Wind(WMO) Percentile', 'reunion_wind_radii_1_ne', 'reunion_wind_radii_1_se', 'reunion_wind_radii_1_sw',
 'reunion_wind_radii_1_nw', 'reunion_wind_radii_2_ne', 'reunion_wind_radii_2_se', 'reunion_wind_radii_2_sw',
 'reunion_wind_radii_2_nw', 'Pres(WMO) Percentile', 'reunion_rmw', 'bom_mn_hurr_xtnt', 'bom_mn_gale_xtnt', 'bom_roci',
 'atcf_rmw', 'atcf_poci', 'atcf_roci', 'atcf_wrad34_rad1', 'atcf_wrad34_rad2', 'atcf_wrad34_rad3', 'atcf_wrad34_rad4',
 'atcf_wrad50_rad1', 'atcf_wrad50_rad2', 'atcf_wrad50_rad3', 'atcf_wrad50_rad4', 'atcf_wrad64_rad1',
 'atcf_wrad64_rad2', 'atcf_wrad64_rad3', 'atcf_wrad64_rad4', 'tokyo_dir50', 'tokyo_short50', 'tokyo_dir30',
 'tokyo_short30', 'jtwc_??_rmw', 'jtwc_??_poci', 'jtwc_??_roci', 'jtwc_??_wrad34_rad1', 'jtwc_??_wrad34_rad2',
 'jtwc_??_wrad34_rad3', 'jtwc_??_wrad34_rad4', 'jtwc_??_wrad50_rad1', 'jtwc_??_wrad50_rad2', 'jtwc_??_wrad50_rad3',
 'jtwc_??_wrad50_rad4', 'jtwc_??_wrad64_rad1', 'jtwc_??_wrad64_rad2', 'jtwc_??_wrad64_rad3', 'jtwc_??_wrad64_rad4']

Drop all redundant columns

So far, we have merged similar columns together and collected the lists of redundant columns to drop. Below we compile them into a single list.

In [62]:
len(full_df.columns)
Out[62]:
206
In [63]:
columns_to_drop = lat_columns + lon_columns + wind_columns + pressure_columns + \
                    grade_columns + eye_dia_columns+columns_to_eliminate
len(columns_to_drop)
Out[63]:
189

Perform delayed computation

In Dask, all computations are delayed and queued. The apply() functions called earlier are not executed yet, however respective columns have been created as you can see from the DataFrame display above. In the cells below, we will call save() to make Dask compute on this larger than memory dataset.

Calling visualize() on the delayed compute operation or the DataFrame object will plot the dask task queue as shown below. The graphic below provides a glimpse on how Dask distributes its tasks and how it reads this 'larger than memory dataset' in chunks and operates on them.

Drawing dask graphs requires the graphviz python library and the graphviz system library to be installed.

In [ ]:
!conda install --yes -c anaconda graphviz 
!conda install --yes -c conda-forge python-graphviz
In [34]:
full_df.visualize()
Out[34]: