ArcGIS Developers
Dashboard

ArcGIS API for Python

Forecasting PM2.5 using big data analysis

Prerequisites

  • The tools available in the ArcGIS API for Python geoanalytics module require an ArcGIS Enterprise licensed and configured with the linux based ArcGIS GeoAnalytics server.
  • When ArcGIS GeoAnalytics Server is installed on Linux, additional configuration steps are required before using the RunPythonScript operation. Install and configure Python 3.6 for Linux on each machine in your GeoAnalytics Server site, ensuring that Python is installed into the same directory on each machine. For this analysis, you also need to install FB Prophet library in the same environment. Then, update the ArcGIS Server Properties on your GeoAnalytics Server site with the pysparkPython property. The value of this property should be the path to the Python executable on your GeoAnalytics Server machines, for example, {"pysparkPython":"path/to/environment"}.

Introduction

While the arcgis.geoanalytics module offers powerful spatial analysis tools, the pyspark package includes dozens of non-spatial distributed tools for classification, prediction, clustering, and more. Using the power of distributed compute, we can analyze large datasets much faster than other non-distributed systems.

This notebook demonstrates the capability of spark-powered geoanalytics server to forecast hourly PM2.5 given the historic time series data for more than one time-dependent variable. The most common factors in the weather environment used in this analysis are PM2.5, PM10, wind speed, wind direction, and relative humidity. The levels of these pollutants are measured by the US Environmental Protection Agency (EPA), which controls overall air quality. We have used the dataset provided by EPA. With Spark we will learn how to customize and extend our analysis capabilities by:

  • Querying and summarizing your data using SQL
  • Turning analysis workflows into pipelines of GeoAnalytics tools
  • Modeling data with included machine learning libraries

The forecasted result will then be displayed on a dashboard using the recently introduced arcgis.dashboard module in ArcGIS API for Python.

Note:

  • The ability to perform big data analysis is only available on ArcGIS Enterprise licensed with a GeoAnalytics server and not yet available on ArcGIS Online.

Necessary imports

In [2]:
from datetime import datetime as dt
import pandas as pd

import arcgis
from arcgis.gis import GIS
from arcgis.geoanalytics.manage_data import run_python_script

Connect to your ArcGIS Enterprise organization

In [3]:
gis = GIS('https://pythonapi.playground.esri.com/portal', 'arcgis_python', 'amazing_arcgis_123')

Ensure your GIS supports GeoAnalytics

After connecting to the Enterprise organization, we need to ensure an ArcGIS Enterprise GIS is set up with a licensed GeoAnalytics server. To do so, we will call the is_supported() method.

In [4]:
arcgis.geoanalytics.is_supported()
Out[4]:
True

Prepare data

To register a file share or an HDFS, we need to format datasets as subfolders within a single parent folder and register the parent folder. This parent folder becomes a datastore, and each subfolder becomes a dataset. Our folder hierarchy would look like below:

Learn more about preparing your big data file share datasets here.

The get_datastores() method of the geoanalytics module returns a DatastoreManager object that lets you search for and manage the big data file share items as Python API Datastore objects on your GeoAnalytics server.

In [5]:
bigdata_datastore_manager = arcgis.geoanalytics.get_datastores()
bigdata_datastore_manager
Out[5]:
<DatastoreManager for https://pythonapi.playground.esri.com/ga/admin>

We will register air quality data as a big data file share using the add_bigdata() function on a DatastoreManager object.

When we register a directory, all subdirectories under the specified folder are also registered with the server. Always register the parent folder (for example, \machinename\mydatashare) that contains one or more individual dataset folders as the big data file share item. To learn more, see register a big data file share

Note: You cannot browse directories in ArcGIS Server Manager. You must provide the full path to the folder you want to register, for example, \myserver\share\bigdata. Avoid using local paths, such as C:\bigdata, unless the same data folder is available on all nodes of the server site.

In [ ]:
# data_item = bigdata_datastore_manager.add_bigdata("air_quality_17_18_19", r"/mnt/network/data")
In [6]:
bigdata_fileshares = bigdata_datastore_manager.search()
bigdata_fileshares
Out[6]:
[<Datastore title:"/bigDataFileShares/NYC_taxi_data15" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/GA_Data" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/ServiceCallsOrleansTest" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/calls" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/all_hurricanes" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/NYCdata" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/hurricanes_1848_1900" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/ServiceCallsOrleans" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/hurricanes_dask_csv" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/hurricanes_dask_shp" type:"bigDataFileShare">,
 <Datastore title:"/cloudStores/cloud_store" type:"cloudStore">]

Get data for analysis

Adding a big data file share to the Geoanalytics server adds a corresponding big data file share item on the portal. We can search for these types of items using the item_type parameter.

In [11]:
aqs_data = gis.content.search("bigDataFileShares_GA_Data", item_type = "big data file share")[0]
aqs_data
Out[11]:
bigDataFileShares_GA_Data
Big Data File Share by arcgis_python
Last Modified: May 27, 2021
0 comments, 0 views

Querying the layers property of the item returns a featureLayer representing the data.

In [12]:
air_lyr = aqs_data.layers[0]
In [13]:
air_lyr.properties
Out[13]:
{
  "dataStoreID": "4f498adb-8e5a-4b7e-96d1-9accacb26a50",
  "fields": [
    {
      "name": "State Code",
      "type": "esriFieldTypeInteger"
    },
    {
      "name": "County Code",
      "type": "esriFieldTypeInteger"
    },
    {
      "name": "Site Num",
      "type": "esriFieldTypeInteger"
    },
    {
      "name": "Parameter Code",
      "type": "esriFieldTypeInteger"
    },
    {
      "name": "POC",
      "type": "esriFieldTypeInteger"
    },
    {
      "name": "Latitude",
      "type": "esriFieldTypeDouble"
    },
    {
      "name": "Longitude",
      "type": "esriFieldTypeDouble"
    },
    {
      "name": "Datum",
      "type": "esriFieldTypeString"
    },
    {
      "name": "Parameter Name",
      "type": "esriFieldTypeString"
    },
    {
      "name": "Date Local",
      "type": "esriFieldTypeString"
    },
    {
      "name": "Time Local",
      "type": "esriFieldTypeString"
    },
    {
      "name": "Date GMT",
      "type": "esriFieldTypeString"
    },
    {
      "name": "Time GMT",
      "type": "esriFieldTypeString"
    },
    {
      "name": "Sample Measurement",
      "type": "esriFieldTypeDouble"
    },
    {
      "name": "Units of Measure",
      "type": "esriFieldTypeString"
    },
    {
      "name": "MDL",
      "type": "esriFieldTypeDouble"
    },
    {
      "name": "Uncertainty",
      "type": "esriFieldTypeString"
    },
    {
      "name": "Qualifier",
      "type": "esriFieldTypeString"
    },
    {
      "name": "Method Type",
      "type": "esriFieldTypeString"
    },
    {
      "name": "Method Code",
      "type": "esriFieldTypeInteger"
    },
    {
      "name": "Method Name",
      "type": "esriFieldTypeString"
    },
    {
      "name": "State Name",
      "type": "esriFieldTypeString"
    },
    {
      "name": "County Name",
      "type": "esriFieldTypeString"
    },
    {
      "name": "Date of Last Change",
      "type": "esriFieldTypeString"
    }
  ],
  "name": "air_quality",
  "geometryType": "esriGeometryPoint",
  "type": "featureClass",
  "spatialReference": {
    "wkid": 102682,
    "latestWkid": 3452
  },
  "geometry": {
    "fields": [
      {
        "name": "Longitude",
        "formats": [
          "x"
        ]
      },
      {
        "name": "Latitude",
        "formats": [
          "y"
        ]
      }
    ]
  },
  "time": {
    "timeType": "instant",
    "timeReference": {
      "timeZone": "UTC"
    },
    "fields": [
      {
        "name": "Date Local",
        "formats": [
          "yyyy-MM-dd"
        ]
      }
    ]
  },
  "currentVersion": 10.81,
  "children": []
}

Now we'll search for a feature layer depicting all the counties in the United States. We'll use the county polygons later in the notebook.

In [15]:
usa_counties = gis.content.get('5c6ef8ef57934990b543708f815d606e')
In [16]:
usa_counties
Out[16]:
usaCounties
Feature Layer Collection by api_data_owner
Last Modified: May 27, 2021
0 comments, 2 views

We will use the first item for our analysis. Since the item is a Feature Layer Collection, accessing the layers property will give us a list of Feature layer objects.

In [17]:
usa_counties_lyr = usa_counties.layers[0]

Uncover patterns in data

Describe data

The describe_dataset method provides an overview of big data. By default, the tool outputs a table layer containing calculated field statistics and a dictionary outlining geometry and time settings for the input layer.

Optionally, the tool can output a feature layer representing a sample set of features using the sample_size parameter, or a single polygon feature layer representing the input feature layers' extent by setting the extent_output parameter to True.

In [18]:
from arcgis.geoanalytics.summarize_data import describe_dataset
In [19]:
description = describe_dataset(input_layer=air_lyr,
                               extent_output=True,
                               sample_size=1000,
                               output_name="Description of air quality data" + str(dt.now().microsecond),
                               return_tuple=True)
In [20]:
description.output_json
Out[20]:
{'datasetName': 'air_quality',
 'datasetSource': 'Big Data File Share - Air_Auality_17_18_19',
 'recordCount': 147535365,
 'geometry': {'geometryType': 'Point',
  'sref': {'wkid': 4326},
  'countNonEmpty': 147535365,
  'countEmpty': 0,
  'spatialExtent': {'xmin': -161.767,
   'ymin': 17.953006,
   'xmax': -65.915482,
   'ymax': 64.84568999999999}},
 'time': {'timeType': 'Instant',
  'countNonEmpty': 147535365,
  'countEmpty': 0,
  'temporalExtent': {'start': '2017-01-01 00:00:00.000',
   'end': '2019-04-30 00:00:00.000'}}}

We can also use sql queries to return a subset of records by leveraging the ArcGIS API for Python's Feature Layer object itself. When you run a query() on a FeatureLayer, you get back a FeatureSet object. Calling the sdf property of the FeatureSet returns a Spatially Enabled DataFrame object.

In [21]:
description.sample_layer.query().sdf
Out[21]:
State_Code County_Code Site_Num Parameter_Code POC Latitude Longitude Datum Parameter_Name Date_Local ... Method_Type Method_Code Method_Name State_Name County_Name Date_of_Last_Change INSTANT_DATETIME globalid OBJECTID SHAPE
0 56 35 700 62201 1 42.486361 -110.098861 WGS84 Relative Humidity 2017-07-05 ... Non-FRM 60 Instrumental - Vaisala 435C RH/AT Sensor Wyoming Sublette 2017-11-20 2017-07-05 {4CB48700-32CB-6126-E342-B783681A14AF} 1 {"x": -110.098861, "y": 42.486360999999995, "s...
1 9 9 27 88313 1 41.301400 -72.902871 WGS84 Black Carbon PM2.5 at 880 nm 2017-06-08 ... Non-FRM 894 Magee AE33/ TAPI M633 Aethalometer - Optical a... Connecticut New Haven 2018-01-09 2017-06-08 {3D99EDC1-EC50-6B1A-508F-A0034CD9DB02} 2 {"x": -72.90287099999999, "y": 41.3014, "spati...
2 45 79 7 42601 2 34.093959 -80.962304 WGS84 Nitric oxide (NO) 2018-06-28 ... Non-FRM 674 Instrumental - Chemiluminescence Thermo Electr... South Carolina Richland 2019-02-18 2018-06-28 {04ED8FBB-2ADF-AA44-BE30-F61E7A939657} 3 {"x": -80.962304, "y": 34.093959, "spatialRefe...
3 48 183 1 62101 1 32.378682 -94.711811 WGS84 Outdoor Temperature 2018-10-18 ... Non-FRM 40 INSTRUMENTAL - ELECTRONIC OR MACHINE AVG. Texas Gregg 2019-02-18 2018-10-18 {60E09F2E-891F-EEBD-1253-0FFA407BDEA4} 4 {"x": -94.711811, "y": 32.378682, "spatialRefe...
4 19 153 30 44201 1 41.603159 -93.643118 WGS84 Ozone 2017-12-23 ... FEM 47 INSTRUMENTAL - ULTRA VIOLET Iowa Polk 2018-01-12 2017-12-23 {73E73030-0883-7A32-ED9C-BD59ACC00A2F} 5 {"x": -93.643118, "y": 41.603159, "spatialRefe...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 27 53 962 61104 1 44.965242 -93.254759 NAD83 Wind Direction - Resultant 2018-07-15 ... Non-FRM 63 Instrumental - Climatronics Minnesota Hennepin 2019-01-17 2018-07-15 {61736B2D-7C20-DF9E-37BA-21CF788F8FDD} 1003 {"x": -93.25475899999999, "y": 44.965241999999...
996 48 43 101 44201 1 29.302650 -103.177810 WGS84 Ozone 2018-02-21 ... FEM 47 INSTRUMENTAL - ULTRA VIOLET Texas Brewster 2018-04-19 2018-02-21 {E739DA0C-7A97-8059-72A3-6505490F990D} 1004 {"x": -103.17781, "y": 29.30265, "spatialRefer...
997 19 163 17 62101 1 41.467236 -90.688451 WGS84 Outdoor Temperature 2017-07-16 ... Non-FRM 40 INSTRUMENTAL - ELECTRONIC OR MACHINE AVG. Iowa Scott 2017-08-10 2017-07-16 {1B805C6B-7621-B9B2-9C25-E35DC16D466E} 1006 {"x": -90.688451, "y": 41.467236, "spatialRefe...
998 39 145 21 62101 1 38.600611 -82.829782 NAD83 Outdoor Temperature 2017-12-20 ... Non-FRM 40 INSTRUMENTAL - ELECTRONIC OR MACHINE AVG. Ohio Scioto 2018-01-12 2017-12-20 {965A3811-00DB-88A4-A734-959041D2CF67} 1007 {"x": -82.829782, "y": 38.600611, "spatialRefe...
999 6 31 4 61103 1 36.102244 -119.565650 NAD83 Wind Speed - Resultant 2017-09-28 ... Non-FRM 20 INSTRUMENTAL - VECTOR SUMMATION California Kings 2018-03-05 2017-09-28 {85F9DBB0-C17B-A74A-2717-29AACFCDB63B} 1008 {"x": -119.56565, "y": 36.102244, "spatialRefe...

1000 rows × 28 columns

In [15]:
m1 = gis.map('USA')
m1
Out[15]: