Spatial and temporal distribution of service calls using big data tools¶
Table of Contents
- Introduction
- Necessary Imports
- Connect to your ArcGIS Enterprise organization
- Ensure your GIS supports GeoAnalytics
- Prepare the data
- Get data for analysis
- Describe data
- Extract features within New Orleans boundary
- Summarize data
- Analyze patterns
- Find statistically significant hot and cold spots
- Visualize other aspects of data
- Conclusion
Introduction¶
The arcgis.geoanalytics module provides types and functions for distributed analysis of large datasets. These GeoAnalytics tools work with big data registered in the GIS datastores as well as with feature layers.
In this notebook, we will go through the steps for setting up data to create a big data file share. We will also edit big data file share manifest to set spatial reference of the dataset. Once the data gets registered, we will demonstrate the utility of a number of tools including describe_dataset
, aggregate_points
, calculate_density
, find_hot_spots
, clip_layer
, and run_python_script
in order to better understand our data.
The sample aims to find answers to some fundamental questions:
- What is the spatial relationship between 911 calls?
- Which block groups have the highest number of 911 calls reporting?
- What is the most common reason for 911 calls?
- How many 911 calls occur each month?
- How many 911 calls occur each hour?
The data that will be used in this sample was originally obtained from data.gov open data portal. You can obtain data by searching using keywords, for example: 'Calls for Service New Orleans'. This sample demonstrates ability of ArcGIS API for Python to perform big data analysis on your infrastructure.
Note:
The ability to perform big data analysis is only available on ArcGIS Enterprise licensed with a GeoAnalytics server and not yet available on ArcGIS Online.
Necessary Imports¶
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
from datetime import datetime as dt
import arcgis
import arcgis.geoanalytics
from arcgis.gis import GIS
from arcgis.geoanalytics.summarize_data import aggregate_points, describe_dataset
from arcgis.geoanalytics.analyze_patterns import calculate_density, find_hot_spots
from arcgis.geoanalytics.manage_data import clip_layer, run_python_script
Connect to your ArcGIS Enterprise organization¶
gis = GIS(url='https://pythonapi.playground.esri.com/portal', username="arcgis_python", password="amazing_arcgis_123")
Ensure your GIS supports GeoAnalytics¶
After connecting to Enterprise portal, we need to ensure an ArcGIS Enterprise GIS is set up with a licensed GeoAnalytics server. To do so, we will call the is_supported()
method.
arcgis.geoanalytics.is_supported()
Prepare the data¶
To register a file share or an HDFS, we need to format datasets as subfolders within a single parent folder and register the parent folder. This parent folder becomes a datastore, and each subfolder becomes a dataset. Our folder hierarchy would look like below:
Learn more about preparing your big data file share datasets here
Create a big data file share¶
The get_datastores()
method of the geoanalytics module returns a DatastoreManager
object that lets you search for and manage the big data file share items as Python API Datastore
objects on your GeoAnalytics server.
bigdata_datastore_manager = arcgis.geoanalytics.get_datastores()
bigdata_datastore_manager
We will register service calls data as a big data file share using the add_bigdata()
function on a DatastoreManager
object.
When we register a directory, all subdirectories under the specified folder are also registered with the server. Always register the parent folder (for example, \machinename\mydatashare) that contains one or more individual dataset folders as the big data file share item. To learn more, see register a big data file share
Note: You cannot browse directories in ArcGIS Server Manager. You must provide the full path to the folder you want to register, for example, \myserver\share\bigdata. Avoid using local paths, such as C:\bigdata, unless the same data folder is available on all nodes of the server site.
data_item = bigdata_datastore_manager.add_bigdata("ServiceCallsOrleans", r"\\machinename\datastore")
bigdata_fileshares = bigdata_datastore_manager.search()
bigdata_fileshares
file_share_folder = bigdata_fileshares[1]
Once a big data file share is created, the GeoAnalytics server samples the datasets to generate a manifest, which outlines the data schema and specifies any time and geometry fields. A query of the resulting manifest returns each dataset's schema.. This process can take a few minutes depending on the size of your data. Once processed, querying the manifest property returns the schema of the datasets in your big data file share.
manifest = file_share_folder.manifest
manifest
Edit a big data file share¶
The spatial reference of the dataset is set to 4326, but we know this data is from New Orleans, Louisiana, and is actually stored in the Louisiana State Plane Coordinate System. We need to edit the manifest with the correct spatial reference: {"wkid": 102682, "latestWkid": 3452}. Knowing the location where this data belongs to and the coordinate system which contains geospatial information of this dataset, we will edit our manifest. This will set the correct spatial reference.
manifest['datasets'][0]['geometry']['spatialReference'] = { "wkid": 102682, "latestWkid": 3452 }
file_share_folder.manifest = manifest
Get data for analysis¶
Search for big data file shares¶
Adding a big data file share to the Geoanalytics server adds a corresponding big data file share item on the portal. We can search for these types of items using the item_type parameter.
search_result = gis.content.search("", item_type = "big data file share", max_items=40)
search_result
data_item = search_result[0]
data_item
Querying the layers property of the item returns a featureLayer representing the data. The object is actually an API Layer object.
data_item.layers
calls = data_item.layers[0]
calls.properties
Search for feature layers¶
block_grp_item = gis.content.search('New_Orleans_Block_Goups', 'feature layer')[0]
block_grp_item
We will use the first item for our analysis. Since the item is a Feature Layer Collection, accessing the layers property will give us a list of Feature layer objects.
blk_grp_lyr = block_grp_item.layers[0]
Describe data¶
The describe_dataset
method provides an overview of big data. By default, the tool outputs a table layer containing calculated field statistics and a dict outlining geometry and time settings for the input layer.
Optionally, the tool can output a feature layer
representing a sample set of features using the sample_size
parameter, or a single polygon feature layer representing the input feature layers' extent by setting the extent_output
parameter to True.
description = describe_dataset(input_layer=calls,
extent_output=True,
sample_size=1000,
output_name="Description of service calls" + str(dt.now().microsecond),
return_tuple=True)
description.output_json
description.sample_layer
description.sample_layer.query().sdf
We can see some records have missing or invalid attribute values, including in the fields the manifests defines as the time and geometry values. We will visualize a sample layer on the map to understand it better.
m1 = gis.map()
m1
m1.add_layer(description.sample_layer)
The map shows that some data points have been located outside New Orleans because of missing or invalid geometries. We want to explore data points within New Orleans city limits. We will use clip_layer
tool to extract only point features within the New Orleans boundary. This will remove data with missing or invalid geometries.
Extract features within New Orleans boundary¶
The clip_layer
method extracts input point, line, or polygon features from an input_layer
that fall within the boundaries of features in a clip_layer
. The output layer contains a subset of features from the input layer. We will clip our input call feature layer to the New Orleans blk_grp_lyr features.
clip_result = clip_layer(calls, blk_grp_lyr, output_name="service calls in new Orleans" + str(dt.now().microsecond))
clip_result
orleans_calls = clip_result.layers[0]
m2 = gis.map("New Orleans")
m2