Spatial and temporal distribution of service calls using big data tools
Introduction¶
The arcgis.geoanalytics module provides types and functions for distributed analysis of large datasets. These GeoAnalytics tools work with big data registered in the GIS datastores as well as with feature layers.
In this notebook, we will go through the steps for setting up data to create a big data file share. We will also edit big data file share manifest to set spatial reference of the dataset. Once the data gets registered, we will demonstrate the utility of a number of tools including describe_dataset
, aggregate_points
, calculate_density
, find_hot_spots
, clip_layer
, and run_python_script
in order to better understand our data.
The sample aims to find answers to some fundamental questions:
- What is the spatial relationship between 911 calls?
- Which block groups have the highest number of 911 calls reporting?
- What is the most common reason for 911 calls?
- How many 911 calls occur each month?
- How many 911 calls occur each hour?
The data that will be used in this sample was originally obtained from data.gov open data portal. You can obtain data by searching using keywords, for example: 'Calls for Service New Orleans'. This sample demonstrates ability of ArcGIS API for Python to perform big data analysis on your infrastructure.
Note:
The ability to perform big data analysis is only available on ArcGIS Enterprise licensed with a GeoAnalytics server and not yet available on ArcGIS Online.
Necessary Imports¶
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
from datetime import datetime as dt
import arcgis
import arcgis.geoanalytics
from arcgis.gis import GIS
from arcgis.geoanalytics.summarize_data import aggregate_points, describe_dataset
from arcgis.geoanalytics.analyze_patterns import calculate_density, find_hot_spots
from arcgis.geoanalytics.manage_data import clip_layer, run_python_script
Connect to your ArcGIS Enterprise organization¶
gis = GIS('https://pythonapi.playground.esri.com/portal', 'arcgis_python', 'amazing_arcgis_123')
Ensure your GIS supports GeoAnalytics¶
After connecting to Enterprise portal, we need to ensure an ArcGIS Enterprise GIS is set up with a licensed GeoAnalytics server. To do so, we will call the is_supported()
method.
arcgis.geoanalytics.is_supported()
Prepare the data¶
To register a file share or an HDFS, we need to format datasets as subfolders within a single parent folder and register the parent folder. This parent folder becomes a datastore, and each subfolder becomes a dataset. Our folder hierarchy would look like below:
Learn more about preparing your big data file share datasets here
Create a big data file share¶
The get_datastores()
method of the geoanalytics module returns a DatastoreManager
object that lets you search for and manage the big data file share items as Python API Datastore
objects on your GeoAnalytics server.
bigdata_datastore_manager = arcgis.geoanalytics.get_datastores()
bigdata_datastore_manager
We will register service calls data as a big data file share using the add_bigdata()
function on a DatastoreManager
object.
When we register a directory, all subdirectories under the specified folder are also registered with the server. Always register the parent folder (for example, \machinename\mydatashare) that contains one or more individual dataset folders as the big data file share item. To learn more, see register a big data file share
Note: You cannot browse directories in ArcGIS Server Manager. You must provide the full path to the folder you want to register, for example, \myserver\share\bigdata. Avoid using local paths, such as C:\bigdata, unless the same data folder is available on all nodes of the server site.
# data_item = bigdata_datastore_manager.add_bigdata("ServiceCallsOrleans", r"\\machinename\datastore")
bigdata_fileshares = bigdata_datastore_manager.search(id='cff51a1a-4f27-4955-a3ef-5fa23240ccf9')
bigdata_fileshares
file_share_folder = bigdata_fileshares[0]
Once a big data file share is created, the GeoAnalytics server samples the datasets to generate a manifest, which outlines the data schema and specifies any time and geometry fields. A query of the resulting manifest returns each dataset's schema.. This process can take a few minutes depending on the size of your data. Once processed, querying the manifest property returns the schema of the datasets in your big data file share.
manifest = file_share_folder.manifest
manifest
Edit a big data file share¶
The spatial reference of the dataset is set to 4326, but we know this data is from New Orleans, Louisiana, and is actually stored in the Louisiana State Plane Coordinate System. We need to edit the manifest with the correct spatial reference: {"wkid": 102682, "latestWkid": 3452}. Knowing the location where this data belongs to and the coordinate system which contains geospatial information of this dataset, we will edit our manifest. This will set the correct spatial reference.
manifest['datasets'][0]['geometry']['spatialReference'] = { "wkid": 102682, "latestWkid": 3452 }
file_share_folder.manifest = manifest
file_share_folder.manifest
Get data for analysis¶
Search for big data file shares¶
Adding a big data file share to the Geoanalytics server adds a corresponding big data file share item on the portal. We can search for these types of items using the item_type parameter.
search_result = gis.content.search("bigDataFileShares_ServiceCallsOrleans", item_type = "big data file share", max_items=40)
search_result
data_item = search_result[0]
data_item
Querying the layers property of the item returns a featureLayer representing the data. The object is actually an API Layer object.
data_item.layers
calls = data_item.layers[0]
calls.properties
Search for feature layers¶
block_grp_item = gis.content.get('9975b4dd3ca24d4bbe6177b85f9da7bb')
block_grp_item
We will use the first item for our analysis. Since the item is a Feature Layer Collection, accessing the layers property will give us a list of Feature layer objects.
blk_grp_lyr = block_grp_item.layers[0]
Describe data¶
The describe_dataset
method provides an overview of big data. By default, the tool outputs a table layer containing calculated field statistics and a dict outlining geometry and time settings for the input layer.
Optionally, the tool can output a feature layer
representing a sample set of features using the sample_size
parameter, or a single polygon feature layer representing the input feature layers' extent by setting the extent_output
parameter to True.
description = describe_dataset(input_layer=calls,
extent_output=True,
sample_size=1000,
output_name="Description of service calls" + str(dt.now().microsecond),
return_tuple=True)
description.output_json
description.sample_layer
description.sample_layer.query().sdf
We can see some records have missing or invalid attribute values, including in the fields the manifests defines as the time and geometry values. We will visualize a sample layer on the map to understand it better.
m1 = gis.map()
m1
m1.add_layer(description.sample_layer)
The map shows that some data points have been located outside New Orleans because of missing or invalid geometries. We want to explore data points within New Orleans city limits. We will use clip_layer
tool to extract only point features within the New Orleans boundary. This will remove data with missing or invalid geometries.
Extract features within New Orleans boundary¶
The clip_layer
method extracts input point, line, or polygon features from an input_layer
that fall within the boundaries of features in a clip_layer
. The output layer contains a subset of features from the input layer. We will clip our input call feature layer to the New Orleans blk_grp_lyr features.
clip_result = clip_layer(calls, blk_grp_lyr, output_name="service calls in new Orleans" + str(dt.now().microsecond))
clip_result
orleans_calls = clip_result.layers[0]
m2 = gis.map("New Orleans")
m2
m2.add_layer(orleans_calls)
Summarize data¶
We can use the aggregate_points
method in the arcgis.geoanalytics.summarize_data
submodule to group call features into individual block group features. The output polygon feature layer summarizes attribute information for all calls that fall within each block group. If no calls fall within a block group, that block group will not appear in the output.
The GeoAnalytics Tools use a process spatial reference during execution. Analyses with square or hexagon bins require a projected coordinate system. We'll use the World Cylindrical Equal Area projection (WKID 54034) below. All results are stored in the spatiotemporal datastore of the Enterprise in the WGS 84 Spatial Reference.
See the GeoAnalytics Documentation for a full explanation of analysis environment settings.
arcgis.env.process_spatial_reference = 54034
agg_result = aggregate_points(orleans_calls,
polygon_layer=blk_grp_lyr,
output_name="aggregate results of call" + str(dt.now().microsecond))
agg_result
m3 = gis.map("New Orleans")
m3
m3.add_layer(agg_result)
m3.legend = True
Analyze patterns¶
The calculate_density
method creates a density map from point features by spreading known quantities of some phenomenon (represented as attributes of the points) across the map. The result is a layer of areas classified from least dense to most dense. In this example, we will create density map by aggregating points within a bin of 1 kilometers. To learn more. please see here
cal_density = calculate_density(orleans_calls,
weight='Uniform',
bin_type='Square',
bin_size=1,
bin_size_unit="Kilometers",
time_step_interval=1,
time_step_interval_unit="Years",
time_step_repeat_interval=1,
time_step_repeat_interval_unit="Months",
time_step_reference=dt(2011, 1, 1),
radius=1000,
radius_unit="Meters",
area_units='SquareKilometers',
output_name="calculate density of call" + str(dt.now().microsecond))
cal_density
m4 = gis.map("New Orleans")
m4
m4.add_layer(cal_density)
m4.legend = True
Find statistically significant hot and cold spots¶
The find_hot_spots
tool analyzes point data and finds statistically significant spatial clustering of high (hot spots) and low (cold spots) numbers of incidents relative to the overall distribution of the data.
hot_spots = find_hot_spots(orleans_calls,
bin_size=100,
bin_size_unit='Meters',
neighborhood_distance=250,
neighborhood_distance_unit='Meters',
output_name="get hot spot areas" + str(dt.now().microsecond))
hot_spots
m5 = gis.map("New Orleans")
m5
m5.add_layer(hot_spots)
m5.legend = True
The darkest red features indicate areas where you can state with 99 percent confidence that the clustering of 911 call features is not the result of random chance but rather of some other variable that might be worth investigating. Similarly, the darkest blue features indicate that the lack of 911 calls is most likely not just random, but with 90% certainty you can state it is because of some variable in those locations. Features that are beige do not represent statistically significant clustering; the number of 911 calls could very likely be the result of random processes and random chance in those areas.
Visualize other aspects of data¶
The run_python_script
method executes a Python script directly in an ArcGIS GeoAnalytics server site . The script can create an analysis pipeline by chaining together multiple GeoAnalytics tools without writing intermediate results to a data store. The tool can also distribute Python functionality across the GeoAnalytics server site.
Geoanalytics Server installs a Python 3.6 environment that this tool uses. The environment includes Spark 2.2.0
, the compute platform that distributes analysis across multiple cores of one or more machines in your GeoAnalytics Server site. The environment includes the pyspark module
which provides a collection of distributed analysis tools for data management, clustering, regression, and more. The run_python_script
task automatically imports the pyspark module
so you can directly interact with it.
When using the geoanalytics
and pyspark
packages, most functions return analysis results as Spark DataFrame memory structures. You can write these data frames to a data store or process them in a script. This lets you chain multiple geoanalytics and pyspark tools while only writing out the final result, eliminating the need to create any bulky intermediate result layers.
The TextType field contains the reason for the initial 911 call. Let's investigate the most frequent purposes for 911 calls in the New Orleans area by writing our own function:
def groupby_texttype():
from datetime import datetime as dt
# Calls data is stored in a feature service and accessed as a DataFrame via the layers object
df = layers[0]
# group the dataframe by TextType field and count the number of calls for each call type.
out = df.groupBy('TypeText').count()
# Write the final result to our datastore.
out.write.format("webgis").save("groupby_texttype" + str(dt.now().microsecond))
In the fnction above, calls layer containing 911 call points is converted to a DataFrame. pyspark can used to group the dataframe by TypeText field to find the count of each category of call. The result can be saved as a feature service or other ArcGIS Enterprise layer type.
run_python_script(code=groupby_texttype, layers=[calls])
The result is saved as a feature layer. We can Search for the saved item using the search()
method. Providing the search keyword same as the name we used for writing the result will retrieve the layer.
grp_cat = gis.content.search('groupby_texttype')[0]
Accessing the tables
property of the item will give us the tables object. We will then use query()
method to read the table as spatially enabled dataframe.
grp_cat_df = grp_cat.tables[0].query().sdf
Sort the values in the decreasing order of the count field.
grp_cat_df.sort_values(by='count', ascending=False, inplace=True)
grp_cat_df.head(10).plot(x='TypeText', y='count', kind='barh')
We can see that COMPLAINT OTHER is the most common category of call followed by TRAFFIC INCIDENTS.
Now we'll investigate 911 calls to investigate the most common reasons for calls within block addresses. We will define a new function to input to the run_python_script
tool, this time grouping data by the TextType and BLOCK_ADDRESS attributes.
def grpby_cat_blkadd():
from datetime import datetime as dt
# Load the big data file share layer into a DataFrame
df = layers[0]
out = df.groupBy('TypeText', 'BLOCK_ADDRESS').count()
out.write.format("webgis").save("grpby_cat_blkadd" + str(dt.now().microsecond))
run_python_script(code=grpby_cat_blkadd, layers=[calls])
grp_cat_addr = gis.content.search('grpby_cat_blkadd')[0]
grp_cat_addr_df = grp_cat_addr.tables[0].query().sdf
grp_cat_addr_df.sort_values(by='count', ascending=False, inplace=True)
grp_cat_addr_df.head(10).plot(x='BLOCK_ADDRESS', y='count', kind='barh')
The chart shows that 22A block address has the highest number of incidents reported. So we want to further investigate the reason why this area had the most number of calls.
blk_addr_high = grp_cat_addr_df[grp_cat_addr_df['BLOCK_ADDRESS'] == '22A']
blk_addr_high.head()
blk_addr_high.TypeText.sort_values(ascending=False).head()
The result indicates the most common reason for a 911 call in the 22A Block in New Orleans is defined as WARR STOP WITH RELEASE.
Now let's investigate the 911 call data for temporal trends. We saw in the manifest that the TimeCreate field holds specific time information for one instant (in UTC time zone) using a string format: MM/dd/yyyy hh:mm:ss a. We can parse these strings using Python's datetime module
to extract year, month, day, hour, minute, and second to perform time analyses.
Let's define a helper function to convert the TimeCreate attribute field string types into a date type.
def calls_with_datetime():
from datetime import datetime as dt
# Load the big data file share layer into a DataFrame
from pyspark.sql import functions as F
df = layers[0]
out = df.withColumn('datetime', F.unix_timestamp('TimeCreate', 'MM/dd/yyyy hh:mm:ss a').cast('timestamp'))
out.write.format("webgis").save("calls_with_datetime" + str(dt.now().microsecond))
run_python_script(code=calls_with_datetime, layers=[calls])
calls_with_datetime = gis.content.search('calls_with_datetime')[0]
calls_with_datetime
calls_with_datetime_lyr = calls_with_datetime.layers[0]
We will now split the date field into year, month and hour for studying temporal trends.
def call_with_added_date_time_cols():
from datetime import datetime as dt
# Load the big data file share layer into a DataFrame
from pyspark.sql.functions import year, month, hour
df = layers[0]
df = df.withColumn('year', year(df['datetime']))
df = df.withColumn('month', month(df['datetime']))
out = df.withColumn('hour', hour(df['datetime']))
out.write.format("webgis").save("call_with_added_date_time_cols" + str(dt.now().microsecond))
run_python_script(code=call_with_added_date_time_cols, layers=[calls_with_datetime_lyr])
date_time_added_item = gis.content.search('call_with_added_date_time_cols')
date_time_added_item[0]
date_time_added_lyr = date_time_added_item[0].layers[0]
def grp_calls_by_month():
from datetime import datetime as dt
# Load the big data file share layer into a DataFrame
df = layers[0]
out = df.groupBy('month').count()
out.write.format("webgis").save("grp_calls_by_month" + str(dt.now().microsecond))
run_python_script(code=grp_calls_by_month, layers=[date_time_added_lyr])
month = gis.content.search('grp_calls_by_month')[0]
grp_month = month.tables[0]
df_month = grp_month.query().sdf
df_month
df_month.dropna().sort_values(by='month').plot(x='month', y='count', kind='bar')
It shows that calls are most frequent in the earlier months of the year.
def grp_calls_by_hour():
from datetime import datetime as dt
# Load the big data file share layer into a DataFrame
df = layers[0]
out = df.groupBy('hour').count()
out.write.format("webgis").save("grp_calls_by_hour" + str(dt.now().microsecond))
run_python_script(code=grp_calls_by_hour, layers=[date_time_added_lyr])
hour = gis.content.search('grp_calls_by_hour')[0]
grp_hour = hour.tables[0]
df_hour = grp_hour.query().sdf
df_hour.dropna().sort_values(by='hour').plot(x='hour', y='count', kind='bar')
The chart above shows frequency of calls per hour of the day. We can see that there is a high frequency of calls during 4-5 p.m
Conclusion¶
We have shown spatial patterns in distribuition of emergency calls and the need of including temporal trends in the analysis. There is an obvious variation in the monthly, weekly, daily and even hourly distribution of service calls and this should be accounted for in further analysis or allocation of resources.