Crime analysis and clustering using geoanalytics and pyspark.ml¶
Introduction¶
Many of the poorest neighborhoods in the City of Chicago face violent crimes. With rapid increase in crime, amount of crime data is also increasing. Thus, there is a strong need to identify crime patterns in order to reduce its occurrence. Data mining using some of the most powerful tools available in ArcGIS API for Python is an effective way to analyze and detect patterns in data. Through this sample, we will demonstrate the utility of a number of geoanalytics tools including find_hot_spots
, aggregate_points
and calculate_density
to visually understand geographical patterns.
The pyspark module
available through run_python_script
tool provides a collection of distributed analysis tools for data management, clustering, regression, and more. The run_python_script
task automatically imports the pyspark module
so you can directly interact with it. By calling this implementation of k-means in the run_python_script
tool, we will cluster crime data into a predefined number of clusters. Such clusters are also useful in identifying crime patterns.
Further, based on the results of the analysis, the segmented crime map can be used to help efficiently dispatch officers throughout a city.
Necessary Imports¶
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
from datetime import datetime as dt
import arcgis
import arcgis.geoanalytics
from arcgis.gis import GIS
from arcgis.geoanalytics.summarize_data import describe_dataset, aggregate_points
from arcgis.geoanalytics.analyze_patterns import calculate_density, find_hot_spots
from arcgis.geoanalytics.manage_data import clip_layer, run_python_script
Connect to your ArcGIS Enterprise Organization¶
gis = GIS(url='https://pythonapi.playground.esri.com/portal', username='arcgis_python', password='amazing_arcgis_123')
Ensure your GIS supports GeoAnalytics¶
Before executing a tool, we need to ensure an ArcGIS Enterprise GIS is set up with a licensed GeoAnalytics server. To do so, call the is_supported() method after connecting to your Enterprise portal. See the Components of ArcGIS URLs documentation for details on the urls to enter in the GIS parameters based on your particular Enterprise configuration.
arcgis.geoanalytics.is_supported()
Prepare the data¶
To register a file share or an HDFS, we need to format datasets as subfolders within a single parent folder and register the parent folder. This parent folder becomes a datastore, and each subfolder becomes a dataset. Our folder hierarchy would look like below:
Learn more about preparing your big data file share datasets here.
Register a big data file share¶
The get_datastores()
method of the geoanalytics module returns a DatastoreManager
object that lets you search for and manage the big data file share items as Python API Datastore
objects on your GeoAnalytics server.
bigdata_datastore_manager = arcgis.geoanalytics.get_datastores()
bigdata_datastore_manager
We will register chicago crime data as a big data file share using the add_bigdata()
function on a DatastoreManager
object.
When we register a directory, all subdirectories under the specified folder are also registered with the server. Always register the parent folder (for example, \machinename\mydatashare) that contains one or more individual dataset folders as the big data file share item. To learn more, see register a big data file share.
Note: You cannot browse directories in ArcGIS Server Manager. You must provide the full path to the folder you want to register, for example, \myserver\share\bigdata. Avoid using local paths, such as C:\bigdata, unless the same data folder is available on all nodes of the server site.
data_item = bigdata_datastore_manager.add_bigdata("Chicago_Crime_2001_2020", r"\\machine_name\data\chicago")
bigdata_fileshares = bigdata_datastore_manager.search()
bigdata_fileshares
file_share_folder = bigdata_fileshares[2]
Once a big data file share is created, the GeoAnalytics server samples the datasets to generate a manifest, which outlines the data schema and specifies any time and geometry fields. A query of the resulting manifest returns each dataset's schema. This process can take a few minutes depending on the size of your data. Once processed, querying the manifest property returns the schema of the datasets in your big data file share.
manifest = file_share_folder.manifest
manifest
Get data for analysis¶
Adding a big data file share to the Geoanalytics server adds a corresponding big data file share item on the portal. We can search for these types of items using the item_type
parameter.
search_result = gis.content.search("bigDataFileShares_Chicago_Crime_2001_2020", item_type = "big data file share")
search_result
crime_item = search_result[0]
crime_item
Querying the layers property of the item returns a featureLayer representing the data. The object is actually an API Layer object.
crime_lyr = crime_item.layers[0]
illinois_blk_grps = gis.content.search('block_groups_illinois', 'feature layer')[0]
illinois_blk_grps
blk_lyr = illinois_blk_grps.layers[0]
We will filter the blockgroups by 031 code which is county code for Chicago.
blk_lyr.filter = "COUNTYFP = '031'"
m2 = gis.map('chicago')
m2
m2.add_layer(blk_lyr)
Describe data¶
The describe_dataset
method provides an overview of big data. By default, the tool outputs a table layer containing calculated field statistics and a dict outlining geometry and time settings for the input layer.
Optionally, the tool can output a feature layer representing a sample set of features using the sample_size
parameter, or a single polygon feature layer representing the input feature layers' extent by setting the extent_output parameter
to True.
description = describe_dataset(input_layer=crime_lyr,
extent_output=True,
sample_size=1000,
output_name="Description of crime data" + str(dt.now().microsecond),
return_tuple=True)
description.output_json
sdf_desc_output = description.output.query(as_df=True)
sdf_desc_output.head()
description.sample_layer
sdf_slyr = description.sample_layer.query(as_df=True)
sdf_slyr.head()
m1 = gis.map('chicago')
m1