Skip To Content ArcGIS for Developers Sign In Dashboard

ArcGIS API for Python

Download the samples Try it live

Crime analysis and clustering using geoanalytics and pyspark.ml

Introduction

Many of the poorest neighborhoods in the City of Chicago face violent crimes. With rapid increase in crime, amount of crime data is also increasing. Thus, there is a strong need to identify crime patterns in order to reduce its occurrence. Data mining using some of the most powerful tools available in ArcGIS API for Python is an effective way to analyze and detect patterns in data. Through this sample, we will demonstrate the utility of a number of geoanalytics tools including find_hot_spots, aggregate_points and calculate_density to visually understand geographical patterns.

The pyspark module available through run_python_script tool provides a collection of distributed analysis tools for data management, clustering, regression, and more. The run_python_script task automatically imports the pyspark module so you can directly interact with it. By calling this implementation of k-means in the run_python_script tool, we will cluster crime data into a predefined number of clusters. Such clusters are also useful in identifying crime patterns.

Further, based on the results of the analysis, the segmented crime map can be used to help efficiently dispatch officers throughout a city.

Necessary Imports

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
from datetime import datetime as dt

import arcgis
import arcgis.geoanalytics
from arcgis.gis import GIS
from arcgis.geoanalytics.summarize_data import describe_dataset, aggregate_points
from arcgis.geoanalytics.analyze_patterns import calculate_density, find_hot_spots
from arcgis.geoanalytics.manage_data import clip_layer, run_python_script

Connect to your ArcGIS Enterprise Organization

In [2]:
gis = GIS(url='https://pythonapi.playground.esri.com/portal', username='arcgis_python', password='amazing_arcgis_123')

Ensure your GIS supports GeoAnalytics

Before executing a tool, we need to ensure an ArcGIS Enterprise GIS is set up with a licensed GeoAnalytics server. To do so, call the is_supported() method after connecting to your Enterprise portal. See the Components of ArcGIS URLs documentation for details on the urls to enter in the GIS parameters based on your particular Enterprise configuration.

In [3]:
arcgis.geoanalytics.is_supported()
Out[3]:
True

Prepare the data

To register a file share or an HDFS, we need to format datasets as subfolders within a single parent folder and register the parent folder. This parent folder becomes a datastore, and each subfolder becomes a dataset. Our folder hierarchy would look like below:

Learn more about preparing your big data file share datasets here.

Register a big data file share

The get_datastores() method of the geoanalytics module returns a DatastoreManager object that lets you search for and manage the big data file share items as Python API Datastore objects on your GeoAnalytics server.

In [ ]:
bigdata_datastore_manager = arcgis.geoanalytics.get_datastores()
bigdata_datastore_manager

We will register chicago crime data as a big data file share using the add_bigdata() function on a DatastoreManager object.

When we register a directory, all subdirectories under the specified folder are also registered with the server. Always register the parent folder (for example, \machinename\mydatashare) that contains one or more individual dataset folders as the big data file share item. To learn more, see register a big data file share.

Note: You cannot browse directories in ArcGIS Server Manager. You must provide the full path to the folder you want to register, for example, \myserver\share\bigdata. Avoid using local paths, such as C:\bigdata, unless the same data folder is available on all nodes of the server site.

In [5]:
data_item = bigdata_datastore_manager.add_bigdata("Chicago_Crime_2001_2020", r"\\machine_name\data\chicago")
Created Big Data file share for Chicago_Crime_2001_2020
In [6]:
bigdata_fileshares = bigdata_datastore_manager.search()
bigdata_fileshares
Out[6]:
[<Datastore title:"/bigDataFileShares/air_quality_2017_18_19" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/air_quality_2019" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/ChicagoCrime_2001_2020" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/Chicago_Crime_2001_2020" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/pm2017" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/pm2017_4326" type:"bigDataFileShare">]
In [7]:
file_share_folder = bigdata_fileshares[2]

Once a big data file share is created, the GeoAnalytics server samples the datasets to generate a manifest, which outlines the data schema and specifies any time and geometry fields. A query of the resulting manifest returns each dataset's schema. This process can take a few minutes depending on the size of your data. Once processed, querying the manifest property returns the schema of the datasets in your big data file share.

In [8]:
manifest = file_share_folder.manifest
manifest
Out[8]:
{'datasets': [{'name': 'crime',
   'format': {'quoteChar': '"',
    'fieldDelimiter': ',',
    'hasHeaderRow': True,
    'encoding': 'UTF-8',
    'escapeChar': '"',
    'recordTerminator': '\n',
    'type': 'delimited',
    'extension': 'csv'},
   'schema': {'fields': [{'name': 'ID', 'type': 'esriFieldTypeBigInteger'},
     {'name': 'Case Number', 'type': 'esriFieldTypeString'},
     {'name': 'Date', 'type': 'esriFieldTypeString'},
     {'name': 'Block', 'type': 'esriFieldTypeString'},
     {'name': 'IUCR', 'type': 'esriFieldTypeBigInteger'},
     {'name': 'Primary Type', 'type': 'esriFieldTypeString'},
     {'name': 'Description', 'type': 'esriFieldTypeString'},
     {'name': 'Location Description', 'type': 'esriFieldTypeString'},
     {'name': 'Arrest', 'type': 'esriFieldTypeString'},
     {'name': 'Domestic', 'type': 'esriFieldTypeString'},
     {'name': 'Beat', 'type': 'esriFieldTypeBigInteger'},
     {'name': 'District', 'type': 'esriFieldTypeBigInteger'},
     {'name': 'Ward', 'type': 'esriFieldTypeBigInteger'},
     {'name': 'Community Area', 'type': 'esriFieldTypeBigInteger'},
     {'name': 'FBI Code', 'type': 'esriFieldTypeBigInteger'},
     {'name': 'X Coordinate', 'type': 'esriFieldTypeBigInteger'},
     {'name': 'Y Coordinate', 'type': 'esriFieldTypeBigInteger'},
     {'name': 'Year', 'type': 'esriFieldTypeBigInteger'},
     {'name': 'Updated On', 'type': 'esriFieldTypeString'},
     {'name': 'Latitude', 'type': 'esriFieldTypeDouble'},
     {'name': 'Longitude', 'type': 'esriFieldTypeDouble'},
     {'name': 'Location', 'type': 'esriFieldTypeString'}]},
   'geometry': {'geometryType': 'esriGeometryPoint',
    'spatialReference': {'wkid': 4326},
    'fields': [{'name': 'Location', 'formats': ['({y},{x})']}]},
   'time': {'timeType': 'instant',
    'timeReference': {'timeZone': 'UTC'},
    'fields': [{'name': 'Date', 'formats': ['MM/dd/yyyy hh:mm:ss a']}]}}]}

Get data for analysis

Adding a big data file share to the Geoanalytics server adds a corresponding big data file share item on the portal. We can search for these types of items using the item_type parameter.

In [9]:
search_result = gis.content.search("bigDataFileShares_Chicago_Crime_2001_2020", item_type = "big data file share")
search_result
Out[9]:
[<Item title:"bigDataFileShares_Chicago_Crime_2001_2020" type:Big Data File Share owner:admin>]
In [10]:
crime_item = search_result[0]
In [11]:
crime_item
Out[11]:
bigDataFileShares_Chicago_Crime_2001_2020
Big Data File Share by admin
Last Modified: April 09, 2020
0 comments, 0 views

Querying the layers property of the item returns a featureLayer representing the data. The object is actually an API Layer object.

In [12]:
crime_lyr = crime_item.layers[0]
In [13]:
illinois_blk_grps = gis.content.search('block_groups_illinois', 'feature layer')[0]
In [14]:
illinois_blk_grps
Out[14]:
block_groups_illinois
block_groups_illinoisFeature Layer Collection by admin
Last Modified: February 13, 2020
0 comments, 3 views
In [15]:
blk_lyr = illinois_blk_grps.layers[0]

We will filter the blockgroups by 031 code which is county code for Chicago.

In [ ]:
blk_lyr.filter = "COUNTYFP = '031'"
In [17]:
m2 = gis.map('chicago')
m2
Out[17]:
In [18]:
m2.add_layer(blk_lyr)

Describe data

The describe_dataset method provides an overview of big data. By default, the tool outputs a table layer containing calculated field statistics and a dict outlining geometry and time settings for the input layer.

Optionally, the tool can output a feature layer representing a sample set of features using the sample_size parameter, or a single polygon feature layer representing the input feature layers' extent by setting the extent_output parameter to True.

In [19]:
description = describe_dataset(input_layer=crime_lyr,
                               extent_output=True,
                               sample_size=1000,
                               output_name="Description of crime data" + str(dt.now().microsecond),
                               return_tuple=True)
In [20]:
description.output_json
Out[20]:
{'datasetName': 'crime',
 'datasetSource': 'Big Data File Share - Chicago_Crime_2001_2020',
 'recordCount': 7061128,
 'geometry': {'geometryType': 'Point',
  'sref': {'wkid': 4326},
  'countNonEmpty': 6993512,
  'countEmpty': 67616,
  'spatialExtent': {'xmin': -91.686565684,
   'ymin': 36.619446395,
   'xmax': -87.524529378,
   'ymax': 42.022910333}},
 'time': {'timeType': 'Instant',
  'countNonEmpty': 7061128,
  'countEmpty': 67616,
  'temporalExtent': {'start': '2001-01-01 00:00:00.000',
   'end': '2020-01-26 23:40:00.000'}}}
In [21]:
sdf_desc_output = description.output.query(as_df=True)
sdf_desc_output.head()
Out[21]:
FIELD_NAME COUNT COUNT_NON_EMPTY AVG MIN MAX STDDEV RANGE SUM VAR ANY globalid OBJECTID
0 ID 7061128 7061128 6.468796e+06 634.0 11969378.0 3.180550e+06 11968744.0 4.567699e+13 1.011590e+13 None {46B95A04-F3C3-FA20-D745-B2C7C9E7AFAF} 1
1 Case Number 7061128 7061124 NaN NaN NaN NaN NaN NaN NaN JD114742 {7FCBD37F-459C-E78F-B873-CA734429AA9B} 2
2 Date 7061128 7061128 NaN NaN NaN NaN NaN NaN NaN 01/01/2001 12:00:00 AM {A7E0431E-0AD4-EC59-38A9-F71177ACDF45} 3
3 Block 7061128 7061128 NaN NaN NaN NaN NaN NaN NaN 061XX S FAIRFIELD AVE {FF3E7A5E-A887-D815-7812-AD995620C5A9} 4
4 IUCR 7061128 6761589 1.127044e+03 110.0 9901.0 8.126368e+02 9791.0 7.620611e+09 6.603785e+05 None {3A5F5858-F0FD-932D-DF6D-FF8355F9141B} 5
In [22]:
description.sample_layer
Out[22]:
<FeatureLayer url:"https://ndhagsb01.esri.com/gis/rest/services/Hosted/Description_of_crime_data956049/FeatureServer/2">
In [23]:
sdf_slyr = description.sample_layer.query(as_df=True)
sdf_slyr.head()
Out[23]:
ID Case_Number Date Block IUCR Primary_Type Description Location_Description Arrest Domestic ... Y_Coordinate Year Updated_On Latitude Longitude Location INSTANT_DATETIME globalid OBJECTID SHAPE
0 8196694 HT430829 08/04/2011 02:10:00 AM 079XX S MERRILL AVE 520.0 ASSAULT AGGRAVATED:KNIFE/CUTTING INSTR RESIDENCE true false ... 1852704.0 2011 02/10/2018 03:50:01 PM 41.750809 -87.572309 (41.750808511, -87.572308641) 2011-08-04 02:10:00 {25BA0BFD-A32B-802A-72C5-D8A698A3C06F} 1 {'x': -87.572308641, 'y': 41.750808511, 'spati...
1 5139385 HM736684 11/22/2006 09:00:00 PM 019XX N MOHAWK ST 1310.0 CRIMINAL DAMAGE TO PROPERTY OTHER false false ... 1913191.0 2006 02/10/2018 03:50:01 PM 41.917244 -87.642423 (41.917243909, -87.642422501) 2006-11-22 21:00:00 {A67F0D22-7EED-03EE-511A-49458AB189C7} 2 {'x': -87.642422501, 'y': 41.917243909, 'spati...
2 6257174 HP338636 05/16/2008 05:30:00 AM 108XX S LOWE AVE 915.0 MOTOR VEHICLE THEFT TRUCK, BUS, MOTOR HOME STREET false false ... 1832936.0 2008 02/28/2018 03:56:25 PM 41.696981 -87.638886 (41.696980545, -87.638886196) 2008-05-16 05:30:00 {5FE25286-201F-EF1D-3D6F-ECF7AC8DA402} 3 {'x': -87.638886196, 'y': 41.696980545, 'spati...
3 8518985 HV195817 01/20/2012 09:00:00 AM 047XX S KNOX AVE 840.0 THEFT FINANCIAL ID THEFT: OVER $300 RESIDENCE false false ... 1872783.0 2012 02/10/2018 03:50:01 PM 41.806897 -87.739467 (41.806896849, -87.739466549) 2012-01-20 09:00:00 {F475734C-7CC7-06DC-75F3-B1D9D6D91D8E} 4 {'x': -87.739466549, 'y': 41.806896849, 'spati...
4 3930218 HL301854 04/17/2005 11:40:00 PM 039XX W ARMITAGE AVE 1220.0 DECEPTIVE PRACTICE THEFT OF LOST/MISLAID PROP ALLEY true false ... 1912994.0 2005 02/28/2018 03:56:25 PM 41.917175 -87.725912 (41.917175309, -87.725912468) 2005-04-17 23:40:00 {862B9571-2761-454E-56E4-F19124DCC584} 5 {'x': -87.725912468, 'y': 41.917175309, 'spati...

5 rows × 26 columns

In [24]:
m1 = gis.map('chicago')
m1
Out[24]: