Skip To Content ArcGIS for Developers Sign In Dashboard

ArcGIS API for Python

Download the samples Try it live

Information extraction from Madison city crime incident reports using Deep Learning

Introduction

Crime analysis is an essential part of efficient law enforcement for any city. It involves:

  • Collecting data in a form that can be analyzed.
  • Identifying spatial/non-spatial patterns and trends in the data.
  • Informed decision making based on the analysis.

In order to start the analysis, the first and foremost requirement is analyzable data. A huge volume of data is present in the witness and police narratives of the crime incident. Few examples of such information are:

  • Place of crime
  • Nature of crime
  • Date and time of crime
  • Suspect
  • Witness

Extracting such information from incident reports requires tedious work. Crime analysts have to sift through piles of police reports to gather and organize this information.

With recent advancements in Natural Language Processing and Deep learning, its possible to devise an automated workflow to extract information from such unstructured text documents. In this notebook we will extract information from crime incident reports obtained from Madison police department [1]using arcgis.learn.EntityRecognizer().

Prerequisites

  • Data preparation and model training workflows using arcgis.learn have a dependency on spaCy. Refer to the section "Install deep learning dependencies of arcgis.learn module" on this page for detailed documentation on installation of the dependencies.
  • Labelled data: In order for Entity Recognizer to learn, it needs to see examples that have been labelled for all the custom categories that the model is expected to extract. Labelled data for this sample notebook is located at data/EntityRecognizer/labelled_crime_reports.json
  • To learn how to use Doccano[2] for labelling text, please see the guide on Labeling text using Doccano
  • Test documents to extract named entities are in a zipped file at data/EntityRecognizer/reports.zip
  • To learn more on how EntityRecognizer works, please see the guide on Named Entity Extraction Workflow with arcgis.learn.

Imports

In [1]:
import re
import os
import pandas as pd
import zipfile,unicodedata
from itertools import repeat
from pathlib import Path
from datetime import datetime

from arcgis.gis import GIS
from arcgis.learn import prepare_data, EntityRecognizer
from arcgis.geocoding import batch_geocode
In [2]:
gis = GIS('home') 

Data preparation

Data preparation involves splitting the data into training and validation sets, creating the necessary data structures for loading data into the model and so on. The prepare_data() method can directly read the training samples in one of the above specified formats and automate the entire process.

In [3]:
training_data = gis.content.get('b2a1f479202244e798800fe43e0c3803')
training_data
Out[3]:
information-extraction-from-madison-city-crime-incident-reports-using-deep-learning
Image Collection by api_data_owner
Last Modified: August 26, 2020
0 comments, 0 views
In [4]:
filepath = training_data.download(file_name=training_data.name)
In [6]:
import zipfile
with zipfile.ZipFile(filepath, 'r') as zip_ref:
    zip_ref.extractall(Path(filepath).parent)
In [7]:
json_path = Path(os.path.join(filepath.split('.')[0] , 'labelled_crime_reports.json'))
In [8]:
data = prepare_data(path= json_path, 
                    class_mapping={'address_tag':'Address'}, 
                    dataset_type='ner_json')

The show_batch() method can be used to visualize the training samples, along with labels.

In [9]:
data.show_batch()
Out[9]:
Address Crime Crime_datetime Reported_date Reported_time Reporting_officer Weapon text
0 [Madison Public Library's Lakeview Branch, 284... [kicked over a trash, kicked a male officer in... [03/12/2019] [9:55 AM] [PIO Joel Despain] The MPD was called to the Madison Public Libra...
1 [Central District ] [afternoon of 4/22] [04/22/2017] [6:28 PM] [Lt. Kelly Donahue] During the afternoon of 4/22, two events occur...
2 [3500 block of Anderson St.] [road rage incident] [01/31/2019] [9:07 AM] [PIO Joel Despain] [crowbar] A Madison mother had her four-year-old son wit...
3 [06/06/2019] [11:11 AM] [PIO Joel Despain] A 13-year-old girl became scared last night af...
4 [North side of Madison, Crestline Dr, Green Ri... [windows were shot out] [10/31/2016] [11:59] [Sgt. Paul Jacobsen] [pellet or soft air gun] Madison Police responded to three different ca...
5 [University Ave. near Highland Ave.] [running in traffic , banging on the door, scr... [Tuesday evening] [11/22/2017] [1:44 PM] [PIO Joel Despain] Numerous callers reported an out an out contro...
6 [Associated Bank, 4407 Cottage] [bank robber] [03/23/2017] [9:03 AM] [PIO Joel Despain] The MPD is seeking help in identifying a bank ...
7 [East Towne Mall] [disturbance] [05/09/2016] [9:52 AM] [PIO Joel Despain] [handgun, BB gun, facsimile firearm] A Madison man was arrested Saturday inside Eas...

Model training

  • First we will create model using the EntityRecognizer() constructor and passing it the data object.
  • Training the model is an iterative process. We can train the model using its fit() method till the f1_score (maximum possible value = 1) continues to improve with each training pass, also known as epoch. This is indicative of the model getting better at predicting the correct labels.
In [10]:
ner = EntityRecognizer(data)
In [11]:
lr=ner.lr_find()
In [12]:
ner.fit(epochs=80,lr=lr)
epoch losses val_loss precision_score recall_score f1_score
0 71.95 13.82 0.0 0.0 0.0
1 16.5 18.99 0.41 0.08 0.13
2 16.2 13.65 0.86 0.39 0.54
3 15.69 15.99 0.83 0.4 0.53
4 13.28 47.22 0.5 0.29 0.36
5 24.16 13.21 0.75 0.43 0.55
6 23.16 44.52 0.32 0.09 0.14
7 26.37 17.95 0.64 0.4 0.49
8 32.86 31.02 0.18 0.03 0.05
9 31.67 10.94 0.81 0.46 0.59
10 20.85 56.99 0.11 0.03 0.05
11 70.54 25.24 0.81 0.44 0.57
12 15.99 55.86 0.21 0.26 0.23
13 32.22 20.43 0.71 0.39 0.5
14 16.75 14.46 0.63 0.46 0.53
15 13.01 18.58 0.64 0.5 0.56
16 13.33 10.53 0.62 0.53 0.57
17 10.51 13.98 0.67 0.55 0.6
18 12.03 15.74 0.65 0.57 0.61
19 13.63 7.05 0.64 0.6 0.62
20 10.59 9.13 0.61 0.49 0.54
21 20.0 14.07 0.61 0.5 0.55
22 24.6 8.87 0.7 0.56 0.62
23 11.57 7.17 0.62 0.58 0.6
24 12.56 8.11 0.6 0.56 0.58
25 17.23 4.58 0.72 0.64 0.68
26 10.71 5.91 0.7 0.65 0.67
27 11.96 4.9 0.69 0.64 0.67
28 8.72 4.37 0.69 0.66 0.68
29 12.06 3.54 0.78 0.69 0.73
30 10.24 5.06 0.75 0.69 0.72
31 8.53 3.81 0.78 0.75 0.76
32 7.61 2.57 0.82 0.78 0.8
33 8.54 2.42 0.82 0.76 0.79
34 7.84 3.28 0.8 0.81 0.8
35 8.07 1.97 0.81 0.81 0.81
36 7.0 1.18 0.82 0.82 0.82
37 7.2 1.13 0.84 0.84 0.84
38 7.07 1.2 0.83 0.84 0.84
39 6.75 0.52 0.84 0.84 0.84
40 8.75 0.81 0.86 0.85 0.86
41 7.52 0.29 0.84 0.84 0.84
42 6.53 0.19 0.88 0.88 0.88
43 7.21 0.13 0.85 0.86 0.86
44 7.37 0.23 0.86 0.87 0.87
45 7.17 0.17 0.87 0.86 0.87
46 5.68 0.01 0.87 0.89 0.88
47 6.98 0.01 0.87 0.88 0.88
48 6.04 0.02 0.88 0.9 0.89
49 6.0 0.0 0.9 0.89 0.9
50 5.93 0.0 0.89 0.89 0.89
51 5.59 0.0 0.91 0.91 0.91
52 6.34 0.0 0.89 0.91 0.9
53 6.51 0.0 0.92 0.91 0.92
54 5.67 0.0 0.91 0.92 0.91
55 6.1 0.0 0.91 0.92 0.92
56 6.63 0.0 0.92 0.92 0.92
57 5.24 0.0 0.91 0.91 0.91
58 4.91 0.0 0.93 0.93 0.93
59 5.02 0.0 0.93 0.93 0.93
60 5.1 0.0 0.92 0.92 0.92
61 5.02 0.0 0.94 0.95 0.94
62 5.15 0.0 0.93 0.94 0.93
63 5.02 0.0 0.94 0.95 0.95
64 5.33 0.06 0.95 0.95 0.95
65 4.97 0.0 0.94 0.95 0.94
66 5.61 0.03 0.95 0.95 0.95
67 5.69 0.0 0.94 0.95 0.95
68 4.61 0.0 0.95 0.96 0.96
69 4.89 0.0 0.94 0.95 0.94
70 4.39 0.0 0.95 0.96 0.95
71 4.41 0.14 0.96 0.96 0.96
72 4.9 0.0 0.95 0.96 0.96
73 3.93 0.0 0.95 0.96 0.96
74 4.93 0.0 0.95 0.96 0.96
75 5.27 0.0 0.96 0.96 0.96
76 4.17 0.0 0.95 0.96 0.96
77 4.37 0.0 0.96 0.97 0.96
78 3.73 0.0 0.96 0.97 0.96
79 3.59 0.0 0.95 0.97 0.96

Validate results

Now we have the trained model, let's look at how the model performs.

In [13]:
ner.show_results()
100.00% [8/8 00:00<00:00]
Out[13]:
TEXT Filename Address Crime Crime_datetime Reported_date Reported_time Reporting_officer Weapon
0 MPD officers responded to a female at the West... Example_2 West Towne Mall disturbing and threatening electronic messages... 03/25/2018 5:03 PM Lt. Mindy Winter handgun,handgun
1 MPD officers responded to a female at the West... Example_2 Verona Rd and McKee/PD disturbing and threatening electronic messages... 03/25/2018 5:03 PM Lt. Mindy Winter handgun,handgun
2 On June 11, 2019 at approximately 11:09pm, Mad... Example_3 122 E. Gilman Street (Lakeshore Apartments) Armed Robbery/Substantial Battery,pistol whipped June 11, 2019 at approximately 11:09pm 06/12/2019 1:15 AM Sgt. Nathan Becker firearm
3 On June 11, 2019 at approximately 11:09pm, Mad... Example_3 N. Butler Street. Armed Robbery/Substantial Battery,pistol whipped June 11, 2019 at approximately 11:09pm 06/12/2019 1:15 AM Sgt. Nathan Becker firearm
4 On June 11, 2019 at approximately 11:09pm, Mad... Example_3 N. Butler Street Armed Robbery/Substantial Battery,pistol whipped June 11, 2019 at approximately 11:09pm 06/12/2019 1:15 AM Sgt. Nathan Becker firearm
5 On June 11, 2019 at approximately 11:09pm, Mad... Example_3 E. Gilman and E. Gorham Street Armed Robbery/Substantial Battery,pistol whipped June 11, 2019 at approximately 11:09pm 06/12/2019 1:15 AM Sgt. Nathan Becker firearm
6 A loud argument between two groups of men esca... Example_4 State Street Campus Ramp, 415 N. Lake St. loud argument,made threatening statements,sexu... early Sunday morning 10/03/2016 10:13 AM PIO Joel Despain knife
7 A 20-year-old East Towne employee contacted th... Example_5 East Towne battery,punched in the face Tuesday,night prior 09/28/2016 12:33 PM PIO Joel Despain
8 A 20-year-old East Towne employee contacted th... Example_5 mall's parking lot battery,punched in the face Tuesday,night prior 09/28/2016 12:33 PM PIO Joel Despain
9 On May 2nd 2019 at approximately 11:56 p.m. Ma... Example_6 5000 block of Milwaukee St. weapons violation,Multiple gunshots were fired... May 2nd 2019 at approximately 11:56 p.m. 05/03/2019 5:04 AM Lt. Reginald Patterson

Save and load trained models

Once you are satisfied with the model, you can save it using the save() method. This creates an Esri Model Definition (EMD file) that can be used for inferencing on new data. Saved models can also be loaded back using the load() method. load() method takes the path to the emd file as a required argument.

In [ ]:
ner.save('crime_model')

Model Inference

Now we can use the trained model to extract entities from new text documents using extract_entities() function. This method expects the folder path of where new text document are located, or a list of text documents.

In [29]:
reports = os.path.join(filepath.split('.')[0] , 'reports')
In [30]:
results = ner.extract_entities(reports) #extract_entities()also accepts path of the documents folder as an argument.
100.00% [1501/1501 00:18<00:00]
In [31]:
results.head()
Out[31]:
TEXT Filename Address Crime Crime_datetime Reported_date Reported_time Reporting_officer Weapon
0 Officers were dispatched to a robbery of the A... 0.txt Associated Bank in the 1500 block of W Broadway robbery,demanded money 08/09/2018 6:17 PM Sgt. Jennifer Kane No weapon
1 The MPD was called to Pink at West Towne Mall ... 1.txt Pink at West Towne Mall thefts Tuesday night 08/18/2016 10:37 PIO Joel Despain
2 The MPD is seeking help locating a unique $1,5... 10.txt Union St. home stolen,thief cut a bike lock that night 08/17/2016 11:09 AM PIO Joel Despain
3 A Radcliffe Drive resident said three men - at... 100.txt Radcliffe Drive targeted armed robbery early this morning 08/07/2018 11:17 AM PIO Joel Despain handguns
4 Madison Police officers were near the intersec... 1001.txt intersection of Francis Street and State Street gunshot,shooting 08/10/2018 4:20 AM Lt. Daniel Nale

Publishing the results as a feature layer

The code below geocodes the extracted address and publishes the results as a feature layer.

In [19]:
# This function generates x,y coordinates based on the extracted location from the model.

def geocode_locations(processed_df, city, region, address_col):
    #creating address with city and region
    add_miner = processed_df[address_col].apply(lambda x: x+f', {city} '+f', {region}') 
    chunk_size = 200
    chunks = len(processed_df[address_col])//chunk_size+1
    batch = list()
    for i in range(chunks):
        batch.extend(batch_geocode(list(add_miner.iloc[chunk_size*i:chunk_size*(i+1)])))
    batch_geo_codes = []
    for i,item in enumerate(batch):
        if isinstance(item,dict):
            if (item['score'] > 90 and 
                    item['address'] != f'{city}, {region}'
                    and item['attributes']['City'] == f'{city}'):
                batch_geo_codes.append(item['location'])
            else:
                batch_geo_codes.append('')    
        else:
            batch_geo_codes.append('') 
    processed_df['geo_codes'] = batch_geo_codes    
    return processed_df
In [20]:
#This function converts the dataframe to a spatailly enabled dataframe.

def prepare_sdf(processed_df):
    processed_df['geo_codes_x'] = 'x'
    processed_df['geo_codes_y'] = 'y'
    for i,geo_code in processed_df['geo_codes'].iteritems():
        if geo_code == '': 
            processed_df.drop(i, inplace=True) #dropping rows with empty location
        else:
            processed_df['geo_codes_x'].loc[i] = geo_code.get('x')
            processed_df['geo_codes_y'].loc[i] = geo_code.get('y')
    
    sdf = processed_df.reset_index(drop=True)
    sdf['geo_x_y'] = sdf['geo_codes_x'].astype('str') + ',' +sdf['geo_codes_y'].astype('str')
    sdf = pd.DataFrame.spatial.from_df(sdf, address_column='geo_x_y') #adding geometry to the dataframe
    sdf.drop(['geo_codes_x','geo_codes_y','geo_x_y','geo_codes'], axis=1, inplace=True) #dropping redundant columns
    return sdf
In [21]:
#This function will publish the spatical dataframe as a feature layer.

def publish_to_feature(df, gis, layer_title:str, tags:str, city:str, 
                       region:str, address_col:str):
    processed_df = geocode_locations(df, city, region, address_col)
    sdf = prepare_sdf(processed_df)
    try:        
        layer = sdf.spatial.to_featurelayer(layer_title, gis,tags) 
    except:
        layer = sdf.spatial.to_featurelayer(layer_title, gis, tags)

    return layer    
In [22]:
# This will take few minutes to run
madison_crime_layer = publish_to_feature(results, gis, layer_title='Madison_Crime' + str(datetime.now().microsecond), 
                                         tags='nlp,madison,crime', city='Madison', 
                                         region='WI', address_col='Address')
In [23]:
madison_crime_layer
Out[23]:
Madison_Crime
Feature Layer Collection by arcgis_python
Last Modified: February 24, 2020
0 comments, 0 views

Visualize crime incident on map

In [22]:
result_map = gis.map('Madison, Wisconsin')
result_map.basemap = 'topographic'
In [23]:
result_map
Out[23]: