Introduction
Crime analysis is an essential part of efficient law enforcement for any city. It involves:
- Collecting data in a form that can be analyzed.
- Identifying spatial/non-spatial patterns and trends in the data.
- Informed decision making based on the analysis.
In order to start the analysis, the first and foremost requirement is analyzable data. A huge volume of data is present in the witness and police narratives of the crime incident. Few examples of such information are:
- Place of crime
- Nature of crime
- Date and time of crime
- Suspect
- Witness
Extracting such information from incident reports requires tedious work. Crime analysts have to sift through piles of police reports to gather and organize this information.
With recent advancements in Natural Language Processing and Deep learning, it's possible to devise an automated workflow to extract information from such unstructured text documents. In this notebook we will extract information from crime incident reports obtained from Madison police department [1]using arcgis.learn
's EntityRecognizer class.
Prerequisites
- Data preparation and model training workflows using
arcgis.learn
is based on spaCy & Hugging Face Transformers libraries. A user can choose an appropriate backbone / library to train his/her model. - Refer to the section "Install deep learning dependencies of arcgis.learn module" on this page for detailed documentation on installation of the dependencies.
- Labelled data: In order for
EntityRecognizer
to learn, it needs to see examples that have been labelled for all the custom categories that the model is expected to extract. Labelled data for this sample notebook is located atdata/EntityRecognizer/labelled_crime_reports.json
. - To learn how to use Doccano[2] for labelling text, please see the guide on Labeling text using Doccano.
- Test documents to extract named entities are in a zipped file at
data/EntityRecognizer/reports.zip
. - To learn more on how
EntityRecognizer
works, please see the guide on Named Entity Extraction Workflow with arcgis.learn.
Necessary Imports
import pandas as pd
import zipfile,unicodedata
from itertools import repeat
from pathlib import Path
from arcgis.gis import GIS
from arcgis.learn import prepare_textdata
from arcgis.learn.text import EntityRecognizer
from arcgis.geocoding import batch_geocode
import re
import os
import datetime
gis = GIS('home')
Data preparation
Data preparation involves splitting the data into training and validation sets, creating the necessary data structures for loading data into the model and so on. The prepare_data()
function can directly read the training samples in one of the above specified formats and automate the entire process.
training_data = gis.content.get('b2a1f479202244e798800fe43e0c3803')
training_data
filepath = training_data.download(file_name=training_data.name)
import zipfile
with zipfile.ZipFile(filepath, 'r') as zip_ref:
zip_ref.extractall(Path(filepath).parent)
json_path = Path(os.path.join(os.path.splitext(filepath)[0] , 'labelled_crime_reports.json'))
data = prepare_textdata(path= json_path, task="entity_recognition", dataset_type='ner_json', class_mapping={'address_tag':'Address'})
The show_batch()
method can be used to visualize the training samples, along with labels.
data.show_batch()
text | Address | Crime | Crime_datetime | Reported_date | Reported_time | Reporting_officer | Weapon | |
---|---|---|---|---|---|---|---|---|
0 | A Madison mother had her four-year-old son wit... | [3500 block of Anderson St.] | [road rage incident] | [01/31/2019] | [9:07 AM] | [PIO Joel Despain] | [crowbar] | |
1 | A knife-wielding woman claimed the man she tri... | [Capitol Centre Market, 111 N. Broom St.] | [stab, second degree reckless endangerment, be... | [09/04/2018] | [11:08 AM] | [PIO Joel Despain] | [knife, nine-inch steak knife] | |
2 | Members of the Dane County Narcotics Task Forc... | [Badger Road area] | [peddling cocaine and heroin, possession with ... | [03/12/2019] | [12:23 PM] | [PIO Joel Despain] | [Monday morning] | |
3 | Members of the Dane County Narcotics Task Forc... | [Badger Road area] | [peddling cocaine and heroin, possession with ... | [03/12/2019] | [12:23 PM] | [PIO Joel Despain] | [Monday morning] | |
4 | Madison Police responded to three different ca... | [North side of Madison, Crestline Dr, Green Ri... | [windows were shot out] | [10/31/2016] | [11:59] | [Sgt. Paul Jacobsen] | [pellet or soft air gun] | |
5 | The MPD arrested two men last night following ... | [Alter Metal Recycling, 4400 Sycamore Ave] | [attempted burglary, attempted burglary] | [03/03/2016] | [9:02 AM] | [PIO Joel Despain] | ||
6 | A Michigan man, who attempted to swindle the E... | [E. Washington Ave. AT&T store] | [attempted to swindle] | [02/17/2016] | [12:14 PM] | [PIO Joel Despain] | ||
7 | Madison Police Officers responded to the 3500 ... | [3500 block of Ridgeway Avenue on Christmas Eve] | [entered their residence, taking all of their ... | [after 7pm] | [12/26/2015] | [9:43 AM] | [P.O. Howard Payne] |
EntityRecognizer model
EntityRecognizer
model in arcgis.learn
can be used with spaCy's EntityRecognizer backbone or with Hugging Face Transformers backbones
Run the command below to see what backbones are supported for the entity recognition task.
print(EntityRecognizer.supported_backbones)
['spacy', 'BERT', 'RoBERTa', 'DistilBERT', 'ALBERT', 'CamemBERT', 'MobileBERT', 'XLNet', 'XLM', 'XLM-RoBERTa', 'FlauBERT', 'ELECTRA', 'Longformer', 'Funnel', 'LLM']
Call the model's available_backbone_models()
method with the backbone name to get the available models for that backbone. The call to available_backbone_models method will list out only few of the available models for each backbone. Visit this link to get a complete list of models for each of the transformer backbones. To know more choosing an appropriate transformer model for your dataset, visit this link
Note - Only a single model is available to train EntityRecognizer
model with spaCy
backbone
print(EntityRecognizer.available_backbone_models("spacy"))
('spacy',)
First we will create model using the EntityRecognizer()
constructor and passing it the data
object.
ner = EntityRecognizer(data, backbone="spacy")
Finding optimum learning rate
The learning rate[3] is a tuning parameter that determines the step size at each iteration while moving toward a minimum of a loss function, it represents the speed at which a machine learning model "learns". arcgis.learn
includes learning rate finder, and is accessible through the model's lr_find()
method, that can automatically select an optimum learning rate, without requiring repeated experiments.
lr = ner.lr_find()
Model training
Training the model is an iterative process. We can train the model using its fit()
method till the F1 score (maximum possible value = 1) continues to improve with each training pass, also known as epoch. This is indicative of the model getting better at predicting the correct labels.
ner.fit(epochs=30, lr=lr)
epoch | losses | val_loss | precision_score | recall_score | f1_score | time |
---|---|---|---|---|---|---|
0 | 84.42 | 10.12 | 0.82 | 0.07 | 0.13 | 00:00:06 |
1 | 14.86 | 10.9 | 0.55 | 0.23 | 0.32 | 00:00:07 |
2 | 15.03 | 8.18 | 0.5 | 0.24 | 0.32 | 00:00:06 |
3 | 10.66 | 5.52 | 0.84 | 0.4 | 0.54 | 00:00:06 |
4 | 8.36 | 5.26 | 0.65 | 0.45 | 0.53 | 00:00:07 |
5 | 12.62 | 4.59 | 0.05 | 0.01 | 0.01 | 00:00:06 |
6 | 7.47 | 4.92 | 0.52 | 0.45 | 0.48 | 00:00:06 |
7 | 7.35 | 3.91 | 0.53 | 0.51 | 0.52 | 00:00:06 |
8 | 7.15 | 3.63 | 0.52 | 0.49 | 0.51 | 00:00:07 |
9 | 7.74 | 4.99 | 0.62 | 0.47 | 0.54 | 00:00:06 |
10 | 7.17 | 2.79 | 0.57 | 0.54 | 0.56 | 00:00:06 |
11 | 11.37 | 4.23 | 0.72 | 0.51 | 0.59 | 00:00:06 |
12 | 8.01 | 2.89 | 0.72 | 0.69 | 0.71 | 00:00:06 |
13 | 6.85 | 2.51 | 0.77 | 0.76 | 0.77 | 00:00:07 |
14 | 6.03 | 2.79 | 0.81 | 0.78 | 0.79 | 00:00:06 |
15 | 5.89 | 1.72 | 0.87 | 0.84 | 0.85 | 00:00:06 |
16 | 5.6 | 1.54 | 0.92 | 0.92 | 0.92 | 00:00:06 |
17 | 5.64 | 0.98 | 0.95 | 0.95 | 0.95 | 00:00:06 |
18 | 5.5 | 0.91 | 0.98 | 0.98 | 0.98 | 00:00:06 |
19 | 5.35 | 0.46 | 0.97 | 0.97 | 0.97 | 00:00:06 |
20 | 5.18 | 0.98 | 0.97 | 0.95 | 0.96 | 00:00:07 |
21 | 8.29 | 0.66 | 0.98 | 0.97 | 0.97 | 00:00:06 |
22 | 5.06 | 0.17 | 0.96 | 0.95 | 0.96 | 00:00:07 |
23 | 5.08 | 0.18 | 0.98 | 0.97 | 0.98 | 00:00:06 |
24 | 5.04 | 0.56 | 0.98 | 0.97 | 0.98 | 00:00:06 |
25 | 4.65 | 0.14 | 0.99 | 0.99 | 0.99 | 00:00:06 |
26 | 4.39 | 0.07 | 0.99 | 0.99 | 0.99 | 00:00:06 |
27 | 4.22 | 0.03 | 0.99 | 0.99 | 0.99 | 00:00:06 |
28 | 4.23 | 0.12 | 0.99 | 0.99 | 0.99 | 00:00:06 |
29 | 4.47 | 0.03 | 0.99 | 0.98 | 0.99 | 00:00:07 |
Evaluate model performance
Important metrics to look at while measuring the performance of the EntityRecognizer
model are Precision, Recall & F1-measures [4].
ner.precision_score()
0.99
ner.recall_score()
0.98
ner.f1_score()
0.99
To find precision, recall & f1 scores per label/class we will call the model's metrics_per_label()
method.
ner.metrics_per_label()
Precision_score | Recall_score | F1_score | |
---|---|---|---|
Crime | 1.00 | 0.97 | 0.98 |
Address | 1.00 | 1.00 | 1.00 |
Crime_datetime | 1.00 | 1.00 | 1.00 |
Reported_date | 1.00 | 1.00 | 1.00 |
Reported_time | 1.00 | 1.00 | 1.00 |
Reporting_officer | 1.00 | 1.00 | 1.00 |
Weapon | 0.91 | 0.91 | 0.91 |
Validate results
Now we have the trained model, let's look at how the model performs.
ner.show_results()
TEXT | Filename | Address | Crime | Crime_datetime | Reported_date | Reported_time | Reporting_officer | Weapon | |
---|---|---|---|---|---|---|---|---|---|
0 | Madison police officers were dispatched to the... | Example_0 | East Towne Mall | overdosed on heroin,possession of heroin,proba... | 02/26/2018 | 7:40 AM | Lt. Jason Ostrenga | Syringes, a metal spoon, and other drug paraph... | |
1 | Suspect entered Azara Hookah at 429 State Stre... | Example_1 | Azara Hookah at 429 State Street | concealing various merchandise,swung the knife | 05/04/2017 | 2:18 AM | Sgt. Eugene Woehrle | knife | |
2 | Suspect entered Azara Hookah at 429 State Stre... | Example_1 | University Ave. | concealing various merchandise,swung the knife | 05/04/2017 | 2:18 AM | Sgt. Eugene Woehrle | knife | |
3 | The MPD arrested an 18-year-old man on a tenta... | Example_2 | Memorial High School | disorderly conduct | after 5:30 p.m. yesterday afternoon | 05/05/2017 | 1:55 PM | PIO Joel Despain | |
4 | The MPD arrested an 18-year-old man on a tenta... | Example_3 | Memorial High School | disorderly conduct | after 5:30 p.m. yesterday afternoon | 05/05/2017 | 1:55 PM | PIO Joel Despain | |
5 | Suspect entered Azara Hookah at 429 State Stre... | Example_4 | Azara Hookah at 429 State Street | concealing various merchandise,swung the knife | 05/04/2017 | 2:18 AM | Sgt. Eugene Woehrle | knife | |
6 | Suspect entered Azara Hookah at 429 State Stre... | Example_4 | University Ave. | concealing various merchandise,swung the knife | 05/04/2017 | 2:18 AM | Sgt. Eugene Woehrle | knife | |
7 | A Milford Rd. resident returned home after wor... | Example_5 | Milford Rd. | burglarized,Jewelry and cash were taken,break-in | 01/18/2019 | 9:54 AM | PIO Joel Despain | ||
8 | The MPD arrested an 18-year-old man on a tenta... | Example_6 | Memorial High School | disorderly conduct | after 5:30 p.m. yesterday afternoon | 05/05/2017 | 1:55 PM | PIO Joel Despain | |
9 | An Independence Lane resident said a stranger ... | Example_7 | Independence Lane | early Saturday morning | 05/09/2016 | 9:21 AM | PIO Joel Despain | handgun |
Save and load trained models
Once you are satisfied with the model, you can save it using the save()
method. This creates an Esri Model Definition (EMD file) that can be used for inferencing on new data.
Saved models can also be loaded back using the load()
method. load()
method takes the path to the emd file as a required argument.
ner.save('crime_model')
Model has been saved to ~\AppData\Local\Temp\information-extraction-from-madison-city-crime-incident-reports-using-deep-learning\models\crime_model
WindowsPath('~/AppData/Local/Temp/information-extraction-from-madison-city-crime-incident-reports-using-deep-learning/models/crime_model')
Model Inference
Now we can use the trained model to extract entities from new text documents using extract_entities()
method. This method expects the folder path of where new text document are located, or a list of text documents.
reports = os.path.join(filepath.split('.')[0] , 'reports')
results = ner.extract_entities(reports) # extract_entities()also accepts path of the documents folder as an argument.
results.head()
TEXT | Filename | Address | Crime | Crime_datetime | Reported_date | Reported_time | Reporting_officer | Weapon | |
---|---|---|---|---|---|---|---|---|---|
0 | Officers were dispatched to a robbery of the A... | 0.txt | Associated Bank in the 1500 block of W Broadway | robbery,demanded money | 08/09/2018 | 6:17 PM | Sgt. Jennifer Kane | No weapon was mentioned | |
1 | The MPD was called to Pink at West Towne Mall ... | 1.txt | Pink at West Towne Mall | thefts at Pink | Tuesday night | 08/18/2016 | 10:37 AM | PIO Joel Despain | |
2 | The MPD is seeking help locating a unique $1,5... | 10.txt | Union St. home | stolen,thief cut a bike lock,stolen | 08/17/2016 | 11:09 AM | PIO Joel Despain | ||
3 | A Radcliffe Drive resident said three men - at... | 100.txt | Radcliffe Drive | armed robbery | early this morning | 08/07/2018 | 11:17 AM | PIO Joel Despain | handguns |
4 | Madison Police officers were near the intersec... | 1001.txt | intersection of Francis Street | gunshot and observed a vehicle,shooting,distur... | 08/10/2018 | 4:20 AM | Lt. Daniel Nale |
Publishing the results as a feature layer
The code below geocodes the extracted address and publishes the results as a feature layer.
# This function generates x,y coordinates based on the extracted location from the model.
def geocode_locations(processed_df, city, region, address_col):
#creating address with city and region
add_miner = processed_df[address_col].apply(lambda x: x+f', {city} '+f', {region}')
chunk_size = 200
chunks = len(processed_df[address_col])//chunk_size+1
batch = list()
for i in range(chunks):
batch.extend(batch_geocode(list(add_miner.iloc[chunk_size*i:chunk_size*(i+1)])))
batch_geo_codes = []
for i,item in enumerate(batch):
if isinstance(item,dict):
if (item['score'] > 90 and
item['address'] != f'{city}, {region}'
and item['attributes']['City'] == f'{city}'):
batch_geo_codes.append(item['location'])
else:
batch_geo_codes.append('')
else:
batch_geo_codes.append('')
processed_df['geo_codes'] = batch_geo_codes
return processed_df
#This function converts the dataframe to a spatailly enabled dataframe.
def prepare_sdf(processed_df):
processed_df['geo_codes_x'] = 'x'
processed_df['geo_codes_y'] = 'y'
for i,geo_code in processed_df['geo_codes'].items():
if geo_code == '':
processed_df.drop(i, inplace=True) #dropping rows with empty location
else:
processed_df['geo_codes_x'].loc[i] = geo_code.get('x')
processed_df['geo_codes_y'].loc[i] = geo_code.get('y')
sdf = processed_df.reset_index(drop=True)
sdf['geo_x_y'] = sdf['geo_codes_x'].astype('str') + ',' +sdf['geo_codes_y'].astype('str')
sdf = pd.DataFrame.spatial.from_df(sdf, address_column='geo_x_y') #adding geometry to the dataframe
sdf.drop(['geo_codes_x','geo_codes_y','geo_x_y','geo_codes'], axis=1, inplace=True) #dropping redundant columns
return sdf
#This function will publish the spatical dataframe as a feature layer.
def publish_to_feature(df, gis, layer_title:str, tags:str, city:str,
region:str, address_col:str):
processed_df = geocode_locations(df, city, region, address_col)
sdf = prepare_sdf(processed_df)
try:
layer = sdf.spatial.to_featurelayer(layer_title, gis, tags)
except:
layer = sdf.spatial.to_featurelayer(layer_title, gis, tags)
return layer
# This will take few minutes to run
madison_crime_layer = publish_to_feature(results, gis, layer_title='Madison_Crime' + str(datetime.datetime.now().microsecond),
tags='nlp, madison, crime', city='Madison',
region='WI', address_col='Address')
madison_crime_layer
Visualize crime incident on map
result_map = gis.map('Madison, Wisconsin')
result_map.basemaps = 'topographic'
result_map
result_map.content.add(madison_crime_layer)
Create a hot spot map of crime densities
ArcGIS has a set of tools to help us identify, quantify and visualize spatial patterns in our data by identifying areas of statistically significant clusters.
The find_hot_spots
tool allows us to visualize areas having such clusters.
from arcgis.features.analyze_patterns import find_hot_spots
crime_hotspots_madison = find_hot_spots(madison_crime_layer,
context={"extent":
{"xmin":-10091700.007046243,"ymin":5225939.095608932,
"xmax":-9731528.729766665,"ymax":5422840.88047145,
"spatialReference":{"wkid":102100,"latestWkid":3857}}},
output_name="crime_hotspots_madison" + str(datetime.datetime.now().microsecond))
hotspot_map = gis.map('Madison, Wisconsin')
hotspot_map.basemaps = 'terrain'
hotspot_map
hotspot_map.content.add(crime_hotspots_madison)
hotspot_map.legend.enabled = True
Conclusion
This sample demonstrates how EntityRecognizer()
from arcgis.learn
can be used for information extraction from crime incident reports, which is an essential requirement for crime analysis. Then, we see how can this information be geocoded and visualized on a map for further analysis.
References
[1]: Police Incident Reports(City of Madison)
[2]: Doccano : text annotation tool for humans
[3]: Learning rate