Skip To Content ArcGIS for Developers Sign In Dashboard

ArcGIS API for Python

Named Entity Extraction Workflow with arcgis.learn

Introduction

Geospatial data is not only available in the form of maps and feature/imagery layers, but also in form of unstructured text.

What is unstructured text?

Unstructured text is written content that lacks structure and cannot readily be indexed or mapped onto standard database fields. It is often user-generated information such as emails, instant messages, news articles, documents or social media postings. These unstructured documents can contain location information which makes them geospatial information. Mapping information from such documents could be of a great value. In this guide, we will explore how to achieve this objective with arcgis.learn.

What is Named Entity Recognition?

Named Entity Recognition is a branch of information extraction. This is used to identify entities such as "Organizations", "Person", "Date", "Country", etc. that are present in the text.

Example of named entities such as PERSON, ORG & DATE in unstructured text. Source: Explosion AI blog

Prerequisites

  • Data preparation and model training workflows using arcgis.learn have a dependency on spaCy. This can be installed using conda as follows: conda install -c esri arcgis fastai pillow scikit-image
  • Labeled data: For Entity Recognizer to learn, it needs to see examples that have been labeled for all the custom categories that the model is expected to extract. Head to the next section to see the supported formats for training data.

  • If you wish to try this workflow, you can find a sample notebook along with the necessary labeled training and test datasets over here.

Supported formats for labeled training data

  • Entity Recognizer can consume labeled training data in three different formats (IOB, BILUO, ner_json).
  • Example structure for JSON:

    • Text : "Sir Chandrashekhara Venkata Raman was born in India"
    • ner-json formatted training data: {"text": "Sir Chandrashekhara Venkata Raman was born in India.","labels": [[0, 33, "Person"], [46, 51, "Country"]]}
  • Example structure for IOB:

    • Text: "Sir Chandrashekhara Venkata Raman was born in India."
    • IOB formatted training data:
      • text.csv: 'Sir', 'Chandrashekhara', 'Venkata', 'Raman', 'was', 'born', 'in', 'Germany.'
      • labels.csv: 'B-Person', 'I-Person', 'I-Person', 'I-Person', 'O', 'O', 'O', 'B-Country'
  • Example structure for BILUO:

    • Text: "Sir Chandrashekhara Venkata Raman was born in India."
    • LBIOU formatted training data:
      • text.csv: 'Sir', 'Chandrashekhara', 'Venkata', 'Raman' ,'was', 'born', 'in', 'Germany.'
      • labels.csv: 'B-Person', 'I-Person', 'I-Person', 'L-Person', 'O', 'O', 'O', 'U-Country'
  • There are labeling tools available to get raw documents in the required format. Below are the references to few labeling tools:

Imports

In [1]:
from arcgis.learn import prepare_data
from arcgis.learn import EntityRecognizer
import random 
import spacy
import os

Data preparation

Data preparation involves splitting the data into training and validation sets, creating the necessary data structures for loading data into the model and so on. The prepare_data() method can directly read the training samples in one of the above specified formats and automate the entire process. Entities that are addresses or geographical locations should be specified using the class_mapping dict, as shown below:

In [2]:
json_path = os.path.join('data', 'EntityRecognizer',
                         'labelled_crime_reports.json')
In [3]:
data = prepare_data(path=json_path, 
                    class_mapping={'address_tag':'Address'}, 
                    dataset_type='ner_json')

The show_batch() method can be used to visualize the training samples, along with labels.

In [4]:
data.show_batch()
Out[4]:
Address Crime Crime_datetime Reported_datetime Reporting_officer Weapon text
0 [Misty Mountain Games, 4672 Cottage Grove Rd.] [stealing] [12/05/2016 at 10:22 AM] [PIO Joel Despain] [hammer] The MPD caught a suspect in the act of stealin...
1 [Villager Shopping Mall, 2222 S. Park St.] [firing a gun] [07/12/2017 at 2:21 PM] [PIO Joel Despain] [BB or pellet gun, pellet gun] Late this morning, the MPD was called to the p...
2 [Misty Mountain Games, 4672 Cottage Grove Rd.] [stealing] [12/05/2016 at 10:22 AM] [PIO Joel Despain] [hammer] The MPD caught a suspect in the act of stealin...
3 [E. Washington Ave., near Zeier Rd.] [struck by a car] [Thursday afternoon] [01/29/2016 at 9:26 AM] [PIO Joel Despain] A 13-year-boy suffered a broken leg and other ...
4 [stolen car, stolen gun, home burglaries, stol... [500 block of Grand Canyon Blvd.] [06/21/2019 at 10:47 AM] [PIO Joel Despain] Two area teens were arrested last night in a s...

EntityRecognizer model

EntityRecognizer model in arcgis.learn is built on top of Spacy's EntityRecognizer. The model training and inferencing workflow is similar to computer vision models in arcgis.learn.

This Model works on the Embed > Encode > Attend > Predict deeplearning framework [4].

  • Embed: This is the process of turning text or sparse vectors into dense word embeddings. These embeddings are much easier to work with than other representations and do an excellent job of capturing semantic information. This is achieved by extracting word features using feature hashing[1] followed by a Multilayer Perceptron. A video description of this workflow can be found here.

  • Encode : This is the process of encoding context into a word vector. This is done using Residual Trigram CNNs.

  • Predict : The final step in the model is making a prediction given the input text. Here the vector from the attention layer is passed to a Multilayer Perceptron to output the entity label ID.

[Source : Explosion AI blog on deep learning formula for NLP models]

Model training

  • First we will create the model using arcgis.learn.EntityRecognizer() constructor and passing it the data object.
  • Training the model is an iterative process. We can train the model using its fit() method till the validation loss (or error rate) continues to go down with each training pass also known as epoch. This is indicative of the model learning the task.
In [5]:
ner = EntityRecognizer(data)
In [6]:
ner.fit(10)
Epoch Train_loss Val_loss
0 27.42 21.31
1 32.43 34.31
2 16.96 28.75
3 26.24 62.87
4 24.9 41.33
5 18.56 21.33
6 19.57 7.66
7 15.85 0.03
8 10.09 0.53
9 10.87 1.32

Validate results

Once we have the trained model, we can visualize the results to see how it performs.

In [7]:
ner.show_results()
Out[7]:
TEXT Reported_datetime Reporting_officer Crime Crime_datetime Address Weapon
0 On June 11, 2019 at approximately 11:09pm, Mad... 06/12/2019 at 1:15 AM Sgt. Nathan Becker June 11, 2019 at approximately 11:09pm, 122 E. Gilman Street
1 On June 11, 2019 at approximately 11:09pm, Mad... 06/12/2019 at 1:15 AM Sgt. Nathan Becker June 11, 2019 at approximately 11:09pm, Armed Robbery/Substantial Battery
2 On June 11, 2019 at approximately 11:09pm, Mad... 06/12/2019 at 1:15 AM Sgt. Nathan Becker June 11, 2019 at approximately 11:09pm, N. Butler Street
3 On June 11, 2019 at approximately 11:09pm, Mad... 06/12/2019 at 1:15 AM Sgt. Nathan Becker June 11, 2019 at approximately 11:09pm, N. Butler Street
4 On June 11, 2019 at approximately 11:09pm, Mad... 06/12/2019 at 1:15 AM Sgt. Nathan Becker June 11, 2019 at approximately 11:09pm, E. Gilman and
5 On June 11, 2019 at approximately 11:09pm, Mad... 06/12/2019 at 1:15 AM Sgt. Nathan Becker June 11, 2019 at approximately 11:09pm, E. Gorham Street
6 The MPD's North District Community Police Team... 08/28/2017 at 2:33 PM PIO Joel Despain armed robbery Friday afternoon,August 9th Northport Dr.
7 The MPD's North District Community Police Team... 08/28/2017 at 2:33 PM PIO Joel Despain armed robbery Friday afternoon,August 9th E. Washington Ave.
8 In recent days, Central District officers have... 07/24/2018 at 11:58 AM PIO Joel Despain 400 block of W. Mifflin St.
9 In recent days, Central District officers have... 07/24/2018 at 11:58 AM PIO Joel Despain 600 block of W. Main St.
10 In recent days, Central District officers have... 07/24/2018 at 11:58 AM PIO Joel Despain N. Butler St.
11 The MPD's North District Community Police Team... 08/28/2017 at 2:33 PM PIO Joel Despain armed robbery Friday afternoon,August 9th Northport Dr.
12 The MPD's North District Community Police Team... 08/28/2017 at 2:33 PM PIO Joel Despain armed robbery Friday afternoon,August 9th E. Washington Ave.
13 The MPD responded to the 5200 block of Camden ... 11/14/2017 at 9:37 AM PIO Joel Despain shots,fired,battered,punched and kicked last night 5200 block of Camden Rd.

Save and load trained models

Once you are satisfied with the model, you can save it using the save() method. This creates an Esri Model Definition (EMD file) that can be used for inferencing on unseen data. Saved models can also be loaded back using the load() method. The load() method takes the path to the emd file as a required argument.

In [8]:
ner.save('crime_10.emd')
Model has been saved to data\EntityRecognizer\models\crime_10.emd
In [9]:
model_path = os.path.join('data', 'EntityRecognizer', 'models', 'crime_10', 'crime_10.emd')
model_path
Out[9]:
'data\\EntityRecognizer\\models\\crime_10\\crime_10.emd'
In [10]:
ner.load(model_path)
<spacy.lang.en.English object at 0x000001CF20811160>

Model inference

The trained model can be used to extract entities from new text documents using the extract_entities() function. This method accepts the path of the folder where new text documents are located, or a list of text documents from which the entities are to be extracted.

In [11]:
reports_path = os.path.join("data", "EntityRecognizer", "reports")
reports_path
Out[11]:
'data\\EntityRecognizer\\reports'
In [12]:
results = ner.extract_entities(reports_path)
In [13]:
results.head()
Out[13]:
TEXT Weapon Reported_datetime Crime Address Crime_datetime Reporting_officer
0 Officers were dispatched to a robbery of the ... 08/09/2018 at 6:17 PM demanded money Associated Bank in the 1500 block of W Broadway Sgt. Jennifer Kane
1 The MPD was called to Pink at West Towne Mall ... 08/18/2016 at 10:37 AM thefts Pink at West Towne Mall Tuesday night PIO Joel Despain
2 The MPD is seeking help locating a unique $1,... bike 08/17/2016 at 11:09 AM stolen,stolen Union St. home Monday.,that night PIO Joel Despain
3 Madison Police officers were near the intersec... gunshot,disturbance 08/10/2018 at 4:20 AM Lake Street ramp Lt. Daniel Nale
4 Madison Police officers were near the intersec... gunshot,disturbance 08/10/2018 at 4:20 AM Frances St. Lt. Daniel Nale

Visualize entities

We can utilize SpaCy's named entity visualizer to check the model's prediction on new text one at a time.

In [1]:
def color_gen(): #this function generates and returns a random color.
    random_number = random.randint(0,16777215) #16777215 ~= 255x255x255(R,G,B)
    hex_number = format(random_number, 'x')
    hex_number = '#' + hex_number
    return hex_number
In [15]:
colors = {ent.upper():color_gen() for ent in ner.entities}
options = {"ents":[ent.upper() for ent in ner.entities], "colors":colors}
In [16]:
txt = 'Multiple officers were called to an apartment building on N. Wickham Court Saturday night following reports of a large disturbance taking place inside. Officers learned there were ongoing tensions between residents of two apartments, and that some of this was the result of a gunshot the night prior. The weapons offense had not been reported to police, but officers now learned a round was fired in a common stairwell and the bullet entered an apartment, going through a bathroom before entering a bedroom wall. No one was hurt and investigators are attempting to sort out whether someone intentionally fired a gun, or if damage was the result of an accident or careless handling of a firearm. Released 12/26/2017 at 10:50 AM by PIO Joel Despain '
In [17]:
model_folder = os.path.join('data', 'EntityRecognizer', 'models', 'crime_10')
In [18]:
nlp = spacy.load(model_folder) #path to the model folder
In [19]:
doc = nlp(txt)
In [20]:
spacy.displacy.render(doc,jupyter=True, style='ent',options=options)
Multiple officers were called to an apartment building on N. Wickham Court Address Saturday night Crime_datetime following reports of a large disturbance taking place inside. Officers learned there were ongoing tensions between residents of two apartments, and that some of this was the result of a gunshot the night prior. The weapons offense had not been reported to police, but officers now learned a round was fired in a common stairwell and the bullet entered an apartment, going through a bathroom before entering a bedroom wall. No one was hurt and investigators are attempting to sort out whether someone intentionally fired a gun Weapon , or if damage was the result of an accident or careless handling of a firearm. Released 12/26/2017 at 10:50 AM Reported_datetime by PIO Joel Despain Reporting_officer

References


Feedback on this topic?