Named Entity Extraction Workflow with

Introduction

Geospatial data is not only available in the form of maps and feature/imagery layers, but also in form of unstructured text.

What is unstructured text?

Unstructured text is written content that lacks structure and cannot readily be indexed or mapped onto standard database fields. It is often user-generated information such as emails, instant messages, news articles, documents or social media postings. These unstructured documents can contain location information which makes them geospatial information. Mapping information from such documents could be of a great value. In this guide, we will explore how to achieve this objective with arcgis.learn.

What is Named Entity Recognition?

Named Entity Recognition is a branch of information extraction. This is used to identify entities such as "Organizations", "Person", "Date", "Country", etc. that are present in the text.

Figure1: Example of named entities such as PERSON, ORG & DATE in unstructured text. Source:

Explosion AI blog

Prerequisites

Data preparation and model training workflows for entity extraction using arcgis.learn is based on spaCy & Hugging Face Transformers libraries. A user can choose an appropriate backbone to train the model.
Refer to the section Install deep learning dependencies of arcgis.learn module for detailed explanation about deep learning dependencies.
Labeled data: For EntityRecognizer to learn, it needs to see examples that have been labeled for all the custom categories that the model is expected to extract. Head to the Data preparation section to see the supported formats for training data.
If you wish to try this workflow, you can find a sample notebook along with the necessary labeled training and test datasets over here.

EntityRecognizer Model Basics

EntityRecognizer model in arcgis.learn can be created with either Hugging Face Transformers or with spaCy's EntityRecognizer architecture.

Transformers Overview

Transformers in NLP are novel architectures that aims to solve sequence-to-sequence tasks while handling long-range dependencies with ease. The transformers are the most latest and advanced models that give state of the art results for a wide range of tasks such as text/sequence classification, named entity recognition (ner), question answering, machine translation, text summarization, text generation etc.

The Hugging Face Transformers library provides transformer models like BERT, RoBERTa, XLM, DistilBert, XLNet etc., for Natural Language Understanding (NLU) with over 32+ pretrained models in 100+ languages.

A transformer consists of an encoding component, a decoding component, and connections between them.

The Encoding component is a stack of encoders (the paper stacks six of them on top of each other).
The Decoding component is a stack of decoders of the same number.

The encoders are all identical in structure (yet they do not share weights). Each one is broken down into two sub-layers:

Self-Attention Layer
- Say the following sentence is an input sentence we want to translate:
  
  The animal didn't cross the street because it was too tired
  
  What does "it" in this sentence refer to? Is it referring to the street or to the animal? It's a simple question to a human, but not as simple to an algorithm. When the model is processing the word "it", self-attention allow the model to associate "it" with "animal".
Feed Forward Layer - The outputs of the self-attention layer are fed to a feed-forward neural network.

The decoder has both those layers (self-attention & feed forward layer), but between them is an attention layer (sometimes called encoder-decoder attention) that helps the decoder focus on relevant parts of the input sentence.

Figure3: Depicting different layers and their interaction in Transformer encoder & decoder components

To get a more detail explanation on different forms of attention visit this page. Also there is a great blog post on Visualizing attention in machine translation model that can help in understanding the attention mechanism in a better way.

How to choose an appropriate transformer backbone for your dataset?

This page mentions different trasformer architectures [3]. Not every architecture can be used to train a Named Entity Recognition model. As of now, there are around 12 different architectures which can be used to perform Named Entity Recognition (NER) task. These are BERT[4], RoBERTa, DistilBERT, ALBERT, FlauBERT, CamemBERT, XLNet, XLM, XLM-RoBERTa, ELECTRA, Longformer and MobileBERT.

Some consideration has to be made to pick the right transformer architecture for the problem at hand.

Some models like BERT, RoBERTa, XLNET, XLM-RoBERTa are highly accurate but at the same time are larger in size. Generating inference from these models is somewhat slow.
If one wishes to sacrifice a little accuracy over a high inferencing and training speed one can go with DistilBERT.
If the model size is a constraint, then one can either choose ALBERT or MobileBERT. Remember the model performance will not be as great compared to models like BERT, RoBERTa, XLNET, etc.
If you have a dataset in the French language one can choose from FlauBERT or CamemBERT as these language model are trained on French text.
When dealing with long sentences/sequences in training data one can choose from XLNET, Longformer, Bart.
Some models like XLM, XLM-RoBERTa are multi-lingual models i.e. models trained on multiple languages. If your dataset consists of text in multiple languages you can chooses models mentioned in the above link.
- The model sizes of these transformer architectures are very large (in GBs).
- They require large memory to fine tune on a particular dataset.
- Due to the large size of these models, inferencing a fined-tuned model will be somewhat slow on CPU.

Entity recognition with spaCy

This Model works on the Embed > Encode > Attend > Predict deep learning framework [1].

Embed: This is the process of turning text or sparse vectors into dense word embeddings. These embeddings are much easier to work with than other representations and do an excellent job of capturing semantic information. This is achieved by extracting word features using feature hashing[2] followed by a Multilayer Perceptron. A video description of this workflow can be found here.

Encode : This is the process of encoding context into a word vector. This is done using Residual Trigram CNNs.

Attend : In this model attention refers to manually extracting features from the encoded tokens. This step has a similar effect as the attention mechanism.

Predict : The final step in the model is making a prediction given the input text. Here the vector from the attention layer is passed to a Multilayer Perceptron to output the entity label ID.

Figure2: Different components of entity recognition workflow in spaCy based on

Explosion AI blog on deep learning formula for NLP models

Data preparation

Entity Recognizer can consume labeled training data in three different formats (ner_json, IOB & BILUO).
Example structure for ner_json format:
- Text : "Sir Chandrashekhara Venkata Raman was born in India"
- JSON formatted training data: {"text": "Sir Chandrashekhara Venkata Raman was born in India.", "labels": [[0, 33, "Person"], [46, 51, "Country"]]}
Example structure for IOB format:
- Text: "Sir Chandrashekhara Venkata Raman was born in India."
- IOB formatted training data:
  - Row in tokens.csv: 'Sir', 'Chandrashekhara', 'Venkata', 'Raman', 'was', 'born', 'in', 'India', '.'
  - Row in tags.csv: 'B-Person', 'I-Person', 'I-Person', 'I-Person', 'O', 'O', 'O', 'B-Country', 'O'
Example structure for BILUO format:
- Text: "Sir Chandrashekhara Venkata Raman was born in India."
- LBIOU formatted training data:
  - Row in tokens.csv: 'Sir', 'Chandrashekhara', 'Venkata', 'Raman', 'was', 'born', 'in', 'India', '.'
  - Row in tags.csv: 'B-Person', 'I-Person', 'I-Person', 'L-Person', 'O', 'O', 'O', 'U-Country', 'O'
There are labeling tools available to get raw documents in the required format. Below are the references to few labeling tools:
- Docanno [5] - To learn how to setup Doccano and label your own data please refer to doccano setup guide
- TagEditor [6]

Data preparation involves splitting the data into training and validation sets, creating the necessary data structures for loading data into the model and so on. The prepare_data() function can directly read the training samples in one of the above specified formats and automate the entire process. While calling this function, user has to provide the following arguments:

path - Path to data directory for data in IOB or BILUO format or the path to the json file if labelled training data is in JSON format.
dataset_type - Input format for you labelled training data (one of IOB, BILUO, ner_json).
class_mapping - Entity defined in 'address_tag' will be treated as location (addresses or geographical locations).
encoding - The encoding to read the csv/json file (default is set to 'UTF-8').

import os
import spacy
import random
from arcgis.learn import prepare_textdata
from arcgis.learn.text import EntityRecognizer

json_path = os.path.join('data', 'EntityRecognizer', 'labelled_crime_reports.json')

data=prepare_textdata(path=json_path, task="entity_recognition", dataset_type='ner_json', class_mapping={'address_tag':'Address'})

The show_batch() method can be used to visualize the training samples, along with labels.

data.show_batch()

	text	Address	Crime	Crime_datetime	Reported_date	Reported_time	Reporting_officer	Weapon
0	A McDonald's employee suffered a knee injury a...	[Odana Rd. restaurant]	[strong-armed robbery, grabbed money]	[Monday night]	[01/19/2016]	[9:44 AM]	[PIO Joel Despain]
1	A 13-year-old boy, who pointed a handgun at a ...	[1500 block of Troy]	[disorderly conduct while armed]	[last night]	[10/12/2016]	[10:11 AM]	[PIO Joel Despain]	[handgun, pellet gun, BB or pellet gun]
2	One man has been arrested and another is being...	[intersection of E. Washington Ave. and N. Sto...	[shooting, firing the gun]	[Sunday evening]	[01/04/2016]	[10:45 AM]		[BB gun, BB gun, BB gun]
3	Several deli employees and a diner - who happe...	[Stalzy's Deli, 2701 Atwood Ave.]	[burglary, stole money, stolen money]		[09/24/2018]	[9:59 AM]	[PIO Joel Despain]
4	A Madison man was arrested Saturday inside Eas...	[East Towne Mall]	[disturbance]		[05/09/2016]	[9:52 AM]	[PIO Joel Despain]	[handgun, BB gun]
5	A knife-wielding man, who threatened a couple ...	[State St., downtown]	[racial slurs and vulgarities, stab, yelling a...	[Sunday afternoon]	[11/12/2018]	[10:02 AM]	[PIO Joel Despain]	[knife, knife]
6	A MPD officer activated his squad car lights h...	[E. Gorham St.]	[crash, intoxicated, drunken driving, driving ...		[12/21/2018]	[11:29 AM]	[PIO Joel Despain]
7	A suspected drug dealer attempted to destroy 5...	[Monday, Sherman Ave. apartment,]	[destroy 50 grams of fentanyl laced heroin]		[12/04/2018]	[11:45 AM]	[PIO Joel Despain]	[handgun]

EntityRecognizer model

EntityRecognizer model in arcgis.learn can be used with spaCy's EntityRecognizer backbone or with Hugging Face Transformers backbones. The model training and inferencing workflow is similar to computer vision models in arcgis.learn.

Run the command below to see what backbones are supported for the entity recognition task.

print(EntityRecognizer.supported_backbones)

['spacy', 'BERT', 'RoBERTa', 'DistilBERT', 'ALBERT', 'CamemBERT', 'MobileBERT', 'XLNet', 'XLM', 'XLM-RoBERTa', 'FlauBERT', 'ELECTRA', 'Longformer']

Apart from the 'spacy' backbone listed above, rest all are transformer backbones. The Hugging Face Transformer library provides a wide variety of models for each of the backbone listed above. To see the full list visit this link.

The call to available_backbone_models() method will list out only few of the available models for each backbone.
This list is not exhaustive and only contain subset of the models listed in the link above. This function is created to give a general idea to the user about the available models for a given backbone.
That being said, the EntityRecognizer class supports any model from the 12 available transformer backbones and spaCy backbone.
Some of the Transformer models are quite large due to the high number of training parameters or large number of intermediate layers. Thus, large models will have large CPU/GPU memory requirements.

print(EntityRecognizer.available_backbone_models("roberta"))

('roberta-base', 'roberta-large', 'distilroberta-base')

print(EntityRecognizer.available_backbone_models("spacy"))

('spacy',)

Construct the EntityRecognizer object by passing the data and the backbone you have chosen.

ner = EntityRecognizer(data, backbone='spacy')

Model training

Finding optimum learning rate

In machine learning, the learning rate is a tuning parameter that determines the step size at each iteration while moving towards a minimum of a loss function, it also represents the speed at which a machine learning model "learns"

If the learning rate is low, then model training will take a lot of time because steps towards the minimum of the loss function are tiny.
If the learning rate is high, then training may not converge or even diverge. Weight changes can be so big that the optimizer overshoots the minimum and makes the loss worse.

To find the optimum learning rate for our model, we will call the lr_find() method of the model.

Note

A user is not required to call the lr_find() method separately. If lr argument is not provided while calling the fit() method then lr_find() method is internally called by the fit() method to find the optimal learning rate.

ner.lr_find()

0.0005943497684706441

Training the model is an iterative process. We can train the model using its fit() method till the validation loss (or error rate) continues to go down with each training pass also known as epoch. This is indicative of the model learning the task.

ner.fit(epochs=10, lr=0.0005)

epoch	losses	val_loss	precision_score	recall_score	f1_score	time
0	19.49	14.82	0.85	0.55	0.67	00:00:04
1	15.38	9.99	0.71	0.58	0.64	00:00:04
2	11.79	8.83	0.69	0.61	0.65	00:00:04
3	11.49	6.92	0.78	0.67	0.72	00:00:04
4	11.43	10.13	0.79	0.7	0.74	00:00:04
5	12.88	9.14	0.82	0.78	0.8	00:00:04
6	14.48	11.59	0.75	0.7	0.72	00:00:04
7	19.36	11.22	0.8	0.7	0.75	00:00:04
8	15.22	4.95	0.86	0.82	0.84	00:00:04
9	9.7	4.19	0.89	0.84	0.87	00:00:04

Evaluate model performance

Important metrics to look at while measuring the performance of the EntityRecognizer model are Precision, Recall & F-measures [7].

Here is a brief description of them:

Precision - Precision talks about how precise/accurate your model is. Out of those predicted positive, how many of them are actual positive.
Recall - Recall is the ability of the classifier to find all the positive samples.
F1 - F1 can be interpreted as a weighted harmonic mean of the precision and recall. To learn more about these metrics one can visit the following link - Precision, Recall & F1 score.

To find precision, recall & f1 scores per label/class we will call the model's metrics_per_label() method.

ner.metrics_per_label()

	Precision_score	Recall_score	F1_score
Reported_date	1.00	1.00	1.00
Reported_time	1.00	1.00	1.00
Crime_datetime	0.75	0.86	0.80
Address	0.88	0.88	0.88
Crime	0.76	0.62	0.68
Reporting_officer	1.00	1.00	1.00
Weapon	0.75	0.67	0.71

Validate results

Once we have the trained model, we can visualize the results to see how it performs.

ner.show_results()

100.00% [8/8 00:00<00:00]

	TEXT	Filename	Address	Crime	Crime_datetime	Reported_date	Reported_time	Reporting_officer	Weapon
0	Madison Police responded at 22:10 to the 500 b...	Example_0	500 block of South Park Street	rob her	22:10	12/26/2017	5:39 AM	Sgt. Paul Jacobsen
1	Responding officers recovered a shell casing f...	Example_1	Citgo gas station, 1423 Northport Dr.	gunfire,shot was fired	Sunday night	09/26/2016	9:43 AM	PIO Joel Despain
2	An Independence Lane resident said a stranger ...	Example_2	Independence Lane		early Saturday morning	05/09/2016	9:21 AM	PIO Joel Despain	handgun
3	Victim reporting that he was pistol whipped in...	Example_3	3400 block of N Sherman Ave			09/18/2017	9:30 PM	Sgt. Rosemarie Mansavage
4	The MPD arrested an 18-year-old man on a tenta...	Example_4	Memorial High School	disorderly conduct after	after 5:30 p.m.	05/05/2017	1:55 PM	PIO Joel Despain
5	A father and his two young children narrowly e...	Example_5	Ondossagon Way home	drunken driver reversed,second degree reckless...	Sunday morning	10/22/2018	9:36 AM	PIO Joel Despain
6	Officers responded to an alarm at Dick's Sport...	Example_6	Dick's Sporting Goods, 237 West Towne Mall			04/27/2017	3:37 AM	Lt. Timothy Radke	13 airsoft
7	A father and his two young children narrowly e...	Example_7	Ondossagon Way home	drunken driver reversed,second degree reckless...	Sunday morning	10/22/2018	9:36 AM	PIO Joel Despain

Once you are satisfied with the model, you can save it using the save() method. This creates an Esri Model Definition (EMD file) that can be used for inferencing on unseen data. Saved models can also be loaded back using the from_model() method. The from_model() method takes the path to the emd file as a required argument.

ner.save('crime')

model_path = os.path.join('data', 'EntityRecognizer', 'models', 'crime', 'crime.emd')

ner = EntityRecognizer.from_model(model_path)

<spacy.lang.en.English object at 0x000002D58039CCC0>

Model inference

The trained model can be used to extract entities from new text documents using the extract_entities() function. This method accepts the path of the folder where new text documents are located, or a list of text documents from which the entities are to be extracted.

reports_path = os.path.join("data", "EntityRecognizer", "reports")

results = ner.extract_entities(reports_path)

100.00% [1501/1501 00:12<00:00]

results.head()

	TEXT	Filename	Address	Crime	Crime_datetime	Reported_date	Reported_time	Reporting_officer	Weapon
0	Officers were dispatched to a robbery of the A...	0.txt	Associated Bank in the 1500 block of W Broadway.	robbery,demanded money from		08/09/2018	6:17 PM	Sgt. Jennifer Kane
1	The MPD was called to Pink at West Towne Mall ...	1.txt	Pink at West Towne Mall	thefts at	Tuesday night	08/18/2016	Pink stores,10:37 AM	PIO Joel Despain
2	The MPD is seeking help locating a unique $1,5...	10.txt	Union St.	stolen from outside		08/17/2016	11:09 AM	PIO Joel Despain
3	A Radcliffe Drive resident said three men - at...	100.txt	Radcliffe Drive	handguns -entered her apartment		08/07/2018	11:17 AM	PIO Joel Despain
4	Madison Police officers were near the intersec...	1001.txt	intersection of Francis Street and State Street	shooting there immediately		08/10/2018	4:20 AM	Lt. Daniel Nale	gunshot

Visualize entities

We can utilize SpaCy's named entity visualizer to check the model's prediction on new text one at a time.

def color_gen(): #this function generates and returns a random color.
    random_number = random.randint(0,16777215) #16777215 ~= 256x256x256(R,G,B)
    hex_number = format(random_number, 'x')
    hex_number = '#' + hex_number
    return hex_number

colors = {ent.upper():color_gen() for ent in ner.entities}
options = {"ents":[ent.upper() for ent in ner.entities], "colors":colors}

txt = 'Multiple officers were called to an apartment building on N. Wickham Court Saturday night following reports of a large disturbance taking place inside. Officers learned there were ongoing tensions between residents of two apartments, and that some of this was the result of a gunshot the night prior. The weapons offense had not been reported to police, but officers now learned a round was fired in a common stairwell and the bullet entered an apartment, going through a bathroom before entering a bedroom wall. No one was hurt and investigators are attempting to sort out whether someone intentionally fired a gun, or if damage was the result of an accident or careless handling of a firearm. Released 12/26/2017 at 10:50 AM by PIO Joel Despain '

model_folder = os.path.join('data', 'EntityRecognizer', 'models', 'crime')

nlp = spacy.load(model_folder) #path to the model folder

doc = nlp(txt)

spacy.displacy.render(doc,jupyter=True, style='ent', options=options)

Multiple officers were called to an apartment building on N. Wickham Court Saturday Address night following reports of a large disturbance taking place inside. Officers learned there were ongoing tensions between residents of two apartments, and that some of this was the result of a gunshot Crime the night prior. The weapons offense had not been reported to police, but officers now learned a round was fired in a common stairwell and the bullet entered an apartment, going through a bathroom before entering a bedroom wall. No one was hurt and investigators are attempting to sort out whether someone intentionally fired a gun Weapon , or if damage was the result of an accident or careless handling of a firearm Weapon . Released 12/26/2017 Reported_date at 10:50 AM Reported_time by PIO Joel Despain Reporting_officer

References

[1]: Embed, encode, attend, predict: The new deep learning formula for state-of-the-art NLP models

[2]: Feature hashing

[3][Summary of the models](https://huggingface.co/transformers/summary.html)

[4][BERT Paper](https://arxiv.org/pdf/1810.04805.pdf)

[5]: Docanno

[6]: TagEditor

[7]: Precision, recall and F-measures