Information extraction from Cheshire fire incident reports using Mistral language model

Introduction

As text data continues to grow rapidly, extracting meaningful insights from large amounts of information is more important than ever. Large language models (LLMs) have emerged as powerful tools for processing unstructured data, significantly enhancing the accuracy and efficiency of information extraction. One of the key tasks that can be performed using large language models is entity extraction, which involves identifying and classifying entities—such as names, organizations, locations, dates, and other specific details—within a text.

In this sample, we will explore how information extraction works using the Mistral language model in the EntityRecognizer class of the arcgis.learn API with the Cheshire fire incident reports dataset. The Cheshire fire dataset includes incident reports detailing fire incidents in Cheshire, covering information like locations, times, types of incidents, and response actions. This data can be valuable for analysis in understanding patterns, improving response strategies, and enhancing safety measures.

Key entities to extract from fire incident reports include:

  • Address
  • Date and Time
  • Incident Type
  • Number of Engines
  • Time Spent at Incident

Prerequisites

  • Refer to the section "Install Deep Learning Dependencies of arcgis.learn Module" for detailed documentation on installing the dependencies: Installation Guide.

  • Follow these steps to download and install the Mistral model backbone:

    1. Download the Mistral Model Backbone.
    2. Extract the downloaded zip file.
    3. Open the Anaconda Prompt and navigate to the folder that contains arcgis_mistral_backbone-1.0.0-py_0.tar.bz2.
    4. Run the following command:
      • conda install --offline arcgis_mistral_backbone-1.0.0-py_0.tar.bz2
  • To learn more about how EntityRecognizer works, please refer to the guide on Named Entity Extraction Workflow with arcgis.learn.

Necessary Imports

import pandas as pd
import zipfile,unicodedata
from itertools import repeat
from pathlib import Path
from arcgis.gis import GIS
from arcgis.learn import prepare_textdata
from arcgis.learn.text import EntityRecognizer
from arcgis.geocoding import batch_geocode
import re
import os
import datetime
gis = GIS('home')

Data preparation

Data preparation involves splitting the data into training and validation sets, creating the necessary data structures for loading data into the model and so on. The prepare_data() function can directly read the training samples in one of the above specified formats and automate the entire process.

training_data = gis.content.get('ab3af7d1b8a24c1f8cc4e5bf3465d6bf')
training_data
information_extraction_from_cheshire_fire_incident_reports_using_mistral_language_model
information extraction from cheshire fire incident reports using mistral language model
Image Collection by api_data_owner
Last Modified: October 16, 2024
0 comments, 0 views
filepath = training_data.download(file_name=training_data.name)
import zipfile
with zipfile.ZipFile(filepath, 'r') as zip_ref:
    zip_ref.extractall(Path(filepath).parent)
json_path = Path(os.path.join(os.path.splitext(filepath)[0] , 'cheshire_fire_incident_reports.json'))
os.path.splitext(filepath)[0]
data = prepare_textdata(path= json_path, task="entity_recognition", dataset_type='ner_json', class_mapping={'address_tag':'Address'})
data.show_batch()
textAddressDate_and_TimeIncident_TypeNumber_of_EnginesTitletime_spent_at_incident
0Industrial paper shredder fire in Widnes Time ...[Pickerings Road, Widnes][26/04/2018 - 21:07][fire][Two][Industrial paper shredder fire in Widnes][around half-an-hour]
1Fire in field in Crewe Time of Incident: 07/03...[Vicarage Road, Crewe][07/03/2018 - 12:48][fire][One][Fire in field in Crewe]
2Car fire in Chester Time of Incident: 05/02/20...[Waverley Terrace, Chester][05/02/2018 - 22:26][car fire][One][Car fire in Chester]
3Industrial paper shredder fire in Widnes Time ...[Pickerings Road, Widnes][26/04/2018 - 21:07][fire][Two][Industrial paper shredder fire in Widnes][around half-an-hour]
4Van fire in Chester Time of Incident: 05/10/20...[Christleton Road, Boughton, Chester][05/10/2018 - 10:45][van fire][One][Van fire in Chester]
5Industrial skip fire in Crewe Time of Incident...[Gresty Road, Crewe][27/11/2018 - 09:16][fire][Two][Industrial skip fire in Crewe][45 minutes]
6Skip fire in Warrington Time of Incident: 25/1...[Griffiths Street, Warrington][25/10/2018 - 22:24][skip fire][One][Skip fire in Warrington]
7Fire at a dry cleaner's in Warrington Time of ...[Gaskell Street, Warrington][03/12/2018 - 17:14][fire][One][Fire at a dry cleaner's in Warrington][45 minutes]

EntityRecognizer model

EntityRecognizer model in arcgis.learn can be used with Hugging Face Transformers or with large language model backbones. For this sample use case we will use the Mistral model backbone to extract entities from the text.

Run the command below to see what backbones are supported for the entity recognition task.

print(EntityRecognizer.available_backbone_models("llm"))
('mistral',)

First we will create model using the EntityRecognizer() constructor and passing the following parameter:

data: The databunch created using the prepare_textdata method.

backbone: To use mistral as the model backbone, use backbone="mistral".

prompt: Text string describing the task and its guardrails. This is an optional parameter.

ner = EntityRecognizer(data,
                       backbone="mistral",
                       prompt="Tag the input sentences in the named entity for the given classes, no other class should be tagged."
                      )

The Mistral model will automatically infer the classes from the dataset. The list of inferred class names is as follows:

  • Address
  • Date_and_Time
  • Incident_Type
  • Number_of_Engines
  • Title
  • Time_spent_at_incident

In-context learning

The Mistral model utilizes in-context learning to generate predictions. Unlike traditional models that depend on lengthy training cycles, it can understand the task using just a few examples and a prompt. By incorporating this information into the input, the Mistral model gains a better understanding and can make more accurate predictions without needing retraining.

Evaluate model performance

Important metrics to look at while measuring the performance of the EntityRecognizer model are Precision, Recall & F1-measures

ner.precision_score()
The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
0.7
ner.recall_score()
0.64
ner.f1_score()
0.66

To find precision, recall & f1 scores per label/class we will call the model's metrics_per_label() method.

ner.metrics_per_label()
Precision_scoreRecall_scoreF1_score
time_spent_at_incident0.170.200.18
address0.000.000.00
incident_type1.000.670.80
o0.710.910.80
date_and_time1.000.900.95
number_of_engines1.000.900.95
title1.000.900.95

Validate results

Now we have the trained model, let's look at how the model performs.

ner.show_results()
TEXTaddresstime_spent_at_incidentincident_typetitlenumber_of_enginesdate_and_time
0Fire in the open at a farm in Chester Time of...harthill lane, chester1 hour and a halffire in the openfire in the open at a farm in chesterone06/08/2018-14:29
1Car fire in Northwich Time of Incident: 22/08...walnut avenue, northwichcar firecar fire in northwichone22/08/2018-20:39
2Grassland fire in Macclesfield Time of Incide...hooleyhey lane, macclesfieldgrassland firegrassland fire in macclesfieldone02/07/2018-13:55
3Tree fire, Warrington Time of Incident: 24/02...off shackleton close, warringtontree firetree fire, warringtonone24/02/2018-14:38
4Explosion at a house in Warrington Time of In...heath lane, warrington2 hours 40 minutesexplosionexplosion at a house in warringtonthree28/09/2018-15:01
5Digger fire in Warrington Time of Incident: 0...manchester road, warringtondigger firedigger fire in warringtontwo07/12/2018-20:19
6Small fire in a garden in Warrington Time of ...waterside, warringtonaround half-an-hourfiresmall fire in a garden in warringtonone05/08/2018 - 18:14
7Fire involving rubbish in Warrington Time of ...bridge lane, warringtonaround 20 minutesfirefire involving rubbish in warringtonone18/05/2018-08:29
8Small fire in the open in Warrington Time of ...parkfields lane, poulton-with-fearnhead, warri...20 minutesfiresmall fire in the open in warringtonone25/07/2018-19:01
9Greenhouse fire in Widnes Time of Incident: 1...scott avenue, widnesgreenhouse firegreenhouse fire in widnesone10/05/2018-18:52

Save and load trained models

Once you are satisfied with the model, you can save it using the save() method. This creates an Esri Model Definition (EMD file) that can be used for inferencing on new data. Saved models can also be loaded back using the load() method. load() method takes the path to the emd file as a required argument.

ner.save('cheshire_fire_model')
Computing model metrics...
Model has been saved to C:\Users\sur11226\AppData\Local\Temp\information_extraction_from_cheshire_fire_incident_reports_using_mistral_language_model\models\cheshire_fire_model

Model Inference

Now we can use the trained model to extract entities from new text documents using extract_entities() method. This method expects the folder path of where new text document are located, or a list of text documents.

reports = os.path.join(os.path.splitext(filepath)[0] , 'reports')
item_names = os.listdir(reports)
# item_names
reports_list = []
file_encoding="utf-8"
for filename in item_names:
    file_path = os.path.join(reports, filename)
    ext = os.path.splitext(filename)[-1].lower().replace(".", "")
    if ext == "txt":
        with open(file_path, "r", encoding=file_encoding, errors="ignore") as f:
            reports_list.append(f.read())
results = ner.extract_entities(reports_list)
100.00% [25/25 04:37<00:00]
results.head()
TEXTaddresstime_spent_at_incidentincident_typetitlenumber_of_enginesdate_and_timeFilename
0Person stuck in lift in Neston Time of Inciden...brook street, nestonperson in liftperson stuck in lift in nestonone30/01/2015 - 16:36Example_0
1Arson investigation after fire in Crewe Time o...charlesworth street, crewefirearson investigation after fire in crewetwo28/01/2015-23:09Example_1
2Grill pan fire in Macclesfield Time of Inciden...buxton road, macclesfordgrill pan firegrill pan fire in macclesfieldone03/01/2015 - 13:07Example_2
3Collision involving car and HGV on M6 Time of ...m6 north, j20collisioncollision involving car and hgv on m6two,one28/01/2015-22:49Example_3
4Road traffic collision in Macclesfield Time of...congleton road, gawsworth, macclesfieldn/aroad traffic collisionroad traffic collision in macclesfieldone,one28/01/2015-17:36Example_4

Conclusion

This sample demonstrates how EntityRecognizer() from arcgis.learn can be used for information extraction from Cheshire fie incident reports using the Mistral large language model.

References

Mistral-7B HuggingFace: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2
Mistral-7B MistralAI: https://mistral.ai/news/announcing-mistral-7b

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.