ArcGIS Developers
Dashboard

ArcGIS API for Python

Download the samples Try it live

Address Standardization and Correction using SequenceToSequence model

Introduction

Address Standardization is the process of changing addresses to adhere to USPS standards. In this notebook, we will aim at abbreviating the addresses as per standard USPS abbreviations.

Address Correction will aim at correcting miss-spelled place names.

We will train a model using SequenceToSequence class of arcgis.learn.text module to translate the non-standard and erroneous address to their standard and correct form.

The dataset consists of a pair of non-standard, incorrect(synthetic errors) house addresses and corresponding correct, standard house addresses from the United States. The correct addresses are taken from OpenAddresses data.

Disclaimer: The correct addresses were synthetically corrupted to prepare the training dataset, this could have lead to some unexpected corruptions in addresses, which will affect the translation learned by the model.

A note on the dataset

  • The data is collected around 2020-04-29 by OpenAddresses.
  • The data licenses can be found in data/address_standardization_correction_data/LICENSE.txt.

Prerequisites

  • Data preparation and model training workflows using arcgis.learn have a dependency on transformers. Refer to the section "Install deep learning dependencies of arcgis.learn module" on this page for detailed documentation on the installation of the dependencies.

  • Labeled data: For SequenceToSequence model to learn, it needs to see documents/texts that have been assigned a label. Labeled data for this sample notebook is located at data/address_standardization_correction_data/address_standardization_correction.csv

  • To learn more about how SequenceToSequence works, please see the guide on How SequenceToSequence works.

Imports

In [1]:
import os
import zipfile
from pathlib import Path
from arcgis.gis import GIS
from arcgis.learn import prepare_textdata
from arcgis.learn.text import SequenceToSequence
In [2]:
gis = GIS('home')

Data preparation

Data preparation involves splitting the data into training and validation sets, creating the necessary data structures for loading data into the model and so on. The prepare_textdata() function can directly read the training samples and automate the entire process.

In [3]:
training_data = gis.content.get('06200bcbf46a4f58b2036c02b0bff41e')
training_data
Out[3]:
address_standardization_correction_data_small
Image Collection by api_data_owner
Last Modified: January 07, 2021
0 comments, 2 views

Note: This address dataset is a subset (~15%) of the dataset available at "ea94e88b5a56412995fd1ffcb85d60e9" item id.

In [4]:
filepath = training_data.download(file_name=training_data.name)
In [5]:
with zipfile.ZipFile(filepath, 'r') as zip_ref:
    zip_ref.extractall(Path(filepath).parent)
In [6]:
data_root = Path(os.path.join(os.path.splitext(filepath)[0]))
In [7]:
data = prepare_textdata(path=data_root, batch_size=16, task='sequence_translation', 
                        text_columns='non-std-address', label_columns='std-address', 
                        train_file='address_standardization_correction_data_small.csv')

The show_batch() method can be used to see the training samples, along with labels.

In [8]:
data.show_batch()
Out[8]:
non-std-address std-address
4967, red violet dr, dubuque, ia, 52002, us 4967, red violet dr, dubuque, ia, 52002, us
211, 7th street, carmi, illinois, 62821.0, us 211, 7th st, carmi, il, 62821.0, us
916, cleary dvenue, junction city, kansas, 66441, us 916, cleary ave, junction city, ks, 66441, us
1919, freychrn drive south west, cedar rapids, iowa, 52404.0, us 1919, gretchen dr sw, cedar rapids, ia, 52404.0, us
512, new haven drive, cary, illinois, 60013.0, us 512, new haven dr, cary, il, 60013.0, us

SequenceToSequence model

SequenceToSequence model in arcgis.learn.text is built on top of Hugging Face Transformers library. The model training and inferencing workflows are similar to computer vision models in arcgis.learn.

Run the command below to see what backbones are supported for the sequence translation task.

In [9]:
SequenceToSequence.supported_backbones
Out[9]:
['T5', 'Bart', 'Marian']

Call the model's available_backbone_models() method with the backbone name to get the available models for that backbone. The call to available_backbone_models method will list out only a few of the available models for each backbone. Visit this link to get a complete list of models for each backbone.

In [10]:
SequenceToSequence.available_backbone_models("T5")
Out[10]:
['t5-small',
 't5-base',
 't5-large',
 't5-3b',
 't5-11b',
 'See all T5 models at https://huggingface.co/models?filter=t5 ']

Load model architecture

Invoke the SequenceToSequence class by passing the data and the backbone you have chosen. The dataset consists of house addresses in non-standard format with synthetic errors, we will finetune a t5-base pretrained model. The model will attempt to learn how to standardize and correct the input addresses.

In [11]:
model = SequenceToSequence(data,backbone='t5-base')

Model training

The learning rate[1] is a tuning parameter that determines the step size at each iteration while moving toward a minimum of a loss function, it represents the speed at which a machine learning model "learns". arcgis.learn includes a learning rate finder, and is accessible through the model's lr_find() method, which can automatically select an optimum learning rate, without requiring repeated experiments.

In [12]:
model.lr_find()
Out[12]:
0.1445439770745928

Training the model is an iterative process. We can train the model using its fit() method till the validation loss (or error rate) continues to go down with each training pass also known as epoch. This is indicative of the model learning the task.

In [13]:
model.fit(1, 0.144)
epoch train_loss valid_loss seq2seq_acc bleu time
0 1.287657 0.951018 0.863680 0.752205 11:10

By default, the earlier layers of the model (i.e. the backbone) are frozen. Once the later layers have been sufficiently trained, the earlier layers are unfrozen (by calling unfreeze() method of the class) to further fine-tune the model.

In [14]:
model.unfreeze()
In [15]:
lr = model.lr_find()
In [16]:
model.fit(5, lr)
epoch train_loss valid_loss seq2seq_acc bleu time
0 0.331751 0.278617 0.962188 0.916663 17:45
1 0.177372 0.153773 0.982446 0.959307 17:36
2 0.143805 0.118750 0.987322 0.970336 17:37
3 0.118908 0.105951 0.989088 0.974331 17:40
4 0.124536 0.103347 0.989461 0.975195 17:41
In [17]:
model.fit(3, lr)
epoch train_loss valid_loss seq2seq_acc bleu time
0 0.116942 0.100216 0.989321 0.974961 17:49
1 0.103494 0.088271 0.990844 0.978451 17:43
2 0.091599 0.084226 0.991426 0.979786 17:40

Validate results

Once we have the trained model, we can see the results to see how it performs.

In [18]:
model.show_results()
text target pred
940, north pennsylvania avneue, mason icty, iowa, 50401, us 940, n pennsylvania ave, mason city, ia, 50401, us 940, n pennsylvania ave, mason city, ia, 50401, us
24640, a-b 305th srreet, nora speings, iowa, 50458, us 24640, a-b 305th st, nora springs, ia, 50458, us 24640, a-b 305th st, nora cetings, ia, 50458, us
2920, 1st srteet south west, mason ciry, iowa, 50401, us 2920, 1st st sw, mason city, ia, 50401, us 2920, 1st st sw, mason city, ia, 50401, us
210, s rhode island ave, mason ctiy, ia, 50401, us 210, s rhode island ave, mason city, ia, 50401, us 210, s rhode island ave, mason city, ia, 50401, us
427, n massachudetts ave, mason coty, ia, 50401, us 427, n massachusetts ave, mason city, ia, 50401, us 427, n massachudetts ave, mason city, ia, 50401, us

Model metrics

To get a sense of how well the model is trained, we will calculate some important metrics for our SequenceToSequence model. To see what's the model accuracy [2] and bleu score [3] on the validation data-set. We will call the model's get_model_metrics() method.

In [19]:
model.get_model_metrics()
Out[19]:
{'seq2seq_acc': 0.9914, 'bleu': 0.9798}

Saving the trained model

Once you are satisfied with the model, you can save it using the save() method. This creates an Esri Model Definition (EMD file) that can be used for inferencing unseen data.

In [20]:
model.save('seq2seq_unfrozen8E_bleu_98', publish=True)
Published DLPK Item Id: ed79aa1b34dd406aae4eed0123bc4608
Out[20]:
WindowsPath('models/seq2seq_unfrozen8E_bleu_98')

Model inference

The trained model can be used to translate new text documents using the predict method. This method accepts a string or a list of strings to predict the labels of these new documents/text.

In [21]:
txt=['940, north pennsylvania avneue, mason icty, iowa, 50401, us',
     '220, soyth rhodeisland aveune, mason city, iowa, 50401, us']
In [22]:
model.predict(txt, num_beams=6, max_length=50)
100.00% [1/1 00:02<00:00]
Out[22]:
[('940, north pennsylvania avneue, mason icty, iowa, 50401, us',
  '940, n pennsylvania ave, mason city, ia, 50401, us'),
 ('220, soyth rhodeisland aveune, mason city, iowa, 50401, us',
  '220, s rhode island ave, mason city, ia, 50401, us')]

Conclusion

In this notebook we will build an address standardization and correction model using SequenceToSequence class of arcgis.learn.text module. The dataset consisted of a pair of non-standard, incorrect (synthetic errors) house addresses and corresponding correct, standard house addresses from the United States. To achieve this we used a t5-base pretrained transformer to build a SequenceToSequence model to standardize and correct the input house addresses. Below are the results on sample inputs.

Non-StandardStandard , ErrorCorrection

  • 940, north pennsylvania avneue, mason icty, iowa, 50401, us → 940, n pennsylvania ave, mason city, ia, 50401, us</span>
  • 220, soyth rhodeisland aveune, mason city, iowa, 50401, us → 220, s rhode island ave, mason city, ia, 50401, us</span>

References


Feedback on this topic?