Address Standardization and Correction using SequenceToSequence model
Introduction¶
Address Standardization is the process of changing addresses to adhere to USPS standards. In this notebook, we will aim at abbreviating the addresses as per standard USPS abbreviations.
Address Correction will aim at correcting miss-spelled place names.
We will train a model using SequenceToSequence
class of arcgis.learn.text
module to translate the non-standard and erroneous address to their standard and correct form.
The dataset consists of a pair of non-standard, incorrect(synthetic errors) house addresses and corresponding correct, standard house addresses from the United States. The correct addresses are taken from OpenAddresses data.
Disclaimer: The correct addresses were synthetically corrupted to prepare the training dataset, this could have lead to some unexpected corruptions in addresses, which will affect the translation learned by the model.
A note on the dataset
- The data is collected around 2020-04-29 by OpenAddresses.
- The data licenses can be found in
data/address_standardization_correction_data/LICENSE.txt
.
Prerequisites¶
Data preparation and model training workflows using arcgis.learn have a dependency on transformers. Refer to the section "Install deep learning dependencies of arcgis.learn module" on this page for detailed documentation on the installation of the dependencies.
Labeled data: For
SequenceToSequence
model to learn, it needs to see documents/texts that have been assigned a label. Labeled data for this sample notebook is located atdata/address_standardization_correction_data/address_standardization_correction.csv
To learn more about how
SequenceToSequence
works, please see the guide on How SequenceToSequence works.
!pip install transformers==3.3.0
Note: Please restart the kernel before running the cells below.
Imports¶
import os
import zipfile
from pathlib import Path
from arcgis.gis import GIS
from arcgis.learn import prepare_textdata
from arcgis.learn.text import SequenceToSequence
gis = GIS('home')
Data preparation¶
Data preparation involves splitting the data into training and validation sets, creating the necessary data structures for loading data into the model and so on. The prepare_textdata()
function can directly read the training samples and automate the entire process.
training_data = gis.content.get('06200bcbf46a4f58b2036c02b0bff41e')
training_data
Note: This address dataset is a subset (~15%) of the dataset available at "ea94e88b5a56412995fd1ffcb85d60e9" item id.
filepath = training_data.download(file_name=training_data.name)
with zipfile.ZipFile(filepath, 'r') as zip_ref:
zip_ref.extractall(Path(filepath).parent)
data_root = Path(os.path.join(os.path.splitext(filepath)[0]))
data = prepare_textdata(path=data_root, batch_size=16, task='sequence_translation',
text_columns='non-std-address', label_columns='std-address',
train_file='address_standardization_correction_data_small.csv')
The show_batch()
method can be used to see the training samples, along with labels.
data.show_batch()
SequenceToSequence model¶
SequenceToSequence
model in arcgis.learn.text
is built on top of Hugging Face Transformers library. The model training and inferencing workflows are similar to computer vision models in arcgis.learn
.
Run the command below to see what backbones are supported for the sequence translation task.
SequenceToSequence.supported_backbones
Call the model's available_backbone_models()
method with the backbone name to get the available models for that backbone. The call to available_backbone_models method will list out only a few of the available models for each backbone. Visit this link to get a complete list of models for each backbone.
SequenceToSequence.available_backbone_models("T5")
Load model architecture¶
Invoke the SequenceToSequence
class by passing the data and the backbone you have chosen. The dataset consists of house addresses in non-standard format with synthetic errors, we will finetune a t5-base pretrained model. The model will attempt to learn how to standardize and correct the input addresses.
model = SequenceToSequence(data,backbone='t5-base')
Model training¶
The learning rate
[1] is a tuning parameter that determines the step size at each iteration while moving toward a minimum of a loss function, it represents the speed at which a machine learning model "learns". arcgis.learn
includes a learning rate finder, and is accessible through the model's lr_find()
method, which can automatically select an optimum learning rate, without requiring repeated experiments.
lr = model.lr_find()
lr
Training the model is an iterative process. We can train the model using its fit()
method till the validation loss (or error rate) continues to go down with each training pass also known as epoch. This is indicative of the model learning the task.
model.fit(1, lr=lr)
By default, the earlier layers of the model (i.e. the backbone) are frozen. Once the later layers have been sufficiently trained, the earlier layers are unfrozen (by calling unfreeze()
method of the class) to further fine-tune the model.
model.unfreeze()
lr = model.lr_find()
lr
model.fit(5, lr)
model.fit(3, lr)
Validate results¶
Once we have the trained model, we can see the results to see how it performs.
model.show_results()
model.get_model_metrics()
Saving the trained model¶
Once you are satisfied with the model, you can save it using the save() method. This creates an Esri Model Definition (EMD file) that can be used for inferencing unseen data.
model.save('seq2seq_unfrozen8E_bleu_98', publish=True)
Model inference¶
The trained model can be used to translate new text documents using the predict method. This method accepts a string or a list of strings to predict the labels of these new documents/text.
txt=['940, north pennsylvania avneue, mason icty, iowa, 50401, us',
'220, soyth rhodeisland aveune, mason city, iowa, 50401, us']
model.predict(txt, num_beams=6, max_length=50)
Conclusion¶
In this notebook we will build an address standardization and correction model using SequenceToSequence
class of arcgis.learn.text
module. The dataset consisted of a pair of non-standard, incorrect (synthetic errors) house addresses and corresponding correct, standard house addresses from the United States. To achieve this we used a t5-base pretrained transformer to build a SequenceToSequence model to standardize and correct the input house addresses. Below are the results on sample inputs.
Non-Standard → Standard , Error → Correction
- 940, north pennsylvania avneue, mason icty, iowa, 50401, us → 940, n pennsylvania ave, mason city, ia, 50401, us</span>
- 220, soyth rhodeisland aveune, mason city, iowa, 50401, us → 220, s rhode island ave, mason city, ia, 50401, us</span>