Identifying country names from incomplete house addresses¶
Introduction¶
Geocoding is the process of taking input text, such as an address or the name of a place, and returning a latitude/longitude location for that place. In this notebook, we will be picking up a dataset consisting of incomplete house addresses from 10 countries. We will build a classifier using TextClassifier
class of arcgis.learn.text
module to predict the country for these incomplete house addresses.
The house addresses in the dataset consist of text in multiple languages like English, Japanese, French, Spanish, etc. The dataset is a small subset of the house addresses taken from OpenAddresses data
A note on the dataset
- The data is collected around 2020-05-27 by OpenAddresses.
- The data licenses can be found in
data/country-classifier/LICENSE.txt
.
Prerequisites¶
Data preparation and model training workflows using arcgis.learn have a dependency on transformers. Refer to the section "Install deep learning dependencies of arcgis.learn module" on this page for detailed documentation on the installation of the dependencies.
Labeled data: For
TextClassifier
to learn, it needs to see documents/texts that have been assigned a label. Labeled data for this sample notebook is located atdata/country-classifier/house-addresses.csv
To learn more about how
TextClassifier
works, please see the guide on Text Classification with arcgis.learn.
Imports¶
import os
import zipfile
import pandas as pd
from pathlib import Path
from arcgis.gis import GIS
from arcgis.learn import prepare_textdata
from arcgis.learn.text import TextClassifier
gis = GIS('home')
Data preparation¶
Data preparation involves splitting the data into training and validation sets, creating the necessary data structures for loading data into the model and so on. The prepare_data()
function can directly read the training samples and automate the entire process.
training_data = gis.content.get('ab36969cfe814c89ba3b659cf734492a')
training_data
filepath = training_data.download(file_name=training_data.name)
with zipfile.ZipFile(filepath, 'r') as zip_ref:
zip_ref.extractall(Path(filepath).parent)
DATA_ROOT = Path(os.path.join(filepath.split('.')[0]))
data = prepare_textdata(DATA_ROOT, "classification", train_file="house-addresses.csv",
text_columns="Address", label_columns="Country", batch_size=64)
The show_batch()
method can be used to see the training samples, along with labels.
data.show_batch(10)
TextClassifier model¶
TextClassifier
model in arcgis.learn.text
is built on top of Hugging Face Transformers library. The model training and inferencing workflow are similar to computer vision models in arcgis.learn
.
Run the command below to see what backbones are supported for the text classification task.
print(TextClassifier.supported_backbones)
Call the model's available_backbone_models()
method with the backbone name to get the available models for that backbone. The call to available_backbone_models method will list out only few of the available models for each backbone. Visit this link to get a complete list of models for each backbone.
print(TextClassifier.available_backbone_models("xlm-roberta"))
Load model architecture¶
Invoke the TextClassifier
class by passing the data and the backbone you have chosen. The dataset consists of house addresses in multiple languages like Japanese, English, French, Spanish, etc., hence we will use a multi-lingual transformer backbone to train our model.
model = TextClassifier(data, backbone="xlm-roberta-base")
Model training¶
The learning rate
[1] is a tuning parameter that determines the step size at each iteration while moving toward a minimum of a loss function, it represents the speed at which a machine learning model "learns". arcgis.learn
includes a learning rate finder, and is accessible through the model's lr_find()
method, that can automatically select an optimum learning rate, without requiring repeated experiments.
model.lr_find()
Training the model is an iterative process. We can train the model using its fit()
method till the validation loss (or error rate) continues to go down with each training pass also known as an epoch. This is indicative of the model learning the task.
model.fit(epochs=4, lr=0.001)
By default, the earlier layers of the model (i.e. the backbone) are frozen. Once the later layers have been sufficiently trained, the earlier layers are unfrozen (by calling unfreeze()
method of the class) to further fine-tune the model.
model.unfreeze()
model.fit(epochs=6)
Validate results¶
Once we have the trained model, we can see the results to see how it performs.
model.show_results(15)
Test the model prediction on an input text¶
text = """1016, 8A, CL RICARDO LEON - SANTA ANA (CARTAGENA), 30319"""
print(model.predict(text))
model.accuracy()
Other important metrics to look at are Precision, Recall & F1-measures [3]. To find precision
, recall
& f1
scores per label/class we will call the model's metrics_per_label()
method.
model.metrics_per_label()
Get misclassified records¶
Its always a good idea to see the cases where your model is not performing well. This step will help us to:
- Identify if there is a problem in the dataset.
- Identify if there is a problem with text/documents belonging to a specific label/class.
- Identify if there is a class imbalance in your dataset, due to which the model didn't see much of the labeled data for a particular class, hence not able to learn properly about that class.
To get the misclassified records we will call the model's get_misclassified_records
method.
misclassified_records = model.get_misclassified_records()
misclassified_records.style.set_table_styles([dict(selector='th', props=[('text-align', 'left')])])\
.set_properties(**{'text-align': "left"}).hide_index()
Saving the trained model¶
Once you are satisfied with the model, you can save it using the save() method. This creates an Esri Model Definition (EMD file) that can be used for inferencing on unseen data.
model.save("country-classifier")
Model inference¶
The trained model can be used to classify new text documents using the predict method. This method accepts a string or a list of strings to predict the labels of these new documents/text.
text_list = data._train_df.sample(15).Address.values
result = model.predict(text_list)
df = pd.DataFrame(result, columns=["Address", "CountryCode", "Confidence"])
df.style.set_table_styles([dict(selector='th', props=[('text-align', 'left')])])\
.set_properties(**{'text-align': "left"}).hide_index()
Conclusion¶
In this notebook, we have built a text classifier using TextClassifier
class of arcgis.learn.text
module. The dataset consisted of house addresses of 10 countries written in languages like English, Japanese, French, Spanish, etc. To achieve this we used a multi-lingual transformer backbone like XLM-RoBERTa
to build a classifier to predict the country for an input house address.
References¶
Feedback on this topic?