Identifying country names from incomplete house addresses

Introduction

Geocoding is the process of taking input text, such as an address or the name of a place, and returning a latitude/longitude location for that place. In this notebook, we will be picking up a dataset consisting of incomplete house addresses from 10 countries. We will build a classifier using TextClassifier class of arcgis.learn.text module to predict the country for these incomplete house addresses.

The house addresses in the dataset consist of text in multiple languages like English, Japanese, French, Spanish, etc. The dataset is a small subset of the house addresses taken from OpenAddresses data

A note on the dataset

  • The data is collected around 2020-05-27 by OpenAddresses.
  • The data licenses can be found in data/country-classifier/LICENSE.txt.

Prerequisites

  • Data preparation and model training workflows using arcgis.learn have a dependency on transformers. Refer to the section "Install deep learning dependencies of arcgis.learn module" on this page for detailed documentation on the installation of the dependencies.

  • Labeled data: For TextClassifier to learn, it needs to see documents/texts that have been assigned a label. Labeled data for this sample notebook is located at data/country-classifier/house-addresses.csv

  • To learn more about how TextClassifier works, please see the guide on Text Classification with arcgis.learn.

Imports

import os
import zipfile
import pandas as pd
from pathlib import Path
from arcgis.gis import GIS
from arcgis.learn import prepare_textdata
from arcgis.learn.text import TextClassifier
gis = GIS('home')

Data preparation

Data preparation involves splitting the data into training and validation sets, creating the necessary data structures for loading data into the model and so on. The prepare_data() function can directly read the training samples and automate the entire process.

training_data = gis.content.get('ab36969cfe814c89ba3b659cf734492a')
training_data
country_classifier
Training data for TextClassifier class of arcgis.learn.text moduleImage Collection by api_data_owner
Last Modified: December 01, 2020
0 comments, 0 views
filepath = training_data.download(file_name=training_data.name)
with zipfile.ZipFile(filepath, 'r') as zip_ref:
    zip_ref.extractall(Path(filepath).parent)
DATA_ROOT = Path(os.path.join(os.path.splitext(filepath)[0]))
data = prepare_textdata(DATA_ROOT, "classification", train_file="house-addresses.csv", 
                        text_columns="Address", label_columns="Country", batch_size=64)

The show_batch() method can be used to see the training samples, along with labels.

data.show_batch(10)
AddressCountry
S/N, LG CASARES, 32170ES
SN, CALLE E. NABARRETE, PLAN DE AYALA (CAMPO CINCO), Ahome, SinaloaMX
152, RUA SANTA RITA DURAO, Belo Horizonte, MG, 30140-110BR
133, Warande, 201, 9660BE
4000, 13 Avenue SE, 133, MEDICINE HATCA
12, Avenue de la République, Beauvais, 60000FR
1487-6, 有馬町JP
4, Rue d'Houat, Saint-Gilles, 35590FR
32, Hartjie My Liefie Avenue, Bloemfontein, MangaungZA
Street, Centurion, City of TshwaneZA

TextClassifier model

TextClassifier model in arcgis.learn.text is built on top of Hugging Face Transformers library. The model training and inferencing workflow are similar to computer vision models in arcgis.learn.

Run the command below to see what backbones are supported for the text classification task.

print(TextClassifier.supported_backbones)
['BERT', 'RoBERTa', 'DistilBERT', 'ALBERT', 'FlauBERT', 'CamemBERT', 'XLNet', 'XLM', 'XLM-RoBERTa', 'Bart', 'ELECTRA', 'Longformer', 'MobileBERT']

Call the model's available_backbone_models() method with the backbone name to get the available models for that backbone. The call to available_backbone_models method will list out only few of the available models for each backbone. Visit this link to get a complete list of models for each backbone.

print(TextClassifier.available_backbone_models("xlm-roberta"))
('xlm-roberta-base', 'xlm-roberta-large')

Load model architecture

Invoke the TextClassifier class by passing the data and the backbone you have chosen. The dataset consists of house addresses in multiple languages like Japanese, English, French, Spanish, etc., hence we will use a multi-lingual transformer backbone to train our model.

model = TextClassifier(data, backbone="xlm-roberta-base")

Model training

The learning rate[1] is a tuning parameter that determines the step size at each iteration while moving toward a minimum of a loss function, it represents the speed at which a machine learning model "learns". arcgis.learn includes a learning rate finder, and is accessible through the model's lr_find() method, that can automatically select an optimum learning rate, without requiring repeated experiments.

model.lr_find()
<Figure size 432x288 with 1 Axes>
0.001202264434617413

Training the model is an iterative process. We can train the model using its fit() method till the validation loss (or error rate) continues to go down with each training pass also known as an epoch. This is indicative of the model learning the task.

model.fit(epochs=6, lr=0.001)
epochtrain_lossvalid_lossaccuracyerror_ratetime
00.3086380.1821500.9296000.07040005:28
10.1036150.0687110.9706000.02940005:46
20.0763260.0412690.9816000.01840005:30
30.0557070.0343070.9863000.01370005:33
40.0418120.0327720.9864000.01360005:27
50.0499930.0321650.9866000.01340005:26

Validate results

Once we have the trained model, we can see the results to see how it performs.

model.show_results(15)
texttargetprediction
SN, AVENIDA JOSE MARIA MORELOS Y PAVON OTE., APATZINGÁN DE LA CONSTITUCIÓN, Apatzingán, Michoacán de OcampoMXMX
906, AVENIDA JOSEFA ORTÍZ DE DOMÍNGUEZ, CIUDAD MENDOZA, Camerino Z. Mendoza, Veracruz de Ignacio de la LlaveMXMX
32, CIRCUITO JOSÉ MARÍA URIARTE, FRACCIONAMIENTO RANCHO ALEGRE, Tlajomulco de Zúñiga, JaliscoMXMX
SN, ESTRADA SP 250 SENTIDO GRAMADAO, LADO DIREITO FAZENDA SAO RAFAEL CASA 4, São Miguel Arcanjo, SP, 18230-000BRBR
SN, CALLE JOSEFA ORTÍZ DE DOMÍNGUEZ, RINCÓN DE BUENA VISTA, Omealca, Veracruz de Ignacio de la LlaveMXMX
SN, CALLE MICHOACAN, DOLORES HIDALGO CUNA DE LA INDEPENDENCIA NACIONAL, Dolores Hidalgo Cuna de la Independencia Nacional, GuanajuatoMXMX
SN, CALLE VERDUZCO, COALCOMÁN DE VÁZQUEZ PALLARES, Coalcomán de Vázquez Pallares, Michoacán de OcampoMXMX
1712, CALLE MÁRTIRES DEL 7 DE ENERO, CIUDAD MENDOZA, Camerino Z. Mendoza, Veracruz de Ignacio de la LlaveMXMX
SN, AVENIDA JACOBO GÁLVEZ, FRACCIONAMIENTO RANCHO ALEGRE, Tlajomulco de Zúñiga, JaliscoMXMX
SN, ANDADOR MZNA 6 AMP. LOS ROBLES, EL PUEBLITO (CRUCERO NACIONAL), Córdoba, Veracruz de Ignacio de la LlaveMXMX
SN, CALLE SÉPTIMA PONIENTE SUR (EJE VIAL), COMITÁN DE DOMÍNGUEZ, Comitán de Domínguez, ChiapasMXMX
18, CALLE FELIPE GORRITI / FELIPE GORRITI KALEA, Pamplona / Iruña, Pamplona / Iruña, Navarra, 31004ESES
SN, RUA X VINTE E SEIS, QUADRA 14 LOTE 35 SALA 3, Aparecida de Goiânia, GO, 74922-680BRBR
SN, CALLE NINGUNO, HEROICA CIUDAD DE JUCHITÁN DE ZARAGOZA, Heroica Ciudad de Juchitán de Zaragoza, OaxacaMXMX
1169, RUA DOUTOR ALBUQUERQUE LINS, BLOCO B ANDAR 11 APARTAMENTO 112B, São Paulo, SP, 01203-001BRBR

Test the model prediction on an input text

text = """1016, 8A, CL RICARDO LEON - SANTA ANA (CARTAGENA), 30319"""
print(model.predict(text))
('1016, 8A, CL RICARDO LEON - SANTA ANA (CARTAGENA), 30319', 'ES', 1.0)

Model metrics

To get a sense of how well the model is trained, we will calculate some important metrics for our text-classifier model. First, to find how accurate[2] the model is in correctly predicting the classes in the dataset, we will call the model's accuracy() method.

model.accuracy()
0.9866

Other important metrics to look at are Precision, Recall & F1-measures [3]. To find precision, recall & f1 scores per label/class we will call the model's metrics_per_label() method.

model.metrics_per_label()
100.00% [10000/10000 05:05<00:00]
Precision_scoreRecall_scoreF1_scoreSupport
AU1.00001.00001.0000929.0
BE0.99900.99900.99901043.0
BR1.00001.00001.0000950.0
CA0.90880.97090.9388996.0
ES0.99690.99800.9975982.0
FR1.00000.99900.99951009.0
JP1.00000.99900.9995989.0
MX1.00001.00001.00001024.0
US0.96910.90930.93831070.0
ZA0.99900.99800.99851008.0

Get misclassified records

Its always a good idea to see the cases where your model is not performing well. This step will help us to:

  • Identify if there is a problem in the dataset.
  • Identify if there is a problem with text/documents belonging to a specific label/class.
  • Identify if there is a class imbalance in your dataset, due to which the model didn't see much of the labeled data for a particular class, hence not able to learn properly about that class.

To get the misclassified records we will call the model's get_misclassified_records method.

misclassified_records = model.get_misclassified_records()
100.00% [10000/10000 05:07<00:00]
misclassified_records.style.set_table_styles([dict(selector='th', props=[('text-align', 'left')])])\
        .set_properties(**{'text-align': "left"}).hide_index()
AddressTargetPrediction
107, HAMILTON CT, EASLEYUSCA
40443, CHEAKAMUS WAYCAUS
309, SOUTH STREET, BARABOOUSCA
19109, DUTY STUSCA
8171, CR 29, 43357USES
6565, WISCONSIN AVEUSCA
7332, 25TH AVENUEUSCA
14778, CAMINITO PUNTA ARENAS, Del Mar, 92014USES
916, PINE STUSCA
168, BROAD SOUND PL, IredellUSCA
316, BEAUMIER LANEUSCA
1518, BARCLAY STUSCA
235, GLADEFIELD DRUSCA
2701, CURRANT CVUSCA
94, ASPETUCK VILLAGEUSCA
27, South 10Th AvenueUSCA
254, GREEN HILLS DRUSCA
1025, BROOKFORD RDUSCA
8981, FAIRMOUNT RD SEUSCA
5, PICKWICK LAUSCA
540, CHARLESTON HWYUSCA
1763, RD, McDowellUSCA
40022, GOVERNMENT RDCAUS
435, EMORY RDUSCA
1, Bokomo Road, Malmesbury, SwartlandZACA
3529, BRADLEY AVEUSCA
710, 9TH STUSCA
1421, PINOT NOIR DRCAUS
1224, ST LUKE RDCAUS
1822, RT 6USCA
140USCA
2302, RIVER MIST RDCAUS
4159, Maher StUSCA
24, DEARBORN STREET, FranklinUSCA
2109, MALDON PLUSCA
Flora Road, Moquini Coastal Estate, Mossel BayZACA
5990, THIROS CIRUSCA
167, CARLSBAD CAVERNS STUSCA
2119, E 3RD AVCAUS
505, HARLEY WAY, SHARONUSCA
1354, ST LUKE RDCAUS
3140, SOUTHWOOD RDCAUS
4205, GlennUSCA
103, BILLETS BRIDGE RD, Courthouse, CamdenUSCA
838087, 4TH LINE EAST, TOWNSHIP OF MULMURCAUS
3317, Doncaster DRCAUS
2726, E TRUESDALE DRIVECAUS
2131, SHAMROCK DRCAUS
1185, ST ANNES RD, Unit 99CAUS
9109, CONTESSA CTUSCA
408, RUBY RDUSCA
2101, FONTAINE RD, 10USCA
52, OLD HWYUSCA
200, EAGLE SHORE DRUSCA
1450, MEADOW AVUSCA
0, BEECH ST, RockinghamUSCA
291, SPRY POINT RD, LITTLE POND, KNSCAUS
10905, YORKTOWN CVUSCA
3903, TATTLE BRANCH RDUSCA
682, ISLAND 90 SIX MILE LAKECAUS
1887, LITITZ PIKE, UNIT 4, MANHEIM TOWNSHIPUSCA
2821, E 18TH AVCAUS
2106, MARK STUSCA
25890, 119TH STREETUSCA
1222, VAN STEFFY AV, WYOMISSINGUSCA
16772, Heritage LnUSCA
450, LINCOLN AVENUEUSCA
27, GRANTHAM GLENUSCA
14972, GREENBRAE STUSCA
35, CR 1322USES
2438, DOUBLETREE DRUSCA
6999, SHIELDS DRCAUS
232, COUNTY RD 5, JACKSONUSCA
2265, Coronado Parkway North, Unit BUSCA
1026, E 18TH AVCAUS
224, PINE CREST PLUSCA
6259, ROGERS RDUSCA
576, WYCHE STUSCA
1109, GLENN AVEUSCA
4821, POSTON DRUSCA
1610, WALNUT AVEUSCA
4134, TN SUNP.A-3 T.JARAL SEC1 UE1, 29749ESUS
6, HUQUENIN CTUSCA
3761, OLD CLAYBURN RDCAUS
4832-8JPCA
1209, ALSON MILLS WAYCAUS
262, BASSETT STUSCA
216, 3RD STUSCA
3749, CLARITY RDUSCA
2619, SQUIRE PLUSCA
1950, PITTMAN CENTER RDUSCA
WILDCAT TRUSCA
28, SUNKIST VALLEY RD, CaledonCAUS
31, 4780BEUS
CabarrusUSCA
494, Oxbow CreekUSCA
0, HIGH STUSCA
121, WHITETAIL ARCHERY AVEUSCA
676, STATE ROUTE 179USCA
8455, BACARDI AVENUE, INVER GROVE HEIGHTSUSCA
41, WILLIAM BLAYDES STUSCA
1539, 29 AV NUSCA
4250, OREGON AVEUSCA
845, NORTH MARY LAKE RDCAUS
3338, FALLS DRUSCA
3301, CONFLANS RDUSCA
3750, WEINBRENNER RDCAUS
9, BROOK SIDEUSCA
312, WHEATON STREETUSCA
RAILROAD RDUSCA
1001, Steinerwaeldel, Volksberg, 67290FRBE
1919, POCO FARM RDUSCA
166, OLGA DRUSCA
CALEDONIACAUS
708, FAIRMEADOW DRUSCA
126, POAS CLUSCA
1136, CLARENDON CIRUSCA
CREEK RD, DOUGLASSUSCA
625505, 15TH SIDEROAD, TOWNSHIP OF MELANCTHONCAUS
529, WENGLER AVE, SHARONUSCA
40114, TN SECTOR 8, 45646ESUS
9955, East 138Th PlaceUSCA
1032, HENEY LAKE RDCAUS
1021, WOODCREEK OAKS BLVDUSCA
514, CLEARFIELD STUSCA
1991, Braeburn Circle SEUSCA
Boiling Spring Lakes, BrunswickUSCA
3965, SAGE DRUSCA
3175, W 34TH AVCAUS
83, ST ANDREWS CRESCENTUSCA
10990, 1ST STREET, HEWITTUSCA
22, POLLETT LN, DIEPPE, NBCAUS
Kent Street, Richibucto, KentCAZA

Saving the trained model

Once you are satisfied with the model, you can save it using the save() method. This creates an Esri Model Definition (EMD file) that can be used for inferencing on unseen data.

model.save("country-classifier")
Computing model metrics...
WindowsPath('models/country-classifier')

Model inference

The trained model can be used to classify new text documents using the predict method. This method accepts a string or a list of strings to predict the labels of these new documents/text.

text_list = data._train_df.sample(15).Address.values
result = model.predict(text_list)

df = pd.DataFrame(result, columns=["Address", "CountryCode", "Confidence"])

df.style.set_table_styles([dict(selector='th', props=[('text-align', 'left')])])\
        .set_properties(**{'text-align': "left"}).hide_index()
100.00% [15/15 00:00<00:00]
AddressCountryCodeConfidence
179, RUA JOSE BARBALHO FILHO, APARTAMENTO 103 BLOCO G, João Pessoa, PB, 58027-000BR1.000000
2531, PARTRIDGE CRESCA0.834484
SN, CALLE ESCUINAPA, URUAPAN, Uruapan, Michoacán de OcampoMX1.000000
44, WOODFORD DR, FREDERICKSBURG, Stafford County, VA, 22405US0.999997
587, CALLE CABO SAN LUCAS, ENSENADA, Ensenada, Baja CaliforniaMX1.000000
80009, Street, Fernie, Chief Albert LuthuliZA0.999997
1906, Pelton Mountain Rd, Chipman Brook, Kings CountyCA0.999895
1, Chemin de Promelles, 1472BE0.999912
1408, Cedarglen Court, Oakville, ONCA0.998583
70, POPLAR ST NCA0.942083
48, CL RAMON TURRO, 8389ES1.000000
454, NORTH MANNHEIM ROAD, Hillside, 60162US0.999981
43, Qoqonga Street, Mfuleni, City of Cape TownZA1.000000
1 B, TRAVESSA GENESIO SILVEIRA, Mossoró, RN, 59600-000BR1.000000
GaMaphale, Greater LetabaZA0.999998

Conclusion

In this notebook, we have built a text classifier using TextClassifier class of arcgis.learn.text module. The dataset consisted of house addresses of 10 countries written in languages like English, Japanese, French, Spanish, etc. To achieve this we used a multi-lingual transformer backbone like XLM-RoBERTa to build a classifier to predict the country for an input house address.

References

[1][Learning Rate](https://en.wikipedia.org/wiki/Learning_rate)

[2][Accuracy](https://en.wikipedia.org/wiki/Accuracy_and_precision)

[3][Precision, recall and F1-measures](https://scikit-learn.org/stable/modules/model_evaluation.html#precision-recall-and-f-measures)

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.