Introduction

Geocoding is the process of taking input text, such as an address or the name of a place, and returning a latitude/longitude location for that place. In this notebook, we will be picking up a dataset consisting of incomplete house addresses from 10 countries. We will build a classifier using TextClassifier class of arcgis.learn.text module to predict the country for these incomplete house addresses.

The house addresses in the dataset consist of text in multiple languages like English, Japanese, French, Spanish, etc. The dataset is a small subset of the house addresses taken from OpenAddresses data

A note on the dataset

The data is collected around 2020-05-27 by OpenAddresses.
The data licenses can be found in data/country-classifier/LICENSE.txt.

Prerequisites

Data preparation and model training workflows using arcgis.learn have a dependency on transformers. Refer to the section "Install deep learning dependencies of arcgis.learn module" on this page for detailed documentation on the installation of the dependencies.
Labeled data: For TextClassifier to learn, it needs to see documents/texts that have been assigned a label. Labeled data for this sample notebook is located at data/country-classifier/house-addresses.csv
To learn more about how TextClassifier works, please see the guide on Text Classification with arcgis.learn.

Imports

import os
import zipfile
import pandas as pd
from pathlib import Path
from arcgis.gis import GIS
from arcgis.learn import prepare_textdata
from arcgis.learn.text import TextClassifier

gis = GIS('home')

Data preparation

Data preparation involves splitting the data into training and validation sets, creating the necessary data structures for loading data into the model and so on. The prepare_data() function can directly read the training samples and automate the entire process.

training_data = gis.content.get('ab36969cfe814c89ba3b659cf734492a')
training_data

country_classifier
Training data for TextClassifier class of arcgis.learn.text module

Image Collection by api_data_owner
Last Modified: December 01, 2020
0 comments, 0 views

filepath = training_data.download(file_name=training_data.name)

with zipfile.ZipFile(filepath, 'r') as zip_ref:
    zip_ref.extractall(Path(filepath).parent)

DATA_ROOT = Path(os.path.join(os.path.splitext(filepath)[0]))

data = prepare_textdata(DATA_ROOT, "classification", train_file="house-addresses.csv", 
                        text_columns="Address", label_columns="Country", batch_size=64)

The show_batch() method can be used to see the training samples, along with labels.

data.show_batch(10)

Address	Country
S/N, LG CASARES, 32170	ES
SN, CALLE E. NABARRETE, PLAN DE AYALA (CAMPO CINCO), Ahome, Sinaloa	MX
152, RUA SANTA RITA DURAO, Belo Horizonte, MG, 30140-110	BR
133, Warande, 201, 9660	BE
4000, 13 Avenue SE, 133, MEDICINE HAT	CA
12, Avenue de la République, Beauvais, 60000	FR
1487-6, 有馬町	JP
4, Rue d'Houat, Saint-Gilles, 35590	FR
32, Hartjie My Liefie Avenue, Bloemfontein, Mangaung	ZA
Street, Centurion, City of Tshwane	ZA

TextClassifier model

TextClassifier model in arcgis.learn.text is built on top of Hugging Face Transformers library. The model training and inferencing workflow are similar to computer vision models in arcgis.learn.

Run the command below to see what backbones are supported for the text classification task.

print(TextClassifier.supported_backbones)

['BERT', 'RoBERTa', 'DistilBERT', 'ALBERT', 'FlauBERT', 'CamemBERT', 'XLNet', 'XLM', 'XLM-RoBERTa', 'Bart', 'ELECTRA', 'Longformer', 'MobileBERT']

Call the model's available_backbone_models() method with the backbone name to get the available models for that backbone. The call to available_backbone_models method will list out only few of the available models for each backbone. Visit this link to get a complete list of models for each backbone.

print(TextClassifier.available_backbone_models("xlm-roberta"))

('xlm-roberta-base', 'xlm-roberta-large')

Load model architecture

Invoke the TextClassifier class by passing the data and the backbone you have chosen. The dataset consists of house addresses in multiple languages like Japanese, English, French, Spanish, etc., hence we will use a multi-lingual transformer backbone to train our model.

model = TextClassifier(data, backbone="xlm-roberta-base")

Model training

The learning rate[1] is a tuning parameter that determines the step size at each iteration while moving toward a minimum of a loss function, it represents the speed at which a machine learning model "learns". arcgis.learn includes a learning rate finder, and is accessible through the model's lr_find() method, that can automatically select an optimum learning rate, without requiring repeated experiments.

model.lr_find()

0.001202264434617413

Training the model is an iterative process. We can train the model using its fit() method till the validation loss (or error rate) continues to go down with each training pass also known as an epoch. This is indicative of the model learning the task.

model.fit(epochs=6, lr=0.001)

epoch	train_loss	valid_loss	accuracy	error_rate	time
0	0.308638	0.182150	0.929600	0.070400	05:28
1	0.103615	0.068711	0.970600	0.029400	05:46
2	0.076326	0.041269	0.981600	0.018400	05:30
3	0.055707	0.034307	0.986300	0.013700	05:33
4	0.041812	0.032772	0.986400	0.013600	05:27
5	0.049993	0.032165	0.986600	0.013400	05:26

Validate results

Once we have the trained model, we can see the results to see how it performs.

model.show_results(15)

text	target	prediction
SN, AVENIDA JOSE MARIA MORELOS Y PAVON OTE., APATZINGÁN DE LA CONSTITUCIÓN, Apatzingán, Michoacán de Ocampo	MX	MX
906, AVENIDA JOSEFA ORTÍZ DE DOMÍNGUEZ, CIUDAD MENDOZA, Camerino Z. Mendoza, Veracruz de Ignacio de la Llave	MX	MX
32, CIRCUITO JOSÉ MARÍA URIARTE, FRACCIONAMIENTO RANCHO ALEGRE, Tlajomulco de Zúñiga, Jalisco	MX	MX
SN, ESTRADA SP 250 SENTIDO GRAMADAO, LADO DIREITO FAZENDA SAO RAFAEL CASA 4, São Miguel Arcanjo, SP, 18230-000	BR	BR
SN, CALLE JOSEFA ORTÍZ DE DOMÍNGUEZ, RINCÓN DE BUENA VISTA, Omealca, Veracruz de Ignacio de la Llave	MX	MX
SN, CALLE MICHOACAN, DOLORES HIDALGO CUNA DE LA INDEPENDENCIA NACIONAL, Dolores Hidalgo Cuna de la Independencia Nacional, Guanajuato	MX	MX
SN, CALLE VERDUZCO, COALCOMÁN DE VÁZQUEZ PALLARES, Coalcomán de Vázquez Pallares, Michoacán de Ocampo	MX	MX
1712, CALLE MÁRTIRES DEL 7 DE ENERO, CIUDAD MENDOZA, Camerino Z. Mendoza, Veracruz de Ignacio de la Llave	MX	MX
SN, AVENIDA JACOBO GÁLVEZ, FRACCIONAMIENTO RANCHO ALEGRE, Tlajomulco de Zúñiga, Jalisco	MX	MX
SN, ANDADOR MZNA 6 AMP. LOS ROBLES, EL PUEBLITO (CRUCERO NACIONAL), Córdoba, Veracruz de Ignacio de la Llave	MX	MX
SN, CALLE SÉPTIMA PONIENTE SUR (EJE VIAL), COMITÁN DE DOMÍNGUEZ, Comitán de Domínguez, Chiapas	MX	MX
18, CALLE FELIPE GORRITI / FELIPE GORRITI KALEA, Pamplona / Iruña, Pamplona / Iruña, Navarra, 31004	ES	ES
SN, RUA X VINTE E SEIS, QUADRA 14 LOTE 35 SALA 3, Aparecida de Goiânia, GO, 74922-680	BR	BR
SN, CALLE NINGUNO, HEROICA CIUDAD DE JUCHITÁN DE ZARAGOZA, Heroica Ciudad de Juchitán de Zaragoza, Oaxaca	MX	MX
1169, RUA DOUTOR ALBUQUERQUE LINS, BLOCO B ANDAR 11 APARTAMENTO 112B, São Paulo, SP, 01203-001	BR	BR

Test the model prediction on an input text

text = """1016, 8A, CL RICARDO LEON - SANTA ANA (CARTAGENA), 30319"""
print(model.predict(text))

('1016, 8A, CL RICARDO LEON - SANTA ANA (CARTAGENA), 30319', 'ES', 1.0)

Model metrics

To get a sense of how well the model is trained, we will calculate some important metrics for our text-classifier model. First, to find how accurate[2] the model is in correctly predicting the classes in the dataset, we will call the model's accuracy() method.

model.accuracy()

0.9866

Other important metrics to look at are Precision, Recall & F1-measures [3]. To find precision, recall & f1 scores per label/class we will call the model's metrics_per_label() method.

model.metrics_per_label()

100.00% [10000/10000 05:05<00:00]

	Precision_score	Recall_score	F1_score	Support
AU	1.0000	1.0000	1.0000	929.0
BE	0.9990	0.9990	0.9990	1043.0
BR	1.0000	1.0000	1.0000	950.0
CA	0.9088	0.9709	0.9388	996.0
ES	0.9969	0.9980	0.9975	982.0
FR	1.0000	0.9990	0.9995	1009.0
JP	1.0000	0.9990	0.9995	989.0
MX	1.0000	1.0000	1.0000	1024.0
US	0.9691	0.9093	0.9383	1070.0
ZA	0.9990	0.9980	0.9985	1008.0

Get misclassified records

Its always a good idea to see the cases where your model is not performing well. This step will help us to:

Identify if there is a problem in the dataset.
Identify if there is a problem with text/documents belonging to a specific label/class.
Identify if there is a class imbalance in your dataset, due to which the model didn't see much of the labeled data for a particular class, hence not able to learn properly about that class.

To get the misclassified records we will call the model's get_misclassified_records method.

misclassified_records = model.get_misclassified_records()

100.00% [10000/10000 05:07<00:00]

misclassified_records.style.set_table_styles([dict(selector='th', props=[('text-align', 'left')])])\
        .set_properties(**{'text-align': "left"}).hide_index()

Address	Target	Prediction
107, HAMILTON CT, EASLEY	US	CA
40443, CHEAKAMUS WAY	CA	US
309, SOUTH STREET, BARABOO	US	CA
19109, DUTY ST	US	CA
8171, CR 29, 43357	US	ES
6565, WISCONSIN AVE	US	CA
7332, 25TH AVENUE	US	CA
14778, CAMINITO PUNTA ARENAS, Del Mar, 92014	US	ES
916, PINE ST	US	CA
168, BROAD SOUND PL, Iredell	US	CA
316, BEAUMIER LANE	US	CA
1518, BARCLAY ST	US	CA
235, GLADEFIELD DR	US	CA
2701, CURRANT CV	US	CA
94, ASPETUCK VILLAGE	US	CA
27, South 10Th Avenue	US	CA
254, GREEN HILLS DR	US	CA
1025, BROOKFORD RD	US	CA
8981, FAIRMOUNT RD SE	US	CA
5, PICKWICK LA	US	CA
540, CHARLESTON HWY	US	CA
1763, RD, McDowell	US	CA
40022, GOVERNMENT RD	CA	US
435, EMORY RD	US	CA
1, Bokomo Road, Malmesbury, Swartland	ZA	CA
3529, BRADLEY AVE	US	CA
710, 9TH ST	US	CA
1421, PINOT NOIR DR	CA	US
1224, ST LUKE RD	CA	US
1822, RT 6	US	CA
140	US	CA
2302, RIVER MIST RD	CA	US
4159, Maher St	US	CA
24, DEARBORN STREET, Franklin	US	CA
2109, MALDON PL	US	CA
Flora Road, Moquini Coastal Estate, Mossel Bay	ZA	CA
5990, THIROS CIR	US	CA
167, CARLSBAD CAVERNS ST	US	CA
2119, E 3RD AV	CA	US
505, HARLEY WAY, SHARON	US	CA
1354, ST LUKE RD	CA	US
3140, SOUTHWOOD RD	CA	US
4205, Glenn	US	CA
103, BILLETS BRIDGE RD, Courthouse, Camden	US	CA
838087, 4TH LINE EAST, TOWNSHIP OF MULMUR	CA	US
3317, Doncaster DR	CA	US
2726, E TRUESDALE DRIVE	CA	US
2131, SHAMROCK DR	CA	US
1185, ST ANNES RD, Unit 99	CA	US
9109, CONTESSA CT	US	CA
408, RUBY RD	US	CA
2101, FONTAINE RD, 10	US	CA
52, OLD HWY	US	CA
200, EAGLE SHORE DR	US	CA
1450, MEADOW AV	US	CA
0, BEECH ST, Rockingham	US	CA
291, SPRY POINT RD, LITTLE POND, KNS	CA	US
10905, YORKTOWN CV	US	CA
3903, TATTLE BRANCH RD	US	CA
682, ISLAND 90 SIX MILE LAKE	CA	US
1887, LITITZ PIKE, UNIT 4, MANHEIM TOWNSHIP	US	CA
2821, E 18TH AV	CA	US
2106, MARK ST	US	CA
25890, 119TH STREET	US	CA
1222, VAN STEFFY AV, WYOMISSING	US	CA
16772, Heritage Ln	US	CA
450, LINCOLN AVENUE	US	CA
27, GRANTHAM GLEN	US	CA
14972, GREENBRAE ST	US	CA
35, CR 1322	US	ES
2438, DOUBLETREE DR	US	CA
6999, SHIELDS DR	CA	US
232, COUNTY RD 5, JACKSON	US	CA
2265, Coronado Parkway North, Unit B	US	CA
1026, E 18TH AV	CA	US
224, PINE CREST PL	US	CA
6259, ROGERS RD	US	CA
576, WYCHE ST	US	CA
1109, GLENN AVE	US	CA
4821, POSTON DR	US	CA
1610, WALNUT AVE	US	CA
4134, TN SUNP.A-3 T.JARAL SEC1 UE1, 29749	ES	US
6, HUQUENIN CT	US	CA
3761, OLD CLAYBURN RD	CA	US
4832-8	JP	CA
1209, ALSON MILLS WAY	CA	US
262, BASSETT ST	US	CA
216, 3RD ST	US	CA
3749, CLARITY RD	US	CA
2619, SQUIRE PL	US	CA
1950, PITTMAN CENTER RD	US	CA
WILDCAT TR	US	CA
28, SUNKIST VALLEY RD, Caledon	CA	US
31, 4780	BE	US
Cabarrus	US	CA
494, Oxbow Creek	US	CA
0, HIGH ST	US	CA
121, WHITETAIL ARCHERY AVE	US	CA
676, STATE ROUTE 179	US	CA
8455, BACARDI AVENUE, INVER GROVE HEIGHTS	US	CA
41, WILLIAM BLAYDES ST	US	CA
1539, 29 AV N	US	CA
4250, OREGON AVE	US	CA
845, NORTH MARY LAKE RD	CA	US
3338, FALLS DR	US	CA
3301, CONFLANS RD	US	CA
3750, WEINBRENNER RD	CA	US
9, BROOK SIDE	US	CA
312, WHEATON STREET	US	CA
RAILROAD RD	US	CA
1001, Steinerwaeldel, Volksberg, 67290	FR	BE
1919, POCO FARM RD	US	CA
166, OLGA DR	US	CA
CALEDONIA	CA	US
708, FAIRMEADOW DR	US	CA
126, POAS CL	US	CA
1136, CLARENDON CIR	US	CA
CREEK RD, DOUGLASS	US	CA
625505, 15TH SIDEROAD, TOWNSHIP OF MELANCTHON	CA	US
529, WENGLER AVE, SHARON	US	CA
40114, TN SECTOR 8, 45646	ES	US
9955, East 138Th Place	US	CA
1032, HENEY LAKE RD	CA	US
1021, WOODCREEK OAKS BLVD	US	CA
514, CLEARFIELD ST	US	CA
1991, Braeburn Circle SE	US	CA
Boiling Spring Lakes, Brunswick	US	CA
3965, SAGE DR	US	CA
3175, W 34TH AV	CA	US
83, ST ANDREWS CRESCENT	US	CA
10990, 1ST STREET, HEWITT	US	CA
22, POLLETT LN, DIEPPE, NB	CA	US
Kent Street, Richibucto, Kent	CA	ZA

Saving the trained model

Once you are satisfied with the model, you can save it using the save() method. This creates an Esri Model Definition (EMD file) that can be used for inferencing on unseen data.

model.save("country-classifier")

Computing model metrics...

WindowsPath('models/country-classifier')

Model inference

The trained model can be used to classify new text documents using the predict method. This method accepts a string or a list of strings to predict the labels of these new documents/text.

text_list = data._train_df.sample(15).Address.values
result = model.predict(text_list)

df = pd.DataFrame(result, columns=["Address", "CountryCode", "Confidence"])

df.style.set_table_styles([dict(selector='th', props=[('text-align', 'left')])])\
        .set_properties(**{'text-align': "left"}).hide_index()

100.00% [15/15 00:00<00:00]

Address	CountryCode	Confidence
179, RUA JOSE BARBALHO FILHO, APARTAMENTO 103 BLOCO G, João Pessoa, PB, 58027-000	BR	1.000000
2531, PARTRIDGE CRES	CA	0.834484
SN, CALLE ESCUINAPA, URUAPAN, Uruapan, Michoacán de Ocampo	MX	1.000000
44, WOODFORD DR, FREDERICKSBURG, Stafford County, VA, 22405	US	0.999997
587, CALLE CABO SAN LUCAS, ENSENADA, Ensenada, Baja California	MX	1.000000
80009, Street, Fernie, Chief Albert Luthuli	ZA	0.999997
1906, Pelton Mountain Rd, Chipman Brook, Kings County	CA	0.999895
1, Chemin de Promelles, 1472	BE	0.999912
1408, Cedarglen Court, Oakville, ON	CA	0.998583
70, POPLAR ST N	CA	0.942083
48, CL RAMON TURRO, 8389	ES	1.000000
454, NORTH MANNHEIM ROAD, Hillside, 60162	US	0.999981
43, Qoqonga Street, Mfuleni, City of Cape Town	ZA	1.000000
1 B, TRAVESSA GENESIO SILVEIRA, Mossoró, RN, 59600-000	BR	1.000000
GaMaphale, Greater Letaba	ZA	0.999998

Conclusion

In this notebook, we have built a text classifier using TextClassifier class of arcgis.learn.text module. The dataset consisted of house addresses of 10 countries written in languages like English, Japanese, French, Spanish, etc. To achieve this we used a multi-lingual transformer backbone like XLM-RoBERTa to build a classifier to predict the country for an input house address.

References

[1] Learning Rate

[2] Accuracy

[3] Precision, recall and F1-measures