Introduction
Geocoding is the process of taking input text, such as an address or the name of a place, and returning a latitude/longitude location for that place. In this notebook, we will be picking up a dataset consisting of incomplete house addresses from 10 countries. We will build a classifier using TextClassifier
class of arcgis.learn.text
module to predict the country for these incomplete house addresses.
The house addresses in the dataset consist of text in multiple languages like English, Japanese, French, Spanish, etc. The dataset is a small subset of the house addresses taken from OpenAddresses data
A note on the dataset
- The data is collected around 2020-05-27 by OpenAddresses.
- The data licenses can be found in
data/country-classifier/LICENSE.txt
.
Prerequisites
-
Data preparation and model training workflows using arcgis.learn have a dependency on transformers. Refer to the section "Install deep learning dependencies of arcgis.learn module" on this page for detailed documentation on the installation of the dependencies.
-
Labeled data: For
TextClassifier
to learn, it needs to see documents/texts that have been assigned a label. Labeled data for this sample notebook is located atdata/country-classifier/house-addresses.csv
-
To learn more about how
TextClassifier
works, please see the guide on Text Classification with arcgis.learn.
Imports
import os
import zipfile
import pandas as pd
from pathlib import Path
from arcgis.gis import GIS
from arcgis.learn import prepare_textdata
from arcgis.learn.text import TextClassifier
gis = GIS('home')
Data preparation
Data preparation involves splitting the data into training and validation sets, creating the necessary data structures for loading data into the model and so on. The prepare_data()
function can directly read the training samples and automate the entire process.
training_data = gis.content.get('ab36969cfe814c89ba3b659cf734492a')
training_data
filepath = training_data.download(file_name=training_data.name)
with zipfile.ZipFile(filepath, 'r') as zip_ref:
zip_ref.extractall(Path(filepath).parent)
DATA_ROOT = Path(os.path.join(os.path.splitext(filepath)[0]))
data = prepare_textdata(DATA_ROOT, "classification", train_file="house-addresses.csv",
text_columns="Address", label_columns="Country", batch_size=64)
The show_batch()
method can be used to see the training samples, along with labels.
data.show_batch(10)
Address | Country |
---|---|
S/N, LG CASARES, 32170 | ES |
SN, CALLE E. NABARRETE, PLAN DE AYALA (CAMPO CINCO), Ahome, Sinaloa | MX |
152, RUA SANTA RITA DURAO, Belo Horizonte, MG, 30140-110 | BR |
133, Warande, 201, 9660 | BE |
4000, 13 Avenue SE, 133, MEDICINE HAT | CA |
12, Avenue de la République, Beauvais, 60000 | FR |
1487-6, 有馬町 | JP |
4, Rue d'Houat, Saint-Gilles, 35590 | FR |
32, Hartjie My Liefie Avenue, Bloemfontein, Mangaung | ZA |
Street, Centurion, City of Tshwane | ZA |
TextClassifier model
TextClassifier
model in arcgis.learn.text
is built on top of Hugging Face Transformers library. The model training and inferencing workflow are similar to computer vision models in arcgis.learn
.
Run the command below to see what backbones are supported for the text classification task.
print(TextClassifier.supported_backbones)
['BERT', 'RoBERTa', 'DistilBERT', 'ALBERT', 'FlauBERT', 'CamemBERT', 'XLNet', 'XLM', 'XLM-RoBERTa', 'Bart', 'ELECTRA', 'Longformer', 'MobileBERT']
Call the model's available_backbone_models()
method with the backbone name to get the available models for that backbone. The call to available_backbone_models method will list out only few of the available models for each backbone. Visit this link to get a complete list of models for each backbone.
print(TextClassifier.available_backbone_models("xlm-roberta"))
('xlm-roberta-base', 'xlm-roberta-large')
Load model architecture
Invoke the TextClassifier
class by passing the data and the backbone you have chosen. The dataset consists of house addresses in multiple languages like Japanese, English, French, Spanish, etc., hence we will use a multi-lingual transformer backbone to train our model.
model = TextClassifier(data, backbone="xlm-roberta-base")
Model training
The learning rate
[1] is a tuning parameter that determines the step size at each iteration while moving toward a minimum of a loss function, it represents the speed at which a machine learning model "learns". arcgis.learn
includes a learning rate finder, and is accessible through the model's lr_find()
method, that can automatically select an optimum learning rate, without requiring repeated experiments.
model.lr_find()
0.001202264434617413
Training the model is an iterative process. We can train the model using its fit()
method till the validation loss (or error rate) continues to go down with each training pass also known as an epoch. This is indicative of the model learning the task.
model.fit(epochs=6, lr=0.001)
epoch | train_loss | valid_loss | accuracy | error_rate | time |
---|---|---|---|---|---|
0 | 0.308638 | 0.182150 | 0.929600 | 0.070400 | 05:28 |
1 | 0.103615 | 0.068711 | 0.970600 | 0.029400 | 05:46 |
2 | 0.076326 | 0.041269 | 0.981600 | 0.018400 | 05:30 |
3 | 0.055707 | 0.034307 | 0.986300 | 0.013700 | 05:33 |
4 | 0.041812 | 0.032772 | 0.986400 | 0.013600 | 05:27 |
5 | 0.049993 | 0.032165 | 0.986600 | 0.013400 | 05:26 |
Validate results
Once we have the trained model, we can see the results to see how it performs.
model.show_results(15)
text | target | prediction |
---|---|---|
SN, AVENIDA JOSE MARIA MORELOS Y PAVON OTE., APATZINGÁN DE LA CONSTITUCIÓN, Apatzingán, Michoacán de Ocampo | MX | MX |
906, AVENIDA JOSEFA ORTÍZ DE DOMÍNGUEZ, CIUDAD MENDOZA, Camerino Z. Mendoza, Veracruz de Ignacio de la Llave | MX | MX |
32, CIRCUITO JOSÉ MARÍA URIARTE, FRACCIONAMIENTO RANCHO ALEGRE, Tlajomulco de Zúñiga, Jalisco | MX | MX |
SN, ESTRADA SP 250 SENTIDO GRAMADAO, LADO DIREITO FAZENDA SAO RAFAEL CASA 4, São Miguel Arcanjo, SP, 18230-000 | BR | BR |
SN, CALLE JOSEFA ORTÍZ DE DOMÍNGUEZ, RINCÓN DE BUENA VISTA, Omealca, Veracruz de Ignacio de la Llave | MX | MX |
SN, CALLE MICHOACAN, DOLORES HIDALGO CUNA DE LA INDEPENDENCIA NACIONAL, Dolores Hidalgo Cuna de la Independencia Nacional, Guanajuato | MX | MX |
SN, CALLE VERDUZCO, COALCOMÁN DE VÁZQUEZ PALLARES, Coalcomán de Vázquez Pallares, Michoacán de Ocampo | MX | MX |
1712, CALLE MÁRTIRES DEL 7 DE ENERO, CIUDAD MENDOZA, Camerino Z. Mendoza, Veracruz de Ignacio de la Llave | MX | MX |
SN, AVENIDA JACOBO GÁLVEZ, FRACCIONAMIENTO RANCHO ALEGRE, Tlajomulco de Zúñiga, Jalisco | MX | MX |
SN, ANDADOR MZNA 6 AMP. LOS ROBLES, EL PUEBLITO (CRUCERO NACIONAL), Córdoba, Veracruz de Ignacio de la Llave | MX | MX |
SN, CALLE SÉPTIMA PONIENTE SUR (EJE VIAL), COMITÁN DE DOMÍNGUEZ, Comitán de Domínguez, Chiapas | MX | MX |
18, CALLE FELIPE GORRITI / FELIPE GORRITI KALEA, Pamplona / Iruña, Pamplona / Iruña, Navarra, 31004 | ES | ES |
SN, RUA X VINTE E SEIS, QUADRA 14 LOTE 35 SALA 3, Aparecida de Goiânia, GO, 74922-680 | BR | BR |
SN, CALLE NINGUNO, HEROICA CIUDAD DE JUCHITÁN DE ZARAGOZA, Heroica Ciudad de Juchitán de Zaragoza, Oaxaca | MX | MX |
1169, RUA DOUTOR ALBUQUERQUE LINS, BLOCO B ANDAR 11 APARTAMENTO 112B, São Paulo, SP, 01203-001 | BR | BR |
Test the model prediction on an input text
text = """1016, 8A, CL RICARDO LEON - SANTA ANA (CARTAGENA), 30319"""
print(model.predict(text))
('1016, 8A, CL RICARDO LEON - SANTA ANA (CARTAGENA), 30319', 'ES', 1.0)
Model metrics
To get a sense of how well the model is trained, we will calculate some important metrics for our text-classifier
model. First, to find how accurate[2] the model is in correctly predicting the classes in the dataset, we will call the model's accuracy()
method.
model.accuracy()
0.9866
Other important metrics to look at are Precision, Recall & F1-measures [3]. To find precision
, recall
& f1
scores per label/class we will call the model's metrics_per_label()
method.
model.metrics_per_label()
Precision_score | Recall_score | F1_score | Support | |
---|---|---|---|---|
AU | 1.0000 | 1.0000 | 1.0000 | 929.0 |
BE | 0.9990 | 0.9990 | 0.9990 | 1043.0 |
BR | 1.0000 | 1.0000 | 1.0000 | 950.0 |
CA | 0.9088 | 0.9709 | 0.9388 | 996.0 |
ES | 0.9969 | 0.9980 | 0.9975 | 982.0 |
FR | 1.0000 | 0.9990 | 0.9995 | 1009.0 |
JP | 1.0000 | 0.9990 | 0.9995 | 989.0 |
MX | 1.0000 | 1.0000 | 1.0000 | 1024.0 |
US | 0.9691 | 0.9093 | 0.9383 | 1070.0 |
ZA | 0.9990 | 0.9980 | 0.9985 | 1008.0 |
Get misclassified records
Its always a good idea to see the cases where your model is not performing well. This step will help us to:
- Identify if there is a problem in the dataset.
- Identify if there is a problem with text/documents belonging to a specific label/class.
- Identify if there is a class imbalance in your dataset, due to which the model didn't see much of the labeled data for a particular class, hence not able to learn properly about that class.
To get the misclassified records we will call the model's get_misclassified_records
method.
misclassified_records = model.get_misclassified_records()
misclassified_records.style.set_table_styles([dict(selector='th', props=[('text-align', 'left')])])\
.set_properties(**{'text-align': "left"}).hide_index()
Address | Target | Prediction |
---|---|---|
107, HAMILTON CT, EASLEY | US | CA |
40443, CHEAKAMUS WAY | CA | US |
309, SOUTH STREET, BARABOO | US | CA |
19109, DUTY ST | US | CA |
8171, CR 29, 43357 | US | ES |
6565, WISCONSIN AVE | US | CA |
7332, 25TH AVENUE | US | CA |
14778, CAMINITO PUNTA ARENAS, Del Mar, 92014 | US | ES |
916, PINE ST | US | CA |
168, BROAD SOUND PL, Iredell | US | CA |
316, BEAUMIER LANE | US | CA |
1518, BARCLAY ST | US | CA |
235, GLADEFIELD DR | US | CA |
2701, CURRANT CV | US | CA |
94, ASPETUCK VILLAGE | US | CA |
27, South 10Th Avenue | US | CA |
254, GREEN HILLS DR | US | CA |
1025, BROOKFORD RD | US | CA |
8981, FAIRMOUNT RD SE | US | CA |
5, PICKWICK LA | US | CA |
540, CHARLESTON HWY | US | CA |
1763, RD, McDowell | US | CA |
40022, GOVERNMENT RD | CA | US |
435, EMORY RD | US | CA |
1, Bokomo Road, Malmesbury, Swartland | ZA | CA |
3529, BRADLEY AVE | US | CA |
710, 9TH ST | US | CA |
1421, PINOT NOIR DR | CA | US |
1224, ST LUKE RD | CA | US |
1822, RT 6 | US | CA |
140 | US | CA |
2302, RIVER MIST RD | CA | US |
4159, Maher St | US | CA |
24, DEARBORN STREET, Franklin | US | CA |
2109, MALDON PL | US | CA |
Flora Road, Moquini Coastal Estate, Mossel Bay | ZA | CA |
5990, THIROS CIR | US | CA |
167, CARLSBAD CAVERNS ST | US | CA |
2119, E 3RD AV | CA | US |
505, HARLEY WAY, SHARON | US | CA |
1354, ST LUKE RD | CA | US |
3140, SOUTHWOOD RD | CA | US |
4205, Glenn | US | CA |
103, BILLETS BRIDGE RD, Courthouse, Camden | US | CA |
838087, 4TH LINE EAST, TOWNSHIP OF MULMUR | CA | US |
3317, Doncaster DR | CA | US |
2726, E TRUESDALE DRIVE | CA | US |
2131, SHAMROCK DR | CA | US |
1185, ST ANNES RD, Unit 99 | CA | US |
9109, CONTESSA CT | US | CA |
408, RUBY RD | US | CA |
2101, FONTAINE RD, 10 | US | CA |
52, OLD HWY | US | CA |
200, EAGLE SHORE DR | US | CA |
1450, MEADOW AV | US | CA |
0, BEECH ST, Rockingham | US | CA |
291, SPRY POINT RD, LITTLE POND, KNS | CA | US |
10905, YORKTOWN CV | US | CA |
3903, TATTLE BRANCH RD | US | CA |
682, ISLAND 90 SIX MILE LAKE | CA | US |
1887, LITITZ PIKE, UNIT 4, MANHEIM TOWNSHIP | US | CA |
2821, E 18TH AV | CA | US |
2106, MARK ST | US | CA |
25890, 119TH STREET | US | CA |
1222, VAN STEFFY AV, WYOMISSING | US | CA |
16772, Heritage Ln | US | CA |
450, LINCOLN AVENUE | US | CA |
27, GRANTHAM GLEN | US | CA |
14972, GREENBRAE ST | US | CA |
35, CR 1322 | US | ES |
2438, DOUBLETREE DR | US | CA |
6999, SHIELDS DR | CA | US |
232, COUNTY RD 5, JACKSON | US | CA |
2265, Coronado Parkway North, Unit B | US | CA |
1026, E 18TH AV | CA | US |
224, PINE CREST PL | US | CA |
6259, ROGERS RD | US | CA |
576, WYCHE ST | US | CA |
1109, GLENN AVE | US | CA |
4821, POSTON DR | US | CA |
1610, WALNUT AVE | US | CA |
4134, TN SUNP.A-3 T.JARAL SEC1 UE1, 29749 | ES | US |
6, HUQUENIN CT | US | CA |
3761, OLD CLAYBURN RD | CA | US |
4832-8 | JP | CA |
1209, ALSON MILLS WAY | CA | US |
262, BASSETT ST | US | CA |
216, 3RD ST | US | CA |
3749, CLARITY RD | US | CA |
2619, SQUIRE PL | US | CA |
1950, PITTMAN CENTER RD | US | CA |
WILDCAT TR | US | CA |
28, SUNKIST VALLEY RD, Caledon | CA | US |
31, 4780 | BE | US |
Cabarrus | US | CA |
494, Oxbow Creek | US | CA |
0, HIGH ST | US | CA |
121, WHITETAIL ARCHERY AVE | US | CA |
676, STATE ROUTE 179 | US | CA |
8455, BACARDI AVENUE, INVER GROVE HEIGHTS | US | CA |
41, WILLIAM BLAYDES ST | US | CA |
1539, 29 AV N | US | CA |
4250, OREGON AVE | US | CA |
845, NORTH MARY LAKE RD | CA | US |
3338, FALLS DR | US | CA |
3301, CONFLANS RD | US | CA |
3750, WEINBRENNER RD | CA | US |
9, BROOK SIDE | US | CA |
312, WHEATON STREET | US | CA |
RAILROAD RD | US | CA |
1001, Steinerwaeldel, Volksberg, 67290 | FR | BE |
1919, POCO FARM RD | US | CA |
166, OLGA DR | US | CA |
CALEDONIA | CA | US |
708, FAIRMEADOW DR | US | CA |
126, POAS CL | US | CA |
1136, CLARENDON CIR | US | CA |
CREEK RD, DOUGLASS | US | CA |
625505, 15TH SIDEROAD, TOWNSHIP OF MELANCTHON | CA | US |
529, WENGLER AVE, SHARON | US | CA |
40114, TN SECTOR 8, 45646 | ES | US |
9955, East 138Th Place | US | CA |
1032, HENEY LAKE RD | CA | US |
1021, WOODCREEK OAKS BLVD | US | CA |
514, CLEARFIELD ST | US | CA |
1991, Braeburn Circle SE | US | CA |
Boiling Spring Lakes, Brunswick | US | CA |
3965, SAGE DR | US | CA |
3175, W 34TH AV | CA | US |
83, ST ANDREWS CRESCENT | US | CA |
10990, 1ST STREET, HEWITT | US | CA |
22, POLLETT LN, DIEPPE, NB | CA | US |
Kent Street, Richibucto, Kent | CA | ZA |
Saving the trained model
Once you are satisfied with the model, you can save it using the save() method. This creates an Esri Model Definition (EMD file) that can be used for inferencing on unseen data.
model.save("country-classifier")
Computing model metrics...
WindowsPath('models/country-classifier')
Model inference
The trained model can be used to classify new text documents using the predict method. This method accepts a string or a list of strings to predict the labels of these new documents/text.
text_list = data._train_df.sample(15).Address.values
result = model.predict(text_list)
df = pd.DataFrame(result, columns=["Address", "CountryCode", "Confidence"])
df.style.set_table_styles([dict(selector='th', props=[('text-align', 'left')])])\
.set_properties(**{'text-align': "left"}).hide_index()
Address | CountryCode | Confidence |
---|---|---|
179, RUA JOSE BARBALHO FILHO, APARTAMENTO 103 BLOCO G, João Pessoa, PB, 58027-000 | BR | 1.000000 |
2531, PARTRIDGE CRES | CA | 0.834484 |
SN, CALLE ESCUINAPA, URUAPAN, Uruapan, Michoacán de Ocampo | MX | 1.000000 |
44, WOODFORD DR, FREDERICKSBURG, Stafford County, VA, 22405 | US | 0.999997 |
587, CALLE CABO SAN LUCAS, ENSENADA, Ensenada, Baja California | MX | 1.000000 |
80009, Street, Fernie, Chief Albert Luthuli | ZA | 0.999997 |
1906, Pelton Mountain Rd, Chipman Brook, Kings County | CA | 0.999895 |
1, Chemin de Promelles, 1472 | BE | 0.999912 |
1408, Cedarglen Court, Oakville, ON | CA | 0.998583 |
70, POPLAR ST N | CA | 0.942083 |
48, CL RAMON TURRO, 8389 | ES | 1.000000 |
454, NORTH MANNHEIM ROAD, Hillside, 60162 | US | 0.999981 |
43, Qoqonga Street, Mfuleni, City of Cape Town | ZA | 1.000000 |
1 B, TRAVESSA GENESIO SILVEIRA, Mossoró, RN, 59600-000 | BR | 1.000000 |
GaMaphale, Greater Letaba | ZA | 0.999998 |
Conclusion
In this notebook, we have built a text classifier using TextClassifier
class of arcgis.learn.text
module. The dataset consisted of house addresses of 10 countries written in languages like English, Japanese, French, Spanish, etc. To achieve this we used a multi-lingual transformer backbone like XLM-RoBERTa
to build a classifier to predict the country for an input house address.
References
[1][Learning Rate](https://en.wikipedia.org/wiki/Learning_rate)
[2][Accuracy](https://en.wikipedia.org/wiki/Accuracy_and_precision)
[3][Precision, recall and F1-measures](https://scikit-learn.org/stable/modules/model_evaluation.html#precision-recall-and-f-measures)