ArcGIS Developers
Dashboard

ArcGIS API for Python

Download the samples Try it live

Identifying country names from incomplete house addresses

Introduction

Geocoding is the process of taking input text, such as an address or the name of a place, and returning a latitude/longitude location for that place. In this notebook, we will be picking up a dataset consisting of incomplete house addresses from 10 countries. We will build a classifier using TextClassifier class of arcgis.learn.text module to predict the country for these incomplete house addresses.

The house addresses in the dataset consist of text in multiple languages like English, Japanese, French, Spanish, etc. The dataset is a small subset of the house addresses taken from OpenAddresses data

A note on the dataset

  • The data is collected around 2020-05-27 by OpenAddresses.
  • The data licenses can be found in data/country-classifier/LICENSE.txt.

Prerequisites

  • Data preparation and model training workflows using arcgis.learn have a dependency on transformers. Refer to the section "Install deep learning dependencies of arcgis.learn module" on this page for detailed documentation on the installation of the dependencies.

  • Labeled data: For TextClassifier to learn, it needs to see documents/texts that have been assigned a label. Labeled data for this sample notebook is located at data/country-classifier/house-addresses.csv

  • To learn more about how TextClassifier works, please see the guide on Text Classification with arcgis.learn.

Imports

In [1]:
import os
import zipfile
import pandas as pd
from pathlib import Path
from arcgis.gis import GIS
from arcgis.learn import prepare_textdata
from arcgis.learn.text import TextClassifier
In [2]:
gis = GIS('home')

Data preparation

Data preparation involves splitting the data into training and validation sets, creating the necessary data structures for loading data into the model and so on. The prepare_data() function can directly read the training samples and automate the entire process.

In [3]:
training_data = gis.content.get('ab36969cfe814c89ba3b659cf734492a')
training_data
Out[3]:
country_classifier
Training data for TextClassifier class of arcgis.learn.text moduleImage Collection by api_data_owner
Last Modified: December 01, 2020
0 comments, 0 views
In [4]:
filepath = training_data.download(file_name=training_data.name)
In [5]:
with zipfile.ZipFile(filepath, 'r') as zip_ref:
    zip_ref.extractall(Path(filepath).parent)
In [6]:
DATA_ROOT = Path(os.path.join(filepath.split('.')[0]))
In [7]:
data = prepare_textdata(DATA_ROOT, "classification", train_file="house-addresses.csv", 
                        text_columns="Address", label_columns="Country", batch_size=64)

The show_batch() method can be used to see the training samples, along with labels.

In [11]:
data.show_batch(10)
Out[11]:
Address Country
S/N, LG CASARES, 32170 ES
SN, CALLE E. NABARRETE, PLAN DE AYALA (CAMPO CINCO), Ahome, Sinaloa MX
152, RUA SANTA RITA DURAO, Belo Horizonte, MG, 30140-110 BR
133, Warande, 201, 9660 BE
4000, 13 Avenue SE, 133, MEDICINE HAT CA
12, Avenue de la République, Beauvais, 60000 FR
1487-6, 有馬町 JP
4, Rue d'Houat, Saint-Gilles, 35590 FR
32, Hartjie My Liefie Avenue, Bloemfontein, Mangaung ZA
Street, Centurion, City of Tshwane ZA

TextClassifier model

TextClassifier model in arcgis.learn.text is built on top of Hugging Face Transformers library. The model training and inferencing workflow are similar to computer vision models in arcgis.learn.

Run the command below to see what backbones are supported for the text classification task.

In [12]:
print(TextClassifier.supported_backbones)
['BERT', 'RoBERTa', 'DistilBERT', 'ALBERT', 'FlauBERT', 'CamemBERT', 'XLNet', 'XLM', 'XLM-RoBERTa', 'Bart', 'ELECTRA', 'Longformer', 'MobileBERT']

Call the model's available_backbone_models() method with the backbone name to get the available models for that backbone. The call to available_backbone_models method will list out only few of the available models for each backbone. Visit this link to get a complete list of models for each backbone.

In [13]:
print(TextClassifier.available_backbone_models("xlm-roberta"))
('xlm-roberta-base', 'xlm-roberta-large')

Load model architecture

Invoke the TextClassifier class by passing the data and the backbone you have chosen. The dataset consists of house addresses in multiple languages like Japanese, English, French, Spanish, etc., hence we will use a multi-lingual transformer backbone to train our model.

In [14]:
model = TextClassifier(data, backbone="xlm-roberta-base")

Model training

The learning rate[1] is a tuning parameter that determines the step size at each iteration while moving toward a minimum of a loss function, it represents the speed at which a machine learning model "learns". arcgis.learn includes a learning rate finder, and is accessible through the model's lr_find() method, that can automatically select an optimum learning rate, without requiring repeated experiments.

In [15]:
model.lr_find()
Out[15]:
0.001202264434617413

Training the model is an iterative process. We can train the model using its fit() method till the validation loss (or error rate) continues to go down with each training pass also known as an epoch. This is indicative of the model learning the task.

In [16]:
model.fit(epochs=4, lr=0.001)
epoch train_loss valid_loss accuracy error_rate time
0 1.088629 0.687974 0.761800 0.238200 01:21
1 0.889682 0.539912 0.794900 0.205100 01:21
2 0.776112 0.456957 0.821200 0.178800 01:21
3 0.711500 0.325689 0.872900 0.127100 01:18

By default, the earlier layers of the model (i.e. the backbone) are frozen. Once the later layers have been sufficiently trained, the earlier layers are unfrozen (by calling unfreeze() method of the class) to further fine-tune the model.

In [17]:
model.unfreeze()

model.fit(epochs=6)
epoch train_loss valid_loss accuracy error_rate time
0 0.308638 0.182150 0.929600 0.070400 05:28
1 0.103615 0.068711 0.970600 0.029400 05:46
2 0.076326 0.041269 0.981600 0.018400 05:30
3 0.055707 0.034307 0.986300 0.013700 05:33
4 0.041812 0.032772 0.986400 0.013600 05:27
5 0.049993 0.032165 0.986600 0.013400 05:26

Validate results

Once we have the trained model, we can see the results to see how it performs.

In [19]:
model.show_results(15)
text target prediction
SN, AVENIDA JOSE MARIA MORELOS Y PAVON OTE., APATZINGÁN DE LA CONSTITUCIÓN, Apatzingán, Michoacán de Ocampo MX MX
906, AVENIDA JOSEFA ORTÍZ DE DOMÍNGUEZ, CIUDAD MENDOZA, Camerino Z. Mendoza, Veracruz de Ignacio de la Llave MX MX
32, CIRCUITO JOSÉ MARÍA URIARTE, FRACCIONAMIENTO RANCHO ALEGRE, Tlajomulco de Zúñiga, Jalisco MX MX
SN, ESTRADA SP 250 SENTIDO GRAMADAO, LADO DIREITO FAZENDA SAO RAFAEL CASA 4, São Miguel Arcanjo, SP, 18230-000 BR BR
SN, CALLE JOSEFA ORTÍZ DE DOMÍNGUEZ, RINCÓN DE BUENA VISTA, Omealca, Veracruz de Ignacio de la Llave MX MX
SN, CALLE MICHOACAN, DOLORES HIDALGO CUNA DE LA INDEPENDENCIA NACIONAL, Dolores Hidalgo Cuna de la Independencia Nacional, Guanajuato MX MX
SN, CALLE VERDUZCO, COALCOMÁN DE VÁZQUEZ PALLARES, Coalcomán de Vázquez Pallares, Michoacán de Ocampo MX MX
1712, CALLE MÁRTIRES DEL 7 DE ENERO, CIUDAD MENDOZA, Camerino Z. Mendoza, Veracruz de Ignacio de la Llave MX MX
SN, AVENIDA JACOBO GÁLVEZ, FRACCIONAMIENTO RANCHO ALEGRE, Tlajomulco de Zúñiga, Jalisco MX MX
SN, ANDADOR MZNA 6 AMP. LOS ROBLES, EL PUEBLITO (CRUCERO NACIONAL), Córdoba, Veracruz de Ignacio de la Llave MX MX
SN, CALLE SÉPTIMA PONIENTE SUR (EJE VIAL), COMITÁN DE DOMÍNGUEZ, Comitán de Domínguez, Chiapas MX MX
18, CALLE FELIPE GORRITI / FELIPE GORRITI KALEA, Pamplona / Iruña, Pamplona / Iruña, Navarra, 31004 ES ES
SN, RUA X VINTE E SEIS, QUADRA 14 LOTE 35 SALA 3, Aparecida de Goiânia, GO, 74922-680 BR BR
SN, CALLE NINGUNO, HEROICA CIUDAD DE JUCHITÁN DE ZARAGOZA, Heroica Ciudad de Juchitán de Zaragoza, Oaxaca MX MX
1169, RUA DOUTOR ALBUQUERQUE LINS, BLOCO B ANDAR 11 APARTAMENTO 112B, São Paulo, SP, 01203-001 BR BR

Test the model prediction on an input text

In [20]:
text = """1016, 8A, CL RICARDO LEON - SANTA ANA (CARTAGENA), 30319"""
print(model.predict(text))
('1016, 8A, CL RICARDO LEON - SANTA ANA (CARTAGENA), 30319', 'ES', 1.0)

Model metrics

To get a sense of how well the model is trained, we will calculate some important metrics for our text-classifier model. First, to find how accurate[2] the model is in correctly predicting the classes in the dataset, we will call the model's accuracy() method.

In [21]:
model.accuracy()
Out[21]:
0.9866

Other important metrics to look at are Precision, Recall & F1-measures [3]. To find precision, recall & f1 scores per label/class we will call the model's metrics_per_label() method.

In [22]:
model.metrics_per_label()
100.00% [10000/10000 05:05<00:00]
Out[22]:
Precision_score Recall_score F1_score Support
AU 1.0000 1.0000 1.0000 929.0
BE 0.9990 0.9990 0.9990 1043.0
BR 1.0000 1.0000 1.0000 950.0
CA 0.9088 0.9709 0.9388 996.0
ES 0.9969 0.9980 0.9975 982.0
FR 1.0000 0.9990 0.9995 1009.0
JP 1.0000 0.9990 0.9995 989.0
MX 1.0000 1.0000 1.0000 1024.0
US 0.9691 0.9093 0.9383 1070.0
ZA 0.9990 0.9980 0.9985 1008.0

Get misclassified records

Its always a good idea to see the cases where your model is not performing well. This step will help us to:

  • Identify if there is a problem in the dataset.
  • Identify if there is a problem with text/documents belonging to a specific label/class.
  • Identify if there is a class imbalance in your dataset, due to which the model didn't see much of the labeled data for a particular class, hence not able to learn properly about that class.

To get the misclassified records we will call the model's get_misclassified_records method.

In [23]:
misclassified_records = model.get_misclassified_records()
100.00% [10000/10000 05:07<00:00]
In [24]:
misclassified_records.style.set_table_styles([dict(selector='th', props=[('text-align', 'left')])])\
        .set_properties(**{'text-align': "left"}).hide_index()
Out[24]:
Address Target Prediction
107, HAMILTON CT, EASLEY US CA
40443, CHEAKAMUS WAY CA US
309, SOUTH STREET, BARABOO US CA
19109, DUTY ST US CA
8171, CR 29, 43357 US ES
6565, WISCONSIN AVE US CA
7332, 25TH AVENUE US CA
14778, CAMINITO PUNTA ARENAS, Del Mar, 92014 US ES
916, PINE ST US CA
168, BROAD SOUND PL, Iredell US CA
316, BEAUMIER LANE US CA
1518, BARCLAY ST US CA
235, GLADEFIELD DR US CA
2701, CURRANT CV US CA
94, ASPETUCK VILLAGE US CA
27, South 10Th Avenue US CA
254, GREEN HILLS DR US CA
1025, BROOKFORD RD US CA
8981, FAIRMOUNT RD SE US CA
5, PICKWICK LA US CA
540, CHARLESTON HWY US CA
1763, RD, McDowell US CA
40022, GOVERNMENT RD CA US
435, EMORY RD US CA
1, Bokomo Road, Malmesbury, Swartland ZA CA
3529, BRADLEY AVE US CA
710, 9TH ST US CA
1421, PINOT NOIR DR CA US
1224, ST LUKE RD CA US
1822, RT 6 US CA
140 US CA
2302, RIVER MIST RD CA US
4159, Maher St US CA
24, DEARBORN STREET, Franklin US CA
2109, MALDON PL US CA
Flora Road, Moquini Coastal Estate, Mossel Bay ZA CA
5990, THIROS CIR US CA
167, CARLSBAD CAVERNS ST US CA
2119, E 3RD AV CA US
505, HARLEY WAY, SHARON US CA
1354, ST LUKE RD CA US
3140, SOUTHWOOD RD CA US
4205, Glenn US CA
103, BILLETS BRIDGE RD, Courthouse, Camden US CA
838087, 4TH LINE EAST, TOWNSHIP OF MULMUR CA US
3317, Doncaster DR CA US
2726, E TRUESDALE DRIVE CA US
2131, SHAMROCK DR CA US
1185, ST ANNES RD, Unit 99 CA US
9109, CONTESSA CT US CA
408, RUBY RD US CA
2101, FONTAINE RD, 10 US CA
52, OLD HWY US CA
200, EAGLE SHORE DR US CA
1450, MEADOW AV US CA
0, BEECH ST, Rockingham US CA
291, SPRY POINT RD, LITTLE POND, KNS CA US
10905, YORKTOWN CV US CA
3903, TATTLE BRANCH RD US CA
682, ISLAND 90 SIX MILE LAKE CA US
1887, LITITZ PIKE, UNIT 4, MANHEIM TOWNSHIP US CA
2821, E 18TH AV CA US
2106, MARK ST US CA
25890, 119TH STREET US CA
1222, VAN STEFFY AV, WYOMISSING US CA
16772, Heritage Ln US CA
450, LINCOLN AVENUE US CA
27, GRANTHAM GLEN US CA
14972, GREENBRAE ST US CA
35, CR 1322 US ES
2438, DOUBLETREE DR US CA
6999, SHIELDS DR CA US
232, COUNTY RD 5, JACKSON US CA
2265, Coronado Parkway North, Unit B US CA
1026, E 18TH AV CA US
224, PINE CREST PL US CA
6259, ROGERS RD US CA
576, WYCHE ST US CA
1109, GLENN AVE US CA
4821, POSTON DR US CA
1610, WALNUT AVE US CA
4134, TN SUNP.A-3 T.JARAL SEC1 UE1, 29749 ES US
6, HUQUENIN CT US CA
3761, OLD CLAYBURN RD CA US
4832-8 JP CA
1209, ALSON MILLS WAY CA US
262, BASSETT ST US CA
216, 3RD ST US CA
3749, CLARITY RD US CA
2619, SQUIRE PL US CA
1950, PITTMAN CENTER RD US CA
WILDCAT TR US CA
28, SUNKIST VALLEY RD, Caledon CA US
31, 4780 BE US
Cabarrus US CA
494, Oxbow Creek US CA
0, HIGH ST US CA
121, WHITETAIL ARCHERY AVE US CA
676, STATE ROUTE 179 US CA
8455, BACARDI AVENUE, INVER GROVE HEIGHTS US CA
41, WILLIAM BLAYDES ST US CA
1539, 29 AV N US CA
4250, OREGON AVE US CA
845, NORTH MARY LAKE RD CA US
3338, FALLS DR US CA
3301, CONFLANS RD US CA
3750, WEINBRENNER RD CA US
9, BROOK SIDE US CA
312, WHEATON STREET US CA
RAILROAD RD US CA
1001, Steinerwaeldel, Volksberg, 67290 FR BE
1919, POCO FARM RD US CA
166, OLGA DR US CA
CALEDONIA CA US
708, FAIRMEADOW DR US CA
126, POAS CL US CA
1136, CLARENDON CIR US CA
CREEK RD, DOUGLASS US CA
625505, 15TH SIDEROAD, TOWNSHIP OF MELANCTHON CA US
529, WENGLER AVE, SHARON US CA
40114, TN SECTOR 8, 45646 ES US
9955, East 138Th Place US CA
1032, HENEY LAKE RD CA US
1021, WOODCREEK OAKS BLVD US CA
514, CLEARFIELD ST US CA
1991, Braeburn Circle SE US CA
Boiling Spring Lakes, Brunswick US CA
3965, SAGE DR US CA
3175, W 34TH AV CA US
83, ST ANDREWS CRESCENT US CA
10990, 1ST STREET, HEWITT US CA
22, POLLETT LN, DIEPPE, NB CA US
Kent Street, Richibucto, Kent CA ZA

Saving the trained model

Once you are satisfied with the model, you can save it using the save() method. This creates an Esri Model Definition (EMD file) that can be used for inferencing on unseen data.

In [25]:
model.save("country-classifier")
Computing model metrics...
Out[25]:
WindowsPath('models/country-classifier')

Model inference

The trained model can be used to classify new text documents using the predict method. This method accepts a string or a list of strings to predict the labels of these new documents/text.

In [26]:
text_list = data._train_df.sample(15).Address.values
result = model.predict(text_list)

df = pd.DataFrame(result, columns=["Address", "CountryCode", "Confidence"])

df.style.set_table_styles([dict(selector='th', props=[('text-align', 'left')])])\
        .set_properties(**{'text-align': "left"}).hide_index()
100.00% [15/15 00:00<00:00]
Out[26]:
Address CountryCode Confidence
179, RUA JOSE BARBALHO FILHO, APARTAMENTO 103 BLOCO G, João Pessoa, PB, 58027-000 BR 1.000000
2531, PARTRIDGE CRES CA 0.834484
SN, CALLE ESCUINAPA, URUAPAN, Uruapan, Michoacán de Ocampo MX 1.000000
44, WOODFORD DR, FREDERICKSBURG, Stafford County, VA, 22405 US 0.999997
587, CALLE CABO SAN LUCAS, ENSENADA, Ensenada, Baja California MX 1.000000
80009, Street, Fernie, Chief Albert Luthuli ZA 0.999997
1906, Pelton Mountain Rd, Chipman Brook, Kings County CA 0.999895
1, Chemin de Promelles, 1472 BE 0.999912
1408, Cedarglen Court, Oakville, ON CA 0.998583
70, POPLAR ST N CA 0.942083
48, CL RAMON TURRO, 8389 ES 1.000000
454, NORTH MANNHEIM ROAD, Hillside, 60162 US 0.999981
43, Qoqonga Street, Mfuleni, City of Cape Town ZA 1.000000
1 B, TRAVESSA GENESIO SILVEIRA, Mossoró, RN, 59600-000 BR 1.000000
GaMaphale, Greater Letaba ZA 0.999998

Conclusion

In this notebook, we have built a text classifier using TextClassifier class of arcgis.learn.text module. The dataset consisted of house addresses of 10 countries written in languages like English, Japanese, French, Spanish, etc. To achieve this we used a multi-lingual transformer backbone like XLM-RoBERTa to build a classifier to predict the country for an input house address.

References


Feedback on this topic?