HTML table to Pandas Data Frame to Portal Item
Read HTML table to Pandas Data Frame
Often we read informative articles that present data in a tabular form. If such data contains location information, it would be much more insightful if presented as a cartographic map. Thus this sample shows how Pandas can be used to extract data from a table within a web page (in this case, a Wikipedia article) and how it can be then brought into the GIS for further analysis and visualization.
Note: to run this sample, you need a few extra libraries in your conda environment. If you don't have the libraries, install them by running the following commands from cmd.exe or your shell
conda install lxml
conda install html5lib
conda install beautifulsoup4
conda install matplotlib```
import pandas as pd
Let us read the Wikipedia article on Estimated number of guns per capita by country as a pandas data frame object
df = pd.read_html("https://en.wikipedia.org/wiki/Number_of_guns_per_capita_by_country")[0]
Let us process the table by dropping some unncessary columns
df.head()
Unnamed: 0 | Country or subnational area | Estimate of civilian firearms per 100 persons | Region[3] | Subregion[3] | Population 2017 | Estimate of firearms in civilian possession | Computation method | Registered firearms | Unregistered firearms | Notes | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 1.0 | United States | 120.5 | Americas | North America | 326474000.0 | 393347000 | 1.0 | 1073743.0 | 392,273,257 Est. | [note 2] |
2 | 2.0 | Falkland Islands | 62.1 | Americas | South America | 3000.0 | 2000 | 2.0 | 1705.0 | 295 | NaN |
3 | 3.0 | Yemen | 52.8 | Asia | Western Asia | 28120000.0 | 14859000 | 2.0 | NaN | NaN | NaN |
4 | 4.0 | New Caledonia | 42.5 | Oceania | Melanesia | 270000.0 | 115000 | 2.0 | 55000.0 | 60000 | NaN |
df.dropna(thresh=4, inplace=True)
del df['Unnamed: 0']
del df['Notes']
df.head()
Country or subnational area | Estimate of civilian firearms per 100 persons | Region[3] | Subregion[3] | Population 2017 | Estimate of firearms in civilian possession | Computation method | Registered firearms | Unregistered firearms | |
---|---|---|---|---|---|---|---|---|---|
1 | United States | 120.5 | Americas | North America | 326474000.0 | 393347000 | 1.0 | 1073743.0 | 392,273,257 Est. |
2 | Falkland Islands | 62.1 | Americas | South America | 3000.0 | 2000 | 2.0 | 1705.0 | 295 |
3 | Yemen | 52.8 | Asia | Western Asia | 28120000.0 | 14859000 | 2.0 | NaN | NaN |
4 | New Caledonia | 42.5 | Oceania | Melanesia | 270000.0 | 115000 | 2.0 | 55000.0 | 60000 |
5 | Serbia | 39.1 | Europe | Southern Europe | 6946000.0 | 2719000 | 2.0 | 1186086.0 | 1532914 |
If you notice, the Estimate of civilian firearms per 100 persons
for United States is not a proper integer. We can correct is as below
df.iloc[0,1] = 120.5
df.dtypes
Country or subnational area object Estimate of civilian firearms per 100 persons float64 Region[3] object Subregion[3] object Population 2017 float64 Estimate of firearms in civilian possession object Computation method float64 Registered firearms float64 Unregistered firearms object dtype: object
However, we cannot change every incorrect data entry if the table is large. Further we need the Estimate of civilian firearms per 100 persons
column in numeric format. Hence let us convert it and while doing so, convert incorrect values to NaN
which stands for Not a Number
converted_column = pd.to_numeric(df["Estimate of civilian firearms per 100 persons"], errors = 'coerce')
df['Estimate of civilian firearms per 100 persons'] = converted_column
df.head()
Country or subnational area | Estimate of civilian firearms per 100 persons | Region[3] | Subregion[3] | Population 2017 | Estimate of firearms in civilian possession | Computation method | Registered firearms | Unregistered firearms | |
---|---|---|---|---|---|---|---|---|---|
1 | United States | 120.5 | Americas | North America | 326474000.0 | 393347000 | 1.0 | 1073743.0 | 392,273,257 Est. |
2 | Falkland Islands | 62.1 | Americas | South America | 3000.0 | 2000 | 2.0 | 1705.0 | 295 |
3 | Yemen | 52.8 | Asia | Western Asia | 28120000.0 | 14859000 | 2.0 | NaN | NaN |
4 | New Caledonia | 42.5 | Oceania | Melanesia | 270000.0 | 115000 | 2.0 | 55000.0 | 60000 |
5 | Serbia | 39.1 | Europe | Southern Europe | 6946000.0 | 2719000 | 2.0 | 1186086.0 | 1532914 |
Plot as a map
Let us connect to our GIS to geocode this data and present it as a map, either by specifying username and password, e.g. in gis = GIS("https://www.arcgis.com", "arcgis_python", "P@ssword123")
or via an existing profile:
from arcgis.gis import GIS
import json
gis = GIS("home")
Since the table is now using Country (or dependent territory, subnational area, etc.)
column to signify the country code, which represents as Country__or_dependent_territory__subnational_area__etc__
as the real column name of the table, the mapping relationship is stated as below:
fc = gis.content.import_data(df, {"CountryCode":"Country or subnational area"})
map1 = gis.map('UK')
map1
Let us us smart mapping to render the points with varying sizes representing the number of firearms per 100 residents
map1.add_layer(fc, {"renderer":"ClassedSizeRenderer",
"field_name": "Estimate_of_civilian_firearms_per_100_persons"})
Publish as Portal Item
Let us publish this layer as a feature collection item in our GIS
item_properties = {
"title": "Worldwide Firearms ownership",
"tags" : "guns,violence",
"snippet": " GSR Worldwide firearms ownership",
"description": "test description",
"text": json.dumps({"featureCollection": {"layers": [dict(fc.layer)]}}),
"type": "Feature Collection",
"typeKeywords": "Data, Feature Collection, Singlelayer",
"extent" : "-102.5272,-41.7886,172.5967,64.984"
}
item = gis.content.add(item_properties)
Let us search for this item
search_result = gis.content.search("Worldwide Firearms ownership")
search_result[0]