Enrich Standard Geographies

Standard geographies are jurisdictional areas determined by government agencies. At the highest level these are the countries of the world. Within these countries the hierarchical levels have different names. If you already have a list of jurisdictional area identifiers such as postal (ZIP) codes or US Census Block Group Identifiers, these can be used directly as input to the enrich method to retrieve demographic information about these jurisdictional areas for analysis.

Example Use Case - Variable Variance

Just as before, we are going to retrieve high variance variables, but this time we are going to look up the unique identifiers for all the US Census Block Groups in Seattle.

Create a Country

Our analysis starts with identifying the country we are going to be working with and instantating an arcgis.geoenrichment.Country object referencing an arcgis.gis.GIS source to use for analysis.

from arcgis.geoenrichment import Country
from arcgis.gis import GIS

gis = GIS(profile="your_online_profile")
usa = Country("usa", gis=gis)

usa

<Country - United States (GIS @ https://geosaurus.maps.arcgis.com version:10.3)>

Selecting Data to Start

Next, we are using Pandas DataFrame filtering to identify a subset of variables to focus on from the thousands available.

enrich_vars = (
    usa.enrich_variables[
        (usa.enrich_variables.name.str.lower().str.contains("cy"))
        & (
            (usa.enrich_variables.data_collection == "occupation")
            | (usa.enrich_variables.data_collection == "Wealth")
            | (usa.enrich_variables.data_collection == "financial")
            | (usa.enrich_variables.data_collection == "educationalattainment")
            | (usa.enrich_variables.data_collection == "language")
            | (usa.enrich_variables.data_collection == "healthinsurancecoverage")
            | (usa.enrich_variables.data_collection == "veterans")
            | (usa.enrich_variables.data_collection == "yearmovedin")
            | (usa.enrich_variables.data_collection == "yearbuilt")
            | (usa.enrich_variables.data_collection == "population")
            | (usa.enrich_variables.data_collection == "occupation")
            | (usa.enrich_variables.data_collection == "housingcosts")
        )
    ]
    .drop_duplicates("name")
    .reset_index(drop=True)
)

enrich_vars

	name	alias	data_collection	enrich_name	enrich_field_name	description	vintage	units
0	NOHS_CY	2022 Pop Age 25+: < 9th Grade	educationalattainment	educationalattainment.NOHS_CY	educationalattainment_NOHS_CY	2022 Population Age 25+: Less than 9th Grade (...	2022	count
1	SOMEHS_CY	2022 Pop Age 25+: High School/No Diploma	educationalattainment	educationalattainment.SOMEHS_CY	educationalattainment_SOMEHS_CY	2022 Population Age 25+: 9-12th Grade/No Diplo...	2022	count
2	HSGRAD_CY	2022 Pop Age 25+: High School Diploma	educationalattainment	educationalattainment.HSGRAD_CY	educationalattainment_HSGRAD_CY	2022 Population Age 25+: High School Diploma (...	2022	count
3	GED_CY	2022 Pop Age 25+: GED	educationalattainment	educationalattainment.GED_CY	educationalattainment_GED_CY	2022 Population Age 25+: GED/Alternative Crede...	2022	count
4	SMCOLL_CY	2022 Pop Age 25+: Some College/No Degree	educationalattainment	educationalattainment.SMCOLL_CY	educationalattainment_SMCOLL_CY	2022 Population Age 25+: Some College/No Degre...	2022	count
...	...	...	...	...	...	...	...	...
92	VAL1M_CY	2022 Home Value $1 Million-1499999	Wealth	Wealth.VAL1M_CY	Wealth_VAL1M_CY	2022 Home Value $1,000,000-$1,499,999 (Esri)	2022	count
93	MEDVAL_CY	2022 Median Home Value	Wealth	Wealth.MEDVAL_CY	Wealth_MEDVAL_CY	2022 Median Home Value (Esri)	2022	currency
94	AVGVAL_CY	2022 Average Home Value	Wealth	Wealth.AVGVAL_CY	Wealth_AVGVAL_CY	2022 Average Home Value (Esri)	2022	currency
95	VALBASE_CY	2022 Home Value Base	Wealth	Wealth.VALBASE_CY	Wealth_VALBASE_CY	2022 Owner Occupied Housing Units by Value Bas...	2022	count
96	WLTHINDXCY	2022 Wealth Index	Wealth	Wealth.WLTHINDXCY	Wealth_WLTHINDXCY	2022 Wealth Index (Esri)	2022	count

97 rows × 8 columns

Get the Geographic Level

Here, we are retrieving levels and using the level_name colum to discover valid values for the enrich method's standard_geography_level parameter.

usa.levels

	level_name	singular_name	plural_name	alias	level_id	admin_level
0	block_groups	Block Group	Block Groups	Block Groups	US.BlockGroups
1	tracts	Census Tract	Census Tracts	Census Tracts	US.Tracts
2	places	Place	Places	Cities and Towns (Places)	US.Places
3	zip5	ZIP Code	ZIP Codes	ZIP Codes	US.ZIP5	Admin4
4	csd	County Subdivision	County Subdivisions	County Subdivisions	US.CSD
5	counties	County	Counties	Counties	US.Counties	Admin3
6	cbsa	CBSA	CBSAs	CBSAs	US.CBSA
7	cd	Congressional District	Congressional Districts	Congressional Districts	US.CD
8	dma	DMA	DMAs	DMAs	US.DMA
9	states	State	States	States	US.States	Admin2
10	whole_usa	United States of America	United States of America	Entire Country	US.WholeUSA	Admin1

Retrive Seattle Block Groups

It is not uncommon to already have the standard geography unique identifiers, if you need to retrieve those within a larger area, you can retrieve these using standard_geography_query. The most versatile parameter in this method is geoquery. You can find more explanation of the options for the geoquery parameter under the geographyQuery parameter documentation. We can start by seeing what is returned when searching for seattle.

from arcgis.geoenrichment import standard_geography_query

standard_geography_query("usa", layers="US.Places", geoquery="seattle")

	DatasetID	Hierarchy	DataLayerID	AreaID	AreaName	MajorSubdivisionName	MajorSubdivisionAbbr	MajorSubdivisionType	CountryAbbr	Score	ObjectId
0	USA_ESRI_2022	census2020	US.Places	5363000	Seattle city	Washington	WA	State	US	100	1

Since only one location is returned, we can use this to retrieve the block groups by populating the sub_geography parameteters.

bg_df = standard_geography_query(
    "usa",
    layers="US.Places",
    geoquery="seattle",
    sub_geography_layer="US.BlockGroups",
    return_sub_geography=True,
)

bg_df.head()

	DatasetID	Hierarchy	DataLayerID	AreaID	AreaName	MajorSubdivisionName	MajorSubdivisionAbbr	MajorSubdivisionType	CountryAbbr	Score	ObjectId
0	USA_ESRI_2022	census2020	US.BlockGroups	530330009001	530330009.001	Washington	WA	State	US	100	1
1	USA_ESRI_2022	census2020	US.BlockGroups	530330009002	530330009.002	Washington	WA	State	US	100	2
2	USA_ESRI_2022	census2020	US.BlockGroups	530330010001	530330010.001	Washington	WA	State	US	100	3
3	USA_ESRI_2022	census2020	US.BlockGroups	530330010002	530330010.002	Washington	WA	State	US	100	4
4	USA_ESRI_2022	census2020	US.BlockGroups	530330011002	530330011.002	Washington	WA	State	US	100	5

Enrich

Now, we can use the retrieved block groups as input into the enrich method to acheive the same results.

enrich_df = usa.enrich(
    bg_df,
    enrich_variables=enrich_vars,
    standard_geography_level="block_groups",
    standard_geography_id_column="AreaID",
)

enrich_df.head()

	std_geography_level	std_geography_name	std_geography_id	source_country	aggregation_method	population_to_polygon_size_rating	apportionment_confidence	has_data	nohs_cy	somehs_cy	...	val300k_cy	val400k_cy	val500k_cy	val750k_cy	val1m_cy	medval_cy	avgval_cy	valbase_cy	wlthindxcy	SHAPE
0	US.BlockGroups	530330009.001	530330009001	USA	Query:US.BlockGroups	2.191	2.576	1	13.0	7.0	...	2.0	2.0	68.0	103.0	176.0	1022727.0	1040915.0	366.0	357.0	{"rings": [[[-122.280028912264, 47.71915262818...
1	US.BlockGroups	530330009.002	530330009002	USA	Query:US.BlockGroups	2.191	2.576	1	0.0	0.0	...	0.0	3.0	29.0	41.0	85.0	1519784.0	1444495.0	327.0	344.0	{"rings": [[[-122.27725891127136, 47.716030628...
2	US.BlockGroups	530330010.001	530330010001	USA	Query:US.BlockGroups	2.191	2.576	1	0.0	29.0	...	10.0	31.0	137.0	46.0	4.0	634124.0	655349.0	229.0	93.0	{"rings": [[[-122.29380691276572, 47.711973626...
3	US.BlockGroups	530330010.002	530330010002	USA	Query:US.BlockGroups	2.191	2.576	1	5.0	9.0	...	5.0	10.0	114.0	76.0	74.0	789474.0	857270.0	282.0	255.0	{"rings": [[[-122.2908289113606, 47.7067966266...
4	US.BlockGroups	530330011.002	530330011002	USA	Query:US.BlockGroups	2.191	2.576	1	0.0	8.0	...	11.0	41.0	219.0	130.0	9.0	684361.0	738695.0	429.0	207.0	{"rings": [[[-122.30622891353259, 47.706644624...

5 rows × 106 columns

Calculate Variance

Variation can now be calculated for the retrieved variables to identify those with exceedingly high variance. Analysis can be used for feature selection or feature reduction to address covariance between variables and perform modeling.

Just as in the previous notebook, we can evaluate the variance and select the top variables.

# get just the enrich value columns
enrich_cols = [
    c for c in enrich_df if c in usa.enrich_variables.name.str.lower().values
]
enrich_df = enrich_df.set_index("std_geography_id").loc[:, enrich_cols]

# get top 20 highest variance columns
top20 = enrich_df.var(ddof=0).sort_values(ascending=False).iloc[:20]
top20.name = "variance"

# add human readable names
ev = usa.enrich_variables
ev.index = ev.name.str.lower()
top20_df = ev.join(top20, how="right").loc[:, ["name", "alias", "variance"]]

top20_df.head()

	name	alias	variance
aggdi_cy	AGGDI_CY	2022 Aggregate Disposable Income	9.773137e+14
aggdi_cy	AGGDI_CY	2022 Aggregate Disposable Income	9.773137e+14
agghinc_cy	AGGHINC_CY	2022 Aggregate HH Income	2.220839e+15
agghinc_cy	AGGHINC_CY	2022 Aggregate HH Income	2.220839e+15
agghinc_cy	AGGHINC_CY	2022 Aggregate HH Income	2.220839e+15

Continuing Analysis

From here, a variety of techniques can be used, but with so many income and net worth variables, before subsequent modeling steps, covariance needs to be addressed. Using the GeoEnrichment dramatically streamlines getting to this point. It provides extremely easy access to thousands of demographic variables for modeling and analysis directly in Python, making it easy to integrate with data engineering pipelines.