Enrich Standard Geographies

Standard geographies are jurisdictional areas determined by government agencies. At the highest level these are the countries of the world. Within these countries the hierarchical levels have different names. If you already have a list of jurisdictional area identifiers such as postal (ZIP) codes or US Census Block Group Identifiers, these can be used directly as input to the enrich method to retrieve demographic information about these jurisdictional areas for analysis.

Example Use Case - Variable Variance

Just as before, we are going to retrieve high variance variables, but this time we are going to look up the unique identifiers for all the US Census Block Groups in Seattle.

Create a Country

Our analysis starts with identifying the country we are going to be working with and instantating an arcgis.geoenrichment.Country object referencing an arcgis.gis.GIS source to use for analysis.

from arcgis.geoenrichment import Country
from arcgis.gis import GIS

gis = GIS(profile="your_online_profile")
usa = Country("usa", gis=gis)

usa
<Country - United States (GIS @ https://geosaurus.maps.arcgis.com version:10.3)>

Selecting Data to Start

Next, we are using Pandas DataFrame filtering to identify a subset of variables to focus on from the thousands available.

enrich_vars = (
    usa.enrich_variables[
        (usa.enrich_variables.name.str.lower().str.contains("cy"))
        & (
            (usa.enrich_variables.data_collection == "occupation")
            | (usa.enrich_variables.data_collection == "Wealth")
            | (usa.enrich_variables.data_collection == "financial")
            | (usa.enrich_variables.data_collection == "educationalattainment")
            | (usa.enrich_variables.data_collection == "language")
            | (usa.enrich_variables.data_collection == "healthinsurancecoverage")
            | (usa.enrich_variables.data_collection == "veterans")
            | (usa.enrich_variables.data_collection == "yearmovedin")
            | (usa.enrich_variables.data_collection == "yearbuilt")
            | (usa.enrich_variables.data_collection == "population")
            | (usa.enrich_variables.data_collection == "occupation")
            | (usa.enrich_variables.data_collection == "housingcosts")
        )
    ]
    .drop_duplicates("name")
    .reset_index(drop=True)
)

enrich_vars
namealiasdata_collectionenrich_nameenrich_field_namedescriptionvintageunits
0NOHS_CY2022 Pop Age 25+: < 9th Gradeeducationalattainmenteducationalattainment.NOHS_CYeducationalattainment_NOHS_CY2022 Population Age 25+: Less than 9th Grade (...2022count
1SOMEHS_CY2022 Pop Age 25+: High School/No Diplomaeducationalattainmenteducationalattainment.SOMEHS_CYeducationalattainment_SOMEHS_CY2022 Population Age 25+: 9-12th Grade/No Diplo...2022count
2HSGRAD_CY2022 Pop Age 25+: High School Diplomaeducationalattainmenteducationalattainment.HSGRAD_CYeducationalattainment_HSGRAD_CY2022 Population Age 25+: High School Diploma (...2022count
3GED_CY2022 Pop Age 25+: GEDeducationalattainmenteducationalattainment.GED_CYeducationalattainment_GED_CY2022 Population Age 25+: GED/Alternative Crede...2022count
4SMCOLL_CY2022 Pop Age 25+: Some College/No Degreeeducationalattainmenteducationalattainment.SMCOLL_CYeducationalattainment_SMCOLL_CY2022 Population Age 25+: Some College/No Degre...2022count
...........................
92VAL1M_CY2022 Home Value $1 Million-1499999WealthWealth.VAL1M_CYWealth_VAL1M_CY2022 Home Value $1,000,000-$1,499,999 (Esri)2022count
93MEDVAL_CY2022 Median Home ValueWealthWealth.MEDVAL_CYWealth_MEDVAL_CY2022 Median Home Value (Esri)2022currency
94AVGVAL_CY2022 Average Home ValueWealthWealth.AVGVAL_CYWealth_AVGVAL_CY2022 Average Home Value (Esri)2022currency
95VALBASE_CY2022 Home Value BaseWealthWealth.VALBASE_CYWealth_VALBASE_CY2022 Owner Occupied Housing Units by Value Bas...2022count
96WLTHINDXCY2022 Wealth IndexWealthWealth.WLTHINDXCYWealth_WLTHINDXCY2022 Wealth Index (Esri)2022count

97 rows × 8 columns

Get the Geographic Level

Here, we are retrieving levels and using the level_name colum to discover valid values for the enrich method's standard_geography_level parameter.

usa.levels
level_namesingular_nameplural_namealiaslevel_idadmin_level
0block_groupsBlock GroupBlock GroupsBlock GroupsUS.BlockGroups
1tractsCensus TractCensus TractsCensus TractsUS.Tracts
2placesPlacePlacesCities and Towns (Places)US.Places
3zip5ZIP CodeZIP CodesZIP CodesUS.ZIP5Admin4
4csdCounty SubdivisionCounty SubdivisionsCounty SubdivisionsUS.CSD
5countiesCountyCountiesCountiesUS.CountiesAdmin3
6cbsaCBSACBSAsCBSAsUS.CBSA
7cdCongressional DistrictCongressional DistrictsCongressional DistrictsUS.CD
8dmaDMADMAsDMAsUS.DMA
9statesStateStatesStatesUS.StatesAdmin2
10whole_usaUnited States of AmericaUnited States of AmericaEntire CountryUS.WholeUSAAdmin1

Retrive Seattle Block Groups

It is not uncommon to already have the standard geography unique identifiers, if you need to retrieve those within a larger area, you can retrieve these using standard_geography_query. The most versatile parameter in this method is geoquery. You can find more explanation of the options for the geoquery parameter under the geographyQuery parameter documentation. We can start by seeing what is returned when searching for seattle.

from arcgis.geoenrichment import standard_geography_query

standard_geography_query("usa", layers="US.Places", geoquery="seattle")
DatasetIDHierarchyDataLayerIDAreaIDAreaNameMajorSubdivisionNameMajorSubdivisionAbbrMajorSubdivisionTypeCountryAbbrScoreObjectId
0USA_ESRI_2022census2020US.Places5363000Seattle cityWashingtonWAStateUS1001

Since only one location is returned, we can use this to retrieve the block groups by populating the sub_geography parameteters.

bg_df = standard_geography_query(
    "usa",
    layers="US.Places",
    geoquery="seattle",
    sub_geography_layer="US.BlockGroups",
    return_sub_geography=True,
)

bg_df.head()
DatasetIDHierarchyDataLayerIDAreaIDAreaNameMajorSubdivisionNameMajorSubdivisionAbbrMajorSubdivisionTypeCountryAbbrScoreObjectId
0USA_ESRI_2022census2020US.BlockGroups530330009001530330009.001WashingtonWAStateUS1001
1USA_ESRI_2022census2020US.BlockGroups530330009002530330009.002WashingtonWAStateUS1002
2USA_ESRI_2022census2020US.BlockGroups530330010001530330010.001WashingtonWAStateUS1003
3USA_ESRI_2022census2020US.BlockGroups530330010002530330010.002WashingtonWAStateUS1004
4USA_ESRI_2022census2020US.BlockGroups530330011002530330011.002WashingtonWAStateUS1005

Enrich

Now, we can use the retrieved block groups as input into the enrich method to acheive the same results.

enrich_df = usa.enrich(
    bg_df,
    enrich_variables=enrich_vars,
    standard_geography_level="block_groups",
    standard_geography_id_column="AreaID",
)

enrich_df.head()
std_geography_levelstd_geography_namestd_geography_idsource_countryaggregation_methodpopulation_to_polygon_size_ratingapportionment_confidencehas_datanohs_cysomehs_cy...val300k_cyval400k_cyval500k_cyval750k_cyval1m_cymedval_cyavgval_cyvalbase_cywlthindxcySHAPE
0US.BlockGroups530330009.001530330009001USAQuery:US.BlockGroups2.1912.576113.07.0...2.02.068.0103.0176.01022727.01040915.0366.0357.0{"rings": [[[-122.280028912264, 47.71915262818...
1US.BlockGroups530330009.002530330009002USAQuery:US.BlockGroups2.1912.57610.00.0...0.03.029.041.085.01519784.01444495.0327.0344.0{"rings": [[[-122.27725891127136, 47.716030628...
2US.BlockGroups530330010.001530330010001USAQuery:US.BlockGroups2.1912.57610.029.0...10.031.0137.046.04.0634124.0655349.0229.093.0{"rings": [[[-122.29380691276572, 47.711973626...
3US.BlockGroups530330010.002530330010002USAQuery:US.BlockGroups2.1912.57615.09.0...5.010.0114.076.074.0789474.0857270.0282.0255.0{"rings": [[[-122.2908289113606, 47.7067966266...
4US.BlockGroups530330011.002530330011002USAQuery:US.BlockGroups2.1912.57610.08.0...11.041.0219.0130.09.0684361.0738695.0429.0207.0{"rings": [[[-122.30622891353259, 47.706644624...

5 rows × 106 columns

Calculate Variance

Variation can now be calculated for the retrieved variables to identify those with exceedingly high variance. Analysis can be used for feature selection or feature reduction to address covariance between variables and perform modeling.

Just as in the previous notebook, we can evaluate the variance and select the top variables.

# get just the enrich value columns
enrich_cols = [
    c for c in enrich_df if c in usa.enrich_variables.name.str.lower().values
]
enrich_df = enrich_df.set_index("std_geography_id").loc[:, enrich_cols]

# get top 20 highest variance columns
top20 = enrich_df.var(ddof=0).sort_values(ascending=False).iloc[:20]
top20.name = "variance"

# add human readable names
ev = usa.enrich_variables
ev.index = ev.name.str.lower()
top20_df = ev.join(top20, how="right").loc[:, ["name", "alias", "variance"]]

top20_df.head()
namealiasvariance
aggdi_cyAGGDI_CY2022 Aggregate Disposable Income9.773137e+14
aggdi_cyAGGDI_CY2022 Aggregate Disposable Income9.773137e+14
agghinc_cyAGGHINC_CY2022 Aggregate HH Income2.220839e+15
agghinc_cyAGGHINC_CY2022 Aggregate HH Income2.220839e+15
agghinc_cyAGGHINC_CY2022 Aggregate HH Income2.220839e+15

Continuing Analysis

From here, a variety of techniques can be used, but with so many income and net worth variables, before subsequent modeling steps, covariance needs to be addressed. Using the GeoEnrichment dramatically streamlines getting to this point. It provides extremely easy access to thousands of demographic variables for modeling and analysis directly in Python, making it easy to integrate with data engineering pipelines.

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.