Standard geographies are jurisdictional areas determined by government agencies. At the highest level these are the countries of the world. Within these countries the hierarchical levels have different names. If you already have a list of jurisdictional area identifiers such as postal (ZIP) codes or US Census Block Group Identifiers, these can be used directly as input to the enrich
method to retrieve demographic information about these jurisdictional areas for analysis.
Example Use Case - Variable Variance
Just as before, we are going to retrieve high variance variables, but this time we are going to look up the unique identifiers for all the US Census Block Groups in Seattle.
Create a Country
Our analysis starts with identifying the country we are going to be working with and instantating an arcgis.geoenrichment.Country
object referencing an arcgis.gis.GIS
source to use for analysis.
from arcgis.geoenrichment import Country
from arcgis.gis import GIS
gis = GIS(profile="your_online_profile")
usa = Country("usa", gis=gis)
usa
<Country - United States (GIS @ https://geosaurus.maps.arcgis.com version:10.3)>
Selecting Data to Start
Next, we are using Pandas DataFrame filtering to identify a subset of variables to focus on from the thousands available.
enrich_vars = (
usa.enrich_variables[
(usa.enrich_variables.name.str.lower().str.contains("cy"))
& (
(usa.enrich_variables.data_collection == "occupation")
| (usa.enrich_variables.data_collection == "Wealth")
| (usa.enrich_variables.data_collection == "financial")
| (usa.enrich_variables.data_collection == "educationalattainment")
| (usa.enrich_variables.data_collection == "language")
| (usa.enrich_variables.data_collection == "healthinsurancecoverage")
| (usa.enrich_variables.data_collection == "veterans")
| (usa.enrich_variables.data_collection == "yearmovedin")
| (usa.enrich_variables.data_collection == "yearbuilt")
| (usa.enrich_variables.data_collection == "population")
| (usa.enrich_variables.data_collection == "occupation")
| (usa.enrich_variables.data_collection == "housingcosts")
)
]
.drop_duplicates("name")
.reset_index(drop=True)
)
enrich_vars
name | alias | data_collection | enrich_name | enrich_field_name | description | vintage | units | |
---|---|---|---|---|---|---|---|---|
0 | NOHS_CY | 2022 Pop Age 25+: < 9th Grade | educationalattainment | educationalattainment.NOHS_CY | educationalattainment_NOHS_CY | 2022 Population Age 25+: Less than 9th Grade (... | 2022 | count |
1 | SOMEHS_CY | 2022 Pop Age 25+: High School/No Diploma | educationalattainment | educationalattainment.SOMEHS_CY | educationalattainment_SOMEHS_CY | 2022 Population Age 25+: 9-12th Grade/No Diplo... | 2022 | count |
2 | HSGRAD_CY | 2022 Pop Age 25+: High School Diploma | educationalattainment | educationalattainment.HSGRAD_CY | educationalattainment_HSGRAD_CY | 2022 Population Age 25+: High School Diploma (... | 2022 | count |
3 | GED_CY | 2022 Pop Age 25+: GED | educationalattainment | educationalattainment.GED_CY | educationalattainment_GED_CY | 2022 Population Age 25+: GED/Alternative Crede... | 2022 | count |
4 | SMCOLL_CY | 2022 Pop Age 25+: Some College/No Degree | educationalattainment | educationalattainment.SMCOLL_CY | educationalattainment_SMCOLL_CY | 2022 Population Age 25+: Some College/No Degre... | 2022 | count |
... | ... | ... | ... | ... | ... | ... | ... | ... |
92 | VAL1M_CY | 2022 Home Value $1 Million-1499999 | Wealth | Wealth.VAL1M_CY | Wealth_VAL1M_CY | 2022 Home Value $1,000,000-$1,499,999 (Esri) | 2022 | count |
93 | MEDVAL_CY | 2022 Median Home Value | Wealth | Wealth.MEDVAL_CY | Wealth_MEDVAL_CY | 2022 Median Home Value (Esri) | 2022 | currency |
94 | AVGVAL_CY | 2022 Average Home Value | Wealth | Wealth.AVGVAL_CY | Wealth_AVGVAL_CY | 2022 Average Home Value (Esri) | 2022 | currency |
95 | VALBASE_CY | 2022 Home Value Base | Wealth | Wealth.VALBASE_CY | Wealth_VALBASE_CY | 2022 Owner Occupied Housing Units by Value Bas... | 2022 | count |
96 | WLTHINDXCY | 2022 Wealth Index | Wealth | Wealth.WLTHINDXCY | Wealth_WLTHINDXCY | 2022 Wealth Index (Esri) | 2022 | count |
97 rows × 8 columns
Get the Geographic Level
Here, we are retrieving levels
and using the level_name
colum to discover valid values for the enrich
method's standard_geography_level
parameter.
usa.levels
level_name | singular_name | plural_name | alias | level_id | admin_level | |
---|---|---|---|---|---|---|
0 | block_groups | Block Group | Block Groups | Block Groups | US.BlockGroups | |
1 | tracts | Census Tract | Census Tracts | Census Tracts | US.Tracts | |
2 | places | Place | Places | Cities and Towns (Places) | US.Places | |
3 | zip5 | ZIP Code | ZIP Codes | ZIP Codes | US.ZIP5 | Admin4 |
4 | csd | County Subdivision | County Subdivisions | County Subdivisions | US.CSD | |
5 | counties | County | Counties | Counties | US.Counties | Admin3 |
6 | cbsa | CBSA | CBSAs | CBSAs | US.CBSA | |
7 | cd | Congressional District | Congressional Districts | Congressional Districts | US.CD | |
8 | dma | DMA | DMAs | DMAs | US.DMA | |
9 | states | State | States | States | US.States | Admin2 |
10 | whole_usa | United States of America | United States of America | Entire Country | US.WholeUSA | Admin1 |
Retrive Seattle Block Groups
It is not uncommon to already have the standard geography unique identifiers, if you need to retrieve those within a larger area, you can retrieve these using standard_geography_query
. The most versatile parameter in this method is geoquery
. You can find more explanation of the options for the geoquery
parameter under the geographyQuery parameter documentation. We can start by seeing what is returned when searching for seattle
.
from arcgis.geoenrichment import standard_geography_query
standard_geography_query("usa", layers="US.Places", geoquery="seattle")
DatasetID | Hierarchy | DataLayerID | AreaID | AreaName | MajorSubdivisionName | MajorSubdivisionAbbr | MajorSubdivisionType | CountryAbbr | Score | ObjectId | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | USA_ESRI_2022 | census2020 | US.Places | 5363000 | Seattle city | Washington | WA | State | US | 100 | 1 |
Since only one location is returned, we can use this to retrieve the block groups by populating the sub_geography
parameteters.
bg_df = standard_geography_query(
"usa",
layers="US.Places",
geoquery="seattle",
sub_geography_layer="US.BlockGroups",
return_sub_geography=True,
)
bg_df.head()
DatasetID | Hierarchy | DataLayerID | AreaID | AreaName | MajorSubdivisionName | MajorSubdivisionAbbr | MajorSubdivisionType | CountryAbbr | Score | ObjectId | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | USA_ESRI_2022 | census2020 | US.BlockGroups | 530330009001 | 530330009.001 | Washington | WA | State | US | 100 | 1 |
1 | USA_ESRI_2022 | census2020 | US.BlockGroups | 530330009002 | 530330009.002 | Washington | WA | State | US | 100 | 2 |
2 | USA_ESRI_2022 | census2020 | US.BlockGroups | 530330010001 | 530330010.001 | Washington | WA | State | US | 100 | 3 |
3 | USA_ESRI_2022 | census2020 | US.BlockGroups | 530330010002 | 530330010.002 | Washington | WA | State | US | 100 | 4 |
4 | USA_ESRI_2022 | census2020 | US.BlockGroups | 530330011002 | 530330011.002 | Washington | WA | State | US | 100 | 5 |
Enrich
Now, we can use the retrieved block groups as input into the enrich
method to acheive the same results.
enrich_df = usa.enrich(
bg_df,
enrich_variables=enrich_vars,
standard_geography_level="block_groups",
standard_geography_id_column="AreaID",
)
enrich_df.head()
std_geography_level | std_geography_name | std_geography_id | source_country | aggregation_method | population_to_polygon_size_rating | apportionment_confidence | has_data | nohs_cy | somehs_cy | ... | val300k_cy | val400k_cy | val500k_cy | val750k_cy | val1m_cy | medval_cy | avgval_cy | valbase_cy | wlthindxcy | SHAPE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | US.BlockGroups | 530330009.001 | 530330009001 | USA | Query:US.BlockGroups | 2.191 | 2.576 | 1 | 13.0 | 7.0 | ... | 2.0 | 2.0 | 68.0 | 103.0 | 176.0 | 1022727.0 | 1040915.0 | 366.0 | 357.0 | {"rings": [[[-122.280028912264, 47.71915262818... |
1 | US.BlockGroups | 530330009.002 | 530330009002 | USA | Query:US.BlockGroups | 2.191 | 2.576 | 1 | 0.0 | 0.0 | ... | 0.0 | 3.0 | 29.0 | 41.0 | 85.0 | 1519784.0 | 1444495.0 | 327.0 | 344.0 | {"rings": [[[-122.27725891127136, 47.716030628... |
2 | US.BlockGroups | 530330010.001 | 530330010001 | USA | Query:US.BlockGroups | 2.191 | 2.576 | 1 | 0.0 | 29.0 | ... | 10.0 | 31.0 | 137.0 | 46.0 | 4.0 | 634124.0 | 655349.0 | 229.0 | 93.0 | {"rings": [[[-122.29380691276572, 47.711973626... |
3 | US.BlockGroups | 530330010.002 | 530330010002 | USA | Query:US.BlockGroups | 2.191 | 2.576 | 1 | 5.0 | 9.0 | ... | 5.0 | 10.0 | 114.0 | 76.0 | 74.0 | 789474.0 | 857270.0 | 282.0 | 255.0 | {"rings": [[[-122.2908289113606, 47.7067966266... |
4 | US.BlockGroups | 530330011.002 | 530330011002 | USA | Query:US.BlockGroups | 2.191 | 2.576 | 1 | 0.0 | 8.0 | ... | 11.0 | 41.0 | 219.0 | 130.0 | 9.0 | 684361.0 | 738695.0 | 429.0 | 207.0 | {"rings": [[[-122.30622891353259, 47.706644624... |
5 rows × 106 columns
Calculate Variance
Variation can now be calculated for the retrieved variables to identify those with exceedingly high variance. Analysis can be used for feature selection or feature reduction to address covariance between variables and perform modeling.
Just as in the previous notebook, we can evaluate the variance and select the top variables.
# get just the enrich value columns
enrich_cols = [
c for c in enrich_df if c in usa.enrich_variables.name.str.lower().values
]
enrich_df = enrich_df.set_index("std_geography_id").loc[:, enrich_cols]
# get top 20 highest variance columns
top20 = enrich_df.var(ddof=0).sort_values(ascending=False).iloc[:20]
top20.name = "variance"
# add human readable names
ev = usa.enrich_variables
ev.index = ev.name.str.lower()
top20_df = ev.join(top20, how="right").loc[:, ["name", "alias", "variance"]]
top20_df.head()
name | alias | variance | |
---|---|---|---|
aggdi_cy | AGGDI_CY | 2022 Aggregate Disposable Income | 9.773137e+14 |
aggdi_cy | AGGDI_CY | 2022 Aggregate Disposable Income | 9.773137e+14 |
agghinc_cy | AGGHINC_CY | 2022 Aggregate HH Income | 2.220839e+15 |
agghinc_cy | AGGHINC_CY | 2022 Aggregate HH Income | 2.220839e+15 |
agghinc_cy | AGGHINC_CY | 2022 Aggregate HH Income | 2.220839e+15 |
Continuing Analysis
From here, a variety of techniques can be used, but with so many income and net worth variables, before subsequent modeling steps, covariance needs to be addressed. Using the GeoEnrichment dramatically streamlines getting to this point. It provides extremely easy access to thousands of demographic variables for modeling and analysis directly in Python, making it easy to integrate with data engineering pipelines.