Part 4 - What to enrich with? (What are Data Collections and Analysis Variables?)

Data Collections and GeoEnrichment coverage

As described earlier, a data collection is a preassembled list of attributes that will be used to enrich the input features. Collection attributes can describe various types of information, such as demographic characteristics and geographic context of the locations or areas submitted as input features.

Some data collections (such as default) can be used in all supported countries. Other data collections may only be available in one or a collection of countries. Data Browser can be used to examine the entire global listing of variables, and associated datasets for each country.

List Countries with GeoEnrichment Data

The get_countries() method can be used to query the countries for which GeoEnrichment data is available, and it returns a list of Country objects with which you can further query for properties. This list can also be viewed here.

from arcgis.gis import GIS
from arcgis.geoenrichment import Country, enrich, get_countries

# Create a GIS Connection
gis = GIS(profile='your_online_profile')

countries = get_countries()
print("Number of countries for which GeoEnrichment data is available: " + str(len(countries)))

#print a few countries for a sample
countries[0:10]

Number of countries for which GeoEnrichment data is available: 177

	iso2	iso3	name	alt_name	datasets	default_dataset	continent	hierarchy
0	AL	ALB	Albania	ALBANIA	[ALB_MBR_2021]	ALB_MBR_2021	Europe	[census]
1	DZ	DZA	Algeria	ALGERIA	[DZA_MBR_2021]	DZA_MBR_2021	Africa	[census]
2	AD	AND	Andorra	ANDORRA	[AND_MBR_2021]	AND_MBR_2021	Europe	[census]
3	AO	AGO	Angola	ANGOLA	[AGO_MBR_2021]	AGO_MBR_2021	Africa	[census]
4	AI	AIA	Anguilla	ANGUILLA	[AIA_MBR_2020]	AIA_MBR_2020	North America	[census]
5	AR	ARG	Argentina	ARGENTINA	[ARG_MBR_2020]	ARG_MBR_2020	South America	[census]
6	AM	ARM	Armenia	ARMENIA	[ARM_MBR_2020]	ARM_MBR_2020	Europe	[census]
7	AW	ABW	Aruba	ARUBA	[ABW_MBR_2020]	ABW_MBR_2020	North America	[census]
8	AU	AUS	Australia	AUSTRALIA	[AUS_ABS_2016, AUS_MBR_2020]	AUS_ABS_2016	Oceania	[AUS_ABS, census]
9	AT	AUT	Austria	AUSTRIA	[AUT_MBR_2021]	AUT_MBR_2021	Europe	[census]

Data Collections for U.S.

The data_collections property of a Country object lists its available data collections and analysis variables under each data collection as a Pandas dataframe.

In order to discover the data collections for a particular country, you may first access the reference variable to it using the country.get() method, and then fetch the data collections from country.data_collections property. Once we know the data collection we would like to use, we can look at analysisVariables available in that data collection.

# Get US as a country
usa = Country.get('US')
type(usa)

arcgis.geoenrichment.enrichment.Country

usa_df = usa.data_collections

# print a few rows of the DataFrame
usa_df.head()

	analysisVariable	alias	fieldCategory	vintage
dataCollectionID
1yearincrements	1yearincrements.AGE0_CY	2022 Population Age <1	2022 Age: 1 Year Increments (Esri)	2022
1yearincrements	1yearincrements.AGE1_CY	2022 Population Age 1	2022 Age: 1 Year Increments (Esri)	2022
1yearincrements	1yearincrements.AGE2_CY	2022 Population Age 2	2022 Age: 1 Year Increments (Esri)	2022
1yearincrements	1yearincrements.AGE3_CY	2022 Population Age 3	2022 Age: 1 Year Increments (Esri)	2022
1yearincrements	1yearincrements.AGE4_CY	2022 Population Age 4	2022 Age: 1 Year Increments (Esri)	2022

usa_df.shape

(18946, 4)

Unique Data Collections for U.S.

Each data collection and analysis variable has a unique ID. When calling the enrich() method (explained earlier in this guide) these analysis variables can be passed in the data_collections and analysis_variables parameters.

As an example, here we see a subset of the data collections for US showing 2 different data collections and multiple analysis variables for each collection.

usa_df.iloc[500:600,:]

	analysisVariable	alias	fieldCategory	vintage
dataCollectionID
1yearincrements	1yearincrements.FAGE75_FY	2027 Females Age 75	2027 Age: 1 Year Increments (Esri)	2027
1yearincrements	1yearincrements.FAGE76_FY	2027 Females Age 76	2027 Age: 1 Year Increments (Esri)	2027
1yearincrements	1yearincrements.FAGE77_FY	2027 Females Age 77	2027 Age: 1 Year Increments (Esri)	2027
1yearincrements	1yearincrements.FAGE78_FY	2027 Females Age 78	2027 Age: 1 Year Increments (Esri)	2027
1yearincrements	1yearincrements.FAGE79_FY	2027 Females Age 79	2027 Age: 1 Year Increments (Esri)	2027
...	...	...	...	...
5yearincrements	5yearincrements.MEDAGE_CY	2022 Median Age	2022 Age: 5 Year Increments (Esri)	2022
5yearincrements	5yearincrements.MALES_CY	2022 Male Population	2022 Age: 5 Year Increments (Esri)	2022
5yearincrements	5yearincrements.MALE0_CY	2022 Males Age 0-4	2022 Age: 5 Year Increments (Esri)	2022
5yearincrements	5yearincrements.MALE5_CY	2022 Males Age 5-9	2022 Age: 5 Year Increments (Esri)	2022
5yearincrements	5yearincrements.MALE10_CY	2022 Males Age 10-14	2022 Age: 5 Year Increments (Esri)	2022

100 rows × 4 columns

The table above shows 2 different data collections (1yearincrements and 5yearincrements). Since these are Age data collections, the analysisVariables for these collections are similar. vintage shows the year that the demographic data represents. For example, a vintage of 2020 means that the data represents the year 2020.

Let's get a list of unique data collections that are available for U.S.

usa_df.index.nunique()

United States has 150 unique data collections. Here are the first 10 data collections.

list(usa_df.index.unique())[:10]

['1yearincrements',
 '5yearincrements',
 'Age',
 'agebyracebysex',
 'AgeDependency',
 'AtRisk',
 'AutomobilesAutomotiveProducts',
 'BabyProductsToysGames',
 'basicFactsForMobileApps',
 'businesses']

Looking at fieldCategory is a great way to clearly understand what the data collection is about. fieldCategory combines vintage, datacollectionID columns along with the year and data collection. However, to query a data collection its unique ID (dataCollectionID) must be used.

Let's look at the fieldCategory column for a few data collections in US.

usa_df.fieldCategory.unique()[:10]

array(['2022 Age: 1 Year Increments (Esri)',
       '2027 Age: 1 Year Increments (Esri)',
       '2010 Age: 1 Year Increments (U.S. Census)',
       '2022 Age: 5 Year Increments (Esri)',
       '2027 Age: 5 Year Increments (Esri)',
       '2010 Age: 5 Year Increments (U.S. Census)',
       '2016-2020 Age: 5 Year Increments (ACS)',
       '2022 Age by Sex by Race (Esri)', '2027 Age by Sex by Race (Esri)',
       '2010 Age by Sex by Race (U.S. Census)'], dtype=object)

Data Collections by Socio-demographic Factors

You can filter the data_collections to get collections for a specific factor using Pandas expressions. Let's loook at data collections for different socio-demographic factors such as Age, Population, Income.

Data Collections for Age

Age_Collections = usa_df['fieldCategory'].str.contains('Age', na=False)
usa_df[Age_Collections].fieldCategory.unique()

array(['2022 Age: 1 Year Increments (Esri)',
       '2027 Age: 1 Year Increments (Esri)',
       '2010 Age: 1 Year Increments (U.S. Census)',
       '2022 Age: 5 Year Increments (Esri)',
       '2027 Age: 5 Year Increments (Esri)',
       '2010 Age: 5 Year Increments (U.S. Census)',
       '2016-2020 Age: 5 Year Increments (ACS)',
       '2022 Age by Sex by Race (Esri)', '2027 Age by Sex by Race (Esri)',
       '2010 Age by Sex by Race (U.S. Census)',
       '2022 Age Dependency (Esri)', '2027 Age Dependency (Esri)',
       '2022 Disposable Income by Age (Esri)',
       '2010 Households by Age of Householder (U.S. Census)',
       '2016-2020 Households by Type and Size and Age (ACS)',
       '2010 Housing by Age of Householder (U.S. Census)',
       '2022 Income by Age (Esri)', '2027 Income by Age (Esri)',
       '2016-2020 Income by Age (ACS)', 'Age: 5 Year Increments',
       '2022 Net Worth by Age (Esri)',
       '2016-2020 Females by Age of Children and Employment Status (ACS)'],
      dtype=object)

Data Collections for Population

Pop_Collections = usa_df['fieldCategory'].str.contains('Population', na=False)
usa_df[Pop_Collections].fieldCategory.unique()

array(['2010 Population (U.S. Census)',
       '2016-2020 Population by Language Spoken at Home (ACS)',
       '2022 Daytime Population (Esri)',
       '2022 Population by Generation (Esri)',
       '2027 Population by Generation (Esri)',
       '2020 Group Quarters Population (U.S. Census)',
       '2020 Group Quarters Population by Type (U.S. Census)',
       '2010 Group Quarters Population (U.S. Census)',
       '2020 Hispanic Population by Race (U.S. Census)',
       '2020 Hispanic Population of Two or More Races (U.S. Census)',
       '2020 Hispanic Population <18 Years by Race (U.S. Census)',
       '2020 Hispanic Population 18+ Years by Race (U.S. Census)',
       '2020 Hispanic Population 18+ Years of Two or More Races (U.S. Census)',
       '2022 Population Time Series (Esri)',
       '2010 Population by Relationship and Household Type (U.S. Census)',
       '2016-2020 Population by Relationship and Household Type (ACS)',
       '2020 Non Hispanic Population by Race (U.S. Census)',
       '2020 Non Hispanic Population of Two or More Races (U.S. Census)',
       '2020 Non Hispanic Population <18 Years by Race (U.S. Census)',
       '2020 Non Hispanic Population 18+ Years by Race (U.S. Census)',
       '2020 Non Hispanic Population 18+ Years of Two or More Races (U.S. Census)',
       '2022 Population (Esri)', '2020 Population (U.S. Census)',
       '2020 Population by Race (U.S. Census)',
       '2020 Population of Two or More Races (U.S. Census)',
       '2020 Population <18 Years by Race (U.S. Census)',
       '2020 Population 18+ Years by Race (U.S. Census)',
       '2020 Population 18+ Years of Two or More Races (U.S. Census)'],
      dtype=object)

Data Collections for Income

Income_Collections = usa_df['fieldCategory'].str.contains('Income', na=False)
Income_Collections.index.unique()

Index(['1yearincrements', '5yearincrements', 'Age', 'agebyracebysex',
       'AgeDependency', 'AtRisk', 'AutomobilesAutomotiveProducts',
       'BabyProductsToysGames', 'basicFactsForMobileApps', 'businesses',
       ...
       'travelMPI', 'unitsinstructure', 'urbanizationgroupsNEW', 'vacant',
       'vehiclesavailable', 'veterans', 'Wealth', 'women', 'yearbuilt',
       'yearmovedin'],
      dtype='object', name='dataCollectionID', length=115)

As mentioned earlier, using a data_collection's unique ID (dataCollectionID) is the best way to further query a data collection. Let's look at the dataCollectionID for various Income data collections.

usa_df[Income_Collections].index.unique()

Index(['AtRisk', 'basicFactsForMobileApps', 'disposableincome',
       'foodstampsSNAP', 'Health', 'householdincome', 'households',
       'incomebyage', 'KeyUSFacts', 'Policy', 'population', 'Wealth'],
      dtype='object', name='dataCollectionID')

Analysis variables for Data Collections

Once we know the data collection we would like to use, we can look at all the unique variables available in that data collection using its unique ID. Let's discover analysisVariables for some of the data collections.

Analysis variables for Age data collection

usa_df.loc['Age']['analysisVariable'].unique()

array(['Age.MALE0', 'Age.MALE5', 'Age.MALE10', 'Age.MALE15', 'Age.MALE20',
       'Age.MALE25', 'Age.MALE30', 'Age.MALE35', 'Age.MALE40',
       'Age.MALE45', 'Age.MALE50', 'Age.MALE55', 'Age.MALE60',
       'Age.MALE65', 'Age.MALE70', 'Age.MALE75', 'Age.MALE80',
       'Age.MALE85', 'Age.FEM0', 'Age.FEM5', 'Age.FEM10', 'Age.FEM15',
       'Age.FEM20', 'Age.FEM25', 'Age.FEM30', 'Age.FEM35', 'Age.FEM40',
       'Age.FEM45', 'Age.FEM50', 'Age.FEM55', 'Age.FEM60', 'Age.FEM65',
       'Age.FEM70', 'Age.FEM75', 'Age.FEM80', 'Age.FEM85'], dtype=object)

Analysis variables are typically represented as dataCollectionID.<analysis variable name> as seen above.

Analysis variables for DaytimePopulation data collection

usa_df.loc['DaytimePopulation']['analysisVariable'].unique()

array(['DaytimePopulation.DPOP_CY', 'DaytimePopulation.DPOPWRK_CY',
       'DaytimePopulation.DPOPRES_CY', 'DaytimePopulation.DPOPDENSCY'],
      dtype=object)

Data Collections for Another Country

Let's look at data collections for New Zealand. Data Browser can be used to examine the entire global listing of variables, and associated datasets for New Zealand.

# Get US as a country
nz = Country.get('New Zealand')
type(nz)

arcgis.geoenrichment.enrichment.Country

nz_df = nz.data_collections

# print a few rows of the DataFrame
nz_df.head()

	analysisVariable	alias	fieldCategory	vintage
dataCollectionID
15YearIncrements	15YearIncrements.PAGE01_CY	2020 Total Population Age 0-14	2020 Population Totals (MBR)	2020
15YearIncrements	15YearIncrements.PAGE02_CY	2020 Total Population Age 15-29	2020 Population Totals (MBR)	2020
15YearIncrements	15YearIncrements.PAGE03_CY	2020 Total Population Age 30-44	2020 Population Totals (MBR)	2020
15YearIncrements	15YearIncrements.PAGE04_CY	2020 Total Population Age 45-59	2020 Population Totals (MBR)	2020
15YearIncrements	15YearIncrements.PAGE05_CY	2020 Total Population Age 60+	2020 Population Totals (MBR)	2020

nz_df.shape

(193, 4)

Unique Data Collections for New Zealand

Let's get a list of unique data collections that are available for New Zealand.

nz_df.index.unique()

Index(['15YearIncrements', 'EducationalAttainment', 'Gender',
       'HouseholdsbyIncome', 'HouseholdsbyType', 'HouseholdTotals', 'KeyFacts',
       'KeyGlobalFacts', 'MaritalStatus', 'PopulationTotals',
       'PurchasingPower', 'Spending'],
      dtype='object', name='dataCollectionID')

New Zealand has 12 unique data collections.

We can look at the fieldCategory column to understand each category better.

nz_df.fieldCategory.unique()

array(['2020 Population Totals (MBR)',
       '2020 Male Population Totals (MBR)',
       '2020 Female Population Totals (MBR)',
       '2020 Educational Attainment (MBR)',
       '2020 Households by Income (MBR)', '2020 Households by Type (MBR)',
       '2020 Household Totals (MBR)', '2020 Marital Status (MBR)',
       '2020 Purchasing Power (MBR)', 'Key Demographic Indicators',
       'Age: 5 Year Increments',
       '2020 Food & Beverage Expenditures (MBR)',
       '2020 Alcoholic Beverage Expenditures (MBR)',
       '2020 Tobacco Expenditures (MBR)',
       '2020 Clothing Expenditures (MBR)',
       '2020 Footwear Expenditures (MBR)',
       '2020 Furniture & Furnishing Expenditures (MBR)',
       '2020 Household Textiles Expenditures (MBR)',
       '2020 Household Appliances Expenditures (MBR)',
       '2020 Household Utensils Expenditures (MBR)',
       '2020 House & Garden Expenditures (MBR)',
       '2020 Household Maintenance Expenditures (MBR)',
       '2020 Medical Products & Supplies Expenditures (MBR)',
       '2020 Consumer Electronics Expenditures (MBR)',
       '2020 Recreation & Culture Durable Expenditures (MBR)',
       '2020 Entertainment Expenditures (MBR)',
       '2020 Recreational & Cultural Service Expenditures (MBR)',
       '2020 Books & Stationery Expenditures (MBR)',
       '2020 Catering Services Expenditures (MBR)',
       '2020 Personal Care Expenditures (MBR)',
       '2020 Jewelry & Personal Effects Expenditures (MBR)'], dtype=object)

Looking at fieldCategory is a great way to clearly understand what the data collection is about. However, to query a data collection its unique ID (dataCollectionID) must be used.

Data Collections for Socio-demographic Factors

New Zealand has fewer data_collections compared to U.S. Let's look at data collections for Key Facts, Education and Spending.

Data Collection for Key Facts

nz_df.loc['KeyGlobalFacts']

	analysisVariable	alias	fieldCategory	vintage
dataCollectionID
KeyGlobalFacts	KeyGlobalFacts.TOTPOP	Total Population	Key Demographic Indicators	NaN
KeyGlobalFacts	KeyGlobalFacts.TOTHH	Total Households	Key Demographic Indicators	NaN
KeyGlobalFacts	KeyGlobalFacts.AVGHHSZ	Average Household Size	Key Demographic Indicators	NaN
KeyGlobalFacts	KeyGlobalFacts.TOTMALES	Male Population	Age: 5 Year Increments	NaN
KeyGlobalFacts	KeyGlobalFacts.TOTFEMALES	Female Population	Age: 5 Year Increments	NaN

Data Collection for Education

nz_df.loc['EducationalAttainment']

	analysisVariable	alias	fieldCategory	vintage
dataCollectionID
EducationalAttainment	EducationalAttainment.EDUC01A_CY	2020 Pop 15+/Edu: No Qualification	2020 Educational Attainment (MBR)	2020
EducationalAttainment	EducationalAttainment.EDUC02A_CY	2020 Pop 15+/Edu: Level 1	2020 Educational Attainment (MBR)	2020
EducationalAttainment	EducationalAttainment.EDUC03A_CY	2020 Pop 15+/Edu: Level 2	2020 Educational Attainment (MBR)	2020
EducationalAttainment	EducationalAttainment.EDUC04A_CY	2020 Pop 15+/Edu: Level 3	2020 Educational Attainment (MBR)	2020
EducationalAttainment	EducationalAttainment.EDUC05A_CY	2020 Pop 15+/Edu: Level 4	2020 Educational Attainment (MBR)	2020
EducationalAttainment	EducationalAttainment.EDUC06B_CY	2020 Pop 15+/Edu: Level 5 Diploma	2020 Educational Attainment (MBR)	2020
EducationalAttainment	EducationalAttainment.EDUC07A_CY	2020 Pop 15+/Edu: Level 6 Diploma	2020 Educational Attainment (MBR)	2020
EducationalAttainment	EducationalAttainment.EDUC08A_CY	2020 Pop 15+/Edu: Bachelor Degree	2020 Educational Attainment (MBR)	2020
EducationalAttainment	EducationalAttainment.EDUC09A_CY	2020 Pop 15+/Edu: Post-graduate and Honours de...	2020 Educational Attainment (MBR)	2020
EducationalAttainment	EducationalAttainment.EDUC10A_CY	2020 Pop 15+/Edu: Master's Degree	2020 Educational Attainment (MBR)	2020
EducationalAttainment	EducationalAttainment.EDUC11A_CY	2020 Pop 15+/Edu: Doctorate	2020 Educational Attainment (MBR)	2020
EducationalAttainment	EducationalAttainment.EDUC12A_CY	2020 Pop 15+/Edu: Overseas Secondary School	2020 Educational Attainment (MBR)	2020
EducationalAttainment	EducationalAttainment.EDUC13_CY	2020 Pop 15+/Edu: Not Included Elsewhere	2020 Educational Attainment (MBR)	2020

Data Collection for Spending

nz_df.loc['Spending']

	analysisVariable	alias	fieldCategory	vintage
dataCollectionID
Spending	Spending.CS01_CY	2020 Food & Beverage: Total	2020 Food & Beverage Expenditures (MBR)	2020
Spending	Spending.CS01PRM_CY	2020 Food & Beverage: Per Mill	2020 Food & Beverage Expenditures (MBR)	2020
Spending	Spending.CSPC01_CY	2020 Food & Beverage: Per Capita	2020 Food & Beverage Expenditures (MBR)	2020
Spending	Spending.CS01IDX_CY	2020 Food & Beverage: Index	2020 Food & Beverage Expenditures (MBR)	2020
Spending	Spending.CS02_CY	2020 Alcoholic Beverage: Total	2020 Alcoholic Beverage Expenditures (MBR)	2020
...	...	...	...	...
Spending	Spending.CS19IDX_CY	2020 Personal Care: Index	2020 Personal Care Expenditures (MBR)	2020
Spending	Spending.CS20_CY	2020 Personal Effects: Total	2020 Jewelry & Personal Effects Expenditures (...	2020
Spending	Spending.CS20PRM_CY	2020 Personal Effects: Per Mill	2020 Jewelry & Personal Effects Expenditures (...	2020
Spending	Spending.CSPC20_CY	2020 Personal Effects: Per Capita	2020 Jewelry & Personal Effects Expenditures (...	2020
Spending	Spending.CS20IDX_CY	2020 Personal Effects: Index	2020 Jewelry & Personal Effects Expenditures (...	2020

80 rows × 4 columns

Analysis variables for Data Collections

Analysis variables for KeyGlobalFacts data collection

nz_df.loc['KeyGlobalFacts']['analysisVariable'].unique()

array(['KeyGlobalFacts.TOTPOP', 'KeyGlobalFacts.TOTHH',
       'KeyGlobalFacts.AVGHHSZ', 'KeyGlobalFacts.TOTMALES',
       'KeyGlobalFacts.TOTFEMALES'], dtype=object)

Analysis variables for EducationalAttainment data collection

nz_df.loc['EducationalAttainment']['analysisVariable'].unique()

array(['EducationalAttainment.EDUC01A_CY',
       'EducationalAttainment.EDUC02A_CY',
       'EducationalAttainment.EDUC03A_CY',
       'EducationalAttainment.EDUC04A_CY',
       'EducationalAttainment.EDUC05A_CY',
       'EducationalAttainment.EDUC06B_CY',
       'EducationalAttainment.EDUC07A_CY',
       'EducationalAttainment.EDUC08A_CY',
       'EducationalAttainment.EDUC09A_CY',
       'EducationalAttainment.EDUC10A_CY',
       'EducationalAttainment.EDUC11A_CY',
       'EducationalAttainment.EDUC12A_CY',
       'EducationalAttainment.EDUC13_CY'], dtype=object)

Analysis variables for Spending data collection

nz_df.loc['Spending']['analysisVariable'].unique()

array(['Spending.CS01_CY', 'Spending.CS01PRM_CY', 'Spending.CSPC01_CY',
       'Spending.CS01IDX_CY', 'Spending.CS02_CY', 'Spending.CS02PRM_CY',
       'Spending.CSPC02_CY', 'Spending.CS02IDX_CY', 'Spending.CS03_CY',
       'Spending.CS03PRM_CY', 'Spending.CSPC03_CY', 'Spending.CS03IDX_CY',
       'Spending.CS04_CY', 'Spending.CS04PRM_CY', 'Spending.CSPC04_CY',
       'Spending.CS04IDX_CY', 'Spending.CS05_CY', 'Spending.CS05PRM_CY',
       'Spending.CSPC05_CY', 'Spending.CS05IDX_CY', 'Spending.CS06_CY',
       'Spending.CS06PRM_CY', 'Spending.CSPC06_CY', 'Spending.CS06IDX_CY',
       'Spending.CS07_CY', 'Spending.CS07PRM_CY', 'Spending.CSPC07_CY',
       'Spending.CS07IDX_CY', 'Spending.CS08_CY', 'Spending.CS08PRM_CY',
       'Spending.CSPC08_CY', 'Spending.CS08IDX_CY', 'Spending.CS09_CY',
       'Spending.CS09PRM_CY', 'Spending.CSPC09_CY', 'Spending.CS09IDX_CY',
       'Spending.CS10_CY', 'Spending.CS10PRM_CY', 'Spending.CSPC10_CY',
       'Spending.CS10IDX_CY', 'Spending.CS11_CY', 'Spending.CS11PRM_CY',
       'Spending.CSPC11_CY', 'Spending.CS11IDX_CY', 'Spending.CS12_CY',
       'Spending.CS12PRM_CY', 'Spending.CSPC12_CY', 'Spending.CS12IDX_CY',
       'Spending.CS13_CY', 'Spending.CS13PRM_CY', 'Spending.CSPC13_CY',
       'Spending.CS13IDX_CY', 'Spending.CS14_CY', 'Spending.CS14PRM_CY',
       'Spending.CSPC14_CY', 'Spending.CS14IDX_CY', 'Spending.CS15_CY',
       'Spending.CS15PRM_CY', 'Spending.CSPC15_CY', 'Spending.CS15IDX_CY',
       'Spending.CS16_CY', 'Spending.CS16PRM_CY', 'Spending.CSPC16_CY',
       'Spending.CS16IDX_CY', 'Spending.CS17_CY', 'Spending.CS17PRM_CY',
       'Spending.CSPC17_CY', 'Spending.CS17IDX_CY', 'Spending.CS18_CY',
       'Spending.CS18PRM_CY', 'Spending.CSPC18_CY', 'Spending.CS18IDX_CY',
       'Spending.CS19_CY', 'Spending.CS19PRM_CY', 'Spending.CSPC19_CY',
       'Spending.CS19IDX_CY', 'Spending.CS20_CY', 'Spending.CS20PRM_CY',
       'Spending.CSPC20_CY', 'Spending.CS20IDX_CY'], dtype=object)

Perform Enrichment using Data Collections and Analysis Variables

Data Collections can be used to enrich various study areas. data_collections and analysis_variables can be passed in the enrich() method. Details about enriching study areas can be found in Enriching Study Areas section.

Let's look at a few similar examples of GeoEnrichment here.

Enrich using Data Collections

Enrich with Age data collection

Here we see an address being enriched by data from Age data collection.

# Enriching single address as single line imput
age_coll = enrich(study_areas=["380 New York St Redlands CA 92373"], 
                       data_collections=['Age'])

age_coll

	source_country	x	y	area_type	buffer_units	buffer_units_alias	buffer_radii	aggregation_method	population_to_polygon_size_rating	apportionment_confidence	...	fem45	fem50	fem55	fem60	fem65	fem70	fem75	fem80	fem85	SHAPE
0	USA	-117.19479	34.057265	RingBuffer	esriMiles	Miles	1.0	BlockApportionment:US.BlockGroups;PointsLayer:...	2.191	2.576	...	366.0	392.0	365.0	345.0	322.0	277.0	168.0	103.0	132.0	{"rings": [[[-117.19479001927878, 34.071773611...

1 rows × 48 columns

age_coll.columns

Index(['source_country', 'x', 'y', 'area_type', 'buffer_units',
       'buffer_units_alias', 'buffer_radii', 'aggregation_method',
       'population_to_polygon_size_rating', 'apportionment_confidence',
       'has_data', 'male0', 'male5', 'male10', 'male15', 'male20', 'male25',
       'male30', 'male35', 'male40', 'male45', 'male50', 'male55', 'male60',
       'male65', 'male70', 'male75', 'male80', 'male85', 'fem0', 'fem5',
       'fem10', 'fem15', 'fem20', 'fem25', 'fem30', 'fem35', 'fem40', 'fem45',
       'fem50', 'fem55', 'fem60', 'fem65', 'fem70', 'fem75', 'fem80', 'fem85',
       'SHAPE'],
      dtype='object')

When a data collection is specified without specific analysis variables, all variables under the data collection are used for enrichment as can be seen above.

Enrich with Health data collection

Here we see a zip code being enriched by data from Health data collection.

redlands = usa.subgeographies.states['California'].zip5['92373']

redlands_df = enrich(study_areas=[redlands], data_collections=['Health'] )

redlands_df

	std_geography_level	std_geography_name	std_geography_id	source_country	aggregation_method	population_to_polygon_size_rating	apportionment_confidence	has_data	rel65_hi2_oc	acscivnins	...	pop85_cy	pop18up_cy	pop21up_cy	medage_cy	hhu18_c10	medhinc_cy	s27_bus	s27_sales	s27_emp	SHAPE
0	US.ZIP5	Redlands	92373	USA	Query:US.ZIP5	2.191	2.576	1	2.0	31157.0	...	1205.0	28208.0	27076.0	42.2	3851.0	91009.0	224.0	306371.0	4093.0	{"rings": [[[-117.16767396036383, 33.976847519...

1 rows × 431 columns

redlands_df.columns

Index(['std_geography_level', 'std_geography_name', 'std_geography_id',
       'source_country', 'aggregation_method',
       'population_to_polygon_size_rating', 'apportionment_confidence',
       'has_data', 'rel65_hi2_oc', 'acscivnins',
       ...
       'pop85_cy', 'pop18up_cy', 'pop21up_cy', 'medage_cy', 'hhu18_c10',
       'medhinc_cy', 's27_bus', 's27_sales', 's27_emp', 'SHAPE'],
      dtype='object', length=431)

Enrich using Analysis Variables

Data can be enriched by specifying specific analysis variables of a data collection with which we want to enrich our data. In this example, we will look at analysis_variables for Age data_collection and then use specific analysis variables to enrich() a study area.

# Unique analysis variables for Age data collection
usa = Country.get('US')
usa.data_collections.loc['Age']['analysisVariable'].unique()

array(['Age.MALE0', 'Age.MALE5', 'Age.MALE10', 'Age.MALE15', 'Age.MALE20',
       'Age.MALE25', 'Age.MALE30', 'Age.MALE35', 'Age.MALE40',
       'Age.MALE45', 'Age.MALE50', 'Age.MALE55', 'Age.MALE60',
       'Age.MALE65', 'Age.MALE70', 'Age.MALE75', 'Age.MALE80',
       'Age.MALE85', 'Age.FEM0', 'Age.FEM5', 'Age.FEM10', 'Age.FEM15',
       'Age.FEM20', 'Age.FEM25', 'Age.FEM30', 'Age.FEM35', 'Age.FEM40',
       'Age.FEM45', 'Age.FEM50', 'Age.FEM55', 'Age.FEM60', 'Age.FEM65',
       'Age.FEM70', 'Age.FEM75', 'Age.FEM80', 'Age.FEM85'], dtype=object)

Now, we will enrich our study area with Age.FEM45, Age.FEM55, Age.FEM65 variables

enrich(study_areas=["380 New York St Redlands CA 92373"], 
       analysis_variables=["Age.FEM45","Age.FEM55","Age.FEM65"])

	source_country	x	y	area_type	buffer_units	buffer_units_alias	buffer_radii	aggregation_method	population_to_polygon_size_rating	apportionment_confidence	has_data	fem45	fem55	fem65	SHAPE
0	USA	-117.19479	34.057265	RingBuffer	esriMiles	Miles	1.0	BlockApportionment:US.BlockGroups;PointsLayer:...	2.191	2.576	1	366.0	365.0	322.0	{"rings": [[[-117.19479001927878, 34.071773611...

Enriching Spatially Enabled Dataframes

One of the most common use case for GeoEnrichment is enriching existing data in feature layers. As a user, you may need to analyze and enrich your data that already exists in feature layers. Spatially Enabled DataFrame (SeDF) helps us bring the data from layer into a dataframe which can then be GeoEnriched.

Let's look at an example using an existing layer of Covid-19 dataset. This feature layer includes latest Covid-19 Cases, Recovered and Deaths data for U.S. at the county level.

# Get the layer
gis = GIS(set_active=False)
covid_item = gis.content.get('628578697fb24d8ea4c32fa0c5ae1843')
print(covid_item)
covid_layer = covid_item.layers[0]
covid_layer

<Item title:"COVID-19 Cases US" type:Feature Layer Collection owner:CSSE_covid19>

<FeatureLayer url:"https://services1.arcgis.com/0MSEUqKaxRlEPj5g/arcgis/rest/services/ncov_cases_US/FeatureServer/0">

We can query the layer as a dataframe and then use the dataframe for enrichment.

covid_df = covid_layer.query(as_df=True)
covid_df.shape

(3272, 19)

covid_df.head()

	OBJECTID	Province_State	Country_Region	Last_Update	Lat	Long_	Confirmed	Recovered	Deaths	Active	Admin2	FIPS	Combined_Key	Incident_Rate	People_Tested	People_Hospitalized	UID	ISO3	SHAPE
0	1	Alabama	US	2022-10-27 17:22:34	32.539527	-86.644082	18511	<NA>	228	<NA>	Autauga	01001	Autauga, Alabama, US	33132.864379	<NA>	<NA>	84001001	USA	{"x": -86.64408226999996, "y": 32.539527450000...
1	2	Alabama	US	2022-10-27 17:22:34	30.72775	-87.722071	65973	<NA>	716	<NA>	Baldwin	01003	Baldwin, Alabama, US	29553.293853	<NA>	<NA>	84001003	USA	{"x": -87.72207057999998, "y": 30.727749910000...
2	3	Alabama	US	2022-10-27 17:22:34	31.868263	-85.387129	6930	<NA>	103	<NA>	Barbour	01005	Barbour, Alabama, US	28072.591752	<NA>	<NA>	84001005	USA	{"x": -85.38712859999998, "y": 31.868263000000...
3	4	Alabama	US	2022-10-27 17:22:34	32.996421	-87.125115	7575	<NA>	108	<NA>	Bibb	01007	Bibb, Alabama, US	33826.024828	<NA>	<NA>	84001007	USA	{"x": -87.12511459999996, "y": 32.996420640000...
4	5	Alabama	US	2022-10-27 17:22:34	33.982109	-86.567906	17320	<NA>	258	<NA>	Blount	01009	Blount, Alabama, US	29951.92474	<NA>	<NA>	84001009	USA	{"x": -86.56790592999994, "y": 33.982109180000...

To showcase GeoEnrichment, we will create a subset of the original data and then enrich() the subset.

# Create subset
test_df = covid_df.iloc[:100].copy()
test_df.shape

(100, 19)

# Check geometry
test_df.spatial.geometry_type

['point', None]

A dataframe can be passed as a value to study_areas parameter of the enrich() method. Here we are enriching our dataframe with specific variables from Age data collection.

# Enrich dataframe
new_df = enrich(study_areas=test_df.spatial, 
       analysis_variables=["Age.FEM45","Age.FEM55","Age.FEM65"])

new_df.head()

	ID	OBJECTID_0	sourceCountry	Long_	Country_Region	FIPS	Last_Update	Combined_Key	ISO3	...	bufferUnitsAlias	bufferRadii	aggregationMethod	populationToPolygonSizeRating	apportionmentConfidence	HasData	FEM45	FEM55	FEM65	SHAPE
0	0	1	US	-82.461707	US	45001	1596857725000	Abbeville, South Carolina, US	USA	...	Miles	1	BlockApportionment:US.BlockGroups	2.191	2.576	1	2	2	2	{"rings": [[[-82.46170657999994, 34.2378420028...
1	1	2	US	-92.414197	US	22001	1596857725000	Acadia, Louisiana, US	USA	...	Miles	1	BlockApportionment:US.BlockGroups	2.191	2.576	1	3	3	2	{"rings": [[[-92.41419697999997, 30.3095821536...
2	2	3	US	-75.632346	US	51001	1596857725000	Accomack, Virginia, US	USA	...	Miles	1	BlockApportionment:US.BlockGroups	2.191	2.576	1	14	16	14	{"rings": [[[-75.63234615, 37.781571251121655]...
3	3	4	US	-116.241552	US	16001	1596857725000	Ada, Idaho, US	USA	...	Miles	1	BlockApportionment:US.BlockGroups	2.191	2.576	1	0	0	0	{"rings": [[[-116.24155159999998, 43.467142851...
4	4	5	US	-94.471059	US	19001	1596857725000	Adair, Iowa, US	USA	...	Miles	1	BlockApportionment:US.BlockGroups	2.191	2.576	1	1	1	1	{"rings": [[[-94.47105873999998, 41.3452468224...

5 rows × 31 columns

new_df.drop(['OBJECTID_0', 'ID','Last_Update'], axis=1, inplace=True)

# Check shape
new_df.shape

(91, 28)

We can see that enrichment resulted in 91 records and 31 columns. There are some areas in our dataframe for which enrichment information is not available. Hence, we have 91 records instead of 100. Geoenrichment adds some additional columns along with the analysis variables we enriched for and so we see 31 columns however we are dropping duplicates and unnecessary columns to bring the count down to 28 columns.

Visualize on a Map

Let's visualize the enriched dataframe on a map. We will use FEM65 column to classify our data for plotting on the map.

covid_map = gis.map('USA', 4)
covid_map

# Plot on a map
covid_map.remove_layers()
new_df.spatial.plot(map_widget=covid_map,
                    renderer_type='c',  # for class breaks renderer,
                    method='esriClassifyNaturalBreaks',  # classification algorithm,
                    class_count=5,  # choose the number of classes,
                    col='FEM65',  # numeric column to classify,
                    cmap='viridis',  # color map to pick colors from for each class,
                    alpha=0.7)

True

Conclusion

In this part of the arcgis.geoenrichment module guide series, you saw how data_collections property of a Country object lists its available data_collections and analysis_variables. You explored different data collections, their analysis variables and then enriched study areas using the same. Towards the end, you experienced how spatially enabled dataframes can be enriched.

In the subsequent pages, you will learn about Generating Reports and Standard Geography Queries.