Spatially Enabled DataFrames - Advanced Topics

The information in this section provides a brief introduction to advanced topics with the Spatially Enabled DataFrame structure.

One of the most important tasks for software applications is to quickly retrieve and process information. Enterprise systems, whether storing GIS information or not, all utilize the concept of indexing to allow for quick searching through large data stores to locate and select specific information for subsequent processing.

This document will outline how row and column indexing work in Spatially Enabled Dataframes and also demonstrate building a spatial index on dataframe geometries to allow for quick searching, accessing, and processing. The document will also demonstrate spatial joins to combine dataframes.

DataFrame Index

As mentioned in the Introduction to the Spatially Enabled DataFrame guide, the Pandas DataFrame structure underlies the ArcGIS API for Python's Spatially Enabled DataFrame. Pandas DataFrames are analagous to spreadsheets. They have a row axis and a column axis. Each of these axes are indexed and labeled for quick and easy identification, data alignment, and retrieval and updating of data subsets.

Let's explore the axes labels and indices and how they allow for data exploraation:

from arcgis.gis import GIS
gis = GIS()

When working with an ArcGIS Online feature layer, the query() method returns a FeatureSet object which has a sdf method to instantiate a Spatially Enabled DataFrame.

from arcgis import GIS
item = gis.content.get("85d0ca4ea1ca4b9abf0c51b9bd34de2e")
flayer = item.layers[0]
df = flayer.query(where="AGE_45_54 < 1500").sdf
df.head()
FIDNAMECLASSSTSTFIPSPLACEFIPSCAPITALPOP_CLASSPOPULATIONPOP2010...MARHH_NO_CMHH_CHILDFHH_CHILDFAMILIESAVE_FAM_SZHSE_UNITSVACANTOWNER_OCCRENTER_OCCSHAPE
01AmmoncityID16160199061518113816...113110633533523.61474727132051271{"x": -12462673.723706165, "y": 5384674.994080...
12BlackfootcityID16160784061194611899...108117438129583.31454731827881441{"x": -12506251.313993266, "y": 5341537.793529...
24BurleycityID16161126061072710345...86113935824993.37388524121831461{"x": -12667411.402393516, "y": 5241722.820606...
36ChubbuckcityID16161468061465513922...128117237035863.4496122933241408{"x": -12520053.904151963, "y": 5300220.333409...
412JeromecityID16164132061140310890...77921038526403.44398529222191474{"x": -12747828.64784961, "y": 5269214.8197742...

5 rows × 50 columns

Describing the DataFrame

The DataFrame.info() provides a concise summary of the object. This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.

Example: Displaying info on Spatially Enabled DataFrame (SEDF)

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 316 entries, 0 to 315
Data columns (total 50 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   FID         316 non-null    Int64   
 1   NAME        316 non-null    string  
 2   CLASS       316 non-null    string  
 3   ST          316 non-null    string  
 4   STFIPS      316 non-null    string  
 5   PLACEFIPS   316 non-null    string  
 6   CAPITAL     316 non-null    string  
 7   POP_CLASS   316 non-null    Int32   
 8   POPULATION  316 non-null    Int32   
 9   POP2010     316 non-null    Int32   
 10  WHITE       316 non-null    Int32   
 11  BLACK       316 non-null    Int32   
 12  AMERI_ES    316 non-null    Int32   
 13  ASIAN       316 non-null    Int32   
 14  HAWN_PI     316 non-null    Int32   
 15  HISPANIC    316 non-null    Int32   
 16  OTHER       316 non-null    Int32   
 17  MULT_RACE   316 non-null    Int32   
 18  MALES       316 non-null    Int32   
 19  FEMALES     316 non-null    Int32   
 20  AGE_UNDER5  316 non-null    Int32   
 21  AGE_5_9     316 non-null    Int32   
 22  AGE_10_14   316 non-null    Int32   
 23  AGE_15_19   316 non-null    Int32   
 24  AGE_20_24   316 non-null    Int32   
 25  AGE_25_34   316 non-null    Int32   
 26  AGE_35_44   316 non-null    Int32   
 27  AGE_45_54   316 non-null    Int32   
 28  AGE_55_64   316 non-null    Int32   
 29  AGE_65_74   316 non-null    Int32   
 30  AGE_75_84   316 non-null    Int32   
 31  AGE_85_UP   316 non-null    Int32   
 32  MED_AGE     316 non-null    Float64 
 33  MED_AGE_M   316 non-null    Float64 
 34  MED_AGE_F   316 non-null    Float64 
 35  HOUSEHOLDS  316 non-null    Int32   
 36  AVE_HH_SZ   316 non-null    Float64 
 37  HSEHLD_1_M  316 non-null    Int32   
 38  HSEHLD_1_F  316 non-null    Int32   
 39  MARHH_CHD   316 non-null    Int32   
 40  MARHH_NO_C  316 non-null    Int32   
 41  MHH_CHILD   316 non-null    Int32   
 42  FHH_CHILD   316 non-null    Int32   
 43  FAMILIES    316 non-null    Int32   
 44  AVE_FAM_SZ  316 non-null    Float64 
 45  HSE_UNITS   316 non-null    Int32   
 46  VACANT      316 non-null    Int32   
 47  OWNER_OCC   316 non-null    Int32   
 48  RENTER_OCC  316 non-null    Int32   
 49  SHAPE       316 non-null    geometry
dtypes: Float64(5), Int32(37), Int64(1), geometry(1), string(6)
memory usage: 91.2 KB

We can see that the SHAPE column is of type geometry. This means that compared to the legacy SpatialDataFrame class, geometry columns are now unique instead of being just of type object.

We can get information about each axis label (aka, index) with the axes property on the spatial dataframe.

print("{:<15}{}\n\n{}{}".format("Row axis: ", df.axes[0], "Column axis: ", df.axes[1]))
Row axis:      RangeIndex(start=0, stop=316, step=1)

Column axis: Index(['FID', 'NAME', 'CLASS', 'ST', 'STFIPS', 'PLACEFIPS', 'CAPITAL',
       'POP_CLASS', 'POPULATION', 'POP2010', 'WHITE', 'BLACK', 'AMERI_ES',
       'ASIAN', 'HAWN_PI', 'HISPANIC', 'OTHER', 'MULT_RACE', 'MALES',
       'FEMALES', 'AGE_UNDER5', 'AGE_5_9', 'AGE_10_14', 'AGE_15_19',
       'AGE_20_24', 'AGE_25_34', 'AGE_35_44', 'AGE_45_54', 'AGE_55_64',
       'AGE_65_74', 'AGE_75_84', 'AGE_85_UP', 'MED_AGE', 'MED_AGE_M',
       'MED_AGE_F', 'HOUSEHOLDS', 'AVE_HH_SZ', 'HSEHLD_1_M', 'HSEHLD_1_F',
       'MARHH_CHD', 'MARHH_NO_C', 'MHH_CHILD', 'FHH_CHILD', 'FAMILIES',
       'AVE_FAM_SZ', 'HSE_UNITS', 'VACANT', 'OWNER_OCC', 'RENTER_OCC',
       'SHAPE'],
      dtype='object')

Row axis information informs us we can retrieve information using the the dataframe loc attribute and any value in the range 0-317 inclusive to access a row. Column axis information tells us we can use any string in the index to return an attribute column:

df.loc[0] #the first row returned
FID                                                           1
NAME                                                      Ammon
CLASS                                                      city
ST                                                           ID
STFIPS                                                       16
PLACEFIPS                                               1601990
CAPITAL                                                        
POP_CLASS                                                     6
POPULATION                                                15181
POP2010                                                   13816
WHITE                                                     13002
BLACK                                                        73
AMERI_ES                                                     67
ASIAN                                                       113
HAWN_PI                                                       9
HISPANIC                                                    884
OTHER                                                       307
MULT_RACE                                                   245
MALES                                                      6750
FEMALES                                                    7066
AGE_UNDER5                                                 1468
AGE_5_9                                                    1503
AGE_10_14                                                  1313
AGE_15_19                                                  1058
AGE_20_24                                                   734
AGE_25_34                                                  2031
AGE_35_44                                                  1767
AGE_45_54                                                  1446
AGE_55_64                                                  1136
AGE_65_74                                                   665
AGE_75_84                                                   486
AGE_85_UP                                                   209
MED_AGE                                                    29.6
MED_AGE_M                                                  28.0
MED_AGE_F                                                  30.8
HOUSEHOLDS                                                 4476
AVE_HH_SZ                                                  3.05
HSEHLD_1_M                                                  457
HSEHLD_1_F                                                  648
MARHH_CHD                                                  1618
MARHH_NO_C                                                 1131
MHH_CHILD                                                   106
FHH_CHILD                                                   335
FAMILIES                                                   3352
AVE_FAM_SZ                                                 3.61
HSE_UNITS                                                  4747
VACANT                                                      271
OWNER_OCC                                                  3205
RENTER_OCC                                                 1271
SHAPE         {'x': -12462673.723706165, 'y': 5384674.994080...
Name: 0, dtype: object
df['POP2010'] #the data from the `POP2010` attribute column
0      13816
1      11899
2      10345
3      13922
4      10890
       ...  
311     9089
312    10272
313     9761
314     7855
315     9845
Name: POP2010, Length: 316, dtype: Int32

Slicing DataFrames

We can access rows, columns and subsets of rows and columns using Python slicing:

#rows 0-9 with a subset of columns indexed as a list
df.loc[0:9][['FID', 'NAME', 'ST', 'POP2010', 'POPULATION']]
FIDNAMESTPOP2010POPULATION
01AmmonID1381615181
12BlackfootID1189911946
24BurleyID1034510727
36ChubbuckID1392214655
412JeromeID1089011403
513KunaID1521018179
621RexburgID2548428019
7143MacombIL1928819838
8151MinookaIL1092411815
9186PlanoIL1085611164

We can use indexing to access SHAPE information and draw it on a map:

df[df['NAME'].str.contains('Camp Pendleton South')]
FIDNAMECLASSSTSTFIPSPLACEFIPSCAPITALPOP_CLASSPOPULATIONPOP2010...MARHH_NO_CMHH_CHILDFHH_CHILDFAMILIESAVE_FAM_SZHSE_UNITSVACANTOWNER_OCCRENTER_OCCSHAPE
691035Camp Pendleton SouthCensus Designated PlaceCA06061056161045010616...3893716925003.692865296112558{"x": -13066535.250599463, "y": 3925680.810605...

1 rows × 50 columns

Note that, the resulting Series might not show as in row 69 everytime, so it is important to the update the following cell to point to the corresponding row according to the current output. Also, SHAPE is one of the many Key Value Pairs of the dict object created from the resulting Series, that we are going to print as it is, and render on the map.

camp_pendleton_s_geodefn = dict(df.loc[69]).get("SHAPE") #geometry definition from row 2
print(camp_pendleton_s_geodefn)
{'x': -13066535.250599463, 'y': 3925680.810605321, 'spatialReference': {'wkid': 102100, 'latestWkid': 3857}}
m = gis.map("San Diego, CA", 8)
m
# draw the camp `point`
m.draw(camp_pendleton_s_geodefn)

Spatial Index

In addition to row and column indices to search a DataFrame, we can use a spatial indexes to quickly access information based on its location and relationship with other features. They are based on the concept of a minimum bounding rectangle - the smallest rectangle that contains an entire geometric shape. Each of these rectangles are then grouped into leaf nodes representing a single shape and node structures containing groups of shapes according to whatever algorithm the different types of spatial indexing use. Querying these rectangles requires magnitudes fewer compute resources for accessing and processing geometries relative to accessing the entire feature array of coordinate pairs that compose a shape. Access to points, complex lines and irregularly-shaped polygons becomes much quicker and easier through different flavors of spatial indexing.

The Spatially Enabled DataFrame uses an implementation of spatial indexing known as QuadTree indexing, which searches nodes when determining locations, relationships and attributes of specific features. QuadTree indexes are the default spatial index, but the SEDF also supports r-tree implementations. In the DataFrame Index section of this notebook, the USA Major Cities feature layer was queried and the sdf property was called on the results to create a DataFrame. The sindex method on the DataFrame creates a QuadTree index:

si = df.spatial.sindex('quadtree',reset=False)

Let's visually inspect the external frame of the QuadTree index. We'll then plot the spatial dataframe to ensure the spatial index encompasses all our features:

midx = gis.map("United States", 3)
midx
midx.center = [39, -98]
midx.basemap = 'gray-vector'
# draw the spatial index envelope
df.spatial.plot(map_widget=midx)
True

Let's use the feature we drew earlier to define a spatial reference variable for use throughout the rest of this guide.

sp_ref = camp_pendleton_s_geodefn['spatialReference']
sp_ref
{'wkid': 102100, 'latestWkid': 3857}
import time
from arcgis.geometry import Geometry, Polygon

#define a symbol to visualize the spatial index quadrants
sym = {
    "type": "esriSFS",
    "style": "esriSFSSolid",
    "color": [0,0,0,0],
    "outline": {
        "type": "esriSLS",
        "style": "esriSLSSolid",
        "color": [0,0,0,255],
        "width": 4}
}

# loop through the children of the root index and draw each extent
# using a different outline color
for i in range(len(si._index.children)):
    sym["outline"]["color"][i] = 255
    if i > 0:
        sym["outline"]["color"][i] = 255
        sym["outline"]["color"][i-1] = 0
    child = si._index.children[i]
    width_factor = child.width/2
    height_factor = child.width/2
    minx = child.center[0] - width_factor
    miny = child.center[1] - height_factor
    maxx = child.center[0] + width_factor
    maxy = child.center[1] + height_factor
    child_geom = Geometry({
        'rings':[[[minx,miny], [minx, maxy], [maxx, maxy], [maxx, miny], [minx, miny]]],
        'spatialReference': sp_ref})
    #child_extent = Polygon(child_geom)
    midx.draw(shape = child_geom, symbol = sym)
    time.sleep(2)

Intersection with the Spatial Index

Up to this point in this guide, we've talked about using indexing for querying attributes in the dataframe. For example:

query = df['ST'] == 'MI'
df[query]
FIDNAMECLASSSTSTFIPSPLACEFIPSCAPITALPOP_CLASSPOPULATIONPOP2010...MARHH_NO_CMHH_CHILDFHH_CHILDFAMILIESAVE_FAM_SZHSE_UNITSVACANTOWNER_OCCRENTER_OCCSHAPE
1011514AllendaleCensus Designated PlaceMI26260134061970917579...9399420026393.26483422725112096{"x": -9568014.259185659, "y": 5309698.1042234...
1021523Big RapidscityMI26260830061061410601...4368232213232.88362329311362194{"x": -9515614.1943842, "y": 5419013.554990504...
1031526CadillaccityMI26261232061054710355...99917952626252.9492764725191761{"x": -9508449.13740055, "y": 5504118.53341087...
1041528ColdwatercityMI26261702061082810945...89721344526283.14482757225761679{"x": -9462596.578807637, "y": 5152066.1922501...
1051529Comstock ParkCensus Designated PlaceMI26261770061089210088...92016755726392.96465648419792193{"x": -9537405.970997674, "y": 5316589.2811759...
1061538FarmingtoncityMI26262738061067410372...11296628727352.92495933528761748{"x": -9281637.643149175, "y": 5230343.3917462...
1071546Grand HavencityMI26263334061107310412...127711532027212.825815104632391530{"x": -9597693.318165638, "y": 5320625.9222716...
1081575Muskegon HeightscityMI26265636061065710856...460143118326823.23484284620421954{"x": -9600424.098598039, "y": 5342825.5598096...
1091600SturgiscityMI26267696061094410994...83519546726323.28459550723331755{"x": -9508726.380150843, "y": 5131255.7167436...

9 rows × 50 columns

We can query multiple attributes and filter on the column output as well:

query = (df['POP2010'] > 20000) & (df['ST'] == 'OH')
df[query][['NAME','ST','POP2010','HOUSEHOLDS','HSEHLD_1_F', 'HSEHLD_1_M']]
NAMESTPOP2010HOUSEHOLDSHSEHLD_1_FHSEHLD_1_M
166AthensOH23832690324742573
195OxfordOH21371579920331850

As GIS analysts and data scientists, we also want to query based on geographic location. We can do that by building a spatial index with the sindex property of the spatial dataframe. The resulting quadtree index allows us to query based on specific geometries in relation to other geometries.

Let's continue looking at the dataframe wer're working with: US cities with a population between the ages of 45 and 54 of less than 1500.

We can draw the entire extent of our dataframe using the dataframe's geoextent property. Let's get the bounding box coordinates:

df_geoextent = df.spatial.full_extent
df_geoextent
(-17595352.55942164,
 2429395.3372018305,
 -7895099.852443745,
 6266417.1716177985)

Let's use these coordinates, place them in more descriptive variable names, then create a bounding box to make a geometry object representing the extent of our dataframe. Finally we'll draw it on the a map:

df_geoextent_geom = df.spatial.bbox
df_geoextent_geom
m1 = gis.map("United States", 3)
m1
m1.center = [39, -98]
sym_poly = {
  "type": "esriSFS",
  "style": "esriSFSSolid",
  "color": [0,0,0,0],  # hollow, no fill
    "outline": {
     "type": "esriSLS",
     "style": "esriSLSSolid",
     "color": [255,0,0,255],  # red border
     "width": 3}
}

# draw the dataframe extent with AOI
m1.draw(shape = df_geoextent_geom, symbol = sym_poly)

Now, let's define a second set of coordinates representing a bounding box for which we want to query the features from our dataframe that fall within it.

We can define our list of coordinates, and then draw it on the map to make sure it falls within our dataframe extent:

area_of_interest = [-13043219.122301877, 3911134.034258818, -13243219.102301877, 4111134.0542588173]
minx, miny, maxx, maxy = area_of_interest[0], area_of_interest[1], area_of_interest[2], area_of_interest[3]

area_of_interest_ring = [[[minx, miny], [minx, maxy], [maxx, maxy], [maxx, miny], [minx, miny]]]
area_of_interest_geom = Geometry({'rings': area_of_interest_ring, 'spatialReference': sp_ref})

sym_poly_aoi = {
  "type": "esriSFS",
  "style": "esriSFSSolid",
  "color": [0,0,0,0],  # hollow, no fill
    "outline": {
     "type": "esriSLS",
     "style": "esriSLSSolid",
     "color": [0,255,0,255],   # green border
     "width": 3}
}

m1.draw(shape = area_of_interest_geom, symbol = sym_poly_aoi)

We can see that our area of interest box falls within the dataframe extent. The spatial index has an intersect method which takes a bounding box as input and returns a list of integer values from the row index of our spatial dataframe. We can use the dataframe's iloc integer-indexing attribute to then loop through the dataframe and put draw the features on a map

index_of_features = si.intersect(area_of_interest)
df.iloc[index_of_features]
FIDNAMECLASSSTSTFIPSPLACEFIPSCAPITALPOP_CLASSPOPULATIONPOP2010...MARHH_NO_CMHH_CHILDFHH_CHILDFAMILIESAVE_FAM_SZHSE_UNITSVACANTOWNER_OCCRENTER_OCCSHAPE
32454MuscoyCensus Designated PlaceCA06065013261116110644...30219430819334.7624432121268963{"x": -13063141.654215325, "y": 4049605.974486...
691035Camp Pendleton SouthCensus Designated PlaceCA06061056161045010616...3893716925003.692865296112558{"x": -13066535.250599463, "y": 3925680.810605...
701053CitrusCensus Designated PlaceCA06061356061132910866...54111625521954.332701861854761{"x": -13123874.444099307, "y": 4044252.329385...
711062CommercecityCA06061497461322712823...54916942427094.1734708816191763{"x": -13151212.145498956, "y": 4027601.902958...
771158Home GardensCensus Designated PlaceCA06063430261222311570...57313529823364.3528651051952808{"x": -13080593.064290542, "y": 4012558.291540...
811177Laguna WoodscityCA06063925961796016192...326461538732.0713016171487302572{"x": -13105613.458129246, "y": 3976538.093303...

6 rows × 50 columns

Let us plot these features that intersect on a map:

m2 = gis.map("Los Angeles, CA", 7)
m2
m2.center = [34, -118]
m2.draw(shape = area_of_interest_geom, symbol = sym_poly_aoi)
pt_sym = {
    "type": "esriSMS",
    "style": "esriSMSDiamond",
    "color": [255,140,0,255],  # yellowish
    "size": 8,
    "angle": 0,
    "xoffset": 0,
    "yoffset": 0,
    "outline": {
        "color": [255,140,0,255],
        "width": 1}
 }
     
# draw the AOI that intersects
for pt_index in index_of_features:
    m2.draw(shape = df.iloc[pt_index]['SHAPE'], symbol = pt_sym)  

Thus we were able to use the spatial indexes to query features that fall within an extent.

Spatial Joins

DataFrames are table-like structures comprised of rows and columns. In relational database, SQL joins are fundamental operations that combine columns from one or more tables using values that are common to each. They occur in almost all database queries.

A Spatial join is a table operation that affixes data from one feature layer’s attribute table to another based on a spatial relationship. The spatial join involves matching rows from the Join Features (data frame1) to the Target Features (data frame2) based on their spatial relationship.

Let's look at how joins work with dataframes by using subsets of our original DataFrame and the pandas merge fucntionality. We'll then move onto examining a spatial join to combine features from one dataframe with another based on a common attribute value.

Query the DataFrame to extract 3 attribute columns of information from 2 states, Ohio and Michigan:

query = (df['ST'] == 'OH') | (df['ST'] == 'MI')
df1 = df[query][['NAME', 'ST', 'POP2010']]
df1
NAMESTPOP2010
101AllendaleMI17579
102Big RapidsMI10601
103CadillacMI10355
104ColdwaterMI10945
105Comstock ParkMI10088
106FarmingtonMI10372
107Grand HavenMI10412
108Muskegon HeightsMI10856
109SturgisMI10994
166AthensOH23832
167CambridgeOH10635
168CelinaOH10400
192GalionOH10512
193LondonOH9904
194NorthbrookOH10668
195OxfordOH21371
196SpringdaleOH11223
197TrentonOH11869
198University HeightsOH13539
199Van WertOH10846

Query the dataframe again for 8 attribute columns from one state, Ohio

query = df['ST'] == 'OH'
df2 = df[query][['NAME', 'POPULATION','BLACK', 'HAWN_PI', 'HISPANIC', 'WHITE', 'MULT_RACE', 'OTHER']]
df2
NAMEPOPULATIONBLACKHAWN_PIHISPANICWHITEMULT_RACEOTHER
166Athens2543110471057620586559138
167Cambridge105413613129985731633
168Celina1056650412939873164111
192Galion103855011401026412042
193London105435960169883028761
194Northbrook10825297463776995379141
195Oxford23054859249118719478127
196Springdale1150733554419656169331977
197Trenton1239311521981141822435
198University Heights13884313343749726215121
199Van Wert11042180143510263221130

The Pandas merge capability joins dataframes in a style similar to SQL joins, with parameters to indicate the column of shared information and the type of join to perform:

An inner join (the default), is analagous to a SQL left inner join, keeping the order from the left table in the output and returning only those records from the right table that match the value in the column specified with the on parameter:

import pandas as pd

pd.merge(df1, df2, on='NAME', how='inner')
NAMESTPOP2010POPULATIONBLACKHAWN_PIHISPANICWHITEMULT_RACEOTHER
0AthensOH238322543110471057620586559138
1CambridgeOH10635105413613129985731633
2CelinaOH104001056650412939873164111
3GalionOH10512103855011401026412042
4LondonOH9904105435960169883028761
5NorthbrookOH1066810825297463776995379141
6OxfordOH2137123054859249118719478127
7SpringdaleOH112231150733554419656169331977
8TrentonOH118691239311521981141822435
9University HeightsOH1353913884313343749726215121
10Van WertOH1084611042180143510263221130

Notice how all the rows from the left DataFrame appear in the result with all the attribute columns and values appended from the right DataFrame where the column value of NAME matched. The POP2010 attribute from the left DataFrame is combined with all the attributes from the right DataFrame.

An outer join combines all rows from both outputs together and orders the results according to the original row index:

pd.merge(df1, df2, on='NAME', how = 'outer')
NAMESTPOP2010POPULATIONBLACKHAWN_PIHISPANICWHITEMULT_RACEOTHER
0AllendaleMI17579<NA><NA><NA><NA><NA><NA><NA>
1Big RapidsMI10601<NA><NA><NA><NA><NA><NA><NA>
2CadillacMI10355<NA><NA><NA><NA><NA><NA><NA>
3ColdwaterMI10945<NA><NA><NA><NA><NA><NA><NA>
4Comstock ParkMI10088<NA><NA><NA><NA><NA><NA><NA>
5FarmingtonMI10372<NA><NA><NA><NA><NA><NA><NA>
6Grand HavenMI10412<NA><NA><NA><NA><NA><NA><NA>
7Muskegon HeightsMI10856<NA><NA><NA><NA><NA><NA><NA>
8SturgisMI10994<NA><NA><NA><NA><NA><NA><NA>
9AthensOH238322543110471057620586559138
10CambridgeOH10635105413613129985731633
11CelinaOH104001056650412939873164111
12GalionOH10512103855011401026412042
13LondonOH9904105435960169883028761
14NorthbrookOH1066810825297463776995379141
15OxfordOH2137123054859249118719478127
16SpringdaleOH112231150733554419656169331977
17TrentonOH118691239311521981141822435
18University HeightsOH1353913884313343749726215121
19Van WertOH1084611042180143510263221130

The rows where the on parameter value is the same in both tables have all attributes from both DataFrames in the result. The rows from the first DataFrame that do not have a matching NAME value in the second dataframe have values filled in with NaN values.

A spatial join works similarly on matching attribute values. However, instead of joining on an attribue field (like you did earlier), you will join based on the spatial relationship between the records in the two tables.

Example: Merging State Statistics Information with Cities

The goal is to get Wyoming's city locations and census data joined with Wyoming's state census data.

If you do not have access to the ArcPy site-package from the Python interpreter used to execute the following cells, you must authenticate to an ArcGIS Online Organization or ArcGIS Enterprise portal.

from arcgis.gis import GIS
g2 = GIS(profile="your_enterprise_profile")
import pandas as pd
from arcgis.features import GeoAccessor, GeoSeriesAccessor
import os
data_pth = r'/path/to/your/data/census_2010/example'
cities = r"cities.shp"
states = r"states.shp"
sdf_target = pd.DataFrame.spatial.from_featureclass(os.path.join(data_pth, cities))
sdf_target.head()
FIDNAMECLASSSTSTFIPSPLACEFIPCAPITALAREALANDAREAWATERPOP_CLASS...MARHH_NO_CMHH_CHILDFHH_CHILDFAMILIESAVE_FAM_SZHSE_UNITSVACANTOWNER_OCCRENTER_OCCSHAPE
00CollegeCensus Designated PlaceAK021675018.6700.4076...93615233926403.13450139723951709{"x": -147.8271911572905, "y": 64.848300194159...
11FairbanksCityAK022423031.8570.8156...2259395105871873.1512357128238637212{"x": -147.72638163006846, "y": 64.83809069704...
22KalispellCityMT30400755.4580.0046...143314748034942.92653239034582684{"x": -114.31606412429451, "y": 48.19780017936...
33Post FallsCityID16648109.6560.0456...185120546746703.13669732846111758{"x": -116.93792709825782, "y": 47.71555468018...
44DishmanCensus Designated PlaceWA53179853.3780.0006...109613134525642.96440825726351516{"x": -117.27780913774282, "y": 47.65654568420...

5 rows × 48 columns

Define a SpatialReference object to pass as the second argument in the from_featureclass function, so calling the method can also project (or transform) output GeoDataFrame to your desired spatial reference. Note: this requires arcpy to work.

from arcgis.geometry import SpatialReference

sdf_join = pd.DataFrame.spatial.from_featureclass(os.path.join(data_pth, states), sr=SpatialReference(4326).as_arcpy)
sdf_join.head()
FIDSTATE_NAMEDRAWSEQSTATE_FIPSSUB_REGIONSTATE_ABBRSHAPE
00Hawaii115PacificHI{"rings": [[[-160.07380334546815, 22.004177347...
11Washington253PacificWA{"rings": [[[-122.40201531038355, 48.225216372...
22Montana330MountainMT{"rings": [[[-111.47542530020736, 44.702162369...
33Maine423New EnglandME{"rings": [[[-69.77727626137293, 44.0741483685...
44North Dakota538West North CentralND{"rings": [[[-98.73043728833767, 45.9382713702...

We will use python's list comprehensions to create lists of the attribute columns in the DataFrame, then print out the lists to see the names of all the attribute columns.

sdf_target_cols = [column for column in sdf_target.columns]
sdf_join_cols = [column for column in sdf_join.columns]

Print out a list of columns in the sdf_target dataframe created from the cities shapefile:

for a,b,c,d in zip(sdf_target_cols[::4],sdf_target_cols[1::4],sdf_target_cols[2::4], sdf_target_cols[3::4]):
    print("{:<30}{:<30}{:<30}{:<}".format(a,b,c,d))
FID                           NAME                          CLASS                         ST
STFIPS                        PLACEFIP                      CAPITAL                       AREALAND
AREAWATER                     POP_CLASS                     POP2000                       POP2007
WHITE                         BLACK                         AMERI_ES                      ASIAN
HAWN_PI                       OTHER                         MULT_RACE                     HISPANIC
MALES                         FEMALES                       AGE_UNDER5                    AGE_5_17
AGE_18_21                     AGE_22_29                     AGE_30_39                     AGE_40_49
AGE_50_64                     AGE_65_UP                     MED_AGE                       MED_AGE_M
MED_AGE_F                     HOUSEHOLDS                    AVE_HH_SZ                     HSEHLD_1_M
HSEHLD_1_F                    MARHH_CHD                     MARHH_NO_C                    MHH_CHILD
FHH_CHILD                     FAMILIES                      AVE_FAM_SZ                    HSE_UNITS
VACANT                        OWNER_OCC                     RENTER_OCC                    SHAPE

Print out a list of columns in the sdf_join dataframe created from the states shapefile:

for a,b,c,d,e in zip(sdf_join_cols[::5],sdf_join_cols[1::5],sdf_join_cols[2::5],sdf_join_cols[3::5],sdf_join_cols[4::5]):
    print("{:<20}{:<20}{:<20}{:<20}{:<}".format(a,b,c,d,e))
FID                 STATE_NAME          DRAWSEQ             STATE_FIPS          SUB_REGION

Create a DataFrame for the cities in Wyoming:

sdf_target.loc[0]['SHAPE'].as_arcpy
q = sdf_target['ST'] == 'WY'
left = sdf_target[q].copy()
left.head()
FIDNAMECLASSSTSTFIPSPLACEFIPCAPITALAREALANDAREAWATERPOP_CLASS...MARHH_NO_CMHH_CHILDFHH_CHILDFAMILIESAVE_FAM_SZHSE_UNITSVACANTOWNER_OCCRENTER_OCCSHAPE
711711Green RiverCityWY563374013.7060.3156...127811325132143.22442624931691008{"x": -109.46492712301152, "y": 41.51419117328...
712712Rock SpringsCityWY566723518.4410.0006...201222053649313.028359101152742074{"x": -109.22240010498797, "y": 41.59092714080...
715715EvanstonCityWY562562010.2450.0446...97613936929403.30466560728051253{"x": -110.96461812552366, "y": 41.26330015271...
764764LaramieCityWY564505011.1380.0196...249617458756082.831199465853795957{"x": -105.58725462620347, "y": 41.31292665660...
766766CheyenneCityWY5613900State21.1080.0827...62994901610141742.93237821458147397585{"x": -104.80204559586696, "y": 41.14554516058...

5 rows × 48 columns

Create a dataframe for the state of Wyoming:

q = sdf_join.STATE_ABBR == 'WY'
right = sdf_join[q].copy()
right.head()
FIDSTATE_NAMEDRAWSEQSTATE_FIPSSUB_REGIONSTATE_ABBRSHAPE
66Wyoming756MountainWY{"rings": [[[-104.05361529329527, 41.698218366...

Perform the spatial join:

Before performing a spatial join between these two DataFrame objects, we can check the SpatialReference of each one of them to validate if they are the same SR - this is a pre-requisite of joining two DataFrames.

left.spatial.sr
{'wkid': 4326, 'latestWkid': 4326}
right.spatial.sr
{'wkid': 4326, 'latestWkid': 4326}
sdf2 = left.spatial.join(right)
sdf2
FID_leftNAMECLASSSTSTFIPSPLACEFIPCAPITALAREALANDAREAWATERPOP_CLASS...OWNER_OCCRENTER_OCCSHAPEindex_rightFID_rightSTATE_NAMEDRAWSEQSTATE_FIPSSUB_REGIONSTATE_ABBR
0711Green RiverCityWY563374013.7060.3156...31691008{"x": -109.46492712301152, "y": 41.51419117328...66Wyoming756MountainWY
1712Rock SpringsCityWY566723518.4410.0006...52742074{"x": -109.22240010498797, "y": 41.59092714080...66Wyoming756MountainWY
2715EvanstonCityWY562562010.2450.0446...28051253{"x": -110.96461812552366, "y": 41.26330015271...66Wyoming756MountainWY
3764LaramieCityWY564505011.1380.0196...53795957{"x": -105.58725462620347, "y": 41.31292665660...66Wyoming756MountainWY
4766CheyenneCityWY5613900State21.1080.0827...147397585{"x": -104.80204559586696, "y": 41.14554516058...66Wyoming756MountainWY
51216SheridanCityWY56698458.4860.0186...44462559{"x": -106.95897260592156, "y": 44.79671814410...66Wyoming756MountainWY
61218CasperCityWY561315023.9450.3166...136166727{"x": -106.32506361818486, "y": 42.83466364743...66Wyoming756MountainWY
71219GilletteCityWY563185513.3690.0256...48672523{"x": -105.50525462413556, "y": 44.28266365145...66Wyoming756MountainWY

8 rows × 55 columns

Notice, you retain the geometry type of your left DataFrame (points) in this case, however, you get all the attributes from both the left and right DataFrames. Let us plot the results of the spatial join on a map:

m3 = gis.map("Wyoming", 6)
m3
m3.center = [43, -107]
from arcgis.geometry import Geometry

# draw the spatial join results on Wyoming state
for idx, row in sdf2.iterrows():
    m3.draw(row['SHAPE'], symbol=pt_sym)

Conclusion

Spatially Enabled DataFrame give you powerful data analysis and data wrangling capabilities. In addition to performing sql like operations on attribute data, you can perform geographic queries. This guide demonstrated some of these advanced capabilities of the SEDF.

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.