Spatially Enabled DataFrames - Advanced Topics

The information in this section provides a brief introduction to advanced topics with the Spatially Enabled DataFrame structure.

One of the most important tasks for software applications is to quickly retrieve and process information. Enterprise systems, whether storing GIS information or not, all utilize the concept of indexing to allow for quick searching through large data stores to locate and select specific information for subsequent processing.

This document will outline how row and column indexing work in Spatially Enabled Dataframes and also demonstrate building a spatial index on dataframe geometries to allow for quick searching, accessing, and processing. The document will also demonstrate spatial joins to combine dataframes.

DataFrame Index

As mentioned in the Introduction to the Spatially Enabled DataFrame guide, the Pandas DataFrame structure underlies the ArcGIS API for Python's Spatially Enabled DataFrame. Pandas DataFrames are analagous to spreadsheets. They have a row axis and a column axis. Each of these axes are indexed and labeled for quick and easy identification, data alignment, and retrieval and updating of data subsets.

Let's explore the axes labels and indices and how they allow for data exploraation:

from arcgis.gis import GIS
gis = GIS()

When working with an ArcGIS Online feature layer, the query() method returns a FeatureSet object which has a sdf method to instantiate a Spatially Enabled DataFrame.

from arcgis import GIS
item = gis.content.get("85d0ca4ea1ca4b9abf0c51b9bd34de2e")
flayer = item.layers[0]
df = flayer.query(where="AGE_45_54 < 1500").sdf
df.head()
AGE_10_14AGE_15_19AGE_20_24AGE_25_34AGE_35_44AGE_45_54AGE_55_64AGE_5_9AGE_65_74AGE_75_84...PLACEFIPSPOP2010POPULATIONPOP_CLASSRENTER_OCCSHAPESTSTFIPSVACANTWHITE
01413138111062138181514119791557525307...0468080142871498061074{"x": -12768343.256613126, "y": 3842463.708135...AZ042619196
17277386771380118513331087740661444...060204299321023962056{"x": -13613950.337588644, "y": 4931686.754090...CA062678273
25935112323276774612734122942...0610561106161186962558{"x": -13066582.116550362, "y": 3925650.676616...CA062967530
3888988900172914791443959766514280...061356010866111956761{"x": -13123874.446103057, "y": 4044249.710416...CA06865898
41086122810131822175914781112925687477...0614974128231300961763{"x": -13151212.145276317, "y": 4027601.332347...CA06886930

5 rows × 51 columns

Describing the DataFrame

The DataFrame.info() provides a concise summary of the object. This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.

Example: Displaying info on Spatially Enabled DataFrame (SEDF)

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 317 entries, 0 to 316
Data columns (total 51 columns):
AGE_10_14     317 non-null int64
AGE_15_19     317 non-null int64
AGE_20_24     317 non-null int64
AGE_25_34     317 non-null int64
AGE_35_44     317 non-null int64
AGE_45_54     317 non-null int64
AGE_55_64     317 non-null int64
AGE_5_9       317 non-null int64
AGE_65_74     317 non-null int64
AGE_75_84     317 non-null int64
AGE_85_UP     317 non-null int64
AGE_UNDER5    317 non-null int64
AMERI_ES      317 non-null int64
ASIAN         317 non-null int64
AVE_FAM_SZ    317 non-null float64
AVE_HH_SZ     317 non-null float64
BLACK         317 non-null int64
CAPITAL       317 non-null object
CLASS         317 non-null object
FAMILIES      317 non-null int64
FEMALES       317 non-null int64
FHH_CHILD     317 non-null int64
FID           317 non-null int64
HAWN_PI       317 non-null int64
HISPANIC      317 non-null int64
HOUSEHOLDS    317 non-null int64
HSEHLD_1_F    317 non-null int64
HSEHLD_1_M    317 non-null int64
HSE_UNITS     317 non-null int64
MALES         317 non-null int64
MARHH_CHD     317 non-null int64
MARHH_NO_C    317 non-null int64
MED_AGE       317 non-null float64
MED_AGE_F     317 non-null float64
MED_AGE_M     317 non-null float64
MHH_CHILD     317 non-null int64
MULT_RACE     317 non-null int64
NAME          317 non-null object
OBJECTID      317 non-null int64
OTHER         317 non-null int64
OWNER_OCC     317 non-null int64
PLACEFIPS     317 non-null object
POP2010       317 non-null int64
POPULATION    317 non-null int64
POP_CLASS     317 non-null int64
RENTER_OCC    317 non-null int64
SHAPE         317 non-null geometry
ST            317 non-null object
STFIPS        317 non-null object
VACANT        317 non-null int64
WHITE         317 non-null int64
dtypes: float64(5), geometry(1), int64(39), object(6)
memory usage: 126.4+ KB

We can see that the SHAPE column is of type geometry. This means that compared to the legacy SpatialDataFrame class, geometry columns are now unique instead of being just of type object.

We can get information about each axis label (aka, index) with the axes property on the spatial dataframe.

print("{:<15}{}\n\n{}{}".format("Row axis: ", df.axes[0], "Column axis: ", df.axes[1]))
Row axis:      RangeIndex(start=0, stop=317, step=1)

Column axis: Index(['AGE_10_14', 'AGE_15_19', 'AGE_20_24', 'AGE_25_34', 'AGE_35_44',
       'AGE_45_54', 'AGE_55_64', 'AGE_5_9', 'AGE_65_74', 'AGE_75_84',
       'AGE_85_UP', 'AGE_UNDER5', 'AMERI_ES', 'ASIAN', 'AVE_FAM_SZ',
       'AVE_HH_SZ', 'BLACK', 'CAPITAL', 'CLASS', 'FAMILIES', 'FEMALES',
       'FHH_CHILD', 'FID', 'HAWN_PI', 'HISPANIC', 'HOUSEHOLDS', 'HSEHLD_1_F',
       'HSEHLD_1_M', 'HSE_UNITS', 'MALES', 'MARHH_CHD', 'MARHH_NO_C',
       'MED_AGE', 'MED_AGE_F', 'MED_AGE_M', 'MHH_CHILD', 'MULT_RACE', 'NAME',
       'OBJECTID', 'OTHER', 'OWNER_OCC', 'PLACEFIPS', 'POP2010', 'POPULATION',
       'POP_CLASS', 'RENTER_OCC', 'SHAPE', 'ST', 'STFIPS', 'VACANT', 'WHITE'],
      dtype='object')

Row axis information informs us we can retrieve information using the the dataframe loc attribute and any value in the range 0-317 inclusive to access a row. Column axis information tells us we can use any string in the index to return an attribute column:

df.loc[0] #the first row returned
AGE_10_14                                                  1413
AGE_15_19                                                  1381
AGE_20_24                                                  1106
AGE_25_34                                                  2138
AGE_35_44                                                  1815
AGE_45_54                                                  1411
AGE_55_64                                                   979
AGE_5_9                                                    1557
AGE_65_74                                                   525
AGE_75_84                                                   307
AGE_85_UP                                                    93
AGE_UNDER5                                                 1562
AMERI_ES                                                    112
ASIAN                                                        55
AVE_FAM_SZ                                                 3.99
AVE_HH_SZ                                                  3.77
BLACK                                                       122
CAPITAL                                                        
CLASS                                                      city
FAMILIES                                                   3352
FEMALES                                                    7385
FHH_CHILD                                                   672
FID                                                           6
HAWN_PI                                                       8
HISPANIC                                                  13708
HOUSEHOLDS                                                 3791
HSEHLD_1_F                                                  227
HSEHLD_1_M                                                  207
HSE_UNITS                                                  4052
MALES                                                      6902
MARHH_CHD                                                  1621
MARHH_NO_C                                                  673
MED_AGE                                                    25.5
MED_AGE_F                                                  26.8
MED_AGE_M                                                  24.2
MHH_CHILD                                                   188
MULT_RACE                                                   345
NAME                                                   Somerton
OBJECTID                                                      6
OTHER                                                      4449
OWNER_OCC                                                  2717
PLACEFIPS                                               0468080
POP2010                                                   14287
POPULATION                                                14980
POP_CLASS                                                     6
RENTER_OCC                                                 1074
SHAPE         {'x': -12768343.256613126, 'y': 3842463.708135...
ST                                                           AZ
STFIPS                                                       04
VACANT                                                      261
WHITE                                                      9196
Name: 0, dtype: object
df['POP2010'] #the data from the `POP2010` attribute column
0      14287
1       9932
2      10616
3      10866
4      12823
5      10755
6      10080
7      10334
8      10588
9      10648
10     11570
11     23096
12     13544
13     12874
14     16192
15     10802
16     11768
17     13058
18     12707
19     11014
20     10234
21      9675
22     10644
23     10224
24     14494
25     10471
26     10127
27     11211
28     11602
29     29589
       ...  
287    10687
288     9840
289    11545
290    13975
291    10465
292    42034
293    10372
294    10568
295    18461
296    10461
297     9845
298    14068
299    13816
300    11899
301    10345
302    13922
303    25484
304    10494
305     9577
306    10615
307    10600
308    11070
309    13431
310    10060
311    10869
312    10125
313    11196
314    10697
315    12764
316    10571
Name: POP2010, Length: 317, dtype: int64

Slicing DataFrames

We can access rows, columns and subsets of rows and columns using Python slicing:

#rows 0-9 with a subset of columns indexed as a list
df.loc[0:9][['OBJECTID', 'NAME', 'ST', 'POP2010', 'POPULATION']]
OBJECTIDNAMESTPOP2010POPULATION
06SomertonAZ1428714980
120AndersonCA993210239
267Camp Pendleton SouthCA1061611869
385CitrusCA1086611195
494CommerceCA1282313009
5113DelhiCA1075511336
6139EmeryvilleCA1008011010
7143ExeterCA1033410642
8148FarmersvilleCA1058810768
9164Garden AcresCA1064810708

We can use indexing to access SHAPE information and draw it on a map:

camp_pendleton_s_geodefn = df.loc[2]['SHAPE'] #geometry definition from row 2
camp_pendleton_s_geodefn
{'x': -13066582.116550362,
 'y': 3925650.6766163236,
 'spatialReference': {'wkid': 102100, 'latestWkid': 3857}}
m = gis.map("San Diego, CA", 8)
m

Camp Pendleton_South_pt

m.draw(camp_pendleton_s_geodefn)

Spatial Index

In addition to row and column indices to search a DataFrame, we can use a spatial indexes to quickly access information based on its location and relationship with other features. They are based on the concept of a minimum bounding rectangle - the smallest rectangle that contains an entire geometric shape. Each of these rectangles are then grouped into leaf nodes representing a single shape and node structures containing groups of shapes according to whatever algorithm the different types of spatial indexing use. Querying these rectangles requires magnitudes fewer compute resources for accessing and processing geometries relative to accessing the entire feature array of coordinate pairs that compose a shape. Access to points, complex lines and irregularly-shaped polygons becomes much quicker and easier through different flavors of spatial indexing.

The Spatially Enabled DataFrame uses an implementation of spatial indexing known as QuadTree indexing, which searches nodes when determining locations, relationships and attributes of specific features. QuadTree indexes are the default spatial index, but the SEDF also supports r-tree implementations. In the DataFrame Index section of this notebook, the USA Major Cities feature layer was queried and the sdf property was called on the results to create a DataFrame. The sindex method on the DataFrame creates a QuadTree index:

si = df.spatial.sindex('quadtree',reset=False)

Let's visually inspect the external frame of the QuadTree index. We'll then plot the spatial dataframe to ensure the spatial index encompasses all our features:

midx = gis.map("United States", 3)
midx

Spatial Index Envelope

midx.center = [39, -98]
midx.basemap = 'gray'
df.spatial.plot(map_widget=midx)
True

Let's use the feature we drew earlier to define a spatial reference variable for use throughout the rest of this guide.

sp_ref = camp_pendleton_s_geodefn['spatialReference']
sp_ref
{'wkid': 102100, 'latestWkid': 3857}
import time
from arcgis.geometry import Geometry, Polygon

#define a symbol to visualize the spatial index quadrants
sym = {
    "type": "esriSFS",
    "style": "esriSFSSolid",
    "color": [0,0,0,0],
    "outline": {
        "type": "esriSLS",
        "style": "esriSLSSolid",
        "color": [0,0,0,255],
        "width": 4}
}

# loop through the children of the root index and draw each extent
# using a different outline color
for i in range(len(si._index.children)):
    sym["outline"]["color"][i] = 255
    if i > 0:
        sym["outline"]["color"][i] = 255
        sym["outline"]["color"][i-1] = 0
    child = si._index.children[i]
    width_factor = child.width/2
    height_factor = child.width/2
    minx = child.center[0] - width_factor
    miny = child.center[1] - height_factor
    maxx = child.center[0] + width_factor
    maxy = child.center[1] + height_factor
    child_geom = Geometry({
        'rings':[[[minx,miny], [minx, maxy], [maxx, maxy], [maxx, miny], [minx, miny]]],
        'spatialReference': sp_ref})
    #child_extent = Polygon(child_geom)
    midx.draw(shape = child_geom, symbol = sym)
    time.sleep(2)

Intersection with the Spatial Index

Up to this point in this guide, we've talked about using indexing for querying attributes in the dataframe. For example:

query = df['ST'] == 'MI'
df[query]
AGE_10_14AGE_15_19AGE_20_24AGE_25_34AGE_35_44AGE_45_54AGE_55_64AGE_5_9AGE_65_74AGE_75_84...PLACEFIPSPOP2010POPULATIONPOP_CLASSRENTER_OCCSHAPESTSTFIPSVACANTWHITE
2127857967121465130213791172818776570...2617020109451060161679{"x": -9462606.190132478, "y": 5152068.9276487...MI2657210124
2138188037351579129813591050928610461...2676960109941100761755{"x": -9508722.837630691, "y": 5131265.4579707...MI265078856
22875746054834157613571155757871382251...2601340175791854762096{"x": -9568014.52123174, "y": 5309699.01482792...MI2622715937
229313223836801146540613568360309243...2608300106011092562194{"x": -9515619.23025533, "y": 5419011.90562258...MI262939326
2305986457961435109513781081678792656...2612320103551025461761{"x": -9508450.065025873, "y": 5504121.4266238...MI266479902
231660672946181611601332972712539327...2617700100881041362193{"x": -9537617.651799908, "y": 5318637.2100821...MI264847938
2325685794561572148514931291643689622...2627380103721046461748{"x": -9281638.045680454, "y": 5230344.9560653...MI263357415
2335955425911312120214201545594878683...2633340104121074861530{"x": -9597964.623579567, "y": 5320394.2544461...MI2610469891
2348788777801463128613261126965542359...2656360108561091461954{"x": -9600424.322350215, "y": 5342825.3921295...MI268461739

9 rows × 51 columns

We can query multiple attributes and filter on the column output as well:

query = (df['POP2010'] > 20000) & (df['ST'] == 'OH')
df[query][['NAME','ST','POP2010','HOUSEHOLDS','HSEHLD_1_F', 'HSEHLD_1_M']]
NAMESTPOP2010HOUSEHOLDSHSEHLD_1_FHSEHLD_1_M
220AthensOH23832690324742573
247OxfordOH21371579920331850

As GIS analysts and data scientists, we also want to query based on geographic location. We can do that by building a spatial index with the sindex property of the spatial dataframe. The resulting quadtree index allows us to query based on specific geometries in relation to other geometries.

Let's continue looking at the dataframe wer're working with: US cities with a population between the ages of 45 and 54 of less than 1500.

We can draw the entire extent of our dataframe using the dataframe's geoextent property. Let's get the bounding box coordinates:

df_geoextent = df.spatial.full_extent
df_geoextent
(-17595357, 2429399, -7895101, 6266432)

Let's use these coordinates, place them in more descriptive variable names, then create a bounding box to make a geometry object representing the extent of our dataframe. Finally we'll draw it on the a map:

df_geoextent_geom = df.spatial.bbox
df_geoextent_geom
{'rings': [[[-17595357, 2429399],
   [-17595357, 6266432],
   [-7895101, 6266432],
   [-7895101, 2429399],
   [-17595357, 2429399]]],
 'spatialReference': {'wkid': 102100}}
m1 = gis.map("United States", 3)
m1

Dataframe Extent with AOI

m1.center = [39, -98]
sym_poly = {
  "type": "esriSFS",
  "style": "esriSFSSolid",
  "color": [0,0,0,0],  # hollow, no fill
    "outline": {
     "type": "esriSLS",
     "style": "esriSLSSolid",
     "color": [255,0,0,255],  # red border
     "width": 3}
}

m1.draw(shape = df_geoextent_geom, symbol = sym_poly)

Now, let's define a second set of coordinates representing a bounding box for which we want to query the features from our dataframe that fall within it.

We can define our list of coordinates, and then draw it on the map to make sure it falls within our dataframe extent:

area_of_interest = [-13043219.122301877, 3911134.034258818, -13243219.102301877, 4111134.0542588173]
minx, miny, maxx, maxy = area_of_interest[0], area_of_interest[1], area_of_interest[2], area_of_interest[3]

area_of_interest_ring = [[[minx, miny], [minx, maxy], [maxx, maxy], [maxx, miny], [minx, miny]]]
area_of_interest_geom = Geometry({'rings': area_of_interest_ring, 'spatialReference': sp_ref})

sym_poly_aoi = {
  "type": "esriSFS",
  "style": "esriSFSSolid",
  "color": [0,0,0,0],  # hollow, no fill
    "outline": {
     "type": "esriSLS",
     "style": "esriSLSSolid",
     "color": [0,255,0,255],   # green border
     "width": 3}
}

m1.draw(shape = area_of_interest_geom, symbol = sym_poly_aoi)

We can see that our area of interest box falls within the dataframe extent. The spatial index has an intersect method which takes a bounding box as input and returns a list of integer values from the row index of our spatial dataframe. We can use the dataframe's iloc integer-indexing attribute to then loop through the dataframe and put draw the features on a map

index_of_features = si.intersect(area_of_interest)
df.iloc[index_of_features]
AGE_10_14AGE_15_19AGE_20_24AGE_25_34AGE_35_44AGE_45_54AGE_55_64AGE_5_9AGE_65_74AGE_75_84...PLACEFIPSPOP2010POPULATIONPOP_CLASSRENTER_OCCSHAPESTSTFIPSVACANTWHITE
25935112323276774612734122942...0610561106161186962558{"x": -13066582.116550362, "y": 3925650.676616...CA062967530
3888988900172914791443959766514280...061356010866111956761{"x": -13123874.446103057, "y": 4044249.710416...CA06865898
41086122810131822175914781112925687477...0614974128231300961763{"x": -13151212.145276317, "y": 4027601.332347...CA06886930
1010761118952170716511397931955524258...063430211570120256808{"x": -13080593.062843386, "y": 4012557.378155...CA061055275
1414244411315352324251138835122...0639259161921749962572{"x": -13105436.040130444, "y": 3977081.952377...CA06171414133
22104112191041140213171271727996396145...065013210644110926963{"x": -13063142.772085657, "y": 4049606.978283...CA062124459

6 rows × 51 columns

Let us plot these features that intersect on a map:

m2 = gis.map("Los Angeles, CA", 7)
m2

Area_of_interest_intersect

m2.center = [34, -118]
m2.draw(shape = area_of_interest_geom, symbol = sym_poly_aoi)
pt_sym = {
    "type": "esriSMS",
    "style": "esriSMSDiamond",
    "color": [255,140,0,255],  # yellowish
    "size": 8,
    "angle": 0,
    "xoffset": 0,
    "yoffset": 0,
    "outline": {
        "color": [255,140,0,255],
        "width": 1}
 }
        
for pt_index in index_of_features:
    m2.draw(shape = df.iloc[pt_index]['SHAPE'], symbol = pt_sym)  

Thus we were able to use the spatial indexes to query features that fall within an extent.

Spatial Joins

DataFrames are table-like structures comprised of rows and columns. In relational database, SQL joins are fundamental operations that combine columns from one or more tables using values that are common to each. They occur in almost all database queries.

A Spatial join is a table operation that affixes data from one feature layer’s attribute table to another based on a spatial relationship. The spatial join involves matching rows from the Join Features (data frame1) to the Target Features (data frame2) based on their spatial relationship.

Let's look at how joins work with dataframes by using subsets of our original DataFrame and the pandas merge fucntionality. We'll then move onto examining a spatial join to combine features from one dataframe with another based on a common attribute value.

Query the DataFrame to extract 3 attribute columns of information from 2 states, Ohio and Michigan:

query = (df['ST'] == 'OH') | (df['ST'] == 'MI')
df1 = df[query][['NAME', 'ST', 'POP2010']]
df1
NAMESTPOP2010
212ColdwaterMI10945
213SturgisMI10994
220AthensOH23832
228AllendaleMI17579
229Big RapidsMI10601
230CadillacMI10355
231Comstock ParkMI10088
232FarmingtonMI10372
233Grand HavenMI10412
234Muskegon HeightsMI10856
242CambridgeOH10635
243CelinaOH10400
244GalionOH10512
245LondonOH9904
246NorthbrookOH10668
247OxfordOH21371
248SpringdaleOH11223
249TrentonOH11869
250University HeightsOH13539
251Van WertOH10846

Query the dataframe again for 8 attribute columns from one state, Ohio

query = df['ST'] == 'OH'
df2 = df[query][['NAME', 'POPULATION','BLACK', 'HAWN_PI', 'HISPANIC', 'WHITE', 'MULT_RACE', 'OTHER']]
df2
NAMEPOPULATIONBLACKHAWN_PIHISPANICWHITEMULT_RACEOTHER
220Athens2450910471057620586559138
242Cambridge104793613129985731633
243Celina1046150412939873164111
244Galion102435011401026412042
245London102795960169883028761
246Northbrook10607297463776995379141
247Oxford22270859249118719478127
248Springdale1117233554419656169331977
249Trenton1224411521981141822435
250University Heights13874313343749726215121
251Van Wert11081180143510263221130

The Pandas merge capability joins dataframes in a style similar to SQL joins, with parameters to indicate the column of shared information and the type of join to perform:

An inner join (the default), is analagous to a SQL left inner join, keeping the order from the left table in the output and returning only those records from the right table that match the value in the column specified with the on parameter:

import pandas as pd

pd.merge(df1, df2, on='NAME', how='inner')
NAMESTPOP2010POPULATIONBLACKHAWN_PIHISPANICWHITEMULT_RACEOTHER
0AthensOH238322450910471057620586559138
1CambridgeOH10635104793613129985731633
2CelinaOH104001046150412939873164111
3GalionOH10512102435011401026412042
4LondonOH9904102795960169883028761
5NorthbrookOH1066810607297463776995379141
6OxfordOH2137122270859249118719478127
7SpringdaleOH112231117233554419656169331977
8TrentonOH118691224411521981141822435
9University HeightsOH1353913874313343749726215121
10Van WertOH1084611081180143510263221130

Notice how all the rows from the left DataFrame appear in the result with all the attribute columns and values appended from the right DataFrame where the column value of NAME matched. The POP2010 attribute from the left DataFrame is combined with all the attributes from the right DataFrame.

An outer join combines all rows from both outputs together and orders the results according to the original row index:

pd.merge(df1, df2, on='NAME', how = 'outer')
NAMESTPOP2010POPULATIONBLACKHAWN_PIHISPANICWHITEMULT_RACEOTHER
0ColdwaterMI10945NaNNaNNaNNaNNaNNaNNaN
1SturgisMI10994NaNNaNNaNNaNNaNNaNNaN
2AthensOH2383224509.01047.010.0576.020586.0559.0138.0
3AllendaleMI17579NaNNaNNaNNaNNaNNaNNaN
4Big RapidsMI10601NaNNaNNaNNaNNaNNaNNaN
5CadillacMI10355NaNNaNNaNNaNNaNNaNNaN
6Comstock ParkMI10088NaNNaNNaNNaNNaNNaNNaN
7FarmingtonMI10372NaNNaNNaNNaNNaNNaNNaN
8Grand HavenMI10412NaNNaNNaNNaNNaNNaNNaN
9Muskegon HeightsMI10856NaNNaNNaNNaNNaNNaNNaN
10CambridgeOH1063510479.0361.03.0129.09857.0316.033.0
11CelinaOH1040010461.050.041.0293.09873.0164.0111.0
12GalionOH1051210243.050.01.0140.010264.0120.042.0
13LondonOH990410279.0596.00.0169.08830.0287.061.0
14NorthbrookOH1066810607.02974.06.0377.06995.0379.0141.0
15OxfordOH2137122270.0859.02.0491.018719.0478.0127.0
16SpringdaleOH1122311172.03355.044.01965.06169.0331.0977.0
17TrentonOH1186912244.0115.02.0198.011418.0224.035.0
18University HeightsOH1353913874.03133.04.0374.09726.0215.0121.0
19Van WertOH1084611081.0180.01.0435.010263.0221.0130.0

The rows where the on parameter value is the same in both tables have all attributes from both DataFrames in the result. The rows from the first DataFrame that do not have a matching NAME value in the second dataframe have values filled in with NaN values.

A spatial join works similarly on matching attribute values. However, instead of joining on an attribue field (like you did earlier), you will join based on the spatial relationship between the records in the two tables.

Example: Merging State Statistics Information with Cities

The goal is to get Wyoming's city locations and census data joined with Wyoming's state census data.

If you do not have access to the ArcPy site-package from the Python interpreter used to execute the following cells, you must authenticate to an ArcGIS Online Organization or ArcGIS Enterprise portal.

from arcgis.gis import GIS
import pandas as pd
from arcgis.features import GeoAccessor, GeoSeriesAccessor
g2 = GIS("https://pythonapi.playground.esri.com/portal", "arcgis_python", "amazing_arcgis_123")
import os
data_pth = r'/path/to/your/data/census_2010/example'
cities = r"cities.shp"
states = r"states.shp"
sdf_target = pd.DataFrame.spatial.from_featureclass(os.path.join(data_pth, cities))
sdf_join = pd.DataFrame.spatial.from_featureclass(os.path.join(data_pth, states))

We will use python's list comprehensions to create lists of the attribute columns in the DataFrame, then print out the lists to see the names of all the attribute columns.

sdf_target_cols = [column for column in sdf_target.columns]
sdf_join_cols = [column for column in sdf_join.columns]

Print out a list of columns in the sdf_target dataframe created from the cities shapefile:

for a,b,c,d in zip(sdf_target_cols[::4],sdf_target_cols[1::4],sdf_target_cols[2::4], sdf_target_cols[3::4]):
    print("{:<30}{:<30}{:<30}{:<}".format(a,b,c,d))
FID                           NAME                          CLASS                         ST
STFIPS                        PLACEFIP                      CAPITAL                       AREALAND
AREAWATER                     POP_CLASS                     POP2000                       POP2007
WHITE                         BLACK                         AMERI_ES                      ASIAN
HAWN_PI                       OTHER                         MULT_RACE                     HISPANIC
MALES                         FEMALES                       AGE_UNDER5                    AGE_5_17
AGE_18_21                     AGE_22_29                     AGE_30_39                     AGE_40_49
AGE_50_64                     AGE_65_UP                     MED_AGE                       MED_AGE_M
MED_AGE_F                     HOUSEHOLDS                    AVE_HH_SZ                     HSEHLD_1_M
HSEHLD_1_F                    MARHH_CHD                     MARHH_NO_C                    MHH_CHILD
FHH_CHILD                     FAMILIES                      AVE_FAM_SZ                    HSE_UNITS
VACANT                        OWNER_OCC                     RENTER_OCC                    SHAPE

Print out a list of columns in the sdf_join dataframe created from the states shapefile:

for a,b,c,d,e in zip(sdf_join_cols[::5],sdf_join_cols[1::5],sdf_join_cols[2::5],sdf_join_cols[3::5],sdf_join_cols[4::5]):
    print("{:<20}{:<20}{:<20}{:<20}{:<}".format(a,b,c,d,e))
FID                 STATE_NAME          STATE_FIPS          SUB_REGION          STATE_ABBR
POPULATION          POP_SQMI            POP2010             POP10_SQMI          WHITE
BLACK               AMERI_ES            ASIAN               HAWN_PI             HISPANIC
OTHER               MULT_RACE           MALES               FEMALES             AGE_UNDER5
AGE_5_9             AGE_10_14           AGE_15_19           AGE_20_24           AGE_25_34
AGE_35_44           AGE_45_54           AGE_55_64           AGE_65_74           AGE_75_84
AGE_85_UP           MED_AGE             MED_AGE_M           MED_AGE_F           HOUSEHOLDS
AVE_HH_SZ           HSEHLD_1_M          HSEHLD_1_F          MARHH_CHD           MARHH_NO_C
MHH_CHILD           FHH_CHILD           FAMILIES            AVE_FAM_SZ          HSE_UNITS
VACANT              OWNER_OCC           RENTER_OCC          NO_FARMS12          AVE_SIZE12
CROP_ACR12          AVE_SALE12          SQMI                Shape_Leng          Shape_Area

Create a DataFrame for the cities in Wyoming:

sdf_target.loc[0]['SHAPE']
{'x': -147.8271911572905,
 'y': 64.84830019415946,
 'spatialReference': {'wkid': 4326, 'latestWkid': 4326}}
q = sdf_target['ST'] == 'WY'
left = sdf_target[q].copy()
left.head()
FIDNAMECLASSSTSTFIPSPLACEFIPCAPITALAREALANDAREAWATERPOP_CLASS...MARHH_NO_CMHH_CHILDFHH_CHILDFAMILIESAVE_FAM_SZHSE_UNITSVACANTOWNER_OCCRENTER_OCCSHAPE
711711Green RiverCityWY563374013.7060.3156...127811325132143.22442624931691008{"x": -109.46492712301152, "y": 41.51419117328...
712712Rock SpringsCityWY566723518.4410.0006...201222053649313.028359101152742074{"x": -109.22240010498797, "y": 41.59092714080...
715715EvanstonCityWY562562010.2450.0446...97613936929403.30466560728051253{"x": -110.96461812552366, "y": 41.26330015271...
764764LaramieCityWY564505011.1380.0196...249617458756082.831199465853795957{"x": -105.58725462620347, "y": 41.31292665660...
766766CheyenneCityWY5613900State21.1080.0827...62994901610141742.93237821458147397585{"x": -104.80204559586696, "y": 41.14554516058...

5 rows × 48 columns

Create a dataframe for the state of Wyoming:

q = sdf_join.STATE_ABBR == 'WY'
right = sdf_join[q].copy()
right.head()
FIDSTATE_NAMESTATE_FIPSSUB_REGIONSTATE_ABBRPOPULATIONPOP_SQMIPOP2010POP10_SQMIWHITE...OWNER_OCCRENTER_OCCNO_FARMS12AVE_SIZE12CROP_ACR12AVE_SALE12SQMIShape_LengShape_AreaSHAPE
1010Wyoming56MountainWY5871066.05636265.8511279...1570776980211736.02587.02418931.0143952.097813.8921.9870327.966688{"rings": [[[-104.05361521909884, 41.698218275...

1 rows × 56 columns

Perform the spatial join:

sdf2 = left.spatial.join(right)
sdf2
FID_leftNAMECLASSSTSTFIPSPLACEFIPCAPITALAREALANDAREAWATERPOP_CLASS...VACANT_rightOWNER_OCC_rightRENTER_OCC_rightNO_FARMS12AVE_SIZE12CROP_ACR12AVE_SALE12SQMIShape_LengShape_Area
0711Green RiverCityWY563374013.7060.3156...349891570776980211736.02587.02418931.0143952.097813.8921.9870327.966688
1712Rock SpringsCityWY566723518.4410.0006...349891570776980211736.02587.02418931.0143952.097813.8921.9870327.966688
2715EvanstonCityWY562562010.2450.0446...349891570776980211736.02587.02418931.0143952.097813.8921.9870327.966688
3764LaramieCityWY564505011.1380.0196...349891570776980211736.02587.02418931.0143952.097813.8921.9870327.966688
4766CheyenneCityWY5613900State21.1080.0827...349891570776980211736.02587.02418931.0143952.097813.8921.9870327.966688
51216SheridanCityWY56698458.4860.0186...349891570776980211736.02587.02418931.0143952.097813.8921.9870327.966688
61218CasperCityWY561315023.9450.3166...349891570776980211736.02587.02418931.0143952.097813.8921.9870327.966688
71219GilletteCityWY563185513.3690.0256...349891570776980211736.02587.02418931.0143952.097813.8921.9870327.966688

8 rows × 104 columns

Notice, you retain the geometry type of your left DataFrame (points) in this case, however, you get all the attributes from both the left and right DataFrames. Let us plot the results of the spatial join on a map:

m3 = gis.map("Wyoming", 6)
m3

Wyoming Spatial Join

m3.center = [43, -107]
from arcgis.geometry import Geometry

for idx, row in sdf2.iterrows():
    m3.draw(row['SHAPE'], symbol=pt_sym)

Conclusion

Spatially Enabled DataFrame give you powerful data analysis and data wrangling capabilities. In addition to performing sql like operations on attribute data, you can perform geographic queries. This guide demonstrated some of these advanced capabilities of the SEDF.

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.