Spatially Enabled DataFrames - Advanced Topics

The information in this section provides a brief introduction to advanced topics with the Spatially Enabled DataFrame structure.

One of the most important tasks for software applications is to quickly retrieve and process information. Enterprise systems, whether storing GIS information or not, all utilize the concept of indexing to allow for quick searching through large data stores to locate and select specific information for subsequent processing.

This document will outline how row and column indexing work in Spatially Enabled Dataframes and also demonstrate building a spatial index on dataframe geometries to allow for quick searching, accessing, and processing. The document will also demonstrate spatial joins to combine dataframes.

DataFrame Index

As mentioned in the Introduction to the Spatially Enabled DataFrame guide, the Pandas DataFrame structure underlies the ArcGIS API for Python's Spatially Enabled DataFrame. Pandas DataFrames are analagous to spreadsheets. They have a row axis and a column axis. Each of these axes are indexed and labeled for quick and easy identification, data alignment, and retrieval and updating of data subsets.

Let's explore the axes labels and indices and how they allow for data exploraation:

In [1]:
from arcgis.gis import GIS
gis = GIS()

When working with an ArcGIS Online feature layer, the query() method returns a FeatureSet object which has a sdf method to instantiate a Spatially Enabled DataFrame.

In [2]:
from arcgis import GIS
item = gis.content.get("85d0ca4ea1ca4b9abf0c51b9bd34de2e")
flayer = item.layers[0]
df = flayer.query(where="AGE_45_54 < 1500").sdf
df.head()
Out[2]:
AGE_10_14 AGE_15_19 AGE_20_24 AGE_25_34 AGE_35_44 AGE_45_54 AGE_55_64 AGE_5_9 AGE_65_74 AGE_75_84 ... PLACEFIPS POP2010 POPULATION POP_CLASS RENTER_OCC SHAPE ST STFIPS VACANT WHITE
0 1413 1381 1106 2138 1815 1411 979 1557 525 307 ... 0468080 14287 14980 6 1074 {"x": -12768343.256613126, "y": 3842463.708135... AZ 04 261 9196
1 727 738 677 1380 1185 1333 1087 740 661 444 ... 0602042 9932 10239 6 2056 {"x": -13613950.337588644, "y": 4931686.754090... CA 06 267 8273
2 593 511 2323 2767 746 127 34 1229 4 2 ... 0610561 10616 11869 6 2558 {"x": -13066582.116550362, "y": 3925650.676616... CA 06 296 7530
3 888 988 900 1729 1479 1443 959 766 514 280 ... 0613560 10866 11195 6 761 {"x": -13123874.446103057, "y": 4044249.710416... CA 06 86 5898
4 1086 1228 1013 1822 1759 1478 1112 925 687 477 ... 0614974 12823 13009 6 1763 {"x": -13151212.145276317, "y": 4027601.332347... CA 06 88 6930

5 rows × 51 columns

Describing the DataFrame

The DataFrame.info() provides a concise summary of the object. This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.

Example: Displaying info on Spatially Enabled DataFrame (SEDF)

In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 317 entries, 0 to 316
Data columns (total 51 columns):
AGE_10_14     317 non-null int64
AGE_15_19     317 non-null int64
AGE_20_24     317 non-null int64
AGE_25_34     317 non-null int64
AGE_35_44     317 non-null int64
AGE_45_54     317 non-null int64
AGE_55_64     317 non-null int64
AGE_5_9       317 non-null int64
AGE_65_74     317 non-null int64
AGE_75_84     317 non-null int64
AGE_85_UP     317 non-null int64
AGE_UNDER5    317 non-null int64
AMERI_ES      317 non-null int64
ASIAN         317 non-null int64
AVE_FAM_SZ    317 non-null float64
AVE_HH_SZ     317 non-null float64
BLACK         317 non-null int64
CAPITAL       317 non-null object
CLASS         317 non-null object
FAMILIES      317 non-null int64
FEMALES       317 non-null int64
FHH_CHILD     317 non-null int64
FID           317 non-null int64
HAWN_PI       317 non-null int64
HISPANIC      317 non-null int64
HOUSEHOLDS    317 non-null int64
HSEHLD_1_F    317 non-null int64
HSEHLD_1_M    317 non-null int64
HSE_UNITS     317 non-null int64
MALES         317 non-null int64
MARHH_CHD     317 non-null int64
MARHH_NO_C    317 non-null int64
MED_AGE       317 non-null float64
MED_AGE_F     317 non-null float64
MED_AGE_M     317 non-null float64
MHH_CHILD     317 non-null int64
MULT_RACE     317 non-null int64
NAME          317 non-null object
OBJECTID      317 non-null int64
OTHER         317 non-null int64
OWNER_OCC     317 non-null int64
PLACEFIPS     317 non-null object
POP2010       317 non-null int64
POPULATION    317 non-null int64
POP_CLASS     317 non-null int64
RENTER_OCC    317 non-null int64
SHAPE         317 non-null geometry
ST            317 non-null object
STFIPS        317 non-null object
VACANT        317 non-null int64
WHITE         317 non-null int64
dtypes: float64(5), geometry(1), int64(39), object(6)
memory usage: 126.4+ KB

We can see that the SHAPE column is of type geometry. This means that compared to the legacy SpatialDataFrame class, geometry columns are now unique instead of being just of type object.

We can get information about each axis label (aka, index) with the axes property on the spatial dataframe.

In [4]:
print("{:<15}{}\n\n{}{}".format("Row axis: ", df.axes[0], "Column axis: ", df.axes[1]))
Row axis:      RangeIndex(start=0, stop=317, step=1)

Column axis: Index(['AGE_10_14', 'AGE_15_19', 'AGE_20_24', 'AGE_25_34', 'AGE_35_44',
       'AGE_45_54', 'AGE_55_64', 'AGE_5_9', 'AGE_65_74', 'AGE_75_84',
       'AGE_85_UP', 'AGE_UNDER5', 'AMERI_ES', 'ASIAN', 'AVE_FAM_SZ',
       'AVE_HH_SZ', 'BLACK', 'CAPITAL', 'CLASS', 'FAMILIES', 'FEMALES',
       'FHH_CHILD', 'FID', 'HAWN_PI', 'HISPANIC', 'HOUSEHOLDS', 'HSEHLD_1_F',
       'HSEHLD_1_M', 'HSE_UNITS', 'MALES', 'MARHH_CHD', 'MARHH_NO_C',
       'MED_AGE', 'MED_AGE_F', 'MED_AGE_M', 'MHH_CHILD', 'MULT_RACE', 'NAME',
       'OBJECTID', 'OTHER', 'OWNER_OCC', 'PLACEFIPS', 'POP2010', 'POPULATION',
       'POP_CLASS', 'RENTER_OCC', 'SHAPE', 'ST', 'STFIPS', 'VACANT', 'WHITE'],
      dtype='object')

Row axis information informs us we can retrieve information using the the dataframe loc attribute and any value in the range 0-317 inclusive to access a row. Column axis information tells us we can use any string in the index to return an attribute column:

In [5]:
df.loc[0] #the first row returned
Out[5]:
AGE_10_14                                                  1413
AGE_15_19                                                  1381
AGE_20_24                                                  1106
AGE_25_34                                                  2138
AGE_35_44                                                  1815
AGE_45_54                                                  1411
AGE_55_64                                                   979
AGE_5_9                                                    1557
AGE_65_74                                                   525
AGE_75_84                                                   307
AGE_85_UP                                                    93
AGE_UNDER5                                                 1562
AMERI_ES                                                    112
ASIAN                                                        55
AVE_FAM_SZ                                                 3.99
AVE_HH_SZ                                                  3.77
BLACK                                                       122
CAPITAL                                                        
CLASS                                                      city
FAMILIES                                                   3352
FEMALES                                                    7385
FHH_CHILD                                                   672
FID                                                           6
HAWN_PI                                                       8
HISPANIC                                                  13708
HOUSEHOLDS                                                 3791
HSEHLD_1_F                                                  227
HSEHLD_1_M                                                  207
HSE_UNITS                                                  4052
MALES                                                      6902
MARHH_CHD                                                  1621
MARHH_NO_C                                                  673
MED_AGE                                                    25.5
MED_AGE_F                                                  26.8
MED_AGE_M                                                  24.2
MHH_CHILD                                                   188
MULT_RACE                                                   345
NAME                                                   Somerton
OBJECTID                                                      6
OTHER                                                      4449
OWNER_OCC                                                  2717
PLACEFIPS                                               0468080
POP2010                                                   14287
POPULATION                                                14980
POP_CLASS                                                     6
RENTER_OCC                                                 1074
SHAPE         {'x': -12768343.256613126, 'y': 3842463.708135...
ST                                                           AZ
STFIPS                                                       04
VACANT                                                      261
WHITE                                                      9196
Name: 0, dtype: object
In [6]:
df['POP2010'] #the data from the `POP2010` attribute column
Out[6]:
0      14287
1       9932
2      10616
3      10866
4      12823
5      10755
6      10080
7      10334
8      10588
9      10648
10     11570
11     23096
12     13544
13     12874
14     16192
15     10802
16     11768
17     13058
18     12707
19     11014
20     10234
21      9675
22     10644
23     10224
24     14494
25     10471
26     10127
27     11211
28     11602
29     29589
       ...  
287    10687
288     9840
289    11545
290    13975
291    10465
292    42034
293    10372
294    10568
295    18461
296    10461
297     9845
298    14068
299    13816
300    11899
301    10345
302    13922
303    25484
304    10494
305     9577
306    10615
307    10600
308    11070
309    13431
310    10060
311    10869
312    10125
313    11196
314    10697
315    12764
316    10571
Name: POP2010, Length: 317, dtype: int64

Slicing DataFrames

We can access rows, columns and subsets of rows and columns using Python slicing:

In [7]:
#rows 0-9 with a subset of columns indexed as a list
df.loc[0:9][['OBJECTID', 'NAME', 'ST', 'POP2010', 'POPULATION']]
Out[7]:
OBJECTID NAME ST POP2010 POPULATION
0 6 Somerton AZ 14287 14980
1 20 Anderson CA 9932 10239
2 67 Camp Pendleton South CA 10616 11869
3 85 Citrus CA 10866 11195
4 94 Commerce CA 12823 13009
5 113 Delhi CA 10755 11336
6 139 Emeryville CA 10080 11010
7 143 Exeter CA 10334 10642
8 148 Farmersville CA 10588 10768
9 164 Garden Acres CA 10648 10708

We can use indexing to access SHAPE information and draw it on a map:

In [8]:
camp_pendleton_s_geodefn = df.loc[2]['SHAPE'] #geometry definition from row 2
camp_pendleton_s_geodefn
Out[8]:
{'x': -13066582.116550362,
 'y': 3925650.6766163236,
 'spatialReference': {'wkid': 102100, 'latestWkid': 3857}}
In [9]:
m = gis.map("San Diego, CA", 8)
m

Camp Pendleton_South_pt

In [10]:
m.draw(camp_pendleton_s_geodefn)

Spatial Index

In addition to row and column indices to search a DataFrame, we can use a spatial indexes to quickly access information based on its location and relationship with other features. They are based on the concept of a minimum bounding rectangle - the smallest rectangle that contains an entire geometric shape. Each of these rectangles are then grouped into leaf nodes representing a single shape and node structures containing groups of shapes according to whatever algorithm the different types of spatial indexing use. Querying these rectangles requires magnitudes fewer compute resources for accessing and processing geometries relative to accessing the entire feature array of coordinate pairs that compose a shape. Access to points, complex lines and irregularly-shaped polygons becomes much quicker and easier through different flavors of spatial indexing.

The Spatially Enabled DataFrame uses an implementation of spatial indexing known as QuadTree indexing, which searches nodes when determining locations, relationships and attributes of specific features. QuadTree indexes are the default spatial index, but the SEDF also supports r-tree implementations. In the DataFrame Index section of this notebook, the USA Major Cities feature layer was queried and the sdf property was called on the results to create a DataFrame. The sindex method on the DataFrame creates a QuadTree index:

In [11]:
si = df.spatial.sindex('quadtree',reset=False)

Let's visually inspect the external frame of the QuadTree index. We'll then plot the spatial dataframe to ensure the spatial index encompasses all our features:

In [12]:
midx = gis.map("United States", 3)
midx

Spatial Index Envelope

In [13]:
midx.center = [39, -98]
midx.basemap = 'gray'
In [14]:
df.spatial.plot(map_widget=midx)
Out[14]:
True

Let's use the feature we drew earlier to define a spatial reference variable for use throughout the rest of this guide.

In [15]:
sp_ref = camp_pendleton_s_geodefn['spatialReference']
sp_ref
Out[15]:
{'wkid': 102100, 'latestWkid': 3857}
In [16]:
import time
from arcgis.geometry import Geometry, Polygon

#define a symbol to visualize the spatial index quadrants
sym = {
    "type": "esriSFS",
    "style": "esriSFSSolid",
    "color": [0,0,0,0],
    "outline": {
        "type": "esriSLS",
        "style": "esriSLSSolid",
        "color": [0,0,0,255],
        "width": 4}
}

# loop through the children of the root index and draw each extent
# using a different outline color
for i in range(len(si._index.children)):
    sym["outline"]["color"][i] = 255
    if i > 0:
        sym["outline"]["color"][i] = 255
        sym["outline"]["color"][i-1] = 0
    child = si._index.children[i]
    width_factor = child.width/2
    height_factor = child.width/2
    minx = child.center[0] - width_factor
    miny = child.center[1] - height_factor
    maxx = child.center[0] + width_factor
    maxy = child.center[1] + height_factor
    child_geom = Geometry({
        'rings':[[[minx,miny], [minx, maxy], [maxx, maxy], [maxx, miny], [minx, miny]]],
        'spatialReference': sp_ref})
    #child_extent = Polygon(child_geom)
    midx.draw(shape = child_geom, symbol = sym)
    time.sleep(2)

Intersection with the Spatial Index

Up to this point in this guide, we've talked about using indexing for querying attributes in the dataframe. For example:

In [17]:
query = df['ST'] == 'MI'
df[query]
Out[17]:
AGE_10_14 AGE_15_19 AGE_20_24 AGE_25_34 AGE_35_44 AGE_45_54 AGE_55_64 AGE_5_9 AGE_65_74 AGE_75_84 ... PLACEFIPS POP2010 POPULATION POP_CLASS RENTER_OCC SHAPE ST STFIPS VACANT WHITE
212 785 796 712 1465 1302 1379 1172 818 776 570 ... 2617020 10945 10601 6 1679 {"x": -9462606.190132478, "y": 5152068.9276487... MI 26 572 10124
213 818 803 735 1579 1298 1359 1050 928 610 461 ... 2676960 10994 11007 6 1755 {"x": -9508722.837630691, "y": 5131265.4579707... MI 26 507 8856
228 757 4605 4834 1576 1357 1155 757 871 382 251 ... 2601340 17579 18547 6 2096 {"x": -9568014.52123174, "y": 5309699.01482792... MI 26 227 15937
229 313 2238 3680 1146 540 613 568 360 309 243 ... 2608300 10601 10925 6 2194 {"x": -9515619.23025533, "y": 5419011.90562258... MI 26 293 9326
230 598 645 796 1435 1095 1378 1081 678 792 656 ... 2612320 10355 10254 6 1761 {"x": -9508450.065025873, "y": 5504121.4266238... MI 26 647 9902
231 660 672 946 1816 1160 1332 972 712 539 327 ... 2617700 10088 10413 6 2193 {"x": -9537617.651799908, "y": 5318637.2100821... MI 26 484 7938
232 568 579 456 1572 1485 1493 1291 643 689 622 ... 2627380 10372 10464 6 1748 {"x": -9281638.045680454, "y": 5230344.9560653... MI 26 335 7415
233 595 542 591 1312 1202 1420 1545 594 878 683 ... 2633340 10412 10748 6 1530 {"x": -9597964.623579567, "y": 5320394.2544461... MI 26 1046 9891
234 878 877 780 1463 1286 1326 1126 965 542 359 ... 2656360 10856 10914 6 1954 {"x": -9600424.322350215, "y": 5342825.3921295... MI 26 846 1739

9 rows × 51 columns

We can query multiple attributes and filter on the column output as well:

In [18]:
query = (df['POP2010'] > 20000) & (df['ST'] == 'OH')
df[query][['NAME','ST','POP2010','HOUSEHOLDS','HSEHLD_1_F', 'HSEHLD_1_M']]
Out[18]:
NAME ST POP2010 HOUSEHOLDS HSEHLD_1_F HSEHLD_1_M
220 Athens OH 23832 6903 2474 2573
247 Oxford OH 21371 5799 2033 1850

As GIS analysts and data scientists, we also want to query based on geographic location. We can do that by building a spatial index with the sindex property of the spatial dataframe. The resulting quadtree index allows us to query based on specific geometries in relation to other geometries.

Let's continue looking at the dataframe wer're working with: US cities with a population between the ages of 45 and 54 of less than 1500.

We can draw the entire extent of our dataframe using the dataframe's geoextent property. Let's get the bounding box coordinates:

In [19]:
df_geoextent = df.spatial.full_extent
df_geoextent
Out[19]:
(-17595357, 2429399, -7895101, 6266432)

Let's use these coordinates, place them in more descriptive variable names, then create a bounding box to make a geometry object representing the extent of our dataframe. Finally we'll draw it on the a map:

In [20]:
df_geoextent_geom = df.spatial.bbox
df_geoextent_geom
Out[20]:
{'rings': [[[-17595357, 2429399],
   [-17595357, 6266432],
   [-7895101, 6266432],
   [-7895101, 2429399],
   [-17595357, 2429399]]],
 'spatialReference': {'wkid': 102100}}
In [21]:
m1 = gis.map("United States", 3)
m1

Dataframe Extent with AOI

In [22]:
m1.center = [39, -98]
In [23]:
sym_poly = {
  "type": "esriSFS",
  "style": "esriSFSSolid",
  "color": [0,0,0,0],  # hollow, no fill
    "outline": {
     "type": "esriSLS",
     "style": "esriSLSSolid",
     "color": [255,0,0,255],  # red border
     "width": 3}
}

m1.draw(shape = df_geoextent_geom, symbol = sym_poly)

Now, let's define a second set of coordinates representing a bounding box for which we want to query the features from our dataframe that fall within it.

We can define our list of coordinates, and then draw it on the map to make sure it falls within our dataframe extent:

In [24]:
area_of_interest = [-13043219.122301877, 3911134.034258818, -13243219.102301877, 4111134.0542588173]
minx, miny, maxx, maxy = area_of_interest[0], area_of_interest[1], area_of_interest[2], area_of_interest[3]

area_of_interest_ring = [[[minx, miny], [minx, maxy], [maxx, maxy], [maxx, miny], [minx, miny]]]
area_of_interest_geom = Geometry({'rings': area_of_interest_ring, 'spatialReference': sp_ref})

sym_poly_aoi = {
  "type": "esriSFS",
  "style": "esriSFSSolid",
  "color": [0,0,0,0],  # hollow, no fill
    "outline": {
     "type": "esriSLS",
     "style": "esriSLSSolid",
     "color": [0,255,0,255],   # green border
     "width": 3}
}

m1.draw(shape = area_of_interest_geom, symbol = sym_poly_aoi)

We can see that our area of interest box falls within the dataframe extent. The spatial index has an intersect method which takes a bounding box as input and returns a list of integer values from the row index of our spatial dataframe. We can use the dataframe's iloc integer-indexing attribute to then loop through the dataframe and put draw the features on a map

In [25]:
index_of_features = si.intersect(area_of_interest)
In [26]:
df.iloc[index_of_features]
Out[26]:
AGE_10_14 AGE_15_19 AGE_20_24 AGE_25_34 AGE_35_44 AGE_45_54 AGE_55_64 AGE_5_9 AGE_65_74 AGE_75_84 ... PLACEFIPS POP2010 POPULATION POP_CLASS RENTER_OCC SHAPE ST STFIPS VACANT WHITE
2 593 511 2323 2767 746 127 34 1229 4 2 ... 0610561 10616 11869 6 2558 {"x": -13066582.116550362, "y": 3925650.676616... CA 06 296 7530
3 888 988 900 1729 1479 1443 959 766 514 280 ... 0613560 10866 11195 6 761 {"x": -13123874.446103057, "y": 4044249.710416... CA 06 86 5898
4 1086 1228 1013 1822 1759 1478 1112 925 687 477 ... 0614974 12823 13009 6 1763 {"x": -13151212.145276317, "y": 4027601.332347... CA 06 88 6930
10 1076 1118 952 1707 1651 1397 931 955 524 258 ... 0634302 11570 12025 6 808 {"x": -13080593.062843386, "y": 4012557.378155... CA 06 105 5275
14 14 24 44 113 153 523 2425 11 3883 5122 ... 0639259 16192 17499 6 2572 {"x": -13105436.040130444, "y": 3977081.952377... CA 06 1714 14133
22 1041 1219 1041 1402 1317 1271 727 996 396 145 ... 0650132 10644 11092 6 963 {"x": -13063142.772085657, "y": 4049606.978283... CA 06 212 4459

6 rows × 51 columns

Let us plot these features that intersect on a map:

In [27]:
m2 = gis.map("Los Angeles, CA", 7)
m2

Area_of_interest_intersect

In [28]:
m2.center = [34, -118]
In [29]:
m2.draw(shape = area_of_interest_geom, symbol = sym_poly_aoi)
In [30]:
pt_sym = {
    "type": "esriSMS",
    "style": "esriSMSDiamond",
    "color": [255,140,0,255],  # yellowish
    "size": 8,
    "angle": 0,
    "xoffset": 0,
    "yoffset": 0,
    "outline": {
        "color": [255,140,0,255],
        "width": 1}
 }
        
for pt_index in index_of_features:
    m2.draw(shape = df.iloc[pt_index]['SHAPE'], symbol = pt_sym)  

Thus we were able to use the spatial indexes to query features that fall within an extent.

Spatial Joins

DataFrames are table-like structures comprised of rows and columns. In relational database, SQL joins are fundamental operations that combine columns from one or more tables using values that are common to each. They occur in almost all database queries.

A Spatial join is a table operation that affixes data from one feature layer’s attribute table to another based on a spatial relationship. The spatial join involves matching rows from the Join Features (data frame1) to the Target Features (data frame2) based on their spatial relationship.

Let's look at how joins work with dataframes by using subsets of our original DataFrame and the pandas merge fucntionality. We'll then move onto examining a spatial join to combine features from one dataframe with another based on a common attribute value.

Query the DataFrame to extract 3 attribute columns of information from 2 states, Ohio and Michigan:

In [31]:
query = (df['ST'] == 'OH') | (df['ST'] == 'MI')
df1 = df[query][['NAME', 'ST', 'POP2010']]
df1
Out[31]:
NAME ST POP2010
212 Coldwater MI 10945
213 Sturgis MI 10994
220 Athens OH 23832
228 Allendale MI 17579
229 Big Rapids MI 10601
230 Cadillac MI 10355
231 Comstock Park MI 10088
232 Farmington MI 10372
233 Grand Haven MI 10412
234 Muskegon Heights MI 10856
242 Cambridge OH 10635
243 Celina OH 10400
244 Galion OH 10512
245 London OH 9904
246 Northbrook OH 10668
247 Oxford OH 21371
248 Springdale OH 11223
249 Trenton OH 11869
250 University Heights OH 13539
251 Van Wert OH 10846

Query the dataframe again for 8 attribute columns from one state, Ohio

In [32]:
query = df['ST'] == 'OH'
df2 = df[query][['NAME', 'POPULATION','BLACK', 'HAWN_PI', 'HISPANIC', 'WHITE', 'MULT_RACE', 'OTHER']]
df2
Out[32]:
NAME POPULATION BLACK HAWN_PI HISPANIC WHITE MULT_RACE OTHER
220 Athens 24509 1047 10 576 20586 559 138
242 Cambridge 10479 361 3 129 9857 316 33
243 Celina 10461 50 41 293 9873 164 111
244 Galion 10243 50 1 140 10264 120 42
245 London 10279 596 0 169 8830 287 61
246 Northbrook 10607 2974 6 377 6995 379 141
247 Oxford 22270 859 2 491 18719 478 127
248 Springdale 11172 3355 44 1965 6169 331 977
249 Trenton 12244 115 2 198 11418 224 35
250 University Heights 13874 3133 4 374 9726 215 121
251 Van Wert 11081 180 1 435 10263 221 130

The Pandas merge capability joins dataframes in a style similar to SQL joins, with parameters to indicate the column of shared information and the type of join to perform:

An inner join (the default), is analagous to a SQL left inner join, keeping the order from the left table in the output and returning only those records from the right table that match the value in the column specified with the on parameter:

In [33]:
import pandas as pd

pd.merge(df1, df2, on='NAME', how='inner')
Out[33]:
NAME ST POP2010 POPULATION BLACK HAWN_PI HISPANIC WHITE MULT_RACE OTHER
0 Athens OH 23832 24509 1047 10 576 20586 559 138
1 Cambridge OH 10635 10479 361 3 129 9857 316 33
2 Celina OH 10400 10461 50 41 293 9873 164 111
3 Galion OH 10512 10243 50 1 140 10264 120 42
4 London OH 9904 10279 596 0 169 8830 287 61
5 Northbrook OH 10668 10607 2974 6 377 6995 379 141
6 Oxford OH 21371 22270 859 2 491 18719 478 127
7 Springdale OH 11223 11172 3355 44 1965 6169 331 977
8 Trenton OH 11869 12244 115 2 198 11418 224 35
9 University Heights OH 13539 13874 3133 4 374 9726 215 121
10 Van Wert OH 10846 11081 180 1 435 10263 221 130

Notice how all the rows from the left DataFrame appear in the result with all the attribute columns and values appended from the right DataFrame where the column value of NAME matched. The POP2010 attribute from the left DataFrame is combined with all the attributes from the right DataFrame.

An outer join combines all rows from both outputs together and orders the results according to the original row index:

In [34]:
pd.merge(df1, df2, on='NAME', how = 'outer')
Out[34]:
NAME ST POP2010 POPULATION BLACK HAWN_PI HISPANIC WHITE MULT_RACE OTHER
0 Coldwater MI 10945 NaN NaN NaN NaN NaN NaN NaN
1 Sturgis MI 10994 NaN NaN NaN NaN NaN NaN NaN
2 Athens OH 23832 24509.0 1047.0 10.0 576.0 20586.0 559.0 138.0
3 Allendale MI 17579 NaN NaN NaN NaN NaN NaN NaN
4 Big Rapids MI 10601 NaN NaN NaN NaN NaN NaN NaN
5 Cadillac MI 10355 NaN NaN NaN NaN NaN NaN NaN
6 Comstock Park MI 10088 NaN NaN NaN NaN NaN NaN NaN
7 Farmington MI 10372 NaN NaN NaN NaN NaN NaN NaN
8 Grand Haven MI 10412 NaN NaN NaN NaN NaN NaN NaN
9 Muskegon Heights MI 10856 NaN NaN NaN NaN NaN NaN NaN
10 Cambridge OH 10635 10479.0 361.0 3.0 129.0 9857.0 316.0 33.0
11 Celina OH 10400 10461.0 50.0 41.0 293.0 9873.0 164.0 111.0
12 Galion OH 10512 10243.0 50.0 1.0 140.0 10264.0 120.0 42.0
13 London OH 9904 10279.0 596.0 0.0 169.0 8830.0 287.0 61.0
14 Northbrook OH 10668 10607.0 2974.0 6.0 377.0 6995.0 379.0 141.0
15 Oxford OH 21371 22270.0 859.0 2.0 491.0 18719.0 478.0 127.0
16 Springdale OH 11223 11172.0 3355.0 44.0 1965.0 6169.0 331.0 977.0
17 Trenton OH 11869 12244.0 115.0 2.0 198.0 11418.0 224.0 35.0
18 University Heights OH 13539 13874.0 3133.0 4.0 374.0 9726.0 215.0 121.0
19 Van Wert OH 10846 11081.0 180.0 1.0 435.0 10263.0 221.0 130.0

The rows where the on parameter value is the same in both tables have all attributes from both DataFrames in the result. The rows from the first DataFrame that do not have a matching NAME value in the second dataframe have values filled in with NaN values.

A spatial join works similarly on matching attribute values. However, instead of joining on an attribue field (like you did earlier), you will join based on the spatial relationship between the records in the two tables.

Example: Merging State Statistics Information with Cities

The goal is to get Wyoming's city locations and census data joined with Wyoming's state census data.

If you do not have access to the ArcPy site-package from the Python interpreter used to execute the following cells, you must authenticate to an ArcGIS Online Organization or ArcGIS Enterprise portal.

In [35]:
from arcgis.gis import GIS
In [36]:
import pandas as pd
from arcgis.features import GeoAccessor, GeoSeriesAccessor
In [37]:
g2 = GIS("https://pythonapi.playground.esri.com/portal", "arcgis_python", "amazing_arcgis_123")
In [38]:
import os
data_pth = r'/path/to/your/data/census_2010/example'
cities = r"cities.shp"
states = r"states.shp"
In [39]:
sdf_target = pd.DataFrame.spatial.from_featureclass(os.path.join(data_pth, cities))
sdf_join = pd.DataFrame.spatial.from_featureclass(os.path.join(data_pth, states))

We will use python's list comprehensions to create lists of the attribute columns in the DataFrame, then print out the lists to see the names of all the attribute columns.

In [40]:
sdf_target_cols = [column for column in sdf_target.columns]
sdf_join_cols = [column for column in sdf_join.columns]

Print out a list of columns in the sdf_target dataframe created from the cities shapefile:

In [41]:
for a,b,c,d in zip(sdf_target_cols[::4],sdf_target_cols[1::4],sdf_target_cols[2::4], sdf_target_cols[3::4]):
    print("{:<30}{:<30}{:<30}{:<}".format(a,b,c,d))
FID                           NAME                          CLASS                         ST
STFIPS                        PLACEFIP                      CAPITAL                       AREALAND
AREAWATER                     POP_CLASS                     POP2000                       POP2007
WHITE                         BLACK                         AMERI_ES                      ASIAN
HAWN_PI                       OTHER                         MULT_RACE                     HISPANIC
MALES                         FEMALES                       AGE_UNDER5                    AGE_5_17
AGE_18_21                     AGE_22_29                     AGE_30_39                     AGE_40_49
AGE_50_64                     AGE_65_UP                     MED_AGE                       MED_AGE_M
MED_AGE_F                     HOUSEHOLDS                    AVE_HH_SZ                     HSEHLD_1_M
HSEHLD_1_F                    MARHH_CHD                     MARHH_NO_C                    MHH_CHILD
FHH_CHILD                     FAMILIES                      AVE_FAM_SZ                    HSE_UNITS
VACANT                        OWNER_OCC                     RENTER_OCC                    SHAPE

Print out a list of columns in the sdf_join dataframe created from the states shapefile:

In [42]:
for a,b,c,d,e in zip(sdf_join_cols[::5],sdf_join_cols[1::5],sdf_join_cols[2::5],sdf_join_cols[3::5],sdf_join_cols[4::5]):
    print("{:<20}{:<20}{:<20}{:<20}{:<}".format(a,b,c,d,e))
FID                 STATE_NAME          STATE_FIPS          SUB_REGION          STATE_ABBR
POPULATION          POP_SQMI            POP2010             POP10_SQMI          WHITE
BLACK               AMERI_ES            ASIAN               HAWN_PI             HISPANIC
OTHER               MULT_RACE           MALES               FEMALES             AGE_UNDER5
AGE_5_9             AGE_10_14           AGE_15_19           AGE_20_24           AGE_25_34
AGE_35_44           AGE_45_54           AGE_55_64           AGE_65_74           AGE_75_84
AGE_85_UP           MED_AGE             MED_AGE_M           MED_AGE_F           HOUSEHOLDS
AVE_HH_SZ           HSEHLD_1_M          HSEHLD_1_F          MARHH_CHD           MARHH_NO_C
MHH_CHILD           FHH_CHILD           FAMILIES            AVE_FAM_SZ          HSE_UNITS
VACANT              OWNER_OCC           RENTER_OCC          NO_FARMS12          AVE_SIZE12
CROP_ACR12          AVE_SALE12          SQMI                Shape_Leng          Shape_Area

Create a DataFrame for the cities in Wyoming:

In [43]:
sdf_target.loc[0]['SHAPE']
Out[43]:
{'x': -147.8271911572905,
 'y': 64.84830019415946,
 'spatialReference': {'wkid': 4326, 'latestWkid': 4326}}
In [44]:
q = sdf_target['ST'] == 'WY'
left = sdf_target[q].copy()
left.head()
Out[44]:
FID NAME CLASS ST STFIPS PLACEFIP CAPITAL AREALAND AREAWATER POP_CLASS ... MARHH_NO_C MHH_CHILD FHH_CHILD FAMILIES AVE_FAM_SZ HSE_UNITS VACANT OWNER_OCC RENTER_OCC SHAPE
711 711 Green River City WY 56 33740 13.706 0.315 6 ... 1278 113 251 3214 3.22 4426 249 3169 1008 {"x": -109.46492712301152, "y": 41.51419117328...
712 712 Rock Springs City WY 56 67235 18.441 0.000 6 ... 2012 220 536 4931 3.02 8359 1011 5274 2074 {"x": -109.22240010498797, "y": 41.59092714080...
715 715 Evanston City WY 56 25620 10.245 0.044 6 ... 976 139 369 2940 3.30 4665 607 2805 1253 {"x": -110.96461812552366, "y": 41.26330015271...
764 764 Laramie City WY 56 45050 11.138 0.019 6 ... 2496 174 587 5608 2.83 11994 658 5379 5957 {"x": -105.58725462620347, "y": 41.31292665660...
766 766 Cheyenne City WY 56 13900 State 21.108 0.082 7 ... 6299 490 1610 14174 2.93 23782 1458 14739 7585 {"x": -104.80204559586696, "y": 41.14554516058...

5 rows × 48 columns

Create a dataframe for the state of Wyoming:

In [45]:
q = sdf_join.STATE_ABBR == 'WY'
right = sdf_join[q].copy()
right.head()
Out[45]:
FID STATE_NAME STATE_FIPS SUB_REGION STATE_ABBR POPULATION POP_SQMI POP2010 POP10_SQMI WHITE ... OWNER_OCC RENTER_OCC NO_FARMS12 AVE_SIZE12 CROP_ACR12 AVE_SALE12 SQMI Shape_Leng Shape_Area SHAPE
10 10 Wyoming 56 Mountain WY 587106 6.0 563626 5.8 511279 ... 157077 69802 11736.0 2587.0 2418931.0 143952.0 97813.89 21.98703 27.966688 {"rings": [[[-104.05361521909884, 41.698218275...

1 rows × 56 columns

Perform the spatial join:

In [46]:
sdf2 = left.spatial.join(right)
sdf2
Out[46]:
FID_left NAME CLASS ST STFIPS PLACEFIP CAPITAL AREALAND AREAWATER POP_CLASS ... VACANT_right OWNER_OCC_right RENTER_OCC_right NO_FARMS12 AVE_SIZE12 CROP_ACR12 AVE_SALE12 SQMI Shape_Leng Shape_Area
0 711 Green River City WY 56 33740 13.706 0.315 6 ... 34989 157077 69802 11736.0 2587.0 2418931.0 143952.0 97813.89 21.98703 27.966688
1 712 Rock Springs City WY 56 67235 18.441 0.000 6 ... 34989 157077 69802 11736.0 2587.0 2418931.0 143952.0 97813.89 21.98703 27.966688
2 715 Evanston City WY 56 25620 10.245 0.044 6 ... 34989 157077 69802 11736.0 2587.0 2418931.0 143952.0 97813.89 21.98703 27.966688
3 764 Laramie City WY 56 45050 11.138 0.019 6 ... 34989 157077 69802 11736.0 2587.0 2418931.0 143952.0 97813.89 21.98703 27.966688
4 766 Cheyenne City WY 56 13900 State 21.108 0.082 7 ... 34989 157077 69802 11736.0 2587.0 2418931.0 143952.0 97813.89 21.98703 27.966688
5 1216 Sheridan City WY 56 69845 8.486 0.018 6 ... 34989 157077 69802 11736.0 2587.0 2418931.0 143952.0 97813.89 21.98703 27.966688
6 1218 Casper City WY 56 13150 23.945 0.316 6 ... 34989 157077 69802 11736.0 2587.0 2418931.0 143952.0 97813.89 21.98703 27.966688
7 1219 Gillette City WY 56 31855 13.369 0.025 6 ... 34989 157077 69802 11736.0 2587.0 2418931.0 143952.0 97813.89 21.98703 27.966688

8 rows × 104 columns

Notice, you retain the geometry type of your left DataFrame (points) in this case, however, you get all the attributes from both the left and right DataFrames. Let us plot the results of the spatial join on a map:

In [47]:
m3 = gis.map("Wyoming", 6)
m3

Wyoming Spatial Join

In [48]:
m3.center = [43, -107]
In [49]:
from arcgis.geometry import Geometry

for idx, row in sdf2.iterrows():
    m3.draw(row['SHAPE'], symbol=pt_sym)

Conclusion

Spatially Enabled DataFrame give you powerful data analysis and data wrangling capabilities. In addition to performing sql like operations on attribute data, you can perform geographic queries. This guide demonstrated some of these advanced capabilities of the SEDF.


Feedback on this topic?