The SpatialDataFrame is deprecated as of version 1.5: Please use the Spatially Enabled DataFrame instead. See this guide for more information.

Advanced Topics

The information in this section provides a brief introduction to advanced topics with the Spatial DataFrame structure.

One of the most important tasks for software applications is to quickly retrieve and process information. Enterprise systems, whether storing GIS information or not, all utilize the concept of indexing to allow for quick searching through large data stores to locate and select specific information for subsequent processing.

This document will outline how row and column indexing work in the Spatial Dataframe and also demonstrate building a spatial index on dataframe geometries to allow for quick searching, accessing, and processing. The document will also demonstrate spatial joins to combine dataframes.

Dataframe Index

As mentioned in the Introduction to the spatial dataframe guide, the Pandas dataframe structure underlies the ArcGIS API for Python Spatial Dataframe. Pandas dataframes are analagous to spreadsheets. They have a row axis and a column axis. Each of these axes are indexed and labeled for quick and easy identification, data alignment, and retrieval and updating of data subsets.

Lets explore the axis labels and indices and how they allow for data exploraation:

In [2]:
from arcgis.gis import GIS

When working with an ArcGIS Online feature layer, the query() method returns a FeatureSet object which has a df method to instantiate a Spatial Dataframe.

In [3]:
from arcgis import GIS
item = GIS().content.get("85d0ca4ea1ca4b9abf0c51b9bd34de2e")
flayer = item.layers[0]
df = flayer.query(where="AGE_45_54 < 1500").df
df.head()
Out[3]:
AGE_10_14 AGE_15_19 AGE_20_24 AGE_25_34 AGE_35_44 AGE_45_54 AGE_55_64 AGE_5_9 AGE_65_74 AGE_75_84 ... PLACEFIPS POP2010 POPULATION POP_CLASS RENTER_OCC ST STFIPS VACANT WHITE SHAPE
0 1413 1381 1106 2138 1815 1411 979 1557 525 307 ... 0468080 14287 14980 6 1074 AZ 04 261 9196 {'x': -12768343.256613126, 'y': 3842463.708135...
1 727 738 677 1380 1185 1333 1087 740 661 444 ... 0602042 9932 10239 6 2056 CA 06 267 8273 {'x': -13613950.337588644, 'y': 4931686.754090...
2 593 511 2323 2767 746 127 34 1229 4 2 ... 0610561 10616 11869 6 2558 CA 06 296 7530 {'x': -13066582.116550362, 'y': 3925650.676616...
3 888 988 900 1729 1479 1443 959 766 514 280 ... 0613560 10866 11195 6 761 CA 06 86 5898 {'x': -13123874.446103057, 'y': 4044249.710416...
4 1086 1228 1013 1822 1759 1478 1112 925 687 477 ... 0614974 12823 13009 6 1763 CA 06 88 6930 {'x': -13151212.145276317, 'y': 4027601.332347...

5 rows × 51 columns

We can get information about each axis label (aka, index) with the axes property on the spatial dataframe.

In [4]:
print("{:<15}{}\n\n{}{}".format("Row axis: ", df.axes[0], "Column axis: ", df.axes[1]))
Row axis:      RangeIndex(start=0, stop=317, step=1)

Column axis: Index(['AGE_10_14', 'AGE_15_19', 'AGE_20_24', 'AGE_25_34', 'AGE_35_44',
       'AGE_45_54', 'AGE_55_64', 'AGE_5_9', 'AGE_65_74', 'AGE_75_84',
       'AGE_85_UP', 'AGE_UNDER5', 'AMERI_ES', 'ASIAN', 'AVE_FAM_SZ',
       'AVE_HH_SZ', 'BLACK', 'CAPITAL', 'CLASS', 'FAMILIES', 'FEMALES',
       'FHH_CHILD', 'FID', 'HAWN_PI', 'HISPANIC', 'HOUSEHOLDS', 'HSEHLD_1_F',
       'HSEHLD_1_M', 'HSE_UNITS', 'MALES', 'MARHH_CHD', 'MARHH_NO_C',
       'MED_AGE', 'MED_AGE_F', 'MED_AGE_M', 'MHH_CHILD', 'MULT_RACE', 'NAME',
       'OBJECTID', 'OTHER', 'OWNER_OCC', 'PLACEFIPS', 'POP2010', 'POPULATION',
       'POP_CLASS', 'RENTER_OCC', 'ST', 'STFIPS', 'VACANT', 'WHITE', 'SHAPE'],
      dtype='object')

Row axis information informs us we can retrieve information using the the dataframe loc attribute and any value in the range 0-317 inclusive to access a row. Column axis information tells us we can use any string in the index to return an attribute column:

In [13]:
df.loc[0] #the first row returned
Out[13]:
AGE_10_14                                                  1413
AGE_15_19                                                  1381
AGE_20_24                                                  1106
AGE_25_34                                                  2138
AGE_35_44                                                  1815
AGE_45_54                                                  1411
AGE_55_64                                                   979
AGE_5_9                                                    1557
AGE_65_74                                                   525
AGE_75_84                                                   307
AGE_85_UP                                                    93
AGE_UNDER5                                                 1562
AMERI_ES                                                    112
ASIAN                                                        55
AVE_FAM_SZ                                                 3.99
AVE_HH_SZ                                                  3.77
BLACK                                                       122
CAPITAL                                                        
CLASS                                                      city
FAMILIES                                                   3352
FEMALES                                                    7385
FHH_CHILD                                                   672
FID                                                           6
HAWN_PI                                                       8
HISPANIC                                                  13708
HOUSEHOLDS                                                 3791
HSEHLD_1_F                                                  227
HSEHLD_1_M                                                  207
HSE_UNITS                                                  4052
MALES                                                      6902
MARHH_CHD                                                  1621
MARHH_NO_C                                                  673
MED_AGE                                                    25.5
MED_AGE_F                                                  26.8
MED_AGE_M                                                  24.2
MHH_CHILD                                                   188
MULT_RACE                                                   345
NAME                                                   Somerton
OBJECTID                                                      6
OTHER                                                      4449
OWNER_OCC                                                  2717
PLACEFIPS                                               0468080
POP2010                                                   14287
POPULATION                                                14980
POP_CLASS                                                     6
RENTER_OCC                                                 1074
ST                                                           AZ
STFIPS                                                       04
VACANT                                                      261
WHITE                                                      9196
SHAPE         {'x': -12768343.256613126, 'y': 3842463.708135...
Name: 0, dtype: object
In [12]:
df['POP2010'] #the data from the `POP2010` attribute column
Out[12]:
0      14287
1       9932
2      10616
3      10866
4      12823
5      10755
6      10080
7      10334
8      10588
       ...  
310    10060
311    10869
312    10125
313    11196
314    10697
315    12764
316    10571
Name: POP2010, Length: 317, dtype: int64

We can access rows, columns and subsets of rows and columns using Python slicing:

In [6]:
#rows 0-9 with a subset of columns indexed as a list
df.loc[0:9][['OBJECTID', 'NAME', 'ST', 'POP2010', 'POPULATION']]
Out[6]:
OBJECTID NAME ST POP2010 POPULATION
0 6 Somerton AZ 14287 14980
1 20 Anderson CA 9932 10239
2 67 Camp Pendleton South CA 10616 11869
3 85 Citrus CA 10866 11195
4 94 Commerce CA 12823 13009
5 113 Delhi CA 10755 11336
6 139 Emeryville CA 10080 11010
7 143 Exeter CA 10334 10642
8 148 Farmersville CA 10588 10768
9 164 Garden Acres CA 10648 10708

We can use indexing to access SHAPE information and draw it on a map:

In [15]:
camp_pendleton_s_geodefn = df.loc[2]['SHAPE'] #geometry definition from row 2
camp_pendleton_s_geodefn
Out[15]:
{'x': -13066582.116550362,
 'y': 3925650.6766163236,
 'spatialReference': {'wkid': 102100, 'latestWkid': 3857}}
In [ ]:
m = GIS().map("San Diego, CA", 8)
m

Camp Pendleton_South_pt

In [16]:
m.draw(camp_pendleton_s_geodefn)

Spatial Index

In addition to row and column indices to search a dataframe, we can use a spatial index to quickly access information based on its location and relationship with other features. It is based on the concept of a minimum bounding rectangle - the smallest rectangle that contains an entire geometric shape. Each of these rectangles are then grouped into leaf nodes representing a single shape and node structures containing groups of shapes according to whatever algorithm the different types of spatial indexing use. Querying these rectangles requires magnitudes fewer computer resources for accessing and processing geometries relative to accessing the entire feature array of coordinate pairs that compose a shape. Access to points, complex lines and irregularly-shaped polygons becomes much quicker and easier through different flavors of spatial indexing.

The Spatial DataFrame uses an implementation of spatial indexing known as QuadTree indexing, which searches nodes when determining locations, relationships and attributes of specific features. In the Dataframe Index section of this notebook, the USA Major Cities feature layer was queried and the df method was called on the results to create a data frame. The sindex method on the df creates a quad tree index:

In [10]:
si = df.sindex

Let's visually inspect the external frame of the Quadtree index. We'll then plot the spatial dataframe to ensure the spatial index encompasses all our features:

In [ ]:
midx = GIS().map("United States", 3)
midx

Spatial Index Envelope

In [12]:
midx.center = [39, -98]
midx.basemap = 'gray'
In [13]:
df.plot(kind='map',
       map_widget=midx)
Out[13]:
True

Let's use the feature we drew above to define a spatial reference variable for use throughout the rest of this guide.

In [14]:
sp_ref = camp_pendleton_s_geodefn['spatialReference']
sp_ref
Out[14]:
{'wkid': 102100, 'latestWkid': 3857}
In [15]:
import time
from arcgis.geometry import Geometry, Polygon

#define a symbol to visualize the spatial index quadrants
sym = {
    "type": "esriSFS",
    "style": "esriSFSSolid",
    "color": [0,0,0,0],
    "outline": {
        "type": "esriSLS",
        "style": "esriSLSSolid",
        "color": [0,0,0,255],
        "width": 4}
}

# loop through the children of the root index and draw each extent
# using a different outline color
for i in range(len(si.children)):
    sym["outline"]["color"][i] = 255
    if i > 0:
        sym["outline"]["color"][i] = 255
        sym["outline"]["color"][i-1] = 0
    child = si.children[i]
    width_factor = child.width/2
    height_factor = child.width/2
    minx = child.center[0] - width_factor
    miny = child.center[1] - height_factor
    maxx = child.center[0] + width_factor
    maxy = child.center[1] + height_factor
    child_geom = Geometry({
        'rings':[[[minx,miny], [minx, maxy], [maxx, maxy], [maxx, miny], [minx, miny]]],
        'spatialReference': sp_ref})
    #child_extent = Polygon(child_geom)
    midx.draw(shape = child_geom, symbol = sym)
    time.sleep(2)

Intersection with the Spatial Index

Up to this point in this guide, we've talked about using indexing for querying attributes in the dataframe. For example:

In [75]:
df[df['ST'] == 'MI']
Out[75]:
AGE_10_14 AGE_15_19 AGE_20_24 AGE_25_34 AGE_35_44 AGE_45_54 AGE_55_64 AGE_5_9 AGE_65_74 AGE_75_84 ... PLACEFIPS POP2010 POPULATION POP_CLASS RENTER_OCC ST STFIPS VACANT WHITE SHAPE
212 785 796 712 1465 1302 1379 1172 818 776 570 ... 2617020 10945 10601 6 1679 MI 26 572 10124 {'x': -9462606.190132478, 'y': 5152068.9276487...
213 818 803 735 1579 1298 1359 1050 928 610 461 ... 2676960 10994 11007 6 1755 MI 26 507 8856 {'x': -9508722.837630691, 'y': 5131265.4579707...
228 757 4605 4834 1576 1357 1155 757 871 382 251 ... 2601340 17579 18547 6 2096 MI 26 227 15937 {'x': -9568014.52123174, 'y': 5309699.01482792...
229 313 2238 3680 1146 540 613 568 360 309 243 ... 2608300 10601 10925 6 2194 MI 26 293 9326 {'x': -9515619.23025533, 'y': 5419011.90562258...
230 598 645 796 1435 1095 1378 1081 678 792 656 ... 2612320 10355 10254 6 1761 MI 26 647 9902 {'x': -9508450.065025873, 'y': 5504121.4266238...
231 660 672 946 1816 1160 1332 972 712 539 327 ... 2617700 10088 10413 6 2193 MI 26 484 7938 {'x': -9537617.651799908, 'y': 5318637.2100821...
232 568 579 456 1572 1485 1493 1291 643 689 622 ... 2627380 10372 10464 6 1748 MI 26 335 7415 {'x': -9281638.045680454, 'y': 5230344.9560653...
233 595 542 591 1312 1202 1420 1545 594 878 683 ... 2633340 10412 10748 6 1530 MI 26 1046 9891 {'x': -9597964.623579567, 'y': 5320394.2544461...
234 878 877 780 1463 1286 1326 1126 965 542 359 ... 2656360 10856 10914 6 1954 MI 26 846 1739 {'x': -9600424.322350215, 'y': 5342825.3921295...

9 rows × 51 columns

We can query multiple attributes and filter on the column output as well:

In [16]:
df[(df['POP2010'] > 20000) & (df['ST'] == 'OH')][['NAME','ST','POP2010','HOUSEHOLDS','HSEHLD_1_F', 'HSEHLD_1_M']]
Out[16]:
NAME ST POP2010 HOUSEHOLDS HSEHLD_1_F HSEHLD_1_M
220 Athens OH 23832 6903 2474 2573
247 Oxford OH 21371 5799 2033 1850

As GIS analysts and data scientists, we also want to query based on geographic location. We can do that by building a spatial index with the sindex property of the spatial dataframe. The resulting quadtree index allows us to query based on specific geometries in relation to other geometries.

Let's continue looking at the dataframe we're working with: US cities with a population between the ages of 45 and 54 of less than 1500.

We can draw the entire extent of our dataframe using the dataframe's geoextent property. Let's get the bounding box coordinates:

In [17]:
df_geoextent = df.geoextent
df_geoextent
Out[17]:
(-17595357.37142927, 2429399.0296470732, -7895101.081856203, 6266432.576333014)

Let's use these coordinates, place them in more descriptive variable names, then create a bounding box to make a geometry object representing the extent of our dataframe. Finally we'll draw it on the a map:

In [18]:
minx, miny, maxx, maxy = df.geoextent[0], df.geoextent[1], df.geoextent[2], df.geoextent[3]
bbox_ring = [[[minx, miny], [minx, maxy], [maxx, maxy], [maxx, miny], [minx, miny]]]
df_geoextent_geom = Geometry({'rings': bbox_ring,
                            'spatialReference': sp_ref})
In [ ]:
m1 = GIS().map("United States", 3)
m1

Dataframe Extent with AOI

In [26]:
m1.center = [39, -98]
In [27]:
sym_poly = {
  "type": "esriSFS",
  "style": "esriSFSSolid",
  "color": [0,0,0,0],
    "outline": {
     "type": "esriSLS",
     "style": "esriSLSSolid",
     "color": [255,0,0,255],
     "width": 3}
}

m1.draw(shape = df_geoextent_geom, symbol = sym_poly)

Now, let's define a second set of coordinates representing a bounding box for which we want to query the features from our dataframe that fall within it.

We can define our list of coordinates, and then draw it on the map to make sure it falls within our dataframe extent:

In [29]:
area_of_interest = [-13043219.122301877, 3911134.034258818, -13243219.102301877, 4111134.0542588173]
minx, miny, maxx, maxy = area_of_interest[0], area_of_interest[1], area_of_interest[2], area_of_interest[3]

area_of_interest_ring = [[[minx, miny], [minx, maxy], [maxx, maxy], [maxx, miny], [minx, miny]]]
area_of_interest_geom = Geometry({'rings': area_of_interest_ring, 'spatialReference': sp_ref})

sym_poly_aoi = {
  "type": "esriSFS",
  "style": "esriSFSSolid",
  "color": [0,0,0,0],
    "outline": {
     "type": "esriSLS",
     "style": "esriSLSSolid",
     "color": [0,255,0,255],
     "width": 3}
}

m1.draw(shape = area_of_interest_geom, symbol = sym_poly_aoi)

We can see that our area of interest box falls within the dataframe extent. The spatial index has an intersect method which takes a bounding box as input and returns a list of integer values (from the row index of our spatial dataframe) that fall within that bounding box. We can use the dataframe's iloc integer-indexing attribute to then loop through the dataframe and draw the features on a map.

In [30]:
index_of_features = si.intersect(area_of_interest)
In [31]:
df.iloc[index_of_features]
Out[31]:
AGE_10_14 AGE_15_19 AGE_20_24 AGE_25_34 AGE_35_44 AGE_45_54 AGE_55_64 AGE_5_9 AGE_65_74 AGE_75_84 ... PLACEFIPS POP2010 POPULATION POP_CLASS RENTER_OCC ST STFIPS VACANT WHITE SHAPE
2 593 511 2323 2767 746 127 34 1229 4 2 ... 0610561 10616 11869 6 2558 CA 06 296 7530 {'x': -13066582.116550362, 'y': 3925650.676616...
3 888 988 900 1729 1479 1443 959 766 514 280 ... 0613560 10866 11195 6 761 CA 06 86 5898 {'x': -13123874.446103057, 'y': 4044249.710416...
4 1086 1228 1013 1822 1759 1478 1112 925 687 477 ... 0614974 12823 13009 6 1763 CA 06 88 6930 {'x': -13151212.145276317, 'y': 4027601.332347...
10 1076 1118 952 1707 1651 1397 931 955 524 258 ... 0634302 11570 12025 6 808 CA 06 105 5275 {'x': -13080593.062843386, 'y': 4012557.378155...
14 14 24 44 113 153 523 2425 11 3883 5122 ... 0639259 16192 17499 6 2572 CA 06 1714 14133 {'x': -13105436.040130444, 'y': 3977081.952377...
22 1041 1219 1041 1402 1317 1271 727 996 396 145 ... 0650132 10644 11092 6 963 CA 06 212 4459 {'x': -13063142.772085657, 'y': 4049606.978283...

6 rows × 51 columns

In [ ]:
m2 = GIS().map("Los Angeles, CA", 7)
m2

Area_of_interest_intersect

In [33]:
m2.center = [34, -118]
In [34]:
m2.draw(shape = area_of_interest_geom, symbol = sym_poly_aoi)
In [35]:
pt_sym = {
    "type": "esriSMS",
    "style": "esriSMSDiamond",
    "color": [255,140,0,255],        
    "size": 8,
    "angle": 0,
    "xoffset": 0,
    "yoffset": 0,
    "outline": {
        "color": [255,140,0,255],
        "width": 1}
 }
        
for pt_index in index_of_features:
    m2.draw(shape = df.iloc[pt_index]['SHAPE'], symbol = pt_sym)  

Spatial Joins

Dataframes are table-like structures comprised of rows and columns. In relational database, SQL joins are fundamental operations that combine columns from one or more tables using values that are common to each. They occur in almost all database queries.

A Spatial join is a table operation that affixes data from one feature layer’s attribute table to another based on a spatial relationship. The spatial join involves matching rows from the Join Features (data frame1) to the Target Features (data frame2) based on their spatial relationship.

Let's look at how joins work with dataframes by using subsets of our original dataframe and the pandas merge fucntionality. We'll then move onto examining a spatial join to combine features from one dataframe with another based on a common attribute value.

Query the dataframe to extract 3 attribute columns of information from 2 states, Ohio and Michigan:

In [40]:
df1 = df[(df['ST'] == 'OH') | (df['ST'] == 'MI')][['NAME', 'ST', 'POP2010']]
df1
Out[40]:
NAME ST POP2010
212 Coldwater MI 10945
213 Sturgis MI 10994
220 Athens OH 23832
228 Allendale MI 17579
229 Big Rapids MI 10601
230 Cadillac MI 10355
231 Comstock Park MI 10088
232 Farmington MI 10372
233 Grand Haven MI 10412
234 Muskegon Heights MI 10856
242 Cambridge OH 10635
243 Celina OH 10400
244 Galion OH 10512
245 London OH 9904
246 Northbrook OH 10668
247 Oxford OH 21371
248 Springdale OH 11223
249 Trenton OH 11869
250 University Heights OH 13539
251 Van Wert OH 10846

Query the dataframe again for 8 attribute columns from one state, Ohio

In [41]:
df2 = df[df['ST'] == 'OH'][['NAME', 'POPULATION','BLACK', 'HAWN_PI', 'HISPANIC', 'WHITE', 'MULT_RACE', 'OTHER']]
df2
Out[41]:
NAME POPULATION BLACK HAWN_PI HISPANIC WHITE MULT_RACE OTHER
220 Athens 24509 1047 10 576 20586 559 138
242 Cambridge 10479 361 3 129 9857 316 33
243 Celina 10461 50 41 293 9873 164 111
244 Galion 10243 50 1 140 10264 120 42
245 London 10279 596 0 169 8830 287 61
246 Northbrook 10607 2974 6 377 6995 379 141
247 Oxford 22270 859 2 491 18719 478 127
248 Springdale 11172 3355 44 1965 6169 331 977
249 Trenton 12244 115 2 198 11418 224 35
250 University Heights 13874 3133 4 374 9726 215 121
251 Van Wert 11081 180 1 435 10263 221 130

The Pandas merge capability joins dataframes in a style similar to SQL joins, with parameters to indicate the column of shared information and the type of join to perform:

An inner join (the default), is analagous to a SQL left inner join, keeping the order from the left table in the output and returning only those records from the right table that match the value in the column specified with the on parameter:

In [42]:
import pandas as pd

pd.merge(df1, df2, on='NAME', how='inner')
Out[42]:
NAME ST POP2010 POPULATION BLACK HAWN_PI HISPANIC WHITE MULT_RACE OTHER
0 Athens OH 23832 24509 1047 10 576 20586 559 138
1 Cambridge OH 10635 10479 361 3 129 9857 316 33
2 Celina OH 10400 10461 50 41 293 9873 164 111
3 Galion OH 10512 10243 50 1 140 10264 120 42
4 London OH 9904 10279 596 0 169 8830 287 61
5 Northbrook OH 10668 10607 2974 6 377 6995 379 141
6 Oxford OH 21371 22270 859 2 491 18719 478 127
7 Springdale OH 11223 11172 3355 44 1965 6169 331 977
8 Trenton OH 11869 12244 115 2 198 11418 224 35
9 University Heights OH 13539 13874 3133 4 374 9726 215 121
10 Van Wert OH 10846 11081 180 1 435 10263 221 130

Notice how all the rows from the left dataframe appear in the result with all the attribute columns and values appended from the right dataframe where the column value of NAME matched. The POP2010 attribute from the left dataframe is combined with all the attributes from the right dataframe.

An outer join combines all rows from both outputs together and orders the results according to the original row index:

In [43]:
pd.merge(df1, df2, on='NAME', how = 'outer')
Out[43]:
NAME ST POP2010 POPULATION BLACK HAWN_PI HISPANIC WHITE MULT_RACE OTHER
0 Coldwater MI 10945 NaN NaN NaN NaN NaN NaN NaN
1 Sturgis MI 10994 NaN NaN NaN NaN NaN NaN NaN
2 Athens OH 23832 24509.0 1047.0 10.0 576.0 20586.0 559.0 138.0
3 Allendale MI 17579 NaN NaN NaN NaN NaN NaN NaN
4 Big Rapids MI 10601 NaN NaN NaN NaN NaN NaN NaN
5 Cadillac MI 10355 NaN NaN NaN NaN NaN NaN NaN
6 Comstock Park MI 10088 NaN NaN NaN NaN NaN NaN NaN
7 Farmington MI 10372 NaN NaN NaN NaN NaN NaN NaN
8 Grand Haven MI 10412 NaN NaN NaN NaN NaN NaN NaN
9 Muskegon Heights MI 10856 NaN NaN NaN NaN NaN NaN NaN
10 Cambridge OH 10635 10479.0 361.0 3.0 129.0 9857.0 316.0 33.0
11 Celina OH 10400 10461.0 50.0 41.0 293.0 9873.0 164.0 111.0
12 Galion OH 10512 10243.0 50.0 1.0 140.0 10264.0 120.0 42.0
13 London OH 9904 10279.0 596.0 0.0 169.0 8830.0 287.0 61.0
14 Northbrook OH 10668 10607.0 2974.0 6.0 377.0 6995.0 379.0 141.0
15 Oxford OH 21371 22270.0 859.0 2.0 491.0 18719.0 478.0 127.0
16 Springdale OH 11223 11172.0 3355.0 44.0 1965.0 6169.0 331.0 977.0
17 Trenton OH 11869 12244.0 115.0 2.0 198.0 11418.0 224.0 35.0
18 University Heights OH 13539 13874.0 3133.0 4.0 374.0 9726.0 215.0 121.0
19 Van Wert OH 10846 11081.0 180.0 1.0 435.0 10263.0 221.0 130.0

The rows where the on parameter value is the same in both tables have all attributes from both dataframes in the result. The rows from the first dataframe that do not have a matching NAME value in the second dataframe have values filled in with NaN values.

A spatial join works similarly on matching attribute values.

Example: Merging State Statistics Information with Cities

The goal is to get Wyoming's city locations and census data joined with Wymoing's state census data.

If you do not have access to the ArcPy site-package from the Python interpreter used to execute the following cells, you must authenticate to an ArcGIS Online Organization or ArcGIS Enterprise portal.

In [44]:
g2 = GIS("https://pythonapi.playground.esri.com/portal", "arcgis_python", "amazing_arcgis_123")
In [45]:
from arcgis.features import SpatialDataFrame
In [46]:
import os
data_pth = r'/path/to/your/data/census_2010/example'
cities = r"cities.shp"
states = r"states.shp"
In [47]:
sdf_target = SpatialDataFrame.from_featureclass(os.path.join(data_pth, cities))
sdf_join = SpatialDataFrame.from_featureclass(os.path.join(data_pth, states))

We will use python's list comprehensions to create lists of the attribute columns in the dataframe, then print out the lists to see the names of all the attribute columns.

In [48]:
sdf_target_cols = [column for column in sdf_target.columns]
sdf_join_cols = [column for column in sdf_join.columns]

Print out a list of columns in the sdf_target dataframe created from the cities shapefile:

In [49]:
for a,b,c,d in zip(sdf_target_cols[::4],sdf_target_cols[1::4],sdf_target_cols[2::4], sdf_target_cols[3::4]):
    print("{:<30}{:<30}{:<30}{:<}".format(a,b,c,d))
index                         AGE_18_21                     AGE_22_29                     AGE_30_39
AGE_40_49                     AGE_50_64                     AGE_5_17                      AGE_65_UP
AGE_UNDER5                    AMERI_ES                      AREALAND                      AREAWATER
ASIAN                         AVE_FAM_SZ                    AVE_HH_SZ                     BLACK
CAPITAL                       CLASS                         FAMILIES                      FEMALES
FHH_CHILD                     HAWN_PI                       HISPANIC                      HOUSEHOLDS
HSEHLD_1_F                    HSEHLD_1_M                    HSE_UNITS                     MALES
MARHH_CHD                     MARHH_NO_C                    MED_AGE                       MED_AGE_F
MED_AGE_M                     MHH_CHILD                     MULT_RACE                     NAME
OTHER                         OWNER_OCC                     PLACEFIP                      POP2000
POP2007                       POP_CLASS                     RENTER_OCC                    SHAPE
ST                            STFIPS                        VACANT                        WHITE

Print out a list of columns in the sdf_join dataframe created from the states shapefile:

In [50]:
for a,b,c,d,e in zip(sdf_join_cols[::5],sdf_join_cols[1::5],sdf_join_cols[2::5],sdf_join_cols[3::5],sdf_join_cols[4::5]):
    print("{:<20}{:<20}{:<20}{:<20}{:<}".format(a,b,c,d,e))
index               AGE_18_21           AGE_22_29           AGE_30_39           AGE_40_49
AGE_50_64           AGE_5_17            AGE_65_UP           AGE_UNDER5          AMERI_ES
ASIAN               AVE_FAM_SZ          AVE_HH_SZ           AVG_SALE07          AVG_SIZE07
BLACK               CROP_ACR07          FAMILIES            FEMALES             FHH_CHILD
HAWN_PI             HISPANIC            HOUSEHOLDS          HSEHLD_1_F          HSEHLD_1_M
HSE_UNITS           MALES               MARHH_CHD           MARHH_NO_C          MED_AGE
MED_AGE_F           MED_AGE_M           MHH_CHILD           MULT_RACE           NO_FARMS07
OTHER               OWNER_OCC           POP00_SQMI          POP10_SQMI          POP2000
POP2010             RENTER_OCC          SHAPE               SQMI                STATE_ABBR
STATE_FIPS          STATE_NAME          SUB_REGION          VACANT              WHITE

Create a dataframe for the cities in Wyoming:

In [51]:
sdf_target.loc[0]['SHAPE']
Out[51]:
{'x': -147.8271911572905, 'y': 64.84830019415946}
In [52]:
q = sdf_target['ST'] == 'WY'
left = sdf_target[q].copy()
left.head()
Out[52]:
index AGE_18_21 AGE_22_29 AGE_30_39 AGE_40_49 AGE_50_64 AGE_5_17 AGE_65_UP AGE_UNDER5 AMERI_ES ... PLACEFIP POP2000 POP2007 POP_CLASS RENTER_OCC SHAPE ST STFIPS VACANT WHITE
711 711 715 980 1527 2384 1765 2855 769 813 160 ... 33740 11808 12183 6 1008 {'x': -109.46492712301152, 'y': 41.51419117328... WY 56 249 10879
712 712 1348 1963 2538 3299 2627 3797 1857 1279 160 ... 67235 18708 19318 6 2074 {'x': -109.22240010498797, 'y': 41.59092714080... WY 56 1011 17164
715 715 700 1106 1648 2006 1381 2840 826 1000 122 ... 25620 11507 11785 6 1253 {'x': -110.96461812552366, 'y': 41.26330015271... WY 56 607 10620
764 764 5469 5662 2966 3267 2879 3382 2191 1388 241 ... 45050 27204 28111 6 5957 {'x': -105.58725462620347, 'y': 41.31292665660... WY 56 658 24704
766 766 2597 5697 7904 8204 8083 9791 7313 3422 430 ... 13900 53011 54750 7 7585 {'x': -104.80204559586696, 'y': 41.14554516058... WY 56 1458 46707

5 rows × 48 columns

Create a dataframe for the state of Wyoming:

In [53]:
q = sdf_join.STATE_ABBR == 'WY'
right = sdf_join[q].copy()
right.head()
Out[53]:
index AGE_18_21 AGE_22_29 AGE_30_39 AGE_40_49 AGE_50_64 AGE_5_17 AGE_65_UP AGE_UNDER5 AMERI_ES ... POP2010 RENTER_OCC SHAPE SQMI STATE_ABBR STATE_FIPS STATE_NAME SUB_REGION VACANT WHITE
6 6 31070 48942 66252 82984 77968 97933 57693 30940 11133 ... 548154 58094 {'rings': (((-104.05361517875079, 41.698218253... 97813 WY 56 Wyoming Mountain 30246 454670

1 rows × 50 columns

Perform the spatial join:

In [54]:
from arcgis.features._data.geodataset.tools import spatial_join
In [55]:
sdf2 = spatial_join(df1=left, df2=right)
sdf2
Out[55]:
TARGET_OID JOIN_OID index_left AGE_18_21_left AGE_22_29_left AGE_30_39_left AGE_40_49_left AGE_50_64_left AGE_5_17_left AGE_65_UP_left ... POP2000_right POP2010 RENTER_OCC_right SQMI STATE_ABBR STATE_FIPS STATE_NAME SUB_REGION VACANT_right WHITE_right
0 711 6 711 715 980 1527 2384 1765 2855 769 ... 493782 548154 58094 97813 WY 56 Wyoming Mountain 30246 454670
1 712 6 712 1348 1963 2538 3299 2627 3797 1857 ... 493782 548154 58094 97813 WY 56 Wyoming Mountain 30246 454670
2 715 6 715 700 1106 1648 2006 1381 2840 826 ... 493782 548154 58094 97813 WY 56 Wyoming Mountain 30246 454670
3 764 6 764 5469 5662 2966 3267 2879 3382 2191 ... 493782 548154 58094 97813 WY 56 Wyoming Mountain 30246 454670
4 766 6 766 2597 5697 7904 8204 8083 9791 7313 ... 493782 548154 58094 97813 WY 56 Wyoming Mountain 30246 454670
5 1216 6 1216 956 1497 1950 2527 2552 2715 2677 ... 493782 548154 58094 97813 WY 56 Wyoming Mountain 30246 454670
6 1218 6 1218 3294 4980 6541 8020 7213 9578 6754 ... 493782 548154 58094 97813 WY 56 Wyoming Mountain 30246 454670
7 1219 6 1219 1206 2267 2783 3781 2489 4430 1190 ... 493782 548154 58094 97813 WY 56 Wyoming Mountain 30246 454670

8 rows × 99 columns

In [ ]:
m3 = GIS().map("Wyoming", 6)
m3

Wyoming Spatial Join

In [61]:
m3.center = [43, -107]
In [62]:
from arcgis.geometry import Geometry

for idx, row in sdf2.iterrows():
    m3.draw(row['SHAPE'], symbol=pt_sym)

Feedback on this topic?