The SpatialDataFrame is deprecated as of version 1.5: Please use the Spatially Enabled DataFrame instead. See this guide for more information.
Advanced Topics¶
The information in this section provides a brief introduction to advanced topics with the Spatial DataFrame
structure.
One of the most important tasks for software applications is to quickly retrieve and process information. Enterprise systems, whether storing GIS information or not, all utilize the concept of indexing to allow for quick searching through large data stores to locate and select specific information for subsequent processing.
This document will outline how row and column indexing work in the Spatial Dataframe and also demonstrate building a spatial index on dataframe geometries to allow for quick searching, accessing, and processing. The document will also demonstrate spatial joins to combine dataframes.
Dataframe Index¶
As mentioned in the Introduction to the spatial dataframe guide, the Pandas dataframe structure underlies the ArcGIS API for Python Spatial Dataframe. Pandas dataframes are analagous to spreadsheets. They have a row axis and a column axis. Each of these axes are indexed and labeled for quick and easy identification, data alignment, and retrieval and updating of data subsets.
Lets explore the axis labels and indices and how they allow for data exploraation:
from arcgis.gis import GIS
When working with an ArcGIS Online feature layer, the query()
method returns a FeatureSet
object which has a df
method to instantiate a Spatial Dataframe.
from arcgis import GIS
item = GIS().content.get("85d0ca4ea1ca4b9abf0c51b9bd34de2e")
flayer = item.layers[0]
df = flayer.query(where="AGE_45_54 < 1500").df
df.head()
We can get information about each axis label (aka, index
) with the axes
property on the spatial dataframe.
print("{:<15}{}\n\n{}{}".format("Row axis: ", df.axes[0], "Column axis: ", df.axes[1]))
Row axis information informs us we can retrieve information using the the dataframe loc
attribute and any value in the range 0-317 inclusive to access a row. Column axis information tells us we can use any string in the index to return an attribute column:
df.loc[0] #the first row returned
df['POP2010'] #the data from the `POP2010` attribute column
We can access rows, columns and subsets of rows and columns using Python slicing:
#rows 0-9 with a subset of columns indexed as a list
df.loc[0:9][['OBJECTID', 'NAME', 'ST', 'POP2010', 'POPULATION']]
We can use indexing to access SHAPE
information and draw it on a map:
camp_pendleton_s_geodefn = df.loc[2]['SHAPE'] #geometry definition from row 2
camp_pendleton_s_geodefn
m = GIS().map("San Diego, CA", 8)
m
m.draw(camp_pendleton_s_geodefn)
Spatial Index¶
In addition to row and column indices to search a dataframe, we can use a spatial index to quickly access information based on its location and relationship with other features. It is based on the concept of a minimum bounding rectangle - the smallest rectangle that contains an entire geometric shape. Each of these rectangles are then grouped into leaf
nodes representing a single shape and node
structures containing groups of shapes according to whatever algorithm the different types of spatial indexing use. Querying these rectangles requires magnitudes fewer computer resources for accessing and processing geometries relative to accessing the entire feature array of coordinate pairs that compose a shape. Access to points, complex lines and irregularly-shaped polygons becomes much quicker and easier through different flavors of spatial indexing.
The Spatial DataFrame uses an implementation of spatial indexing known as QuadTree indexing, which searches nodes when determining locations, relationships and attributes of specific features. In the Dataframe Index section of this notebook, the USA Major Cities feature layer was queried and the df
method was called on the results to create a data frame. The sindex
method on the df
creates a quad tree index:
si = df.sindex
Let's visually inspect the external frame of the Quadtree index. We'll then plot the spatial dataframe to ensure the spatial index encompasses all our features:
midx = GIS().map("United States", 3)
midx
midx.center = [39, -98]
midx.basemap = 'gray'
df.plot(kind='map',
map_widget=midx)
Let's use the feature we drew above to define a spatial reference variable for use throughout the rest of this guide.
sp_ref = camp_pendleton_s_geodefn['spatialReference']
sp_ref
import time
from arcgis.geometry import Geometry, Polygon
#define a symbol to visualize the spatial index quadrants
sym = {
"type": "esriSFS",
"style": "esriSFSSolid",
"color": [0,0,0,0],
"outline": {
"type": "esriSLS",
"style": "esriSLSSolid",
"color": [0,0,0,255],
"width": 4}
}
# loop through the children of the root index and draw each extent
# using a different outline color
for i in range(len(si.children)):
sym["outline"]["color"][i] = 255
if i > 0:
sym["outline"]["color"][i] = 255
sym["outline"]["color"][i-1] = 0
child = si.children[i]
width_factor = child.width/2
height_factor = child.width/2
minx = child.center[0] - width_factor
miny = child.center[1] - height_factor
maxx = child.center[0] + width_factor
maxy = child.center[1] + height_factor
child_geom = Geometry({
'rings':[[[minx,miny], [minx, maxy], [maxx, maxy], [maxx, miny], [minx, miny]]],
'spatialReference': sp_ref})
#child_extent = Polygon(child_geom)
midx.draw(shape = child_geom, symbol = sym)
time.sleep(2)
Intersection with the Spatial Index¶
Up to this point in this guide, we've talked about using indexing for querying attributes in the dataframe. For example:
df[df['ST'] == 'MI']
We can query multiple attributes and filter on the column output as well:
df[(df['POP2010'] > 20000) & (df['ST'] == 'OH')][['NAME','ST','POP2010','HOUSEHOLDS','HSEHLD_1_F', 'HSEHLD_1_M']]
As GIS analysts and data scientists, we also want to query based on geographic location. We can do that by building a spatial index with the sindex
property of the spatial dataframe. The resulting quadtree index allows us to query based on specific geometries in relation to other geometries.
Let's continue looking at the dataframe we're working with: US cities with a population between the ages of 45 and 54 of less than 1500.
We can draw the entire extent of our dataframe using the dataframe's geoextent
property. Let's get the bounding box coordinates:
df_geoextent = df.geoextent
df_geoextent
Let's use these coordinates, place them in more descriptive variable names, then create a bounding box to make a geometry object representing the extent of our dataframe. Finally we'll draw it on the a map:
minx, miny, maxx, maxy = df.geoextent[0], df.geoextent[1], df.geoextent[2], df.geoextent[3]
bbox_ring = [[[minx, miny], [minx, maxy], [maxx, maxy], [maxx, miny], [minx, miny]]]
df_geoextent_geom = Geometry({'rings': bbox_ring,
'spatialReference': sp_ref})
m1 = GIS().map("United States", 3)
m1
m1.center = [39, -98]
sym_poly = {
"type": "esriSFS",
"style": "esriSFSSolid",
"color": [0,0,0,0],
"outline": {
"type": "esriSLS",
"style": "esriSLSSolid",
"color": [255,0,0,255],
"width": 3}
}
m1.draw(shape = df_geoextent_geom, symbol = sym_poly)
Now, let's define a second set of coordinates representing a bounding box for which we want to query the features from our dataframe that fall within it.
We can define our list of coordinates, and then draw it on the map to make sure it falls within our dataframe extent:
area_of_interest = [-13043219.122301877, 3911134.034258818, -13243219.102301877, 4111134.0542588173]
minx, miny, maxx, maxy = area_of_interest[0], area_of_interest[1], area_of_interest[2], area_of_interest[3]
area_of_interest_ring = [[[minx, miny], [minx, maxy], [maxx, maxy], [maxx, miny], [minx, miny]]]
area_of_interest_geom = Geometry({'rings': area_of_interest_ring, 'spatialReference': sp_ref})
sym_poly_aoi = {
"type": "esriSFS",
"style": "esriSFSSolid",
"color": [0,0,0,0],
"outline": {
"type": "esriSLS",
"style": "esriSLSSolid",
"color": [0,255,0,255],
"width": 3}
}
m1.draw(shape = area_of_interest_geom, symbol = sym_poly_aoi)
We can see that our area of interest box falls within the dataframe extent. The spatial index has an intersect
method which takes a bounding box as input and returns a list of integer values (from the row index of our spatial dataframe) that fall within that bounding box. We can use the dataframe's iloc
integer-indexing attribute to then loop through the dataframe and draw the features on a map.
index_of_features = si.intersect(area_of_interest)
df.iloc[index_of_features]
m2 = GIS().map("Los Angeles, CA", 7)
m2
m2.center = [34, -118]
m2.draw(shape = area_of_interest_geom, symbol = sym_poly_aoi)
pt_sym = {
"type": "esriSMS",
"style": "esriSMSDiamond",
"color": [255,140,0,255],
"size": 8,
"angle": 0,
"xoffset": 0,
"yoffset": 0,
"outline": {
"color": [255,140,0,255],
"width": 1}
}
for pt_index in index_of_features:
m2.draw(shape = df.iloc[pt_index]['SHAPE'], symbol = pt_sym)
Spatial Joins¶
Dataframes are table-like structures comprised of rows and columns. In relational database, SQL joins
are fundamental operations that combine columns from one or more tables using values that are common to each. They occur in almost all database queries.
A Spatial join is a table operation that affixes data from one feature layer’s attribute table to another based on a spatial relationship. The spatial join involves matching rows from the Join Features (data frame1) to the Target Features (data frame2) based on their spatial relationship.
Let's look at how joins work with dataframes by using subsets of our original dataframe and the pandas merge
fucntionality. We'll then move onto examining a spatial join to combine features from one dataframe with another based on a common attribute value.
Query the dataframe to extract 3 attribute columns of information from 2 states, Ohio and Michigan:
df1 = df[(df['ST'] == 'OH') | (df['ST'] == 'MI')][['NAME', 'ST', 'POP2010']]
df1
Query the dataframe again for 8 attribute columns from one state, Ohio
df2 = df[df['ST'] == 'OH'][['NAME', 'POPULATION','BLACK', 'HAWN_PI', 'HISPANIC', 'WHITE', 'MULT_RACE', 'OTHER']]
df2
The Pandas merge
capability joins dataframes in a style similar to SQL joins, with parameters to indicate the column of shared information and the type of join to perform:
An inner
join (the default), is analagous to a SQL left inner join, keeping the order from the left table in the output and returning only those records from the right table that match the value in the column specified with the on
parameter:
import pandas as pd
pd.merge(df1, df2, on='NAME', how='inner')
Notice how all the rows from the left dataframe appear in the result with all the attribute columns and values appended from the right dataframe where the column value of NAME matched. The POP2010
attribute from the left dataframe is combined with all the attributes from the right dataframe.
An outer
join combines all rows from both outputs together and orders the results according to the original row index:
pd.merge(df1, df2, on='NAME', how = 'outer')
The rows where the on parameter value is the same in both tables have all attributes from both dataframes in the result. The rows from the first dataframe that do not have a matching NAME
value in the second dataframe have values filled in with NaN values.
A spatial join works similarly on matching attribute values.
Example: Merging State Statistics Information with Cities¶
The goal is to get Wyoming's city locations and census data joined with Wymoing's state census data.
If you do not have access to the
ArcPy
site-package from the Python interpreter used to execute the following cells, you must authenticate to an ArcGIS Online Organization or ArcGIS Enterprise portal.
g2 = GIS("https://pythonapi.playground.esri.com/portal", "arcgis_python", "amazing_arcgis_123")
from arcgis.features import SpatialDataFrame
import os
data_pth = r'/path/to/your/data/census_2010/example'
cities = r"cities.shp"
states = r"states.shp"
sdf_target = SpatialDataFrame.from_featureclass(os.path.join(data_pth, cities))
sdf_join = SpatialDataFrame.from_featureclass(os.path.join(data_pth, states))
We will use python's list comprehensions to create lists of the attribute columns in the dataframe, then print out the lists to see the names of all the attribute columns.
sdf_target_cols = [column for column in sdf_target.columns]
sdf_join_cols = [column for column in sdf_join.columns]
Print out a list of columns in the sdf_target
dataframe created from the cities shapefile:
for a,b,c,d in zip(sdf_target_cols[::4],sdf_target_cols[1::4],sdf_target_cols[2::4], sdf_target_cols[3::4]):
print("{:<30}{:<30}{:<30}{:<}".format(a,b,c,d))
Print out a list of columns in the sdf_join
dataframe created from the states shapefile:
for a,b,c,d,e in zip(sdf_join_cols[::5],sdf_join_cols[1::5],sdf_join_cols[2::5],sdf_join_cols[3::5],sdf_join_cols[4::5]):
print("{:<20}{:<20}{:<20}{:<20}{:<}".format(a,b,c,d,e))
Create a dataframe for the cities in Wyoming:
sdf_target.loc[0]['SHAPE']
q = sdf_target['ST'] == 'WY'
left = sdf_target[q].copy()
left.head()
Create a dataframe for the state of Wyoming:
q = sdf_join.STATE_ABBR == 'WY'
right = sdf_join[q].copy()
right.head()
Perform the spatial join:
from arcgis.features._data.geodataset.tools import spatial_join
sdf2 = spatial_join(df1=left, df2=right)
sdf2
m3 = GIS().map("Wyoming", 6)
m3
m3.center = [43, -107]
from arcgis.geometry import Geometry
for idx, row in sdf2.iterrows():
m3.draw(row['SHAPE'], symbol=pt_sym)
Feedback on this topic?