Spatially Enabled DataFrames - Advanced Topics
The information in this section provides a brief introduction to advanced topics with the Spatially Enabled DataFrame
structure.
One of the most important tasks for software applications is to quickly retrieve and process information. Enterprise systems, whether storing GIS information or not, all utilize the concept of indexing to allow for quick searching through large data stores to locate and select specific information for subsequent processing.
This document will outline how row and column indexing work in Spatially Enabled Dataframes and also demonstrate building a spatial index on dataframe geometries to allow for quick searching, accessing, and processing. The document will also demonstrate spatial joins to combine dataframes.
DataFrame Index
As mentioned in the Introduction to the Spatially Enabled DataFrame guide, the Pandas DataFrame structure underlies the ArcGIS API for Python's Spatially Enabled DataFrame. Pandas DataFrames are analagous to spreadsheets. They have a row axis and a column axis. Each of these axes are indexed and labeled for quick and easy identification, data alignment, and retrieval and updating of data subsets.
Let's explore the axes labels and indices and how they allow for data exploraation:
from arcgis.gis import GIS
gis = GIS()
When working with an ArcGIS Online feature layer, the query()
method returns a FeatureSet
object which has a sdf
method to instantiate a Spatially Enabled DataFrame.
from arcgis import GIS
item = gis.content.get("85d0ca4ea1ca4b9abf0c51b9bd34de2e")
flayer = item.layers[0]
df = flayer.query(where="AGE_45_54 < 1500").sdf
df.head()
AGE_10_14 | AGE_15_19 | AGE_20_24 | AGE_25_34 | AGE_35_44 | AGE_45_54 | AGE_55_64 | AGE_5_9 | AGE_65_74 | AGE_75_84 | ... | PLACEFIPS | POP2010 | POPULATION | POP_CLASS | RENTER_OCC | SHAPE | ST | STFIPS | VACANT | WHITE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1413 | 1381 | 1106 | 2138 | 1815 | 1411 | 979 | 1557 | 525 | 307 | ... | 0468080 | 14287 | 14980 | 6 | 1074 | {"x": -12768343.256613126, "y": 3842463.708135... | AZ | 04 | 261 | 9196 |
1 | 727 | 738 | 677 | 1380 | 1185 | 1333 | 1087 | 740 | 661 | 444 | ... | 0602042 | 9932 | 10239 | 6 | 2056 | {"x": -13613950.337588644, "y": 4931686.754090... | CA | 06 | 267 | 8273 |
2 | 593 | 511 | 2323 | 2767 | 746 | 127 | 34 | 1229 | 4 | 2 | ... | 0610561 | 10616 | 11869 | 6 | 2558 | {"x": -13066582.116550362, "y": 3925650.676616... | CA | 06 | 296 | 7530 |
3 | 888 | 988 | 900 | 1729 | 1479 | 1443 | 959 | 766 | 514 | 280 | ... | 0613560 | 10866 | 11195 | 6 | 761 | {"x": -13123874.446103057, "y": 4044249.710416... | CA | 06 | 86 | 5898 |
4 | 1086 | 1228 | 1013 | 1822 | 1759 | 1478 | 1112 | 925 | 687 | 477 | ... | 0614974 | 12823 | 13009 | 6 | 1763 | {"x": -13151212.145276317, "y": 4027601.332347... | CA | 06 | 88 | 6930 |
5 rows × 51 columns
Describing the DataFrame
The DataFrame.info()
provides a concise summary of the object. This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.
Example: Displaying info on Spatially Enabled DataFrame (SEDF)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 317 entries, 0 to 316 Data columns (total 51 columns): AGE_10_14 317 non-null int64 AGE_15_19 317 non-null int64 AGE_20_24 317 non-null int64 AGE_25_34 317 non-null int64 AGE_35_44 317 non-null int64 AGE_45_54 317 non-null int64 AGE_55_64 317 non-null int64 AGE_5_9 317 non-null int64 AGE_65_74 317 non-null int64 AGE_75_84 317 non-null int64 AGE_85_UP 317 non-null int64 AGE_UNDER5 317 non-null int64 AMERI_ES 317 non-null int64 ASIAN 317 non-null int64 AVE_FAM_SZ 317 non-null float64 AVE_HH_SZ 317 non-null float64 BLACK 317 non-null int64 CAPITAL 317 non-null object CLASS 317 non-null object FAMILIES 317 non-null int64 FEMALES 317 non-null int64 FHH_CHILD 317 non-null int64 FID 317 non-null int64 HAWN_PI 317 non-null int64 HISPANIC 317 non-null int64 HOUSEHOLDS 317 non-null int64 HSEHLD_1_F 317 non-null int64 HSEHLD_1_M 317 non-null int64 HSE_UNITS 317 non-null int64 MALES 317 non-null int64 MARHH_CHD 317 non-null int64 MARHH_NO_C 317 non-null int64 MED_AGE 317 non-null float64 MED_AGE_F 317 non-null float64 MED_AGE_M 317 non-null float64 MHH_CHILD 317 non-null int64 MULT_RACE 317 non-null int64 NAME 317 non-null object OBJECTID 317 non-null int64 OTHER 317 non-null int64 OWNER_OCC 317 non-null int64 PLACEFIPS 317 non-null object POP2010 317 non-null int64 POPULATION 317 non-null int64 POP_CLASS 317 non-null int64 RENTER_OCC 317 non-null int64 SHAPE 317 non-null geometry ST 317 non-null object STFIPS 317 non-null object VACANT 317 non-null int64 WHITE 317 non-null int64 dtypes: float64(5), geometry(1), int64(39), object(6) memory usage: 126.4+ KB
We can see that the SHAPE
column is of type geometry
. This means that compared to the legacy SpatialDataFrame
class, geometry columns are now unique instead of being just of type object
.
We can get information about each axis label (aka, index
) with the axes
property on the spatial dataframe.
print("{:<15}{}\n\n{}{}".format("Row axis: ", df.axes[0], "Column axis: ", df.axes[1]))
Row axis: RangeIndex(start=0, stop=317, step=1) Column axis: Index(['AGE_10_14', 'AGE_15_19', 'AGE_20_24', 'AGE_25_34', 'AGE_35_44', 'AGE_45_54', 'AGE_55_64', 'AGE_5_9', 'AGE_65_74', 'AGE_75_84', 'AGE_85_UP', 'AGE_UNDER5', 'AMERI_ES', 'ASIAN', 'AVE_FAM_SZ', 'AVE_HH_SZ', 'BLACK', 'CAPITAL', 'CLASS', 'FAMILIES', 'FEMALES', 'FHH_CHILD', 'FID', 'HAWN_PI', 'HISPANIC', 'HOUSEHOLDS', 'HSEHLD_1_F', 'HSEHLD_1_M', 'HSE_UNITS', 'MALES', 'MARHH_CHD', 'MARHH_NO_C', 'MED_AGE', 'MED_AGE_F', 'MED_AGE_M', 'MHH_CHILD', 'MULT_RACE', 'NAME', 'OBJECTID', 'OTHER', 'OWNER_OCC', 'PLACEFIPS', 'POP2010', 'POPULATION', 'POP_CLASS', 'RENTER_OCC', 'SHAPE', 'ST', 'STFIPS', 'VACANT', 'WHITE'], dtype='object')
Row axis information informs us we can retrieve information using the the dataframe loc
attribute and any value in the range 0-317 inclusive to access a row. Column axis information tells us we can use any string in the index to return an attribute column:
df.loc[0] #the first row returned
AGE_10_14 1413 AGE_15_19 1381 AGE_20_24 1106 AGE_25_34 2138 AGE_35_44 1815 AGE_45_54 1411 AGE_55_64 979 AGE_5_9 1557 AGE_65_74 525 AGE_75_84 307 AGE_85_UP 93 AGE_UNDER5 1562 AMERI_ES 112 ASIAN 55 AVE_FAM_SZ 3.99 AVE_HH_SZ 3.77 BLACK 122 CAPITAL CLASS city FAMILIES 3352 FEMALES 7385 FHH_CHILD 672 FID 6 HAWN_PI 8 HISPANIC 13708 HOUSEHOLDS 3791 HSEHLD_1_F 227 HSEHLD_1_M 207 HSE_UNITS 4052 MALES 6902 MARHH_CHD 1621 MARHH_NO_C 673 MED_AGE 25.5 MED_AGE_F 26.8 MED_AGE_M 24.2 MHH_CHILD 188 MULT_RACE 345 NAME Somerton OBJECTID 6 OTHER 4449 OWNER_OCC 2717 PLACEFIPS 0468080 POP2010 14287 POPULATION 14980 POP_CLASS 6 RENTER_OCC 1074 SHAPE {'x': -12768343.256613126, 'y': 3842463.708135... ST AZ STFIPS 04 VACANT 261 WHITE 9196 Name: 0, dtype: object
df['POP2010'] #the data from the `POP2010` attribute column
0 14287 1 9932 2 10616 3 10866 4 12823 5 10755 6 10080 7 10334 8 10588 9 10648 10 11570 11 23096 12 13544 13 12874 14 16192 15 10802 16 11768 17 13058 18 12707 19 11014 20 10234 21 9675 22 10644 23 10224 24 14494 25 10471 26 10127 27 11211 28 11602 29 29589 ... 287 10687 288 9840 289 11545 290 13975 291 10465 292 42034 293 10372 294 10568 295 18461 296 10461 297 9845 298 14068 299 13816 300 11899 301 10345 302 13922 303 25484 304 10494 305 9577 306 10615 307 10600 308 11070 309 13431 310 10060 311 10869 312 10125 313 11196 314 10697 315 12764 316 10571 Name: POP2010, Length: 317, dtype: int64
Slicing DataFrames
We can access rows, columns and subsets of rows and columns using Python slicing:
#rows 0-9 with a subset of columns indexed as a list
df.loc[0:9][['OBJECTID', 'NAME', 'ST', 'POP2010', 'POPULATION']]
OBJECTID | NAME | ST | POP2010 | POPULATION | |
---|---|---|---|---|---|
0 | 6 | Somerton | AZ | 14287 | 14980 |
1 | 20 | Anderson | CA | 9932 | 10239 |
2 | 67 | Camp Pendleton South | CA | 10616 | 11869 |
3 | 85 | Citrus | CA | 10866 | 11195 |
4 | 94 | Commerce | CA | 12823 | 13009 |
5 | 113 | Delhi | CA | 10755 | 11336 |
6 | 139 | Emeryville | CA | 10080 | 11010 |
7 | 143 | Exeter | CA | 10334 | 10642 |
8 | 148 | Farmersville | CA | 10588 | 10768 |
9 | 164 | Garden Acres | CA | 10648 | 10708 |
We can use indexing to access SHAPE
information and draw it on a map:
camp_pendleton_s_geodefn = df.loc[2]['SHAPE'] #geometry definition from row 2
camp_pendleton_s_geodefn
{'x': -13066582.116550362, 'y': 3925650.6766163236, 'spatialReference': {'wkid': 102100, 'latestWkid': 3857}}
m = gis.map("San Diego, CA", 8)
m
m.draw(camp_pendleton_s_geodefn)
Spatial Index
In addition to row and column indices to search a DataFrame, we can use a spatial indexes to quickly access information based on its location and relationship with other features. They are based on the concept of a minimum bounding rectangle - the smallest rectangle that contains an entire geometric shape. Each of these rectangles are then grouped into leaf
nodes representing a single shape and node
structures containing groups of shapes according to whatever algorithm the different types of spatial indexing use. Querying these rectangles requires magnitudes fewer compute resources for accessing and processing geometries relative to accessing the entire feature array of coordinate pairs that compose a shape. Access to points, complex lines and irregularly-shaped polygons becomes much quicker and easier through different flavors of spatial indexing.
The Spatially Enabled DataFrame uses an implementation of spatial indexing known as QuadTree indexing, which searches nodes when determining locations, relationships and attributes of specific features. QuadTree
indexes are the default spatial index, but the SEDF also supports r-tree
implementations. In the DataFrame Index section of this notebook, the USA Major Cities feature layer was queried and the sdf
property was called on the results to create a DataFrame. The sindex
method on the DataFrame creates a QuadTree index:
si = df.spatial.sindex('quadtree',reset=False)
Let's visually inspect the external frame of the QuadTree index. We'll then plot the spatial dataframe to ensure the spatial index encompasses all our features:
midx = gis.map("United States", 3)
midx
midx.center = [39, -98]
midx.basemap = 'gray'
df.spatial.plot(map_widget=midx)
True
Let's use the feature we drew earlier to define a spatial reference variable for use throughout the rest of this guide.
sp_ref = camp_pendleton_s_geodefn['spatialReference']
sp_ref
{'wkid': 102100, 'latestWkid': 3857}
import time
from arcgis.geometry import Geometry, Polygon
#define a symbol to visualize the spatial index quadrants
sym = {
"type": "esriSFS",
"style": "esriSFSSolid",
"color": [0,0,0,0],
"outline": {
"type": "esriSLS",
"style": "esriSLSSolid",
"color": [0,0,0,255],
"width": 4}
}
# loop through the children of the root index and draw each extent
# using a different outline color
for i in range(len(si._index.children)):
sym["outline"]["color"][i] = 255
if i > 0:
sym["outline"]["color"][i] = 255
sym["outline"]["color"][i-1] = 0
child = si._index.children[i]
width_factor = child.width/2
height_factor = child.width/2
minx = child.center[0] - width_factor
miny = child.center[1] - height_factor
maxx = child.center[0] + width_factor
maxy = child.center[1] + height_factor
child_geom = Geometry({
'rings':[[[minx,miny], [minx, maxy], [maxx, maxy], [maxx, miny], [minx, miny]]],
'spatialReference': sp_ref})
#child_extent = Polygon(child_geom)
midx.draw(shape = child_geom, symbol = sym)
time.sleep(2)
Intersection with the Spatial Index
Up to this point in this guide, we've talked about using indexing for querying attributes in the dataframe. For example:
query = df['ST'] == 'MI'
df[query]
AGE_10_14 | AGE_15_19 | AGE_20_24 | AGE_25_34 | AGE_35_44 | AGE_45_54 | AGE_55_64 | AGE_5_9 | AGE_65_74 | AGE_75_84 | ... | PLACEFIPS | POP2010 | POPULATION | POP_CLASS | RENTER_OCC | SHAPE | ST | STFIPS | VACANT | WHITE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
212 | 785 | 796 | 712 | 1465 | 1302 | 1379 | 1172 | 818 | 776 | 570 | ... | 2617020 | 10945 | 10601 | 6 | 1679 | {"x": -9462606.190132478, "y": 5152068.9276487... | MI | 26 | 572 | 10124 |
213 | 818 | 803 | 735 | 1579 | 1298 | 1359 | 1050 | 928 | 610 | 461 | ... | 2676960 | 10994 | 11007 | 6 | 1755 | {"x": -9508722.837630691, "y": 5131265.4579707... | MI | 26 | 507 | 8856 |
228 | 757 | 4605 | 4834 | 1576 | 1357 | 1155 | 757 | 871 | 382 | 251 | ... | 2601340 | 17579 | 18547 | 6 | 2096 | {"x": -9568014.52123174, "y": 5309699.01482792... | MI | 26 | 227 | 15937 |
229 | 313 | 2238 | 3680 | 1146 | 540 | 613 | 568 | 360 | 309 | 243 | ... | 2608300 | 10601 | 10925 | 6 | 2194 | {"x": -9515619.23025533, "y": 5419011.90562258... | MI | 26 | 293 | 9326 |
230 | 598 | 645 | 796 | 1435 | 1095 | 1378 | 1081 | 678 | 792 | 656 | ... | 2612320 | 10355 | 10254 | 6 | 1761 | {"x": -9508450.065025873, "y": 5504121.4266238... | MI | 26 | 647 | 9902 |
231 | 660 | 672 | 946 | 1816 | 1160 | 1332 | 972 | 712 | 539 | 327 | ... | 2617700 | 10088 | 10413 | 6 | 2193 | {"x": -9537617.651799908, "y": 5318637.2100821... | MI | 26 | 484 | 7938 |
232 | 568 | 579 | 456 | 1572 | 1485 | 1493 | 1291 | 643 | 689 | 622 | ... | 2627380 | 10372 | 10464 | 6 | 1748 | {"x": -9281638.045680454, "y": 5230344.9560653... | MI | 26 | 335 | 7415 |
233 | 595 | 542 | 591 | 1312 | 1202 | 1420 | 1545 | 594 | 878 | 683 | ... | 2633340 | 10412 | 10748 | 6 | 1530 | {"x": -9597964.623579567, "y": 5320394.2544461... | MI | 26 | 1046 | 9891 |
234 | 878 | 877 | 780 | 1463 | 1286 | 1326 | 1126 | 965 | 542 | 359 | ... | 2656360 | 10856 | 10914 | 6 | 1954 | {"x": -9600424.322350215, "y": 5342825.3921295... | MI | 26 | 846 | 1739 |
9 rows × 51 columns
We can query multiple attributes and filter on the column output as well:
query = (df['POP2010'] > 20000) & (df['ST'] == 'OH')
df[query][['NAME','ST','POP2010','HOUSEHOLDS','HSEHLD_1_F', 'HSEHLD_1_M']]
NAME | ST | POP2010 | HOUSEHOLDS | HSEHLD_1_F | HSEHLD_1_M | |
---|---|---|---|---|---|---|
220 | Athens | OH | 23832 | 6903 | 2474 | 2573 |
247 | Oxford | OH | 21371 | 5799 | 2033 | 1850 |
As GIS analysts and data scientists, we also want to query based on geographic location. We can do that by building a spatial index with the sindex
property of the spatial dataframe. The resulting quadtree index allows us to query based on specific geometries in relation to other geometries.
Let's continue looking at the dataframe wer're working with: US cities with a population between the ages of 45 and 54 of less than 1500.
We can draw the entire extent of our dataframe using the dataframe's geoextent
property. Let's get the bounding box coordinates:
df_geoextent = df.spatial.full_extent
df_geoextent
(-17595357, 2429399, -7895101, 6266432)
Let's use these coordinates, place them in more descriptive variable names, then create a bounding box to make a geometry object representing the extent of our dataframe. Finally we'll draw it on the a map:
df_geoextent_geom = df.spatial.bbox
df_geoextent_geom
{'rings': [[[-17595357, 2429399], [-17595357, 6266432], [-7895101, 6266432], [-7895101, 2429399], [-17595357, 2429399]]], 'spatialReference': {'wkid': 102100}}
m1 = gis.map("United States", 3)
m1
m1.center = [39, -98]
sym_poly = {
"type": "esriSFS",
"style": "esriSFSSolid",
"color": [0,0,0,0], # hollow, no fill
"outline": {
"type": "esriSLS",
"style": "esriSLSSolid",
"color": [255,0,0,255], # red border
"width": 3}
}
m1.draw(shape = df_geoextent_geom, symbol = sym_poly)
Now, let's define a second set of coordinates representing a bounding box for which we want to query the features from our dataframe that fall within it.
We can define our list of coordinates, and then draw it on the map to make sure it falls within our dataframe extent:
area_of_interest = [-13043219.122301877, 3911134.034258818, -13243219.102301877, 4111134.0542588173]
minx, miny, maxx, maxy = area_of_interest[0], area_of_interest[1], area_of_interest[2], area_of_interest[3]
area_of_interest_ring = [[[minx, miny], [minx, maxy], [maxx, maxy], [maxx, miny], [minx, miny]]]
area_of_interest_geom = Geometry({'rings': area_of_interest_ring, 'spatialReference': sp_ref})
sym_poly_aoi = {
"type": "esriSFS",
"style": "esriSFSSolid",
"color": [0,0,0,0], # hollow, no fill
"outline": {
"type": "esriSLS",
"style": "esriSLSSolid",
"color": [0,255,0,255], # green border
"width": 3}
}
m1.draw(shape = area_of_interest_geom, symbol = sym_poly_aoi)
We can see that our area of interest box falls within the dataframe extent. The spatial index has an intersect
method which takes a bounding box as input and returns a list of integer values from the row index of our spatial dataframe. We can use the dataframe's iloc
integer-indexing attribute to then loop through the dataframe and put draw the features on a map
index_of_features = si.intersect(area_of_interest)
df.iloc[index_of_features]
AGE_10_14 | AGE_15_19 | AGE_20_24 | AGE_25_34 | AGE_35_44 | AGE_45_54 | AGE_55_64 | AGE_5_9 | AGE_65_74 | AGE_75_84 | ... | PLACEFIPS | POP2010 | POPULATION | POP_CLASS | RENTER_OCC | SHAPE | ST | STFIPS | VACANT | WHITE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 593 | 511 | 2323 | 2767 | 746 | 127 | 34 | 1229 | 4 | 2 | ... | 0610561 | 10616 | 11869 | 6 | 2558 | {"x": -13066582.116550362, "y": 3925650.676616... | CA | 06 | 296 | 7530 |
3 | 888 | 988 | 900 | 1729 | 1479 | 1443 | 959 | 766 | 514 | 280 | ... | 0613560 | 10866 | 11195 | 6 | 761 | {"x": -13123874.446103057, "y": 4044249.710416... | CA | 06 | 86 | 5898 |
4 | 1086 | 1228 | 1013 | 1822 | 1759 | 1478 | 1112 | 925 | 687 | 477 | ... | 0614974 | 12823 | 13009 | 6 | 1763 | {"x": -13151212.145276317, "y": 4027601.332347... | CA | 06 | 88 | 6930 |
10 | 1076 | 1118 | 952 | 1707 | 1651 | 1397 | 931 | 955 | 524 | 258 | ... | 0634302 | 11570 | 12025 | 6 | 808 | {"x": -13080593.062843386, "y": 4012557.378155... | CA | 06 | 105 | 5275 |
14 | 14 | 24 | 44 | 113 | 153 | 523 | 2425 | 11 | 3883 | 5122 | ... | 0639259 | 16192 | 17499 | 6 | 2572 | {"x": -13105436.040130444, "y": 3977081.952377... | CA | 06 | 1714 | 14133 |
22 | 1041 | 1219 | 1041 | 1402 | 1317 | 1271 | 727 | 996 | 396 | 145 | ... | 0650132 | 10644 | 11092 | 6 | 963 | {"x": -13063142.772085657, "y": 4049606.978283... | CA | 06 | 212 | 4459 |
6 rows × 51 columns
Let us plot these features that intersect on a map:
m2 = gis.map("Los Angeles, CA", 7)
m2
m2.center = [34, -118]
m2.draw(shape = area_of_interest_geom, symbol = sym_poly_aoi)
pt_sym = {
"type": "esriSMS",
"style": "esriSMSDiamond",
"color": [255,140,0,255], # yellowish
"size": 8,
"angle": 0,
"xoffset": 0,
"yoffset": 0,
"outline": {
"color": [255,140,0,255],
"width": 1}
}
for pt_index in index_of_features:
m2.draw(shape = df.iloc[pt_index]['SHAPE'], symbol = pt_sym)
Thus we were able to use the spatial indexes to query features that fall within an extent.
Spatial Joins
DataFrames are table-like structures comprised of rows and columns. In relational database, SQL joins
are fundamental operations that combine columns from one or more tables using values that are common to each. They occur in almost all database queries.
A Spatial join is a table operation that affixes data from one feature layer’s attribute table to another based on a spatial relationship. The spatial join involves matching rows from the Join Features (data frame1) to the Target Features (data frame2) based on their spatial relationship.
Let's look at how joins work with dataframes by using subsets of our original DataFrame and the pandas merge
fucntionality. We'll then move onto examining a spatial join to combine features from one dataframe with another based on a common attribute value.
Query the DataFrame to extract 3 attribute columns of information from 2 states, Ohio and Michigan:
query = (df['ST'] == 'OH') | (df['ST'] == 'MI')
df1 = df[query][['NAME', 'ST', 'POP2010']]
df1
NAME | ST | POP2010 | |
---|---|---|---|
212 | Coldwater | MI | 10945 |
213 | Sturgis | MI | 10994 |
220 | Athens | OH | 23832 |
228 | Allendale | MI | 17579 |
229 | Big Rapids | MI | 10601 |
230 | Cadillac | MI | 10355 |
231 | Comstock Park | MI | 10088 |
232 | Farmington | MI | 10372 |
233 | Grand Haven | MI | 10412 |
234 | Muskegon Heights | MI | 10856 |
242 | Cambridge | OH | 10635 |
243 | Celina | OH | 10400 |
244 | Galion | OH | 10512 |
245 | London | OH | 9904 |
246 | Northbrook | OH | 10668 |
247 | Oxford | OH | 21371 |
248 | Springdale | OH | 11223 |
249 | Trenton | OH | 11869 |
250 | University Heights | OH | 13539 |
251 | Van Wert | OH | 10846 |
Query the dataframe again for 8 attribute columns from one state, Ohio
query = df['ST'] == 'OH'
df2 = df[query][['NAME', 'POPULATION','BLACK', 'HAWN_PI', 'HISPANIC', 'WHITE', 'MULT_RACE', 'OTHER']]
df2
NAME | POPULATION | BLACK | HAWN_PI | HISPANIC | WHITE | MULT_RACE | OTHER | |
---|---|---|---|---|---|---|---|---|
220 | Athens | 24509 | 1047 | 10 | 576 | 20586 | 559 | 138 |
242 | Cambridge | 10479 | 361 | 3 | 129 | 9857 | 316 | 33 |
243 | Celina | 10461 | 50 | 41 | 293 | 9873 | 164 | 111 |
244 | Galion | 10243 | 50 | 1 | 140 | 10264 | 120 | 42 |
245 | London | 10279 | 596 | 0 | 169 | 8830 | 287 | 61 |
246 | Northbrook | 10607 | 2974 | 6 | 377 | 6995 | 379 | 141 |
247 | Oxford | 22270 | 859 | 2 | 491 | 18719 | 478 | 127 |
248 | Springdale | 11172 | 3355 | 44 | 1965 | 6169 | 331 | 977 |
249 | Trenton | 12244 | 115 | 2 | 198 | 11418 | 224 | 35 |
250 | University Heights | 13874 | 3133 | 4 | 374 | 9726 | 215 | 121 |
251 | Van Wert | 11081 | 180 | 1 | 435 | 10263 | 221 | 130 |
The Pandas merge
capability joins dataframes in a style similar to SQL joins, with parameters to indicate the column of shared information and the type of join to perform:
An inner
join (the default), is analagous to a SQL left inner join, keeping the order from the left table in the output and returning only those records from the right table that match the value in the column specified with the on
parameter:
import pandas as pd
pd.merge(df1, df2, on='NAME', how='inner')
NAME | ST | POP2010 | POPULATION | BLACK | HAWN_PI | HISPANIC | WHITE | MULT_RACE | OTHER | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Athens | OH | 23832 | 24509 | 1047 | 10 | 576 | 20586 | 559 | 138 |
1 | Cambridge | OH | 10635 | 10479 | 361 | 3 | 129 | 9857 | 316 | 33 |
2 | Celina | OH | 10400 | 10461 | 50 | 41 | 293 | 9873 | 164 | 111 |
3 | Galion | OH | 10512 | 10243 | 50 | 1 | 140 | 10264 | 120 | 42 |
4 | London | OH | 9904 | 10279 | 596 | 0 | 169 | 8830 | 287 | 61 |
5 | Northbrook | OH | 10668 | 10607 | 2974 | 6 | 377 | 6995 | 379 | 141 |
6 | Oxford | OH | 21371 | 22270 | 859 | 2 | 491 | 18719 | 478 | 127 |
7 | Springdale | OH | 11223 | 11172 | 3355 | 44 | 1965 | 6169 | 331 | 977 |
8 | Trenton | OH | 11869 | 12244 | 115 | 2 | 198 | 11418 | 224 | 35 |
9 | University Heights | OH | 13539 | 13874 | 3133 | 4 | 374 | 9726 | 215 | 121 |
10 | Van Wert | OH | 10846 | 11081 | 180 | 1 | 435 | 10263 | 221 | 130 |
Notice how all the rows from the left DataFrame
appear in the result with all the attribute columns and values appended from the right DataFrame
where the column value of NAME
matched. The POP2010
attribute from the left DataFrame
is combined with all the attributes from the right DataFrame
.
An outer
join combines all rows from both outputs together and orders the results according to the original row index:
pd.merge(df1, df2, on='NAME', how = 'outer')
NAME | ST | POP2010 | POPULATION | BLACK | HAWN_PI | HISPANIC | WHITE | MULT_RACE | OTHER | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Coldwater | MI | 10945 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | Sturgis | MI | 10994 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | Athens | OH | 23832 | 24509.0 | 1047.0 | 10.0 | 576.0 | 20586.0 | 559.0 | 138.0 |
3 | Allendale | MI | 17579 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | Big Rapids | MI | 10601 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 | Cadillac | MI | 10355 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
6 | Comstock Park | MI | 10088 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
7 | Farmington | MI | 10372 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
8 | Grand Haven | MI | 10412 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
9 | Muskegon Heights | MI | 10856 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
10 | Cambridge | OH | 10635 | 10479.0 | 361.0 | 3.0 | 129.0 | 9857.0 | 316.0 | 33.0 |
11 | Celina | OH | 10400 | 10461.0 | 50.0 | 41.0 | 293.0 | 9873.0 | 164.0 | 111.0 |
12 | Galion | OH | 10512 | 10243.0 | 50.0 | 1.0 | 140.0 | 10264.0 | 120.0 | 42.0 |
13 | London | OH | 9904 | 10279.0 | 596.0 | 0.0 | 169.0 | 8830.0 | 287.0 | 61.0 |
14 | Northbrook | OH | 10668 | 10607.0 | 2974.0 | 6.0 | 377.0 | 6995.0 | 379.0 | 141.0 |
15 | Oxford | OH | 21371 | 22270.0 | 859.0 | 2.0 | 491.0 | 18719.0 | 478.0 | 127.0 |
16 | Springdale | OH | 11223 | 11172.0 | 3355.0 | 44.0 | 1965.0 | 6169.0 | 331.0 | 977.0 |
17 | Trenton | OH | 11869 | 12244.0 | 115.0 | 2.0 | 198.0 | 11418.0 | 224.0 | 35.0 |
18 | University Heights | OH | 13539 | 13874.0 | 3133.0 | 4.0 | 374.0 | 9726.0 | 215.0 | 121.0 |
19 | Van Wert | OH | 10846 | 11081.0 | 180.0 | 1.0 | 435.0 | 10263.0 | 221.0 | 130.0 |
The rows where the on parameter value is the same in both tables have all attributes from both DataFrames in the result. The rows from the first DataFrame that do not have a matching NAME
value in the second dataframe have values filled in with NaN
values.
A spatial join works similarly on matching attribute values. However, instead of joining on an attribue field (like you did earlier), you will join based on the spatial relationship between the records in the two tables.
Example: Merging State Statistics Information with Cities
The goal is to get Wyoming's city locations and census data joined with Wyoming's state census data.
If you do not have access to the
ArcPy
site-package from the Python interpreter used to execute the following cells, you must authenticate to an ArcGIS Online Organization or ArcGIS Enterprise portal.
from arcgis.gis import GIS
import pandas as pd
from arcgis.features import GeoAccessor, GeoSeriesAccessor
g2 = GIS("https://pythonapi.playground.esri.com/portal", "arcgis_python", "amazing_arcgis_123")
import os
data_pth = r'/path/to/your/data/census_2010/example'
cities = r"cities.shp"
states = r"states.shp"
sdf_target = pd.DataFrame.spatial.from_featureclass(os.path.join(data_pth, cities))
sdf_join = pd.DataFrame.spatial.from_featureclass(os.path.join(data_pth, states))
We will use python's list comprehensions to create lists of the attribute columns in the DataFrame, then print out the lists to see the names of all the attribute columns.
sdf_target_cols = [column for column in sdf_target.columns]
sdf_join_cols = [column for column in sdf_join.columns]
Print out a list of columns in the sdf_target
dataframe created from the cities shapefile:
for a,b,c,d in zip(sdf_target_cols[::4],sdf_target_cols[1::4],sdf_target_cols[2::4], sdf_target_cols[3::4]):
print("{:<30}{:<30}{:<30}{:<}".format(a,b,c,d))
FID NAME CLASS ST STFIPS PLACEFIP CAPITAL AREALAND AREAWATER POP_CLASS POP2000 POP2007 WHITE BLACK AMERI_ES ASIAN HAWN_PI OTHER MULT_RACE HISPANIC MALES FEMALES AGE_UNDER5 AGE_5_17 AGE_18_21 AGE_22_29 AGE_30_39 AGE_40_49 AGE_50_64 AGE_65_UP MED_AGE MED_AGE_M MED_AGE_F HOUSEHOLDS AVE_HH_SZ HSEHLD_1_M HSEHLD_1_F MARHH_CHD MARHH_NO_C MHH_CHILD FHH_CHILD FAMILIES AVE_FAM_SZ HSE_UNITS VACANT OWNER_OCC RENTER_OCC SHAPE
Print out a list of columns in the sdf_join
dataframe created from the states shapefile:
for a,b,c,d,e in zip(sdf_join_cols[::5],sdf_join_cols[1::5],sdf_join_cols[2::5],sdf_join_cols[3::5],sdf_join_cols[4::5]):
print("{:<20}{:<20}{:<20}{:<20}{:<}".format(a,b,c,d,e))
FID STATE_NAME STATE_FIPS SUB_REGION STATE_ABBR POPULATION POP_SQMI POP2010 POP10_SQMI WHITE BLACK AMERI_ES ASIAN HAWN_PI HISPANIC OTHER MULT_RACE MALES FEMALES AGE_UNDER5 AGE_5_9 AGE_10_14 AGE_15_19 AGE_20_24 AGE_25_34 AGE_35_44 AGE_45_54 AGE_55_64 AGE_65_74 AGE_75_84 AGE_85_UP MED_AGE MED_AGE_M MED_AGE_F HOUSEHOLDS AVE_HH_SZ HSEHLD_1_M HSEHLD_1_F MARHH_CHD MARHH_NO_C MHH_CHILD FHH_CHILD FAMILIES AVE_FAM_SZ HSE_UNITS VACANT OWNER_OCC RENTER_OCC NO_FARMS12 AVE_SIZE12 CROP_ACR12 AVE_SALE12 SQMI Shape_Leng Shape_Area
Create a DataFrame for the cities in Wyoming:
sdf_target.loc[0]['SHAPE']
{'x': -147.8271911572905, 'y': 64.84830019415946, 'spatialReference': {'wkid': 4326, 'latestWkid': 4326}}
q = sdf_target['ST'] == 'WY'
left = sdf_target[q].copy()
left.head()
FID | NAME | CLASS | ST | STFIPS | PLACEFIP | CAPITAL | AREALAND | AREAWATER | POP_CLASS | ... | MARHH_NO_C | MHH_CHILD | FHH_CHILD | FAMILIES | AVE_FAM_SZ | HSE_UNITS | VACANT | OWNER_OCC | RENTER_OCC | SHAPE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
711 | 711 | Green River | City | WY | 56 | 33740 | 13.706 | 0.315 | 6 | ... | 1278 | 113 | 251 | 3214 | 3.22 | 4426 | 249 | 3169 | 1008 | {"x": -109.46492712301152, "y": 41.51419117328... | |
712 | 712 | Rock Springs | City | WY | 56 | 67235 | 18.441 | 0.000 | 6 | ... | 2012 | 220 | 536 | 4931 | 3.02 | 8359 | 1011 | 5274 | 2074 | {"x": -109.22240010498797, "y": 41.59092714080... | |
715 | 715 | Evanston | City | WY | 56 | 25620 | 10.245 | 0.044 | 6 | ... | 976 | 139 | 369 | 2940 | 3.30 | 4665 | 607 | 2805 | 1253 | {"x": -110.96461812552366, "y": 41.26330015271... | |
764 | 764 | Laramie | City | WY | 56 | 45050 | 11.138 | 0.019 | 6 | ... | 2496 | 174 | 587 | 5608 | 2.83 | 11994 | 658 | 5379 | 5957 | {"x": -105.58725462620347, "y": 41.31292665660... | |
766 | 766 | Cheyenne | City | WY | 56 | 13900 | State | 21.108 | 0.082 | 7 | ... | 6299 | 490 | 1610 | 14174 | 2.93 | 23782 | 1458 | 14739 | 7585 | {"x": -104.80204559586696, "y": 41.14554516058... |
5 rows × 48 columns
Create a dataframe for the state of Wyoming:
q = sdf_join.STATE_ABBR == 'WY'
right = sdf_join[q].copy()
right.head()
FID | STATE_NAME | STATE_FIPS | SUB_REGION | STATE_ABBR | POPULATION | POP_SQMI | POP2010 | POP10_SQMI | WHITE | ... | OWNER_OCC | RENTER_OCC | NO_FARMS12 | AVE_SIZE12 | CROP_ACR12 | AVE_SALE12 | SQMI | Shape_Leng | Shape_Area | SHAPE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | 10 | Wyoming | 56 | Mountain | WY | 587106 | 6.0 | 563626 | 5.8 | 511279 | ... | 157077 | 69802 | 11736.0 | 2587.0 | 2418931.0 | 143952.0 | 97813.89 | 21.98703 | 27.966688 | {"rings": [[[-104.05361521909884, 41.698218275... |
1 rows × 56 columns
Perform the spatial join:
sdf2 = left.spatial.join(right)
sdf2
FID_left | NAME | CLASS | ST | STFIPS | PLACEFIP | CAPITAL | AREALAND | AREAWATER | POP_CLASS | ... | VACANT_right | OWNER_OCC_right | RENTER_OCC_right | NO_FARMS12 | AVE_SIZE12 | CROP_ACR12 | AVE_SALE12 | SQMI | Shape_Leng | Shape_Area | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 711 | Green River | City | WY | 56 | 33740 | 13.706 | 0.315 | 6 | ... | 34989 | 157077 | 69802 | 11736.0 | 2587.0 | 2418931.0 | 143952.0 | 97813.89 | 21.98703 | 27.966688 | |
1 | 712 | Rock Springs | City | WY | 56 | 67235 | 18.441 | 0.000 | 6 | ... | 34989 | 157077 | 69802 | 11736.0 | 2587.0 | 2418931.0 | 143952.0 | 97813.89 | 21.98703 | 27.966688 | |
2 | 715 | Evanston | City | WY | 56 | 25620 | 10.245 | 0.044 | 6 | ... | 34989 | 157077 | 69802 | 11736.0 | 2587.0 | 2418931.0 | 143952.0 | 97813.89 | 21.98703 | 27.966688 | |
3 | 764 | Laramie | City | WY | 56 | 45050 | 11.138 | 0.019 | 6 | ... | 34989 | 157077 | 69802 | 11736.0 | 2587.0 | 2418931.0 | 143952.0 | 97813.89 | 21.98703 | 27.966688 | |
4 | 766 | Cheyenne | City | WY | 56 | 13900 | State | 21.108 | 0.082 | 7 | ... | 34989 | 157077 | 69802 | 11736.0 | 2587.0 | 2418931.0 | 143952.0 | 97813.89 | 21.98703 | 27.966688 |
5 | 1216 | Sheridan | City | WY | 56 | 69845 | 8.486 | 0.018 | 6 | ... | 34989 | 157077 | 69802 | 11736.0 | 2587.0 | 2418931.0 | 143952.0 | 97813.89 | 21.98703 | 27.966688 | |
6 | 1218 | Casper | City | WY | 56 | 13150 | 23.945 | 0.316 | 6 | ... | 34989 | 157077 | 69802 | 11736.0 | 2587.0 | 2418931.0 | 143952.0 | 97813.89 | 21.98703 | 27.966688 | |
7 | 1219 | Gillette | City | WY | 56 | 31855 | 13.369 | 0.025 | 6 | ... | 34989 | 157077 | 69802 | 11736.0 | 2587.0 | 2418931.0 | 143952.0 | 97813.89 | 21.98703 | 27.966688 |
8 rows × 104 columns
Notice, you retain the geometry type of your left DataFrame (points) in this case, however, you get all the attributes from both the left and right DataFrames. Let us plot the results of the spatial join on a map:
m3 = gis.map("Wyoming", 6)
m3
m3.center = [43, -107]
from arcgis.geometry import Geometry
for idx, row in sdf2.iterrows():
m3.draw(row['SHAPE'], symbol=pt_sym)
Conclusion
Spatially Enabled DataFrame give you powerful data analysis and data wrangling capabilities. In addition to performing sql like operations on attribute data, you can perform geographic queries. This guide demonstrated some of these advanced capabilities of the SEDF.