The information in this section provides a brief introduction to advanced topics with the Spatially Enabled DataFrame
structure.
One of the most important tasks for software applications is to quickly retrieve and process information. Enterprise systems, whether storing GIS information or not, all utilize the concept of indexing to allow for quick searching through large data stores to locate and select specific information for subsequent processing.
This document will outline how row and column indexing work in Spatially Enabled Dataframes and also demonstrate building a spatial index on dataframe geometries to allow for quick searching, accessing, and processing. The document will also demonstrate spatial joins to combine dataframes.
DataFrame Index
As mentioned in the Introduction to the Spatially Enabled DataFrame guide, the Pandas DataFrame structure underlies the ArcGIS API for Python's Spatially Enabled DataFrame. Pandas DataFrames are analagous to spreadsheets. They have a row axis and a column axis. Each of these axes are indexed and labeled for quick and easy identification, data alignment, and retrieval and updating of data subsets.
Let's explore the axes labels and indices and how they allow for data exploraation:
from arcgis.gis import GIS
gis = GIS()
When working with an ArcGIS Online feature layer, the query()
method returns a FeatureSet
object which has a sdf
method to instantiate a Spatially Enabled DataFrame.
from arcgis import GIS
item = gis.content.get("85d0ca4ea1ca4b9abf0c51b9bd34de2e")
flayer = item.layers[0]
df = flayer.query(where="AGE_45_54 < 1500").sdf
df.head()
FID | NAME | CLASS | ST | STFIPS | PLACEFIPS | CAPITAL | POP_CLASS | POPULATION | POP2010 | ... | MARHH_NO_C | MHH_CHILD | FHH_CHILD | FAMILIES | AVE_FAM_SZ | HSE_UNITS | VACANT | OWNER_OCC | RENTER_OCC | SHAPE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Ammon | city | ID | 16 | 1601990 | 6 | 15181 | 13816 | ... | 1131 | 106 | 335 | 3352 | 3.61 | 4747 | 271 | 3205 | 1271 | {"x": -12462673.723706165, "y": 5384674.994080... | |
1 | 2 | Blackfoot | city | ID | 16 | 1607840 | 6 | 11946 | 11899 | ... | 1081 | 174 | 381 | 2958 | 3.31 | 4547 | 318 | 2788 | 1441 | {"x": -12506251.313993266, "y": 5341537.793529... | |
2 | 4 | Burley | city | ID | 16 | 1611260 | 6 | 10727 | 10345 | ... | 861 | 139 | 358 | 2499 | 3.37 | 3885 | 241 | 2183 | 1461 | {"x": -12667411.402393516, "y": 5241722.820606... | |
3 | 6 | Chubbuck | city | ID | 16 | 1614680 | 6 | 14655 | 13922 | ... | 1281 | 172 | 370 | 3586 | 3.4 | 4961 | 229 | 3324 | 1408 | {"x": -12520053.904151963, "y": 5300220.333409... | |
4 | 12 | Jerome | city | ID | 16 | 1641320 | 6 | 11403 | 10890 | ... | 779 | 210 | 385 | 2640 | 3.44 | 3985 | 292 | 2219 | 1474 | {"x": -12747828.64784961, "y": 5269214.8197742... |
5 rows × 50 columns
Describing the DataFrame
The DataFrame.info()
provides a concise summary of the object. This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.
Example: Displaying info on Spatially Enabled DataFrame (SEDF)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 316 entries, 0 to 315 Data columns (total 50 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 FID 316 non-null Int64 1 NAME 316 non-null string 2 CLASS 316 non-null string 3 ST 316 non-null string 4 STFIPS 316 non-null string 5 PLACEFIPS 316 non-null string 6 CAPITAL 316 non-null string 7 POP_CLASS 316 non-null Int32 8 POPULATION 316 non-null Int32 9 POP2010 316 non-null Int32 10 WHITE 316 non-null Int32 11 BLACK 316 non-null Int32 12 AMERI_ES 316 non-null Int32 13 ASIAN 316 non-null Int32 14 HAWN_PI 316 non-null Int32 15 HISPANIC 316 non-null Int32 16 OTHER 316 non-null Int32 17 MULT_RACE 316 non-null Int32 18 MALES 316 non-null Int32 19 FEMALES 316 non-null Int32 20 AGE_UNDER5 316 non-null Int32 21 AGE_5_9 316 non-null Int32 22 AGE_10_14 316 non-null Int32 23 AGE_15_19 316 non-null Int32 24 AGE_20_24 316 non-null Int32 25 AGE_25_34 316 non-null Int32 26 AGE_35_44 316 non-null Int32 27 AGE_45_54 316 non-null Int32 28 AGE_55_64 316 non-null Int32 29 AGE_65_74 316 non-null Int32 30 AGE_75_84 316 non-null Int32 31 AGE_85_UP 316 non-null Int32 32 MED_AGE 316 non-null Float64 33 MED_AGE_M 316 non-null Float64 34 MED_AGE_F 316 non-null Float64 35 HOUSEHOLDS 316 non-null Int32 36 AVE_HH_SZ 316 non-null Float64 37 HSEHLD_1_M 316 non-null Int32 38 HSEHLD_1_F 316 non-null Int32 39 MARHH_CHD 316 non-null Int32 40 MARHH_NO_C 316 non-null Int32 41 MHH_CHILD 316 non-null Int32 42 FHH_CHILD 316 non-null Int32 43 FAMILIES 316 non-null Int32 44 AVE_FAM_SZ 316 non-null Float64 45 HSE_UNITS 316 non-null Int32 46 VACANT 316 non-null Int32 47 OWNER_OCC 316 non-null Int32 48 RENTER_OCC 316 non-null Int32 49 SHAPE 316 non-null geometry dtypes: Float64(5), Int32(37), Int64(1), geometry(1), string(6) memory usage: 91.2 KB
We can see that the SHAPE
column is of type geometry
. This means that compared to the legacy SpatialDataFrame
class, geometry columns are now unique instead of being just of type object
.
We can get information about each axis label (aka, index
) with the axes
property on the spatial dataframe.
print("{:<15}{}\n\n{}{}".format("Row axis: ", df.axes[0], "Column axis: ", df.axes[1]))
Row axis: RangeIndex(start=0, stop=316, step=1) Column axis: Index(['FID', 'NAME', 'CLASS', 'ST', 'STFIPS', 'PLACEFIPS', 'CAPITAL', 'POP_CLASS', 'POPULATION', 'POP2010', 'WHITE', 'BLACK', 'AMERI_ES', 'ASIAN', 'HAWN_PI', 'HISPANIC', 'OTHER', 'MULT_RACE', 'MALES', 'FEMALES', 'AGE_UNDER5', 'AGE_5_9', 'AGE_10_14', 'AGE_15_19', 'AGE_20_24', 'AGE_25_34', 'AGE_35_44', 'AGE_45_54', 'AGE_55_64', 'AGE_65_74', 'AGE_75_84', 'AGE_85_UP', 'MED_AGE', 'MED_AGE_M', 'MED_AGE_F', 'HOUSEHOLDS', 'AVE_HH_SZ', 'HSEHLD_1_M', 'HSEHLD_1_F', 'MARHH_CHD', 'MARHH_NO_C', 'MHH_CHILD', 'FHH_CHILD', 'FAMILIES', 'AVE_FAM_SZ', 'HSE_UNITS', 'VACANT', 'OWNER_OCC', 'RENTER_OCC', 'SHAPE'], dtype='object')
Row axis information informs us we can retrieve information using the the dataframe loc
attribute and any value in the range 0-317 inclusive to access a row. Column axis information tells us we can use any string in the index to return an attribute column:
df.loc[0] #the first row returned
FID 1 NAME Ammon CLASS city ST ID STFIPS 16 PLACEFIPS 1601990 CAPITAL POP_CLASS 6 POPULATION 15181 POP2010 13816 WHITE 13002 BLACK 73 AMERI_ES 67 ASIAN 113 HAWN_PI 9 HISPANIC 884 OTHER 307 MULT_RACE 245 MALES 6750 FEMALES 7066 AGE_UNDER5 1468 AGE_5_9 1503 AGE_10_14 1313 AGE_15_19 1058 AGE_20_24 734 AGE_25_34 2031 AGE_35_44 1767 AGE_45_54 1446 AGE_55_64 1136 AGE_65_74 665 AGE_75_84 486 AGE_85_UP 209 MED_AGE 29.6 MED_AGE_M 28.0 MED_AGE_F 30.8 HOUSEHOLDS 4476 AVE_HH_SZ 3.05 HSEHLD_1_M 457 HSEHLD_1_F 648 MARHH_CHD 1618 MARHH_NO_C 1131 MHH_CHILD 106 FHH_CHILD 335 FAMILIES 3352 AVE_FAM_SZ 3.61 HSE_UNITS 4747 VACANT 271 OWNER_OCC 3205 RENTER_OCC 1271 SHAPE {'x': -12462673.723706165, 'y': 5384674.994080... Name: 0, dtype: object
df['POP2010'] #the data from the `POP2010` attribute column
0 13816 1 11899 2 10345 3 13922 4 10890 ... 311 9089 312 10272 313 9761 314 7855 315 9845 Name: POP2010, Length: 316, dtype: Int32
Slicing DataFrames
We can access rows, columns and subsets of rows and columns using Python slicing:
#rows 0-9 with a subset of columns indexed as a list
df.loc[0:9][['FID', 'NAME', 'ST', 'POP2010', 'POPULATION']]
FID | NAME | ST | POP2010 | POPULATION | |
---|---|---|---|---|---|
0 | 1 | Ammon | ID | 13816 | 15181 |
1 | 2 | Blackfoot | ID | 11899 | 11946 |
2 | 4 | Burley | ID | 10345 | 10727 |
3 | 6 | Chubbuck | ID | 13922 | 14655 |
4 | 12 | Jerome | ID | 10890 | 11403 |
5 | 13 | Kuna | ID | 15210 | 18179 |
6 | 21 | Rexburg | ID | 25484 | 28019 |
7 | 143 | Macomb | IL | 19288 | 19838 |
8 | 151 | Minooka | IL | 10924 | 11815 |
9 | 186 | Plano | IL | 10856 | 11164 |
We can use indexing to access SHAPE
information and draw it on a map:
df[df['NAME'].str.contains('Camp Pendleton South')]
FID | NAME | CLASS | ST | STFIPS | PLACEFIPS | CAPITAL | POP_CLASS | POPULATION | POP2010 | ... | MARHH_NO_C | MHH_CHILD | FHH_CHILD | FAMILIES | AVE_FAM_SZ | HSE_UNITS | VACANT | OWNER_OCC | RENTER_OCC | SHAPE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
69 | 1035 | Camp Pendleton South | Census Designated Place | CA | 06 | 0610561 | 6 | 10450 | 10616 | ... | 389 | 37 | 169 | 2500 | 3.69 | 2865 | 296 | 11 | 2558 | {"x": -13066535.250599463, "y": 3925680.810605... |
1 rows × 50 columns
Note that, the resulting Series
might not show as in row 69
everytime, so it is important to the update the following cell to point to the corresponding row according to the current output. Also, SHAPE
is one of the many Key Value Pairs of the dict
object created from the resulting Series
, that we are going to print as it is, and render on the map.
camp_pendleton_s_geodefn = dict(df.loc[69]).get("SHAPE") #geometry definition from row 2
print(camp_pendleton_s_geodefn)
{'x': -13066535.250599463, 'y': 3925680.810605321, 'spatialReference': {'wkid': 102100, 'latestWkid': 3857}}
m = gis.map("San Diego, CA", 8)
m
# draw the camp `point`
m.draw(camp_pendleton_s_geodefn)
Spatial Index
In addition to row and column indices to search a DataFrame, we can use a spatial indexes to quickly access information based on its location and relationship with other features. They are based on the concept of a minimum bounding rectangle - the smallest rectangle that contains an entire geometric shape. Each of these rectangles are then grouped into leaf
nodes representing a single shape and node
structures containing groups of shapes according to whatever algorithm the different types of spatial indexing use. Querying these rectangles requires magnitudes fewer compute resources for accessing and processing geometries relative to accessing the entire feature array of coordinate pairs that compose a shape. Access to points, complex lines and irregularly-shaped polygons becomes much quicker and easier through different flavors of spatial indexing.
The Spatially Enabled DataFrame uses an implementation of spatial indexing known as QuadTree indexing, which searches nodes when determining locations, relationships and attributes of specific features. QuadTree
indexes are the default spatial index, but the SEDF also supports r-tree
implementations. In the DataFrame Index section of this notebook, the USA Major Cities feature layer was queried and the sdf
property was called on the results to create a DataFrame. The sindex
method on the DataFrame creates a QuadTree index:
si = df.spatial.sindex('quadtree',reset=False)
Let's visually inspect the external frame of the QuadTree index. We'll then plot the spatial dataframe to ensure the spatial index encompasses all our features:
midx = gis.map("United States", 3)
midx
midx.center = [39, -98]
midx.basemap = 'gray-vector'
# draw the spatial index envelope
df.spatial.plot(map_widget=midx)
True
Let's use the feature we drew earlier to define a spatial reference variable for use throughout the rest of this guide.
sp_ref = camp_pendleton_s_geodefn['spatialReference']
sp_ref
{'wkid': 102100, 'latestWkid': 3857}
import time
from arcgis.geometry import Geometry, Polygon
#define a symbol to visualize the spatial index quadrants
sym = {
"type": "esriSFS",
"style": "esriSFSSolid",
"color": [0,0,0,0],
"outline": {
"type": "esriSLS",
"style": "esriSLSSolid",
"color": [0,0,0,255],
"width": 4}
}
# loop through the children of the root index and draw each extent
# using a different outline color
for i in range(len(si._index.children)):
sym["outline"]["color"][i] = 255
if i > 0:
sym["outline"]["color"][i] = 255
sym["outline"]["color"][i-1] = 0
child = si._index.children[i]
width_factor = child.width/2
height_factor = child.width/2
minx = child.center[0] - width_factor
miny = child.center[1] - height_factor
maxx = child.center[0] + width_factor
maxy = child.center[1] + height_factor
child_geom = Geometry({
'rings':[[[minx,miny], [minx, maxy], [maxx, maxy], [maxx, miny], [minx, miny]]],
'spatialReference': sp_ref})
#child_extent = Polygon(child_geom)
midx.draw(shape = child_geom, symbol = sym)
time.sleep(2)
Intersection with the Spatial Index
Up to this point in this guide, we've talked about using indexing for querying attributes in the dataframe. For example:
query = df['ST'] == 'MI'
df[query]
FID | NAME | CLASS | ST | STFIPS | PLACEFIPS | CAPITAL | POP_CLASS | POPULATION | POP2010 | ... | MARHH_NO_C | MHH_CHILD | FHH_CHILD | FAMILIES | AVE_FAM_SZ | HSE_UNITS | VACANT | OWNER_OCC | RENTER_OCC | SHAPE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
101 | 1514 | Allendale | Census Designated Place | MI | 26 | 2601340 | 6 | 19709 | 17579 | ... | 939 | 94 | 200 | 2639 | 3.26 | 4834 | 227 | 2511 | 2096 | {"x": -9568014.259185659, "y": 5309698.1042234... | |
102 | 1523 | Big Rapids | city | MI | 26 | 2608300 | 6 | 10614 | 10601 | ... | 436 | 82 | 322 | 1323 | 2.88 | 3623 | 293 | 1136 | 2194 | {"x": -9515614.1943842, "y": 5419013.554990504... | |
103 | 1526 | Cadillac | city | MI | 26 | 2612320 | 6 | 10547 | 10355 | ... | 999 | 179 | 526 | 2625 | 2.9 | 4927 | 647 | 2519 | 1761 | {"x": -9508449.13740055, "y": 5504118.53341087... | |
104 | 1528 | Coldwater | city | MI | 26 | 2617020 | 6 | 10828 | 10945 | ... | 897 | 213 | 445 | 2628 | 3.14 | 4827 | 572 | 2576 | 1679 | {"x": -9462596.578807637, "y": 5152066.1922501... | |
105 | 1529 | Comstock Park | Census Designated Place | MI | 26 | 2617700 | 6 | 10892 | 10088 | ... | 920 | 167 | 557 | 2639 | 2.96 | 4656 | 484 | 1979 | 2193 | {"x": -9537405.970997674, "y": 5316589.2811759... | |
106 | 1538 | Farmington | city | MI | 26 | 2627380 | 6 | 10674 | 10372 | ... | 1129 | 66 | 287 | 2735 | 2.92 | 4959 | 335 | 2876 | 1748 | {"x": -9281637.643149175, "y": 5230343.3917462... | |
107 | 1546 | Grand Haven | city | MI | 26 | 2633340 | 6 | 11073 | 10412 | ... | 1277 | 115 | 320 | 2721 | 2.82 | 5815 | 1046 | 3239 | 1530 | {"x": -9597693.318165638, "y": 5320625.9222716... | |
108 | 1575 | Muskegon Heights | city | MI | 26 | 2656360 | 6 | 10657 | 10856 | ... | 460 | 143 | 1183 | 2682 | 3.23 | 4842 | 846 | 2042 | 1954 | {"x": -9600424.098598039, "y": 5342825.5598096... | |
109 | 1600 | Sturgis | city | MI | 26 | 2676960 | 6 | 10944 | 10994 | ... | 835 | 195 | 467 | 2632 | 3.28 | 4595 | 507 | 2333 | 1755 | {"x": -9508726.380150843, "y": 5131255.7167436... |
9 rows × 50 columns
We can query multiple attributes and filter on the column output as well:
query = (df['POP2010'] > 20000) & (df['ST'] == 'OH')
df[query][['NAME','ST','POP2010','HOUSEHOLDS','HSEHLD_1_F', 'HSEHLD_1_M']]
NAME | ST | POP2010 | HOUSEHOLDS | HSEHLD_1_F | HSEHLD_1_M | |
---|---|---|---|---|---|---|
166 | Athens | OH | 23832 | 6903 | 2474 | 2573 |
195 | Oxford | OH | 21371 | 5799 | 2033 | 1850 |
As GIS analysts and data scientists, we also want to query based on geographic location. We can do that by building a spatial index with the sindex
property of the spatial dataframe. The resulting quadtree index allows us to query based on specific geometries in relation to other geometries.
Let's continue looking at the dataframe wer're working with: US cities with a population between the ages of 45 and 54 of less than 1500.
We can draw the entire extent of our dataframe using the dataframe's geoextent
property. Let's get the bounding box coordinates:
df_geoextent = df.spatial.full_extent
df_geoextent
(-17595352.55942164, 2429395.3372018305, -7895099.852443745, 6266417.1716177985)
Let's use these coordinates, place them in more descriptive variable names, then create a bounding box to make a geometry object representing the extent of our dataframe. Finally we'll draw it on the a map:
df_geoextent_geom = df.spatial.bbox
df_geoextent_geom
m1 = gis.map("United States", 3)
m1
m1.center = [39, -98]
sym_poly = {
"type": "esriSFS",
"style": "esriSFSSolid",
"color": [0,0,0,0], # hollow, no fill
"outline": {
"type": "esriSLS",
"style": "esriSLSSolid",
"color": [255,0,0,255], # red border
"width": 3}
}
# draw the dataframe extent with AOI
m1.draw(shape = df_geoextent_geom, symbol = sym_poly)
Now, let's define a second set of coordinates representing a bounding box for which we want to query the features from our dataframe that fall within it.
We can define our list of coordinates, and then draw it on the map to make sure it falls within our dataframe extent:
area_of_interest = [-13043219.122301877, 3911134.034258818, -13243219.102301877, 4111134.0542588173]
minx, miny, maxx, maxy = area_of_interest[0], area_of_interest[1], area_of_interest[2], area_of_interest[3]
area_of_interest_ring = [[[minx, miny], [minx, maxy], [maxx, maxy], [maxx, miny], [minx, miny]]]
area_of_interest_geom = Geometry({'rings': area_of_interest_ring, 'spatialReference': sp_ref})
sym_poly_aoi = {
"type": "esriSFS",
"style": "esriSFSSolid",
"color": [0,0,0,0], # hollow, no fill
"outline": {
"type": "esriSLS",
"style": "esriSLSSolid",
"color": [0,255,0,255], # green border
"width": 3}
}
m1.draw(shape = area_of_interest_geom, symbol = sym_poly_aoi)
We can see that our area of interest box falls within the dataframe extent. The spatial index has an intersect
method which takes a bounding box as input and returns a list of integer values from the row index of our spatial dataframe. We can use the dataframe's iloc
integer-indexing attribute to then loop through the dataframe and put draw the features on a map
index_of_features = si.intersect(area_of_interest)
df.iloc[index_of_features]
FID | NAME | CLASS | ST | STFIPS | PLACEFIPS | CAPITAL | POP_CLASS | POPULATION | POP2010 | ... | MARHH_NO_C | MHH_CHILD | FHH_CHILD | FAMILIES | AVE_FAM_SZ | HSE_UNITS | VACANT | OWNER_OCC | RENTER_OCC | SHAPE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
32 | 454 | Muscoy | Census Designated Place | CA | 06 | 0650132 | 6 | 11161 | 10644 | ... | 302 | 194 | 308 | 1933 | 4.76 | 2443 | 212 | 1268 | 963 | {"x": -13063141.654215325, "y": 4049605.974486... | |
69 | 1035 | Camp Pendleton South | Census Designated Place | CA | 06 | 0610561 | 6 | 10450 | 10616 | ... | 389 | 37 | 169 | 2500 | 3.69 | 2865 | 296 | 11 | 2558 | {"x": -13066535.250599463, "y": 3925680.810605... | |
70 | 1053 | Citrus | Census Designated Place | CA | 06 | 0613560 | 6 | 11329 | 10866 | ... | 541 | 116 | 255 | 2195 | 4.33 | 2701 | 86 | 1854 | 761 | {"x": -13123874.444099307, "y": 4044252.329385... | |
71 | 1062 | Commerce | city | CA | 06 | 0614974 | 6 | 13227 | 12823 | ... | 549 | 169 | 424 | 2709 | 4.17 | 3470 | 88 | 1619 | 1763 | {"x": -13151212.145498956, "y": 4027601.902958... | |
77 | 1158 | Home Gardens | Census Designated Place | CA | 06 | 0634302 | 6 | 12223 | 11570 | ... | 573 | 135 | 298 | 2336 | 4.35 | 2865 | 105 | 1952 | 808 | {"x": -13080593.064290542, "y": 4012558.291540... | |
81 | 1177 | Laguna Woods | city | CA | 06 | 0639259 | 6 | 17960 | 16192 | ... | 3264 | 6 | 15 | 3873 | 2.07 | 13016 | 1714 | 8730 | 2572 | {"x": -13105613.458129246, "y": 3976538.093303... |
6 rows × 50 columns
Let us plot these features that intersect on a map:
m2 = gis.map("Los Angeles, CA", 7)
m2
m2.center = [34, -118]
m2.draw(shape = area_of_interest_geom, symbol = sym_poly_aoi)
pt_sym = {
"type": "esriSMS",
"style": "esriSMSDiamond",
"color": [255,140,0,255], # yellowish
"size": 8,
"angle": 0,
"xoffset": 0,
"yoffset": 0,
"outline": {
"color": [255,140,0,255],
"width": 1}
}
# draw the AOI that intersects
for pt_index in index_of_features:
m2.draw(shape = df.iloc[pt_index]['SHAPE'], symbol = pt_sym)
Thus we were able to use the spatial indexes to query features that fall within an extent.
Spatial Joins
DataFrames are table-like structures comprised of rows and columns. In relational database, SQL joins
are fundamental operations that combine columns from one or more tables using values that are common to each. They occur in almost all database queries.
A Spatial join is a table operation that affixes data from one feature layer’s attribute table to another based on a spatial relationship. The spatial join involves matching rows from the Join Features (data frame1) to the Target Features (data frame2) based on their spatial relationship.
Let's look at how joins work with dataframes by using subsets of our original DataFrame and the pandas merge
fucntionality. We'll then move onto examining a spatial join to combine features from one dataframe with another based on a common attribute value.
Query the DataFrame to extract 3 attribute columns of information from 2 states, Ohio and Michigan:
query = (df['ST'] == 'OH') | (df['ST'] == 'MI')
df1 = df[query][['NAME', 'ST', 'POP2010']]
df1
NAME | ST | POP2010 | |
---|---|---|---|
101 | Allendale | MI | 17579 |
102 | Big Rapids | MI | 10601 |
103 | Cadillac | MI | 10355 |
104 | Coldwater | MI | 10945 |
105 | Comstock Park | MI | 10088 |
106 | Farmington | MI | 10372 |
107 | Grand Haven | MI | 10412 |
108 | Muskegon Heights | MI | 10856 |
109 | Sturgis | MI | 10994 |
166 | Athens | OH | 23832 |
167 | Cambridge | OH | 10635 |
168 | Celina | OH | 10400 |
192 | Galion | OH | 10512 |
193 | London | OH | 9904 |
194 | Northbrook | OH | 10668 |
195 | Oxford | OH | 21371 |
196 | Springdale | OH | 11223 |
197 | Trenton | OH | 11869 |
198 | University Heights | OH | 13539 |
199 | Van Wert | OH | 10846 |
Query the dataframe again for 8 attribute columns from one state, Ohio
query = df['ST'] == 'OH'
df2 = df[query][['NAME', 'POPULATION','BLACK', 'HAWN_PI', 'HISPANIC', 'WHITE', 'MULT_RACE', 'OTHER']]
df2
NAME | POPULATION | BLACK | HAWN_PI | HISPANIC | WHITE | MULT_RACE | OTHER | |
---|---|---|---|---|---|---|---|---|
166 | Athens | 25431 | 1047 | 10 | 576 | 20586 | 559 | 138 |
167 | Cambridge | 10541 | 361 | 3 | 129 | 9857 | 316 | 33 |
168 | Celina | 10566 | 50 | 41 | 293 | 9873 | 164 | 111 |
192 | Galion | 10385 | 50 | 1 | 140 | 10264 | 120 | 42 |
193 | London | 10543 | 596 | 0 | 169 | 8830 | 287 | 61 |
194 | Northbrook | 10825 | 2974 | 6 | 377 | 6995 | 379 | 141 |
195 | Oxford | 23054 | 859 | 2 | 491 | 18719 | 478 | 127 |
196 | Springdale | 11507 | 3355 | 44 | 1965 | 6169 | 331 | 977 |
197 | Trenton | 12393 | 115 | 2 | 198 | 11418 | 224 | 35 |
198 | University Heights | 13884 | 3133 | 4 | 374 | 9726 | 215 | 121 |
199 | Van Wert | 11042 | 180 | 1 | 435 | 10263 | 221 | 130 |
The Pandas merge
capability joins dataframes in a style similar to SQL joins, with parameters to indicate the column of shared information and the type of join to perform:
An inner
join (the default), is analagous to a SQL left inner join, keeping the order from the left table in the output and returning only those records from the right table that match the value in the column specified with the on
parameter:
import pandas as pd
pd.merge(df1, df2, on='NAME', how='inner')
NAME | ST | POP2010 | POPULATION | BLACK | HAWN_PI | HISPANIC | WHITE | MULT_RACE | OTHER | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Athens | OH | 23832 | 25431 | 1047 | 10 | 576 | 20586 | 559 | 138 |
1 | Cambridge | OH | 10635 | 10541 | 361 | 3 | 129 | 9857 | 316 | 33 |
2 | Celina | OH | 10400 | 10566 | 50 | 41 | 293 | 9873 | 164 | 111 |
3 | Galion | OH | 10512 | 10385 | 50 | 1 | 140 | 10264 | 120 | 42 |
4 | London | OH | 9904 | 10543 | 596 | 0 | 169 | 8830 | 287 | 61 |
5 | Northbrook | OH | 10668 | 10825 | 2974 | 6 | 377 | 6995 | 379 | 141 |
6 | Oxford | OH | 21371 | 23054 | 859 | 2 | 491 | 18719 | 478 | 127 |
7 | Springdale | OH | 11223 | 11507 | 3355 | 44 | 1965 | 6169 | 331 | 977 |
8 | Trenton | OH | 11869 | 12393 | 115 | 2 | 198 | 11418 | 224 | 35 |
9 | University Heights | OH | 13539 | 13884 | 3133 | 4 | 374 | 9726 | 215 | 121 |
10 | Van Wert | OH | 10846 | 11042 | 180 | 1 | 435 | 10263 | 221 | 130 |
Notice how all the rows from the left DataFrame
appear in the result with all the attribute columns and values appended from the right DataFrame
where the column value of NAME
matched. The POP2010
attribute from the left DataFrame
is combined with all the attributes from the right DataFrame
.
An outer
join combines all rows from both outputs together and orders the results according to the original row index:
pd.merge(df1, df2, on='NAME', how = 'outer')
NAME | ST | POP2010 | POPULATION | BLACK | HAWN_PI | HISPANIC | WHITE | MULT_RACE | OTHER | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Allendale | MI | 17579 | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> |
1 | Big Rapids | MI | 10601 | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> |
2 | Cadillac | MI | 10355 | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> |
3 | Coldwater | MI | 10945 | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> |
4 | Comstock Park | MI | 10088 | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> |
5 | Farmington | MI | 10372 | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> |
6 | Grand Haven | MI | 10412 | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> |
7 | Muskegon Heights | MI | 10856 | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> |
8 | Sturgis | MI | 10994 | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> |
9 | Athens | OH | 23832 | 25431 | 1047 | 10 | 576 | 20586 | 559 | 138 |
10 | Cambridge | OH | 10635 | 10541 | 361 | 3 | 129 | 9857 | 316 | 33 |
11 | Celina | OH | 10400 | 10566 | 50 | 41 | 293 | 9873 | 164 | 111 |
12 | Galion | OH | 10512 | 10385 | 50 | 1 | 140 | 10264 | 120 | 42 |
13 | London | OH | 9904 | 10543 | 596 | 0 | 169 | 8830 | 287 | 61 |
14 | Northbrook | OH | 10668 | 10825 | 2974 | 6 | 377 | 6995 | 379 | 141 |
15 | Oxford | OH | 21371 | 23054 | 859 | 2 | 491 | 18719 | 478 | 127 |
16 | Springdale | OH | 11223 | 11507 | 3355 | 44 | 1965 | 6169 | 331 | 977 |
17 | Trenton | OH | 11869 | 12393 | 115 | 2 | 198 | 11418 | 224 | 35 |
18 | University Heights | OH | 13539 | 13884 | 3133 | 4 | 374 | 9726 | 215 | 121 |
19 | Van Wert | OH | 10846 | 11042 | 180 | 1 | 435 | 10263 | 221 | 130 |
The rows where the on parameter value is the same in both tables have all attributes from both DataFrames in the result. The rows from the first DataFrame that do not have a matching NAME
value in the second dataframe have values filled in with NaN
values.
A spatial join works similarly on matching attribute values. However, instead of joining on an attribue field (like you did earlier), you will join based on the spatial relationship between the records in the two tables.
Example: Merging State Statistics Information with Cities
The goal is to get Wyoming's city locations and census data joined with Wyoming's state census data.
If you do not have access to the
ArcPy
site-package from the Python interpreter used to execute the following cells, you must authenticate to an ArcGIS Online Organization or ArcGIS Enterprise portal.
from arcgis.gis import GIS
g2 = GIS(profile="your_enterprise_profile")
import pandas as pd
from arcgis.features import GeoAccessor, GeoSeriesAccessor
import os
data_pth = r'/path/to/your/data/census_2010/example'
cities = r"cities.shp"
states = r"states.shp"
sdf_target = pd.DataFrame.spatial.from_featureclass(os.path.join(data_pth, cities))
sdf_target.head()
FID | NAME | CLASS | ST | STFIPS | PLACEFIP | CAPITAL | AREALAND | AREAWATER | POP_CLASS | ... | MARHH_NO_C | MHH_CHILD | FHH_CHILD | FAMILIES | AVE_FAM_SZ | HSE_UNITS | VACANT | OWNER_OCC | RENTER_OCC | SHAPE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | College | Census Designated Place | AK | 02 | 16750 | 18.670 | 0.407 | 6 | ... | 936 | 152 | 339 | 2640 | 3.13 | 4501 | 397 | 2395 | 1709 | {"x": -147.8271911572905, "y": 64.848300194159... | |
1 | 1 | Fairbanks | City | AK | 02 | 24230 | 31.857 | 0.815 | 6 | ... | 2259 | 395 | 1058 | 7187 | 3.15 | 12357 | 1282 | 3863 | 7212 | {"x": -147.72638163006846, "y": 64.83809069704... | |
2 | 2 | Kalispell | City | MT | 30 | 40075 | 5.458 | 0.004 | 6 | ... | 1433 | 147 | 480 | 3494 | 2.92 | 6532 | 390 | 3458 | 2684 | {"x": -114.31606412429451, "y": 48.19780017936... | |
3 | 3 | Post Falls | City | ID | 16 | 64810 | 9.656 | 0.045 | 6 | ... | 1851 | 205 | 467 | 4670 | 3.13 | 6697 | 328 | 4611 | 1758 | {"x": -116.93792709825782, "y": 47.71555468018... | |
4 | 4 | Dishman | Census Designated Place | WA | 53 | 17985 | 3.378 | 0.000 | 6 | ... | 1096 | 131 | 345 | 2564 | 2.96 | 4408 | 257 | 2635 | 1516 | {"x": -117.27780913774282, "y": 47.65654568420... |
5 rows × 48 columns
Define a SpatialReference object to pass as the second argument in the from_featureclass
function, so calling the method can also project (or transform) output GeoDataFrame to your desired spatial reference. Note: this requires arcpy to work.
from arcgis.geometry import SpatialReference
sdf_join = pd.DataFrame.spatial.from_featureclass(os.path.join(data_pth, states), sr=SpatialReference(4326).as_arcpy)
sdf_join.head()
FID | STATE_NAME | DRAWSEQ | STATE_FIPS | SUB_REGION | STATE_ABBR | SHAPE | |
---|---|---|---|---|---|---|---|
0 | 0 | Hawaii | 1 | 15 | Pacific | HI | {"rings": [[[-160.07380334546815, 22.004177347... |
1 | 1 | Washington | 2 | 53 | Pacific | WA | {"rings": [[[-122.40201531038355, 48.225216372... |
2 | 2 | Montana | 3 | 30 | Mountain | MT | {"rings": [[[-111.47542530020736, 44.702162369... |
3 | 3 | Maine | 4 | 23 | New England | ME | {"rings": [[[-69.77727626137293, 44.0741483685... |
4 | 4 | North Dakota | 5 | 38 | West North Central | ND | {"rings": [[[-98.73043728833767, 45.9382713702... |
We will use python's list comprehensions to create lists of the attribute columns in the DataFrame, then print out the lists to see the names of all the attribute columns.
sdf_target_cols = [column for column in sdf_target.columns]
sdf_join_cols = [column for column in sdf_join.columns]
Print out a list of columns in the sdf_target
dataframe created from the cities shapefile:
for a,b,c,d in zip(sdf_target_cols[::4],sdf_target_cols[1::4],sdf_target_cols[2::4], sdf_target_cols[3::4]):
print("{:<30}{:<30}{:<30}{:<}".format(a,b,c,d))
FID NAME CLASS ST STFIPS PLACEFIP CAPITAL AREALAND AREAWATER POP_CLASS POP2000 POP2007 WHITE BLACK AMERI_ES ASIAN HAWN_PI OTHER MULT_RACE HISPANIC MALES FEMALES AGE_UNDER5 AGE_5_17 AGE_18_21 AGE_22_29 AGE_30_39 AGE_40_49 AGE_50_64 AGE_65_UP MED_AGE MED_AGE_M MED_AGE_F HOUSEHOLDS AVE_HH_SZ HSEHLD_1_M HSEHLD_1_F MARHH_CHD MARHH_NO_C MHH_CHILD FHH_CHILD FAMILIES AVE_FAM_SZ HSE_UNITS VACANT OWNER_OCC RENTER_OCC SHAPE
Print out a list of columns in the sdf_join
dataframe created from the states shapefile:
for a,b,c,d,e in zip(sdf_join_cols[::5],sdf_join_cols[1::5],sdf_join_cols[2::5],sdf_join_cols[3::5],sdf_join_cols[4::5]):
print("{:<20}{:<20}{:<20}{:<20}{:<}".format(a,b,c,d,e))
FID STATE_NAME DRAWSEQ STATE_FIPS SUB_REGION
Create a DataFrame for the cities in Wyoming:
sdf_target.loc[0]['SHAPE'].as_arcpy
q = sdf_target['ST'] == 'WY'
left = sdf_target[q].copy()
left.head()
FID | NAME | CLASS | ST | STFIPS | PLACEFIP | CAPITAL | AREALAND | AREAWATER | POP_CLASS | ... | MARHH_NO_C | MHH_CHILD | FHH_CHILD | FAMILIES | AVE_FAM_SZ | HSE_UNITS | VACANT | OWNER_OCC | RENTER_OCC | SHAPE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
711 | 711 | Green River | City | WY | 56 | 33740 | 13.706 | 0.315 | 6 | ... | 1278 | 113 | 251 | 3214 | 3.22 | 4426 | 249 | 3169 | 1008 | {"x": -109.46492712301152, "y": 41.51419117328... | |
712 | 712 | Rock Springs | City | WY | 56 | 67235 | 18.441 | 0.000 | 6 | ... | 2012 | 220 | 536 | 4931 | 3.02 | 8359 | 1011 | 5274 | 2074 | {"x": -109.22240010498797, "y": 41.59092714080... | |
715 | 715 | Evanston | City | WY | 56 | 25620 | 10.245 | 0.044 | 6 | ... | 976 | 139 | 369 | 2940 | 3.30 | 4665 | 607 | 2805 | 1253 | {"x": -110.96461812552366, "y": 41.26330015271... | |
764 | 764 | Laramie | City | WY | 56 | 45050 | 11.138 | 0.019 | 6 | ... | 2496 | 174 | 587 | 5608 | 2.83 | 11994 | 658 | 5379 | 5957 | {"x": -105.58725462620347, "y": 41.31292665660... | |
766 | 766 | Cheyenne | City | WY | 56 | 13900 | State | 21.108 | 0.082 | 7 | ... | 6299 | 490 | 1610 | 14174 | 2.93 | 23782 | 1458 | 14739 | 7585 | {"x": -104.80204559586696, "y": 41.14554516058... |
5 rows × 48 columns
Create a dataframe for the state of Wyoming:
q = sdf_join.STATE_ABBR == 'WY'
right = sdf_join[q].copy()
right.head()
FID | STATE_NAME | DRAWSEQ | STATE_FIPS | SUB_REGION | STATE_ABBR | SHAPE | |
---|---|---|---|---|---|---|---|
6 | 6 | Wyoming | 7 | 56 | Mountain | WY | {"rings": [[[-104.05361529329527, 41.698218366... |
Perform the spatial join:
Before performing a spatial join between these two DataFrame objects, we can check the SpatialReference
of each one of them to validate if they are the same SR - this is a pre-requisite of joining two DataFrames.
left.spatial.sr
{'wkid': 4326, 'latestWkid': 4326}
right.spatial.sr
{'wkid': 4326, 'latestWkid': 4326}
sdf2 = left.spatial.join(right)
sdf2
FID_left | NAME | CLASS | ST | STFIPS | PLACEFIP | CAPITAL | AREALAND | AREAWATER | POP_CLASS | ... | OWNER_OCC | RENTER_OCC | SHAPE | index_right | FID_right | STATE_NAME | DRAWSEQ | STATE_FIPS | SUB_REGION | STATE_ABBR | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 711 | Green River | City | WY | 56 | 33740 | 13.706 | 0.315 | 6 | ... | 3169 | 1008 | {"x": -109.46492712301152, "y": 41.51419117328... | 6 | 6 | Wyoming | 7 | 56 | Mountain | WY | |
1 | 712 | Rock Springs | City | WY | 56 | 67235 | 18.441 | 0.000 | 6 | ... | 5274 | 2074 | {"x": -109.22240010498797, "y": 41.59092714080... | 6 | 6 | Wyoming | 7 | 56 | Mountain | WY | |
2 | 715 | Evanston | City | WY | 56 | 25620 | 10.245 | 0.044 | 6 | ... | 2805 | 1253 | {"x": -110.96461812552366, "y": 41.26330015271... | 6 | 6 | Wyoming | 7 | 56 | Mountain | WY | |
3 | 764 | Laramie | City | WY | 56 | 45050 | 11.138 | 0.019 | 6 | ... | 5379 | 5957 | {"x": -105.58725462620347, "y": 41.31292665660... | 6 | 6 | Wyoming | 7 | 56 | Mountain | WY | |
4 | 766 | Cheyenne | City | WY | 56 | 13900 | State | 21.108 | 0.082 | 7 | ... | 14739 | 7585 | {"x": -104.80204559586696, "y": 41.14554516058... | 6 | 6 | Wyoming | 7 | 56 | Mountain | WY |
5 | 1216 | Sheridan | City | WY | 56 | 69845 | 8.486 | 0.018 | 6 | ... | 4446 | 2559 | {"x": -106.95897260592156, "y": 44.79671814410... | 6 | 6 | Wyoming | 7 | 56 | Mountain | WY | |
6 | 1218 | Casper | City | WY | 56 | 13150 | 23.945 | 0.316 | 6 | ... | 13616 | 6727 | {"x": -106.32506361818486, "y": 42.83466364743... | 6 | 6 | Wyoming | 7 | 56 | Mountain | WY | |
7 | 1219 | Gillette | City | WY | 56 | 31855 | 13.369 | 0.025 | 6 | ... | 4867 | 2523 | {"x": -105.50525462413556, "y": 44.28266365145... | 6 | 6 | Wyoming | 7 | 56 | Mountain | WY |
8 rows × 55 columns
Notice, you retain the geometry type of your left DataFrame (points) in this case, however, you get all the attributes from both the left and right DataFrames. Let us plot the results of the spatial join on a map:
m3 = gis.map("Wyoming", 6)
m3
m3.center = [43, -107]
from arcgis.geometry import Geometry
# draw the spatial join results on Wyoming state
for idx, row in sdf2.iterrows():
m3.draw(row['SHAPE'], symbol=pt_sym)
Conclusion
Spatially Enabled DataFrame give you powerful data analysis and data wrangling capabilities. In addition to performing sql like operations on attribute data, you can perform geographic queries. This guide demonstrated some of these advanced capabilities of the SEDF.