Read and write shapefiles

Learn how to read from, manage, and write to shapefiles. A shapefile data source behaves like other file formats within Spark (parquet, ORC, etc.). You can use shapefiles to read data from, or to write data to.

In this tutorial you will read from shapefiles, write results to new shapefiles, and partition data logically.

Read shapefiles

Prepare your input shapefile

Download the sample shapefile from ArcGIS Online.
Store it in a local folder on your machine, for example c:\data\shapefile_demo.

Set up the workspace

Import the required modules.

Python
Use dark colors for code blocksCopy
# Import the required modules
import os, tempfile
import geoanalytics
from geoanalytics.sql import functions as ST

geoanalytics.auth(username="user1", password="p@ssword")

Set the output directory to write your formatted data to.

Python
Use dark colors for code blocksCopy
# Set the workspace
output_dir = os.path.normpath(r"C:/data/shapefile_demo")

Read from your shapefile and display columns of interest

Read the shapefile into a DataFrame. Note that the folder containing the shapefile is specified, and not the full path to the .shp file. A folder can contain multiple shapefiles with the same schema and be read as a single DataFrame.

Python
Use dark colors for code blocksCopy
# Read the shapefile into a DataFrame
shapefileDF=spark.read.format("shapefile").load(r"c:\data\shapefile_demo\Mineplants")

Visualize a subset of the columns, including the geometry column, by showing a sample of five rows from the input.

Python
Use dark colors for code blocksCopy
# Sample your DataFrame
shapefileDF.select("commodity","COMPANY_NA","geometry").show(5, truncate=False)

Result
Use dark colors for code blocksCopy
+---------+-------------------------------+------------------------+
|commodity|COMPANY_NA                     |geometry                |
+---------+-------------------------------+------------------------+
|Aluminum |Alcoa Inc                      |{"x":-87.336,"y":37.915}|
|Aluminum |Century Aluminum Co            |{"x":-86.786,"y":37.942}|
|Aluminum |Alcan Inc                      |{"x":-87.5,"y":37.65}   |
|Aluminum |Ormet Corp                     |{"x":-90.923,"y":30.138}|
|Aluminum |Kaiser Aluminum & Chemical Corp|{"x":-90.755,"y":30.049}|
+---------+-------------------------------+------------------------+
only showing top 5 rows

Write shapefiles

Write a DataFrame to a shapefile

Use a defined dataset to create a DataFrame and write it to a shapefile.

Define your own dataset.

Python
Use dark colors for code blocksCopy
# Define a point dataset
myPoints = [(0, -4655711.2806, 222503.076),
	(1, -4570473.292, 322503.076),
	(2, -4830838.089, 146545.398),
	(3, -4570771.608, 116617.112),
	(4, -4682228.671, 173377.654)]

fields = ["id", "latitude", "longitude"]

Create a DataFrame from your dataset definition.

Python
Use dark colors for code blocksCopy
# Create a DataFrame
df = spark.createDataFrame(myPoints, fields)

# Enable geometry
df = df.withColumn("geometry",
ST.srid(ST.point("longitude", "latitude"), 6329)) \
.st.set_geometry_field("geometry")

Write your DataFrame to a shapefile.

Python
Use dark colors for code blocksCopy
# Write to a single shapefile - update the path to a location accessible to you
myshp = df.coalesce(1).write.format("shapefile").mode("overwrite").save(r"C:\data\output_shapefile")

Merge shapefiles with different schemas

Use schema merging when a collection of datasets contains varying schemas. For example, data have been collected over time. Each month a new dataset was created and a new column name for that month was introduced. Use schema merging to resolve the schema differences.

If you haven't already downloaded the sample shapefile, follow the steps to prepare your input shapefile.

Python
Use dark colors for code blocksCopy
# Read the shapefile into a DataFrame
shapefileDF=spark.read.format("shapefile").load(r"c:\data\shapefile_demo\Mineplants")

Set the output location for the shapefiles. These are the shapefiles that will have their schemas merged to form a single DataFrame.

Python
Use dark colors for code blocksCopy
# Set the output path to store your shapefiles
output_shapefiles = os.path.join(output_dir, "merged_shapefile")

Create three subset shapefiles. Specify a value of 1 for .coalesce() to write each query result to a single (1) shapefile. A coalesce value enables the number of partitions to be reduced, resulting in fewer output shapefiles. By default, a shapefile will be written for each partition. Each shapefile will have three columns with names in common (geometry, id and commodity), and one column with a unique name.

Rows with id values between 1 and 5 will have a column named site_name.

Python
Use dark colors for code blocksCopy
# Create the first subset shapefile
shapefileDF.where("id <= 5").select("id", "commodity","site_name","geometry") \
    .coalesce(1).write.format("shapefile").mode("overwrite").save(output_shapefiles)

Rows with id values between 6 and 10 will have a column named company_na.

Python
Use dark colors for code blocksCopy
# Create the second subset shapefile
shapefileDF.where("id between 6 and 10").select("id","commodity", "company_na",
    "geometry").coalesce(1).write.format("shapefile").mode("append").save(output_shapefiles)

Rows with id values between 11 and 15 will have a column named state_loca.

Python
Use dark colors for code blocksCopy
# Create the third subset shapefile
shapefileDF.where("id between 11 and 15").select("id", "commodity", "state_loca",
    "geometry").coalesce(1).write.format("shapefile").mode("append").save(output_shapefiles)

Use schema merging to create a DataFrame with a single, combined schema.

Python
Use dark colors for code blocksCopy
# Merge schemas for the three subset shapefiles
spark.read.format("shapefile").option("mergeSchemas","true").load(output_shapefiles) \
    .orderBy("id").show()

Result
Use dark colors for code blocksCopy
+---+---------+--------------------+--------------------+--------------------+--------------+
| id|commodity|          company_na|            geometry|           site_name|    state_loca|
+---+---------+--------------------+--------------------+--------------------+--------------+
|  1| Aluminum|                null|{"x":-87.336,"y":...|Evansville (Warri...|          null|
|  2| Aluminum|                null|{"x":-86.786,"y":...|  Hawesville Smelter|          null|
|  3| Aluminum|                null|{"x":-87.5,"y":37...|      Sebree Smelter|          null|
|  4| Aluminum|                null|{"x":-90.923,"y":...|   Burnside Refinery|          null|
|  5| Aluminum|                null|{"x":-90.755,"y":...|   Gramercy Refinery|          null|
|  6| Aluminum|           Alcoa Inc|{"x":-77.469,"y":...|                null|          null|
|  7| Aluminum|Noranda Aluminum Inc|{"x":-89.564,"y":...|                null|          null|
|  8| Aluminum|Columbia Falls Al...|{"x":-114.139,"y"...|                null|          null|
|  9| Aluminum|           Alcoa Inc|{"x":-74.75,"y":4...|                null|          null|
| 10| Aluminum|           Alcoa Inc|{"x":-74.881,"y":...|                null|          null|
| 11| Aluminum|                null|{"x":-80.873,"y":...|                null|          Ohio|
| 12| Aluminum|                null|{"x":-80.05,"y":3...|                null|South Carolina|
| 13| Aluminum|                null|{"x":-83.968,"y":...|                null|     Tennessee|
| 14| Aluminum|                null|{"x":-96.554,"y":...|                null|         Texas|
| 15| Aluminum|                null|{"x":-97.076,"y":...|                null|         Texas|
+---+---------+--------------------+--------------------+--------------------+--------------+

Partition your shapefile into logical groups

Datasets can be partitioned by values within one or more columns. Each unique value in a column becomes a directory with the name <column_name>=<value>. In this example, you will logically separate the data based on column values for spatial regions.

Without partitioning and coalescing when writing data, you will end up with a shapefile for each record by default. Partitioning your data logically enables you to read, write, and store data in meaningful storage structures.

Specify the location to output your newly partitioned data.

Python
Use dark colors for code blocksCopy
# Set the output path to store your partitioned datasets
partitioned_output = os.path.join(output_dir, "partitioned")

Partition your data based on the values for the columns "state_loca" and "commodity".

Python
Use dark colors for code blocksCopy
# Partition your data by state and resource type
shapefileDF.write.format("shapefile").partitionBy("state_loca",
    "commodity").mode("overwrite").save(partitioned_output)

The result will be a new folder for each state. To preview the results of the partition, list the first thirty newly created datasets.

Python
Use dark colors for code blocksCopy
# Print out the first 30 partitions to visualize results
for index, (path, names, filenames) in enumerate(os.walk(partitioned_output)):
    print(os.path.relpath(path, output_dir))
    if index == 30:
        break;

Result
Use dark colors for code blocksCopy
partitioned
partitioned\state_loca=Alabama
partitioned\state_loca=Alabama\commodity=Bentonite
partitioned\state_loca=Alabama\commodity=Cement
partitioned\state_loca=Alabama\commodity=Common%20Clay%20and%20Shale
partitioned\state_loca=Alabama\commodity=Crushed%20Stone
partitioned\state_loca=Alabama\commodity=Dimension%20Stone
partitioned\state_loca=Alabama\commodity=Gypsum
partitioned\state_loca=Alabama\commodity=Iron%20Oxide%20Pigments
partitioned\state_loca=Alabama\commodity=Kaolin
partitioned\state_loca=Alabama\commodity=Lime
partitioned\state_loca=Alabama\commodity=Perlite
partitioned\state_loca=Alabama\commodity=Salt
partitioned\state_loca=Alabama\commodity=Sand%20and%20Gravel
partitioned\state_loca=Alabama\commodity=Silicon
partitioned\state_loca=Alabama\commodity=Sulfur
partitioned\state_loca=Alaska
partitioned\state_loca=Alaska\commodity=Crushed%20Stone
partitioned\state_loca=Alaska\commodity=Germanium
partitioned\state_loca=Alaska\commodity=Gold
partitioned\state_loca=Alaska\commodity=Lead
partitioned\state_loca=Alaska\commodity=Sand%20and%20Gravel
partitioned\state_loca=Alaska\commodity=Silver
partitioned\state_loca=Alaska\commodity=Zinc
partitioned\state_loca=Arizona
partitioned\state_loca=Arizona\commodity=Bentonite
partitioned\state_loca=Arizona\commodity=Cement
partitioned\state_loca=Arizona\commodity=Common%20Clay%20and%20Shale
partitioned\state_loca=Arizona\commodity=Copper
partitioned\state_loca=Arizona\commodity=Crushed%20Stone
partitioned\state_loca=Arizona\commodity=Gemstones

What's next?

Learn about how to read in other data types or analyze your data through SQL functions and analysis tools: