Read and write shapefiles

Learn how to read from, manage, and write to shapefiles. A shapefile data source behaves like other file formats within Spark (parquet, ORC, etc.). You can use shapefiles to read data from, or to write data to.

In this tutorial you will read from shapefiles, write results to new shapefiles, and partition data logically.

Read shapefiles

Prepare your input shapefile

  1. Download the sample shapefile from ArcGIS Online.

Set up the workspace

  1. Import the required modules.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    # Import the required modules
    import os, tempfile
    import geoanalytics
    from geoanalytics.sql import functions as ST
    
    geoanalytics.auth(username="user1", password="p@ssword")
  2. Set the output directory to write your formatted data to.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    # Set the workspace
    output_dir = os.path.normpath(r"c:/shapefile_demo")

Read from your shapefile and display columns of interest

  1. Read the shapefile into a DataFrame. Note that the folder containing the shapefile is specified, and not the full path to the .shp file. A folder can contain multiple shapefiles with the same schema and be read as a single DataFrame.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    # Read the shapefile into a DataFrame
    shapefileDF=spark.read.format("shapefile").load(r"c:\shapefile_demo\Mineplants")
  2. Visualize a subset of the columns, including the geometry column, by showing a sample of five rows from the input.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    # Sample your DataFrame
    shapefileDF.select("commodity","COMPANY_NA","geometry").show(5, truncate=False)
    Result
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    +---------+-------------------------------+------------------------+
    |commodity|COMPANY_NA                     |geometry                |
    +---------+-------------------------------+------------------------+
    |Aluminum |Alcoa Inc                      |{"x":-87.336,"y":37.915}|
    |Aluminum |Century Aluminum Co            |{"x":-86.786,"y":37.942}|
    |Aluminum |Alcan Inc                      |{"x":-87.5,"y":37.65}   |
    |Aluminum |Ormet Corp                     |{"x":-90.923,"y":30.138}|
    |Aluminum |Kaiser Aluminum & Chemical Corp|{"x":-90.755,"y":30.049}|
    +---------+-------------------------------+------------------------+
    only showing top 5 rows

Write shapefiles

Write a DataFrame to a shapefile

Use a defined dataset to create a DataFrame and write it to a shapefile.

  1. Define your own dataset.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    # Define a point dataset
    myPoints = [(0, -4655711.2806, 222503.076),
    	(1, -4570473.292, 322503.076),
    	(2, -4830838.089, 146545.398),
    	(3, -4570771.608, 116617.112),
    	(4, -4682228.671, 173377.654)]
    
    fields = ["id", "latitude", "longitude"]
  2. Create a DataFrame from your dataset definition.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    # Create a DataFrame
    df = spark.createDataFrame(myPoints, fields)
    
    # Enable geometry
    df = df.withColumn("geometry",
    ST.srid(ST.point("longitude", "latitude"), 6329)) \
    .st.set_geometry_field("geometry")
  3. Write your DataFrame to a shapefile.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    # Write to a single shapefile - update the path to a location accessible to you
    myshp = df.coalesce(1).write.format("shapefile").mode("overwrite").save(r"C:\shapefile_demo\output_shapefile")

Merge shapefiles with different schemas

Use schema merging when a collection of datasets contains varying schemas. For example, data have been collected over time. Each month a new dataset was created and a new column name for that month was introduced. Use schema merging to resolve the schema differences.

  1. If you haven't already downloaded the sample shapefile, follow the steps to prepare your input shapefile.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    # Read the shapefile into a DataFrame
    shapefileDF=spark.read.format("shapefile").load(r"c:\shapefile_demo\Mineplants")
  2. Set the output location for the shapefiles. These are the shapefiles that will have their schemas merged to form a single DataFrame.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    # Set the output path to store your shapefiles
    output_shapefiles = os.path.join(output_dir, "merged_shapefile")
  3. Create three subset shapefiles. Specify a value of 1 for .coalesce() to write each query result to a single (1) shapefile. A coalesce value enables the number of partitions to be reduced, resulting in fewer output shapefiles. By default, a shapefile will be written for each partition. Each shapefile will have three columns with names in common (geometry, id and commodity), and one column with a unique name.

    • Rows with id values between 1 and 5 will have a column named site_name.
    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    # Create the first subset shapefile
    shapefileDF.where("id <= 5").select("id", "commodity","site_name","geometry") \
        .coalesce(1).write.format("shapefile").mode("overwrite").save(output_shapefiles)
    • Rows with id values between 6 and 10 will have a column named company_na.
    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    # Create the second subset shapefile
    shapefileDF.where("id between 6 and 10").select("id","commodity", "company_na",
        "geometry").coalesce(1).write.format("shapefile").mode("append").save(output_shapefiles)
    • Rows with id values between 11 and 15 will have a column named state_loca.
    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    # Create the third subset shapefile
    shapefileDF.where("id between 11 and 15").select("id", "commodity", "state_loca",
        "geometry").coalesce(1).write.format("shapefile").mode("append").save(output_shapefiles)
  4. Use schema merging to create a DataFrame with a single, combined schema.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    # Merge schemas for the three subset shapefiles
    spark.read.format("shapefile").option("mergeSchemas","true").load(output_shapefiles) \
        .orderBy("id").show()
    Result
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    +---+---------+--------------------+--------------------+--------------------+--------------+
    | id|commodity|          company_na|            geometry|           site_name|    state_loca|
    +---+---------+--------------------+--------------------+--------------------+--------------+
    |  1| Aluminum|                null|{"x":-87.336,"y":...|Evansville (Warri...|          null|
    |  2| Aluminum|                null|{"x":-86.786,"y":...|  Hawesville Smelter|          null|
    |  3| Aluminum|                null|{"x":-87.5,"y":37...|      Sebree Smelter|          null|
    |  4| Aluminum|                null|{"x":-90.923,"y":...|   Burnside Refinery|          null|
    |  5| Aluminum|                null|{"x":-90.755,"y":...|   Gramercy Refinery|          null|
    |  6| Aluminum|           Alcoa Inc|{"x":-77.469,"y":...|                null|          null|
    |  7| Aluminum|Noranda Aluminum Inc|{"x":-89.564,"y":...|                null|          null|
    |  8| Aluminum|Columbia Falls Al...|{"x":-114.139,"y"...|                null|          null|
    |  9| Aluminum|           Alcoa Inc|{"x":-74.75,"y":4...|                null|          null|
    | 10| Aluminum|           Alcoa Inc|{"x":-74.881,"y":...|                null|          null|
    | 11| Aluminum|                null|{"x":-80.873,"y":...|                null|          Ohio|
    | 12| Aluminum|                null|{"x":-80.05,"y":3...|                null|South Carolina|
    | 13| Aluminum|                null|{"x":-83.968,"y":...|                null|     Tennessee|
    | 14| Aluminum|                null|{"x":-96.554,"y":...|                null|         Texas|
    | 15| Aluminum|                null|{"x":-97.076,"y":...|                null|         Texas|
    +---+---------+--------------------+--------------------+--------------------+--------------+

Partition your shapefile into logical groups

Datasets can be partitioned by values within one or more columns. Each unique value in a column becomes a directory with the name <column_name>=<value>. In this example, you will logically separate the data based on column values for spatial regions.

Without partitioning and coalescing when writing data, you will end up with a shapefile for each record by default. Partitioning your data logically enables you to read, write, and store data in meaningful storage structures.

  1. Specify the location to output your newly partitioned data.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    # Set the output path to store your partitioned datasets
    partitioned_output = os.path.join(output_dir, "partitioned")
  2. Partition your data based on the values for the columns "state_loca" and "commodity".

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    # Partition your data by state and resource type
    shapefileDF.write.format("shapefile").partitionBy("state_loca",
        "commodity").mode("overwrite").save(partitioned_output)
  3. The result will be a new folder for each state. To preview the results of the partition, list the first thirty newly created datasets.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    # Print out the first 30 partitions to visualize results
    for index, (path, names, filenames) in enumerate(os.walk(partitioned_output)):
        print(os.path.relpath(path, output_dir))
        if index == 30:
            break;
    Result
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    partitioned
    partitioned\state_loca=Alabama
    partitioned\state_loca=Alabama\commodity=Bentonite
    partitioned\state_loca=Alabama\commodity=Cement
    partitioned\state_loca=Alabama\commodity=Common%20Clay%20and%20Shale
    partitioned\state_loca=Alabama\commodity=Crushed%20Stone
    partitioned\state_loca=Alabama\commodity=Dimension%20Stone
    partitioned\state_loca=Alabama\commodity=Gypsum
    partitioned\state_loca=Alabama\commodity=Iron%20Oxide%20Pigments
    partitioned\state_loca=Alabama\commodity=Kaolin
    partitioned\state_loca=Alabama\commodity=Lime
    partitioned\state_loca=Alabama\commodity=Perlite
    partitioned\state_loca=Alabama\commodity=Salt
    partitioned\state_loca=Alabama\commodity=Sand%20and%20Gravel
    partitioned\state_loca=Alabama\commodity=Silicon
    partitioned\state_loca=Alabama\commodity=Sulfur
    partitioned\state_loca=Alaska
    partitioned\state_loca=Alaska\commodity=Crushed%20Stone
    partitioned\state_loca=Alaska\commodity=Germanium
    partitioned\state_loca=Alaska\commodity=Gold
    partitioned\state_loca=Alaska\commodity=Lead
    partitioned\state_loca=Alaska\commodity=Sand%20and%20Gravel
    partitioned\state_loca=Alaska\commodity=Silver
    partitioned\state_loca=Alaska\commodity=Zinc
    partitioned\state_loca=Arizona
    partitioned\state_loca=Arizona\commodity=Bentonite
    partitioned\state_loca=Arizona\commodity=Cement
    partitioned\state_loca=Arizona\commodity=Common%20Clay%20and%20Shale
    partitioned\state_loca=Arizona\commodity=Copper
    partitioned\state_loca=Arizona\commodity=Crushed%20Stone
    partitioned\state_loca=Arizona\commodity=Gemstones

What's next?

Learn about how to read in other data types or analyze your data through SQL functions and analysis tools:

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.