Read and write shapefiles

Learn how to read from, manage, and write to shapefiles. A shapefile data source behaves like other file formats within Spark (parquet, ORC, etc.). You can use shapefiles to read data from, or to write data to.

In this tutorial you will read from shapefiles, write results to new shapefiles, and partition data logically.

Read shapefiles

Prepare your input shapefile

  1. Download the sample shapefile from ArcGIS Online.

  2. Store it in a local folder on your machine, for example c:\data\shapefile_demo.

Set up the workspace

  1. Import the required modules.

    Python
    Use dark colors for code blocksCopy
                                                                                                                  
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    # Import the required modules
    import os, tempfile
    
  2. Set the output directory to write your formatted data to.

    Python
    Use dark colors for code blocksCopy
                                                                                                                  
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    # Set the workspace
    output_dir = os.path.normpath(r"C:/data/shapefile_demo")
    

Read from your shapefile and display columns of interest

  1. Read the shapefile into a DataFrame. Note that the folder containing the shapefile is specified, and not the full path to the .shp file. A folder can contain multiple shapefiles with the same schema and be read as a single DataFrame.

    Python
    Use dark colors for code blocksCopy
                                                                                                                  
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    # Read the shapefile into a DataFrame
    shapefileDF=spark.read.format("shapefile").load(r"c:\data\shapefile_demo\Mineplants")
    
  2. Visualize a subset of the columns, including the geometry column, by showing a sample of five rows from the input.

    Python
    Use dark colors for code blocksCopy
                                                                                                                  
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    # Sample your DataFrame
    shapefileDF.select("commodity","COMPANY_NA","geometry").show(5, truncate=False)
    
    Result
    Use dark colors for code blocksCopy
              
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    +---------+-------------------------------+------------------------+
    |commodity|COMPANY_NA                     |geometry                |
    +---------+-------------------------------+------------------------+
    |Aluminum |Alcoa Inc                      |{"x":-87.336,"y":37.915}|
    |Aluminum |Century Aluminum Co            |{"x":-86.786,"y":37.942}|
    |Aluminum |Alcan Inc                      |{"x":-87.5,"y":37.65}   |
    |Aluminum |Ormet Corp                     |{"x":-90.923,"y":30.138}|
    |Aluminum |Kaiser Aluminum & Chemical Corp|{"x":-90.755,"y":30.049}|
    +---------+-------------------------------+------------------------+
    only showing top 5 rows

Write shapefiles

Write a DataFrame to a shapefile

Use a defined dataset to create a DataFrame and write it to a shapefile.

  1. Define your own dataset.

    Python
    Use dark colors for code blocksCopy
                                                                                                                  
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    # Define a point dataset
    myPoints = [(0, -4655711.2806, 222503.076),
    	(1, -4570473.292, 322503.076),
    	(2, -4830838.089, 146545.398),
    	(3, -4570771.608, 116617.112),
    	(4, -4682228.671, 173377.654)]
    
    fields = ["id", "latitude", "longitude"]
    
  2. Create a DataFrame from your dataset definition.

    Python
    Use dark colors for code blocksCopy
                                                                                                                  
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    # Imports
    from geoanalytics.sql import functions as ST
    
    # Create a DataFrame
    df = spark.createDataFrame(myPoints, fields)
    
    # Enable geometry
    df = df.withColumn("geometry",
    ST.srid(ST.point("longitude", "latitude"), 6329)) \
    .st.set_geometry_field("geometry")
    
  3. Write your DataFrame to a shapefile.

    Python
    Use dark colors for code blocksCopy
                                                                                                                  
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    # Write to a single shapefile - update the path to a location accessible to you
    myshp = df.coalesce(1).write.format("shapefile").mode("overwrite").save(r"C:\data\output_shapefile")

Merge shapefiles with different schemas

Use schema merging when a collection of datasets contains varying schemas. For example, data have been collected over time. Each month a new dataset was created and a new column name for that month was introduced. Use schema merging to resolve the schema differences.

  1. If you haven't already downloaded the sample shapefile, follow the steps to prepare your input shapefile.

    Python
    Use dark colors for code blocksCopy
                                                                                                                  
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    # Read the shapefile into a DataFrame
    shapefileDF=spark.read.format("shapefile").load(r"c:\data\shapefile_demo\Mineplants")
    
  2. Set the output location for the shapefiles. These are the shapefiles that will have their schemas merged to form a single DataFrame.

    Python
    Use dark colors for code blocksCopy
                                                                                                                  
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    # Set the output path to store your shapefiles
    output_shapefiles = os.path.join(output_dir, "merged_shapefile")
    
  3. Create three subset shapefiles. Specify a value of 1 for .coalesce() to write each query result to a single (1) shapefile. A coalesce value enables the number of partitions to be reduced, resulting in fewer output shapefiles. By default, a shapefile will be written for each partition. Each shapefile will have three columns with names in common (geometry, id and commodity), and one column with a unique name.

    • Rows with id values between 1 and 5 will have a column named site_name.

      Python
      Use dark colors for code blocksCopy
                                                                                                                    
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      37
      38
      39
      40
      41
      42
      43
      44
      45
      46
      47
      48
      49
      50
      51
      52
      53
      54
      55
      56
      57
      58
      59
      60
      61
      62
      63
      64
      65
      66
      67
      68
      69
      70
      71
      72
      73
      74
      75
      76
      77
      78
      79
      80
      81
      82
      83
      84
      85
      86
      87
      88
      89
      90
      91
      92
      93
      94
      95
      96
      97
      98
      99
      100
      101
      102
      103
      104
      105
      106
      107
      108
      109
      110
      # Create the first subset shapefile
      shapefileDF.where("id <= 5").select("id", "commodity","site_name","geometry") \
          .coalesce(1).write.format("shapefile").mode("overwrite").save(output_shapefiles)
      
    • Rows with id values between 6 and 10 will have a column named company_na.

      Python
      Use dark colors for code blocksCopy
                                                                                                                    
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      37
      38
      39
      40
      41
      42
      43
      44
      45
      46
      47
      48
      49
      50
      51
      52
      53
      54
      55
      56
      57
      58
      59
      60
      61
      62
      63
      64
      65
      66
      67
      68
      69
      70
      71
      72
      73
      74
      75
      76
      77
      78
      79
      80
      81
      82
      83
      84
      85
      86
      87
      88
      89
      90
      91
      92
      93
      94
      95
      96
      97
      98
      99
      100
      101
      102
      103
      104
      105
      106
      107
      108
      109
      110
      # Create the second subset shapefile
      shapefileDF.where("id between 6 and 10").select("id","commodity", "company_na",
          "geometry").coalesce(1).write.format("shapefile").mode("append").save(output_shapefiles)
      
    • Rows with id values between 11 and 15 will have a column named state_loca.

      Python
      Use dark colors for code blocksCopy
                                                                                                                    
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      37
      38
      39
      40
      41
      42
      43
      44
      45
      46
      47
      48
      49
      50
      51
      52
      53
      54
      55
      56
      57
      58
      59
      60
      61
      62
      63
      64
      65
      66
      67
      68
      69
      70
      71
      72
      73
      74
      75
      76
      77
      78
      79
      80
      81
      82
      83
      84
      85
      86
      87
      88
      89
      90
      91
      92
      93
      94
      95
      96
      97
      98
      99
      100
      101
      102
      103
      104
      105
      106
      107
      108
      109
      110
      # Create the third subset shapefile
      shapefileDF.where("id between 11 and 15").select("id", "commodity", "state_loca",
          "geometry").coalesce(1).write.format("shapefile").mode("append").save(output_shapefiles)
      
  4. Use schema merging to create a DataFrame with a single, combined schema.

    Python
    Use dark colors for code blocksCopy
                                                                                                                  
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    # Merge schemas for the three subset shapefiles
    spark.read.format("shapefile").option("mergeSchemas","true").load(output_shapefiles) \
        .orderBy("id").show()
    
    Result
    Use dark colors for code blocksCopy
                       
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    +---+---------+--------------------+--------------------+--------------------+--------------+
    | id|commodity|          company_na|            geometry|           site_name|    state_loca|
    +---+---------+--------------------+--------------------+--------------------+--------------+
    |  1| Aluminum|                null|{"x":-87.336,"y":...|Evansville (Warri...|          null|
    |  2| Aluminum|                null|{"x":-86.786,"y":...|  Hawesville Smelter|          null|
    |  3| Aluminum|                null|{"x":-87.5,"y":37...|      Sebree Smelter|          null|
    |  4| Aluminum|                null|{"x":-90.923,"y":...|   Burnside Refinery|          null|
    |  5| Aluminum|                null|{"x":-90.755,"y":...|   Gramercy Refinery|          null|
    |  6| Aluminum|           Alcoa Inc|{"x":-77.469,"y":...|                null|          null|
    |  7| Aluminum|Noranda Aluminum Inc|{"x":-89.564,"y":...|                null|          null|
    |  8| Aluminum|Columbia Falls Al...|{"x":-114.139,"y"...|                null|          null|
    |  9| Aluminum|           Alcoa Inc|{"x":-74.75,"y":4...|                null|          null|
    | 10| Aluminum|           Alcoa Inc|{"x":-74.881,"y":...|                null|          null|
    | 11| Aluminum|                null|{"x":-80.873,"y":...|                null|          Ohio|
    | 12| Aluminum|                null|{"x":-80.05,"y":3...|                null|South Carolina|
    | 13| Aluminum|                null|{"x":-83.968,"y":...|                null|     Tennessee|
    | 14| Aluminum|                null|{"x":-96.554,"y":...|                null|         Texas|
    | 15| Aluminum|                null|{"x":-97.076,"y":...|                null|         Texas|
    +---+---------+--------------------+--------------------+--------------------+--------------+

Partition your shapefile into logical groups

Datasets can be partitioned by values within one or more columns. Each unique value in a column becomes a directory with the name <column_name>=<value>. In this example, you will logically separate the data based on column values for spatial regions.

Without partitioning and coalescing when writing data, you will end up with a shapefile for each record by default. Partitioning your data logically enables you to read, write, and store data in meaningful storage structures.

  1. Specify the location to output your newly partitioned data.

    Python
    Use dark colors for code blocksCopy
                                                                                                                  
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    # Set the output path to store your partitioned datasets
    partitioned_output = os.path.join(output_dir, "partitioned")
    
  2. Partition your data based on the values for the columns "state_loca" and "commodity".

    Python
    Use dark colors for code blocksCopy
                                                                                                                  
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    # Partition your data by state and resource type
    shapefileDF.write.format("shapefile").partitionBy("state_loca",
        "commodity").mode("overwrite").save(partitioned_output)
    
  3. The result will be a new folder for each state. To preview the results of the partition, list the first thirty newly created datasets.

    Python
    Use dark colors for code blocksCopy
                                                                                                                  
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    # Print out the first 30 partitions to visualize results
    for index, (path, names, filenames) in enumerate(os.walk(partitioned_output)):
        print(os.path.relpath(path, output_dir))
        if index == 30:
            break;
    
    Result
    Use dark colors for code blocksCopy
                                   
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    partitioned
    partitioned\state_loca=Alabama
    partitioned\state_loca=Alabama\commodity=Bentonite
    partitioned\state_loca=Alabama\commodity=Cement
    partitioned\state_loca=Alabama\commodity=Common%20Clay%20and%20Shale
    partitioned\state_loca=Alabama\commodity=Crushed%20Stone
    partitioned\state_loca=Alabama\commodity=Dimension%20Stone
    partitioned\state_loca=Alabama\commodity=Gypsum
    partitioned\state_loca=Alabama\commodity=Iron%20Oxide%20Pigments
    partitioned\state_loca=Alabama\commodity=Kaolin
    partitioned\state_loca=Alabama\commodity=Lime
    partitioned\state_loca=Alabama\commodity=Perlite
    partitioned\state_loca=Alabama\commodity=Salt
    partitioned\state_loca=Alabama\commodity=Sand%20and%20Gravel
    partitioned\state_loca=Alabama\commodity=Silicon
    partitioned\state_loca=Alabama\commodity=Sulfur
    partitioned\state_loca=Alaska
    partitioned\state_loca=Alaska\commodity=Crushed%20Stone
    partitioned\state_loca=Alaska\commodity=Germanium
    partitioned\state_loca=Alaska\commodity=Gold
    partitioned\state_loca=Alaska\commodity=Lead
    partitioned\state_loca=Alaska\commodity=Sand%20and%20Gravel
    partitioned\state_loca=Alaska\commodity=Silver
    partitioned\state_loca=Alaska\commodity=Zinc
    partitioned\state_loca=Arizona
    partitioned\state_loca=Arizona\commodity=Bentonite
    partitioned\state_loca=Arizona\commodity=Cement
    partitioned\state_loca=Arizona\commodity=Common%20Clay%20and%20Shale
    partitioned\state_loca=Arizona\commodity=Copper
    partitioned\state_loca=Arizona\commodity=Crushed%20Stone
    partitioned\state_loca=Arizona\commodity=Gemstones

What's next?

Learn about how to read in other data types or analyze your data through SQL functions and analysis tools:

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.