Dissolve Boundaries Workflow

This covers the Spark SQL workflow that replicates the Dissolve Boundaries tool. Dissolve Boundaries merges geometries that intersect or have the same field value into a single geometry. This workflow will dissolve the USA States data by region, calculate the summary statistics for each dissolved region, and convert the dissolved multipart geometries into non-multipart geometries.

Set up the input dataset

  1. Import the spatial type and PySpark SQL functions.

    Python
    Use dark colors for code blocksCopy
                                                                                          
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    # Imports
    from geoanalytics.sql import functions as ST
    from pyspark.sql import functions as F
    
  2. Create a DataFrame from the USA States Generalized shapefile data. Display the first 5 rows.

    Python
    Use dark colors for code blocksCopy
                                                                                          
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    # Create a DataFrame from the USA States Generalized shapefile data
    df = spark.read.format("shapefile") \
                        .load(r"C:\git\pysparkTestHarness\pysparkTestHarness\data\shapefile\USA_States_Generalized")
    
    # Display the first 5 rows of the DataFrame
    df.select('STATE_NAME', "SUB_REGION", "POP2010").show(5)
    
    Result
    Use dark colors for code blocksCopy
              
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    +----------+----------+--------+
    |STATE_NAME|SUB_REGION| POP2010|
    +----------+----------+--------+
    |    Alaska|   Pacific|  710231|
    |California|   Pacific|37253956|
    |    Hawaii|   Pacific| 1360301|
    |     Idaho|  Mountain| 1567582|
    |    Nevada|  Mountain| 2700551|
    +----------+----------+--------+
    only showing top 5 rows
  3. Visualize the USA States data.

    Python
    Use dark colors for code blocksCopy
                                                                                          
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    # Plot the USA States data
    df_plot = df.st.plot(aspect='equal', figsize=(10, 10))
    
    # Add the plot info
    df_plot.set_title("USA States")
    df_plot.set_xlabel("Longitude")
    df_plot.set_ylabel("Latitude")
    

Dissolve States by region

  1. Use the ST_Aggr_Union Python function to dissolve the States by the SUB_REGION field to create multipart geometries.

    NOTE: The as_text function isn't necessary and is only included in this demo to show that there are multipart geometries.

    Python
    Use dark colors for code blocksCopy
                                                                                          
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    # Dissolve by SUB_REGION and create multipart geometries
    df_dissolved_multipart = df.groupBy("SUB_REGION").agg(ST.aggr_union("geometry").alias("dissolved_geom_multipart")) \
                                                        .withColumn("wkt", ST.as_text("dissolved_geom_multipart"))
    df_dissolved_multipart.show(10)
    
    Result
    Use dark colors for code blocksCopy
                 
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    +------------------+------------------------+--------------------+
    |        SUB_REGION|dissolved_geom_multipart|                 wkt|
    +------------------+------------------------+--------------------+
    |           Pacific|    {"rings":[[[-1.78...|MULTIPOLYGON (((-...|
    |          Mountain|    {"rings":[[[-1.32...|POLYGON ((-1.3263...|
    |West South Central|    {"rings":[[[-1.17...|MULTIPOLYGON (((-...|
    |West North Central|    {"rings":[[[-1.05...|POLYGON ((-1.0583...|
    |East South Central|    {"rings":[[[-9469...|POLYGON ((-946995...|
    |       New England|    {"rings":[[[-8185...|MULTIPOLYGON (((-...|
    |    South Atlantic|    {"rings":[[[-8993...|MULTIPOLYGON (((-...|
    |East North Central|    {"rings":[[[-9804...|MULTIPOLYGON (((-...|
    |   Middle Atlantic|    {"rings":[[[-8403...|MULTIPOLYGON (((-...|
    +------------------+------------------------+--------------------+
  2. Plot the dissolved multipart geometries.

    Python
    Use dark colors for code blocksCopy
                                                                                          
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    # Plot the dissolved multipart geometries
    df_dissolved_multipart_plot = df_dissolved_multipart.st.plot(cmap_values="SUB_REGION", is_categorical=True, cmap="Paired",
                                                                 aspect='equal', legend=True, legend_kwds={'title':"USA Region"},
                                                                 figsize=(10, 10))
    
    # Add the plot info
    df_dissolved_multipart_plot.set_title("USA States dissolved multipart by region")
    df_dissolved_multipart_plot.set_xlabel("Longitude")
    df_dissolved_multipart_plot.set_ylabel("Latitude")
    

Calculate summary statistics for the dissolved regions

NOTE: When using the groupBy function there are multiple summary statistics to select from.

  1. Calculate the total population for each region.

    Python
    Use dark colors for code blocksCopy
                                                                                          
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    # Get the sum of the population for each "SUB_REGION"
    df.groupBy("SUB_REGION").sum().select("SUB_REGION", "sum(POP2010)").show(10)
    
    Result
    Use dark colors for code blocksCopy
                 
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    +------------------+------------+
    |        SUB_REGION|sum(POP2010)|
    +------------------+------------+
    |           Pacific|    49880102|
    |          Mountain|    22065451|
    |West South Central|    36346202|
    |West North Central|    20505437|
    |East South Central|    18432505|
    |       New England|    14444865|
    |    South Atlantic|    59777037|
    |East North Central|    46421564|
    |   Middle Atlantic|    40872375|
    +------------------+------------+
  2. Calculate the number of States within each region.

    Python
    Use dark colors for code blocksCopy
                                                                                          
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    # Get the count of States within each "SUB_REGION"
    df.groupBy("SUB_REGION").count().select("SUB_REGION", "count").show(10)
    
    Result
    Use dark colors for code blocksCopy
                 
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    +------------------+-----+
    |        SUB_REGION|count|
    +------------------+-----+
    |           Pacific|    5|
    |          Mountain|    8|
    |West South Central|    4|
    |West North Central|    7|
    |East South Central|    4|
    |       New England|    6|
    |    South Atlantic|    9|
    |East North Central|    5|
    |   Middle Atlantic|    3|
    +------------------+-----+

Create dissolved non-multipart geometries

  1. Convert the dissolved multipart geometries into dissolved non-multipart geometries using the ST_Geometries Python function and the PySpark Explode function.

    NOTE: The as_text and monotomically_increasing_id functions aren't necessary for this workflow and are only included to show that the geometries are non-multipart both in the DataFrame and the plot below.

    Python
    Use dark colors for code blocksCopy
                                                                                          
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    # Create dissolved non-multipart geometries from the dissolved multipart geometries
    df_dissolved_non_multipart = df_dissolved_multipart.select("SUB_REGION",
        F.explode(ST.geometries("dissolved_geom_multipart")) \
        .alias("dissolved_geom_non_multipart")) \
        .withColumn("wkt", ST.as_text("dissolved_geom_non_multipart")) \
        .withColumn("index", F.monotonically_increasing_id())
    
    df_dissolved_non_multipart.orderBy("SUB_REGION", desc=False).show(20)
    
    Result
    Use dark colors for code blocksCopy
                             
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    +------------------+----------------------------+--------------------+-----+
    |        SUB_REGION|dissolved_geom_non_multipart|                 wkt|index|
    +------------------+----------------------------+--------------------+-----+
    |East North Central|        {"rings":[[[-9851...|POLYGON ((-985149...|   79|
    |East North Central|        {"rings":[[[-9804...|POLYGON ((-980408...|   77|
    |East North Central|        {"rings":[[[-9851...|POLYGON ((-985185...|   80|
    |East North Central|        {"rings":[[[-9688...|POLYGON ((-968863...|   78|
    |East North Central|        {"rings":[[[-9334...|POLYGON ((-933466...|   81|
    |East South Central|        {"rings":[[[-9469...|POLYGON ((-946995...|   57|
    |   Middle Atlantic|        {"rings":[[[-8403...|POLYGON ((-840342...|   82|
    |   Middle Atlantic|        {"rings":[[[-8264...|POLYGON ((-826401...|   85|
    |   Middle Atlantic|        {"rings":[[[-8158...|POLYGON ((-815894...|   84|
    |   Middle Atlantic|        {"rings":[[[-8210...|POLYGON ((-821005...|   83|
    |          Mountain|        {"rings":[[[-1.32...|POLYGON ((-1.3263...|   46|
    |       New England|        {"rings":[[[-7933...|POLYGON ((-793364...|   59|
    |       New England|        {"rings":[[[-7859...|POLYGON ((-785963...|   60|
    |       New England|        {"rings":[[[-8185...|POLYGON ((-818536...|   58|
    |       New England|        {"rings":[[[-7795...|POLYGON ((-779589...|   61|
    |       New England|        {"rings":[[[-7612...|POLYGON ((-761290...|   62|
    |           Pacific|        {"rings":[[[-1.78...|POLYGON ((-1.7819...|    0|
    |           Pacific|        {"rings":[[[-1.77...|POLYGON ((-1.7737...|    1|
    |           Pacific|        {"rings":[[[-1.75...|POLYGON ((-1.7552...|    2|
    |           Pacific|        {"rings":[[[-1.74...|POLYGON ((-1.7445...|    3|
    +------------------+----------------------------+--------------------+-----+
    only showing top 20 rows
  2. Plot the dissolved non-multipart geometries.

    Python
    Use dark colors for code blocksCopy
                                                                                          
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    # Plot the dissolved non-multipart geometries
    df_dissolved_non_multipart_plot = df_dissolved_non_multipart.st.plot(cmap_values="index",
                                                                         is_categorical=True,
                                                                         cmap="tab20c",
                                                                         aspect="equal",
                                                                         figsize=(10, 10))
    
    # Add the plot info
    df_dissolved_non_multipart_plot.set_title("USA States dissolved non-multipart by region")
    df_dissolved_non_multipart_plot.set_xlabel("Longitude")
    df_dissolved_non_multipart_plot.set_ylabel("Latitude")

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.