Learn how to read from, manage, and write to shapefiles. A shapefile data source behaves like other
file formats within Spark
(parquet, ORC, etc.). You can use shapefiles to read data from, or to
write data to.
In this tutorial you will read from shapefiles, write results to new
shapefiles, and partition data logically.
Read shapefiles Download the sample shapefile from ArcGIS Online .
Store it in a local folder on your machine, for example
c: \data\shapefile_ demo
.
Set up the workspace Import the required modules.
Python
Use dark colors for code blocks Copy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
# Import the required modules
import os, tempfile
import geoanalytics
from geoanalytics.sql import functions as ST
geoanalytics.auth(username= "user1" , password= "p@ssword" )
Set the output directory to write your formatted data to.
Python
Use dark colors for code blocks Copy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
# Set the workspace
output_dir = os.path.normpath( r"C:/data/shapefile_demo" )
Read from your shapefile and display columns of interest Read the shapefile into a DataFrame. Note that the folder containing the shapefile is specified, and not
the full path to the .shp
file. A folder can contain multiple shapefiles with the same schema and be read as a single DataFrame.
Python
Use dark colors for code blocks Copy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
# Read the shapefile into a DataFrame
shapefileDF=spark.read. format ( "shapefile" ).load( r"c:\data\shapefile_demo\Mineplants" )
Visualize a subset of the columns, including the geometry column, by
showing a sample of five rows from the input.
Python
Use dark colors for code blocks Copy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
# Sample your DataFrame
shapefileDF.select( "commodity" , "COMPANY_NA" , "geometry" ).show( 5 , truncate= False )
Result
Use dark colors for code blocks Copy
1
2
3
4
5
6
7
8
9
10
+---------+-------------------------------+------------------------+
|commodity|COMPANY_NA |geometry |
+---------+-------------------------------+------------------------+
|Aluminum |Alcoa Inc |{"x":-87.336,"y":37.915}|
|Aluminum |Century Aluminum Co |{"x":-86.786,"y":37.942}|
|Aluminum |Alcan Inc |{"x":-87.5,"y":37.65} |
|Aluminum |Ormet Corp |{"x":-90.923,"y":30.138}|
|Aluminum |Kaiser Aluminum & Chemical Corp|{"x":-90.755,"y":30.049}|
+---------+-------------------------------+------------------------+
only showing top 5 rows
Write shapefiles Write a DataFrame to a shapefile Use a defined dataset to create a DataFrame and write it to a shapefile.
Define your own dataset.
Python
Use dark colors for code blocks Copy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
# Define a point dataset
myPoints = [( 0 , - 4655711.2806 , 222503.076 ),
( 1 , - 4570473.292 , 322503.076 ),
( 2 , - 4830838.089 , 146545.398 ),
( 3 , - 4570771.608 , 116617.112 ),
( 4 , - 4682228.671 , 173377.654 )]
fields = [ "id" , "latitude" , "longitude" ]
Create a DataFrame from your dataset definition.
Python
Use dark colors for code blocks Copy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
# Create a DataFrame
df = spark.createDataFrame(myPoints, fields)
# Enable geometry
df = df.withColumn( "geometry" ,
ST.srid(ST.point( "longitude" , "latitude" ), 6329 )) \
.st.set_geometry_field( "geometry" )
Write your DataFrame to a shapefile.
Python
Use dark colors for code blocks Copy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
# Write to a single shapefile - update the path to a location accessible to you
myshp = df.coalesce( 1 ).write. format ( "shapefile" ).mode( "overwrite" ).save( r"C:\data\output_shapefile" )
Merge shapefiles with different schemas Use schema merging
when a collection of datasets contains varying schemas.
For example, data have been collected over time. Each month a
new dataset was created and a new column name for that month was introduced.
Use schema merging to resolve the schema differences.
If you haven't already downloaded the sample shapefile, follow the steps to prepare your input shapefile .
Python
Use dark colors for code blocks Copy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
# Read the shapefile into a DataFrame
shapefileDF=spark.read. format ( "shapefile" ).load( r"c:\data\shapefile_demo\Mineplants" )
Set the output location for the shapefiles.
These are the shapefiles that will have their schemas merged to form a single DataFrame.
Python
Use dark colors for code blocks Copy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
# Set the output path to store your shapefiles
output_shapefiles = os.path.join(output_dir, "merged_shapefile" )
Create three subset shapefiles. Specify a value of 1 for .coalesce()
to write each query result to a single (1) shapefile. A coalesce value enables the number of partitions to be reduced,
resulting in fewer output shapefiles. By default, a shapefile will be written for each partition.
Each shapefile will have three columns with names in common (geometry
, id
and
commodity
), and one column with a unique name.
Rows with id
values between 1 and 5 will have a column named site_ name
.
Python
Use dark colors for code blocks Copy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
# Create the first subset shapefile
shapefileDF.where( "id <= 5" ).select( "id" , "commodity" , "site_name" , "geometry" ) \
.coalesce( 1 ).write. format ( "shapefile" ).mode( "overwrite" ).save(output_shapefiles)
Rows with id
values between 6 and 10 will have a column named company_ na
.
Python
Use dark colors for code blocks Copy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
# Create the second subset shapefile
shapefileDF.where( "id between 6 and 10" ).select( "id" , "commodity" , "company_na" ,
"geometry" ).coalesce( 1 ).write. format ( "shapefile" ).mode( "append" ).save(output_shapefiles)
Rows with id
values between 11 and 15 will have a column named state_ loca
.
Python
Use dark colors for code blocks Copy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
# Create the third subset shapefile
shapefileDF.where( "id between 11 and 15" ).select( "id" , "commodity" , "state_loca" ,
"geometry" ).coalesce( 1 ).write. format ( "shapefile" ).mode( "append" ).save(output_shapefiles)
Use schema merging to create a DataFrame with a single, combined schema.
Python
Use dark colors for code blocks Copy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
# Merge schemas for the three subset shapefiles
spark.read. format ( "shapefile" ).option( "mergeSchemas" , "true" ).load(output_shapefiles) \
.orderBy( "id" ).show()
Result
Use dark colors for code blocks Copy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
+---+---------+--------------------+--------------------+--------------------+--------------+
| id|commodity| company_na| geometry| site_name| state_loca|
+---+---------+--------------------+--------------------+--------------------+--------------+
| 1| Aluminum| null|{"x":-87.336,"y":...|Evansville (Warri...| null|
| 2| Aluminum| null|{"x":-86.786,"y":...| Hawesville Smelter| null|
| 3| Aluminum| null|{"x":-87.5,"y":37...| Sebree Smelter| null|
| 4| Aluminum| null|{"x":-90.923,"y":...| Burnside Refinery| null|
| 5| Aluminum| null|{"x":-90.755,"y":...| Gramercy Refinery| null|
| 6| Aluminum| Alcoa Inc|{"x":-77.469,"y":...| null| null|
| 7| Aluminum|Noranda Aluminum Inc|{"x":-89.564,"y":...| null| null|
| 8| Aluminum|Columbia Falls Al...|{"x":-114.139,"y"...| null| null|
| 9| Aluminum| Alcoa Inc|{"x":-74.75,"y":4...| null| null|
| 10| Aluminum| Alcoa Inc|{"x":-74.881,"y":...| null| null|
| 11| Aluminum| null|{"x":-80.873,"y":...| null| Ohio|
| 12| Aluminum| null|{"x":-80.05,"y":3...| null|South Carolina|
| 13| Aluminum| null|{"x":-83.968,"y":...| null| Tennessee|
| 14| Aluminum| null|{"x":-96.554,"y":...| null| Texas|
| 15| Aluminum| null|{"x":-97.076,"y":...| null| Texas|
+---+---------+--------------------+--------------------+--------------------+--------------+
Partition your shapefile into logical groups Datasets can be partitioned
by values within one or more columns. Each unique value in a column becomes a directory with the name
<column_ name> =<value>
. In this example, you will logically separate
the data based on column values for spatial regions.
Without partitioning and coalescing when writing data, you will end up with a shapefile for each record by default.
Partitioning your data logically enables you to read, write, and store data in meaningful storage structures.
Specify the location to output your newly partitioned data.
Python
Use dark colors for code blocks Copy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
# Set the output path to store your partitioned datasets
partitioned_output = os.path.join(output_dir, "partitioned" )
Partition your data based on the values for the columns
"state_loca" and "commodity".
Python
Use dark colors for code blocks Copy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
# Partition your data by state and resource type
shapefileDF.write. format ( "shapefile" ).partitionBy( "state_loca" ,
"commodity" ).mode( "overwrite" ).save(partitioned_output)
The result will be a new folder for each state. To preview the results of the partition, list the first thirty newly
created datasets.
Python
Use dark colors for code blocks Copy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
# Print out the first 30 partitions to visualize results
for index, (path, names, filenames) in enumerate (os.walk(partitioned_output)):
print (os.path.relpath(path, output_dir))
if index == 30 :
break ;
Result
Use dark colors for code blocks Copy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
partitioned
partitioned\state_loca=Alabama
partitioned\state_loca=Alabama\commodity=Bentonite
partitioned\state_loca=Alabama\commodity=Cement
partitioned\state_loca=Alabama\commodity=Common%20Clay%20and%20Shale
partitioned\state_loca=Alabama\commodity=Crushed%20Stone
partitioned\state_loca=Alabama\commodity=Dimension%20Stone
partitioned\state_loca=Alabama\commodity=Gypsum
partitioned\state_loca=Alabama\commodity=Iron%20Oxide%20Pigments
partitioned\state_loca=Alabama\commodity=Kaolin
partitioned\state_loca=Alabama\commodity=Lime
partitioned\state_loca=Alabama\commodity=Perlite
partitioned\state_loca=Alabama\commodity=Salt
partitioned\state_loca=Alabama\commodity=Sand%20and%20Gravel
partitioned\state_loca=Alabama\commodity=Silicon
partitioned\state_loca=Alabama\commodity=Sulfur
partitioned\state_loca=Alaska
partitioned\state_loca=Alaska\commodity=Crushed%20Stone
partitioned\state_loca=Alaska\commodity=Germanium
partitioned\state_loca=Alaska\commodity=Gold
partitioned\state_loca=Alaska\commodity=Lead
partitioned\state_loca=Alaska\commodity=Sand%20and%20Gravel
partitioned\state_loca=Alaska\commodity=Silver
partitioned\state_loca=Alaska\commodity=Zinc
partitioned\state_loca=Arizona
partitioned\state_loca=Arizona\commodity=Bentonite
partitioned\state_loca=Arizona\commodity=Cement
partitioned\state_loca=Arizona\commodity=Common%20Clay%20and%20Shale
partitioned\state_loca=Arizona\commodity=Copper
partitioned\state_loca=Arizona\commodity=Crushed%20Stone
partitioned\state_loca=Arizona\commodity=Gemstones
What's next? Learn about how to read in other data types or analyze your data through SQL functions and analysis tools: