A comma-separated values (CSV) file (.csv
) is a type of
delimited text file that uses commas or other characters to separate fields. CSV data can contain spatial data as
numbers (e.g., longitude and latitude), as text (e.g., Well-Known Text (WKT), GeoJSON,
or EsriJSON), or as encoded binary values.
CSV data can be stored in many locations, including a distributed file system such as HDFS, cloud storage, a local directory, or any other location accessible to Spark.
After you load one or more CSV files as a Spark DataFrame, you can create a geometry column and define its spatial reference using GeoAnalytics Engine SQL functions. For example, if you had polygons stored as WKT strings you could call ST_PointFromText to create a point column from a string column and set its spatial reference. For more information see Geometry.
After creating the geometry column and defining its spatial reference, you can perform spatial analysis and visualization using the SQL functions and tools available in GeoAnalytics Engine. You can also export a Spark DataFrame to CSV files for data storage or export to other systems.
The following table shows several examples of how to load and save CSV files in Spark, where path
is a path to a
directory of CSVs or a CSV file.
Load | Save |
---|---|
spark.read.csv(path) | df.write.csv(path) |
spark.read.format("csv").load(path) | df.write.format("csv").save(path) |
spark.read.load(path, format="csv") | df.write.save(path, format="csv") |
Additionally, the Spark Data
and Data
classes provide options that can be used when loading and saving CSV files,
as shown below. For a complete list of options that can be used with the CSV data source, see DataFrameReader.csv
and DataFrameWriter.csv.
DataFrameReader option | Example | Description |
---|---|---|
header | .option("header", True) | Interpret the first row of the CSV files as header. |
infer | .option("infer | Infer the column schema when reading CSV files. Spark will by default infer all columns as string columns. |
delimiter | .option("delimiter", ";") | Read CSV files with the specified delimiter. The default delimiter is ",". |
DataFrameWriter option | Example | Description |
---|---|---|
partition | .partition | Partition the output by the given column name. This example will partition the output CSV files by values in the date column. |
overwrite | .mode("overwrite") | Overwrite existing data in the specified path. Other available options are append , error , and ignore . |
header | .option("header", True) | Export the Spark DataFrame with a header row. |
Usage notes
-
The CSV data source doesn't support loading or saving
point
,line
,polygon
, orgeometry
columns. -
The spatial reference of a geometry column always needs to be set when importing geometry data from CSV files.
-
Spark will read CSV files from multiple directories if the directory names start with
column=
. For example, the following example directory contains CSV data that is partitioned bydistrict
. Spark can inferdistrict
as a column name in the DataFrame by reading the subdirectory names starting withdistrict=
. -
When reading a directory of CSV data with subdirectories not named with
column=
, Spark won't read from the subdirectories in bulk. You must add the glob pattern at the end of the root path (e.g.,C:\data\example\*
orC:\data\example\*\*
). -
All subdirectories must contain CSV files in order to be read in bulk.
-
Consider explicitly saving the spatial reference in the CSV as a column or in the schema in the geometry column name. To read more on the best practices of working with spatial references in DataFrames, see the documentation on Coordinate Systems and Transformations.
-
Be careful when saving to one CSV file using
.coalesce(1)
with large datasets. Consider partitioning large data by a certain attribute column to easily read and filter subdirectories of data and to improve performance.