Skip to content

CSV

A comma-separated values (CSV) file (.csv) is a type of delimited text file that uses commas or other characters to separate fields. CSV data can contain spatial data as numbers (e.g., longitude and latitude), as text (e.g., Well-Known Text (WKT), GeoJSON, or EsriJSON), or as encoded binary values.

CSV data can be stored in many locations, including a distributed file system such as HDFS, cloud storage, a local directory, or any other location accessible to Spark.

After you load one or more CSV files as a Spark DataFrame, you can create a geometry column and define its spatial reference using GeoAnalytics Engine SQL functions. For example, if you had polygons stored as WKT strings you could call ST_PointFromText to create a point column from a string column and set its spatial reference. For more information see Geometry.

After creating the geometry column and defining its spatial reference, you can perform spatial analysis and visualization using the SQL functions and tools available in GeoAnalytics Engine. You can also export a Spark DataFrame to CSV files for data storage or export to other systems.

Reading CSV Files in Spark

The following examples demonstrate how to load CSV files into Spark DataFrames using both Python and Scala.

PythonPythonScala
Use dark colors for code blocksCopy
1
2
3
4
5

# Option 1
df = spark.read.option("header", "true").csv("path/to/your/file.csv")
# Option 2
df = spark.read.format("csv").option("header", "true").load("path/to/your/file.csv")

Note that the option("header", "true") ensures the first row is used as column headers.

Options for reading CSV files

Here are some common options used in reading CSV files in Spark:

DataFrameReader optionExampleDescription
header.option("header", True)Interpret the first row of the CSV files as header.
inferSchema.option("inferSchema", True)Infer the column schema when reading CSV files. Spark will by default infer all columns as string columns.
delimiter.option("delimiter", ";")Read CSV files with the specified delimiter. The default delimiter is ",".

For a complete list of options that can be used with the CSV data source in Spark, see the Spark documentation.

Specifying a Schema

To define the schema explicitly, you can use StructType and StructField in both PySpark and Scala.

PythonPythonScala
Use dark colors for code blocksCopy
1
2
3
4
5
6
7
8
9
10
11
12

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("city", StringType(), True)
])

df = spark.read.option("header", "true") \
               .schema(schema) \
               .csv("path/to/your/file.csv")

Reading multiple CSV Files

Spark can infer partition columns from directory names that follow the column=value format. For instance, if your CSV files are organized in subdirectories named District=0, District=2, etc., Spark will recognize district as a partition column when reading the data. This allows Spark to optimize query performance by reading only the relevant partitions.

export data dir

When reading a directory of CSV data with subdirectories not named with column=, Spark won't read from the subdirectories in bulk. You must add the glob pattern at the end of the root path. Note that all subdirectories must contain CSV files in order to be read in bulk.

PythonPythonScala
Use dark colors for code blocksCopy
1
2

df = spark.read.option("header", "true").csv("path/to/your/csv/files/*.csv")

Writing DataFrames to CSV

PythonPythonScala
Use dark colors for code blocksCopy
1
2
3
4
5

# Option 1
df.write.option("header", "true").csv("path/to/output/directory")
# Option 2
df.write.format("csv").option("header", "true").save("path/to/output/directory")

Here are some common options used in writing CSV files in Spark:

DataFrameWriter optionExampleDescription
partitionBy.partitionBy("date")Partition the output by the given column name. This example will partition the output CSV files by values in the date column.
overwrite.mode("overwrite")Overwrite existing data in the specified path. Other available options are append, error, and ignore.
header.option("header", True)Export the Spark DataFrame with a header row.

Usage notes

  • The CSV data source doesn't support loading or saving point, line, polygon, or geometry columns.
  • The spatial reference of a geometry column always needs to be set when importing geometry data from CSV files.
  • Consider explicitly saving the spatial reference in the CSV as a column or in the schema in the geometry column name. To read more on the best practices of working with spatial references in DataFrames, see the documentation on Coordinate Systems and Transformations.
  • Be careful when saving to one CSV file using .coalesce(1) with large datasets. Consider partitioning large data by a certain attribute column to easily read and filter subdirectories of data and to improve performance.

What's next?

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.