ORC

Apache ORC (Optimized Row Columnar) is an open-source type-aware columnar file format commonly used in Hadoop ecosystems. The ORC file format (.orc) is self-describing, as in it optimizes large streaming reads but also integrates support for finding required rows quickly. Because of this, ORC takes significantly less time to read in data and can reduce the size of data on disk. Additionally, ORC supports complex types of data such as structs, lists, maps, and unions. ORC is supported natively in Spark and in Hive. To learn more about ORC, see the ORC specification. To learn more about using ORC with Spark and Hive, see the Spark documentation on ORC files.

ORC data can be stored in a distributed file system such as HDFS, cloud storage, a local directory, or any other location accessible to Spark.

After you load one or more ORC files as a Spark DataFrame, you can create a geometry column and define its spatial reference using GeoAnalytics Engine SQL functions. For example, if you had polygons stored as WKT strings you could call ST_PointFromText to create a point column from a string column and set its spatial reference. For more information see Geometry.

After creating the geometry column and defining its spatial reference, you can perform spatial analysis and visualization using the SQL functions and tools available in GeoAnalytics Engine. You can also export a Spark DataFrame to ORC files for data storage or export to other systems.

The following table shows several examples of how to load and save ORC files in Spark, where path is a path to a directory of ORCs or an ORC file.

Load	Save
`spark.read.orc(path)`	`df.write.orc(path)`
`spark.read.format("orc").load(path)`	`df.write.format("orc").save(path)`
`spark.read.load(path, format="orc")`	`df.write.save(path, format="orc")`

Additionally, Spark DataFrameReader and DataFrameWriter classes provide optional parameters that you can use when reading or writing ORC files. For a full list of options, see DataFrameReader.orc and DataFrameWriter.orc.

DataFrameReader option	Example	Description
`recursiveFileLookup`	`.option("recursiveFileLookup", True)`	Recursively look though ORC files under the given directory.
`mergeSchemas`	`.option("mergeSchemas", True)`	Merge the schemas of a collection of ORC datasets in the input directory.
`pathGlobFilter`	`.option("pathGlobFilter", "*.orc")`	Read in files with the specified name pattern under the given file path.

DataFrameWriter option	Example	Description
`partitionBy`	`.partitionBy("date")`	Partition the output by the given column name. This example will partition the output ORC files by values in the `date` column.
`overwrite`	`.mode("overwrite")`	Overwrite existing data in the specified path. Other available options are `append`,`error`,and `ignore`.

Usage notes

The ORC data source doesn't support loading or saving DataFrames containing point, line, polygon, or geometry columns.
The spatial reference of a geometry column always needs to be set when importing geometry data from ORC files.
Spark will read ORC files from multiple directories if the directory names start with column=. For example, the following example directory contains ORC data that is partitioned by district. Spark can infer district as a column name in the DataFrame by reading the subdirectory names starting with district=.
When reading in a directory of ORC data with subdirectories not following the naming convention of column=, Spark won't read from the subdirectories in bulk. You will need to add the glob pattern at the end of the root path (i.e., C:\data\example\* or C:\data\example\*\*).
Consider explicitly saving the spatial reference in the CSV as a column or in the schema in the geometry column name. To read more on the best practices of working with spatial references in DataFrames, see the documentation on Coordinate Systems and Transformations.
Be careful when saving to one ORC file using .coalesce(1) with large datasets. Consider partitioning large data by a certain attribute column to easily read and filter subdirectories of data and to improve performance.