Read and write to databases

GeoAnalytics Engine can read and write data to and from databases that are supported by Apache Spark. This tutorial demonstrates how to read and write data to and from DataFrames for the following databases:

  • Microsoft SQL Server
  • Oracle
  • PostgreSQL

Prerequisites

The following are required for this tutorial:

  1. A running Spark session configured with ArcGIS GeoAnalytics Engine.
  2. A Jupyter or JupyterLab notebook connected to your Spark session.
  3. A JDBC driver used to make a connection between GeoAnalytics Engine and the database.

Steps

JDBC drivers

A JDBC driver is required to read and write data to and from a database with GeoAnalytics Engine. Each database type (e.g. PostgreSQL, Microsoft SQL Server, etc.) requires a specific JDBC driver.

In this tutorial, the following JDBC drivers are used:

DatabaseJDBC Driver
Microsoft SQL ServerSpark MS SQL Connector
OracleOjdbc11
PostgreSQLPostgreSQL JDBC Driver

The database specific JDBC drivers can be included when starting up Spark/GeoAnalytics Engine by using the packages flag and specifying a comma-delimited list of Maven coordinates.

Examples:

  • --packages groupId:artifactId:version
  • --packages groupId:artifactId:version,groupId:artifactId:version

Setup modules and data

  1. Import the GeoAnalytics Engine Spatial Type functions and the Point geometry type. The Python uuid module is also imported to create a unique database table name.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    import uuid
    from geoanalytics.sql import functions as ST
    from geoanalytics.sql import Point
  2. Create a DataFrame with columns for a unique ID and Point type geometry.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    data = [(1, Point(10, 10)),
           (2, Point(20, 20)),
           (3, Point(30, 30))]
    
    df = spark.createDataFrame(data, ["unique_id", "geometry"])
    df.printSchema()
    df.show()
    Result
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    root
     |-- unique_id: long (nullable = true)
     |-- geometry: point (nullable = true)
    
    +---------+---------------+
    |unique_id|       geometry|
    +---------+---------------+
    |        1|{"x":10,"y":10}|
    |        2|{"x":20,"y":20}|
    |        3|{"x":30,"y":30}|
    +---------+---------------+
  3. Create a new well-known binary column from the Point geometry column and then drop the Point geometry column.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    df = df.withColumn("geom_wkb", ST.as_binary("geometry")) \
           .drop("geometry")
    
    df.printSchema()
    df.show()
    Result
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    root
     |-- unique_id: long (nullable = true)
     |-- geom_wkb: binary (nullable = true)
    
    +---------+--------------------+
    |unique_id|            geom_wkb|
    +---------+--------------------+
    |        1|[01 01 00 00 00 0...|
    |        2|[01 01 00 00 00 0...|
    |        3|[01 01 00 00 00 0...|
    +---------+--------------------+

Microsoft SQL Server

SQL Server Write

  1. Write the DataFrame created in the Setup modules and data section to a new table within the Microsoft SQL Server database.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    # Create a unique table name using the UUID module
    table_name = f"df_write_{uuid.uuid4().hex[:6]}"
    
    # Write the DataFrame to the SQL Server database
    df.write \
      .format("jdbc") \
      .option("url", f"jdbc:sqlserver://{server_name}\\{instance_name};databaseName={database_name}") \
      .option("dbtable", table_name) \
      .option("user", username) \
      .option("password", password) \
      .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
      .save()
  2. Use the Microsoft SQL Server STPointFromWKB geometry method to create a SQL Server point geometry type column from the well-known binary column. The SQL Query below will add an empty SQL Server geometry spatial data type column to the database table created in step 1 and then populate it with the results from STPointFromWKB.

    SQL
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    -- Add a geometry column to the database table
    ALTER TABLE table_name
    ADD geom Geometry;
    
    -- Populate the geometry column with the results from STPointFromWKB and set the SRID
    UPDATE table_name
    SET geom=geometry::STPointFromWKB(geom_wkb, SRID);
    
    -- View the database table updates
    SELECT * FROM table_name;
    Result
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
     +---------+--------------------------------------------+----------------------------------------------+
     |unique_id|geom_wkb                                    |geom                                          |
     +---------+--------------------------------------------+----------------------------------------------+
     |3        |0x01010000000000000000003E400000000000003E40|0xE6100000010C0000000000003E400000000000003E40|
     |1        |0x010100000000000000000024400000000000002440|0xE6100000010C00000000000024400000000000002440|
     |2        |0x010100000000000000000034400000000000003440|0xE6100000010C00000000000034400000000000003440|
     +---------+--------------------------------------------+----------------------------------------------+

SQL Server Read

  1. When reading from a database table, you can read all records or a subset. Both options are demonstrated below.

    The Microsoft SQL Server STAsBinary method is used with the Spark query data source option to convert the SQL Server geometry spatial data type column to it's well-known binary format.

    Create a DataFrame by reading all records in the database table that was created in the SQL Server write example above.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    df = spark.read \
              .format("jdbc") \
              .option("url", f"jdbc:sqlserver://{server_name}\\{instance_name};databaseName={database_name}") \
              .option("query", f"select unique_id, geom.STAsBinary() as wkb from {table_name}") \
              .option("user", username) \
              .option("password", password) \
              .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
              .load()
    
    df.printSchema()
    df.show()
    Result
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
     root
      |-- unique_id: long (nullable = true)
      |-- wkb: binary (nullable = true)
    
     +---------+--------------------+
     |unique_id|                 wkb|
     +---------+--------------------+
     |        3|[01 01 00 00 00 0...|
     |        1|[01 01 00 00 00 0...|
     |        2|[01 01 00 00 00 0...|
     +---------+--------------------+

    Create a DataFrame using a where SQL statement in the Spark query data source option to read in only a subset of the database table that was created in the SQL Server write example above.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    df = spark.read \
              .format("jdbc") \
              .option("url", f"jdbc:sqlserver://{server_name}\\{instance_name};databaseName={database_name}") \
              .option("query", f"select unique_id, geom.STAsBinary() as wkb from {table_name} where unique_id = 1") \
              .option("user", username) \
              .option("password", password) \
              .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
              .load()
    
    df.printSchema()
    df.show()
    Result
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
     root
      |-- unique_id: long (nullable = true)
      |-- wkb: binary (nullable = true)
    
     +---------+--------------------+
     |unique_id|                 wkb|
     +---------+--------------------+
     |        1|[01 01 00 00 00 0...|
     +---------+--------------------+
  2. Create a new Point geometry column from the well-known binary column in the DataFrame using ST_PointFromBinary.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    df = df.withColumn("geometry", ST.point_from_binary("wkb"))
    df.printSchema()
    df.show()
    Result
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
     root
      |-- unique_id: long (nullable = true)
      |-- wkb: binary (nullable = true)
      |-- geometry: point (nullable = true)
    
     +---------+--------------------+---------------+
     |unique_id|                 wkb|       geometry|
     +---------+--------------------+---------------+
     |        3|[01 01 00 00 00 0...|{"x":30,"y":30}|
     |        1|[01 01 00 00 00 0...|{"x":10,"y":10}|
     |        2|[01 01 00 00 00 0...|{"x":20,"y":20}|
     +---------+--------------------+---------------+

Oracle

Oracle write

  1. Write the DataFrame created in the Setup modules and data section to a new table within the Oracle database.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    # Create a unique table name using the UUID module
    table_name = f"df_write_{uuid.uuid4().hex[:6]}"
    
    # Write the DataFrame to the Oracle database
    df.write \
      .format("jdbc") \
      .option("url", f"jdbc:oracle:thin:@//{host}:{port}/{service_name}") \
      .option("dbtable", table_name) \
      .option("user", username) \
      .option("password", password) \
      .option("driver", "oracle.jdbc.driver.OracleDriver") \
      .save()
  2. Use the Oracle Spatial Data Object FROM_WKBGEOMETRY function to create an SDO_Geometry type column from the well-known binary column. The SQL Query below will add an empty Oracle SDO Geometry column to the database table created in step 1 and then populate it with the results from FROM_WKBGEOMETRY.

    SQL
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    -- Add a geometry column to the database table
    ALTER TABLE table_name
    ADD geom SDO_GEOMETRY;
    
    -- Populate the geometry column with the results from SDO_UTIL.FROM_WKBGEOMETRY
    UPDATE table_name
    SET geom=SDO_UTIL.FROM_WKBGEOMETRY("geom_wkb");
    
    -- Set the SRID for the geometry column in the database
    UPDATE table_name tbl
    SET tbl.GEOM.SDO_SRID = SRID;
    
    -- Commit the changes to the database table
    commit;
    
    -- View the database table updates
    SELECT * FROM table_name;
    Result
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    +---------+--------------------------------------------+----------------------------------+
    |unique_id|geom_wkb                                    |GEOM                              |
    +---------+--------------------------------------------+----------------------------------+
    |2        |0x010100000000000000000034400000000000003440|{2001,4326,{20,20,null},null,null}|
    |1        |0x010100000000000000000024400000000000002440|{2001,4326,{10,10,null},null,null}|
    |3        |0x01010000000000000000003E400000000000003E40|{2001,4326,{30,30,null},null,null}|
    +---------+--------------------------------------------+----------------------------------+

Oracle Read

  1. When reading from a database table, you can read all records or a subset. Both options are demonstrated below.

    The Oracle Spatial Data Object TO_WKBGEOMETRY function to convert the Oracle SDO geometry column to it's well-known binary format.

    Create a DataFrame by reading all records in the database table that was created in the Oracle write example above.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    df = spark.read \
              .format("jdbc") \
              .option("url", f"jdbc:oracle:thin:@//{host}:{port}/{service_name}") \
              .option("query", f'select "unique_id", SDO_UTIL.TO_WKBGEOMETRY(GEOM) as wkb from {table_name}') \
              .option("user", username) \
              .option("password", password) \
              .option("driver", "oracle.jdbc.driver.OracleDriver") \
              .load()
    
    df.printSchema()
    df.show()
    Result
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
     root
      |-- unique_id: decimal(19,0) (nullable = true)
      |-- WKB: binary (nullable = true)
    
     +---------+--------------------+
     |unique_id|                 WKB|
     +---------+--------------------+
     |        2|[00 00 00 00 01 4...|
     |        3|[00 00 00 00 01 4...|
     |        1|[00 00 00 00 01 4...|
     +---------+--------------------+

    Create a DataFrame using a where SQL statement in the Spark query data source option to read in only a subset of the database table that was created in the Oracle write example above.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    df = spark.read \
            .format("jdbc") \
            .option("url", f"jdbc:oracle:thin:@//{host}:{port}/{service_name}") \
            .option("query", f'select "unique_id", SDO_UTIL.TO_WKBGEOMETRY(GEOM) as wkb from {table_name} where rownum<3') \
            .option("user", username) \
            .option("password", password) \
            .option("driver", "oracle.jdbc.driver.OracleDriver") \
            .load()
    
    df.printSchema()
    df.show()
    Result
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
     root
      |-- unique_id: decimal(19,0) (nullable = true)
      |-- WKB: binary (nullable = true)
    
     +---------+--------------------+
     |unique_id|                 WKB|
     +---------+--------------------+
     |        2|[00 00 00 00 01 4...|
     |        1|[00 00 00 00 01 4...|
     +---------+--------------------+
  2. Create a new Point geometry column from the well-known binary column in the DataFrame using ST_PointFromBinary.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    df = df.withColumn("geometry", ST.point_from_binary("wkb"))
    df.printSchema()
    df.show()
    Result
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
     root
      |-- unique_id: decimal(19,0) (nullable = true)
      |-- WKB: binary (nullable = true)
      |-- geometry: point (nullable = true)
    
     +---------+--------------------+---------------+
     |unique_id|                 WKB|       geometry|
     +---------+--------------------+---------------+
     |        2|[00 00 00 00 01 4...|{"x":20,"y":20}|
     |        3|[00 00 00 00 01 4...|{"x":30,"y":30}|
     |        1|[00 00 00 00 01 4...|{"x":10,"y":10}|
     +---------+--------------------+---------------+

PostgreSQL

PostgreSQL Write

  1. Write the DataFrame created in the Setup modules and data section to a new table within the PostgreSQL database.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    # Create a unique table name using the UUID module
    table_name = f"df_write_{uuid.uuid4().hex[:6]}"
    
    # Write the DataFrame to the PostgreSQL database
    df.write \
      .format("jdbc") \
      .option("url", f"jdbc:postgresql://{host}:{port}/{database}") \
      .option("dbtable", table_name) \
      .option("user", username) \
      .option("password", password) \
      .option("driver", "org.postgresql.Driver") \
      .save()
  2. Use the PostgreSQL ST_PointFromWKB function to create a POINT geometry type column from the well-known binary column. The SQL Query below will create a new database table that is a copy of the table created in step 1 but with an additional column for the PostgreSQL geometry type that was created using ST_PointFromWKB.

    SQL
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    -- Create a new database table with a geometry column with the results from ST_PointFromWKB
    DROP TABLE IF EXISTS new_table_name;
    SELECT *, ST_PointFromWKB(geom_wkb, SRID) AS geom INTO new_table_name FROM table_name;
    
    -- View the database table updates
    SELECT * FROM new_table_name;
    Result
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
     +---------+--------------------------------------------+--------------------------------------------------------------------------------+
     |unique_id|geom_wkb                                    |geom                                                                            |
     +---------+--------------------------------------------+--------------------------------------------------------------------------------+
     |2        |0x010100000000000000000034400000000000003440|280000000100000001000400E61000000C0000000100000080A09B9FB91880A09B9FB91800000000|
     |3        |0x01010000000000000000003E400000000000003E40|280000000100000001000400E61000000C0000000100000080B0FADF831980B0FADF831900000000|
     |1        |0x010100000000000000000024400000000000002440|280000000100000001000400E61000000C000000010000008090BCDEEE178090BCDEEE1700000000|
     +---------+--------------------------------------------+--------------------------------------------------------------------------------+

PostgreSQL Read

  1. When reading from a database table, you can read all records or a subset. Both options are demonstrated below.

    The PostgreSQL STAsBinary function is used with the Spark query data source option to convert the PostgreSQL geometry column to it's well-known binary format.

    Create a DataFrame by reading all records in the database table that was created in the PostgreSQL write example above.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    df = spark.read \
            .format("jdbc") \
            .option("url", f"jdbc:postgresql://{host}:{port}/{database}") \
            .option("query", f"select unique_id, ST_AsBinary(geom) as wkb from {table_name}") \
            .option("user", username) \
            .option("password", password) \
            .option("driver", "org.postgresql.Driver") \
            .load()
    
    df.printSchema()
    df.show()
    Result
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
     root
      |-- unique_id: long (nullable = true)
      |-- wkb: binary (nullable = true)
    
     +---------+--------------------+
     |unique_id|                 wkb|
     +---------+--------------------+
     |        3|[01 01 00 00 00 1...|
     |        1|[01 01 00 00 00 2...|
     |        2|[01 01 00 00 00 1...|
     +---------+--------------------+

    Create a DataFrame using a where SQL statement in the Spark query data source option to read in only a subset of the database table that was created in the PostgreSQL write example above.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    df = spark.read \
            .format("jdbc") \
            .option("url", f"jdbc:postgresql://{host}:{port}/{database}") \
            .option("query", f"select unique_id, ST_AsBinary(geom) as wkb from {table_name}  where unique_id > 1") \
            .option("user", username) \
            .option("password", password) \
            .option("driver", "org.postgresql.Driver") \
            .load()
    
    df.printSchema()
    df.show()
    Result
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
     root
      |-- unique_id: long (nullable = true)
      |-- wkb: binary (nullable = true)
    
     +---------+--------------------+
     |unique_id|                 wkb|
     +---------+--------------------+
     |        3|[01 01 00 00 00 1...|
     |        2|[01 01 00 00 00 1...|
     +---------+--------------------+
  2. Create a new Point geometry column from the well-known binary column in the DataFrame using ST_PointFromBinary.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    df = df.withColumn("geometry", ST.point_from_binary("wkb"))
    df.printSchema()
    df.show()
    Result
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
     root
      |-- unique_id: long (nullable = true)
      |-- wkb: binary (nullable = true)
      |-- geometry: point (nullable = true)
    
     +---------+--------------------+--------------------+
     |unique_id|                 wkb|            geometry|
     +---------+--------------------+--------------------+
     |        3|[01 01 00 00 00 1...|{"x":30.000000000...|
     |        1|[01 01 00 00 00 2...|{"x":10.000000000...|
     |        2|[01 01 00 00 00 1...|{"x":20.000000000...|
     +---------+--------------------+--------------------+

What's next?

Learn about how to read in other data types or analyze your data through SQL functions and analysis tools:

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.