Google Cloud Managed Service for Apache Spark (formerly Google Dataproc) is a fully managed and highly scalable service for running Apache Spark and other open-source tools and frameworks. Using the steps outlined below, GeoAnalytics Engine can be leveraged within a PySpark notebook hosted in Google Cloud Managed Service for Apache Spark.

GeoAnalytics EngineManaged Service for Apache Spark
1.0.x2.0-debian10, 2.0-ubuntu18, 2.0-rocky8
1.1.x-1.3.x2.0-debian10, 2.0-ubuntu18, 2.0-rocky8, 2.1-debian11, 2.1-ubuntu20, 2.1-rocky8
1.4.x-1.5.x2.0-debian10, 2.0-ubuntu18, 2.0-rocky8, 2.1-debian11, 2.1-ubuntu20, 2.1-rocky8, 2.2-debian12, 2.2-ubuntu22, 2.2-rocky9
1.6.x2.1-debian11, 2.1-ubuntu20, 2.1-rocky8, 2.2-debian12, 2.2-ubuntu22, 2.2-rocky9
1.7.x2.1-debian11, 2.1-ubuntu20, 2.1-rocky8, 2.2-debian12, 2.2-ubuntu22, 2.2-rocky9, 2.3-debian12, 2.3-ubuntu22, 2.3-rocky9
2.0.x-2.1.x2.2-debian12, 2.2-ubuntu22, 2.2-rocky9, 2.3-debian12, 2.3-ubuntu22, 2.3-rocky9

To complete this install you will need:

  • An active subscription to Google Cloud Platform.
  • GeoAnalytics Engine install files. If you have a connected GeoAnalytics Engine subscription, you can download the ArcGIS GeoAnalytics Engine distribution here after signing in. If you have a license file, follow the instructions provided with your license file to download the GeoAnalytics Engine distribution.

  • A GeoAnalytics Engine subscription, or a license file.

Prepare the workspace

  1. Log in to the Google Cloud Console.

  2. Select an existing project or set-up a new one.

  3. Create a Google Cloud Storage bucket in the same region you plan to deploy a cluster in.

  4. Upload the GeoAnalytics Engine .jar file and .whl file to your bucket. Depending on the analysis you will complete, optionally upload the following jars:

    • esri-projection-geographic, if you need to perform a transformation that requires supplementary projection data.
    • geoanalytics-natives to use geocoding or network analysis tools.
    • geoanalytics-raster to use raster functions, tools, or data sources.
    • geoanalytics-geoenrichment to use GeoEnrichment tools.
  5. Copy and paste the text below into a text editor and save it as a .sh script. Replace <bucket-name>, <jar-file-name>, and <wheel-file-name> with the paths to the bucket name, the jar file name, and the wheel name from step 4. Save the script and upload it to your bucket.

    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    #!/bin/bash
    
    BUCKET_NAME=<bucket-name>
    JAR_FILE=<jar-file-name>
    WHL_FILE=<wheel-file-name>
    
    WHL_PATH="gs://${BUCKET_NAME}/${WHL_FILE}"
    JAR_PATH="gs://${BUCKET_NAME}/${JAR_FILE}"
    H3_JAR_PATH="gs://${BUCKET_NAME}/h3-4.1.1.jar"
    tmpdir=$(dirname $(mktemp -u))
    
    gsutil cp "${JAR_PATH}" "/usr/lib/spark/jars"
    gsutil cp "${WHL_PATH}" "${tmpdir}"
    gsutil cp "${H3_JAR_PATH}" "/usr/lib/spark/jars"
    
    #https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/python/pip-install.sh
    
    set -exo pipefail
    readonly PACKAGES="${tmpdir}/${WHL_FILE}"
    function err() {
    echo "\[$(date +'%Y-%m-%dT%H:%M:%S%z')\]: $\*"
    exit 1
    }
    
    function run_with_retry() {
    local -r cmd=("$@")
    for ((i = 0; i < 10; i++)); do
    if "${cmd[@]}"; then
    return 0
    fi
    sleep 5
    done
    err "Failed to run command: ${cmd\[*]}"
    }
    
    function install_pip() {
    if command -v pip \>/dev/null; then
    echo "pip is already installed."
    return 0
    fi
    if command -v easy_install \>/dev/null; then
    echo "Installing pip with easy_install..."
    run_with_retry easy_install pip
    return 0
    fi
    echo "Installing python-pip..."
    run_with_retry apt update
    run_with_retry apt install python-pip -y
    }
    
    function main() {
    if \[\[ -z "${PACKAGES}" \]\]; then
    echo "ERROR: Must specify PIP PACKAGES. A space separated list of
    packages to install. Packages can contain version selector"
    exit 1
    fi
    install_pip
    run_with_retry pip install --upgrade ${PACKAGES}
    }
    
    main

    If you need to perform a transformation that requires supplementary projection data, add the first line in the example below to the script and replace PROJECTION_DATA_JAR_PATH with the corresponding File API path noted in step 4. Follow these steps for every esri-projection-geographic jar that you previously uploaded.

    If you are planning to use geocoding or network analysis tools, add the second line in the example below to the script and replace GEOANALYTICS_NATIVES_JAR_PATH with the corresponding File API path noted in step 4.

    If you are planning to use raster functions, tools, or data sources, add the third line in the example below to the script and replace GEOANALYTICS_RASTER_JAR_PATH with the corresponding File API path noted in step 4.

    If you are planning to use GeoEnrichment tools, add the fourth line in the example below to the script and replace GEOANALYTICS_RASTER_JAR_PATH with the corresponding File API path noted in step 4.

    Use dark colors for code blocksCopy
    1
    2
    3
    4
    gsutil cp "${PROJECTION_DATA_JAR_PATH}" "/usr/lib/spark/jars"
    gsutil cp "${GEOANALYTICS_NATIVES_JAR_PATH}" "/usr/lib/spark/jars"
    gsutil cp "${GEOANALYTICS_RASTER_JAR_PATH}" "/usr/lib/spark/jars"
    gsutil cp "${GEOANALYTICS_GEOENRICHMENT_JAR_PATH}" "/usr/lib/spark/jars"

Create a cluster

  1. Navigate to Managed Service for Apache Spark and open the Create a cluster page. If prompted, choose to create a cluster on Compute Engine.

  2. Define your cluster. Specify a Name, Region, Zone, and Cluster Type that meet your requirements, and choose a supported Managed Service for Apache Spark image under Version.

  3. Customize the cluster in Advanced configurations(optional). Under Cluster, adjust the settings for AutoScaling policy and Dataproc Metastore, or keep the default. Configure one or more components, and select "Enable component UIs" from Additional optional components. The cluster contains Apache Spark, and you must select at least either "Jupyter Notebook" or "Zeppelin Notebook".

  4. Under Infrastructure, update the Master node and Worker node configurations to meet your requirements. Change other settings as needed.

  5. Configure the Security settings to meet your requirements or keep the defaults.

  6. Under Other, adjust the settings for Internal IP only and Labels.

    Add the three properties in the table below to Cluster properties:

    PrefixKeyValue
    sparkspark.pluginscom.esri.geoanalytics.Plugin
    sparkspark.serializerorg.apache.spark.serializer.KryoSerializer
    sparkspark.kryo.registratorcom.esri.geoanalytics.KryoRegistrator

    Under Initialization actions, browse to and select the .sh script you uploaded previously. The service will run when provisioning your cluster.

  7. Click Create to create the cluster.

(Optional) Check cluster status and view logs

  1. If the cluster is successfully created, you should see that the status of the cluster you just created in the Google Managed Service for Apache Spark clusters page shows as Running.

  2. If cluster creation failed, you should see the status of the cluster shows as Error. To view cluster logs and understand failure reasons:

    1. Click the cluster name on the Clusters page in Google Cloud Managed Service for Apache Spark to open the Cluster details page.
    2. Click on view logs next to Cluster details. This will open up the Cloud Explorer page that allows you to make a query selection to view a certain level of the Cluster logs. Usually you can start with querying ERROR level logs to get the error messages. To look for master daemon logs, worker daemon nodes, and system logs, filter with the log names indicated in the Managed Service for Apache Spark documentation.
    3. If you would like to access logs through other methods such as gcloud logging and logging API, see documentation on accessing cluster logs in Cloud logging.

Authorize GeoAnalytics Engine

  1. Find the Web Interfaces page on the cluster you created previously. Open a Jupyter, JupyterLab, or Zeppelin notebook by clicking on the corresponding Component gateway.

  2. Import the geoanalytics library and authorize it using a username and password, an API key, or a license file. See Authorization for more information. For example:

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    
    import geoanalytics
    geoanalytics.auth(username="User1", password="p@ssw0rd")
  3. Try out the API by importing the SQL functions as an easy-to-use alias like ST and listing the first 20 functions in a notebook cell:

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    
    from geoanalytics.sql import functions as ST
    spark.sql("show user functions like 'ST_*'").show()

What’s next?

You can now use any SQL function, track function, raster function, or analysis tool in the geoanalytics module.

See Data sources and Using DataFrames to learn more about how to access your data from your notebook. Also see Visualize results to get started with viewing your data on a map. For examples of what else is possible with GeoAnalytics Engine, check out the sample notebooks, tutorials, and blog posts.

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.