Install GeoAnalytics Engine on Amazon EMR

Amazon EMR is a platform for rapidly processing, analyzing, and applying machine learning (ML) to big data using open-source frameworks. Using the steps outlined below, GeoAnalytics Engine can be leveraged within a PySpark notebook hosted in Amazon EMR.

The table below summarizes the Amazon EMR runtimes supported by each version of GeoAnalytics Engine.

GeoAnalytics EngineAmazon EMR
1.0.x6.1.0-6.6.0

To complete this install you will need:

  • An active AWS subscription
  • GeoAnalytics Engine install files. If you have a GeoAnalytics Engine subscription with a username and password, you can download the ArcGIS GeoAnalytics Engine distribution here after signing in. If you have a license file, follow the instructions provided with your license file to download the GeoAnalytics Engine distribution.

  • A GeoAnalytics Engine subscription, or a license file.

Prepare the workspace

  1. Sign in to the AWS Management Console.

  2. Create a bucket in S3 or choose an existing one to stage setup files in.

  3. Upload the jar and whl files to your S3 bucket.

  4. Copy and paste the text below into a text editor and change the BUCKET_PATH variable value to the path of the bucket or folder where you uploaded the jar and wheel. Change WHEEL_NAME and JAR_NAME to the names of the wheel file and jar file respectively. Save the file using ".sh" as the file extension and upload it to your S3 bucket.

    Use dark colors for code blocksCopy
               
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    #!/bin/bash
    
    BUCKET_PATH=s3://testbucket
    WHEEL_NAME=geoanalytics-1.0.0-py3-none-any.whl
    JAR_NAME=geoanalytics_2.12-1.0.0.jar
    
    sudo mkdir -p /home/geoanalytics/
    sudo aws s3 cp $BUCKET_PATH/$JAR_NAME /usr/lib/spark/jars/
    sudo aws s3 cp $BUCKET_PATH/$WHEEL_NAME /home/geoanalytics/$WHEEL_NAME
    sudo python3 -m pip install -U pip
    sudo python3 -m pip install /home/geoanalytics/$WHEEL_NAME

    If you are using the supplemental projection data jars, add the lines below to the script before uploading the script to S3 and update the file names if needed.

    Use dark colors for code blocksCopy
      
    1
    2
    sudo aws s3 cp $BUCKET_PATH/esri-projection-data1.jar /usr/lib/spark/jars/
    sudo aws s3 cp $BUCKET_PATH/esri-projection-data2.jar /usr/lib/spark/jars/

Create a Spark pool

  1. Navigate to Amazon EMR and click Create cluster.

  2. Click Go to advanced options to open the Advanced Options page.

  3. Under Software Configuration choose any supported EMR release for Release. See About Amazon EMR Releases for details on release components. Ensure that at least the following packages are selected:

    • Hadoop
    • Hive
    • Spark
    • JupyterHub
    • JupyterEnterpriseGateway
  4. Under Edit software settings select "Enter configuration" and copy and paste the text below into the text box.

    Use dark colors for code blocksCopy
              
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    [
       {
          "classification":"spark-defaults",
          "properties":{
             "spark.plugins":"com.esri.geoanalytics.Plugin",
             "spark.serializer":"org.apache.spark.serializer.KryoSerializer",
             "spark.kryo.registrator":"com.esri.geoanalytics.KryoRegistrator"
          }
       }
    ]
  5. Accept the defaults for all other parameters or change them before clicking Next.

  6. Configure the hardware and networking for your cluster and click Next.

  7. Under Bootstrap Actions select "Custom action" and click Configure and add. For Script location, specify the path to the .sh script you uploaded to S3 earlier and click Add.

  8. Accept the defaults for all other parameters in the General Cluster Settings page or update them before clicking Next.

  9. Configure the security options for your cluster and click Create cluster. If the create fails, check the cluster logs to diagnose the issue.

Authorize GeoAnalytics Engine

  1. Create a new PySpark notebook or open an existing one and attach the notebook to your cluster.
  2. Import the geoanalytics library and authorize it using your username and password or another supported authorization method. See Licensing and Authorization for more information. For example:

    Use dark colors for code blocksCopy
      
    1
    2
    import geoanalytics
    geoanalytics.auth(username="User1", password="p@ssw0rd")
  3. Try out the API by importing the SQL functions as an easy-to-use alias like ST and listing the first 20 functions in a notebook cell:

    Use dark colors for code blocksCopy
      
    1
    2
    from geoanalytics.sql import functions as ST
    spark.sql("show user functions like 'ST_*'").show()

What’s next?

You can now use any SQL function or analysis tool in the geoanalytics module.

See Data sources and Using DataFrames to learn more about how to access your data from your notebook. Also see Visualize results to get started with viewing your data on a map. For examples of what else is possible with GeoAnalytics Engine, check out the sample notebooks, tutorials, and blog posts.

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.