Install GeoAnalytics Engine on Amazon EMR

Amazon EMR is a platform for rapidly processing, analyzing, and applying machine learning (ML) to big data using open-source frameworks. Using the steps outlined below, GeoAnalytics Engine can be leveraged within a PySpark notebook hosted in Amazon EMR.

The table below summarizes the Amazon EMR runtimes supported by each version of GeoAnalytics Engine.

GeoAnalytics Engine	Amazon EMR
1.0.x	6.1.0-6.6.0
1.1.x	6.1.0-6.9.0
1.2.x	6.1.0-6.11.0
1.3.x	6.1.0-6.15.0
1.4.x	6.1.0-7.1.0
1.5.x	6.1.0-7.3.0
1.6.x	6.6.0-7.9.0

To complete this install you will need:

An active AWS subscription
GeoAnalytics Engine install files. If you have a connected GeoAnalytics Engine subscription, you can download the ArcGIS GeoAnalytics Engine distribution here after signing in. If you have a license file, follow the instructions provided with your license file to download the GeoAnalytics Engine distribution.
A GeoAnalytics Engine subscription, or a license file.

Prepare the workspace

Sign in to the AWS Management Console.
Create a bucket in S3 or choose an existing one to stage setup files in.
Upload the jar and whl files to your S3 bucket. Depending on the analysis you will complete, optionally upload the following jars:
- esri-projection-geographic, if you need to perform a transformation that requires supplementary projection data.
- geoanalytics-natives to use geocoding or network analysis tools.
Copy and paste the text below into a text editor and change the BUCKET_PATH variable value to the path of the bucket or folder where you uploaded the jar and wheel. If you copy the bucket path from S3 using Copy S3 URI, make sure to remove the / character that is added at the end, as your script will break otherwise. Change WHEEL_NAME and JAR_NAME to the names of the wheel file and jar file respectively. Save the file using ".sh" as the file extension and upload it to your S3 bucket.
Use dark colors for code blocksCopy
```
1
2
3
4
5
6
7
8
9
10
11
#!/bin/bash

BUCKET_PATH=s3://testbucket
WHEEL_NAME=geoanalytics-x.x.x-py3-none-any.whl
JAR_NAME=geoanalytics_2.12-x.x.x.jar

sudo mkdir -p /home/geoanalytics/
sudo aws s3 cp $BUCKET_PATH/$JAR_NAME /usr/lib/spark/jars/
sudo aws s3 cp $BUCKET_PATH/$WHEEL_NAME /home/geoanalytics/$WHEEL_NAME
sudo python3 -m pip install -U pip
sudo python3 -m pip install /home/geoanalytics/$WHEEL_NAME
```
If you are using the supplemental projection data jars, add the first line in the example below to the script and replace <esri-projection-name>.jar with the name of the corresponding file uploaded in step 3. Make sure to follow these steps for every projection data jar used.

If you are planning to use geocoding or network analysis tools, add the second line in the example below to the script and replace <geoanalytics-natives>.jar with the name of the corresponding file uploaded in step 3.

Make sure to upload the updated script to S3.
Use dark colors for code blocksCopy
```
1
2
sudo aws s3 cp $BUCKET_PATH/<esri-projection-name>.jar /usr/lib/spark/jars/
sudo aws s3 cp $BUCKET_PATH/<geoanalytics-natives>.jar /usr/lib/spark/jars/
```
Note
To use ST_H3Bin or ST_H3Bins, you must copy the H3 jar to /usr/lib/spark/jars/ on your cluster using the setup script, like in the examples shown above.

Create a Spark pool

Navigate to Amazon EMR and click Create cluster.
Under Name and applications choose any supported EMR release for Amazon EMR release. See About Amazon EMR Releases for details on release components. For Application bundle select "Custom" and ensure that at least the following applications are checked:
- Hadoop
- Hive
- Spark
- JupyterHub
- JupyterEnterpriseGateway
Under Cluster configuration, choose your preferred configuration.
Configure the Networking for your cluster.
Under Bootstrap Actions - optional click "Add". For Script location, specify the path to the .sh script you

uploaded to S3 earlier and click Add bootstrap action.

Under Software settings - optional select "Enter configuration" and copy and paste the text below into the text box.

Use dark colors for code blocksCopy
[
   {
      "classification":"spark-defaults",
      "properties":{
         "spark.plugins":"com.esri.geoanalytics.Plugin",
         "spark.serializer":"org.apache.spark.serializer.KryoSerializer",
         "spark.kryo.registrator":"com.esri.geoanalytics.KryoRegistrator"
      }
   }
]

Set the Identity and Access Management (IAM) roles for your cluster.
Accept the defaults for all other parameters in the previous steps or change them based on your needs.
Click Create cluster. If the create fails, check the cluster logs to diagnose the issue.

(Optional) Check cluster status and view logs

If the cluster is created successfully, you should see Running or Waiting next to the cluster name. Under the cluster list in the EMR console, select the down arrow next to the cluster name. You should see the Master and Core codes both show running.
If the cluster didn't create successfully, you should see Terminated with errors next to the cluster name. To check logs related to the errors, click on the cluster from the cluster list in the EMR console, and click on the folder icon next to Log URL under Summary tab , and select the subfoloder with the component that is relevant to the error. For example, if the cluster failed to provision, look for the subfolder called provision-node for more details.
If cluster terminated with unclear errors, checkout EMR documentation on troubleshooting failed clusters for more tips on triaging cluster creation failures.

Authorize GeoAnalytics Engine

Create a Workspace in your EMR Studio to run notebook code or open an existing one and attach it to your cluster.
Import the geoanalytics library and authorize it using a username and password, an API key, or a license file. See Authorization for more information. For example:

PythonPythonScala
Use dark colors for code blocksCopy
```
1
2
3

import geoanalytics
geoanalytics.auth(username="User1", password="p@ssw0rd")
```
Try out the API by importing the SQL functions as an easy-to-use alias like ST and listing the first 20 functions in a notebook cell:

PythonPythonScala
Use dark colors for code blocksCopy
```
1
2
3

from geoanalytics.sql import functions as ST
spark.sql("show user functions like 'ST_*'").show()
```

What’s next?

You can now use any SQL function, track function, or analysis tool in the geoanalytics module.

See Data sources and Using DataFrames to learn more about how to access your data from your notebook. Also see Visualize results to get started with viewing your data on a map. For examples of what else is possible with GeoAnalytics Engine, check out the sample notebooks, tutorials, and blog posts.