Spark Local Mode

Apache Spark supports a local deployment mode that lets you run PySpark code using your personal computer's resources as a single node cluster. This mode is useful for testing your workflow prior to using resources on a larger Spark cluster. For example, you might choose to write code on your personal computer using a subset of your data before deploying a full-scale Spark cluster in the cloud. This would lower your overall compute time in the cloud and reduce costs.

The following steps explain how to install Apache Spark and GeoAnalytics Engine on Windows or Linux using Spark in local standalone mode. Once complete, you will be able to run PySpark and GeoAnalytics Engine code in a python notebook, the PySpark shell, or with a python script.

Prerequisites:

Note that some versions of Java or Python are deprecated in some versions of Spark. See Dependencies for details.

Install Apache Hadoop

GeoAnalytics Engine requires Hadoop binaries to be installed when reading from or writing to shapefiles. Hadoop is also required when reading from or writing to any distributed file system that Spark supports, including parquet, S3, and others.

To install Hadoop on Linux, download the binaries directly from Apache and unpack the distribution as described in Hadoop documentation.

To install Hadoop on Windows, download the Windows binaries from a third party or build them yourself. At a minimum you must have winutils.exe and hadoop.dll staged on your machine at <install location>\Hadoop\bin\.

For both Linux and Windows, set the HADOOP_HOME environment variable to the Hadoop install location and add %HADOOP_HOME%\bin to your Path variable. For example:

WindowsWindowsLinux
Use dark colors for code blocksCopy
set HADOOP_HOME=C:\Hadoop
set PATH=%PATH%;%HADOOP_HOME%\bin

Install Apache Spark and PySpark

Download Apache Spark. Any supported version of Spark will work, but the release should support the versions of Java and Python you have installed.

Set the required environment variables:

Set the SPARK_HOME environment variable to the Spark install directory and add %SPARK_HOME%\bin to your Path variable. For example:

WindowsWindowsLinux
Use dark colors for code blocksCopy
set SPARK_HOME=C:\Spark\spark-3.2.0-bin-hadoop2.7
set PATH=%PATH%;%SPARK_HOME%\bin

Set the PYSPARK_PYTHON environment variable to the path of the Python executable you're using, for example:

WindowsWindowsLinux
Use dark colors for code blocksCopy
set PYSPARK_PYTHON=C:\Python37\python.exe

If you want to use GeoAnalytics Engine in a notebook, set the PYSPARK_DRIVER_PYTHON environment variable to the path of a Python notebook executable, for example:

WindowsWindowsLinux
Use dark colors for code blocksCopy
set PYSPARK_DRIVER_PYTHON=C:\Python37\Scripts\jupyter-notebook.exe

If you want to use GeoAnalytics Engine via the PySpark shell or by running python scripts, skip this step.

Ensure that JAVA_HOME is set and that %JAVA_HOME%\bin is in your Path environment variable. If not, set it using:

WindowsWindowsLinux
Use dark colors for code blocksCopy
set JAVA_HOME=C:\Java
set PATH=%PATH%;%SPARK_HOME%\bin;%JAVA_HOME%\bin

Install PySpark with pip, conda, or by manually installing the package. For more information, see PySpark Installation. Below is an example using pip.
Use dark colors for code blocksCopy
1 pip install pyspark

Start a PySpark session with GeoAnalytics Engine

Copy the jar and zip install files to your computer.

Open command prompt and run the command below. Change the paths to the jar and zip file before running. You can also change the amount of memory available to Spark by updating the value for spark.driver.memory. If you set PYSPARK_DRIVER_PYTHON to a python notebook, the notebook application will open and the geoanalytics module will be available to import in any notebook you create. If you are using the PySpark shell or running a script, you can import geoanalytics as soon as PySpark starts.

WindowsWindowsLinux
Use dark colors for code blocksCopy
pyspark --jars C:\engine\geoanalytics.jar ^
        --py-files C:\engine\geoanalytics.zip ^
        --conf spark.plugins=com.esri.geoanalytics.Plugin ^
        --conf spark.serializer=org.apache.spark.serializer.KryoSerializer ^
        --conf spark.kryo.registrator=com.esri.geoanalytics.KryoRegistrator ^
        --conf spark.driver.memory=5g

If you need to perform a transformation that requires supplementary projection data, add the projection data jars to the --jars argument. Similarly, if you need to use geocoding or network analysis tools, add the file path of geoanalytics-natives.jar to the --jars argument. For example:

WindowsWindowsLinux
Use dark colors for code blocksCopy
pyspark --jars C:\engine\geoanalytics.jar,C:\engine\esri-projection-geographic-north-america.jar,C:\engine\esri-projection-geographic-south-america.jar,C:\engine\geoanalytics-natives.jar   ^
        ...

Authorize GeoAnalytics Engine

If using a notebook, create a new notebook or open an existing one. Otherwise, continue to the next step.
Import the geoanalytics library and authorize it using your username and password or a license file. See Authorization for more information. For example:
Use dark colors for code blocksCopy
1 2 import geoanalytics geoanalytics.auth(username="User1", password="p@ssw0rd")
Try out the API by importing the SQL functions as an easy-to-use alias like ST and listing the first 20 functions in a notebook cell:
Use dark colors for code blocksCopy
1 2 from geoanalytics.sql import functions as ST spark.sql("show user functions like 'ST_*'").show()

What's Next?

You can now use any SQL function, track function, or analysis tool in the geoanalytics module.

See Data sources and Using DataFrames to learn more about how to access your data from your notebook. Also see Visualize results to get started with viewing your data on a map. For examples of what else is possible with GeoAnalytics Engine, check out the sample notebooks, tutorials, and blog posts.