Spark Local Mode
Apache Spark supports a local deployment mode that lets you run PySpark code using your personal computer's resources as a single node cluster. This mode is useful for testing your workflow prior to using resources on a larger Spark cluster. For example, you might choose to write code on your personal computer using a subset of your data before deploying a full-scale Spark cluster in the cloud. This would lower your overall compute time in the cloud and reduce costs.
The following steps explain how to install Apache Spark and GeoAnalytics Engine on Windows or Linux using Spark in local standalone mode. Once complete, you will be able to run PySpark and GeoAnalytics Engine code in a python notebook, the PySpark shell, or with a python script.
Prerequisites:
Note that some versions of Java or Python are deprecated in some versions of Spark. See Dependencies for details.
Install Apache Hadoop
GeoAnalytics Engine requires Hadoop binaries to be installed when reading from or writing to shapefiles. Hadoop is also required when reading from or writing to any distributed file system that Spark supports, including parquet, S3, and others.
To install Hadoop on Linux, download the binaries directly from Apache and unpack the distribution as described in Hadoop documentation.
To install Hadoop on Windows, download the Windows binaries from a third party or
build them yourself. At a minimum you must have
winutils.exe
and hadoop.dll
staged on your machine at <install location>
.
For both Linux and Windows, set the HADOOP_
environment variable to the Hadoop install location and add
%HADOOP_
to your Path variable. For example:
Install Apache Spark and PySpark
Download Apache Spark. Any supported version of Spark will work, but the release should support the versions of Java and Python you have installed.
Set the required environment variables:
Set the
SPARK_
environment variable to the Spark install directory and addHOME %SPARK_
to your Path variable. For example:HOME%\bin Windows Windows Linux Use dark colors for code blocks Copy Set the
PYSPARK_
environment variable to the path of the Python executable you're using, for example:PYTHON Windows Windows Linux Use dark colors for code blocks Copy If you want to use GeoAnalytics Engine in a notebook, set the
PYSPARK_
environment variable to the path of a Python notebook executable, for example:DRIVER_ PYTHON Windows Windows Linux Use dark colors for code blocks Copy If you want to use GeoAnalytics Engine via the PySpark shell or by running python scripts, skip this step.
Ensure that
JAVA_
is set and thatHOME %JAVA_
is in your Path environment variable. If not, set it using:HOME%\bin Windows Windows Linux Use dark colors for code blocks Copy
Install PySpark with pip, conda, or by manually installing the package. For more information, see PySpark Installation. Below is an example using pip.
Use dark colors for code blocks Copy
Start a PySpark session with GeoAnalytics Engine
Copy the jar and zip install files to your computer.
Open command prompt and run the command below. Change the paths to the jar and zip file before running. You can also change the amount of memory available to Spark by updating the value for
spark.driver.memory
. If you setPYSPARK_
to a python notebook, the notebook application will open and the geoanalytics module will be available to import in any notebook you create. If you are using the PySpark shell or running a script, you can import geoanalytics as soon as PySpark starts.DRIVER_ PYTHON Windows Windows Linux Use dark colors for code blocks Copy If you need to perform a transformation that requires supplementary projection data, add the projection data jars to the
--jars
argument. For example:Windows Windows Linux Use dark colors for code blocks Copy
Authorize GeoAnalytics Engine
- If using a notebook, create a new notebook or open an existing one. Otherwise, continue to the next step.
Import the geoanalytics library and authorize it using your username and password or a license file. See Authorization for more information. For example:
Use dark colors for code blocks Copy Try out the API by importing the SQL functions as an easy-to-use alias like
ST
and listing the first 20 functions in a notebook cell:Use dark colors for code blocks Copy
What's Next?
You can now use any SQL function or analysis tool in the geoanalytics
module.
See Data sources and Using DataFrames to learn more about how to access your data from your notebook. Also see Visualize results to get started with viewing your data on a map. For examples of what else is possible with GeoAnalytics Engine, check out the sample notebooks, tutorials, and blog posts.