Spark Local Mode

Apache Spark supports a local deployment mode that lets you run PySpark code using your personal computer's resources as a single node cluster. This mode is useful for testing your workflow prior to using resources on a larger Spark cluster. For example, you might choose to write code on your personal computer using a subset of your data before deploying a full-scale Spark cluster in the cloud. This would lower your overall compute time in the cloud and reduce costs.

The following steps explain how to install Apache Spark and GeoAnalytics Engine on Windows or Linux using Spark in local standalone mode. Once complete, you will be able to run PySpark and GeoAnalytics Engine code in a python notebook, the PySpark shell, or with a python script.

Prerequisites:

Note that some versions of Java or Python are deprecated in some versions of Spark. See Dependencies for details.

Install Apache Hadoop

GeoAnalytics Engine requires Hadoop binaries to be installed when reading from or writing to shapefiles. Hadoop is also required when reading from or writing to any distributed file system that Spark supports, including parquet, S3, and others.

To install Hadoop on Linux, download the binaries directly from Apache and unpack the distribution as described in Hadoop documentation.

To install Hadoop on Windows, download the Windows binaries from a third party or build them yourself. At a minimum you must have winutils.exe and hadoop.dll staged on your machine at <install location>\Hadoop\bin\.

For both Linux and Windows, set the HADOOP_HOME environment variable to the Hadoop install location and add %HADOOP_HOME%\bin to your Path variable. For example:

WindowsWindowsLinux
Use dark colors for code blocksCopy
                                                                              
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
set HADOOP_HOME=C:\Hadoop
set PATH=%PATH%;%HADOOP_HOME%\bin

Install Apache Spark and PySpark

  1. Download Apache Spark. Any supported version of Spark will work, but the release should support the versions of Java and Python you have installed.

  2. Set the required environment variables:

    • Set the SPARK_HOME environment variable to the Spark install directory and add %SPARK_HOME%\bin to your Path variable. For example:

      WindowsWindowsLinux
      Use dark colors for code blocksCopy
                                                                                    
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      37
      38
      39
      40
      41
      42
      43
      44
      45
      46
      47
      48
      49
      50
      51
      52
      53
      54
      55
      56
      57
      58
      59
      60
      61
      62
      63
      64
      65
      66
      67
      68
      69
      70
      71
      72
      73
      74
      75
      76
      77
      78
      set SPARK_HOME=C:\Spark\spark-3.2.0-bin-hadoop2.7
      set PATH=%PATH%;%SPARK_HOME%\bin
      
    • Set the PYSPARK_PYTHON environment variable to the path of the Python executable you're using, for example:

      WindowsWindowsLinux
      Use dark colors for code blocksCopy
                                                                                    
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      37
      38
      39
      40
      41
      42
      43
      44
      45
      46
      47
      48
      49
      50
      51
      52
      53
      54
      55
      56
      57
      58
      59
      60
      61
      62
      63
      64
      65
      66
      67
      68
      69
      70
      71
      72
      73
      74
      75
      76
      77
      78
      set PYSPARK_PYTHON=C:\Python37\python.exe
      
    • If you want to use GeoAnalytics Engine in a notebook, set the PYSPARK_DRIVER_PYTHON environment variable to the path of a Python notebook executable, for example:

      WindowsWindowsLinux
      Use dark colors for code blocksCopy
                                                                                    
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      37
      38
      39
      40
      41
      42
      43
      44
      45
      46
      47
      48
      49
      50
      51
      52
      53
      54
      55
      56
      57
      58
      59
      60
      61
      62
      63
      64
      65
      66
      67
      68
      69
      70
      71
      72
      73
      74
      75
      76
      77
      78
      set PYSPARK_DRIVER_PYTHON=C:\Python37\Scripts\jupyter-notebook.exe
      

      If you want to use GeoAnalytics Engine via the PySpark shell or by running python scripts, skip this step.

    • Ensure that JAVA_HOME is set and that %JAVA_HOME%\bin is in your Path environment variable. If not, set it using:

      WindowsWindowsLinux
      Use dark colors for code blocksCopy
                                                                                    
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      37
      38
      39
      40
      41
      42
      43
      44
      45
      46
      47
      48
      49
      50
      51
      52
      53
      54
      55
      56
      57
      58
      59
      60
      61
      62
      63
      64
      65
      66
      67
      68
      69
      70
      71
      72
      73
      74
      75
      76
      77
      78
      set JAVA_HOME=C:\Java
      set PATH=%PATH%;%SPARK_HOME%\bin;%JAVA_HOME%\bin
      
  3. Install PySpark with pip, conda, or by manually installing the package. For more information, see PySpark Installation. Below is an example using pip.

    Use dark colors for code blocksCopy
     
    1
    pip install pyspark

Start a PySpark session with GeoAnalytics Engine

  1. Copy the jar and zip install files to your computer.

  2. Open command prompt and run the command below. Change the paths to the jar and zip file before running. You can also change the amount of memory available to Spark by updating the value for spark.driver.memory. If you set PYSPARK_DRIVER_PYTHON to a python notebook, the notebook application will open and the geoanalytics module will be available to import in any notebook you create. If you are using the PySpark shell or running a script, you can import geoanalytics as soon as PySpark starts.

    WindowsWindowsLinux
    Use dark colors for code blocksCopy
                                                                                  
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    pyspark --jars C:\engine\geoanalytics.jar ^
            --py-files C:\engine\geoanalytics.zip ^
            --conf spark.plugins=com.esri.geoanalytics.Plugin ^
            --conf spark.serializer=org.apache.spark.serializer.KryoSerializer ^
            --conf spark.kryo.registrator=com.esri.geoanalytics.KryoRegistrator ^
            --conf spark.driver.memory=5g
    

    If you need to perform a transformation that requires supplementary projection data, add the projection data jars to the --jars argument. For example:

    WindowsWindowsLinux
    Use dark colors for code blocksCopy
                                                                                  
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    pyspark --jars C:\engine\geoanalytics.jar,C:\engine\esri-projection-data1.jar,C:\engine\esri-projection-data2.jar   ^
            ...
    

Authorize GeoAnalytics Engine

  1. If using a notebook, create a new notebook or open an existing one. Otherwise, continue to the next step.
  2. Import the geoanalytics library and authorize it using your username and password or another supported authorization method. See Licensing and Authorization for more information. For example:

    Use dark colors for code blocksCopy
      
    1
    2
    import geoanalytics
    geoanalytics.auth(username="User1", password="p@ssw0rd")
  3. Try out the API by importing the SQL functions as an easy-to-use alias like ST and listing the first 20 functions in a notebook cell:

    Use dark colors for code blocksCopy
      
    1
    2
    from geoanalytics.sql import functions as ST
    spark.sql("show user functions like 'ST_*'").show()

What's Next?

You can now use any SQL function or analysis tool in the geoanalytics module.

See Data sources and Using DataFrames to learn more about how to access your data from your notebook. Also see Visualize results to get started with viewing your data on a map. For examples of what else is possible with GeoAnalytics Engine, check out the sample notebooks, tutorials, and blog posts.

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.