Make your data accessible to the GeoAnalytics Server

Make your data accessible to the GeoAnalytics Server

Collecting, storing, managing, and analyzing large quantities of numbers, figures, and files is not a new business activity. However, referring to these numbers, figures, and files as big data is relatively recent.

The GeoAnalytics Server expands your ArcGIS Enterprise deployment, providing functionality and services to process and analyze big data.

In order to run the GeoAnalytics tools, your data needs to be in one of the following formats:

Note: Please refer to the features module guide to understand more about feature layers and feature collections.

Big data file shares

The GeoAnalytics server allows you to register datasets in a format called a big data file share. Big data file shares are items on your Web GIS, and can reference data in any of the following data sources:

Storing your data in a big data file share datastore benefits you because:

  • The GeoAnalytics tools read your data only when they are executed, which allows you to update or add data to these locations.
  • You can use partitioned data as a single dataset.
  • Big data file shares are flexible in how time and geometry are defined, allowing data in multiple formats in a single dataset.

When writing results to a big data file share, you can use the following output GeoAnalytics Tools:

  • File share
  • HDFS
  • Cloud store

The following file types are supported as datasets for input and output in big data file shares:

  • Delimited files (such as .csv, .tsv, and .txt)
  • Shapefiles (.shp)
  • Parquet files (.gz.parquet)
  • ORC files (orc.crc)

Store data where all ArcGIS Server machines can access it

To allow your ArcGIS Server sites to access the data resources you want to publish, all machines in the ArcGIS Server site must have access to the resource. For example, when you publish a map as a service, the map and all the data for the map's layers must be accessible to all ArcGIS Server machines.

You need to do the following to make your data accessible to ArcGIS Server:

Store your data in a location that all machines in your ArcGIS Server site can access.

Grant permissions to allow ArcGIS Server to access the data. If your data is stored in a folder or a database that you access using operating system authentication, you must grant the ArcGIS Server account permissions to these locations. The ArcGIS Server account is the account you used to install ArcGIS Server, not the primary site administrator specified when the ArcGIS Server site was created. If your data is stored in a database that you access using database authentication, the database user you provide when registering the database must have permissions to the data.

Register your data store with the ArcGIS Server site.

Prepare your data to be registered as big data file share

To register a file share or an HDFS, you need to format your datasets as subfolders within a single parent folder and register the parent folder. This parent folder becomes a datastore, and each subfolder becomes a dataset. For instance, to register 2 datasets representing earthquakes and hurricanes, your folder hierarchy would look like below:

|---FileShareFolder         <-- register as a datastore
   |---Earthquakes          <-- dataset 1
      |---1960              
         |---01_1960.csv
         |---02_1960.csv
      |---1961              
         |---01_1961.csv
         |---02_1961.csv
   |---Hurricanes           <-- dataset 2
      |---atlantic_hur.shp
      |---pacific_hur.shp

Learn more about preparing your big data file share datasets here.

Input
# Connect to enterprise GIS
from arcgis.gis import GIS
import arcgis.geoanalytics
portal_gis = GIS("your_enterprise_portal")

Ensuring your GIS supports GeoAnalytics

Before executing a tool, we need to ensure an ArcGIS Enterprise GIS is set up with a licensed GeoAnalytics server. To do so, call the is_supported() method after connecting to your Enterprise portal. See the Components of ArcGIS URLs documentation for details on the urls to enter in the GIS parameters based on your particular Enterprise configuration.

Input
# Verify that GeoAnalytics is supported 
arcgis.geoanalytics.is_supported()
Output
True

Registering big data file shares

The get_datastores() method of the geoanalytics module returns a DatastoreManager object that lets you search for and manage the big data file share items as Python API Datastore objects on your GeoAnalytics server.

Input
bigdata_datastore_manager = arcgis.geoanalytics.get_datastores()
bigdata_datastore_manager
Output
<DatastoreManager for https://pythonapi.playground.esri.com/ga/admin>

The data prepared above can be stored to one of the following locations based on the size of your data, the number of people who will access the web service, and how frequently the data changes.

  • Store data locally on each ArcGIS Server machine
  • Store data in a shared directory
  • Store data in a database
  • Store caches, imagery, and big data files in a cloud storage container

Later, We will learn how to register a cloud store as datastore for registering your data as big data file share. Next, we will learn how to register data from a local/shared directory.

Register the cloud store with your GeoAnalytics Server

If your data is stored on a cloud store, you can register it as a DataStore using the add_cloudstore function. This function can register cloud store as a DataStore for Azure Data Lake Storage, Amazon, Alibaba, or Azure.

Input
conn_dict = {"accessKeyId":"<provide key here>",
             "secretAccessKey":"<provide secret key here>",
             "region":"<provide region here>",
             "defaultEndpointsProtocol":"<probide https or http here>",
             "credentialType":"accesskey"}
Input
import json
conn_str = json.dumps(conn_dict)
Input
datastore_obj = bigdata_datastore_manager.add_cloudstore(name='cloud_store', 
                                         conn_str=conn_str, 
                                         object_store="esri-delhi-store", 
                                         provider='amazon')
Created cloud store for cloud_store
Input
datastore_obj.path
Output
'/cloudStores/cloud_store'

Register the data on cloud store as a big data file share

You can register your data as a big data file share using the add_bigdata() method on a DatastoreManager object. Ensure the datasets are stored in a format compatible with the GeoAnalytics server, as demonstrated earlier in this guide.

item = bigdata_datastore_manager.add_bigdata("Name_of_big_data_file_share", r"\\<file_share_path>\<big_data_folder>")

Input
data_item1 = bigdata_datastore_manager.add_bigdata(name="ServiceCallsOrleans", 
                                                  server_path=data_item.path, 
                                                  connection_type='dataStore')
Created Big Data file share for ServiceCallsOrleans

Register data using a local directory or shared directory

If your data lies locally on your system, you can directly register it by passing the shared path of that data into the add_bigdata function. It is recommended that you use shared paths to allow other servers to access the data while running operations.

Input
data_item2 = bigdata_datastore_manager.add_bigdata("ServiceCallsOrleans", r"\\machinename\datastore")
Created Big Data file share for ServiceCallsOrleans

Learn more about how to register other formats as a file share, for example HDFS or Hive, click here

Inspecting big data file shares

Now that you have successfully registered your data with ArcGIS Geoanalytics Server to run Geoanalytics tools, it's time to inspect the file share to verify that your data is registered in a desired format.

Searching for big data file shares on datastore

Here, we use the search() method on a DatastoreManager object to search for Datastores. Observe in the output below the item titled ServiceCallsOrleans. As illustrated in the example file structure above is registered as a big data file share in the portal.

Input
bigdata_fileshares = bigdata_datastore_manager.search()
bigdata_fileshares
Output
[<Datastore title:"/bigDataFileShares/NYC_taxi_data15" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/all_hurricanes" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/NYCdata" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/hurricanes_1848_1900" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/ServiceCallsOrleans" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/hurricanes_dask_csv" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/hurricanes_dask_shp" type:"bigDataFileShare">,
 <Datastore title:"/cloudStores/cloud_store" type:"cloudStore">]

Get datasets from a big data file share datastore

Let's use the datasets property on a Datastore object to find out how many datasets are available and then list them.

Input
file_share_folder = bigdata_fileshares[4]
file_share_datasets = file_share_folder.datasets
len(file_share_datasets)
Output
1
Input
for i in range(0, len(file_share_datasets)):
    print("{:<10}{:<3}{}".format("Dataset " + str(i) + ":", "", file_share_datasets[i]['name']))
Dataset 0:   yearly_calls
Input
# let's view the json schema of the calls dataset for a sample
file_share_datasets[0]
Output
{'name': 'yearly_calls',
 'format': {'quoteChar': '"',
  'fieldDelimiter': ',',
  'hasHeaderRow': True,
  'encoding': 'UTF-8',
  'escapeChar': '"',
  'recordTerminator': '\n',
  'type': 'delimited',
  'extension': 'csv'},
 'schema': {'fields': [{'name': 'NOPD_Item', 'type': 'esriFieldTypeString'},
   {'name': 'Type_', 'type': 'esriFieldTypeString'},
   {'name': 'TypeText', 'type': 'esriFieldTypeString'},
   {'name': 'Priority', 'type': 'esriFieldTypeString'},
   {'name': 'MapX', 'type': 'esriFieldTypeDouble'},
   {'name': 'MapY', 'type': 'esriFieldTypeDouble'},
   {'name': 'TimeCreate', 'type': 'esriFieldTypeString'},
   {'name': 'TimeDispatch', 'type': 'esriFieldTypeString'},
   {'name': 'TimeArrive', 'type': 'esriFieldTypeString'},
   {'name': 'TimeClosed', 'type': 'esriFieldTypeString'},
   {'name': 'Disposition', 'type': 'esriFieldTypeString'},
   {'name': 'DispositionText', 'type': 'esriFieldTypeString'},
   {'name': 'BLOCK_ADDRESS', 'type': 'esriFieldTypeString'},
   {'name': 'Zip', 'type': 'esriFieldTypeBigInteger'},
   {'name': 'PoliceDistrict', 'type': 'esriFieldTypeBigInteger'},
   {'name': 'Location', 'type': 'esriFieldTypeString'}]},
 'geometry': {'geometryType': 'esriGeometryPoint',
  'spatialReference': {'wkid': 102682, 'latestWkid': 3452},
  'fields': [{'name': 'MapX', 'formats': ['x']},
   {'name': 'MapY', 'formats': ['y']}]},
 'time': {'timeType': 'instant',
  'timeReference': {'timeZone': 'UTC'},
  'fields': [{'name': 'TimeCreate', 'formats': ['MM/dd/yyyy hh:mm:ss a']}]}}

Get path of the big data file share item

Input
file_share_folder.datapath
Output
'/bigDataFileShares/ServiceCallsOrleans'

Check if the data is accessible to all Geoanalytics servers

You can validate the data store connection to confirm that the ArcGIS Server site can communicate with a data store.

Input
file_share_folder.validate()
Output
True

If ArcGIS Server did not connect, confirm that the data store is available. For example, ensure that the machine the data store is on is running and has network connectivity.

Get schema of the data

Once a big data file share is created, the GeoAnalytics server samples the datasets to generate a manifest that outlines the data schema and specifies any time and geometry fields. A query of the resulting manifest returns each dataset's schema. This process can take a few minutes depending on the size of your data. Once processed, querying the manifest property returns the schema of the datasets in your big data file share.

To learn more about the big data file share manifest, see Understanding the big data file share manifest in the ArcGIS Server help.

Input
manifest = file_share_folder.manifest
manifest
Output
{'datasets': [{'name': 'calls',
   'format': {'quoteChar': '"',
    'fieldDelimiter': ',',
    'hasHeaderRow': True,
    'encoding': 'UTF-8',
    'escapeChar': '"',
    'recordTerminator': '\n',
    'type': 'delimited',
    'extension': 'csv'},
   'schema': {'fields': [{'name': 'NOPD_Item', 'type': 'esriFieldTypeString'},
     {'name': 'Type_', 'type': 'esriFieldTypeString'},
     {'name': 'TypeText', 'type': 'esriFieldTypeString'},
     {'name': 'Priority', 'type': 'esriFieldTypeString'},
     {'name': 'MapX', 'type': 'esriFieldTypeDouble'},
     {'name': 'MapY', 'type': 'esriFieldTypeDouble'},
     {'name': 'TimeCreate', 'type': 'esriFieldTypeString'},
     {'name': 'TimeDispatch', 'type': 'esriFieldTypeString'},
     {'name': 'TimeArrive', 'type': 'esriFieldTypeString'},
     {'name': 'TimeClosed', 'type': 'esriFieldTypeString'},
     {'name': 'Disposition', 'type': 'esriFieldTypeString'},
     {'name': 'DispositionText', 'type': 'esriFieldTypeString'},
     {'name': 'BLOCK_ADDRESS', 'type': 'esriFieldTypeString'},
     {'name': 'Zip', 'type': 'esriFieldTypeBigInteger'},
     {'name': 'PoliceDistrict', 'type': 'esriFieldTypeBigInteger'},
     {'name': 'Location', 'type': 'esriFieldTypeString'}]},
   'geometry': {'geometryType': 'esriGeometryPoint',
    'spatialReference': {'wkid': 4326},
    'fields': [{'name': 'MapX', 'formats': ['x']},
     {'name': 'MapY', 'formats': ['y']}]},
   'time': {'timeType': 'instant',
    'timeReference': {'timeZone': 'UTC'},
    'fields': [{'name': 'TimeCreate',
      'formats': ['MM/dd/yyyy hh:mm:ss a']}]}}]}

Modify a big data file share

When a big data catalog service is created, a manifest for the input data is automatically generated and uploaded to the GeoAnalytics Server site where you registered the data. The process of generating a manifest may not always correctly estimate the fields representing geometry and time, and you may need to apply edits. To edit a manifest, follow the steps with UI Edit big data file share manifests in Manager.

In this example here, the spatial reference of the dataset is set to 4326, but we know this data is from New Orleans, Louisiana, and is actually stored in the Louisiana State Plane Coordinate System. We need to edit the manifest with the correct spatial reference: {"wkid": 102682, "latestWkid": 3452}. Knowing the location of this data and the coordinate system to which it belongs, we will edit our manifest. This will set the correct spatial reference.

Input
manifest['datasets'][0]['geometry']['spatialReference'] = { "wkid": 102682, "latestWkid": 3452 }
Input
file_share_folder.manifest = manifest
Input
file_share_folder.manifest
Output
{'datasets': [{'name': 'calls',
   'format': {'quoteChar': '"',
    'fieldDelimiter': ',',
    'hasHeaderRow': True,
    'encoding': 'UTF-8',
    'escapeChar': '"',
    'recordTerminator': '\n',
    'type': 'delimited',
    'extension': 'csv'},
   'schema': {'fields': [{'name': 'NOPD_Item', 'type': 'esriFieldTypeString'},
     {'name': 'Type_', 'type': 'esriFieldTypeString'},
     {'name': 'TypeText', 'type': 'esriFieldTypeString'},
     {'name': 'Priority', 'type': 'esriFieldTypeString'},
     {'name': 'MapX', 'type': 'esriFieldTypeDouble'},
     {'name': 'MapY', 'type': 'esriFieldTypeDouble'},
     {'name': 'TimeCreate', 'type': 'esriFieldTypeString'},
     {'name': 'TimeDispatch', 'type': 'esriFieldTypeString'},
     {'name': 'TimeArrive', 'type': 'esriFieldTypeString'},
     {'name': 'TimeClosed', 'type': 'esriFieldTypeString'},
     {'name': 'Disposition', 'type': 'esriFieldTypeString'},
     {'name': 'DispositionText', 'type': 'esriFieldTypeString'},
     {'name': 'BLOCK_ADDRESS', 'type': 'esriFieldTypeString'},
     {'name': 'Zip', 'type': 'esriFieldTypeBigInteger'},
     {'name': 'PoliceDistrict', 'type': 'esriFieldTypeBigInteger'},
     {'name': 'Location', 'type': 'esriFieldTypeString'}]},
   'geometry': {'geometryType': 'esriGeometryPoint',
    'spatialReference': {'wkid': 102682, 'latestWkid': 3452},
    'fields': [{'name': 'MapX', 'formats': ['x']},
     {'name': 'MapY', 'formats': ['y']}]},
   'time': {'timeType': 'instant',
    'timeReference': {'timeZone': 'UTC'},
    'fields': [{'name': 'TimeCreate',
      'formats': ['MM/dd/yyyy hh:mm:ss a']}]}}]}
Input
# You can regenerate a manifest if you have added new data or if you have uploaded a hints file using the edit resource.
file_share_folder.regenerate()
Output
True

Now, you are ready to run tools available in arcgis.geoanalytics module.

Modify the output templates for a big data file share

When you choose to use the big data file share as an output location, output templates are automatically generated. These templates outline the formatting of output analysis results, such as the file type, and how time and geometry will be registered. If you want to modify the geometry or time formatting, or add or delete templates, you can modify the templates. To edit the output templates, follow the steps in Edit big data file share manifests in Manager. Learn more about output templates in the Output templates in a big data file share topic.

In this guide, we have covered concepts on registering datastores and big data file shares, inspecting a big data file share, modifying data schema, and more. In this next guide, we will perform analysis with the published data and learn more about tools available in the geoanalytics module.

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.