Making your data accessible to the GIS

Collecting, storing, managing and analyzing large quantities of numbers, figures, and files is not a new business activity. But referring to these numbers, figures and files as big data is relatively recent. Analysts, researchers and other professionals popularly characterize big data as sharing 4 key traits, also known as the 4 Vs:

  • high volume: a large quantity of data that cannot be efficiently managed by traditional relational databases or analyzed in a traditional manner using the memory of a single machine
  • high velocity: data that is dynamic and streams in at a fast pace from multiple sources
  • large variety: different formats from structured to unstructured; tabular to documents, email, video or audio; spatial to non-spatial;
  • unknown veracity: data that is unprocessed, uncleaned, inconsistent or unscreened and of unknown origin or quality

The GeoAnalytics Server expands your ArcGIS Enterprise deployment providing functionality and services to process and analyze big data.

Big data file shares

The GeoAnalytics server allows you to register datasets in a format called a big data file share. Big data file shares are items on your Web GIS, and can reference data in any of the following data sources:

Storing your data in a big data file share datastore benefits you because:

  • The GeoAnalytics tools read your data only when they are executed, which allows you to update or add data to these locations.
  • You can use partitioned data as a single dataset.
  • Big data file shares are flexible in how time and geometry are defined, allowing data in multiple formats in a single dataset.

Preparing your data

To register a file share or an HDFS, you need to format your datasets as subfolders within a single parent folder and register the parent folder. This parent folder becomes a datastore, and each subfolder becomes a dataset. For instance, to register 2 datasets representing earthquakes and hurricanes, your folder hierarchy would look like below:

|---FileShareFolder         <-- register as a datastore
   |---Earthquakes          <-- dataset 1
   |---Hurricanes           <-- dataset 2

Learn more about preparing your big data file share datasets here.

In [ ]:
# Connect to enterprise GIS
from arcgis.gis import GIS
import arcgis.geoanalytics
portal_gis = GIS("portal_url", "username", "password")

Ensuring your GIS supports GeoAnalytics

It is best practice to confirm proper configuration of your Enterprise to support the GeoAnalytics Server.

In [ ]:
# Verify that GeoAnalytics is supported 
Out[ ]:

Searching for big data file shares

The get_datastores() method of the geoanalytics module returns a DatastoreManager object that lets you search for and manage the big data file share items as Python API Datastore objects on your GeoAnalytics server.

In [ ]:
bigdata_datastore_manager = arcgis.geoanalytics.get_datastores()
Out[ ]:
<DatastoreManager for>

Use the search() method on a DatastoreManager object to search for Datastores. Observe in the output below the item titled FileShareFolder as illustrated in the example file structure above is registered as a big data file share in the portal.

In [ ]:
bigdata_fileshares =
Out[ ]:
[<Datastore title:"/bigDataFileShares/FileShareFolder" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/hdfs_test" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/qalab" type:"bigDataFileShare">]

Get datasets from a big data file share datastore

Let's use the datasets property on a Datastore object to find out how many datasets are available and then list them.

In [ ]:
file_share_folder = bigdata_fileshares[0]
file_share_datasets = file_share_folder.datasets
Out[ ]:
In [ ]:
for i in range(0, len(file_share_datasets)):
    print("{:<10}{:<3}{}".format("Dataset " + str(i) + ":", "", file_share_datasets[i]['name']))
Dataset 0:   Earthquakes
Dataset 1:   Hurricanes
In [ ]:
# let's view the json schema of the hurricanes dataset for a sample
Out[ ]:
{'format': {'extension': 'shp', 'type': 'shapefile'},
 'geometry': {'geometryType': 'esriGeometryPoint',
  'spatialReference': {'wkid': 4326}},
 'name': 'Hurricanes',
 'schema': {'fields': [{'name': 'serial_num', 'type': 'esriFieldTypeString'},
   {'name': 'season', 'type': 'esriFieldTypeBigInteger'},
   {'name': 'num', 'type': 'esriFieldTypeBigInteger'},
   {'name': 'basin', 'type': 'esriFieldTypeString'},
   {'name': 'sub_basin', 'type': 'esriFieldTypeString'},
   {'name': 'name', 'type': 'esriFieldTypeString'},
   {'name': 'iso_time', 'type': 'esriFieldTypeString'},
   {'name': 'nature', 'type': 'esriFieldTypeString'},
   {'name': 'latitude', 'type': 'esriFieldTypeDouble'},
   {'name': 'longitude', 'type': 'esriFieldTypeDouble'},
   {'name': 'wind_wmo_', 'type': 'esriFieldTypeDouble'},
   {'name': 'pres_wmo_', 'type': 'esriFieldTypeBigInteger'},
   {'name': 'center', 'type': 'esriFieldTypeString'},
   {'name': 'wind_wmo1', 'type': 'esriFieldTypeDouble'},
   {'name': 'pres_wmo1', 'type': 'esriFieldTypeDouble'},
   {'name': 'track_type', 'type': 'esriFieldTypeString'},
   {'name': 'size', 'type': 'esriFieldTypeString'},
   {'name': 'Wind', 'type': 'esriFieldTypeBigInteger'}]},
 'time': {'fields': [{'formats': ['yyyy-MM-dd HH:mm:ss'], 'name': 'iso_time'}],
  'timeReference': {'timeZone': 'UTC'},
  'timeType': 'instant'}}

Registering big data file shares

You can register your data as a big data file share using the add_bigdata() method on a DatastoreManager object. Ensure the datasets are stored in a format compatible with the GeoAnalytics server as seen earlier in this guide.

In [ ]:
Sample_City_Crime_data_item = bigdata_datastore_manager.add_bigdata("Sample_US_City_Crime", 
Created Big Data file share for Sample_US_City_Crime
In [ ]:
Out[ ]:
<Datastore title:"/bigDataFileShares/Sample_US_City_Crime" type:"bigDataFileShare">

Once a big data file share is created, the GeoAnalytics server samples the datasets to generate the schema of the data to create a manifest. This process can take a few minutes depending on the size of your data. Once processed, querying the manifest property returns the schema of the datasets in your big data file share.

In [ ]:
Out[ ]:
{'datasets': [{'format': {'encoding': 'UTF-8',
    'extension': 'csv',
    'fieldDelimiter': ',',
    'hasHeaderRow': True,
    'quoteChar': '"',
    'recordTerminator': '\n',
    'type': 'delimited'},
   'name': 'HoustonCrime',
   'schema': {'fields': [{'name': 'Address', 'type': 'esriFieldTypeString'},
     {'name': 'Beat', 'type': 'esriFieldTypeString'},
     {'name': 'BlockRange', 'type': 'esriFieldTypeString'},
     {'name': 'Date', 'type': 'esriFieldTypeString'},
     {'name': 'DayOfWeek', 'type': 'esriFieldTypeString'},
     {'name': 'Hour', 'type': 'esriFieldTypeDouble'},
     {'name': 'OffenseTyp', 'type': 'esriFieldTypeString'},
     {'name': 'Offenses', 'type': 'esriFieldTypeDouble'},
     {'name': 'Premise', 'type': 'esriFieldTypeString'},
     {'name': 'StreetName', 'type': 'esriFieldTypeString'},
     {'name': 'Suffix', 'type': 'esriFieldTypeString'},
     {'name': 'Type', 'type': 'esriFieldTypeString'},
     {'name': 'x', 'type': 'esriFieldTypeDouble'},
     {'name': 'y', 'type': 'esriFieldTypeDouble'}]},
   'time': {'fields': [{'formats': ['yyyy-MM-dd'], 'name': 'Date'}],
    'timeReference': {'timeZone': 'UTC'},
    'timeType': 'instant'}},
  {'format': {'encoding': 'UTF-8',
    'extension': 'csv',
    'fieldDelimiter': ',',
    'hasHeaderRow': True,
    'quoteChar': '"',
    'recordTerminator': '\n',
    'type': 'delimited'},
   'name': 'PhiladelphiaCrime',
   'schema': {'fields': [{'name': 'X', 'type': 'esriFieldTypeDouble'},
     {'name': 'Y', 'type': 'esriFieldTypeDouble'},
     {'name': 'DC_DIST', 'type': 'esriFieldTypeBigInteger'},
     {'name': 'SECTOR', 'type': 'esriFieldTypeBigInteger'},
     {'name': 'DISPATCH_DATE_TIME', 'type': 'esriFieldTypeString'},
     {'name': 'DISPATCH_DATE', 'type': 'esriFieldTypeString'},
     {'name': 'DISPATCH_TIME', 'type': 'esriFieldTypeString'},
     {'name': 'HOUR', 'type': 'esriFieldTypeString'},
     {'name': 'DC_KEY', 'type': 'esriFieldTypeBigInteger'},
     {'name': 'LOCATION_BLOCK', 'type': 'esriFieldTypeString'},
     {'name': 'UCR_GENERAL', 'type': 'esriFieldTypeBigInteger'},
     {'name': 'OBJECTID', 'type': 'esriFieldTypeBigInteger'},
     {'name': 'TEXT_GENERAL_CODE', 'type': 'esriFieldTypeString'},
     {'name': 'POINT_X', 'type': 'esriFieldTypeDouble'},
     {'name': 'POINT_Y', 'type': 'esriFieldTypeDouble'},
     {'name': 'GlobalID', 'type': 'esriFieldTypeString'}]},
   'time': {'fields': [{'formats': ["yyyy-MM-dd'T'HH:mm:ss.SSSZ"],
      'name': 'DISPATCH_DATE_TIME'}],
    'timeReference': {'timeZone': 'UTC'},
    'timeType': 'instant'}}]}

Feedback on this topic?