Making your data accessible to the GIS

Big data is popularly characterized with 4 v's -

  • high volume: large quantity of data that cannot be analyzed in a traditional manner using the memory available one a single machine,
  • high velocity: data that is not just static but can also arrive from streaming sources,
  • large variety: formats that are tabular, non tabular, spatial, non spatial from a variety of sources
  • unknown veracity: data that is not pre-processed or screened and of unknown quality.

Big data file shares

Given the enormity and uncertainty in such kinds of data, the GeoAnalytics server allows you register your big datasets in a format called a big data file share. Big data file shares can reference data in the following data sources

  • file share - a directory of datasets
  • HDFS - a Hadoop Distributed Files System directory of datasets
  • Hive - metastore databases

Storing your data in a Big data file share datastore has the following benefits

  • the GeoAnalytics tools read your data only when they are executed. This allows you to keep updating or adding new data to these locations.
  • you can partition your data, say using file system folders, yet treat them as a single dataset
  • big data file shares are flexible in how time and geometry are defined. This allows you to have data in multiple formats even in a single dataset.

Preparing your data

To register a file share or a HDFS, you need to format your datasets as sub folders within a single parent folder and register that folder. This parent folder you register becomes a datastore and each of the sub folder becomes a dataset. For instance, to register 2 datastores representing earthquakes and hurricanes, your folder hierarchy would look like below:

   |---Earthquakes          <-- register as a datastore
      |---1960              <-- dataset 1
      |---1961              <-- dataset 2
   |---Hurricanes           <-- register as a datastore

To learn more about preparing your data for use with GeoAnalytics server, refer to this server documentation.

Searching for big data file shares

The get_datastores() method of the geoanalytics module returns you a DatastoreManager object that lets you search for and manage Datastore objects on your GeoAnalytics server.

In [ ]:
# Connect to enterprise GIS
from arcgis.gis import GIS
import arcgis.geoanalytics
portal_gis = GIS("portal url", "username", "password")
In [ ]:
bigdata_datastore_manager = arcgis.geoanalytics.get_datastores()
Out[ ]:
<DatastoreManager for>

Use the search() method on a DatastoreManager object to search for Datastores

In [ ]:
bigdata_fileshares =
Out[ ]:
[<Datastore title:"/bigDataFileShares/Chicago_accidents" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/hurricanes" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/hurricanes_1m_168yrs" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/hurricanes_all" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/Hurricane_tracks" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/NYCdata" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/NYC_taxi" type:"bigDataFileShare">]

Get datasets from a big data file share datastore

Use the datasets property on a Datastore object to get a dictionary representation of the datasets.

In [ ]:
Chicago_accidents = bigdata_fileshares[0]
Out[ ]:
In [ ]:
# let us view the first dataset for a sample
Out[ ]:
{'format': {'encoding': 'UTF-8',
  'extension': 'csv',
  'fieldDelimiter': ',',
  'hasHeaderRow': True,
  'quoteChar': '"',
  'recordTerminator': '\n',
  'type': 'delimited'},
 'geometry': {'fields': [{'formats': ['x'], 'name': 'longitude'},
   {'formats': ['y'], 'name': 'latitude'}],
  'geometryType': 'esriGeometryPoint',
  'spatialReference': {'wkid': 4326}},
 'name': 'April',
 'schema': {'fields': [{'name': 'date', 'type': 'esriFieldTypeString'},
   {'name': 'year', 'type': 'esriFieldTypeBigInteger'},
   {'name': 'day_o_week', 'type': 'esriFieldTypeBigInteger'},
   {'name': 'num_veh', 'type': 'esriFieldTypeBigInteger'},
   {'name': 'injuries', 'type': 'esriFieldTypeBigInteger'},
   {'name': 'fatalities', 'type': 'esriFieldTypeBigInteger'},
   {'name': 'coll_type', 'type': 'esriFieldTypeBigInteger'},
   {'name': 'weather', 'type': 'esriFieldTypeBigInteger'},
   {'name': 'lighting', 'type': 'esriFieldTypeBigInteger'},
   {'name': 'surf_cond', 'type': 'esriFieldTypeBigInteger'},
   {'name': 'rd_defect', 'type': 'esriFieldTypeBigInteger'},
   {'name': 'veh1_type', 'type': 'esriFieldTypeBigInteger'},
   {'name': 'veh1_specl', 'type': 'esriFieldTypeBigInteger'},
   {'name': 'veh1_dir', 'type': 'esriFieldTypeBigInteger'},
   {'name': 'veh1_manuv', 'type': 'esriFieldTypeBigInteger'},
   {'name': 'veh1_loc1', 'type': 'esriFieldTypeBigInteger'},
   {'name': 'veh2_type', 'type': 'esriFieldTypeBigInteger'},
   {'name': 'veh2_specl', 'type': 'esriFieldTypeBigInteger'},
   {'name': 'veh2_dir', 'type': 'esriFieldTypeBigInteger'},
   {'name': 'veh2_manuv', 'type': 'esriFieldTypeBigInteger'},
   {'name': 'veh2_loc1', 'type': 'esriFieldTypeBigInteger'},
   {'name': 'longitude', 'type': 'esriFieldTypeDouble'},
   {'name': 'latitude', 'type': 'esriFieldTypeDouble'},
   {'name': 'intersection', 'type': 'esriFieldTypeBigInteger'}]},
 'time': {'fields': [{'formats': ['MM/dd/yyyy'], 'name': 'date'}],
  'timeReference': {'timeZone': 'UTC'},
  'timeType': 'instant'}}

Registering big data file shares

You can register your data as a big data file share using the add_bigdata() method on a DatastoreManager object. Ensure the datasets are stored in a format compatible with the GeoAnalytics server as seen earlier in this guide.

In [ ]:
NYC_data_item = bigdata_datastore_manager.add_bigdata("NYCdata2", 
Created Big Data file share for NYCdata2
In [ ]:
Out[ ]:
<Datastore title:"/bigDataFileShares/NYCdata2" type:"bigDataFileShare">

Once a big data file share is created, the GeoAnalytics server processes all the valid file types to discern the schema of the data. This process can take a few minutes depending on the size of your data. Once processed, querying the manifest property returns the schema.

In [ ]:
Out[ ]:
{'datasets': [{'format': {'encoding': 'UTF-8',
    'extension': 'csv',
    'fieldDelimiter': ',',
    'hasHeaderRow': True,
    'quoteChar': '"',
    'recordTerminator': '\n',
    'type': 'delimited'},
   'geometry': {'fields': [{'formats': ['x'], 'name': 'pickup_longitude'},
     {'formats': ['y'], 'name': 'pickup_latitude'}],
    'geometryType': 'esriGeometryPoint',
    'spatialReference': {'wkid': 4326}},
   'name': 'sampled',
   'schema': {'fields': [{'name': 'VendorID',
      'type': 'esriFieldTypeBigInteger'},
     {'name': 'tpep_pickup_datetime', 'type': 'esriFieldTypeString'},
     {'name': 'tpep_dropoff_datetime', 'type': 'esriFieldTypeString'},
     {'name': 'passenger_count', 'type': 'esriFieldTypeBigInteger'},
     {'name': 'trip_distance', 'type': 'esriFieldTypeDouble'},
     {'name': 'pickup_longitude', 'type': 'esriFieldTypeDouble'},
     {'name': 'pickup_latitude', 'type': 'esriFieldTypeDouble'},
     {'name': 'RateCodeID', 'type': 'esriFieldTypeBigInteger'},
     {'name': 'store_and_fwd_flag', 'type': 'esriFieldTypeString'},
     {'name': 'dropoff_longitude', 'type': 'esriFieldTypeDouble'},
     {'name': 'dropoff_latitude', 'type': 'esriFieldTypeDouble'},
     {'name': 'payment_type', 'type': 'esriFieldTypeBigInteger'},
     {'name': 'fare_amount', 'type': 'esriFieldTypeDouble'},
     {'name': 'extra', 'type': 'esriFieldTypeDouble'},
     {'name': 'mta_tax', 'type': 'esriFieldTypeDouble'},
     {'name': 'tip_amount', 'type': 'esriFieldTypeDouble'},
     {'name': 'tolls_amount', 'type': 'esriFieldTypeDouble'},
     {'name': 'improvement_surcharge', 'type': 'esriFieldTypeDouble'},
     {'name': 'total_amount', 'type': 'esriFieldTypeDouble'}]},
   'time': {'fields': [{'formats': ['MM/dd/yyyy HH:mm'],
      'name': 'tpep_pickup_datetime'}],
    'timeReference': {'timeZone': 'UTC'},
    'timeType': 'instant'}}]}

Feedback on this topic?