Making your data accessible to the GIS¶
Collecting, storing, managing and analyzing large quantities of numbers, figures, and files is not a new business activity. But referring to these numbers, figures and files as big data is relatively recent. Analysts, researchers and other professionals popularly characterize big data as sharing 4 key traits, also known as the 4 Vs:
- high volume: a large quantity of data that cannot be efficiently managed by traditional relational databases or analyzed in a traditional manner using the memory of a single machine
- high velocity: data that is dynamic and streams in at a fast pace from multiple sources
- large variety: different formats from structured to unstructured; tabular to documents, email, video or audio; spatial to non-spatial;
- unknown veracity: data that is unprocessed, uncleaned, inconsistent or unscreened and of unknown origin or quality
The GeoAnalytics Server expands your ArcGIS Enterprise deployment providing functionality and services to process and analyze big data.
Big data file shares¶
The GeoAnalytics server allows you to register datasets in a format called a big data file share. Big data file shares are items on your Web GIS, and can reference data in any of the following data sources:
- File Share - a directory of datasets stored locally or shared across a network
- HDFS - an Hadoop Distributed File System directory of datasets
- Apache Hive - a metastore database
- Cloud Store - an Azure Blob Storage container or Amazon Web Services S3 bucket
Storing your data in a big data file share datastore benefits you because:
- The GeoAnalytics tools read your data only when they are executed, which allows you to update or add data to these locations.
- You can use partitioned data as a single dataset.
- Big data file shares are flexible in how time and geometry are defined, allowing data in multiple formats in a single dataset.
Preparing your data¶
To register a file share or an HDFS, you need to format your datasets as subfolders within a single parent folder and register the parent folder. This parent folder becomes a datastore
, and each subfolder becomes a dataset
. For instance, to register 2 datasets representing earthquakes and hurricanes, your folder hierarchy would look like below:
|---FileShareFolder <-- register as a datastore
|---Earthquakes <-- dataset 1
|---1960
|---01_1960.csv
|---02_1960.csv
|---1961
|---01_1961.csv
|---02_1961.csv
|---Hurricanes <-- dataset 2
|---atlantic_hur.shp
|---pacific_hur.shp
Learn more about preparing your big data file share datasets here.
# Connect to enterprise GIS
from arcgis.gis import GIS
import arcgis.geoanalytics
portal_gis = GIS("portal_url", "username", "password")
Ensuring your GIS supports GeoAnalytics¶
It is best practice to confirm proper configuration of your Enterprise to support the GeoAnalytics Server.
# Verify that GeoAnalytics is supported
arcgis.geoanalytics.is_supported()
Searching for big data file shares¶
The get_datastores()
method of the geoanalytics
module returns a DatastoreManager
object that lets you search for and manage the big data file share items as Python API Datastore
objects on your GeoAnalytics server.
bigdata_datastore_manager = arcgis.geoanalytics.get_datastores()
bigdata_datastore_manager
Use the search()
method on a DatastoreManager
object to search for Datastores
. Observe in the output below the item titled FileShareFolder as illustrated in the example file structure above is registered as a big data file share in the portal.
bigdata_fileshares = bigdata_datastore_manager.search()
bigdata_fileshares
Get datasets from a big data file share datastore¶
Let's use the datasets
property on a Datastore
object to find out how many datasets are available and then list them.
file_share_folder = bigdata_fileshares[0]
file_share_datasets = file_share_folder.datasets
len(file_share_datasets)
for i in range(0, len(file_share_datasets)):
print("{:<10}{:<3}{}".format("Dataset " + str(i) + ":", "", file_share_datasets[i]['name']))
# let's view the json schema of the hurricanes dataset for a sample
file_share_datasets[1]
Registering big data file shares¶
You can register your data as a big data file share using the add_bigdata()
method on a DatastoreManager
object. Ensure the datasets are stored in a format compatible with the GeoAnalytics server as seen earlier in this guide.
Sample_City_Crime_data_item = bigdata_datastore_manager.add_bigdata("Sample_US_City_Crime",
r"\\<file_share_path>\<big_data_folder>")
Sample_City_Crime_data_item
Once a big data file share is created, the GeoAnalytics server samples the datasets to generate the schema of the data to create a manifest. This process can take a few minutes depending on the size of your data. Once processed, querying the manifest
property returns the schema of the datasets in your big data file share.
Sample_City_Crime_data_item.manifest
Feedback on this topic?