Reconstructing 3D buildings from Aerial LiDAR with Deep Learning

  • 🔬 Data Science
  • 🥠 Deep Learning and Instance Segmentation


The workflow traditionally used to reconstruct 3D building models from aerial LiDAR is relatively straight-forward: the LiDAR point-cloud is transformed into a Digital Surface Model (DSM) raster, then inspected by human editors for buildings present. If a building is found, one or more polygons describing the roof form of the building is manually digitized, e.g. if it is a large hip roof with two gable outlets, there will be three polygons (one hip and two gables on top) drawn by the editor. Once all the roofs are described that way, a set of ArcGIS Procedural rules is applied to extrude the building models using the manually digitized roof segments, with heights and ridge directions computed from the DSM.

Figure 1. 3D building reconstruction from Lidar example: a building with complex roof shape and its representation in visible spectrum (RGB), Aerial LiDAR, and corresponding roof segments digitized by a human editor. The last one is a 3D reconstruction of the same building using manually digitized masks and ArcGIS Procedural rules.

The most time-consuming and expensive step in the above workflow is the manual search and digitization of the roof segment polygons from a DSM raster. In this notebook, we are going to focus on this challenging step and demonstrate how to detect instances of roof segments of various types using instance segmentation to make the process more efficient. The workflow consists of four major steps: (1) extract training data, (2) train a deep learning instance segmentation model, (3) model deployment and roof segments detection and (4) 3D enabling the detected segments.


Complete data required to run this sample is packaged together in a project package and can be downloaded from here. You are also required to download the rule package used in Part 4 of this notebook from here.

Below are the items present in the project package shared:

  • D1_D2_D3_Buildings_1: labelled feature data for training data preparation
  • R7_nDSM_TestVal: raster image for training data preparation
  • DSM_AOI_Clip: DSM raster for area of interest, required during model inferencing
  • DTM_AOI_Clip: DTM raster for area of interest, required during model inferencing
  • DSM_AOI_Clip_DetectObjects_26032020_t4_220e: sample results obtained from the trained MaskRCNN model inferenced on area of interest obtained after performing part 3 of the notebook
  • DSM_AOI_Clip_DetectObjects_26032020_t4_220e_selection_3dEnabling: sample 3D enabled roof segments obtained after performing part 4 of the notebook

    Moreover, there is a toolbox (3d_workflow.tbx) in the 'Toolboxes' section of the project having the script (3dEnabling) to perform part 4 of the notebook.

Part 1 - Data Preparation

We started with two input data:

  • A single-band raster layer (R7_nDSM_TestVal) with 2.25 square feet per pixel resolution converted from LiDAR point cloud (using the “LAS Dataset to Raster” geoprocessing tool)
  • A feature class (D1_D2_D3_Buildings_1) that defines the location and label (i.e. flat, gable, hip, shed, mansard, vault, dome) of each roof segment.

We are using single band Lidar data which is essentially elevation to train our deep learning MaskRCNN model.

Figure 2. Example of different roof types (flat not shown).

Export training data

Export training data using 'Export Training data for deep learning' tool, detailed documentation here.

  • Input Raster: R7_nDSM_TestVal
  • Output Folder: Set a location where you want to export the training data, it can be an existing folder or the tool will create that for you.
  • Input Feature Class Or Classified Raster: D1_D2_D3_Buldings_1
  • Image Format: TIFF format
  • Tile Size X & Tile Size Y can be set to 256
  • Stride X & Stride Y: 128
  • Meta Data Format: Select 'RCNN Masks' as the data format because we are training a MaskRCNN model.
  • In Environments tab set an optimum Cell Size. For this example, as we have to perform the analysis on the LiDAR imagery, we used 0.2 cell size.

Figure 3. Export Training data screenshot from ArcGIS Pro.

After filling all details and running the Export Training Data For Deep Learning tool, a code like above will be generated and executed. That will create all the necessary files needed for the next step in the 'Output Folder', and we will now call it our training data.

Part 2 - Model Training

You should already have the training chips with you exported from ArcGIS pro. Please change the path to your own export training data folder that contains "images" and "labels" folder. Note that we set a relatively small batch_size here on purpose as instance segmentation is a more computationally intensive task compared with object detection and pixel-based classification. If you run into "insufficient memory" issue during training, you can come back to adjust it to meet your own needs.

Necessary Imports

import os
from pathlib import Path

from arcgis.learn import prepare_data, MaskRCNN
#connect to GIS
from arcgis.gis import GIS
gis = GIS('home')

Prepare data

We will now use the prepare_data() function to apply various types of transformations and augmentations on the training data. These augmentations enable us to train a better model with limited data and also prevent the model from overfitting. prepare_data() takes 3 parameters.

  • path: path of the folder containing training data.
  • batch_size: No of images your model will train on each step inside an epoch, it directly depends on the memory of your graphic card. 8 worked for us on a 11GB GPU.
  • imagery_type: It is a mandatory input to enable a model for multispectral data processing. It can be "landsat8", "sentinel2", "naip", "ms" or "multispectral".
training_data = gis.content.get('807425fa74f34d7695ec024eb934456c')
Image Collection by api_data_owner
Last Modified: September 18, 2020
0 comments, 5 views
filepath =
import zipfile
with zipfile.ZipFile(filepath, 'r') as zip_ref:
data_path = Path(os.path.join(os.path.splitext(filepath)[0]))
data = prepare_data(data_path, batch_size=8, imagery_type='ms')

Visualize a few samples from your training data

To get a sense of what the training data looks like, arcgis.learn.show_batch() method randomly picks a few training chips and visualizes them. Note that the masks representing different roof segments are overlaid upon the original images with red and pink colors.

rows: No of rows we want to see the results for.


Load model architecture

Here we use Mask R-CNN [1], a well-recognized instance algorithm, to detect roof segments (Figure 3). A Mask R-CNN model architecture and a pretrained model has already been predefined in arcgis.learn, so we can just define it with a single line. Please refer to the guide on our developers' site for more information.

The idea of Mask R-CNN is to detect objects in an image while simultaneously generating a high-quality segmentation mask for each instance. In other words, it is like a combination of UNet and SSD and does two jobs in one go. This is also why it is relatively computationally more intensive.

model = MaskRCNN(data)

Figure 4. Mask R-CNN framework for instance segmentation [1]

Find an optimal learning rate

Learning rate is one of the most important hyperparameters in model training. arcgis.learn leverages’s learning rate finder to find an optimum learning rate for training models. We can use the lr_find() method to find the optimum learning rate at which we can train a robust model fast enough.

# The users can visualize the learning rate of the model with comparative loss. 
lr = model.lr_find()

Fit the model

Let's train it for a few epochs with the learning rate we have found. For the sake of time, we can start with 10 epochs.

Input, lr=lr)
epoch train_loss valid_loss time
0 2.769909 2.669544 06:09
1 2.142615 2.117454 06:38
2 1.951803 1.997903 07:03
3 1.843830 1.804779 07:50
4 1.687831 1.723288 08:12
5 1.534998 1.688702 07:52
6 1.503451 1.559698 07:48
7 1.417018 1.515242 07:40
8 1.335621 1.512897 07:36
9 1.238250 1.535544 07:26

Note that the validation loss actually goes higher, but this doesn't necessarily mean the result is getting bad. Because our training data is also missing many buildings, the loss sometimes tells the actual model performance, so let's look at the actual results instead.

Save the model

We will save the model which we trained as a 'Deep Learning Package' ('.dlpk' format). Deep Learning package is the standard format used to deploy deep learning models on the ArcGIS platform.

We will use the save() method to save the trained model. By default it will be saved to the 'models' sub-folder within our training data folder.


Visualize results in validation set

Now we have the model, let's look at how the model performs. There are 3 modes when visualizing the results.

  • bbox - For visualizing only bounding boxes.
  • mask - For visualizing only mask
  • bbox_mask - For visualizing both mask and bounding boxes.
model.show_results(mode='bbox_mask', rows=2)

As we can see, with only 10 epochs, we are already seeing good results. We have trained it further till 220 epochs and were able to get promising results. Below are the results that we got from our model trained for 220 epochs.

Load a saved model to visualize results

To visualize results from a saved model, we can load it using load() function.

# model.load('220e')
# model.show_results(mode='bbox_mask', rows=5, box_threshold=0.5)

Part 3 - Deploy Model and Detect Roof Segments

We will use saved model to detect objects using 'Detect Objects Using Deep Learning' tool available in both ArcGIS Pro and ArcGIS Enterprise. For this sample, we will use the DSM image for our area of Interest. You will find this image in the ArcGIS project you have downloaded with the name - DSM_AOI_Clip.

  • Input Raster: DSM_AOI_Clip
  • Output Detect Objects: DSM_AOI_Clip_DetectObjects
  • Model Definition: path to the model emd file
  • padding: The 'Input Raster' is tiled and the deep learning model runs on each individual tile separately before producing the final 'detected objects feature class'. This may lead to unwanted artifacts along the edges of each tile as the model has little context to detect objects accurately. Padding as the name suggests allows us to supply some extra information along the tile edges, this helps the model to predict better.
  • batch_size: 4
  • threshold:0.4
  • return_bboxes: False
  • Cell Size: Should be close to at which we trained the model, we specified that at the Export Training Data step .

Figure 5. Detect objects using Deep Learning screenshot from ArcGIS Pro
model_arguments="padding 56;batch_size 4;threshold 0.4;return_bboxes False;",
processing_mode = "PROCESS_AS_MOSAICKED_IMAGE")

Figure 6. A subset of detected different types of building roof forms (shown in different color)

Part 4 - 3D enabling the MaskRCNN results

The next step is to extrude object detection results in 3D using traditional Geoprocessing tools and Procedural rules. First, we used the “Regularize Building Footprint” tool from the 3D Analyst toolbox to regularize raw detections. Next, we used surface elevation and DSM rasters to acquire the base elevation for each building segment, and calculate the roof ridge direction. The final step was calling the “Features from CityEngine Rules” GP tool which applied a procedural rule to these polygons to achieve a final reconstruction of the 3D buildings.

All the pre-processing steps are recorded in a python script and shared with you in the form of a toolbox within your project. Double click on the script (3dEnabling).

Figure 7. Screenshot of location of the script to perform 3D enabling task

The 3dEnabling script takes the following inputs:

  • Input Masks: detected buildings from MaskRCNN model.
  • DTM: 'DTM_AOI_Clip' image from the project.
  • DSM: 'DSM_AOI_Clip' image from the project.
  • Rule Package: 'LGIM_RFE_LOD2.rpk' rule package file from the project.
  • Output Buildings: 3dEnabledPolygons
  • Cell Size: Should be close to at which we trained the model, we specified that at the Export Training Data step.

Figure 8. 3dEnabling tool screenshot from ArcGIS Pro

This tool converts the detection obtained from MaskRCNN model in 3D based on the rules defined in the rule package.

scene_item = gis.content.get('9ce6cf3170334c61bf3da25dacbdf553')
from arcgis.widgets import MapView
map2 = MapView(item = scene_item)
map2.extent = {'spatialReference': {'latestWkid': 3857, 'wkid': 102100},
                'xmin': -8919494.611200048,
                'ymin': 2981619.558592063,
               'xmax': -8918883.485987043,
               'ymax': 2981867.4795709597}

Here is a web map to showcase the results obtained from the notebook.


In this notebook, we have covered a lot of ground. In part 1, we discussed how to prepare and export training data from aerial Lidar data. In part 2, we demonstrated how to prepare the input data, train an instance segmentation model, visualize the results, as well as apply the model to an unseen image using the deep learning module in ArcGIS API for Python. In part 4, we discussed how the MaskRCNN results obtained can be converted in 3D using the toolbox shared.


[1] He, K., Gkioxari, G., Dollár, P. and Girshick, R., 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 2961-2969).

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.