Skip To Content ArcGIS for Developers Sign In Dashboard

ArcGIS API for Python

Object detection and tracking on videos

The Object detection with arcgis.learn section of this guide explains how object detection models can be trained and used to extract the location of detected objects from imagery. This section of the guide explains how they can be applied to videos, for both detecting objects in a video, as well as for tracking them.

Object detection in videos

Object detection models can be used to detect objects in videos using the predict_video function. This function applies the model to each frame of the video, and provides the classes and bounding boxes of detected objects in each frame. The information is stored in a metadata file. The detected objects can also be visualized on the video, by specifying the visualize=True flag. By default, the output video is saved in the original video's directory.

The metadata file is a comma-separated values (CSV) file, containing metadata about the video frames for specific times. This function updates the CSV file by encoding object detections in the MISB 0903 standard in the vmtilocaldataset column. When multiplexed with the original video, this enables the object detections to be visualized in ArcGIS Pro, using its support for Full Motion Video (FMV) and VMTI (video moving target indications) metadata. To learn more about it, read here.

Object tracking

When detecting objects in a video, we are often interested in knowing how many objects are there and what tracks they follow. As an example, in a video from a traffic camera installed at intersection, we may be interested in counting the number and types of vehicles crossing the intersection. Optionally, in a video captured from a drone, we might be interested in counting or tracking individual objects as they move around.

Object tracking is a process of:

  • Taking an initial set of object detections (such as an input set of bounding box coordinates)
  • Creating a unique ID for each of the initial detections
  • And then tracking each of the objects as they move around frames in a video, maintaining the assignment of unique IDs

Object tracking in arcgis.learn is based on SORT(Simple Online Realtime Tracking) algorithm. This algorithm combines Kalman-filtering and Hungarian Assignment Algorithm

Kalman Filter is used to estimate the position of a tracker while Hungarian Algorithm is used to assign trackers to a new detection.

Kalman Filter

Kalman filtering uses a series of measurements observed over time and produces estimates of unknown variables by estimating a joint probability distribution over the variables for each timeframe. The filter is named after Rudolf E. Kálmán, one of the primary developers of its theory.

Our state contains 8 variables; (u,v,a,h,u’,v’,a’,h’) where (u,v) are centres of the bounding boxes, a is the aspect ratio and h, the height of the image. The other variables are the respective velocities of the variables.

A Kalman Filter is used on every bounding box, so it comes after a box has been matched with a tracker. When the association is made, predict and update functions are called.

  • Predict: Prediction step is matrix multiplication that will tell us the position of our bounding box at time t based on its position at time t-1.

  • Update: Update phase is a correction step. It includes the new measurement from the Object Detection model and helps improve our filter.

Hungarian Assignment Algorithm

The Hungarian algorithm, also known as Kuhn-Munkres algorithm, can associate an obstacle from one frame to another, based on a score such as Intersection over Union (IoU).

We iterate through the list of trackers and detections and assign a tracker to each detection on the basis of IoU scores.

The general process is to detect obstacles using an object detection algorithm, match these bounding box with former bounding boxes we have using The Hungarian Algorithm and then predict future bounding box positions or actual positions using Kalman Filters.

Object detection and tracking using predict_video function

The following options/parameters can be specified in the predict video function by the user:

  • The final saved VMTI can be multiplexed with the input video by passing the multiplex=True flag. The multiplexed video can be saved at the path specified in multiplex_file_path. By default, the video gets saved in the original video's directory.

The track=True parameter can be used to track detected objects in the video. When tracking the detected objects, the following tracker_options can be specified as a dict:

  • assignment_iou_thrd - There might be multiple trackers detecting and tracking objects. The Intersection over Union (iou) threshold can be set to assign a tracker with the mentioned threshold value.
  • vanish_frames - Then the number of frames the object remains absent from the frame can be mentioned for it to be considered as vanished.
  • detect_frames - Also the number of frames an object remains present in the frame to start tracking it.

Additionally, the detections can be visualized on an output video that this function can create, if passed the visualize=True parameter. When visualizing the detected objects, the following visual_options can be specified to display scores, labels, the color of the predictions, thickness and font face to show the labels:

  • show_scores - To view scores on predictions
  • show_labels - To view labels on predictions
  • thickness - To set the thickness level of box
  • fontface - Fontface value from opencv values
  • color - (B, G, R)

The example below shows how a trained model can be used to detect objects in a video:

In [ ]:
mdl = model.from_model(r'\path\to\model\model.emd')
In [20]:
mdl.predict_video(
    input_video_path=r'\path\to\video.mp4', 
    metadata_file=r'\path\to\metadata\file.csv',
    visualize=True)

The following example shows how the detected objects can be additionally tracked as well as multiplexed. Additionally, it creates an output video that visualizes the detected objects using the specified visual_options:

In [21]:
mdl.predict_video(
    input_video_path=r'\path\to\input_video.mp4', 
    metadata_file=r'\path\to\metadata\metedata_file.csv',
    track=True,
    output_file_path=r'\path\to\output\output_file.mp4',
    multiplex=True,
    multiplex_file_path=r'\path\to\output\multiplexed_file.mp4',
    tracker_options={'assignment_iou_thrd': 0.3, 'vanish_frames': 40, 'detect_frames': 10},
    visual_options={'show_scores': True, 'show_labels': True, 'thickness': 2, 'fontface': 0, 'color': (255, 255, 255)})

You can refer to this sample notebook for a detailed workflow that automates road surface investigation using a video.

References

[1] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He: “Focal Loss for Dense Object Detection”, 2017; [http://arxiv.org/abs/1708.02002 arXiv:1708.02002].

[2] https://towardsdatascience.com/computer-vision-for-tracking-8220759eee85

[3] https://arxiv.org/abs/1602.00763


Feedback on this topic?