YOLOv3 Object Detector

Introduction

YOLO (You Only Look Once) is one of the most popular series of object detection models. Its advantage has been in providing real-time detections while approaching the accuracy of state-of-the-art object detection models.

In the earlier works for object detection, models used to either use a sliding window technique or region proposal network. Sliding window, as the name suggests choses a Region of Interest (RoI) by sliding a window across the image and then performs classification in the chosen RoI to detect an object. Region proposal networks work in two steps - first, they extract region proposals and then using CNN features, classify the proposed regions. Sliding window method is not very precise and accurate, and though some of the region-based networks can be highly accurate they tend to be slower.

Then came along the one-shot object detectors such as SSD, YOLO and RetinaNet. These models detect objects in a single pass of the image and, thus, are considerably faster, and can match up the accuracy of region-based detectors. The SSD guide explains the essential components of a one-shot object detection model. You can also read up the RetinaNet guide here. These models are already a part of ArcGIS API for Python and the addition of YOLOv3 provides another tool in our deep learning toolbox.

The biggest advantage of YOLOv3 in arcgis.learn is that it comes preloaded with weights pretrained on the COCO dataset. This makes it ready-to-use for the 80 common objects (car, truck, person, etc.) that are part of the COCO dataset.

Figure 1. Real-time Object detection using YOLOv3 [1]

Model Architecture

YOLOv3 uses Darknet-53 as its backbone. This contrasts with the use of popular ResNet family of backbones by other models such as SSD and RetinaNet. Darknet-53 is a deeper version of Darknet-19 which was used in YOLOv2, a prior version. As the name suggests, this backbone architecture has 53 convolutional layers. Adapting the ResNet style residual layers has improved its accuracy while maintaining the speed advantage. This feature extractor performs better than ResNet101 and similar to ResNet152 while being about 1.5x and 2x faster, respectively [2].

YOLOv3 has incremental improvements over its prior versions [2]. It uses upsampling and concatenation of feature layers with earlier feature layers which preserve fine-grained features. Another improvement is using three scales for detection. This has made the model good at detecting objects of varying scales in an image. There are other improvements in anchor box selections, loss function, etc. For a detailed analysis of the YOLOv3 architecture, please refer to this blog.

Figure 2. YOLOv3 architecture [3]

Implementation in `arcgis.learn`

You can create a YOLOv3 model in arcgis.learn using a single line of code.

model = YOLOv3(data)

where data is the databunch prepared for training using the prepare_data method in the earlier steps.

For more information about the API, please go to the API reference.

Using COCO pretrained weights

To use the model out-of-the-box with COCO pretrained weights, initialize the model as following:

model = YOLOv3()

Note, the model must be initialized without providing any data. Because we are not training the model and instead using the pre-trained weights, we do not require a databunch. Any oriented image or video (at least 416x416 px) can be used for inferencing using the following commands, respectively:

model.predict(image_path)

model.predict_video(input_video_path, metadata_file)

The following 80 classes are available for object detection in the COCO dataset:

'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck',
'boat', 'traffic light', 'fire hydrant', 'stop sign',
'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella',
'handbag', 'tie', 'suitcase', 'frisbee', 'skis',
'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove',
'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass',
'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich',
'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair',
'couch', 'potted plant', 'bed', 'dining table',
'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book',
'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'

References

[1] Sik-Ho Tsang, "Review: YOLOv3 — You Only Look Once (Object Detection)", https://towardsdatascience.com/review-yolov3-you-only-look-once-object-detection-eab75d7a1ba6.
[2] Joseph Redmon, Ali Farhadi: "YOLOv3: An Incremental Improvement", 2018; [https://arxiv.org/abs/1804.02767 arXiv:1804.02767].
[3] Ayoosh Katuria, "What’s new in YOLO v3?", https://towardsdatascience.com/yolo-v3-object-detection-53fb7d3bfe6b.