Vehicle detection and tracking using deep learning

🔬 Data Science

🥠 Deep Learning and Object Detection

🛤️ Tracking

Introduction and objective

Vehicle detection and tracking is a common problem with multiple use cases. Government authorities and private establishment might want to understand the traffic flowing through a place to better develop its infrastructure for the ease and convenience of everyone. A road widening project, timing the traffic signals and construction of parking spaces are a few examples where analysing the traffic is integral to the project.

Traditionally, identification and tracking has been carried out manually. A person will stand at a point and note the count of the vehicles and their types. Recently, sensors have been put into use, but they only solve the counting problem. Sensors will not be able to detect the type of vehicle.

In this notebook, we'll demonstrate how we can use deep learning to detect vehicles and then track them in a video. We'll use a short video taken from live traffic camera feed.

Necessary imports

%matplotlib inline
import os
import pandas as pd
from pathlib import Path

from arcgis.gis import GIS
from arcgis.learn import RetinaNet, prepare_data

gis = GIS('home')

Prepare data that will be used for training

You can download vehicle training data from here. Extract the downloaded file to get your training data.

Model training

Let's set a path to the folder that contains training images and their corresponding labels.

training_data = gis.content.get('ccaa060897e24b379a4ed2cfd263c15f')
training_data

vehicle_detection_and_tracking

Image Collection by api_data_owner
Last Modified: August 26, 2020
0 comments, 1,891 views

filepath = training_data.download(file_name=training_data.name)

import zipfile
with zipfile.ZipFile(filepath, 'r') as zip_ref:
    zip_ref.extractall(Path(filepath).parent)

data_path = Path(os.path.join(os.path.splitext(filepath)[0]))

We'll use the prepare_data function to create a fastai databunch with the necessary parameters such as batch_size, and chip_size. A complete list of parameters can be found in the API reference.

The given dataset has 235 images of size 854x480 pixels. We will define a chip_size of 480 pixels which will create random crops of 480x480 from the given images. This way we will maintain the aspect ratios of the objects but can miss out on objects when training the model for fewer epochs. To avoid cropping, we can set resize_to=480 so that every chip is an entire frame and doesn't miss any object, but there is a risk of poor detection with smaller sized object.

data = prepare_data(data_path, 
                    batch_size=4, 
                    dataset_type="PASCAL_VOC_rectangles", 
                    chip_size=480)

Please check your dataset. 9 images dont have the corresponding label files.

We see the warning above because there are a few images in our dataset with missing corresponding label files. These images will be ignored while loading the data. If it is a significant number, we might want to fix this issue by adding the label files for those images or removing those images.

We can use the classes attribute of the data object to get information about the number of classes.

data.classes

['background',
 'bicycle',
 'bus',
 'car',
 'motorcycle',
 'person',
 'scooter',
 'tempo',
 'tractor',
 'truck',
 'van']

Visualize training data

To visualize and get a sense of the training data, we can use the data.show_batch method.

data.show_batch()

In the previous cell, we see a sample of the dataset. We can observe, in the given chips, that the most common vehicles are cars and bicycles. It can also be noticed that the different instance of the vehicles have varying scales.

Load model architecture

arcgis.learn provides us object detection models which are based on pretrained convnets, such as ResNet, that act as the backbones. We will use RetinaNet with the default parameters to create our vehicle detection model. For more details on RetinaNet check out How RetinaNet works? and the API reference.

retinanet = RetinaNet(data)

We will use the lr_find() method to find an optimum learning rate. It is important to set a learning rate at which we can train a model with good accuracy and speed.

lr = retinanet.lr_find()

Train the model

We will now train the RetinaNet model using the suggested learning rate from the previous step. We can specify how many epochs we want to train for. Let's train the model for 100 epochs. Also, we can turn tensorboard True if we want to visualize the training process in tensorboard.

retinanet.fit(100, lr=lr, tensorboard=True)

Monitor training on Tensorboard using the following command: 'tensorboard --host=DELDEVD020 --logdir="~\AppData\Local\Temp\vehicle_detection_and_tracking\training_log"'

epoch	train_loss	valid_loss	average_precision	time
0	2.920065	3.539380	0.000000	00:35
1	2.813155	3.267476	0.000000	00:34
2	2.669132	2.747293	0.026481	00:35
3	2.436182	33.604561	0.051374	00:37
4	2.093326	53.471359	0.056984	00:38
5	1.764035	59.744678	0.058219	00:37
6	1.622536	21.022703	0.072807	00:35
7	1.422675	10.428946	0.068505	00:35
8	1.306369	4.067480	0.070096	00:35
9	1.177961	3.009438	0.071657	00:37
10	1.094871	1.647490	0.078009	00:34
11	1.106127	4.243889	0.067299	00:34
12	1.015017	4.861908	0.080683	00:32
13	1.000670	2.700608	0.098910	00:33
14	0.922511	2.807095	0.109304	00:32
15	0.859568	1.992613	0.097922	00:32
16	0.805707	1.943712	0.108335	00:32
17	0.708350	2.501110	0.100423	00:32
18	0.700377	1.970295	0.101732	00:32
19	0.737585	1.489944	0.116635	00:34
20	0.715043	1.333420	0.119699	00:32
21	0.658129	1.908532	0.096939	00:33
22	0.641105	1.665467	0.124671	00:32
23	0.584993	1.614303	0.122485	00:32
24	0.642041	1.949863	0.124358	00:32
25	0.582971	1.438722	0.151532	00:33
26	0.598245	2.033441	0.154287	00:32
27	0.534201	1.356403	0.146847	00:32
28	0.558636	1.075806	0.108606	00:32
29	0.538215	1.143882	0.171819	00:34
30	0.552466	1.204231	0.136003	00:32
31	0.510792	0.950244	0.198825	00:33
32	0.506083	0.949070	0.282821	00:32
33	0.458930	1.120145	0.202151	00:32
34	0.457333	1.095438	0.200361	00:32
35	0.431510	0.948762	0.181125	00:32
36	0.428306	0.849569	0.178509	00:33
37	0.432655	1.017028	0.245018	00:33
38	0.433414	1.021725	0.154290	00:33
39	0.399637	0.951915	0.161055	00:35
40	0.415549	1.017545	0.198056	00:33
41	0.407991	0.955821	0.261862	00:32
42	0.411236	1.030218	0.248527	00:32
43	0.395501	0.999297	0.358107	00:32
44	0.403905	0.977955	0.329283	00:33
45	0.402387	1.105100	0.211323	00:35
46	0.378837	0.870647	0.230285	00:34
47	0.358226	0.711156	0.266512	00:35
48	0.329803	0.798039	0.326183	00:34
49	0.334917	0.891248	0.315338	00:36
50	0.345367	0.872495	0.366508	00:34
51	0.345040	0.745729	0.349554	00:33
52	0.307302	0.838848	0.230683	00:33
53	0.323212	0.826456	0.408229	00:34
54	0.333826	0.711416	0.396662	00:35
55	0.325671	0.783359	0.287597	00:34
56	0.308648	0.780350	0.233106	00:35
57	0.311531	0.804540	0.267550	00:33
58	0.324239	0.753153	0.304216	00:36
59	0.320816	0.672748	0.329250	00:34
60	0.305792	0.748309	0.277391	00:35
61	0.303483	0.774665	0.344212	00:34
62	0.278325	0.723832	0.340825	00:34
63	0.273383	0.729151	0.324897	00:34
64	0.264061	0.676256	0.265130	00:35
65	0.249917	0.810901	0.323834	00:33
66	0.246965	0.728388	0.292947	00:34
67	0.263901	0.777008	0.330378	00:34
68	0.270437	0.704789	0.292732	00:36
69	0.237354	0.711246	0.316746	00:34
70	0.232953	0.765340	0.283510	00:34
71	0.246333	0.703580	0.326877	00:34
72	0.280900	0.729755	0.299404	00:34
73	0.262047	0.682556	0.299948	00:34
74	0.246901	0.713064	0.308421	00:34
75	0.239837	0.693446	0.303404	00:33
76	0.226346	0.709199	0.293320	00:33
77	0.247556	0.675500	0.301058	00:34
78	0.224785	0.668302	0.304285	00:35
79	0.231438	0.717872	0.294961	00:34
80	0.227474	0.761334	0.298954	00:35
81	0.230270	0.709718	0.302787	00:35
82	0.247459	0.693567	0.306647	00:34
83	0.241058	0.665935	0.306471	00:34
84	0.232183	0.620805	0.414182	00:34
85	0.246630	0.620489	0.406833	00:35
86	0.236326	0.658135	0.406687	00:35
87	0.222942	0.724259	0.295845	00:35
88	0.223491	0.711598	0.406538	00:36
89	0.212068	0.701122	0.398747	00:34
90	0.218335	0.656301	0.306746	00:35
91	0.215216	0.671562	0.305381	00:34
92	0.209122	0.658336	0.306580	00:34
93	0.204161	0.678559	0.289841	00:34
94	0.204125	0.680999	0.305388	00:34
95	0.218789	0.662410	0.306862	00:34
96	0.207825	0.682917	0.306709	00:34
97	0.215605	0.663999	0.398909	00:34
98	0.220050	0.685042	0.306403	00:36
99	0.228829	0.674382	0.397525	00:34

After the training is complete, we can view the plot with training and validation losses.

retinanet.learn.recorder.plot_losses()

Visualize results on validation set

To see sample results we can use the show_results method. This method displays the chips from the validation dataset with ground truth (left) and predictions (right). We can also specify the threshold to view predictions at different confidence levels. This visual analysis helps in assessing the qualitative results of the trained model.

retinanet.show_results(thresh=0.4)

To see the quantitative results of our model we will use the average_precision_score method.

retinanet.average_precision_score(detect_thresh=0.4)

100.00% [6/6 00:01<00:00]

{'bicycle': 0.45238095592884786,
 'bus': 0.5,
 'car': 0.9038291679057693,
 'motorcycle': 0.1666666716337204,
 'person': 0.0,
 'scooter': 0.0,
 'tempo': 0.0,
 'tractor': 1.0,
 'truck': 0.5,
 'van': 0.5454545617103577}

We can see the average precision for each class in the validation dataset. Note that while car and bicycle have a good score, van doesn't, and a few have a score of 0. Remember when we visualized the data using show_batch we noted that the cars and bicycles were the most common objects. It means, the scores could be correlated with the number of examples of these objects we have in our training dataset.

Let's look at the number of instances of each class in the training data and it should explain.

all_classes = []
for i, bb in enumerate(data.train_ds.y):
    all_classes += bb.data[1].tolist()
    
df = pd.value_counts(all_classes, sort=False)
df.index = [data.classes[i] for i in df.index] 
df

car           747
van            75
bicycle       267
person         26
scooter         7
motorcycle     32
truck          29
bus            23
tempo           1
tractor         3
Name: count, dtype: int64

It is evident that the classes that have a score of 0.0 have extremely low number of examples in the training dataset.

Save the model

Let's save the model by giving it a name and calling the save method, so that we can load it later whenever required. The model is saved by default in a directory called models in the data_path initialized earlier, but a custom path can be provided.

retinanet.save('vehicle_det_ep100_defaults')

Computing model metrics...

100.00% [6/6 00:01<00:00]

WindowsPath('~/AppData/Local/Temp/vehicle_detection_and_tracking/models/vehicle_det_ep100_defaults')

Inference and tracking

Multiple-object tracking can be performed using predict_video function of the arcgis.learn module. To enable tracking, set the track parameter in the predict_video function as track=True.

The following options/parameters are available in the predict video function for the user to decide:-

vanish_frames: The number of frames the object remains absent from the frame to be considered as vanished.
detect_frames: The number of frames an object remains present in the frame to start tracking.
assignment_iou_thrd: There might be multiple trackers detecting and tracking objects. The Intersection over Union (iou) threshold can be set to assign a tracker with the mentioned threshold value.

video_data = gis.content.get('1801dc029fed467ba67d6e39113202af')
video_data

vehicle_detection_and_tracking_video

Image Collection by api_data_owner
Last Modified: November 19, 2020
0 comments, 482 views

videopath = video_data.download(file_name=video_data.name)

import zipfile
with zipfile.ZipFile(videopath, 'r') as zip_ref:
    zip_ref.extractall(Path(videopath).parent)

video_file = os.path.join(os.path.splitext(videopath)[0], 'test.mp4')
video_file

'~\AppData\\Local\\Temp\\vehicle_detection_and_tracking_video\\test.mp4'

retinanet.predict_video(input_video_path=video_file, 
                        metadata_file='test.csv',
                        track=True, 
                        visualize=True, 
                        threshold=0.5,
                        resize=True)

100.00% [15023/15023 31:25<00:00]

# Output video will be saved at video_out_file path
video_out_file = os.path.join(os.path.splitext(videopath)[0], 'test_predictions.avi')
video_out_file

'~\AppData\\Local\\Temp\\vehicle_detection_and_tracking_video\\test_predictions.avi'

We have already generated the output video and have displayed it below.

from IPython.display import HTML

HTML("""<style>.video-container{display:grid;justify-items:center;align-items:center;position:relative;width:100%;height:100%;min-height:400px;overflow:hidden}.video-container iframe{position:absolute;z-index:1;top:50%;left:50%;min-width:100%;min-height:100%;transform:translate(-50%,-50%)}</style><div class="video-container"><iframe id="kmsembed-1_61k6sy6f" width="608" height="402" src="https://mediaspace.esri.com/embed/secure/iframe/entryId/1_tssk1yqk/uiConfId/49806163/st/0" class="kmsembed" allowfullscreen webkitallowfullscreen mozallowfullscreen allow="autoplay *; fullscreen *; encrypted-media *" referrerpolicy="no-referrer-when-downgrade" sandbox="allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="Road crack prediction"></iframe></div>""")

We can count the number of vehicles per unit of time and update a feature layer with the live count of cars, buses, trucks etc. When this process is done for multiple intersections within the city, an ArcGIS dashboard can be created. It queries the continually updated feature layers and displays the results using a dashboard such the following:

ArcGIS Dashboard Demo for Vehicle Detection and Tracking

from IPython.display import HTML
 
HTML("""<style>.video-container{display:grid;justify-items:center;align-items:center;position:relative;width:100%;height:100%;min-height:400px;overflow:hidden}.video-container iframe{position:absolute;z-index:1;top:50%;left:50%;min-width:100%;min-height:100%;transform:translate(-50%,-50%)}</style><div class="video-container"><iframe id="kmsembed-1_ig6khu77" width="608" height="402" src="https://mediaspace.esri.com/embed/secure/iframe/entryId/1_ig6khu77/uiConfId/49806163/st/0" class="kmsembed" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" referrerPolicy="no-referrer-when-downgrade" sandbox="allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="ArcGIS Dashboard Demo for Vehicle Detection and Tracking"></iframe></div>""")

Conclusion

In this notebook, we have learnt how to automate multi-object tracking and counting system. This will not only help in intelligent traffic management but can be found useful in wide variety of applications.

Vehicle detection and tracking using deep learning

Your ArcGIS portal

Your ArcGIS Location Platform dashboard