ArticlePDF Available

Deep Learning Models for Waterfowl Detection and Classification in Aerial Images

March 2024
Information 15(3):157

March 2024
15(3):157

License
CC BY 4.0

Authors:

Waterfowl populations monitoring is essential for wetland conservation. Lately, deep learning techniques have shown promising advancements in detecting waterfowl in aerial images. In this paper, we present performance evaluation of several popular supervised and semi-supervised deep learning models for waterfowl detection in aerial images using four new image datasets containing 197,642 annotations. The best-performing model, Faster R-CNN, achieved 95.38% accuracy in terms of mAP. Semi-supervised learning models outperformed supervised models when the same amount of labeled data was used for training. Additionally, we present performance evaluation of several deep learning models on waterfowl classifications on aerial images using a new real-bird classification dataset consisting of 6,986 examples and a new decoy classification dataset consisting of about 10,000 examples per category of 20 categories. The best model achieved accuracy of 91.58% on the decoy dataset and 82.88% on the real-bird dataset.

Available via license: CC BY 4.0

Content may be subject to copyright.

Citation: Zhang, Y.; Feng, Y.; Wang, S.;

Tang, Z.; Zhai, Z.; Viegut, R.; Webb, L.;

Raedeke, A.; Shang, Y.; Deep Learning

Models for Waterfowl Detection and

Classiﬁcation in Aerial Images.

Information 2024,15, 157. https://

doi.org/10.3390/info15030157

Academic Editor: Danilo Avola

Received: 29 January 2024

Revised: 29 February 2024

Accepted: 4 March 2024

Published: 11 March 2024

Licensee MDPI, Basel, Switzerland.

This article is an open access article

distributed under the terms and

conditions of the Creative Commons

Attribution (CC BY) license (https://

creativecommons.org/licenses/by/

4.0/).

information

Article

Deep Learning Models for Waterfowl Detection and

Classiﬁcation in Aerial Images †

Yang Zhang 1,* , Yuan Feng 1, Shiqi Wang 1, Zhicheng Tang 1, Zhenduo Zhai 1, Reid Viegut 2, Lisa Webb 3,

Andrew Raedeke 4and Yi Shang 1,*

1The Department of Electrical Engineering and Computer Science (EECS), University of Missouri,

Columbia, MO 65201, USA; yfzc8@mail.missouri.edu (Y.F.); swz45@mail.missouri.edu (S.W.);

zt253@mail.missouri.edu (Z.T.); zz7z9@mail.missouri.edu (Z.Z.)

2The School of Natural Resources, University of Missouri, Columbia, MO 65201, USA; rav3pt@missouri.edu

3U.S. Geological Survey, Missouri Cooperative Fish and Wildlife Research Unit, University of Missouri,

Columbia, MO 65201, USA; ewebb@usgs.gov

4The Missouri Department of Conservation, Columbia, MO 65201, USA; andrew.raedeke@mdc.mo.gov

*Correspondence: zhangy1@missouri.edu (Y.Z.); shangy@missouri.edu (Y.S.)

†This article is a revised and expanded version of a paper entitled “Development of New Aerial Image

Datasets and Deep Learning Methods for Waterfowl Detection and Classiﬁcation”, which was presented at

2022 IEEE 4th International Conference on Cognitive Machine Intelligence (CogMI), Atlanta, GA, USA and

14–17 December 2022.

Abstract: Waterfowl populations monitoring is essential for wetland conservation. Lately, deep

learning techniques have shown promising advancements in detecting waterfowl in aerial images. In

this paper, we present performance evaluation of several popular supervised and semi-supervised

deep learning models for waterfowl detection in aerial images using four new image datasets

containing 197,642 annotations. The best-performing model, Faster R-CNN, achieved 95.38% accuracy

in terms of mAP. Semi-supervised learning models outperformed supervised models when the same

amount of labeled data was used for training. Additionally, we present performance evaluation of

several deep learning models on waterfowl classiﬁcations on aerial images using a new real-bird

classiﬁcation dataset consisting of 6,986 examples and a new decoy classiﬁcation dataset consisting of

about 10,000 examples per category of 20 categories. The best model achieved accuracy of 91.58% on

the decoy dataset and 82.88% on the real-bird dataset.

Keywords: aerial images; waterfowl detection; waterfowl classiﬁcation; deep learning; computer

vision

1. Introduction

The audience for this paper should be machine learning and data science professionals

who are interested in developing deep learning models for wildlife management and

research. Effective management of waterfowl populations is pivotal in the decision-making

framework outlined by the Missouri Department of Conservation’s Wetland Planning

Initiative [

]. Managers currently employ diverse methods, from informal observations to

structured transect counts, for monitoring waterfowl. However, the lack of standardized

monitoring hampers comparability across locations and diminishes collective learning for

statewide management decisions. Accurate classiﬁcation of waterfowl using UAS imagery

requires an extensive library of annotated images and a more complete assessment of

performance among alternative machine learning approaches than have currently been

completed. Our work assesses the potential of using Uncrewed Aircraft Systems (UAS)

and deep learning techniques to enhance waterfowl population monitoring [2].

Our previous paper [

] aimed to present aerial-image datasets and to apply deep

learning models to detect and classify waterfowl in these datasets. The focus of the

previous paper was on deep learning on waterfowl detection, while limited works have

Information 2024,15, 157. https://doi.org/10.3390/info15030157 https://www.mdpi.com/journal/information

Information 2024,15, 157 2 of 13

been presented on waterfowl classiﬁcation. As an extension of our previous paper, this

paper presents our creation of the newest aerial-image datasets and the adaptation and

evaluation of advanced deep learning methods to detect and classify waterfowl in aerial

images. Between 2020 and 2022, we conducted 57 trips to capture real waterfowl imagery

and an additional 5 trips speciﬁcally for waterfowl decoy imagery across 10 conservation

areas in Missouri. The distribution of these conservation areas is as shown in Figure 1.

Employing DJI Mavic Pro 2 drones and a custom drone-path-planning app, we captured

images at various altitudes (15 to 90 m) and in diverse lighting conditions (Sunny and

Cloudy), resulting in thousands of aerial images in varying contexts.

Figure 1. A map of Missouri with star marks to indicate the distribution of the habitats in which we

conducted the waterfowl survey.

To create labeled datasets for machine learning, we used a server-based LabelMe

program to collaboratively label the waterfowl instances in the aerial images. This involved

generating labels (bounding boxes) around the contours of the waterfowl instances. We

labeled 197,642 waterfowl across 1237 images for training and assessing deep learning mod-

els for waterfowl detection and classiﬁcation. However, there were still over 100,000 aerial

images unlabeled. We created an unlabeled detection dataset from these images, which

served as the training data for our semi-supervised models.

For the waterfowl classiﬁcation, we created a new labeled decoy classiﬁcation dataset

containing around 10,000 examples and a new labeled real-bird classiﬁcation dataset by

cropping individual waterfowl from aerial images captured at a 15 m altitude by a drone.

Additionally, we selected a subset of model-ﬁltered waterfowl crops from the images

captured at a 15 m altitude in the unlabeled detection dataset, to create an unlabeled classi-

ﬁcation dataset for training semi-supervised models. In total, the waterfowl classiﬁcation

dataset comprised 6989 labeled waterfowl crops and 235,542 unlabeled waterfowl crops.

The main contributions of this paper are as follows:

We created three new labeled datasets speciﬁcally designed for waterfowl detection in

aerial images, along with a new dataset for waterfowl classiﬁcation in aerial images.

Through rigorous evaluation using authentic waterfowl datasets, we assessed the

efﬁcacy of cutting-edge supervised deep learning models for both waterfowl detection

Information 2024,15, 157 3 of 13

and classiﬁcation. Our analysis yielded notably accurate outcomes, demonstrating

the models’ robust performance in real-life scenarios.

We trained and evaluated semi-supervised learning models for waterfowl detection

and classiﬁcation. Our experiments’ results showed an improvement in detection and

classiﬁcation accuracy.

2. Related Work

2.1. Deep Learning Methods for Object Detection

There are two main types of deep learning models for image object detection: one-stage

detectors and two-stage detectors. Two-stage detectors, exempliﬁed by Faster R-CNN [

Mask R-CNN [

], and EfﬁcientDet [

], function by proposing regions through a dedicated

network and subsequently classifying those regions via an independent network. Faster

R-CNN, a popular two-stage detector, integrates Region Proposal Network (RPN) for

proposals generation, sharing convolutional layers with the object detection network. It

also employs a Feature Pyramid Network (FPN) to facilitate multi-scale proposal generation,

with speciﬁc anchor size adjustments optimized for detecting smaller objects, such as birds

in aerial images.

In contrast, one-stage detectors, such as RetinaNet [

] and SSD [

], operate as end-

to-end deep learning models. While slow in speed, two-stage detectors often offer more

accurate predictions. For instance, RetinaNet, a popular one-stage detector, enhances

prediction accuracy through focal loss, performing direct regression and classiﬁcation on

individual anchor boxes derived from the feature map. The YOLO (You Only Look Once)

models, such as YOLOv1 [

] and the recent YOLONAS [

], are well-known one-stage

detectors. For example, YOLOv5 [

] demonstrated remarkable performance in 2021, while

YOLONAS [10] attained state-of-the-art (popular) performance in 2023.

Transformer-based object-detection models, such as Detection Transformer (DETR),

have shown promising performance. DETR [

], notable for being the ﬁrst end-to-end

transformer-based object detector, achieved comparable performance to Faster R-CNN

without the need for Non-Maximum Suppression (NMS) methods to reduce duplicated

proposals. Deformable DETR [

] further enhanced DETR’s convergence time by focusing

on sparse spatial positioning. The state-of-the-art model CODETR [

] has surpassed others

in the COCO detection leaderboard. A key innovation of CODETR lies in its application of

auxiliary heads to increase the number of samples in each training batch.

In the domain of aerial bird detection, the DeepForest Bird Detector, developed by the

Weecology lab at the University of Florida [

], is a leading RetinaNet-based model. Trained

on extensive drone-captured bird images worldwide, this model served as a baseline for

evaluating the bird-detection models developed in our study.

Semi-supervised learning techniques, like Mean Teacher [

], utilize unlabeled data

to bolster the performance of supervised learning. In the student–teacher model, these

methods initially generate predicted labels for unlabeled images through labeling functions

or existing models trained on labeled data. Subsequently, an object detector is trained,

using images containing both accurate and predicted (potentially inaccurate) labels. An-

other approach involves concurrent training of a detection neural network on labeled and

unlabeled images, utilizing the consistency of the predictions as an additional learning ob-

jective [

]. Soft Teacher [

], an end-to-end semi-supervised object-detection model based

on Faster R-CNN, diversiﬁes input images by applying weak augmentation for the teacher

model and strong augmentation for the student model. It also employs a box-jittering

technique to select reliable pseudo-boxes for regression learning. Addressing imbalanced

foreground and background pseudo-labels during training, Unbiased Teacher [

] imple-

ments focal loss and Exponential Moving Average (EMA) training, effectively mitigating

the data-imbalance issue.

Information 2024,15, 157 4 of 13

2.2. Deep Learning Methods for Image Classiﬁcation

The objective of image classiﬁcation is to predict the categories of distinct objects in

images. In the past decade, deep learning methods in image classiﬁcation have attained

signiﬁcant advancements since 2012 [

]. Numerous deep learning models have been

introduced, consistently delivering improved performance [21,22].

ResNet [

] is a highly successful image-classiﬁcation model, speciﬁcally tackling the

vanishing gradient challenge within deep neural networks by introducing a framework for

deep residual learning. EfﬁcientNet [

] introduced the compound coefﬁcient technique.

Unlike random scaling of network depth and width, this technique harmonizes width,

depth, and resolution dimensions using a constant ratio, thereby effectively balancing

the model’s overall architecture. MixMatch [

] is a semi-supervised classiﬁcation model

published in 2019. MixMatch applies k-rounds augmentation to original images and

employs a sharpening algorithm to generate distinct pseudo-labels for them. Both labeled

and unlabeled data are incorporated into the training process, with prediction consistency

serving as the guiding supervision. FixMatch [

] is another semi-supervised classiﬁcation

model. It employs a blend of consistency regularization and pseudo-labeling within its

semi-supervised training methodology. Pseudo-labels, serving as the supervision for

predictions on strongly augmented unlabeled images, are generated from the model’s

output on weakly augmented unlabeled images.

3. New Waterfowl Aerial-Image Datasets

3.1. Waterfowl-Detection Datasets

From images collected in Missouri conservation areas, we labeled 1237 aerial images

(drone altitude 15–90 m) with 197,642 waterfowl and decoy labels and created four new

waterfowl-detection datasets: Bird-G, Bird-H, Bird-I, and Bird-J, as shown in Table 1. These

datasets were categorized based on the Missouri conservation areas from which the aerial

images were collected. Compared with datasets Bird-A to Bird-F, these datasets are more

practical as they encompass data collected across various seasons and altitudes, thus

enhancing their comprehensiveness. The number of images, number of birds, drone ﬂight

altitudes, and target objects of each dataset are given in the table. We then divided each

waterfowl-detection dataset into subsets of training (60%), validation (20%), and test (20%).

In addition, we selected over 11,021 unlabeled aerial images to form an unlabeled dataset

for semi-supervised learning experiments.

Table 1. Summary of waterfowl-detection datasets created based on collected aerial images.

Dataset Name No. of Images No. of Birds Altitude (m) Object

Bird-G 181 62,758 15–90 Birds

Bird-H 177 16,738 15–90 Decoys

Bird-I 171 7058 15 Birds

Bird-J 708 111,088 15–90 Birds

Unlabeled-K 11,021 Unknown 15–90 Birds

After the division into training, validation, and test subsets of Bird-G, Bird-I, and

Bird-J, we combined the labeled datasets to form a big dataset named ’real-bird dataset’.

This dataset was used in evaluating model performance across various model and training

parameters. While annotating waterfowl instances, we also annotated the habitat and

weather conditions of the aerial images. The testing data encompassed images captured

at four different altitudes (15, 30, 60, and 90 m) in 11 distinct habitat conditions (i.e.,

HarvestedCrop, Ice, Land, Lotus, etc.) and two weather conditions (Cloudy and Sunny).

3.2. Waterfowl-Classiﬁcation Dataset

We manually labeled the categories of waterfowl in the Bird-H and Bird-I datasets to

create a real-bird classiﬁcation dataset and a decoy-bird classiﬁcation dataset. The real-bird

Information 2024,15, 157 5 of 13

classiﬁcation dataset comprises 6986 waterfowl image crops—individual birds cropped

from 15 m images in the Bird-I dataset. The images belong to 20 categories, including

19 waterfowl categories and 1 ‘Unknown’ category. Figure 2shows the distribution of

the waterfowl images across the 20 categories. These category labels have been assigned

with high conﬁdence by waterfowl experts within our team. The dataset division between

the training and test sets for the real-bird classiﬁcation dataset mirrors that of the Bird-I

detection dataset, with a ratio of 5:1. However, it is important to note that this proportion

may not be consistent across all classes.

To create an unlabeled training dataset for semi-supervised learning, we ran a pre-

trained YOLOv5 model on all the images in the Unlabeled-K dataset to extract crops of

bird images and ﬁltered out low-quality crops using a conﬁdence threshold of 0.5. This

process yielded an unlabeled training dataset of 235,452 bird crops. While it is important to

acknowledge that some crops containing waterfowl instances may be mistakenly removed,

potentially reducing the transferability of the model to new datasets, these ﬁltering methods

can signiﬁcantly decrease the number of crops without waterfowl.

Figure 2. Distribution of waterfowl image examples in training and test sets across 20 categories in

the real-bird classiﬁcation dataset.

The decoy-bird classiﬁcation dataset contains around 10,000 decoy-bird crops from

images in the Bird-H dataset. There are 10 different bird categories for images taken at four

different heights. Considering the limited number of waterfowl instances, we divided the

dataset into training and test sets and ignored the validation set. In the test set, we ensured

an equal number of examples across all classes. The remaining examples were placed in the

training set. In the case of the 90 m subset, we excluded the ‘female wigeon’ and ‘female

pintail’ classes due to their limited number of images, which fell below 10, making them

too small for reliable analysis. To investigate the impact of habitat on detection accuracy,

we used three habitat subsets representing OpenWater, StandingCorn, and MoistSoil. Each

subset was further divided into training and test sets, using the 7:3 ratio.

4. Methods

We applied some state-of-the-art deep learning methods to detect and classify water-

fowl in drone images and compared their performances under various conditions.

4.1. Deep Learning Models For Waterfowl Detection

We applied both supervised models—including DeepForest Bird Detector, RetinaNet,

Faster R-CNN, YOLOv5, and YOLONAS—and a semi-supervised model, Soft Teacher,

to our waterfowl detection. The waterfowl objects in our datasets fall within the small-

to-medium object category in object detection by the COCO [

] dataset guidelines. The

bounding-box sizes for the waterfowl ranged from 18 ×18 pixels to 94 ×89 pixels.

Information 2024,15, 157 6 of 13

During the Faster R-CNN training, we adjusted the initial size of the anchor boxes

from [32, 64, 128, 256, 512] to [8, 16, 32, 64, 128] to align with typical waterfowl sizes. To

accommodate higher waterfowl density, the RPN positive-sample fraction was increased

from 0.5 to 0.8 and the RPN batch size from 256 to 512 to generate more positive samples

in the Region Proposal Network training. The training parameters were set to 100 epochs,

with early-stop tolerance of 30, a learning rate of 0.001, and a batch size of 4 for all the

models. For input uniformity across the deep learning models, we cropped each training

image into multiple non-overlapping 512

512 pixel images, facilitating training across

various models.

During testing, we initially cropped each test image into 512

512 pixel images,

which were then fed into the trained deep learning models. The resulting detections were

aggregated to form predictions for the original test images. Performance metrics were

computed based on these predictions and their corresponding ground-truth labels. For the

semi-supervised detection models, unlabeled image crops in the Unlabeled-K dataset were

prepared by cropping all the original aerial images into 512 ×512 pixel images.

4.2. Deep Learning Models for Waterfowl Classiﬁcation

For the waterfowl classiﬁcation, we applied two supervised classiﬁcation models,

EfﬁcientNet and ResNet, and two semi-supervised classiﬁcation models, MixMatch and

FixMatch. After some basic parameter tuning by exploring a range of parameter values,

we selected parameters that yielded good results across all of our experiments. In training,

we used data augmentation that included random rotation and random horizontal ﬂip. We

tested two backbones, WiderResNet and ResNext, for the semi-supervised models. Across

all the models, we set the training epochs to 300, the learning rate 0.0001, and the batch size

to 32. Regarding the semi-supervised models, each training batch comprised 16 labeled and

16 unlabeled images.

4.3. Data Processing

We collected thousands of RGB aerial images featuring waterfowl and decoys, using

a DJI Mavic Pro 2 drone across various conservation areas in Missouri. This drone has a

20 MP 1-inch CMOS sensor, providing a 66-degree ﬁeld of view and images at a resolution

of 5472

3648 pixels. For the waterfowl detection, we cropped each aerial image into

512

512 crops with an overlap of 20%. For the waterfowl classiﬁcation, we resized

the sizes of the waterfowl crop images to different sizes according to the altitudes of

the drone-captured images. That is, we resized the crops into 128

128 pixels for 15 m

images,

64 ×64 pixels

for 30 m images, and 40

40 pixels for 60 m images. For the semi-

supervised models (FixMatch and MixMatch), we resized the waterfowl crop images to

32 pixels to match the input requirements of WiderResNet, the backbone of the two

semi-supervised models.

4.4. Evaluation Metrics

When evaluating the detection performance, we used Precision, Recall,

, and

mAP30 [2]:

Precision =tp

tp +f p ,Recall =tp

tp +f n , (1)

where tp is true positive and f p is false positive.

F1=2∗Precision ∗Recall

Precision +Reca ll . (2)

IoU =Intersection area o f two bounding boxes

Union area o f two bounding boxes . (3)

Information 2024,15, 157 7 of 13

Note that mAP stands for mean Average Precision and that mAP30 is the mean

Average Precision when the Intersection of Union (IoU) threshold is 30%.

When evaluating the classiﬁcation performance, we used classiﬁcation accuracy.

5. Experimental Results

The experiments were run on a Dell AlienWare desktop with Nvidia RTX 2070 GPU

and 8 GB of memory.

5.1. Performance of Detectors Trained Using Individual Datasets

In this experiment, we separately applied Faster R-CNN, YOLOv5, YOLONAS, and

Soft Teacher to each dataset. To elaborate, using the Bird-G dataset as an example, we

trained each deep learning model using its allocated training and validation sets. Subse-

quently, the model’s performance was assessed and reported based on its test set.

Table 2compares the mAP30 performances of four models on four datasets. YOLONAS

was the best on average, reaching 86.66% mAP. Faster R-CNN and SoftTeacher were slightly

worse than YOLONAS. YOLOv5 was the worst, only 77.16%, mainly due to its poor

performance on Bird-H. None of the models performed the best across all datasets.

Table 2. Test performances of individually trained detection models, in terms of mAP30 (%).

Faster R-CNN YOLOv5 YOLONAS SoftTeacher

Bird-G 89.76 89.42 86.62 88.73

Bird-H 81.77 52.14 91.52 78.56

Bird-I 94.57 88.48 89.2 95.54

Bird-J 73.60 78.61 79.23 71.43

Average 84.92 77.16 86.66 83.31

5.2. Performance of Detectors Trained Using All Datasets Combined

In this experiment, we trained each detection model using the combined training

images from all the detection datasets. For fair comparison, we used the same parameters

when training the detection models: 100 epochs, learning rate 0.01, and batch size 2. Then,

we evaluated these trained models on the test set of each dataset. One exception was Deep-

Forest Bird Detector. We did not re-train it and simply used the pre-trained weight from its

public release.

Table 3compares the mAP30 performances of six models on four datasets. Again,

YOLONAS was the best on average, reaching 84.3% mAP. The pre-trained DeepForest Bird

Detector was the worst, only 64.64%. None of the models performed the best across all

datasets. YOLONAS was the best on Bird-H. YOLOv5 was the best on Bird-G and Bird-J.

Faster R-CNN was the best on Bird-I.

We compared the results in Table 3with those in Table 2, to assess the feasibility of

training a generic model capable of achieving performance comparable to models trained on

individual datasets. However, the results indicate that generic models generally perform

worse than those trained on individual datasets, in terms of average mAP30, with the

exception of YOLOv5. We observed that YOLOv5 performed less effectively on small

datasets (Bird-H) during training. Increasing the number of training images can improve

its performance.

Information 2024,15, 157 8 of 13

Table 3. Test performances of detection models trained using all datasets, in terms of mAP30 (%).

DeepForest

RetinaNet Faster

R-CNN YOLOv5

YOLONAS

Soft

Teacher

Bird-G 76.60 89.69 89.67 91.08 84.56 88.56

Bird-H 55.65 81.69 82.88 68.78 88.11 82.45

Bird-I 77.41 85.48 88.85 87.07 87.57 84.46

Bird-J 48.93 74.71 74.48 88.98 76.97 72.20

Average 64.64 82.89 83.71 83.97 84.30 81.91

Table 4shows the training and inference times of these models. The semi-supervised

model Soft Teacher was the slowest in training, about 30 times slower than RetinaNet,

17 times slower than YOLOv5 and YOLONAS, and 4 times slower than Faster R-CNN. In

terms of inference time, Soft Teacher and Faster R-CNN had the same speed, about 4 times

slower than the other models.

Table 4. Comparison of training and inference times (in seconds) of detection models. The models

were trained for 1000 iterations. Inference time was for one drone image.

DeepForest

RetinaNet Faster

R-CNN YOLOv5

YOLONAS

Soft

Teacher

Training - 36 247 59 58 1014

Inference 0.9 0.9 4.3 1.1 1.0 4.2

Figure 3shows an example of the detection results of DeepForest Bird Detector and

Faster R-CNN on an image of a ﬂooded corn ﬁeld. These results were generated by setting

the models’ conﬁdence threshold to 0.3.

DeepForest

(a) DeepForest Inference. F1 = 0.328.

FasterRCNN

(b) FasterRCNN Inference. F1 = 0.841.

YOLOV5

YOLONAS

(d) YOLONAS Inference. F1 = 0.901.

Figure 3. Detection results of the DeepForest Bird Detector: (a) Faster R-CNN. (b) YOLOV5. (c) YOLONAS.

(d) An image of a ﬂooded corn ﬁeld. In each image, green boxes denote True Positive (TP) predictions,

yellow boxes denote False Positive (FP) predictions, and red boxes denote False Negative (FN)

predictions.

Information 2024,15, 157 9 of 13

To study the inﬂuence of environmental factors, e.g., habitats and light conditions, on

detection accuracy, Table 5compares the performances of those detectors trained using all

datasets on images captured in different habitats and light conditions. The performance of

those detectors exhibited signiﬁcant variations, ranging from a low of 30% to a high of above

90%. For instance, in the Ice-habitat case, Faster R-CNN and RetinaNet achieved 99.05%

and 89.76%, respectively, under Sunny conditions, but dropped to 70.68% and 34.28% under

Cloudy skies. In comparison, YOLOv5 and YOLONAS performed consistently well in the

Ice case under both Sunny and Cloudy skies. In the Land-habitat case, Faster R-CNN was

the best, reaching 95.82% in Sunny conditions but only 74.29% in Cloudy conditions.

As the results show, for most of the habitats the detectors performed better on Sunny

images. Yet, in the cases of the Wooded, Moist Soil, and StandingCorn habitats, most of the

classiﬁers performed better on Cloudy images.

Table 5. Test performances of detectors trained using all datasets combined in terms of mAP30 (%)

on bird images in different habitat and light conditions.

Faster R-CNN YOLOv5 YOLONAS RetinaNet Soft Teacher

Sunny Cloudy Sunny Cloudy Sunny Cloudy Sunny Cloudy Sunny Cloudy

HarvestedCrop 91.18 75.41 82.93 36.16 84.15 71.77 56.10 50.15 90.75 72.90

Ice 99.05 70.68 99.37 98.48 99.33 92.69 89.76 34.28 96.66 49.00

Land 95.82 74.29 88.71 66.95 90.56 68.99 73.11 48.85 68.03 61.81

Lotus 88.24 85.70 79.32 66.16 85.59 81.75 88.24 85.88 56.08 74.99

MoistSoil 93.13 90.20 72.98 93.59 86.58 91.99 76.44 84.61 78.65 71.91

OpenWater 98.67 87.18 99.09 93.05 97.89 91.30 98.11 83.98 89.09 44.10

ShrubScrub 93.73 - 56.81 - 89.15 - 64.01 - 84.80 -

StandingCorn 90.75 93.36 55.06 69.46 83.20 86.07 87.27 72.51 75.88 74.70

WaterCorn 95.48 91.69 71.27 68.03 91.78 88.56 94.87 66.24 83.21 52.99

Wooded 81.92 92.88 67.66 92.16 78.02 87.29 89.04 88.13 58.19 75.33

5.3. Performance of Altitude-Speciﬁc Detection Models

Based on the altitudes at which the aerial images in the datasets were captured, which

were 15, 30, 60, and 90 m, we partitioned all the real-bird detection datasets (i.e., Bird-G,

Bird-I, and Bird-J) into four distinct subsets. The division between the training and test

sets within each subset remained consistent with the original dataset. We subsequently

conducted separate training and testing of various models on these altitude-speciﬁc subsets.

Table 6compares the performances of ﬁve models, in terms of mAP30 on datasets

of different image-capturing altitudes. The results show a decrease in the performances

of all the models as the altitude increased, which can be attributed to the decreasing size

and resolution of waterfowl at higher altitudes. Faster R-CNN was the best for lower-

altitude cases (i.e., 15 and 30 m), reaching 95.38% and 93.25% mAP. The two YOLO models

performed better on higher-altitude images. The semi-supervised model Soft Teacher was

competitive, but not the best for any altitude case.

Table 6. Test performances of altitude-speciﬁc models, in terms of mAP30 (%), on images captured at

different altitudes.

Altitude Faster

R-CNN RetinaNet YOLOv5 YOLONAS Soft

Teacher

15 m 95.38 86.20 85.37 93.96 92.59

30 m 93.25 90.54 80.78 91.23 92.27

60 m 87.56 43.93 86.21 91.41 88.58

90 m 81.67 62.94 90.58 88.70 77.23

5.4. Performance of Semi-Supervised Learning Detectors

In this experiment, we utilized the real-bird detection datasets to assess the efﬁcacy

of semi-supervised learning models. We varied the proportions of labeled training data,

ranging from 10% to 50% of the training set for Soft Teacher, while the remainder served

Information 2024,15, 157 10 of 13

as unlabeled data. For comparison, Faster R-CNN was trained using the same amount of

labeled data. Both models were trained using identical parameters, including 100 epochs, a

learning rate of 0.01, and a batch size of 4. The performance metric was mAP30.

Table 7compares the performance of Soft Teacher with that of Faster R-CNN when

different amounts of labeled training examples were used in training. When a small amount

of labeled training examples was used, such as 10%, Soft Teacher outperformed Faster

R-CNN by a large margin (73.45% vs. 64.50%). As the amount of labeled training examples

being used increased, the performances of both Soft Teacher and Faster R-CNN improved,

and the difference between them decreased. Soft Teacher outperformed Faster R-CNN

in all cases. The performance of Faster R-CNN trained by a 100% labeled training set

was similar to that of Soft Teacher trained by a 50% labeled training set. We also noticed

that models trained by 80% labeled images outperformed models trained by 100% labeled

images and we believe that the inaccurate labels for the remaining 20% of images caused

this performance difference.

Table 7. Test performances of Faster R-CNN and Soft Teacher, in terms of mAP30 (%) when trained

using a proportion of the training set (10%, 20%, 50%, and 100%) as labeled data.

Labeled Training Set Proportion

10% 20% 50% 100%

Faster R-CNN 67.50 74.12 78.17 82.79

Soft Teacher 73.45 77.74 82.65 -

5.5. Performances of Classiﬁcation Models

In this experiment, we evaluated the classiﬁcation performances of various deep

learning models, including EfﬁcientNet-b5, ResNet18, MixMatch, and FixMatch, using

both our real-bird and decoy classiﬁcation datasets. All the models were trained with a

learning rate 0.00001 and a batch size of 4 and with early stopping—halting the training

process when the validation accuracy showed no improvement for 15 consecutive epochs.

The maximum number of training epochs was capped at 300. Since all the decoy image

crops were labeled, we utilized unlabeled waterfowl crops from our unlabeled training set

when training the semi-supervised models on the decoy classiﬁcation training set.

Table 8shows the classiﬁcation accuracy of four models on decoy-bird-image crops

taken at altitudes of 15, 30, 60, and 90 m, as well as on real-bird-image crops taken at

an altitude of 15 m. The results show that the classiﬁcation accuracy of all four models

decreased as the image altitude increased. For instance, EfﬁcientNet reached 91.58% on

15 m images, but only 41.05% on 90 m images. There was a big classiﬁcation-accuracy drop

from 30 m to 60 m. This leads us to the conclusion that images captured at 15 m and 30 m

altitudes are suitable for bird classiﬁcation in aerial images, while images captured at 60 m

and 90 m are not.

In terms of overall performance, the two semi-supervised models, MixMatch and

FixMatch, leveraged extra unlabeled training data and outperformed EfﬁcientNet and

ResNet18 on the 30, 60, and 90 m cases. However, EfﬁcientNet was the best on the 15 m

decoy case, whereas MixMatch was the best on the 15 m real-bird case.

Table 8. Classiﬁcation accuracy (%) of ﬁve classiﬁcation models on 15, 30, 60, and 90 m waterfowl

classiﬁcation datasets.

EfﬁcientNet ResNet18 MixMatch FixMatch

15 m real bird 81.65 78.37 82.88 80.70

15 m decoy 91.58 89.78 87.54 88.71

30 m decoy 79.98 76.74 81.34 80.09

60 m decoy 43.75 40.66 46.40 48.80

90 m decoy 41.05 36.72 47.92 46.25

Information 2024,15, 157 11 of 13

Table 9compares the performances of the four models on images captured in different

habitats: OpenWater, MoistSoil, and StandingCorn. All the models achieved accuracy of

over 90% on OpenWater images, lower than 80% on StandingCorn images, and from 70%

to 54.78% on MoistSoil images. All the models were competitive in all cases, except that

ResNet18 was much worse on MoistSoil images. These results underscore the considerable

inﬂuence of various habitat types on classiﬁcation accuracy.

Table 9. Classiﬁcation accuracy (%) of four deep learning models on images captured over three

different habitats in the decoy classiﬁcation dataset.

EfﬁcientNet ResNet18 MixMatch FixMatch

OpenWater 93.46 91.25 92.18 93.57

MoistSoil 70.77 54.78 71.58 72.53

StandingCorn 83.68 82.44 84.55 81.56

6. Summary and Future Works

This paper presents our recent work, which involved the creation of new aerial-image

datasets for waterfowl detection and classiﬁcation and the adaptation and evaluation of

popular supervised and semi-supervised deep learning models. Our experimental results

for semi-supervised learning models showed their ability to slightly improve detection and

classiﬁcation performance using unlabeled data. Furthermore, we showed that altitude-

speciﬁc detection models achieved improved detection results over altitude-blind detection

models. Multiple models delivered strong performance, particularly excelling in images

captured at 15 and 30 m, where they achieved detection accuracy exceeding 90%. Our

experimental results also showed that different image contexts, such as different habitat

and weather conditions, had signiﬁcant impact on detection and classiﬁcation accuracy.

Additionally, we evaluated several classiﬁcation models using our classiﬁcation dataset

and compared their performance across images taken at different heights and in vari-

ous habitats. These models delivered good performance on images captured at 15 and

30 m, achieving accuracy from 80% to 90%.

While labeling aerial images of waterfowl, we observed a disparity in their distribution

across habitats. There was a high-density distribution in habitats such as water and ice,

while habitats like land and crops showed a lower-density distribution. In future work, we

aim to identify the distribution patterns and to adjust the focus of our models accordingly.

When evaluating detection models on waterfowl datasets, we observed a signiﬁcant

disparity between image crops containing birds (foreground images) and those without

birds (background images). The proportion of negative samples in the training set plays a

critical role in the model’s ability to accurately predict False Positives. Our future work

will focus on developing a dynamic training strategy to determine the optimal proportion

of negative samples in the training set.

Transformer-based object detection and classiﬁcation models have exhibited promising

performance. Our future work will involve training and testing these models, with a focus

on comparing their performance against convolution-based models.

Author Contributions: Conceptualization, Y.Z., A.R., L.W. and Y.S.; methodology, Y.Z. and Y.S.;

software, Y.Z., Y.F., S.W., Z.T. and Z.Z.; validation, Y.Z.; formal analysis, Y.Z.; investigation, Y.Z.,

Y.F., Z.T., R.V. and Z.Z.; resources, Y.Z., Y.F., Z.T. and Z.Z.; data curation, Y.Z., Y.F., Z.T., R.V. and

Z.Z.; writing—original draft preparation, Y.Z.; writing—review and editing,Y. S.; visualization, Y.Z.;

supervision, Y.S.; project administration, Y.S.; funding acquisition, A.R., L.W. and Y.S. All authors

have read and agreed to the published version of the manuscript.

Funding: This research was funded by the Department of Conservation, Missouri. The Missouri

Cooperative Fish and Wildlife Research Unit is jointly sponsored by the Missouri Department of

Conservation, the University of Missouri, the U.S. Fish and Wildlife Service, the U.S. Geological

Survey, and the Wildlife Management Institute. Any use of trade, ﬁrm, or product names is for

descriptive purposes only and does not imply endorsement by the U.S. Government.

Information 2024,15, 157 12 of 13

Institutional Review Board Statement: Not applicable.

Informed Consent Statement: This article does not contain any studies performed by any of the

authors that involved human participants.

Data Availability Statement: Check our website to obtain sample data: https://waterfowldetector.

readthedocs.io.

Conﬂicts of Interest: The authors declare no conﬂicts of interest.

References

Missouri Department of Conservation. Wetland Planning Initiative Implementation Plan; Missouri Department of Conservation:

St. Charles, MO, USA, 2019

Tang, Z.; Zhang, Y.; Wang, Y.; Shang, Y.; Viegut, R.; Webb, E.; Raedeke, A.; Sartwell, J. sUAS and Machine Learning Integration in

Waterfowl Population Surveys. In Proceedings of the 2021 IEEE 33rd International Conference on Tools with Artiﬁcial Intelligence

(ICTAI), Washington, DC, USA, 1–3 November 2021; pp. 517–521.

Zhang, Y.; Wang, S.; Zhai, Z.; Shang, Y.; Viegut, R.; Webb, E.; Raedeke, A.; Sartwell, J. Development of New Aerial Image Datasets

and Deep Learning Methods for Waterfowl Detection and Classiﬁcation. In Proceedings of the 2022 IEEE 4th International

Conference on Cognitive Machine Intelligence (CogMI), Atlanta, GA, USA, 14–17 December 2022; pp. 117–124. [CrossRef]

Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In

Proceedings of the Advances in Neural Information Processing Systems; Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R.,

Eds. Curran Associates, Inc.: Red Hook, NY, USA, 2015; Volume 28.

He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer

Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [CrossRef]

Tan, M.; Pang, R.; Le, Q.V. EfﬁcientDet: Scalable and Efﬁcient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference

on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10778–10787.

Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell.

2020,42, 318–327. [CrossRef] [PubMed]

Ma, W.; Wang, X.; Yu, J. A Lightweight Feature Fusion Single Shot Multibox Detector for Garbage Detection. IEEE Access 2020,

8, 188577–188586. [CrossRef]

Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Uniﬁed, Real-Time Object Detection. In Proceedings

of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016;

pp. 779–788. [CrossRef]

10.

Aharon, S.; Louis-Dupont; Oferbaratz; Masad, O.; Yurkova, K.; Fridman, L.; Lkdci; Khvedchenya, E.; Rubin, R.; Bagrov, N.; et al.

Super-Gradients, 2021. Available online: https://zenodo.org/records/7789328 (accessed on 29 February 2024).

11.

Ultralytics. YOLOv5: A State-of-the-Art Real-Time Object Detection System. 2021. Available online: https://docs.ultralytics.com

(accessed on 29 February 2024).

12.

Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In

Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; pp. 213–229.

13.

Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection.

arXiv 2020, arXiv:2010.04159.

14. Zong, Z.; Song, G.; Liu, Y. DETRs with Collaborative Hybrid Assignments Training. arXiv 2022, arXiv:2211.12860.

15.

Weinstein, B.G.; Marconi, S.; Aubry-Kientz, M.; Vincent, G.; Senyondo, H.; White, E.P. DeepForest: A Python package for RGB

deep learning tree crown delineation. Methods Ecol. Evol. 2020,11, 1743–1751. [CrossRef]

16.

Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised

deep learning results. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on

Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; Volume 30.

17.

Jeong, J.; Lee, S.; Kim, J.; Kwak, N. Consistency-based semi-supervised learning for object detection. In Proceedings of the

Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019,

Vancouver, BC, Canada, 8–14 December 2019; Volume 32.

18.

Xu, M.; Zhang, Z.; Hu, H.; Wang, J.; Wang, L.; Wei, F.; Bai, X.; Liu, Z. End-to-End Semi-Supervised Object Detection with Soft

Teacher. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17

October 2021; pp. 3060–3069.

19.

Liu, Y.C.; Ma, C.Y.; He, Z.; Kuo, C.W.; Chen, K.; Zhang, P.; Wu, B.; Kira, Z.; Vajda, P. Unbiased Teacher for Semi-Supervised Object

Detection. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual Event, Austria, 3–7 May

2021.

20.

Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classiﬁcation with Deep Convolutional Neural Network. Commun. ACM

2017,60, 84–90. [CrossRef]

21.

Wei, W.; Yang, Y.; Wang, X.; Wang, W.; Li, J. Development of convolutional neural network and its application in image

classiﬁcation: A survey. Opt. Eng. 2019,58, 040901.

Information 2024,15, 157 13 of 13

22.

Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the

International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015.

23.

He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016.

24.

Tan, M.; Le, Q.V. EfﬁcientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the International

Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019.

25.

Berthelot, D.; Carlini, N.; Goodfellow, I.J.; Papernot, N.; Oliver, A.; Raffel, C. MixMatch: A Holistic Approach to Semi-Supervised

Learning. arXiv 2019, arXiv:1905.02249.

26.

Sohn, K.; Berthelot, D.; Li, C.L.; Zhang, Z.; Carlini, N.; Cubuk, E.D.; Kurakin, A.; Zhang, H.; Raffel, C. FixMatch: Simplifying

Semi-Supervised Learning with Consistency and Conﬁdence. In Proceedings of the Advances in Neural Information Processing

Systems 33: Annual Conference on Neural Information Processing Systems 2020, Virtual, 6–12 December 2020.

27.

Lin, T.; Maire, M.; Belongie, S.J.; Bourdev, L.D.; Girshick, R.B.; Hays, J.; Perona, P.; Ramanan, D.; Doll’ar, P.; Zitnick, C.L. Microsoft

COCO: Common Objects in Context. arXiv 2014, arXiv:1405.0312.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual

author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to

people or property resulting from any ideas, methods, instructions or products referred to in the content.

ResearchGate has not been able to resolve any citations for this publication.

A Lightweight Feature Fusion Single Shot Multibox Detector for Garbage Detection

Article

Full-text available

Jan 2020

To solve the problems of a poor manual garbage sorting environment, including heavy tasks and low sorting efficiency, we propose the Lightweight Feature Fusion Single Shot Multibox Detector (L-SSD) algorithm to realize intelligent trash classification and recognition. Since waste has a small volume and the image resolution of garbage is always low, the algorithm that we propose is an enhanced single shot multibox detector (SSD) with a lightweight and novel feature fusion module. This SSD can significantly improve the performance of rubbish detection. In this feature fusion module, features from different layers with different scales are connected in series. A new feature pyramid was generated by using downsampling blocks, which will be fed to appointed multibox detectors to predict the final detection results. Due to the extremely unbalanced ratio of positive samples to negative samples, which leads to a low accuracy of SSD, Focal Loss using balanced cross-entropy is employed, which is provided by easy examples that corresponds to difficult samples with a decline in the loss weight. Thus, the training is biased towards meaningful samples. We have replaced the backbone network of VGG16 with ResNet-101 to achieve more accurate detection. We analyzed the performance of a nonmaximum suppression (NMS) algorithm and discovered that Soft-NMS was more suitable for learning better image representations. The strategy of Soft-NMS is to suppress the undesirable detection box rather than remove it completely. The experimental results show that the L-SSD exceeds a large number of state-of-the-art object detection algorithms in both accuracy and speed.

DETRs with Collaborative Hybrid Assignments Training

Conference Paper

Oct 2023

Development of New Aerial Image Datasets and Deep Learning Methods for Waterfowl Detection and Classification

Conference Paper

Dec 2022

End-to-End Semi-Supervised Object Detection with Soft Teacher

Conference Paper

Oct 2021

sUAS and Machine Learning Integration in Waterfowl Population Surveys

Conference Paper

Nov 2021

End-to-End Object Detection with Transformers

Chapter

Nov 2020

We present a new method that views object detection as a direct set prediction problem. Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components like a non-maximum suppression procedure or anchor generation that explicitly encode our prior knowledge about the task. The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss that forces unique predictions via bipartite matching, and a transformer encoder-decoder architecture. Given a fixed small set of learned object queries, DETR reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel. The new model is conceptually simple and does not require a specialized library, unlike many other modern detectors. DETR demonstrates accuracy and run-time performance on par with the well-established and highly-optimized Faster R-CNN baseline on the challenging COCO object detection dataset. Moreover, DETR can be easily generalized to produce panoptic segmentation in a unified manner. We show that it significantly outperforms competitive baselines. Training code and pretrained models are available at https://github.com/facebookresearch/detr.

DeepForest: A Python package for RGB deep learning tree crown delineation

Article

Aug 2020
Methods Ecol. Evol.

Remote sensing of forested landscapes can transform the speed, scale and cost of forest research. The delineation of individual trees in remote sensing images is an essential task in forest analysis. Here we introduce a new Python package, DeepForest that detects individual trees in high resolution RGB imagery using deep learning. While deep learning has proven highly effective in a range of computer vision tasks, it requires large amounts of training data that are typically difficult to obtain in ecological studies. DeepForest overcomes this limitation by including a model pretrained on over 30 million algorithmically generated crowns from 22 forests and fine‐tuned using 10,000 hand‐labelled crowns from six forests. The package supports the application of this general model to new data, fine tuning the model to new datasets with user labelled crowns, training new models and evaluating model predictions. This simplifies the process of using and retraining deep learning models for a range of forests, sensors and spatial resolutions. We illustrate the workflow of DeepForest using data from the National Ecological Observatory Network, a tropical forest in French Guiana, and street trees from Portland, Oregon.

EfficientDet: Scalable and Efficient Object Detection

Conference Paper

Jun 2020

Development of convolutional neural network and its application in image classification: A survey

Article

Apr 2019

In recent years, convolutional neural networks (CNNs) have been widely used in various computer visual recognition tasks and have achieved good results compared with traditional methods. Image classification is one of the basic and important tasks of visual recognition, and the CNN architecture applied to other visual recognition tasks (such as object detection, object localization, and semantic segmentation) is generally derived from the network architecture in image classification. We first summarize the development history of CNNs and then analyze the architecture of various deep CNNs in image classification. Furthermore, not only the innovation of the network architecture is beneficial to the results of image classification, but also the improvement of the network optimization method or training method has improved the result of image classification. We also analyze each of these methods-effect. The experimental results of various methods are compared. Finally, the development of future CNNs is prospected. © 2019 Society of Photo-Optical Instrumentation Engineers (SPIE).

Focal Loss for Dense Object Detection

Article

Jul 2018

The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far. In this paper, we investigate why this is the case. We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. We propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the effectiveness of our loss, we design and train a simple dense detector we call RetinaNet. Our results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors. Code is at: https://github.com/facebookresearch/Detectron.