Content uploaded by Mithun Haridas T P
Author content
All content in this area was uploaded by Mithun Haridas T P on Aug 01, 2021
Content may be subject to copyright.
2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS) | 978-1-6654-0521-8/20/$31.00 ©2021 IEEE | DOI: 10.1109/ICACCS51430.2021.9441905
2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS)
Underwater Object Detection model based on
YOLOv3 architecture using Deep Neural Networks
Athira. P
Department of Electronics
Cochin University of Science and
Technology
Kochi, India
athirapmohan1996@cusat.ac.in
Mithun Haridas T.P.
Department of Electronics
Cochin University of Science and
Technology
Kochi, India
mithuntp@cusat. ac.in
Supriya M.H.
Department of Electronics
Cochin University of Science and
Technology
Kochi, India
supriya@cusat. ac.in
Abstract— While analysing the strategic areas o f underwa ter
surveillance as well as resource exploration or scrutiny, object
detection plays a crucial role. The capab ility of analysing the
objects along with extracting the in herent in formatio n
emphasizes the high research value o f object detection in th e field
of under water as well as the low light medium. The conventional
systems serving this objective utilizes traditional handcrafting
algorithms and computational methodologies which is highly
inefficient. This brings out the need of com puter vision based
systems which are basically automated and will be a learning
based model. This paper aims to propose a model to au tomatically
detect underwater object using YOLOv3 architecture with
darknet framewor k and deep learning. This paper also explores
the possibility of custom training of YOLOv3 based underwater
object detection models using Fish 4 Knowledge dataset.
Keywords— obje ct detection, u nderwa ter imag es, YOLOv3, deep
learning.
I. INTRODUCTION
The problem of object detection is a crucial task that
is being used broadly in various kind of industries for
monitoring, inspection, sorting etc. Basically, it can be defined
as a technique which identifies and localise the required targets
from video frames in real time. Object detection [1][7] can also
be used to count and track different objects. It is quite different
from recognition, where image recognition assign label to an
image, but on the other hand object detection draw a bounding
box and then label the object. This finds application in various
fields like mechanized vehicle frameworks, movement
acknowledgment, robotized CCTV, object checking, etc. The
methods by which object detection can be implemented are
through traditional approaches as well as learning approaches.
Traditional approaches use regression model to predict the
output by combining the information from various features of
image and gives information about the object location and its
label. Where as in learning approaches deep neural network
architectures are used for end-to-end process in which feature
extraction with object detection is achieved.
As of now, the underwater object detection plays an
important role in studying climatic factors, port safety, resource
exploration, etc. Previously used manual methods for analysis
are labor intensive and time consuming; hence it is replaced by
automatic ROV where man-power can be reduced. The video
data obtained from ROV are very large in size and it’s abled to
process large amounts of such video information automatically,
which would make the process tedious. The main objectives o f
these vehicles show that, it should perform automatic
identification of man-made structures, off-shore structures,
perform object detection and/or obstacle avoidance etc.
YOLOv3 is an improved version of YOLO detection
model proposed by Joseph Redmon and Ali Farhadi [1], which
is a fast-performing object detection algorithm. Enhancing the
previous models, it enables to extend the detection model to
multi-scale with stronger feature extraction, and uses cross
entropy error functions, hence can be applied for multiple object
tracking. Like SSD, YOLOv3[1] also performs faster object
detection thus enabling real-time inference using GPU. The
detection precision of YOLOv3[1] resembles Faster R-CNN. R-
CNN based models uses a region proposed method which
makes the detection process tedious as it uses selective search
algorithm for the elimination of bounding boxes with low
confidence value and select the best one. Where as in YOLO,
the information in image pixels are directly used to prediction
bounding boxes and probability of being a particular object
class.
978-1-6654-0521-8/21/$31.00 ©2021 IEEE
40
Authorized licensed use limited to: COCHIN UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on August 01,2021 at 06:05:08 UTC from IEEE Xplore. Restrictions apply.
2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS)
This work explores YOLOv3[1] architecture and
DarkNet [6] framework for implementing an efficient
underwater object detection model. Fish 4 Knowledge
database [13][14] is used for the training of the model. The
image dataset is preprocessed, labelled and annotated for
training the model for underwater images. Performance
analysis is also done by analyzing the mean average precision
and learning curves.
Organisation of paper: Section II presents works
related to object detection. The theory behind the model
proposed is described in the Section III, followed by the
explanation about the Methodology in Section IV, then the
Section V presents Experimental Results related to the
implementation and the Section VI concludes the research
work.
II. RELATED WORKS
Deep learning models are found more suitable for
detection and extracting information’s from images in
challenging environments along with the ability to work with
a higher amount of data at the same time.
Object detection [1][7] can be seen as a classification
problem in which each pixel is passed through classifier
window which determines the object class present. R-CNN [4]
is modelled by combining region proposal network and selective
search with Alexnet for solving the problem of selecting a
candidate region. The modified versions of RCNN are Fast R-
CNN [3] and Faster R-CNN [5].
Strachan and Kell [12] in 1995 m ade an early attempt
to detect dead fish based on the features such as shape and
colour. Later on, Storbeck and Daan [10] in 2001 proposed its
3D model by adding features like height and width for
classification. Real time detection of fish was proposed in 2014 by
Hsiao e t al [9] by using motion-based fish detection from video frames.
This is achieved by using Gaussian Mixture Model and
achieved accuracy about 83.99%. Similarly, in the same year,
Palazzo and Murabito [8] discussed another method for real
time detection by using covariance model of fish video frames
and achieved average detection accuracy o f 78.01%.
Another improved model for object detection is YOLO
[1] which predict bounding boxes and its confidence value in a
single pipeline by using a single convolutional network. In 2017
Sung et al [11] proposed deep neural network-based fish
detection using CNN architecture and achieved 65.2% accuracy
on localizing and detecting fish from 93 image datasets. YOLO
produces results with high accuracy and precisions. It predicts
bounding box and object class with confidence value of each
class utilising a single pipeline of neural network [1].
III. ARCHITECTURE
A. Network architecture
YOLO architecture is based on CNN as shown in Fig. 1.
There are prior three versions of YOLO before YOLOv3[1].
YOLOv1[2] was the first implementation of single stage
detector co n c e p t w hic h uses re d u c t i o n layers o f dim e n s i o n 1x1
followed by convolutional layer of dimension 3x3 and uses
batch normalization and leaky ReLU activation function. The
network consists of 24 convolutional layer which extracts
features and the two fully connected (FC) layer that predicts
bounding boxes and its class probability. The final output
obtained is a 7x7x30 tensor consisting of bounding boxes. This
model is trained to detect 49 objects, but produces high value of
error in localising them.
The improved version o f YOLOv1 is YOLOv2 which was
built mainly focusing on reduced localisation error. YOLOv2
removed the end FC layers and added batch normalisation on
all convolutional layers which made the network resolution
independent and obtained lower localisation error. YOLOv2 [3]
used darknet-19, that utilises a network with 19 layers
augmented with additional 11 layers to detect objects.
41
Authorized licensed use limited to: COCHIN UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on August 01,2021 at 06:05:08 UTC from IEEE Xplore. Restrictions apply.
2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS)
Fig. 2. Bounding box [2]
Both the previous YOLO models can detect less than 20
classes, hence a more advanced model YOLO9000[3] was
developed which can detect and classify more objects and
classes. These models were then improved by adding more
features like residual blocks, skip connections and up-sampling
and named as YOLOv3 which utilizes a 53 layered network
which is trained on ImageNet database [1].
B. Bounding box forecasting
YOLOv3[1] uses a single pipeline for feature extraction
and hence the whole image is passed on to the convolutional
network and produces a square output called grid on to which
the bounding boxes are anchored. The grid cell and anchor
share a common centroid. The YOLO algorithm predicts
location offset against anchor box: tx, ty, tw, th, objectness
scores, and class probability. Objectness-score gives the
confidence of object presence in the bounding box and the and
class probability defines the class which it belongs perfectly or
not [1]. The predictions correspond to the bounding box
coordinates with (Cx, Cy) being the upper-left corner and Ph and
Pw being the width and height and as depicted in the Fig. 2. and
calculated as given in (1), (2), (3), (4)
bx = 0 (tx)+ Cx (1)
by = °(ty)+ Cy (2)
bw = P etw
Pwe (3)
bh = Ph (4)
where bx is x-coordinate, by is y-coordinate, bw and bh are
the height and width. The measure of overlapping of ground
truth and bounding box, called objectness-score, is calculated
by logistic regression. Value “ 1” indicates the perfect overlap
of bounding box and ground truth or overlap above a threshold,
whereas if the overlap is not perfect and below a threshold the
\value will be “0” and the bounding box is ignored. The
objectness score initially help to filter the perfect bounding box.
Generally, those bounding boxes with a objectness score greater
than the threshold are filtered first and then considered for
further filtering process. Most of the object detection algorithm
faces the problem o f detecting the same object different time in
different frames resulting in its poor performance. YOLO
[1] [2] [3] uses non maximal suppression (NMS) to solve the
problem of multiple detection o f same images. NMS uses a
special function called Intersection of Union or IOU, by setting
a minimum IOU threshold which is commonly set as 0.5. If B1
and B2 are two bounding boxes, the IOU is determined as the
ratio of the intersection of area of B1-B2 to the total area
combining B1-B2.
C. Class prediction
YOLOv3[1] uses a multilabel classification. Here
independent logistic classifiers are used, instead of softmax
function, to reduce the calculational complexity which in turns
improves the system performance. For example, in complex
situations like using an open image dataset, an object can be
labelled as a cat and an animal ie; there are many overlapping.
SoftMax provides poor performance as it predicts the presence
of single class, which may not be the desired result, and hence
binary cross entropy is used in YOLOv3 [1].
D. Predictions across scales
Predictions are made by three different scales; 13x13, 26x26
and 52x52. Features are extracted using feature pyramid
network followed by darknet53 [6]. The last stage of prediction
is a 3-d tensor encoding the bounding box, confidence value that
gives objectness score, and probability o f the object being in a
particular class [1].
Fig. 3. Darknet53 Architecture [1]
42
Authorized licensed use limited to: COCHIN UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on August 01,2021 at 06:05:08 UTC from IEEE Xplore. Restrictions apply.
2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS)
Fig. 4. Sample F4K dataset image showing single and multi-objects
Fig. 5. Object Detection Methodology and bounding box with objectness
score
Fig. 6. Snapshot o f labelling tool
E. Feature extractor
YOLOv3[1] uses Darknet53[6] network for feature
extraction which is a hybrid network derived from darknet19
and residual network. It has a total of 53 convolutional layers
therefore called as darknet-53. The architecture of DarkNet53
is shown in the Fig. 3. YOLOv2 uses darknet-19 for feature
extraction and YOLOv3 uses darknet53 with 53 convolutional
layers for the same. Both YOLOv3 and YOLOv2 use batch
normalisation.
IV. METHODOLOGY
The object detection is reframed as a regression task
by the YOLO and produces final output with bounding boxes
and confidence score. Fish4Knowledge [13][14] video dataset
is utilised for model development. Sample images from Fish 4
Knowledge database is as shown in Fig. 4., The whole
implementation was done using python in Google colab
environment. By adopting transfer learning, YOLOv3 [1]
network was then trained with custom dataset prepared using
Fish 4 Knowledge database for 1600 iteration using Google Co-
laboratory. The image dataset split is made as given in the T able
1. The trained model is tuned to perform the object detection
task as shown in the Fig. 5.
YOLOv3 uses residual skip connection and
upsampling. It is a fully convolutional network and performs
detection at th r e e s c a l e s by ap p ly in g 1x1 kernel on fe a ture maps
whose shape is determined by the number of bounding box and
number of class. The detection process occurs only in three
layer s; d etection la yer 82, 94 and 106. In the i nitial stage the
image undergoes down-sampling resulting a stride of 32 for 81
layers. After the first detection using 1x1 kernel w e t h u s obtain
a feature map of 13x13x255. Similar process happens in rest of
the layer and produces a final feature map of size 52x52x255.
In YOLO, different layers are responsible for detecting different
size objects ie; 13x13 scale detects the large object, 26x26 scale
is responsible for detecting medium and 52x52 scale is
responsible for detecting small objects.
For training YOLO with custom object, the anchor boxes
need to be arranged in the decreasing order of their dimension.
The nine anchors of YOLO are assigned as the biggest anchor
for the first scale, next set o f three for the second and third.
A. Data Preparations
Fish4Knowledge[13][14] video dataset is available in mp4
format. The video data is converted into frames and these
extracted frames are then labelled using labelImg tool [15]. A
total of 2500 frames were obtained which were then labelled
manually using LableImg tool [15] as shown in Fig. 6. Images
were labelled in YOLO format, which contain the details of
object class, bounding box coordinates and the height and width
of the image with left-bottom as origin.
TABLE I. DATASET SPLIT
Dataset Type Training Validation Testing
Number of Images 2000 250 250
43
Authorized licensed use limited to: COCHIN UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on August 01,2021 at 06:05:08 UTC from IEEE Xplore. Restrictions apply.
2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS)
For instance,
Fig. 7. Output of YOLO detection algorithm on F4K dataset
Fig. 8. Loss and mAP chart during Training
If bounding box parameters; x represents the distance of
centre from x-axis, y is the distance o f centre from y-axis, w -
the width, h - the height and the image parameters such as
width (W), height (H) in pixel, the annotation values can be
calculated using (5),(6),(7),(8).
Center-x = x
W(5)
Center-y = y
H(6)
w
Width = "W (7)
h
Height = H(8)
0 0.557813 0.5578173 0.104375 0.100833
Bounding box center coordinates are (0.557813 0.5578173)
with height and width 10% is of the entire image (0.104375 and
0.100833) and 0 represent the object class present in it.
B. Training
Training is done in Google Colab using GPU and the
annotations are made using LabelImg tool [15]. Since the
network was trained previously for 80 classes of object which
doesn’t contain the object of interest (fish), the first step before
training was to create new configuration file with only one
class. The convolution filter size was selected as 18x18 since
only one class was used for training.
Input to the network should be an image and hence video is
passed through the system to extract frames and which is
forwarded to object detector YOLO algorithm. The output of
YOLO consist o f the confidence score and class ID of the
corresponding object class present in the bounding box as
shown in Fig. 7.
V. RESULTS AND ANALYSIS
The network was trained and tested with
Fish4Knowledge[13][14] dataset. The losses in each batch can
be calculated from the log file generated during the training
phase. Fig. 8 shows the loss and mAP plotted against iteration.
The loss decreases and mAP increases with iterations. Further
the network can be trained until the average loss decreases
below 0.2 and on further training the network get overfitted,
which was avoided using early stopping. The three detection
layers i.e. layer 82, 94 and 106 calculate the loss functions for
the bounding box which are namely; Mean squared error of
centreX, centreY, Width and Height; Binary cross entropy of
objectness score, no objectness score and multi-class
predictions. Thus the loss function has four parts and be
calculated as in (9).
Loss = Lambda_Coord * Sum(Mean_Square_Error((bx,
by), (bx', by) * obj_mask)
+ Lambda_Coord * Sum(Mean_Square_Error((bw, bh), (bw>,
bh') * obj_mask)+ Sum(Binary_Cross_Entropy(obj, obj’)
* obj_mask)
+ Lambda_Noobj * Sum(Binary_Cross_Entropy(obj, obj’) *
(1 -obj_mask) * ignore_mask)
+ Sum(Binary_Cross_Entropy(class, class’)) (9)
44
Authorized licensed use limited to: COCHIN UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on August 01,2021 at 06:05:08 UTC from IEEE Xplore. Restrictions apply.
2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS)
TABLE II. EVALUATION METRICS FOR CONFIDENCE
THRESHOLD = 0.25 AND IOU THRESHOLD = 0.5 Future scope
Evaluation
Metric
Best
Accuracy mAP Recall Avg. IOU F1
Score
Values 0.9759 0.9661 0.95 0.6928 0.92
Where the relative centroid is represented as bx and by
and the directly predicted centroid is represented as bx’ and by’.
Lambda_Coord is a weight which has a value 5. The second
term represent the height and width loss calculated using
width(bw) along with the height(bh), followed by object/non-
objectness score loss and finally the last term represent the
classification loss[2]. The mean Average Precision (mAP) was
calculated for analysis the performance of the object detection.
After completing 1600 iterations 96.61% mean average
precision was obtained and a confidence threshold of 0.25 is
set in order to avoid occlusion of bounding box. mAP was
calculated by keeping an IOU threshold of 0.5 in order obtain
a better result. Testing results obtained is tabulated in table II.
The object detector was tested with both images and
videos. The results obtained by testing the model using the
Fish 4 Knowledge video data of 09min 35sec duration shows
that best accuracy of 97.59%, Average loss of 0.475593,
Precision of 0.88, Recall 0.95, F1 score 0.92 and Average IoU
69.28 and the average detection time was found to be 15
seconds, for confidence threshold = 0.25 and IoU threshold o f
0.5. Accuracy, precision and recall of the model performance
is calculated by taking the positive object class as fish in the
frame and negative object class as no fish in frame. The mean
Average Precision(mAP), F1 score and Intersection of Union
(IoU) can be calculated as shown in (10), (11), (12).
mAP = ^— ^ N°- of class Average precision (10)
precision*recall
F1 score = 2*
-----
—
----------
-
(11)
precision+recall
IoU B10B2
B1UB2
V. CONCLUSION
(12)
Underwater object detection model is implemented
using YOLOv3 architecture using Fish4Knowledge [13] [14]
dataset. A total of 2500 images were utilised for training the
detector for a single class. The network successfully detects
multiple objects in the consecutive frames with accuracy of
96.17% and mean average Precision of 96.61% for confidence
threshold of 0.25 and IoU threshold of 0.5. Average IoU was
obtained as 69.28% and F1 score as 0.92 for the obtained result.
The model can be further tuned to detect different
object classes from different domains for object detection and
tracking. YOLO can be combined with deep sort or any other
object tracker for the implementation of tracking and further
analysis.
References
[1] Redmon, Joseph and Farhadi, Ali, “Yolov3: An incremental
improvement,”arXiv preprint arXiv:1804.02767, 2018.
[2] J. Redmon, S. Divvala, R. Girshick and A. Farhadi, "You Only Look
Once: Unified, Real-Time Object Detection," 2016 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016,
pp. 779-788
[3] J. Redmon and A. Farhadi, "YOLO9000: Better, Faster, Stronger," 2017
IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), Honolulu, HI, USA, 2017, pp. 6517-6525
[4] R. Girshick, "Fast R-CNN," 2015 IEEE International Conference on
Computer Vision (ICCV), Santiago, 2015, pp. 1440-1448,
[5] S. Ren, K. He, R. Girshick and J. Sun, "Faster R-CNN: Towards Real-
Time Object Detection with Region Proposal Networks," in IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6,
pp. 1137-1149, 1 June 2017
[6] J. Redmon. Darknet: “Open source neural networks in c”.
http://pj reddie.com/darknet/, 2013-2016.
[7] A. Mekonnen and F. Lerasle, "Comparative Evaluations of Selected
Tracking-by-Detection Approaches," IEEE Transactions on Circuits and
Systems for Video Technology, vol. 29, no. 4, pp. 996-1010, 2019.
[8] Palazzo, Simone and Murabito, Francesca, “Fish species identification in
real-life underwater images” In 3rd ACM International Workshop on
Multimedia Analysis for Ecological Data, Orlando, Florida, pp. 13- 18.
[9] Hsiao, Y., Chen, C., Lin, S., and Lin,”Real-world underwater fish
recognition and identification using sparse representation” in Ecological
Informatics 2014, 23: 13-21.
[10] Storbeck, Frank and Daan, Berent, “Fish species recognition using
computer vision and a neural network” . Fisheries Research, 51: 11-15.
[11] Sung, M., Yu, S., and Girdhar, Y “Vision based real-time fish detection
using convolution neural network” in IEEE OCEAN-2017, Aberdeen,
UK, 1-6 pp.
[12] Strachan, N.J. C., and Kell, L “ A potential method for the differentiation
between haddock fish stocks by computer vision using canonical
discriminant analysis” in ICES Journal of Marine Science, 52: 145-149.
[13] B. J. Boom, P. X. Huang, C. Spampinato, S. Palazzo, J. He, C. Beyan, E.
Beauxis-Aussalet, J. van Ossenbruggen, G. Nadarajan, J. Y. Chen-Burger,
D. Giordano, L. Hardman, F.-P. Lin, R. B. Fisher, "Long-term underwater
camera surveillance for monitoring and analysis of fish populations",
Proc. Int. Workshop on Visual observation and Analysis o f Animal and
Insect Behavior (VAIB), in conjunction with ICPR 2012, Tsukuba, Japan,
2012.
[14] B. J. Boom, P. X. Huang, J. He, R. B. Fisher, "Supporting Ground-Truth
annotation of image datasets using clustering", 21st Int. Conf. on Pattern
Recognition (ICPR), 2012.
[15] "LabelImg," Tzutalin.github.io, 2019. [Online].
Available:https://tzutalin.github.io/labelImg/.
45
Authorized licensed use limited to: COCHIN UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on August 01,2021 at 06:05:08 UTC from IEEE Xplore. Restrictions apply.