Conference PaperPDF Available

Underwater Object Detection model based on YOLOv3 architecture using Deep Neural Networks

Authors:

Figures

Content may be subject to copyright.
2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS) | 978-1-6654-0521-8/20/$31.00 ©2021 IEEE | DOI: 10.1109/ICACCS51430.2021.9441905
2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS)
Underwater Object Detection model based on
YOLOv3 architecture using Deep Neural Networks
Athira. P
Department of Electronics
Cochin University of Science and
Technology
Kochi, India
athirapmohan1996@cusat.ac.in
Mithun Haridas T.P.
Department of Electronics
Cochin University of Science and
Technology
Kochi, India
mithuntp@cusat. ac.in
Supriya M.H.
Department of Electronics
Cochin University of Science and
Technology
Kochi, India
supriya@cusat. ac.in
Abstract— While analysing the strategic areas o f underwa ter
surveillance as well as resource exploration or scrutiny, object
detection plays a crucial role. The capab ility of analysing the
objects along with extracting the in herent in formatio n
emphasizes the high research value o f object detection in th e field
of under water as well as the low light medium. The conventional
systems serving this objective utilizes traditional handcrafting
algorithms and computational methodologies which is highly
inefficient. This brings out the need of com puter vision based
systems which are basically automated and will be a learning
based model. This paper aims to propose a model to au tomatically
detect underwater object using YOLOv3 architecture with
darknet framewor k and deep learning. This paper also explores
the possibility of custom training of YOLOv3 based underwater
object detection models using Fish 4 Knowledge dataset.
Keywordsobje ct detection, u nderwa ter imag es, YOLOv3, deep
learning.
I. INTRODUCTION
The problem of object detection is a crucial task that
is being used broadly in various kind of industries for
monitoring, inspection, sorting etc. Basically, it can be defined
as a technique which identifies and localise the required targets
from video frames in real time. Object detection [1][7] can also
be used to count and track different objects. It is quite different
from recognition, where image recognition assign label to an
image, but on the other hand object detection draw a bounding
box and then label the object. This finds application in various
fields like mechanized vehicle frameworks, movement
acknowledgment, robotized CCTV, object checking, etc. The
methods by which object detection can be implemented are
through traditional approaches as well as learning approaches.
Traditional approaches use regression model to predict the
output by combining the information from various features of
image and gives information about the object location and its
label. Where as in learning approaches deep neural network
architectures are used for end-to-end process in which feature
extraction with object detection is achieved.
As of now, the underwater object detection plays an
important role in studying climatic factors, port safety, resource
exploration, etc. Previously used manual methods for analysis
are labor intensive and time consuming; hence it is replaced by
automatic ROV where man-power can be reduced. The video
data obtained from ROV are very large in size and it’s abled to
process large amounts of such video information automatically,
which would make the process tedious. The main objectives o f
these vehicles show that, it should perform automatic
identification of man-made structures, off-shore structures,
perform object detection and/or obstacle avoidance etc.
YOLOv3 is an improved version of YOLO detection
model proposed by Joseph Redmon and Ali Farhadi [1], which
is a fast-performing object detection algorithm. Enhancing the
previous models, it enables to extend the detection model to
multi-scale with stronger feature extraction, and uses cross
entropy error functions, hence can be applied for multiple object
tracking. Like SSD, YOLOv3[1] also performs faster object
detection thus enabling real-time inference using GPU. The
detection precision of YOLOv3[1] resembles Faster R-CNN. R-
CNN based models uses a region proposed method which
makes the detection process tedious as it uses selective search
algorithm for the elimination of bounding boxes with low
confidence value and select the best one. Where as in YOLO,
the information in image pixels are directly used to prediction
bounding boxes and probability of being a particular object
class.
978-1-6654-0521-8/21/$31.00 ©2021 IEEE
40
Authorized licensed use limited to: COCHIN UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on August 01,2021 at 06:05:08 UTC from IEEE Xplore. Restrictions apply.
2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS)
This work explores YOLOv3[1] architecture and
DarkNet [6] framework for implementing an efficient
underwater object detection model. Fish 4 Knowledge
database [13][14] is used for the training of the model. The
image dataset is preprocessed, labelled and annotated for
training the model for underwater images. Performance
analysis is also done by analyzing the mean average precision
and learning curves.
Organisation of paper: Section II presents works
related to object detection. The theory behind the model
proposed is described in the Section III, followed by the
explanation about the Methodology in Section IV, then the
Section V presents Experimental Results related to the
implementation and the Section VI concludes the research
work.
II. RELATED WORKS
Deep learning models are found more suitable for
detection and extracting information’s from images in
challenging environments along with the ability to work with
a higher amount of data at the same time.
Object detection [1][7] can be seen as a classification
problem in which each pixel is passed through classifier
window which determines the object class present. R-CNN [4]
is modelled by combining region proposal network and selective
search with Alexnet for solving the problem of selecting a
candidate region. The modified versions of RCNN are Fast R-
CNN [3] and Faster R-CNN [5].
Strachan and Kell [12] in 1995 m ade an early attempt
to detect dead fish based on the features such as shape and
colour. Later on, Storbeck and Daan [10] in 2001 proposed its
3D model by adding features like height and width for
classification. Real time detection of fish was proposed in 2014 by
Hsiao e t al [9] by using motion-based fish detection from video frames.
This is achieved by using Gaussian Mixture Model and
achieved accuracy about 83.99%. Similarly, in the same year,
Palazzo and Murabito [8] discussed another method for real
time detection by using covariance model of fish video frames
and achieved average detection accuracy o f 78.01%.
Another improved model for object detection is YOLO
[1] which predict bounding boxes and its confidence value in a
single pipeline by using a single convolutional network. In 2017
Sung et al [11] proposed deep neural network-based fish
detection using CNN architecture and achieved 65.2% accuracy
on localizing and detecting fish from 93 image datasets. YOLO
produces results with high accuracy and precisions. It predicts
bounding box and object class with confidence value of each
class utilising a single pipeline of neural network [1].
III. ARCHITECTURE
A. Network architecture
YOLO architecture is based on CNN as shown in Fig. 1.
There are prior three versions of YOLO before YOLOv3[1].
YOLOv1[2] was the first implementation of single stage
detector co n c e p t w hic h uses re d u c t i o n layers o f dim e n s i o n 1x1
followed by convolutional layer of dimension 3x3 and uses
batch normalization and leaky ReLU activation function. The
network consists of 24 convolutional layer which extracts
features and the two fully connected (FC) layer that predicts
bounding boxes and its class probability. The final output
obtained is a 7x7x30 tensor consisting of bounding boxes. This
model is trained to detect 49 objects, but produces high value of
error in localising them.
The improved version o f YOLOv1 is YOLOv2 which was
built mainly focusing on reduced localisation error. YOLOv2
removed the end FC layers and added batch normalisation on
all convolutional layers which made the network resolution
independent and obtained lower localisation error. YOLOv2 [3]
used darknet-19, that utilises a network with 19 layers
augmented with additional 11 layers to detect objects.
41
Authorized licensed use limited to: COCHIN UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on August 01,2021 at 06:05:08 UTC from IEEE Xplore. Restrictions apply.
2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS)
Fig. 2. Bounding box [2]
Both the previous YOLO models can detect less than 20
classes, hence a more advanced model YOLO9000[3] was
developed which can detect and classify more objects and
classes. These models were then improved by adding more
features like residual blocks, skip connections and up-sampling
and named as YOLOv3 which utilizes a 53 layered network
which is trained on ImageNet database [1].
B. Bounding box forecasting
YOLOv3[1] uses a single pipeline for feature extraction
and hence the whole image is passed on to the convolutional
network and produces a square output called grid on to which
the bounding boxes are anchored. The grid cell and anchor
share a common centroid. The YOLO algorithm predicts
location offset against anchor box: tx, ty, tw, th, objectness
scores, and class probability. Objectness-score gives the
confidence of object presence in the bounding box and the and
class probability defines the class which it belongs perfectly or
not [1]. The predictions correspond to the bounding box
coordinates with (Cx, Cy) being the upper-left corner and Ph and
Pw being the width and height and as depicted in the Fig. 2. and
calculated as given in (1), (2), (3), (4)
bx = 0 (tx)+ Cx (1)
by = °(ty)+ Cy (2)
bw = P etw
Pwe (3)
bh = Ph (4)
where bx is x-coordinate, by is y-coordinate, bw and bh are
the height and width. The measure of overlapping of ground
truth and bounding box, called objectness-score, is calculated
by logistic regression. Value “ 1” indicates the perfect overlap
of bounding box and ground truth or overlap above a threshold,
whereas if the overlap is not perfect and below a threshold the
\value will be “0” and the bounding box is ignored. The
objectness score initially help to filter the perfect bounding box.
Generally, those bounding boxes with a objectness score greater
than the threshold are filtered first and then considered for
further filtering process. Most of the object detection algorithm
faces the problem o f detecting the same object different time in
different frames resulting in its poor performance. YOLO
[1] [2] [3] uses non maximal suppression (NMS) to solve the
problem of multiple detection o f same images. NMS uses a
special function called Intersection of Union or IOU, by setting
a minimum IOU threshold which is commonly set as 0.5. If B1
and B2 are two bounding boxes, the IOU is determined as the
ratio of the intersection of area of B1-B2 to the total area
combining B1-B2.
C. Class prediction
YOLOv3[1] uses a multilabel classification. Here
independent logistic classifiers are used, instead of softmax
function, to reduce the calculational complexity which in turns
improves the system performance. For example, in complex
situations like using an open image dataset, an object can be
labelled as a cat and an animal ie; there are many overlapping.
SoftMax provides poor performance as it predicts the presence
of single class, which may not be the desired result, and hence
binary cross entropy is used in YOLOv3 [1].
D. Predictions across scales
Predictions are made by three different scales; 13x13, 26x26
and 52x52. Features are extracted using feature pyramid
network followed by darknet53 [6]. The last stage of prediction
is a 3-d tensor encoding the bounding box, confidence value that
gives objectness score, and probability o f the object being in a
particular class [1].
Fig. 3. Darknet53 Architecture [1]
42
Authorized licensed use limited to: COCHIN UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on August 01,2021 at 06:05:08 UTC from IEEE Xplore. Restrictions apply.
2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS)
Fig. 4. Sample F4K dataset image showing single and multi-objects
Fig. 5. Object Detection Methodology and bounding box with objectness
score
Fig. 6. Snapshot o f labelling tool
E. Feature extractor
YOLOv3[1] uses Darknet53[6] network for feature
extraction which is a hybrid network derived from darknet19
and residual network. It has a total of 53 convolutional layers
therefore called as darknet-53. The architecture of DarkNet53
is shown in the Fig. 3. YOLOv2 uses darknet-19 for feature
extraction and YOLOv3 uses darknet53 with 53 convolutional
layers for the same. Both YOLOv3 and YOLOv2 use batch
normalisation.
IV. METHODOLOGY
The object detection is reframed as a regression task
by the YOLO and produces final output with bounding boxes
and confidence score. Fish4Knowledge [13][14] video dataset
is utilised for model development. Sample images from Fish 4
Knowledge database is as shown in Fig. 4., The whole
implementation was done using python in Google colab
environment. By adopting transfer learning, YOLOv3 [1]
network was then trained with custom dataset prepared using
Fish 4 Knowledge database for 1600 iteration using Google Co-
laboratory. The image dataset split is made as given in the T able
1. The trained model is tuned to perform the object detection
task as shown in the Fig. 5.
YOLOv3 uses residual skip connection and
upsampling. It is a fully convolutional network and performs
detection at th r e e s c a l e s by ap p ly in g 1x1 kernel on fe a ture maps
whose shape is determined by the number of bounding box and
number of class. The detection process occurs only in three
layer s; d etection la yer 82, 94 and 106. In the i nitial stage the
image undergoes down-sampling resulting a stride of 32 for 81
layers. After the first detection using 1x1 kernel w e t h u s obtain
a feature map of 13x13x255. Similar process happens in rest of
the layer and produces a final feature map of size 52x52x255.
In YOLO, different layers are responsible for detecting different
size objects ie; 13x13 scale detects the large object, 26x26 scale
is responsible for detecting medium and 52x52 scale is
responsible for detecting small objects.
For training YOLO with custom object, the anchor boxes
need to be arranged in the decreasing order of their dimension.
The nine anchors of YOLO are assigned as the biggest anchor
for the first scale, next set o f three for the second and third.
A. Data Preparations
Fish4Knowledge[13][14] video dataset is available in mp4
format. The video data is converted into frames and these
extracted frames are then labelled using labelImg tool [15]. A
total of 2500 frames were obtained which were then labelled
manually using LableImg tool [15] as shown in Fig. 6. Images
were labelled in YOLO format, which contain the details of
object class, bounding box coordinates and the height and width
of the image with left-bottom as origin.
TABLE I. DATASET SPLIT
Dataset Type Training Validation Testing
Number of Images 2000 250 250
43
Authorized licensed use limited to: COCHIN UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on August 01,2021 at 06:05:08 UTC from IEEE Xplore. Restrictions apply.
2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS)
For instance,
Fig. 7. Output of YOLO detection algorithm on F4K dataset
Fig. 8. Loss and mAP chart during Training
If bounding box parameters; x represents the distance of
centre from x-axis, y is the distance o f centre from y-axis, w -
the width, h - the height and the image parameters such as
width (W), height (H) in pixel, the annotation values can be
calculated using (5),(6),(7),(8).
Center-x = x
W(5)
Center-y = y
H(6)
w
Width = "W (7)
h
Height = H(8)
0 0.557813 0.5578173 0.104375 0.100833
Bounding box center coordinates are (0.557813 0.5578173)
with height and width 10% is of the entire image (0.104375 and
0.100833) and 0 represent the object class present in it.
B. Training
Training is done in Google Colab using GPU and the
annotations are made using LabelImg tool [15]. Since the
network was trained previously for 80 classes of object which
doesn’t contain the object of interest (fish), the first step before
training was to create new configuration file with only one
class. The convolution filter size was selected as 18x18 since
only one class was used for training.
Input to the network should be an image and hence video is
passed through the system to extract frames and which is
forwarded to object detector YOLO algorithm. The output of
YOLO consist o f the confidence score and class ID of the
corresponding object class present in the bounding box as
shown in Fig. 7.
V. RESULTS AND ANALYSIS
The network was trained and tested with
Fish4Knowledge[13][14] dataset. The losses in each batch can
be calculated from the log file generated during the training
phase. Fig. 8 shows the loss and mAP plotted against iteration.
The loss decreases and mAP increases with iterations. Further
the network can be trained until the average loss decreases
below 0.2 and on further training the network get overfitted,
which was avoided using early stopping. The three detection
layers i.e. layer 82, 94 and 106 calculate the loss functions for
the bounding box which are namely; Mean squared error of
centreX, centreY, Width and Height; Binary cross entropy of
objectness score, no objectness score and multi-class
predictions. Thus the loss function has four parts and be
calculated as in (9).
Loss = Lambda_Coord * Sum(Mean_Square_Error((bx,
by), (bx', by) * obj_mask)
+ Lambda_Coord * Sum(Mean_Square_Error((bw, bh), (bw>,
bh') * obj_mask)+ Sum(Binary_Cross_Entropy(obj, obj’)
* obj_mask)
+ Lambda_Noobj * Sum(Binary_Cross_Entropy(obj, obj’) *
(1 -obj_mask) * ignore_mask)
+ Sum(Binary_Cross_Entropy(class, class’)) (9)
44
Authorized licensed use limited to: COCHIN UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on August 01,2021 at 06:05:08 UTC from IEEE Xplore. Restrictions apply.
2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS)
TABLE II. EVALUATION METRICS FOR CONFIDENCE
THRESHOLD = 0.25 AND IOU THRESHOLD = 0.5 Future scope
Evaluation
Metric
Best
Accuracy mAP Recall Avg. IOU F1
Score
Values 0.9759 0.9661 0.95 0.6928 0.92
Where the relative centroid is represented as bx and by
and the directly predicted centroid is represented as bx’ and by’.
Lambda_Coord is a weight which has a value 5. The second
term represent the height and width loss calculated using
width(bw) along with the height(bh), followed by object/non-
objectness score loss and finally the last term represent the
classification loss[2]. The mean Average Precision (mAP) was
calculated for analysis the performance of the object detection.
After completing 1600 iterations 96.61% mean average
precision was obtained and a confidence threshold of 0.25 is
set in order to avoid occlusion of bounding box. mAP was
calculated by keeping an IOU threshold of 0.5 in order obtain
a better result. Testing results obtained is tabulated in table II.
The object detector was tested with both images and
videos. The results obtained by testing the model using the
Fish 4 Knowledge video data of 09min 35sec duration shows
that best accuracy of 97.59%, Average loss of 0.475593,
Precision of 0.88, Recall 0.95, F1 score 0.92 and Average IoU
69.28 and the average detection time was found to be 15
seconds, for confidence threshold = 0.25 and IoU threshold o f
0.5. Accuracy, precision and recall of the model performance
is calculated by taking the positive object class as fish in the
frame and negative object class as no fish in frame. The mean
Average Precision(mAP), F1 score and Intersection of Union
(IoU) can be calculated as shown in (10), (11), (12).
mAP = ^— ^ N°- of class Average precision (10)
precision*recall
F1 score = 2*
-----
----------
-
(11)
precision+recall
IoU B10B2
B1UB2
V. CONCLUSION
(12)
Underwater object detection model is implemented
using YOLOv3 architecture using Fish4Knowledge [13] [14]
dataset. A total of 2500 images were utilised for training the
detector for a single class. The network successfully detects
multiple objects in the consecutive frames with accuracy of
96.17% and mean average Precision of 96.61% for confidence
threshold of 0.25 and IoU threshold of 0.5. Average IoU was
obtained as 69.28% and F1 score as 0.92 for the obtained result.
The model can be further tuned to detect different
object classes from different domains for object detection and
tracking. YOLO can be combined with deep sort or any other
object tracker for the implementation of tracking and further
analysis.
References
[1] Redmon, Joseph and Farhadi, Ali, “Yolov3: An incremental
improvement,”arXiv preprint arXiv:1804.02767, 2018.
[2] J. Redmon, S. Divvala, R. Girshick and A. Farhadi, "You Only Look
Once: Unified, Real-Time Object Detection," 2016 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016,
pp. 779-788
[3] J. Redmon and A. Farhadi, "YOLO9000: Better, Faster, Stronger," 2017
IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), Honolulu, HI, USA, 2017, pp. 6517-6525
[4] R. Girshick, "Fast R-CNN," 2015 IEEE International Conference on
Computer Vision (ICCV), Santiago, 2015, pp. 1440-1448,
[5] S. Ren, K. He, R. Girshick and J. Sun, "Faster R-CNN: Towards Real-
Time Object Detection with Region Proposal Networks," in IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6,
pp. 1137-1149, 1 June 2017
[6] J. Redmon. Darknet: “Open source neural networks in c.
http://pj reddie.com/darknet/, 2013-2016.
[7] A. Mekonnen and F. Lerasle, "Comparative Evaluations of Selected
Tracking-by-Detection Approaches," IEEE Transactions on Circuits and
Systems for Video Technology, vol. 29, no. 4, pp. 996-1010, 2019.
[8] Palazzo, Simone and Murabito, Francesca, “Fish species identification in
real-life underwater images” In 3rd ACM International Workshop on
Multimedia Analysis for Ecological Data, Orlando, Florida, pp. 13- 18.
[9] Hsiao, Y., Chen, C., Lin, S., and Lin,”Real-world underwater fish
recognition and identification using sparse representation” in Ecological
Informatics 2014, 23: 13-21.
[10] Storbeck, Frank and Daan, Berent, “Fish species recognition using
computer vision and a neural network” . Fisheries Research, 51: 11-15.
[11] Sung, M., Yu, S., and Girdhar, Y “Vision based real-time fish detection
using convolution neural network” in IEEE OCEAN-2017, Aberdeen,
UK, 1-6 pp.
[12] Strachan, N.J. C., and Kell, L “ A potential method for the differentiation
between haddock fish stocks by computer vision using canonical
discriminant analysis” in ICES Journal of Marine Science, 52: 145-149.
[13] B. J. Boom, P. X. Huang, C. Spampinato, S. Palazzo, J. He, C. Beyan, E.
Beauxis-Aussalet, J. van Ossenbruggen, G. Nadarajan, J. Y. Chen-Burger,
D. Giordano, L. Hardman, F.-P. Lin, R. B. Fisher, "Long-term underwater
camera surveillance for monitoring and analysis of fish populations",
Proc. Int. Workshop on Visual observation and Analysis o f Animal and
Insect Behavior (VAIB), in conjunction with ICPR 2012, Tsukuba, Japan,
2012.
[14] B. J. Boom, P. X. Huang, J. He, R. B. Fisher, "Supporting Ground-Truth
annotation of image datasets using clustering", 21st Int. Conf. on Pattern
Recognition (ICPR), 2012.
[15] "LabelImg," Tzutalin.github.io, 2019. [Online].
Available:https://tzutalin.github.io/labelImg/.
45
Authorized licensed use limited to: COCHIN UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on August 01,2021 at 06:05:08 UTC from IEEE Xplore. Restrictions apply.
... In recent years, object detection has experienced a revolution due to deep learning approaches, notably convolutional neural networks. Deep learning methods include the wellknown You Only Look Once (YOLO) algorithm, which has outperformed in terms of accuracy and speed [17] [18]. Another popular method for feature extraction is Scale-Invariant Feature Transform (SIFT). ...
Article
Full-text available
Automated extraction of valuable content from real-time classroom lectures holds significant potential for enhancing educational accessibility and efficiency. However, capturing the spontaneous insights of live lectures often proves challenging due to rapid visual transitions, instructor movement, and diverse learning styles. This paper presents a novel approach that combines the strengths of YOLO and Scale-Invariant Feature Transform (SIFT) techniques to automatically extract slides from live classroom lectures. YOLO, a real-time object detection algorithm, is employed to identify board area, teacher, and other objects within the video stream. While SIFT, a robust feature-based method, was used to accurately merge key points from multiple pictures of the same region. The proposed method involves a multi-stage process: first, YOLO detects the potential place of the teacher, which occluded the board within the video frames. Subsequently, the teacher was removed from the image. The board was divided into multiple segments, to remove and merge redundant content Scale-invariant feature Transform (SIFT) was employed. Experimental results on a diverse dataset of classroom lecture videos demonstrated the effectiveness of the proposed method in extracting slides across different environments, lecture styles, and recording conditions. The potential benefits include improved note-taking, reduced manual effort in content curation, and enhanced accessibility to lecture materials. The presented approach contributes to the broader goal of leveraging computer vision and machine learning techniques to transform traditional classroom settings into modern, interactive, and adaptive learning environments.
... al-time object detection system. Different variants of YOLO are also explored for underwater object tracking (Venkatesh Alla et al (2022)). Lu et al (2021) developed a real-time multi-objects deep CNN architecture for the recognition and tracking of underwater objects using YOLO. The use of the Yolo v3 model is also reported for object tracking in Athira. et al (2021). ...
Preprint
Full-text available
The objective of this article is to provide a detailed review of the state-of-the-art underwater surveillance process and the new trends of the same. Underwater surveillance has recently getting lots of attention because of its potential applications including the security of the coastal border, effective fish farming, deep-sea exploration, preservation of rare aquatic animals, etc. In the case of underwater, the light that is sensed by the camera got degraded due to many underwater phenomena such as haze, low illumination, scattering, absorption, diffraction, and refraction. The actual color of the object, as well as the scene, gets degraded as the light can not travel deep underwater. In a situation, where the scene of view is not clear, it is very difficult to detect the moving objects present in the scene. Further, the identification of the object of concern and tracking of that becomes more challenging. In this survey, we categorize underwater surveillance as a combination of the three blocks: enhancement, object detection, and object tracking. We categorize all these three blocks based on the underwater complexity and motion models considered. In this article, we tried to enumerate the most detailed descriptions of each category. We also discuss the future directions of research in the area of underwater surveillance.
... Even though the accuracy was greatly improved, the speed was not as fast as YOLOv2.YOLOv3 has been used for underwater custom training for detecting fishes and the training is done using the Fish 4 Knowledge dataset [2]. The labelling is done in YOLO format such that it contains the details of the object class, the bounding box coordinates and the height and width of the image. ...
Chapter
Full-text available
Ocean Pollution is one of the alarming environmental concerns where studies reveal that the biggest reason for ocean pollution is caused by the plastic debris discarded from the land. These plastics pose a threat to the coastal wildlife, marine ecosystem balance, and the economic health of the coastal communities. Inevitably this would result in affecting both human and aquatic living. The most commonly used methods, though effective, pose certain disadvantages when it comes to detecting and quantifying plastics. Thus, it is important to adopt alternative methods involving the latest technologies that would easily help us to identify the plastics and aid in their removal. In this paper, we have investigated the YOLO v4 and YOLO v5 deep learning object detection algorithms for detecting and identifying the marine plastics in the epipelagic layers of the water bodies. Ocean plastic images available on the internet are used to create the datasets. Image augmentation helps in increasing the number of images in the dataset. The Mean Average Precision of YOLO v4 and YOLO v5 are studied and the algorithm performance is explained with the results concluded.
Article
Full-text available
Underwater target detection plays a crucial role in marine environmental monitoring and early warning systems. It involves utilizing optical images acquired from underwater imaging devices to locate and identify aquatic organisms in challenging environments. However, the color deviation and low illumination in these images, caused by harsh working conditions, pose significant challenges to an effective target detection. Moreover, the detection of numerous small or tiny aquatic targets becomes even more demanding, considering the limited storage and computing power of detection devices. To address these problems, we propose the YOLOv7-CHS model for underwater target detection, which introduces several innovative approaches. Firstly, we replace efficient layer aggregation networks (ELAN) with the high-order spatial interaction (HOSI) module as the backbone of the model. This change reduces the model size while preserving accuracy. Secondly, we integrate the contextual transformer (CT) module into the head of the model, which combines static and dynamic contextual representations to effectively improve the model’s ability to detect small targets. Lastly, we incorporate the simple parameter-free attention (SPFA) module at the head of the detection network, implementing a combined channel-domain and spatial-domain attention mechanism. This integration significantly improves the representation capabilities of the network. To validate the implications of our model, we conduct a series of experiments. The results demonstrate that our proposed model achieves higher mean average precision (mAP) values on the Starfish and DUO datasets compared to the original YOLOv7, with improvements of 4.5% and 4.2%, respectively. Additionally, our model achieves a real-time detection speed of 32 frames per second (FPS). Furthermore, the floating point operations (FLOPs) of our model are 62.9 G smaller than those of YOLOv7, facilitating the deployment of the model. Its innovative design and experimental results highlight its effectiveness in addressing the challenges associated with underwater object detection.
Chapter
In the last decade, the number of underwater image processing research has increased significantly. This is primarily due to society's dependency on the precious resources found underwater and to protect the underwater environment. Unlike regular imaging in a normal environment, underwater images suffer from low visibility, blurriness, color casts, etc. due to light scattering, turbidity, darkness, and wavelength of light. For effective underwater exploration, excellent approaches are necessary. This review study discusses the survey of “underwater image enhancement and object detection” methods. These methods are outlined briefly with the available dataset and evaluation metrics used for underwater image enhancement. A wide range of domain applications is also highlighted.
Article
With the increasing advancements in Artificial Intelligence and its varied applications across multiple domains, the manufacturing industry is not left behind. Manufacturing and Production require a lot of labour force to ensure good quality end results. While this may be a necessity in the rudimentary stages of development, there is a way to cut down on this while checking the quality of the end product. This project aims at using the power of Artificial Intelligence, specifically Computer vision to create a quality inspecting tool that entails localizing and predicting the required objects in the image of the Dengue kit. This project highlights the entire process including simulation, design of conveyor belt and displays the final process of how both combined can help catalyse quality inspection by subtracting the manual crunch. Keywords: Artificial Intelligence, Inspection, Computer vision, Industry 4.0 Revolution, Object Detection, Yolov4
Article
We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that's pretty swell. It's a little bigger than last time but more accurate. It's still fast though, don't worry. At 320x320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 mAP@50 in 51 ms on a Titan X, compared to 57.5 mAP@50 in 198 ms by RetinaNet, similar performance but 3.8x faster. As always, all the code is online at https://pjreddie.com/yolo/
Article
In this work, we present a comparative evaluation of various multi-person tracking-by-detection approaches on public datasets. The work investigates five popular trackers coupled with six relevant visual people detectors evaluated on seven public datasets. The evaluation emphasizes on exhibited performance variation depending on tracker-detector choices. Our experimental results show that the overall performance depends on how challenging the dataset is, the performance of the detector on the specific dataset, and the tracker-detector combination. Some trackers are more sensitive to the choice of a detector and some detectors to the choice of a tracker than others. Based on our results, two of the trackers demonstrate the best performances consistently across different datasets whereas the best performing detectors vary per dataset. This underscores the need for careful application context specific evaluation when choosing a detector.
Conference Paper
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Conference Paper
Kernel descriptors consist in finite-dimensional vectors extracted from image patches and designed in such a way that the dot product approximates a nonlinear kernel, whose projection feature space would be high-dimensional. Recently, they have been successfully used for fine-gradined object recogntion, and in this work we study the application of two such descriptors, called EMK and KDES (respectively designed as a kernelized generalization of the common bag-of-words and histogram-of-gradient approaches) to the MAED 2014 Fish Classification task, consisting of about 50,000 underwater images from 10 fish species.
Article
We present YOLO, a unified pipeline for object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance. Our unified architecture is also extremely fast; YOLO processes images in real-time at 45 frames per second, hundreds to thousands of times faster than existing detection systems. Our system uses global image context to detect and localize objects, making it less prone to background errors than top detection systems like R-CNN. By itself, YOLO detects objects at unprecedented speeds with moderate accuracy. When combined with state-of-the-art detectors, YOLO boosts performance by 2-3% points mAP.
Article
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolutional features. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy. Fast R-CNN trains the very deep VGG16 network 9x faster than R-CNN, is 213x faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Compared to SPPnet, Fast R-CNN trains VGG16 3x faster, tests 10x faster, and is more accurate. Fast R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-source MIT License at https://github.com/rbgirshick/fast-rcnn.