Conference PaperPDF Available

Underwater Object Detection model based on YOLOv3 architecture using Deep Neural Networks

March 2021

March 2021

DOI:10.1109/ICACCS51430.2021.9441905

Conference: 2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS)

Authors:

Mithun Haridas T P

Cochin University of Science and Technology

Supriya M H

Cochin University of Science and Technology

Bounding box [2]

…

Darknet53 Architecture [1]

…

Figures - uploaded by Mithun Haridas T P

Content may be subject to copyright.

Content uploaded by Mithun Haridas T P

Content may be subject to copyright.

2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS)

Underwater Object Detection model based on

YOLOv3 architecture using Deep Neural Networks

Athira. P

Department of Electronics

Cochin University of Science and

Technology

Kochi, India

athirapmohan1996@cusat.ac.in

Mithun Haridas T.P.

Department of Electronics

Cochin University of Science and

Technology

Kochi, India

mithuntp@cusat. ac.in

Supriya M.H.

Department of Electronics

Cochin University of Science and

Technology

Kochi, India

supriya@cusat. ac.in

Abstract— While analysing the strategic areas o f underwa ter

surveillance as well as resource exploration or scrutiny, object

detection plays a crucial role. The capab ility of analysing the

objects along with extracting the in herent in formatio n

emphasizes the high research value o f object detection in th e field

of under water as well as the low light medium. The conventional

systems serving this objective utilizes traditional handcrafting

algorithms and computational methodologies which is highly

inefficient. This brings out the need of com puter vision based

systems which are basically automated and will be a learning

based model. This paper aims to propose a model to au tomatically

detect underwater object using YOLOv3 architecture with

darknet framewor k and deep learning. This paper also explores

the possibility of custom training of YOLOv3 based underwater

object detection models using Fish 4 Knowledge dataset.

Keywords— obje ct detection, u nderwa ter imag es, YOLOv3, deep

learning.

I. INTRODUCTION

The problem of object detection is a crucial task that

is being used broadly in various kind of industries for

monitoring, inspection, sorting etc. Basically, it can be defined

as a technique which identifies and localise the required targets

from video frames in real time. Object detection [1][7] can also

be used to count and track different objects. It is quite different

from recognition, where image recognition assign label to an

image, but on the other hand object detection draw a bounding

box and then label the object. This finds application in various

fields like mechanized vehicle frameworks, movement

acknowledgment, robotized CCTV, object checking, etc. The

methods by which object detection can be implemented are

through traditional approaches as well as learning approaches.

Traditional approaches use regression model to predict the

output by combining the information from various features of

image and gives information about the object location and its

label. Where as in learning approaches deep neural network

architectures are used for end-to-end process in which feature

extraction with object detection is achieved.

As of now, the underwater object detection plays an

important role in studying climatic factors, port safety, resource

exploration, etc. Previously used manual methods for analysis

are labor intensive and time consuming; hence it is replaced by

automatic ROV where man-power can be reduced. The video

data obtained from ROV are very large in size and it’s abled to

process large amounts of such video information automatically,

which would make the process tedious. The main objectives o f

these vehicles show that, it should perform automatic

identification of man-made structures, off-shore structures,

perform object detection and/or obstacle avoidance etc.

YOLOv3 is an improved version of YOLO detection

model proposed by Joseph Redmon and Ali Farhadi [1], which

is a fast-performing object detection algorithm. Enhancing the

previous models, it enables to extend the detection model to

multi-scale with stronger feature extraction, and uses cross

entropy error functions, hence can be applied for multiple object

tracking. Like SSD, YOLOv3[1] also performs faster object

detection thus enabling real-time inference using GPU. The

detection precision of YOLOv3[1] resembles Faster R-CNN. R-

CNN based models uses a region proposed method which

makes the detection process tedious as it uses selective search

algorithm for the elimination of bounding boxes with low

confidence value and select the best one. Where as in YOLO,

the information in image pixels are directly used to prediction

bounding boxes and probability of being a particular object

class.

Authorized licensed use limited to: COCHIN UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on August 01,2021 at 06:05:08 UTC from IEEE Xplore. Restrictions apply.

2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS)

This work explores YOLOv3[1] architecture and

DarkNet [6] framework for implementing an efficient

underwater object detection model. Fish 4 Knowledge

database [13][14] is used for the training of the model. The

image dataset is preprocessed, labelled and annotated for

training the model for underwater images. Performance

analysis is also done by analyzing the mean average precision

and learning curves.

Organisation of paper: Section II presents works

related to object detection. The theory behind the model

proposed is described in the Section III, followed by the

explanation about the Methodology in Section IV, then the

Section V presents Experimental Results related to the

implementation and the Section VI concludes the research

work.

II. RELATED WORKS

Deep learning models are found more suitable for

detection and extracting information’s from images in

challenging environments along with the ability to work with

a higher amount of data at the same time.

Object detection [1][7] can be seen as a classification

problem in which each pixel is passed through classifier

window which determines the object class present. R-CNN [4]

is modelled by combining region proposal network and selective

search with Alexnet for solving the problem of selecting a

candidate region. The modified versions of RCNN are Fast R-

CNN [3] and Faster R-CNN [5].

Strachan and Kell [12] in 1995 m ade an early attempt

to detect dead fish based on the features such as shape and

colour. Later on, Storbeck and Daan [10] in 2001 proposed its

3D model by adding features like height and width for

classification. Real time detection of fish was proposed in 2014 by

Hsiao e t al [9] by using motion-based fish detection from video frames.

This is achieved by using Gaussian Mixture Model and

achieved accuracy about 83.99%. Similarly, in the same year,

Palazzo and Murabito [8] discussed another method for real

time detection by using covariance model of fish video frames

and achieved average detection accuracy o f 78.01%.

Another improved model for object detection is YOLO

[1] which predict bounding boxes and its confidence value in a

single pipeline by using a single convolutional network. In 2017

Sung et al [11] proposed deep neural network-based fish

detection using CNN architecture and achieved 65.2% accuracy

on localizing and detecting fish from 93 image datasets. YOLO

produces results with high accuracy and precisions. It predicts

bounding box and object class with confidence value of each

class utilising a single pipeline of neural network [1].

III. ARCHITECTURE

A. Network architecture

YOLO architecture is based on CNN as shown in Fig. 1.

There are prior three versions of YOLO before YOLOv3[1].

YOLOv1[2] was the first implementation of single stage

detector co n c e p t w hic h uses re d u c t i o n layers o f dim e n s i o n 1x1

followed by convolutional layer of dimension 3x3 and uses

batch normalization and leaky ReLU activation function. The

network consists of 24 convolutional layer which extracts

features and the two fully connected (FC) layer that predicts

bounding boxes and its class probability. The final output

obtained is a 7x7x30 tensor consisting of bounding boxes. This

model is trained to detect 49 objects, but produces high value of

error in localising them.

The improved version o f YOLOv1 is YOLOv2 which was

built mainly focusing on reduced localisation error. YOLOv2

removed the end FC layers and added batch normalisation on

all convolutional layers which made the network resolution

independent and obtained lower localisation error. YOLOv2 [3]

used darknet-19, that utilises a network with 19 layers

augmented with additional 11 layers to detect objects.

Authorized licensed use limited to: COCHIN UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on August 01,2021 at 06:05:08 UTC from IEEE Xplore. Restrictions apply.

2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS)

Fig. 2. Bounding box [2]

Both the previous YOLO models can detect less than 20

classes, hence a more advanced model YOLO9000[3] was

developed which can detect and classify more objects and

classes. These models were then improved by adding more

features like residual blocks, skip connections and up-sampling

and named as YOLOv3 which utilizes a 53 layered network

which is trained on ImageNet database [1].

B. Bounding box forecasting

YOLOv3[1] uses a single pipeline for feature extraction

and hence the whole image is passed on to the convolutional

network and produces a square output called grid on to which

the bounding boxes are anchored. The grid cell and anchor

share a common centroid. The YOLO algorithm predicts

location offset against anchor box: tx, ty, tw, th, objectness

scores, and class probability. Objectness-score gives the

confidence of object presence in the bounding box and the and

class probability defines the class which it belongs perfectly or

not [1]. The predictions correspond to the bounding box

coordinates with (Cx, Cy) being the upper-left corner and Ph and

Pw being the width and height and as depicted in the Fig. 2. and

calculated as given in (1), (2), (3), (4)

bx = 0 (tx)+ Cx (1)

by = °(ty)+ Cy (2)

bw = P etw

Pwe (3)

bh = Ph (4)

where bx is x-coordinate, by is y-coordinate, bw and bh are

the height and width. The measure of overlapping of ground

truth and bounding box, called objectness-score, is calculated

by logistic regression. Value “ 1” indicates the perfect overlap

of bounding box and ground truth or overlap above a threshold,

whereas if the overlap is not perfect and below a threshold the

\value will be “0” and the bounding box is ignored. The

objectness score initially help to filter the perfect bounding box.

Generally, those bounding boxes with a objectness score greater

than the threshold are filtered first and then considered for

further filtering process. Most of the object detection algorithm

faces the problem o f detecting the same object different time in

different frames resulting in its poor performance. YOLO

[1] [2] [3] uses non maximal suppression (NMS) to solve the

problem of multiple detection o f same images. NMS uses a

special function called Intersection of Union or IOU, by setting

a minimum IOU threshold which is commonly set as 0.5. If B1

and B2 are two bounding boxes, the IOU is determined as the

ratio of the intersection of area of B1-B2 to the total area

combining B1-B2.

C. Class prediction

YOLOv3[1] uses a multilabel classification. Here

independent logistic classifiers are used, instead of softmax

function, to reduce the calculational complexity which in turns

improves the system performance. For example, in complex

situations like using an open image dataset, an object can be

labelled as a cat and an animal ie; there are many overlapping.

SoftMax provides poor performance as it predicts the presence

of single class, which may not be the desired result, and hence

binary cross entropy is used in YOLOv3 [1].

D. Predictions across scales

Predictions are made by three different scales; 13x13, 26x26

and 52x52. Features are extracted using feature pyramid

network followed by darknet53 [6]. The last stage of prediction

is a 3-d tensor encoding the bounding box, confidence value that

gives objectness score, and probability o f the object being in a

particular class [1].

Fig. 3. Darknet53 Architecture [1]

Authorized licensed use limited to: COCHIN UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on August 01,2021 at 06:05:08 UTC from IEEE Xplore. Restrictions apply.

2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS)

Fig. 4. Sample F4K dataset image showing single and multi-objects

Fig. 5. Object Detection Methodology and bounding box with objectness

score

Fig. 6. Snapshot o f labelling tool

E. Feature extractor

YOLOv3[1] uses Darknet53[6] network for feature

extraction which is a hybrid network derived from darknet19

and residual network. It has a total of 53 convolutional layers

therefore called as darknet-53. The architecture of DarkNet53

is shown in the Fig. 3. YOLOv2 uses darknet-19 for feature

extraction and YOLOv3 uses darknet53 with 53 convolutional

layers for the same. Both YOLOv3 and YOLOv2 use batch

normalisation.

IV. METHODOLOGY

The object detection is reframed as a regression task

by the YOLO and produces final output with bounding boxes

and confidence score. Fish4Knowledge [13][14] video dataset

is utilised for model development. Sample images from Fish 4

Knowledge database is as shown in Fig. 4., The whole

implementation was done using python in Google colab

environment. By adopting transfer learning, YOLOv3 [1]

network was then trained with custom dataset prepared using

Fish 4 Knowledge database for 1600 iteration using Google Co-

laboratory. The image dataset split is made as given in the T able

1. The trained model is tuned to perform the object detection

task as shown in the Fig. 5.

YOLOv3 uses residual skip connection and

upsampling. It is a fully convolutional network and performs

detection at th r e e s c a l e s by ap p ly in g 1x1 kernel on fe a ture maps

whose shape is determined by the number of bounding box and

number of class. The detection process occurs only in three

layer s; d etection la yer 82, 94 and 106. In the i nitial stage the

image undergoes down-sampling resulting a stride of 32 for 81

layers. After the first detection using 1x1 kernel w e t h u s obtain

a feature map of 13x13x255. Similar process happens in rest of

the layer and produces a final feature map of size 52x52x255.

In YOLO, different layers are responsible for detecting different

size objects ie; 13x13 scale detects the large object, 26x26 scale

is responsible for detecting medium and 52x52 scale is

responsible for detecting small objects.

For training YOLO with custom object, the anchor boxes

need to be arranged in the decreasing order of their dimension.

The nine anchors of YOLO are assigned as the biggest anchor

for the first scale, next set o f three for the second and third.

A. Data Preparations

Fish4Knowledge[13][14] video dataset is available in mp4

format. The video data is converted into frames and these

extracted frames are then labelled using labelImg tool [15]. A

total of 2500 frames were obtained which were then labelled

manually using LableImg tool [15] as shown in Fig. 6. Images

were labelled in YOLO format, which contain the details of

object class, bounding box coordinates and the height and width

of the image with left-bottom as origin.

TABLE I. DATASET SPLIT

Dataset Type Training Validation Testing

Number of Images 2000 250 250

Authorized licensed use limited to: COCHIN UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on August 01,2021 at 06:05:08 UTC from IEEE Xplore. Restrictions apply.

2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS)

For instance,

Fig. 7. Output of YOLO detection algorithm on F4K dataset

Fig. 8. Loss and mAP chart during Training

If bounding box parameters; x represents the distance of

centre from x-axis, y is the distance o f centre from y-axis, w -

the width, h - the height and the image parameters such as

width (W), height (H) in pixel, the annotation values can be

calculated using (5),(6),(7),(8).

Center-x = x

W(5)

Center-y = y

H(6)

Width = "W (7)

Height = H(8)

0 0.557813 0.5578173 0.104375 0.100833

Bounding box center coordinates are (0.557813 0.5578173)

with height and width 10% is of the entire image (0.104375 and

0.100833) and 0 represent the object class present in it.

B. Training

Training is done in Google Colab using GPU and the

annotations are made using LabelImg tool [15]. Since the

network was trained previously for 80 classes of object which

doesn’t contain the object of interest (fish), the first step before

training was to create new configuration file with only one

class. The convolution filter size was selected as 18x18 since

only one class was used for training.

Input to the network should be an image and hence video is

passed through the system to extract frames and which is

forwarded to object detector YOLO algorithm. The output of

YOLO consist o f the confidence score and class ID of the

corresponding object class present in the bounding box as

shown in Fig. 7.

V. RESULTS AND ANALYSIS

The network was trained and tested with

Fish4Knowledge[13][14] dataset. The losses in each batch can

be calculated from the log file generated during the training

phase. Fig. 8 shows the loss and mAP plotted against iteration.

The loss decreases and mAP increases with iterations. Further

the network can be trained until the average loss decreases

below 0.2 and on further training the network get overfitted,

which was avoided using early stopping. The three detection

layers i.e. layer 82, 94 and 106 calculate the loss functions for

the bounding box which are namely; Mean squared error of

centreX, centreY, Width and Height; Binary cross entropy of

objectness score, no objectness score and multi-class

predictions. Thus the loss function has four parts and be

calculated as in (9).

Loss = Lambda_Coord * Sum(Mean_Square_Error((bx,

by), (bx', by) * obj_mask)

+ Lambda_Coord * Sum(Mean_Square_Error((bw, bh), (bw>,

bh') * obj_mask)+ Sum(Binary_Cross_Entropy(obj, obj’)

* obj_mask)

+ Lambda_Noobj * Sum(Binary_Cross_Entropy(obj, obj’) *

(1 -obj_mask) * ignore_mask)

+ Sum(Binary_Cross_Entropy(class, class’)) (9)

Authorized licensed use limited to: COCHIN UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on August 01,2021 at 06:05:08 UTC from IEEE Xplore. Restrictions apply.

2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS)

TABLE II. EVALUATION METRICS FOR CONFIDENCE

THRESHOLD = 0.25 AND IOU THRESHOLD = 0.5 Future scope

Evaluation

Metric

Best

Accuracy mAP Recall Avg. IOU F1

Score

Values 0.9759 0.9661 0.95 0.6928 0.92

Where the relative centroid is represented as bx and by

and the directly predicted centroid is represented as bx’ and by’.

Lambda_Coord is a weight which has a value 5. The second

term represent the height and width loss calculated using

width(bw) along with the height(bh), followed by object/non-

objectness score loss and finally the last term represent the

classification loss[2]. The mean Average Precision (mAP) was

calculated for analysis the performance of the object detection.

After completing 1600 iterations 96.61% mean average

precision was obtained and a confidence threshold of 0.25 is

set in order to avoid occlusion of bounding box. mAP was

calculated by keeping an IOU threshold of 0.5 in order obtain

a better result. Testing results obtained is tabulated in table II.

The object detector was tested with both images and

videos. The results obtained by testing the model using the

Fish 4 Knowledge video data of 09min 35sec duration shows

that best accuracy of 97.59%, Average loss of 0.475593,

Precision of 0.88, Recall 0.95, F1 score 0.92 and Average IoU

69.28 and the average detection time was found to be 15

seconds, for confidence threshold = 0.25 and IoU threshold o f

0.5. Accuracy, precision and recall of the model performance

is calculated by taking the positive object class as fish in the

frame and negative object class as no fish in frame. The mean

Average Precision(mAP), F1 score and Intersection of Union

(IoU) can be calculated as shown in (10), (11), (12).

mAP = ^— ^ N°- of class Average precision (10)

precision*recall

F1 score = 2*

-----

—

----------

(11)

precision+recall

IoU B10B2

B1UB2

V. CONCLUSION

(12)

Underwater object detection model is implemented

using YOLOv3 architecture using Fish4Knowledge [13] [14]

dataset. A total of 2500 images were utilised for training the

detector for a single class. The network successfully detects

multiple objects in the consecutive frames with accuracy of

96.17% and mean average Precision of 96.61% for confidence

threshold of 0.25 and IoU threshold of 0.5. Average IoU was

obtained as 69.28% and F1 score as 0.92 for the obtained result.

The model can be further tuned to detect different

object classes from different domains for object detection and

tracking. YOLO can be combined with deep sort or any other

object tracker for the implementation of tracking and further

analysis.

References

[1] Redmon, Joseph and Farhadi, Ali, “Yolov3: An incremental

improvement,”arXiv preprint arXiv:1804.02767, 2018.

[2] J. Redmon, S. Divvala, R. Girshick and A. Farhadi, "You Only Look

Once: Unified, Real-Time Object Detection," 2016 IEEE Conference on

Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016,

pp. 779-788

[3] J. Redmon and A. Farhadi, "YOLO9000: Better, Faster, Stronger," 2017

IEEE Conference on Computer Vision and Pattern Recognition

(CVPR), Honolulu, HI, USA, 2017, pp. 6517-6525

[4] R. Girshick, "Fast R-CNN," 2015 IEEE International Conference on

Computer Vision (ICCV), Santiago, 2015, pp. 1440-1448,

[5] S. Ren, K. He, R. Girshick and J. Sun, "Faster R-CNN: Towards Real-

Time Object Detection with Region Proposal Networks," in IEEE

Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6,

pp. 1137-1149, 1 June 2017

[6] J. Redmon. Darknet: “Open source neural networks in c”.

http://pj reddie.com/darknet/, 2013-2016.

[7] A. Mekonnen and F. Lerasle, "Comparative Evaluations of Selected

Tracking-by-Detection Approaches," IEEE Transactions on Circuits and

Systems for Video Technology, vol. 29, no. 4, pp. 996-1010, 2019.

[8] Palazzo, Simone and Murabito, Francesca, “Fish species identification in

real-life underwater images” In 3rd ACM International Workshop on

Multimedia Analysis for Ecological Data, Orlando, Florida, pp. 13- 18.

[9] Hsiao, Y., Chen, C., Lin, S., and Lin,”Real-world underwater fish

recognition and identification using sparse representation” in Ecological

Informatics 2014, 23: 13-21.

[10] Storbeck, Frank and Daan, Berent, “Fish species recognition using

computer vision and a neural network” . Fisheries Research, 51: 11-15.

[11] Sung, M., Yu, S., and Girdhar, Y “Vision based real-time fish detection

using convolution neural network” in IEEE OCEAN-2017, Aberdeen,

UK, 1-6 pp.

[12] Strachan, N.J. C., and Kell, L “ A potential method for the differentiation

between haddock fish stocks by computer vision using canonical

discriminant analysis” in ICES Journal of Marine Science, 52: 145-149.

[13] B. J. Boom, P. X. Huang, C. Spampinato, S. Palazzo, J. He, C. Beyan, E.

Beauxis-Aussalet, J. van Ossenbruggen, G. Nadarajan, J. Y. Chen-Burger,

D. Giordano, L. Hardman, F.-P. Lin, R. B. Fisher, "Long-term underwater

camera surveillance for monitoring and analysis of fish populations",

Proc. Int. Workshop on Visual observation and Analysis o f Animal and

Insect Behavior (VAIB), in conjunction with ICPR 2012, Tsukuba, Japan,

2012.

[14] B. J. Boom, P. X. Huang, J. He, R. B. Fisher, "Supporting Ground-Truth

annotation of image datasets using clustering", 21st Int. Conf. on Pattern

Recognition (ICPR), 2012.

[15] "LabelImg," Tzutalin.github.io, 2019. [Online].

Available:https://tzutalin.github.io/labelImg/.

Authorized licensed use limited to: COCHIN UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on August 01,2021 at 06:05:08 UTC from IEEE Xplore. Restrictions apply.

Deep Learning-Based Automated Classroom Slide Extraction

Article

Full-text available

Apr 2024

Automated extraction of valuable content from real-time classroom lectures holds significant potential for enhancing educational accessibility and efficiency. However, capturing the spontaneous insights of live lectures often proves challenging due to rapid visual transitions, instructor movement, and diverse learning styles. This paper presents a novel approach that combines the strengths of YOLO and Scale-Invariant Feature Transform (SIFT) techniques to automatically extract slides from live classroom lectures. YOLO, a real-time object detection algorithm, is employed to identify board area, teacher, and other objects within the video stream. While SIFT, a robust feature-based method, was used to accurately merge key points from multiple pictures of the same region. The proposed method involves a multi-stage process: first, YOLO detects the potential place of the teacher, which occluded the board within the video frames. Subsequently, the teacher was removed from the image. The board was divided into multiple segments, to remove and merge redundant content Scale-invariant feature Transform (SIFT) was employed. Experimental results on a diverse dataset of classroom lecture videos demonstrated the effectiveness of the proposed method in extracting slides across different environments, lecture styles, and recording conditions. The potential benefits include improved note-taking, reduced manual effort in content curation, and enhanced accessibility to lecture materials. The presented approach contributes to the broader goal of leveraging computer vision and machine learning techniques to transform traditional classroom settings into modern, interactive, and adaptive learning environments.

Passive Visual Underwater Surveillance: A Survey

Preprint

Full-text available

Feb 2023

The objective of this article is to provide a detailed review of the state-of-the-art underwater surveillance process and the new trends of the same. Underwater surveillance has recently getting lots of attention because of its potential applications including the security of the coastal border, effective fish farming, deep-sea exploration, preservation of rare aquatic animals, etc. In the case of underwater, the light that is sensed by the camera got degraded due to many underwater phenomena such as haze, low illumination, scattering, absorption, diffraction, and refraction. The actual color of the object, as well as the scene, gets degraded as the light can not travel deep underwater. In a situation, where the scene of view is not clear, it is very difficult to detect the moving objects present in the scene. Further, the identification of the object of concern and tracking of that becomes more challenging. In this survey, we categorize underwater surveillance as a combination of the three blocks: enhancement, object detection, and object tracking. We categorize all these three blocks based on the underwater complexity and motion models considered. In this article, we tried to enumerate the most detailed descriptions of each category. We also discuss the future directions of research in the area of underwater surveillance.

Marine Plastic Detection Using Deep Learning

Chapter

Full-text available

Nov 2022

Ocean Pollution is one of the alarming environmental concerns where studies reveal that the biggest reason for ocean pollution is caused by the plastic debris discarded from the land. These plastics pose a threat to the coastal wildlife, marine ecosystem balance, and the economic health of the coastal communities. Inevitably this would result in affecting both human and aquatic living. The most commonly used methods, though effective, pose certain disadvantages when it comes to detecting and quantifying plastics. Thus, it is important to adopt alternative methods involving the latest technologies that would easily help us to identify the plastics and aid in their removal. In this paper, we have investigated the YOLO v4 and YOLO v5 deep learning object detection algorithms for detecting and identifying the marine plastics in the epipelagic layers of the water bodies. Ocean plastic images available on the internet are used to create the datasets. Image augmentation helps in increasing the number of images in the dataset. The Mean Average Precision of YOLO v4 and YOLO v5 are studied and the algorithm performance is explained with the results concluded.

Efficient Underwater Object Detection Using Deep Neural Networks

Conference Paper

Feb 2024

Detection of Underwater Objects in Images and Videos Using Deep Learning

Article

Jan 2023

YOLOv7-CHS: An Emerging Model for Underwater Object Detection

Article

Full-text available

Oct 2023

Underwater target detection plays a crucial role in marine environmental monitoring and early warning systems. It involves utilizing optical images acquired from underwater imaging devices to locate and identify aquatic organisms in challenging environments. However, the color deviation and low illumination in these images, caused by harsh working conditions, pose significant challenges to an effective target detection. Moreover, the detection of numerous small or tiny aquatic targets becomes even more demanding, considering the limited storage and computing power of detection devices. To address these problems, we propose the YOLOv7-CHS model for underwater target detection, which introduces several innovative approaches. Firstly, we replace efficient layer aggregation networks (ELAN) with the high-order spatial interaction (HOSI) module as the backbone of the model. This change reduces the model size while preserving accuracy. Secondly, we integrate the contextual transformer (CT) module into the head of the model, which combines static and dynamic contextual representations to effectively improve the model’s ability to detect small targets. Lastly, we incorporate the simple parameter-free attention (SPFA) module at the head of the detection network, implementing a combined channel-domain and spatial-domain attention mechanism. This integration significantly improves the representation capabilities of the network. To validate the implications of our model, we conduct a series of experiments. The results demonstrate that our proposed model achieves higher mean average precision (mAP) values on the Starfish and DUO datasets compared to the original YOLOv7, with improvements of 4.5% and 4.2%, respectively. Additionally, our model achieves a real-time detection speed of 32 frames per second (FPS). Furthermore, the floating point operations (FLOPs) of our model are 62.9 G smaller than those of YOLOv7, facilitating the deployment of the model. Its innovative design and experimental results highlight its effectiveness in addressing the challenges associated with underwater object detection.

Underwater Surveillance Robot

Conference Paper

Jul 2023

A Systematic Review on Underwater Image Enhancement and Object Detection Methods

Chapter

Nov 2022

In the last decade, the number of underwater image processing research has increased significantly. This is primarily due to society's dependency on the precious resources found underwater and to protect the underwater environment. Unlike regular imaging in a normal environment, underwater images suffer from low visibility, blurriness, color casts, etc. due to light scattering, turbidity, darkness, and wavelength of light. For effective underwater exploration, excellent approaches are necessary. This review study discusses the survey of “underwater image enhancement and object detection” methods. These methods are outlined briefly with the available dataset and evaluation metrics used for underwater image enhancement. A wide range of domain applications is also highlighted.

Quality Inspection of Dengue kits using YOLOv4 architecture

Article

Jun 2022

With the increasing advancements in Artificial Intelligence and its varied applications across multiple domains, the manufacturing industry is not left behind. Manufacturing and Production require a lot of labour force to ensure good quality end results. While this may be a necessity in the rudimentary stages of development, there is a way to cut down on this while checking the quality of the end product. This project aims at using the power of Artificial Intelligence, specifically Computer vision to create a quality inspecting tool that entails localizing and predicting the required objects in the image of the Dengue kit. This project highlights the entire process including simulation, design of conveyor belt and displays the final process of how both combined can help catalyse quality inspection by subtracting the manual crunch. Keywords: Artificial Intelligence, Inspection, Computer vision, Industry 4.0 Revolution, Object Detection, Yolov4

YOLOv3: An Incremental Improvement

Article

Apr 2018

We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that's pretty swell. It's a little bigger than last time but more accurate. It's still fast though, don't worry. At 320x320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 mAP@50 in 51 ms on a Titan X, compared to 57.5 mAP@50 in 198 ms by RetinaNet, similar performance but 3.8x faster. As always, all the code is online at https://pjreddie.com/yolo/

Comparative Evaluations of Selected Tracking-by-Detection Approaches

Article

Mar 2018

In this work, we present a comparative evaluation of various multi-person tracking-by-detection approaches on public datasets. The work investigates five popular trackers coupled with six relevant visual people detectors evaluated on seven public datasets. The evaluation emphasizes on exhibited performance variation depending on tracker-detector choices. Our experimental results show that the overall performance depends on how challenging the dataset is, the performance of the detector on the specific dataset, and the tracker-detector combination. Some trackers are more sensitive to the choice of a detector and some detectors to the choice of a tracker than others. Based on our results, two of the trackers demonstrate the best performances consistently across different datasets whereas the best performing detectors vary per dataset. This underscores the need for careful application context specific evaluation when choosing a detector.

YOLO9000: Better, Faster, Stronger

Conference Paper

Jul 2017

Vision based real-time fish detection using convolutional neural network

Conference Paper

Jun 2017

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Conference Paper

Jan 2016

State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.

You Only Look Once: Unified, Real-Time Object Detection

Conference Paper

Jun 2016

Fish Species Identification in Real-Life Underwater Images

Conference Paper

Nov 2014

Kernel descriptors consist in finite-dimensional vectors extracted from image patches and designed in such a way that the dot product approximates a nonlinear kernel, whose projection feature space would be high-dimensional. Recently, they have been successfully used for fine-gradined object recogntion, and in this work we study the application of two such descriptors, called EMK and KDES (respectively designed as a kernelized generalization of the common bag-of-words and histogram-of-gradient approaches) to the MAED 2014 Fish Classification task, consisting of about 50,000 underwater images from 10 fish species.

You Only Look Once: Unified, Real-Time Object Detection

Article

Jun 2015

We present YOLO, a unified pipeline for object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance. Our unified architecture is also extremely fast; YOLO processes images in real-time at 45 frames per second, hundreds to thousands of times faster than existing detection systems. Our system uses global image context to detect and localize objects, making it less prone to background errors than top detection systems like R-CNN. By itself, YOLO detects objects at unprecedented speeds with moderate accuracy. When combined with state-of-the-art detectors, YOLO boosts performance by 2-3% points mAP.

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Article

Jun 2015

State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolutional features. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.

Fast r-cnn

Article

Apr 2015

Ross Girshick

This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy. Fast R-CNN trains the very deep VGG16 network 9x faster than R-CNN, is 213x faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Compared to SPPnet, Fast R-CNN trains VGG16 3x faster, tests 10x faster, and is more accurate. Fast R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-source MIT License at https://github.com/rbgirshick/fast-rcnn.

Underwater Object Detection model based on YOLOv3 architecture using Deep Neural Networks

Figures

Recommended publications

Underwater Object Detection and Reconstruction Based on Active Single-Pixel Imaging and Super-Resolu...

Phenotyping Problems of Parts-per-Object Count

Underwater object detection based on geophysical inversion information

An Improved Underwater Object Detection Method Based on YOLOv5

SUNOD: Synthetic Underwater Non-Natural Object Detection Dataset

Foreign Object Detection between PSDs and Metro Doors Using Deep Neural Networks

Unsupervised Image Segmentation model based on W Net architecture and Conditional Random Field for U...