ArticlePDF Available

A comprehensive study towards high-level approaches for weapon detection using classical machine learning and deep learning methods

Authors:

Abstract and Figures

Surveillance systems do not give a rapid response to deal with suspicious activities such as armed robbery in public places. Consequently, there is a need for technology that can recognize criminal activities from Closed Circuit Televisions (CCTV) footage without the need of human help. Various high-performance computing algorithms have been developed but are limited to specific conditions. In this paper, we have identified gaps between existing technologies for weapon detection. The automatic detection of guns/weapons could help in the investigation of crime scenes. A new and difficult area of study is identifying the specific type of firearm used in an attack known as intra-class detection. The study examines and classifies the strengths and shortcomings of several existing algorithms using classical machine learning and deep learning approaches, employed in the detection of different kinds of weapons. We have thoroughly compare and analyze the performance of several recent state-of-the-art methods on different datasets along with their future scope. We observed that deep learning techniques beat traditional machine learning techniques in terms of speed and accuracy.
Content may be subject to copyright.
A Comprehensive Study towards High-level Approaches for Weapon Detection
using Classical Machine Learning and Deep Learning Methods
Pavinder Yadav, Nidhi Gupta, Pawan Kumar Sharma
Department of Mathematics and Scientific Computing
National Institute of Technology, Hamirpur
Himachal Pradesh, 177005, India
Abstract
Surveillance systems do not give a rapid response to deal with suspicious activities such as armed robbery in public
places. Consequently, there is a need for technology that can recognize criminal activities from Closed Circuit Tele-
visions (CCTV) footage without the need of human help. Various high-performance computing algorithms have been
developed but are limited to specific conditions. In this paper, we have identified gaps between existing technologies
for weapon detection. The automatic detection of guns/weapons could help in the investigation of crime scenes. A
new and dicult area of study is identifying the specific type of firearm used in an attack known as intra-class detec-
tion. The study examines and classifies the strengths and shortcomings of several existing algorithms using classical
machine learning and deep learning approaches, employed in the detection of dierent kinds of weapons. We have
thoroughly compare and analyze the performance of several recent state-of-the-art methods on dierent datasets along
with their future scope. We observed that deep learning techniques beat traditional machine learning techniques in
terms of speed and accuracy.
Keywords: Weapon Detection, Deep Learning, Machine Learning, Computer Vision, Security and Surveillance
1. Introduction
Nowadays, Closed Circuit Televisions (CCTVs) are widely being used in society to prevent crimes and identify
suspicious activities. With the fast development of CCTV cameras, inspecting and analyzing them becomes more dif-
ficult for a human operator, and taking any necessary action based on the video input from the remote camera. When
several people are involved in video input, it becomes expensive and ineective. According to several studies, human
operators develop video blindness and tend to miss up to 95% of the screen action after 20 to 40 minutes of intensive
monitoring (Velastin et al.,2006). This overall results in a significant quality reduction and poor productivity, which
leads to an inaccurate detection rate of up to 83% (Ainsworth,2002). Researchers have developed a number of com-
puter vision-based automatic weapon detection systems in response to the proliferation of high-powered computers
and the availability of high-speed internet.
Object detection, in particular, has become a demanding study field in the last decade, and it has been used
in a variety of applications including foreground moving targets detection (Minaeian et al.,2018), human activity
recognition (Singh and Vishwakarma,2019), marine surveillance (Jeong et al.,2018), pedestrian identification (Jin et
al.,2016), weapon detection (Olmos et al.,2018), and many more. Some merely need to identify items that take up
a significant portion of the scene, while others require the detection of many objects of dierent sizes. The outcomes
vary according to the size of the object, with small objects having poorer outcomes compared to large objects. These
findings are well reflected in the challenges like ImageNet (Deng et al.,2009), Common Objects in COntext (COCO)
Corresponding author
Email addresses: pavinder_phdmath@nith.ac.in (Pavinder Yadav), nidhi@nith.ac.in (Nidhi Gupta), psharma@nith.ac.in
(Pawan Kumar Sharma)
Preprint submitted to Elsevier August 2, 2022
Revised manuscript (with changes marked)
(Lin et al.,2014), and open images (Kuznetsova et al.,2020), where small objects are observed with 38% less accuracy
in detection than larger objects. The reason is that small objects have fewer pixels on the image, which means that
they don’t show up very often, either because they aren’t labeled or because they aren’t well represented in the training
phase of the process.
The crime rate is relatively high in nations where a person has access to armoury (e.g. a pistol). Information
from many sources revealed the eects of criminal actions, which ranged from murder to theft, resulting in the loss
of valuable lives, infrastructure devastation, and disrupted law and order circumstances. Standard CCTV cameras are
used for monitoring at certain places, but the monitoring process is very laborious. Nowadays, guns are available in
a wide range of styles and sizes, which makes them dicult to identify in real-time. In this situation, deep learning
ushered in a breakthrough. However, with time, researchers have created various models to identify weaponry using
deep learning methods. The detection of weapons has become a dicult task despite the existence of several advanced
state-of-the-art deep learning techniques. A thorough examination of several techniques has been carried out in this
article.
Detection of small objects in aerial and satellite images has been previously addressed by modifying the archi-
tecture of the network (Sommer et al.,2017), data augmentation for small objects (Tong et al.,2020), or adding
perceptual generative adversarial networks for better image resolution (Li et al.,2017). Although these techniques
observed higher precision in small object detection, they were limited to certain applications such as trac signs or
satellite images only.
Earlier, the object detection procedure was divided into three phases: (i) generating proposals, (ii) extracting
feature vectors, and (iii) classifying regions. The main objective was to find a zone of interest in an image that
might include objects of any size. For safeguard information, multiple scales were used to reduce the size of the
input images and the multi-scale windows curve was employed for transition between the images. The second step
was to get a feature vector of fixed length from the sliding window in order to secure specific information about the
area enclosed. Low-level visual descriptors such as the Harris Corner Detector (Harris et al.,1988), Histogram of
Gradients (HOG) (Dalal and Triggs,2005), Scale Invariant Feature Transform (SIFT) (Lowe,1999), or Speeded Up
Robust Features (SURF) (Bay et al.,2006) were used to encode feature vectors, which exhibited fitness to scale,
illumination, and rotational variance. Finally, the area classifiers were trained to assign labels per dierent categories
to the covered regions in the third phase. Because of their high performance on small-sized training data, Support
Vector Machines (SVM) (Hearst et al.,1998) were commonly used. Additionally, in the classification stage, various
classification approaches such as bagging (Opitz and Maclin,1999), cascade learning (Dalal and Triggs,2005), and
Adaboost (Freund et al.,1996) were utilised, resulting in better detection accuracy.
In particular, Deep Convolutional Neural Networks (DCNNs) have outperformed all other machine learning ap-
proaches in object detection in the last few years. It takes a lot of work to find higher level features in data with
traditional methods (Guo et al.,2016). DCNNs models do this automatically. Convolutional layers, non-linear acti-
vation functions such as ReLU, pooling layers, and fully-linked layers make up DCNNs. The convolutional layers
extract a variety of characteristics from the source images. Following that, fully connected layers learn from these
characteristics. Another advantage of utilising such a design is that it may be reused partially or fully for similar
applications. It is possible because of the theory of transfer learning, which cuts down on model building time and
eliminates the need for a large dataset.
Object detection models based on deep learning methods are majorly divided into two groups: (i) one-stage detec-
tors like You Only Look Once (YOLO) (Redmon et al.,2016) and its versions like YOLOv2 (Redmon and Farhadi,
2017), and YOLOv3 (Redmon and Farhadi,2018), and (ii) two-stage detectors like Region-based Convolutional Neu-
ral Network (R-CNN) (Girshick et al.,2014) and its versions like Fast R-CNN (Girshick,2015), and Faster R-CNN
(Ren et al.,2015). Without the need for a cascading area classification phase, one-stage detectors produce categorical
predictions of items on each position of the feature maps. Two-stage detectors start with a proposal generator that gen-
erates a small number of proposals and extracts characteristics from each one, and then use region classifiers to predict
the suggested category of the object. One-stage detectors are substantially more time-ecient and have more appli-
cability in real-time identification, but two-stage detectors produce better results on public benchmark datasets. In
this study, we cover the fundamental concepts of major approaches and review each of these methods in a methodical
manner.
For a few decades, researchers have been striving to develop an automatic weapon detection system based on
computer vision algorithms. In a potentially perilous situation, that person carries a knife or another firearm in his
2
hand rather than any other body parts. It is needless to say that, normally, guns can only be operated by hand while
committing any crime. Therefore, the vision system is expected to be trained to access the ideal weapon image or a
form that is comparable to that weapon. The major goals of such a detection system are as follows:
To create an automatic alarm system that can alert surveillance security personnel in real-time, resulting in a
quick response; and
To classify dierent types of weapons, which can provide a crucial information for forensic investigation.
Deep learning revolutionised the creation of weapon detection systems. In this regard, several studies have been
conducted and various models have been developed to identify firearms.
The main objective of this comprehensive study is to identify research gaps in the field of weapon detection and
identification and to thoroughly study existing datasets, their limitations, and future research directions. This article
represents the huge number of contributions of a significant articles in a structured and systematic manner. This survey
can provide readers with a comprehensive understanding of weapon detection using deep learning, as well as perhaps
drive future research eorts on weapon detection approaches and their benefits. Overall, this article examines the
strengths and shortcomings of the various existing approaches, and oers a detailed assessment of the open issues
with forecasting of future prospects.
The paper is organized into seven sections. Section 2 provides a detailed description of existing publicly available
datasets. Classical machine learning approaches adopted for weapon detection are discussed in Section 3. In Section 4,
deep learning approaches are described in detail. Furthermore, the comparative analysis of various classical machine
learning and deep learning methods is discussed in Section 5. The key contributions of the study are highlighted in
Section 6. The future extents related to the real-time weapon detection methods are given in Section 7.
2. Publicly Available Datasets
Table 1shows the public datasets that can be used to classify and recognize weapons, providing the year of
publication, the number of images in each dataset, the resolution of images or videos, and the types of used weapons.
Table 1: Statistics of datasets that are publicly available.
Database Publication/site Year # Images/Videos Resolution Types of
Weapon
IMFDBs IMFDBs 2014 4,50,000* Variable Size Handguns,
Rifles, Knives
Knives Images Database Grega et al. (2016) 2016 12,899 100 x 100 Knives
Gun Movies Database Grega et al. (2016) 2016 7 Videos 640 x 480 Guns
Dataset of Olmos et al. Olmos et al. (2018) 2018 3,000 Variable Size Guns
Sohas weapon P´
erez-Hern´
andez et al. (2020) 2020 17,684 Variable Size Guns, Knives
Dataset created by mock attack Gonz´
alez et al. (2020) 2020 5,149 1920 x 1080 Handguns and
Rifles
Synthetic dataset Gonz´
alez et al. (2020) 2020 2,500 1920 x 1080 Guns
ITU Firearm Dataset Iqbal et al. (2021) 2021 10,973 480 x 800 Guns
*Approximate
2.1. IMFDBs
The Internet Movie Firearm Database contains a huge picture collection of weapons. It is a powered-wiki-managed
online repository that is publicly available at the site IMFDBs. It comprises about 4,50,000 pictures of the weapon,
some of which are displayed in Fig. 1. Several thousand images from the movie sequences or video games are used
to create the database, which reflects the limited number of pictures with close-up views of weapons. The weapons
that were obliterated by darkness or made unseeable due to blurriness or size are also included here. IMFDBs is
an excellent dataset for firearms since it has a wide range of gun pictures in various unconstrained orientations and
positions.
3
Figure 1: Sample images from IMFDBs database (IMFDBs).
2.2. Knives images database
This database (Grega et al.,2016) comprises 12,899 pictures of knives, which are classified into two categories:
(i) positive examples containing 3,559 images, and (ii) negative examples containing 9,340 images. The images
containing a knife shown in Fig. 2a belong to the positive class, and the images shown in Fig. 2b belong to negative
examples considering all circumstances. It is considered that a knife that is not being wielded by a person is less
hazardous. It can also be overlooked during processing, resulting in a slew of false alarms. Because carrying knives
in public is banned in Poland, the photographs were taken either indoors or via vehicle windows. All images in this
database have a resolution of 100 x 100 pixels.
(a) (b)
Figure 2: Knives images database (a) Positive samples (b) Negative samples (Grega et al.,2016).
2.3. Database of firearm videos
The Gun Movies Dataset is a video collection recorded by surveillance cameras. Grega et al. (2016) created this
dataset by simulating a gun-shooting situation due to the lack of real-life gun-shooting footage. As a result, this
dataset comprises CCTV footage with an actor and seven dierent video recordings. The training and testing sets
were identical in size, with each recording lasting for 8.5 minutes and yielding roughly 12,000 frames. Sixty percent
of each set contains negative examples that do not include a handgun but other objects in a hand, while the remaining
forty percent contains positive examples that include a firearm visible to an observer. A few images from this dataset
are shown in Fig. 3.
4
Figure 3: Frames extracted from Gun movie dataset of Grega et al. (2016).
2.4. Dataset of Olmos et al.
Olmos et al. (2018) developed two databases for knives and handguns. The knife dataset contains 12,869 images.
Each image contains multiple variations of cold steel weapons of various sorts, forms, colors, sizes, and materials,
placed near and far from the camera, partially occluded by the hand or anything else. There are three sets of databases,
one with 102 classes and 9,261 images, which is suitable for the classification tasks. The second dataset comprises
3,000 images of weapons with extensive contextual information that may be used to make detection. The third dataset
comprises 608 images, with 304 of them being handgun images. This dataset may be used for classification as well
as detection tasks. Some of the sample images are shown in Fig. 4a.
2.5. Sohas weapon
The authors of P´
erez-Hern´
andez et al. (2020) created a dataset called Sohas weapon to investigate six tiny objects
that are frequently handled in a similar manner to a weapon, namely handguns, knives, smartphones, bills, handbags,
and cards. They employed a variety of surveillance cameras to capture images. Among them, 10% of the images
were obtained from web sources. For detection, all of these images were manually annotated. The dataset contains
dierent types of knives with dierent shapes. A few images of the dataset are shown in Fig. 4b.
(a) (b)
Figure 4: A few samples from the (a) Dataset of Olmos et al. (2018) (b) Sohas weapon dataset (P´
erez-Hern´
andez et al.,2020).
5
2.6. Dataset created by mock attack and synthetic dataset
During a simulated assault, this dataset was collected and manually annotated by Gonz´
alez et al. (2020). The
infrastructure of the dataset is made up of three security cameras that are strategically placed at the same location to
cover two distinct pathways and one entrance, generating diverse situations. A total of 5,149 frames were retrieved
from movies at a rate of two frames per second (607, 3511, and 1031 frames from cam1, cam5, and cam7, respec-
tively). In addition, they constructed a fictitious dataset by simulating a section of a city and an educational facility
within it using the Unity Game Engine. Several cameras capture the motions of eleven distinct models and seven
animations that make up the cast of characters. The images add eleven distinct items to the produced datasets: four
dierent pistols, five dierent rifles, a knife, and a smartphone. The creation of synthetic data might assist the network
focus on the item to be discovered. Therefore, this dataset is not realistic. Fig. 5a shows an example of the dataset
collected by mock attack. Fig. 5b illustrates a few sample images shown from synthetic dataset.
(a) (b)
Figure 5: (a) Dataset collected during mock attack (b) A few sample images from synthetic dataset Gonz´
alez et al. (2020).
2.7. ITU Firearms Dataset (ITUF)
The ITUF dataset (Iqbal et al.,2021) contains images of guns and rifles in a variety of settings, including being
targeted, laid out on tables, transported, or stored in racks. Web scraping was used to acquire the images for the
dataset. Weapons, battles, pistols, film titles, firearms, firearm variations, shooters, corps, guns, and rifles were
among the phrases used in the dataset. The data-driven algorithms can overcome garment variances, body posture
variations, weapon position and size variations, changing light conditions, and both indoor and outdoor situations.
These mentioned variables result in a strong prior for the data-driven algorithms. Images were eliminated from the
findings, which were not associated with weaponry, as well as cartoons and duplicates. There are 10,973 completely
annotated firearm images in the final clean collection, with 13,647 firearm occurrences. Fig. 6depicts a few sample
images from the dataset.
Figure 6: Sample images from ITU Firearm Dataset (Iqbal et al.,2021).
6
3. Classical Machine Learning Methods
This section covers the detailed description of algorithms that use classical machine learning approaches (Haralick
and Shapiro,1985). Some of these approaches are listed in Table 2. Edge detection is the process of identifying parts
of an image where objects could be distinguished from others. The edge detection has the advantage of reducing
the quantity of data required to analyse the image. Edge detection works eectively with sharp images, while noisy
images enhance complexity and make it a challenging task to be resolved.
Table 2: List of classical machine learning approaches and respective area of application.
Method References Year Area of Application
Active Appearance Models (AAMs) Cootes et al. (1998) 1988 Knives detection
Harris Corner Detector Harris et al. (1988) 1988 Knives and Gun detection
Scale Invariant Feature Transform (SIFT) Lowe (1999) 1999 Knives and Gun detection
Haar Cascades Viola and Jones (2001) 2001 Knives detection
Speeded Up Robust Features (SURF) Bay et al. (2006) 2006 Knives and Gun detection
3.1. Active Appearance Models (AAMs)
A statistical model of the form and pixel intensities (texture) throughout the object can be expressed as an Active
Appearance Models (AAMs) (Cootes et al.,1998) in general. The phrase appearance refers to the combination of
form and texture, while active refers to the employment of an algorithm to match the shape and texture model in fresh
images. Objects of interest are manually tagged with so-called landmark points in the images from the training set to
characterise the form during the training phase. The algorithm may be divided into four stages:
1. Select a starting reference shape.
2. Align the reference shape with all other forms.
3. Recalculate the aligned forms mean shape.
4. If the mean form distance from the reference shape exceeds a threshold, set the mean shape as the reference
shape and return to step 2. Otherwise, the mean shape should be returned.
3.2. Harris Corner Detector
It is a corner detection technique commonly utilized in computer vision algorithms to extract corners and infer
image features in detail (Harris et al.,1988). Instead of applying shifting patches for every 45angle, the Harris
corner detector takes into account the dierential of the corner values with respect to directions directly, allowing it to
more eectively distinguish between edges and corners. The concept is to imagine a tiny window surrounding each
pixel in the image. By moving each window, a minimal fraction in a specific direction, it is possible to determine the
amount of change in the values of pixels. Eq. 1expresses the change function E(m,n) as the total sum of all squared
dierences,
E(m,n)=X
x,y
w(x,y)[I(x+m,y+n)I(x,y)]2(1)
where m,nare the pixels x,yare the coordinates, and Iis value of the intensity of pixels in 3 x 3 window. A feature
of the image is regarded to be any pixels with high E(m,n) values, as determined by a threshold value in the image.
For corner detection, we must maximize the function value E(m,n). Subsequently, the second term should also be
maximized. Eq. 2is determined by applying Taylor Series Expansion to Eq. 1and then performing a number of
mathematical operations on the result.
E(m,n)[m,n]P"u
v#(2)
7
where,
P=Xw(x,y)"I2
xIxIy
IxIyI2
y#(3)
where, Ixand Iyare the derivatives of the image in the (x,y) directions, respectively. The corner response has been
calculated by Eq. 4
Q=det(P)ctrace(P)2) (4)
where, cis an constant and αand βare eigenvalues of P. As a consequence, as seen in Fig. 7, the eigenvalues
indicate whether an area is a corner, a flat surface, or an edge.
Figure 7: Classification of image point using eigenvalue of Q.
When |Q|is small, which occurs when αand βare small, the area is flat.
WhenQ <0, the region is considered to be an edge, which happens when α >> β and vice versa.
When Q is large, the area is considered a corner, which happens when αand βare both large andαβ.
3.3. Scale Invariant Feature Transform (SIFT)
Scale Invariant Feature Transform (SIFT) (Lowe,1999) extracts a high number of distinct possible key-points
from an image which are invariant to various perspectives, scaling, rotation, light variations, and noise. Fig. 8shows
the architecture of the SIFT algorithm.
8
Figure 8: Architecture of SIFT (Lowe,1999).
There are basically four stages in this algorithm:
Scale-space detection: The input image is scanned to find the regions of interest that are either local maxima or
local minima and invariant to transformations like rotation and scaling.
Key-point localization: In this phase, the most stable key-points are identified and poor contrast and outliers
are eliminated from the key-points. Outliers, low-contrast pixels, and poorly located key-points along an edge
are all removed using the taylor series.
Orientation: After finding the most stable key points in phase one, the local image gradient direction is used to
assign the orientation of each key-point to it.
Key-points description: Suitable feature points are described as stored intensity samples in the neighborhood
in the final phase. All of these key-points are unique and unaected by ane transformations or changes in
illumination.
3.4. Speeded Up Robust Features (SURF)
Speeded Up Robust Features (SURF) (Bay et al.,2006) is a feature extraction method that operates in a similar
way to SIFT but is significantly faster. The SURF method is a reliable detector and description of prospective key-
points of interest. The architecture of the SURF algorithm is shown in Fig. 9.
9
Figure 9: Architecture of SURF (Bay et al.,2006).
The SURF detector makes use of the Hessian Matrix to discover interest key-points. The Hessian Matrix (H) is
quick and accurate calculation method. The Hessian Matrix (H) for x=(x,y) is expressed by Eq. 5:
H(x, σ)="Dx x(x, σ)Dxy (x, σ)
Dxy(x, σ)Dyy (x, σ)#(5)
where, the Gaussian second order derivative is convolutioned with the integral image Iyields D(x, σ) .
SURF descriptor: Interest key-points are characterized by using Haar wavelet responses to assign orientation to
each key-point. A square area is built around each key-point for the description of key-points. The chosen square
region is then subdivided into 4 x 4 subregions. After that, the four descriptor components dx,dy,|dx|, and |dy|are
assessed, dxrepresents the horizontal haar wavelet response, whereas dyrepresents the vertical Haar wavelet response.
|dx|and |dy|are both absolute values of horizontal and vertical directions, respectively.
3.5. Haar Cascades
Viola and Jones (2001) presented the Haar Cascades detector, which is a successful fusion of three fundamental
principles. To begin, a large collection of characteristics is needed that can be calculated in a short and consistent
time. This feature-based strategy reduces in-class variability while increasing inter-class variability. Secondly, using a
boosting method allows the salient features to be selected and the classifier to be trained at the same time. Afterward,
a quick and ecient detection method is made possible by building a chain of more complex classifiers.
As illustrated in Fig. 10, the method employs edge and line detection features as well as center-surround features.
At each level of the procedure, the number of the features employed to evaluate the image increases. With only 200
basic features, Wilson and Fernandez (2006) were able to recognize a human face with 95% accuracy rate.
10
Figure 10: Haar-like feature employs edge or line detection characteristics (Wilson and Fernandez,2006).
4. Deep Learning Methods
Deep neural networks are used in deep learning algorithms, such as Faster R-CNN (Ren et al.,2015), Single Shot
multibox Detector (SSD) (Liu et al.,2016), and YOLO (Redmon et al.,2016). Fig. 11 depicts the key advancements
and achievements in deep learning-based object detection algorithms from the year 2012. One of the most significant
benefits of deep learning algorithms is that they do not require hand-crafted features such as edge and corner detection.
During the training phase, these algorithms learn characteristics on the fly. Therefore, these algorithms require a large
quantity of data to be trained. However, a large amount of data may also be used to detect covered objects. The
data must be labeled beforehand as training for SSD, Faster R-CNN, and YOLO. It is also sometimes referred to as
supervised learning. Several deep learning methods are given in Table 3with their advantages and shortcomings.
11
Table 3: Highlights and shortcomings of deep learning methods using one-two stage approaches.
Method Publication Approach Highlights and Shortcomings
R-CNN (Girshick et
al.,2014)
Two-sage Highlights: Significant improvement in performance over previous
state-of-the-art methods; The first method which combine CNN and RP
methods
Shortcomings: Training is costly in terms of both space and time; Test-
ing is time-consuming
Fast R-CNN (Girshick,
2015)
Two-stage Highlights: Create a layer for ROI pooling; First method which enables
training to end-to-end detector (ignoring RP generation)
Shortcomings: The new bottleneck is revealed to be external RP com-
putation; For real-time applications, it is still too slow
Faster R-CNN (Ren et al.,
2015)
Two-stage Highlights: Instead of selective search, propose RPN for producing
nearly high-quality and cost-free RP; By sharing convolution layers,
combine Fast RCNN and RPN into a single network; Introduce multi-
scale anchor boxes and translation invariant as RPN references
Shortcomings: It is not a simplified procedure; Training is complex;
For real-time applications, it is still too slow
YOLO (Redmon et
al.,2016)
One-stage Highlights: The first highly eective unified detector; YOLO can run
at 45 FPS; Drop the process of RP completely; Framework for detec-
tion that is both ecient and elegant; Dramatically faster than previous
detection techniques
Shortcomings: Localization of small object is dicult; The accuracy
of the detector falls far short of that of previous detectors
SSD (Liu et al.,
2016)
One-stage Highlights: To detect objects using convolution layers of multi-scale,
it eectively combines YOLO and RPN ideas; First ecient and accu-
rate unified detector; Can run at 59 FPS; Faster and significantly more
accurate than YOLO
Shortcomings: It is ineective at detecting objects of small size
FPN (Lin et al.,
2017)
Two-stage Highlights: Superior to Faster-RCNN while maintaining high accu-
racy; Using a bank of specialised convolution layers, create a set of
position sensitive score maps
Shortcomings: For real-time applications, it is still too slow; Training
is not a simplified process
YOLOv2 (Redmon
and Farhadi,
2017)
One-stage Highlights: It employs a variety of existing strategies to boost both ac-
curacy and speed; Propose a faster DarkNet-19; In real time, YOLOv2
can identify over 9000 object classes
Shortcomings: It is ineective at detecting objects of small size
YOLOv3 (Redmon
and Farhadi,
2018)
One-stage Highlights: YOLOv3 improves detection object accuracy by referring
to the concept of residual network; It employs darknet-53 to generate a
feature maps of small-size
Shortcomings: it is ineective at detecting objects of small size
YOLOv4 (Bochkovskiy
et al.,2020)
One-stage Highlights: Backbone of the model employs Bag-of-Specials (BoS)
and Bag-of-Freebies (BoF), which improves performance with Cross
Stage Partial Darknet-53; Neck has improved Spatial Pyramid Pooling,
resulting in output of fixed-size independent of size of input
Shortcomings: Training is not a streamlined process; it is ineective at
detecting objects of small size
12
Figure 11: Milestones in object detection using deep convolutional neural networks.
4.1. Backbone Networks
Object detectors based on deep neural networks make use of backbone networks to extract high-level information
from input images. Deep neural networks are most commonly used as image classifiers to perform on large-scale
image classification datasets like the ImageNet classification dataset. In most image classifiers, the final classification
layers are eliminated, and the remaining layers are employed as backbone networks, and further detection layers are
added to the backbone networks to construct comprehensive object detectors. The major design goals of backbone
networks are to increase detection accuracy and processing eciency. The following are some of the most widely
used backbone networks:
VGGNets (Simonyan and Zisserman,2014) with convolutional layers that employ tiny filters of 3x3 pixels,
followed by 2x2 max pooling. VGG-16 contains thirteen convolutional layers, whereas VGG-19 has sixteen
convolutional layers. VGG was the winner of the ImageNet Challenge in the year 2014, and it is still one of the
most popular networks today.
Residual networks (ResNets) (He et al.,2016) were presented as a way to train very deep networks using
residual blocks. Residual networks come in a variety of shapes and sizes. ResNet50 and ResNet101 are the
most popular variants. ResNet is substantially more comprehensive than VGGNet.
Inception networks (Szegedy et al.,2015,2016) that boosted network size and scope without adding to the
computational costs. Convolution layers of 1x1, 3x3, and 5x5 filter sizes, as well as max pooling layers, are
layered in parallel in the Inception module. Many scales of features may be retrieved at the same time in a
single layer. VGGNet is substantially slower than Inception networks.
DenseNet (Huang et al.,2017) is a network in which each layer is densely linked to all other levels in a forward
manner, allowing all later layers to utilise lower level characteristics. The vanishing-gradient problem can be
solved with DenseNet.
The ZFNet (Zeiler and Fergus,2014) is a classic convolutional neural network. The design was inspired by
showing intermediate feature layers and the operation of a classifier. The filter widths and strides of the convo-
lutions are comparatively shorter than in several previous architectures.
4.2. Two-stage detectors
Broadly, detectors are divided into one-stage detectors and two-stage detectors. In this section, two-stage detector
models are discussed in detail as below.
13
4.2.1. R-CNN
One of the most significant drawbacks of the conventional method based on the sliding window method is that it
reads every available portion of the image. Within the image, the object of interest might be in multiple spatial posi-
tions and have dierent aspect ratios. This will necessitate the selection and processing of a large number of areas and
hence increase the processing time. R-CNN (Girshick et al.,2014) resolves this issue by employing a selective search
approach. This technique generates 2,000 region suggestions, commonly known as “region extraction”. There are
4096-dimensional feature vectors generated by warping the areas into squares and forwarding them to a convolutional
neural network. These characteristics are passed on to SVM, which categorises the areas at the final stage. It also
uses a regression technique to determine the bounding boxes of the categorized objects detected in the image. The
drawbacks of this method include the fact that it takes a long time to train because it must categorise 2,000 areas for
each image. It is dicult to use in real time because each image takes around 47 seconds to process.
4.2.2. Fast R-CNN
Fast R-CNN (Girshick,2015) addresses the issue of R-CNN and develops a significantly faster algorithm. The
steps are similar to R-CNN, but instead of calculating the areas, the image is sent directly to CNN, which generates
feature maps. The area proposals are detected and, using the Region-Of-Interest (ROI) pooling layer, they are warped
into squares. Using this convolutional feature map, the shape is converted to a fixed size and transmitted to the fully
connected layer. A softmax layer is used to predict the class and bounding box using the ROI feature vector. This
method is considerably faster than R-CNN, since it does not create 2,000 suggested regions.
4.2.3. Faster R-CNN
In the Faster R-CNN (Ren et al.,2015) configuration, there are two phases. The feature map of the original image
is created in the first stage using feature extraction (VGG, ResNet, Resnet-V2, Inception, etc.). The feature map from
a chosen in-between convolutional layer is used by the Region Proposal Network (RPN) to predict proposal areas
with objectness scores and locations. Time-saving software is used to make a score that approximates the chance that
a thing will be an object or not. Box regressions are also done for each of the proposals, using a robust loss function.
The second step uses ROI pending to crop features from the same intermediate feature map in order to determine
the position of the proposed areas. The regional feature map for each proposed region is given to the remainder of the
network to forecast the less specific score and improve the box location. Using this technique, it is possible to skip
entering each proposed region into the front-end CNN in order to calculate the regional feature map. However, each
proposed region must be entered individually into the database of the network. As a result, the speed of detection is
proportional to the number of RPN proposal areas. The architecture of the Faster R-CNN is shown in Fig. 12.
Figure 12: Architecture of Faster R-CNN (Ren et al.,2015).
14
4.2.4. Feature Pyramid Network (FPN)
The dilemma was addressed by the Feature Pyramid Network (FPN) (Lin et al.,2017), which determined that
bottom-level feature maps contain spatial information rather than semantic information. Also, later layers of a deep
neural network contain high-level semantic information rather than spatial information. FPN used CNN’s network
structure to create a bottom-up and top-down path with wide extent. A CNN was utilised to process an input image in
the bottom-up section, and a pooling layer was employed to reduce the size of feature maps. The extracted features
were up-sampled in the top-down section to the same size as in the bottom-up section. FPN created integrated image
features that boost detection performance substantially, mainly for small objects.
4.3. One-stage detectors
4.3.1. SSD
SSD (Liu et al.,2016) is one-stage object detection network that use a single forward CNN to predict item class
and position. SSD achieved performance standards in terms of speed and accuracy for object detection tasks, achieving
over 74% mean Average Precision (mAP) at 59 fps on standard datasets. The basic architecture of SSD is shown in
Fig. 13.
Figure 13: Architecture of SSD (Liu et al.,2016).
In general, the SSD consists of three sections:
1. The fundamental convolutional layer, which contains ResNet, ResNetv2, VGG, inception, and other feature
extraction networks. The intermediate convolutional layer creates a layer-scale feature map, which splits the
receptive field into a large number of small cells, assisting in the identification of small objects.
2. An extra convolutional layer is linked to the last layer of the basic convolutional network. Larger-scale multi-
scale feature maps are produced.
3. A prediction convolutional layer employing a tiny convolutional kernel predicts bounding box location and
confidence for several categories.
4.3.2. YOLO
YOLO (Redmon et al.,2016) focused on speeding up the object detection methods. The region proposal was
deleted since the object detection issue was regarded as a regression issue. It divides the input image into 7 x 7 pixels,
with each pixel being used to estimate where the centre of an object could lie, rather than using pre-defined anchors
for object portions. Each cell projected bounding box locations, class probabilities, and scores for each bounding box.
YOLO is a real-time object detector that can detect things at a rate of 45 fps, which is extremely fast in comparison
to the previous object detection models. On the other hand, only class probability was estimated within each cell. It
cannot handle a large number of ground truth objects and does not work well with items that are partially localized in
one cell and has poor localization accuracy due to bounding box sizes and proportions. On the COCO dataset, YOLO
produces a mAP with a rate of around 54.30%.
15
4.3.3. YOLOv2
YOLOv2 (Redmon and Farhadi,2017) proposes several improvements to the first version of YOLO. The com-
pletely connected layers are eliminated, and the anchor boxes approach is used to forecast bounding boxes to improve
the recall. Unsupervised learning approaches are used to construct bounding box sizes and proportions directly using
training data. The bounding box analysis forecasted the position in relation to the left top location of the cell, resulting
in predicting limits of 0 and 1. Batch normalisation, high-resolution classifications, and multi-resolution training are
among the other strategies oered by this version. All of the strategies have significantly increased detection accuracy
while maintaining high speed.
4.3.4. YOLOv3
In order to keep low translation variance, SSD chooses the early layers to create large-scale feature maps specifi-
cally used to find smaller objects. Feature maps generated by early layers are complex enough, hence resulting in poor
performance on smaller objects. To address these mentioned issues, YOLOv3 (Redmon and Farhadi,2018) enhances
the accuracy of detection of objects by referring to the notion of residual network. It is a one-stage method that also
works eciently with respect to detecting speed. The architecture of YOLOv3 is depicted in depth in Fig. 14. It
generates a small-scale feature map that is a 32-fold lower resolution version of the original image using Darknet-53,
omitting the last three layers. To detect big objects, a small-scale feature map is employed. The small-scale feature
map is up-sampled and concatenated with the feature map generated by previous layers. A large-scale feature map
is generated by YOLOv3, as opposed to SSD choosing the previous layers to build large-scale feature maps. For the
detection of small-sized objects, a large-scale feature map including position information from previous layers and
complicated features from deeper levels is employed. The feature map scales are 8, 16, and 32 times down-sampled
as compared to the original image, respectively. Softmax is used to predict single-level classification, but YOLOv3
predicts multilevel classification for each bounding box by using separate sigmoid functions for each of the boxes.
Figure 14: YOLOv3 architecture (Redmon and Farhadi,2018).
4.3.5. YOLOv4
YOLOv4 (Bochkovskiy et al.,2020) is an improved version of YOLOv3. It splits images into regions and further
processes the probabilities for each region and bounding boxes using a single neural network on the entire image.
16
Bag-of-Specials (BoS) and Bag-of-Freebies (BoF) are two distinct packages used in the model’s backbone to improve
performance with Cross Stage Partial Darknet53. The trade-obetween these two factors aects the performance
eciency. While BoS is used to increase inference cost by a minimal amount while considerably enhancing object
detection accuracy, BoF is used to just raise the cost of training while keeping the cost of inference low. The neck of
YOLOv4 has improved Spatial Pyramid Pooling, which creates a fixed-size output independent of input size.
5. State-of-the-Art for Weapon Detection Methods
Weapon detection has emerged as a captivating topic in the field of object detection methods. Various systems
have been developed for detecting weapons like pistols, rifles, and knives, each with its own set of advantages and
limitations. Misclassification, intra-class detection, a dynamic background, occlusion, and varying illuminations are
amongst the major issues which make the identification of handguns, knives, and other weapons dicult. A thorough
examination of several techniques has been carried out to cover the previous weapons detection systems to the most
recent models as shown in Fig. 15.
Figure 15: A classification of computer vision algorithms for detecting weapons.
In the literature, several algorithms have been suggested like Harris interest point detector (Harris et al.,1988),
SIFT (Lowe,1999), SURF (Bay et al.,2006) and deep learning techniques like Faster R-CNN (Ren et al.,2015), SSD
(Liu et al.,2016), and YOLO (Redmon et al.,2016) for the detection task. The generalized weapon detection model
is shown in Fig. 16.
Figure 16: Basic model of computer vision-based weapon detection system.
5.1. Weapon Detection using Classical Machine Learning Methods
The Haar Cascades were developed by ˙
Zywicki et al. (2011) to identify hazardous equipment such as knives.
Positive and negative sample images were used in the training phase to demonstrate the presence and absence of the
target item, respectively. A total of 1,560 positive and 6,518 negative samples were used in the training phase. Positive
17
sample criteria include information about angle, illumination, dynamic background, knife in hand, variety of blades,
and varied grips. To improve the performance, three training sets were constructed in the experiment, among which
the third training set observed the best result. However, the results were not satisfying as the true positive rate was
46%, which is a relatively small score for the detection in real-time.
Glowacz et al. (2015) introduced an AAM based method for object detection like knives. The goal was to de-
termine whether or not a knife could be seen in the given image. They utilised the Harris corner detection method
(Harris et al.,1988) to identify tip-of-the-knife. The number of discovered corners was determined on the basis of a
pre-defined threshold. All knife tips were identified at the lowest threshold of 204, and for all images, the mean value
of the threshold at which the knife-tip is tagged as a corner was 217. The overall classification accuracy of this model
was 92.50%. However, as AAMs are not invariant to the rotation, the method works only if the tip of the knife is
visible in the images. Kmie´
cet al. (2012) introduced an approach employing the Harris corner detection technique
and AAM initialised with shape-specific interest points. The model failed to recognize the knife in three images out
of 40 positive images. This method only works when the tip of the knife is visible in the image.
Tiwari and Verma (2015a) used the Harris Interest Point Detector (HIPD) and Fast Retina Keypoint (FREAK)
to develop a new approach for detecting firearms. A hybrid technique utilised both concepts. It included colour-
based segmentation to eliminate irrelevant images or colours from the image, and Harris Interest Point Detector and
FREAK to detect the gun. For colour-based segmentation, the K-means clustering method was used. Morphological
processing is applied to each image in order to extract boundaries and close tiny gaps. To discover the resemblance
with the gun, the interest point feature of the object boundary was extracted and compared to the stored description.
When the similarity score exceeds 50%, the system provides a warning. The model was assessed in terms of accuracy
after testing it against various sessions as well as negative images. This method has an overall accuracy of 84.26%.
Later on, Tiwari and Verma (2015b) enhanced their work by proposing a technique for detecting firearms in which
they used SURF. These extracted features of the object boundary was compared to the stored descriptions to find the
resemblance with the gun. The system elevates the warning when it receives a resemblance of greater than 50%.
The authors also discussed several challenges such as gun rotation, orientation, and variation as well as light, shadow,
noise, real-time processing power, information loss owing to 3D to 2D transformation, partial or complete occlusion
of the gun, and deformation. Following that, morphological closure and boundary extraction were done, resulting
in an image that displays the general structure of the item while hiding the interior details covered by a rectangular
box. Although SURF feature extraction is not faster than other techniques like Harris and SIFT, it can handle images
irrespective of scale, orientation, or other characteristics. SURF first finds interest-points (such as corners and blobs)
and then uses the Hessian Matrix to produce descriptors for each. Finally, a similarity score was calculated between
the stored description of the gun and that of the blob. The SURF characteristics of an object border are utilised to
compare the forms of items. A total of 25 pictures were utilised, out of which 15 were with positive samples. Overall,
the true positive rate of the model was 86.67%. However, since these systems were time-consuming and complicated,
they were unable to be used for real-time weapon detection.
The comparative analysis in terms of true positive rate between classical machine learning methods is shown in
Fig. 17. It can be explicitly concluded from the graph that AAMs are an ecient method over others, having a true
positive rate of 92.50%.
18
Figure 17: Performance analysis between classical machine learning methods.
A detailed description of the work based on classical machine learning methods is summarised in Table 4. The
specifics of the datasets and the outcomes that were acquired are detailed in this table. From the results, it can be
concluded that the approaches utilized by Kmie´
cet al. (2012) and Glowacz et al. (2015) resulted in the highest true
positive rates.
Table 4: Detailed results based on classical machine learning methods.
Publication # Images in pos-
itive test set
# Correctly clas-
sified positive
images
True positive
rate
# images in neg-
ative test set
# Misclassified
negative images
False positive
rate
Classification
Accuracy
˙
Zywicki et al. (2011) 1,560 - 46.00% 6,518 - - -
Kmie´
cet al. (2012) 40 37 92.50% 40 0 0% 92.50%
Tiwari and Verma (2015a) 65 54 83.07% 24 3 8.33% 84.26%
Glowacz et al. (2015) 40 37 92.50% 40 0 0% 92.50%
Tiwari and Verma (2015b) 15 13 86.67% 10 0 0% 86.67%
Classical machine learning methods need a lot more human interaction to produce results. These systems have
problems with the reliability of their database, where guns make up the majority of the picture. This doesn’t accurately
show how real-life events with a handgun work. As a result, these systems are not suitable for continuous monitoring
in situations where the images retrieved from CCTV recordings are complicated owing to various variables or when
there are open regions with a large number of objects. In such complicated scenarios, these conventional methods fail
to provide better accuracy for weapon detection.
In traditional machine learning methodologies, the bulk of the applicable features must be set by a domain expert in
order to minimize computational complexity and make patterns more transparent for learning techniques to eciently
work. The major advantage of deep learning algorithms is that they try to learn high-level features from data in an
incremental manner. This reduces the feature extraction complexity as well as lowers the need of domain expertise in
real-time applications.
19
5.2. Weapon Detection using Two-stage Deep Learning methods
The most often used measurements in computer vision are True Positive (TP), False Positive (FP), True Negative
(TN), and False Negative (FN). The number of images accurately labeled as positive images while employing classi-
fiers to identify weapons is referred to as TP. The existence of a weapon in the input image is indicated by a positive.
The term FP refers to an actual instance that is missed by the classifier. The properly classified negative image is indi-
cated by TN, while the number of wrongly classified negative images is denoted by FN. The performance parameters
namely, accuracy, precision, recall, and F1 score are measured by Eq. 6, Eq. 7, Eq. 8, and Eq. 9, respectively as
follows:
accuracy =T P +T N
T P +FP +FN +T N (6)
precision =T P
T P +FP (7)
recall =T P
T P +FN (8)
F1score =2precision recall
precision +recall (9)
Olmos et al. (2018) designed an automated method for the detection of handguns to facilitate monitoring and
control systems. The authors reframed the problem of weapon detection as a problem of reducing false positives,
and discovered a solution by (i) building a key training dataset using the results of the DCNN classifier, and (ii)
comparing the two approaches, namely the sliding window approach and the region proposal approach, to determine
the best classification model. The dataset used in the experiment contains 3,000 images of short guns with rich
backdrop detail. The Faster R-CNN model with VGG-16 as a feature extractor produced the most promising results.
After five consecutive true positives among the thirty situations, the automatic alarm system eectively activates the
alarm. They also established a metric called Alarm Activation Time per Interval (AATpI) to assess the performance
of the detection model. With an average time interval of AATpI =0.2s, the model correctly identified the gun in
27 scenarios. However, in three scenes, the detector was incapable of detecting handguns due to the same factors
mentioned as before, like low contrast and poor brightness of the frames, fast movements of the gun, or the guns not
being visible in the forefront of the image. The precision score of this model was recorded at 84.21% with F1 score
at 91.43%. As illustrated in Fig. 18, the architecture of VGG-16, contains 13 convolution layers and 3 fully connected
layers, which is used as the feature extractor.
Figure 18: Architecture of VGG-16 used in Olmos et al. (2018).
Verma and Dhillon (2017) utilized transfer learning to identify guns using a deep convolution network and a state-
of-the-art feature area based CNN model. As a feature extractor, the system uses a CNN-based VGG-16 architecture
followed by state-of-the-art classifiers trained on a typical gun database. The performance of the model was evaluated
in a variety of situations, including dierent backgrounds with firearms, occlusion, and so on. The results show that
SVM (Hearst et al.,1998) outperforms other classifiers with a classification accuracy of 92.60% and total accuracy
was 93.10%. However, the model was built on a single CPU, which meant that training time was a major concern.
20
Gelana and Yadav (2019) proposed an image processing and machine learning-based weapon identification model.
Their model consisted of six main elements: (i) RGB to gray-scale conversion was used to reduce the complexity of
each frame and speed up the background subtraction process; (ii) Background subtraction: three alternative tech-
niques to background subtraction and segmentation were used. The visual background extractor (Barnich and Van
Droogenbroeck,2011) and the improved Gaussian mixture model (Zivkovic,2004) techniques, as well as the dif-
ference of frame background subtraction algorithm, are all used in this study (iii) Filtering operation: Dilation and
erosion procedures were used on the extracted foreground object to eliminate tiny white noises caused by illumination
fluctuations and to connect dissimilar parts in an image (iv) Segmentation/Edge Detection: The well-known Canny
edge detection method (Canny,1986) was employed for this purpose. The Canny algorithm inputs the filtered fore-
ground object and outputs the information about edges (v) The sliding window approach substantially reduces the
area evaluated by the learning algorithm. The size and slide step are chosen after several tests and are subjected to
alteration in the future (vi) A tensorflow-based version of the CNN method was used to classify an item as either a
treat (gun) or a non-treat (non-gun). After applying 30% split to the CNN training-testing dataset, 4,000 negative and
1,869 positive images comprised the dataset frame. The 585 positive and 1,173 negative images among the 1,758
images were used to test the algorithm. The most essential element in weapon detection was to reduce the frequency
of false positives while maintaining detection sensitivity. The approach described in this work had a specificity of
99.73% for images containing non-gun items and a detection accuracy of 93.84% for images including gun objects.
Castillo et al. (2019) developed an automated cold steel weapon identification model for video surveillance that
was based on a new brightness-directed preprocessing technique termed Darkening and Contrast at Learning and
Test stages (DaCoLT) that enhances detection quality. The Faster R-CNN with Inception-ResNet-V2 (Szegedy et al.,
2017) was the most accurate model with an F1 score of 95%. However, with a frame rate of 1.3 frames per second, it
was not suited for near-real-time operations.
A unique binocular image fusion technique for reducing the frequency of false positives in the identification of
firearms in surveillance films, was proposed by Olmos et al. (2019). They used a dataset of 3,000 weapon images
created by Olmos et al. (2018) for training and testing purposes. They compared the performance of Faster R-CNN
with and without image fusion using four feature extractors, i.e., VGG-16, ResNet, Inception-Resnetv2, and Neural
Architecture Search (NAS). Faster R-CNN (VGG-16 +ImageNet) has much greater accuracy, precision, recall, and
F1 score compared to the previous existing methods. It achieved the overall highest accuracy of 80.62%. However,
the most frequent cameras in CCTV systems are not dual cameras, so this method would not be appropriate for most
retail establishments.
P´
erez-Hern´
andez et al. (2020) proposed a method utilising binarization approach to improve the robustness, pre-
cision, and reliability of small item recognition. To enhance their detection accuracy in movies, they recommended
adopting a two-level deep learning-based approach called Object Detection using Binary Classifiers. In which the
first level selects potential areas from the input frame, while the second level employs a CNN-classifier that employs
One-Versus-All (OVA) and One-Versus-One (OVO) binarization techniques. A firearm, a knife, a smartphone, a bill, a
purse, and a card were used to create the database. The experimental study shows that the suggested technique reduces
the incidence of false positives when compared to the baseline multi-class detection model. However, because this
model was complicated and time-intensive, it could not be used to identify guns in real-time. The dataset collection
had 560 images in total. As indicated in Table 5, the OVO model observed the highest precision of 93.87%.
Iqbal et al. (2021) proposed a weakly supervised Orientation Aware Object Detection (OAOD) approach using
Axis-Aligned Bounding Boxes (AABB) for training and learning to recognize oriented object bounding boxes (OBB).
The proposed OAOD diers from previous oriented object detectors in that it does not require OBB during training,
which may or may not be available at any given time. To achieve the goal of training on AABB and identification
of OBB, a multiphase method was utilised, with Stage-1 estimating AABB and Stage-2 estimating OBB. There are
10,973 pictures of firearms and rifles in the weapon dataset presented by the ITUF (Iqbal et al.,2021). The ITUF
dataset was used to examine the OAOD technique to other state-of-the-art classification techniques, such as fully
supervised oriented object detectors. The overall obtained mAP on AABB was 88.30% and the mAP on OBB was
77.50%. However, because the model was computationally expensive and the mAP was quite low, it could not be
used for real-time gun detection.
Gonz´
alez et al. (2020) used Faster R-CNN to utilise FPN with ResNet-50 on a new dataset collected from a
genuine CCTV installed in a university campus. Further they developed synthetic images to be employed in quasi
real-time CCTV. The FPN architecture achieved an accuracy score of 88.12%. However, the developed model could
21
not be utilised for training or testing purposes because the created synthetic dataset was not providing satisfactory
results.
Kaya et al. (2021) presented a novel model based on deep learning that utilizes VGG-16, ResNet-101, ResNet-50,
and a suggested CNN model with seven layers to detect seven distinct weapons. Assault rifles, knives, bazookas,
hunting rifles, pistols, grenades, and revolvers are among the 5,214 weapon illustrations split into seven categories.
The system was developed with a total of 3,128 images in training and 1,043 images in validation and testing. Con-
sequently, the proposed model was found to be 98.40% accurate. However, that device was capable of identifying a
few types of weapons only. The intended method was not very complicated, but it was very slow when it came to
computing.
Galab et al. (2021) demonstrated how to improve the brightness of knife detection in surveillance systems using
an adaptive method. Based on the preprocessing Brightness Handler procedure (BHp), they compared four CNN
architectures: AlexNet, VGGNet, GoogLeNet, and ResNet. AlexNet with BHp produced excellent outcomes with a
96.95% accuracy. AlexNet was an early CNN with six convolutional layers and performed quite slowly in contrast to
current CNN models. That model took images of the size of 227 x 227 pixels, indicating that the weapon must cover
the majority of the image.
Fig. 19 shows the performance analysis of two-stage deep learning methods in terms of accuracy, precision, recall,
and F1 score. It may be implied from the graph that the model developed by Kaya et al. (2021) observed the best
accuracy with 98.40% compared to all other methods, whereas Galab et al. (2021) secured the highest F1 score as
98.42%. The models developed by Castillo et al. (2019) and Galab et al. (2021) have the highest precision values.
On the other hand, other models developed by Olmos et al. (2018) and Gonz´
alez et al. (2020) observed highest recall
values.
Figure 19: Comparative analysis of performance of two-stage deep learning methods.
Table 5shows a detailed comparative analysis based on two-stage deep learning methods’ outcomes in terms of
accuracy, precision, recall and F1 score.
22
Table 5: Comparison of detection results based on two-stage deep learning methods.
Authors Data Specifications Detection Results
Accuracy(%) Precision(%) Recall(%) F1 Score(%)
Verma and Dhillon (2017) - 93.10 - - -
Olmos et al. (2018) 3,000 - 84.21 100 91.43
Castillo et al. (2019) 19,379 - 100 78.55 87.44
Gelana and Yadav (2019) 5,869 97.78 99.45 94.21 96.76
Olmos et al. (2019) 3,000 80.62 92.68 80 85.88
P´
erez-Hern´
andez et al. (2020) 5,680 - 93.87 93.09 93.43
Gonz´
alez et al. (2020) 7,649 - 88.12 100 93.68
Kaya et al. (2021) - 98.40 99.28 95.97 92.89
Iqbal et al. (2021) 10,973 - 88.30 - -
Galab et al. (2021) 12,899 96.95 100 96.80 98.42
5.3. Weapon Detection using One-stage Deep Learning methods
Narejo et al. (2021) created a smart surveillance security system that identifies weapons, especially firearms. They
used backbone Darknet-53 to train the YOLOv3 classification model for that purpose. They collected a large number
of photos from Google manually and approximately 50 pictures for each weapon class. The overall accuracy of the
model was 98.89%, but precision and F1 score were not measured. Also, the number of images in the collection was
not provided.
Romero and Salamea (2019) developed the system to resolve existing issues and divided the operation of system
into two halves. The first front end was in charge of limiting the area of interest, while the second back end was
in charge of detecting the weapon from the front end. The authors created a database comprising 17,684 images
from various movies, with firearms (class A) and without firearms (class B). Using various approaches (rotating and
flipping), the image dataset was additionally expanded by 2,29,892 (from 17,568 to 2,47,576). As mentioned before,
the system was made up of two parts; namely the front end and the rear end. The authors employed YOLO for real-
time object recognition and localization in the front end. YOLO was trained on the COCO dataset to recognize people
while ignoring the rest of the image, which reduces the complexity of the system and, hence, the probability of false
positives. The VGG-Net and ZFNet models were used to identify weapons. Grayscale pictures were also useful in
enhancing the eciency of the system. If the individual in the bounding box does not have a weapon, the bounding
box will be eliminated, narrowing the area of concern. The overall performance of the system was observed as 86%
recall and 90.80% accuracy. Fig. 20 depicts the operation of a weapon-detecting system proposed by Romero and
Salamea (2019).
Figure 20: System architecture of Romero and Salamea (2019).
Cardoso et al. (2019) used YOLO object detector to detect guns in images using CNN . The idea was tested on a
database of 608 images, including 304 weapons. Experiments observed an accuracy of 89.15%. However, the number
of images in the dataset was very low, hence it was found infeasible for real-time detection
23
Jain et al. (2020) used SSD and Faster R-CNN algorithms to develop automated weapon identification using CNN.
Faster R-CNN observed better accuracy as 84.60%. On the other hand, SSD achieved an accuracy of 73.80%, which
was low compared to the Faster R-CNN. Due to the higher speed, SSD provided real-time detection, but Faster R-
CNN observed higher accuracy. In a fully automated system, a person in charge double-checks every gun detection
alert with @0.73fps (SSD) and @1.606fps (Faster R-CNN), which are too slow for real-time detection.
Salido et al. (2021) compared three CNN models for automatic identification of pistols in video surveillance. The
goal was to see if integrating posture information with the way firearms were held in the training dataset would reduce
false positives. The findings showed that RetinaNet fine-tuned with the unfrozen ResNet-50 backbone had the greatest
average precision of 96.36% and recall of 97.23%, while YOLOv3 had the highest accuracy of 96.23% and F1 score
values (93.36%) when trained on the dataset with posture information. Using YOLOv3, the number of false positives
and false negatives was 8 and 21, respectively, which was quite high for the tiny dataset and poor resolution images.
Singh et al. (2021) presented a computer vision-based method for identifying firearms using YOLOv4. The images
of knives, swords, pistols, machine guns, shotguns, and other weapons were included in the dataset used to train the
model. They combined them into a single weapon class. The model employed had a mean Average Precision (mAP)
of 77.75% and an average loss of 1.314. However, mAP was an insucient factor for measuring the real-time weapon
identification performance.
Sliding window and region proposal/object detection were two methodologies used by Bhatti et al. (2020). Some
of the algorithms employed were VGG16, Faster R-CNN, Inception-ResnetV2, SSD, MobileNetV1, Inception-V3,
Inception-ResnetV2, YOLOv3, and YOLOv4. A total of 8,327 images comprising pistols and non-pistol classes were
used, which were collected from various sources. A total of 7,328 images were utilised for the training and another 999
for testing. YOLOv4 outperformed all other algorithms, receiving an F1 score of 91% and a mean average accuracy
of 91.73%. The number of false positives and false negatives were still relatively high as 54 and 52, respectively.
Lamas et al. (2022) presented a reproducible and traceable top-down weapon detection over pose estimation
methodology that exploits the human presence in scenarios where a person carries a weapon, firearm, or knife. The
two types of detection architectures were used among the four selected detection models for evaluating the approaches.
Faster R-CNN, a two-stage detector based on ResNet101 and various one-stage detectors such as SSD, based on
ResNet50, EcientDet (Tompson et al.,2015) based on D3, and CenterNet (Duan et al.,2019). All deep learn-
ing architectures were trained on the Sohas weapon dataset with a precision score of 94.4%, in which EcientDet
outperformed others. However, this method was able to detect human-handled weapons only.
Fig. 21 shows the performance analysis of one-stage deep learning methods in terms of accuracy, precision,
recall, and F1 score. According to Fig. 21, the model created by Narejo et al. (2021) observed the greatest accuracy
of 98.89%, while the model developed by Salido et al. (2021) achieved the best precision score of 96.23%.
24
Figure 21: Analysis of performance of one-stage deep learning methods in terms of accuracy, precision, recall and F1 score.
Fig. 22 shows the overall performance analysis of detection using deep learning methods in terms of accuracy and
precision parameters.
Figure 22: Analysis of performance of detection using deep learning methods in terms of accuracy and precision.
25
Table 6presents a thorough comparison of one-stage deep learning models developed by dierent researchers in
terms of accuracy, precision, recall, and F1 score.
Table 6: Comparison of weapon detection results based on one-stage deep learning methods.
Authors Data Specifications Detection Results
Positive
Images
Negative
Images
Accuracy(%) Precision(%) Recall(%) F1 Score(%)
Romero and Salamea (2019) 8,843 8,841 90.80 86.00 86.00 86.00
Cardoso et al. (2019) 3,000 6,857 - 89.15 100 94.26
Jain et al. (2020) - - 84.60 - - -
Narejo et al. (2021) - - 98.89 - - -
Salido et al. (2021) 1,220 - 90.09 96.23 90.67 93.36
Singh et al. (2021) - - - 77.75 - -
Bhatti et al. (2020) 3,073 5,254 - 93.00 88.00 91.00
Lamas et al. (2022) 3,000 14,684 - 94.40 91.50 92.90
6. Conclusion
In the area of security and surveillance, weapon detection is of significant use in computer vision. An automatic
weapon detection system that responds quickly in situations that could be dangerous is good for public safety. This
literature attempts to showcase several conventional weapon detection systems using machine learning and the most
advanced deep learning techniques. The journey began with a manually operated system and progressed to completely
automated and sophisticated technologies. In light of this, numerous conventional weapon detection techniques have
already been developed, viz. HIPD, AAMs, SIFT, SURF, FREAK, and many more, wherein the AAMs have emerged
to be the preeminent among these. Although the multitudinous applications of these conventional techniques have
been reviewed in the past, none has so far emerged as an eective technique owing to the imprecision in detection of
tiny objects due to their complex background and partial occlusion. Classical methods require manual intervention
for extracting features, and thus, they are not very precise for weapon recognition (Krizhevsky et al.,2012). This
opens a window for the development of deep learning architectures capable of automatically discovering higher level
features from input images that oer speed, accuracy, and real-time applications viz. self-driving cars (Maqueda
et al.,2018), natural language processing (Worsham and Kalita,2020), face detection (Zhan et al.,2016), speech
recognition (Nassif et al.,2019), text recognition (Roy et al.,2017), and disease diagnosis (Ma et al.,2021;Hu et
al.,2018) etc., of this technology in the field. Additionally, a wide literature in the domain of DCNN and transfer
learning methods incorporating multiple models (one-stage and two-stage) like Faster R-CNN, VGG-Net, ZFNet, and
YOLOv3 is available. In the case of one-stage deep learning methods, YOLOv3 has higher precision and shows better
performance in comparison to others. Faster R-CNN architecture observed the highest precision compared to other
methods in two-stage methods.
In Table 7, we provide the important findings of the study that were discovered. The following information are
included in the table: (a) The name of the method used, (b) The strength of method, (c) Problems encountered in
weapon detection, and (d) The respective publication and comments.
26
Table 7: An overview of the survey’s major results using the classical machine learning approach and deep learning approach.
Methods Strengths Issues Publications and Remarks
Haar
Cascades
The accuracy of the cascade im-
proves with the increased num-
ber of positive and negative
samples images.
The findings are unsatisfactory due to
the low true positive rate obtained.
˙
Zywicki et al. (2011) observed that in-
creasing the Haar Scale coecient re-
duced the frequency of incorrectly detected
knives. As previously stated, the obtained
results for this cascade are unsatisfactory.
AAMs
According to the test results,
this technique outperforms over
other classical machine learn-
ing algorithms with a TRP of
92.50%.
This method is not rotation invariant.
The technique works only if the knife
tip is visible in the images.
Glowacz et al. (2015) present AAMs as a
weapon detection tool. Later on, Kmie´
cet
al. (2012) enhanced their findings by us-
ing the Harris corner detection approach in
their own work.
HIPD and
FREAK
The K-means clustering method
along with HIPD and FREAK
is applied to utilized the color-
based segmentation which re-
sults in higher accuracy.
However, this is only useful if the
gun is entirely visible in the scene.
The approach fails in the case of
a partially visibility of gun or in a
blurred images.
This approach was adopted by Tiwari and
Verma (2015a). Additionally, similarity
score surpasses 50% after the alert mech-
anism applied.
SURF The findings from previously
used methodologies are im-
proved in terms of accuracy.
The computational time is higher in
view of real-time detection.
Tiwari and Verma (2015b) used SURF
methods for the classification task and im-
proved the better eciency.
SVM with
VGG-16
For feature extraction, VGG-16
is very eective and commonly
used deep learning architecture.
It improves the extraction of
high-level features from images.
SVM is a classification method that
is slow as well as complexed. How-
ever, CNN techniques produce better
results than SVM method.
Gelana and Yadav (2019) used this method
with a variety of techniques such as Canny
edge detection, enhanced Gaussian mix-
ture model and others to achieve results.
However, the approach is slow and com-
plicated which makes it unsuitable for real-
time detection.
Faster
R-CNN
This approach improves the pre-
cision for small weapon detec-
tion using two-stage deep learn-
ing approach.
Despite of higher precision, it
is a time-consuming and com-
plex method and computationally
expensive.
Various studies employed Faster R-CNN
with the variety of backbone networks, in-
cluding VGG-16 (Olmos et al.,2018), In-
ception ResNetv2 (Castillo et al.,2019),
FPN with ResNet-50 (Gonz´
alez et al.,
2020) and others to obtain better results.
Instead of using a selective search method,
it suggests using RPN to generate region
suggestions which makes it much faster
than R-CNN and Fast R-CNN.
SSD based
CNN
Localization and classification
tasks are completed in a single
forward pass across the network
which results in much faster de-
tection.
It can analyze a video at the rate of
0.75 frames per second. However, it
doesn’t perform well with small ob-
jects like a handguns, because it uses
the first convolution layers to create
high-level feature maps.
This approach along with Faster R-CNN
was suggested by Jain et al. (2020).
SSDs deliver much faster performance but
achieves less accuracy.
YOLO and its
versions
YOLOv3 and YOLOv4 are
faster than any of the deep learn-
ing architectures currently avail-
able. This method generates
high-level feature maps by us-
ing both previous and subse-
quent layers, resulting in more
accuracy than SSD. It is the
most eective approach for de-
tecting firearms in real-time.
YOLO and YOLOv2 are more eec-
tive for the identification of large ob-
jects. Importantly, it performs poor
while dealing with small objects.
Several researchers (Cardoso et al.,2019;
Narejo et al.,2021;Romero and Salamea,
2019) use the YOLO architecture for
weapon detection. YOLOv4 (Singh et al.,
2021;Bhatti et al.,2020) outperforms over
other enhanced versions of YOLO.
27
The two-stage detectors exhibit better accuracy in comparison to single-stage detectors, which is evidenced by
their vast real-time applications, but the latter are more cost-eective than the former. One-stage detectors are usually
faster than two-stage ones because they use lightweight backbone networks, eliminate preprocessing algorithms, and
consider fewer candidate regions for prediction. However, two-stage detectors can run in real time with the intro-
duction of similar techniques. One-stage frameworks’ performance is poorer than two-stage architectures like Faster
RCNN in the detection of small objects, which gives fair competition in the detection of large objects.
There are still certain challenges in the field of weapons detection that need to be addressed, such as a lack of
datasets, the detection of weapons in a variety of lighting conditions, and others. Table 8provides a more in-depth
discussion of these concerns.
Table 8: Several issues of weapon detection systems.
Issues Comments
The unavailability of real-time datasets The datasets that have been presented in a number of research papers
are gathered from the internet sources. There are just a few datasets
(Gonz´
alez et al.,2020) available which are obtained from closed-circuit
television cameras.
Multiple weapon detection system There is a still a requirment for multiple weapon detection. This specific
problem has only been addressed in a few researches like (Verma and
Dhillon,2017;Gonz´
alez et al.,2020;Salido et al.,2021) .
The partial appearance of the weapon Only a few reseraches have tackled the subject of partial occlusion of
weapon. Nonetheless, these are important diculties take place for
weapon detection.
Weapon detection of dierent kinds This is a significant problem to take into consideration. Only a few
works (Olmos et al.,2018;Iqbal et al.,2021;Gonz´
alez et al.,2020) are
capable of detecting distinct sorts of guns.
Despite the fact that datasets have recently emerged, the lack of large and well-balanced datasets limits the de-
velopment of deep learning algorithms that are generalizable enough to be employed in automatic weapon detection
systems. As the public datasets originate from a range of machines with dierent inherent architectures, domain
adaption techniques might help. Deep learning techniques can provide very productive outputs considering all the
above-mentioned features, but models based on these techniques for real-time applications are still not at the fore-
front. The reason behind this is the complex nature of the performed simulations, as a large dataset is required for
computing the output. Moreover, detectors based on deep learning generally contain a high number of parameters and
are consequently data-hungry, requiring a powerful computing system for the training of the developed model.
Additionally, device may be constructed utilizing these algorithms for automatic weapon detection, which notifies
security stawhen it detects a weapon. The companies and organizations that supply security and surveillance systems
would benefit from the implementation of an automated weapon detection system on internet of things (IoT) devices,
such as a smartphone, laptop, etc. Human resource management and the creation of new products or applications
are being transformed by machine learning and deep learning. This creates an environment appropriate for deep
learning in open innovation and small and medium enterprises (SMEs) (Malo-Peris´
e and Merseguer,2022;Alam and
Ansari,2020). According to the findings of the research (Baierle et al.,2020), open innovation characteristics have a
significant impact on the competitiveness of manufacturing SMEs in a Southern Brazilian area.
7. Future Scope
There is still a long way to go before developing a single robust deep learning technique.The following future
scopes are oered based on the thorough survey as discussed in the paper. The following points are inferred for the
automated identification of firearms:
i. Requirement of real-time dataset: The specific dataset for weapon detection is unavailable. At the time,
only a few real-time datasets were available. Usually, the datasets are gathered from virtual sources such as
28
movies, games, and others, which raises an issue about the reliability of the data due to varying surrounding
conditions, viz. illumination conditions, viewing angle, and resolution of images. The scarcity of real-time
datasets emerges as a major obstacle in the development of automatic weapon detection systems.
ii. A heterogeneous model: The existing methods are not entirely capable of detecting weapons of the same class,
various shapes, colours, and complex backgrounds. For example, dierent sensors are used to capture the im-
ages of guns or knives, resulting in dierent intensity distributions for a single image. The same weapon image
is mapped to dierent pixel resolutions with dierent imaging parameters such as the size of the weapon. Thus,
the development of a heterogenous method for a reliable automatic weapon detection system is indispensable.
iii. Constructive use of contextual information: Objects in the visual world have complex relationships, and
precise context is essential for comprehending them. Insucient consideration has been devoted to the use of
contextual information appropriately in the object detection field. A guidebook about the precise and successful
utilisation of this information might be a potential future avenue for visual software development.
iv. Detection of small objects: One more significant challenge in object detection system studies is the identifi-
cation of small objects such as weapons or knives, which is one of the shortcomings of existing methods using
deep learning architecture. As a result, there is a potential scope for developing techniques for small-sized
objects.
v. Need for low-computing network: These networks comprise hundreds of millions of parameters, demanding
large amounts of data as well as high-performance graphical processing units (GPUs) for training. This fas-
cinated the researchers, who were building small and lightweight networks to decrease or eliminate network
redundancy. The developed model can be operated eectively on tiny devices like smart phones and can be
employed with the IoT.
Funding Information
This research did not receive any specific grants from funding agencies in the public, commercial, or non-profit
sectors.
Conflict of Interest
The authors declare that they have no known competing financial interests or personal relationships that could
have appeared to influence the work reported in this paper.
Acknowledgment
The present work has been carried out in the computer laboratory of the Department of Mathematics and Scientific
Computing at the National Institute of Technology, Hamirpur, Himachal Pradesh, India.
References
Ainsworth, T. (2002). Buyer beware. Security Oz, 19, 18-26.
Alam, M. A., & Ansari, K. M. (2020). Open innovation ecosystems: Toward low-cost wind energy startups. Interna-
tional Journal of Energy Sector Management, 14(5), 853-869.
https://doi.org/10.1108/ijesm-07-2019-0010
Baierle, I. C., Benitez, G. B., Nara, E. O., Schaefer, J. L., & Sellitto, M. A. (2020). Influence of open innovation
variables on the competitive edge of small and medium enterprises. Journal of Open Innovation: Technology,
Market, and Complexity, 6(4), 179.
https://doi.org/10.3390/joitmc6040179
29
Barnich, O., & Van Droogenbroeck, M. (2011). ViBe: A universal background subtraction algorithm for video se-
quences. IEEE Transactions on Image Processing, 20(6), 1709-1724.
https://doi.org/10.1109/tip.2010.2101613
Bay, H., Tuytelaars, T., & Van Gool, L. (2006). SURF: Speeded up robust features. Computer Vision ECCV 2006,
3951, 404-417.
https://doi.org/10.1007/11744023_32
Bhatti, M. T., Khan, M. G., Aslam, M., & Fiaz, M. J. (2021). Weapon detection in real-time CCTV videos using deep
learning. IEEE Access, 9, 34366-34382.
https://doi.org/10.1109/access.2021.3059170
Bochkovskiy, A., Wang, C. Y., & Liao, H. Y. M. (2020). Yolov4: Optimal speed and accuracy of object detection.
arXiv preprint arXiv:2004.10934.
Canny, J. (1986). A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine
Intelligence, PAMI-8(6), 679-698.
https://doi.org/10.1109/tpami.1986.4767851
Cardoso, G. V., Ciarelli, P. M., & Vassallo, R. F. (2019). Use of deep learning for firearms detection in images. Anais
do XV Workshop de Vis˜ao Computacional (WVC 2019), 109–114.
https://doi.org/10.5753/wvc.2019.7637
Castillo, A., Tabik, S., P´
erez, F., Olmos, R., & Herrera, F. (2019). Brightness guided preprocessing for automatic cold
steel weapon detection in surveillance videos with deep learning. Neurocomputing, 330, 151-161.
https://doi.org/10.1016/j.neucom.2018.10.076
Cootes, T. F., Edwards, G. J., & Taylor, C. J. (1998). Active appearance models. In European Conference on Computer
Vision, 1407, 484-498.
https://doi.org/10.1007/BFb0054760
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. 2005 IEEE Computer Society
Conference on Computer Vision and Pattern Recognition (CVPR’05), 1, 886–893.
https://doi.org/10.1109/cvpr.2005.177
Deng, J., Dong, W., Socher, R., Li, L., Kai Li, & Li Fei-Fei. (2009). ImageNet: A large-scale hierarchical image
database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248-255.
https://doi.org/10.1109/cvpr.2009.5206848
Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., & Tian, Q. (2019). CenterNet: Keypoint triplets for object detection.
2019 IEEE/CVF International Conference on Computer Vision (ICCV), 6568-6577.
https://doi.org/10.1109/iccv.2019.00667
Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In ICML, 96, 148-156.
10.1.1.380.9055
Galab, M. K., Taha, A., & Zayed, H. H. (2021). Adaptive technique for brightness enhancement of automated knife
detection in surveillance video with deep learning. Arabian Journal for Science and Engineering, 46(4), 4049-
4058.
https://doi.org/10.1007/s13369-021-05401-4
Gelana, F., & Yadav, A. (2018). Firearm detection from surveillance cameras using image processing and machine
learning techniques. Smart Innovations in Communication and Computational Sciences, 851, 25-34.
https://doi.org/10.1007/978-981-13-2414- 7_3
Girshick, R. (2015). Fast R-CNN. 2015 IEEE International Conference on Computer Vision (ICCV), 1440–1448.
https://doi.org/10.1109/iccv.2015.169
30
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and
semantic segmentation. 2014 IEEE Conference on Computer Vision and Pattern Recognition, 580–587.
https://doi.org/10.1109/cvpr.2014.81
Glowacz, A., Kmie´
c, M., & Dziech, A. (2013). Visual detection of knives in security applications using active appear-
ance models. Multimedia Tools and Applications, 74(12), 4253-4267.
https://doi.org/10.1007/s11042-013-1537-2
Gonz´
alez, J. L. S., Zaccaro, C., ´
Alvarez-Garc´
ıa, J. A., Morillo, L. M. S., & Caparrini, F. S. (2020). Real-time gun
detection in CCTV: An open problem. Neural networks, 132, 297-308.
https://doi.org/10.1016/j.neunet.2020.09.013
Grega, M., Matiola´
nski, A., Guzik, P., & Leszczuk, M. (2016). Automated detection of firearms and knives in a CCTV
image. Sensors, 16(1), 47.
https://doi.org/10.3390/s16010047
Guo, Y., Liu, Y., Oerlemans, A., Lao, S., Wu, S., & Lew, M. S. (2016). Deep learning for visual understanding: A
review. Neurocomputing, 187, 27-48.
https://doi.org/10.1016/j.neucom.2015.09.116
Haralick, R. M., & Shapiro, L. G. (1985). Image segmentation techniques. Computer Vision, Graphics, and Image
Processing, 29(1), 100-132.
https://doi.org/10.1016/s0734-189x(85)90153-7
Harris, C., & Stephens, M. (1988). A combined corner and edge detector. Procedings of the Alvey Vision Confer-
ence,15, 10–5244.
https://doi.org/10.5244/c.2.23
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. 2016 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 770-778.
https://doi.org/10.1109/cvpr.2016.90
Hearst, M., Dumais, S., Osuna, E., Platt, J., & Scholkopf, B. (1998). Support vector machines. IEEE Intelligent
Systems and their Applications, 13(4), 18-28.
https://doi.org/10.1109/5254.708428
Hu, Z., Tang, J., Wang, Z., Zhang, K., Zhang, L., & Sun, Q. (2018). Deep learning for image-based cancer detection
and diagnosis - A survey. Pattern Recognition, 83, 134-149.
https://doi.org/10.1016/j.patcog.2018.05.014
Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected Convolutional networks.
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2261–2269.
https://doi.org/10.1109/cvpr.2017.243
IMFDBs. http://www.imfdb.org/wiki/Main_Page. [Online; accessed 10-Oct-2021].
Iqbal, J., Munir, M. A., Mahmood, A., Ali, A. R., & Ali, M. (2021). Leveraging orientation for weakly supervised
object detection with application to firearm localization. Neurocomputing, 440, 310-320.
https://doi.org/10.1016/j.neucom.2021.01.075
Jain, H., Vikram, A., Mohana, Kashyap, A., & Jain, A. (2020). Weapon detection using artificial intelligence and deep
learning for security applications. 2020 International Conference on Electronics and Sustainable Communication
Systems (ICESC), 193–198.
https://doi.org/10.1109/icesc48915.2020.9155832
31
Jeong, C. Y., Yang, H. S., & Moon, K. (2018). Fast horizon detection in maritime images using region-of-interest.
International Journal of Distributed Sensor Networks, 14(7), 155014771879075.
https://doi.org/10.1177/1550147718790753
Jin, X., Zhang, Y., & Jin, Q. (2016). Pulmonary nodule detection based on CT images using convolution neural
network. 2016 9th International Symposium on Computational Intelligence and Design (ISCID), 1, 202-204.
https://doi.org/10.1109/iscid.2016.1053
Kaya, V., Tuncer, S., & Baran, A. (2021). Detection and classification of dierent weapon types using deep learning.
Applied Sciences, 11(16), 7535.
https://doi.org/10.3390/app11167535
Kmie´
c, M., Głowacz, A., & Dziech, A. (2012). Towards robust visual knife detection in images: Active appearance
models initialised with shape-specific interest points. Communications in Computer and Information Science,
287, 148-158.
https://doi.org/10.1007/978-3-642-30721- 8_15
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural net-
works. Advances in Neural Information Processing Systems, 25,1097-1105
https://doi.org/10.1145/3065386
Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci, M.,
Kolesnikov, A., Duerig, T., & Ferrari, V. (2020). The open images dataset V4. International Journal of Computer
Vision, 128(7), 1956-1981.
https://doi.org/10.1007/s11263-020-01316-z
Lamas, A., Tabik, S., Montes, A. C., P´
erez-Hern´
andez, F., Garc´
ıa, J., Olmos, R., & Herrera, F. (2022). Human pose
estimation for mitigating false negatives in weapon detection in video-surveillance. Neurocomputing, 489, 488-
503.
https://doi.org/10.1016/j.neucom.2021.12.059
Li, J., Liang, X., Wei, Y., Xu, T., Feng, J., & Yan, S. (2017). Perceptual generative adversarial networks for small
object detection. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1951– 1959.
https://doi.org/10.1109/cvpr.2017.211
Lin, T., Dollar, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object
detection. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2117-2125.
https://doi.org/10.1109/cvpr.2017.106
Lin, T., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´
ar, P., & Zitnick, C. L. (2014). Microsoft
COCO: Common objects in context. Computer Vision ECCV 2014, 740-755.
https://doi.org/10.1007/978-3-319-10602- 1_48
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C., & Berg, A. C. (2016). SSD: Single shot MultiBox
detector. Computer Vision ECCV 2016, 9905, 21-37.
https://doi.org/10.1007/978-3-319-46448- 0_2
Lowe, D. (1999). Object recognition from local scale-invariant features. Proceedings of the Seventh IEEE Interna-
tional Conference on Computer Vision, 2, 1150–1157.
https://doi.org/10.1109/iccv.1999.790410
Ma, X., Niu, Y., Gu, L., Wang, Y., Zhao, Y., Bailey, J., & Lu, F. (2021). Understanding adversarial attacks on deep
learning based medical image analysis systems. Pattern Recognition, 110, 107332.
https://doi.org/10.1016/j.patcog.2020.107332
32
Malo-Peris´
e, P., & Merseguer, J. (2022). The “Socialized architecture”: A software engineering approach for a new
cloud. Sustainability, 14(4), 2020.
https://doi.org/10.3390/su14042020
Maqueda, A. I., Loquercio, A., Gallego, G., Garcia, N., & Scaramuzza, D. (2018). Event-based vision meets deep
learning on steering prediction for self-driving cars. 2018 IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 5419–5427.
https://doi.org/10.1109/cvpr.2018.00568
Minaeian, S., Liu, J., & Son, Y. (2018). Eective and ecient detection of moving targets from a UAV’s camera.
IEEE Transactions on Intelligent Transportation Systems, 19(2), 497-506.
https://doi.org/10.1109/tits.2017.2782790
Narejo, S., Pandey, B., Esenarro vargas, D., Rodriguez, C., & Anjum, M. R. (2021). Weapon detection using YOLO
V3 for smart surveillance system. Mathematical Problems in Engineering, 2021, 1-9.
https://doi.org/10.1155/2021/9975700
Nassif, A. B., Shahin, I., Attili, I., Azzeh, M., & Shaalan, K. (2019). Speech recognition using deep neural networks:
A systematic review. IEEE Access, 7, 19143-19165.
https://doi.org/10.1109/access.2019.2896880
Olmos, R., Tabik, S., & Herrera, F. (2018). Automatic handgun detection alarm in videos using deep learning. Neuro-
computing, 275, 66-72.
https://doi.org/10.1016/j.neucom.2017.05.012
Olmos, R., Tabik, S., Lamas, A., P´
erez-Hern´
andez, F., & Herrera, F. (2019). A binocular image fusion approach for
minimizing false positives in handgun detection with deep learning. Information Fusion, 49, 271-280.
https://doi.org/10.1016/j.inffus.2018.11.015
Opitz, D., & Maclin, R. (1999). Popular ensemble methods: An empirical study. Journal of Artificial Intelligence
Research, 11, 169-198.
https://doi.org/10.1613/jair.614
P´
erez-Hern´
andez, F., Tabik, S., Lamas, A., Olmos, R., Fujita, H., & Herrera, F. (2020). Object detection binary
classifiers methodology based on deep learning to identify small objects handled similarly: Application in video
surveillance. Knowledge-Based Systems, 194, 105590.
https://doi.org/10.1016/j.knosys.2020.105590
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection.
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 779-788.
https://doi.org/10.1109/cvpr.2016.91
Redmon, J., & Farhadi, A. (2017). YOLO9000: Better, faster, stronger. 2017 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 6517–6525.
https://doi.org/10.1109/cvpr.2017.690
Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767.
Ren, S., He, K., Girshick, R., & Sun, J. (2017). Faster R-CNN: Towards real-time object detection with region proposal
networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1137-1149.
https://doi.org/10.1109/tpami.2016.2577031
Romero, D., & Salamea, C. (2019). Convolutional models for the detection of firearms in surveillance videos. Applied
Sciences, 9(15), 2965.
https://doi.org/10.3390/app9152965
33
Roy, S., Das, N., Kundu, M., & Nasipuri, M. (2017). Handwritten isolated Bangla compound character recognition:
A new benchmark using a novel deep learning approach. Pattern Recognition Letters, 90, 15-21.
https://doi.org/10.1016/j.patrec.2017.03.004
Salido, J., Lomas, V., Ruiz-Santaquiteria, J., & Deniz, O. (2021). Automatic handgun detection with deep learning in
video surveillance images. Applied Sciences, 11(13), 6085.
https://doi.org/10.3390/app11136085
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv
preprint arXiv:1409.1556.
Singh, A., Anand, T., Sharma, S., & Singh, P. (2021). IoT based weapons detection system for surveillance and secu-
rity using YOLOV4. 2021 6th International Conference on Communication and Electronics Systems (ICCES),
488–493.
https://doi.org/10.1109/icces51350.2021.9489224
Singh, T., & Vishwakarma, D. K. (2019). Human activity recognition in video benchmarks: A survey. Lecture Notes
in Electrical Engineering, 526, 247-259.
https://doi.org/10.1007/978-981-13-2553- 3_24
Sommer, L. W., Schuchert, T., & Beyerer, J. (2017). Fast deep vehicle detection in aerial images. 2017 IEEE Winter
Conference on Applications of Computer Vision (WACV), 311-319.
https://doi.org/10.1109/wacv.2017.41
Szegedy, C., Ioe, S., Vanhoucke, V., & Alemi, A. (2017). Inception-V4, inception-resnet and the impact of residual
connections on learning. Proceedings of the AAAI Conference on Artificial Intelligence, 31(1).
https://doi.org/10.1609/aaai.v31i1.11231
Szegedy, C., Vanhoucke, V., Ioe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for
computer vision. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2818–2826.
https://doi.org/10.1109/cvpr.2016.308
Szegedy, C., Wei Liu, Yangqing Jia, Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich,
A. (2015). Going deeper with convolutions. 2015 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 1–9.
https://doi.org/10.1109/cvpr.2015.7298594
Tiwari, R. K., & Verma, G. K. (2015). A computer vision based framework for visual gun detection using Harris
interest point detector. Procedia Computer Science, 54, 703-712.
https://doi.org/10.1016/j.procs.2015.06.083
Tiwari, R. K., & Verma, G. K. (2015). A computer vision based framework for visual gun detection using SURF. 2015
International Conference on Electrical, Electronics, Signals, Communication and Optimization (EESCO), 1–5.
https://doi.org/10.1109/eesco.2015.7253863
Tompson, J., Goroshin, R., Jain, A., LeCun, Y., & Bregler, C. (2015). Ecient object localization using Convolutional
networks. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 648–656.
https://doi.org/10.1109/cvpr.2015.7298664
Tong, K., Wu, Y., & Zhou, F. (2020). Recent advances in small object detection based on deep learning: A review.
Image and Vision Computing, 97, 103910.
https://doi.org/10.1016/j.imavis.2020.103910
34
Velastin, S. A., Boghossian, B. A., & Vicencio-Silva, M. A. (2006). A motion-based image processing system for
detecting potentially dangerous situations in Underground Railway stations. Transportation Research Part C:
Emerging Technologies, 14(2), 96-113.
https://doi.org/10.1016/j.trc.2006.05.006
Verma, G. K., & Dhillon, A. (2017). A handheld gun detection using faster R-CNN deep learning. Proceedings of the
7th International Conference on Computer and Communication Technology - ICCCT-2017, 84–88.
https://doi.org/10.1145/3154979.3154988
Viola, P., & Jones, M. (2001). Rapid object detection using a boosted Cascade of simple features. Proceedings of the
2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, 1, I-I
https://doi.org/10.1109/cvpr.2001.990517
Wilson, P. I., & Fernandez, J. (2006). Facial feature detection using Haar classifiers. Journal of Computing Sciences
in Colleges, 21(4), 127-133.
doi/abs/10.5555/1127389.1127416
Worsham, J., & Kalita, J. (2020). Multi-task learning for natural language processing in the 2020s: Where are we
going? Pattern Recognition Letters, 136, 120-126.
https://doi.org/10.1016/j.patrec.2020.05.031
Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding Convolutional networks. Computer Vision ECCV
2014, 818-833.
https://doi.org/10.1007/978-3-319-10590- 1_53
Zhan, S., Tao, Q., & Li, X. (2016). Face detection using representation learning. Neurocomputing, 187, 19-26.
https://doi.org/10.1016/j.neucom.2015.07.130
Zivkovic, Z. (2004). Improved adaptive gaussian mixture model for background subtraction. Proceedings of the 17th
International Conference on Pattern Recognition, 2004. ICPR 2004, 2, 28–31.
https://doi.org/10.1109/icpr.2004.1333992
˙
Zywicki, M., Matiola´
nski, A., Orzechowski, T. M., & Dziech, A. (2011). Knife detection as a subset of object de-
tection approach based on Haar cascades. In Proceedings of 11th International Conference “Pattern recognition
and information processing, 139-142.
35
... The robustness of this model to illumination conditions was enhanced using so-called brightness-controlled preprocessing (DaCoLT) involving dimming and changing the contrast of the images during the learning and testing stages. The authors of an interesting paper in the field of dangerous object detection are Yadav et al. 15. They made a systematic review of datasets and traditional and deep learning methods that are used for weapon detection. ...
Preprint
Full-text available
Recently, a significant number of tools for detecting dangerous objects have been developed. Unfortunately, the performance offered by them is overestimated due to the poor quality of the datasets used (insufficiently numerous, contain items not strictly related to dangerous objects, insufficient range of presentation conditions). To fill in a gap in this area we have built an extensive dataset dedicated to detecting the objects most often used in various acts of breaching public security (baseball bat, gun, knife, machete, rifle). This collection contains images presenting the detected objects with different quality and under different environmental conditions. We believe that the results obtained from it are more reliable and give a better idea of the detection accuracy that can be achieved under real conditions. We used the Faster R-CNN with different backbone networks in the study. The best results were obtained for the ResNet152 backbone. The mAP value was 85%, while the AP level ranged from 80% to 91%, depending on the item detected. An average real-time detection speed was 11-13 FPS. Both the accuracy and speed of the Faster R-CNN model allow it to be recommended for use in public security monitoring systems aimed at detecting potentially dangerous objects.
... Rawat and Juya [12] compare SSD, YOLO, and Faster RCNN, highlighting Faster RCNN's superior accuracy. Yadav et al. [13] analyze weapon detection systems, emphasizing the need for a real-time dataset. In [14], SSD and Faster RCNN detect guns, balancing accuracy and speed. ...
... DCNN models accomplish this task automatically. These methods are used in a variety of fields, including self-driving auto-mobiles [9], traffic monitoring [10], weapons recognition systems [28,29], face recognition systems [11], natural language processing [30], license plate detection [6] and many others. ...
Conference Paper
Full-text available
Automatic instrument reading has become a critical issue for intelligent sensors in smart cities. Several artificial intelligence techniques are developing tools for addressing the issue. The image-based Automatic Meter Reading (AMR) techniques have been tested on images taken under regulated conditions, but they become unresponsive when dealing with fuzzy, hazy or blurred meter images. In this paper, we deal with AMR, which focuses on unconstrained settings such as fuzzy, hazy or blurry meter images. Automated meter reading consists of three major components: identifying the counter region, localising and cropping the counter region and digit recognition. In this article, the deep learning model YOLOv5 have been used on the image dataset. YOLOv5 is a state-of-the-art single-stage deep learning detector that outperforms all other detectors and it is observed that the proposed technique and the trained model based on YOLOv5 can reliably detect and recognise meter readings from the different meter kinds. For the task of digit recognition, a YOLOv5 based custom-built digit optical character reader is used that can recognise 0-to-9-digit numbers. Furthermore, the proposed AMR system achieves remarkable recognition rates of 99.74% for counters and 88.70% for digit recognition even while rejecting counters with lower confidence values.
Conference Paper
Handguns, pistols, and revolvers are commonly used in today’s world for committing criminal acts, requiring the need for effective surveillance and control systems. However, despite the advancement of security systems, human monitoring and involvement are still necessary to effectively combat these crimes. This paper provides a robust automated handgun identification technique for recorded videos and live CCTV footage that may be used for both control and surveillance purposes. Automatic detection of firearms is crucial for improving people’s protection and safety, however, it is a challenging task because of the numerous differences in design, size, and appearance of firearms. In recent years, object detectors have improved, yielding better findings and shorter inference times. The authors used cutting edge object detector YOLOv7 for firearm detection. A varied and demanding dataset of 15,367 images for weapon identification is also proposed, which is carefully annotated for weapon localization and classification. After analysing the data, it is determined that the model achieves an accuracy rate of 96.80% and recall rate of 90.37%.
Conference Paper
In the past years, since 2020, the outbreak of COVID-19 has alarmed the world with the speed and its spread around the world. This raised the demand for early, accurate and automated detection systems for COVID-19 as there is a scarcity of manpower in the medical field. This attracted many researchers using deep learning to build a COVID-19 detection model. For the diagnosis of COVID-19, computed tomography scanning is being used as a more accurate, non-invasive and efficient method in real-time. In this work, we have proposed a model using six different image classification techniques of deep learning on CT scan images and compared the accuracy to find the most suitable and reliable model for transfer learning to achieve the best result on ResNet50 as 97.19% training and 98.05% testing accuracy. The model will automate the process of detection of COVID-19, leading to the advancement in the field of smart health care.
Article
Full-text available
The relevance of the study stems from the legal ambiguity surrounding specific aspects of visual surveillance utilised by law enforcement agencies, journalists, private detectives, and other individuals with a need for it. The purpose of the study is to identify indicators that can differentiate between legal and illegal covert visual surveillance of individuals in public spaces, establish the circumstances under which such surveillance should be deemed a criminal offence, define the specific aspects of documenting this offence, and explore methods of proving the guilt of those responsible. Historical-legal, formal-legal, logical-normative, logical-semantic, sociological and statistical research methods are applied in the study. The criteria for the legality of covert visual surveillance of a person in publicly accessible places are: its conduct by authorised subjects (investigators or employees of operational units); implementation only within the framework of criminal proceedings (or proceedings in an intelligence gathering case); the existence of a decision of the investigating judge on permission to conduct visual surveillance of a specific person; strict compliance with the requirements of the Criminal Procedure Law regarding the procedure for conducting visual surveillance and restrictions established by the decision of the investigating judge. It is found that representatives of civilian professions can conduct visual surveillance in publicly accessible places only in an open way. Covert visual surveillance of a person to collect information about them constitutes a criminal offence consisting in violation of privacy. To bring illegal observers to criminal responsibility, factual data indicating the purpose of visual surveillance (collecting confidential information about a person), motives, time, place, means of committing the crime, and other circumstances are collected during the pre-trial investigation. The practical value of the paper is the possibility of using the obtained data to prevent illegal actions of private detectives, journalists, and other entities who secretly collect information about a person through visual surveillance, and to ensure effective investigation of such activities.
Article
Full-text available
Today, the cloud means a revolution within the Internet revolution. However, an oligopoly sustaining the cloud may not be the best solution, since ethical problems such as privacy or even transferring data sovereignty could eventually happen. Our research, coined as the "socialized architecture," presents a novel disruptive approach to completely transform the cloud as we know it today. The approach follows ideas already working in the field of volunteer computing, since it tries to socialize spare computing power in the infraused hardware that institutions and normal people own. However, our solution is completely different to current ones, since it does not create hyper-specialized muscles in client machines. The solution is new since it proposes a software engineering approach for developing “socialized services”, which, leveraging an asynchronous interaction model, creates a network of lightweight microservices that can be dynamically allocated and replicated through the network. The use of state-of-the-art patterns, such as Command Query Responsibility Segregation, helps to isolate domain events and persistence needs, while an API Gateway addresses communication. All previous ideas were tested through a complete and functional proof of concept, which is a prototype called Circle implementing a social network. Circle has been useful to expose problems that need to be addressed. The results of the assessment confirm, in our view, that it is worth to start this new field of work.
Article
Full-text available
Applying CNN-based object detection models to the task of weapon detection in video-surveillance is still producing a high number of false negatives. In this context, most existing works focus on one type of weapons, mainly firearms, and improve the detection using different pre- and post-processing strategies. One interesting approach that has not been explored in depth yet is the exploitation of the human pose information for improving weapon detection. This paper proposes a top-down methodology that first determines the hand regions guided by the human pose estimation then analyzes those regions using a weapon detection model. For an optimal localization of each hand region, we defined a new factor, called Adaptive pose factor, that takes into account the distance of the body from the camera. Our experiments show that this top-down Weapon Detection over Pose Estimation (WeDePE) methodology is more robust than the alternative bottom-up approach and state-of-the art detection models in both indoor and outdoor video-surveillance scenarios.
Article
Full-text available
Today, with the increasing number of criminal activities, automatic control systems are becoming the primary need for security forces. In this study, a new model is proposed to detect seven different weapon types using the deep learning method. This model offers a new approach to weapon classification based on the VGGNet architecture. The model is taught how to recognize assault rifles, bazookas, grenades, hunting rifles, knives, pistols, and revolvers. The proposed model is developed using the Keras library on the TensorFlow base. A new model is used to determine the method required to train, create layers, implement the training process, save training in the computer environment, determine the success rate of the training, and test the trained model. In order to train the model network proposed in this study, a new dataset consisting of seven different weapon types is constructed. Using this dataset, the proposed model is compared with the VGG-16, ResNet-50, and ResNet-101 models to determine which provides the best classification results. As a result of the comparison, the proposed model’s success accuracy of 98.40% is shown to be higher than the VGG-16 model with 89.75% success accuracy, the ResNet-50 model with 93.70% success accuracy, and the ResNet-101 model with 83.33% success accuracy.
Conference Paper
Full-text available
The increasing number of terrorist acts and lone wolf attacks on places of public gathering such as Hotels and Cinemas has solidified the need for much denser Closed-circuit Television (CCTV) systems. The increasing number of CCTV cameras has deemed it almost impossible for a human operator to inspect all the video streams and detect possible terror events. One of the common types of terror event is called “Active Shooter”. Events such as the 2008 Mumbai shooting, shooting at the movie theater in Colorado (USA), Oslo (Norway) and recently an attacker opened gun fire at an outdoor music festival in Las Vegas on Oct. 1, 2017, USA. Therefore in this work, the detection of an “Active Shooter” carrying a non-concealed firearm and alerting the CCTV operator of a potentially dangerous event both visually and audibly has been carried out. The proposed approach of gun detection uses a feature extraction techniques and a convolutional neural network classifier for classifying objects as either a gun or not a gun. And the classification accuracy achieved by the proposed approach is 97.78%.
Article
Full-text available
There is a great need to implement preventive mechanisms against shootings and terrorist acts in public spaces with a large influx of people. While surveillance cameras have become common, the need for monitoring 24/7 and real-time response requires automatic detection methods. This paper presents a study based on three convolutional neural network (CNN) models applied to the automatic detection of handguns in video surveillance images. It aims to investigate the reduction of false positives by including pose information associated with the way the handguns are held in the images belonging to the training dataset. The results highlighted the best average precision (96.36%) and recall (97.23%) obtained by RetinaNet fine-tuned with the unfrozen ResNet-50 backbone and the best precision (96.23%) and F1 score values (93.36%) obtained by YOLOv3 when it was trained on the dataset including pose information. This last architecture was the only one that showed a consistent improvement—around 2%—when pose information was expressly considered during training.
Article
Full-text available
Every year, a large amount of population reconciles gun-related violence all over the world. In this work, we develop a computer-based fully automated system to identify basic armaments, particularly handguns and rifles. Recent work in the field of deep learning and transfer learning has demonstrated significant progress in the areas of object detection and recognition. We have implemented YOLO V3 “You Only Look Once” object detection model by training it on our customized dataset. The training results confirm that YOLO V3 outperforms YOLO V2 and traditional convolutional neural network (CNN). Additionally, intensive GPUs or high computation resources were not required in our approach as we used transfer learning for training our model. Applying this model in our surveillance system, we can attempt to save human life and accomplish reduction in the rate of manslaughter or mass killing. Additionally, our proposed system can also be implemented in high-end surveillance and security robots to detect a weapon or unsafe assets to avoid any kind of assault or risk to human life.
Article
Full-text available
Security and safety is a big concern for today’s modern world. For a country to be economically strong, it must ensure a safe and secure environment for investors and tourists. Having said that, Closed Circuit Television (CCTV) cameras are being used for surveillance and to monitor activities i.e. robberies but these cameras still require human supervision and intervention. We need a system that can automatically detect these illegal activities. Despite state-of-the-art deep learning algorithms, fast processing hardware, and advanced CCTV cameras, weapon detection in real-time is still a serious challenge. Observing angle differences, occlusions by the carrier of the firearm and persons around it further enhances the difficulty of the challenge. This work focuses on providing a secure place using CCTV footage as a source to detect harmful weapons by applying the state of the art open-source deep learning algorithms. We have implemented binary classification assuming pistol class as the reference class and relevant confusion objects inclusion concept is introduced to reduce false positives and false negatives. No standard dataset was available for real-time scenario so we made our own dataset by making weapon photos from our own camera, manually collected images from internet, extracted data from YouTube CCTV videos, through GitHub repositories, data by university of Granada and Internet Movies Firearms Database (IMFDB) imfdb.org. Two approaches are used i.e. sliding window/classification and region proposal/object detection. Some of the algorithms used are VGG16, Inception-V3, Inception-ResnetV2, SSDMobileNetV1, Faster-RCNN Inception-ResnetV2 (FRIRv2), YOLOv3, and YOLOv4. Precision and recall count the most rather than accuracy when object detection is performed so these entire algorithms were tested in terms of them. Yolov4 stands out best amongst all other algorithms and gave a F1-score of 91% along with a mean average precision of 91.73% higher than previously achieved.
Article
Full-text available
Detecting knives in surveillance videos are very urgent for public safety. In general, the research in identifying dangerous weapons is relatively new. Knife detection is a very challenging task because knives vary in size and shape. Besides, it easily reflects lights that reduce the visibility of knives in a video sequence. The reflection of light on the surface of the knife and the brightness on its surface makes the detection process extremely difficult, even impossible. This paper presents an adaptive technique for brightness enhancement of knife detection in surveillance systems. This technique overcomes the brightness problem that faces the steel weapons and improves the knife detection process. It suggests an automatic threshold to assess the level of frame brightness. Depending on this threshold, the proposed technique determines if the frame needs to enhance its brightness or not. Experimental results verify the efficiency of the proposed technique in detecting knives using the deep transfer learning approach. Moreover, the most four famous models of deep convolutional neural networks are tested to select the best in detecting knives. Finally, a comparison is made with the-state-of-the-art techniques, and the proposed technique proved its superiority.
Article
Automatic detection of firearms is important for enhancing the security and safety of people, however, it is a challenging task owing to the wide variations in shape, size and appearance of firearms. Also, most of the generic object detectors process axis-aligned rectangular areas though, a thin and long rifle may actually cover only a small percentage of that area and the rest may contain irrelevant details suppressing the required object signatures. To handle these challenges, we propose a weakly supervised Orientation Aware Object Detection (OAOD) algorithm which learns to detect oriented object bounding boxes (OBB) while using Axis-Aligned Bounding Boxes (AABB) for training. The proposed OAOD is different from the existing oriented object detectors which strictly require OBB during training which may not always be present. The goal of training on AABB and detection of OBB is achieved by employing a multistage scheme, with Stage-1 predicting the AABB and Stage-2 predicting OBB. In-between the two stages, the oriented proposal generation module along with the object aligned RoI pooling is designed to extract features based on the predicted orientation and to make these features orientation invariant. A diverse and challenging dataset consisting of eleven thousand images is also proposed for firearm detection which is manually annotated for firearm classification and localization. The proposed ITU Firearm dataset (ITUF) contains a wide range of guns and rifles. The OAOD algorithm is evaluated on the ITUF dataset and compared with current state-of-the-art object detectors, including fully supervised oriented object detectors. OAOD has outperformed both types of object detectors with a significant margin. The experimental results (mAP: 88.3 on AABB & mAP: 77.5 on OBB) demonstrate effectiveness of the proposed algorithm for firearm detection.