ArticlePDF Available

A comprehensive study towards high-level approaches for weapon detection using classical machine learning and deep learning methods

August 2022
Expert Systems with Applications 212(5):118698

August 2022
212(5):118698

DOI:10.1016/j.eswa.2022.118698

Authors:

Pavinder Yadav

National Institute of Technology, Hamirpur

Nidhi Gupta

National Institute of Technology, Hamirpur

Pawan Sharma

National Institute of Technology, Hamirpur

Surveillance systems do not give a rapid response to deal with suspicious activities such as armed robbery in public places. Consequently, there is a need for technology that can recognize criminal activities from Closed Circuit Televisions (CCTV) footage without the need of human help. Various high-performance computing algorithms have been developed but are limited to specific conditions. In this paper, we have identified gaps between existing technologies for weapon detection. The automatic detection of guns/weapons could help in the investigation of crime scenes. A new and difficult area of study is identifying the specific type of firearm used in an attack known as intra-class detection. The study examines and classifies the strengths and shortcomings of several existing algorithms using classical machine learning and deep learning approaches, employed in the detection of different kinds of weapons. We have thoroughly compare and analyze the performance of several recent state-of-the-art methods on different datasets along with their future scope. We observed that deep learning techniques beat traditional machine learning techniques in terms of speed and accuracy.

Sample images from IMFDBs database (IMFDBs).

…

Knives images database (a) Positive samples (b) Negative samples (Grega et al., 2016).

…

Frames extracted from Gun movie dataset of Grega et al. (2016).

…

A few samples from the (a) Dataset of Olmos et al. (2018) (b) Sohas weapon dataset (Pérez-Hernández et al., 2020).

…

+12

(a) Dataset collected during mock attack (b) A few sample images from synthetic dataset González et al. (2020).

…

Figures - uploaded by Pavinder Yadav

Content may be subject to copyright.

Content uploaded by Pavinder Yadav

Content may be subject to copyright.

A Comprehensive Study towards High-level Approaches for Weapon Detection

using Classical Machine Learning and Deep Learning Methods

Pavinder Yadav, Nidhi Gupta∗, Pawan Kumar Sharma

Department of Mathematics and Scientiﬁc Computing

National Institute of Technology, Hamirpur

Himachal Pradesh, 177005, India

Abstract

Surveillance systems do not give a rapid response to deal with suspicious activities such as armed robbery in public

places. Consequently, there is a need for technology that can recognize criminal activities from Closed Circuit Tele-

visions (CCTV) footage without the need of human help. Various high-performance computing algorithms have been

developed but are limited to speciﬁc conditions. In this paper, we have identiﬁed gaps between existing technologies

for weapon detection. The automatic detection of guns/weapons could help in the investigation of crime scenes. A

new and diﬃcult area of study is identifying the speciﬁc type of ﬁrearm used in an attack known as intra-class detec-

tion. The study examines and classiﬁes the strengths and shortcomings of several existing algorithms using classical

machine learning and deep learning approaches, employed in the detection of diﬀerent kinds of weapons. We have

thoroughly compare and analyze the performance of several recent state-of-the-art methods on diﬀerent datasets along

with their future scope. We observed that deep learning techniques beat traditional machine learning techniques in

terms of speed and accuracy.

Keywords: Weapon Detection, Deep Learning, Machine Learning, Computer Vision, Security and Surveillance

1. Introduction

Nowadays, Closed Circuit Televisions (CCTVs) are widely being used in society to prevent crimes and identify

suspicious activities. With the fast development of CCTV cameras, inspecting and analyzing them becomes more dif-

ﬁcult for a human operator, and taking any necessary action based on the video input from the remote camera. When

several people are involved in video input, it becomes expensive and ineﬀective. According to several studies, human

operators develop video blindness and tend to miss up to 95% of the screen action after 20 to 40 minutes of intensive

monitoring (Velastin et al.,2006). This overall results in a signiﬁcant quality reduction and poor productivity, which

leads to an inaccurate detection rate of up to 83% (Ainsworth,2002). Researchers have developed a number of com-

puter vision-based automatic weapon detection systems in response to the proliferation of high-powered computers

and the availability of high-speed internet.

Object detection, in particular, has become a demanding study ﬁeld in the last decade, and it has been used

in a variety of applications including foreground moving targets detection (Minaeian et al.,2018), human activity

recognition (Singh and Vishwakarma,2019), marine surveillance (Jeong et al.,2018), pedestrian identiﬁcation (Jin et

al.,2016), weapon detection (Olmos et al.,2018), and many more. Some merely need to identify items that take up

a signiﬁcant portion of the scene, while others require the detection of many objects of diﬀerent sizes. The outcomes

vary according to the size of the object, with small objects having poorer outcomes compared to large objects. These

ﬁndings are well reﬂected in the challenges like ImageNet (Deng et al.,2009), Common Objects in COntext (COCO)

∗Corresponding author

Email addresses: pavinder_phdmath@nith.ac.in (Pavinder Yadav), nidhi@nith.ac.in (Nidhi Gupta), psharma@nith.ac.in

(Pawan Kumar Sharma)

Preprint submitted to Elsevier August 2, 2022

Revised manuscript (with changes marked)

(Lin et al.,2014), and open images (Kuznetsova et al.,2020), where small objects are observed with 38% less accuracy

in detection than larger objects. The reason is that small objects have fewer pixels on the image, which means that

they don’t show up very often, either because they aren’t labeled or because they aren’t well represented in the training

phase of the process.

The crime rate is relatively high in nations where a person has access to armoury (e.g. a pistol). Information

from many sources revealed the eﬀects of criminal actions, which ranged from murder to theft, resulting in the loss

of valuable lives, infrastructure devastation, and disrupted law and order circumstances. Standard CCTV cameras are

used for monitoring at certain places, but the monitoring process is very laborious. Nowadays, guns are available in

a wide range of styles and sizes, which makes them diﬃcult to identify in real-time. In this situation, deep learning

ushered in a breakthrough. However, with time, researchers have created various models to identify weaponry using

deep learning methods. The detection of weapons has become a diﬃcult task despite the existence of several advanced

state-of-the-art deep learning techniques. A thorough examination of several techniques has been carried out in this

article.

Detection of small objects in aerial and satellite images has been previously addressed by modifying the archi-

tecture of the network (Sommer et al.,2017), data augmentation for small objects (Tong et al.,2020), or adding

perceptual generative adversarial networks for better image resolution (Li et al.,2017). Although these techniques

observed higher precision in small object detection, they were limited to certain applications such as traﬃc signs or

satellite images only.

Earlier, the object detection procedure was divided into three phases: (i) generating proposals, (ii) extracting

feature vectors, and (iii) classifying regions. The main objective was to ﬁnd a zone of interest in an image that

might include objects of any size. For safeguard information, multiple scales were used to reduce the size of the

input images and the multi-scale windows curve was employed for transition between the images. The second step

was to get a feature vector of ﬁxed length from the sliding window in order to secure speciﬁc information about the

area enclosed. Low-level visual descriptors such as the Harris Corner Detector (Harris et al.,1988), Histogram of

Gradients (HOG) (Dalal and Triggs,2005), Scale Invariant Feature Transform (SIFT) (Lowe,1999), or Speeded Up

Robust Features (SURF) (Bay et al.,2006) were used to encode feature vectors, which exhibited ﬁtness to scale,

illumination, and rotational variance. Finally, the area classiﬁers were trained to assign labels per diﬀerent categories

to the covered regions in the third phase. Because of their high performance on small-sized training data, Support

Vector Machines (SVM) (Hearst et al.,1998) were commonly used. Additionally, in the classiﬁcation stage, various

classiﬁcation approaches such as bagging (Opitz and Maclin,1999), cascade learning (Dalal and Triggs,2005), and

Adaboost (Freund et al.,1996) were utilised, resulting in better detection accuracy.

In particular, Deep Convolutional Neural Networks (DCNNs) have outperformed all other machine learning ap-

proaches in object detection in the last few years. It takes a lot of work to ﬁnd higher level features in data with

traditional methods (Guo et al.,2016). DCNNs models do this automatically. Convolutional layers, non-linear acti-

vation functions such as ReLU, pooling layers, and fully-linked layers make up DCNNs. The convolutional layers

extract a variety of characteristics from the source images. Following that, fully connected layers learn from these

characteristics. Another advantage of utilising such a design is that it may be reused partially or fully for similar

applications. It is possible because of the theory of transfer learning, which cuts down on model building time and

eliminates the need for a large dataset.

Object detection models based on deep learning methods are majorly divided into two groups: (i) one-stage detec-

tors like You Only Look Once (YOLO) (Redmon et al.,2016) and its versions like YOLOv2 (Redmon and Farhadi,

2017), and YOLOv3 (Redmon and Farhadi,2018), and (ii) two-stage detectors like Region-based Convolutional Neu-

ral Network (R-CNN) (Girshick et al.,2014) and its versions like Fast R-CNN (Girshick,2015), and Faster R-CNN

(Ren et al.,2015). Without the need for a cascading area classiﬁcation phase, one-stage detectors produce categorical

predictions of items on each position of the feature maps. Two-stage detectors start with a proposal generator that gen-

erates a small number of proposals and extracts characteristics from each one, and then use region classiﬁers to predict

the suggested category of the object. One-stage detectors are substantially more time-eﬃcient and have more appli-

cability in real-time identiﬁcation, but two-stage detectors produce better results on public benchmark datasets. In

this study, we cover the fundamental concepts of major approaches and review each of these methods in a methodical

manner.

For a few decades, researchers have been striving to develop an automatic weapon detection system based on

computer vision algorithms. In a potentially perilous situation, that person carries a knife or another ﬁrearm in his

hand rather than any other body parts. It is needless to say that, normally, guns can only be operated by hand while

committing any crime. Therefore, the vision system is expected to be trained to access the ideal weapon image or a

form that is comparable to that weapon. The major goals of such a detection system are as follows:

•To create an automatic alarm system that can alert surveillance security personnel in real-time, resulting in a

quick response; and

•To classify diﬀerent types of weapons, which can provide a crucial information for forensic investigation.

Deep learning revolutionised the creation of weapon detection systems. In this regard, several studies have been

conducted and various models have been developed to identify ﬁrearms.

The main objective of this comprehensive study is to identify research gaps in the ﬁeld of weapon detection and

identiﬁcation and to thoroughly study existing datasets, their limitations, and future research directions. This article

represents the huge number of contributions of a signiﬁcant articles in a structured and systematic manner. This survey

can provide readers with a comprehensive understanding of weapon detection using deep learning, as well as perhaps

drive future research eﬀorts on weapon detection approaches and their beneﬁts. Overall, this article examines the

strengths and shortcomings of the various existing approaches, and oﬀers a detailed assessment of the open issues

with forecasting of future prospects.

The paper is organized into seven sections. Section 2 provides a detailed description of existing publicly available

datasets. Classical machine learning approaches adopted for weapon detection are discussed in Section 3. In Section 4,

deep learning approaches are described in detail. Furthermore, the comparative analysis of various classical machine

learning and deep learning methods is discussed in Section 5. The key contributions of the study are highlighted in

Section 6. The future extents related to the real-time weapon detection methods are given in Section 7.

2. Publicly Available Datasets

Table 1shows the public datasets that can be used to classify and recognize weapons, providing the year of

publication, the number of images in each dataset, the resolution of images or videos, and the types of used weapons.

Table 1: Statistics of datasets that are publicly available.

Database Publication/site Year # Images/Videos Resolution Types of

Weapon

IMFDBs IMFDBs 2014 4,50,000* Variable Size Handguns,

Riﬂes, Knives

Knives Images Database Grega et al. (2016) 2016 12,899 100 x 100 Knives

Gun Movies Database Grega et al. (2016) 2016 7 Videos 640 x 480 Guns

Dataset of Olmos et al. Olmos et al. (2018) 2018 3,000 Variable Size Guns

Sohas weapon P´

erez-Hern´

andez et al. (2020) 2020 17,684 Variable Size Guns, Knives

Dataset created by mock attack Gonz´

alez et al. (2020) 2020 5,149 1920 x 1080 Handguns and

Riﬂes

Synthetic dataset Gonz´

alez et al. (2020) 2020 2,500 1920 x 1080 Guns

ITU Firearm Dataset Iqbal et al. (2021) 2021 10,973 480 x 800 Guns

*Approximate

2.1. IMFDBs

The Internet Movie Firearm Database contains a huge picture collection of weapons. It is a powered-wiki-managed

online repository that is publicly available at the site IMFDBs. It comprises about 4,50,000 pictures of the weapon,

some of which are displayed in Fig. 1. Several thousand images from the movie sequences or video games are used

to create the database, which reﬂects the limited number of pictures with close-up views of weapons. The weapons

that were obliterated by darkness or made unseeable due to blurriness or size are also included here. IMFDBs is

an excellent dataset for ﬁrearms since it has a wide range of gun pictures in various unconstrained orientations and

positions.

Figure 1: Sample images from IMFDBs database (IMFDBs).

2.2. Knives images database

This database (Grega et al.,2016) comprises 12,899 pictures of knives, which are classiﬁed into two categories:

(i) positive examples containing 3,559 images, and (ii) negative examples containing 9,340 images. The images

containing a knife shown in Fig. 2a belong to the positive class, and the images shown in Fig. 2b belong to negative

examples considering all circumstances. It is considered that a knife that is not being wielded by a person is less

hazardous. It can also be overlooked during processing, resulting in a slew of false alarms. Because carrying knives

in public is banned in Poland, the photographs were taken either indoors or via vehicle windows. All images in this

database have a resolution of 100 x 100 pixels.

(a) (b)

Figure 2: Knives images database (a) Positive samples (b) Negative samples (Grega et al.,2016).

2.3. Database of ﬁrearm videos

The Gun Movies Dataset is a video collection recorded by surveillance cameras. Grega et al. (2016) created this

dataset by simulating a gun-shooting situation due to the lack of real-life gun-shooting footage. As a result, this

dataset comprises CCTV footage with an actor and seven diﬀerent video recordings. The training and testing sets

were identical in size, with each recording lasting for 8.5 minutes and yielding roughly 12,000 frames. Sixty percent

of each set contains negative examples that do not include a handgun but other objects in a hand, while the remaining

forty percent contains positive examples that include a ﬁrearm visible to an observer. A few images from this dataset

are shown in Fig. 3.

Figure 3: Frames extracted from Gun movie dataset of Grega et al. (2016).

2.4. Dataset of Olmos et al.

Olmos et al. (2018) developed two databases for knives and handguns. The knife dataset contains 12,869 images.

Each image contains multiple variations of cold steel weapons of various sorts, forms, colors, sizes, and materials,

placed near and far from the camera, partially occluded by the hand or anything else. There are three sets of databases,

one with 102 classes and 9,261 images, which is suitable for the classiﬁcation tasks. The second dataset comprises

3,000 images of weapons with extensive contextual information that may be used to make detection. The third dataset

comprises 608 images, with 304 of them being handgun images. This dataset may be used for classiﬁcation as well

as detection tasks. Some of the sample images are shown in Fig. 4a.

2.5. Sohas weapon

The authors of P´

erez-Hern´

andez et al. (2020) created a dataset called Sohas weapon to investigate six tiny objects

that are frequently handled in a similar manner to a weapon, namely handguns, knives, smartphones, bills, handbags,

and cards. They employed a variety of surveillance cameras to capture images. Among them, 10% of the images

were obtained from web sources. For detection, all of these images were manually annotated. The dataset contains

diﬀerent types of knives with diﬀerent shapes. A few images of the dataset are shown in Fig. 4b.

(a) (b)

Figure 4: A few samples from the (a) Dataset of Olmos et al. (2018) (b) Sohas weapon dataset (P´

erez-Hern´

andez et al.,2020).

2.6. Dataset created by mock attack and synthetic dataset

During a simulated assault, this dataset was collected and manually annotated by Gonz´

alez et al. (2020). The

infrastructure of the dataset is made up of three security cameras that are strategically placed at the same location to

cover two distinct pathways and one entrance, generating diverse situations. A total of 5,149 frames were retrieved

from movies at a rate of two frames per second (607, 3511, and 1031 frames from cam1, cam5, and cam7, respec-

tively). In addition, they constructed a ﬁctitious dataset by simulating a section of a city and an educational facility

within it using the Unity Game Engine. Several cameras capture the motions of eleven distinct models and seven

animations that make up the cast of characters. The images add eleven distinct items to the produced datasets: four

diﬀerent pistols, ﬁve diﬀerent riﬂes, a knife, and a smartphone. The creation of synthetic data might assist the network

focus on the item to be discovered. Therefore, this dataset is not realistic. Fig. 5a shows an example of the dataset

collected by mock attack. Fig. 5b illustrates a few sample images shown from synthetic dataset.

(a) (b)

Figure 5: (a) Dataset collected during mock attack (b) A few sample images from synthetic dataset Gonz´

alez et al. (2020).

2.7. ITU Firearms Dataset (ITUF)

The ITUF dataset (Iqbal et al.,2021) contains images of guns and riﬂes in a variety of settings, including being

targeted, laid out on tables, transported, or stored in racks. Web scraping was used to acquire the images for the

dataset. Weapons, battles, pistols, ﬁlm titles, ﬁrearms, ﬁrearm variations, shooters, corps, guns, and riﬂes were

among the phrases used in the dataset. The data-driven algorithms can overcome garment variances, body posture

variations, weapon position and size variations, changing light conditions, and both indoor and outdoor situations.

These mentioned variables result in a strong prior for the data-driven algorithms. Images were eliminated from the

ﬁndings, which were not associated with weaponry, as well as cartoons and duplicates. There are 10,973 completely

annotated ﬁrearm images in the ﬁnal clean collection, with 13,647 ﬁrearm occurrences. Fig. 6depicts a few sample

images from the dataset.

Figure 6: Sample images from ITU Firearm Dataset (Iqbal et al.,2021).

3. Classical Machine Learning Methods

This section covers the detailed description of algorithms that use classical machine learning approaches (Haralick

and Shapiro,1985). Some of these approaches are listed in Table 2. Edge detection is the process of identifying parts

of an image where objects could be distinguished from others. The edge detection has the advantage of reducing

the quantity of data required to analyse the image. Edge detection works eﬀectively with sharp images, while noisy

images enhance complexity and make it a challenging task to be resolved.

Table 2: List of classical machine learning approaches and respective area of application.

Method References Year Area of Application

Active Appearance Models (AAMs) Cootes et al. (1998) 1988 Knives detection

Harris Corner Detector Harris et al. (1988) 1988 Knives and Gun detection

Scale Invariant Feature Transform (SIFT) Lowe (1999) 1999 Knives and Gun detection

Haar Cascades Viola and Jones (2001) 2001 Knives detection

Speeded Up Robust Features (SURF) Bay et al. (2006) 2006 Knives and Gun detection

3.1. Active Appearance Models (AAMs)

A statistical model of the form and pixel intensities (texture) throughout the object can be expressed as an Active

Appearance Models (AAMs) (Cootes et al.,1998) in general. The phrase appearance refers to the combination of

form and texture, while active refers to the employment of an algorithm to match the shape and texture model in fresh

images. Objects of interest are manually tagged with so-called landmark points in the images from the training set to

characterise the form during the training phase. The algorithm may be divided into four stages:

1. Select a starting reference shape.

2. Align the reference shape with all other forms.

3. Recalculate the aligned forms mean shape.

4. If the mean form distance from the reference shape exceeds a threshold, set the mean shape as the reference

shape and return to step 2. Otherwise, the mean shape should be returned.

3.2. Harris Corner Detector

It is a corner detection technique commonly utilized in computer vision algorithms to extract corners and infer

image features in detail (Harris et al.,1988). Instead of applying shifting patches for every 45◦angle, the Harris

corner detector takes into account the diﬀerential of the corner values with respect to directions directly, allowing it to

more eﬀectively distinguish between edges and corners. The concept is to imagine a tiny window surrounding each

pixel in the image. By moving each window, a minimal fraction in a speciﬁc direction, it is possible to determine the

amount of change in the values of pixels. Eq. 1expresses the change function E(m,n) as the total sum of all squared

diﬀerences,

E(m,n)=X

x,y

w(x,y)[I(x+m,y+n)−I(x,y)]2(1)

where m,nare the pixels x,yare the coordinates, and Iis value of the intensity of pixels in 3 x 3 window. A feature

of the image is regarded to be any pixels with high E(m,n) values, as determined by a threshold value in the image.

For corner detection, we must maximize the function value E(m,n). Subsequently, the second term should also be

maximized. Eq. 2is determined by applying Taylor Series Expansion to Eq. 1and then performing a number of

mathematical operations on the result.

E(m,n)≈[m,n]P"u

v#(2)

where,

P=Xw(x,y)"I2

xIxIy

IxIyI2

y#(3)

where, Ixand Iyare the derivatives of the image in the (x,y) directions, respectively. The corner response has been

calculated by Eq. 4

Q=det(P)−ctrace(P)2) (4)

where, cis an constant and αand βare eigenvalues of P. As a consequence, as seen in Fig. 7, the eigenvalues

indicate whether an area is a corner, a ﬂat surface, or an edge.

Figure 7: Classiﬁcation of image point using eigenvalue of Q.

•When |Q|is small, which occurs when αand βare small, the area is ﬂat.

•WhenQ <0, the region is considered to be an edge, which happens when α >> β and vice versa.

•When Q is large, the area is considered a corner, which happens when αand βare both large andα∼β.

3.3. Scale Invariant Feature Transform (SIFT)

Scale Invariant Feature Transform (SIFT) (Lowe,1999) extracts a high number of distinct possible key-points

from an image which are invariant to various perspectives, scaling, rotation, light variations, and noise. Fig. 8shows

the architecture of the SIFT algorithm.

Figure 8: Architecture of SIFT (Lowe,1999).

There are basically four stages in this algorithm:

•Scale-space detection: The input image is scanned to ﬁnd the regions of interest that are either local maxima or

local minima and invariant to transformations like rotation and scaling.

•Key-point localization: In this phase, the most stable key-points are identiﬁed and poor contrast and outliers

are eliminated from the key-points. Outliers, low-contrast pixels, and poorly located key-points along an edge

are all removed using the taylor series.

•Orientation: After ﬁnding the most stable key points in phase one, the local image gradient direction is used to

assign the orientation of each key-point to it.

•Key-points description: Suitable feature points are described as stored intensity samples in the neighborhood

in the ﬁnal phase. All of these key-points are unique and unaﬀected by aﬃne transformations or changes in

illumination.

3.4. Speeded Up Robust Features (SURF)

Speeded Up Robust Features (SURF) (Bay et al.,2006) is a feature extraction method that operates in a similar

way to SIFT but is signiﬁcantly faster. The SURF method is a reliable detector and description of prospective key-

points of interest. The architecture of the SURF algorithm is shown in Fig. 9.

Figure 9: Architecture of SURF (Bay et al.,2006).

The SURF detector makes use of the Hessian Matrix to discover interest key-points. The Hessian Matrix (H) is

quick and accurate calculation method. The Hessian Matrix (H) for x=(x,y) is expressed by Eq. 5:

H(x, σ)="Dx x(x, σ)Dxy (x, σ)

Dxy(x, σ)Dyy (x, σ)#(5)

where, the Gaussian second order derivative is convolutioned with the integral image Iyields D(x, σ) .

SURF descriptor: Interest key-points are characterized by using Haar wavelet responses to assign orientation to

each key-point. A square area is built around each key-point for the description of key-points. The chosen square

region is then subdivided into 4 x 4 subregions. After that, the four descriptor components dx,dy,|dx|, and |dy|are

assessed, dxrepresents the horizontal haar wavelet response, whereas dyrepresents the vertical Haar wavelet response.

|dx|and |dy|are both absolute values of horizontal and vertical directions, respectively.

3.5. Haar Cascades

Viola and Jones (2001) presented the Haar Cascades detector, which is a successful fusion of three fundamental

principles. To begin, a large collection of characteristics is needed that can be calculated in a short and consistent

time. This feature-based strategy reduces in-class variability while increasing inter-class variability. Secondly, using a

boosting method allows the salient features to be selected and the classiﬁer to be trained at the same time. Afterward,

a quick and eﬃcient detection method is made possible by building a chain of more complex classiﬁers.

As illustrated in Fig. 10, the method employs edge and line detection features as well as center-surround features.

At each level of the procedure, the number of the features employed to evaluate the image increases. With only 200

basic features, Wilson and Fernandez (2006) were able to recognize a human face with 95% accuracy rate.

Figure 10: Haar-like feature employs edge or line detection characteristics (Wilson and Fernandez,2006).

4. Deep Learning Methods

Deep neural networks are used in deep learning algorithms, such as Faster R-CNN (Ren et al.,2015), Single Shot

multibox Detector (SSD) (Liu et al.,2016), and YOLO (Redmon et al.,2016). Fig. 11 depicts the key advancements

and achievements in deep learning-based object detection algorithms from the year 2012. One of the most signiﬁcant

beneﬁts of deep learning algorithms is that they do not require hand-crafted features such as edge and corner detection.

During the training phase, these algorithms learn characteristics on the ﬂy. Therefore, these algorithms require a large

quantity of data to be trained. However, a large amount of data may also be used to detect covered objects. The

data must be labeled beforehand as training for SSD, Faster R-CNN, and YOLO. It is also sometimes referred to as

supervised learning. Several deep learning methods are given in Table 3with their advantages and shortcomings.

Table 3: Highlights and shortcomings of deep learning methods using one-two stage approaches.

Method Publication Approach Highlights and Shortcomings

R-CNN (Girshick et

al.,2014)

Two-sage Highlights: Signiﬁcant improvement in performance over previous

state-of-the-art methods; The ﬁrst method which combine CNN and RP

methods

Shortcomings: Training is costly in terms of both space and time; Test-

ing is time-consuming

Fast R-CNN (Girshick,

2015)

Two-stage Highlights: Create a layer for ROI pooling; First method which enables

training to end-to-end detector (ignoring RP generation)

Shortcomings: The new bottleneck is revealed to be external RP com-

putation; For real-time applications, it is still too slow

Faster R-CNN (Ren et al.,

2015)

Two-stage Highlights: Instead of selective search, propose RPN for producing

nearly high-quality and cost-free RP; By sharing convolution layers,

combine Fast RCNN and RPN into a single network; Introduce multi-

scale anchor boxes and translation invariant as RPN references

Shortcomings: It is not a simpliﬁed procedure; Training is complex;

For real-time applications, it is still too slow

YOLO (Redmon et

al.,2016)

One-stage Highlights: The ﬁrst highly eﬀective uniﬁed detector; YOLO can run

at 45 FPS; Drop the process of RP completely; Framework for detec-

tion that is both eﬃcient and elegant; Dramatically faster than previous

detection techniques

Shortcomings: Localization of small object is diﬃcult; The accuracy

of the detector falls far short of that of previous detectors

SSD (Liu et al.,

2016)

One-stage Highlights: To detect objects using convolution layers of multi-scale,

it eﬀectively combines YOLO and RPN ideas; First eﬃcient and accu-

rate uniﬁed detector; Can run at 59 FPS; Faster and signiﬁcantly more

accurate than YOLO

Shortcomings: It is ineﬀective at detecting objects of small size

FPN (Lin et al.,

2017)

Two-stage Highlights: Superior to Faster-RCNN while maintaining high accu-

racy; Using a bank of specialised convolution layers, create a set of

position sensitive score maps

Shortcomings: For real-time applications, it is still too slow; Training

is not a simpliﬁed process

YOLOv2 (Redmon

and Farhadi,

2017)

One-stage Highlights: It employs a variety of existing strategies to boost both ac-

curacy and speed; Propose a faster DarkNet-19; In real time, YOLOv2

can identify over 9000 object classes

Shortcomings: It is ineﬀective at detecting objects of small size

YOLOv3 (Redmon

and Farhadi,

2018)

One-stage Highlights: YOLOv3 improves detection object accuracy by referring

to the concept of residual network; It employs darknet-53 to generate a

feature maps of small-size

Shortcomings: it is ineﬀective at detecting objects of small size

YOLOv4 (Bochkovskiy

et al.,2020)

One-stage Highlights: Backbone of the model employs Bag-of-Specials (BoS)

and Bag-of-Freebies (BoF), which improves performance with Cross

Stage Partial Darknet-53; Neck has improved Spatial Pyramid Pooling,

resulting in output of ﬁxed-size independent of size of input

Shortcomings: Training is not a streamlined process; it is ineﬀective at

detecting objects of small size

Figure 11: Milestones in object detection using deep convolutional neural networks.

4.1. Backbone Networks

Object detectors based on deep neural networks make use of backbone networks to extract high-level information

from input images. Deep neural networks are most commonly used as image classiﬁers to perform on large-scale

image classiﬁcation datasets like the ImageNet classiﬁcation dataset. In most image classiﬁers, the ﬁnal classiﬁcation

layers are eliminated, and the remaining layers are employed as backbone networks, and further detection layers are

added to the backbone networks to construct comprehensive object detectors. The major design goals of backbone

networks are to increase detection accuracy and processing eﬃciency. The following are some of the most widely

used backbone networks:

•VGGNets (Simonyan and Zisserman,2014) with convolutional layers that employ tiny ﬁlters of 3x3 pixels,

followed by 2x2 max pooling. VGG-16 contains thirteen convolutional layers, whereas VGG-19 has sixteen

convolutional layers. VGG was the winner of the ImageNet Challenge in the year 2014, and it is still one of the

most popular networks today.

•Residual networks (ResNets) (He et al.,2016) were presented as a way to train very deep networks using

residual blocks. Residual networks come in a variety of shapes and sizes. ResNet50 and ResNet101 are the

most popular variants. ResNet is substantially more comprehensive than VGGNet.

•Inception networks (Szegedy et al.,2015,2016) that boosted network size and scope without adding to the

computational costs. Convolution layers of 1x1, 3x3, and 5x5 ﬁlter sizes, as well as max pooling layers, are

layered in parallel in the Inception module. Many scales of features may be retrieved at the same time in a

single layer. VGGNet is substantially slower than Inception networks.

•DenseNet (Huang et al.,2017) is a network in which each layer is densely linked to all other levels in a forward

manner, allowing all later layers to utilise lower level characteristics. The vanishing-gradient problem can be

solved with DenseNet.

•The ZFNet (Zeiler and Fergus,2014) is a classic convolutional neural network. The design was inspired by

showing intermediate feature layers and the operation of a classiﬁer. The ﬁlter widths and strides of the convo-

lutions are comparatively shorter than in several previous architectures.

4.2. Two-stage detectors

Broadly, detectors are divided into one-stage detectors and two-stage detectors. In this section, two-stage detector

models are discussed in detail as below.

4.2.1. R-CNN

One of the most signiﬁcant drawbacks of the conventional method based on the sliding window method is that it

reads every available portion of the image. Within the image, the object of interest might be in multiple spatial posi-

tions and have diﬀerent aspect ratios. This will necessitate the selection and processing of a large number of areas and

hence increase the processing time. R-CNN (Girshick et al.,2014) resolves this issue by employing a selective search

approach. This technique generates 2,000 region suggestions, commonly known as “region extraction”. There are

4096-dimensional feature vectors generated by warping the areas into squares and forwarding them to a convolutional

neural network. These characteristics are passed on to SVM, which categorises the areas at the ﬁnal stage. It also

uses a regression technique to determine the bounding boxes of the categorized objects detected in the image. The

drawbacks of this method include the fact that it takes a long time to train because it must categorise 2,000 areas for

each image. It is diﬃcult to use in real time because each image takes around 47 seconds to process.

4.2.2. Fast R-CNN

Fast R-CNN (Girshick,2015) addresses the issue of R-CNN and develops a signiﬁcantly faster algorithm. The

steps are similar to R-CNN, but instead of calculating the areas, the image is sent directly to CNN, which generates

feature maps. The area proposals are detected and, using the Region-Of-Interest (ROI) pooling layer, they are warped

into squares. Using this convolutional feature map, the shape is converted to a ﬁxed size and transmitted to the fully

connected layer. A softmax layer is used to predict the class and bounding box using the ROI feature vector. This

method is considerably faster than R-CNN, since it does not create 2,000 suggested regions.

4.2.3. Faster R-CNN

In the Faster R-CNN (Ren et al.,2015) conﬁguration, there are two phases. The feature map of the original image

is created in the ﬁrst stage using feature extraction (VGG, ResNet, Resnet-V2, Inception, etc.). The feature map from

a chosen in-between convolutional layer is used by the Region Proposal Network (RPN) to predict proposal areas

with objectness scores and locations. Time-saving software is used to make a score that approximates the chance that

a thing will be an object or not. Box regressions are also done for each of the proposals, using a robust loss function.

The second step uses ROI pending to crop features from the same intermediate feature map in order to determine

the position of the proposed areas. The regional feature map for each proposed region is given to the remainder of the

network to forecast the less speciﬁc score and improve the box location. Using this technique, it is possible to skip

entering each proposed region into the front-end CNN in order to calculate the regional feature map. However, each

proposed region must be entered individually into the database of the network. As a result, the speed of detection is

proportional to the number of RPN proposal areas. The architecture of the Faster R-CNN is shown in Fig. 12.

Figure 12: Architecture of Faster R-CNN (Ren et al.,2015).

4.2.4. Feature Pyramid Network (FPN)

The dilemma was addressed by the Feature Pyramid Network (FPN) (Lin et al.,2017), which determined that

bottom-level feature maps contain spatial information rather than semantic information. Also, later layers of a deep

neural network contain high-level semantic information rather than spatial information. FPN used CNN’s network

structure to create a bottom-up and top-down path with wide extent. A CNN was utilised to process an input image in

the bottom-up section, and a pooling layer was employed to reduce the size of feature maps. The extracted features

were up-sampled in the top-down section to the same size as in the bottom-up section. FPN created integrated image

features that boost detection performance substantially, mainly for small objects.

4.3. One-stage detectors

4.3.1. SSD

SSD (Liu et al.,2016) is one-stage object detection network that use a single forward CNN to predict item class

and position. SSD achieved performance standards in terms of speed and accuracy for object detection tasks, achieving

over 74% mean Average Precision (mAP) at 59 fps on standard datasets. The basic architecture of SSD is shown in

Fig. 13.

Figure 13: Architecture of SSD (Liu et al.,2016).

In general, the SSD consists of three sections:

1. The fundamental convolutional layer, which contains ResNet, ResNetv2, VGG, inception, and other feature

extraction networks. The intermediate convolutional layer creates a layer-scale feature map, which splits the

receptive ﬁeld into a large number of small cells, assisting in the identiﬁcation of small objects.

2. An extra convolutional layer is linked to the last layer of the basic convolutional network. Larger-scale multi-

scale feature maps are produced.

3. A prediction convolutional layer employing a tiny convolutional kernel predicts bounding box location and

conﬁdence for several categories.

4.3.2. YOLO

YOLO (Redmon et al.,2016) focused on speeding up the object detection methods. The region proposal was

deleted since the object detection issue was regarded as a regression issue. It divides the input image into 7 x 7 pixels,

with each pixel being used to estimate where the centre of an object could lie, rather than using pre-deﬁned anchors

for object portions. Each cell projected bounding box locations, class probabilities, and scores for each bounding box.

YOLO is a real-time object detector that can detect things at a rate of 45 fps, which is extremely fast in comparison

to the previous object detection models. On the other hand, only class probability was estimated within each cell. It

cannot handle a large number of ground truth objects and does not work well with items that are partially localized in

one cell and has poor localization accuracy due to bounding box sizes and proportions. On the COCO dataset, YOLO

produces a mAP with a rate of around 54.30%.

4.3.3. YOLOv2

YOLOv2 (Redmon and Farhadi,2017) proposes several improvements to the ﬁrst version of YOLO. The com-

pletely connected layers are eliminated, and the anchor boxes approach is used to forecast bounding boxes to improve

the recall. Unsupervised learning approaches are used to construct bounding box sizes and proportions directly using

training data. The bounding box analysis forecasted the position in relation to the left top location of the cell, resulting

in predicting limits of 0 and 1. Batch normalisation, high-resolution classiﬁcations, and multi-resolution training are

among the other strategies oﬀered by this version. All of the strategies have signiﬁcantly increased detection accuracy

while maintaining high speed.

4.3.4. YOLOv3

In order to keep low translation variance, SSD chooses the early layers to create large-scale feature maps speciﬁ-

cally used to ﬁnd smaller objects. Feature maps generated by early layers are complex enough, hence resulting in poor

performance on smaller objects. To address these mentioned issues, YOLOv3 (Redmon and Farhadi,2018) enhances

the accuracy of detection of objects by referring to the notion of residual network. It is a one-stage method that also

works eﬃciently with respect to detecting speed. The architecture of YOLOv3 is depicted in depth in Fig. 14. It

generates a small-scale feature map that is a 32-fold lower resolution version of the original image using Darknet-53,

omitting the last three layers. To detect big objects, a small-scale feature map is employed. The small-scale feature

map is up-sampled and concatenated with the feature map generated by previous layers. A large-scale feature map

is generated by YOLOv3, as opposed to SSD choosing the previous layers to build large-scale feature maps. For the

detection of small-sized objects, a large-scale feature map including position information from previous layers and

complicated features from deeper levels is employed. The feature map scales are 8, 16, and 32 times down-sampled

as compared to the original image, respectively. Softmax is used to predict single-level classiﬁcation, but YOLOv3

predicts multilevel classiﬁcation for each bounding box by using separate sigmoid functions for each of the boxes.

Figure 14: YOLOv3 architecture (Redmon and Farhadi,2018).

4.3.5. YOLOv4

YOLOv4 (Bochkovskiy et al.,2020) is an improved version of YOLOv3. It splits images into regions and further

processes the probabilities for each region and bounding boxes using a single neural network on the entire image.

Bag-of-Specials (BoS) and Bag-of-Freebies (BoF) are two distinct packages used in the model’s backbone to improve

performance with Cross Stage Partial Darknet53. The trade-oﬀbetween these two factors aﬀects the performance

eﬃciency. While BoS is used to increase inference cost by a minimal amount while considerably enhancing object

detection accuracy, BoF is used to just raise the cost of training while keeping the cost of inference low. The neck of

YOLOv4 has improved Spatial Pyramid Pooling, which creates a ﬁxed-size output independent of input size.

5. State-of-the-Art for Weapon Detection Methods

Weapon detection has emerged as a captivating topic in the ﬁeld of object detection methods. Various systems

have been developed for detecting weapons like pistols, riﬂes, and knives, each with its own set of advantages and

limitations. Misclassiﬁcation, intra-class detection, a dynamic background, occlusion, and varying illuminations are

amongst the major issues which make the identiﬁcation of handguns, knives, and other weapons diﬃcult. A thorough

examination of several techniques has been carried out to cover the previous weapons detection systems to the most

recent models as shown in Fig. 15.

Figure 15: A classiﬁcation of computer vision algorithms for detecting weapons.

In the literature, several algorithms have been suggested like Harris interest point detector (Harris et al.,1988),

SIFT (Lowe,1999), SURF (Bay et al.,2006) and deep learning techniques like Faster R-CNN (Ren et al.,2015), SSD

(Liu et al.,2016), and YOLO (Redmon et al.,2016) for the detection task. The generalized weapon detection model

is shown in Fig. 16.

Figure 16: Basic model of computer vision-based weapon detection system.

5.1. Weapon Detection using Classical Machine Learning Methods

The Haar Cascades were developed by ˙

Zywicki et al. (2011) to identify hazardous equipment such as knives.

Positive and negative sample images were used in the training phase to demonstrate the presence and absence of the

target item, respectively. A total of 1,560 positive and 6,518 negative samples were used in the training phase. Positive

sample criteria include information about angle, illumination, dynamic background, knife in hand, variety of blades,

and varied grips. To improve the performance, three training sets were constructed in the experiment, among which

the third training set observed the best result. However, the results were not satisfying as the true positive rate was

46%, which is a relatively small score for the detection in real-time.

Glowacz et al. (2015) introduced an AAM based method for object detection like knives. The goal was to de-

termine whether or not a knife could be seen in the given image. They utilised the Harris corner detection method

(Harris et al.,1988) to identify tip-of-the-knife. The number of discovered corners was determined on the basis of a

pre-deﬁned threshold. All knife tips were identiﬁed at the lowest threshold of 204, and for all images, the mean value

of the threshold at which the knife-tip is tagged as a corner was 217. The overall classiﬁcation accuracy of this model

was 92.50%. However, as AAMs are not invariant to the rotation, the method works only if the tip of the knife is

visible in the images. Kmie´

cet al. (2012) introduced an approach employing the Harris corner detection technique

and AAM initialised with shape-speciﬁc interest points. The model failed to recognize the knife in three images out

of 40 positive images. This method only works when the tip of the knife is visible in the image.

Tiwari and Verma (2015a) used the Harris Interest Point Detector (HIPD) and Fast Retina Keypoint (FREAK)

to develop a new approach for detecting ﬁrearms. A hybrid technique utilised both concepts. It included colour-

based segmentation to eliminate irrelevant images or colours from the image, and Harris Interest Point Detector and

FREAK to detect the gun. For colour-based segmentation, the K-means clustering method was used. Morphological

processing is applied to each image in order to extract boundaries and close tiny gaps. To discover the resemblance

with the gun, the interest point feature of the object boundary was extracted and compared to the stored description.

When the similarity score exceeds 50%, the system provides a warning. The model was assessed in terms of accuracy

after testing it against various sessions as well as negative images. This method has an overall accuracy of 84.26%.

Later on, Tiwari and Verma (2015b) enhanced their work by proposing a technique for detecting ﬁrearms in which

they used SURF. These extracted features of the object boundary was compared to the stored descriptions to ﬁnd the

resemblance with the gun. The system elevates the warning when it receives a resemblance of greater than 50%.

The authors also discussed several challenges such as gun rotation, orientation, and variation as well as light, shadow,

noise, real-time processing power, information loss owing to 3D to 2D transformation, partial or complete occlusion

of the gun, and deformation. Following that, morphological closure and boundary extraction were done, resulting

in an image that displays the general structure of the item while hiding the interior details covered by a rectangular

box. Although SURF feature extraction is not faster than other techniques like Harris and SIFT, it can handle images

irrespective of scale, orientation, or other characteristics. SURF ﬁrst ﬁnds interest-points (such as corners and blobs)

and then uses the Hessian Matrix to produce descriptors for each. Finally, a similarity score was calculated between

the stored description of the gun and that of the blob. The SURF characteristics of an object border are utilised to

compare the forms of items. A total of 25 pictures were utilised, out of which 15 were with positive samples. Overall,

the true positive rate of the model was 86.67%. However, since these systems were time-consuming and complicated,

they were unable to be used for real-time weapon detection.

The comparative analysis in terms of true positive rate between classical machine learning methods is shown in

Fig. 17. It can be explicitly concluded from the graph that AAMs are an eﬃcient method over others, having a true

positive rate of 92.50%.

Figure 17: Performance analysis between classical machine learning methods.

A detailed description of the work based on classical machine learning methods is summarised in Table 4. The

speciﬁcs of the datasets and the outcomes that were acquired are detailed in this table. From the results, it can be

concluded that the approaches utilized by Kmie´

cet al. (2012) and Glowacz et al. (2015) resulted in the highest true

positive rates.

Table 4: Detailed results based on classical machine learning methods.

Publication # Images in pos-

itive test set

# Correctly clas-

siﬁed positive

images

True positive

rate

# images in neg-

ative test set

# Misclassiﬁed

negative images

False positive

rate

Classiﬁcation

Accuracy

Zywicki et al. (2011) 1,560 - 46.00% 6,518 - - -

Kmie´

cet al. (2012) 40 37 92.50% 40 0 0% 92.50%

Tiwari and Verma (2015a) 65 54 83.07% 24 3 8.33% 84.26%

Glowacz et al. (2015) 40 37 92.50% 40 0 0% 92.50%

Tiwari and Verma (2015b) 15 13 86.67% 10 0 0% 86.67%

Classical machine learning methods need a lot more human interaction to produce results. These systems have

problems with the reliability of their database, where guns make up the majority of the picture. This doesn’t accurately

show how real-life events with a handgun work. As a result, these systems are not suitable for continuous monitoring

in situations where the images retrieved from CCTV recordings are complicated owing to various variables or when

there are open regions with a large number of objects. In such complicated scenarios, these conventional methods fail

to provide better accuracy for weapon detection.

In traditional machine learning methodologies, the bulk of the applicable features must be set by a domain expert in

order to minimize computational complexity and make patterns more transparent for learning techniques to eﬃciently

work. The major advantage of deep learning algorithms is that they try to learn high-level features from data in an

incremental manner. This reduces the feature extraction complexity as well as lowers the need of domain expertise in

real-time applications.

5.2. Weapon Detection using Two-stage Deep Learning methods

The most often used measurements in computer vision are True Positive (TP), False Positive (FP), True Negative

(TN), and False Negative (FN). The number of images accurately labeled as positive images while employing classi-

ﬁers to identify weapons is referred to as TP. The existence of a weapon in the input image is indicated by a positive.

The term FP refers to an actual instance that is missed by the classiﬁer. The properly classiﬁed negative image is indi-

cated by TN, while the number of wrongly classiﬁed negative images is denoted by FN. The performance parameters

namely, accuracy, precision, recall, and F1 score are measured by Eq. 6, Eq. 7, Eq. 8, and Eq. 9, respectively as

follows:

accuracy =T P +T N

T P +FP +FN +T N (6)

precision =T P

T P +FP (7)

recall =T P

T P +FN (8)

F1score =2∗precision ∗recall

precision +recall (9)

Olmos et al. (2018) designed an automated method for the detection of handguns to facilitate monitoring and

control systems. The authors reframed the problem of weapon detection as a problem of reducing false positives,

and discovered a solution by (i) building a key training dataset using the results of the DCNN classiﬁer, and (ii)

comparing the two approaches, namely the sliding window approach and the region proposal approach, to determine

the best classiﬁcation model. The dataset used in the experiment contains 3,000 images of short guns with rich

backdrop detail. The Faster R-CNN model with VGG-16 as a feature extractor produced the most promising results.

After ﬁve consecutive true positives among the thirty situations, the automatic alarm system eﬀectively activates the

alarm. They also established a metric called Alarm Activation Time per Interval (AATpI) to assess the performance

of the detection model. With an average time interval of AATpI =0.2s, the model correctly identiﬁed the gun in

27 scenarios. However, in three scenes, the detector was incapable of detecting handguns due to the same factors

mentioned as before, like low contrast and poor brightness of the frames, fast movements of the gun, or the guns not

being visible in the forefront of the image. The precision score of this model was recorded at 84.21% with F1 score

at 91.43%. As illustrated in Fig. 18, the architecture of VGG-16, contains 13 convolution layers and 3 fully connected

layers, which is used as the feature extractor.

Figure 18: Architecture of VGG-16 used in Olmos et al. (2018).

Verma and Dhillon (2017) utilized transfer learning to identify guns using a deep convolution network and a state-

of-the-art feature area based CNN model. As a feature extractor, the system uses a CNN-based VGG-16 architecture

followed by state-of-the-art classiﬁers trained on a typical gun database. The performance of the model was evaluated

in a variety of situations, including diﬀerent backgrounds with ﬁrearms, occlusion, and so on. The results show that

SVM (Hearst et al.,1998) outperforms other classiﬁers with a classiﬁcation accuracy of 92.60% and total accuracy

was 93.10%. However, the model was built on a single CPU, which meant that training time was a major concern.

Gelana and Yadav (2019) proposed an image processing and machine learning-based weapon identiﬁcation model.

Their model consisted of six main elements: (i) RGB to gray-scale conversion was used to reduce the complexity of

each frame and speed up the background subtraction process; (ii) Background subtraction: three alternative tech-

niques to background subtraction and segmentation were used. The visual background extractor (Barnich and Van

Droogenbroeck,2011) and the improved Gaussian mixture model (Zivkovic,2004) techniques, as well as the dif-

ference of frame background subtraction algorithm, are all used in this study (iii) Filtering operation: Dilation and

erosion procedures were used on the extracted foreground object to eliminate tiny white noises caused by illumination

ﬂuctuations and to connect dissimilar parts in an image (iv) Segmentation/Edge Detection: The well-known Canny

edge detection method (Canny,1986) was employed for this purpose. The Canny algorithm inputs the ﬁltered fore-

ground object and outputs the information about edges (v) The sliding window approach substantially reduces the

area evaluated by the learning algorithm. The size and slide step are chosen after several tests and are subjected to

alteration in the future (vi) A tensorﬂow-based version of the CNN method was used to classify an item as either a

treat (gun) or a non-treat (non-gun). After applying 30% split to the CNN training-testing dataset, 4,000 negative and

1,869 positive images comprised the dataset frame. The 585 positive and 1,173 negative images among the 1,758

images were used to test the algorithm. The most essential element in weapon detection was to reduce the frequency

of false positives while maintaining detection sensitivity. The approach described in this work had a speciﬁcity of

99.73% for images containing non-gun items and a detection accuracy of 93.84% for images including gun objects.

Castillo et al. (2019) developed an automated cold steel weapon identiﬁcation model for video surveillance that

was based on a new brightness-directed preprocessing technique termed Darkening and Contrast at Learning and

Test stages (DaCoLT) that enhances detection quality. The Faster R-CNN with Inception-ResNet-V2 (Szegedy et al.,

2017) was the most accurate model with an F1 score of 95%. However, with a frame rate of 1.3 frames per second, it

was not suited for near-real-time operations.

A unique binocular image fusion technique for reducing the frequency of false positives in the identiﬁcation of

ﬁrearms in surveillance ﬁlms, was proposed by Olmos et al. (2019). They used a dataset of 3,000 weapon images

created by Olmos et al. (2018) for training and testing purposes. They compared the performance of Faster R-CNN

with and without image fusion using four feature extractors, i.e., VGG-16, ResNet, Inception-Resnetv2, and Neural

Architecture Search (NAS). Faster R-CNN (VGG-16 +ImageNet) has much greater accuracy, precision, recall, and

F1 score compared to the previous existing methods. It achieved the overall highest accuracy of 80.62%. However,

the most frequent cameras in CCTV systems are not dual cameras, so this method would not be appropriate for most

retail establishments.

P´

erez-Hern´

andez et al. (2020) proposed a method utilising binarization approach to improve the robustness, pre-

cision, and reliability of small item recognition. To enhance their detection accuracy in movies, they recommended

adopting a two-level deep learning-based approach called Object Detection using Binary Classiﬁers. In which the

ﬁrst level selects potential areas from the input frame, while the second level employs a CNN-classiﬁer that employs

One-Versus-All (OVA) and One-Versus-One (OVO) binarization techniques. A ﬁrearm, a knife, a smartphone, a bill, a

purse, and a card were used to create the database. The experimental study shows that the suggested technique reduces

the incidence of false positives when compared to the baseline multi-class detection model. However, because this

model was complicated and time-intensive, it could not be used to identify guns in real-time. The dataset collection

had 560 images in total. As indicated in Table 5, the OVO model observed the highest precision of 93.87%.

Iqbal et al. (2021) proposed a weakly supervised Orientation Aware Object Detection (OAOD) approach using

Axis-Aligned Bounding Boxes (AABB) for training and learning to recognize oriented object bounding boxes (OBB).

The proposed OAOD diﬀers from previous oriented object detectors in that it does not require OBB during training,

which may or may not be available at any given time. To achieve the goal of training on AABB and identiﬁcation

of OBB, a multiphase method was utilised, with Stage-1 estimating AABB and Stage-2 estimating OBB. There are

10,973 pictures of ﬁrearms and riﬂes in the weapon dataset presented by the ITUF (Iqbal et al.,2021). The ITUF

dataset was used to examine the OAOD technique to other state-of-the-art classiﬁcation techniques, such as fully

supervised oriented object detectors. The overall obtained mAP on AABB was 88.30% and the mAP on OBB was

77.50%. However, because the model was computationally expensive and the mAP was quite low, it could not be

used for real-time gun detection.

Gonz´

alez et al. (2020) used Faster R-CNN to utilise FPN with ResNet-50 on a new dataset collected from a

genuine CCTV installed in a university campus. Further they developed synthetic images to be employed in quasi

real-time CCTV. The FPN architecture achieved an accuracy score of 88.12%. However, the developed model could

not be utilised for training or testing purposes because the created synthetic dataset was not providing satisfactory

results.

Kaya et al. (2021) presented a novel model based on deep learning that utilizes VGG-16, ResNet-101, ResNet-50,

and a suggested CNN model with seven layers to detect seven distinct weapons. Assault riﬂes, knives, bazookas,

hunting riﬂes, pistols, grenades, and revolvers are among the 5,214 weapon illustrations split into seven categories.

The system was developed with a total of 3,128 images in training and 1,043 images in validation and testing. Con-

sequently, the proposed model was found to be 98.40% accurate. However, that device was capable of identifying a

few types of weapons only. The intended method was not very complicated, but it was very slow when it came to

computing.

Galab et al. (2021) demonstrated how to improve the brightness of knife detection in surveillance systems using

an adaptive method. Based on the preprocessing Brightness Handler procedure (BHp), they compared four CNN

architectures: AlexNet, VGGNet, GoogLeNet, and ResNet. AlexNet with BHp produced excellent outcomes with a

96.95% accuracy. AlexNet was an early CNN with six convolutional layers and performed quite slowly in contrast to

current CNN models. That model took images of the size of 227 x 227 pixels, indicating that the weapon must cover

the majority of the image.

Fig. 19 shows the performance analysis of two-stage deep learning methods in terms of accuracy, precision, recall,

and F1 score. It may be implied from the graph that the model developed by Kaya et al. (2021) observed the best

accuracy with 98.40% compared to all other methods, whereas Galab et al. (2021) secured the highest F1 score as

98.42%. The models developed by Castillo et al. (2019) and Galab et al. (2021) have the highest precision values.

On the other hand, other models developed by Olmos et al. (2018) and Gonz´

alez et al. (2020) observed highest recall

values.

Figure 19: Comparative analysis of performance of two-stage deep learning methods.

Table 5shows a detailed comparative analysis based on two-stage deep learning methods’ outcomes in terms of

accuracy, precision, recall and F1 score.

Table 5: Comparison of detection results based on two-stage deep learning methods.

Authors Data Speciﬁcations Detection Results

Accuracy(%) Precision(%) Recall(%) F1 Score(%)

Verma and Dhillon (2017) - 93.10 - - -

Olmos et al. (2018) 3,000 - 84.21 100 91.43

Castillo et al. (2019) 19,379 - 100 78.55 87.44

Gelana and Yadav (2019) 5,869 97.78 99.45 94.21 96.76

Olmos et al. (2019) 3,000 80.62 92.68 80 85.88

P´

erez-Hern´

andez et al. (2020) 5,680 - 93.87 93.09 93.43

Gonz´

alez et al. (2020) 7,649 - 88.12 100 93.68

Kaya et al. (2021) - 98.40 99.28 95.97 92.89

Iqbal et al. (2021) 10,973 - 88.30 - -

Galab et al. (2021) 12,899 96.95 100 96.80 98.42

5.3. Weapon Detection using One-stage Deep Learning methods

Narejo et al. (2021) created a smart surveillance security system that identiﬁes weapons, especially ﬁrearms. They

used backbone Darknet-53 to train the YOLOv3 classiﬁcation model for that purpose. They collected a large number

of photos from Google manually and approximately 50 pictures for each weapon class. The overall accuracy of the

model was 98.89%, but precision and F1 score were not measured. Also, the number of images in the collection was

not provided.

Romero and Salamea (2019) developed the system to resolve existing issues and divided the operation of system

into two halves. The ﬁrst front end was in charge of limiting the area of interest, while the second back end was

in charge of detecting the weapon from the front end. The authors created a database comprising 17,684 images

from various movies, with ﬁrearms (class A) and without ﬁrearms (class B). Using various approaches (rotating and

ﬂipping), the image dataset was additionally expanded by 2,29,892 (from 17,568 to 2,47,576). As mentioned before,

the system was made up of two parts; namely the front end and the rear end. The authors employed YOLO for real-

time object recognition and localization in the front end. YOLO was trained on the COCO dataset to recognize people

while ignoring the rest of the image, which reduces the complexity of the system and, hence, the probability of false

positives. The VGG-Net and ZFNet models were used to identify weapons. Grayscale pictures were also useful in

enhancing the eﬃciency of the system. If the individual in the bounding box does not have a weapon, the bounding

box will be eliminated, narrowing the area of concern. The overall performance of the system was observed as 86%

recall and 90.80% accuracy. Fig. 20 depicts the operation of a weapon-detecting system proposed by Romero and

Salamea (2019).

Figure 20: System architecture of Romero and Salamea (2019).

Cardoso et al. (2019) used YOLO object detector to detect guns in images using CNN . The idea was tested on a

database of 608 images, including 304 weapons. Experiments observed an accuracy of 89.15%. However, the number

of images in the dataset was very low, hence it was found infeasible for real-time detection

Jain et al. (2020) used SSD and Faster R-CNN algorithms to develop automated weapon identiﬁcation using CNN.

Faster R-CNN observed better accuracy as 84.60%. On the other hand, SSD achieved an accuracy of 73.80%, which

was low compared to the Faster R-CNN. Due to the higher speed, SSD provided real-time detection, but Faster R-

CNN observed higher accuracy. In a fully automated system, a person in charge double-checks every gun detection

alert with @0.73fps (SSD) and @1.606fps (Faster R-CNN), which are too slow for real-time detection.

Salido et al. (2021) compared three CNN models for automatic identiﬁcation of pistols in video surveillance. The

goal was to see if integrating posture information with the way ﬁrearms were held in the training dataset would reduce

false positives. The ﬁndings showed that RetinaNet ﬁne-tuned with the unfrozen ResNet-50 backbone had the greatest

average precision of 96.36% and recall of 97.23%, while YOLOv3 had the highest accuracy of 96.23% and F1 score

values (93.36%) when trained on the dataset with posture information. Using YOLOv3, the number of false positives

and false negatives was 8 and 21, respectively, which was quite high for the tiny dataset and poor resolution images.

Singh et al. (2021) presented a computer vision-based method for identifying ﬁrearms using YOLOv4. The images

of knives, swords, pistols, machine guns, shotguns, and other weapons were included in the dataset used to train the

model. They combined them into a single weapon class. The model employed had a mean Average Precision (mAP)

of 77.75% and an average loss of 1.314. However, mAP was an insuﬃcient factor for measuring the real-time weapon

identiﬁcation performance.

Sliding window and region proposal/object detection were two methodologies used by Bhatti et al. (2020). Some

of the algorithms employed were VGG16, Faster R-CNN, Inception-ResnetV2, SSD, MobileNetV1, Inception-V3,

Inception-ResnetV2, YOLOv3, and YOLOv4. A total of 8,327 images comprising pistols and non-pistol classes were

used, which were collected from various sources. A total of 7,328 images were utilised for the training and another 999

for testing. YOLOv4 outperformed all other algorithms, receiving an F1 score of 91% and a mean average accuracy

of 91.73%. The number of false positives and false negatives were still relatively high as 54 and 52, respectively.

Lamas et al. (2022) presented a reproducible and traceable top-down weapon detection over pose estimation

methodology that exploits the human presence in scenarios where a person carries a weapon, ﬁrearm, or knife. The

two types of detection architectures were used among the four selected detection models for evaluating the approaches.

Faster R-CNN, a two-stage detector based on ResNet101 and various one-stage detectors such as SSD, based on

ResNet50, EﬃcientDet (Tompson et al.,2015) based on D3, and CenterNet (Duan et al.,2019). All deep learn-

ing architectures were trained on the Sohas weapon dataset with a precision score of 94.4%, in which EﬃcientDet

outperformed others. However, this method was able to detect human-handled weapons only.

Fig. 21 shows the performance analysis of one-stage deep learning methods in terms of accuracy, precision,

recall, and F1 score. According to Fig. 21, the model created by Narejo et al. (2021) observed the greatest accuracy

of 98.89%, while the model developed by Salido et al. (2021) achieved the best precision score of 96.23%.

Figure 21: Analysis of performance of one-stage deep learning methods in terms of accuracy, precision, recall and F1 score.

Fig. 22 shows the overall performance analysis of detection using deep learning methods in terms of accuracy and

precision parameters.

Figure 22: Analysis of performance of detection using deep learning methods in terms of accuracy and precision.

Table 6presents a thorough comparison of one-stage deep learning models developed by diﬀerent researchers in

terms of accuracy, precision, recall, and F1 score.

Table 6: Comparison of weapon detection results based on one-stage deep learning methods.

Authors Data Speciﬁcations Detection Results

Positive

Images

Negative

Images

Accuracy(%) Precision(%) Recall(%) F1 Score(%)

Romero and Salamea (2019) 8,843 8,841 90.80 86.00 86.00 86.00

Cardoso et al. (2019) 3,000 6,857 - 89.15 100 94.26

Jain et al. (2020) - - 84.60 - - -

Narejo et al. (2021) - - 98.89 - - -

Salido et al. (2021) 1,220 - 90.09 96.23 90.67 93.36

Singh et al. (2021) - - - 77.75 - -

Bhatti et al. (2020) 3,073 5,254 - 93.00 88.00 91.00

Lamas et al. (2022) 3,000 14,684 - 94.40 91.50 92.90

6. Conclusion

In the area of security and surveillance, weapon detection is of signiﬁcant use in computer vision. An automatic

weapon detection system that responds quickly in situations that could be dangerous is good for public safety. This

literature attempts to showcase several conventional weapon detection systems using machine learning and the most

advanced deep learning techniques. The journey began with a manually operated system and progressed to completely

automated and sophisticated technologies. In light of this, numerous conventional weapon detection techniques have

already been developed, viz. HIPD, AAMs, SIFT, SURF, FREAK, and many more, wherein the AAMs have emerged

to be the preeminent among these. Although the multitudinous applications of these conventional techniques have

been reviewed in the past, none has so far emerged as an eﬀective technique owing to the imprecision in detection of

tiny objects due to their complex background and partial occlusion. Classical methods require manual intervention

for extracting features, and thus, they are not very precise for weapon recognition (Krizhevsky et al.,2012). This

opens a window for the development of deep learning architectures capable of automatically discovering higher level

features from input images that oﬀer speed, accuracy, and real-time applications viz. self-driving cars (Maqueda

et al.,2018), natural language processing (Worsham and Kalita,2020), face detection (Zhan et al.,2016), speech

recognition (Nassif et al.,2019), text recognition (Roy et al.,2017), and disease diagnosis (Ma et al.,2021;Hu et

al.,2018) etc., of this technology in the ﬁeld. Additionally, a wide literature in the domain of DCNN and transfer

learning methods incorporating multiple models (one-stage and two-stage) like Faster R-CNN, VGG-Net, ZFNet, and

YOLOv3 is available. In the case of one-stage deep learning methods, YOLOv3 has higher precision and shows better

performance in comparison to others. Faster R-CNN architecture observed the highest precision compared to other

methods in two-stage methods.

In Table 7, we provide the important ﬁndings of the study that were discovered. The following information are

included in the table: (a) The name of the method used, (b) The strength of method, (c) Problems encountered in

weapon detection, and (d) The respective publication and comments.

Table 7: An overview of the survey’s major results using the classical machine learning approach and deep learning approach.

Methods Strengths Issues Publications and Remarks

Haar

Cascades

The accuracy of the cascade im-

proves with the increased num-

ber of positive and negative

samples images.

The ﬁndings are unsatisfactory due to

the low true positive rate obtained.

Zywicki et al. (2011) observed that in-

creasing the Haar Scale coeﬃcient re-

duced the frequency of incorrectly detected

knives. As previously stated, the obtained

results for this cascade are unsatisfactory.

AAMs

According to the test results,

this technique outperforms over

other classical machine learn-

ing algorithms with a TRP of

92.50%.

This method is not rotation invariant.

The technique works only if the knife

tip is visible in the images.

Glowacz et al. (2015) present AAMs as a

weapon detection tool. Later on, Kmie´

cet

al. (2012) enhanced their ﬁndings by us-

ing the Harris corner detection approach in

their own work.

HIPD and

FREAK

The K-means clustering method

along with HIPD and FREAK

is applied to utilized the color-

based segmentation which re-

sults in higher accuracy.

However, this is only useful if the

gun is entirely visible in the scene.

The approach fails in the case of

a partially visibility of gun or in a

blurred images.

This approach was adopted by Tiwari and

Verma (2015a). Additionally, similarity

score surpasses 50% after the alert mech-

anism applied.

SURF The ﬁndings from previously

used methodologies are im-

proved in terms of accuracy.

The computational time is higher in

view of real-time detection.

Tiwari and Verma (2015b) used SURF

methods for the classiﬁcation task and im-

proved the better eﬃciency.

SVM with

VGG-16

For feature extraction, VGG-16

is very eﬀective and commonly

used deep learning architecture.

It improves the extraction of

high-level features from images.

SVM is a classiﬁcation method that

is slow as well as complexed. How-

ever, CNN techniques produce better

results than SVM method.

Gelana and Yadav (2019) used this method

with a variety of techniques such as Canny

edge detection, enhanced Gaussian mix-

ture model and others to achieve results.

However, the approach is slow and com-

plicated which makes it unsuitable for real-

time detection.

Faster

R-CNN

This approach improves the pre-

cision for small weapon detec-

tion using two-stage deep learn-

ing approach.

Despite of higher precision, it

is a time-consuming and com-

plex method and computationally

expensive.

Various studies employed Faster R-CNN

with the variety of backbone networks, in-

cluding VGG-16 (Olmos et al.,2018), In-

ception ResNetv2 (Castillo et al.,2019),

FPN with ResNet-50 (Gonz´

alez et al.,

2020) and others to obtain better results.

Instead of using a selective search method,

it suggests using RPN to generate region

suggestions which makes it much faster

than R-CNN and Fast R-CNN.

SSD based

CNN

Localization and classiﬁcation

tasks are completed in a single

forward pass across the network

which results in much faster de-

tection.

It can analyze a video at the rate of

0.75 frames per second. However, it

doesn’t perform well with small ob-

jects like a handguns, because it uses

the ﬁrst convolution layers to create

high-level feature maps.

This approach along with Faster R-CNN

was suggested by Jain et al. (2020).

SSDs deliver much faster performance but

achieves less accuracy.

YOLO and its

versions

YOLOv3 and YOLOv4 are

faster than any of the deep learn-

ing architectures currently avail-

able. This method generates

high-level feature maps by us-

ing both previous and subse-

quent layers, resulting in more

accuracy than SSD. It is the

most eﬀective approach for de-

tecting ﬁrearms in real-time.

YOLO and YOLOv2 are more eﬀec-

tive for the identiﬁcation of large ob-

jects. Importantly, it performs poor

while dealing with small objects.

Several researchers (Cardoso et al.,2019;

Narejo et al.,2021;Romero and Salamea,

2019) use the YOLO architecture for

weapon detection. YOLOv4 (Singh et al.,

2021;Bhatti et al.,2020) outperforms over

other enhanced versions of YOLO.

The two-stage detectors exhibit better accuracy in comparison to single-stage detectors, which is evidenced by

their vast real-time applications, but the latter are more cost-eﬀective than the former. One-stage detectors are usually

faster than two-stage ones because they use lightweight backbone networks, eliminate preprocessing algorithms, and

consider fewer candidate regions for prediction. However, two-stage detectors can run in real time with the intro-

duction of similar techniques. One-stage frameworks’ performance is poorer than two-stage architectures like Faster

RCNN in the detection of small objects, which gives fair competition in the detection of large objects.

There are still certain challenges in the ﬁeld of weapons detection that need to be addressed, such as a lack of

datasets, the detection of weapons in a variety of lighting conditions, and others. Table 8provides a more in-depth

discussion of these concerns.

Table 8: Several issues of weapon detection systems.

Issues Comments

The unavailability of real-time datasets The datasets that have been presented in a number of research papers

are gathered from the internet sources. There are just a few datasets

(Gonz´

alez et al.,2020) available which are obtained from closed-circuit

television cameras.

Multiple weapon detection system There is a still a requirment for multiple weapon detection. This speciﬁc

problem has only been addressed in a few researches like (Verma and

Dhillon,2017;Gonz´

alez et al.,2020;Salido et al.,2021) .

The partial appearance of the weapon Only a few reseraches have tackled the subject of partial occlusion of

weapon. Nonetheless, these are important diﬃculties take place for

weapon detection.

Weapon detection of diﬀerent kinds This is a signiﬁcant problem to take into consideration. Only a few

works (Olmos et al.,2018;Iqbal et al.,2021;Gonz´

alez et al.,2020) are

capable of detecting distinct sorts of guns.

Despite the fact that datasets have recently emerged, the lack of large and well-balanced datasets limits the de-

velopment of deep learning algorithms that are generalizable enough to be employed in automatic weapon detection

systems. As the public datasets originate from a range of machines with diﬀerent inherent architectures, domain

adaption techniques might help. Deep learning techniques can provide very productive outputs considering all the

above-mentioned features, but models based on these techniques for real-time applications are still not at the fore-

front. The reason behind this is the complex nature of the performed simulations, as a large dataset is required for

computing the output. Moreover, detectors based on deep learning generally contain a high number of parameters and

are consequently data-hungry, requiring a powerful computing system for the training of the developed model.

Additionally, device may be constructed utilizing these algorithms for automatic weapon detection, which notiﬁes

security staﬀwhen it detects a weapon. The companies and organizations that supply security and surveillance systems

would beneﬁt from the implementation of an automated weapon detection system on internet of things (IoT) devices,

such as a smartphone, laptop, etc. Human resource management and the creation of new products or applications

are being transformed by machine learning and deep learning. This creates an environment appropriate for deep

learning in open innovation and small and medium enterprises (SMEs) (Malo-Peris´

e and Merseguer,2022;Alam and

Ansari,2020). According to the ﬁndings of the research (Baierle et al.,2020), open innovation characteristics have a

signiﬁcant impact on the competitiveness of manufacturing SMEs in a Southern Brazilian area.

7. Future Scope

There is still a long way to go before developing a single robust deep learning technique.The following future

scopes are oﬀered based on the thorough survey as discussed in the paper. The following points are inferred for the

automated identiﬁcation of ﬁrearms:

i. Requirement of real-time dataset: The speciﬁc dataset for weapon detection is unavailable. At the time,

only a few real-time datasets were available. Usually, the datasets are gathered from virtual sources such as

movies, games, and others, which raises an issue about the reliability of the data due to varying surrounding

conditions, viz. illumination conditions, viewing angle, and resolution of images. The scarcity of real-time

datasets emerges as a major obstacle in the development of automatic weapon detection systems.

ii. A heterogeneous model: The existing methods are not entirely capable of detecting weapons of the same class,

various shapes, colours, and complex backgrounds. For example, diﬀerent sensors are used to capture the im-

ages of guns or knives, resulting in diﬀerent intensity distributions for a single image. The same weapon image

is mapped to diﬀerent pixel resolutions with diﬀerent imaging parameters such as the size of the weapon. Thus,

the development of a heterogenous method for a reliable automatic weapon detection system is indispensable.

iii. Constructive use of contextual information: Objects in the visual world have complex relationships, and

precise context is essential for comprehending them. Insuﬃcient consideration has been devoted to the use of

contextual information appropriately in the object detection ﬁeld. A guidebook about the precise and successful

utilisation of this information might be a potential future avenue for visual software development.

iv. Detection of small objects: One more signiﬁcant challenge in object detection system studies is the identiﬁ-

cation of small objects such as weapons or knives, which is one of the shortcomings of existing methods using

deep learning architecture. As a result, there is a potential scope for developing techniques for small-sized

objects.

v. Need for low-computing network: These networks comprise hundreds of millions of parameters, demanding

large amounts of data as well as high-performance graphical processing units (GPUs) for training. This fas-

cinated the researchers, who were building small and lightweight networks to decrease or eliminate network

redundancy. The developed model can be operated eﬀectively on tiny devices like smart phones and can be

employed with the IoT.

Funding Information

This research did not receive any speciﬁc grants from funding agencies in the public, commercial, or non-proﬁt

sectors.

Conﬂict of Interest

The authors declare that they have no known competing ﬁnancial interests or personal relationships that could

have appeared to inﬂuence the work reported in this paper.

Acknowledgment

The present work has been carried out in the computer laboratory of the Department of Mathematics and Scientiﬁc

Computing at the National Institute of Technology, Hamirpur, Himachal Pradesh, India.

References

Ainsworth, T. (2002). Buyer beware. Security Oz, 19, 18-26.

Alam, M. A., & Ansari, K. M. (2020). Open innovation ecosystems: Toward low-cost wind energy startups. Interna-

tional Journal of Energy Sector Management, 14(5), 853-869.

https://doi.org/10.1108/ijesm-07-2019-0010

Baierle, I. C., Benitez, G. B., Nara, E. O., Schaefer, J. L., & Sellitto, M. A. (2020). Inﬂuence of open innovation

variables on the competitive edge of small and medium enterprises. Journal of Open Innovation: Technology,

Market, and Complexity, 6(4), 179.

https://doi.org/10.3390/joitmc6040179

Barnich, O., & Van Droogenbroeck, M. (2011). ViBe: A universal background subtraction algorithm for video se-

quences. IEEE Transactions on Image Processing, 20(6), 1709-1724.

https://doi.org/10.1109/tip.2010.2101613

Bay, H., Tuytelaars, T., & Van Gool, L. (2006). SURF: Speeded up robust features. Computer Vision – ECCV 2006,

3951, 404-417.

https://doi.org/10.1007/11744023_32

Bhatti, M. T., Khan, M. G., Aslam, M., & Fiaz, M. J. (2021). Weapon detection in real-time CCTV videos using deep

learning. IEEE Access, 9, 34366-34382.

https://doi.org/10.1109/access.2021.3059170

Bochkovskiy, A., Wang, C. Y., & Liao, H. Y. M. (2020). Yolov4: Optimal speed and accuracy of object detection.

arXiv preprint arXiv:2004.10934.

Canny, J. (1986). A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine

Intelligence, PAMI-8(6), 679-698.

https://doi.org/10.1109/tpami.1986.4767851

Cardoso, G. V., Ciarelli, P. M., & Vassallo, R. F. (2019). Use of deep learning for ﬁrearms detection in images. Anais

do XV Workshop de Vis˜ao Computacional (WVC 2019), 109–114.

https://doi.org/10.5753/wvc.2019.7637

Castillo, A., Tabik, S., P´

erez, F., Olmos, R., & Herrera, F. (2019). Brightness guided preprocessing for automatic cold

steel weapon detection in surveillance videos with deep learning. Neurocomputing, 330, 151-161.

https://doi.org/10.1016/j.neucom.2018.10.076

Cootes, T. F., Edwards, G. J., & Taylor, C. J. (1998). Active appearance models. In European Conference on Computer

Vision, 1407, 484-498.

https://doi.org/10.1007/BFb0054760

Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. 2005 IEEE Computer Society

Conference on Computer Vision and Pattern Recognition (CVPR’05), 1, 886–893.

https://doi.org/10.1109/cvpr.2005.177

Deng, J., Dong, W., Socher, R., Li, L., Kai Li, & Li Fei-Fei. (2009). ImageNet: A large-scale hierarchical image

database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248-255.

https://doi.org/10.1109/cvpr.2009.5206848

Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., & Tian, Q. (2019). CenterNet: Keypoint triplets for object detection.

2019 IEEE/CVF International Conference on Computer Vision (ICCV), 6568-6577.

https://doi.org/10.1109/iccv.2019.00667

Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In ICML, 96, 148-156.

10.1.1.380.9055

Galab, M. K., Taha, A., & Zayed, H. H. (2021). Adaptive technique for brightness enhancement of automated knife

detection in surveillance video with deep learning. Arabian Journal for Science and Engineering, 46(4), 4049-

4058.

https://doi.org/10.1007/s13369-021-05401-4

Gelana, F., & Yadav, A. (2018). Firearm detection from surveillance cameras using image processing and machine

learning techniques. Smart Innovations in Communication and Computational Sciences, 851, 25-34.

https://doi.org/10.1007/978-981-13-2414- 7_3

Girshick, R. (2015). Fast R-CNN. 2015 IEEE International Conference on Computer Vision (ICCV), 1440–1448.

https://doi.org/10.1109/iccv.2015.169

Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and

semantic segmentation. 2014 IEEE Conference on Computer Vision and Pattern Recognition, 580–587.

https://doi.org/10.1109/cvpr.2014.81

Glowacz, A., Kmie´

c, M., & Dziech, A. (2013). Visual detection of knives in security applications using active appear-

ance models. Multimedia Tools and Applications, 74(12), 4253-4267.

https://doi.org/10.1007/s11042-013-1537-2

Gonz´

alez, J. L. S., Zaccaro, C., ´

Alvarez-Garc´

ıa, J. A., Morillo, L. M. S., & Caparrini, F. S. (2020). Real-time gun

detection in CCTV: An open problem. Neural networks, 132, 297-308.

https://doi.org/10.1016/j.neunet.2020.09.013

Grega, M., Matiola´

nski, A., Guzik, P., & Leszczuk, M. (2016). Automated detection of ﬁrearms and knives in a CCTV

image. Sensors, 16(1), 47.

https://doi.org/10.3390/s16010047

Guo, Y., Liu, Y., Oerlemans, A., Lao, S., Wu, S., & Lew, M. S. (2016). Deep learning for visual understanding: A

review. Neurocomputing, 187, 27-48.

https://doi.org/10.1016/j.neucom.2015.09.116

Haralick, R. M., & Shapiro, L. G. (1985). Image segmentation techniques. Computer Vision, Graphics, and Image

Processing, 29(1), 100-132.

https://doi.org/10.1016/s0734-189x(85)90153-7

Harris, C., & Stephens, M. (1988). A combined corner and edge detector. Procedings of the Alvey Vision Confer-

ence,15, 10–5244.

https://doi.org/10.5244/c.2.23

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. 2016 IEEE Conference

on Computer Vision and Pattern Recognition (CVPR), 770-778.

https://doi.org/10.1109/cvpr.2016.90

Hearst, M., Dumais, S., Osuna, E., Platt, J., & Scholkopf, B. (1998). Support vector machines. IEEE Intelligent

Systems and their Applications, 13(4), 18-28.

https://doi.org/10.1109/5254.708428

Hu, Z., Tang, J., Wang, Z., Zhang, K., Zhang, L., & Sun, Q. (2018). Deep learning for image-based cancer detection

and diagnosis - A survey. Pattern Recognition, 83, 134-149.

https://doi.org/10.1016/j.patcog.2018.05.014

Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected Convolutional networks.

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2261–2269.

https://doi.org/10.1109/cvpr.2017.243

IMFDBs. http://www.imfdb.org/wiki/Main_Page. [Online; accessed 10-Oct-2021].

Iqbal, J., Munir, M. A., Mahmood, A., Ali, A. R., & Ali, M. (2021). Leveraging orientation for weakly supervised

object detection with application to ﬁrearm localization. Neurocomputing, 440, 310-320.

https://doi.org/10.1016/j.neucom.2021.01.075

Jain, H., Vikram, A., Mohana, Kashyap, A., & Jain, A. (2020). Weapon detection using artiﬁcial intelligence and deep

learning for security applications. 2020 International Conference on Electronics and Sustainable Communication

Systems (ICESC), 193–198.

https://doi.org/10.1109/icesc48915.2020.9155832

Jeong, C. Y., Yang, H. S., & Moon, K. (2018). Fast horizon detection in maritime images using region-of-interest.

International Journal of Distributed Sensor Networks, 14(7), 155014771879075.

https://doi.org/10.1177/1550147718790753

Jin, X., Zhang, Y., & Jin, Q. (2016). Pulmonary nodule detection based on CT images using convolution neural

network. 2016 9th International Symposium on Computational Intelligence and Design (ISCID), 1, 202-204.

https://doi.org/10.1109/iscid.2016.1053

Kaya, V., Tuncer, S., & Baran, A. (2021). Detection and classiﬁcation of diﬀerent weapon types using deep learning.

Applied Sciences, 11(16), 7535.

https://doi.org/10.3390/app11167535

Kmie´

c, M., Głowacz, A., & Dziech, A. (2012). Towards robust visual knife detection in images: Active appearance

models initialised with shape-speciﬁc interest points. Communications in Computer and Information Science,

287, 148-158.

https://doi.org/10.1007/978-3-642-30721- 8_15

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classiﬁcation with deep convolutional neural net-

works. Advances in Neural Information Processing Systems, 25,1097-1105

https://doi.org/10.1145/3065386

Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci, M.,

Kolesnikov, A., Duerig, T., & Ferrari, V. (2020). The open images dataset V4. International Journal of Computer

Vision, 128(7), 1956-1981.

https://doi.org/10.1007/s11263-020-01316-z

Lamas, A., Tabik, S., Montes, A. C., P´

erez-Hern´

andez, F., Garc´

ıa, J., Olmos, R., & Herrera, F. (2022). Human pose

estimation for mitigating false negatives in weapon detection in video-surveillance. Neurocomputing, 489, 488-

503.

https://doi.org/10.1016/j.neucom.2021.12.059

Li, J., Liang, X., Wei, Y., Xu, T., Feng, J., & Yan, S. (2017). Perceptual generative adversarial networks for small

object detection. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1951– 1959.

https://doi.org/10.1109/cvpr.2017.211

Lin, T., Dollar, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object

detection. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2117-2125.

https://doi.org/10.1109/cvpr.2017.106

Lin, T., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´

ar, P., & Zitnick, C. L. (2014). Microsoft

COCO: Common objects in context. Computer Vision – ECCV 2014, 740-755.

https://doi.org/10.1007/978-3-319-10602- 1_48

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C., & Berg, A. C. (2016). SSD: Single shot MultiBox

detector. Computer Vision – ECCV 2016, 9905, 21-37.

https://doi.org/10.1007/978-3-319-46448- 0_2

Lowe, D. (1999). Object recognition from local scale-invariant features. Proceedings of the Seventh IEEE Interna-

tional Conference on Computer Vision, 2, 1150–1157.

https://doi.org/10.1109/iccv.1999.790410

Ma, X., Niu, Y., Gu, L., Wang, Y., Zhao, Y., Bailey, J., & Lu, F. (2021). Understanding adversarial attacks on deep

learning based medical image analysis systems. Pattern Recognition, 110, 107332.

https://doi.org/10.1016/j.patcog.2020.107332

Malo-Peris´

e, P., & Merseguer, J. (2022). The “Socialized architecture”: A software engineering approach for a new

cloud. Sustainability, 14(4), 2020.

https://doi.org/10.3390/su14042020

Maqueda, A. I., Loquercio, A., Gallego, G., Garcia, N., & Scaramuzza, D. (2018). Event-based vision meets deep

learning on steering prediction for self-driving cars. 2018 IEEE/CVF Conference on Computer Vision and Pattern

Recognition, 5419–5427.

https://doi.org/10.1109/cvpr.2018.00568

Minaeian, S., Liu, J., & Son, Y. (2018). Eﬀective and eﬃcient detection of moving targets from a UAV’s camera.

IEEE Transactions on Intelligent Transportation Systems, 19(2), 497-506.

https://doi.org/10.1109/tits.2017.2782790

Narejo, S., Pandey, B., Esenarro vargas, D., Rodriguez, C., & Anjum, M. R. (2021). Weapon detection using YOLO

V3 for smart surveillance system. Mathematical Problems in Engineering, 2021, 1-9.

https://doi.org/10.1155/2021/9975700

Nassif, A. B., Shahin, I., Attili, I., Azzeh, M., & Shaalan, K. (2019). Speech recognition using deep neural networks:

A systematic review. IEEE Access, 7, 19143-19165.

https://doi.org/10.1109/access.2019.2896880

Olmos, R., Tabik, S., & Herrera, F. (2018). Automatic handgun detection alarm in videos using deep learning. Neuro-

computing, 275, 66-72.

https://doi.org/10.1016/j.neucom.2017.05.012

Olmos, R., Tabik, S., Lamas, A., P´

erez-Hern´

andez, F., & Herrera, F. (2019). A binocular image fusion approach for

minimizing false positives in handgun detection with deep learning. Information Fusion, 49, 271-280.

https://doi.org/10.1016/j.inffus.2018.11.015

Opitz, D., & Maclin, R. (1999). Popular ensemble methods: An empirical study. Journal of Artiﬁcial Intelligence

Research, 11, 169-198.

https://doi.org/10.1613/jair.614

P´

erez-Hern´

andez, F., Tabik, S., Lamas, A., Olmos, R., Fujita, H., & Herrera, F. (2020). Object detection binary

classiﬁers methodology based on deep learning to identify small objects handled similarly: Application in video

surveillance. Knowledge-Based Systems, 194, 105590.

https://doi.org/10.1016/j.knosys.2020.105590

Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Uniﬁed, real-time object detection.

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 779-788.

https://doi.org/10.1109/cvpr.2016.91

Redmon, J., & Farhadi, A. (2017). YOLO9000: Better, faster, stronger. 2017 IEEE Conference on Computer Vision

and Pattern Recognition (CVPR), 6517–6525.

https://doi.org/10.1109/cvpr.2017.690

Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767.

Ren, S., He, K., Girshick, R., & Sun, J. (2017). Faster R-CNN: Towards real-time object detection with region proposal

networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1137-1149.

https://doi.org/10.1109/tpami.2016.2577031

Romero, D., & Salamea, C. (2019). Convolutional models for the detection of ﬁrearms in surveillance videos. Applied

Sciences, 9(15), 2965.

https://doi.org/10.3390/app9152965

Roy, S., Das, N., Kundu, M., & Nasipuri, M. (2017). Handwritten isolated Bangla compound character recognition:

A new benchmark using a novel deep learning approach. Pattern Recognition Letters, 90, 15-21.

https://doi.org/10.1016/j.patrec.2017.03.004

Salido, J., Lomas, V., Ruiz-Santaquiteria, J., & Deniz, O. (2021). Automatic handgun detection with deep learning in

video surveillance images. Applied Sciences, 11(13), 6085.

https://doi.org/10.3390/app11136085

Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv

preprint arXiv:1409.1556.

Singh, A., Anand, T., Sharma, S., & Singh, P. (2021). IoT based weapons detection system for surveillance and secu-

rity using YOLOV4. 2021 6th International Conference on Communication and Electronics Systems (ICCES),

488–493.

https://doi.org/10.1109/icces51350.2021.9489224

Singh, T., & Vishwakarma, D. K. (2019). Human activity recognition in video benchmarks: A survey. Lecture Notes

in Electrical Engineering, 526, 247-259.

https://doi.org/10.1007/978-981-13-2553- 3_24

Sommer, L. W., Schuchert, T., & Beyerer, J. (2017). Fast deep vehicle detection in aerial images. 2017 IEEE Winter

Conference on Applications of Computer Vision (WACV), 311-319.

https://doi.org/10.1109/wacv.2017.41

Szegedy, C., Ioﬀe, S., Vanhoucke, V., & Alemi, A. (2017). Inception-V4, inception-resnet and the impact of residual

connections on learning. Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 31(1).

https://doi.org/10.1609/aaai.v31i1.11231

Szegedy, C., Vanhoucke, V., Ioﬀe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for

computer vision. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2818–2826.

https://doi.org/10.1109/cvpr.2016.308

Szegedy, C., Wei Liu, Yangqing Jia, Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich,

A. (2015). Going deeper with convolutions. 2015 IEEE Conference on Computer Vision and Pattern Recognition

(CVPR), 1–9.

https://doi.org/10.1109/cvpr.2015.7298594

Tiwari, R. K., & Verma, G. K. (2015). A computer vision based framework for visual gun detection using Harris

interest point detector. Procedia Computer Science, 54, 703-712.

https://doi.org/10.1016/j.procs.2015.06.083

Tiwari, R. K., & Verma, G. K. (2015). A computer vision based framework for visual gun detection using SURF. 2015

International Conference on Electrical, Electronics, Signals, Communication and Optimization (EESCO), 1–5.

https://doi.org/10.1109/eesco.2015.7253863

Tompson, J., Goroshin, R., Jain, A., LeCun, Y., & Bregler, C. (2015). Eﬃcient object localization using Convolutional

networks. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 648–656.

https://doi.org/10.1109/cvpr.2015.7298664

Tong, K., Wu, Y., & Zhou, F. (2020). Recent advances in small object detection based on deep learning: A review.

Image and Vision Computing, 97, 103910.

https://doi.org/10.1016/j.imavis.2020.103910

Velastin, S. A., Boghossian, B. A., & Vicencio-Silva, M. A. (2006). A motion-based image processing system for

detecting potentially dangerous situations in Underground Railway stations. Transportation Research Part C:

Emerging Technologies, 14(2), 96-113.

https://doi.org/10.1016/j.trc.2006.05.006

Verma, G. K., & Dhillon, A. (2017). A handheld gun detection using faster R-CNN deep learning. Proceedings of the

7th International Conference on Computer and Communication Technology - ICCCT-2017, 84–88.

https://doi.org/10.1145/3154979.3154988

Viola, P., & Jones, M. (2001). Rapid object detection using a boosted Cascade of simple features. Proceedings of the

2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, 1, I-I

https://doi.org/10.1109/cvpr.2001.990517

Wilson, P. I., & Fernandez, J. (2006). Facial feature detection using Haar classiﬁers. Journal of Computing Sciences

in Colleges, 21(4), 127-133.

doi/abs/10.5555/1127389.1127416

Worsham, J., & Kalita, J. (2020). Multi-task learning for natural language processing in the 2020s: Where are we

going? Pattern Recognition Letters, 136, 120-126.

https://doi.org/10.1016/j.patrec.2020.05.031

Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding Convolutional networks. Computer Vision – ECCV

2014, 818-833.

https://doi.org/10.1007/978-3-319-10590- 1_53

Zhan, S., Tao, Q., & Li, X. (2016). Face detection using representation learning. Neurocomputing, 187, 19-26.

https://doi.org/10.1016/j.neucom.2015.07.130

Zivkovic, Z. (2004). Improved adaptive gaussian mixture model for background subtraction. Proceedings of the 17th

International Conference on Pattern Recognition, 2004. ICPR 2004, 2, 28–31.

https://doi.org/10.1109/icpr.2004.1333992

Zywicki, M., Matiola´

nski, A., Orzechowski, T. M., & Dziech, A. (2011). Knife detection as a subset of object de-

tection approach based on Haar cascades. In Proceedings of 11th International Conference “Pattern recognition

and information processing, 139-142.

Dangerous Items Detection in Surveillance Camera Images Using Faster R-CNN

Preprint

Full-text available

Jun 2024

Recently, a significant number of tools for detecting dangerous objects have been developed. Unfortunately, the performance offered by them is overestimated due to the poor quality of the datasets used (insufficiently numerous, contain items not strictly related to dangerous objects, insufficient range of presentation conditions). To fill in a gap in this area we have built an extensive dataset dedicated to detecting the objects most often used in various acts of breaching public security (baseball bat, gun, knife, machete, rifle). This collection contains images presenting the detected objects with different quality and under different environmental conditions. We believe that the results obtained from it are more reliable and give a better idea of the detection accuracy that can be achieved under real conditions. We used the Faster R-CNN with different backbone networks in the study. The best results were obtained for the ResNet152 backbone. The mAP value was 85%, while the AP level ranged from 80% to 91%, depending on the item detected. An average real-time detection speed was 11-13 FPS. Both the accuracy and speed of the Faster R-CNN model allow it to be recommended for use in public security monitoring systems aimed at detecting potentially dangerous objects.

Comparing Object Detection Models for Public Safety

Chapter

Full-text available

Jun 2024

Deep Learning Based Automated Electronic Meter Reading System using YOLOv5 Architecture

Conference Paper

Full-text available

May 2024

Automatic instrument reading has become a critical issue for intelligent sensors in smart cities. Several artificial intelligence techniques are developing tools for addressing the issue. The image-based Automatic Meter Reading (AMR) techniques have been tested on images taken under regulated conditions, but they become unresponsive when dealing with fuzzy, hazy or blurred meter images. In this paper, we deal with AMR, which focuses on unconstrained settings such as fuzzy, hazy or blurry meter images. Automated meter reading consists of three major components: identifying the counter region, localising and cropping the counter region and digit recognition. In this article, the deep learning model YOLOv5 have been used on the image dataset. YOLOv5 is a state-of-the-art single-stage deep learning detector that outperforms all other detectors and it is observed that the proposed technique and the trained model based on YOLOv5 can reliably detect and recognise meter readings from the different meter kinds. For the task of digit recognition, a YOLOv5 based custom-built digit optical character reader is used that can recognise 0-to-9-digit numbers. Furthermore, the proposed AMR system achieves remarkable recognition rates of 99.74% for counters and 88.70% for digit recognition even while rejecting counters with lower confidence values.

Enhancing Video Surveillance with Deep Learning-Based Real-Time Handgun Detection and Tracking

Conference Paper

Jul 2024

Pavinder Yadav

Handguns, pistols, and revolvers are commonly used in today’s world for committing criminal acts, requiring the need for effective surveillance and control systems. However, despite the advancement of security systems, human monitoring and involvement are still necessary to effectively combat these crimes. This paper provides a robust automated handgun identification technique for recorded videos and live CCTV footage that may be used for both control and surveillance purposes. Automatic detection of firearms is crucial for improving people’s protection and safety, however, it is a challenging task because of the numerous differences in design, size, and appearance of firearms. In recent years, object detectors have improved, yielding better findings and shorter inference times. The authors used cutting edge object detector YOLOv7 for firearm detection. A varied and demanding dataset of 15,367 images for weapon identification is also proposed, which is carefully annotated for weapon localization and classification. After analysing the data, it is determined that the model achieves an accuracy rate of 96.80% and recall rate of 90.37%.

VGG-SSD Model for Weapon Detection using Image Processing

Conference Paper

Jun 2024

IoV-6G+: A secure blockchain-based data collection and sharing framework for Internet of vehicles in 6G-assisted environment

Article

Full-text available

Apr 2024

COVID-19 Detection from CT Scan Images using Transfer Learning Approach

Conference Paper

Apr 2024

In the past years, since 2020, the outbreak of COVID-19 has alarmed the world with the speed and its spread around the world. This raised the demand for early, accurate and automated detection systems for COVID-19 as there is a scarcity of manpower in the medical field. This attracted many researchers using deep learning to build a COVID-19 detection model. For the diagnosis of COVID-19, computed tomography scanning is being used as a more accurate, non-invasive and efficient method in real-time. In this work, we have proposed a model using six different image classification techniques of deep learning on CT scan images and compared the accuracy to find the most suitable and reliable model for transfer learning to achieve the best result on ResNet50 as 97.19% training and 98.05% testing accuracy. The model will automate the process of detection of COVID-19, leading to the advancement in the field of smart health care.

Visual surveillance of a person: Legality issue

Article

Full-text available

Dec 2023

The relevance of the study stems from the legal ambiguity surrounding specific aspects of visual surveillance utilised by law enforcement agencies, journalists, private detectives, and other individuals with a need for it. The purpose of the study is to identify indicators that can differentiate between legal and illegal covert visual surveillance of individuals in public spaces, establish the circumstances under which such surveillance should be deemed a criminal offence, define the specific aspects of documenting this offence, and explore methods of proving the guilt of those responsible. Historical-legal, formal-legal, logical-normative, logical-semantic, sociological and statistical research methods are applied in the study. The criteria for the legality of covert visual surveillance of a person in publicly accessible places are: its conduct by authorised subjects (investigators or employees of operational units); implementation only within the framework of criminal proceedings (or proceedings in an intelligence gathering case); the existence of a decision of the investigating judge on permission to conduct visual surveillance of a specific person; strict compliance with the requirements of the Criminal Procedure Law regarding the procedure for conducting visual surveillance and restrictions established by the decision of the investigating judge. It is found that representatives of civilian professions can conduct visual surveillance in publicly accessible places only in an open way. Covert visual surveillance of a person to collect information about them constitutes a criminal offence consisting in violation of privacy. To bring illegal observers to criminal responsibility, factual data indicating the purpose of visual surveillance (collecting confidential information about a person), motives, time, place, means of committing the crime, and other circumstances are collected during the pre-trial investigation. The practical value of the paper is the possibility of using the obtained data to prevent illegal actions of private detectives, journalists, and other entities who secretly collect information about a person through visual surveillance, and to ensure effective investigation of such activities.

Enhancing pavement health assessment: An attention-based approach for accurate crack detection, measurement, and mapping

Article

Aug 2024
EXPERT SYST APPL

Real-Time Weapons Detection System using Computer Vision

Conference Paper

Dec 2023

The “Socialized Architecture”: A Software Engineering Approach for a New Cloud

Article

Full-text available

Feb 2022

Today, the cloud means a revolution within the Internet revolution. However, an oligopoly sustaining the cloud may not be the best solution, since ethical problems such as privacy or even transferring data sovereignty could eventually happen. Our research, coined as the "socialized architecture," presents a novel disruptive approach to completely transform the cloud as we know it today. The approach follows ideas already working in the field of volunteer computing, since it tries to socialize spare computing power in the infraused hardware that institutions and normal people own. However, our solution is completely different to current ones, since it does not create hyper-specialized muscles in client machines. The solution is new since it proposes a software engineering approach for developing “socialized services”, which, leveraging an asynchronous interaction model, creates a network of lightweight microservices that can be dynamically allocated and replicated through the network. The use of state-of-the-art patterns, such as Command Query Responsibility Segregation, helps to isolate domain events and persistence needs, while an API Gateway addresses communication. All previous ideas were tested through a complete and functional proof of concept, which is a prototype called Circle implementing a social network. Circle has been useful to expose problems that need to be addressed. The results of the assessment confirm, in our view, that it is worth to start this new field of work.

Human pose estimation for mitigating false negatives in weapon detection in video-surveillance

Article

Full-text available

Jun 2022
NEUROCOMPUTING

Applying CNN-based object detection models to the task of weapon detection in video-surveillance is still producing a high number of false negatives. In this context, most existing works focus on one type of weapons, mainly firearms, and improve the detection using different pre- and post-processing strategies. One interesting approach that has not been explored in depth yet is the exploitation of the human pose information for improving weapon detection. This paper proposes a top-down methodology that first determines the hand regions guided by the human pose estimation then analyzes those regions using a weapon detection model. For an optimal localization of each hand region, we defined a new factor, called Adaptive pose factor, that takes into account the distance of the body from the camera. Our experiments show that this top-down Weapon Detection over Pose Estimation (WeDePE) methodology is more robust than the alternative bottom-up approach and state-of-the art detection models in both indoor and outdoor video-surveillance scenarios.

Detection and Classification of Different Weapon Types Using Deep Learning

Article

Full-text available

Aug 2021

Today, with the increasing number of criminal activities, automatic control systems are becoming the primary need for security forces. In this study, a new model is proposed to detect seven different weapon types using the deep learning method. This model offers a new approach to weapon classification based on the VGGNet architecture. The model is taught how to recognize assault rifles, bazookas, grenades, hunting rifles, knives, pistols, and revolvers. The proposed model is developed using the Keras library on the TensorFlow base. A new model is used to determine the method required to train, create layers, implement the training process, save training in the computer environment, determine the success rate of the training, and test the trained model. In order to train the model network proposed in this study, a new dataset consisting of seven different weapon types is constructed. Using this dataset, the proposed model is compared with the VGG-16, ResNet-50, and ResNet-101 models to determine which provides the best classification results. As a result of the comparison, the proposed model’s success accuracy of 98.40% is shown to be higher than the VGG-16 model with 89.75% success accuracy, the ResNet-50 model with 93.70% success accuracy, and the ResNet-101 model with 83.33% success accuracy.

Firearm Detection from Surveillance Cameras Using Image Processing and Machine Learning Techniques

Conference Paper

Full-text available

Nov 2018

The increasing number of terrorist acts and lone wolf attacks on places of public gathering such as Hotels and Cinemas has solidified the need for much denser Closed-circuit Television (CCTV) systems. The increasing number of CCTV cameras has deemed it almost impossible for a human operator to inspect all the video streams and detect possible terror events. One of the common types of terror event is called “Active Shooter”. Events such as the 2008 Mumbai shooting, shooting at the movie theater in Colorado (USA), Oslo (Norway) and recently an attacker opened gun fire at an outdoor music festival in Las Vegas on Oct. 1, 2017, USA. Therefore in this work, the detection of an “Active Shooter” carrying a non-concealed firearm and alerting the CCTV operator of a potentially dangerous event both visually and audibly has been carried out. The proposed approach of gun detection uses a feature extraction techniques and a convolutional neural network classifier for classifying objects as either a gun or not a gun. And the classification accuracy achieved by the proposed approach is 97.78%.

Automatic Handgun Detection with Deep Learning in Video Surveillance Images

Article

Full-text available

Jun 2021

There is a great need to implement preventive mechanisms against shootings and terrorist acts in public spaces with a large influx of people. While surveillance cameras have become common, the need for monitoring 24/7 and real-time response requires automatic detection methods. This paper presents a study based on three convolutional neural network (CNN) models applied to the automatic detection of handguns in video surveillance images. It aims to investigate the reduction of false positives by including pose information associated with the way the handguns are held in the images belonging to the training dataset. The results highlighted the best average precision (96.36%) and recall (97.23%) obtained by RetinaNet fine-tuned with the unfrozen ResNet-50 backbone and the best precision (96.23%) and F1 score values (93.36%) obtained by YOLOv3 when it was trained on the dataset including pose information. This last architecture was the only one that showed a consistent improvement—around 2%—when pose information was expressly considered during training.

Weapon Detection Using YOLO V3 for Smart Surveillance System

Article

Full-text available

May 2021
MATH PROBL ENG

Every year, a large amount of population reconciles gun-related violence all over the world. In this work, we develop a computer-based fully automated system to identify basic armaments, particularly handguns and rifles. Recent work in the field of deep learning and transfer learning has demonstrated significant progress in the areas of object detection and recognition. We have implemented YOLO V3 “You Only Look Once” object detection model by training it on our customized dataset. The training results confirm that YOLO V3 outperforms YOLO V2 and traditional convolutional neural network (CNN). Additionally, intensive GPUs or high computation resources were not required in our approach as we used transfer learning for training our model. Applying this model in our surveillance system, we can attempt to save human life and accomplish reduction in the rate of manslaughter or mass killing. Additionally, our proposed system can also be implemented in high-end surveillance and security robots to detect a weapon or unsafe assets to avoid any kind of assault or risk to human life.

Weapon Detection in Real-Time CCTV Videos using Deep Learning

Article

Full-text available

Feb 2021

Security and safety is a big concern for today’s modern world. For a country to be economically strong, it must ensure a safe and secure environment for investors and tourists. Having said that, Closed Circuit Television (CCTV) cameras are being used for surveillance and to monitor activities i.e. robberies but these cameras still require human supervision and intervention. We need a system that can automatically detect these illegal activities. Despite state-of-the-art deep learning algorithms, fast processing hardware, and advanced CCTV cameras, weapon detection in real-time is still a serious challenge. Observing angle differences, occlusions by the carrier of the firearm and persons around it further enhances the difficulty of the challenge. This work focuses on providing a secure place using CCTV footage as a source to detect harmful weapons by applying the state of the art open-source deep learning algorithms. We have implemented binary classification assuming pistol class as the reference class and relevant confusion objects inclusion concept is introduced to reduce false positives and false negatives. No standard dataset was available for real-time scenario so we made our own dataset by making weapon photos from our own camera, manually collected images from internet, extracted data from YouTube CCTV videos, through GitHub repositories, data by university of Granada and Internet Movies Firearms Database (IMFDB) imfdb.org. Two approaches are used i.e. sliding window/classification and region proposal/object detection. Some of the algorithms used are VGG16, Inception-V3, Inception-ResnetV2, SSDMobileNetV1, Faster-RCNN Inception-ResnetV2 (FRIRv2), YOLOv3, and YOLOv4. Precision and recall count the most rather than accuracy when object detection is performed so these entire algorithms were tested in terms of them. Yolov4 stands out best amongst all other algorithms and gave a F1-score of 91% along with a mean average precision of 91.73% higher than previously achieved.

Adaptive Technique for Brightness Enhancement of Automated Knife Detection in Surveillance Video with Deep Learning

Article

Full-text available

Feb 2021

Detecting knives in surveillance videos are very urgent for public safety. In general, the research in identifying dangerous weapons is relatively new. Knife detection is a very challenging task because knives vary in size and shape. Besides, it easily reflects lights that reduce the visibility of knives in a video sequence. The reflection of light on the surface of the knife and the brightness on its surface makes the detection process extremely difficult, even impossible. This paper presents an adaptive technique for brightness enhancement of knife detection in surveillance systems. This technique overcomes the brightness problem that faces the steel weapons and improves the knife detection process. It suggests an automatic threshold to assess the level of frame brightness. Depending on this threshold, the proposed technique determines if the frame needs to enhance its brightness or not. Experimental results verify the efficiency of the proposed technique in detecting knives using the deep transfer learning approach. Moreover, the most four famous models of deep convolutional neural networks are tested to select the best in detecting knives. Finally, a comparison is made with the-state-of-the-art techniques, and the proposed technique proved its superiority.

IoT Based Weapons Detection System for Surveillance and Security Using YOLOV4

Conference Paper

Jul 2021

Leveraging Orientation for Weakly Supervised Object Detection with Application to Firearm Localization

Article

Jan 2021
NEUROCOMPUTING

Automatic detection of firearms is important for enhancing the security and safety of people, however, it is a challenging task owing to the wide variations in shape, size and appearance of firearms. Also, most of the generic object detectors process axis-aligned rectangular areas though, a thin and long rifle may actually cover only a small percentage of that area and the rest may contain irrelevant details suppressing the required object signatures. To handle these challenges, we propose a weakly supervised Orientation Aware Object Detection (OAOD) algorithm which learns to detect oriented object bounding boxes (OBB) while using Axis-Aligned Bounding Boxes (AABB) for training. The proposed OAOD is different from the existing oriented object detectors which strictly require OBB during training which may not always be present. The goal of training on AABB and detection of OBB is achieved by employing a multistage scheme, with Stage-1 predicting the AABB and Stage-2 predicting OBB. In-between the two stages, the oriented proposal generation module along with the object aligned RoI pooling is designed to extract features based on the predicted orientation and to make these features orientation invariant. A diverse and challenging dataset consisting of eleven thousand images is also proposed for firearm detection which is manually annotated for firearm classification and localization. The proposed ITU Firearm dataset (ITUF) contains a wide range of guns and rifles. The OAOD algorithm is evaluated on the ITUF dataset and compared with current state-of-the-art object detectors, including fully supervised oriented object detectors. OAOD has outperformed both types of object detectors with a significant margin. The experimental results (mAP: 88.3 on AABB & mAP: 77.5 on OBB) demonstrate effectiveness of the proposed algorithm for firearm detection.

A comprehensive study towards high-level approaches for weapon detection using classical machine learning and deep learning methods

Abstract and Figures

Recommended publications

A multi-weapon detection using ensembled learning

Weapon Detection in Surveillance Videos Using Deep Neural Networks

Improved bounding box regression loss for weapon detection systems using deep learning

Robust Weapon Detection in Dark Environments using Yolov7-DarkVision

Systematic review on weapon detection in surveillance footage through deep learning