Conference PaperPDF Available

Detecting Criminal Activities and Promoting Safety Using Deep Learning

Authors:
1
Detecting Criminal Activities and Promoting
Safety Using Deep Learning
Rohan Mathur, Tejas Chintala and Rajeswari D*
Department of Data Science and Business Systems, School of Computing, SRM Institute of Science and Technology,
Kattankulathur, Tamil Nadu 603203, India.
E-mail : rm3686@srmist.edu.in, tc5739@srmist.edu.in, rajeswad@srmist.edu.in
Abstract- Automation and autonomous systems are among
the few powerhouses of innovation that drive entire domains
towards advancing further in leaps and bounds. Great
technological innovations can be attributed to tasks that are
made easier and more perceptible by automation, and
artificial intelligence is here to make these automated
systems smart enough to perform their tasks with the power
of decision-making, thereby greatly reducing human
intervention in redundant processes. Our project follows the
aforementioned ideals: building a product to minimize
manual labor (both physical and mental) for tasks that can
be seamlessly automated and processed while solving the
main problem statement at hand. Currently, surveillance
cameras play a vital role to ensure the safety of the people,
yet they are plain video-providing entities with no smart
decision making mechanisms of their own. Because of this
growth of data composed from surveillance cameras,
automated video streams have become a requisite for
automatically detecting abnormal events. The main aim of
the project focuses on promoting safety on campus by
employing deep learning techniques to automate the task of
monitoring and reporting crimes from the physical Closed-
Circuit Television (CCTV), assigning the responsibility of
detecting criminal activity to a framework that can identify
patterns to differentiate them for smarter monitoring. The
model in this paper that we propose was able to distinguish
between certain crimes with a 0.94 and 0.95 precision for
Assault and Abuse respectively.
KeywordsResNet, Computer Vision, Deep Learning, Triplet
Loss Function, Single Shot Detector, UCF Crime
I.INTRODUCTION
Video classification is a computer vision problem
that was showcased with the motivation of it being
able to solve and automate classification tasks
concerning the real-time live video. Considering that
the problem is recent, there are still exist solutions
left to be tested. Even after this, the applications
constitute a wide spectrum of solutions, with the
start of detecting important aspects of sports actions
[1] or daily pursuit that is taking place in a scene, to
various security and healthcare. Our project aims at
promoting safety on campus by automating the task
of monitoring and reporting crimes by assigning the
responsibility of detecting criminal or abnormal
activity to a system that is well-versed in deducing
patterns that distinguish criminal activity from
normal activity. Traditional surveillance systems
deal with a plethora of shortcomings that do not sync
well with the current facilities available in today‟s
day and age.
One of the most notable flaws in vanilla surveillance
systems is the heavy dependence on an attentive
supervisor monitoring footage and ensuring that any
abnormal activity is duly noted and taken care of.
CCTV footage requires a human intervention which
may lead to errors [8]. In a bid to reduce manual
intervention and labor, the project not only
eliminates the need for additional supervision, but it
also immediately detects the occurrence of a crime
and takes note of people involved, recognizes the
type of crime occurring, and accordingly triggers
actions that initiate mitigation measures on the crime
scene, real-time.
There have been different implementations for the
automatic detection of crimes like localizing the gun
and victim present in the video footage [4] using
deep learning approaches, especially CNN
architectures like Residual models [5]. With the
availability of vast databases like video footage from
the UCF-crime dataset [6] or RWF-2000 database
[7], it makes it easier to create a system of automatic
activity detectors, for the enhancement of safety in
public places.
II. LITERATURE REVIEW
There have been many papers on this application
along with their implementations, and have been
discussed in this paper. Umadevi V. Navalgund and
Priyadharshini K [8] proposed a crime detection
system by creating a pipeline to detect weapons
using images of the same and training them on
pretrained models to classify them. They used
2022 International Conference on Advances in Computing, Communication and Applied Informatics (ACCAI) | 978-1-6654-9529-5/22/$31.00 ©2022 IEEE | DOI: 10.1109/ACCAI53970.2022.9752619
Authorized licensed use limited to: SRM University. Downloaded on May 11,2022 at 06:34:43 UTC from IEEE Xplore. Restrictions apply.
2
VGGNet 19 as their pre-trained model for detection
and achieved 69% accuracy and 75% recall. Olmos,
Siham Tabik, and Francisco Herrera [4] described a
system of detecting guns in surveillance videos, and
then classifying if it is or not, predicting the crime
happened or not. The results for this paper are
obtained by the use of Region-based Convolutional
Neural Network (RCNN) and Faster Region-based
Convolutional Neural Network (FRCNN) based
models which the authors trained on their data. This
paper showcased a prediction of high crime
occurrence videos of low quality, which were able to
give out satisfactory predictions.
A. R. Zamir [1] proposes an overview on prominent
action localization and recognition methods for
sports videos. Using the Sports dataset provided by
UCF as the benchmark for evaluation of the
discussed techniques, they proposed a pipeline for
recognition of action into three major pipelines
which follow - feature extraction, representation of
videos with dictionary learning, and finally the
classification of the data, in this use-case, the sport.
A. Karpathy et al. [9] studies the performance of
convolutional neural networks (CNNS) in video
classification, finding that CNN architectures [10]
are capable of being able to grasp powerful features
from weakly-labeled data and surpassing other
existing methods in performance.
The transfer learning experiments suggest that the
features that are learned are generic and showcase
that other classification tasks are generalized. Tahani
Almanie, Rsha Mirza and Elizabeth Lor [11] discuss
a similar objective by using machine learning
techniques like Decision Trees and Naïve Bayes
Classifiers for classifying and predicting crimes. The
datasets being used here are spatial data. They were
able to achieve a 51% and 54% accuracy
respectively for datasets for two cities - Denver and
Los Angeles.
III. PROPOSED METHOD
The main aim of this paper is to detect crimes
happening in video streams such as CCTV cameras.
Moreover, an additional module for detecting faces
in these streams is also done using a previously
implemented method known as Triplet Loss. This
paper proposes two modules for showcasing the
given problem. The first is in taking a CCTV stream
and recognizing the faces using methods explained
below. The next module discusses how to tackle
crime detection using deep learning methods. Figure
1 gives a brief overview of both the modules
discussed in this paper. The Face Recognition
module uses Open-CV‟s built-in DNN (Deep Neural
Networks) module to train a model that is primarily
built to differentiate faces and recognize them. This
is done by identifying faces through given data,
performing data preprocessing, and then proceeding
with training on embeddings using Triplet Loss
Function. After these embeddings are calculated, the
model is then used for the recognition of such faces.
This can be easily done by loading the model and
using a webcam to identify the face by surrounding
it with bounding boxes as well as giving a
confidence parameter. This paper was able to
achieve significant results whilst evaluating this
module. The Crime detection module initially selects
the input videos that possess criminal activity as
well as videos with normal / no criminal activity.
After performing basic preprocessing steps, which
includes data augmentation and converting the
videos to image frames, it is trained through a pre-
existing ResNet architecture through which accuracy
and other evaluation metrics are observed.
Finally, using the model that is trained, one can
detect crimes by playing simulated webcam streams
through a mobile phone. Accuracies over 90% were
achieved when evaluating for a total of six classes in
this paper. Overall, this paper suggests an end-to-end
pipeline to tie both these modules and use them for
simultaneously identifying who is committing the
crime.
Face Recognition
Face Detection is the technique of detecting and
returning the location of a face inside the frame from
a given photo or video stream. Face verification
takes it a step further and checks if the given face
holds any resemblance with another face stored in
memory. This is done by gauging the similarity
between the two faces using distance metrics like the
L2 norm or the cosine similarity. Finally, face
recognition cumulatively involves both the above
techniques to extract prominent features from the
face and identify who the face belongs to from a set
of labels obtained from the dataset on which the
model is being trained. The proposed solution that
this paper offers for face recognition involves
detecting faces, computing the embeddings of the
face, training these on a Support Vector Machine
(SVM) on the given embeddings and finally being
Authorized licensed use limited to: SRM University. Downloaded on May 11,2022 at 06:34:43 UTC from IEEE Xplore. Restrictions apply.
3
able to detect faces in images or simulated video
streams. Figure 2 explains the pipeline that is
discussed in this paper. The Caffe Model is being
used for Face Detection and the OpenFace Model for
feature extraction.
Fig. 1. Overview of Face Recognition and Crime
Detection Module
Fig. 2. Overall Face Recognition Pipeline with
OpenCV
OpenCV‟s deep learning face detection is based on
the Single Shot Detector (SSD) architecture with a
ResNet. The single-shot detection refers to a
technique wherein the model requires a single shot
to detect multiple objects within the image. It
discretizes the given image into certain bounded
boxes around regions with feature maps of high
confidence and generates multiple boxes around
such region maps. The confidence for each of these
boxes is calculated and the box dimensions are
adjusted to obtain the best fit for detection. Figure 3
depicts how the final bounding boxes would look
once it recognizes a face.
Fig. 3. SSD Multiple Bounding Boxes for
Localization and Confidence for
A Given Face
Additionally, we can compute facial landmarks
(mouth, right/left eyebrows, eyes, nose, jawline)
using dlib library, which will further enable us to
preprocess the images and perform face alignment
on datasets for better results. After cropping and
performing face alignment, we pass the given face
through the proposed neural network. For training a
face recognition model, an input batch must contain
an Anchor Image (current image of person “A”),
Positive Image (another image of person “A”) , and
a Negative Image (any other image that is not person
“A”). Through this, the neural network calculates the
face embeddings and adjusts the weights using a
method called triplet loss. In this way, the
embeddings of the “Anchor” and the “Positive”
image are close to each other, whereas the
”Negative” image it is farther away. A CNN (Caffe)
model computes the embeddings for all input
images, and these embeddings are sufficiently
different to train a classifier such as SVMs, SGD
Authorized licensed use limited to: SRM University. Downloaded on May 11,2022 at 06:34:43 UTC from IEEE Xplore. Restrictions apply.
4
Classifiers, Random Forests, etc. on top of the
calculated embeddings of the face, which constitute
the facial recognition pipeline.
1) Data Augmentation: To benefit from a low
amount of given data, our paper proposes certain
data augmentation methods to derive patterns in our
data to multiply the effectiveness. The concepts of
flipping, rotating, zooming and translating, scaling,
cropping, moving along x-axis and y-axis, adding
gaussian noise, shearing, skewing, implementing
black and white filters, and blurring images have
been implemented. For our dataset, the augmented
version has been displayed in Figure 4.
B. Video Classification for Crime Detection
This module aims towards detecting anomalies in
videos of footage and normal activity. The dataset at
play here used for training is the UCF Crimes
Dataset, which is the only dataset that contains
videos of diverse classes of crimes, each replete with
valuable and distinctive features. The dataset
contains 13 classes in total: Accidents, Fighting,
Burglary, Shoplifting, Robbery, Shooting, Abuse,
Arrest, Arson, Assault, Accidents, Explosion,
Stealing, and Vandalism. In total, it consists of
approximately one thousand nine hundred real-world
videos, all taken under different places and each
video showing a realistic crime. This amounts to a
total of one hundred and twenty-eight hours of
videos, cumulatively amounting to 95GB of storage
capacity. Figure 5 depicts the pipeline that is
followed for this module. Figure 6 showcases the
different videos of a few classes
Fig. 4. Different Types Of Augmented Images
Fig. 5. Crime Detection Module
1) Dataset Preparation and Pre-Processing: Three
techniques are identified and used in each scene,
which involve converting, enhancing and finally,
augmenting the data. This paper uses six classes -
“Abuse”, ”Assault”, ”Fighting”, ”Normal”,
”Robbery” and ”Vandalism”. Because of the sheer
size of the given videos, the authors had to reduce
the dataset classes. Video editing and trimming
helped in the reduction of redundant and misleading
features, as videos that were 5 minutes long were
trimmed to 45 seconds focusing on the duration
when the crime occurred as the irrelevant portions
had been discarded. Videos with low resolutions and
portions were sharpened and cropped to highlight
portions of the crime scene. Manually labelling the
parts of each crime video is done. The remaining
stream of the video can then be labeled under the
normal class. Finally, data augmentation in the form
of enlargement of the different types of data
available to train a specific model was done. Figure
7 gives an example of the data augmentation that
was done.
Authorized licensed use limited to: SRM University. Downloaded on May 11,2022 at 06:34:43 UTC from IEEE Xplore. Restrictions apply.
5
Fig. 6. Video Classes being turned into photos for preparation
Fig. 7. Normal Image, Rotated Left, Rotated Right
2) Residual Network (ResNet): The ResNet layers
are made such that they formulate as learning
residual functions with function to the inputs of the
layer, instead of learning unreferenced functions.
The architecture, with a depth of between 18 to 152
convolutional layers, bypasses the signal from one
layer to the other by introducing a short connection.
These connections have the ability to pass through
the gradient flows of layer networks, from its early
ones to the later ones, hence easing the training of
very deep networks. Residual Block in Figure 8
illustrated below shows how the connection
bypasses the signal from the top to the tail of the
block. By introducing the concept of Residual
Network, the architecture was able to solve the
vanishing gradient issue by using the skip
connections. This was it is able to skip training for
few layers and can make a direct connection to the
output [12], [13].
3) Proposed Video Classification Technique: The
proposed pipeline for crime classification requires
looping over all frames in an input video. For each
frame that it recognizes, it is passed through a CNN
and the same frame is classified individually and
independent of each other. Subsequently, the model
Authorized licensed use limited to: SRM University. Downloaded on May 11,2022 at 06:34:43 UTC from IEEE Xplore. Restrictions apply.
6
then chooses the label with the largest probability for
the frame and labels it, and finally writes the output
frame. Since we have a sequential problem, the
above method won‟t work since it is for a single
frame only. In the form of crime detection, it needs a
preserved correlation between subsequent frames for
single video input.
Fig. 8. Residual Block
This is achieved by calculating the above for a given
frame and maintaining a list of the last „N‟
predictions. Using these, the pipeline computes the
average of the last „N‟ predictions and chooses the
appropriate label with the largest probability and
returns the final output. 4) Training the ResNet on
UCF-Crimes Dataset: For establishing the training
and testing steps through ResNet is discussed below.
Firstly, locating the file directories of the training
and validation images, along with training
parameters like batch size, the number of epochs,
height, and width of images, the learning rate is also
specified. Generating training and validation data is
done using the Image Data Generator module in
Tensor Flow. As training runs, the behaviour for
each epoch is saved and plotted at the end of
training. Finally, the model along with its weights
are stored. For training, the stated hyper-parameters
like learning rate, the number of epochs, batch size
are specified for the model, These hyper-parameters
must be taken into account while training a CNN to
improve performance. Figure 9 shows the final
training epochs that were done.
The following parameters were provided to the
ResNet Module -
Splitting the dataset: keeping 25% as test data
and remaining for model training.
Number of Epochs = 50
Loss = Categorical Cross-Entropy
Optimizer = Learning rate of 0.0001 for
Stochastic Gradient Descent
Metric = Accuracy
The evaluation and result of the final is based on the
following terms -
False Positive: No crime occurs on the
scene, the model considers otherwise.
False Negative: The model fails to detect the
crime occurred.
True Positive: The model is able to detect
the crime correctly.
True Negative: The crime does not occur
and the model is unable to detect it.
Fig. 9. Training the classifier (Epochs = 50)
The final epoch displays the following parameters -
Training Loss - 0.158
Training Accuracy - 0.953
Validation Loss - 0.1168
Validation Accuracy - 0.966
IV. RESULTS
A. Face Recognition
Training for Face Recognition was done for three
different faces with approximately 15 photos of each
person. Having limited resources and limited
computational power, the authors chose on keeping
the data for this module minimal, The results for it is
as shown below in Figure 10 and Figure 11 when
using a webcam as simulated CCTV stream.
As seen, the model was able to correctly classify the
faces as well as provide bounding boxes and
confidence levels for the faces.
Authorized licensed use limited to: SRM University. Downloaded on May 11,2022 at 06:34:43 UTC from IEEE Xplore. Restrictions apply.
7
Fig. 10. Face Recognition for person A
(simulating a CCTV)
Fig. 11. Face Recognition for person B
(simulating a CCTV)
B. Crime Detection
On training For 50 epochs, the model was evaluated
and gave the results shown in Table I and Table II.
The training loss and accuracy along with validation
loss and accuracy was recorded and is as seen in
Figure 12. Using the model that was trained, the
authors evaluated this using a rolling stream of a
YouTube video of a criminal activity of „Fighting‟,
„Vandalism‟ and „Abuse‟. This way, the model was
able to correctly predict the happenings going on in
the stream. This is shown in Figure 13, Figure 14
and Figure 15.
V. CONCLUSION AND FUTURE WORKS
The face recognition module‟s work done in this
project was able to demonstrate a simulated version
of the final prototype and were able to successfully
replicate a facial recognition model. The Crime
Detection module was able to give satisfactory
results with the ResNet Model, having trained it for
numerous classes after the final data preprocessing.
Future enhancements would include using more
facial classes for training the face recognition model.
It would also be interesting in creating a database of
these faces to easily identify a given victim for our
implementation with a live CCTV camera stream
when deploying it. Trying to use different facial
recognition models and comparing the results would
also be interesting to experiment with. Having
constraints due to computational power, this paper
uses a limited number of classes, this project can aim
to classify different types of crimes, and thereby,
using more data.
Table I Results for the Different Classes
Table II Accuracy Metrics
Fig. 12. Loss and Accuracy as a Function of Epochs
Authorized licensed use limited to: SRM University. Downloaded on May 11,2022 at 06:34:43 UTC from IEEE Xplore. Restrictions apply.
8
Fig. 13. Detecting and Classifying Crimes from Test
Footage
Fig. 14. Detecting and Classifying Movement with
Weapons
Fig. 15. Detecting and Classifying Crimes
REFERENCES
[1] A. R. Zamir, “Action recognition in realistic sports videos,” in
Computer vision in sports. Springer, 2014, pp. 181208.
[2] Agarwal, H., Singh, A., Rajeswari, D., Deepfake Detection
using SVM, Proceedings of the 2nd International Conference on
Electronics and Sustainable Communication Systems, ICESC
2021this link is disabled, 2021, pp. 12451249.
[3] Soni, H., Arora, P., Rajeswari, D., Malicious Application
Detection in Android using Machine Learning, Proceedings of
the 2020 IEEE International Conference on Communication and
Signal Processing, ICCSP 2020this link is disabled, 2020, pp.
846848.
[4] Olmos, Siham Tabik, and Francisco Herrera, Soft Computing
and Intelligent Information Systems research group, on
“Automatic Handgun Detection Alarm in Videos Using Deep
Learning Roberto”, February 20, 2017.
[5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, “Deep
Residual Learning for Image Recognition”, 2015.
[6] Ucf-crime dataset (real-world anomalies detection in videos),
Jun 2019.
[7] Ming Cheng, Kunjing Cai, Ming Li, “RWF-2000: An Open
Large Scale Video Database for Violence Detection”, ICPR‟20,
https://arxiv.org/pdf/1911.05913v3.pdf.
[8] Umadevi V. Navalgund; Priyadharshini K, “Crime Intention
Detection System Using Deep Learning”, International
Conference on Circuits and Systems in Digital Enterprise
Technology (ICCSDET), December 2018.
[9] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar,
and L. Fei-Fei, “Largescale video classification with
convolutional neural networks,” in The IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), June 2014.
[10] K. O‟Shea and R. Nash: An introduction to convolutional neural
networks, ArXiv e-prints, (2015)
[11] Tahani Almanie, Rsha Mirza, Elizabeth Lor, “Crime Prediction
Based on Crime Types and Using Spatial and Temporal
Criminal Hotspots”, International Journal of Data Mining
Knowledge Management Process (IJDKP), Vol.5, No.4, July
2015.
[12] Devvi Sarwinda, Radifa Hilya Paradisa, Alhadi
Bustamam,Pinkie Anggia: Deep Learning in Image
Classification using Residual Network (ResNet) Variants for
Detection of Colorectal Cancer, Procedia Computer Science
Volume 179, (2021) 423-431.
[13] Panagiotis Stalidis, Theodoros Semertzidis: Examining Deep
Learning Architectures for Crime Classification and Prediction,
Improved Forecasting through Artificial Intelligence. (October
2021).
Authorized licensed use limited to: SRM University. Downloaded on May 11,2022 at 06:34:43 UTC from IEEE Xplore. Restrictions apply.
... For this case, to improve the efficiency of the security management system and minimize crime incidents and losses, an effective crime prediction analysis system is needed. Such a system would enable proactive crime prevention and ensure robust security management in public places such as banks, shopping malls, and avenues [3,6]. ...
... In contemporary times, machine-learning and advanced image-processing algorithms have significantly contributed to the evolution of smart surveillance and security systems, as evidenced by recent developments [3,6,7,8]. In addition, the rise of smart devices and networked cameras has also boosted this field. ...
... A deep learning neural network generally has five layers: input and output layers with Convolution, Max-Pooling, and Fully connected layers. Many individuals choose to employ pre-trained deep learning models due to constraints such as limited time, memory, and resources such as CPU and processors [6,35]. When opposed to machine learning, which involves explicit design, these pre-trained models produce better and more accurate outcomes. ...
Article
Full-text available
The majority of visual based surveillance applications and security systems heavily rely on object detection, which serves as a critical module. In the context of crime scene analysis, images and videos play an essential role in capturing visual documentation of a particular scene. By detecting objects associated with a specific crime, police officers are able to reconstruct a scene for subsequent analysis. Nevertheless, the task of identifying objects of interest can be highly arduous for law enforcement agencies, mainly because of the massive amount of data that must be processed. Hence, the main objective of this paper is to propose a DL-based model for detecting tracked objects such as handheld firearms and informing the authority about the threat before the incident happens. We have applied VGG-19, ResNet, and GoogleNet as our deep learning models. The experiment result shows that ResNet50 has achieved the highest average accuracy of 0.92% compared to VGG19 and GoogleNet, which have achieved 0.91% and 0.89%, respectively. Also, YOLOv6 has achieved the highest MAP and inference speed compared to the faster R-CNN.
... The different features in the two-dimensional image pixels or the three-dimensional video pixels are used to generate the gradient of the image or the histogram of the image and assess it through the help of scientific methods as a spatiotemporal gradient (STG) [5], optical histograms for optical flow (HOF) [6][7][8][9], and also perhaps to derive the textures of the image through the mixture of dynamic textures (MDT) [10]. Similarly, the object level features track the objects in the image, such as track their trajectory [11] or their appearance or any change to them [12] in order to indicate an event such as in sports images or History images or energy images etc. ...
... In case the test sample is a normal event it will be in the high probability area of the Gaussian distribution, while an abnormal event will have a lower value. Hence inference on occurrence of an abnormal event is derived through a threshold as given in Eq. (11). ...
... To train the discriminator of the eventual algorithm, we use sequences that were synthesised by the original GAN as additional unfavorable instances. This works in a way that deep reinforcement learning does to prevent the model from deviating from its intended path [15]. ...
... We use a depth map to further enhance the picture. This map modifies the appearance of items in according to their distance from the camera [15], successfully reproducing the undersea environment's impacts on light absorption and scattering. Using this method, we can provide a more faithful rendering of the scene. ...
... These models have been useful in collecting real-time information about the road and traffic to find road accidents [5] monitor speed limits, traffic jams, driver behaviour [6] etc. But compared to complex models lightweight traffic detection are considerably much useful to ensure safety. ...
... A Convolutional Neural Network model takes one input in the form of an image and then assigns learnable weights and biases to a number of visual properties. The proposed CNN model receives as input images scaled to 256*256 pixels that are then normalised and enhanced [11]. CNN requires far fewer pre-processing steps than alternative machine learning algorithms. ...
Conference Paper
Full-text available
Train track crack detection is a process of identifying cracks in the structure of railway tracks. Railways are major modes of transport in India. The tracks must be in good condition for trains to have safe voyages. Cracks that appear on the tracks are often due to heat and other natural causes. At present these cracks are identified manually by railway personnel by inspecting them at regular intervals. This process is not effective as it consumes more time and there is an increased chance of leaving the cracked track undiscovered. The aim of this research work is to avoid the derailment of trains and reduce the cost and time that happens due to the cracks. This work proposed a technique for recognizing railway track cracks by combining Convolutional Neural Networks with image pre-processing techniques. Observations indicate that neural networks are capable of capturing the colours and textures of lesions related to respective railway track breaks with 94.6% accuracy.
Article
Full-text available
Since identifying criminals is a crucial function of intelligent surveillance systems, it has attracted a lot of attention. Although various approaches are developed for criminal face recognition, they cannot accurately identify the criminal faces. In this study, a novel advanced deep learning model was designed for accurate identification of criminal face from the CCTV images. The developed model utilizes five major phases namely, data collection, pre-processing, feature extraction, feature selection and classification. The study utilizes the data collected from the National Institute of Standards and Technology (NIST) containing criminal and non-criminal face images. The developed model employs Haarcascade algorithm for scaling and transforming the raw images into appropriate format for subsequent analysis. Further, the designed model utilizes Principal Component Analysis (PCA) and Ant Colony Optimization (ACO) for feature extraction and selection, respectively. Finally, the face recognition task was performed using the DenseNet 169 classifier. The developed framework was designed and implemented in Pytorch software and the result metrics are estimated. Furthermore, a comprehensive comparative study was conducted to validate the performances of the developed model with the conventional deep learning models. The experimental results and comparative study illustrate that the designed model outperformed the traditional models.
Article
Surveillance system research is now experiencing great expansion. Surveillance cameras put in public locations such as offices, hospitals, schools, roads, and other locations can be utilised to capture important activities and movements for event prediction, online monitoring, goal-driven analysis, and intrusion detection. This research proposed novel technique in detecting crime scene video surveillance system in real time violence detection using deep learning architectures. Here the aim is to collect the real time crime scene video of surveillance system and extract the features using spatio temporal (ST) technique with Deep Reinforcement neural network (DRNN) based classification technique. The input video has been processed and converted as video frames and from the video frames the features has been extracted and classified. Its purpose is to detect signals of hostility and violence in real time, allowing abnormalities to be distinguished from typical patterns. To validate our system's performance, it is trained as well as tested in large-scale UCF Crime anomaly dataset. The experimental results reveal that the suggested technique performs well in real-time datasets, with accuracy of 98%, precision of 96%, recall of 80%t, and F-1 score of 78%.
Article
Full-text available
This paper investigates a deep learning method in image classification for the detection of colorectal cancer with ResNet architecture. The exceptional performance of a deep learning classification incites scholars to implement them in medical images. In this study, we trained ResNet-18 and ResNet-50 on colon glands images. The models trained to distinguish colorectal cancer into benign and malignant. We assessed our prototypes on three varieties of testing data (20%, 25%, and 40% of whole datasets). The empirical outcomes confirm that the application of ResNet-50 provides the most reliable performance for accuracy, sensitivity, and specificity value than ResNet-18 in three kinds of testing data. Upon three test assortments, we perceive the best performance value on 20% and 25% test sets with a classification accuracy of above 80%, the sensitivity of above 87%, and the specificity of above 83%. In this research, a deep learning method demonstrates the profoundly reliable and reproducible outcomes for biomedical image analysis.
Article
Full-text available
In this paper, a detailed study on crime classification and prediction using deep learning architectures is presented. To the best of our knowledge, this is the first work that examines the effectiveness of deep learning algorithms on this domain and provides recommendations for designing and training deep learning systems for crime prediction and classification, using open data from police reports. Having as training data time-series of crime types per location, a comparative study of state-of-the-art methods against 3 different deep learning configurations is conducted. In our experiments with five publicly available datasets, we demonstrate that the deep learning-based methods consistently outperform the existing best-performing methods. Moreover, we evaluate the effectiveness of different parameters in the deep learning architectures and give insights for configuring them in order to achieve improved performance in crime classification and prediction.
Article
The ability to analyze the actions which occur in a video is essential for automatic understanding of sports. Action localization and recognition in videos are two main research topics in this context. In this chapter, we provide a detailed study of the prominent methods devised for these two tasks which yield superior results for sports videos.We adopt UCF Sports, which is a dataset of realistic sports videos collected from broadcast television channels, as our evaluation benchmark. First, we present an overview of UCF Sports along with comprehensive statistics of the techniques tested on this dataset as well as the evolution of their performance over time. To provide further details about the existing action recognition methods in this area, we decompose the action recognition framework into three main steps of feature extraction, dictionary learning to represent a video, and classification; we overview several successful techniques for each of these steps. We also overview the problem of spatio-temporal localization of actions and argue that, in general, it manifests a more challenging problem compared to action recognition. We study several recent methods for action localizationwhich have shown promising results on sports videos. Finally, we discuss a number of forward-thinking insights drawn from overviewing the action recognition and localization methods. In particular, we argue that performing the recognition on temporally untrimmed videos and attempting to describe an action, instead of conducting a forced-choice classification, are essential for analyzing the human actions in a realistic environment.
Conference Paper
Convolutional Neural Networks (CNNs) have been established as a powerful class of models for image recognition problems. Encouraged by these results, we provide an extensive empirical evaluation of CNNs on large-scale video classification using a new dataset of 1 million YouTube videos belonging to 487 classes. We study multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggest a multiresolution, foveated architecture as a promising way of speeding up the training. Our best spatio-temporal networks display significant performance improvements compared to strong feature-based baselines (55.3% to 63.9%), but only a surprisingly modest improvement compared to single-frame models (59.3% to 60.9%). We further study the generalization performance of our best model by retraining the top layers on the UCF-101 Action Recognition dataset and observe significant performance improvements compared to the UCF-101 baseline model (63.3% up from 43.9%).
Article
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.