Content uploaded by Rajeswari Devarajan
Author content
All content in this area was uploaded by Rajeswari Devarajan on Nov 03, 2022
Content may be subject to copyright.
1
Detecting Criminal Activities and Promoting
Safety Using Deep Learning
Rohan Mathur, Tejas Chintala and Rajeswari D*
Department of Data Science and Business Systems, School of Computing, SRM Institute of Science and Technology,
Kattankulathur, Tamil Nadu– 603203, India.
E-mail : rm3686@srmist.edu.in, tc5739@srmist.edu.in, rajeswad@srmist.edu.in
Abstract- Automation and autonomous systems are among
the few powerhouses of innovation that drive entire domains
towards advancing further in leaps and bounds. Great
technological innovations can be attributed to tasks that are
made easier and more perceptible by automation, and
artificial intelligence is here to make these automated
systems smart enough to perform their tasks with the power
of decision-making, thereby greatly reducing human
intervention in redundant processes. Our project follows the
aforementioned ideals: building a product to minimize
manual labor (both physical and mental) for tasks that can
be seamlessly automated and processed while solving the
main problem statement at hand. Currently, surveillance
cameras play a vital role to ensure the safety of the people,
yet they are plain video-providing entities with no smart
decision making mechanisms of their own. Because of this
growth of data composed from surveillance cameras,
automated video streams have become a requisite for
automatically detecting abnormal events. The main aim of
the project focuses on promoting safety on campus by
employing deep learning techniques to automate the task of
monitoring and reporting crimes from the physical Closed-
Circuit Television (CCTV), assigning the responsibility of
detecting criminal activity to a framework that can identify
patterns to differentiate them for smarter monitoring. The
model in this paper that we propose was able to distinguish
between certain crimes with a 0.94 and 0.95 precision for
Assault and Abuse respectively.
Keywords—ResNet, Computer Vision, Deep Learning, Triplet
Loss Function, Single Shot Detector, UCF Crime
I.INTRODUCTION
Video classification is a computer vision problem
that was showcased with the motivation of it being
able to solve and automate classification tasks
concerning the real-time live video. Considering that
the problem is recent, there are still exist solutions
left to be tested. Even after this, the applications
constitute a wide spectrum of solutions, with the
start of detecting important aspects of sports actions
[1] or daily pursuit that is taking place in a scene, to
various security and healthcare. Our project aims at
promoting safety on campus by automating the task
of monitoring and reporting crimes by assigning the
responsibility of detecting criminal or abnormal
activity to a system that is well-versed in deducing
patterns that distinguish criminal activity from
normal activity. Traditional surveillance systems
deal with a plethora of shortcomings that do not sync
well with the current facilities available in today‟s
day and age.
One of the most notable flaws in vanilla surveillance
systems is the heavy dependence on an attentive
supervisor monitoring footage and ensuring that any
abnormal activity is duly noted and taken care of.
CCTV footage requires a human intervention which
may lead to errors [8]. In a bid to reduce manual
intervention and labor, the project not only
eliminates the need for additional supervision, but it
also immediately detects the occurrence of a crime
and takes note of people involved, recognizes the
type of crime occurring, and accordingly triggers
actions that initiate mitigation measures on the crime
scene, real-time.
There have been different implementations for the
automatic detection of crimes like localizing the gun
and victim present in the video footage [4] using
deep learning approaches, especially CNN
architectures like Residual models [5]. With the
availability of vast databases like video footage from
the UCF-crime dataset [6] or RWF-2000 database
[7], it makes it easier to create a system of automatic
activity detectors, for the enhancement of safety in
public places.
II. LITERATURE REVIEW
There have been many papers on this application
along with their implementations, and have been
discussed in this paper. Umadevi V. Navalgund and
Priyadharshini K [8] proposed a crime detection
system by creating a pipeline to detect weapons
using images of the same and training them on
pretrained models to classify them. They used
2022 International Conference on Advances in Computing, Communication and Applied Informatics (ACCAI) | 978-1-6654-9529-5/22/$31.00 ©2022 IEEE | DOI: 10.1109/ACCAI53970.2022.9752619
Authorized licensed use limited to: SRM University. Downloaded on May 11,2022 at 06:34:43 UTC from IEEE Xplore. Restrictions apply.
2
VGGNet 19 as their pre-trained model for detection
and achieved 69% accuracy and 75% recall. Olmos,
Siham Tabik, and Francisco Herrera [4] described a
system of detecting guns in surveillance videos, and
then classifying if it is or not, predicting the crime
happened or not. The results for this paper are
obtained by the use of Region-based Convolutional
Neural Network (RCNN) and Faster Region-based
Convolutional Neural Network (FRCNN) based
models which the authors trained on their data. This
paper showcased a prediction of high crime
occurrence videos of low quality, which were able to
give out satisfactory predictions.
A. R. Zamir [1] proposes an overview on prominent
action localization and recognition methods for
sports videos. Using the Sports dataset provided by
UCF as the benchmark for evaluation of the
discussed techniques, they proposed a pipeline for
recognition of action into three major pipelines
which follow - feature extraction, representation of
videos with dictionary learning, and finally the
classification of the data, in this use-case, the sport.
A. Karpathy et al. [9] studies the performance of
convolutional neural networks (CNNS) in video
classification, finding that CNN architectures [10]
are capable of being able to grasp powerful features
from weakly-labeled data and surpassing other
existing methods in performance.
The transfer learning experiments suggest that the
features that are learned are generic and showcase
that other classification tasks are generalized. Tahani
Almanie, Rsha Mirza and Elizabeth Lor [11] discuss
a similar objective by using machine learning
techniques like Decision Trees and Naïve Bayes
Classifiers for classifying and predicting crimes. The
datasets being used here are spatial data. They were
able to achieve a 51% and 54% accuracy
respectively for datasets for two cities - Denver and
Los Angeles.
III. PROPOSED METHOD
The main aim of this paper is to detect crimes
happening in video streams such as CCTV cameras.
Moreover, an additional module for detecting faces
in these streams is also done using a previously
implemented method known as Triplet Loss. This
paper proposes two modules for showcasing the
given problem. The first is in taking a CCTV stream
and recognizing the faces using methods explained
below. The next module discusses how to tackle
crime detection using deep learning methods. Figure
1 gives a brief overview of both the modules
discussed in this paper. The Face Recognition
module uses Open-CV‟s built-in DNN (Deep Neural
Networks) module to train a model that is primarily
built to differentiate faces and recognize them. This
is done by identifying faces through given data,
performing data preprocessing, and then proceeding
with training on embeddings using Triplet Loss
Function. After these embeddings are calculated, the
model is then used for the recognition of such faces.
This can be easily done by loading the model and
using a webcam to identify the face by surrounding
it with bounding boxes as well as giving a
confidence parameter. This paper was able to
achieve significant results whilst evaluating this
module. The Crime detection module initially selects
the input videos that possess criminal activity as
well as videos with normal / no criminal activity.
After performing basic preprocessing steps, which
includes data augmentation and converting the
videos to image frames, it is trained through a pre-
existing ResNet architecture through which accuracy
and other evaluation metrics are observed.
Finally, using the model that is trained, one can
detect crimes by playing simulated webcam streams
through a mobile phone. Accuracies over 90% were
achieved when evaluating for a total of six classes in
this paper. Overall, this paper suggests an end-to-end
pipeline to tie both these modules and use them for
simultaneously identifying who is committing the
crime.
Face Recognition
Face Detection is the technique of detecting and
returning the location of a face inside the frame from
a given photo or video stream. Face verification
takes it a step further and checks if the given face
holds any resemblance with another face stored in
memory. This is done by gauging the similarity
between the two faces using distance metrics like the
L2 norm or the cosine similarity. Finally, face
recognition cumulatively involves both the above
techniques to extract prominent features from the
face and identify who the face belongs to from a set
of labels obtained from the dataset on which the
model is being trained. The proposed solution that
this paper offers for face recognition involves
detecting faces, computing the embeddings of the
face, training these on a Support Vector Machine
(SVM) on the given embeddings and finally being
Authorized licensed use limited to: SRM University. Downloaded on May 11,2022 at 06:34:43 UTC from IEEE Xplore. Restrictions apply.
3
able to detect faces in images or simulated video
streams. Figure 2 explains the pipeline that is
discussed in this paper. The Caffe Model is being
used for Face Detection and the OpenFace Model for
feature extraction.
Fig. 1. Overview of Face Recognition and Crime
Detection Module
Fig. 2. Overall Face Recognition Pipeline with
OpenCV
OpenCV‟s deep learning face detection is based on
the Single Shot Detector (SSD) architecture with a
ResNet. The single-shot detection refers to a
technique wherein the model requires a single shot
to detect multiple objects within the image. It
discretizes the given image into certain bounded
boxes around regions with feature maps of high
confidence and generates multiple boxes around
such region maps. The confidence for each of these
boxes is calculated and the box dimensions are
adjusted to obtain the best fit for detection. Figure 3
depicts how the final bounding boxes would look
once it recognizes a face.
Fig. 3. SSD – Multiple Bounding Boxes for
Localization and Confidence for
A Given Face
Additionally, we can compute facial landmarks
(mouth, right/left eyebrows, eyes, nose, jawline)
using dlib library, which will further enable us to
preprocess the images and perform face alignment
on datasets for better results. After cropping and
performing face alignment, we pass the given face
through the proposed neural network. For training a
face recognition model, an input batch must contain
an Anchor Image (current image of person “A”),
Positive Image (another image of person “A”) , and
a Negative Image (any other image that is not person
“A”). Through this, the neural network calculates the
face embeddings and adjusts the weights using a
method called triplet loss. In this way, the
embeddings of the “Anchor” and the “Positive”
image are close to each other, whereas the
”Negative” image it is farther away. A CNN (Caffe)
model computes the embeddings for all input
images, and these embeddings are sufficiently
different to train a classifier such as SVMs, SGD
Authorized licensed use limited to: SRM University. Downloaded on May 11,2022 at 06:34:43 UTC from IEEE Xplore. Restrictions apply.
4
Classifiers, Random Forests, etc. on top of the
calculated embeddings of the face, which constitute
the facial recognition pipeline.
1) Data Augmentation: To benefit from a low
amount of given data, our paper proposes certain
data augmentation methods to derive patterns in our
data to multiply the effectiveness. The concepts of
flipping, rotating, zooming and translating, scaling,
cropping, moving along x-axis and y-axis, adding
gaussian noise, shearing, skewing, implementing
black and white filters, and blurring images have
been implemented. For our dataset, the augmented
version has been displayed in Figure 4.
B. Video Classification for Crime Detection
This module aims towards detecting anomalies in
videos of footage and normal activity. The dataset at
play here used for training is the UCF Crimes
Dataset, which is the only dataset that contains
videos of diverse classes of crimes, each replete with
valuable and distinctive features. The dataset
contains 13 classes in total: Accidents, Fighting,
Burglary, Shoplifting, Robbery, Shooting, Abuse,
Arrest, Arson, Assault, Accidents, Explosion,
Stealing, and Vandalism. In total, it consists of
approximately one thousand nine hundred real-world
videos, all taken under different places and each
video showing a realistic crime. This amounts to a
total of one hundred and twenty-eight hours of
videos, cumulatively amounting to 95GB of storage
capacity. Figure 5 depicts the pipeline that is
followed for this module. Figure 6 showcases the
different videos of a few classes
Fig. 4. Different Types Of Augmented Images
Fig. 5. Crime Detection Module
1) Dataset Preparation and Pre-Processing: Three
techniques are identified and used in each scene,
which involve converting, enhancing and finally,
augmenting the data. This paper uses six classes -
“Abuse”, ”Assault”, ”Fighting”, ”Normal”,
”Robbery” and ”Vandalism”. Because of the sheer
size of the given videos, the authors had to reduce
the dataset classes. Video editing and trimming
helped in the reduction of redundant and misleading
features, as videos that were 5 minutes long were
trimmed to 45 seconds focusing on the duration
when the crime occurred as the irrelevant portions
had been discarded. Videos with low resolutions and
portions were sharpened and cropped to highlight
portions of the crime scene. Manually labelling the
parts of each crime video is done. The remaining
stream of the video can then be labeled under the
normal class. Finally, data augmentation in the form
of enlargement of the different types of data
available to train a specific model was done. Figure
7 gives an example of the data augmentation that
was done.
Authorized licensed use limited to: SRM University. Downloaded on May 11,2022 at 06:34:43 UTC from IEEE Xplore. Restrictions apply.
5
Fig. 6. Video Classes being turned into photos for preparation
Fig. 7. Normal Image, Rotated Left, Rotated Right
2) Residual Network (ResNet): The ResNet layers
are made such that they formulate as learning
residual functions with function to the inputs of the
layer, instead of learning unreferenced functions.
The architecture, with a depth of between 18 to 152
convolutional layers, bypasses the signal from one
layer to the other by introducing a short connection.
These connections have the ability to pass through
the gradient flows of layer networks, from its early
ones to the later ones, hence easing the training of
very deep networks. Residual Block in Figure 8
illustrated below shows how the connection
bypasses the signal from the top to the tail of the
block. By introducing the concept of Residual
Network, the architecture was able to solve the
vanishing gradient issue by using the skip
connections. This was it is able to skip training for
few layers and can make a direct connection to the
output [12], [13].
3) Proposed Video Classification Technique: The
proposed pipeline for crime classification requires
looping over all frames in an input video. For each
frame that it recognizes, it is passed through a CNN
and the same frame is classified individually and
independent of each other. Subsequently, the model
Authorized licensed use limited to: SRM University. Downloaded on May 11,2022 at 06:34:43 UTC from IEEE Xplore. Restrictions apply.
6
then chooses the label with the largest probability for
the frame and labels it, and finally writes the output
frame. Since we have a sequential problem, the
above method won‟t work since it is for a single
frame only. In the form of crime detection, it needs a
preserved correlation between subsequent frames for
single video input.
Fig. 8. Residual Block
This is achieved by calculating the above for a given
frame and maintaining a list of the last „N‟
predictions. Using these, the pipeline computes the
average of the last „N‟ predictions and chooses the
appropriate label with the largest probability and
returns the final output. 4) Training the ResNet on
UCF-Crimes Dataset: For establishing the training
and testing steps through ResNet is discussed below.
Firstly, locating the file directories of the training
and validation images, along with training
parameters like batch size, the number of epochs,
height, and width of images, the learning rate is also
specified. Generating training and validation data is
done using the Image Data Generator module in
Tensor Flow. As training runs, the behaviour for
each epoch is saved and plotted at the end of
training. Finally, the model along with its weights
are stored. For training, the stated hyper-parameters
like learning rate, the number of epochs, batch size
are specified for the model, These hyper-parameters
must be taken into account while training a CNN to
improve performance. Figure 9 shows the final
training epochs that were done.
The following parameters were provided to the
ResNet Module -
Splitting the dataset: keeping 25% as test data
and remaining for model training.
Number of Epochs = 50
Loss = Categorical Cross-Entropy
Optimizer = Learning rate of 0.0001 for
Stochastic Gradient Descent
Metric = Accuracy
The evaluation and result of the final is based on the
following terms -
False Positive: No crime occurs on the
scene, the model considers otherwise.
False Negative: The model fails to detect the
crime occurred.
True Positive: The model is able to detect
the crime correctly.
True Negative: The crime does not occur
and the model is unable to detect it.
Fig. 9. Training the classifier (Epochs = 50)
The final epoch displays the following parameters -
Training Loss - 0.158
Training Accuracy - 0.953
Validation Loss - 0.1168
Validation Accuracy - 0.966
IV. RESULTS
A. Face Recognition
Training for Face Recognition was done for three
different faces with approximately 15 photos of each
person. Having limited resources and limited
computational power, the authors chose on keeping
the data for this module minimal, The results for it is
as shown below in Figure 10 and Figure 11 when
using a webcam as simulated CCTV stream.
As seen, the model was able to correctly classify the
faces as well as provide bounding boxes and
confidence levels for the faces.
Authorized licensed use limited to: SRM University. Downloaded on May 11,2022 at 06:34:43 UTC from IEEE Xplore. Restrictions apply.
7
Fig. 10. Face Recognition for person A
(simulating a CCTV)
Fig. 11. Face Recognition for person B
(simulating a CCTV)
B. Crime Detection
On training For 50 epochs, the model was evaluated
and gave the results shown in Table I and Table II.
The training loss and accuracy along with validation
loss and accuracy was recorded and is as seen in
Figure 12. Using the model that was trained, the
authors evaluated this using a rolling stream of a
YouTube video of a criminal activity of „Fighting‟,
„Vandalism‟ and „Abuse‟. This way, the model was
able to correctly predict the happenings going on in
the stream. This is shown in Figure 13, Figure 14
and Figure 15.
V. CONCLUSION AND FUTURE WORKS
The face recognition module‟s work done in this
project was able to demonstrate a simulated version
of the final prototype and were able to successfully
replicate a facial recognition model. The Crime
Detection module was able to give satisfactory
results with the ResNet Model, having trained it for
numerous classes after the final data preprocessing.
Future enhancements would include using more
facial classes for training the face recognition model.
It would also be interesting in creating a database of
these faces to easily identify a given victim for our
implementation with a live CCTV camera stream
when deploying it. Trying to use different facial
recognition models and comparing the results would
also be interesting to experiment with. Having
constraints due to computational power, this paper
uses a limited number of classes, this project can aim
to classify different types of crimes, and thereby,
using more data.
Table I Results for the Different Classes
Table II Accuracy Metrics
Fig. 12. Loss and Accuracy as a Function of Epochs
Authorized licensed use limited to: SRM University. Downloaded on May 11,2022 at 06:34:43 UTC from IEEE Xplore. Restrictions apply.
8
Fig. 13. Detecting and Classifying Crimes from Test
Footage
Fig. 14. Detecting and Classifying Movement with
Weapons
Fig. 15. Detecting and Classifying Crimes
REFERENCES
[1] A. R. Zamir, “Action recognition in realistic sports videos,” in
Computer vision in sports. Springer, 2014, pp. 181–208.
[2] Agarwal, H., Singh, A., Rajeswari, D., Deepfake Detection
using SVM, Proceedings of the 2nd International Conference on
Electronics and Sustainable Communication Systems, ICESC
2021this link is disabled, 2021, pp. 1245–1249.
[3] Soni, H., Arora, P., Rajeswari, D., Malicious Application
Detection in Android using Machine Learning, Proceedings of
the 2020 IEEE International Conference on Communication and
Signal Processing, ICCSP 2020this link is disabled, 2020, pp.
846–848.
[4] Olmos, Siham Tabik, and Francisco Herrera, Soft Computing
and Intelligent Information Systems research group, on
“Automatic Handgun Detection Alarm in Videos Using Deep
Learning Roberto”, February 20, 2017.
[5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, “Deep
Residual Learning for Image Recognition”, 2015.
[6] Ucf-crime dataset (real-world anomalies detection in videos),
Jun 2019.
[7] Ming Cheng, Kunjing Cai, Ming Li, “RWF-2000: An Open
Large Scale Video Database for Violence Detection”, ICPR‟20,
https://arxiv.org/pdf/1911.05913v3.pdf.
[8] Umadevi V. Navalgund; Priyadharshini K, “Crime Intention
Detection System Using Deep Learning”, International
Conference on Circuits and Systems in Digital Enterprise
Technology (ICCSDET), December 2018.
[9] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar,
and L. Fei-Fei, “Largescale video classification with
convolutional neural networks,” in The IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), June 2014.
[10] K. O‟Shea and R. Nash: An introduction to convolutional neural
networks, ArXiv e-prints, (2015)
[11] Tahani Almanie, Rsha Mirza, Elizabeth Lor, “Crime Prediction
Based on Crime Types and Using Spatial and Temporal
Criminal Hotspots”, International Journal of Data Mining
Knowledge Management Process (IJDKP), Vol.5, No.4, July
2015.
[12] Devvi Sarwinda, Radifa Hilya Paradisa, Alhadi
Bustamam,Pinkie Anggia: Deep Learning in Image
Classification using Residual Network (ResNet) Variants for
Detection of Colorectal Cancer, Procedia Computer Science
Volume 179, (2021) 423-431.
[13] Panagiotis Stalidis, Theodoros Semertzidis: Examining Deep
Learning Architectures for Crime Classification and Prediction,
Improved Forecasting through Artificial Intelligence. (October
2021).
Authorized licensed use limited to: SRM University. Downloaded on May 11,2022 at 06:34:43 UTC from IEEE Xplore. Restrictions apply.