Figure 7 - uploaded by Vinayakumar Ravi
Content may be subject to copyright.
Uniform resource locator (URL).

Uniform resource locator (URL).

Source publication
Article
Full-text available
A computer virus or malware is a computer program, but with the purpose of causing harm to the system. This year has witnessed the rise of malware and the loss caused by them is high. Cyber criminals have continually advancing their methods of attack. The existing methodologies to detect the existence of such malicious programs and to prevent them...

Context in source publication

Context 1
... Resource Locator (URL) is a subset of Uniform Resource Identifier (URI) which is unique string character identifier used to locate web resources unambiguously across the web, shown in Figure 7. Commonly URL's are web addresses used to locate the web servers for different applications web pages (http), database access, file transfer (FTP) and Email. ...

Citations

... All rights reserved. such as cyber security events detection (Vinayakumar et al. 2019), commitment detection (Azarbonyad, Sim, and White 2019), intent detection (Shu et al. 2020) and online template induction (Whittaker et al. 2019). Most of these applications depend on training high-quality classifiers on emails. ...
Article
It is an essential product requirement of Yahoo Mail to distinguish between personal and machine-generated emails. The old production classifier in Yahoo Mail was based on a simple logistic regression model. That model was trained by aggregating features at the SMTP address level. We propose building deep learning models at the message level. We train four individual CNN models: (1) a content model with subject and content as input, (2) a sender model with sender email address and name as input, (3) an action model by analyzing email recipients’ action patterns and generating target labels based on senders’ opening/deleting behaviors and (4) a salutation model by utilizing senders’ "explicit salutation" signal as positive labels. Next, we train a final full model after exploring different combinations of the above four models. Experimental results on editorial data show that our full model improves the adjusted-recall from 70.5% to 78.8% and the precision from 94.7% to 96.0% compared to the old production model. Also, our full model significantly outperforms a state-of-the-art BERT model at this task. Our new model has been deployed to the current production system (Yahoo Mail 6).
... Figure 7 depicts the various phishing email detection methods utilised in the literature, and also the volume of publications that use each method. The most prevalent phishing email detection algorithms are supervised approaches, such as support vector machines (SVM) [44]- [50], logistic regression (LR) [44], [45], [48], [51]- [56], Decision Tree (DT) [48]- [50], [57]- [62], and Naïve Bayes (NB) [44], [63]- [65]. Unsupervised approaches such as kmeans clustering [48], [66]- [70] and deep learning methods have also been adopted [45], [51], [52], [63], [71]- [82]. ...
... The most prevalent phishing email detection algorithms are supervised approaches, such as support vector machines (SVM) [44]- [50], logistic regression (LR) [44], [45], [48], [51]- [56], Decision Tree (DT) [48]- [50], [57]- [62], and Naïve Bayes (NB) [44], [63]- [65]. Unsupervised approaches such as kmeans clustering [48], [66]- [70] and deep learning methods have also been adopted [45], [51], [52], [63], [71]- [82]. ...
... Hence, the filter matrix column space would be similar to the input matrix column space [145]. We noticed that several studies, e.g., [45], [63], [74], [75] have used the CNN method. ...
Article
Full-text available
Every year, phishing results in losses of billions of dollars and is a major threat to the Internet economy. Phishing attacks are now most often carried out by email. To better comprehend the existing research trend of phishing email detection, several review studies have been performed. However, it is important to assess this issue from different perspectives. None of the surveys have ever comprehensively studied the use of Natural Language Processing (NLP) techniques for detection of phishing except one that shed light on the use of NLP techniques for classification and training purposes, while exploring a few alternatives. To bridge the gap, this study aims to systematically review and synthesise research on the use of NLP for detecting phishing emails. Based on specific predefined criteria, a total of 100 research articles published between 2006 and 2022 were identified and analysed. We study the key research areas in phishing email detection using NLP, machine learning algorithms used in phishing detection email, text features in phishing emails, datasets and resources that have been used in phishing emails, and the evaluation criteria. The findings include that the main research area in phishing detection studies is feature extraction and selection, followed by methods for classifying and optimizing the detection of phishing emails. Amongst the range of classification algorithms, support vector machines (SVMs) are heavily utilised for detecting phishing emails. The most frequently used NLP techniques are found to be TF-IDF and word embeddings. Furthermore, the most commonly used datasets for benchmarking phishing email detection methods is the Nazario phishing corpus. Also, Python is the most commonly used one for phishing email detection. It is expected that the findings of this paper can be helpful for the scientific community, especially in the field of NLP application in cybersecurity problems. This survey also is unique in the sense that it relates works to their openly available tools and resources. The analysis of the presented works revealed that not much work had been performed on Arabic language phishing emails using NLP techniques. Therefore, many open issues are associated with Arabic phishing email detection.
Preprint
Full-text available
Enterprise security is increasingly being threatened by social engineering attacks, such as phishing, which deceive employees into giving access to enterprise data. To protect both the users themselves and enterprise data, more and more organizations provide cyber security training that seeks to teach employees/customers to identify and report suspicious content. By its very nature, such training seeks to focus on signals that are likely to persist across a wide range of attacks. Further, it expects the user to apply the learnings from these training on e-mail messages that were not filtered by existing, automatic enterprise security (e.g., spam filters and commercial phishing detection software). However, relying on such training now shifts the detection of phishing from an automatic process to a human driven one which is fallible especially when a user errs due to distraction, forgetfulness, etc. In this work we explore treating this type of detection as a natural language processing task and modifying training pipelines accordingly. We present a dataset with annotated labels where these labels are created from the classes of signals that users are typically asked to identify in such training. We also present baseline classifier models trained on these classes of labels. With a comparative analysis of performance between human annotators and the models on these labels, we provide insights which can contribute to the improvement of the respective curricula for both machine and human training.
Article
Full-text available
The modern malware increasingly employs domain generation algorithms (DGAs) to evade traditional DNS query detection methods, such as blacklisting or reverse engineering of suspicious domain names. These algorithms generate vast numbers of random domain names to establish communication with Command and Control (C&C) servers, posing significant challenges for detection. Previous research has predominantly relied on classical machine learning algorithms, necessitating manual feature extraction and classification, which is both time-consuming and labour-intensive this paper, we propose a deep learning-based architecture for detecting DGA-generated domain names. Our model utilizes recurrent networks with gated recurrent units (GRUs) for domain name detection. By converting domain names into vectors and employing GRUs, the model autonomously learns features, eliminating the need for manual intervention in feature extraction. Compared to traditional methods, our approach reduces time costs associated with feature extraction. The experimental result demonstrates the effectiveness of our proposed GRU achieving 98% accuracy, 94% recall rate, 93% precision, and an Area Under the Curve (AUC) of 99.6%. The GRUarchitecture outperforms LSTM models in terms of recall rate and accuracy while requiring less computational resources, indicating significant performance enhancement.
Article
Phishing involves malicious activity whereby phishers, in the disguise of legitimate entities, obtain illegitimate access to the victims' personal and private information, usually through emails. Currently, phishing attacks and threats are being handled effectively through the use of the latest phishing email detection solutions. Most current phishing detection systems assume phishing attacks to be in English, though attacks in other languages are growing. In particular, Arabic is a widely used language and therefore represents a vulnerable target. However, there is a significant shortage of corpora that can be used to develop Arabic phishing detection systems. This paper presents the development of a new English-Arabic parallel phishing email corpus that has been developed from the anti-phishing share task text (IWSPA-AP 2018). The email content was to be translated, and the task had been allotted to 10 volunteers who had a university background and were English and Arabic language experts. To evaluate the effectiveness of the new corpus, we develop phishing email detection models using Term Frequency-Inverse Document Frequency (TF-IDF) and Multilayer Perceptron using 1258 emails in Arabic and English that have equal ratios of legitimate and phishing emails. The experimental findings show that the accuracy reaches 96.82% for the Arabic dataset and 94.63% for the emails in English, providing some assurance of the potential value of the parallel corpus developed.
Chapter
Paste sites are largely used for innocent text sharing but they have grown in popularity as venues for criminal operations such as data leaks and publication. This research examines numerous types of sensitive information and the extent to which each can cause damage if compromised. Our proposal intends to develop an efficient scoring scheme for determining the sensitivity of information included within a paste’s body. We designed a scraper to monitor two surface web and two dark web paste sites and extract and score various aspects from the obtained data. The findings indicated that surface web paste sites featured a greater amount of sensitive material than dark web paste sites.