ArticlePDF Available

An anti-phishing model based on similarity measurement

Authors:

Abstract and Figures

Phishing has represented a more noteworthy danger to clients. In the current work, the authors attempted to build up a powerful anti-phishing technique based on hybrid similarity approach combining Cosine and Soft Cosine similarity that measures the resemblance between user query and database. The proposed similarity hybrid is also evaluated against another similarity hybrid comprising of Cosine and Jaccard similarity measure so as to validate the proposed work. Both hybrid similarities are separately fed to validation layer of feed forward back propagation neural network (FFBPNN) to predict phishing and legitimate websites. The performance of the proposed work is evaluated against data set comprising of 3,000 sample files in terms of positive predictive value (PPV), true positive rate (TPR) and F-measure. The comparative analysis demonstrated that the anti-phishing model using proposed similarity hybrid outperformed the cosine and Jaccard similarity hybrid with 0.233%, 0.2833% and 0.258% higher PPV, TPR and F-measure, respectively.
Content may be subject to copyright.
I
nt. J. Computational Vision and Robotics, Vol. 12, No. 2, 2022 141
Copyright © 2022 Inderscience Enterprises Ltd.
An anti-phishing model based on similarity
measurement
Parvinder Singh* and Bhawna Sharma
Department of Computer Science and Engineering,
Deenbandhu Chhotu Ram University of Science and Technology,
Murthal, Sonepat, Haryana, India
Email: parvindersingh.cse@dcrustm.org
Email: bhawnash024@gmail.com
*Corresponding author
Jasvinder Kaur
Computer Science and Engineering,
PDM University,
Bahadurgarh, Haryana, India
Email: jasvinder.kaur@pdm.ac.in
Abstract: Phishing has represented a more noteworthy danger to clients. In the
current work, the authors attempted to build up a powerful anti-phishing
technique based on hybrid similarity approach combining Cosine and Soft
Cosine similarity that measures the resemblance between user query and
database. The proposed similarity hybrid is also evaluated against another
similarity hybrid comprising of Cosine and Jaccard similarity measure so as to
validate the proposed work. Both hybrid similarities are separately fed to
validation layer of feed forward back propagation neural network (FFBPNN) to
predict phishing and legitimate websites. The performance of the proposed
work is evaluated against data set comprising of 3,000 sample files in terms of
positive predictive value (PPV), true positive rate (TPR) and F-measure. The
comparative analysis demonstrated that the anti-phishing model using proposed
similarity hybrid outperformed the cosine and Jaccard similarity hybrid with
0.233%, 0.2833% and 0.258% higher PPV, TPR and F-measure, respectively.
Keywords: phishing; cosine similarity; soft-cosine similarity; similarity index;
FFBPNN.
Reference to this paper should be made as follows: Singh, P., Sharma, B. and
Kaur, J. (2022) ‘An anti-phishing model based on similarity measurement’,
Int. J. Computational Vision and Robotics, Vol. 12, No. 2, pp.141–155.
Biographical notes: Parvinder Singh is working as a Professor in the
Department of Computer Science and Engineering at Deenbandhu Chhotu Ram
University of Science and Technology, Murthal, Sonepat, Haryana, India. His
research interests are information security and image processing. He has
published more than 100 research papers in journals of reputed publishers. He
is also associated with editing work of many journals. He is also a recipient of
various international awards and grants.
142 P. Singh et al.
Bhawna Sharma is currently pursuing her PhD from Deenbandhu Chhotu Ram
University of Science and Technology, Murthal, Sonepat, Haryana, India. Her
research interests include cyber security and machine learning.
Jasvinder Kaur is presently working as an Assistant Professor in Department of
Computer Science and Engineering at PDM University, Bahadurgarh, Haryana.
She has 28 publications in international and national journals. She is a recipient
of Excellence Award for her research work on ‘Digital logic embedding for
information view privileges’. Her research areas include information hiding and
information security.
1 Introduction
Modern advances in the technological sector have raised the popularity of internet
technology. Presently, none of the field exists that remains untouched by the network
frame work of internet. The rising popularity of social media including Twitter,
Instagram and Facebook further adds up to the popularity of internet technology making
it an indispensable daily need (Sheng et al., 2007). It is observed that the number of
people using social media has nearly doubled since 2014. As shown in Figure. 1, 9% rise
in the number of users has been observed in the last year as per a report published in
Global Digital Report 2019 (https://datareportal.com/). Such an enormous rise in the
network traffic has significantly raised challenges to manifolds. One is malicious attacks
and other is requirement of advanced hardware. This had further led to the discovery of
big data computing technologies with more sophisticated hardware configuration.
Internet provides a platform to connect numerous users sharing their information in the
form of media, images, videos, files, etc. and thus also require attention from the risk of
privacy leakage in one way or the other.
Phishing is a cyber-crime that frequently appears in personal computers and mobile
platforms. It is a fraudulent behaviour taking advantage of social networks and technical
skills that advertises and sends illegal links to users as a measure of disguise to decode
their private information and financial credentials. Spoofed e-mails are employed as
legitimate business tricking recipients to accidentally share their login and password
details. In other words, phishing is defined as stealing someone’s private information by
befooling them to be genuine (Cao et al., 2008; Aksu et al., 2019; Lam and Kettani,
2019). Majority of the internet users get fascinated by these illusions and get trapped up.
The statistical analysis by Clement (2020) reports that in 2020 first quarter shows 19% of
the total phishing attacks were focusing the financial sector in addition to payments
which was measured to be 13.3%. Figure 2 shows that software as a service (SaaS) and
the web-mails demonstrated highest compromised sector with 33.5% phishing attacks
(Engel et al., 2012).
A
n anti-phishing model based on similarity measuremen
t
143
Figure 1 Rising popularity of social media (see online version for colours)
0 500 1000 1500 2000 2500 3000 3500 4000
2014
2015
2016
2017
2018
2019
Number of users (in millions)
Years
Social media users
Figure 2 Phishing attacks in the 1st quarter-2020 (see online version for colours)
8.3%
6.2%
3.9%
8.5%
19.4%
33.5%
13.3%
6.9%
Researchers around the world have been constantly involved to tackle phishing attacks
and deploy various anti-phishing protocols. It is found that one third of the anti- phishing
mechanisms are rule-based prevention methods implemented in recent past (Mensah
et al., 2015). The major limitation as per these mechanisms is the adaptability. This
144 P. Singh et al.
means that there is requirement of incorporation of new data of rule set with every new
phishing attack. To deal with adaptability issues, authors have used swarm intelligence
with machine learning. Some similarity-based approaches have also been proposed that
compares the phishing with legitimate websites (Kordestani and Shajari, 2018). In the
current research, authors used rule-based architecture to develop an anti-phishing
protocol based on the cosine and soft-cosine similarity index powered by the machine
learning design. Cosine similarity measures the resemblance between two documents
independent of the document size.
In Figure 3, cosine angle is reflected by a set of vectors representing query and
repository vectors projected in the space. The plus point of this similarity aspect is the
fact that larger documents can still be expected to get oriented in the environs. Thus,
smaller cosine angle reflects higher similarity, for instance, cosine 0° reflects complete
similarity as 1. More detailed mechanism is discussed in the later part of the article.
Figure 3 Cosine similarity (see online version for colours)
Author has introduced phishing attacks and related statistics in the current section.
Section 2 summarises the research revolving around anti-phishing protocols deployed
by researchers, Section 3 describes the proposed methodology based on cosine and
soft-cosine followed by Section 4 dedicated for observations and discusses the
accomplish result. Section 5 concludes the paper.
2 Literature review
This section covers various types of protocols proposed to deal and prevent phishing
attacks. Ramanathan and Wechsler (2012) had proposed PhishGILLNET as multi-layered
anti-phishing model. The technique demonstrated effective results with a prerequisite that
webpage should be in Hyper Text Markup Language (HTML) and Multipurpose Internet
Mail Extensions (MIME) formats (Engel et al., 2012). Li et al. (2016) proposed a novel
anti-phishing design based on ball-support vector machine (BVM) to distinguish a
A
n anti-phishing model based on similarity measuremen
t
145
malicious uniform resource locator (URL) from a genuine. In the process, they extracted
various topological features of the website and analysis 12 out of them followed by the
BVM-based vector analysis. Evaluation against support vector machines (SVM) proved
BVM to be highly effective in detecting phishing websites with a relatively slower speed
for big data. Kaur and Kalra (2016) had proposed a five-tier anti-phishing design to
safeguard from phishing attacks. The hybrid approach analyses the URL and reflects the
page status as secure or phishing website. In 2016, Nguyen et al. proposed an
anti-phishing design that took advantage of neural fuzzy model. The work was evaluated
against 11,660 phishing and 10,000 legitimate websites which were used as training data
in the adaptive neural network (NN). The works proved efficient detection of phishing
URLs. Sonowal and Kuppusamy (2020) developed PhiDMA as a multilayered
anti-phishing architecture divided into five layers corresponding to white list, URL
feature, signature, string matching and score layer. The model was developed to offer
easy access to even visually impaired individuals with 92.72% phishing detection
accuracy. Ugochi (2018) focused his research towards depth analysis of IP address and
URL cosine similarity in order to identify phishing URLs. Experimental evaluation
proved to be highly effective against 100 phishing URL used in the study with least
memory requirement. Makki et al. (2018) postulated a cost-sensitive K-nearest neighbour
(KNN) approach enhanced with cosine similarity to identify cheat instances affecting
financial market, money transactions, telecommunication and credit card. The model
proved to outperform the approach based on traditional KNN alone. In 2018, Jain and
Gupta had presented a phishing detection technique that was based on the analysis of
hyperlinks present in the webpages. The work proved to be highly efficient in solution
and exhibited 98.4% accuracy for phishing site detection using logistic regression-based
classification. Kordestani and Shajari (2018) proposed a novel textual similarity-based
method to identify phishing sites. The method was evaluated against real website and
demonstrated optimal phishing detection accuracy while discriminating between phishing
and legitimate website and guides user towards genuine website (Kordestani and Shajari,
2018). Azeez in 2019 had developed PhishDetect technique aimed to identify phishing
websites by evaluating URL features and web contents. The technique proved to be
highly efficient in identifying the phishing URLs. Morovati and Kadam (2019) dedicated
their research study towards the identification of phishing attacks spread through
phishing e-mails. To achieve distinguishable results, they had incorporated email forensic
analysis with machine learning methodology. Later, Zhu et al. (2019) had postulated
optimal feature selection neural network (OFS-NN) as a phishing website detector. This
model was based on optimised feature selection technique followed by NN technique.
They had developed an optimal classifier that could accurately detect different type of
phishing websites. Lin (2019) had developed an architecture that mediates a cascade of
kaizen events. In this approach, cosine similarity of data objects is used to classify
protection levels. Experimental evaluation had shown that the designed fuzzy architecture
proved to be competent as compared to other methods to support Kaizen architecture to
identify susceptibility of web applications.
146 P. Singh et al.
Table 1 Comparison of latest work in anti-phishing
Sr. no. Author and year Proposed work and technique Result and outcomes
1 Rao et al. (2020) An anti-phishing application named as CatchPhish was proposed to validate
the genuineness of the URL ahead of visiting it.
The CatchPhish program obtained a 94.26% accuracy over our
dataset and 98.25 % accuracy over a benchmark dataset. It has
scope for application of feature selection for handcrafted apps
when detecting the phishing sites.
2 Suleman and
Awan (2019)
Anti-phishing method was postulated that was based on analysing uniform
resource locator (URL). It uses machine-learning classifiers like naïve Bayes,
iterative dichotomiser-3 (ID3), K-nearest neighbour (KNN), decision tree and
random forest, used for the classification of legitimate and illegitimate
websites. After that, with the use of this detection, we can easily distinguish
the phishing website. However, it has been observed that the use of genetic
algorithms (GAs) for feature selection can improve the detection accuracy.
The experimental results show that the use of iterative
dichotomiser-3 (ID3) along with yet another generating genetic
algorithm (YAGGA) improves the detection accuracy up to
95%.
3 Jain and Gupta
(2018)
Bi-levelled authentication system was proposed to identify phishing attack
without depending upon the text-based language of the webpage. In the
technique, the first level authentication was based on the search engine
mechanism, and the second level was based on the hyperlink to detect the
phishing URLs or the websites.
The simulation analysis demonstrated that the proposed work
achieved a much higher detection accuracy that was averaged to
around 98.1%.
4 Gupta and
Singhal (2018)
A practical method was proposed that could detect phishing in minimum time
span. The URLs were mined with the data mining tool called Waikato
Environment for Knowledge Analysis (Weka). For the classification of
phishing sites, authors implemented six computational intelligence techniques
namely, random tree, random forest, SMO, LMT, NB, J48 in order to
investigate the URLs of the dataset.
The authors demonstrated that both random tree and forest
exhibited higher accuracy as compare to other classification
algorithms. Comparison between random forest and random
tree, random tree is better because of the execution time and
running time is better to other.
5 Jain and Gupta
(2018)
They had demonstrated a phishing detection system named as PHISH-SAFE
that was based on testing and training using SVM machine learning approach.
The simulation analysis shows that the proposed SVM-based
phishing detection approach could detect phishing sites with an
accuracy of 90%.
Abutair and
Belghith (2017)
Anti-phishing detection system based on case-based reasoning phishing was
proposed.
Experimental analysis demonstrated that the proposed work
achieved phishing detection accuracy higher than 95.62% that
get further enhanced when a smaller feature set are used or in
case of a smaller dataset.
A
n anti-phishing model based on similarity measuremen
t
147
3 Proposed methodology
3.1 Dataset
Current research work employs the dataset obtained from PhishTank (https://www.
phishtank.com/) database. The downloadable data attributes consist of Phishing Id, URL
information, type of target, online status, etc. It also presents a community-based
evaluation platform where users query is classified to be phishing or legitimate-based
votes. The site could be accessed at http://www.phishtank.com/.
The proposed anti-phishing design is largely a two layered architecture. First layer is
calculating the website similarity based on two hybrids, namely, Hsim1 comprising of
hybrid similarity obtained using cosine and soft-cosine and Hsim2 comprising of cosine
and Jaccard hybrid similarity between the input user query and the database repository.
The second layer functions as a validation layer (VL) to check the effectiveness of the
prediction performed by the first layer. This layer classifies the phishing and
non-phishing sites based on a multiclass NN. The quality of the proposed architecture is
evaluated using quality parameters namely, terms of positive predictive value (PPV), true
positive rate (TPR) and F-measure. The overview of the steps is shown in Figure 4.
Figure 4 Proposed multi-layer anti-phishing architecture
Input data
User query
Repository
Similarity layer
Apply cosine
similarity
Apply soft
cosine similarity
Hybrid
similarity
Training and
classification layer
Apply feed forward back
propagation neural
network using Levenberg
processing engine
Performance evaluation
Determine performance in terms
of PPV, TPR and F-measure
3.1.1 Similarity layer
The layer is dedicated to perform similarity predictions between the query or the test data
and the repository. To achieve this, author had proposed a hybrid similarity approach
148 P. Singh et al.
using cosine with soft-cosine for predictions that take advantage of angular co-relation
established between the query and the repository vector as shown in Figure 3, where ‘Ɵ
defines the angle stretched by the two vectors. Mathematically, cosine similarity is
calculated as follows:
()
122
11
() ()
nVect i Vect
Sim nn
i
Vect Vect
ij
QR
Cosine
QR
=
==
= (1)
where CosineSim represents, cosine similarity observed between user query (represented
by QVect) and repository (represented by RVect). The pseudo code to compute the similarity
is summarised in Algorithm 1.
Algorithm 1 Pseudo code for cosine similarity
1 Input: RValue // repository data values
2 Foreach IVal in RValue // scan every data value present in the repository
3 Vectorisation of repository data
RVect = conversion(RValue) // prune out stop words from the list
4 Isolate words from repository data
wdata = Segregate(RVect) // segregate words from repository files
5 Eliminate stop words
wstop = Remove(wdata) // removal of stop words
6 ASCII code generation
()
ASCII
data stop
wASCIIw= // generate ASCII code for each word
7 Calculate cosine similarity
()
122
11
() ()
nVect i Vect
Sim nn
i
Vect Vect
ij
QR
Cosine
QR
=
==
=
8 StoretoList
9 Endfor
10 Output: CosineSim // cosine similarly between query and repository
Algorithm 1 calculates the cosine similarity to predict the phishing sites or URLs. In the
process, it first converts the data values to vectors and evaluates the stop words present in
the repository data and the query data. Another similarity aspect, i.e., soft-cosine is also
calculated that takes the advantage of the same algorithmic flow except the similarity
calculation made in Step 7. The mathematically soft-cosine similarity is calculated as
follows:
,
,,
ij
ij ij
n
Vect Vect
ij
Sim nn
Vect Vect Vect Vect
ij ij
QR
SCosine
QQ RR
=
 (2)
A
n anti-phishing model based on similarity measuremen
t
149
where SCosineSim represents the soft-cosine similarity observed between user query
represented by QVect and the repository data represented by RVect.
Further, hybrid similarity prediction is performed by combining the observed
similarity predicted by individual similarity calculations, i.e., using cosine similarity and
soft-cosine similarity as follows:
1Sim Sim Sim
H
Cosine SCosine=+ (3)
where hybrid similarity prediction is represented by HSim1 based on the predictions made
by cosine and soft-cosine similarity calculators represented by CosineSim and SCosineSim,
respectively. In addition to cosine similarity, statistical method in the form of similarity is
also incorporated to evaluate the effectiveness of the proposed cosine and soft-cosine
hybrid. Jaccard similarity JaccardSim between the query dataset Qdata and the repository
dataset Rdata is calculated as follows:
100
data data
Sim
data data
QR
Jaccard QR

=∗


(4)
where Qdata Qdata represents the intersection data present between Qdata and Rdata and
Qdata Rdata represents the union of data present between the two datasets. Here, another
hybrid similarity prediction is executed to represent the similarity estimations between
the query and repository datasets by combining the observed similarity of individual
cosine similarity and Jaccard similarity as follows:
2Sim Sim Sim
H
Cosine Jaccard=+ (5)
where HSim2 represents the hybrid similarity formed with the combination of CosineSim
and JaccardSim. The hybrid similarity calculations HSim1 and HSim2 made in this layer are
sent to VL.
3.1.2 Validation layer
The current layer evaluates the similarity predications of the similarity layer (SL) based
on NN. It consists of input layer that inputs the user query, hidden layer that act as a
processing framework based on the weights and the output layer that returns the
classification results. It is important to understand that sigmoid activation function is
applied to the raw data from input layer. The output of the sigmoid function unit is passed
to the hidden layer. Algorithm 2 summarises the steps employed in the validation of the
similarity predictions using NN.
The similarity predications made in the previous layer are arranged in groups. The
blocks in a group represent the similarity calculations corresponding to each repository.
However, similarity predictions corresponding to a repository are represented by a single
group tagged by repository name. The obtained sets then undergo supervised learning
using NN. This multiclass classifier is used to differentiate phishing and genuine sites
using URLs. The group value and classified value are compared for each query value. If
the group value and the classified value match, then the NN classify the query to true,
otherwise the query is classified as false.
150 P. Singh et al.
Algorithm 2 Pseudo code for validation using NN
1 Input: HSim // hybrid similarity value
2 Initialise variables
TVal // training value
GVal // group value
3 Assign training value
TVal = HSim // assign respective training value
4 Assign group value
GVal = Gnum // group value is represented by respective group number
5 Assign NN parameters
DRatio = 0.7 // distribution ratio
CVratio = 0.15 // cross-validation ratio
Tratio =0.15 // test ratio
Nnum = 20 neurons // number of neurons
6 Initialise neural network for training
TrainNN(TVal, GVal, Nnum) // initialise neural network
7 Start training neural network
8 Foreach X in Fclassifed //for every value in classified frame
9 If FVal == TVal // classified result matches with the training value
10 Tclass++ //auto increment true classified class label
11 Else
12 Fclass++ // auto increment false classified class label
13 Endif
14 Endfor
4 Result and discussion
The phishing predications made by the proposed framework are evaluated in terms of
demonstrated quality parameters, namely, PPV also termed as precision that is calculated
against the training dataset, F-measure and TPR also called the recall value.
a PPV is computed as the number of true detections made by the prediction model in
comparison to the total number of detections. Mathematically, it can be calculated as
follows:
()
positive
positive positive
T
PPV TF
=+ (6)
where TPositive and FPositive represents true positive and false positive detections,
respectively.
b TPR defines number of positive results obtained that are actually correct or positive.
It can be calculated as follows:
A
n anti-phishing model based on similarity measuremen
t
151
()
positive
positive negative
T
TPR TF
=+ (7)
where Fnegative represents the false negative detections.
c F-measure represents the harmonic mean of above two parameters. It is calculated by
product and arithmetic summation as follows:
2
measure
PPV TPR
FPPV TPR

=∗

+

(8)
The values obtained using above relationships using two hybrids namely Hsim1
(representing cosine hybrid with soft-cosine) and Hsim2 (representing cosine hybrid with
Jaccard similarity) are summarised in Table 2. The columns 2 and 3 compare the average
value of PPV, columns 4 and 5 of TPR and columns 6 and 7 of F-measure observed using
two hybrids separately for respective number of files mentioned in column 1.
Table 2 Average value of PPV, TPR and F-measure
Number
of test
files
PPV TPR F-measure
Cosine
+ Jaccard
Cosine
+ soft-cosine
Cosine
+ Jaccard
Cosine
+ soft-cosine
Cosine
+ Jaccard
Cosine
+ soft-cosine
100 0.708 0.709 0.716 0.718 0.712 0.713
200 0.712 0.714 0.719 0.721 0.715 0.717
500 0.714 0.715 0.72 0.722 0.717 0.718
1,000 0.718 0.719 0.723 0.725 0.720 0.722
2,000 0.722 0.725 0.726 0.729 0.724 0.727
3,000 0.726 0.732 0.732 0.738 0.729 0.735
Figure 5 PPV analysis of the proposed work using the two hybrid similarities (see online version
for colours)
0.705
0.71
0.715
0.72
0.725
0.73
0.735
0.74
100 200 500 1000 2000 3000
PPV values
Number of test files
Cosine + Soft-Cosine Cosine + Jaccard
152 P. Singh et al.
It is generalised from Table 2 that a rise in the number of test files increases the average
value of each considered parameter. Figure 5 shows PPV analysis of the proposed work
using the two hybrid similarities separately for different number of test files. It concludes
that best performance (maximum value of each parameter) is obtained by considering
larger number of test files. The line graph shows that average PPV observed with the
implementation of cosine and soft-cosine hybrid is 0.719 which is 0.42% higher than the
cosine and Jaccard hybrid which is 0.716. This shows that the proposed cosine and
soft-cosine hybrid proved to demonstrate better precision as compared to cosine and
Jaccard hybrid.
Figure 6 TPR analysis of the proposed work using the two hybrid similarities (see online version
for colours)
0.705
0.71
0.715
0.72
0.725
0.73
0.735
0.74
100 200 500 1000 2000 3000
TPR values
Number of test files
y
Cosine + Soft-Cosine Cosine + Jaccard
Figure 7 F-measure analysis of the proposed work using the two hybrid similarities (see online
version for colours)
0.705
0.710
0.715
0.720
0.725
0.730
0.735
0.740
100 200 500 1000 2000 3000
F-measure values
Number of test files
Cosine + Soft-Cosine Cosine + Jaccard
A
n anti-phishing model based on similarity measuremen
t
153
TPR analysis of the two hybrids is shown in Figure 6 that shows that the proposed work
exhibited better TPR using cosine and soft-cosine hybrid similarity as compared to cosine
and Jaccard hybrid similarity. The average value of TPR using cosine and soft-cosine
similarity is 0.2833% higher as compared to phishing detection performed using cosine
and Jaccard hybrid similarity.
F-measure reflects the harmonic mean of PPV and TPR. Figure 7 gives the graphical
comparison of the performance of the proposed work using two hybrid similarity
approaches separately. It is observed that proposed anti-phishing model taking advantage
of hybrid similarity using cosine and soft-cosine achieved an average F-measure of
0.7222 while using cosine and Jaccard hybrid similarity achieved an average F-measure
of 0.7196. This shows that proposed work using cosine and soft-cosine similarity resulted
in 0.258% higher F-measure for phishing detection.
5 Conclusions
In the current work, author has proposed an anti-phishing framework based on new
rule-based architecture. The first layer of anti-phishing model id dedicated to perform
similarity evaluation between user query and the website. The phishing detection is based
on cosine and soft-cosine similarity hybrid measurements followed by NN machine
learning that cross-validation of the prediction results. In the SL, to prove the
effectiveness of the proposed cosine and soft-cosine similarity, another hybrid similarity
(cosine and Jaccard similarity) was also implemented. Simulation analysis have shown
that the proposed work using cosine and soft-cosine similarity outperformed the cosine
and similarity based anti-phishing model in terms of 0.233% higher PPV, 0.2833% better
TPR and 0.258% improved F-measure. Overall, it is also observed that the proposed
anti-phishing model showed an average improvement of PPV of 2.3%, TPR of 2% and
F-measure of 2.15% when the sample size is increased from 100 to 3,000 test files.
However, an average value of PPV of 0.719, TPR of 0.726 and F-measure of 0.722 is
observed that proved the effectiveness of the proposed anti-phishing design.
Acknowledgements
This work is part of bilateral Indian-Bulgarian cooperation research project
between Technical University of Sofia, Bulgaria and Deenbandhu Chhotu Ram
University of Science and Technology, Murthal, Sonepat, India under the title
‘Contemporary approaches for processing and analysis of multidimensional signals in
telecommunications’, financed by the Department of Science and Technology (DST),
India and the Ministry of Education and Science, Bulgaria.
154 P. Singh et al.
References
Abutair, H.Y. and Belghith, A. (2017) ‘Using case-based reasoning for phishing detection’,
Procedia Computer Science, Vol. 109, pp.281–288.
Aksu, D., Turgut, Z., Üstebay, S. and Aydin, M.A. (2019) ‘Phishing analysis of websites
using classification techniques’, in International Telecommunications Conference, Springer,
Singapore, pp.251–258.
Cao, Y., Han, W. and Le, Y. (2008) ‘Anti-phishing based on automated individual white-list’,
in Proceedings of the 4th ACM Workshop on Digital Identity Management, pp.51–60.
Clement, J. (2020) Phishing Statistics Analysis, Statista.com [online] https://www.statista.com/
statistics/266161/websites-most-affected-by phishing/ (accessed July 2020).
Engel, D., Stütz, T. and Uhl, A. (2012) ‘Assessing JPEG2000 encryption with key-dependent
wavelet packets’, EURASIP Journal on Information Security, Vol. 1, pp.2–13.
Global Digital Report 2019 [online] https://datareportal.com/ (accessed 21 February 2020).
Gupta, S. and Singhal, A. (2018) ‘Dynamic classification mining techniques for predicting phishing
URL’, in Soft Computing: Theories and Applications, pp.537–546, Springer, Singapore.
Jain, A.K. and Gupta, B.B. (2018) ‘Two-level authentication approach to protect from phishing
attacks in real time’, Journal of Ambient Intelligence and Humanized Computing, Vol. 9,
pp.1783–1796.
Jain, A.K. and Gupta, B.B. (2019) ‘A machine learning based approach for phishing detection
using hyperlinks information’, Journal of Ambient Intelligence and Humanized Computing,
Vol. 10, pp.2015–2028.
Kaur, D. and Kalra, S. (2016) ‘Five-tier barrier anti-phishing scheme using hybrid approach’,
Information Security Journal: A Global Perspective, Vol. 25, pp.247–260.
Kordestani, H. and Shajari, M. (2018) ‘A similarity-based framework for detecting phishing
websites’, International Journal of Advanced Research in Computer Science, Vol. 9.
Lam, T. and Kettani, H. (2019) ‘PhAttApp: a phishing attack detection application’, in Proceedings
of the 2019 3rd International Conference on Information System and Data Mining,
pp.154–158.
Li, Y., Yang, L. and Ding, J. (2016) ‘A minimum enclosing ball-based support vector machine
approach for detection of phishing websites’, Optik, Vol. 127, pp.345–351.
Lin, K.S. (2019) ‘New attack potential measurement method to kaizen event for web application
security vulnerabilities’, International Journal of Electronic Commerce Studies, Vol. 10,
pp.89–112.
Makki, S., Haque, R., Taher, Y., Assaghir, Z., Hacid, M.S. and Zeineddine, H. (2018)
‘A cost-sensitive cosine similarity K-nearest neighbor for credit card fraud detection’, Big
Data and Cyber-security Intelligence, Beirut, Lebanon.
Mensah, P., Blanc, G., Okada, K., Miyamoto, D. and Kadobayashi, Y. (2015) ‘AJNA: anti-phishing
JS-based visual analysis, to mitigate users’ excessive trust in SSL/TLS’, in 2015 4th
International Workshop on Building Analysis Datasets and Gathering Experience Returns for
Security (BADGERS), pp.74–84.
Morovati, K. and Kadam, S.S. (2019) ‘Detection of phishing emails with email forensic analysis
and machine learning techniques’, International Journal of Cyber-Security and Digital
Forensics, Vol. 8, pp.98–108.
Nguyen, L.A.T., Nguyen, H.K. and To, B.L. (2016) ‘An efficient approach based on neuro-fuzzy
for phishing detection’, Journal of Automation and Control Engineering, Vol. 4.
PhishTank [online] https://www.phishtank.com/ (accessed 1 February 2020).
Rao, R.S., Vaishnavi, T. and Pais, A.R. (2020) ‘CatchPhish: detection of phishing websites by
inspecting URLs’, Journal of Ambient Intelligence and Humanized Computing, Vol. 11,
pp.813–825.
A
n anti-phishing model based on similarity measuremen
t
155
Ramanathan, V. and Wechsler, H. (2012) ‘phishGILLNET – phishing detection methodology using
probabilistic latent semantic analysis, AdaBoost, and co-training’, EURASIP Journal on
Information Security, No. 1, pp.1–22.
Sheng, S., Magnien, B., Kumaraguru, P., Acquisti, A., Cranor, L.F., Hong, J. and Nunge, E. (2007)
‘Anti-phishing phil: the design and evaluation of a game that teaches people not to fall for
phish’, in Proceedings of the 3rd Symposium on Usable Privacy and Security, pp.88–99.
Sonowal, G. and Kuppusamy, K.S. (2020) ‘PhiDMA – a phishing detection model with multi-filter
approach’, Journal of King Saud University – Computer and Information Sciences, Vol. 32,
pp.99–112.
Suleman, M.T. and Awan, S.M. (2019) ‘Optimization of URL-based phishing websites detection
through genetic algorithms’, Automatic Control and Computer Sciences, Vol. 53, pp.333–341.
Ugochi, O.C. (2018) ‘A novel web page anti-phishing approach using URL cosine similarity and IP
address comparison’, in International Conferences on WWW/Internet, ICWI 2018 and Applied
Computing, pp.321–328.
Zhu, E., Chen, Y., Ye, C., Li, X. and Liu, F. (2019) ‘OFS-NN: an effective phishing websites
detection model based on optimal feature selection and neural network’, IEEE Access, Vol. 7,
pp.73271–73284.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Phishing attack is now a big threat to people’s daily life and networking environment. Through disguising illegal URLs as legitimate ones, attackers can induce users to visit the phishing URLs to get private information and other benefits. Effective methods of detecting phishing websites are urgently needed to alleviate the threats posed by phishing attacks. As the active learning capability from massive data sets, the neural network is widely used to detect phishing attacks. However, in the stage of training data sets, many useless and small influence features will trap the neural network model into the problem of over-fitting. This problem usually causes the trained model that cannot effectively detect the phishing websites. In order to alleviate this problem, this paper proposes OFS-NN, an effective phishing websites detection model based on optimal feature selection method and neural network. In the proposed OFS-NN, a new index (FVV, Feature Validity Value) is firstly introduced to evaluate the impact of sensitive features on phishing websites detection. Then, based on the new FVV index, an algorithm is designed to select optimal features from the phishing websites. This algorithm is able to alleviate the over-fitting problem of the underlying neural network to a large extent. The selected optimal features are used to train the underlying neural network and, finally, an optimal classifier is constructed to detect the phishing websites. The experimental results show that the OFS-NN model is accurate and stable in detecting many types of the phishing websites.
Article
Full-text available
There exists many anti-phishing techniques which use source code-based features and third party services to detect the phishing sites. These techniques have some limitations and one of them is that they fail to handle drive-by-downloads. They also use third-party services for the detection of phishing URLs which delay the classification process. Hence, in this paper, we propose a light-weight application, CatchPhish which predicts the URL legitimacy without visiting the website. The proposed technique uses hostname, full URL, Term Frequency-Inverse Document Frequency (TF-IDF) features and phish-hinted words from the suspicious URL for the classification using the Random forest classifier. The proposed model with only TF-IDF features on our dataset achieved an accuracy of 93.25%. Experiment with TF-IDF and hand-crafted features achieved a significant accuracy of 94.26% on our dataset and an accuracy of 98.25%, 97.49% on benchmark datasets which is much better than the existing baseline models.
Conference Paper
Full-text available
Technology has grown rapidly since the end of the last century. Thousands of businesses from different major industries are transforming into information organizations and offering online services. The industrial and enterprise of Internet of Things (IoT) is growing at an exponential rate. Incident Command Systems (ICS) and Supervisory Control and Data Acquisition (SCADA), which were once known to be untouchable by malware as they were usually available offline, are now facing security challenges. These systems become more vulnerable as their online availability increases to enable integration with other systems. Technology allows organizations to provide greater value to customers, expand their businesses beyond physical boundaries, and compete with other businesses. However, technology also allows attackers from all over the world to attack organizations and consumers. Ransomware, a type of malware, is a growing cybersecurity threat. It mainly targets home users and businesses for financial gain. Ransomware attacks often start with a delivery phase in which attackers deliver malicious content. Attackers often use multiple threat vectors for ransomware enablement such as emails, instant messages, and drive-by downloads, exploiting the vulnerabilities of a network or application. Among these attack vectors, email is the top threat vector, which most ransomware attackers attempt to use. This study proposes the use of a phishing detector application, PhAttApp. This application offers numerous features to detect and prevent ransomware delivery through phishing channels and thus reduces the risk of ransomware infection.
Article
Full-text available
This paper presents a novel approach that can detect phishing attack by analysing the hyperlinks found in the HTML source code of the website. The proposed approach incorporates various new outstanding hyperlink specific features to detect phishing attack. The proposed approach has divided the hyperlink specific features into 12 different categories and used these features to train the machine learning algorithms. We have evaluated the performance of our proposed phishing detection approach on various classification algorithms using the phishing and non-phishing websites dataset. The proposed approach is an entirely client-side solution, and does not require any services from the third party. Moreover, the proposed approach is language independent and it can detect the website written in any textual language. Compared to other methods, the proposed approach has relatively high accuracy in detection of phishing websites as it achieved more than 98.4% accuracy on logistic regression classifier.
Article
With recognition of the importance of web application security, there is a need for research on an action program for measurement and improvement of web application security. Therefore, the main purpose of this study was to formulate a Kaizen program suitable for measurement and improvement of web application security vulnerabilities. An improvement working procedure is introduced to implement the Kaizen program. Further, an augmented attack potential measurement method is proposed to measure the effectiveness of the formulated Kaizen program. The proposed new attack potential measurement method is considered to be an umbrella under which several novel techniques and methods are included, such as OWASP’s web application security vulnerabilities assessment method, ISO/IEC 18045 attack potential ratings method and fuzzy evaluation method. The numerical results of an example are presented to show that the augmented attack potential measurement method is not only comparable but also distinguishable. It is more reasonable and effective than that of the traditional method for measuring web application security improvement. Finally, conclusions are made and suggestions for future work are proposed.
Article
Website phishing is an online crime for obtaining secret information such as passwords, account numbers, and credit card details. Attackers lure users through attractive hyperlinks, in order to, redirect to the fake websites. Phishing detection through a machine-learning approach has become quite effective nowadays. In this research, the Uniform Resource Locator (URL) based phishing detection approach has been used. Machine-learning classifiers like Naïve Bayes, Iterative Dichotomiser-3 (ID3), K-Nearest Neighbor (KNN), Decision Tree and Random Forest used for the classification of legitimate and illegitimate websites. This classification would help in the detection of phishing websites. However, it has been observed that use of Genetic Algorithms (GAs) for feature selection can improve the detection accuracy. Our experimental results portrayed the use of Iterative Dichotomiser-3 (ID3) along with Yet Another Generating Genetic Algorithm (YAGGA) improves the detection accuracy up to 95%.
Conference Paper
In today's world, where all records are carried into an electronic environment, cyber security represents a very broad scope, with the primary objective of preventing the loss of financial and / or emotional loss of people, institutions, organizations through the security of data in the digital environment. Today, the most common cyber security threat is phishing attacks. With the phishing attack, the attacker aims to capture the data which are very important for the individuals like identification number, social security number, bank account information, and so on. In this study, using deep learning, it was checked whether the web sites are real or not by using neural networks and support vector machine, decision tree and stacked autoencoders as classification methods. As a result of the study, 86% success rate was reached by using stacked autoencoders which are a part of deep learning techniques