ArticlePDF Available

An anti-phishing model based on similarity measurement

January 2022
International Journal of Computational Vision and Robotics 12(2):141

January 2022
12(2):141

DOI:10.1504/IJCVR.2022.121169

Authors:

Parvinder Singh

Deenbandhu Chhotu Ram University of Science and Technology, Murthal

Jasvinder Kaur

PDM Group of Institutions

Phishing has represented a more noteworthy danger to clients. In the current work, the authors attempted to build up a powerful anti-phishing technique based on hybrid similarity approach combining Cosine and Soft Cosine similarity that measures the resemblance between user query and database. The proposed similarity hybrid is also evaluated against another similarity hybrid comprising of Cosine and Jaccard similarity measure so as to validate the proposed work. Both hybrid similarities are separately fed to validation layer of feed forward back propagation neural network (FFBPNN) to predict phishing and legitimate websites. The performance of the proposed work is evaluated against data set comprising of 3,000 sample files in terms of positive predictive value (PPV), true positive rate (TPR) and F-measure. The comparative analysis demonstrated that the anti-phishing model using proposed similarity hybrid outperformed the cosine and Jaccard similarity hybrid with 0.233%, 0.2833% and 0.258% higher PPV, TPR and F-measure, respectively.

Proposed multi-layer anti-phishing architecture

…

PPV analysis of the proposed work using the two hybrid similarities (see online version for colours)

…

Average value of PPV, TPR and F-measure

…

Figures - uploaded by Parvinder Singh

Content may be subject to copyright.

Content uploaded by Parvinder Singh

Content may be subject to copyright.

nt. J. Computational Vision and Robotics, Vol. 12, No. 2, 2022 141

An anti-phishing model based on similarity

measurement

Parvinder Singh* and Bhawna Sharma

Department of Computer Science and Engineering,

Deenbandhu Chhotu Ram University of Science and Technology,

Murthal, Sonepat, Haryana, India

Email: parvindersingh.cse@dcrustm.org

Email: bhawnash024@gmail.com

*Corresponding author

Jasvinder Kaur

Computer Science and Engineering,

PDM University,

Bahadurgarh, Haryana, India

Email: jasvinder.kaur@pdm.ac.in

Abstract: Phishing has represented a more noteworthy danger to clients. In the

current work, the authors attempted to build up a powerful anti-phishing

technique based on hybrid similarity approach combining Cosine and Soft

Cosine similarity that measures the resemblance between user query and

database. The proposed similarity hybrid is also evaluated against another

similarity hybrid comprising of Cosine and Jaccard similarity measure so as to

validate the proposed work. Both hybrid similarities are separately fed to

validation layer of feed forward back propagation neural network (FFBPNN) to

predict phishing and legitimate websites. The performance of the proposed

work is evaluated against data set comprising of 3,000 sample files in terms of

positive predictive value (PPV), true positive rate (TPR) and F-measure. The

comparative analysis demonstrated that the anti-phishing model using proposed

similarity hybrid outperformed the cosine and Jaccard similarity hybrid with

0.233%, 0.2833% and 0.258% higher PPV, TPR and F-measure, respectively.

Keywords: phishing; cosine similarity; soft-cosine similarity; similarity index;

FFBPNN.

Reference to this paper should be made as follows: Singh, P., Sharma, B. and

Kaur, J. (2022) ‘An anti-phishing model based on similarity measurement’,

Int. J. Computational Vision and Robotics, Vol. 12, No. 2, pp.141–155.

Biographical notes: Parvinder Singh is working as a Professor in the

Department of Computer Science and Engineering at Deenbandhu Chhotu Ram

University of Science and Technology, Murthal, Sonepat, Haryana, India. His

research interests are information security and image processing. He has

published more than 100 research papers in journals of reputed publishers. He

is also associated with editing work of many journals. He is also a recipient of

various international awards and grants.

142 P. Singh et al.

Bhawna Sharma is currently pursuing her PhD from Deenbandhu Chhotu Ram

University of Science and Technology, Murthal, Sonepat, Haryana, India. Her

research interests include cyber security and machine learning.

Jasvinder Kaur is presently working as an Assistant Professor in Department of

Computer Science and Engineering at PDM University, Bahadurgarh, Haryana.

She has 28 publications in international and national journals. She is a recipient

of Excellence Award for her research work on ‘Digital logic embedding for

information view privileges’. Her research areas include information hiding and

information security.

1 Introduction

Modern advances in the technological sector have raised the popularity of internet

technology. Presently, none of the field exists that remains untouched by the network

frame work of internet. The rising popularity of social media including Twitter,

Instagram and Facebook further adds up to the popularity of internet technology making

it an indispensable daily need (Sheng et al., 2007). It is observed that the number of

people using social media has nearly doubled since 2014. As shown in Figure. 1, 9% rise

in the number of users has been observed in the last year as per a report published in

Global Digital Report 2019 (https://datareportal.com/). Such an enormous rise in the

network traffic has significantly raised challenges to manifolds. One is malicious attacks

and other is requirement of advanced hardware. This had further led to the discovery of

big data computing technologies with more sophisticated hardware configuration.

Internet provides a platform to connect numerous users sharing their information in the

form of media, images, videos, files, etc. and thus also require attention from the risk of

privacy leakage in one way or the other.

Phishing is a cyber-crime that frequently appears in personal computers and mobile

platforms. It is a fraudulent behaviour taking advantage of social networks and technical

skills that advertises and sends illegal links to users as a measure of disguise to decode

their private information and financial credentials. Spoofed e-mails are employed as

legitimate business tricking recipients to accidentally share their login and password

details. In other words, phishing is defined as stealing someone’s private information by

befooling them to be genuine (Cao et al., 2008; Aksu et al., 2019; Lam and Kettani,

2019). Majority of the internet users get fascinated by these illusions and get trapped up.

The statistical analysis by Clement (2020) reports that in 2020 first quarter shows 19% of

the total phishing attacks were focusing the financial sector in addition to payments

which was measured to be 13.3%. Figure 2 shows that software as a service (SaaS) and

the web-mails demonstrated highest compromised sector with 33.5% phishing attacks

(Engel et al., 2012).

n anti-phishing model based on similarity measuremen

143

Figure 1 Rising popularity of social media (see online version for colours)

0 500 1000 1500 2000 2500 3000 3500 4000

2014

2015

2016

2017

2018

2019

Number of users (in millions)

Years

Social media users

Figure 2 Phishing attacks in the 1st quarter-2020 (see online version for colours)

8.3%

6.2%

3.9%

8.5%

19.4%

33.5%

13.3%

6.9%

Researchers around the world have been constantly involved to tackle phishing attacks

and deploy various anti-phishing protocols. It is found that one third of the anti- phishing

mechanisms are rule-based prevention methods implemented in recent past (Mensah

et al., 2015). The major limitation as per these mechanisms is the adaptability. This

144 P. Singh et al.

means that there is requirement of incorporation of new data of rule set with every new

phishing attack. To deal with adaptability issues, authors have used swarm intelligence

with machine learning. Some similarity-based approaches have also been proposed that

compares the phishing with legitimate websites (Kordestani and Shajari, 2018). In the

current research, authors used rule-based architecture to develop an anti-phishing

protocol based on the cosine and soft-cosine similarity index powered by the machine

learning design. Cosine similarity measures the resemblance between two documents

independent of the document size.

In Figure 3, cosine angle is reflected by a set of vectors representing query and

repository vectors projected in the space. The plus point of this similarity aspect is the

fact that larger documents can still be expected to get oriented in the environs. Thus,

smaller cosine angle reflects higher similarity, for instance, cosine 0° reflects complete

similarity as 1. More detailed mechanism is discussed in the later part of the article.

Figure 3 Cosine similarity (see online version for colours)

Author has introduced phishing attacks and related statistics in the current section.

Section 2 summarises the research revolving around anti-phishing protocols deployed

by researchers, Section 3 describes the proposed methodology based on cosine and

soft-cosine followed by Section 4 dedicated for observations and discusses the

accomplish result. Section 5 concludes the paper.

2 Literature review

This section covers various types of protocols proposed to deal and prevent phishing

attacks. Ramanathan and Wechsler (2012) had proposed PhishGILLNET as multi-layered

anti-phishing model. The technique demonstrated effective results with a prerequisite that

webpage should be in Hyper Text Markup Language (HTML) and Multipurpose Internet

Mail Extensions (MIME) formats (Engel et al., 2012). Li et al. (2016) proposed a novel

anti-phishing design based on ball-support vector machine (BVM) to distinguish a

n anti-phishing model based on similarity measuremen

145

malicious uniform resource locator (URL) from a genuine. In the process, they extracted

various topological features of the website and analysis 12 out of them followed by the

BVM-based vector analysis. Evaluation against support vector machines (SVM) proved

BVM to be highly effective in detecting phishing websites with a relatively slower speed

for big data. Kaur and Kalra (2016) had proposed a five-tier anti-phishing design to

safeguard from phishing attacks. The hybrid approach analyses the URL and reflects the

page status as secure or phishing website. In 2016, Nguyen et al. proposed an

anti-phishing design that took advantage of neural fuzzy model. The work was evaluated

against 11,660 phishing and 10,000 legitimate websites which were used as training data

in the adaptive neural network (NN). The works proved efficient detection of phishing

URLs. Sonowal and Kuppusamy (2020) developed PhiDMA as a multilayered

anti-phishing architecture divided into five layers corresponding to white list, URL

feature, signature, string matching and score layer. The model was developed to offer

easy access to even visually impaired individuals with 92.72% phishing detection

accuracy. Ugochi (2018) focused his research towards depth analysis of IP address and

URL cosine similarity in order to identify phishing URLs. Experimental evaluation

proved to be highly effective against 100 phishing URL used in the study with least

memory requirement. Makki et al. (2018) postulated a cost-sensitive K-nearest neighbour

(KNN) approach enhanced with cosine similarity to identify cheat instances affecting

financial market, money transactions, telecommunication and credit card. The model

proved to outperform the approach based on traditional KNN alone. In 2018, Jain and

Gupta had presented a phishing detection technique that was based on the analysis of

hyperlinks present in the webpages. The work proved to be highly efficient in solution

and exhibited 98.4% accuracy for phishing site detection using logistic regression-based

classification. Kordestani and Shajari (2018) proposed a novel textual similarity-based

method to identify phishing sites. The method was evaluated against real website and

demonstrated optimal phishing detection accuracy while discriminating between phishing

and legitimate website and guides user towards genuine website (Kordestani and Shajari,

2018). Azeez in 2019 had developed PhishDetect technique aimed to identify phishing

websites by evaluating URL features and web contents. The technique proved to be

highly efficient in identifying the phishing URLs. Morovati and Kadam (2019) dedicated

their research study towards the identification of phishing attacks spread through

phishing e-mails. To achieve distinguishable results, they had incorporated email forensic

analysis with machine learning methodology. Later, Zhu et al. (2019) had postulated

optimal feature selection neural network (OFS-NN) as a phishing website detector. This

model was based on optimised feature selection technique followed by NN technique.

They had developed an optimal classifier that could accurately detect different type of

phishing websites. Lin (2019) had developed an architecture that mediates a cascade of

kaizen events. In this approach, cosine similarity of data objects is used to classify

protection levels. Experimental evaluation had shown that the designed fuzzy architecture

proved to be competent as compared to other methods to support Kaizen architecture to

identify susceptibility of web applications.

146 P. Singh et al.

Table 1 Comparison of latest work in anti-phishing

Sr. no. Author and year Proposed work and technique Result and outcomes

1 Rao et al. (2020) An anti-phishing application named as CatchPhish was proposed to validate

the genuineness of the URL ahead of visiting it.

The CatchPhish program obtained a 94.26% accuracy over our

dataset and 98.25 % accuracy over a benchmark dataset. It has

scope for application of feature selection for handcrafted apps

when detecting the phishing sites.

2 Suleman and

Awan (2019)

Anti-phishing method was postulated that was based on analysing uniform

resource locator (URL). It uses machine-learning classifiers like naïve Bayes,

iterative dichotomiser-3 (ID3), K-nearest neighbour (KNN), decision tree and

random forest, used for the classification of legitimate and illegitimate

websites. After that, with the use of this detection, we can easily distinguish

the phishing website. However, it has been observed that the use of genetic

algorithms (GAs) for feature selection can improve the detection accuracy.

The experimental results show that the use of iterative

dichotomiser-3 (ID3) along with yet another generating genetic

algorithm (YAGGA) improves the detection accuracy up to

95%.

3 Jain and Gupta

(2018)

Bi-levelled authentication system was proposed to identify phishing attack

without depending upon the text-based language of the webpage. In the

technique, the first level authentication was based on the search engine

mechanism, and the second level was based on the hyperlink to detect the

phishing URLs or the websites.

The simulation analysis demonstrated that the proposed work

achieved a much higher detection accuracy that was averaged to

around 98.1%.

4 Gupta and

Singhal (2018)

A practical method was proposed that could detect phishing in minimum time

span. The URLs were mined with the data mining tool called Waikato

Environment for Knowledge Analysis (Weka). For the classification of

phishing sites, authors implemented six computational intelligence techniques

namely, random tree, random forest, SMO, LMT, NB, J48 in order to

investigate the URLs of the dataset.

The authors demonstrated that both random tree and forest

exhibited higher accuracy as compare to other classification

algorithms. Comparison between random forest and random

tree, random tree is better because of the execution time and

running time is better to other.

5 Jain and Gupta

(2018)

They had demonstrated a phishing detection system named as PHISH-SAFE

that was based on testing and training using SVM machine learning approach.

The simulation analysis shows that the proposed SVM-based

phishing detection approach could detect phishing sites with an

accuracy of 90%.

Abutair and

Belghith (2017)

Anti-phishing detection system based on case-based reasoning phishing was

proposed.

Experimental analysis demonstrated that the proposed work

achieved phishing detection accuracy higher than 95.62% that

get further enhanced when a smaller feature set are used or in

case of a smaller dataset.

n anti-phishing model based on similarity measuremen

147

3 Proposed methodology

3.1 Dataset

Current research work employs the dataset obtained from PhishTank (https://www.

phishtank.com/) database. The downloadable data attributes consist of Phishing Id, URL

information, type of target, online status, etc. It also presents a community-based

evaluation platform where users query is classified to be phishing or legitimate-based

votes. The site could be accessed at http://www.phishtank.com/.

The proposed anti-phishing design is largely a two layered architecture. First layer is

calculating the website similarity based on two hybrids, namely, Hsim1 comprising of

hybrid similarity obtained using cosine and soft-cosine and Hsim2 comprising of cosine

and Jaccard hybrid similarity between the input user query and the database repository.

The second layer functions as a validation layer (VL) to check the effectiveness of the

prediction performed by the first layer. This layer classifies the phishing and

non-phishing sites based on a multiclass NN. The quality of the proposed architecture is

evaluated using quality parameters namely, terms of positive predictive value (PPV), true

positive rate (TPR) and F-measure. The overview of the steps is shown in Figure 4.

Figure 4 Proposed multi-layer anti-phishing architecture

Input data

User query

Repository

Similarity layer

Apply cosine

similarity

Apply soft

cosine similarity

Hybrid

similarity

Training and

classification layer

Apply feed forward back

propagation neural

network using Levenberg

processing engine

Performance evaluation

Determine performance in terms

of PPV, TPR and F-measure

3.1.1 Similarity layer

The layer is dedicated to perform similarity predictions between the query or the test data

and the repository. To achieve this, author had proposed a hybrid similarity approach

148 P. Singh et al.

using cosine with soft-cosine for predictions that take advantage of angular co-relation

established between the query and the repository vector as shown in Figure 3, where ‘Ɵ’

defines the angle stretched by the two vectors. Mathematically, cosine similarity is

calculated as follows:

()

122

() ()

nVect i Vect

Sim nn

Vect Vect

Cosine

= (1)

where CosineSim represents, cosine similarity observed between user query (represented

by QVect) and repository (represented by RVect). The pseudo code to compute the similarity

is summarised in Algorithm 1.

Algorithm 1 Pseudo code for cosine similarity

1 Input: RValue // repository data values

2 Foreach IVal in RValue // scan every data value present in the repository

3 Vectorisation of repository data

RVect = conversion(RValue) // prune out stop words from the list

4 Isolate words from repository data

wdata = Segregate(RVect) // segregate words from repository files

5 Eliminate stop words

wstop = Remove(wdata) // removal of stop words

6 ASCII code generation

()

ASCII

data stop

wASCIIw= // generate ASCII code for each word

7 Calculate cosine similarity

()

122

() ()

nVect i Vect

Sim nn

Vect Vect

Cosine

=

8 StoretoList

9 Endfor

10 Output: CosineSim // cosine similarly between query and repository

Algorithm 1 calculates the cosine similarity to predict the phishing sites or URLs. In the

process, it first converts the data values to vectors and evaluates the stop words present in

the repository data and the query data. Another similarity aspect, i.e., soft-cosine is also

calculated that takes the advantage of the same algorithmic flow except the similarity

calculation made in Step 7. The mathematically soft-cosine similarity is calculated as

follows:

ij ij

Vect Vect

Sim nn

Vect Vect Vect Vect

ij ij

SCosine

QQ RR

=

 (2)

n anti-phishing model based on similarity measuremen

149

where SCosineSim represents the soft-cosine similarity observed between user query

represented by QVect and the repository data represented by RVect.

Further, hybrid similarity prediction is performed by combining the observed

similarity predicted by individual similarity calculations, i.e., using cosine similarity and

soft-cosine similarity as follows:

1Sim Sim Sim

Cosine SCosine=+ (3)

where hybrid similarity prediction is represented by HSim1 based on the predictions made

by cosine and soft-cosine similarity calculators represented by CosineSim and SCosineSim,

respectively. In addition to cosine similarity, statistical method in the form of similarity is

also incorporated to evaluate the effectiveness of the proposed cosine and soft-cosine

hybrid. Jaccard similarity JaccardSim between the query dataset Qdata and the repository

dataset Rdata is calculated as follows:

100

data data

Sim

data data

Jaccard QR

∩



=∗



∪



(4)

where Qdata ∩ Qdata represents the intersection data present between Qdata and Rdata and

Qdata ∪ Rdata represents the union of data present between the two datasets. Here, another

hybrid similarity prediction is executed to represent the similarity estimations between

the query and repository datasets by combining the observed similarity of individual

cosine similarity and Jaccard similarity as follows:

2Sim Sim Sim

Cosine Jaccard=+ (5)

where HSim2 represents the hybrid similarity formed with the combination of CosineSim

and JaccardSim. The hybrid similarity calculations HSim1 and HSim2 made in this layer are

sent to VL.

3.1.2 Validation layer

The current layer evaluates the similarity predications of the similarity layer (SL) based

on NN. It consists of input layer that inputs the user query, hidden layer that act as a

processing framework based on the weights and the output layer that returns the

classification results. It is important to understand that sigmoid activation function is

applied to the raw data from input layer. The output of the sigmoid function unit is passed

to the hidden layer. Algorithm 2 summarises the steps employed in the validation of the

similarity predictions using NN.

The similarity predications made in the previous layer are arranged in groups. The

blocks in a group represent the similarity calculations corresponding to each repository.

However, similarity predictions corresponding to a repository are represented by a single

group tagged by repository name. The obtained sets then undergo supervised learning

using NN. This multiclass classifier is used to differentiate phishing and genuine sites

using URLs. The group value and classified value are compared for each query value. If

the group value and the classified value match, then the NN classify the query to true,

otherwise the query is classified as false.

150 P. Singh et al.

Algorithm 2 Pseudo code for validation using NN

1 Input: HSim // hybrid similarity value

2 Initialise variables

TVal // training value

GVal // group value

3 Assign training value

TVal = HSim // assign respective training value

4 Assign group value

GVal = Gnum // group value is represented by respective group number

5 Assign NN parameters

DRatio = 0.7 // distribution ratio

CVratio = 0.15 // cross-validation ratio

Tratio =0.15 // test ratio

Nnum = 20 neurons // number of neurons

6 Initialise neural network for training

TrainNN(TVal, GVal, Nnum) // initialise neural network

7 Start training neural network

8 Foreach X in Fclassifed //for every value in classified frame

9 If FVal == TVal // classified result matches with the training value

10 Tclass++ //auto increment true classified class label

11 Else

12 Fclass++ // auto increment false classified class label

13 Endif

14 Endfor

4 Result and discussion

The phishing predications made by the proposed framework are evaluated in terms of

demonstrated quality parameters, namely, PPV also termed as precision that is calculated

against the training dataset, F-measure and TPR also called the recall value.

a PPV is computed as the number of true detections made by the prediction model in

comparison to the total number of detections. Mathematically, it can be calculated as

follows:

()

positive

positive positive

PPV TF

=+ (6)

where TPositive and FPositive represents true positive and false positive detections,

respectively.

b TPR defines number of positive results obtained that are actually correct or positive.

It can be calculated as follows:

n anti-phishing model based on similarity measuremen

151

()

positive

positive negative

TPR TF

=+ (7)

where Fnegative represents the false negative detections.

c F-measure represents the harmonic mean of above two parameters. It is calculated by

product and arithmetic summation as follows:

measure

PPV TPR

FPPV TPR

∗



=∗





(8)

The values obtained using above relationships using two hybrids namely Hsim1

(representing cosine hybrid with soft-cosine) and Hsim2 (representing cosine hybrid with

Jaccard similarity) are summarised in Table 2. The columns 2 and 3 compare the average

value of PPV, columns 4 and 5 of TPR and columns 6 and 7 of F-measure observed using

two hybrids separately for respective number of files mentioned in column 1.

Table 2 Average value of PPV, TPR and F-measure

Number

of test

files

PPV TPR F-measure

Cosine

+ Jaccard

Cosine

+ soft-cosine

Cosine

+ Jaccard

Cosine

+ soft-cosine

Cosine

+ Jaccard

Cosine

+ soft-cosine

100 0.708 0.709 0.716 0.718 0.712 0.713

200 0.712 0.714 0.719 0.721 0.715 0.717

500 0.714 0.715 0.72 0.722 0.717 0.718

1,000 0.718 0.719 0.723 0.725 0.720 0.722

2,000 0.722 0.725 0.726 0.729 0.724 0.727

3,000 0.726 0.732 0.732 0.738 0.729 0.735

Figure 5 PPV analysis of the proposed work using the two hybrid similarities (see online version

for colours)

0.705

0.71

0.715

0.72

0.725

0.73

0.735

0.74

100 200 500 1000 2000 3000

PPV values

Number of test files

Cosine + Soft-Cosine Cosine + Jaccard

152 P. Singh et al.

It is generalised from Table 2 that a rise in the number of test files increases the average

value of each considered parameter. Figure 5 shows PPV analysis of the proposed work

using the two hybrid similarities separately for different number of test files. It concludes

that best performance (maximum value of each parameter) is obtained by considering

larger number of test files. The line graph shows that average PPV observed with the

implementation of cosine and soft-cosine hybrid is 0.719 which is 0.42% higher than the

cosine and Jaccard hybrid which is 0.716. This shows that the proposed cosine and

soft-cosine hybrid proved to demonstrate better precision as compared to cosine and

Jaccard hybrid.

Figure 6 TPR analysis of the proposed work using the two hybrid similarities (see online version

for colours)

0.705

0.71

0.715

0.72

0.725

0.73

0.735

0.74

100 200 500 1000 2000 3000

TPR values

Number of test files

Cosine + Soft-Cosine Cosine + Jaccard

Figure 7 F-measure analysis of the proposed work using the two hybrid similarities (see online

version for colours)

0.705

0.710

0.715

0.720

0.725

0.730

0.735

0.740

100 200 500 1000 2000 3000

F-measure values

Number of test files

Cosine + Soft-Cosine Cosine + Jaccard

n anti-phishing model based on similarity measuremen

153

TPR analysis of the two hybrids is shown in Figure 6 that shows that the proposed work

exhibited better TPR using cosine and soft-cosine hybrid similarity as compared to cosine

and Jaccard hybrid similarity. The average value of TPR using cosine and soft-cosine

similarity is 0.2833% higher as compared to phishing detection performed using cosine

and Jaccard hybrid similarity.

F-measure reflects the harmonic mean of PPV and TPR. Figure 7 gives the graphical

comparison of the performance of the proposed work using two hybrid similarity

approaches separately. It is observed that proposed anti-phishing model taking advantage

of hybrid similarity using cosine and soft-cosine achieved an average F-measure of

0.7222 while using cosine and Jaccard hybrid similarity achieved an average F-measure

of 0.7196. This shows that proposed work using cosine and soft-cosine similarity resulted

in 0.258% higher F-measure for phishing detection.

5 Conclusions

In the current work, author has proposed an anti-phishing framework based on new

rule-based architecture. The first layer of anti-phishing model id dedicated to perform

similarity evaluation between user query and the website. The phishing detection is based

on cosine and soft-cosine similarity hybrid measurements followed by NN machine

learning that cross-validation of the prediction results. In the SL, to prove the

effectiveness of the proposed cosine and soft-cosine similarity, another hybrid similarity

(cosine and Jaccard similarity) was also implemented. Simulation analysis have shown

that the proposed work using cosine and soft-cosine similarity outperformed the cosine

and similarity based anti-phishing model in terms of 0.233% higher PPV, 0.2833% better

TPR and 0.258% improved F-measure. Overall, it is also observed that the proposed

anti-phishing model showed an average improvement of PPV of 2.3%, TPR of 2% and

F-measure of 2.15% when the sample size is increased from 100 to 3,000 test files.

However, an average value of PPV of 0.719, TPR of 0.726 and F-measure of 0.722 is

observed that proved the effectiveness of the proposed anti-phishing design.

Acknowledgements

This work is part of bilateral Indian-Bulgarian cooperation research project

between Technical University of Sofia, Bulgaria and Deenbandhu Chhotu Ram

University of Science and Technology, Murthal, Sonepat, India under the title

‘Contemporary approaches for processing and analysis of multidimensional signals in

telecommunications’, financed by the Department of Science and Technology (DST),

India and the Ministry of Education and Science, Bulgaria.

154 P. Singh et al.

References

Abutair, H.Y. and Belghith, A. (2017) ‘Using case-based reasoning for phishing detection’,

Procedia Computer Science, Vol. 109, pp.281–288.

Aksu, D., Turgut, Z., Üstebay, S. and Aydin, M.A. (2019) ‘Phishing analysis of websites

using classification techniques’, in International Telecommunications Conference, Springer,

Singapore, pp.251–258.

Cao, Y., Han, W. and Le, Y. (2008) ‘Anti-phishing based on automated individual white-list’,

in Proceedings of the 4th ACM Workshop on Digital Identity Management, pp.51–60.

Clement, J. (2020) Phishing Statistics Analysis, Statista.com [online] https://www.statista.com/

statistics/266161/websites-most-affected-by phishing/ (accessed July 2020).

Engel, D., Stütz, T. and Uhl, A. (2012) ‘Assessing JPEG2000 encryption with key-dependent

wavelet packets’, EURASIP Journal on Information Security, Vol. 1, pp.2–13.

Global Digital Report 2019 [online] https://datareportal.com/ (accessed 21 February 2020).

Gupta, S. and Singhal, A. (2018) ‘Dynamic classification mining techniques for predicting phishing

URL’, in Soft Computing: Theories and Applications, pp.537–546, Springer, Singapore.

Jain, A.K. and Gupta, B.B. (2018) ‘Two-level authentication approach to protect from phishing

attacks in real time’, Journal of Ambient Intelligence and Humanized Computing, Vol. 9,

pp.1783–1796.

Jain, A.K. and Gupta, B.B. (2019) ‘A machine learning based approach for phishing detection

using hyperlinks information’, Journal of Ambient Intelligence and Humanized Computing,

Vol. 10, pp.2015–2028.

Kaur, D. and Kalra, S. (2016) ‘Five-tier barrier anti-phishing scheme using hybrid approach’,

Information Security Journal: A Global Perspective, Vol. 25, pp.247–260.

Kordestani, H. and Shajari, M. (2018) ‘A similarity-based framework for detecting phishing

websites’, International Journal of Advanced Research in Computer Science, Vol. 9.

Lam, T. and Kettani, H. (2019) ‘PhAttApp: a phishing attack detection application’, in Proceedings

of the 2019 3rd International Conference on Information System and Data Mining,

pp.154–158.

Li, Y., Yang, L. and Ding, J. (2016) ‘A minimum enclosing ball-based support vector machine

approach for detection of phishing websites’, Optik, Vol. 127, pp.345–351.

Lin, K.S. (2019) ‘New attack potential measurement method to kaizen event for web application

security vulnerabilities’, International Journal of Electronic Commerce Studies, Vol. 10,

pp.89–112.

Makki, S., Haque, R., Taher, Y., Assaghir, Z., Hacid, M.S. and Zeineddine, H. (2018)

‘A cost-sensitive cosine similarity K-nearest neighbor for credit card fraud detection’, Big

Data and Cyber-security Intelligence, Beirut, Lebanon.

Mensah, P., Blanc, G., Okada, K., Miyamoto, D. and Kadobayashi, Y. (2015) ‘AJNA: anti-phishing

JS-based visual analysis, to mitigate users’ excessive trust in SSL/TLS’, in 2015 4th

International Workshop on Building Analysis Datasets and Gathering Experience Returns for

Security (BADGERS), pp.74–84.

Morovati, K. and Kadam, S.S. (2019) ‘Detection of phishing emails with email forensic analysis

and machine learning techniques’, International Journal of Cyber-Security and Digital

Forensics, Vol. 8, pp.98–108.

Nguyen, L.A.T., Nguyen, H.K. and To, B.L. (2016) ‘An efficient approach based on neuro-fuzzy

for phishing detection’, Journal of Automation and Control Engineering, Vol. 4.

PhishTank [online] https://www.phishtank.com/ (accessed 1 February 2020).

Rao, R.S., Vaishnavi, T. and Pais, A.R. (2020) ‘CatchPhish: detection of phishing websites by

inspecting URLs’, Journal of Ambient Intelligence and Humanized Computing, Vol. 11,

pp.813–825.

n anti-phishing model based on similarity measuremen

155

Ramanathan, V. and Wechsler, H. (2012) ‘phishGILLNET – phishing detection methodology using

probabilistic latent semantic analysis, AdaBoost, and co-training’, EURASIP Journal on

Information Security, No. 1, pp.1–22.

Sheng, S., Magnien, B., Kumaraguru, P., Acquisti, A., Cranor, L.F., Hong, J. and Nunge, E. (2007)

‘Anti-phishing phil: the design and evaluation of a game that teaches people not to fall for

phish’, in Proceedings of the 3rd Symposium on Usable Privacy and Security, pp.88–99.

Sonowal, G. and Kuppusamy, K.S. (2020) ‘PhiDMA – a phishing detection model with multi-filter

approach’, Journal of King Saud University – Computer and Information Sciences, Vol. 32,

pp.99–112.

Suleman, M.T. and Awan, S.M. (2019) ‘Optimization of URL-based phishing websites detection

through genetic algorithms’, Automatic Control and Computer Sciences, Vol. 53, pp.333–341.

Ugochi, O.C. (2018) ‘A novel web page anti-phishing approach using URL cosine similarity and IP

address comparison’, in International Conferences on WWW/Internet, ICWI 2018 and Applied

Computing, pp.321–328.

Zhu, E., Chen, Y., Ye, C., Li, X. and Liu, F. (2019) ‘OFS-NN: an effective phishing websites

detection model based on optimal feature selection and neural network’, IEEE Access, Vol. 7,

pp.73271–73284.

ResearchGate has not been able to resolve any citations for this publication.

OFS-NN: An Effective Phishing Websites Detection Model Based on Optimal Feature Selection and Neural Network

Article

Full-text available

Jun 2019

Phishing attack is now a big threat to people’s daily life and networking environment. Through disguising illegal URLs as legitimate ones, attackers can induce users to visit the phishing URLs to get private information and other benefits. Effective methods of detecting phishing websites are urgently needed to alleviate the threats posed by phishing attacks. As the active learning capability from massive data sets, the neural network is widely used to detect phishing attacks. However, in the stage of training data sets, many useless and small influence features will trap the neural network model into the problem of over-fitting. This problem usually causes the trained model that cannot effectively detect the phishing websites. In order to alleviate this problem, this paper proposes OFS-NN, an effective phishing websites detection model based on optimal feature selection method and neural network. In the proposed OFS-NN, a new index (FVV, Feature Validity Value) is firstly introduced to evaluate the impact of sensitive features on phishing websites detection. Then, based on the new FVV index, an algorithm is designed to select optimal features from the phishing websites. This algorithm is able to alleviate the over-fitting problem of the underlying neural network to a large extent. The selected optimal features are used to train the underlying neural network and, finally, an optimal classifier is constructed to detect the phishing websites. The experimental results show that the OFS-NN model is accurate and stable in detecting many types of the phishing websites.

CatchPhish: detection of phishing websites by inspecting URLs

Article

Full-text available

Feb 2020

There exists many anti-phishing techniques which use source code-based features and third party services to detect the phishing sites. These techniques have some limitations and one of them is that they fail to handle drive-by-downloads. They also use third-party services for the detection of phishing URLs which delay the classification process. Hence, in this paper, we propose a light-weight application, CatchPhish which predicts the URL legitimacy without visiting the website. The proposed technique uses hostname, full URL, Term Frequency-Inverse Document Frequency (TF-IDF) features and phish-hinted words from the suspicious URL for the classification using the Random forest classifier. The proposed model with only TF-IDF features on our dataset achieved an accuracy of 93.25%. Experiment with TF-IDF and hand-crafted features achieved a significant accuracy of 94.26% on our dataset and an accuracy of 98.25%, 97.49% on benchmark datasets which is much better than the existing baseline models.

PhAttApp: A Phishing Attack Detection Application

Conference Paper

Full-text available

Apr 2019

Technology has grown rapidly since the end of the last century. Thousands of businesses from different major industries are transforming into information organizations and offering online services. The industrial and enterprise of Internet of Things (IoT) is growing at an exponential rate. Incident Command Systems (ICS) and Supervisory Control and Data Acquisition (SCADA), which were once known to be untouchable by malware as they were usually available offline, are now facing security challenges. These systems become more vulnerable as their online availability increases to enable integration with other systems. Technology allows organizations to provide greater value to customers, expand their businesses beyond physical boundaries, and compete with other businesses. However, technology also allows attackers from all over the world to attack organizations and consumers. Ransomware, a type of malware, is a growing cybersecurity threat. It mainly targets home users and businesses for financial gain. Ransomware attacks often start with a delivery phase in which attackers deliver malicious content. Attackers often use multiple threat vectors for ransomware enablement such as emails, instant messages, and drive-by downloads, exploiting the vulnerabilities of a network or application. Among these attack vectors, email is the top threat vector, which most ransomware attackers attempt to use. This study proposes the use of a phishing detector application, PhAttApp. This application offers numerous features to detect and prevent ransomware delivery through phishing channels and thus reduces the risk of ransomware infection.

A machine learning based approach for phishing detection using hyperlinks information

Article

Full-text available

May 2019

This paper presents a novel approach that can detect phishing attack by analysing the hyperlinks found in the HTML source code of the website. The proposed approach incorporates various new outstanding hyperlink specific features to detect phishing attack. The proposed approach has divided the hyperlink specific features into 12 different categories and used these features to train the machine learning algorithms. We have evaluated the performance of our proposed phishing detection approach on various classification algorithms using the phishing and non-phishing websites dataset. The proposed approach is an entirely client-side solution, and does not require any services from the third party. Moreover, the proposed approach is language independent and it can detect the website written in any textual language. Compared to other methods, the proposed approach has relatively high accuracy in detection of phishing websites as it achieved more than 98.4% accuracy on logistic regression classifier.

Article

Full-text available

Feb 2018

Hossain Kordestani

New Attack Potential Measurement Method to Kaizen Event for Web Application Security Vulnerabilities

Article

Dec 2019

林國水Kuo-Sui Lin

With recognition of the importance of web application security, there is a need for research on an action program for measurement and improvement of web application security. Therefore, the main purpose of this study was to formulate a Kaizen program suitable for measurement and improvement of web application security vulnerabilities. An improvement working procedure is introduced to implement the Kaizen program. Further, an augmented attack potential measurement method is proposed to measure the effectiveness of the formulated Kaizen program. The proposed new attack potential measurement method is considered to be an umbrella under which several novel techniques and methods are included, such as OWASP’s web application security vulnerabilities assessment method, ISO/IEC 18045 attack potential ratings method and fuzzy evaluation method. The numerical results of an example are presented to show that the augmented attack potential measurement method is not only comparable but also distinguishable. It is more reasonable and effective than that of the traditional method for measuring web application security improvement. Finally, conclusions are made and suggestions for future work are proposed.

Detection of Phishing Emails with Email Forensic Analysis and Machine Learning Techniques

Article

Jan 2019

Kamran Morovati

Optimization of URL-Based Phishing Websites Detection through Genetic Algorithms

Article

Jul 2019

Website phishing is an online crime for obtaining secret information such as passwords, account numbers, and credit card details. Attackers lure users through attractive hyperlinks, in order to, redirect to the fake websites. Phishing detection through a machine-learning approach has become quite effective nowadays. In this research, the Uniform Resource Locator (URL) based phishing detection approach has been used. Machine-learning classifiers like Naïve Bayes, Iterative Dichotomiser-3 (ID3), K-Nearest Neighbor (KNN), Decision Tree and Random Forest used for the classification of legitimate and illegitimate websites. This classification would help in the detection of phishing websites. However, it has been observed that use of Genetic Algorithms (GAs) for feature selection can improve the detection accuracy. Our experimental results portrayed the use of Iterative Dichotomiser-3 (ID3) along with Yet Another Generating Genetic Algorithm (YAGGA) improves the detection accuracy up to 95%.

Phishing Analysis of Websites using Classification Techniques

Conference Paper

Dec 2017

In today's world, where all records are carried into an electronic environment, cyber security represents a very broad scope, with the primary objective of preventing the loss of financial and / or emotional loss of people, institutions, organizations through the security of data in the digital environment. Today, the most common cyber security threat is phishing attacks. With the phishing attack, the attacker aims to capture the data which are very important for the individuals like identification number, social security number, bank account information, and so on. In this study, using deep learning, it was checked whether the web sites are real or not by using neural networks and support vector machine, decision tree and stacked autoencoders as classification methods. As a result of the study, 86% success rate was reached by using stacked autoencoders which are a part of deep learning techniques

Dynamic Classification Mining Techniques for Predicting Phishing URL

Chapter

Jan 2018

An anti-phishing model based on similarity measurement

Abstract and Figures

Recommended publications

Cosine and Soft Cosine Similarity-Based Anti-Phishing Model

Antiphishing Model Based on Similarity Index and Neural Networks

An improved anti‐phishing model utilizing TF‐IDF and AdaBoost

PhishAlert: An Efficient Phishing URL Detection via Hybrid Methodology