Conference PaperPDF Available

Semantic Sky: A Gmail Plugin for Email Classification and Annotation

April 2024

April 2024

Conference: 21st International Conference on Informatics and Information Technologies (CIIT 2024)
At: Strumica, North Macedonia

Authors:

Milos Jovanovik

Ss. Cyril and Methodius University in Skopje

Saso Gramatikov

Ss. Cyril and Methodius University in Skopje

Riste Stojanov

Ss. Cyril and Methodius University in Skopje

We introduce a novel tool designed to simplify email management with the use of cutting-edge machine learning frameworks for natural language processing. Handling emails can often prove to be a challenging task, especially when faced with a substantial volume of emails that require an immediate action from the recipient. Our vision is to develop a tool that accelerates the process of taking the necessary actions directly within the email interface, eliminating the need to switch between multiple windows or platforms. Our approach is to use a text classification model that identifies the intent of each email and a named entity recognition model that identifies key tokens or information within the email that may assist the user. After the plugin determines the appropriate action for the recipient, it establishes connections with relevant systems, and facilitates the exchange of necessary information. The proposed tool simplifies e-mail management and accelerates the process of completing tasks within the e-mail interface. It highlights the key information within the e-mail and the intended purpose of the e-mail. The used plugin expediates the process of information sharing. This tool improves the handling of emails, and give the users tools to complete the tasks efficiently.

The plugin panel displays token information captured during email analysis.

…

The section in the plugin panel for articulating the necessary action.

…

Transformer encoder-decoder architecture.

…

Multi-head attention mechanism.

…

BERT tokenization which breaks down words into smaller subwords.

…

Figures - uploaded by Milos Jovanovik

Content may be subject to copyright.

Content uploaded by Milos Jovanovik

Content may be subject to copyright.

Semantic Sky: A Gmail Plugin for Email

Classiﬁcation and Annotation

Agon Osmani

Faculty of Computer Science and Engineering

Skopje, North Macedonia

agon.osmani.2@students.ﬁnki.ukim.mk

Milos Jovanovik

Faculty of Computer Science and Engineering

Skopje, North Macedonia

milos.jovanovik@ﬁnki.ukim.mk

Sasho Gramatikov

Faculty of Computer Science and Engineering

Skopje, North Macedonia

sasho.gramatikov@ﬁnki.ukim.mk

Riste Stojanov

Faculty of Computer Science and Engineering

Skopje, North Macedonia

riste.stojanov@ﬁnki.ukim.mk

Abstract—We introduce a novel tool designed to simplify email

management with the use of cutting-edge machine learning

frameworks for natural language processing. Handling emails

can often prove to be a challenging task, especially when faced

with a substantial volume of emails that require an immediate

action from the recipient. Our vision is to develop a tool that

accelerates the process of taking the necessary actions directly

within the email interface, eliminating the need to switch between

multiple windows or platforms. Our approach is to use a text

classiﬁcation model that identiﬁes the intent of each email and

a named entity recognition model that identiﬁes key tokens or

information within the email that may assist the user. After the

plugin determines the appropriate action for the recipient, it

establishes connections with relevant systems, and facilitates the

exchange of necessary information. The proposed tool simpliﬁes

e-mail management and accelerates the process of completing

tasks within the e-mail interface. It highlights the key information

within the e-mail and the intended purpose of the e-mail. The

used plugin expediates the process of information sharing. This

tool improves the handling of emails, and give the users tools to

complete the tasks efﬁciently.

I. INTRODUCTION

In our professional careers, the e-mails often play a crucial

role. This is especially the case when we interact with parties

outside of our company. The emails can have two purposes:

(1) to spread information, or (2) to request an action from

us. In this paper we will focus on the second type of emails.

In the current dynamic environment, when everyone is on the

move, the smart mobile devices are usually the primary way of

consuming the emails [1]. However, replaying from our mobile

devices can be a tricky. Additionally, forgetting to process a

mail that is read on the move is very common. Even though the

mail providers are trying to solve this problem by analysing

the content of the email using machine learning models, and

then suggesting few suitable fast replays, this does not solve

the completion of the instructed action in the message. The

actions are often domain speciﬁc and require interaction with

some of the company’s systems.

The volume of emails exchanged at a company level grows

through the time, and it is generally hard for an individual to

keep track of all the information stored within them. In the

cases in which we do not have the capacity to process such

amounts of data manually, we try to invent tools that might

help us. The natural language processing (NLP) tools allow

us to perform intelligent knowledge extraction (KE) and use

this knowledge in order to execute the actions of interest.

Named entity recognition (NER), as a key component in

NLP systems for annotating entities with their corresponding

classes, enriches the semantic context of the words by adding

hierarchical identiﬁcation. Currently, there is a lot of new work

being done in this ﬁeld, especially in the process of neural

networks optimization for label sequencing, which outperform

early NER systems based on domain dictionaries, lexicons,

orthographic feature extraction and semantic rules. Starting

with [2], neural network NER systems with minimal feature

engineering have become popular, due to the performances

they achieve.

The Sequence-to-Sequence (Seq2Seq) architectures [3] ﬁrst

introduced the powerful ability to transform a given sequence

of text elements into another sequence – a concept which

ﬁts well in machine translation. Transformers [4] are models

which implement Seq2Seq architecture by using an encoder-

decoder structure. The Google’s BERT [5] is based on a

transformer architecture and integrates an attention mechanism

[4]. It produces outstanding results on many NLP tasks,

including NER and text classiﬁcation, due to its ability to learn

contextual relations between words (or sub-words) in a text,

making it applicable in any domain. Multilingual BERT [6] as

a single language model is remarkable at cross-lingual model

transfer, in which annotations in one language can be used to

generalize the model for another language.

Transfer learning, as a machine learning method, provides

the concept of re-usability in neural networks, where one

model developed for a task can be reused as the starting

point of the training process of another problem that has

a signiﬁcantly smaller training set. In recent years, transfer

learning is one of the most popular approaches, since it out-

performs the state-of-the-art models in many use-cases, and

does so by using smaller training sets for ﬁne-tuning and far

less computational resources. Hugging Face [7] has been a

pioneering force in the realm of transfer learning, offering

a versatile platform and pre-trained models that empower

developers to efﬁciently adapt and ﬁne-tune models for a wide

array of natural language processing tasks.

II. RE LATE D WOR K

In this section we are going to go over various related

efforts that have employed similar techniques to our work.

These include email classiﬁcation using BERT, usages of NLP

techniques to enhance email management workﬂows, Named

Entity Recognition (NER) within emails, and the creation of

Gmail plugins aimed at email processing enhancements.

Large language models (LLMs) are usually Transformer-

based language models housing more than billions of pa-

rameters and are trained on extensive text datasets. LLMs

demonstrate remarkable abilities in comprehending natural

language and tackling intricate tasks, particularly through text

generation. GPT-3 and GPT-3.5 [8] are iterations of language

models created by OpenAI that produce human-like natural

language text. Initially, the davinci model [8] served as the

foundation for the gpt-3 series, having 175 billion parameters.

Gpt-3.5-turbo [8] is optimized for chat applications, offering

greater capabilities compared to text-davinci-003. LLaMA [9]

has models with parameter ranges from 7 billion to 65 billion,

exhibiting competitive performance compared to leading ex-

isting large language models. Notably, LLaMA-13B compares

to GPT-3 on most benchmarks despite being ten times smaller.

In the ﬁeld of email classiﬁcation, distinct from our focus on

extracting call-to-action elements from emails, there have been

attempts that use the BERT transformer to discern whether an

email is classiﬁed as spam or not.

In [10], the BERT model was used for the purpose of spam

detection in emails. This research utilized the BERT base

model, which comprises a stack of 12 encoders, to construct

a high-performance spam detector. The model was trained

on diverse datasets and used the BERT tokenizer to segment

email sentences into chunks of words and feed them to the

model, achieving remarkable F1 scores of 98.62%, 97.83%,

and 99.13% across three different email corpora: The Enron

corpus [11], SpamAssassin corpus 1, and Ling-Spam corpus

2, respectively.

In [12], various NLP techniques used for phishing detection

in emails are discussed. The paper reviews into a wide array

of methods, including the implementation of machine learning

strategies like Random Forests [13], Support Vector Machines

[14], and the utilization of bag-of-words, also known as

Dynamic Markov Chain features. Additionally, the survey

references several notable works in this domain, with partic-

ular mention of [15], where a phishing detection model was

constructed using Keras, word embeddings, and convolutional

1https://www.kaggle.com/datasets/ganiyuolalekan/

spam-assassin- email-classiﬁcation- dataset?resource=download

2https://www.kaggle.com/datasets/mandygu/lingspam-dataset

neural networks (CNN). This model achieved an accuracy of

96.8% on the test dataset.

In the ﬁeld of named entity extraction, Lisa F. Rau [16],

is a pioneer who formulated a system for extracting and

recognizing company names relying on heuristic techniques

and manually crafted rules. In an earlier paper dating back

to 1999 [17], during a time when dictionaries and lists of

people, organizations, and locations posed a bottleneck for

Named Entity Recognition, the authors aimed to address this

limitation. Their model incorporated contextual information

related to named entities, including their sentence position,

lowercase usage, and presence in the document. This approach

performed a combined precision and recall score of 93.39%.

In their work [18], Conditional Random Fields (CRF) were

applied to label sequences of examples, speciﬁcally in the

context of extracting personal names from emails. Their ap-

proach included the implementation of three distinct types of

features for words: basic features, dictionary features, and

email structure features. These features were designed to

identify patterns such as capitalization patterns, the presence

of common words or ﬁrst names from dictionaries, and other

indicators, including token matching within the ”from” ﬁeld

or across the email header. During evaluations, the model

demonstrated impressive performance, achieving an F1 score

of 91.9% when utilizing all of these features.

Several research papers have explored the automation of

email processing to streamline tasks and enhance email man-

agement. For instance, in the paper [19], the authors de-

veloped thread visualization techniques to handle complex

email conversations with accumulated messages. Meanwhile,

[20] directed their attention toward an automated attention

manager, a tool designed to assist computer users in efﬁciently

handling notiﬁcations. Their approach centers on the automatic

evaluation of message value and the continuous inference

of a user’s attention using Bayesian models to estimate the

probability distribution of the user’s focus.

III. SYS TE M ARCHITECTURE

A. Google App Scripts

Google Apps Script [21] represents an application devel-

opment platform that expedites the creation of business ap-

plications that integrate with G Suite. This web-based coding

platform enables users to create plugins for all the products

that Google offers. Within the framework of this project, we

will be using the App Scripts to create a plugin which is going

to be integrated with our custom API service.

In this paper, we introduce an application deployed as

an App Scripts project. Once deployed, the application can

be accessed within Gmail’s right-side panel, as depicted in

Figure 1.

Upon clicking the plugin icon, a right-side panel will unfold.

This plugin is set to operate on individual mails. If no mail is

selected, the panel will remain empty. When the user tries to

access an email of interest, the panel will present an ”Analyze”

button (Figure 2).

Fig. 1. Finding our plugin application in the Gmails right-side panel.

Fig. 2. Analyze button for user consent to analyze the email.

By clicking this button, the user consents to transmitting the

contents of the current email to the content analysis service

while ensuring the data’s protection in accordance with GDPR

guidelines [22]. It’s important to note that in this scenario,

we do not retain the data within the content analysis service;

instead, we store its representation obtained by the model.

Once the email is transmitted to the content analysis service,

we employ a ﬁne-tuned transformer model to identify some

of the most frequently occurring named entities within the

email, including individuals and locations. In this process, we

commence with Multilingual BERT [6], ﬁne-tuned speciﬁcally

for Named Entity Recognition (NER) using the Wiki-data

corpus [23]. In the current iteration, our emphasis lies on the

Macedonian language, and therefore, we have ﬁne-tuned the

model exclusively using the Macedonian section of the Wiki-

data. However, we have intentions to extend this ﬁne-tuning

methodology to cover all languages in subsequent iterations.

The panel is refreshed, presenting a new layout comprising

several distinct sections. The initial section of the panel

exhibits lists of information detected within the email, encom-

passing individual names, locations, and university subjects (as

depicted in Figure 3). These detected pieces of information are

categorized as tokens within the text, employing the Named

Entity Recognition (NER) technique.

Fig. 3. The plugin panel displays token information captured during email

analysis.

The content analysis service utilizes the ﬁne-tuned model

to extract the entities of interest. It’s crucial to acknowledge

that certain named entities might not be present or accurately

labeled within the ﬁne-tuning dataset and that the model is

initially not infallible. Therefore, we intentionally designed the

platform to be extensible right from the outset by having the

platform to provide these results to the end user for veriﬁcation

and correction.

To facilitate the initial development of training data, espe-

cially when introducing new named entities, we grant users the

capability to deﬁne regular expressions that identify named

entities within the text. Consequently, the content analysis

service combines the results predicted by the model with

those obtained through regular expression matches. Figure 3

provides an illustration of the outcomes presented to the

user. As depicted, users have the option to select certain

candidate tokens via checkboxes and subsequently remove

them if they are identiﬁed as false positives. Furthermore,

for instances where false negatives are identiﬁed, each named

entity provides an input ﬁeld at its beginning. Users can

input text fragments that were not recognized in the text. It’s

important to mention that we have implemented validation to

ensure that only valid text parts can be added as new named

entities.

Once all modiﬁcations for a particular email have been

made, if any, the user has the capability to ﬁne-tune the model

with the newly annotated data by clicking the ”Fine Tune”

button. During this process, training is conducted using a

single example to adapt the model to the freshly annotated

insights provided by the end user.

Fig. 4. The section in the plugin panel for articulating the necessary action.

The second section of the side panel pertains to specifying

the intended action to be taken, as illustrated in Figure 4.

In this form, users have the opportunity to articulate their

goals concerning the email’s content by describing the action

they wish to take after reading the currently opened email.

Examples of such actions within a context of an academic

institution regarding professors include tasks like adding a

grade to a system, reviewing homework, scheduling a meeting

with a student, and numerous other actions.

Within this section, users are presented with a set of

predeﬁned actions, and they have the option to select one

of these suggestions as their intended action. Additionally,

users can manually input an alternative action if it is not

present in the predeﬁned options. This form is equipped with

another ﬁne-tuned multilingual BERT model that offers action

suggestions. However, in this scenario, we employ text clas-

siﬁcation, utilizing the email’s content as input and extracting

the desired actions as output. These suggested actions can be

customized, and the model can be ﬁne-tuned to enhance its

future performance.

B. Content analysis service

The core technology integrated with the plugin is a Flask

Application 3, written in Python 3.10. The Gmail Plugin

communicates with this application via HTTP GET and POST

requests to access the required data from the content analysis

service. This architecture creates a composite application with

distinct components: the user interface, email processing and

model training, all separated into individual systems.

Emails serve as the central component of the system and

are encapsulated by the Mail model. The primary attribute of

this model is the hash generated from the email’s sender and

content. Importantly, the content itself is not stored to mitigate

privacy concerns.

Each email is associated with a single action (goal) and

multiple named entity tokens. Each Token is characterized

by its named entity type and its position within the message

text of the mail. This design allows the content, along with

the tokens, to remain in memory brieﬂy within the content

analysis service before being discarded. However, the email’s

hash is unique, enabling the analysis to retrieve the Action

and Tokens from the database when the same email is opened

again. Subsequently, the stored position of the token in the text

can be used to extract the textual value for each Token. This

approach ensures user privacy is upheld while maintaining the

platform’s core functionality.

Within the application’s service layer, reside a collection

of various algorithms designed to facilitate both manual and

predictive extraction of tokens from emails. These algorithms

determine the placement of the tokens within the text and

handle their storage in the database, alongside the assignment

of the corresponding action to the mail.

We employ Python code with the following expression:

matches =re.f inditer(tok en regex, text)

This allows us to utilize a regular expression tailored to

the token type, seeking and identifying all the corresponding

matches within the text. We proceed by iterating through

these matches, capturing the start position of each token.

Subsequently, we assign the tokens foreign key, linking it

to the hash of the email currently undergoing processing,

and store this information in the database. Following the

identiﬁcation of the tokens, we also obtain the NER tokens

that are generated using the BERT model in order to return a

union of the tokens to the Gmail plugin interface.

A similar procedure is applied when users manually add or

exclude tokens. Utilizing the same code, a search is conducted

to identify all instances of the token, extract their starting

positions, and employ a veriﬁcation process.

3https://ﬂask.palletsprojects.com/en/3.0.x/

The exclusion of tokens proves particularly valuable for

handling false positives generated by the BERT model. To

retain a token in the database as a false positive, this algorithm

is employed to ﬂag the token as incorrect, ensuring that it

does not appear in the interface results. However, crucially,

information indicating that the token is a false positive is

preserved in the database. This preservation of information

serves the purpose of enabling the computation of more

accurate metrics during evaluation.

C. The Transformer Architecture

Recurrent neural networks, including long short-term mem-

ory (LSTM) [24] and gated recurrent neural networks (GRU)

[25] are models used for solving NLP tasks such as machine

translation and sentiment analysis. These models generate a

sequence of hidden states htbased on the previous hidden

state ht−1and the input at position t. However, their sequen-

tial nature prevent parallelization within training examples,

especially with longer sequences, due to memory constraints.

In contrast, the Transformer architecture [4] omits recurrence

and relies solely on an attention mechanism to capture global

dependencies between input and output. This design allows

for much greater parallelization, overcoming the limitations

of sequential processing in recurrent models.

Fig. 5. Transformer encoder-decoder architecture.

The Transformer adopts an encoder-decoder architecture

(Figure 5). In this setup, the encoder takes an input sequence

of symbol representations and maps it to a sequence of

continuous representations.

The encoder comprises a stack of identical layers, each

consisting of two sub-layers. The ﬁrst sub-layer is a multi-

head self-attention mechanism, while the second sub-layer is

a straightforward fully connected feed-forward network.

Given that this model lacks recurrence, ”positional encod-

ings” are integrated into the input embeddings at the base

of the encoder, in order to incorporate information about the

position of the tokens in the text. Sine and cosine functions of

varying frequencies are employed to generate the positional

encodings. Speciﬁcally, the positional encodings are deﬁned

as follows:

P E(pos,2i)= sin pos

100002i/dmodel 

P E(pos,2i+1) = cos pos

100002i/dmodel 

Here, pos represents the position and idenotes the dimen-

sion. Each dimension of the positional encoding corresponds

to a sinusoid.

The attention function can be deﬁned as a process that

maps a query and a collection of key-value pairs to an

output. The output is determined by calculating a weighted

sum of the values, where the weight assigned to each value

is determined by a compatibility function of the query with

the corresponding key. We calculate the dot products of the

query with all keys and apply a softmax function to derive the

weights on the values:

Attention(Q, K, V ) = softmax QK T

√dkV

Multi-head attention (Figure 6) allows the model to focus on

different parts of the input data (such as words in a sentence)

with different perspectives simultaneously. This helps capture

complex relationships within the data. In multi-head attention,

the attention mechanism is replicated multiple times in paral-

lel. Each head learns different relationships between the input

data. The queries, keys, and values are linearly transformed

using learned weights (WQ, WK, WV) speciﬁc to that head.

The multi-head attention computation can be described as

follows:

MultiHead(Q, K, V ) = Concat(head1, ..., headh)WO

where

headi=Attention(QWQi, KWKi, V WVi)

Fig. 6. Multi-head attention mechanism.

D. Fine-tuning BERT

Before allowing the example to be ﬁne-tuned for training,

the content analysis service initiates an automated feature

engineering process comprising two key phases: creating an

example dictionary and tokenizing labels. In the initial step of

creating an example dictionary, the email is transformed into

a list of words and undergoes preprocessing. Subsequently, a

new list is generated, mirroring the list of words and associat-

ing each word with its corresponding NER tag. Following this,

the words are segmented into word chunks using the BERT

tokenizer, with each chunk being labeled either with the NER

tag or with the value -100, signifying that the word doesn’t

represent a label.

Fig. 7. BERT tokenization which breaks down words into smaller subwords.

BERT [5] which stands for Bidirectional Encoder

Representations from Transformers is an NLP Model intro-

duced by researchers at Google. It comprises two pre-trained

model variants, BERTBASE and BERTLARGE. For our work,

we exclusively utilize the BERTBASE model.

BERT is constructed as a stack of encoders utilizing the

transformers architecture, which involves self-attention layer

followed by a feed forward neural network layer on the

encoder side and attention mechanisms on the decoder side.

BERTBASE, speciﬁcally, features an encoder stack with 12

transformer blocks. Its architecture incorporates larger feed-

forward networks with 768 hidden units and is equipped with

12 attention heads and with a total number of 110 million

parameters. This architecture empowers BERTBASE with ad-

vanced language comprehension and processing capabilities.

This model initiates its input processing with the CLS

(classiﬁcation) token, which is followed by a sequence of

words. The [CLS] token serves as a distinct symbol placed at

the beginning of each input example, while [SEP] functions

as a specialized separator token. At each layer of the model,

self-attention mechanisms are applied, and the outcomes are

subsequently passed through a feedforward network before

being transferred to the next encoder.

Fig. 8. BERT consists of a stack of transformer encoder layers that process

the input text in a series of self-attention and feedforward neural network

layers.

The vector obtained through this training process can now

be used to accomplish various tasks, including classiﬁcation,

named entity recognition, and more.

IV. CONCLUSION

In this paper we presented a general workﬂow which makes

us able to extract the information of interest and get suggested

actions from a given mail. Given that the mails are the

main communication channels for the businesses, this opens

a big opportunity to integrate this extracted information with

the proprietary systems used in companies and institutions.

Therefore, it will boost the employee productivity by reducing

time spent on processing emails and mitigating distractions

during periods of focus.

Considering BERT as an established Pretrained model,

there’s a high probability of a successful performance after

model training. In the next iteration of our work, we intend to

record the model performance for evaluation through several

epochs by measuring the F1 scores, with an attempt to ﬁnd

the highest performing version of the model.

This tool holds potential for further development consider-

ing the anticipated advancements in technology and Artiﬁcial

Intelligence. E.g. Automatic integration of the Gmail plugin

with third party systems, automation of logical responses and

automatic execution of requested tasks.

It’s worth noting that even though the email is processed

in the content analysis service, the content is never stored

in memory. Any personal information is made sure to be

discarded as soon as it gets processed. E.g. After the hash

of the mail content is extracted, or the model is ﬁne-tuned

with an example. We use the hash representation and token

position in the text, rather than their raw form. This way we

are not prone to a data breaches and insider attacks.

REFERENCES

[1] D. A. Dillman, J. D. Smyth, and L. M. Christian, Internet, phone, mail,

and mixed-mode surveys: The tailored design method. John Wiley &

Sons, 2014.

[2] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and

P. Kuksa, “Natural language processing (almost) from scratch,” Journal

of machine learning research, vol. 12, no. ARTICLE, pp. 2493–2537,

2011.

[3] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning

with neural networks,” Advances in neural information processing

systems, vol. 27, 2014.

[4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,

Ł. Kaiser, and I. Polosukhin, “Attention is All you Need,” in Advances

in neural information processing systems, pp. 5998–6008, 2017.

[5] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training

of deep bidirectional transformers for language understanding,” arXiv

preprint arXiv:1810.04805, 2018.

[6] T. Pires, E. Schlinger, and D. Garrette, “How multilingual is multilingual

bert?,” arXiv preprint arXiv:1906.01502, 2019.

[7] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi,

P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al., “Huggingface’s trans-

formers: State-of-the-art natural language processing,” arXiv preprint

arXiv:1910.03771, 2019.

[8] J. Ye, X. Chen, N. Xu, C. Zu, Z. Shao, S. Liu, Y. Cui, Z. Zhou, C. Gong,

Y. Shen, et al., “A comprehensive capability analysis of gpt-3 and gpt-

3.5 series models,” arXiv preprint arXiv:2303.10420, 2023.

[9] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux,

T. Lacroix, B. Rozi`

ere, N. Goyal, E. Hambro, F. Azhar, et al.,

“Llama: Open and efﬁcient foundation language models,” arXiv preprint

arXiv:2302.13971, 2023.

[10] T. Sahmoud and D. M. Mikki, “Spam detection using bert,” arXiv

preprint arXiv:2206.02443, 2022.

[11] B. Klimt and Y. Yang, “The enron corpus: A new dataset for email

classiﬁcation research,” in European conference on machine learning,

pp. 217–226, Springer, 2004.

[12] S. Salloum, T. Gaber, S. Vadera, and K. Shaalan, “Phishing email

detection using natural language processing techniques: a literature

survey,” Procedia Computer Science, vol. 189, pp. 19–28, 2021.

[13] L. Breiman, “Random forests,” Machine learning, vol. 45, pp. 5–32,

2001.

[14] M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt, and B. Scholkopf, “Sup-

port vector machines,” IEEE Intelligent Systems and their applications,

vol. 13, no. 4, pp. 18–28, 1998.

[15] M. Hiransha, N. A. Unnithan, R. Vinayakumar, K. Soman, and

A. Verma, “Deep learning based phishing e-mail detection,” in Proc.

1st AntiPhishing Shared Pilot 4th ACM Int. Workshop Secur. Privacy

Anal.(IWSPA), pp. 1–5, Tempe, AZ, USA, 2018.

[16] L. F. Rau, “Extracting company names from text,” in Proceedings the

Seventh IEEE Conference on Artiﬁcial Intelligence Application, pp. 29–

30, IEEE Computer Society, 1991.

[17] A. Mikheev, M. Moens, and C. Grover, “Named entity recognition

without gazetteers,” in Ninth Conference of the European Chapter of

the Association for Computational Linguistics, pp. 1–8, 1999.

[18] E. Minkov, R. C. Wang, and W. Cohen, “Extracting personal names

from email: Applying named entity recognition to informal text,” in

Proceedings of human language technology conference and conference

on empirical methods in natural language processing, pp. 443–450,

2005.

[19] G. D. Venolia and C. Neustaedter, “Understanding sequence and reply

relationships within email conversations: a mixed-model visualization,”

in Proceedings of the SIGCHI conference on Human factors in comput-

ing systems, pp. 361–368, 2003.

[20] E. J. Horvitz, A. Jacobs, and D. Hovel, “Attention-sensitive alerting,”

arXiv preprint arXiv:1301.6707, 2013.

[21] J. Ferreira, Google Apps Script: Web Application Development Essen-

tials. ” O’Reilly Media, Inc.”, 2014.

[22] European Parliament and Council of the European Union, “Regulation

(EU) 2016/679 of the European Parliament and of the Council.”

[23] D. Vrandeˇ

ci´

c and M. Kr¨

otzsch, “Wikidata: a free collaborative knowl-

edgebase,” Communications of the ACM, vol. 57, no. 10, pp. 78–85,

2014.

[24] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural

computation, vol. 9, no. 8, pp. 1735–1780, 1997.

[25] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of

gated recurrent neural networks on sequence modeling,” arXiv preprint

arXiv:1412.3555, 2014.

ResearchGate has not been able to resolve any citations for this publication.

Phishing Email Detection Using Natural Language Processing Techniques: A Literature Survey

Article

Full-text available

Jan 2021

Phishing is the most prevalent method of cybercrime that convinces people to provide sensitive information; for instance, account IDs, passwords, and bank details. Emails, instant messages, and phone calls are widely used to launch such cyber-attacks. Despite constant updating of the methods of avoiding such cyber-attacks, the ultimate outcome is currently inadequate. On the other hand, phishing emails have increased exponentially in recent years, which suggests a need for more effective and advanced methods to counter them. Numerous methods have been established to filter phishing emails, but the problem still needs a complete solution. To the best of our knowledge, this is the first survey that focuses on using Natural Language Processing (NLP) and Machine Learning (ML) techniques to detect phishing emails. This study provides an analysis of the numerous state-of-the-art NLP strategies currently in use to identify phishing emails at various stages of the attack, with an emphasis on ML strategies. These approaches are subjected to a comparative assessment and analysis. This gives a sense of the problem, its immediate solution space, and the expected future research directions.

Transformers: State-of-the-Art Natural Language Processing

Conference Paper

Full-text available

Jan 2020

Deep Learning Based Phishing E-mail Detection CEN-Deepspam

Conference Paper

Full-text available

Mar 2018

Email communication, has now become an inevitable communication tool in our daily life. Especially for finance sector, communication through email plays an important role in their businesses. So, it is very important to classify emails based on their behavior. Email phishing one of most dangerous Internet phenomenon that cause various problems to business class mainly to finance sector. This type of emails steals our valuable information without our permission, more over we won't be aware of such an act even if it has been occurred. In this paper, we reveal about how to distinguish phishing emails from legitimate mails. Dataset had two types of email texts one with header and other without header. We used Keras Word Embedding and Convolutional Neural Network to build our model.

Natural Language Processing (Almost) from Scratch

Article

Full-text available

Feb 2011
J MACH LEARN RES

We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including: part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements.

Internet, Phone, Mail, and Mixed‐Mode Surveys: The Tailored Design Method

Book

Feb 2024

How Multilingual is Multilingual BERT?

Conference Paper

Jan 2019

Attention Is All You Need

Article

Jun 2017

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Wikidata: A Free Collaborative Knowledgebase

Article

Sep 2014

Wikidata allows every user to extend and edit the stored information, even without creating an account. A form based interface makes editing easy. Wikidata's goal is to allow data to be used both in Wikipedia and in external applications. Data is exported through Web services in several formats, including JavaScript Object Notation, or JSON, and Resource Description Framework, or RDF. Data is published under legal terms that allow the widest possible reuse. The value of Wikipedia's data has long been obvious, with many efforts to use it. The Wikidata approach is to crowdsource data acquisition, allowing a global community to edit the data. This extends the traditional wiki approach of allowing users to edit a website. In March 2013, Wikimedia introduced Lua as a scripting language for automatically creating and enriching parts of articles. Lua scripts can access Wikidata, allowing Wikipedia editors to retrieve, process, and display data. Many other features were introduced in 2013, and development is planned to continue for the foreseeable future.

Sequence to Sequence Learning with Neural Networks

Article

Sep 2014
Adv Neural Inform Process Syst

Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.7 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a strong phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which beats the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.

Understanding Sequence and Reply Relationships within Email Conversations: A Mixed-Model Visualization

Conference Paper

Apr 2003

It has been proposed that email clients could be improved if they presented messages grouped into conversations. An email conversation is the tree of related messages that arises from the use of the reply operation. We propose two models of conversation. The first model characterizes a conversation as a chronological sequence of messages; the second as a tree based on the reply relationship. We show how existing email clients and prior research projects implicitly support each model to a greater or lesser degree depending on their design, but none fully supports both models simultaneously. We present a mixed-model visualization that simultaneously presents sequence and reply relationships among the messages of a conversation, making both visible at a glance. We describe the integration of the visualization into a working prototype email client. A usability study indicates that the system meets our usability goals and verifies that the visualization fully conveys both types of relationships within the messages of an email conversation.