Conference PaperPDF Available

Semantic Sky: A Gmail Plugin for Email Classification and Annotation

Authors:

Abstract and Figures

We introduce a novel tool designed to simplify email management with the use of cutting-edge machine learning frameworks for natural language processing. Handling emails can often prove to be a challenging task, especially when faced with a substantial volume of emails that require an immediate action from the recipient. Our vision is to develop a tool that accelerates the process of taking the necessary actions directly within the email interface, eliminating the need to switch between multiple windows or platforms. Our approach is to use a text classification model that identifies the intent of each email and a named entity recognition model that identifies key tokens or information within the email that may assist the user. After the plugin determines the appropriate action for the recipient, it establishes connections with relevant systems, and facilitates the exchange of necessary information. The proposed tool simplifies e-mail management and accelerates the process of completing tasks within the e-mail interface. It highlights the key information within the e-mail and the intended purpose of the e-mail. The used plugin expediates the process of information sharing. This tool improves the handling of emails, and give the users tools to complete the tasks efficiently.
Content may be subject to copyright.
Semantic Sky: A Gmail Plugin for Email
Classification and Annotation
Agon Osmani
Faculty of Computer Science and Engineering
Skopje, North Macedonia
agon.osmani.2@students.finki.ukim.mk
Milos Jovanovik
Faculty of Computer Science and Engineering
Skopje, North Macedonia
milos.jovanovik@finki.ukim.mk
Sasho Gramatikov
Faculty of Computer Science and Engineering
Skopje, North Macedonia
sasho.gramatikov@finki.ukim.mk
Riste Stojanov
Faculty of Computer Science and Engineering
Skopje, North Macedonia
riste.stojanov@finki.ukim.mk
Abstract—We introduce a novel tool designed to simplify email
management with the use of cutting-edge machine learning
frameworks for natural language processing. Handling emails
can often prove to be a challenging task, especially when faced
with a substantial volume of emails that require an immediate
action from the recipient. Our vision is to develop a tool that
accelerates the process of taking the necessary actions directly
within the email interface, eliminating the need to switch between
multiple windows or platforms. Our approach is to use a text
classification model that identifies the intent of each email and
a named entity recognition model that identifies key tokens or
information within the email that may assist the user. After the
plugin determines the appropriate action for the recipient, it
establishes connections with relevant systems, and facilitates the
exchange of necessary information. The proposed tool simplifies
e-mail management and accelerates the process of completing
tasks within the e-mail interface. It highlights the key information
within the e-mail and the intended purpose of the e-mail. The
used plugin expediates the process of information sharing. This
tool improves the handling of emails, and give the users tools to
complete the tasks efficiently.
I. INTRODUCTION
In our professional careers, the e-mails often play a crucial
role. This is especially the case when we interact with parties
outside of our company. The emails can have two purposes:
(1) to spread information, or (2) to request an action from
us. In this paper we will focus on the second type of emails.
In the current dynamic environment, when everyone is on the
move, the smart mobile devices are usually the primary way of
consuming the emails [1]. However, replaying from our mobile
devices can be a tricky. Additionally, forgetting to process a
mail that is read on the move is very common. Even though the
mail providers are trying to solve this problem by analysing
the content of the email using machine learning models, and
then suggesting few suitable fast replays, this does not solve
the completion of the instructed action in the message. The
actions are often domain specific and require interaction with
some of the company’s systems.
The volume of emails exchanged at a company level grows
through the time, and it is generally hard for an individual to
keep track of all the information stored within them. In the
cases in which we do not have the capacity to process such
amounts of data manually, we try to invent tools that might
help us. The natural language processing (NLP) tools allow
us to perform intelligent knowledge extraction (KE) and use
this knowledge in order to execute the actions of interest.
Named entity recognition (NER), as a key component in
NLP systems for annotating entities with their corresponding
classes, enriches the semantic context of the words by adding
hierarchical identification. Currently, there is a lot of new work
being done in this field, especially in the process of neural
networks optimization for label sequencing, which outperform
early NER systems based on domain dictionaries, lexicons,
orthographic feature extraction and semantic rules. Starting
with [2], neural network NER systems with minimal feature
engineering have become popular, due to the performances
they achieve.
The Sequence-to-Sequence (Seq2Seq) architectures [3] first
introduced the powerful ability to transform a given sequence
of text elements into another sequence a concept which
fits well in machine translation. Transformers [4] are models
which implement Seq2Seq architecture by using an encoder-
decoder structure. The Google’s BERT [5] is based on a
transformer architecture and integrates an attention mechanism
[4]. It produces outstanding results on many NLP tasks,
including NER and text classification, due to its ability to learn
contextual relations between words (or sub-words) in a text,
making it applicable in any domain. Multilingual BERT [6] as
a single language model is remarkable at cross-lingual model
transfer, in which annotations in one language can be used to
generalize the model for another language.
Transfer learning, as a machine learning method, provides
the concept of re-usability in neural networks, where one
model developed for a task can be reused as the starting
point of the training process of another problem that has
a significantly smaller training set. In recent years, transfer
learning is one of the most popular approaches, since it out-
performs the state-of-the-art models in many use-cases, and
does so by using smaller training sets for fine-tuning and far
less computational resources. Hugging Face [7] has been a
pioneering force in the realm of transfer learning, offering
a versatile platform and pre-trained models that empower
developers to efficiently adapt and fine-tune models for a wide
array of natural language processing tasks.
II. RE LATE D WOR K
In this section we are going to go over various related
efforts that have employed similar techniques to our work.
These include email classification using BERT, usages of NLP
techniques to enhance email management workflows, Named
Entity Recognition (NER) within emails, and the creation of
Gmail plugins aimed at email processing enhancements.
Large language models (LLMs) are usually Transformer-
based language models housing more than billions of pa-
rameters and are trained on extensive text datasets. LLMs
demonstrate remarkable abilities in comprehending natural
language and tackling intricate tasks, particularly through text
generation. GPT-3 and GPT-3.5 [8] are iterations of language
models created by OpenAI that produce human-like natural
language text. Initially, the davinci model [8] served as the
foundation for the gpt-3 series, having 175 billion parameters.
Gpt-3.5-turbo [8] is optimized for chat applications, offering
greater capabilities compared to text-davinci-003. LLaMA [9]
has models with parameter ranges from 7 billion to 65 billion,
exhibiting competitive performance compared to leading ex-
isting large language models. Notably, LLaMA-13B compares
to GPT-3 on most benchmarks despite being ten times smaller.
In the field of email classification, distinct from our focus on
extracting call-to-action elements from emails, there have been
attempts that use the BERT transformer to discern whether an
email is classified as spam or not.
In [10], the BERT model was used for the purpose of spam
detection in emails. This research utilized the BERT base
model, which comprises a stack of 12 encoders, to construct
a high-performance spam detector. The model was trained
on diverse datasets and used the BERT tokenizer to segment
email sentences into chunks of words and feed them to the
model, achieving remarkable F1 scores of 98.62%, 97.83%,
and 99.13% across three different email corpora: The Enron
corpus [11], SpamAssassin corpus 1, and Ling-Spam corpus
2, respectively.
In [12], various NLP techniques used for phishing detection
in emails are discussed. The paper reviews into a wide array
of methods, including the implementation of machine learning
strategies like Random Forests [13], Support Vector Machines
[14], and the utilization of bag-of-words, also known as
Dynamic Markov Chain features. Additionally, the survey
references several notable works in this domain, with partic-
ular mention of [15], where a phishing detection model was
constructed using Keras, word embeddings, and convolutional
1https://www.kaggle.com/datasets/ganiyuolalekan/
spam-assassin- email-classification- dataset?resource=download
2https://www.kaggle.com/datasets/mandygu/lingspam-dataset
neural networks (CNN). This model achieved an accuracy of
96.8% on the test dataset.
In the field of named entity extraction, Lisa F. Rau [16],
is a pioneer who formulated a system for extracting and
recognizing company names relying on heuristic techniques
and manually crafted rules. In an earlier paper dating back
to 1999 [17], during a time when dictionaries and lists of
people, organizations, and locations posed a bottleneck for
Named Entity Recognition, the authors aimed to address this
limitation. Their model incorporated contextual information
related to named entities, including their sentence position,
lowercase usage, and presence in the document. This approach
performed a combined precision and recall score of 93.39%.
In their work [18], Conditional Random Fields (CRF) were
applied to label sequences of examples, specifically in the
context of extracting personal names from emails. Their ap-
proach included the implementation of three distinct types of
features for words: basic features, dictionary features, and
email structure features. These features were designed to
identify patterns such as capitalization patterns, the presence
of common words or first names from dictionaries, and other
indicators, including token matching within the ”from” field
or across the email header. During evaluations, the model
demonstrated impressive performance, achieving an F1 score
of 91.9% when utilizing all of these features.
Several research papers have explored the automation of
email processing to streamline tasks and enhance email man-
agement. For instance, in the paper [19], the authors de-
veloped thread visualization techniques to handle complex
email conversations with accumulated messages. Meanwhile,
[20] directed their attention toward an automated attention
manager, a tool designed to assist computer users in efficiently
handling notifications. Their approach centers on the automatic
evaluation of message value and the continuous inference
of a user’s attention using Bayesian models to estimate the
probability distribution of the user’s focus.
III. SYS TE M ARCHITECTURE
A. Google App Scripts
Google Apps Script [21] represents an application devel-
opment platform that expedites the creation of business ap-
plications that integrate with G Suite. This web-based coding
platform enables users to create plugins for all the products
that Google offers. Within the framework of this project, we
will be using the App Scripts to create a plugin which is going
to be integrated with our custom API service.
In this paper, we introduce an application deployed as
an App Scripts project. Once deployed, the application can
be accessed within Gmail’s right-side panel, as depicted in
Figure 1.
Upon clicking the plugin icon, a right-side panel will unfold.
This plugin is set to operate on individual mails. If no mail is
selected, the panel will remain empty. When the user tries to
access an email of interest, the panel will present an ”Analyze”
button (Figure 2).
Fig. 1. Finding our plugin application in the Gmails right-side panel.
Fig. 2. Analyze button for user consent to analyze the email.
By clicking this button, the user consents to transmitting the
contents of the current email to the content analysis service
while ensuring the data’s protection in accordance with GDPR
guidelines [22]. It’s important to note that in this scenario,
we do not retain the data within the content analysis service;
instead, we store its representation obtained by the model.
Once the email is transmitted to the content analysis service,
we employ a fine-tuned transformer model to identify some
of the most frequently occurring named entities within the
email, including individuals and locations. In this process, we
commence with Multilingual BERT [6], fine-tuned specifically
for Named Entity Recognition (NER) using the Wiki-data
corpus [23]. In the current iteration, our emphasis lies on the
Macedonian language, and therefore, we have fine-tuned the
model exclusively using the Macedonian section of the Wiki-
data. However, we have intentions to extend this fine-tuning
methodology to cover all languages in subsequent iterations.
The panel is refreshed, presenting a new layout comprising
several distinct sections. The initial section of the panel
exhibits lists of information detected within the email, encom-
passing individual names, locations, and university subjects (as
depicted in Figure 3). These detected pieces of information are
categorized as tokens within the text, employing the Named
Entity Recognition (NER) technique.
Fig. 3. The plugin panel displays token information captured during email
analysis.
The content analysis service utilizes the fine-tuned model
to extract the entities of interest. It’s crucial to acknowledge
that certain named entities might not be present or accurately
labeled within the fine-tuning dataset and that the model is
initially not infallible. Therefore, we intentionally designed the
platform to be extensible right from the outset by having the
platform to provide these results to the end user for verification
and correction.
To facilitate the initial development of training data, espe-
cially when introducing new named entities, we grant users the
capability to define regular expressions that identify named
entities within the text. Consequently, the content analysis
service combines the results predicted by the model with
those obtained through regular expression matches. Figure 3
provides an illustration of the outcomes presented to the
user. As depicted, users have the option to select certain
candidate tokens via checkboxes and subsequently remove
them if they are identified as false positives. Furthermore,
for instances where false negatives are identified, each named
entity provides an input field at its beginning. Users can
input text fragments that were not recognized in the text. It’s
important to mention that we have implemented validation to
ensure that only valid text parts can be added as new named
entities.
Once all modifications for a particular email have been
made, if any, the user has the capability to fine-tune the model
with the newly annotated data by clicking the ”Fine Tune”
button. During this process, training is conducted using a
single example to adapt the model to the freshly annotated
insights provided by the end user.
Fig. 4. The section in the plugin panel for articulating the necessary action.
The second section of the side panel pertains to specifying
the intended action to be taken, as illustrated in Figure 4.
In this form, users have the opportunity to articulate their
goals concerning the email’s content by describing the action
they wish to take after reading the currently opened email.
Examples of such actions within a context of an academic
institution regarding professors include tasks like adding a
grade to a system, reviewing homework, scheduling a meeting
with a student, and numerous other actions.
Within this section, users are presented with a set of
predefined actions, and they have the option to select one
of these suggestions as their intended action. Additionally,
users can manually input an alternative action if it is not
present in the predefined options. This form is equipped with
another fine-tuned multilingual BERT model that offers action
suggestions. However, in this scenario, we employ text clas-
sification, utilizing the email’s content as input and extracting
the desired actions as output. These suggested actions can be
customized, and the model can be fine-tuned to enhance its
future performance.
B. Content analysis service
The core technology integrated with the plugin is a Flask
Application 3, written in Python 3.10. The Gmail Plugin
communicates with this application via HTTP GET and POST
requests to access the required data from the content analysis
service. This architecture creates a composite application with
distinct components: the user interface, email processing and
model training, all separated into individual systems.
Emails serve as the central component of the system and
are encapsulated by the Mail model. The primary attribute of
this model is the hash generated from the email’s sender and
content. Importantly, the content itself is not stored to mitigate
privacy concerns.
Each email is associated with a single action (goal) and
multiple named entity tokens. Each Token is characterized
by its named entity type and its position within the message
text of the mail. This design allows the content, along with
the tokens, to remain in memory briefly within the content
analysis service before being discarded. However, the email’s
hash is unique, enabling the analysis to retrieve the Action
and Tokens from the database when the same email is opened
again. Subsequently, the stored position of the token in the text
can be used to extract the textual value for each Token. This
approach ensures user privacy is upheld while maintaining the
platform’s core functionality.
Within the application’s service layer, reside a collection
of various algorithms designed to facilitate both manual and
predictive extraction of tokens from emails. These algorithms
determine the placement of the tokens within the text and
handle their storage in the database, alongside the assignment
of the corresponding action to the mail.
We employ Python code with the following expression:
matches =re.f inditer(tok en regex, text)
This allows us to utilize a regular expression tailored to
the token type, seeking and identifying all the corresponding
matches within the text. We proceed by iterating through
these matches, capturing the start position of each token.
Subsequently, we assign the tokens foreign key, linking it
to the hash of the email currently undergoing processing,
and store this information in the database. Following the
identification of the tokens, we also obtain the NER tokens
that are generated using the BERT model in order to return a
union of the tokens to the Gmail plugin interface.
A similar procedure is applied when users manually add or
exclude tokens. Utilizing the same code, a search is conducted
to identify all instances of the token, extract their starting
positions, and employ a verification process.
3https://flask.palletsprojects.com/en/3.0.x/
The exclusion of tokens proves particularly valuable for
handling false positives generated by the BERT model. To
retain a token in the database as a false positive, this algorithm
is employed to flag the token as incorrect, ensuring that it
does not appear in the interface results. However, crucially,
information indicating that the token is a false positive is
preserved in the database. This preservation of information
serves the purpose of enabling the computation of more
accurate metrics during evaluation.
C. The Transformer Architecture
Recurrent neural networks, including long short-term mem-
ory (LSTM) [24] and gated recurrent neural networks (GRU)
[25] are models used for solving NLP tasks such as machine
translation and sentiment analysis. These models generate a
sequence of hidden states htbased on the previous hidden
state ht1and the input at position t. However, their sequen-
tial nature prevent parallelization within training examples,
especially with longer sequences, due to memory constraints.
In contrast, the Transformer architecture [4] omits recurrence
and relies solely on an attention mechanism to capture global
dependencies between input and output. This design allows
for much greater parallelization, overcoming the limitations
of sequential processing in recurrent models.
Fig. 5. Transformer encoder-decoder architecture.
The Transformer adopts an encoder-decoder architecture
(Figure 5). In this setup, the encoder takes an input sequence
of symbol representations and maps it to a sequence of
continuous representations.
The encoder comprises a stack of identical layers, each
consisting of two sub-layers. The first sub-layer is a multi-
head self-attention mechanism, while the second sub-layer is
a straightforward fully connected feed-forward network.
Given that this model lacks recurrence, ”positional encod-
ings” are integrated into the input embeddings at the base
of the encoder, in order to incorporate information about the
position of the tokens in the text. Sine and cosine functions of
varying frequencies are employed to generate the positional
encodings. Specifically, the positional encodings are defined
as follows:
P E(pos,2i)= sin pos
100002i/dmodel
P E(pos,2i+1) = cos pos
100002i/dmodel
Here, pos represents the position and idenotes the dimen-
sion. Each dimension of the positional encoding corresponds
to a sinusoid.
The attention function can be defined as a process that
maps a query and a collection of key-value pairs to an
output. The output is determined by calculating a weighted
sum of the values, where the weight assigned to each value
is determined by a compatibility function of the query with
the corresponding key. We calculate the dot products of the
query with all keys and apply a softmax function to derive the
weights on the values:
Attention(Q, K, V ) = softmax QK T
dkV
Multi-head attention (Figure 6) allows the model to focus on
different parts of the input data (such as words in a sentence)
with different perspectives simultaneously. This helps capture
complex relationships within the data. In multi-head attention,
the attention mechanism is replicated multiple times in paral-
lel. Each head learns different relationships between the input
data. The queries, keys, and values are linearly transformed
using learned weights (WQ, WK, WV) specific to that head.
The multi-head attention computation can be described as
follows:
MultiHead(Q, K, V ) = Concat(head1, ..., headh)WO
where
headi=Attention(QWQi, KWKi, V WVi)
Fig. 6. Multi-head attention mechanism.
D. Fine-tuning BERT
Before allowing the example to be fine-tuned for training,
the content analysis service initiates an automated feature
engineering process comprising two key phases: creating an
example dictionary and tokenizing labels. In the initial step of
creating an example dictionary, the email is transformed into
a list of words and undergoes preprocessing. Subsequently, a
new list is generated, mirroring the list of words and associat-
ing each word with its corresponding NER tag. Following this,
the words are segmented into word chunks using the BERT
tokenizer, with each chunk being labeled either with the NER
tag or with the value -100, signifying that the word doesn’t
represent a label.
Fig. 7. BERT tokenization which breaks down words into smaller subwords.
BERT [5] which stands for Bidirectional Encoder
Representations from Transformers is an NLP Model intro-
duced by researchers at Google. It comprises two pre-trained
model variants, BERTBASE and BERTLARGE. For our work,
we exclusively utilize the BERTBASE model.
BERT is constructed as a stack of encoders utilizing the
transformers architecture, which involves self-attention layer
followed by a feed forward neural network layer on the
encoder side and attention mechanisms on the decoder side.
BERTBASE, specifically, features an encoder stack with 12
transformer blocks. Its architecture incorporates larger feed-
forward networks with 768 hidden units and is equipped with
12 attention heads and with a total number of 110 million
parameters. This architecture empowers BERTBASE with ad-
vanced language comprehension and processing capabilities.
This model initiates its input processing with the CLS
(classification) token, which is followed by a sequence of
words. The [CLS] token serves as a distinct symbol placed at
the beginning of each input example, while [SEP] functions
as a specialized separator token. At each layer of the model,
self-attention mechanisms are applied, and the outcomes are
subsequently passed through a feedforward network before
being transferred to the next encoder.
Fig. 8. BERT consists of a stack of transformer encoder layers that process
the input text in a series of self-attention and feedforward neural network
layers.
The vector obtained through this training process can now
be used to accomplish various tasks, including classification,
named entity recognition, and more.
IV. CONCLUSION
In this paper we presented a general workflow which makes
us able to extract the information of interest and get suggested
actions from a given mail. Given that the mails are the
main communication channels for the businesses, this opens
a big opportunity to integrate this extracted information with
the proprietary systems used in companies and institutions.
Therefore, it will boost the employee productivity by reducing
time spent on processing emails and mitigating distractions
during periods of focus.
Considering BERT as an established Pretrained model,
there’s a high probability of a successful performance after
model training. In the next iteration of our work, we intend to
record the model performance for evaluation through several
epochs by measuring the F1 scores, with an attempt to find
the highest performing version of the model.
This tool holds potential for further development consider-
ing the anticipated advancements in technology and Artificial
Intelligence. E.g. Automatic integration of the Gmail plugin
with third party systems, automation of logical responses and
automatic execution of requested tasks.
It’s worth noting that even though the email is processed
in the content analysis service, the content is never stored
in memory. Any personal information is made sure to be
discarded as soon as it gets processed. E.g. After the hash
of the mail content is extracted, or the model is fine-tuned
with an example. We use the hash representation and token
position in the text, rather than their raw form. This way we
are not prone to a data breaches and insider attacks.
REFERENCES
[1] D. A. Dillman, J. D. Smyth, and L. M. Christian, Internet, phone, mail,
and mixed-mode surveys: The tailored design method. John Wiley &
Sons, 2014.
[2] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and
P. Kuksa, “Natural language processing (almost) from scratch, Journal
of machine learning research, vol. 12, no. ARTICLE, pp. 2493–2537,
2011.
[3] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning
with neural networks,” Advances in neural information processing
systems, vol. 27, 2014.
[4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
Ł. Kaiser, and I. Polosukhin, “Attention is All you Need, in Advances
in neural information processing systems, pp. 5998–6008, 2017.
[5] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training
of deep bidirectional transformers for language understanding,” arXiv
preprint arXiv:1810.04805, 2018.
[6] T. Pires, E. Schlinger, and D. Garrette, “How multilingual is multilingual
bert?,” arXiv preprint arXiv:1906.01502, 2019.
[7] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi,
P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al., “Huggingface’s trans-
formers: State-of-the-art natural language processing,” arXiv preprint
arXiv:1910.03771, 2019.
[8] J. Ye, X. Chen, N. Xu, C. Zu, Z. Shao, S. Liu, Y. Cui, Z. Zhou, C. Gong,
Y. Shen, et al., “A comprehensive capability analysis of gpt-3 and gpt-
3.5 series models,” arXiv preprint arXiv:2303.10420, 2023.
[9] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux,
T. Lacroix, B. Rozi`
ere, N. Goyal, E. Hambro, F. Azhar, et al.,
“Llama: Open and efficient foundation language models, arXiv preprint
arXiv:2302.13971, 2023.
[10] T. Sahmoud and D. M. Mikki, “Spam detection using bert,” arXiv
preprint arXiv:2206.02443, 2022.
[11] B. Klimt and Y. Yang, “The enron corpus: A new dataset for email
classification research,” in European conference on machine learning,
pp. 217–226, Springer, 2004.
[12] S. Salloum, T. Gaber, S. Vadera, and K. Shaalan, “Phishing email
detection using natural language processing techniques: a literature
survey,” Procedia Computer Science, vol. 189, pp. 19–28, 2021.
[13] L. Breiman, “Random forests,” Machine learning, vol. 45, pp. 5–32,
2001.
[14] M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt, and B. Scholkopf, “Sup-
port vector machines,” IEEE Intelligent Systems and their applications,
vol. 13, no. 4, pp. 18–28, 1998.
[15] M. Hiransha, N. A. Unnithan, R. Vinayakumar, K. Soman, and
A. Verma, “Deep learning based phishing e-mail detection,” in Proc.
1st AntiPhishing Shared Pilot 4th ACM Int. Workshop Secur. Privacy
Anal.(IWSPA), pp. 1–5, Tempe, AZ, USA, 2018.
[16] L. F. Rau, “Extracting company names from text,” in Proceedings the
Seventh IEEE Conference on Artificial Intelligence Application, pp. 29–
30, IEEE Computer Society, 1991.
[17] A. Mikheev, M. Moens, and C. Grover, “Named entity recognition
without gazetteers,” in Ninth Conference of the European Chapter of
the Association for Computational Linguistics, pp. 1–8, 1999.
[18] E. Minkov, R. C. Wang, and W. Cohen, “Extracting personal names
from email: Applying named entity recognition to informal text,” in
Proceedings of human language technology conference and conference
on empirical methods in natural language processing, pp. 443–450,
2005.
[19] G. D. Venolia and C. Neustaedter, “Understanding sequence and reply
relationships within email conversations: a mixed-model visualization,
in Proceedings of the SIGCHI conference on Human factors in comput-
ing systems, pp. 361–368, 2003.
[20] E. J. Horvitz, A. Jacobs, and D. Hovel, Attention-sensitive alerting,”
arXiv preprint arXiv:1301.6707, 2013.
[21] J. Ferreira, Google Apps Script: Web Application Development Essen-
tials. O’Reilly Media, Inc.”, 2014.
[22] European Parliament and Council of the European Union, “Regulation
(EU) 2016/679 of the European Parliament and of the Council.”
[23] D. Vrandeˇ
ci´
c and M. Kr¨
otzsch, “Wikidata: a free collaborative knowl-
edgebase,” Communications of the ACM, vol. 57, no. 10, pp. 78–85,
2014.
[24] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[25] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of
gated recurrent neural networks on sequence modeling,” arXiv preprint
arXiv:1412.3555, 2014.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Phishing is the most prevalent method of cybercrime that convinces people to provide sensitive information; for instance, account IDs, passwords, and bank details. Emails, instant messages, and phone calls are widely used to launch such cyber-attacks. Despite constant updating of the methods of avoiding such cyber-attacks, the ultimate outcome is currently inadequate. On the other hand, phishing emails have increased exponentially in recent years, which suggests a need for more effective and advanced methods to counter them. Numerous methods have been established to filter phishing emails, but the problem still needs a complete solution. To the best of our knowledge, this is the first survey that focuses on using Natural Language Processing (NLP) and Machine Learning (ML) techniques to detect phishing emails. This study provides an analysis of the numerous state-of-the-art NLP strategies currently in use to identify phishing emails at various stages of the attack, with an emphasis on ML strategies. These approaches are subjected to a comparative assessment and analysis. This gives a sense of the problem, its immediate solution space, and the expected future research directions.
Conference Paper
Full-text available
Email communication, has now become an inevitable communication tool in our daily life. Especially for finance sector, communication through email plays an important role in their businesses. So, it is very important to classify emails based on their behavior. Email phishing one of most dangerous Internet phenomenon that cause various problems to business class mainly to finance sector. This type of emails steals our valuable information without our permission, more over we won't be aware of such an act even if it has been occurred. In this paper, we reveal about how to distinguish phishing emails from legitimate mails. Dataset had two types of email texts one with header and other without header. We used Keras Word Embedding and Convolutional Neural Network to build our model.
Article
Full-text available
We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including: part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements.
Article
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Article
Wikidata allows every user to extend and edit the stored information, even without creating an account. A form based interface makes editing easy. Wikidata's goal is to allow data to be used both in Wikipedia and in external applications. Data is exported through Web services in several formats, including JavaScript Object Notation, or JSON, and Resource Description Framework, or RDF. Data is published under legal terms that allow the widest possible reuse. The value of Wikipedia's data has long been obvious, with many efforts to use it. The Wikidata approach is to crowdsource data acquisition, allowing a global community to edit the data. This extends the traditional wiki approach of allowing users to edit a website. In March 2013, Wikimedia introduced Lua as a scripting language for automatically creating and enriching parts of articles. Lua scripts can access Wikidata, allowing Wikipedia editors to retrieve, process, and display data. Many other features were introduced in 2013, and development is planned to continue for the foreseeable future.
Article
Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.7 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a strong phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which beats the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
Conference Paper
It has been proposed that email clients could be improved if they presented messages grouped into conversations. An email conversation is the tree of related messages that arises from the use of the reply operation. We propose two models of conversation. The first model characterizes a conversation as a chronological sequence of messages; the second as a tree based on the reply relationship. We show how existing email clients and prior research projects implicitly support each model to a greater or lesser degree depending on their design, but none fully supports both models simultaneously. We present a mixed-model visualization that simultaneously presents sequence and reply relationships among the messages of a conversation, making both visible at a glance. We describe the integration of the visualization into a working prototype email client. A usability study indicates that the system meets our usability goals and verifies that the visualization fully conveys both types of relationships within the messages of an email conversation.