Conference Paper

LSTM Based Self-Defending AI Chatbot Providing Anti-Phishing

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Email services have to put through a lot of effort in fighting spam emails. Most of the efforts go in for detecting and filtering spam emails from benign emails. On the other front, people are educated by banks and other organizations on the awareness of such emails. These approaches are essentially passive in nature, in countering spam attacks where the attacker is not directly engaged by the adversary. Despite all these efforts, many innocent people fall for such attacks leading them to share their account details or lose a large sum of money. We propose an AI based system, that is self-aware and self-defending, which sends coherent replies to spammers with the aim of consuming their time. To make it more difficult for spammers we reply from algorithmically generated mail servers. Also, to avoid simple match filtering of mails by spammers, we make the replies different from each other and genuine, by using a language model trained by LSTM to form sentences in natural language depending upon the context of the email.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... From the available methods, spam filters are commonly employed method, where the manually developed keyword patterns and methods for blacklisting the known spammers are the commonly employed e-mail filters. On the other hand, the ML algorithms, such as k-NN, NB classifiers and SVMs are employed in which the training using spam e-mails is done and is compared to the benign e-mails (Kovalluri et al., 2018). In addition to this, the researchers are interested in evaluating the other ML algorithms that are engaged in filtering the spams, which include the advanced SVM and memory-based systems or rough sets (RS). ...
... The drawback of the method was to address the generated false positive (FP) errors and there was a need for the security mechanism to prevent the FP errors. Kovalluri et al. (2018) modeled the artificial intelligence (AI)-based system using the long shortterm memory (LSTM)-based language model, which reduced the application of spam to steal, propagate and it was hard for tracking the victims. The demerit of the method was that the LSTM model commits some mistakes during the generation of sentences. ...
Article
Full-text available
Purpose-Phishing is a serious cybersecurity problem, which is widely available through multimedia, such as e-mail and Short Messaging Service (SMS) to collect the personal information of the individual. However, the rapid growth of the unsolicited and unwanted information needs to be addressed, raising the necessity of the technology to develop any effective anti-phishing methods. Design/methodology/approach-The primary intention of this research is to design and develop an approach for preventing phishing by proposing an optimization algorithm. The proposed approach involves four steps, namely preprocessing, feature extraction, feature selection and classification, for dealing with phishing e-mails. Initially, the input data set is subjected to the preprocessing, which removes stop words and stemming in the data and the preprocessed output is given to the feature extraction process. By extracting keyword frequency from the preprocessed, the important words are selected as the features. Then, the feature selection process is carried out using the Bhattacharya distance such that only the significant features that can aid the classification are selected. Using the selected features, the classification is done using the deep belief network (DBN) that is trained using the proposed fractional-earthworm optimization algorithm (EWA). The proposed fractional-EWA is designed by the integration of EWA and fractional calculus to determine the weights in the DBN optimally. Findings-The accuracy of the methods, naive Bayes (NB), DBN, neural network (NN), EWA-DBN and fractional EWA-DBN is 0.5333, 0.5455, 0.5556, 0.5714 and 0.8571, respectively. The sensitivity of the methods, NB, DBN, NN, EWA-DBN and fractional EWA-DBN is 0.4558, 0.5631, 0.7035, 0.7045 and 0.8182, respectively. Likewise, the specificity of the methods, NB, DBN, NN, EWA-DBN and fractional EWA-DBN is 0.5052, 0.5631, 0.7028, 0.7040 and 0.8800, respectively. It is clear from the comparative table that the proposed method acquired the maximal accuracy, sensitivity and specificity compared with the existing methods. Originality/value-The e-mail phishing detection is performed in this paper using the optimization-based deep learning networks. The e-mails include a number of unwanted messages that are to be detected in order to avoid the storage issues. The importance of the method is that the inclusion of the historical data in the detection process enhances the accuracy of detection.
... With the increasing diffusion of the Internet, the impact of threats delivered via mail is now very relevant, both considering the economic losses for the victims and the effort dedicated to detect harmful messages or attachments [15]. As today, the overall fraction of mails supporting frauds and criminal activities is up to the 90% of the total exchanged volume and this trend is expected to grow in the near future [3,8,19]. Therefore, mitigating the impact of malicious and unwanted mails is a crucial activity, not only limited to human aspects but also to prevent waste of resources (e.g., bandwidth and storage of mail servers). ...
... To the best of our knowledge, previous techniques for mitigating the impact of scam attempts via mail do not consider the use of generative AI-based schemes to engage scammers. The only notable exception is [19], although it adopts a long short-term memory approach to generate basic questions and consume the time of the attacker. Concerning AI techniques to implement spam/scam countermeasures, they have been primarily used to automatically inspect various parts of a message in order to detect spam or phishing mails, i.e., for classification purposes. ...
Preprint
Full-text available
The use of Artificial Intelligence (AI) to support cybersecurity operations is now a consolidated practice, e.g., to detect malicious code or configure traffic filtering policies. The recent surge of AI, generative techniques and frameworks with efficient natural language processing capabilities dramatically magnifies the number of possible applications aimed at increasing the security of the Internet. Specifically, the ability of ChatGPT to produce textual contents while mimicking realistic human interactions can be used to mitigate the plague of emails containing scams. Therefore, this paper investigates the use of AI to engage scammers in automatized and pointless communications, with the goal of wasting both their time and resources. Preliminary results showcase that ChatGPT is able to decoy scammers, thus confirming that AI is an effective tool to counteract threats delivered via mail. In addition, we highlight the multitude of implications and open research questions to be addressed in the perspective of the ubiquitous adoption of AI.
... The investigation [19] suggests an AI-based, self-aware, self-defending system that delivers cogent responses. Send responses from mail servers produced by algorithms to make it harder for spammers. ...
... (3) Self-Defending Systems -Self-Defending systems powered by artificial intelligence can identify threats and respond to them. Some intelligence-based self-defending systems may be in the form of self-defending chatbots (Kovalluri et al., 2018) and self-defending security software (Kerivan & Brothers, 2006). Also, self-defending systems may be specified into two categories (Yuan et al., 2014) (a) Self-Configuring Systems -Self-configuring systems enable a system to automatically adapt its configuration of components without any human intervention. ...
Article
Full-text available
Intelligence has been defined in many ways like logic, awareness, reasoning, critical thinking, etc. Many researchers insist on the possibility of a Technological Singularity shortly, which may see machines gaining intelligence similar to, or greater than humans. While many researchers believe that Technological Singularity is at an arm’s length, many counter-question the possibility of the same due to the lack of concrete evidence. Recently Cybersecurity has manoeuvred its way through technology to become one of the most rapidly advancing fields. Artificial Intelligence introduced to Cybersecurity has seen a tremendous increase in the number of systems that are capable of performing tasks faster and better than humans. This has led us to believe that there is intelligence in cyberspace along with the possibility of Cyber Singularity. We emphasise the intelligence of systems using a set of characteristics that insist on how sophisticated the systems have become over time that might lead to Cyber Singularity. We map these characteristics to the characteristics of living species with the hope of locating intelligence in the biomedical domain and further, try to identify systems displaying such characteristics in cyberspace. Keeping in mind the concept of technological singularity proposed before, we also perform an extensive survey of the past research works related to the field and also, use the concepts of set theory to reinforce the possibility of Cyber Singularity in the coming years.
... Chat bot for eCommerce which respond to user queries for a predefined data set of frequently asked questions is discussed in [18]. In [19] chat bot to filter out spam mails based on predefined category and reply to spam mails in natural language based on context using LSTM model is discussed. Plutchik [20] similar to ChEMBL Bot, it has the ability to search medical searches for databases offered through NCBI. ...
Article
Purpose Phishing is a serious cybersecurity problem, which is widely available through multimedia, such as e-mail and Short Messaging Service (SMS) to collect the personal information of the individual. However, the rapid growth of the unsolicited and unwanted information needs to be addressed, raising the necessity of the technology to develop any effective anti-phishing methods. Design/methodology/approach The primary intention of this research is to design and develop an approach for preventing phishing by proposing an optimization algorithm. The proposed approach involves four steps, namely preprocessing, feature extraction, feature selection and classification, for dealing with phishing e-mails. Initially, the input data set is subjected to the preprocessing, which removes stop words and stemming in the data and the preprocessed output is given to the feature extraction process. By extracting keyword frequency from the preprocessed, the important words are selected as the features. Then, the feature selection process is carried out using the Bhattacharya distance such that only the significant features that can aid the classification are selected. Using the selected features, the classification is done using the deep belief network (DBN) that is trained using the proposed fractional-earthworm optimization algorithm (EWA). The proposed fractional-EWA is designed by the integration of EWA and fractional calculus to determine the weights in the DBN optimally. Findings The accuracy of the methods, naive Bayes (NB), DBN, neural network (NN), EWA-DBN and fractional EWA-DBN is 0.5333, 0.5455, 0.5556, 0.5714 and 0.8571, respectively. The sensitivity of the methods, NB, DBN, NN, EWA-DBN and fractional EWA-DBN is 0.4558, 0.5631, 0.7035, 0.7045 and 0.8182, respectively. Likewise, the specificity of the methods, NB, DBN, NN, EWA-DBN and fractional EWA-DBN is 0.5052, 0.5631, 0.7028, 0.7040 and 0.8800, respectively. It is clear from the comparative table that the proposed method acquired the maximal accuracy, sensitivity and specificity compared with the existing methods. Originality/value The e-mail phishing detection is performed in this paper using the optimization-based deep learning networks. The e-mails include a number of unwanted messages that are to be detected in order to avoid the storage issues. The importance of the method is that the inclusion of the historical data in the detection process enhances the accuracy of detection.
Conference Paper
Full-text available
Moving from limited-domain natural language generation (NLG) to open domain is difficult because the number of semantic input combinations grows exponentially with the number of domains. Therefore, it is important to leverage existing resources and exploit similarities between domains to facilitate domain adaptation. In this paper, we propose a procedure to train multi-domain, Recurrent Neural Network-based (RNN) language generators via multiple adaptation steps. In this procedure, a model is first trained on counterfeited data synthesised from an out-of-domain dataset, and then fine tuned on a small set of in-domain utterances with a discriminative objective function. Corpus-based evaluation results show that the proposed procedure can achieve competitive performance in terms of BLEU score and slot error rate while significantly reducing the data needed to train generators in new, unseen domains. In subjective testing, human judges confirm that the procedure greatly improves generator performance when only a small amount of data is available in the domain.
Working Paper
Full-text available
This paper addresses the problem of unwanted email, known as spam, which may be commercial or malicious in nature. Unwanted email threatens to overwhelm legitimate email traffic—the messages that people want to receive—and is often used by criminals as a means to sell illegal goods or compromise systems and confidential data (phishing). Reforming email through adoption of a proposed Trusted Email Open Standard (TEOS) could solve this problem. TEOS is platform-agnostic and combines technology with policy to create a level of trust and accountability currently lacking in email. By securely identifying email senders and enabling them to make verifiable assertions about the messages they send, including participation in programs that promote best practices, TEOS provides a solution to spam and phishing. Because this standard combines proven technology with broad consensus on best practices, adoption could be rapid, with costs offset by savings through reduced spam and phishing activity. At the same time, TEOS safeguards the interests of all responsible users of email, from legitimate bulk senders and email service providers to consumers, even those individuals who wish to use email anonymously.
Article
Full-text available
Natural language generation (NLG) is a critical component of spoken dialogue and it has a significant impact both on usability and perceived quality. Most NLG systems in common use employ rules and heuristics and tend to generate rigid and stylised responses without the natural variation of human language. They are also not easily scaled to systems covering multiple domains and languages. This paper presents a statistical language generator based on a semantically controlled Long Short-term Memory (LSTM) structure. The LSTM generator can learn from unaligned data by jointly optimising sentence planning and surface realisation using a simple cross entropy training criterion, and language variation can be easily achieved by sampling from output candidates. An objective evaluation in two differing test domains showed improved performance compared to previous methods with fewer heuristics. Human judges scored the LSTM system higher on informativeness and naturalness and overall preferred it to the other systems.
Article
Full-text available
This paper shows the implementation of an artificial intelligent chatterbot with whom human can interact by speaking to it and receive a response by chatterbot using its speech synthesizer. Objective of this paper is to show application of chatterbot that can be used in various fields like education, healthcare, and route assistance. It is statistical model and chatterbot is based on AIML (Artificial Intelligent Markup Language) structure for training the model and uses Microsoft voice synthesizer for providing speech recognition system and natural language processing.
Conference Paper
Full-text available
In this work, we explain the design of a chat robot that is specifically tailored for providing FAQBot system for university students and with the objective of an undergraduate adviser in student information desk. The chat robot accepts natural language input from users, navigates through the Information Repository and responds with student information in natural language. In this paper, we model the Information Repository by a connected graph where the nodes contain information and links interrelates the information nodes. The design semantics includes AIML (Artificial Intelligence Mark up Language) specification language for authoring the information repository such that chat robot design separates the information repository from the natural language interface component. Correspondingly, in the experiment, we constructed three experimental systems (a pure dialog systems associated with natural language knowledge based entries, a domain knowledge systems engineered with information content and a hybrid system, combining dialog and domain knowledge). Consequently, the information repository can easily be modified and focused on particular topic without recreating the code design. Experimental parameters and outcome suggests that topic specific dialogue coupled with conversational knowledge yield the maximum dialogue session than the general conversational dialogue.
Conference Paper
Full-text available
Spam has become a major problem that is threatening the efficiency of the current email system. Spam is overwhelming the Internet because 1) emails are pushed from senders to receivers without much control from recipients, and 2) the cost for delivering emails is very low. In this paper, we present an anti-spam framework that slows down spammers: by adding delay to email delivery, and by consuming more sender resources. Both delay and resource consumption are controlled based on the likelihood of the source of email messages being a spammer, so that our technique only impact the spammers and has negligible impact on normal email senders. The mechanisms are implemented in the TCP level at the recipient side without requiring any modifications at the sender side. Our evaluations show that selectively delaying connections can effectively slow down a spammer thousands of times when they use a simple setup or use open relays. The mechanism of increasing sender's resource consumption can significantly slow down spammers even when they are spamming from their own optimized servers.
Conference Paper
Full-text available
A challenging problem for spoken dialog systems is the design of utterance generation modules that are fast, flexible and general, yet produce high quality output in particular domains. A promising approach is trainable generation, which uses general-purpose linguistic knowledge automatically adapted to the application domain. This paper presents a trainable sentence planner for the MATCH dialog system. We show that trainable sentence planning can produce output comparable to that of MATCH's template-based generator even for quite complex information presentations.
Conference Paper
Full-text available
The freely available SPaRKy sentence planner uses hand-written weighted rules for sentence plan construction, and a user- or domain-specific second-stage ranker for sentence plan selection. However, coming up with sentence plan construction rules for a new domain can be difficult. In this paper, we automatically extract sentence plan construction rules from the RST-DT corpus. In our rules, we use only domain- independent features that are available to a sentence planner at runtime. We evaluate these rules, and outline ways in which they can be used for sentence planning. We have integrated them into a revised version of SPaRKy.
Book
Full-text available
Reinforcement learning is a learning paradigm concerned with learning to control a system so as to maximize a numerical performance measure that expresses a long-term objective. What distinguishes reinforcement learning from supervised learning is that only partial feedback is given to the learner about the learner's predictions. Further, the predictions may have long term effects through influencing the future state of the controlled system. Thus, time plays a special role. The goal in reinforcement learning is to develop efficient learning algorithms, as well as to understand the algorithms' merits and limitations. Reinforcement learning is of great interest because of the large number of practical applications that it can be used to address, ranging from problems in artificial intelligence to operations research or control engineering. In this book we focus on those algorithms of reinforcement learning which build on the powerful theory of dynamic programming. We give a fairly comprehensive catalog of learning problems, describe the core ideas, a large number of state of the art algorithms, followed by the discussion of their theoretical properties and limitations.
Conference Paper
Full-text available
Most proponents of domain authentication suggest combining domain authentication with reputation services. This paper presents a new learning algorithm for learning the reputation of email domains and IP addresses based on analyzing the paths used to transmit known spam and known good mail. The result is an effective algorithm providing the reputation information needed to combine with domain authentication to make filtering decisions. This algorithm achieves many of the benefits offered by domain-authentication systems, black-list services, and white-list services provide without any infrastructure costs or rollout requirements.
Article
Full-text available
Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient-based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O(1). Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.
Article
Full-text available
We study the use of support vector machines (SVM) in classifying e-mail as spam or nonspam by comparing it to three other classification algorithms: Ripper, Rocchio, and boosting decision trees. These four algorithms were tested on two different data sets: one data set where the number of features were constrained to the 1000 best features and another data set where the dimensionality was over 7000. SVM performed best when using binary features. For both data sets, boosting trees and SVM had acceptable test performance in terms of accuracy and speed. However, SVM had significantly less training time
Conference Paper
Full-text available
Recent work in P2P overlay networks allow for decentralized object location and routing (DOLR) across networks based on unique IDs. In this paper, we propose an extension to DOLR systems to publish objects using generic feature vectors instead of content-hashed GUIDs, which enables the systems to locate similar objects. We discuss the design of a distributed text similarity engine, named Approximate Text Addressing (ATA), built on top of this extension that locates objects by their text descriptions. We then outline the design and implementation of a motivating application on ATA, a decentralized spam-filtering service.
Article
Full-text available
We investigate the performance of two machine learning algorithms in the context of anti-spam filtering. The increasing volume of unsolicited bulk e-mail (spam) has generated a need for reliable anti-spam filters. Filters of this type have so far been based mostly on keyword patterns that are constructed by hand and perform poorly. The Naive Bayesian classifier has recently been suggested as an effective method to construct automatically anti-spam filters with superior performance. We investigate thoroughly the performance of the Naive Bayesian filter on a publicly available corpus, contributing towards standard benchmarks. At the same time, we compare the performance of the Naive Bayesian filter to an alternative memory-based learning approach, after introducing suitable cost-sensitive evaluation measures. Both methods achieve very accurate spam filtering, outperforming clearly the keyword-based filter of a widely used e-mail reader.
Thesis
Language is the principal medium for ideas, while dialogue is the most natural and effective way for humans to interact with and access information from machines. Natural language generation (NLG) is a critical component of spoken dialogue and it has a significant impact on usability and perceived quality. Many commonly used NLG systems employ rules and heuristics, which tend to generate inflexible and stylised responses without the natural variation of human language. However, the frequent repetition of identical output forms can quickly make dialogue become tedious for most real-world users. Additionally, these rules and heuristics are not scalable and hence not trivially extensible to other domains or languages. A statistical approach to language generation can learn language decisions directly from data without relying on hand-coded rules or heuristics, which brings scalability and flexibility to NLG. Statistical models also provide an opportunity to learn in-domain human colloquialisms and cross-domain model adaptations. A robust and quasi-supervised NLG model is proposed in this thesis. The model leverages a Recurrent Neural Network (RNN)-based surface realiser and a gating mechanism applied to input semantics. The model is motivated by the Long-Short Term Memory (LSTM) network. The RNN-based surface realiser and gating mechanism use a neural network to learn end-to-end language generation decisions from input dialogue act and sentence pairs; it also integrates sentence planning and surface realisation into a single optimisation problem. The single optimisation not only bypasses the costly intermediate linguistic annotations but also generates more natural and human-like responses. Furthermore, a domain adaptation study shows that the proposed model can be readily adapted and extended to new dialogue domains via a proposed recipe. Continuing the success of end-to-end learning, the second part of the thesis speculates on building an end-to-end dialogue system by framing it as a conditional generation problem. The proposed model encapsulates a belief tracker with a minimal state representation and a generator that takes the dialogue context to produce responses. These features suggest comprehension and fast learning. The proposed model is capable of understanding requests and accomplishing tasks after training on only a few hundred human-human dialogues. A complementary Wizard-of-Oz data collection method is also introduced to facilitate the collection of human-human conversations from online workers. The results demonstrate that the proposed model can talk to human judges naturally, without any difficulty, for a sample application domain. In addition, the results also suggest that the introduction of a stochastic latent variable can help the system model intrinsic variation in communicative intention much better.
Conference Paper
Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.7 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a strong phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which beats the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
Conference Paper
In this paper we propose and investigate a novel end-to-end method for automatically generating short email responses, called Smart Reply. It generates semantically diverse suggestions that can be used as complete email responses with just one tap on mobile. The system is currently used in Inbox by Gmail and is responsible for assisting with 10% of all mobile responses. It is designed to work at very high throughput and process hundreds of millions of messages daily. The system exploits state-of-the-art, large-scale deep learning. We describe the architecture of the system as well as the challenges that we faced while building it, like response diversity and scalability. We also introduce a new method for semantic clustering of user-generated content that requires only a modest amount of explicitly labeled data.
Article
Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Many popular models to learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for morphologically rich languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skip-gram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram, words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpus quickly. We evaluate the obtained word representations on five different languages, on word similarity and analogy tasks.
Article
Recent neural models of dialogue generation offer great promise for generating responses for conversational agents, but tend to be shortsighted, predicting utterances one at a time while ignoring their influence on future outcomes. Modeling the future direction of a dialogue is crucial to generating coherent, interesting dialogues, a need which led traditional NLP models of dialogue to draw on reinforcement learning. In this paper, we show how to integrate these goals, applying deep reinforcement learning to model future reward in chatbot dialogue. The model simulates dialogues between two virtual agents, using policy gradient methods to reward sequences that display three useful conversational properties: informativity (non-repetitive turns), coherence, and ease of answering (related to forward-looking function). We evaluate our model on diversity, length as well as with human judges, showing that the proposed algorithm generates more interactive responses and manages to foster a more sustained conversation in dialogue simulation. This work marks a first step towards learning a neural conversational model based on the long-term success of dialogues.
Article
Email Spam is an unsolicited and unwanted email in the mailbox. Spam causes threats to the internet security and it creates traffic in the network. Users spend their precious time to removing all spam emails in their mailbox. Spam email occupies storage space and it consumes network bandwidth and it is very expensive to produce. Detecting these mails and classifying as spam or ham is the major task to the organization. Solution to this problem is filtering the spam emails. Bayesian filter is also a filtering technique. The major issue in Bayesian approach is the performance of filter when word library was very large. In this paper, we introduce the algorithm called Support Vector Machine (SVM) to detect the spam email and send only legitimate email to the users. Support Vector Machine calculates fast and it provides accurate results.
Article
We present a novel response generation system that can be trained end to end on large quantities of unstructured Twitter conversations. A neural network architecture is used to address sparsity issues that arise when integrating contextual information into classic statistical models, allowing the system to take into account previous dialog utterances. Our dynamic-context generative models show consistent gains over both context-sensitive and non-context-sensitive Machine Translation and Information Retrieval baselines.
Article
Conversational modeling is an important task in natural language understanding and machine intelligence. Although previous approaches exist, they are often restricted to specific domains (e.g., booking an airline ticket) and require hand-crafted rules. In this paper, we present a simple approach for this task which uses the recently proposed sequence to sequence framework. Our model converses by predicting the next sentence given the previous sentence or sentences in a conversation. The strength of our model is that it can be trained end-to-end and thus requires much fewer hand-crafted rules. We find that this straightforward model can generate simple conversations given a large conversational training dataset. Our preliminary suggest that, despite optimizing the wrong objective function, the model is able to extract knowledge from both a domain specific dataset, and from a large, noisy, and general domain dataset of movie subtitles. On a domain-specific IT helpdesk dataset, the model can find a solution to a technical problem via conversations. On a noisy open-domain movie transcript dataset, the model can perform simple forms of common sense reasoning. As expected, we also find that the lack of consistency is a common failure mode of our model.
Article
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ∗∗∗, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.
Conference Paper
In the present day world, people are so much habituated to Social Networks. Because of this, it is very easy to spread spam contents through them. One can access the details of any person very easily through these sites. No one is safe inside the social media. In this paper we are proposing an application which uses an integrated approach to the spam classification in Twitter. The integrated approach comprises the use of URL analysis, natural language processing and supervised machine learning techniques. In short, this is a three step process.
Article
Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.7 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a strong phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which beats the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
Article
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.
Article
In the past few years, as the number of dialogue systems has increased, there has been an increasing interest in the use of natural language generation in spoken dialogue. Our research assumes that trainable natural language generation is needed to support more flexible and customized dialogues with human users. This paper focuses on methods for automatically training the sentence planning module of a spoken language generator. Sentence planning is a set of inter-related but distinct tasks, one of which is sentence scoping, i.e., the choice of syntactic structure for elementary speech acts and the decision of how to combine them into one or more sentences. The paper first presents SPoT, a trainable sentence planner, and a new methodology for automatically training SPoT on the basis of feedback provided by human judges. Our methodology is unique in neither depending on hand-crafted rules nor on the existence of a domain-specific corpus. SPoT first randomly generates a candidate set of sentence plans and then selects one. We show that SPoT learns to select a sentence plan whose rating on average is only 5% worse than the top human-ranked sentence plan. We then experimentally evaluate SPoT by asking human judges to compare SPoT's output with a hand-crafted template-based generation component, two rule-based sentence planners, and two baseline sentence planners. We show that SPoT performs better than the rule-based systems and the baselines, and as well as the hand-crafted system.
Conference Paper
We present a data-driven approach to generating responses to Twitter status posts, based on phrase-based Statistical Machine Translation. We find that mapping conversational stimuli onto responses is more difficult than translating between languages, due to the wider range of possible responses, the larger fraction of unaligned words/phrases, and the presence of large phrase pairs whose alignment cannot be further decomposed. After addressing these challenges, we compare approaches based on SMT and Information Retrieval in a human evaluation. We show that SMT outperforms IR on this task, and its output is preferred over actual human responses in 15% of cases. As far as we are aware, this is the first work to investigate the use of phrase-based SMT to directly translate a linguistic stimulus into an appropriate response.
Conference Paper
In recent years the volume of junk email (spam, virus etc.) has increased dramatically. These unwanted messages clutter up users' mailboxes, consume server resources, and cause delays to the delivery of mail. This paper presents an approach that ensures that nonjunk mail is delivered without excessive delay, at the expense of delaying junk mail. Using data from two Internet-facing mail servers, we show how it is possible to simply and accurately predict whether the next message sent from a particular server will be good or junk, by monitoring the types of messages previously sent. The prediction can be used to delay acceptance of junk mail, and prioritize good mail through the mail server, ensuring that loading is reduced and delays are low, even if the server is overloaded. The paper includes a review of server-based anti-spam techniques, and an evaluation of these against the data. We develop and calibrate a model of mail server performance, and use it to predict the performance of the prioritization scheme. We also describe an implementation on a standard mail server.
Article
Two important recent trends in nlg are (i) probabilistic techniques and (ii) comprehen- sive approaches that move away from traditional strictly modular and sequential models. This paper reports experiments in which pcru — a generation framework that combines probabilistic generation methodology with a comprehensive model of the generation space — was used to semi-automatically create five different versions of a weather forecast generator. The generators were evaluated in terms of output quality, development time and computational efficiency against (i) human forecasters,(ii) a traditional handcrafted pipelined nlg system, and (iii) a halogen-style statistical generator. The most striking result is that despite acquiring all decision-making abilities automatically, the best pcru generators produce outputs of high enough quality to be scored more highly by human judges than forecasts written by experts.
Conference Paper
CAFE (collaborative agents for filtering e-mails) is a multiagent system to collaboratively filter spam from users' mail stream. CAFE associates a proxy agent with each user, and this agent represents a sort of interface between the user's e-mail client (i.e. Microsoft Outlook, Eudora, etc.) and the e-mail server. With the support of other types of agents, the proxy agent makes a classification of new messages into three categories: ham (good messages), spam and spam-presumed. The system analyzes every single e-mail using essentially three kinds of approach: a first approach based on the usage of a hash function, a static approach using DNSBL (DNS-based black lists) databases and a dynamic approach based on a Bayesian algorithm.
Conference Paper
In recent years, spam email has become big trouble over the Internet. Although some technical counter-measures against spam have been proposed, spam is still coming everyday, but when countermeasures are overly sensitive, even legitimate emails are refused or eliminated. Hence, we propose a new countermeasure that guarantees that legitimate emails are delivered. Our proposed system can restrain the transfer of spam emails only. In this paper, we explain the proposed system and show the result of some experiments.
Article
We present a simple, yet highly accurate, spam filtering program, called SpamCop, which is able to identify about 92% of the spams while misclassifying only about 1.16% of the nonspam e-mails. SpamCop treats an e-mail message as a multiset of words and employs a naive Bayes algorithm to determine whether or not a message is likely to be a spam. Compared with keyword-spotting rules, the probabilistic approach taken in SpamCop not only offers high accuracy, but also overcomes the brittleness suffered by the keyword spotting approach. 1. Introduction With the explosive growth of the Internet, so comes the proliferation of spams. Spammers collect a plethora of e-mail addresses without the consent of the owners of these addresses. Then, unsolicited advertising or even offensive messages are sent to them in mass-mailings. As a result, many individuals suffer from mailboxes flooded with spams. Many e-mail packages contain mechanisms that attempt to filter out spams by comparing the se...
Article
The noisy channel model has been applied to a wide range of problems, including spelling correction. These models consist of two components: a source model and a channel model. Very little research has gone into improving the channel model for spelling correction. This paper describes a new channel model for spelling correction, based on generic string to string edits. Using this model gives significant performance improvements compared to previously proposed models.
Semantically conditioned lstmbased natural language generation for spoken dialogue systems
  • Wen
  • Tsung-Hsien
Wen, Tsung-Hsien, et al. "Semantically conditioned lstmbased natural language generation for spoken dialogue systems." arXiv preprint arXiv:1508.01745 (2015).
Natural language generation: an introduction and open-ended review of the state of the art
  • J A Bateman
J. A. Bateman, Natural language generation: an introduction and open-ended review of the state of the art (2001) [cited March 2015]. URL http://www.fb10.uni-
Designated mailers protocol
  • Gordon Fecyk
Gordon Fecyk. Designated mailers protocol. http://www.panam.ca/dmp/draft-fecyk-dmp-01.txt, Accessed: 31.05.06, 2003.
Sender ID technology: Infor-mation for IT professionals
  • Senderid
SenderID. Sender ID technology: Infor-mation for IT professionals. Available at http://www.microsoft.com/mscorp/safety/ technologies/senderid/technology.mspx, Accessed: 31.05.06, 2004.
Bot Framework | Microsoft Docs
  • Jasongroce
Jasongroce. "Bot Framework Documentation." Bot Framework | Microsoft Docs. N.p., n.d. Web. 01 July 2017.
Enriching word vectors with subword information
  • Bojanowski
  • Piotr
Deep reinforcement learning for dialogue generation
  • Li
  • Jiwei
Deep reinforcement learning for dialogue Session: Machine Learning in Security RESEC
  • Jiwei Li
Li, Jiwei, et al. "Deep reinforcement learning for dialogue Session: Machine Learning in Security RESEC'18, June 4, 2018, Incheon, Republic of Korea generation." arXiv preprint arXiv:1606.01541 (2016).
  • Aurelio Marc
  • Sumit Ranzato
  • Michael Chopra
  • Wojciech Auli
  • Zaremba
Marc'Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level train-ing with recurrent neural networks. arXiv preprint arXiv:1511.06732
Bateman Natural language generation: an introduction and open-ended review of the state of the art
  • J A Bateman
Training a sentence planner for spo- ken dialogue using boosting. Computer Speech and Language. Marilyn A Walker Owen C Rambow and Monica Ro- gati
  • Marilyn A Walker Owen
  • C Rambow
  • Monica Ro- Gati
Multi-domain neural network language generation for spoken dialogue systems
  • Wen Tsung-Hsien
  • Etal
Sender ID technology: Infor- mation for IT professionals
  • Senderid
  • Sender
  • Technology