ArticlePDF Available

Mining E-mail Content for Cyber Forensic Investigation

January 2012

January 2012
2(2)

Authors:

Raisoni Group of Institutions

E-mail is a widely used mechanism for communication, due to its cost and expediency. However, the concern lies when along with its legitimate usage; it is being abused for committing various cyber crimes. E-mail system security lacks adequate proactive mechanism, to defend against such vulnerabilities and misuses. A cyber forensic investigation is employed for gathering significant evidences against adversaries by examining suspected e-mail accounts, in order to prosecute criminals in court of law. In this context, data mining techniques and tools based on them have been used extensively for extracting evidences from huge e-mail ensembles. This can provide assistance to the forensic investigator, to perform a multi-staged analysis of e-mail ensembles. In this paper, we briefly discuss various applications of data mining techniques with respect to cyber forensic investigation. Specifically, we describe our proposed framework and give implementation of first module,e-mail statistical analysis of our framework.

E-mail Details Window

…

Figures - uploaded by Smita Nirkhi

Content may be subject to copyright.

Content uploaded by Smita Nirkhi

Content may be subject to copyright.

UACEE International Journal of Computer Science and its Applications - Volume 2: Issue 2 [ISSN 2250 - 3765]

112

Mining E-mail Content for Cyber Forensic

Investigation

Ms. Sobiya R. Khan

P.G. Dept. of Computer Sci.

&Eng., GHRCE, Nagpur, India

sobi16@gmail.com

Ms. Smita M. Nirkhi

P.G. Dept. of Computer Sci.

&Eng.,GHRCE, Nagpur, India

smita811@gmail.com

Dr. R. V. Dharaskar

M. P. G. I,

Nanded, India

rvdharaskar@rediffmail.com

Abstract - E-mail is a widely used mechanism for

communication, due to its cost and expediency. However, the

concern lies when along with its legitimate usage; it is being

abused for committing various cyber crimes. E-mail system

security lacks adequate proactive mechanism, to defend against

such vulnerabilities and misuses. A cyber forensic investigation is

employed for gathering significant evidences against adversaries

by examining suspected e-mail accounts, in order to prosecute

criminals in court of law. In this context, data mining techniques

and tools based on them have been used extensively for

extracting evidences from huge e-mail ensembles. This can

provide assistance to the forensic investigator, to perform a

multi-staged analysis of e-mail ensembles. In this paper, we

briefly discuss various applications of data mining techniques

with respect to cyber forensic investigation. Specifically, we

describe our proposed framework and give implementation of

first module,e-mail statistical analysis of our framework.

Keywords - Cyber Crime, E-mail forensic analysis, Statistical

Analysis, Classification and Clustering techniques, Authorship

identification, Community identification.

I. INTRODUCTION

Nowadays, e-mail has become an easy, efficient and

economical means of communication over the Internet &

Intranet. It is being employed by most of the industries and

governments, as well. Thus there is huge amount of e-mail

traffic generated on daily basis. However, with its increased

usage, there is an undesired increase in the crimes which are

mediated via e-mails. Examples of such misuse include:

phishing, spamming, drug trafficking, cyber bullying, racial

vilification, child pornography, and sexual harassment. The

prime reason for this inherent vulnerability are twofold,

firstly, there is no mechanism for message encryption at the

sender end and an integrity check at the recipient end.

Secondly, the widely used, Simple Mail Transfer Protocol

(SMTP) e-mail protocol lacks a source authentication

mechanism. This inherent vulnerability of e-mail

communication exposes it to such crimes. Due to such crimes,

these e-mail misuse phenomena do a lot of harm to people„s

benefit, and even influence social stability.However, there are

no effective upbeat methods for preventing these phenomena.

The current methods are merely some passive mechanisms

such as e-mail filtering, installing firewall, etc. But they are

unable to put an end to the e-mail misuse phenomena.In this

context, a cyber crime investigation is carried out to gather

evidences and bring the culprits in the court of law and

provide justice to the victims. This had lead to the need for

efficient automated tools in the hands of forensic experts,

during forensic investigation, which can provide means to

capture evidence against such criminals which are credible in

the court of law.

This paper is divided into six sections. Section II

briefly describes the issues to be considered in e-mail mining.

Section III gives a brief overview of the related work. Section

IV gives an outline of our proposed framework, its

experimental setup and implementation of first module e-mail

statistical analysis of our framework. Theexperimentalresults

are described in Section V. Section VI gives the conclusion

drawn.

II. E-MAIL MINING

Digital Forensic technology has already become a centre

of attention among researcher‟s and law professionals. With

the ongoing research and development in this technology,

there is hope that this could help to curb the amount of cyber

crime going.Data mining is the process of extracting useful

patterns from vast amount of data. So, the obvious question is

why to use data mining in forensic investigation? The crux to

this question lies in the Analysis phase of forensic

investigation. The Analysis phase poses difficulty in front of

the forensic investigator, because it‟s difficult to analyse large

data set, if no appropriate methods are available to process it.

Also, it is unknown at the initial stage of the investigation,

which pieces of information may have value as evidence. Data

mining techniques are inherently applicable to this problem

domain and hence can provide an efficient mechanism to

capture relevant information out of huge data set [24].

E-mail Mining can be considered as an application of the

upcoming research area of Text Mining on e-mail data [17].

However, there are some specific characteristics of e-mail

data that set a distinctive separating line between E-mail and

Text Mining:

1. Information in the headers of e-mail can be used for

various e-mail mining tasks.Text mining techniques

might be inefficient fore-mail, as e-mail data is generally

quite short.

2. Well-formed linguistic is not guaranteed in e-mail and

spelling and grammar mistakes might also appear

frequently.Different topics may be discussed; which

makes e-mail classification more difficult.

3. E-mail is writtenpersonally hence generic techniques are

difficult to be effective.Concepts or distributions of

target classes may change over time. HTML tags and

UACEE International Journal of Computer Science and its Applications - Volume 2: Issue 2 [ISSN 2250 - 3765]

113

attachments must also be removed in order to apply a

text mining technique.

4. Due to privacy issues very few e-mail data are available

publicly for experiments. Exception to the above

statement, are the Enron Corpus and the Ling Spam

corpus, which have been made public for research

purpose [24].

III. RELATED WORK

A. E-mail Analysis Tools

Researchers have employed existing state-of-the-art

data mining techniques, machine learning algorithms and

visualization techniquesto implement various tools and

frameworks. Such tools have varied functionalities and

applications with respect to cyber crime investigation. Some

of the existing tools are online such as e.g., MET, UnMask,

etc which use online e-mail data whereas others are offline in

nature, such as e.g., EMT, IEFAF, etc which use offline E-

mail dumps to extract information relevant for forensic

investigation [24]. The following is a brief description of the

various tools: The benchmark tools of E-mail Mining Toolkit

(EMT) and Malicious E-mail Tracking (MET) were

developed at the Columbia University, which employed data

mining techniques to perform behaviour-based analyses and

social network analysis [8-9]. The EMS toolkit sheds some

light on the social network of the users [3]. Visualize

Association inside E-mails (VAIE) builds data models of e-

mails to classify them in different categories based on key

word search techniques[11].Visualization techniques have

been employed for e-mail analysis to provide graphical

representation of the e-mail data [10].UnMask, has been

developed as an ongoing project for determining phishing

[12]. IEFAF includes features such as computing statistical

distribution; generating data mining models& performing e-

mail authorship analysis [1].

B. E-mail Author Attribution

Authorship analysis is a process of examining the

characteristics of a piece of writing to draw conclusions on its

authorship. Its roots are from a linguistic research area called

stylometry, which refers to statistical analysis of literary style.

Authorship analysis is categorized into three major field,

Authorship identification, Similarity detection and Authorship

characterization. Authorship identiﬁcation determines the

likelihood of a piece of writing to be produced by a particular

author by examining other writings by that author. Authorship

analysis has been used in a small but diverse number of

application areas. Examples include identifying authors in

literature, in program code, and in forensic analysis for

criminal cases. Authorship analysis has been applied to online

messages in recent years [24]. Commendable results were

obtained with respect to e-mail authorship analysis on both

aggregated and across different topics in [6, 14]. In another

literature, four types of writing-style features (lexical,

syntactic, structural, and content-specific features) along with

SVM were used to identify plausible author of e-mail and

online messages [4, 5] which was extended using genetic

algorithm in [19]. Stylometric features combined with

unsupervised techniques have been employed for author

identification and similarity detection in [18]. A novel method

termed as Write-print using frequent pattern mining has been

developed in [2], which was further improved using clustering

technique in [7].

C. E-mail Classification and Clustering

Most e-mail mining tasks are being accomplished by

using e-mail classification at some point. E-mail classification

is the assignment of an email message to one of the category,

from a pre-defined set of categories. Automatic email

classification aims at building a model (typically by using

machine learning techniques), which will undertake this task

on behalf of the user. Examples of applications are automatic

mail categorization into folders, spam filtering and author

identification [24]. Four different classifiers (Neural Network,

SVM, Naïve Bayesian and Decision Tree) have been used to

identify suspicious mails in [15]. Naïve Bayesian classifier

has been used for identifying threats from a company's rapidly

expanding e-mail data set in [16]. Various studies have

revealed that SVM has been shown to be very robust and

successful. Clustering techniques goes one step further where

training data set isn‟t available by automatically categorizing

data. Clustering technique has been used extensively for text

categorization and authorship analysis as well [24]. An

effective digital text analysis strategy has been given in [20]

which rely on clustering based text mining technique.

D. E-mail Social Network Analysis

Social Network analysis is the study of

communication links or associations between people. It

reveals a great deal of information about his/her behaviour

and circle of people (friends, colleagues, family members,

etc.) around him/her with whom he/she interacts [24]. Social

Network has been explored in [22] by implementing a novel

algorithm using data mining to identify user behaviours,

identify patterns of communications between entities in an e-

mail collection to extract social standing. Associations

between members have been extracted to discover criminal

communities in [13]. Social Network analysis has also been

explored in [23] which use recursive data mining in order to

identify frequently occurring communities in online messages

such as e-mails, blogs, chats, etc. Studies have shown that

frequent pattern mining techniques have been very successful

in this problem domain.

IV. PROPOSED SYSTEM

Above discussed tools, frameworks and techniques

have shown expertise in one or the other application, but still

they lack a consistent interface, an integrated approach, and a

commercial outfit which can provide varied functionalities to

analyse e-mail ensembles and discern useful information from

it, which could be useful in the investigation process. The

results should be available on timely basis during the

investigation and evidences should be in such form which

could be satisfactorily presented in the court of law for further

UACEE International Journal of Computer Science and its Applications - Volume 2: Issue 2 [ISSN 2250 - 3765]

114

jurisdiction. Thus, we can conclude that need still exist for

automated forensic tools whichwill help forensic experts to

efficiently analyse e-mail collections, within a limited time

frame.

Here we are proposing the implementation ofa

framework which will employ data mining techniques to

achieve the various functionalities. The framework is

proposed to perform E-mail Statistical Analysis, E-mail

Classification & Clustering, E-mail Author Identification and

E-mail Social Network Analysis and will try to overcome the

limitations observed in previous systems.To evaluate our

implementation, we are using the Enron e-mail corpus made

available by MIT athttp://www.cs.cmu.edu/~enron/. The

proposed framework will be implemented in Java and will use

the data mining tool weka.

A. E-mail Statistical Analysis

Statistical analysis of e-mail accounts calculated

from communication patterns reflects a great deal of

information which could be of value to the forensic

investigator. The various possible statistics obtained from the

email corpus could be number of e-mails per sender, per

recipient, per sender domain, per recipient domain, per class,

per cluster etc. Statistics related to Classes and clusters are

determined after applying classification and clustering

respectively. Computing various statistics such as e-mailing

frequency during different parts of the day, average e-mail

size, and average attachment size (if any) helps to discern

usage behaviour and can be used to detect suspicious

behaviour. This may help investigators to narrow down the

investigation scope by short listing e-mail accounts that are

showing unusual behaviourand can be emphasized for further

investigation. Additional information like total number of

users (senders/recipients) within an e-mail collection, finding

all the recipients of each user can help during

investigation.Statistical distributions can be computed over a

certain period of time and for a specific set of e-mails.

B. E-mail Extraction

The objective of the first module of our proposed

framework is to perform E-mail Statistical Analysis. This

provides a statistical summary of the e-mail data, via applying

various statistical measures on the data obtained from e-mail.

But prior to that, it is essential to extract the relevant

information from the e-mail dataset and to represent that

information in a form which will be suitable for manipulation

for statistical analysis. Hence we can put forth the present

objective before statistical analysis as,

1. E-mail extraction

2. Data cleaning for removing ambiguities

3. Identification of relevant tokens

4. Extraction of tokens from e-mail archives such as

Message Id, To, From, Date, Time, CC, BCC,

Subject and Body

5. Storing the extracted tokens in the Database

The tokens are identified from this data, which is saved

inside database and which helps to perform various analyses

pertaining to E-mail.

C. E-mail Extraction Logic

The following steps must be followed for E-mail

extraction:

1. Read directory of particular User

2. If file present in directory, Read File

3. For each line in the file, check whether token exists or

not. If token is present, extract token value and go to next

line

4. Repeat step 3 until end of file

5. If directory has more files go to step 2

6. If all files in directory are processed , stop

After the extraction, various statistics described in

the next section are calculated from the database.

V. EXPERIMENTAL RESULT

Presently we are using data from Inbox and Sent folders

for our analysis. The E-mail Details Window is as shown in

Fig.1. Here the details regarding each users mails is displayed

from the inbox and sent folders.

Figure 1. E-mail Details Window

Fig. 2 represents the General E-mail Statistics which can

be calculated from the E-mail data. These statistics are

calculated from the E-mail Inbox and Sent data. The various

statistics included are as follows:

No. of Items in Inbox

No. of Items Sent

Average Mail Size [AMS]

Average No. of Mails received per Day

Average No. of Mails Sent per Day

No. of Mails below AMS

No. of Items received at Night

No. of Items received during Day Time

No. of Items sent at Night

No. of Items sent during Day Time

UACEE International Journal of Computer Science and its Applications - Volume 2: Issue 2 [ISSN 2250 - 3765]

115

Average No. of Recipients

Mails with larger no of Recipients

No of Contacts

Figure 2. E-mail Statistical Analysis

Graphical representation of the statistical data is

presented for few statistics in Fig. 3 and 4.

Figure 3. Distribution of Incoming E-mails

Figure 4. Time Distribution of Inbox

The graph of Fig. 4 represents inbox distribution of 5 users for

user Allen P. The next graph in Fig. 5 shows the mailing

frequency during various time distributions. Anomalous

behaviour can be identified from the statistical distributions

and unusual behaviour could be identified by the forensic

investigator. The work is still in progress for rest of the

modules.More Statistics will be added to the general statistics

after the completion of our second and fourth module, which

will include the statistics corresponding to Classification,

Clustering and Social Network Analysis.

VI. CONCLUSION

In this paper, we briefly discussed the application of

data mining techniques with respect to various automated

tools, e-mail authorship analysis, e-mail classification &

clustering and social network analysis. The study of previous

work reveals that data mining techniques gives a promising

lookfor analysis of huge e-mail dataset. Automated tools

based on such technique can assistforensic investigator during

initial cyber forensic investigation. We are employing data

mining techniques to implement our framework. The initial

results of our first module of E-mail Statistical Analysis have

been shown which will be integrated with the rest of the

modules.

REFERENCES

[1] Rachid Hadjidj, Mourad Debbabi, Hakim Lounis, Farkhund Iqbal,

Adam Szporer, Djamel Benredjem, “Towards an integrated e-mail

forensic analysis framework”, Digital Investigation 5, pp.124–137,

2009.

[2] Iqbal F, Hadjidj R, Fung BCM, Debbabi M., “A novel approach of

mining write-prints for authorship attribution in e-mail forensics”,

Digital Investigation 5:pp.42–51, 2008.

[3] Hongjun Li, Jiangang Zhang, Haibo Wang, Shaoming Huang, “A

Mining Algorithm For E-mail‟s Relationships Based On Neural

Networks”, International Conference on Computer Science and

Software Engineering, 2008.

[4] Zheng R, Li J, Chen H, Huang Z., “A framework for authorship

identification of online messages: writing-style features and

classification techniques”. Journal of the American Society for

Information Science and Technology, February ;57(3), pp.378– 93,

2006.

[5] Zheng R, Qin Y, Huang Z, Chen H., “Authorship analysis in cybercrime

investigation”, In: Proc. 1st NSF/NIJ symposium. ISI Springer-Verlag;

pp. 59–73, 2003.

[6] de Vel O, Anderson A, Corney M, Mohay G., “Mining e-mail content

for author identification forensics”, SIGMOD Record December

;30(4):55–64, 2001.

[7] Farkhund Iqbal, Hamad Binsalleeh, Benjamin C.M. Fung, Mourad

Debbabi., “Mining writeprints from anonymous e-mails for forensic

investigation”, Digital Investigation, 2010.

[8] Stolfo S.J., Hershkop S., Ke Wang, Nimeskern O., “EMT/MET: systems

for modeling and detecting errant e-mail”, Proceedings of DARPA

Information Survivability Conference and Exposition, 2003.

[9] Stolfo S.J., Hershkop S., Ke Wang, Nimeskern O., Chia-Wei Hu,

“Behavior-Based Modeling and Its Application to E-mail

Analysis”,ACM Transactions on Internet Technology, Vol. 6,No. 2,

May, Pages 187–221, 2006.

[10] Xiaoyan Fu_,Seok-Hee Hong,Nikola S. Nikolov,Xiaobin Shen,Yingxin

Wu,Kai Xuk, “Visualization and Analysis of E-mail Networks”, Asia-

Pacific Symposium on Visualisation, 2007.

[11] Fanlin Meng, Shunxiang Wu, Junbin Yang, Genzhen Yu, “Research of

an E-mail Forensic and Analysis System Based on Visualization”,

Second Asia-Pacific Conference on Computational Intelligence and

Industrial Applications, 2009.

UACEE International Journal of Computer Science and its Applications - Volume 2: Issue 2 [ISSN 2250 - 3765]

116

[12] Sudhir Aggarwal,Jasbinder Bali,Zhenhai Duan,Leo Kermes,Wayne

Liu,Shahank Sahai,Zhenghui Zhu, “The Design and Development of an

Undercover Multipurpose Anti-Spoofing Kit (UnMask)”, 23rd Annual

Computer Security Applications Conference, 2007.

[13] Rabeah Al-Zaidy, Benjamin C. M. Fung, Amr M. Youssef, “Towards

discovering criminal communities from textual data”, Proceedings of

the 2011 ACM Symposium on Applied Computing, 2011.

[14] Olivier de Vel, “Mining E-mail Authorship”, KDD-2000 Workshop on

Text Mining, August 20, Boston, 2000.

[15] S S.Appavu alias Balamurugan, Dr.R.Rajaram, “Data mining techniques

for suspicious e-mail detection: A comparative study”, IADIS European

Conference Data Mining, 2007.

[16] D.V. Chandra Shekar and S.Sagar Imambi, “Classifying and Identifying

of Threats in E-mails – Using Data Mining Techniques”, Proceedings of

the International MultiConference of Engineers and Computer

Scientists, Vol.I, IMECS, 19-21 March 2008, Hong Kong.

[17] Ioannis Katakis, Grigorios Tsoumakas, Ioannis Vlahavas, “E-mail

Mining: Emerging Techniques for E-mail Management”,Aristotle

University of Thessaloniki, Department of Informatics, Greece.

[18] Abbasi A, Chen H., “Writeprints: a stylometric approach to identity

level identification and similarity detection in cyberspace”, ACM

Transactions on Information Systems, Vol.26, No.2, Article 7, March

2008.

[19] Jiexun Li, Rong Zheng, Hsinchun Chen, “From Fingerprint to

Writeprint”, Communications of the ACM, 2006.

[20] Sergio Decherchi, Simone Tacconi, Judith Redi, Fabio Sangiacomo,

Alessio Leoncini and Rodolfo Zunino, “Text Clustering for Digital

Forensics Analysis”, Journal of Information Assurance and Security 5

(2010),pp.384-391.

[21] Gary Palmer, “A Road Map for Digital Forensic Research, “DFRWS

Technical Report”,Available:http://www.dfrws.org/2001/dfrwsrmfinal.

pdf, 2001.

[22] Ryan Rowe, German Creamer, Shlomo Hershkop and Salvatore J

Stolfo, “Automated Social Hierarchy Detection through E-mail Network

Analysis”,Joint 9th WEBKDD and 1st SNAKDD Workshop ‟07 August

12, 2007, San Jose, California, USA.

[23] M. Goldberg, M. Hayvanovych, A. Hoonlor, S. Kelley, M. Ismail, K.

Mertsalov, B. Szymanski and W. Wallace, “Discovery, Analysis and

Monitoring of Hidden Social Networks and Their Evolution”,

Technologies for Homeland Security, IEEE Conference, pp.1-6, 2008.

[24] Sobiya R. Khan, Smita M. Nirkhi, R. V. Dharaskar, “E-mail Mining for

Cyber Crime Investigation”, Proceedings of International Conference on

Advances in Computer and Communication Technology, pp.138-141,

February 2012.

E-mail Data Analysis for Application to Cyber Forensic Investigation using Data Mining

Conference Paper

Full-text available

Jan 2013

This paper discusses briefly the significance of e-mail communication in today's world, how substantial e-mails are with respect to obtaining digital evidence. The framework proposed by authors employs state-of-the-art existing data mining techniques. Experiments are conducted for e-mail analysis on the Enron data corpus. The intent of the proposed system is to provide assistance during forensic investigation. In this paper, we enhance the results obtained in our previous work on statistical analysis and provide our findings on e-mail classification experiments.

Integrated Cyber Forensic for E-Mail Analysis

Article

Full-text available

Jul 2023

In the past two decades, the Internet has become as open, publicly and widely used as a source of data transmission and exchanging the messages between criminals, terrorists and those who have illegal motivations. Moreover, exchanging important data between various military and financial institutions, even ordinary citizens. From this view, there is one of the important means of exchanging information widely used on the Internet medium is e-mail. Email messages are digital evidence that has been become one of the important means to adopt by courts in many countries and societies as evidence relied upon in condemnation. This paper presents a distinct technique for classifying emails based on data processing and mining, trimming, refinement, and then adapts several algorithms to classify these emails and then using SWARM algorithm to obtain practical and accurate results also using hybrid English lexical dictionary SentiWordNet3. 0 for email forensic analysis then deal with a machine learning algorithm. The proposed system is capable of learning in an environment with large and variable data. To test the proposed system, have to select available data which Enron Data set. A high accuracy rate (95%) was obtained, which is higher than the classification rates mentioned in previous research papers presented in section 2 in this paper.

DIGITAL CYBER FORENSICS CONTRIBUTION FOR EMAIL ANALYSIS

Article

Full-text available

Jul 2020

In the past two decades, the Internet has become as open, publicly and widely used as a source of data transmission and exchanging the messages between criminals, terrorists and those who have illegal motivations. Moreover, exchanging important data between various military and financial institutions, even ordinary citizens. From this view, there is one of the important means of exchanging information widely used on the Internet medium is e-mail. Email messages are digital evidence that has been become one of the important means to adopt by courts in many countries and societies as evidence relied upon in condemnation. This paper presents a distinct technique for classifying emails based on data processing and mining, trimming, refinement, and then adapts several algorithms to classify these emails and then using SWARM algorithm to obtain practical and accurate results also using hybrid English lexical dictionary SentiWordNet3.0 for email forensic analysis then deal with a machine learning algorithm. The proposed system is capable of learning in an environment with large and variable data. To test the proposed system, have to select available data which Enron Data set. A high accuracy rate (95%) was obtained, which is higher than the classification rates mentioned in previous research papers presented in section 2 in this paper.

DIGITAL CYBER FORENSIC EMAIL ANALYSIS AND DETECTION BASED ON INTELLIGENT TECHNIQUES

Research

Full-text available

Jan 2020

The Internet has become open, public and widely used as a source of data transmission and exchanging messages between criminals, terrorists and those who have illegal motivations. Moreover, it can be used for exchanging important data between various military and financial institutions, or even ordinary citizens. One of the important means of exchanging information widely used on the Internet medium is the e-mail. Email messages are digital evidence that has been become one of the important means to adopt by courts in many countries and societies as evidence relied upon in condemnation, that prompts the researchers to work continuously to develop email analysis tool using the latest technologies to find digital evidence from email messages to assist the forensic expertise into to analyze email groups. This work presents a distinct technique for analyzing and classifying emails based on data processing and extraction, trimming, and refinement, clustering, then using the SWARM algorithm to improve the performance and then adapting support vector machine algorithm to classify these emails to obtain practical and accurate results. This framework, also proposes a hybrid English lexical Dictionary (SentiWordNet 3.0) for email forensic analysis, it contains all the sentiwords such as positive and negative and can deal with the Machine Learning algorithm. The proposed system is capable of learning in an environment with large and variable data to test the proposed system will be select available data which is Enron Data set. A high accuracy rate is 92% was obtained in best case. The experiment is conducted the Enron email dataset corpus (May 7, 2015 Version of the dataset).

Integrated Cyber Forensic for E-Mail Analysis Framework

Article

Full-text available

Sep 2019

Objectives: This study presents a distinct technique for classifying emails based on data processing and mining, trimming, refinement, and then adapts several algorithms to classify these emails. Methods/Statistical Analysis: the SWARM algorithm to obtain practical and accurate results. Findings: The proposed system is capable of learning in an environment with large and variable data. To test the proposed system, we have to select available data which Enron Data set. A high accuracy rate (95%) was obtained, which is higher than the classification rates mentioned in previous research papers presented in section 2 in this paper. Application/ Improvements: In the past two decades, the Internet has become as an open, publicly and widely used as a source of data transmission and exchanging the messages between criminals, terrorists and those who have illegal motivations. Moreover exchanging important data between various military and financial institutions even ordinary citizens. From this view, there is one of the important means of exchanging information widely used on the Internet medium is e-mail. Email messages are digital evidence which became one of the important means to adopt by courts in many countries and societies as evidence relied upon in condemnation.

CS-BPSO: Hybrid feature selection based on chi-square and binary PSO algorithm for Arabic email authorship analysis

Article

Jun 2021
KNOWL-BASED SYST

Email authorship analysis is a challenging task involving the detection of an author’s style to help determine their identity. Emails represent a widespread application of big data, and email authorship analysis is widely performed in the forensic linguistics field. However, the high-dimensional feature space encountered in authorship analysis affects the classification performance. Moreover, the Arabic language is highly inflected and involves certain unique characteristics, which pose critical challenges in identifying the context. Therefore, the selection of prominent features is a critical step in realizing authorship analysis. Swarm intelligence (SI) algorithms are widely adopted to address such feature selection problems. In this study, an efficient hybrid feature selection algorithm based on binary particle swarm optimization (BPSO) and chi-square BPSO (CS-BPSO) was developed to enhance the performance of Arabic email authorship analysis. Static and dynamic features were considered. Experiments were conducted on Arabic email messages collected from a sample population to test the algorithm performance using three popular classifiers: support vector machine, K-nearest neighbour, and naïve Bayes classifiers. Different metrics, specifically, the accuracy, precision, recall, and f1-score, were considered as performance measures. The results showed that the CS-BPSO method achieves impressive results using dynamic features. The findings were quite satisfactory in terms of solving multiple types of difficulties, e.g., imbalanced dataset, small dataset, and short text.

Developing a Reliable System for Real-Life Emails Classification Using Machine Learning Approach

Chapter

Full-text available

May 2021

Cyber World has become accessible, public and commonly used to distribute and exchange messages between malicious actors, terrorists, and illegally motivated persons. Electronic mail is one of the most frequently used transfers of information on internet media. E-mails are the most important digital proof that courts in various countries and communities use to condemn and that enables researchers to work continually to improve e-mail analysis using state-of-the-art technology to find digital evidence from e-mails. This work introduces a distinctive technology to analyze emails. It is based on consecutive phases, starting with data processing, extraction, compilation, then implementing the SWARM algorithm to adjust the output and to transfer these electronic mails for realistic and precise results by adjusting the support algorithm of vector machines. For email forensic analysis this system includes all the sentiment terms plus positives and negatives cases. It can deal with the machine learning algorithm (Sent WordNet 3.0). Enron Data set is used to test the proposed framework. In the best case, a high accuracy rate is 92%.

DIGITAL CYBER FORENSIC EMAIL ANALYSIS AND DETECTION BASED ON INTELLIGENT TECHNIQUESINVESTIGATION

Article

Apr 2020

The Internet has become open, public and widely used as a source of data transmission and exchanging messages between criminals, terrorists and those who have illegal motivations. Moreover, it can be used for exchanging important data between various military and financial institutions, or even ordinary citizens. One of the important means of exchanging information widely used on the Internet medium is the e-mail. Email messages are digital evidence that has been become one of the important means to adopt by courts in many countries and societies as evidence relied upon in condemnation, that prompts the researchers to work continuously to develop email analysis tool using the latest technologies to find digital evidence from email messages to assist the forensic expertise into to analyze email groups .This work presents a distinct technique for analyzing and classifying emails based on data processing and extraction, trimming, and refinement, clustering, then using the SWARM algorithm to improve the performance and then adapting support vector machine algorithm to classify these emails to obtain practical and accurate results. This framework, also proposes a hybrid English lexical Dictionary (SentiWordNet 3.0) for email forensic analysis, it contains all the sentiwords such as positive and negative and can deal with the Machine Learning algorithm. The proposed system is capable of learning in an environment with large and variable data. To test the proposed system will be select available data which is Enron Data set. A high accuracy rate is 92% was obtained in best case. The experiment is conducted the Enron email dataset corpus (May 7, 2015 Version of the dataset).

Writeprints: A Stylometric Approach to Identity-level Identification and Similarity Detection in Cyberspace

Article

Full-text available

Mar 2008

One of the problems often associated with online anonymity is that it hinders social accountability, as substantiated by the high levels of cybercrime. Although identity cues are scarce in cyberspace, individuals often leave behind textual identity traces. In this study we proposed the use of stylometric analysis techniques to help identify individuals based on writing style. We incorporated a rich set of stylistic features, including lexical, syntactic, structural, content-specific, and idiosyncratic attributes. We also developed the Writeprints technique for identification and similarity detection of anonymous identities. Writeprints is a Karhunen-Loeve transforms-based technique that uses a sliding window and pattern disruption algorithm with individual author-level feature sets. The Writeprints,technique and extended feature set were evaluated on a testbed encompassing four online datasets spanning different domains: email, instant messaging, feedback comments, and program code. Writeprints outperformed benchmark techniques, including SVM, Ensemble SVM, PCA, and standard Karhunen-Loeve transforms, on the identification and similarity detection tasks with accuracy as high as 94% when differentiating between 100 authors. The extended feature set also significantly outperformed a baseline set of features commonly used in previous research. Furthermore, individual-author-level feature sets generally outperformed use of a single group of attributes.

DATA MINING TECHNIQUES FOR SUSPICIOUS EMAIL DETECTION: A COMPARATIVE STUDY

Article

Full-text available

Email has become one of the fastest and most economical forms of communication. This paper proposes to apply classification data mining for the task of suspicious email detection based on deception theory. In this paper, email data was classified using four different classifiers (Neural Network, SVM, Naïve Bayesian and Decision Tree). The experiment was performed using WEKA based on different features by which the email corpus is classified into suspicious or normal emails. Experimental results show that simple ID3 classifier which make a binary tree, will give a promising detection rates.

Automated social hierarchy detection through email network analysis

Article

Full-text available

Aug 2007

This paper provides a novel algorithm for automatically extracting social hierarchy data from electronic communication behavior. The algorithm is based on data mining user behaviors to automatically analyze and catalog patterns of communications between entities in a email collection to extract social standing. The advantage to such automatic methods is that they extract relevancy between hierarchy levels and are dynamic over time. We illustrate the algorithms over real world data using the Enron corporation's email archive. The results show great promise when compared to the corporations work chart and judicial proceeding analyzing the major players.

E-mail Mining: Emerging Techniques for E-Mail Management

Article

Full-text available

Jan 2006

ABSTRACT Email has met tremendous popularity over the past f ew years. People are sending and receiving many messages per day, communicating with partners and friends, or exchanging files and information. Unfortunately, the phenomenon of email overload has grown over the past years becoming a personal headache for users and a financ ial issue for companies. In this chapter, we will discuss how,disciplines like Machine Learning and Data Mining can contribute to the solution of the problem by constructing intelligent techniques which automate email managing tasks and what advantages they hold over other conv entional solutions. We will also discuss the particularity of email data and what special treatm ent it requires. Some interesting email mining applications like mail categorization, summarizatio n, automatic answering and spam filtering

Mining Email Content for Author Identification Forensics

Article

Full-text available

Dec 2001
SIGMOD REC

We describe an investigation into e-mail content mining for author identification, or authorship attribution, for the purpose of forensic investigation. We focus our discussion on the ability to discriminate between authors for the case of both aggregated e-mail topics as well as across different e-mail topics. An extended set of e-mail document features including structural characteristics and linguistic patterns were derived and, together with a Support Vector Machine learning algorithm, were used for mining the e-mail content. Experiments using a number of e-mail documents generated by different authors on a set of topics gave promising results for both aggregated and multi-topic author categorisation.

Behavior-based modeling and its application to Email analysis

Article

Full-text available

May 2006

The Email Mining Toolkit (EMT) is a data mining system that computes behavior profiles or models of user email accounts. These models may be used for a multitude of tasks including forensic analyses and detection tasks of value to law enforcement and intelligence agencies, as well for as other typical tasks such as virus and spam detection. To demonstrate the power of the methods, we focus on the application of these models to detect the early onset of a viral propagation without “content-base ” (or signature-based) analysis in common use in virus scanners. We present several experiments using real email from 15 users with injected simulated viral emails and describe how the combination of different behavior models improves overall detection rates. The performance results vary depending upon parameter settings, approaching 99 &percnt; true positive (TP) (percentage of viral emails caught) in general cases and with 0.38 &percnt; false positive (FP) (percentage of emails with attachments that are mislabeled as viral). The models used for this study are based upon volume and velocity statistics of a user's email rate and an analysis of the user's (social) cliques revealed in the person's email behavior. We show by way of simulation that virus propagations are detectable since viruses may emit emails at rates different than human behavior suggests is normal, and email is directed to groups of recipients in ways that violate the users' typical communications with their social groups.

A novel approach of mining write-prints for authorship attribution in e-mail forensics

Article

Full-text available

Sep 2008
DIGIT INVEST

There is an alarming increase in the number of cybercrime incidents through anonymous e-mails. The problem of e-mail authorship attribution is to identify the most plausible author of an anonymous e-mail from a group of potential suspects. Most previous contributions employed a traditional classification approach, such as decision tree and Support Vector Machine (SVM), to identify the author and studied the effects of different writing style features on the classification accuracy. However, little attention has been given on ensuring the quality of the evidence. In this paper, we introduce an innovative data mining method to capture the write-print of every suspect and model it as combinations of features that occurred frequently in the suspect's e-mails. This notion is called frequent pattern, which has proven to be effective in many data mining applications, but it is the first time to be applied to the problem of authorship attribution. Unlike the traditional approach, the extracted write-print by our method is unique among the suspects and, therefore, provides convincing and credible evidence for presenting it in a court of law. Experiments on real-life e-mails suggest that the proposed method can effectively identify the author and the results are supported by a strong evidence.

A road map for digital forensic research

Article

Jan 2001

G. Palmer

The Design and Development of an Undercover Multipurpose Anti-spoofing Kit (UnMask)

Article

Jan 2007

This paper describes the design and development of a software system to support law enforcement in investigating and prosecuting email based crimes. It focuses on phishing scams which use emails to trick users into revealing personal data. The system described in this paper, called the Undercover Multipurpose Anti-Spoofing Kit (UnMask), will enable investigators to reduce the time and effort needed for digital forensic investigations of email-based crimes. A novel aspect of UnMask is its use of a database to not only store information related to the email and its constituent parts (such as IP addresses, links, domain names), but also to organize a workflow to automatically launch UNIX tools to collect additional information from the Internet. The retrieved information is in turn added to the database. Reports can then be automatically generated according to the needs of the forensic investigator, including correlations across multiple email data stored in the database. UnMask is a working system. To the best of our knowledge, UnMask is the first comprehensive system that can automatically analyze emails and generate forensic reports that can be used for subsequent investigation and prosecution.

Research of an E-mail forensic and analysis system based on visualization

Conference Paper

Dec 2009

Nowadays, E-mail communication has been abused for numerous illegitimate purposes such as E-mail spamming, terrorist attack, business fraud, etc. As a result, to analysis the rich personal information hidden in E-mail is significant for investigation and evidence collection. In this paper, an investigation and analysis system aiming to Email was presented, which supports a variety of data sources including the preserved Email client data files, databases as well as text files. The system firstly parses related data files, preprocess the data, and then, a key word search technique based on KMP algorithm was adopted to classify the E-mail collections into different categories. Afterwards, an association frequency mining based on statistics will be performed to discover the association features behind email accounts. To make the forensic results more readable, we will associate the E-mail accounts with personnel information table in reality. The final forensic results will be visualized using related layout techniques to make the information more illustrative and understandable.

Mining E-mail Content for Cyber Forensic Investigation

Abstract and Figures

Recommended publications

Digital evidence and procedural protections: Potential of blockchain technology

An authorship attribution forensic method for web information

Social media risks from forensic point of view

Generic Model for Forensic Investigation of p2p Networks