Conference PaperPDF Available

Visualization of sanitized email logs for spam analysis

March 2007

March 2007

DOI:10.1109/APVIS.2007.329303

Source
IEEE Xplore

Conference: Visualization, 2007. APVIS '07. 2007 6th International Asia-Pacific Symposium on

Authors:

kun ling Ma

Institute of Nephrology

Email has become an integral method of communication. However, it is still plagued by vast amounts of spam. Many statistical techniques, such as Bayesian filtering, have been applied to this problem, and been proven useful. But these techniques in general require training. Another common method of spam prevention is blacklisting known spam sources. In order to do this, the sources must be identified. What this paper presents is a set of visualization techniques designed to show patterns in incoming email which can reveal misidentified pieces of spam, common spam sources, and patterns such as periods of increased spam activity, while maintaining the privacy of the email. This can aid system administrators in rapidly and effectively adjusting system level filters, which would improve the quality of service and decrease the time and resources wasted by spam.

A graph of emails over the period of 1 hour. Sources are on the left, destinations on the right, and semitransparent lines are drawn between them to represent emails (known spam is colored red). Blue delimitates different domains, and the orange line designates the current focus, which is labeled. Bulk email (such as spam) forms fan like shapes from one source to many destinations. As can be seen, in this hour most of the spam originated from a select few sources.

…

A timeline view of the emails per hour over a month. The red area represents the portion of the emails that are known spam according to Spam Assassin. The grey region is the currently selected time period. A repeating pattern of five large peaks followed by two smaller (or nonexistent) peaks corresponds to the five day work week plus two day weekends.

…

3d versions of the graph and scatterplot. The added dimension is time. This allows larger amounts of data to be viewed at any one time, at the cost of introducing occlusion. It also shows how many patterns persist through time as opposed to traffic bursts.

…

One option for presenting the time aspect of the data is to animate it, either by showing sequential time-steps or by showing a sliding time window. Three consecutive frames are shown here. In them, it can be seen that some patterns are persistent while others come and go.

…

A detail view of the email originating from a selected address, and the emails originating from the addresses it sent emails to. The size of each segment of the ring corresponds to what percentage of the email was sent there, and the color is random. Color and position of the nodes in the middle are derived from the color and position of each address in the first level that sent it email. This is useful for showing patterns such as email relays, where spam is forwarded by a compromised user.

…

Figures - uploaded by kun ling Ma

Content may be subject to copyright.

Content uploaded by kun ling Ma

Content may be subject to copyright.

Visualization of Sanitized Email Logs for Spam Analysis

Chris Muelder∗

University of California, Davis

Kwan-Liu Ma†

University of California, Davis

ABSTRACT

Email has become an integral method of communication. How-

ever, it is still plagued by vast amounts of spam. Many statistical

techniques, such as Bayesian ﬁltering, have been applied to this

problem, and been proven useful. But these techniques in general

require training. Another common method of spam prevention is

blacklisting known spam sources. In order to do this, the sources

must be identiﬁed. What this paper presents is a set of visualization

techniques designed to show patterns in incoming email which can

reveal misidentiﬁed pieces of spam, common spam sources, and

patterns such as periods of increased spam activity, while maintain-

ing the privacy of the email. This can aid system administrators in

rapidly and effectivly adjusting system level ﬁlters, which would

improve the quality of service and decrease the time and resources

wasted by spam.

1 INTRODUCTION

Email has become the standard for electronic communication.

Along with its popularity, it has gained the attention of advertisers,

who use email as a means of very inexpensively spreading adver-

tisements. Most users ﬁnd this spam annoying and would rather not

receive it. Spam also tends to waste resources such as bandwidth,

not just for the end users, but for the intermediate service providers

as well. Eliminating spam can free up such resources for legitimate

uses. Thus, many techniques have been created to combat spam.

Because of the rapid delivery time of email, and due to the desire for

privacy, these anti-spam techniques must be automatic and require

no human interaction. However, as solicitors learn the techniques in

use, they start to devise ways to work around them. For instance, if

the email delivery service starts ﬁltering on a word that is common

in spam, then solicitors will simply misspell or vary the word such

that the current detection software will not catch it. This leads to an

arms race between the spam and the ﬁlter, and so often spam can

get through. Various adaptive measures such as Bayesian ﬁlters [7]

have been developed; however they are content based and thus are

in general only convenient for end users. Thus, it would be beneﬁ-

cial to use visualization to get the big picture of incoming emails at

the system level to identify likely spam so that ﬁlters can be rapidly

adjusted accordingly, and so perhaps even Bayesian ﬁlters can be

employed effectively at the system level without the need for an

administrator to look at the content of the email. It could also al-

low the identiﬁcation of primary spam sources for blacklisting pur-

poses. The visualization techniques used in this paper accomplish

this task by presenting an abstract view of large amounts of email

so that large scale patterns such as spam can be seen and identiﬁed.

1.1 Privacy Issues

One of the difﬁculties of working with email is that email infor-

mation generally contains a lot of sensitive information, such as

the content or the subject lines. A system administrator looking at

this sensitive data would violate the privacy of email, which is often

∗e-mail: muelder@cs.ucdavis.edu

†e-mail: ma@cs.ucdavis.edu

granted in the terms of use of the email system. Thus, most logs and

email records are unavailable for analysis due to the need to keep

that sensitive information private. To work around this, one must

take special care to clean out the parts of the data that are sensitive.

In some cases even the senders’ and recipients’ email addresses can

be considered sensitive, so they must be abstracted or just scrubbed

out of the data. While some of this potentially sensitive data can

be useful in statistical ﬁltering (such as Bayesian ﬁlters), it is not

necessarily a distinguishing feature of spam. Spam is simply de-

ﬁned as unsolicited bulk email, regardless of content, so it should

be possible to identify spam even without the content to work with.

Thus, it should be sufﬁcient to visualize email patterns with the sen-

sitive content removed and still effectively identify spam, which is

the goal of the visualizations presented in this paper.

1.2 Bipartite Graph Representation

One abstraction of email communications is a directed graph, where

nodes represent addresses and edges represent emails between their

associated nodes. Since in this paper, only incoming email is con-

sidered, it can safely be assumed that the data can be split into two

disjoint sets, a set of external sending email addresses and a set of

internal receiving ones. Also, almost all edges will go from the ex-

ternal set to the internal set. Those that do not are either completely

internal and trusted, or completely external and unalterable. Thus,

these edges can be discarded and what is left is a bipartite graph,

where all edges go from the set of senders to the set of receivers.

By limiting the graph to a bipartite one, the options for visualization

can be clearer and faster than a general graph.

1.3 Related Work

Using visualization to analyze email is not a new idea, as there are

a number of email visualizations that have been developed. Moslty,

these visualizations work one of three types of email datasets: pub-

lic forums, end users’ inboxes, and organizational archives.

•Publicly accessible forums such as newsgroups are often the

easiest to work with since everything is accessible. Many vi-

sualizations have been developed to analyze the social net-

work represented by public email communication such as

through newsgroups [6, 8, 9]. Often, spam is prevented by

making the newsgroup somewhat exclusive, such as requiring

a password or user veriﬁcation before allowing posting. And

any user caught spamming is usually banned fairly quickly,

so spam is usually not as big of an issue as it is with private

email.

•The next easiest set of data to work with is the user’s own

inbox, since there is no privacy restriction. However, in this

case the scope is fairly limited, since only the email to or from

this one user can be seen. Many interesting email visualiza-

tions have been developed that focus on analyzing or manag-

ing one’s own email inbox or archive [1, 5]. Some of these are

useful in categorizing spam, but as these visualizations are on

a per user basis, they are usually not useful at the administra-

tive level.

•Organizational or system level email archives are the focus

of this paper, and are likely the most difﬁcult to work with

due to potential sensitivity of the data. Visualizations have

been developed to analyze emails at the organizational level

[2, 11], but these tend to focus on social structures. Due to the

lack of content analysis at this level, they do not usually focus

on spam detection and ﬁltering. The visualization methods

shown in this paper were designed to detect and analyze spam

and spam sources without the content.

The concept of visualizing bipartite graphs is also not new. Bipar-

tite graphs have been around for a long time, and so there have been

several works regarding how to visualize them effectively [3, 13].

There has even been work done on applying bipartite graph visu-

alization to source-destination relationships [12]. But these works

are not directed towards the visualization of email. Finally, identi-

fying spam has traditionally been performed using statistical meth-

ods [4, 7]. But these methods are generally content based, and so

they could be in violation of the user’s expectation of privacy if ap-

plied at the system level. Also, keeping statistical methods such

as Bayesian ﬁlters trained at the system level is a daunting task in

itself, which can be alleviated through the use of visualization.

2 DE SI GN A ND TECHNIQUES

The visualization presented in this paper was designed to work

with raw system log email records. These records include infor-

mation such as sources, destinations, arrival times, subjects, relays,

and spam scores as deﬁned by Spam Assassin [10], the statistical

spam detection system in place. Much more information was also

present, but this information was not used. Many parts of this data

can be considered sensitive. In particular, source and destination

addresses and subject lines. Thus, before any visualization could be

attempted, the sensitive data needed to be scrubbed out. Once the

sensitive data was scrubbed out, the relevant information needed to

be parsed out into a more structured format. Then, the time varying

bipartite graph represented by this data was displayed using visual-

ization techniques.

Data Sample 1 An example of raw log data.

Feb 12 20:36:32 baton sendmail[10529]: [ID 80

1593 mail.info] k1D4a2Zd010529: from=<5138651

db1f0438111aa7bf24025145f@virgin.net>, size=1

3174, class=0, nrcpts=1, msgid=<e24a11cf32bd1

55b8284c28ad3cc6feb@doug>, proto=ESMTP, daemo

n=MTA, relay=ucdavis.edu [169.237.104.35]

Feb 12 20:36:34 baton mimedefang.pl[4245]: [I

D 702911 mail.info] MDLOG,k1D4a2Zd010529,mail

_in,,,<5138651db1f0438111aa7bf24025145f@virgi

n.net>,<785e0431a07a0bbadd089e63318b2122@ucda

vis.edu>,(Subject Removed)

Feb 12 20:36:34 baton sendmail[10529]: [ID 80

1593 mail.info] k1D4a2Zd010529: Milter delete

: header X-Spam-Score: 13.128 (*************)

BAYES_80,HELO_DYNAMIC_IPADDR,HTML_80_90,HTML

_IMAGE_ONLY_12,HTML_MESSAGE,MIME_QP_LONG_LIN

E,URIBL_SBL

Feb 12 20:36:34 baton sendmail[10529]: [ID 80

1593 mail.info] k1D4a2Zd010529: Milter add: h

eader: X-Scanned-By: MIMEDefang 2.52 on 169.2

37.6.6

Feb 12 20:36:34 baton sendmail[10539]: [ID 80

1593 mail.info] k1D4a2Zd010529: to=785e0431a0

7a0bbadd089e63318b2122@ucdavis.edu, delay=00:

00:02, xdelay=00:00:00, mailer=esmtp, pri=432

86, relay=ucdavis.edu. [169.237.6.84], dsn=2.

0.0, stat=Sent (k1D4aYH9008118 Message accept

ed for delivery)

Data Sample 2 An example of structured (parsed) log data.

----------id=k1D4a2Zd010529---------------

ID=k1D4a2Zd010529

TO=785e0431a07a0bbadd089e63318b2122@ucdavis.e

FROM=5138651db1f0438111aa7bf24025145f@virgin.

net

SPAMSCORE=

DELAY=00:00:02

XDELAY=00:00:00

MAILER=esmtp

PRI=43286

RELAY=ucdavis.edu. [169.237.6.84]

DSN=2.0.0

STAT=Sent (k1D4aYH9008118 Message accepted f

or delivery)

TIME=Feb 12 20:36:34

2.1 Scrubbing and Filtering

The data in its original raw form consists of Linux system logs

which contain the system log messages sent by the various email

daemons. Since these messages are very unstructured, the process

of extracting only the non-sensitive data in a structured form in-

volves a lot of text manipulation, and so the programs to handle

this were written in Perl. The ﬁrst script takes the data and scrubs

out the sensitive information by encrypting or removing the user

names and removing the subject lines. The next script takes the

log, parses out the relevant information, and outputs the data in a

structured form for parsing in a visualization program.

2.2 Bipartite Graph View

One way of rendering a bipartite graph is to line up all the vertices in

one set of the graph on one side, line up all the vertices in the other

set on the other side, and then render edges between these vertices

as for a normal graph. This creates a relatively easy to understand

picture, which is good for showing patterns in the edges. Also,

since the vertices are ﬁxed, the edges can change with time while

keeping a constant frame of reference. Sources are lined up on the

left, destinations are lined up on the right, and edges are drawn

between them for each email. The addresses are grouped according

to domain names, which are sorted alphabetically. Color is used to

represent the spam score that Spam Assassin gave it, where black

means a spam score of zero, and the more red the line the higher the

spam score. Opacity represents how many emails there were from

that source to that destination in that time range. That is, when it is

nearly transparent there were very few emails but when it is opaque

there were many emails. Figure 1 shows the results of applying

these techniques to a small amount of email. The ﬁrst and most

obvious pattern shown is that the majority of the email originates

from just a few sources. Since much of the known spam originates

from these same addresses, it is likely that the rest of this email is

spam as well. It also indicates which addresses or domains should

likely be blacklisted, or in the case of internal addresses, be audited.

2.3 Scatterplot View

When the number of emails gets large, the graph starts to become

incoherent, since there are just too many overlapping edges. In

order to resolve this, a scatterplot was added as an alternate view.

While somewhat less intuitive, it is better at handling large numbers

of emails since there is no overlapping. Sources are still lined up on

the left side, but now destinations are lined up along the bottom, and

emails are shown simply as points at the coordinates corresponding

to their source and destination addresses. The same color and opac-

ity rules are applied as in the graph view: color is spam score and

Figure 1: A graph of emails over the period of 1 hour. Sources are

on the left, destinations on the right, and semitransparent lines are

drawn between them to represent emails (known spam is colored

red). Blue delimitates different domains, and the orange line desig-

nates the current focus, which is labeled. Bulk email (such as spam)

forms fan like shapes from one source to many destinations. As can

be seen, in this hour most of the spam originated from a select few

sources.

Figure 2: A months worth of emails shown in a scatterplot. Sources

are on the left, destinations are along the bottom. Points represent

emails. Bulk mail, such as most spam, appear as horizontal lines

(one source, many destinations). Another odd pattern revealed here

is the set of vertical lines (many sources, one destination). Interest-

ingly, many of these vertical lines have very similar patterns, indicat-

ing the same set of sources. This might indicate a spammer that is

using a botnet to send email from distributed, compromised sources.

(a) A 3d graph of emails over the

period of about a day.

(b) A 3d extension of the scatter-

plot over the whole month.

Figure 4: 3d versions of the graph and scatterplot. The added di-

mension is time. This allows larger amounts of data to be viewed at

any one time, at the cost of introducing occlusion. It also shows how

many patterns persist through time as opposed to trafﬁc bursts.

opacity is number of emails. Figure 2 shows an example of this

technique applied to an entire month of emails. In this view, hor-

izontal lines correspond to bulk email, which is often spam. The

more interesting pattern revealed here, however, is the presence the

vertical lines, which indicate addresses that recieve lots of email

from everywhere. These vertical lines all seem to share the same

pattern, which indicates that they are all on the same set of exter-

nal lists. In fact, this is a likely indicator that these are a subset of

internal email addresses that are on a particular spam list.

2.4 Timeline Overview

Often times one does not just want to know what happened but also

when it happened. Thus it is useful to present information regarding

when events occur. Also, looking at the entire data set at once can

be confusing or overwhelming because of the large size of the data,

so it is often helpful to look at small time segments individually. A

timeline was added to the visualization allow as an overview of the

data set from which individual time segments can be selected. An

example of this timeline is shown in Figure 3. In this timeline, the

top edge of the black area represents the total number of emails and

the red area represents the portion of these emails that were iden-

tiﬁed as spam. The grey highlight indicates the currently selected

time period, which would be shown in the graph or scatterplot view.

The ﬁrst pattern to notice is that there is a distinct cyclical pattern

corresponding to the work week. That is, there is a repeating pat-

tern of ﬁve peaks followed by two much shorter peaks. There are

also some spikes that do not correspond to this pattern. These cor-

respond to other increased amounts of activity, either due to brief

ﬂoods or merely increased activity. The overall low level of spam,

is indicative to how poorly Spam Assassin identiﬁes spam.

2.5 3D View

While the two dimensional graph and scatterplot representations are

useful for viewing the emails themselves, they do not show any-

thing about the timing information. On the other hand, the time-

line view shows the timing information well of large scale events,

but not what sources and destinations were involved. So, both of

the two dimensional visualizations were extended to three dimen-

sional views that show both the time information and the source

and destination information. This allows one view to summarize

the data. As the number of emails increases, occlusion can begin to

be a problem, so often times more transparency is required. Thus,

small scale patterns such as individual emails can tend to be lost in

this view. However, spam messages in general are not small scale

patterns, and so this is not a very large problem for spam analy-

sis. Figure 4(a) shows an example of about a day’s worth of email

Figure 3: A timeline view of the emails per hour over a month. The red area represents the por tion of the emails that are known spam according

to Spam Assassin. The grey region is the currently selected time period. A repeating pattern of ﬁve large peaks followed by two smaller (or

nonexistent) peaks corresponds to the ﬁve day work week plus two day weekends.

Figure 5: One option for presenting the time aspect of the data is to

animate it, either by showing sequential time-steps or by showing a

sliding time window. Three consecutive frames are shown here. In

them, it can be seen that some patterns are persistent while others

come and go.

shown in a three dimensional graph view and Figure 4(b) shows

a month’s worth of email shown in the three dimensional scatter-

plot view. In Figure 4(a), it can easily be seen that the two largest

sources of bulk email tend to be a continuous source of email, and

not merely occasional senders. In Figure 4(a), a pattern is revealed

that was invisible in the 2d view, and that is the lines parallel to the

time axis. These lines are cases where there was one source, one

destination, and a continuous stream of trafﬁc. While this could be

due to automated messages such as news feeds, it is also possible

that these destinations are spam relays, which take spam and resend

it to other addresses.

2.6 Animation

Another option for displaying time information is to actually use

time to represent it. That is, timing information can be represented

by animating the view of the data. This is likely the most intu-

itive way to represent time, but it can be somewhat difﬁcult to use

effectively since it relies on the user’s memory and reaction time.

However, it does keep the visualization down to two dimensions,

so that the inherent difﬁculties of three dimensional visualizations

can be avoided. So the visualization presented here has a capability

to animate the edges of the email graph. This can done by sim-

ply showing sequential time steps at a ﬁxed rate. A slightly more

complex way that was also implemented is a sliding time window,

where a constant amount of time is shown while new time steps

are added and old ones removed. This creates some persistence to

patterns with short durations, while allowing better comparison be-

tween sequential time steps. The nodes that correspond to sources

and destinations are kept constant, which provides a frame of refer-

ence as the edges are changing. An example of a few frames of the

animation process are shown in Figure 5.

2.7 Zooming and Filtering

Since there are a very large number of email addresses in both the

set of sources and the set of destinations, it can be quite difﬁcult if

not impossible to single one out from the full graph or scatterplot

views. So, a capability was added to zoom into the source and des-

tination axes in order to both reduce the complexity of the graph

Figure 6: In order to see details more clearly, a capability was added

to zoom into regions of the scatterplot. In this image, the ucdavis.edu

domain was focused on for both sources and destinations. This re-

veals an interesting pattern of a diagonal line, which is indicative of

people emailing themselves. Zooming into the graph is done very

similarly.

Figure 7: A detail view of the email originating from a selected ad-

dress, and the emails originating from the addresses it sent emails

to. The size of each segment of the ring corresponds to what per-

centage of the email was sent there, and the color is random. Color

and position of the nodes in the middle are derived from the color

and position of each address in the ﬁrst level that sent it email. This

is useful for showing patterns such as email relays, where spam is

forwarded by a compromised user.

(a) All spam over the month (b) Spam from U. C. Davis

Figure 8: Email marked as spam by Spam Assassin over the whole

month. (b), (c), and (d) focus on some of the larger domains. In it we

can see that ucdavis.edu is the source for a lot of spam, yahoo.com

for less, and hotmail.com for even less.

and allow the user to focus on regions of interest. On the graph

this is done by selecting regions of the sources or regions of the

destinations with the mouse. On the scatterplot, as is shown in Fig-

ure 6, both are done simultaneously by selecting a rectangular area

of interest with the mouse.

2.8 Detail View

Once the graph or scatterplot visualizations have been used to iden-

tify an address of interest, it is beneﬁcial to be able to investigate

this address in more detail. Figure 7 shows one possible visual-

ization of the details of a node. This visualization shows a repre-

sentation of the emails’ outgoing social network structure, where

the ring shows one degree away from the selected address, and the

nodes in the middle show the second degree away. In the ring, each

section represents an address that the selected address emailed, and

the size indicates how much email was sent to that address. The col-

ors were chosen randomly. The nodes in the middle are addresses

that received emails sent by members of the ﬁrst level. They are

positioned and colored according to which nodes in the ﬁrst level

sent them email. For example, if a node recieved email from ex-

actly three of the nodes in the ﬁrst level, then it would be placed

equidistant from these three nodes and its color would be the aver-

age color of the nodes. The edges that would normally be drawn

are only shown for the currently selected node in the ﬁrst level for

clarity, since otherwise it can become quite cluttered.

3 CA SE STUDIES

In order to test the effectiveness of this visualization, it was run on

data collected from the month of February, 2006. This data con-

sisted of system logs from the University of California, Davis. This

is the month of data shown in the timeline view of Figure 3, and

used in the rest of the Figures. While exploring this data, many in-

teresting patterns were discovered. Several have already been dis-

cussed along with their respective Figures in the previous section,

(a) A common spam pattern of

one source and many destinations.

This shows a large aount of spam

originating from an address at

“cuna.org”

(b) An odd spam pattern where just

a single user was targeted by nearly

the entire “globo.com” domain.

Figure 9: One to many and many to one spam patterns. These kinds

of patterns are fairly common to spam.

and others are discussed here.

3.1 Known Spam

Much can be learned by simply ﬁltering out all but the mail known

to be spam according to Spam Assassin. Figure 8(a) shows the

known spam from all sources over the entire month. From it, it

can easily be seen that there are deﬁnitely some primary sources

from which the majority of spam originates. In fact, the largest sev-

eral spam sources seem to be internal to U. C. Davis itself, which

is shown in more detail in Figure 8(b). In this Figure, it can eas-

ily be seen that there are a handful of addresses from which the

majority of spam originates, indicating which addresses should be

blocked or at least audited since they are internal. While most of

the other spam sources are from individual small domains, there

is also spam originating from larger more common domains such

as Yahoo and Hotmail, which are shown in Figures 8(c) and 8(d)

respectvely. Of interest here is that Hotmail apparently does a bet-

ter job of preventing outgoing spam messages than Yahoo, but still

not perfect. Why this occurs is beyond the scope of the research

possible with just this dataset, and would require a more in depth

study. However, this visualization does produce a set of individual

addresses in these and other domains that could be used as the basis

for a blacklist. Another interesting pattern seen here is the set of

common spam recievers. These are email addresses that have ap-

perently been collected by an outside spammer somehow, because

spam from many different sources seems to concentrate on these

select addresses. It is also possible that these are the addresses of

email lists, and so they are targeted by many spammers because do-

ing so increases the spam’s audience while decreasing the expense

on the side of the spammer.

3.2 Isolating Spam Sources

Once spam sources have been located in the big picture, the zoom-

ing feature can be used to hone in on them and identify the actual

address or addresses from which the spam is originating. In the

case of Figure 9(a), a large amount of spam was traced down to

a single sorce, originating from “cuna.org.” Once ths source has

been identiﬁed, it can be further analyzed by viewing it in a detail

view or by going back to the original data. Alternately, it could

be good enough to monitor its future actions more closely or treat

mail coming form it more harshly, making it more difﬁcult for spam

originating from it to get past the ﬁlters (by giving it an initial spam

score for example).

(a) Spam is often relayed through bot networks. The lines in the scatterplot shown at left are potentially compromized users. At right, they are shown in more

detail in the bipartite graph view, and it is seen that the source was the null source address (missing@data.edu).

(b) Detail view of the addresses that received mail from the null source and

those addresses they sent email to. The ﬁve lines shown in Figure 10(a)

correspond to the ﬁve large segments of the outer ring. From here, it can be

seen by selecting each of these regions that each of their addresses were also

sending mail to many destinations, just like a spam relay.

these addresses in the overview, it can be seen that while they did send to

many destinations, it was in low volume. Since this was much less than the

incoming mail from the null address, they can not have been actually relaying

spam in this case.

Figure 10: An email pattern similar to that of a set of spam relays. However, in this case this trafﬁc was not actually spam.

Figure 11: A recurring trafﬁc burst. About a day apart, these patterns show emails from one source to one destination that were sent many

times. Investigation reveals that these are emails that were repeatedly deferred by the server.

3.3 Spam Focus

One interesting pattern that was detected that was not really ex-

pected is shown in Figure 9(b). In this Figure, nearly the entire

domin of “globo.com” is sending spam to exactly one destination.

One possible explaination is that this indicates that the user gave his

or her email address out for some reason, and this domain is taking

advantage of this. Another possibility is that this user signed up for

something legitimate on this site and the mail is being mislabeled

as spam. However, there are few reasons for legitimate email to be

coming from several distinct sources that comprise almost the en-

tire domain. If further analyis reveals that it is the case that this is

truly spam, then this entire domain would be a likely candidate for

blacklisting.

3.4 Spam Relays

When sending spam, a spammer can use a compomised innocent

user as a proxy in order to bypass spam prevention methods such

as blacklists. In this case, the spammer would continually send the

spam to the user, possibly encrypted, where it would be intercepted

by a daemon which would then forward it to a set of destinations.

The pattern found in Figure 4(b) was indicative of this kind of traf-

ﬁc, so it was explored in more detail. In Figure 10(a), the source of

5 of these lines was focused in on, then shown in the two dimen-

sional bipartite graph view. In this view, it is seen that the source of

these emails had the special token “missing@data.edu,” indicating

that there was no source address in the log ﬁles. Since it does not

really make any sense for a legitimate email to have a null source

address, this indicates that this is more likely spam. From here, the

null source was selected for viewing in the detail view, and the re-

sult was the image shown in Figure 10(b), which shows some of the

relaying properties better. However, when the addresses of the ﬁve

destinations were viewed in the normal graph view, it was seen that

they in fact were not sending spam/ Thus, whatever mail they were

receiving from the null address was getting blocked or dropped or

was in fact not spam to be relayed.

3.5 Recurring Email Deferring

One anomaly that was revealed by the visualization was a recurring

pattern of emails from individual sources to individual destinations

that were repeated many times. This odd pattern was seen to occur

each night at about the same time, between 2:00 A.M. and 3:00

A.M. and is shown in Figure 11. It turns out that this pattern was

caused by emails getting deferred repeatedly, and so they end up

creating many messages in the system log for a single email. Since

this email deferring apparently occurs daily, it is likely not due to an

external reason, such as a denial of service attack. Rather, it is likely

due to an internal issue of some sort, that it might be beneﬁcial to

identify and ﬁx. Thus, the visualization presented here is capable

of identifying internal issues as well as external attacks.

3.6 Other Anomalous Patterns

While not necessarily spam related, there are many other patterns

that can be seen with the visualization techniques shown in this pa-

per. For example, in Figure 12(a), the internal U. C. Davis trafﬁc

was focused on, and a pair of dashed lines parallel to the time axis

were found. When focused in more, and shown in more detail, it

can be seen that these lines were emails from one source to a pair

of destinations that were dependent on the work day. That is, the

amount of email sent from this source appears to follow the stan-

dard work week pattern of ﬁve peak amounts of trafﬁc in the middle

of each weekday, followed by two lesser peaks on the weekends.

This kind of pattern could be explained by an automatic forward-

ing of a tech support address. That is, mail sent to the tech support

address could be automatically being forwarded to a pair of tech

support people.

4 FUTURE WORK

The results generated by this visualization are quite useful, but there

are still ways in which they could be improved. It would be useful

to extract and use more of the data in the raw log records that are

currently ignored. In fact, it could even be useful to be able to sim-

ply drill down from the email representation to the actual raw log

(a) A pair of dashed lines (b) One week of the dashed lines in

more detail

Figure 12: An anomalous pattern. These emails match up with the

standard work week pattern yet are from one source to exactly two

destinations. Possibly caused by an automatic forwarding script.

entry that deﬁnes it. A variation that could be interesting to in-

vestigate is to reverse the selection of data, and consider only the

outgoing mail. This would be useful in identifying compromised

internal systems, or users that are misusing the system intention-

ally. It could also be useful to use a tool such as this as an interface

to create ﬁlter rules in a sandbox environment. That is, potential

spam could be identiﬁed, a ﬁlter rule could be made, and could

then be applied to the data and fed back into the visualization. This

would make a useful feedback system. Finally, it would be beneﬁ-

cial to test the system against some datasets where the ground truth

is known, perhaps even artiﬁcially generated datasets, in order to

more fully validate the effectiveness of the visualizations.

5 CONCLUSIONS

Visualization of incoming email is an effective way to get an overall

feel for what kinds of email are coming into the network, even with-

out using potentially sensitive information such as subject lines.

The visualization techniques used in this paper, while not very com-

plex, quite clearly point out some interesting features in the data

which would likely be of interest to a system administrator. They

are quite effective at pointing out which sources are predominately

responsible for most of the incoming spam on a network. The visu-

alizations are also fairly effective at revealing spam messages that

were not identiﬁed by the system wide ﬁltering process, which can

then be used to train system-wide ﬁlters. This system inherently has

an issue with false positives. Since all the sensitive data has been

removed, there is really no way to conﬁrm whether or not a spam

source identiﬁed by this tool is really sending spam or not, without

going back to the original data. Just like statistical measures, care

should probably be taken when dealing with spam sources detected

with this tool, since it is possible that they are legitimate users that

are being used as spam bots, and blacklisting them could be con-

strued as denial of service. However, it is possible that the results

of this tool can be considered sufﬁcient to go past the privacy wall

and view the sensitive data in suspicious cases for conﬁrmation of

their spam content. These features would make it a valuable tool

for a email system administrator, by enabling rapid identiﬁcation of

weaknesses in the system’s spam prevention system.

ACKNOWLEDGEMENTS

This work is sponsored in part by the National Science Founda-

tion under contracts CCF 0222991, OCI 0325934, IIS 0552334,

and CCF 0634913. Special thanks to Ken Jones and Ken Gribble of

the Computer Science Department at the University of California

Davis for supplying the data and helping with the data scrubbing

process.

REFERENCES

[1] Kerr, Bernard (2003). Thread Arcs: An Email Thread Visualization.

2003 IEEE Symposium on Information Visualization, pp. 27.

[2] Li, W., Hershkop, S. and Stolfo, S. J., (2004). Email Archive Anal-

ysis Through Graphical Visualization. Proceedings of the 2004 ACM

Workshop on Visualization and Data Mining for Computer Security,

pp. 128-132.

[3] M. Newton, O. S´

ykora, and I. Vrto. Two new heuristics for two-sided

bipartite graph drawing. In Graph Drawing, pages 312–319, 2002.

[4] Patrick Pantel and Dekang Lin. “SpamCop– A Spam Classiﬁcation

and Organization Program.” Proceedings of AAAI-98 Workshop on

Learning for Text Categorization.

[5] Rohall, S. L., Gruen, D., Moody, P., Wattenberg, M., Stern, M., Kerr,

B., Stachel, B., Dave, K., Armes, R. and Wilcox, E. (2004). ReMail:

A Reinvented Email Prototype. Proceedings of ACM Human Factors

in Computing Systems (CHI 2004), pp 791-792.

[6] Sack, W. (2000). Discourse Diagrams: Interface Design for Very

Large Scale Conversations. Proceedings of the 33rd Hawaii Interna-

tional Conference on System Sciences, January 2000, p. 3034.

[7] Mehran Sahami, Susan Dumais, David Heckerman and Eric Horvitz.

“A Bayesian Approach to Filtering Junk E-Mail.” Proceedings of

AAAI-98 Workshop on Learning for Text Categorization.

[8] Smith, M. (2002). Tools for Navigating Large Social Cyberspaces.

Communications of the ACM, vol. 45, no. 4., April 2002, pp. 51-55.

[9] Smith, M. (1999). Invisible Crowds in Cyberspace: Measuring and

Mapping the Social Structure of USENET. In Smith, M. and Kol-

lock, P. (eds.): Communities in Cyberspace, Routledge Press, London,

1999.

[10] http://spamassassin.apache.org/.

[11] Tyler, J. R., Wilkinson, D. M., Huberman, B. A. (2003). Email as

Spectroscopy: Automated Discovery of Community Structure within

Organizations. Communities and Technologies, pp. 81-96.

[12] William Yurcik, ”VisFlowConnect-IP: A Link-Based Visualization of

NetFlows for Security Monitoring,” 18th Annual FIRST Conference

on Computer Security Incident Handling , Baltimore, MD USA, June

25-30, 2006.

[13] Lanbo Zheng , Le Song , Peter Eades, Crossing minimization prob-

lems of drawing bipartite graphs in two clusters, proceedings of the

2005 Asia-Paciﬁc symposium on Information visualisation, p.33-37,

January 01, 2005, Sydney, Australia.

Visual analytics for intrusion detection in spam emails

Article

Aug 2013

Spam email attacks are increasing at an alarming rate and have become more and more cunning in nature. This has necessitated the need for visual spam email analysis within an intrusion detection system to identify these attacks. The challenges are how to increase the accuracy of detection and how to visualise large volumes of spam email to better understand the analysis results and identify email attacks. This paper proposes a Density–Weight model that is to strengthen and extend the system capacity for analysis of network attacks in spam emails, including DDoS attacks. An interactive visual clustering method DA–TU is introduced to classify and display spam emails. The experimental results have shown that the proposed new model has improved the accuracy of intrusion detection and provides a better understanding of the nature of spam email attacks on though the network.

Visual Analytics of Anomalous User Behaviors: A Survey

Preprint

May 2019

The increasing accessibility of data provides substantial opportunities for understanding user behaviors. Unearthing anomalies in user behaviors is of particular importance as it helps signal harmful incidents such as network intrusions, terrorist activities, and financial frauds. Many visual analytics methods have been proposed to help understand user behavior-related data in various application domains. In this work, we survey the state of art in visual analytics of anomalous user behaviors and classify them into four categories including social interaction, travel, network communication, and transaction. We further examine the research works in each category in terms of data types, anomaly detection techniques, and visualization techniques, and interaction methods. Finally, we discuss the findings and potential research directions.

Visualizing geolocation of spam email

Conference Paper

Full-text available

Apr 2013

With the recent surge in cyber attacks, there is a growing demand for effective security analytics tools. Though, there are advanced data collection techniques in the form of honeypots and malware collectors, the value of data are only as useful as the analysis technique used. One of the primary drawbacks of current security analytic tools is the lack of visualization controls to effectively analyze the data. In this paper, we develop a visualization tool to analyze the geographical locations of spammers based on the integration of MaxMind and WhoIS databases with Google Maps API. The visualization tool provides an insight into spam origins, along with patterns of spammers identified from spam activity. A key component in the development of this tool is its extensible framework allowing for the addition of resources to retrieve more information about a spammer and analyze additional patterns of spammers for spam analysis.

Sybil Attack Detection through Global Topology Pattern Visualization

Article

Jan 2011

We present a robust intrusion detection approach for wireless networks based on a new multi-matrix visualization method with a set of pattern generation, evaluation, organization and interaction functions. Our approach concentrates on assisting users to analyze statistical network topology patterns that could expose significant attack features. Specifically, we investigate Sybil attacks that have severe impacts on the fundamental operations of wireless networks. We have analyzed the features of network topologies under various Sybil attacks and, consequently, designed several matrix reordering algorithms to generate statistical patterns. These topology patterns are automatically evaluated and classified through the measured structural similarities to the signature attack patterns. We have also designed a new time-series analysis method to identify attack durations with a time histogram generation and an automatic segmentation method. To handle complex Sybil attacks, we have integrated our pattern generation, evaluation and organization methods to construct a prototype detection system, in which specialized interaction functions are provided to assist the analysis and comparison of network data. Simulation results show that this approach can effectively locate Sybil attacks under different combinations of network parameters. Our multi-matrix visualization method provides a flexible framework to handle the intricacies and implications from building a complex visual analytics system, which can be extended to defend against a wide range of attacks.

Visual Analytics of Anomalous User Behaviors: A Survey

Article

Full-text available

Jan 2020

With the pervasive use of information technologies, the increasing availability of data provides new opportunities for understanding user behaviors. Unearthing anomalies in user behavior is of particular importance as it helps signal harmful incidents such as network intrusions, terrorist activities, and financial frauds. In this work, we survey state-of-the-art research work in visual analytics of anomalous user behaviors and classify them into four application domains, which are social interaction, travel, network communication, and financial transaction. We further examine the research work in each category in terms of data types, visualization techniques, and interactive analysis methods. We hope that our survey can provide systematic guidelines for researchers and practitioners to find effective solutions to their research problems in specific application domains. Finally, we discuss trends of academic interest over the past decades and suggest potential directions across visual analytics of these user behaviors for future research.

Database Support for Discovering Patterns in Large Datasets Collected from Multiple Sensors

Conference Paper

Apr 2008

Richard A. Wasniowski

A large number of applications require mining high rate data streams. An important challenge in dealing with such data intensive applications is fast data change rate that requires improved analysis, new algorithms, tools and techniques. Our purpose for this work is to examine how to integrate collecting data streams from multiple sensors in order to discover new patterns and minimize the number of incorrect alarms. We present a framework for mining patterns from data streams and discuss practical applications.

VisFlowConnect-IP: a link-based visualization of Netflows for security monitoring

Article

Full-text available

Jan 2006

William Yurcik

Network traffic dynamics have become an important behavior-based approach to assist security ad-ministrators in protecting networks. In this paper we present VisFlowConnect-IP, a link-based network flow visualization tool that allows operators to detect and investigate anomalous internal and external network traffic. We model the network as a graph with hosts being nodes and traffic being flows on edges. We present a detailed description of VisFlowConnect-IP functionality and demonstrate its application to traffic dynamics in order to monitor, discover, and investigate security-relevant events.

ReMail: a reinvented email prototype

Conference Paper

Full-text available

Apr 2004

Electronic mail has become the most widely used business productivity application. However, people increasingly feel frustrated by their email. They are overwhelmed by the volume, lose important items, and feel pressure to respond quickly. Though email usage has changed, our email clients largely have not. In this demonstration, we will show a prototype email client developed as part of a larger project on "reinventing email." This prototype incorporates capabilities for dealing with threads of related messages, a synchronous communication facility, and the ability to annotate individual messages. The prototype has been tested on a small set of users to evaluate its features. People particularly liked tools to view and navigate among messages in a thread, and the integration of synchronous awareness and communication with the email experience.

Email archive analysis through graphical visualization

Conference Paper

Full-text available

Oct 2004

The analysis of the vast storehouse of email content accumulated or produced by individual users has received relatively little attention other than for specific tasks such as spam and virus filtering. Current email analysis in standard client applications consists of keyword based matching techniques for filtering and expert driven manual exploration of email files. We have implemented a tool, called the Email Mining Toolkit (EMT) for analyzing email archives which includes a graphical display to explore relationships between users and groups of email users. The chronological flow of an email message can be analyzed by EMT. Our design goal is to embed the technology into standard email clients, such as Outlook, revealing far more information about a user's own email history than is otherwise now possible. In this paper we detail the visualization techniques implemented in EMT. We show the utility of these tools and underlying models for detecting email misuse such as viral propagation, and spam spread as examples.

E-Mail as Spectroscopy: Automated Discovery of Community Structure within Organizations

Article

Full-text available

Apr 2005

We describe a method for the automatic identification of communities of practice from e-mail logs within an organization. We use a betweenness centrality algorithm that can rapidly find communities within a graph representing information flows. We apply this algorithm to an initial e-mail corpus of nearly 1 million messages collected over a 2-month span, and show that the method is effective at identifying true communities, both formal and informal, within these scale-free graphs. This approach also enables the identification of leadership roles within the communities. These studies are complemented by a qualitative evaluation of the results in the field.

Tools for Navigating Large Social Cyberspaces

Article

Full-text available

Apr 2002

Marc A Smith

The article focuses on tools for navigating large social cyberspaces. Ideally, Usenet members would make efficient use of bandwidth, participating actively but judiciously in newsgroups, ensuring their comments are posted only to relevant newsgroups, and abiding by the local norms and culture that govern decorum. A key finding of collective action studies shows that mutual awareness of other participants' histories and relationships is critical to a cooperative outcome. The challenges of cooperation are heightened further when people are able to draw from a resource without contribution. Interfaces, like email and news browsers, that provide access to social cyberspaces such as discussion boards, email lists and chat rooms, present limited, if any, information about the social context of the interactions they host. Basic social cues about the size and nature of groups are missing, making discovery, navigation, and self-regulation an increasing challenge as the size and scope of these spaces expand. While people can eventually develop a refined sense of the rhythms, leaders, and fools in a particular social cyber space, the information does not come easily or easily transfer to other spaces. With little sense of the presence of other people, individuals have a difficult time forming cooperative relationships.

Two new heuristics for two-sided bipartite graph drawing

Article

Jan 2002

Invisible Crowds in Cyberspace: Measuring and Mapping the Social Structure of Usenet

Conference Paper

Jan 1999

Marc A Smith

The Usenet is a quintessential Internet social phenomenon: it is huge, global, anarchic and rapidly growing. It is also mostly invisible. Although, it is the largest example of a conferencing or discussion group system, the tools generally available to access it only display leaves and branches - chains of messages and responses. None present the trees and forest. With hundreds of thousands of new messages every day, it is impossible to try to read them all to get a sense of the entire place. As a result, an overview of activity in the Usenet has been difficult to assemble and many basic questions about its size, shape, structure and dynamics have gone unanswered. How big is the Usenet? How many people post? Where are they from? When and where do they post? How do groups vary from one another and over time? How many different kinds of groups are there? How many groups successfully thrive and how many die? What do the survivors have that the others lack? How do different social cyberspaces connect and fit together and form a larger ecology?

Crossing Minimization Problems of Drawing Bipartite Graphs in Two Clusters.

Conference Paper

Jan 2005

The crossing minimization problem is a classic and very important problem in graph drawing (Pach, Toth 1997); the results directly aect the eectiveness of the layout, especially for very large scale graphs. But in many cases crossings cannot be avoided. In this paper we present two models for bipartite graph drawing, aiming to reduce crossings that cannot be avoided in the traditional bilayer drawings. We characterize crossing minimization problems in these models, and prove that they are NP-complete. 1 Motivation

Thread arcs: An email thread visualization

Conference Paper

Jan 2003

Bernard Kerr

This paper describes Thread Arcs, a novel interactive visualization technique designed to help people use threads found in email. Thread Arcs combine the chronology of messages with the branching tree structure of a conversational thread in a mixed-model visualization (Venolia and Neustaedter 2003) that is stable and compact. By quickly scanning and interacting with Thread Arcs, people can see various attributes of conversations and find relevant messages in them easily. We tested this technique against other visualization techniques with users' own email in a functional prototype email client. Thread Arcs proved an excellent match for the types of threads found in users' email and for the qualities users wanted in small-scale visualizations. CR Categories: H.5.2 User Interfaces, H.5.3 Group and Organization Interfaces, I.3.6 Methodology and Techniques

Two New Heuristics for Two-Sided Bipartite Graph Drawing

Conference Paper

Aug 2002

Two new heuristic strategies are studied based on heuristics for the linear arrangement problem and a stochastic hill-climbing method for the two-sided bipartite crossing number problem. These are compared to the standard heuristic for two-sided bipartite drawing based on iteration of the barycentre method. Our experiments show that they can efficiently find good solutions.

Visualization of sanitized email logs for spam analysis

Abstract and Figures

Recommended publications

A Probabilistic Approach for On-Line Sum-Auditing

Achieving Network Level Privacy in Wireless Sensor Networks

Privacy on the Edge: Customizable Privacy-Preserving Context Sharing in Hierarchical Edge Computing

Privacy Preservation in Big Data From the Communication Perspective—A Survey