Conference PaperPDF Available

Visualization of sanitized email logs for spam analysis

Authors:
  • Institute of Nephrology

Abstract and Figures

Email has become an integral method of communication. However, it is still plagued by vast amounts of spam. Many statistical techniques, such as Bayesian filtering, have been applied to this problem, and been proven useful. But these techniques in general require training. Another common method of spam prevention is blacklisting known spam sources. In order to do this, the sources must be identified. What this paper presents is a set of visualization techniques designed to show patterns in incoming email which can reveal misidentified pieces of spam, common spam sources, and patterns such as periods of increased spam activity, while maintaining the privacy of the email. This can aid system administrators in rapidly and effectively adjusting system level filters, which would improve the quality of service and decrease the time and resources wasted by spam.
Content may be subject to copyright.
Visualization of Sanitized Email Logs for Spam Analysis
Chris Muelder
University of California, Davis
Kwan-Liu Ma
University of California, Davis
ABSTRACT
Email has become an integral method of communication. How-
ever, it is still plagued by vast amounts of spam. Many statistical
techniques, such as Bayesian filtering, have been applied to this
problem, and been proven useful. But these techniques in general
require training. Another common method of spam prevention is
blacklisting known spam sources. In order to do this, the sources
must be identified. What this paper presents is a set of visualization
techniques designed to show patterns in incoming email which can
reveal misidentified pieces of spam, common spam sources, and
patterns such as periods of increased spam activity, while maintain-
ing the privacy of the email. This can aid system administrators in
rapidly and effectivly adjusting system level filters, which would
improve the quality of service and decrease the time and resources
wasted by spam.
1 INTRODUCTION
Email has become the standard for electronic communication.
Along with its popularity, it has gained the attention of advertisers,
who use email as a means of very inexpensively spreading adver-
tisements. Most users find this spam annoying and would rather not
receive it. Spam also tends to waste resources such as bandwidth,
not just for the end users, but for the intermediate service providers
as well. Eliminating spam can free up such resources for legitimate
uses. Thus, many techniques have been created to combat spam.
Because of the rapid delivery time of email, and due to the desire for
privacy, these anti-spam techniques must be automatic and require
no human interaction. However, as solicitors learn the techniques in
use, they start to devise ways to work around them. For instance, if
the email delivery service starts filtering on a word that is common
in spam, then solicitors will simply misspell or vary the word such
that the current detection software will not catch it. This leads to an
arms race between the spam and the filter, and so often spam can
get through. Various adaptive measures such as Bayesian filters [7]
have been developed; however they are content based and thus are
in general only convenient for end users. Thus, it would be benefi-
cial to use visualization to get the big picture of incoming emails at
the system level to identify likely spam so that filters can be rapidly
adjusted accordingly, and so perhaps even Bayesian filters can be
employed effectively at the system level without the need for an
administrator to look at the content of the email. It could also al-
low the identification of primary spam sources for blacklisting pur-
poses. The visualization techniques used in this paper accomplish
this task by presenting an abstract view of large amounts of email
so that large scale patterns such as spam can be seen and identified.
1.1 Privacy Issues
One of the difficulties of working with email is that email infor-
mation generally contains a lot of sensitive information, such as
the content or the subject lines. A system administrator looking at
this sensitive data would violate the privacy of email, which is often
e-mail: muelder@cs.ucdavis.edu
e-mail: ma@cs.ucdavis.edu
granted in the terms of use of the email system. Thus, most logs and
email records are unavailable for analysis due to the need to keep
that sensitive information private. To work around this, one must
take special care to clean out the parts of the data that are sensitive.
In some cases even the senders’ and recipients’ email addresses can
be considered sensitive, so they must be abstracted or just scrubbed
out of the data. While some of this potentially sensitive data can
be useful in statistical filtering (such as Bayesian filters), it is not
necessarily a distinguishing feature of spam. Spam is simply de-
fined as unsolicited bulk email, regardless of content, so it should
be possible to identify spam even without the content to work with.
Thus, it should be sufficient to visualize email patterns with the sen-
sitive content removed and still effectively identify spam, which is
the goal of the visualizations presented in this paper.
1.2 Bipartite Graph Representation
One abstraction of email communications is a directed graph, where
nodes represent addresses and edges represent emails between their
associated nodes. Since in this paper, only incoming email is con-
sidered, it can safely be assumed that the data can be split into two
disjoint sets, a set of external sending email addresses and a set of
internal receiving ones. Also, almost all edges will go from the ex-
ternal set to the internal set. Those that do not are either completely
internal and trusted, or completely external and unalterable. Thus,
these edges can be discarded and what is left is a bipartite graph,
where all edges go from the set of senders to the set of receivers.
By limiting the graph to a bipartite one, the options for visualization
can be clearer and faster than a general graph.
1.3 Related Work
Using visualization to analyze email is not a new idea, as there are
a number of email visualizations that have been developed. Moslty,
these visualizations work one of three types of email datasets: pub-
lic forums, end users’ inboxes, and organizational archives.
Publicly accessible forums such as newsgroups are often the
easiest to work with since everything is accessible. Many vi-
sualizations have been developed to analyze the social net-
work represented by public email communication such as
through newsgroups [6, 8, 9]. Often, spam is prevented by
making the newsgroup somewhat exclusive, such as requiring
a password or user verification before allowing posting. And
any user caught spamming is usually banned fairly quickly,
so spam is usually not as big of an issue as it is with private
email.
The next easiest set of data to work with is the user’s own
inbox, since there is no privacy restriction. However, in this
case the scope is fairly limited, since only the email to or from
this one user can be seen. Many interesting email visualiza-
tions have been developed that focus on analyzing or manag-
ing one’s own email inbox or archive [1, 5]. Some of these are
useful in categorizing spam, but as these visualizations are on
a per user basis, they are usually not useful at the administra-
tive level.
Organizational or system level email archives are the focus
of this paper, and are likely the most difficult to work with
due to potential sensitivity of the data. Visualizations have
been developed to analyze emails at the organizational level
[2, 11], but these tend to focus on social structures. Due to the
lack of content analysis at this level, they do not usually focus
on spam detection and filtering. The visualization methods
shown in this paper were designed to detect and analyze spam
and spam sources without the content.
The concept of visualizing bipartite graphs is also not new. Bipar-
tite graphs have been around for a long time, and so there have been
several works regarding how to visualize them effectively [3, 13].
There has even been work done on applying bipartite graph visu-
alization to source-destination relationships [12]. But these works
are not directed towards the visualization of email. Finally, identi-
fying spam has traditionally been performed using statistical meth-
ods [4, 7]. But these methods are generally content based, and so
they could be in violation of the user’s expectation of privacy if ap-
plied at the system level. Also, keeping statistical methods such
as Bayesian filters trained at the system level is a daunting task in
itself, which can be alleviated through the use of visualization.
2 DE SI GN A ND TECHNIQUES
The visualization presented in this paper was designed to work
with raw system log email records. These records include infor-
mation such as sources, destinations, arrival times, subjects, relays,
and spam scores as defined by Spam Assassin [10], the statistical
spam detection system in place. Much more information was also
present, but this information was not used. Many parts of this data
can be considered sensitive. In particular, source and destination
addresses and subject lines. Thus, before any visualization could be
attempted, the sensitive data needed to be scrubbed out. Once the
sensitive data was scrubbed out, the relevant information needed to
be parsed out into a more structured format. Then, the time varying
bipartite graph represented by this data was displayed using visual-
ization techniques.
Data Sample 1 An example of raw log data.
Feb 12 20:36:32 baton sendmail[10529]: [ID 80
1593 mail.info] k1D4a2Zd010529: from=<5138651
db1f0438111aa7bf24025145f@virgin.net>, size=1
3174, class=0, nrcpts=1, msgid=<e24a11cf32bd1
55b8284c28ad3cc6feb@doug>, proto=ESMTP, daemo
n=MTA, relay=ucdavis.edu [169.237.104.35]
Feb 12 20:36:34 baton mimedefang.pl[4245]: [I
D 702911 mail.info] MDLOG,k1D4a2Zd010529,mail
_in,,,<5138651db1f0438111aa7bf24025145f@virgi
n.net>,<785e0431a07a0bbadd089e63318b2122@ucda
vis.edu>,(Subject Removed)
Feb 12 20:36:34 baton sendmail[10529]: [ID 80
1593 mail.info] k1D4a2Zd010529: Milter delete
: header X-Spam-Score: 13.128 (*************)
BAYES_80,HELO_DYNAMIC_IPADDR,HTML_80_90,HTML
_IMAGE_ONLY_12,HTML_MESSAGE,MIME_QP_LONG_LIN
E,URIBL_SBL
Feb 12 20:36:34 baton sendmail[10529]: [ID 80
1593 mail.info] k1D4a2Zd010529: Milter add: h
eader: X-Scanned-By: MIMEDefang 2.52 on 169.2
37.6.6
Feb 12 20:36:34 baton sendmail[10539]: [ID 80
1593 mail.info] k1D4a2Zd010529: to=785e0431a0
7a0bbadd089e63318b2122@ucdavis.edu, delay=00:
00:02, xdelay=00:00:00, mailer=esmtp, pri=432
86, relay=ucdavis.edu. [169.237.6.84], dsn=2.
0.0, stat=Sent (k1D4aYH9008118 Message accept
ed for delivery)
Data Sample 2 An example of structured (parsed) log data.
----------id=k1D4a2Zd010529---------------
ID=k1D4a2Zd010529
TO=785e0431a07a0bbadd089e63318b2122@ucdavis.e
du
FROM=5138651db1f0438111aa7bf24025145f@virgin.
net
SPAMSCORE=
DELAY=00:00:02
XDELAY=00:00:00
MAILER=esmtp
PRI=43286
RELAY=ucdavis.edu. [169.237.6.84]
DSN=2.0.0
STAT=Sent (k1D4aYH9008118 Message accepted f
or delivery)
TIME=Feb 12 20:36:34
2.1 Scrubbing and Filtering
The data in its original raw form consists of Linux system logs
which contain the system log messages sent by the various email
daemons. Since these messages are very unstructured, the process
of extracting only the non-sensitive data in a structured form in-
volves a lot of text manipulation, and so the programs to handle
this were written in Perl. The first script takes the data and scrubs
out the sensitive information by encrypting or removing the user
names and removing the subject lines. The next script takes the
log, parses out the relevant information, and outputs the data in a
structured form for parsing in a visualization program.
2.2 Bipartite Graph View
One way of rendering a bipartite graph is to line up all the vertices in
one set of the graph on one side, line up all the vertices in the other
set on the other side, and then render edges between these vertices
as for a normal graph. This creates a relatively easy to understand
picture, which is good for showing patterns in the edges. Also,
since the vertices are fixed, the edges can change with time while
keeping a constant frame of reference. Sources are lined up on the
left, destinations are lined up on the right, and edges are drawn
between them for each email. The addresses are grouped according
to domain names, which are sorted alphabetically. Color is used to
represent the spam score that Spam Assassin gave it, where black
means a spam score of zero, and the more red the line the higher the
spam score. Opacity represents how many emails there were from
that source to that destination in that time range. That is, when it is
nearly transparent there were very few emails but when it is opaque
there were many emails. Figure 1 shows the results of applying
these techniques to a small amount of email. The first and most
obvious pattern shown is that the majority of the email originates
from just a few sources. Since much of the known spam originates
from these same addresses, it is likely that the rest of this email is
spam as well. It also indicates which addresses or domains should
likely be blacklisted, or in the case of internal addresses, be audited.
2.3 Scatterplot View
When the number of emails gets large, the graph starts to become
incoherent, since there are just too many overlapping edges. In
order to resolve this, a scatterplot was added as an alternate view.
While somewhat less intuitive, it is better at handling large numbers
of emails since there is no overlapping. Sources are still lined up on
the left side, but now destinations are lined up along the bottom, and
emails are shown simply as points at the coordinates corresponding
to their source and destination addresses. The same color and opac-
ity rules are applied as in the graph view: color is spam score and
Figure 1: A graph of emails over the period of 1 hour. Sources are
on the left, destinations on the right, and semitransparent lines are
drawn between them to represent emails (known spam is colored
red). Blue delimitates different domains, and the orange line desig-
nates the current focus, which is labeled. Bulk email (such as spam)
forms fan like shapes from one source to many destinations. As can
be seen, in this hour most of the spam originated from a select few
sources.
Figure 2: A months worth of emails shown in a scatterplot. Sources
are on the left, destinations are along the bottom. Points represent
emails. Bulk mail, such as most spam, appear as horizontal lines
(one source, many destinations). Another odd pattern revealed here
is the set of vertical lines (many sources, one destination). Interest-
ingly, many of these vertical lines have very similar patterns, indicat-
ing the same set of sources. This might indicate a spammer that is
using a botnet to send email from distributed, compromised sources.
(a) A 3d graph of emails over the
period of about a day.
(b) A 3d extension of the scatter-
plot over the whole month.
Figure 4: 3d versions of the graph and scatterplot. The added di-
mension is time. This allows larger amounts of data to be viewed at
any one time, at the cost of introducing occlusion. It also shows how
many patterns persist through time as opposed to traffic bursts.
opacity is number of emails. Figure 2 shows an example of this
technique applied to an entire month of emails. In this view, hor-
izontal lines correspond to bulk email, which is often spam. The
more interesting pattern revealed here, however, is the presence the
vertical lines, which indicate addresses that recieve lots of email
from everywhere. These vertical lines all seem to share the same
pattern, which indicates that they are all on the same set of exter-
nal lists. In fact, this is a likely indicator that these are a subset of
internal email addresses that are on a particular spam list.
2.4 Timeline Overview
Often times one does not just want to know what happened but also
when it happened. Thus it is useful to present information regarding
when events occur. Also, looking at the entire data set at once can
be confusing or overwhelming because of the large size of the data,
so it is often helpful to look at small time segments individually. A
timeline was added to the visualization allow as an overview of the
data set from which individual time segments can be selected. An
example of this timeline is shown in Figure 3. In this timeline, the
top edge of the black area represents the total number of emails and
the red area represents the portion of these emails that were iden-
tified as spam. The grey highlight indicates the currently selected
time period, which would be shown in the graph or scatterplot view.
The first pattern to notice is that there is a distinct cyclical pattern
corresponding to the work week. That is, there is a repeating pat-
tern of five peaks followed by two much shorter peaks. There are
also some spikes that do not correspond to this pattern. These cor-
respond to other increased amounts of activity, either due to brief
floods or merely increased activity. The overall low level of spam,
is indicative to how poorly Spam Assassin identifies spam.
2.5 3D View
While the two dimensional graph and scatterplot representations are
useful for viewing the emails themselves, they do not show any-
thing about the timing information. On the other hand, the time-
line view shows the timing information well of large scale events,
but not what sources and destinations were involved. So, both of
the two dimensional visualizations were extended to three dimen-
sional views that show both the time information and the source
and destination information. This allows one view to summarize
the data. As the number of emails increases, occlusion can begin to
be a problem, so often times more transparency is required. Thus,
small scale patterns such as individual emails can tend to be lost in
this view. However, spam messages in general are not small scale
patterns, and so this is not a very large problem for spam analy-
sis. Figure 4(a) shows an example of about a day’s worth of email
Figure 3: A timeline view of the emails per hour over a month. The red area represents the por tion of the emails that are known spam according
to Spam Assassin. The grey region is the currently selected time period. A repeating pattern of five large peaks followed by two smaller (or
nonexistent) peaks corresponds to the five day work week plus two day weekends.
Figure 5: One option for presenting the time aspect of the data is to
animate it, either by showing sequential time-steps or by showing a
sliding time window. Three consecutive frames are shown here. In
them, it can be seen that some patterns are persistent while others
come and go.
shown in a three dimensional graph view and Figure 4(b) shows
a month’s worth of email shown in the three dimensional scatter-
plot view. In Figure 4(a), it can easily be seen that the two largest
sources of bulk email tend to be a continuous source of email, and
not merely occasional senders. In Figure 4(a), a pattern is revealed
that was invisible in the 2d view, and that is the lines parallel to the
time axis. These lines are cases where there was one source, one
destination, and a continuous stream of traffic. While this could be
due to automated messages such as news feeds, it is also possible
that these destinations are spam relays, which take spam and resend
it to other addresses.
2.6 Animation
Another option for displaying time information is to actually use
time to represent it. That is, timing information can be represented
by animating the view of the data. This is likely the most intu-
itive way to represent time, but it can be somewhat difficult to use
effectively since it relies on the user’s memory and reaction time.
However, it does keep the visualization down to two dimensions,
so that the inherent difficulties of three dimensional visualizations
can be avoided. So the visualization presented here has a capability
to animate the edges of the email graph. This can done by sim-
ply showing sequential time steps at a fixed rate. A slightly more
complex way that was also implemented is a sliding time window,
where a constant amount of time is shown while new time steps
are added and old ones removed. This creates some persistence to
patterns with short durations, while allowing better comparison be-
tween sequential time steps. The nodes that correspond to sources
and destinations are kept constant, which provides a frame of refer-
ence as the edges are changing. An example of a few frames of the
animation process are shown in Figure 5.
2.7 Zooming and Filtering
Since there are a very large number of email addresses in both the
set of sources and the set of destinations, it can be quite difficult if
not impossible to single one out from the full graph or scatterplot
views. So, a capability was added to zoom into the source and des-
tination axes in order to both reduce the complexity of the graph
Figure 6: In order to see details more clearly, a capability was added
to zoom into regions of the scatterplot. In this image, the ucdavis.edu
domain was focused on for both sources and destinations. This re-
veals an interesting pattern of a diagonal line, which is indicative of
people emailing themselves. Zooming into the graph is done very
similarly.
Figure 7: A detail view of the email originating from a selected ad-
dress, and the emails originating from the addresses it sent emails
to. The size of each segment of the ring corresponds to what per-
centage of the email was sent there, and the color is random. Color
and position of the nodes in the middle are derived from the color
and position of each address in the first level that sent it email. This
is useful for showing patterns such as email relays, where spam is
forwarded by a compromised user.
(a) All spam over the month (b) Spam from U. C. Davis
(c) Spam from Yahoo (d) Spam from Hotmail
Figure 8: Email marked as spam by Spam Assassin over the whole
month. (b), (c), and (d) focus on some of the larger domains. In it we
can see that ucdavis.edu is the source for a lot of spam, yahoo.com
for less, and hotmail.com for even less.
and allow the user to focus on regions of interest. On the graph
this is done by selecting regions of the sources or regions of the
destinations with the mouse. On the scatterplot, as is shown in Fig-
ure 6, both are done simultaneously by selecting a rectangular area
of interest with the mouse.
2.8 Detail View
Once the graph or scatterplot visualizations have been used to iden-
tify an address of interest, it is beneficial to be able to investigate
this address in more detail. Figure 7 shows one possible visual-
ization of the details of a node. This visualization shows a repre-
sentation of the emails’ outgoing social network structure, where
the ring shows one degree away from the selected address, and the
nodes in the middle show the second degree away. In the ring, each
section represents an address that the selected address emailed, and
the size indicates how much email was sent to that address. The col-
ors were chosen randomly. The nodes in the middle are addresses
that received emails sent by members of the first level. They are
positioned and colored according to which nodes in the first level
sent them email. For example, if a node recieved email from ex-
actly three of the nodes in the first level, then it would be placed
equidistant from these three nodes and its color would be the aver-
age color of the nodes. The edges that would normally be drawn
are only shown for the currently selected node in the first level for
clarity, since otherwise it can become quite cluttered.
3 CA SE STUDIES
In order to test the effectiveness of this visualization, it was run on
data collected from the month of February, 2006. This data con-
sisted of system logs from the University of California, Davis. This
is the month of data shown in the timeline view of Figure 3, and
used in the rest of the Figures. While exploring this data, many in-
teresting patterns were discovered. Several have already been dis-
cussed along with their respective Figures in the previous section,
(a) A common spam pattern of
one source and many destinations.
This shows a large aount of spam
originating from an address at
“cuna.org”
(b) An odd spam pattern where just
a single user was targeted by nearly
the entire “globo.com” domain.
Figure 9: One to many and many to one spam patterns. These kinds
of patterns are fairly common to spam.
and others are discussed here.
3.1 Known Spam
Much can be learned by simply filtering out all but the mail known
to be spam according to Spam Assassin. Figure 8(a) shows the
known spam from all sources over the entire month. From it, it
can easily be seen that there are definitely some primary sources
from which the majority of spam originates. In fact, the largest sev-
eral spam sources seem to be internal to U. C. Davis itself, which
is shown in more detail in Figure 8(b). In this Figure, it can eas-
ily be seen that there are a handful of addresses from which the
majority of spam originates, indicating which addresses should be
blocked or at least audited since they are internal. While most of
the other spam sources are from individual small domains, there
is also spam originating from larger more common domains such
as Yahoo and Hotmail, which are shown in Figures 8(c) and 8(d)
respectvely. Of interest here is that Hotmail apparently does a bet-
ter job of preventing outgoing spam messages than Yahoo, but still
not perfect. Why this occurs is beyond the scope of the research
possible with just this dataset, and would require a more in depth
study. However, this visualization does produce a set of individual
addresses in these and other domains that could be used as the basis
for a blacklist. Another interesting pattern seen here is the set of
common spam recievers. These are email addresses that have ap-
perently been collected by an outside spammer somehow, because
spam from many different sources seems to concentrate on these
select addresses. It is also possible that these are the addresses of
email lists, and so they are targeted by many spammers because do-
ing so increases the spam’s audience while decreasing the expense
on the side of the spammer.
3.2 Isolating Spam Sources
Once spam sources have been located in the big picture, the zoom-
ing feature can be used to hone in on them and identify the actual
address or addresses from which the spam is originating. In the
case of Figure 9(a), a large amount of spam was traced down to
a single sorce, originating from “cuna.org.” Once ths source has
been identified, it can be further analyzed by viewing it in a detail
view or by going back to the original data. Alternately, it could
be good enough to monitor its future actions more closely or treat
mail coming form it more harshly, making it more difficult for spam
originating from it to get past the filters (by giving it an initial spam
score for example).
(a) Spam is often relayed through bot networks. The lines in the scatterplot shown at left are potentially compromized users. At right, they are shown in more
detail in the bipartite graph view, and it is seen that the source was the null source address (missing@data.edu).
(b) Detail view of the addresses that received mail from the null source and
those addresses they sent email to. The five lines shown in Figure 10(a)
correspond to the five large segments of the outer ring. From here, it can be
seen by selecting each of these regions that each of their addresses were also
sending mail to many destinations, just like a spam relay.
(c) One of the five potential relays shown in overview. Upon inspection of
these addresses in the overview, it can be seen that while they did send to
many destinations, it was in low volume. Since this was much less than the
incoming mail from the null address, they can not have been actually relaying
spam in this case.
Figure 10: An email pattern similar to that of a set of spam relays. However, in this case this traffic was not actually spam.
Figure 11: A recurring traffic burst. About a day apart, these patterns show emails from one source to one destination that were sent many
times. Investigation reveals that these are emails that were repeatedly deferred by the server.
3.3 Spam Focus
One interesting pattern that was detected that was not really ex-
pected is shown in Figure 9(b). In this Figure, nearly the entire
domin of “globo.com” is sending spam to exactly one destination.
One possible explaination is that this indicates that the user gave his
or her email address out for some reason, and this domain is taking
advantage of this. Another possibility is that this user signed up for
something legitimate on this site and the mail is being mislabeled
as spam. However, there are few reasons for legitimate email to be
coming from several distinct sources that comprise almost the en-
tire domain. If further analyis reveals that it is the case that this is
truly spam, then this entire domain would be a likely candidate for
blacklisting.
3.4 Spam Relays
When sending spam, a spammer can use a compomised innocent
user as a proxy in order to bypass spam prevention methods such
as blacklists. In this case, the spammer would continually send the
spam to the user, possibly encrypted, where it would be intercepted
by a daemon which would then forward it to a set of destinations.
The pattern found in Figure 4(b) was indicative of this kind of traf-
fic, so it was explored in more detail. In Figure 10(a), the source of
5 of these lines was focused in on, then shown in the two dimen-
sional bipartite graph view. In this view, it is seen that the source of
these emails had the special token “missing@data.edu,” indicating
that there was no source address in the log files. Since it does not
really make any sense for a legitimate email to have a null source
address, this indicates that this is more likely spam. From here, the
null source was selected for viewing in the detail view, and the re-
sult was the image shown in Figure 10(b), which shows some of the
relaying properties better. However, when the addresses of the five
destinations were viewed in the normal graph view, it was seen that
they in fact were not sending spam/ Thus, whatever mail they were
receiving from the null address was getting blocked or dropped or
was in fact not spam to be relayed.
3.5 Recurring Email Deferring
One anomaly that was revealed by the visualization was a recurring
pattern of emails from individual sources to individual destinations
that were repeated many times. This odd pattern was seen to occur
each night at about the same time, between 2:00 A.M. and 3:00
A.M. and is shown in Figure 11. It turns out that this pattern was
caused by emails getting deferred repeatedly, and so they end up
creating many messages in the system log for a single email. Since
this email deferring apparently occurs daily, it is likely not due to an
external reason, such as a denial of service attack. Rather, it is likely
due to an internal issue of some sort, that it might be beneficial to
identify and fix. Thus, the visualization presented here is capable
of identifying internal issues as well as external attacks.
3.6 Other Anomalous Patterns
While not necessarily spam related, there are many other patterns
that can be seen with the visualization techniques shown in this pa-
per. For example, in Figure 12(a), the internal U. C. Davis traffic
was focused on, and a pair of dashed lines parallel to the time axis
were found. When focused in more, and shown in more detail, it
can be seen that these lines were emails from one source to a pair
of destinations that were dependent on the work day. That is, the
amount of email sent from this source appears to follow the stan-
dard work week pattern of five peak amounts of traffic in the middle
of each weekday, followed by two lesser peaks on the weekends.
This kind of pattern could be explained by an automatic forward-
ing of a tech support address. That is, mail sent to the tech support
address could be automatically being forwarded to a pair of tech
support people.
4 FUTURE WORK
The results generated by this visualization are quite useful, but there
are still ways in which they could be improved. It would be useful
to extract and use more of the data in the raw log records that are
currently ignored. In fact, it could even be useful to be able to sim-
ply drill down from the email representation to the actual raw log
(a) A pair of dashed lines (b) One week of the dashed lines in
more detail
Figure 12: An anomalous pattern. These emails match up with the
standard work week pattern yet are from one source to exactly two
destinations. Possibly caused by an automatic forwarding script.
entry that defines it. A variation that could be interesting to in-
vestigate is to reverse the selection of data, and consider only the
outgoing mail. This would be useful in identifying compromised
internal systems, or users that are misusing the system intention-
ally. It could also be useful to use a tool such as this as an interface
to create filter rules in a sandbox environment. That is, potential
spam could be identified, a filter rule could be made, and could
then be applied to the data and fed back into the visualization. This
would make a useful feedback system. Finally, it would be benefi-
cial to test the system against some datasets where the ground truth
is known, perhaps even artificially generated datasets, in order to
more fully validate the effectiveness of the visualizations.
5 CONCLUSIONS
Visualization of incoming email is an effective way to get an overall
feel for what kinds of email are coming into the network, even with-
out using potentially sensitive information such as subject lines.
The visualization techniques used in this paper, while not very com-
plex, quite clearly point out some interesting features in the data
which would likely be of interest to a system administrator. They
are quite effective at pointing out which sources are predominately
responsible for most of the incoming spam on a network. The visu-
alizations are also fairly effective at revealing spam messages that
were not identified by the system wide filtering process, which can
then be used to train system-wide filters. This system inherently has
an issue with false positives. Since all the sensitive data has been
removed, there is really no way to confirm whether or not a spam
source identified by this tool is really sending spam or not, without
going back to the original data. Just like statistical measures, care
should probably be taken when dealing with spam sources detected
with this tool, since it is possible that they are legitimate users that
are being used as spam bots, and blacklisting them could be con-
strued as denial of service. However, it is possible that the results
of this tool can be considered sufficient to go past the privacy wall
and view the sensitive data in suspicious cases for confirmation of
their spam content. These features would make it a valuable tool
for a email system administrator, by enabling rapid identification of
weaknesses in the system’s spam prevention system.
ACKNOWLEDGEMENTS
This work is sponsored in part by the National Science Founda-
tion under contracts CCF 0222991, OCI 0325934, IIS 0552334,
and CCF 0634913. Special thanks to Ken Jones and Ken Gribble of
the Computer Science Department at the University of California
Davis for supplying the data and helping with the data scrubbing
process.
REFERENCES
[1] Kerr, Bernard (2003). Thread Arcs: An Email Thread Visualization.
2003 IEEE Symposium on Information Visualization, pp. 27.
[2] Li, W., Hershkop, S. and Stolfo, S. J., (2004). Email Archive Anal-
ysis Through Graphical Visualization. Proceedings of the 2004 ACM
Workshop on Visualization and Data Mining for Computer Security,
pp. 128-132.
[3] M. Newton, O. S´
ykora, and I. Vrto. Two new heuristics for two-sided
bipartite graph drawing. In Graph Drawing, pages 312–319, 2002.
[4] Patrick Pantel and Dekang Lin. “SpamCop– A Spam Classification
and Organization Program.” Proceedings of AAAI-98 Workshop on
Learning for Text Categorization.
[5] Rohall, S. L., Gruen, D., Moody, P., Wattenberg, M., Stern, M., Kerr,
B., Stachel, B., Dave, K., Armes, R. and Wilcox, E. (2004). ReMail:
A Reinvented Email Prototype. Proceedings of ACM Human Factors
in Computing Systems (CHI 2004), pp 791-792.
[6] Sack, W. (2000). Discourse Diagrams: Interface Design for Very
Large Scale Conversations. Proceedings of the 33rd Hawaii Interna-
tional Conference on System Sciences, January 2000, p. 3034.
[7] Mehran Sahami, Susan Dumais, David Heckerman and Eric Horvitz.
“A Bayesian Approach to Filtering Junk E-Mail.” Proceedings of
AAAI-98 Workshop on Learning for Text Categorization.
[8] Smith, M. (2002). Tools for Navigating Large Social Cyberspaces.
Communications of the ACM, vol. 45, no. 4., April 2002, pp. 51-55.
[9] Smith, M. (1999). Invisible Crowds in Cyberspace: Measuring and
Mapping the Social Structure of USENET. In Smith, M. and Kol-
lock, P. (eds.): Communities in Cyberspace, Routledge Press, London,
1999.
[10] http://spamassassin.apache.org/.
[11] Tyler, J. R., Wilkinson, D. M., Huberman, B. A. (2003). Email as
Spectroscopy: Automated Discovery of Community Structure within
Organizations. Communities and Technologies, pp. 81-96.
[12] William Yurcik, ”VisFlowConnect-IP: A Link-Based Visualization of
NetFlows for Security Monitoring,” 18th Annual FIRST Conference
on Computer Security Incident Handling , Baltimore, MD USA, June
25-30, 2006.
[13] Lanbo Zheng , Le Song , Peter Eades, Crossing minimization prob-
lems of drawing bipartite graphs in two clusters, proceedings of the
2005 Asia-Pacific symposium on Information visualisation, p.33-37,
January 01, 2005, Sydney, Australia.
... The most common visualisation approach in spam email intrusion detection is the use of network mapping that depicts the spam email network linking email sources and their destinations. Muelder and Ma (2007) analysed known spam over an entire month and found the mapping pattern for spam email. Figure 1 shows a common spam pattern of one source and many destinations. ...
... The blue delimitates are different domains which sent or received spam emails. Figure 1 A common spam pattern (Muelder and Ma, 2007) (see online version for colours) IP matrix mapping (Ohno et al., 2005; Zhang et al., 2009) is based on the 32-bits IP addressing space: divides 32-bits into two groups: the global-level IP matrix and the locallevel IP matrix, it is used for indicating spam sources; virus type and amount attacks in different IP levels. Damian et al. (2011) have used both historical and current data to propose their visualisation technique particularly on high-speed networks. ...
Article
Spam email attacks are increasing at an alarming rate and have become more and more cunning in nature. This has necessitated the need for visual spam email analysis within an intrusion detection system to identify these attacks. The challenges are how to increase the accuracy of detection and how to visualise large volumes of spam email to better understand the analysis results and identify email attacks. This paper proposes a Density–Weight model that is to strengthen and extend the system capacity for analysis of network attacks in spam emails, including DDoS attacks. An interactive visual clustering method DA–TU is introduced to classify and display spam emails. The experimental results have shown that the proposed new model has improved the accuracy of intrusion detection and provides a better understanding of the nature of spam email attacks on though the network.
... Private social interaction behaviors include sending and/or receiving emails, making phone calls, and sending text messages between familiars on a nor-mal basis. Examples of anomalous interaction are communication of fraudsters [vdEHBvW13,vdEHBvW14] and criminals [PS06,KFS * 18], emailing patterns of core contributors in a working group [FHN * 07, GZ04] and spam [MM07]. Public social interaction behaviors associate with posting/sharing/replying contents on publicly accessible social platforms. ...
Preprint
The increasing accessibility of data provides substantial opportunities for understanding user behaviors. Unearthing anomalies in user behaviors is of particular importance as it helps signal harmful incidents such as network intrusions, terrorist activities, and financial frauds. Many visual analytics methods have been proposed to help understand user behavior-related data in various application domains. In this work, we survey the state of art in visual analytics of anomalous user behaviors and classify them into four categories including social interaction, travel, network communication, and transaction. We further examine the research works in each category in terms of data types, anomaly detection techniques, and visualization techniques, and interaction methods. Finally, we discuss the findings and potential research directions.
... Visualization techniques designed to show patterns in incoming email which can reveal misidentified pieces of spam, common spam sources, and patterns such as periods of increased spam activity have also been implemented [3]. Analytical models that use coefficient vectors, 'density' and 'input', with visual clustering methods to classify and display the spam emails have been introduced for the analysis of spam email viruses and attacks [4]. So far, these analytic tools do not provide information related to origin and spread of spam. ...
Conference Paper
Full-text available
With the recent surge in cyber attacks, there is a growing demand for effective security analytics tools. Though, there are advanced data collection techniques in the form of honeypots and malware collectors, the value of data are only as useful as the analysis technique used. One of the primary drawbacks of current security analytic tools is the lack of visualization controls to effectively analyze the data. In this paper, we develop a visualization tool to analyze the geographical locations of spammers based on the integration of MaxMind and WhoIS databases with Google Maps API. The visualization tool provides an insight into spam origins, along with patterns of spammers identified from spam activity. A key component in the development of this tool is its extensible framework allowing for the addition of resources to retrieve more information about a spammer and analyze additional patterns of spammers for spam analysis.
... With the ever increasing data size and complexity, many visualization approaches have been developed to improve the processing of a large amount of network data including traffic patterns, network flows and logs [4, 23, 30]. Because of the importance of the network topology, it has been used to help enforce Internet and wireless network security in multiple network visualization mechanisms [3, 15, 29] . ...
Article
We present a robust intrusion detection approach for wireless networks based on a new multi-matrix visualization method with a set of pattern generation, evaluation, organization and interaction functions. Our approach concentrates on assisting users to analyze statistical network topology patterns that could expose significant attack features. Specifically, we investigate Sybil attacks that have severe impacts on the fundamental operations of wireless networks. We have analyzed the features of network topologies under various Sybil attacks and, consequently, designed several matrix reordering algorithms to generate statistical patterns. These topology patterns are automatically evaluated and classified through the measured structural similarities to the signature attack patterns. We have also designed a new time-series analysis method to identify attack durations with a time histogram generation and an automatic segmentation method. To handle complex Sybil attacks, we have integrated our pattern generation, evaluation and organization methods to construct a prototype detection system, in which specialized interaction functions are provided to assist the analysis and comparison of network data. Simulation results show that this approach can effectively locate Sybil attacks under different combinations of network parameters. Our multi-matrix visualization method provides a flexible framework to handle the intricacies and implications from building a complex visual analytics system, which can be extended to defend against a wide range of attacks.
Article
Full-text available
With the pervasive use of information technologies, the increasing availability of data provides new opportunities for understanding user behaviors. Unearthing anomalies in user behavior is of particular importance as it helps signal harmful incidents such as network intrusions, terrorist activities, and financial frauds. In this work, we survey state-of-the-art research work in visual analytics of anomalous user behaviors and classify them into four application domains, which are social interaction, travel, network communication, and financial transaction. We further examine the research work in each category in terms of data types, visualization techniques, and interactive analysis methods. We hope that our survey can provide systematic guidelines for researchers and practitioners to find effective solutions to their research problems in specific application domains. Finally, we discuss trends of academic interest over the past decades and suggest potential directions across visual analytics of these user behaviors for future research.
Conference Paper
A large number of applications require mining high rate data streams. An important challenge in dealing with such data intensive applications is fast data change rate that requires improved analysis, new algorithms, tools and techniques. Our purpose for this work is to examine how to integrate collecting data streams from multiple sensors in order to discover new patterns and minimize the number of incorrect alarms. We present a framework for mining patterns from data streams and discuss practical applications.
Article
Full-text available
Network traffic dynamics have become an important behavior-based approach to assist security ad-ministrators in protecting networks. In this paper we present VisFlowConnect-IP, a link-based network flow visualization tool that allows operators to detect and investigate anomalous internal and external network traffic. We model the network as a graph with hosts being nodes and traffic being flows on edges. We present a detailed description of VisFlowConnect-IP functionality and demonstrate its application to traffic dynamics in order to monitor, discover, and investigate security-relevant events.
Conference Paper
Full-text available
Electronic mail has become the most widely used business productivity application. However, people increasingly feel frustrated by their email. They are overwhelmed by the volume, lose important items, and feel pressure to respond quickly. Though email usage has changed, our email clients largely have not. In this demonstration, we will show a prototype email client developed as part of a larger project on "reinventing email." This prototype incorporates capabilities for dealing with threads of related messages, a synchronous communication facility, and the ability to annotate individual messages. The prototype has been tested on a small set of users to evaluate its features. People particularly liked tools to view and navigate among messages in a thread, and the integration of synchronous awareness and communication with the email experience.
Conference Paper
Full-text available
The analysis of the vast storehouse of email content accumulated or produced by individual users has received relatively little attention other than for specific tasks such as spam and virus filtering. Current email analysis in standard client applications consists of keyword based matching techniques for filtering and expert driven manual exploration of email files. We have implemented a tool, called the Email Mining Toolkit (EMT) for analyzing email archives which includes a graphical display to explore relationships between users and groups of email users. The chronological flow of an email message can be analyzed by EMT. Our design goal is to embed the technology into standard email clients, such as Outlook, revealing far more information about a user's own email history than is otherwise now possible. In this paper we detail the visualization techniques implemented in EMT. We show the utility of these tools and underlying models for detecting email misuse such as viral propagation, and spam spread as examples.
Article
Full-text available
We describe a method for the automatic identification of communities of practice from e-mail logs within an organization. We use a betweenness centrality algorithm that can rapidly find communities within a graph representing information flows. We apply this algorithm to an initial e-mail corpus of nearly 1 million messages collected over a 2-month span, and show that the method is effective at identifying true communities, both formal and informal, within these scale-free graphs. This approach also enables the identification of leadership roles within the communities. These studies are complemented by a qualitative evaluation of the results in the field.
Article
Full-text available
The article focuses on tools for navigating large social cyberspaces. Ideally, Usenet members would make efficient use of bandwidth, participating actively but judiciously in newsgroups, ensuring their comments are posted only to relevant newsgroups, and abiding by the local norms and culture that govern decorum. A key finding of collective action studies shows that mutual awareness of other participants' histories and relationships is critical to a cooperative outcome. The challenges of cooperation are heightened further when people are able to draw from a resource without contribution. Interfaces, like email and news browsers, that provide access to social cyberspaces such as discussion boards, email lists and chat rooms, present limited, if any, information about the social context of the interactions they host. Basic social cues about the size and nature of groups are missing, making discovery, navigation, and self-regulation an increasing challenge as the size and scope of these spaces expand. While people can eventually develop a refined sense of the rhythms, leaders, and fools in a particular social cyber space, the information does not come easily or easily transfer to other spaces. With little sense of the presence of other people, individuals have a difficult time forming cooperative relationships.
Conference Paper
The Usenet is a quintessential Internet social phenomenon: it is huge, global, anarchic and rapidly growing. It is also mostly invisible. Although, it is the largest example of a conferencing or discussion group system, the tools generally available to access it only display leaves and branches - chains of messages and responses. None present the trees and forest. With hundreds of thousands of new messages every day, it is impossible to try to read them all to get a sense of the entire place. As a result, an overview of activity in the Usenet has been difficult to assemble and many basic questions about its size, shape, structure and dynamics have gone unanswered. How big is the Usenet? How many people post? Where are they from? When and where do they post? How do groups vary from one another and over time? How many different kinds of groups are there? How many groups successfully thrive and how many die? What do the survivors have that the others lack? How do different social cyberspaces connect and fit together and form a larger ecology?
Conference Paper
The crossing minimization problem is a classic and very important problem in graph drawing (Pach, Toth 1997); the results directly aect the eectiveness of the layout, especially for very large scale graphs. But in many cases crossings cannot be avoided. In this paper we present two models for bipartite graph drawing, aiming to reduce crossings that cannot be avoided in the traditional bilayer drawings. We characterize crossing minimization problems in these models, and prove that they are NP-complete. 1 Motivation
Conference Paper
This paper describes Thread Arcs, a novel interactive visualization technique designed to help people use threads found in email. Thread Arcs combine the chronology of messages with the branching tree structure of a conversational thread in a mixed-model visualization (Venolia and Neustaedter 2003) that is stable and compact. By quickly scanning and interacting with Thread Arcs, people can see various attributes of conversations and find relevant messages in them easily. We tested this technique against other visualization techniques with users' own email in a functional prototype email client. Thread Arcs proved an excellent match for the types of threads found in users' email and for the qualities users wanted in small-scale visualizations. CR Categories: H.5.2 User Interfaces, H.5.3 Group and Organization Interfaces, I.3.6 Methodology and Techniques
Conference Paper
Two new heuristic strategies are studied based on heuristics for the linear arrangement problem and a stochastic hill-climbing method for the two-sided bipartite crossing number problem. These are compared to the standard heuristic for two-sided bipartite drawing based on iteration of the barycentre method. Our experiments show that they can efficiently find good solutions.