Conference PaperPDF Available

Dustminer: Troubleshooting Interactive Complexity Bugs in Sensor Networks

November 2008

November 2008

DOI:10.1145/1460412.1460423

Source
DBLP

Conference: Proceedings of the 6th International Conference on Embedded Networked Sensor Systems, SenSys 2008, Raleigh, NC, USA, November 5-7, 2008

Authors:

Mohammad Maifi Hasan Khan

University of Connecticut

Hieu Khac Le

University of Illinois, Urbana-Champaign

Hossein Ahmadi

Tarek F. Abdelzaher

University of Illinois, Urbana-Champaign

Show all 5 authorsHide

This paper presents a tool for uncovering bugs due to inter- active complexity in networked sensing applications. Such bugs are not localized to one component that is faulty, but rather result from complex and unexpected interactions be- tween multiple often individually non-faulty components. Moreover, the manifestations of these bugs are often not repeatable, making them particularly hard to find, as the particular sequence of events that invokes the bug may not be easy to reconstruct. Because of the distributed nature of failure scenarios, our tool looks for sequences of events that may be responsible for faulty behavior, as opposed to localized bugs such as a bad pointer in a module. An ex- tensible framework is developed where a front-end collects runtime data logs of the system being debugged and an of- fline back-end uses frequent discriminative pattern mining to uncover likely causes of failure. We provide a case study of debugging a recent multichannel MAC protocol that was found to exhibit corner cases of poor performance (worse than single channel MAC). The tool helped uncover event sequences that lead to a highly degraded mode of operation. Fixing the problem significantly improved the performance of the protocol. We also provide a detailed analysis of tool overhead in terms of memory requirements and impact on the running application.

Logged events for diagnosing LiteOS application bug

…

Discriminative frequent patterns found only in "good" log for LiteOS bug

…

Discriminative frequent patterns found only in "bad" log for LiteOS bug

…

Figures - uploaded by Hieu Khac Le

Content may be subject to copyright.

Content uploaded by Hieu Khac Le

Content may be subject to copyright.

Dustminer: Troubleshooting Interactive Complexity Bugs

in Sensor Networks

Mohammad Maiﬁ Hasan Khan, Hieu Khac Le, Hossein Ahmadi, Tarek F. Abdelzaher,

and Jiawei Han

Department of Computer Science

University of Illinois, Urbana-Champaign

mmkhan2@uiuc.edu, hieule2@cs.uiuc.edu, hahmadi2@uiuc.edu, zaher@cs.uiuc.edu

hanj@cs.uiuc.edu

ABSTRACT

This paper presents a tool for uncovering bugs due to inter-

active complexity in networked sensing applications. Such

bugs are not localized to one component that is faulty, but

rather result from complex and unexpected interactions be-

tween multiple often individually non-faulty components.

Moreover, the manifestations of these bugs are often not

repeatable, making them particularly hard to ﬁnd, as the

particular sequence of events that invokes the bug may not

be easy to reconstruct. Because of the distributed nature

of failure scenarios, our tool looks for sequences of events

that may be responsible for faulty behavior, as opposed to

localized bugs such as a bad pointer in a module. An ex-

tensible framework is developed where a front-end collects

runtime data logs of the system being debugged and an of-

ﬂine back-end uses frequent discriminative pattern mining

to uncover likely causes of failure. We provide a case study

of debugging a recent multichannel MAC protocol that was

found to exhibit corner cases of poor performance (worse

than single channel MAC). The tool helped uncover event

sequences that lead to a highly degraded mode of operation.

Fixing the problem signiﬁcantly improved the performance

of the protocol. We also provide a detailed analysis of tool

overhead in terms of memory requirements and impact on

the running application.

Categories and Subject Descriptors

D.2.5 [Software Engineering]: [Testing and Debugging-

Distributed Debugging]

General Terms

Design, Reliability, Experimentation

Keywords

Protocol debugging, Distributed automated debugging, Wire-

less sensor networks

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

SenSys’08, November 5–7, 2008, Raleigh, North Carolina, USA.

1. INTRODUCTION

DustMiner is a diagnostic tool that leverages an extensible

framework for uncovering root causes of failures and perfor-

mance anomalies in wireless sensor network applications in

an automated way. This paper presents the design and im-

plementation of Dustminer along with two case studies of

real life failure diagnosis scenarios. The goal of this work is

to further contribute to automating the process of debug-

ging, instead of relying only on manual eﬀorts, and hence

reduce the development time and eﬀort signiﬁcantly.

Developing wireless sensor network applications still re-

mains a signiﬁcant challenge and a time consuming task.

To make the development of wireless sensor network applica-

tions easier, much of the previous work focused on program-

ming abstractions [27, 7, 25, 17, 11]. Most wireless sensor

network application developers would agree, however, to the

fact that most of the development time is spent on debug-

ging and troubleshooting the current code, which greatly

reduces productivity.

Early debugging and troubleshooting support revolved aro-

und testbeds [8, 36], simulation [18, 35, 34] and emula-

tion environments [29, 12]. Recent source level debugging

tools [38, 37] have greatly contributed to the convenience of

the troubleshooting process. They make it easier to zoom in

on sources of errors by oﬀering more visibility into run-time

state. Unfortunately wireless sensor network applications

often fail not because of a single node coding error but as

a result of improper interaction between components. Such

interaction may be due to some protocol design ﬂaw (e.g.,

missed corner cases that the protocol does not handle cor-

rectly) or unexpected artifacts of component integration. In-

teraction errors are often non-reproducible since repeating

the experiment might not lead to the same corner-case again.

Hence, in contrast to previous debugging tools, in this pa-

per, we focus on ﬁnding (generally non-deterministically oc-

curring) bugs that arise from interactions among seemingly

individually sound components.

The main approach of Dustminer is to log many diﬀer-

ent types of events in the sensor network and then analyze

the logs in an automated fashion to extract the sequences

of events that lead to failure. These sequences shed light on

what caused the bug to manifest making it easier to under-

stand and ﬁx the root cause of the problem.

Our work extends a somewhat sparse chain of prior at-

tempts at diagnostic debugging in sensor networks. Sym-

pathy [30] is one of the earliest tools in the wireless sensor

networks domain that addresses the issue of diagnosing fail-

ures in deployed systems in an automated fashion. Specif-

ically, it infers the identities of failed nodes or links based

on reduced throughput at the basestation. SNTS [15] pro-

vides a more general diagnostic tool that extracts conditions

on current network state that are correlated with failure, in

hopes that they may include the root cause. A diagnostic

simulator (an extension to TOSSIM) is described in [14]. Be-

ing a simulator extension, it is not intended to troubleshoot

deployed systems. It is based on frequent pattern mining

(to ﬁnd event patterns that occur frequently when the bugs

manifest). Unfortunately, the cause of a problem is often an

infrequent pattern; a single “bad” chain of events that leads

to many problems.

We extend the diagnostic capability of the above tools

by implementing a new automated discriminative sequence

analysis technique that relies on two separate phases to ﬁnd

root causes of problems or performance anomalies in sensor

networks; the ﬁrst phase identiﬁes frequent patterns corre-

lated to failure as before. The second focuses on those pat-

terns, correlating them with (infrequent) events that may

have caused them, hence uncovering the true root of the

problem. We apply this technique to identify the sequences

of events that cause manifestations of interaction bugs. In

turn, identifying these sequences helps the developer un-

derstand the nature of the bug. Our tool is based on a

front-end that collects data from the run-time system and a

back-end that analyses it. Both are plug-and-play modules

that can be chosen from multiple alternatives. We identify

the architectural requirements of building such an extensi-

ble framework and present a modular software architecture

that addresses these requirements.

We provide two case studies of real life debugging using

the new tool. The ﬁrst case study shows how our tool iso-

lates a kernel-level bug in the radio communication stack in

the newly introduced LiteOS [6] operating system for sensor

networks that oﬀers a UNIX-like remote ﬁlesystem abstrac-

tion. The second case study shows how the tool is used to

debug a performance problem in a recently published mul-

tichannel Media Access Control (MAC) protocol [16]. For

both case studies we used MicaZ motes as target hardware.

It is important to stress what our tool is not. We spe-

cialize in uncovering problems that result from component

interactions. Speciﬁcally, the proposed tool is not intended

to look for local errors (e.g., code errors that occur and man-

ifest themselves on one node). Examples of the latter type of

errors include inﬁnite loops, dereferencing invalid pointers,

or running out of memory. Current debugging tools are, for

the most part, well-equipped to help ﬁnd such errors. An

integrated development environment can use previous tools

for that purpose.

The rest of the paper is organized as follows. In Sec-

tion 2, we describe recent work on debugging sensor net-

work applications. In Section 3, we describe the main idea

of our tool. In Section 4, we describe the design challenges

with our proposed solution. Section 5 describes the system

architecture of Dustminer. Section 6 describes the imple-

mentation details of the data collection front-end and data

analysis back-end that are used in this paper along with their

system overhead. To show the eﬀectiveness of our tool, in

Section 7, we provided two case studies. In the ﬁrst, we

uncover a kernel-level bug in LiteOS. In the second, we use

our tool to identify a protocol design bug in a recent multi

channel MAC protocol [16]. We show that when the bug is

ﬁxed, the throughput of the system is improved by nearly

50%. Section 8 concludes the paper.

2. RELATED WORK

Current troubleshooting support for sensor networks (i)

favors reproducible errors, and (ii) is generally geared to-

wards ﬁnding local bugs such as an incorrectly written line of

code, an erroneous pointer reference, or an inﬁnite loop. Ex-

isting tools revolve around testing, measurements, or step-

ping through instruction execution. Marionette [37] and

Clairvoyant [38] are examples of source debugging systems

that allow the programmer to interact with the sensor net-

work using breakpoints, watches, and line-by-line tracing.

Source level debugger is more suitable to identify program-

ming errors which is contained in a single node. It is dif-

ﬁcult to ﬁnd distributed bugs using source level debugger

due to the fact that source level debugging interferes heav-

ily with the normal operation of the code and may prevent

the excitation of distributed bugs in the ﬁrst place. It also

involves manual checking of system states which is not scal-

able. SNMS [32] presents a sensor network measurement

service that collects performance statistics such as packet

loss and radio energy consumption. Testing-based systems

include laboratory testbeds such as Motelab [36], Kansei [8],

Emstar [12] etc. These systems are good at exposing mani-

festations of errors, but leave it to the programmer’s skill to

guess the cause of the problem.

Simulation and emulation based systems include TOSSIM

[18], DiSenS [35], S2DB [34], Atemu [29] etc. Atemu provides

XATDB which is a GUI based debugger that provides inter-

face to debug code at line level. S2DB is a simulation based

debugger that provides debugging abstractions at diﬀerent

levels such as the node level and network level. It provides

the concept of parallel debugging where a developer can set

breakpoints across multiple devices to access internal system

state. This remains a manual process, and it is very hard

to debug a large system manually for design bugs. More-

over the simulated environment prevents the system from

exciting bugs which arise due to peculiar characteristics of

real hardware, and deployment scenarios such as clock skew,

radio irregularities, and sensing failures, to name a few.

For oﬄine parameter tuning and performance analysis,

record and replay is a popular technique which is imple-

mented by Envirolog [26] for sensor network applications.

Envirolog stores module outputs (e.g., outputs of sensing

modules) locally and replays them to reproduce the original

execution. It is good at testing performance of high-level

protocols subjected to a given recorded set of environmental

conditions. While this can also help with debugging such

protocols, no automated diagnosis support is oﬀered.

In [31], the authors pointed out that sensor data, such as

humidity and temperature, may be used to predict network

and node failure, although they do not propose an auto-

mated technique for correlating failures with sensor data.

In contrast, we are focused on automating the trobuleshoot-

ing of general protocol bugs which may not necessarily be

revealed by analyzing sensor measurements.

Sympathy [30] presents an early step towards sensor net-

work self-diagnosis. It specializes in attributing reduced

communication throughput at a base-station to the failure

of a node or link in the sensor network. Another example

of automated diagnostic tools is [15] which analyzes pas-

100

sively logged radio communication messages using a clas-

siﬁcation algorithm [10] to identify states correlated with

the occurrence of bugs. The diagnostic capability of [15] is

constrained by its inability to identify event sequences that

precipitate an interaction-related bug. The tool also does

not oﬀer an interface to the debugged system that allows

logging internal events. The recently published diagnostic

simulation eﬀort [14] comes closest to our work. The present

paper extends the diagnostic capability of the above tools by

implementing an actual system (as opposed to using simu-

lation) and presenting a better log analysis algorithm that

is able to uncover infrequent sequences that lead to failures.

Machine learning techniques have previously been applied

to failure diagnosis in other systems [5, 3, 22]. Extensive

work in software bug mining and analysis studies includes

a software behavior graph analysis method for back-tracing

noncrashing bugs [22], a statistical model-based bug local-

ization method [19], which explores the distinction of run-

time execution distribution between buggy and nonbuggy

programs, a control ﬂow abnormality analysis method for

logic error isolation [23], a fault localization-based approach

for failure proximity [20], a Bayesian analysis approach for

localization of bugs with a single buggy run, which is espe-

cially valuable for distributed environments where the bug

can hardly be reproduced reliably [21], and a dynamic slicing-

based approach for failure indexing which may cluster and

index bugs based on their speculated root cause and locality

[24], as well as discriminative frequent pattern analysis [9].

We extend the techniques of discriminative pattern analy-

sis to a two-stage process for analyzing logs to pinpoint the

cause of failure; the ﬁrst stage identiﬁes all the “symptoms”

of the problem from the logs, while the second ties them to

possible root causes.

Formal methods [33, 13, 4, 28] oﬀer a diﬀerent alterna-

tive based on verifying component correctness or, conversely,

identifying which properties are violated. The approach is

challenging to apply to large systems due to the high degree

of concurrency and non-determinism in complex wireless

sensor networks, which leads to an exponential state space

explosion. Unreliable wireless communication and sensing

pose additional challenges. Moreover, even a veriﬁed sys-

tem can suﬀer poor performance due to a design ﬂaw. Our

tool automatically answers questions that help identify the

cause of failure that manifests during run time.

3. DUSTMINER OVERVIEW

Most previous debugging approaches for sensor networks

are geared at ﬁnding localized errors in code (with prefer-

ence to those that are reproducible). In contrast, we focus

on non-deterministically occurring errors that arise because

of unexpected or adverse distributed interactions between

multiple seemingly individually-correct components. The

non-localized, hard-to-reproduce nature of such errors makes

them especially hard to ﬁnd.

Dustminer is based on the idea that, in a distributed wire-

less sensor network, nodes interact with each other in a

manner deﬁned by their distributed protocols to perform

cooperative tasks. Unexpected sequences of events, subtle

interactions between modules, or unintended design ﬂaws

in protocols may occasionally lead to an undesirable or in-

valid state, causing the system to fail or exhibit poor perfor-

mance. Hence, in principle, if we log diﬀerent types of events

in the network, we may be able to capture the unexpected

sequence that leads to failure (along with thousands of other

sequences of events). The challenge for the diagnostic tool

is to automatically identify this culprit sequence. Our ap-

proach exploits both (i) non-determinism and (ii) interactive

complexity to improve ability to diagnose distributed inter-

action bugs. This point is elaborated below:

•Exploiting non-reproducible behavior: We adapt data

mining approaches that use examples of both “good”

and “bad” system behavior to be able to classify the

conditions correlated with good and bad. In particu-

lar, note that conditions that cause a problem to occur

are correlated (by causality) with the resulting bad be-

havior. Root causes of non-reproducible bugs are thus

inherently suited for discovery using such data mining

approaches; the lack of reproducibility itself and the

inherent system non-determinism improve the odds of

occurrence of suﬃciently diverse behavior examples to

train the troubleshooting system to understand the rel-

evant correlations and identify causes of problems.

•Exploiting interactive complexity: Interactive complex-

ity describes a system where scale and complexity cause

components to interact in unexpected ways. A failure

that occurs due to such unexpected interactions is typ-

ically hard to “blame” on any single component. This

fundamentally changes the objective of a troubleshoot-

ing tool from aiding in stepping through code (which

is more suitable for ﬁnding a localized error in some

line, such as an incorrect pointer reference), to aid-

ing with diagnosing a sequence of events (component

interactions) that leads to a failure state.

At a high level, our tool ﬁrst uses a data collection front-end

to collect runtime events for post-mortem analysis. Once

the log of runtime events is available, the tool separates the

collected sequence of events into two piles - a “good” pile,

which contains the parts of the log when the system performs

as expected, and a “bad” pile, which contains the parts of

the log when the system fails or exhibits poor performance.

This data separation phase is done based on a predicate that

deﬁnes “good” versus “bad” behavior, provided by the appli-

cation developer. For example the predicate, applied oﬄine

to logged data, might state that a sequence of more than 10

consecutive lost messages between a sender and receiver is

bad behavior (hence return “bad” in this case). To increase

diagnostic accuracy, experiments can be run multiple times

before data analysis.

A discriminative frequent pattern mining algorithm then

looks for patterns (sequences of events) that exist with very

diﬀerent frequencies in the two piles. These patterns are

called discriminative. Later, such patterns are analyzed for

correlations with preceding events in the logs, if any, that

occur less frequently. Hence, it is possible to catch anomalies

that cause problems as well as common roots of multiple

error manifestations.

A well-known algorithm for ﬁnding frequent patterns in

data is the Apriori algorithm [2]. This algorithm was used

in previous work on sensor network debugging [14] to ad-

dress a problem similar to ours where Apriori algorithm is

used to determine the event sequences that lead to problems

in sensor networks. We show that the approach has serious

limitations and extend this algorithm to suite purposes of

101

sensor network debugging. The original Apriori algorithm,

used in the aforementioned work, is an iterative algorithm

that proceeds as follows. At the ﬁrst iteration, it counts the

number of occurrences (called support) of each distinct event

in the data set (i.e., in the “good” or “bad” pile). Next, it

discards all events that are infrequent (their support is less

than some parameter minSup). The remaining events are

frequent patterns of length 1. Assume the set of frequent

patterns of length 1 is S1. At the next iteration, the algo-

rithm generates all the candidate patterns of length 2 which

is S1×S1. Here ‘×’ represents the Cartesian product. It

then computes the frequency of occurrence of each pattern

in S1×S1and discards those with support less than minSup

again. The remaining patterns are the frequent patterns of

length 2. Let us call them set S2. Similarly, the algorithm

will generate all the candidate patterns of length 3 which is

S2×S1and discard infrequent patterns (with support less

than minSup) to generate S3and so on. It continues this

process until it cannot generate any more frequent patterns.

We show in this paper, how the previous work is extended

for purposes of diagnosis. The problems with the algorithm

and our proposed solutions are described in section 4.

4. CHALLENGES AND SOLUTIONS

Performing discriminative frequent pattern mining based on

frequent patterns generated by Apriori algorithm poses sev-

eral challenges that need to be addressed before we can apply

discriminative frequent pattern mining for debugging. For

the purposes of the discussion below, let as deﬁne an event

to be the basic element in the log that is analyzed for failure

diagnosis. The structure of an event in our log is as follows:

< N odeI d, EventT y pe, attribute1, attribute2, ...attribute n,

T imestamp >

NodeId is used to identify the node that records the event.

Ev entT ype is used to identify the event type (e.g., message

dropped, ﬂash write ﬁnished, etc). Based on the event type,

it is possible to interpret the rest of the record (the list of

attributes). The set of distinct E ventT ypes is often called

the alphabet in an analogy with strings. In other words, if

events were letters in an alphabet, we are looking for strings

that cause errors to occur. These strings represent event

sequences (ordered lists of events). The generated log can

be thought of as a single sequence of logged events. For

example, S1=(< a >, , , < c >, < d >, , <

b >, < a >, < c >)is an event sequence. Elements < a >,

, ..., are events. A discriminative pattern between two

data sets is a subsequence of (not necessarily contiguous)

events that occurs with a diﬀerent count in the two sets. The

larger the diﬀerence, the better the discrimination. With the

above terminology in mind, we present how the algorithm is

extended to apply to debugging.

4.1 Preventing False Frequent Patterns

The Apriori algorithm generates all possible combinations

of frequent subsequences of the original sequence. As a re-

sult, it generates subsequences combining events that are

“too far” apart to be causally correlated with high proba-

bility and thus reduces the chance of ﬁnding the “culprit

sequence” that actually caused the failure. This strategy

could negatively impact the ability to identify discrimina-

tive patterns in two ways; (i) it could lead to the generation

of discriminative patterns that are not causally related, and

(ii) it could eliminate discriminative patterns by generating

false patterns. Consider the following example.

Suppose we have the following two sequences:

S1= (< a >, , < c >, < d >, < a >, , < c >, < d >)

S2= (< a >, , < c >, < d >, < a >, < c >, , < d >

Suppose the system fails when < a > is followed by < c >

before . As this condition is violated in sequence S2,

ideally, we would like our algorithm to be able to detect

(< a >, < c >, )as a discriminative pattern that distin-

guishes these two sequences.

Now, if we apply the Apriori technique, it will generate

(< a >, < c >, )as an equally likely pattern for both

S1, and S2. As in both S1and S2, it will combine the ﬁrst

occurrence of < a > and the ﬁrst occurrence of < c > with

the second occurrence of . So it will get canceled out

at the diﬀerential analysis phase.

To address this issue, the key observation here is that the

ﬁrst occurrence of < a > should not be allowed to combine

with the second occurrence of as there is another

event < a > after the ﬁrst occurrence of < a > but before

the second occurrence of and the second occurrence

of is correlated with second occurrence of < a > with

higher probability.

To prevent such erroneous combinations, we use a dy-

namic search window scheme where the ﬁrst item of any can-

didate sequence is used to determine the search window. In

this case, for any pattern starting with < a >, the search win-

dow is [1,4] and [5,8] in S1and S2. With this search window,

the algorithm will search for pattern (< a >, < c >, )

in window [1,4] and [5,8] and will fail to ﬁnd it in S1but

will ﬁnd it in sequence S2only. As a result, the algorithm

will be able to report pattern (< a >, < c >, )as a

discriminative pattern.

This dynamic search window scheme also speeds up the

search signiﬁcantly. In this scheme, the original pattern (of

size 8 events) was reduced to windows of size 4 making the

search for patterns in those windows more eﬃcient.

4.2 Suppressing Redundant Subsequences

At the frequent pattern generation stage, if two patterns,

Siand Sj, have support ≥minSup, the Apriori algorithm

keeps both sequences as frequent patterns even if one is a

subsequence of the other and both have equal support. This

makes perfect sense in data mining but not in debugging.

For example, when mining the “good” data set, the above

strategy assumes that any subset of a “good” pattern is also a

good pattern. In real-life, this is not true. Forgetting a step

in a multi-step procedure may well cause failure. Hence,

subsequences of good sequences are not necessarily good.

Keeping these subsequences as examples of “good” behavior

leads to a major problem at the diﬀerential analysis stage

when discriminative patterns are generated since they may

incorrectly cancel out similar subsequences found frequent

in the other (i.e., “bad” behavior) data pile. For example,

consider two sequences below:

S1= (< a >, , < c >, < d >, < a >, , < c >, < d >)

S2= (< a >, , < c >, < d >, < a >, , < d >, < c >)

Suppose, for correct operation of the protocol, event < a >

has to be followed by event < c > before event < d > can

happen. In sequence S2this condition is violated. Ideally,

we would like our algorithm to report sequence:

102

S3= (< a >, , < d >)as the “culprit” sequence. How-

ever, if we apply Apriori algorithm, it will fail to catch this

sequence. This is because it will generate S3as a frequent

pattern both for S1and S2with support 2 and will get can-

celed out at the diﬀerential analysis phase. As expected, S3

will never show up as a “discriminative pattern”. Note that

with the dynamic search window scheme alone, we cannot

prevent this.

To illustrate, suppose a successful message transmission

involves the following sequence of events:

(<enableRadio>,<messageSent>,<ackReceived>,<disable-

Radio>)

Now although sequence:

(< enabl eRadio >, < messag eSent >, < disableRadio >)

is a subsequence of the original “good” sequence, it does not

represent a successful scenario as it disables radio before

receiving the “ACK” message.

To solve this problem, we need an extra step (which we call

sequenceCompression) before we perform diﬀerential analy-

sis to identify discriminative patterns. At this step, we re-

move the sequence Siif it is a subsequence of Sjwith the

same support1. This will remove all the redundant subse-

quences from the frequent pattern list. Subsequences with

a (suﬃciently) diﬀerent support, will be retained and will

show up after discriminative pattern mining.

In the above example, pattern (< a >, , < c >,< d >)

has support 2 in S1and support 1 in S2. Pattern (< a >, <

b >, < d >)has support 2 in both S1and S2. Fortunately, at

the sequenceCompression step, pattern (< a >, , < d >)

will be removed from the frequent pattern list generated for

S1because it is a subsequence of a larger frequent pattern

of the same support. It will therefore remain only on the

frequent pattern list generated for S2and will show up as a

discriminative pattern.

4.3 Two Stage Mining for Infrequent Events

In debugging, sometimes less frequent patterns could be

more indicative of the cause of failure than the most frequent

patterns. A single mistake can cause a damaging sequence

of events. For example, a single node reboot event can cause

a large number of message losses. In such cases, if frequent

patterns are generated that are commonly found in failure

cases, the most frequent patterns may not include the real

cause of the problem. For example, in case of node reboot,

manifestation of the bug (message loss event) will be re-

ported as the most frequent pattern and the real cause of

the problem (the node reboot event) may be overlooked.

Fortunately, in the case of sensor network debugging, a so-

lution may be inspired by the nature of the problem domain.

The fundamental issue to observe is that much computation

in sensor networks is recurrent. Code repeatedly visits the

same states (perhaps not strictly periodically), repeating the

same actions over time. Hence, a single problem, such as a

node reboot or a race condition that pollutes a data struc-

ture, often results in multiple manifestations of the same

unusual symptom (like multiple subsequent message losses

or multiple subsequent false alarms). Catching these re-

current symptoms by an algorithm such as Apriori is much

easier due to their larger frequency. With such symptoms

identiﬁed, the search space can be narrowed and it becomes

1This mechanism can be extended to remove subsequences

of a similar but not identical support.

easier to correlate them with other less frequent preceding

event occurrences. To address this challenge, we developed

a two stage pattern mining scheme.

At the ﬁrst stage, the Apriori algorithm generates the

usual frequent discriminative patterns that have support

larger than minSup. For the ﬁrst stage, minSup is set larger

than 1. It is expected that the patterns involving mani-

festations of bugs will survive at the end of this stage but

infrequent events like a node reboot will be dropped due to

their low support.

At the second stage, at ﬁrst, the algorithm splits the log

into ﬁxed width segments (default width is 50 events in our

implementation). Next, the algorithm counts the number of

discriminative frequent patterns found in each segment and

ranks each segment of the log based on the count (the higher

the number of discriminative patterns in a segment, the

higher the rank). If discriminative patterns occurred consec-

utively in multiple segments, those segments are merged into

a larger segment. Next, the algorithm generates frequent

patterns with minSup reduced to 1 on the Khighest-ranked

segments separately (default Kis 5 in our implementation)

and extracts the patterns that are common in these regions.

Note that the initial value of Kis set conservatively. The

optimum value of Kdepends on the application. If with the

initial value of K, the tool failed to catch the real cause,

the value of Kis increased iteratively. In this scheme, we

have a higher chance of reporting single events such as race

conditions that cause multiple problematic symptoms. Ob-

serve that the algorithm is applied on data that is the total

logs from several experimental runs. The race condition may

have occurred once at diﬀerent points of some of these runs.

This scheme has a signiﬁcant impact on the performance

of the frequent pattern mining algorithm. Scalability is one

of the biggest challenges in applying discriminative frequent

pattern analysis to debugging. For example, if the total

number of logged events is of the order of thousands (more

than 40000 in one of our later examples), it is computation-

ally impossible to generate frequent patterns of non-trivial

length for this whole sequence. Using two stage mining, we

can dramatically reduce the search space and make it fea-

sible to mine for longer frequent patterns which are more

indicative of the cause of failure than shorter sequences.

4.4 Other Challenges

Several other changes need to be made to standard data

mining techniques. For example, the amount of logged events

and the corresponding frequency of patterns can be diﬀerent

from run to run depending on factors such as length of exe-

cution and system load. A higher sampling rate at sensors,

for example, may generate more messages and cause more

events to be logged. Many logged event patterns in this case

will appear to be more frequent. This is problematic when

it is desired to compare the frequency of patterns found in

“good” and “bad” data piles for purposes of identifying those

correlated with bad behavior. To address this issue, we need

to normalize the frequency count of events in the log. In the

case of single events (i.e., patterns of length 1), we use the

ratio of occurrence of the event instead of absolute counts.

In other words, the support of any particular event, < e >

in the event log is divided by the total number of events

logged, yielding in essence the probability of ﬁnding that

event in the log, P(e). For patterns of length more than

103

1, we extend the scheme to compute the probability of the

pattern given recursively by P(e1).P (e2|e1).P (e3|e1.e2), ....

The individual terms above are easy to compute. For ex-

ample, P(e2|e1) is obtained by dividing the support of the

pattern (< e1>, < e2>) by the total support of patterns

starting with < e1>.

Finally, there are issues with handling event parameters.

Logged events may have parameters (e.g., identity of the re-

ceiver for a message transmission event). Since event param-

eter lists may be diﬀerent, calling each variation a diﬀerent

event will cause a combinatorial explosion of the alphabet.

For example, an event with 10 parameters, each of 10 pos-

sible values will generate a space of 1010 possible combina-

tions. To address the problem, continuous or ﬁne-grained

parameters need to be discretized into a smaller number of

ranges. Multi-parameter events need to be converted into se-

quences of single-parameter events each listing one parame-

ter at a time. Hence, the exponential explosion is reduced to

linear growth in the alphabet, proportional to the number of

discrete categories a single parameter can take and the aver-

age number of parameters per event. Techniques for dealing

with event parameter lists were introduced in [14] and are

not discussed further in this paper.

5. DUSTMINER ARCHITECTURE

We realize that the types of debugging algorithms needed

are diﬀerent for diﬀerent applications, and are going to evolve

over time with the evolution of hardware and software plat-

forms. Hence, we aim to develop a modular tool architecture

that facilitates evolution and reuse. Keeping that in mind,

we developed a software architecture that provides the nec-

essary functionality and ﬂexibility for future development.

The goal of our architecture is to facilitate easy use and ex-

perimentation with diﬀerent debugging techniques and fos-

ter future development. As there are numerous diﬀerent

types of hardware, programming abstractions, and operat-

ing systems in use for wireless sensor networks, the architec-

ture must be able to accommodate diﬀerent combinations

of hardware and software. Diﬀerent ways of data collec-

tion should not aﬀect the way the data analysis layer works.

Similarly we realize that for diﬀerent types of bugs, we may

need diﬀerent types of techniques to identify the bug and

we want to provide a ﬂexible framework to experiment with

diﬀerent data analysis algorithms. Based on the above re-

quirements, we designed a layered, modular architecture as

shown in Figure 1. We separate the whole system into three

subsystems; (i) a data collection front-end, (ii) data prepro-

cessing middleware and (iii) a data analysis back-end.

5.1 Data Collection Front-End

The role of data collection front-end is to provide the de-

bug information (i.e., log ﬁles) that can be analyzed for di-

agnosing failures. The source of this debug log is irrelevant

to the data analysis subsystem. As shown in Figure 1, the

developer may choose to analyze the recorded radio commu-

nication messages obtained using a passive listening tool, or

the execution traces obtained from simulation runs, or the

run-time sequences of events obtained by logging on actual

application motes and so on. With this separation of con-

cerns, the front-end developer could design and implement

the data collection subsystem more eﬃciently and indepen-

Application Specific

“Bad” Behavior

Data Labeling Function

Labeled Data File

Parsed Data File

Front-End

Designer

Front-End Specific

Raw Data Storage

Format

Front-End Specific

Data Cleaning Algorithm

Data Parsing Algorithm

Application Specific

Header File Describing

Event Format

Front-End –I:

Passive Listener

Front-End –III:

Diagnostic Simulation

Front-End –II:

Runtime Logging

Recorded Log

Data Analysis Tool -I:

WEKA

Data Analysis Tool-II:

Discriminative Frequent

Pattern Miner

Data Analysis Tool-III:

Graphical Visualizer

Set of Data Collection Front-End

Data Analysis Tool

Specific Data Converter

Set of Data Analysis

Back-End

Application

Developer

Data Preprocessing Middleware

Figure 1: Debugging framework

dently. The data collection front-end developer merely needs

to provide the format of the recorded data. These data are

used by the data preprocessing middleware to parse the raw

recorded byte streams.

5.2 Data Preprocessing Middleware

This middleware that sits between the data collection front-

end and the data analysis back-end provides the necessary

functionality to change or modify one subsystem without af-

fecting the other. The interface between the data collection

front-end and the data analysis back-end is further divided

into the following layers:

•Data cleaning layer: This layer is front-end speciﬁc.

Each supported front-end will have one instance of it.

The layer is the interface between the particular data

collection front-end and the data preprocessing mid-

dleware. It ensures that the recorded events are com-

pliant with format requirements.

•Data parsing layer: This layer is provided by our frame-

work and is responsible for extracting meaningful records

from the recorded raw byte stream. To parse the

recorded byte stream, this layer requires a header ﬁle

describing the recorded message format. This infor-

mation is provided by the application developer (i.e.,

the user of the data collection front-end).

•Data labeling layer: To be able to identify the proba-

ble causes of failure, the data analysis subsystem needs

samples of logged events representing both “go od” and

“bad” behavior. As “good” or “bad” behavior seman-

tics are an application speciﬁc criterion, the applica-

tion developer needs to implement a predicate (a small

module) whose interface is already provided by us in

the framework. The predicate, presented with an or-

dered event log, decides whether behavior is good or

bad.

104

•Data conversion layer: This layer provides the inter-

face between the data preprocessing middleware and

the data analysis subsystem. One instance of this layer

exists for each diﬀerent analysis back-end. This layer is

responsible for converting the labeled data into appro-

priate format for the data analysis algorithm. The in-

terface of this data conversion layer is provided by the

framework. As diﬀerent data analysis algorithms and

techniques can be used for analysis, each may have dif-

ferent input format requirements. This layer provides

the necessary functionality to accommodate supported

data analysis techniques.

5.3 Data Analysis Back-End

At present, we implement the data analysis algorithm and

its modiﬁcations presented earlier in Section 4. It is respon-

sible for identifying the causes of failures. The approach is

extensible. As newer analysis algorithms are developed that

catch more or diﬀerent types of bugs, they can be easily

incorporated into the tool as alternative back-ends. Such

algorithms can be applied in parallel to analyze the same

set of logs to ﬁnd diﬀerent problems with them.

6. DUSTMINER IMPLEMENTATION

In this section, we describe the implementation of the data

collection front-end and the data analysis back-end that are

used for failure diagnosis in this paper. We used two diﬀer-

ent data collection front-ends for two diﬀerent case studies.

The ﬁrst one is implemented by us and used for real time

logging of user deﬁned events on ﬂash memory in MicaZ

motes, and the second front-end was a built-in logging sup-

port functionality provided by LiteOS operating system for

MicaZ motes. At the data analysis back-end, we used dis-

criminative frequent pattern analysis for failure diagnosis.

We describe the implementation of each of these next.

6.1 The Front-End: Acquiring System State

We used two diﬀerent data collection front-ends to col-

lect data: (i) event logging system implemented for MicaZ

platform in TinyOS 2.0 and (ii) kernel event logger for Mi-

caZ platform provided by LiteOS. The format of the event

logged by the two subsystems are completely diﬀerent. We

were able to use our framework to easily integrate the two

diﬀerent front-ends and use the same back-end to analyze

the cause of failures, which shows modularity. We brieﬂy

describe each of these front-ends below.

6.1.1 Data Collection Front-End for TinyOS

Implementation:

The event logger for MicaZ hardware is implemented us-

ing the TinyOS 2.0 BlockRead and BlockWrite interfaces

to perform read and write operations respectively on ﬂash.

BlockRead and BlockWrite interfaces allow accessing the

ﬂash memory at a larger granularity which minimizes the

recording time to ﬂash.

To minimize the number of ﬂash accesses we used a global

buﬀer to accumulate events temporarily before writing to

ﬂash. Two identical buﬀers (buﬀer A and B) are used alter-

nately to minimize the interference between event buﬀering

100

0 2 4 6 8 10 12 14 16 18 20

Event Interval (ms)

Succes Rate (%)

Two Buffers of 16 Bytes Each

Two Buffers of 32 Bytes Each

Two Buffers of 64 Bytes Each

Two Buffers of 128 Bytes Each

Two Buffers of 256 Bytes Each

One Buffer of 32 Bytes

One Buffer of 64 Bytes

One Buffer of 128 Bytes

One Buffer of 256 Bytes

One Buffer of 512 Bytes

Figure 2: Impact of buﬀer size and event rate on

logging performance

Meta Data

Record 1 Record 2

Record 3

Flash_Head_Index

Flash_Tail_Index

Record Length

EventId

Data

Flash Space

Reserved

for Application

Flash Space

Reserved

for Logging

Figure 3: Flash space layout

and writing to ﬂash. When buﬀer A gets ﬁlled up, buﬀer B is

used for temporary buﬀering and buﬀer A is written to ﬂash

and vice versa. In Figure 2 we show the eﬀect of buﬀer size

on logging performance for single buﬀer and double buﬀer

respectively. Using two buﬀers increases the logging per-

formance substantially. As shown in ﬁgure, for event rate

of 1000 events/second, using one buﬀer of 512 bytes has a

success ratio (measured as the ratio of successfully logged

events to the total number of generated events) of only 60%

whereas using two buﬀers of 256 bytes each (512 bytes in

total) can give almost 100% success ratio. For a rate of 200

events/second, two buﬀers of 32 bytes each is enough for

100% success ratio.

The sizes of these buﬀers are conﬁgurable as diﬀerent ap-

plications need diﬀerent amounts of runtime memory. It is

to be noted that if the system crashes while some data are

still in the RAM buﬀer, those events will be lost. The ﬂash

space layout is given in Figure 3.

A separate MicaZ mote (LogManager) is used to commu-

nicate with the logging subsystem to start and stop log-

ging. Until the logging subsystem receives the “StartLog-

ging” command, it will not log anything and after receiv-

ing “StopLogging” command it will ﬂush the remaining data

that is in the buﬀer to ﬂash and stop logging. This gives the

user the ﬂexibility to start and stop logging whenever they

want. It also lets the user to run their application without

enabling logging, when needed, to avoid the runtime over-

head of logging functionality without recompiling the code.

We realize that occasional event reordering can occur due

to preemption, interrupts, or task scheduling delays. An

105

occasional invalid log entry is not a problem. An occasional

incorrect logging sequence is ﬁne too as long as the same

occasional wrong sequence does not occur consistently. This

is because common sequences do not have to occur every

time, but only often enough to be noticed. Hence, they can

be occasionally mis-logged without aﬀecting the diagnostic

accuracy.

Time Synchronization:

We need to timestamp the recorded events so that events

recorded on diﬀerent nodes can be serialized later during

oﬄine analysis. To avoid the overhead of running a time

synchronization protocol on the application mote, we used

an oﬄine time synchronization scheme. A separate node

(TimeManager) is used to broadcast its local clock periodi-

cally. The event logging component will receive the message

and log it in ﬂash with a local timestamp. From this infor-

mation we can calculate the clock skew on diﬀerent nodes

in reference to TimeManager node, adjust the timestamp of

the logged events and serialize the logs. We realize that the

serialized log may not be exact but it is good enough for

pattern mining.

Interface:

The only part of the data collection front-end that is ex-

posed to the user is the interface for logging user deﬁned

events. Our design goal was to have an easy-to-use interface

and eﬃcient implementation to reduce the runtime overhead

as much as possible. One critical issue with distributed log-

ging was to timestamp the recorded events so that events on

diﬀerent nodes can be serialized later during oﬄine analysis.

To make event logging functionality simpler, we deﬁned the

interface to the logging component as follows:

log(EventId,(void *)buffer,unit8_t size)

log(E ventId, (v oid∗)buf fer, unit8tsize) is the key inter-

face between application developers and the logging subsys-

tem. To log an event, the user has to call the log() function

with appropriate parameters. For example, if the user wants

to log the event that a radio message was sent and also wants

to log the receiverId along with the event, he/she needs to

deﬁne the appropriate record structure in a header ﬁle (this

ﬁle will also be used to parse the data) with these ﬁelds,

initialize the record with appropriate values and call the log

function with that record as the parameter. This simple

function call will log the event. The rest is taken care of by

the logging system underneath. The logging system will pad

the timestamp with the recorded event and log as a single

event. Note that NodeId is not recorded during logging.

This information is added when data is uploaded to PC for

oﬄine analysis.

System Overhead:

The event logging support requires 14670 bytes of pro-

gram memory (this includes the code size for BlockRead

and BlockWrite interface provided by TinyOS 2.0) and 830

bytes of data memory when 400 bytes are used for buﬀering

(two buﬀers of 200 bytes each) data before writing to ﬂash.

User can choose to use less buﬀer space if the expected event

rate is low. To instrument code, the program size increase

is minimal. To log an event with no attributes, it needs a

single line of code. To log an event with nattributes, it

takes n+ 1 lines of code, nlines are to initialize the record

and 1 line to call the log () function.

6.1.2 Data Collection Front-End for LiteOS

LiteOS [6] provides the required functionality to log ker-

nel events on MicaZ platforms. Speciﬁcally, the kernel logs

events including system calls, radio activities, context swit-

ches and so on. An event log entry is a single 8-bit code

without attributes. In Figure 4, we present a subset of

events logged for the LiteOS case study presented in our pa-

per. We used an experimental set up of a debugging testbed

with all motes connected to a PC via serial interfaces. In

pre-deployment testing on our indoor testbed, logs can thus

be transmitted in real-time through a programming board

via serial communication with a base-station. When a sys-

tem call is invoked or a radio packet is received on a node,

the corresponding code for that speciﬁc event is transmitted

through the serial port to the base station (PC). The base

station collects event codes from the serial port and records

it in a globally ordered ﬁle.

6.2 The Data Analysis Back-End

At the back-end, we implement the data preprocessing

and discriminative frequent pattern mining algorithm. To

integrate the data collection front-end with the data pre-

processing middleware, we provided a simple text ﬁle de-

scribing the storage format of the raw byte stream stored

on ﬂash for each of the front-ends. This ﬁle was used to

parse the recorded events. The user supplied diﬀerent pred-

icates as a Java function. These were used to annotate data

into good and bad segments. The rest of the system is a

collection of data analysis algorithms such as discriminative

frequent pattern mining, or any other tool such as Weka [1].

7. EVALUATION

To test the eﬀectiveness of the tool, we applied our tool

to troubleshoot two real life applications. We present the

case studies where we have used our tool successfully. The

ﬁrst was a kernel level bug in the LiteOS operating sys-

tem. The second was to debug a multichannel Media Access

Control(MAC) protocol [16] implemented in TinyOS 2.0 for

MicaZ platform with only one half-duplex radio interface.

7.1 Case Study - I: LiteOS Bug

In this case study, we troubleshoot a simple data collection

application where several sensors monitor light and report

it to a sink node. The communication is performed in a

single-hop environment. In this scenario, sensors transmit

packets to the receiver, and the receiver records received

packets and sends an “ACK” back. The sending rate that

sensors use is variable and depends on the variations in their

readings. After receiving each message, depending on its

sequence number, the receiver decides to record the value or

not. If the sequence number is older than the last sequence

number it has received, the packet is dropped.

This application is implemented using MicaZ motes on

LiteOS operating system and is tested on an experimen-

tal testbed. Each of the nodes is connected to a desktop

computer via an MIB520 programming board and a serial

cable. The PC acts as the base station. In this experiment,

there was one receiver (the base node) and a set of 5 senders

(monitoring sensors). This experiment illustrates a typical

experimental debugging set up. Prior to deployment, pro-

106

grammers would typically test the protocol on target hard-

ware in the lab. This is how such a test might proceed.

7.1.1 Failure Scenario

When this simple application was stress tested, some of

the nodes would crash occasionally and non-deterministically.

Each time diﬀerent nodes would crash and at diﬀerent times.

Perplexed by the situation, the developer (a ﬁrst-year grad-

uate student with no prior experience with sensor networks)

decided to log diﬀerent types of events using LiteOS support

and use our debugging tool. These were mostly kernel-level

events along with a few application-level events. The built-

in logging functionality provided by LiteOS was used to log

the events. A subset of the diﬀerent types of logged events

are listed in Figure 4.

Recorded Events Attribute List

Context_Switch_To_User_Thread Null

Get_Current_Thread_Index Null

Get_Current_Radio_Info_Address Null

Get_Current_Radio_Handle_Address Null

Post_Thread_Task Null

Get_Serial_Mutex Null

Get_Current_Serial_Info_Address Null

Get_Serial_Send_Function Null

Disable_Radio_State Null

Packet_Received Null

Packet_Sent Null

Yield_To_System_Thread Null

Get_Current_Thread_Address Null

Get_Radio_Mutex Null

Get_Radio_Send_Function Null

Mutex_Unlock_Function Null

Get_Current_Radio_Handle Null

Figure 4: Logged events for diagnosing LiteOS ap-

plication bug

7.1.2 Failure Diagnosis

After running the experiment, “good” logs were collected

from the nodes that did not crash during the experiment

and “bad” logs were collected from nodes that crashed at

some point in time. After applying our discriminative fre-

quent pattern mining algorithm to the logs, we provided two

sets of patterns to the developer, one set includes the high-

est ranked discriminative patterns that are found only in

“good” logs as shown in Figure 5, and the other set includes

the highest ranked discriminative patterns that are found

only in “bad” logs as shown in Figure 6.

Based on the discriminative frequent pattern, it is clear

that in “good” pile, event is highly

correlated with the < Get Current Radio Handle > event.

On the other hand, in the “bad” pile, though < P acket

Received > event is present, the other event is missing. In

the “bad” pile, is highly correlated

with < Get serial S end F unction > event. From these

observations, it is clear that proceeding with a < Get serial

Send F unction > when < Get Current Radio H andle >

is missing is the most likely cause of failure.

To explain the error we will brieﬂy describe the way a

<Packet_Received>,<Packet_Sent >,<Get_Current_Radio_Handle>

<Packet_Received>,<Get_Current_Radio_Handle_Address>,<Get_Current_Radio_Handle>

<Packet_Received>,<Mutex_Unlock_Function>,<Get_Current_Radio_Handle>

<Packet_Received>,<Disabale_Radio_State>,<Get_Current_Radio_Handle>

<Packet_Received>,<Post_Thread_Task>,<Get_Current_Radio_Handle>

Figure 5: Discriminative frequent patterns found

only in “good” log for LiteOS bug

<Context_Switch_to_User_Thread>,<Get_Current_Thread_Address>,<Get_Serial_Send_Function>

<Packet_Received>,<Context_Switch_to_User_Thread>,<Get_Serial_Send_Function>

<Packet_Received>,<Post_Thread_Task>,<Get_Serial_Send_Function>

<Packet_Received>,<Get_Current_Thread_Index>,<Get_Serial_Send_Function>

<Packet_Received>,<Get_Current_Thread_Address>,<Get_Serial_Send_Function>

Figure 6: Discriminative frequent patterns found

only in “bad” log for LiteOS bug

received packet is handled in LiteOS. In the application,

receiver always registers for receiving packets, then waits

until a packet arrives. At that time, the kernel switches

back to the user thread with appropriate packet information.

The packet is then processed in the application. However,

at very high data rates, another packet can come when the

processing of the previous packet has not yet been done. In

that case, LiteOS kernel overwrites the radio receive buﬀer

with new information even if the user is still using the old

packet data to process the previous packet.

Indeed, for correct operation, event

always has to be followed by < Get Current Radio Handle >

event before < Get Ser ial Send F unction > event. Other-

wise it crashes the system. Over-writing a receive buﬀer

for some reason is a very typical bug in sensor networks.

This example is presented to illustrate the use of the tool.

In section 7.2 we present a more complex example that ex-

plores more of the interactive complexity this tool was truly

designed to uncover.

7.1.3 Comparison with Previous Work

To compare the performance of our discriminative pattern

mining algorithm with previous work on diagnosing sensor

network bugs [15, 14], we implemented the pure Apriori

algorithm, used in [14], to generate frequent patterns and

perform diﬀerential analysis to extract discriminative pat-

terns. We did not compare with [15] because that work

did not look for sequences of events that cause problem but

rather looked for current state that correlates with immi-

nent failure. Hence, it addressed a diﬀerent problem. For

this case study, when we applied the Apriori algorithm to

the “good” log and the “bad” log, the list of discriminative

patterns missed the event completely

and failed to identify the fact that the problem was corre-

lated with the timing of packet reception. Moreover, when

we applied the Apriori algorithm to multiple instances of

“good” logs and “bad” logs together, the list of discrimina-

tive patterns returned was empty. All the frequent patterns

generated by Apriori algorithm were canceled at the diﬀer-

ential phase. This shows the necessity of our extensions as

described in section 4.

107

7.2 Case Study - II: Multichannel MAC Pro-

tocol

In this case study we debug a multichannel MAC protocol.

The objective of the protocol used in our study is to assign

a home channel to each node in the network dynamically in

such a way that the throughput is maximized. The design

of the protocol exploits the fact that in most wireless sensor

networks, the communication rate among diﬀerent nodes is

not uniform ( e.g., in a data aggregation network). Hence,

the problem was formulated in such a way that nodes com-

municating frequently are clustered together and assigned

the same home channel whereas nodes that communicate

less frequently are clustered into diﬀerent channels. This

minimizes overhead of channel switching when nodes need to

communicate. This protocol was recently published in [16].

During experimentation with the protocol, it was noticed

that when data rates between diﬀerent internally closely-

communicating clusters is low, the multi-channel protocol

outperforms a single channel MAC protocol comfortably as

it should. However, when the data rate between clusters

was increased, while the throughput near the base station

still outperformed a single channel MAC signiﬁcantly, nodes

further from the base station were performing worse than in

the single channel MAC. This should not have happened in

a well-designed protocol as the multichannel MAC proto-

col should utilize the communication spectrum better than

a single channel MAC. The author of the protocol initially

concluded that the performance degradation was due to the

overhead associated with communication across clusters as-

signed to diﬀerent channels. Such communication entails

frequent channel switching as the sender node, according to

the protocol, must switch the frequency of the receiver be-

fore transmission, then return to its home channel. This

incurs overhead that increases with the transmission rate

across clusters. We decided to verify this conjecture.

As a stress test of our tool, we instrumented the proto-

col to log events related to the MAC layer (such as message

transmission and reception as well as channel switching) and

used our tool to determine the discriminative patterns gen-

erated from diﬀerent runs with diﬀerent message rates, some

of which performing better than others. For better under-

standing of the failure scenario detected, we brieﬂy describe

the operation of the multichannel MAC protocol below.

7.2.1 Multichannel MAC Protocol Overview

In the multichannel MAC protocol, each node initially

starts at channel 0 as its home channel. To communicate

with others, every node maintains a data structure called

“neighbor table” that stores the neighbor home channel for

each of its neighboring nodes. Channels are organized as

a ladder, numbered from lowest (0) to highest (12). When

a node decides to change its home channel, it sends out a

“Bye” message in its current home channel which includes

its new home channel number. Receiving a “Bye” message,

each other node updates its neighbor table to reﬂect the new

home channel number for the sender of the “Bye” message.

After changing its home channel, a node sends out a “Hello”

message in the new home channel which includes its nodeID.

All neighboring nodes on that channel add this node as a

new neighbor and update their neighbor tables accordingly.

To increase robustness to message loss, the protocol also

includes a mechanism for discovering the home channel of

a neighbor when its current entry in the neighbor table be-

comes stale. When a node sends a message to a receiver

on that receiver’s home channel (as listed in the neighbor

table) but does not receive an “ACK’ after ’n’ (n is set to 5)

tries, it assumes that the destination node is not on its home

channel. The reason may be that the destination node has

changed its home channel permanently but the notiﬁcation

was lost. Instead of wasting more time on retransmissions

on the same channel, the sender starts scanning all chan-

nels, asking if the receiver is there. The purpose is to ﬁnd

the receiver’s new home channel and update the neighbor

table accordingly. The destination node will eventually hear

this data message and reply when it is on its home channel.

Since the above mechanism is expensive, as an optimiza-

tion, overhearing is used to reduce staleness of the neigh-

bor table. Namely, a node updates the home channel of a

neighbor in its neighbor table when the node overhears an

acknowledgement (“ACK”) from that neighbor sent on that

channel. Since the “ACK”s are used as a mechanism to infer

home channel information, whenever a node switches chan-

nels temporarily (e.g., to send to a diﬀerent node on the

home channel of the latter), it delays sending out “ACK”

messages until it comes back to its home channel in order to

prevent incorrect updates of neighbor tables by recipients of

such ACKs.

Finally, to estimate channel conditions, each node periodi-

cally broadcasts a “channelUpdate” message which contains

the information about successfully received and sent mes-

sages during the last measurement period (where the period

is set at compile time). Based on that information, each

node calculates the channel quality (i.e., probability of suc-

cessfully accessing the medium), and uses that measure to

probabilistically decide whether to change its home channel

or not. Nodes that sink a lot of traﬃc (e.g., aggregation

hubs or cluster heads) switch ﬁrst. Others that communi-

cate heavily with them follow. This typically results into a

natural separation of node clusters into diﬀerent frequencies

so they do not interfere.

7.2.2 Performance Problem

This protocol was executed on 16 MicaZ motes implement-

ing an aggregation tree where several aggregation cluster-

heads ﬁlter data received from their children, signiﬁcantly

reducing the amount forwarded, then send that reduced data

to a base-station. When the data rate across clusters was

low, the protocol outperformed the single channel MAC.

However, when the data rate among clusters was increased,

the performance of the protocol deteriorated signiﬁcantly,

performing worse than a single channel MAC in some cases.

The developer of the protocol assumed that this was due to

the overhead associated with the channel change mechanism

which is incurred when communication happens among dif-

ferent clusters heavily. Much debugging eﬀort was spent on

that direction with no result.

7.2.3 Failure Diagnosis

To diagnose the cause of the performance problem, we logged

diﬀerent types of MAC events as listed in Figure 7.

The question posed to our tool was “Why is the perfor-

mance bad at higher data rate?”. To answer this question,

we ﬁrst executed the protocol at low data rates (when the

performance is better than single channel MAC) to collect

108

Recorded Events Attribute List

Ack_Received Null

Home_Channel_Changed oldChannel, newChannel

TimeSyncMsg referenceTime, localTime

Channel_Update_Msg_Sent homeChannel

Data_Msg_Sent_On_Same_Channel destId, homeChannel

Data_Msg_Sent_On_Different_Channel destId, homeChannel, destChannel

Channel_Update_Msg_Received homeChannel, neighborId, neighborChannel

Retry_Transmission oldChannelTried, nextChannelToTry

No_Ack_Received Null

Figure 7: Logged events for diagnosing multichannel

MAC protocol

<No_Ack_Received>,<Retry_Transmission>

<Retry_Transmission>,<No_Ack_Received>

<Data_Msg_Sent_On_Same_Channel: homechannel:0>,

<No_Ack_Received>,

<Retry_Transmission>,

<Retry_Transmission: nextchanneltotry:1>,

<Retry_Transmission>,

<Retry_Transmission: oldchanneltried:1>,

<No_Ack_Received>

<Data_Msg_Sent_On_Same_Channel: homechannel:0>,

<No_Ack_Received>,

<Retry_Transmission>,

<Retry_Transmission: nextchanneltotry:1>,

<Retry_Transmission: nextchanneltotry:2>

<Data_Msg_Sent_On_Same_Channel: homechannel:0>,

<No_Ack_Received>,

<Retry_Transmission>,

<Retry_Transmission: nextchanneltotry:1>,

<Retry_Transmission: oldchanneltried:2>,

<Retry_Transmission: nextchanneltotry:3>,

<No_Ack_Received>,

<Retry_Transmission: oldchanneltried:3>

Figure 8: Discriminative frequent patterns for mul-

tichannel MAC protocol

logs representing “good” behavior. We then again executed

the protocol with a high data rate (when the performance is

worse than single channel MAC) to collect logs representing

“bad” behavior.

After performing discriminative pattern analysis, the list

of top 5 discriminative patterns that were produced by our

tool is shown in Figure 8.

The sequences indicate that, in all cases, there seems to

be a problem with not receiving acknowledgements. Lack

of acknowledgements causes a channel scanning pattern to

unfold. This is shown as the < Retr y T r ansmission >

event on diﬀerent channels, as a result of not receiving ac-

knowledgements. Hence, the problem does not lie in the

frequent overhead of senders changing their channel to that

of their receiver in order to send a message across clusters.

The problem lied in the frequent lack of response (an ACK)

from a receiver. At the ﬁrst stage of frequent pattern min-

ing < No Ack Received > is identiﬁed as the most frequent

event. At the second stage, the algorithm searched for fre-

quent patterns in top K(e.g., top 5) segments of the logs

where < N o Ack Received > event occurred with highest

frequency. The second stage of the log analysis (correlat-

50000

100000

150000

200000

250000

300000

350000

Success ful Send Success ful Receive

Number of Messages

Multichannel MAC

performance

(with bug)

Multichannel MAC

performance

(with bug fix)

Figure 9: Performance improvement after the bug

ﬁx

ing frequent events to preceding ones) then uncovered that

the lack of an ACK from the receiver is preceded by a tem-

porary channel change. This gave away the bug. As we

described earlier, whenever a node changes its channel tem-

porarily, it disables “ACK”s until it comes back to its home

channel. In a high intercluster communication scenario, dis-

abling the “ACK” is a bad decision for a node that spends a

signiﬁcant amount of time communicating with other clus-

ters on channels other than its own home channel. As a

side eﬀect, nodes which are trying to communicate with it

fail to receive an “ACK” for a long time and start scan-

ning channels frequently looking for the missing receiver.

Another interesting aspect of the problem that was discov-

ered is the cascading eﬀect of the problem. When we look

at generated discriminative patterns across multiple nodes

(not shown for space limitations), we see that the scanning

patterns revealed in the logs shown in fact cascades. Chan-

nel scanning at the destination node often triggers channel

scanning at the sender node and this interesting cascaded

eﬀect was also captured by our tool.

As a quick ﬁx, we stopped disabling“ACK”when a node is

outside its home channel. This may appear to violate some

correctness semantics because a node may now send an ACK

while temporarily being on a channel other than its home.

This, one would think, will pollute neighbor tables of nodes

that overhear the ACK because they will update their tables

to indicate an incorrect home channel. In reality, the perfor-

mance of the MAC layer improved signiﬁcantly (up to 50%),

as shown in Figure 9. In retrospect, this is not unexpected.

As intercluster communication increases, the distinction be-

tween one’s home channel and the home channel of another

with whom one communicates a lot becomes fuzzy, as one

spends more and more time on that other node’s home chan-

nel (to send messages to it). When ACKs are never disabled,

the neighbor tables of nodes will tend to record with a higher

probability the channel on which each neighbor spend most

of its time. This could be the neighbor’s home channel or

the channel of a node downstream with which the neighbor

communicates a lot. The distinction becomes immaterial

as long as the neighbor can be found on that channel with

a high probability. Indeed, complex interaction problems

often seem simple when explained but are sometimes hard

to think of at design time. Dustminer was successful at

uncovering the aforementioned interaction and signiﬁcantly

improve the performance of the MAC protocol in question.

109

7.2.4 Comparison with Prior Work

As before, we compare our results with the result of the

algorithm in [14], which uses the Apriori algorithm. As we

mentioned earlier, one problem with the previous approach

is scalability. Due to huge numbers of events logged for this

case study (about 40000 for “good” logs and 40000 for “bad”

logs), we could not generate frequent patterns of length more

than 2 using the approach in [14]. To generate frequent pat-

terns of length 2 for 40000 events in the “good” log, it took

1683.02 seconds (28 minutes) and to ﬁnish the whole compu-

tation including diﬀerential analysis it took 4323 seconds (72

minutes). With our two-stage mining scheme, it took 5.547

seconds to ﬁnish the ﬁrst stage and ﬁnishing the whole com-

putation including diﬀerential analysis took 332.924 seconds

(6 minutes). In terms of quality of the generated sequences

(which is often correlated with the length of the sequence),

our algorithm returned discriminative sequences of length

upto 8, that was enough to understand the chain of events

causing the problem as illustrated above. We tried to gen-

erate frequent patterns of length 3 with the approach in [14]

but terminated the process after one day of computation

that remained in progress. We used a machine of 2.53 GHz

speed and 512 MB RAM. The generated patterns of length

2 were insuﬃcient to give insight into the problem.

7.3 Debugging Overhead

To test the impact of logging on application behavior, we

ran the multichannel MAC protocol with logging enabled

and without logging enabled with both moderate data rate

and high data rate. The network was set as a data aggrega-

tion network.

For moderate data rate experiment, the source nodes (node

that only sends messages) were set to transmit data at a

rate of 10 messages/sec, the intermediate nodes were set to

transmit data at a rate of 2 messages/sec and one node was

acting as the base station (which only receives messages).

We tested this on a 8 nodes network with 5 source nodes,

2 intermediate nodes and one base station. Over multiple

runs, after we take the average to get a reliable estimate,

average number of successfully transmitted messages was

increased by 9.57% and average number of successfully re-

ceived messages was increased by 2.32%. The most likely

reason for this minor improvement is writing to ﬂash was

creating a randomization eﬀect which probably helped to

reduce interference at the MAC layer.

At high data rate, source nodes were set to transmit data

at a rate of 100 messages/sec and intermediate nodes were

set to transmit data at a rate of 20 messages/sec. Over

multiple runs, after we take the average to get a reliable es-

timate, average number of successfully transmitted messages

was reduced by 1.09% and average number of successfully

received messages was dropped by 1.62%. The most likely

reason is the overhead of writing to ﬂash kicked in at a such

high data rate and eventually reduced the advantage expe-

rienced at a low data rate.

The performance improvement of the multichannel MAC

protocol reported in this paper is obtained by running the

protocol at the high data rate to prevent over estimation.

We realize that this eﬀect on application may change the

behavior of the original application slightly, but that eﬀect

seems to be negligible from our experience and did not af-

fect the diagnostic capability of the discriminative pattern

mining algorithm which is inherently robust against minor

statistical variance.

As multichannel MAC protocol did not use ﬂash memory

to store any data, we were able to use the whole ﬂash for

logging events. To test the relation between quality of gener-

ated discriminative patterns and the logging space used, we

used 100KB, 200KB and 400KB of ﬂash space in three dif-

ferent experiments. The generated discriminative patterns

were similar. We realize that diﬀerent application has dif-

ferent amount of ﬂash space requirements and the amount

of logging space may aﬀect the diagnostic capability. To

help in severe space constraints, we provide the radio inter-

face so users can choose to log at diﬀerent times instead of

logging continuously. User can also choose to log events at

diﬀerent resolutions (e.g., instead of logging every message

transmitted, log only every 50th message transmitted).

For LiteOS case study, we did not use ﬂash space at all as

the events were transmitted to basestation (PC) directly us-

ing serial connection and eliminate the ﬂash space overhead

completely which makes our tool easily usable for testbeds

which often provides serial connections.

8. CONCLUSION

In this paper, we presented a sensor network troubleshoot-

ing tool that helps the developer diagnose root causes of er-

rors. The tool is geared towards ﬁnding interaction bugs.

Very successful examples of debugging tools that hunt for

localized errors in code have been produced in previous lit-

erature. The point of departure in this paper lies in focus-

ing on errors that are not localized (such as a bad pointer

or an incorrect assignment statement) but rather arise be-

cause of adverse interactions among multiple components

each of which appears to be correctly designed. The cascad-

ing channel-scanning example that occurred due to disabling

acknowledgements in the MAC Protocol illustrates the sub-

tlety of interaction problems in sensor networks. With in-

creased distribution and resource constraints, the interac-

tive complexity of sensor networks applications will remain

high, motivating tools such as the one we described. Future

development of Dustminer will focus on scalability and user

interface to reduce the time and eﬀort needed to understand

and use the new tool.

Acknowledgments

The authors thank the shepherd, Kay R¨

omer, and the anony-

mous reviewers for providing valuable feedback and improv-

ing the paper. This work was supported in part by NSF

grants DNS 05−54759, CNS 06−26342, and CNS 06−13665.

9. REFERENCES

[1] http://www.cs.waikato.ac.nz/ml/weka/.

[2] R. Agrawal and R. Srikant. Fast algorithms for mining

association rules. In Proceedings of the Twentieth

International Conference on Very Large Data Bases

(VLDB’94), pages 487–499, 1994.

[3] M. K. Aguilera, J. C. Mogul, J. L. Wiener,

P. Reynolds, and A. Muthitacharoen. Performance

debugging for distributed systems of black boxes. In

Proceedings of the nineteenth ACM symposium on

Operating systems principles (SOSP’03), pages 74–89,

2003. Bolton Landing, NY, USA.

110

[4] P. Ballarini and A. Miller. Model checking medium

access control for sensor networks. In Proceedings of

the 2nd International Symposium On Leveraging

Applications of Formal Methods, Veriﬁcation and

Validation (ISOLA’06), pages 255–262, Paphos,

Cyprus, November 2006.

[5] P. Bod´ık, G. Friedman, L. Biewald, H. Levine,

G. Candea, K. Patel, G. Tolle, J. Hui, A. Fox, M. I.

Jordan, and D. Patterson. Combining visualization

and statistical analysis to improve operator conﬁdence

and eﬃciency for failure detection and localization. In

Proceedings of the 2nd International Conference on

Autonomic Computing(ICAC’05), 2005.

[6] Q. Cao, T. Abdelzaher, J. Stankovic, and T. He. The

liteos operating system: Towards unix-like

abstractions for wireless sensor networks. In

Proceedings of the Seventh International Conference

on Information Processing in Sensor Networks

(IPSN’08), April 2008.

[7] E. Cheong, J. Liebman, J. Liu, and F. Zhao. Tinygals:

a programming model for event-driven embedded

systems. In Proceedings of the 2003 ACM symposium

on Applied computing (SAC’03), pages 698–704, 2003.

Melbourne, Florida.

[8] E. Ertin, A. Arora, R. Ramnath, and M. Nesterenko.

Kansei: A testbed for sensing at scale. In Proceedings

of the 4th Symposium on Information Processing in

Sensor Networks (IPSN/SPOTS track), 2006.

[9] G. D. Fatta, S. Leue, and E. Stegantova.

Discriminative pattern mining in software fault

detection. In Proceedings of the 3rd international

workshop on Software quality assurance (SOQUA ’06),

pages 62–69, 2006.

[10] E. Frank and I. H. Witten. Generating accurate rule

sets without global optimization. In Proceedings of the

Fifteenth International Conference on Machine

Learning (ICML’98), pages 144–151, 1998.

[11] D. Gay, P. Levis, R. von Behren, M. Welsh,

E. Brewer, and D. Culler. The nesc language: A

holistic approach to networked embedded systems. In

Proceedings of Programming Language Design and

Implementation (PLDI’03), pages 1–11, June 2003.

[12] L. Girod, J. Elson, A. Cerpa, T. Stathopoulos,

N. Ramanathan, and D. Estrin. Emstar: a software

environment for developing and deploying wireless

sensor networks. In Proceedings of the annual

conference on USENIX Annual Technical Conference

(ATEC’04), pages 24–24, Boston, MA, 2004.

[13] Y. Hanna, H. Rajan, and W. Zhang. Slede:

Lightweight speciﬁcation and formal veriﬁcation of

sensor networks protocols. In Proceedings of the First

ACM Conference on Wireless Network Security

(WiSec), Alexandria, VA, March-April 2008.

[14] M. M. H. Khan, T. Abdelzaher, and K. K. Gupta.

Towards diagnostic simulation in sensor networks. In

Proceedings of International Conference on Distributed

Computing in Sensor Systems (DCOSS), 2008. Greece.

[15] M. M. H. Khan, L. Luo, C. Huang, and

T. Abdelzaher. Snts: Sensor network troubleshooting

suite. In Proceedings of International Conference on

Distributed Computing in Sensor Systems (DCOSS),

2007. Santa Fe, New Mexico, USA.

[16] H. K. Lee, D. Henriksson, and T. Abdelzaher. A

practical multi-channel medium access control

protocol for wireless sensor networks. In Proceedings of

International Conference on Information Processing in

Sensor Networks (IPSN’08), St. Louis, Missouri, April

2008.

[17] P. Levis and D. Culler. Mate: a tiny virtual machine

for sensor networks. In Proceedings of the 10th

international conference on Architectural support for

programming languages and operating systems, San

Jose, California, October 2002.

[18] P. Levis, N. Lee, M. Welsh, and D. Culler. Tossim:

accurate and scalable simulation of entire tinyos

applications. In Proceedings of the 1st international

conference on Embedded networked sensor systems

(SenSys’03), pages 126–137, Los Angeles, California,

USA, 2003.

[19] C. Liu, L. Fei, X. Yan, J. Han, and S. P. Midkiﬀ.

Statistical debugging: A hypothesis testing-based

approach. IEEE Transactions on Software

Engineering, 32:831–848, 2006.

[20] C. Liu and J. Han. Failure proximity: a fault

localization-based approach. In Proceedings of the 14th

ACM SIGSOFT international symposium on

Foundations of software engineering (SIGSOFT

’06/FSE-14), pages 46–56, 2006.

[21] C. Liu, Z. Lian, and J. Han. How bayesians debug. In

Proceedings of the Sixth International Conference on

Data Mining (ICDM’06), pages 382–393, December

2006.

[22] C. Liu, X. Yan, L. Fei, J. Han, and S. P. Midkiﬀ.

Sober: statistical model-based bug localization. In

Proceedings of the 13th ACM SIGSOFT international

symposium on Foundations of software engineering

(FSE-13), 2005. Lisbon, Portugal.

[23] C. Liu, X. Yan, and J. Han. Mining control ﬂow

abnormality for logic error isolation. In Proceedings of

2006 SIAM International Conference on Data Mining

(SDM’06), Bethesda, MD, April 2006.

[24] C. Liu, X. Zhang, J. Han, Y. Zhang, and B. K.

Bhargava. Failure indexing: A dynamic slicing based

approach. In Proceedings of the 2007 IEEE

International Conference on Software Maintenance

(ICSM’07), Paris, France, October 2007.

[25] L. Luo, T. F. Abdelzaher, T. He, and J. A. Stankovic.

Envirosuite: An environmentally immersive

programming framework for sensor networks. ACM

Transactions on Embedded Computing Systems,

5(3):543–576, 2006.

[26] L. Luo, T. He, G. Zhou, L. Gu, T. Abdelzaher, and

J. Stankovic. Achieving Repeatability of

Asynchronous Events in Wireless Sensor Networks

with EnviroLog. In Proceedings of the 25th IEEE

International Conference on Computer

Communications (INFOCOM’06), pages 1–14, 2006.

[27] S. R. Madden, M. J. Franklin, J. M. Hellerstein, and

W. Hong. Tinydb: an acquisitional query processing

system for sensor networks. ACM Transactions on

Database Systems, 30(1):122–173, 2005.

[28] P. Olveczky and S. Thorvaldsen. Formal modeling and

analysis of wireless sensor network algorithms in

real-time maude. In Proceedings of the International

111

Parallel and Distributed Processing Symposium

(IPDPS), Rhodes Island, Greece, April 2006.

[29] J. Polley, D. Blazakis, J. McGee, D. Rusk, and J. S.

Baras. Atemu: A ﬁne-grained sensor network

simulator. In Proceedings of the First International

Conference on Sensor and Ad Hoc Communications

and Networks (SECON’04), pages 145–152, Santa

Clara, CA, October 2004.

[30] N. Ramanathan, K. Chang, R. Kapur, L. Girod,

E. Kohler, and D. Estrin. Sympathy for the sensor

network debugger. In Proceedings of the 3rd

international conference on Embedded networked

sensor systems (SenSys’05), pages 255–267, 2005.

[31] R. Szewczyk, J. Polastre, A. Mainwaring, and

D. Culler. Lessons from a sensor network expedition.

In Proceedings of the First European Workshop on

Sensor Networks (EWSN), 2004.

[32] G. Tolle and D. Culler. Design of an

application-cooperative management system for

wireless sensor networks. In Proceedings of the Second

European Workshop on Wireless Sensor Networks

(EWSN’05), pages 121–132, Istanbul, Turkey,

February 2005.

[33] P. Volgyesi, M. Maroti, S. Dora, E. Osses, and

A. Ledeczi. Software composition and veriﬁcation for

sensor networks. Science of Computer Programming,

56(1-2):191–210, 2005.

[34] Y. Wen and R. Wolski. s2db: A novel simulation-based

debugger for sensor network applications. UCSB 2006,

2006-01.

[35] Y. Wen, R. Wolski, and G. Moore. Disens: scalable

distributed sensor network simulation. In Proceedings

of the 12th ACM SIGPLAN symposium on Principles

and practice of parallel programming (PPoPP’07),

pages 24–34, 2007. San Jose, California, USA.

[36] G. Werner-Allen, P. Swieskowski, and M. Welsh.

Motelab: A wireless sensor network testbed. In

Proceedings of the Fourth International Conference on

Information Processing in Sensor Networks

(IPSN’05), Special Track on Platform Tools and

Design Methods for Network Embedded Sensors

(SPOTS), pages 483–488, April 2005.

[37] K. Whitehouse, G. Tolle, J. Taneja, C. Sharp, S. Kim,

J. Jeong, J. Hui, P. Dutta, and D. Culler. Marionette:

Using rpc for interactive development and debugging

of wireless embedded networks. In Proceedings of the

Fifth International Conference on Information

Processing in Sensor Networks: Special Track on

Sensor Platform, Tools, and Design Methods for

Network Embedded Systems (IPSN/SPOTS), pages

416–423, Nashville, TN, April 2006.

[38] J. Yang, M. L. Soﬀa, L. Selavo, and K. Whitehouse.

Clairvoyant: a comprehensive source-level debugger

for wireless sensor networks. In Proceedings of the 5th

international conference on Embedded networked

sensor systems (SenSys’07), pages 189–203, 2007.

112

Semantics-Aware Active Fault Detection in Status Updating Systems

Article

Full-text available

Jan 2024

With its growing number of deployed devices and applications, the Internet of Things (IoT) raises significant challenges for network maintenance procedures. In this work, we address a problem of active fault detection in an IoT scenario, whereby a monitor can probe a remote device to acquire fresh information and facilitate fault detection. However, probing could significantly impact the system’s energy and communication resources. To this end, we utilize the Age of Information to measure the freshness of information at the monitor and adopt a semantics-aware communication approach between the monitor and the remote device. In semantics-aware communications, the processes of generating and transmitting information are treated jointly to consider the importance of information and the purpose of communication. We formulate the problem as a Partially Observable Markov Decision Process and show analytically that the optimal policy is of a threshold type. Finally, we use a computationally efficient stochastic approximation algorithm to approximate the optimal policy and present numerical results that exhibit the advantage of our approach compared to a conventional delay-based probing policy.

Semantics-Aware Active Fault Detection in IoT

Conference Paper

Full-text available

Sep 2022

2022 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COM-IT-CON) A REVIEW OF FAULT DETECTION AND DIAGNOSIS PROTOCOLS FOR WSNs

Conference Paper

Full-text available

Aug 2022

Wireless-sensor-networks are made up small sensor-nodes with their own computation and communication capability, for example, to detect temperature, pressure, sound, etc. The nodes of the sensor are of an ad-hoc nature and are moved to a brutal environmental condition equipped with a limited power supply that may lead to high chances of failure occurrence. In addition, the Faulty node may degrade the quality/performance of the overall network environment. so proper environment and complete strategies are required to function properly and maintain reliability while protecting the level of WSN detection. In this paper, we reviewed the identification and strategic fault-tolerant scheme with their ability to handle the fault in a brutal environment and improving the overall quality and functionality of the system.

Semantics-Aware Active Fault Detection in Status Updating Systems

Preprint

Full-text available

Feb 2022

With its growing number of deployed devices and applications, the Internet of Things (IoT) raises significant challenges for network maintenance procedures. In this work we address a problem of active fault detection in an IoT scenario, whereby a monitor can probe a remote device in order to acquire fresh information and facilitate fault detection. However, probing could have a significant impact on the system's energy and communication resources. To this end, we utilize Age of Information as a measure of the freshness of information at the monitor and adopt a semantics-aware communication approach between the monitor and the remote device. In semantics-aware communications, the processes of generating and transmitting information are treated jointly to consider the importance of information and the purpose of communication. We formulate the problem as a Partially Observable Markov Decision Process and show analytically that the optimal policy is of a threshold type. Finally, we use a computationally efficient stochastic approximation algorithm to approximate the optimal policy and present numerical results that exhibit the advantage of our approach compared to a conventional delay-based probing policy.

Pattern mining: advanced methods

Chapter

Jan 2023

A Review of Fault Detection and Diagnosis Protocols for WSNs

Conference Paper

May 2022

Combatting Energy Issues for Mobile Applications

Article

Apr 2022

Energy efficiency is an important criterion to judge the quality of mobile apps, but one third of our arbitrarily sampled apps suffer from energy issues that can quickly drain battery power. To understand these issues, we conduct an empirical study on 36 well-maintained apps such as Chrome and Firefox, whose issue tracking systems are publicly accessible. Our study involves issue causes, manifestation, fixing efforts, detection techniques, reasons of no-fixes and debugging techniques. Inspired by the empirical study, we propose a novel testing framework for detecting energy issues in real-world mobile apps. Our framework examines apps with well-designed input sequences and runtime context. We develop leading edge technologies, e.g. pre-designing input sequences with potential energy overuse and tuning tests on-the-fly, to achieve high efficacy in detecting energy issues. A large-scale evaluation shows that 90.4% of the detected issues in our experiments were previously unknown to developers. On average, these issues can double the energy consumption of the test cases where the issues were detected. And our test achieves a low number of false positives. Finally, we show how our test reports can help developers fix the issues.

Testing and Debugging Sensor Network Applications

Chapter

Sep 2012

Characteristics Analysis and Challenges for Fault Diagnosis in Solar Insecticidal Lamps Internet of Things

Article

Full-text available

Jun 2020

Solar insecticidal lamps Internet of Things (SIL-IoTs) is a novel physical agricultural pest control implement, which is an emerging paradigm that extends Internet of Things technology towards Solar Insecticidal Lamp (SIL). SIL-IoTs is composed of SIL nodes with functions of preventing and controlling of agricultural migratory pests with phototaxis feature, which can be deployed over a vast region for the purpose of ensuring pests outbreak area location, reducing pesticide dosage and monitoring agricultural environmental conditions. SIL-IoTs is widely used in agricultural production, and a number of studies have been conducted. However, in most current research projects, fault diagnosis has not been taken into consideration, despite the fact that SIL-IoTs faults have an adverse influence on the development and application of SIL-IoTs. Based on this background, this research aims to analyze the characteristics and challenges of fault diagnosis in SIL-IoTs, which naturally leads to a great number of open research issues outlined afterward. Firstly, an overview and state-of-art of SIL-IoTs were introduced, and the importance of fault diagnosis in SIL-IoTs was analyzed. Secondly, faults of SIL nodes were listed and classified into different types of Wireless Sensor Networks (WSNs) faults. Furthermore, WSNs faults were classified into behavior-based, time-based, component- based, and area affected-based faults. Different types of fault diagnosis algorithms (i.e., statistic method, probability method, hierarchical routing method, machine learning method, topology control method, and mobile sink method) in WSNs were discussed and summarized. Moreover, WSNs fault diagnosis strategies were classified into behavior-based strategies (i.e., active type and positive type), monitoring-based strategies (i.e., continuous type, periodic type, direct type, and indirect type) and facility- based strategies (i. e., centralized type, distributed type and hybrid type). Based on above algorithms and strategies, four kinds of fault phenomena: 1) abnormal background data, 2) abnormal communication of some nodes, 3) abnormal communication of the whole SIL-IoTs, and 4) normal performance with abnormal behavior actually were introduced, and fault diagnosis tools (i.e., Sympathy, Clairvoyant, SNIF and Dustminer) which were adapted to the mentioned fault phenomena were analyzed. Finally, four challenges of fault diagnosis in SIL-IoTs were highlighted, i. e., 1) the complex deployment environment of SIL nodes, leading to the fault diagnosis challenges of heterogeneous WSNs under the condition of unequal energy harvesting, 2) SIL nodes task conflict, resulting from the interference of high voltage discharge, 3) signal loss of continuous area nodes, resulting in the regional link fault, and 4) multiple failure situations of fault diagnosis. To sum up, fault diagnosis plays a vital role in ensuring the reliability, real-time data transmission, and insecticidal efficiency of SIL-IoTs. This work can also be extended for various types of smart agriculture applications and provide fault diagnosis references.

Log Discovery for Troubleshooting Open Distributed Systems with TLQ

Conference Paper

Jul 2020

Kansei

Conference Paper

Full-text available

Jan 2006

The nesC Language: A Holistic Approach to Networked Embedded Systems

Conference Paper

Full-text available

Jun 2003

We present nesC, a programming language for networked embedded systems that represent a new design space for application developers. An example of a networked embedded system is a sensor network, which consists of (potentially) thousands of tiny, low-power "motes," each of which execute concurrent, reactive programs that must operate with severe memory and power constraints.nesC's contribution is to support the special needs of this domain by exposing a programming model that incorporates event-driven execution, a flexible concurrency model, and component-oriented application design. Restrictions on the programming model allow the nesC compiler to perform whole-program analyses, including data-race detection (which improves reliability) and aggressive function inlining (which reduces resource consumption).nesC has been used to implement TinyOS, a small operating system for sensor networks, as well as several significant sensor applications. nesC and TinyOS have been adopted by a large number of sensor network research groups, and our experience and evaluation of the language shows that it is effective at supporting the complex, concurrent programming style demanded by this new class of deeply networked systems.

The nesC Language: A Holistic Approach to Networked Embedded Systems http://nescc.sourceforge.net

Technical Report

Full-text available

Nov 2002

We present nesC, a programming language for networked embedded systems that represent a new design space for application developers. An example of a networked embedded system is a sensor network, which consists of (potentially) thousands of tiny, low-power "motes," each of which execute concurrent, reactive pro- grams that must operate with severe memory and power constraints. nesC's contribution is to support the special needs of this domain by exposing a programming model that incorporates event-driven execution, a flexible concurrency model, and component-oriented application design. Restrictions on the programming model al- low the nesC compiler to perform whole-program analyses, including data-race detection (which improves reliability) and aggressive function inlining (which reduces resource consumption). nesC has been used to implement TinyOS, a small operating sys- tem for sensor networks, as well as several significant sensor applications. nesC and TinyOS have been adopted by a large number of sensor network research groups, and our experience and evaluation of the language shows that it is effective at supporting the complex, concurrent programming style demanded by this new class of deeply networked systems.

Formal modeling and analysis of wireless sensor network algorithms in Real-Time Maude

Article

Full-text available

Jan 2006

Advanced wireless sensor network algorithms pose challenges to their formal modeling and analysis, such as modeling probabilistic and real-time behaviors and novel forms of communication, and analyzing both correctness and performance. In this paper, we propose using Real-Time Maude to formally model, simulate, and further analyze such algorithms. The Real-Time Maude formalism is expressive yet intuitive, and the tool provides a spectrum of analysis methods, including simulation, reachability analysis, and temporal logic model checking. We have used Real-Time Maude to formally model and analyze the sophisticated OGDC algorithm. We could perform all the analyses performed by the OGDC developers using the simulation tool ns-2, as well as further analyses which are beyond the capabilities of simulation tools. To the best of our knowledge, this is the first time a formal tool has been applied to such a complex wireless sensor network algorithm.

EnviroSuite

Article

Aug 2006

Sensor networks open a new frontier for embedded-distributed computing. Paradigms for sensor network programming-in-the-large have been identified as a significant challenge toward developing large-scale applications. Classical programming languages are too low-level. This paper presents the design, implementation, and evaluation of EnviroSuite , a programming framework that introduces a new paradigm, called environmentally immersive programming , to abstract distributed interactions with the environment. Environmentally immersive programming refers to an object-based programming model in which individual objects represent physical elements in the external environment. It allows the programmer to think directly in terms of environmental abstractions. EnviroSuite provides language primitives for environmentally immersive programming that map transparently into a support library of distributed algorithms for tracking and environmental monitoring. We show how nesC code of realistic applications is significantly simplified using EnviroSuite and demonstrate the resulting system performance on Mica2 and XSM platforms.

Fast Algorithms for Mining Association Rules

Article

Jan 1994

R. Agrawal

S 2 DB: a novel simulation-based debugger for sensor network applications

Article

Jan 2006

Sensor network computing can be characterized as resource-constrained distributed computing using unreliable, low bandwidth communication. This combination of characteristics poses significant software development and maintenance challenges. Effective and efficient debugging tools for sensor network are thus critical. Existent development tools, such as TOSSIM, EmStar, ATEMU and Avrora, provide useful debugging support, but not with the fidelity, scale and functionality that we believe are sufficient to meet the needs of the next generation of applications.In this paper, we propose a debugger, called S2DB, based on a distributed full system sensor network simulator with high fidelity and scalable performance, DiSenS. By exploiting the potential of DiSenS as a scalable full system simulator, S2DB extends conventional debugging methods by adding novel device level, program source level, group level, and network level debugging abstractions. The performance evaluation shows that all these debugging features introduce overhead that is generally less than 10% into the simulator and thus making S2DB an efficient and effective debugging tool for sensor networks.

MoteLab: A wireless sensor network testbed

Conference Paper

May 2005

As wireless sensor networks have emerged as a exciting new area of research in computer science, many of the logistical challenges facing those who wish to develop, deploy, and debug applications on realistic large-scale sensor networks have gone unmet. Manually reprogramming nodes, deploying them into the physical environment, and instrumenting them for data gathering is tedious and time-consuming. To address this need we have developed MoteLab, a Web-based sensor network testbed. MoteLab consists of a set of permanently-deployed sensor network nodes connected to a central server which handles re programming and data logging while providing a Web interface for creating and scheduling jobs on the testbed. MoteLab accelerates application deployment by streamlining access to a large, fixed network of real sensor network devices; it accelerates debugging and development by automating data logging, allowing the performance of sensor network software to be evaluated offline Additionally, by providing a Web interface MoteLab allows both local and remote users access to the testbed, and its scheduling and quota system ensure fair sharing. We have developed and deployed MoteLab at Harvard and found ft invaluable for both research and teaching. The MoteLab source is freely available, easy to install, and already in use at several other research institutions. We expect that widespread use of MoteLab will accelerate and improve wireless sensor network research.

Fast Algorithms for Mining Association Rules in Large Databases

Conference Paper

Jan 1994

DiSenS: Scalable Distributed Sensor Network Simulation

Conference Paper

Mar 2007

Simulation is widely used for developing, evaluating and analyz- ing sensor network applications, especially when deploying a large scale sensor network remains expensive and labor intensive. How- ever, due to its computation intensive nature, existent sim ulation tools have to make trade-offs between fidelity and scalabili ty and thus offer limited capabilities as design and analysis tool s. In this paper, we introduce DiSenS (DIstributed SENsor network Simula- tion) - a highly scalable distributed simulation system for sensor networks. DiSenS does not only faithfully emulates an extensive set of sensor hardware and supports extensible radio/power models, so that sensor network applications can be simulated transparently with high fidelity, but also employs distributed-memory par allel cluster system to attack the complex simulation problem. Combin- ing an efficient distributed synchronization protocol and a sophisti- cated node partitioning algorithm (based on existent research), DiS- enS achieves greater scalability than even many discrete event sim- ulators. On a small to medium size cluster (16-64 nodes), DiSenS is able to simulate hundreds of motes in realtime speed and scale to thousands in sub-realtime speed. To our knowledge, DiSenS is the first full-system sensor network simulator with such scalab ility. Categories and Subject Descriptors I.6 (Simulation and Model- ing)

Dustminer: Troubleshooting Interactive Complexity Bugs in Sensor Networks

Abstract and Figures

Recommended publications

Troubleshooting interactive complexity bugs in wireless sensor networks using data mining techniques

Finding Symbolic Bug Patterns in Sensor Networks

Diagnostic powertracing for sensor node failure analysis

Towards Diagnostic Simulation in Sensor Networks