ArticlePDF Available

Daredevil: indoor location using sound

Authors:

Abstract and Figures

A variety of techniques have been used by prior work on the problem of smartphone location. In this paper, we propose a novel approach using sound source localization (SSL) with microphone arrays to determine where in a room a smartphone is located. In our system called Daredevil, smartphones emit sound at particular times and frequencies, which are received by microphone arrays. Using SSL that we modified for our purposes, we can calculate the angle between the center of each microphone array and the phone, and thereby triangulate the phone's position. In this early work, we demonstrate the feasibility of our approach and present initial results. Daredevil can locate smartphones in a room with an average precision of 3.19 feet. We identify a number of challenges in realizing the system in large deployments, and we hope this work will benefit researchers who pursue such techniques.
Content may be subject to copyright.
Daredevil: Indoor Location Using Sound
Ionut Constandache, Sharad Agarwal, Ivan Tashev, Romit Roy Choudhury?
Duke University, Microsoft Research, ?University of Illinois at Urbana-Champaign
ionut@cs.duke.edu, {sagarwal,ivantash}@microsoft.com, ?croy@illinois.edu
Abstract
A variety of techniques have been used by prior work
on the problem of smartphone location. In this paper,
we propose a novel approach using sound source lo-
calization (SSL) with microphone arrays to determine
where in a room a smartphone is located. In our sys-
tem called Daredevil, smartphones emit sound at par-
ticular times and frequencies, which are received by
microphone arrays. Using SSL that we modified for
our purposes, we can calculate the angle between the
center of each microphone array and the phone, and
thereby triangulate the phone’s position. In this early
work, we demonstrate the feasibility of our approach
and present initial results. Daredevil can locate smart-
phones in a room with an average precision of 3.19
feet. We identify a number of challenges in realiz-
ing the system in large deployments, and we hope this
work will benefit researchers who pursue such tech-
niques.
I. Introduction
Making indoor location available ubiquitously is dif-
ficult. There are many challenges, including achiev-
ing accuracy with off-the-shelf phones, relying solely
on existing infrastructure such as Wi-Fi access points,
and scaling any human effort such as fingerprinting.
However, there are specific situations where tailored
indoor location solutions are valuable and one or more
of these constraints do not apply. For instance, fire-
fighters may be willing to carry custom equipment for
indoor location but will require the solution to work
in any burning building. A retail chain store may be
willing to deploy custom equipment in their stores, but
may want location to work with off-the-shelf phones
that users carry.
In this work, we focus on the retail scenario as our
motivation. A store owner may want to provide indoor
location to shoppers for a variety of reasons. She may
want shoppers to efficiently find items on their shop-
ping list with minimal clerk assistance. She may want
to entice shoppers with special discounts or product
reviews depending on what product the shopper is
looking at. In doing so, the store owner wants to min-
imize the burden on the shopper and not require any-
thing custom on the end user devices beyond an app
for the store. She may be willing to deploy custom
equipment inside the store. While we focus on the
retail scenario, our techniques are equally applicable
to any scenario where user phones with a custom app
need to be located in a large indoor room where the
room can be augmented with additional equipment.
Prior work has considered a variety of ways to ad-
dress such scenarios, including sensing ambient mag-
netic fields [4], fingerprinting Wi-Fi [1] and finger-
printing FM radio transmissions [3]. In this work,
we focus on sound, either audible or ultrasound. The
choice of sound comes with its own advantages and
disadvantages. There are transducers on all phones
(speakers and microphones) and sound is generally
unaffected by changes in the store layout or human
presence (unlike magnetic fields and Wi-Fi signal
strength). However, depending on the frequency, am-
plitude and duration, it can be overheard by humans,
and there is a lot of ambient noise that can interfere
with the detection of the intended signal.
In this work, we attempt to accurately locate users
indoors by detecting the angle of arrival of audio
chirps emitted by smartphones. We examine how well
off-the-shelf phones can emit such chirps, at a variety
of different frequencies and amplitudes. We examine
how we can encode small amounts of information in
these chirps to distinguish different phones. To cal-
culate the angle of arrival, we build microphone ar-
rays that can be used in conjunction with SSL (sound
source localization) algorithms [17] that we have cus-
tomized for our use.
The novelty of our work is in applying SSL to lo-
cate smartphones. Shopkick is a company that has de-
ployed sound-based location in stores. However, in
their system the phone is detecting sound emitted by
a custom device in the store, and only detects whether
the phone is in the store, not where in the store the
phone is. SSL techniques have been explored in depth
in prior work, and are in use in products such as Mi-
crosoft Kinect, for distinguishing different humans
speaking at the same time. In this work, we apply
those techniques to locating phones, which has unique
challenges.
Our system, called Daredevil, operates at least two
microphone arrays, and by calculating the angle from
each array to the phone emitting a tone, can trian-
gulate the user’s location. In this paper, we present
the design of the system and hardware and feasibility
experiments in §III, evaluation results in §IV, and
relevant prior work in §V. This early work demon-
strates a working system that achieves low error – on
average 3.8, or approximately 3.2 feet on average.
However, we also identify limitations of current hard-
ware on smartphones that prevent us from deploying
Daredevil in practical scenarios, and we point to fu-
ture work in this space in §VI.
II. Motivation
We envision Daredevil to be deployed in retail stores
where pairs of microphone arrays are mounted on
walls or ceilings. Unmodified phones running an app
can be identified and located using sound that they
emit. There are several open questions that we need
to answer when building such a system.
It is important to understand how well off-the-shelf
smartphones can emit high frequency audio tones.
This is partly a function of the speakers they have,
audio processing hardware, and the software platform
on the phones. A related question is how loud these
tones are, or at what distance they can be heard using
microphones. The longer the distance is, the fewer mi-
crophone arrays are needed, but higher is the potential
for interference between multiple phones. The higher
the frequency, the lower is the chance for interference
from human voice.
The cost and robustness of our microphone array
design are two more key factors. An inexpensive ar-
ray can allow for more arrays to be deployed in the in-
door environment. The array has to be robust enough
to pick up audio tones from phones from a variety of
angles and distances.
The ultimate question is how accurately the micro-
phone arrays can detect the angle of incidence from
phones, and subsequently the location of each phone
by triangulation from a pair of arrays. Additional
questions are how quickly tones can be generated and
phones located, and how that impacts the scalability
of the system in how many phones can be located and
how frequently.
βmic2
(Xmic1 ,Y mic1) (Xmic2 ,Y mic2)
(X
,Y)
αmic1
Mic 2
Mic 1
Server
Figure 1: Daredevil overview
III. Design and Implementation
III.A. Overview
As shown in Figure 1, we assume that the indoor envi-
ronment has been instrumented with at least two mi-
crophone arrays, with known positions. We expect the
arrays to be mounted high on adjacent walls that are
perpendicular to each other, or attached to the ceiling.
The placement should be high enough such that there
is good line of sight to most users in the environment,
minimizing the impact of audio reflecting from indoor
surfaces. The goal of each microphone array is to cal-
culate the angle of incidence of received audio to the
front surface of the array. With two angles (one from
each microphone array), we can triangulate the posi-
tion of the sound source.
Each phone has an app that generates and plays au-
dio in a specific frequency band. The idea is to play
audio at high enough frequencies to be inaudible by
humans, yet be playable by smartphone speakers and
be loud enough to be captured by microphone arrays.
In our application scenario, we expect multiple
phones to be present that need to be located simul-
taneously. If multiple phones emit sound at the same
time and frequency, it can be difficult to distinguish
them and hence calculate an angle of incidence to
each microphone array. We can use two basic ap-
proaches to address the problem of multiple phones –
frequency division multiplexing (FDM) or time divi-
sion multiplexing (TDM). As our experiments in §IV
show, when attempting to emit sound at a particular
frequency, we observe our signal at additional fre-
quencies. As a result, FDM may be more challenging
to achieve in practice. Hence, we focus on TDM in
Daredevil.
Our smartphone app uses coarse-grained location
information from the underlying mobile OS (typically
based on Wi-Fi and cellular tower location) to esti-
mate which Daredevil-enabled store the user is in.
The app will then receive a schedule from the Dare-
devil server (shown in Figure 1) that tells it when it
can emit sound and in what frequency band. The app
uses amplitude modulation to encode a unique phone
ID in the sound it emits. Clock drift between phones
and servers will limit how quickly multiple phones
can be located.
Each phone registers with the Daredevil server and
retrieves the common tone frequency, a unique phone
ID, and the TDM schedule assigned to it. The phone
constructs the audio in software and plays it at the as-
signed schedule. The microphone arrays stream au-
dio to the server. SSL software running at the server
identifies the tone, decodes the phone ID, and com-
putes the tone angle of arrival at each of the two mi-
crophone arrays (angles αmic1and βmic2in Figure
1). The microphone array coordinates (Xmic1, Ymic1)
and (Xmic2, Ymic2)are static values provided to the
the server software as configuration values at deploy-
ment time. Using the microphone array positions and
the tone angles of arrivals, the server computes the
phone coordinates through triangulation. The phone
coordinates (X, Y )are returned to the app, or pro-
cessed further for higher-level services (such as pro-
viding directions on top of an indoor map, or sending
a discount coupon).
The phone location is updated periodically, each
time the app on the phone plays sound at the sched-
uled times. The schedule periodicity can be made
adaptive based on the number of phones present in the
environment. Additional factors, including the max-
imum user speed, and limits on user movement by
walls and aisles, can be also used to dynamically ad-
just the schedule for different phones.
III.B. Sound Source Localization
Locating sounds using microphone arrays is a well es-
tablished area in signal processing. One of the first
applications was pointing a pan-tilt-zoom camera to-
ward the current speaker in a conference room [21].
Direction estimation with a pair of microphones
is done by computing the delay between the two
received signals, and using the known speed of
sound and the distance between the two microphones:
θ=arcsin(τ ν). Here θis the direction angle, τ
is the estimated time delay, νis the speed of sound,
and δis the distance between the two microphones. τ
is computed using the Generalized Cross-Correlation
function [10], typically with PHAT weighting.
Using more than two microphones improves preci-
sion, but increases the complexity of the estimation.
The naive approach of combining angles from all pos-
sible pairs does not work well. Instead, Steered Re-
sponse Power (SRP) algorithms are often used. Those
are based on evaluating multiple angles and picking
the one that maximizes certain criteria (power of the
signal from the sound source [18], spatial probabil-
ity [17], eigenvalues [14], etc.). In Daredevil, we use
a modified and improved version of the algorithm de-
scribed in [17]. On every audio processing frame we
run a Voice Activity Detector (VAD), like the one de-
scribed in [15] but modified for the type of audio we
generate, and engage the sound source localizer only
if there is a signal (real signal or interfering signal)
present.
The audio frames with signal present are converted
to frequency domain using short-term Fourier Trans-
formation, and only the frequency bins containing the
signal frequency band are processed. For each fre-
quency bin kof audio frame n, the probability of it
having a signal as a function of the direction p(n)
k(θ)
is estimated using the IDOA approach [17]. The prob-
abilities from the frequency bins of interest are aver-
aged to receive the probability of sound source pre-
sented as function of the direction p(n)(θ). A confi-
dence level is estimated as the proportion of the dif-
ference between the maximum and minimum values
of the probability, divided by the probability average.
If the confidence level is below a given threshold, the
result from this audio frame is discarded, otherwise
the direction angle is considered where the probabil-
ity peaks. The time stamped angle, accompanied by
the confidence level, is sent up the stack for further
processing.
The angular precision based on a single audio frame
with duration 10-30 ms is not high and hence we use
post-processors. Their purpose is to remove outliers
and reflections and to increase location precision by
averaging and interpolating sound source trajectory
over small time windows. A variety of methods can be
used, including Kalman filtering [9], clustering [16]
and particle filtering [22]. In Daredevil, we use the
clustering algorithm described in chapter 6 of [16].
The output of the sound source localizer is a set of
directions toward all sound sources tracked, each ac-
companied by a confidence level.
In addition to localizing sounds, signals from the
microphones can be combined into one signal, which
is equivalent of a highly directive microphone - an
operation called beamforming. Commonly the beam-
forming happens in frequency domain, where we al-
ready converted the audio signal to perform sound
source localization. In this domain the beamform-
ing output is defined as a weighted sum with direc-
tion dependent weight. The output signal contains
less reverberation and noise, allowing us to process
more easily the chirps from a phone. The beamform-
ing operation requires knowledge of the direction to
the desired sound source, which is obtained from the
sound source localizer. In Daredevil, we use a time
invariant steerable beamformer [16]. It consists of 21
pre-computed tables of weight coefficients for direc-
tions from -50to +50pointing at every 5. Using
the list of sound sources, obtained from the sound
source localizer, for each of the tracked sound sources
we snap the direction to the closest pre-computed an-
gle and then compute the beamformer output. This
means that we apply the beamforming procedure as
many times as sound sources we track. The output
of each beamformer contains the sound from the cor-
responding phone, while the other signals and noises
are attenuated. It is used further for decoding the in-
formation coming from that particular phone.
Each microphone array works in the range of ±50.
In the ideal case we should be able to reject any sound
source that is out of this angle and out of the line of
sight. Otherwise we will get fake directions for sound
sources. We mitigate this problem in the following
ways:
We use VAD to select only those audio frames
where we have strong signals, presumably reject-
ing these that are out of the monitored zone or not
in the direct sight of the array.
We scan for sound sources only in the directions
of the range above.
After processing each frame we compute a confi-
dence level ((max-min)/average) and reject all of
the frames with confidence level below a thresh-
old.
We cluster the sound source localizations, and
track only sound sources with presence in sev-
eral frames. We then compute a confidence level
for each sound source.
We compute the beamformer toward a direction
only when the confidence level after clustering is
above a threshold.
Again, the key is to reject directions toward sporadic
sound sources, reflections from walls and ceilings,
and noise. This is part of the reason why we do not
detect real sound sources at larger distances – with
our thresholds we cannot distinguish those from soft
sounds from nearby noise. Further tuning can be done
Figure 2: Spectrogram from Adobe Audition software
of 18 kHz audio played on a HTC Surround phone and
recorded on a microphone array
to the thresholds to increase the detection range as
much as possible while still keeping false positives at
acceptable low levels.
III.C. Audio Frequency Band
In our application scenario, the store owner is will-
ing to deploy hardware in the store but cannot ex-
pect users to modify their phones beyond installing
an app. Ideally, we want the audio emitted by phones
to be well beyond human perceptible ranges. Medical
literature indicates that human hearing can recognize
sounds up to 20 kHz. Unfortunately, speakers on most
smartphones are not designed to operate beyond hu-
man hearing ranges, and our experiments on a number
of phone models (Apple iPhone 4, Samsung Focus,
Asus E600, and HTC Surround) verified that expec-
tation. As the emitted sound frequency approaches
21 kHz, the sound amplitude drops off quickly. Most
phones are capable of achieving 18 kHz at loud vol-
umes. It is well known that the upper range of audio
frequencies detectable by the human ear varies with
age. In a small experiment using 10 test users aged
20 and higher, playing audio on a PC with high end
speakers, we observed no user perception of frequen-
cies higher than 17 kHz. For this paper, we focus on
18 kHz, realizing that the short audio chirps that our
app emits may be perceptible by children.
Phone loudspeakers at maximum volume produce
substantial non-linear distortions. For 18 kHz sound,
the second, third, and higher harmonics are above one
half of the sampling rate and should be removed by the
anti-aliasing filter integrated into the ADC. However,
due to the high amplitude of these non-linear distor-
tions, they still bleed over and are mirrored as signals
with lower frequency. The spectrogram in Figure 2
Tone
TSLOT
tPLAY tGUARD
tTONE
Figure 3: TDM time slot
shows these additional signals with frequencies of 6
and 10 kHz. While these two shadow frequencies are
within the human audible range, their amplitudes are
much lower than the primary signal which is clearly
visible at 18 kHz. We have observed this behavior
on multiple phones and we believe this is a limitation
of the speakers built into phones. We observe simi-
lar spectrograms for frequency bands ranging from 18
kHz to 21 kHz.
Our findings are consistent with prior work. Recent
work [5] investigated the feasibility of playing ultra-
sonic sounds on mobile phones. The authors tested
four commercial phones (HTC G1, HTC Hero, Apple
iPhone 3GS, and Nokia 6210 Navigator) playing tones
at frequencies between 17 kHz and 22 kHz. They ob-
serve that all phones were capable of generating these
high frequencies. While noise appeared at some other
frequencies when the volume of the phone’s speak-
ers are set to maximum, there exists a combination of
volume settings for each phone such that the noise is
minimal.
III.D. Locating Multiple Phones
Using FDM to allocate distinct frequencies to differ-
ent phones is possible, but complicated by the shadow
frequency problem we observe. We instead use TDM.
However, for a TDM approach to work, we need to
ensure that the slots do not overlap. This is challeng-
ing because different phones and the Daredevil server
may not have good time synchronization with each
other. In addition, scheduling and processing over-
head in smartphone OSes may introduce variable de-
lay between our app issuing a command to play audio
and it coming out of the phone’s speaker.
Figure 3 represents a time slot in the Dare-
devil TDM schedule. We denote the length of the slot
with TSLOT . This represents the time interval allo-
cated to a phone to play the tone of length tT ON E
(tT ON E < TSLO T ). We mark with tP LAY the de-
lay between the request to play the tone and the phone
actually playing it, and tGU ARD a guard time delay
to account for clock drift. Parameter tGU ARD ensures
that the next scheduled phone does not play the tone
concurrently with the current phone due to poor clock
synchronization.
To measure the time delay tP LAY between issuing
the play tone command and the tone coming out of the
phone speaker, we ran a small experiment. We instru-
mented the Daredevil app on a phone to timestamp
when tone play is requested by the OS. In parallel,
the phone recorded sound through its microphone. In
this way, we are able to estimate tP LAY which is the
difference in time between the app issuing the play
command and the audio containing the expected fre-
quency. After running this experiment 10 times on
different phone models, we observed a maximum de-
lay of 100ms. Hence we use a tP LAY of 100ms.
We empirically evaluated a number of values for
tT ON E , while measuring (i) tone detection and (ii)
angle estimation accuracy. Both these values suffer
when the tone length is short. If the tone length is too
short, our SSL algorithm does not have enough sam-
ples to remove noise from angle estimates. In our ex-
periments, 500ms was an adequate length, and hence
we set tT ON E to 500ms.
To pick tGU ARD we ran experiments with two
phones that were configured to synchronize their
clocks with cellular towers. We attached both phones
to a PC using USB cables and recorded on the PC
the time reported by both phones at multiple points
throughput the day. The two sets of reported times
were very close to each other, and remained below
150ms, which is the value we pick for tGUARD .
Putting together the values for tP LAY ,tTO N E and
tGU ARD , we have 750ms for the value of TSLOT .
Hence, with a single frequency band, we can locate up
to 40 phones every 30 seconds, in the coverage area of
a pair of microphone arrays.
III.E. Hardware and Software Imple-
mentation
Figure 4 shows a photograph of one of the three mi-
crophone arrays we built for Daredevil. We used a
laser cutter on a sheet of plastic to form the base and
the front plate that holds the microphones. The geom-
etry is a linear equidistant eight element microphone
array. The distance between the microphones is one
half wavelength for sound with frequency of 21 kHz
(8.16 mm). We used cardioid electret microphones
with a diameter of 6 mm, all of them pointing forward.
The microphones fit snugly into the holes on the
front plate (bottom of picture), and wires connect
them to simple circuit. The voltage bias for the mi-
crophones is provided by a 9 V battery. We also
have variable resistors on each channel for individual
gain calibration. Eight cables then run out to an A-D
converter. For our experiments, we used the MOTU
UltraLite-mk3, which supports 8 audio inputs and
Figure 4: Top down view of one of our microphone
arrays
connects to a PC via USB. For prototype and hard-
ware debugging purposes, we made the microphone
array larger than it needs to be. If productized the en-
tire microphone array and the ADC can be made into
a box that is the size of a deck of playing cards.
Each A-D converter outputs 8 channel audio
through the audio driver stack in Microsoft Windows.
We use a combination of C code and Matlab code to
filter the audio, do VAD, do SSL, and triangulate the
phone’s position. Another piece of software maintains
a list of active phones and allocates frequency bands,
time slots, and unique IDs to each phone. This consti-
tutes the Daredevil server.
On the phone side, we have a simple app on Win-
dows Phone 7, which communicates with the Dare-
devil server over the Internet, and plays audio when
given a schedule. The app uses amplitude modu-
lation to encode a 24 bit unique phone ID on top
of the audio tone. Both the unique ID and the au-
dio frequency are provided to the app by the Dare-
devil server. The unique phone ID is modulated at a
baud rate of 50 symbols per second, and there are two
additional guard bits at the start of the audio sequence.
Each tone is 0.52 seconds long. We have also ported
our app to the Apple iOS platform.
IV. Experimental Results
The primary question we want to answer in our evalu-
ation is how accurately we can determine the location
of a phone in Daredevil. We begin by evaluating how
accurately Daredevil can determine the angle between
a microphone array and where the phone was when it
played a sound. We evaluate that error at different an-
gles and distances to the microphone array. We also
evaluate how that error changes when we change the
frequency band in which audio is played. Finally, us-
ing the accuracy we achieve in determining the angle,
we can use triangulation to calculate the error between
a phone’s actual and estimated positions.
In the experiments here, we set the volume of the
phone’s loudspeaker to 90% of the maximum, which
does not have as much distortion as at 100%. The
limited ability of phone loudspeakers to emit sounds
at high amplitudes without distortion, and the noise
abatement techniques in our sound source localiza-
tion algorithm together limit the maximum distance
at which we can locate phones.
IV.A. Angular accuracy at 18 kHz
To evaluate Daredevil, we deployed one of our micro-
phone arrays in a conference room. We then played
tones at 18 kHz on a Samsung Focus smartphone at
specific locations in the room. Figure 5 shows the 12
locations in the room that we evaluated. The figure
also shows for each location, the distance to the micro-
phone and angle between the center of the microphone
array and the shortest line between the phone’s posi-
tion and the microphone array. We calculated these
distances and angles by using a laser range finder, a
measuring tape, and trigonometry.
Figure 6 shows the results of estimating the angle
at each of the 12 locations in Daredevil. As shown
in the figure, many of the estimates are 5or lower.
However, our modified VAD is not able to detect the
tone at three locations and in two locations, our SSL
provides poor accuracy.
To understand why accuracy is poor at certain lo-
cations, we present Figure 7. The green baseline in
the center represents the noise floor, while the spikes
denote the tone being played by the phone. While
the tones are distinguishable, the SNR is low – in
many cases less than twice the noise floor. This low
SNR results in missed tones and poor angle accuracy
caused by tone amplitudes, erroneously correlated to
the noise and not the actual tone signal. The SNR is
poor at higher frequencies as well (we tested 19 kHz,
20 kHz, and 21 kHz).
IV.B. Angular accuracy at 10 kHz
The poor SNR at 18 kHz contributes to our poor ac-
curacy, and we suspect the poor SNR is due to phone
speakers not being able to produce sound at high am-
Mic
Array
#1
5ft
-450
#2
10ft
-260
#3
15ft
-180
#12
25ft
-110
#4
5ft
450
#5
10ft
260
#6
15ft
180
#7
10ft
710
#8
15ft
450
#9
25ft
300
#11
30ft
90
#12
35ft
290
Figure 5: Evaluation locations when playing tones at
18 kHz. The “X” marks a spot where we played audio
from a phone. The black bubble shows the position
number, the distance and angle to the center of the
microphone array. (not drawn to scale).
Figure 6: Results from calculating the angle at each
of the 12 locations while playing tones at 18 kHz.
Each orange circle indicates the time at which the
tone was played on a phone, in order (location #1 is
the left-most circle, location #12 is the right-most cir-
cle). Locations 3, 10, 11, and 12 were not detected
by Daredevil. Blue dots denote the angle direction
inferred per audio frame (1024 audio samples at 96
kHz). Green lines denote the beamforming directions.
Small horizontal red lines represent the angles output
by SSL after clustering and weighting the per-frame
directions.
Weak Signal or noise?
Good SNR
Figure 7: Amplitudes of the 18 kHz audio tones re-
ceived at one of the 8 microphones, after amplifying
the signal by 40 dB, applying a band pass filter around
18 kHz, and then amplifying again by 40 dB. The ver-
tical axis is amplitude, and the horizontal axis is time.
plitudes at 18 kHz. To test this theory, we repeated our
experiments at 10 kHz. Figure 8 shows the locations
we tested in the same room at this frequency.
Figure 9 shows the angular accuracy at 10 kHz. We
observe that all the 7 locations are detected, and in
most cases with relatively low error. Daredevil can
detect the phone’s audio as far away as 35 feet. The
average error is 3.8.
Figure 10 shows the amplitude of the received sig-
nal at one microphone after processing. As expected,
we see a very strong set of 7 signals (the vertical bars).
Unfortunately, since we are operating at audible fre-
quencies, we also catch human voice, as shown by the
other peaks. These sometimes appear in Figure 9 as
calculated angles (short red horizontal lines without
an orange circle), but since the phone’s ID cannot be
decoded off that signal, no phone is detected at those
angles.
Unfortunately, the lower frequency of 10 kHz poses
problems for Daredevil. While each tone played by a
phone is very short, it can become annoying for hu-
mans. While human voice can also appear in this fre-
quency range, it can be filtered out. However, it can
interfere with the phone’s audio signal when humans
speak at the same time. While 10 kHz is not a fea-
sible option in a real deployment, it demonstrates the
effectiveness of Daredevil if this limitation of phone
speakers were to be mitigated in the future.
IV.C. Location accuracy
To evaluate location accuracy, we simulate a range of
scenarios with different angular accuracies from our
results above. Rather than stand at many locations in-
side a room and run the application on the phone, we
7 feet
#1
5ft
540
#2
10ft
340
#3
15ft
250
#4
20ft
190
#5
25ft
150
#6
30ft
130
#7
35ft
110
Mic
Array
Figure 8: Evaluation locations when playing tones at
10 kHz. The “X” marks a spot where we played audio
from a phone. The black bubble shows the position
number, the distance and angle to the center of the
microphone array. (not drawn to scale).
Detected
Not Detected
Off by 7º
Off by 1º
Off by 2º
Off by 1º
Off by 5º
Off by 5º
Off by 4º
Figure 9: Results from calculating the angle at each
of the 7 locations while playing tones at 10 kHz. Each
orange circle indicates the time at which the tone
was played on a phone, in order (location #1 is the
left-most circle, location #7 is the right-most circle).
All locations were detected by Daredevil. Blue dots
denote the angle direction inferred per audio frame
(1024 audio samples at 96 kHz). Green lines de-
note the beamforming directions. Small horizontal red
lines represent the angles output by SSL after cluster-
ing and weighting the per-frame directions.
use our prior measured angular accuracies in a simu-
lation for convenience. We define the location error
as the distance in feet between the Daredevil com-
puted location and the actual location of the user. We
randomly select 50 user locations within 25 feet of
each of two microphone arrays. At each location, we
use two angular errors that are randomly chosen from
Figure 9 (for each angle error we considered both the
positive and the negative value). Figure 11 shows the
CDF of the Daredevil location error in feet. The min-
Excellent SNR
Human Voice
Figure 10: Amplitudes of the 10 kHz audio tones re-
ceived at one of the 8 microphones, after amplifying
the signal by 40 dB, applying a band pass filter around
10 kHz, and then amplifying again by 40 dB. The ver-
tical axis is amplitude, and the horizontal axis is time.
imum location error is 0.52 feet, the maximum error
is 17.59 feet while the average error across all 50 lo-
cations is 3.19 feet (less than 1 meter). As baseline
values we include the CDFs of two other localization
schemes: Centroid (estimate the user location as the
mid-point between the two microphone arrays) and
Best Mic (estimate the user location as the location of
the microphone array that is closest to the user). The
CDF graphs show that Daredevil offers promising lo-
cation accuracy, and can meet the requirements of the
scenarios we outlined in §I.
V. Related Work
Indoor location is an active area of research, with a
large number of published works. We cannot even
begin to comprehensively cite all prior work in this
space, but instead focus on location systems that use
sound.
ActiveBat [6] relies on small devices, called Bats,
to transmit ultrasonic pulses when instructed over
RF. Bats are carried by users or attached to objects.
Known-location receivers listen for the Bat ultrasonic
pulse, and compute the distance to it based on the time
difference between the RF request and the device ul-
trasonic pulse. The Bat location is computed centrally
by the system using distance estimates to at least three
receivers. ActiveBat avoids collisions by scheduling
the Bat transmissions: it broadcasts over RF which
Bat devices should transmit the ultrasonic pulse. The
system determines the Bat location with accuracy of a
few cm.
In Cricket [13], mobile devices correlate RF signals
and ultrasonic pulses transmitted by hardware bea-
cons installed in the environment. The mobile device
0 5 10 15 20 25 30 35
0
0.2
0.4
0.6
0.8
1
error (feet)
CDF
Centroid
Best Mic
DareDevil
Figure 11: CDF of location error in feet for Dare-
devil, Centroid and Best Mic. Centroid and Best Mic
represent baseline localization schemes. Centroid al-
ways estimates the user location at the mid-location
between the two microphone arrays. Best Mic places
the user at the closest microphone array location from
the user ground truth location.
is located at the closest emitting beacon. The bea-
cons are carefully placed in the environment to maxi-
mize location coverage and minimize interference. To
avoid collisions and make correct RF-ultrasonic pulse
correlations, the beacon transmissions times are ran-
domized. The system achieves location granularities
of 4x4 square feet.
We believe that Daredevil provides an interesting
solution in a different part of the design space, where
fewer devices are installed (two microphone arrays
per room or area) and a different technique (SSL) is
used to calculate location. In situations where a large
number of devices can be installed in the environment,
such as in ActiveBat and Cricket, perhaps finer loca-
tion accuracy can be achieved.
WALRUS [2] provides room-level mobile device
location. Each room is instrumented with a desk-
top computer which broadcasts simultaneously Wi-Fi
and ultrasound beacons. Ultrasound beacons are con-
fined by room walls and, as a result, these beacons are
overheard only by devices co-located with the desk-
top computer. The mobile device infers room-location
by correlating received Wi-Fi packets (possibly from
multiple rooms) with the overheard ultrasound bea-
con(from the same room). To make correct correla-
tions, WALRUS uses a maximum time difference be-
tween the Wi-Fi and ultrasound beacon and periodic
variations in desktops broadcast times. The authors
show that even in the case of 6 neighboring rooms,
room-identification is correct 85% of the time. In con-
trast, Daredevil is locating phones within a room at
much finer granularity using microphone arrays.
WebBeep [11] proposes a scheme based on time
of flight to locate mobile devices. These devices in-
form over RF when they are about to play an audible
tone (at 4kHz). Microphones detect the tone, and then
compute a distance estimate to the device based on
the speed of sound and the difference between the ad-
vertised time of play and the time the tone was heard.
These distance estimates are affected by a constant,
unknown time-delay between when the sound was ad-
vertised to play and when it actually played. By fac-
toring in this unknown delay and relying on distance
estimates to three microphones, the location of the
device can be inferred. In a room 20x9m2covered
with 6microphones, the authors show an accuracy of
70cm, 90% of the time in a quiet environment.
Authors in [7, 12] built or used off the shelf UWB
transmitters/receivers for location in an indoor envi-
ronment. The advantage of wideband ultrasound over
narrow band ultrasound (as used in the above sys-
tems) is to allow multiple, simultaneous access to the
medium. A wide frequency spectrum can support
code division multiple access (CDMA) techniques. In
CDMA, special codes are assigned to transmitters to
ensure their signals can be separated at the receiver,
even when transmissions occur in parallel.
In contrast to these prior systems, we use a unique
approach of using microphone arrays at carefully
spaced intervals to use SSL to calculate the angle of
incidence of sound. In §III.B, we already described
the underlying prior work in SSL. However, the prior
work in that space has focused on locating human
voice, not locating phones, which comes with asso-
ciated challenges of how phones can emit appropriate
sound and how to schedule those transmissions and
what accuracy can be achieved. SSL has been used
in robotics, typically to locate human voice so that a
robot can follow humans. For example in [20], the
authors designed a sound location system to point the
robot camera toward the direction of the sound source.
When the source was placed at 3mfrom the robot, the
mean direction error was 3. Authors in [8] allow for
simultaneous robot and multiple fixed sound source
location. The sound sources are assumed fixed, while
the robot can move freely through the space of a room.
The bearing to sound sources is computed with accu-
racies between 3and 9. Similar work [19], tracked
multiple people (2-4) while speaking. The system
used a combination of beamforming and particle fil-
tering to distinguish between multiple sound sources
in the environment. The authors showed direction ac-
curacies of 10when the sound source is up to 7m
away from the robot.
VI. Conclusions
Daredevil is a system for locating phones in an indoor
environment that has been instrumented with small
microphone arrays. Each array can be as small as a
pack of playing cards. Our system uses sound source
localization to calculate the angle between a phone
and each microphone array, and then triangulation to
calculate the position. We use time-division multi-
plexing to schedule multiple phones that want to be
located. Our implementation and evaluation demon-
strate that low error can be achieved: 3.8, or approx-
imately 3.2 feet on average.
Our current implementation is limited by the qual-
ity of audio speakers on modern smartphones. At fre-
quencies higher than 18 kHz, the amplitude of sound
that these speakers generate is too low compared to the
ambient noise floor. At human audible ranges, suffi-
cient amplitude can be achieved. Our hope is that bet-
ter speakers will be available on smartphones in the
future. Nonetheless, at frequencies above 18 kHz, it
is unknown what the impact will be on animals, espe-
cially service animals for the blind. One possibility of
using audible frequencies such as 10 kHz is to embed
our unique audio signal into music or sounds that the
UI of an app may play while the user is interacting
with the app.
There are two directions for future work in this
space that we believe are promising. In this paper, we
did not address the problem of segmenting the floor
plan of a large room (such as a large department store)
into squares, where each square is served by a pair of
microphone arrays. In such a configuration, the sched-
ule of audio transmissions by phones in one square
would need to be coordinated with those in adjacent
squares – a traditional coloring problem that would
need to be solved. A second direction that we did not
explore in this work is reversing the direction of audio.
If we had wall mounted speakers that are transmitting
at inaudible, high frequencies, can a microphone array
on a phone determine its location? In such a scheme,
no scheduling is needed since the same signal would
be useful by all phones in the vicinity. Some mod-
ern smartphones have multiple microphones, primar-
ily for noise cancellation during audio calls. If each
audio stream from each microphone were exposed to
the software stack on the phone, potentially SSL could
be applied.
VII. Acknowledgments
We thank Mike Sinclair (MSR) for letting us use his
hardware lab and helping us with the laser cutter, and
we thank Bruce Cleary (MSR) for help with circuit
boards.
References
[1] P. Bahl and V. Padmanabhan. RADAR: An In-
Building RF-based User Location and Tracking
System. In IEEE INFOCOM, 2000.
[2] G. Borriello, A. Liu, T. Offer, C. Palistrant, and
R. Sharp. WALRUS: wireless acoustic location
with room-level resolution using ultrasound. In
ACM MobiSys, 2005.
[3] Y. Chen, D. Lymberopoulos, J. Liu, and
B. Priyantha. FM-based Indoor Localization. In
ACM MobiSys, 2012.
[4] I. Constandache, R. R. Choudhury, and I. Rhee.
CompAcc: Using Mobile Phone Compasses and
Accelerometers for Localization. In IEEE IN-
FOCOM, 2010.
[5] V. Filonenko, C. Cullen, and J. Carswell. In
IPIN, Sept. 2010.
[6] A. Harter, A. Hopper, P. Steggles, A. Ward, and
P. Webster. The anatomy of a context-aware ap-
plication. In ACM MobiCom, 1999.
[7] M. Hazas and A. Hopper. Broadband ultrasonic
location systems for improved indoor position-
ing. IEEE Transactions on Mobile Computing,
5:536–547, 2006.
[8] J.-S. Hu, C.-Y. Chan, C.-K. Wang, and C.-
C. Wang. Simultaneous localization of mobile
robot and multiple sound sources using micro-
phone array. In IEEE ICRA, 2009.
[9] S. Julier and J. Uhlmann. A new extension of the
Kalman filter to nonlinear systems. In Interna-
tional Symposium on Aerospace/Defense Sens-
ing, Simulation and Controls, 1997.
[10] C. Knapp and G. Carter. The generalized cor-
relation method for estimation of time delay. In
IEEE TASSP, Aug. 1976.
[11] C. V. Lopes, A. Haghighat, A. Mandal, T. Givar-
gis, and P. Baldi. Localization of off-the-shelf
mobile devices using audible sound: architec-
tures, protocols and performance assessment. In
ACM MC2R, April 2006.
[12] K. Muthukrishnan and M. Hazas. Position Esti-
mation from UWB Pseudorange and Angle-of-
Arrival: A Comparison of Non-linear Regres-
sion and Kalman Filtering. In LoCA, pages 222–
239, 2009.
[13] N. B. Priyantha, A. Chakraborty, and H. Balakr-
ishnan. The cricket location-support system. In
ACM MobiCom, 2000.
[14] R. Schmidt. Multiple emitter location and signal
parameter estimation. In IEEE Transactions on
Antennas and Propagation, 1986.
[15] J. Sohn, N. Kim, and W. Sung. A statistical
model based voice activity detector. In IEEE
Signal Processing Letters, Jan. 1999.
[16] I. Tashev. Sound capture and processing: Practi-
cal approaches. In John Wiley and Sons, 2009.
[17] I. Tashev and A. Acero. Microphone array post-
processor using instantaneous direction of ar-
rival. In IWAENC, 2006.
[18] H. V. Trees. Optimum Array Processing. Part IV
of Detection, Estimation and Modulation The-
ory. In John Wiley and Sons, 2002.
[19] J.-M. Valin, F. Michaud, and J. Rouat. Robust
localization and tracking of simultaneous mov-
ing sound sources using beamforming and parti-
cle filtering. Robotics and Autonomous Systems,
55(3):216–228, 2007.
[20] J.-M. Valin, F. Michaud, J. Rouat, and D. Le-
tourneau. Robust sound source localization us-
ing a microphone array on a mobile robot. In
IEEE IROS, 2003.
[21] H. Wang and P. Chu. Voice source localization
for automatic camera pointing in videoconfer-
encing. In ICASSP, 1997.
[22] D. Ward and R. Williamson. Particle filter beam-
forming for acoustic source localization in rever-
berant environment. In ICASSP, 2002.
... Popular solutions include RF multi-lateration techniques [18,26] or recent deep learning methods [13]. There are other works using optical (e.g., IR beacons [30,32], VLC [38]) and acoustic sensors [20,36,44], however, these solutions are either limited to LoS scenarios or highly sensitive to ambient noise [29], material attenuation (e.g., wood, concrete) [20,55], and certain environmental conditions (e.g., temperature, humidity) limiting them to a room-level applications. In comparison, RF signals are more robust in LoS and NLoS indoor environments. ...
... Popular solutions include RF multi-lateration techniques [18,26] or recent deep learning methods [13]. There are other works using optical (e.g., IR beacons [30,32], VLC [38]) and acoustic sensors [20,36,44], however, these solutions are either limited to LoS scenarios or highly sensitive to ambient noise [29], material attenuation (e.g., wood, concrete) [20,55], and certain environmental conditions (e.g., temperature, humidity) limiting them to a room-level applications. In comparison, RF signals are more robust in LoS and NLoS indoor environments. ...
Preprint
The plethora of sensors in our commodity devices provides a rich substrate for sensor-fused tracking. Yet, today's solutions are unable to deliver robust and high tracking accuracies across multiple agents in practical, everyday environments - a feature central to the future of immersive and collaborative applications. This can be attributed to the limited scope of diversity leveraged by these fusion solutions, preventing them from catering to the multiple dimensions of accuracy, robustness (diverse environmental conditions) and scalability (multiple agents) simultaneously. In this work, we take an important step towards this goal by introducing the notion of dual-layer diversity to the problem of sensor fusion in multi-agent tracking. We demonstrate that the fusion of complementary tracking modalities, - passive/relative (e.g., visual odometry) and active/absolute tracking (e.g., infrastructure-assisted RF localization) offer a key first layer of diversity that brings scalability while the second layer of diversity lies in the methodology of fusion, where we bring together the complementary strengths of algorithmic (for robustness) and data-driven (for accuracy) approaches. RoVaR is an embodiment of such a dual-layer diversity approach that intelligently attends to cross-modal information using algorithmic and data-driven techniques that jointly share the burden of accurately tracking multiple agents in the wild. Extensive evaluations reveal RoVaR's multi-dimensional benefits in terms of tracking accuracy (median of 15cm), robustness (in unseen environments), light weight (runs in real-time on mobile platforms such as Jetson Nano/TX2), to enable practical multi-agent immersive applications in everyday environments.
... Relaxing either one of these two requirements brings up rich bodies of past work [24,28,37,63,66,78,79]. For instance, a known source signal (such as a training sequence or an impulse sound) can be localized through channel estimation and fingerprinting [29,59,66,79], while scattered microphone arrays permit triangulation [24,25,37,63]. However, VoLoc's aim to localize arbitrary sound signals with a single device essentially inherits the worst of both worlds. ...
... ■ Multiple arrays or known sound signals: Distributed microphone arrays have been used to localize (or triangulate) an unknown sound source, such as gun shots [63], wildlife [24], noise sources [55,70,71], and mobile devices [25]. Many works also address the inverse problem of localizing microphones with speaker arrays that are playing known sounds [13,15,42,50]. ...
Conference Paper
Full-text available
Voice assistants such as Amazon Echo (Alexa) and Google Home use microphone arrays to estimate the angle of arrival (AoA) of the human voice. This paper focuses on adding user localization as a new capability to voice assistants. For any voice command, we desire Alexa to be able to localize the user inside the home. The core challenge is twofold: (1) accurately estimating the AoAs of multipath echoes without the knowledge of the source signal, and (2) tracing back these AoAs to reverse triangulate the user's location. We develop VoLoc, a system that proposes an iterative align-and-cancel algorithm for improved multipath AoA estimation, followed by an error-minimization technique to estimate the geometry of a nearby wall reflection. The AoAs and geometric parameters of the nearby wall are then fused to reveal the user's location. Under modest assumptions, we report localization accuracy of 0.44 m across different rooms, clutter, and user/microphone locations. VoLoc runs in near real-time but needs to hear around 15 voice commands before becoming operational.
... Relaxing either one of the requirements brings up rich bodies of past work [76,77,78,79,80,81,82]. For instance, a known source signal can be localized through channel estimation and fingerprinting [83,79,84,82], while scattered microphone arrays permit triangulation [78,77,80,85]. However, VoLoc's aim to localize arbitrary sound signals with a single device essentially inherits the worst of both worlds. ...
... • Multiple arrays or known sound signals: Distributed microphone arrays have been used to localize (or triangulate) an unknown sound source, such as gun shots [78], wildlife [77], and mobile devices [85]. ...
Thesis
Full-text available
Propagation delay refers to the length of time it takes for a signal to travel from point A to point B. Many existing systems, including Global Positioning System (GPS) localization, vehicular imaging, and microphone array beamforming, have taken advantage of propagation delay. This dissertation revisits different properties of propagation delay to enable new acoustic techniques and applications. For instance: (1) We leverage the propagation delay difference between two very different frequencies-radio frequency (RF), and acoustics-to improve active noise cancellation. By "piggybacking" sound over RF, our proposed system is able to compute anti-noise signals more precisely , and ultimately attain better cancellation performance. (2) We develop solutions that exploit the propagation delays of multipath echoes to localize an indoor human speaker. By aligning the arrivals of the voice signal at different times, we compute user location within an optimization framework, serving as a valuable context for smart voice assistants like Amazon Echo and Google Home. (3) We design 3D directional sound by actively synthesizing different propagation delays at two ears using earphones. We develop algorithms that accurately track the 3D orientation of the head, a key enabler for designing 3D acoustics. In general, this dissertation shows that while propagation delay has been studied for a long time and for many applications , there is still opportunity for new techniques and systems, by carefully looking at different properties of the propagation delay, across frequencies, time, and space.
... However, the well-known outdoor positioning system, GPS, is not useful for guaranteeing the accuracy of indoor positioning because of the shielding effect of the positioning signal transmission of satellites, i.e., the positioning ability of GPS is weak in regards to indoor environments such as shopping malls, hypermarkets, office building, etc. A variety of theoretical and available technologies have been proposed in recent decades for achieving the requirements of the indoor positioning [2][3][4][5], based on methods using infrared rays (IR), ultrasound, radio-frequency identification (RFID), wireless local area networks (WLAN), Bluetooth, audible sound, and other technologies. Nevertheless, not all of the above-mentioned methods can be applied to mobile communication systems due to complexity of implementation and the costs of the hardware and software; hence, a new design which is suitable for performing indoor positioning, based on the integration of a single microphone of the mobile communication system and one positioning estimation design. ...
Article
Full-text available
An indoor positioning design developed for mobile phones by integrating a single microphone sensor, an H2 estimator, and tagged sound sources, all with distinct frequencies, is proposed in this investigation. From existing practical experiments, the results summarize a key point for achieving a satisfactory indoor positioning: The estimation accuracy of the instantaneous sound pressure level (SPL) that is inevitably affected by random variations of environmental corruptions dominates the indoor positioning performance. Following this guideline, the proposed H2 estimation design, accompanied by a sound pressure level model, is developed for effectively mitigating the influences of received signal strength (RSS) variations caused by reverberation, reflection, refraction, etc. From the simulation results and practical tests, the proposed design delivers a highly promising indoor positioning performance: an average positioning RMS error of 0.75 m can be obtained, even under the effects of heavy environmental corruptions.
... The microphone array and distributed microphone arrays have been widely used for sound localization. These works mainly target a sound source emitting pre-designed signals [29][30][31]. The human voice is generally unknown to microphones, which brings about challenges for localization. ...
... For example, some companies leverage acoustic sensing to infer the user's location [11,39]. The research community also pays close attention to this trend and proposes many innovative voice localization technologies [10,12,14,30,38,47]. Knowing a user's location helps to narrow down the possible set of voice commands and provide customized services to users. ...
Article
Full-text available
Voice interaction is friendly and convenient for users. Smart devices such as Amazon Echo allow users to interact with them by voice commands and become increasingly popular in our daily life. In recent years, research works focus on using the microphone array built in smart devices to localize the user's position, which adds additional context information to voice commands. In contrast, few works explore the user's head orientation, which also contains useful context information. For example, when a user says, "turn on the light", the head orientation could infer which light the user is referring to. Existing model-based works require a large number of microphone arrays to form an array network, while machine learning-based approaches need laborious data collection and training workload. The high deployment/usage cost of these methods is unfriendly to users. In this paper, we propose HOE, a model-based system that enables Head Orientation Estimation for smart devices with only two microphone arrays, which requires a lower training overhead than previous approaches. HOE first estimates the user's head orientation candidates by measuring the voice energy radiation pattern. Then, the voice frequency radiation pattern is leveraged to obtain the final result. Real-world experiments are conducted, and the results show that HOE can achieve a median estimation error of 23 degrees. To the best of our knowledge, HOE is the first model-based attempt to estimate the head orientation by only two microphone arrays without the arduous data training overhead.
... Advanced light sensors and projected location sequences provide sub-room accuracy [28]. Similarly, audio has also proven effective across a variety of approaches including audio fingerprinting rooms [1,2], proximity beacons, 3 and techniques that leverage microphone/speaker arrays for angle-based 3D geometric localization [29,30]. While effective, the density requirements make these approaches expensive to deploy. ...
Article
Full-text available
An important capability of most smart, Internet-of-Things-enabled spaces (e.g., office, home, hospital, factory) is the ability to leverage context of use. Location is a key context element, particularly indoor location. Recent advances in radio ranging technologies, such as Wi-Fi RTT, promise the availability of low-cost, near-ubiquitous time-of-flight-based ranging estimates. In this paper, we build on prior work to enhance this ranging technology’s ability to provide useful location estimates. For further improvements, we model user motion behavior to estimate the user motion state by taking the temporal measurements available from time-of-flight ranging. We select the velocity parameter of a particle-filter-based on this motion state. We demonstrate meaningful improvements in coordinate-based estimation accuracy and substantial increases in room-level estimation accuracy. Furthermore, insights gained in our real-world deployment provide important implications for future Internet-of-Things applications and their supporting technology deployments such as social interaction, workflow management, inventory control, or healthcare information tools.
... From the study in 2011 [10], it is possible to locate a mobile phone indoor by using the ambient sound fingerprint, under the condition that the Wi-Fi infrastructure is unavailable. In addition, the system presented in 2014 [11] uses a triangulation technique with a configuration of microphone arrays, to determine the location of mobile phones. Differing from the previous author, instead of detecting the unstable background sound, this paper describes how the room acoustical properties affect generated audio signals, achieving a low distance error -approximately 3.2 feet on average. ...
Thesis
Full-text available
This project thesis work represents the solution of indoor positioning on custom location for finding and searching things relevant to that store, marts with optimization of multiple inputs from BLE beacons, WiFi Signal Strengths and other sensors of smartphones such as magnetic field, sound noise level. The presented solution try to solve the problem of accuracy and time difference between the actual position and position on custom indoor location points and predict the correct positioning with machine learning multi class classification solutions.
Article
The plethora of sensors in our commodity devices provides a rich substrate for sensor-fused tracking. Yet, today's solutions are unable to deliver robust and high tracking accuracies across multiple agents in practical, everyday environments - a feature central to the future of immersive and collaborative applications. This can be attributed to the limited scope of diversity leveraged by these fusion solutions, preventing them from catering to the multiple dimensions of accuracy, robustness (diverse environmental conditions) and scalability (multiple agents) simultaneously. In this work, we take an important step towards this goal by introducing the notion of dual-layer diversity to the problem of sensor fusion in multi-agent tracking. We demonstrate that the fusion of complementary tracking modalities, - passive/relative (e.g. visual odometry) and active/absolute tracking (e.g.infrastructure-assisted RF localization) offer a key first layer of diversity that brings scalability while the second layer of diversity lies in the methodology of fusion, where we bring together the complementary strengths of algorithmic (for robustness) and data-driven (for accuracy) approaches. ROVAR is an embodiment of such a dual-layer diversity approach that intelligently attends to cross-modal information using algorithmic and data-driven techniques that jointly share the burden of accurately tracking multiple agents in the wild. Extensive evaluations reveal ROVAR'S multi-dimensional benefits in terms of tracking accuracy, scalability and robustness to enable practical multi-agent immersive applications in everyday environments.
Article
Full-text available
The major challenge for accurate fingerprint-based indoor localization is the design of robust and discriminative wireless signatures. Even though WiFi RSSI signatures are widely available indoors, they vary significantly over time and are susceptible to human presence, multipath, and fading due to the high operating frequency. To overcome these limitations, we propose to use FM broadcast radio signals for robust indoor fingerprinting. Because of the lower frequency, FM signals are less susceptible to human presence, multipath and fading, they exhibit exceptional indoor penetration, and according to our experimental study they vary less over time when compared to WiFi signals. In this work, we demonstrate through a detailed experimental study in 3 different buildings across the US, that FM radio signal RSSI values can be used to achieve room-level indoor localization with similar or better accuracy to the one achieved by WiFi signals. Furthermore, we propose to use additional signal quality indicators at the physical layer (i.e., SNR, multipath etc.) to augment the wireless signature, and show that localization accuracy can be further improved by more than 5%. More importantly, we experimentally demonstrate that the localization errors of FM andWiFi signals are independent. When FM and WiFi signals are combined to generate wireless fingerprints, the localization accuracy increases as much as 83% (when accounting for wireless signal temporal variations) compared to when WiFi RSSI only is used as a signature.
Conference Paper
Full-text available
In this paper we describe a novel algorithm for postprocessing a microphone array’s beamformer output to achieve better spatial filtering under noise and reverberation. For each audio frame and frequency bin the algorithm estimates the spatial probability for sound source presence and applies a spatio-temporal filter towards the look-up direction. It is implemented as a real-time post-processor after a timeinvariant beamformer and it substantially improves the directivity of the microphone array. The algorithm is CPU efficient and adapts quickly when the listening direction changes. It was evaluated with a linear four element microphone array. The directivity index improvement is up to 8 dB, the suppression of a jammer 40° from the sound source is up to 17 dB.
Article
This paper identifies the possibility of using electronic compasses and accelerometers in mobile phones, as a simple and scalable method of localization. The idea is not fundamentally different from ship or air navigation systems, known for cen-turies. Nonetheless, directly applying the idea to human-scale environments is non-trivial. Noisy phone sensors and complicated human movements present practical research challenges. We cope with these challenges by recording a person's walking patterns, and matching it against possible path signatures generated from a local electronic map. Electronic maps enable greater coverage, while eliminating the reliance on WiFi infrastructure and expen-sive war-driving. Measurement on Nokia phones and evaluation with real users confirm the anticipated benefits. Results show a location accuracy of less than 12m in regions where today's localization services are unsatisfactory or unavailable.
Article
The proliferation of mobile computing devices and local-area wireless networks has fostered a growing interest in location-aware systems and services. In this paper we present RADAR, a radio-frequency (RF) based system for locating and tracking users inside buildings. RADAR operates by recording and processing signal strength information at multiple base stations positioned to provide overlapping coverage in the area of interest. It combines empirical measurements with signal propagation modeling to determine user location and thereby enable location-aware services and applications. We present experimental results that demonstrate the ability of RADAR to estimate user location with a high degree of accuracy.
Article
Traditional acoustic source localization uses a two-step procedure requiring intermediate time-delay estimates from pairs of microphones. An alternative single-step approach is proposed in this paper in which particle filtering is used to estimate the source location through steered beamforming. This scheme is especially attractive in speech enhancement applications, where the localization estimates are typically used to steer a beamformer at a later stage. Simulation results show that the algorithm is robust to reverberation, and is able to accurately follow the source trajectory.