ArticlePDF Available

Daredevil: indoor location using sound

June 2014
ACM SIGMOBILE Mobile Computing and Communications Review 18(2):9-19

June 2014
18(2):9-19

Authors:

Microsoft

A variety of techniques have been used by prior work on the problem of smartphone location. In this paper, we propose a novel approach using sound source localization (SSL) with microphone arrays to determine where in a room a smartphone is located. In our system called Daredevil, smartphones emit sound at particular times and frequencies, which are received by microphone arrays. Using SSL that we modified for our purposes, we can calculate the angle between the center of each microphone array and the phone, and thereby triangulate the phone's position. In this early work, we demonstrate the feasibility of our approach and present initial results. Daredevil can locate smartphones in a room with an average precision of 3.19 feet. We identify a number of challenges in realizing the system in large deployments, and we hope this work will benefit researchers who pursue such techniques.

Spectrogram from Adobe Audition software of 18 kHz audio played on a HTC Surround phone and recorded on a microphone array

…

CDF of location error in feet for Daredevil, Centroid and Best Mic. Centroid and Best Mic represent baseline localization schemes. Centroid always estimates the user location at the mid-location between the two microphone arrays. Best Mic places the user at the closest microphone array location from the user ground truth location.

…

Figures - uploaded by Ivan Tashev

Content may be subject to copyright.

Content uploaded by Ivan Tashev

Content may be subject to copyright.

Daredevil: Indoor Location Using Sound

Ionut Constandache†, Sharad Agarwal‡, Ivan Tashev‡, Romit Roy Choudhury?

†Duke University, ‡Microsoft Research, ?University of Illinois at Urbana-Champaign

†ionut@cs.duke.edu, ‡{sagarwal,ivantash}@microsoft.com, ?croy@illinois.edu

Abstract

A variety of techniques have been used by prior work

on the problem of smartphone location. In this paper,

we propose a novel approach using sound source lo-

calization (SSL) with microphone arrays to determine

where in a room a smartphone is located. In our sys-

tem called Daredevil, smartphones emit sound at par-

ticular times and frequencies, which are received by

microphone arrays. Using SSL that we modiﬁed for

our purposes, we can calculate the angle between the

center of each microphone array and the phone, and

thereby triangulate the phone’s position. In this early

work, we demonstrate the feasibility of our approach

and present initial results. Daredevil can locate smart-

phones in a room with an average precision of 3.19

feet. We identify a number of challenges in realiz-

ing the system in large deployments, and we hope this

work will beneﬁt researchers who pursue such tech-

niques.

I. Introduction

Making indoor location available ubiquitously is dif-

ﬁcult. There are many challenges, including achiev-

ing accuracy with off-the-shelf phones, relying solely

on existing infrastructure such as Wi-Fi access points,

and scaling any human effort such as ﬁngerprinting.

However, there are speciﬁc situations where tailored

indoor location solutions are valuable and one or more

of these constraints do not apply. For instance, ﬁre-

ﬁghters may be willing to carry custom equipment for

indoor location but will require the solution to work

in any burning building. A retail chain store may be

willing to deploy custom equipment in their stores, but

may want location to work with off-the-shelf phones

that users carry.

In this work, we focus on the retail scenario as our

motivation. A store owner may want to provide indoor

location to shoppers for a variety of reasons. She may

want shoppers to efﬁciently ﬁnd items on their shop-

ping list with minimal clerk assistance. She may want

to entice shoppers with special discounts or product

reviews depending on what product the shopper is

looking at. In doing so, the store owner wants to min-

imize the burden on the shopper and not require any-

thing custom on the end user devices beyond an app

for the store. She may be willing to deploy custom

equipment inside the store. While we focus on the

retail scenario, our techniques are equally applicable

to any scenario where user phones with a custom app

need to be located in a large indoor room where the

room can be augmented with additional equipment.

Prior work has considered a variety of ways to ad-

dress such scenarios, including sensing ambient mag-

netic ﬁelds [4], ﬁngerprinting Wi-Fi [1] and ﬁnger-

printing FM radio transmissions [3]. In this work,

we focus on sound, either audible or ultrasound. The

choice of sound comes with its own advantages and

disadvantages. There are transducers on all phones

(speakers and microphones) and sound is generally

unaffected by changes in the store layout or human

presence (unlike magnetic ﬁelds and Wi-Fi signal

strength). However, depending on the frequency, am-

plitude and duration, it can be overheard by humans,

and there is a lot of ambient noise that can interfere

with the detection of the intended signal.

In this work, we attempt to accurately locate users

indoors by detecting the angle of arrival of audio

chirps emitted by smartphones. We examine how well

off-the-shelf phones can emit such chirps, at a variety

of different frequencies and amplitudes. We examine

how we can encode small amounts of information in

these chirps to distinguish different phones. To cal-

culate the angle of arrival, we build microphone ar-

rays that can be used in conjunction with SSL (sound

source localization) algorithms [17] that we have cus-

tomized for our use.

The novelty of our work is in applying SSL to lo-

cate smartphones. Shopkick is a company that has de-

ployed sound-based location in stores. However, in

their system the phone is detecting sound emitted by

a custom device in the store, and only detects whether

the phone is in the store, not where in the store the

phone is. SSL techniques have been explored in depth

in prior work, and are in use in products such as Mi-

crosoft Kinect, for distinguishing different humans

speaking at the same time. In this work, we apply

those techniques to locating phones, which has unique

challenges.

Our system, called Daredevil, operates at least two

microphone arrays, and by calculating the angle from

each array to the phone emitting a tone, can trian-

gulate the user’s location. In this paper, we present

the design of the system and hardware and feasibility

experiments in §III, evaluation results in §IV, and

relevant prior work in §V. This early work demon-

strates a working system that achieves low error – on

average 3.8◦, or approximately 3.2 feet on average.

However, we also identify limitations of current hard-

ware on smartphones that prevent us from deploying

Daredevil in practical scenarios, and we point to fu-

ture work in this space in §VI.

II. Motivation

We envision Daredevil to be deployed in retail stores

where pairs of microphone arrays are mounted on

walls or ceilings. Unmodiﬁed phones running an app

can be identiﬁed and located using sound that they

emit. There are several open questions that we need

to answer when building such a system.

It is important to understand how well off-the-shelf

smartphones can emit high frequency audio tones.

This is partly a function of the speakers they have,

audio processing hardware, and the software platform

on the phones. A related question is how loud these

tones are, or at what distance they can be heard using

microphones. The longer the distance is, the fewer mi-

crophone arrays are needed, but higher is the potential

for interference between multiple phones. The higher

the frequency, the lower is the chance for interference

from human voice.

The cost and robustness of our microphone array

design are two more key factors. An inexpensive ar-

ray can allow for more arrays to be deployed in the in-

door environment. The array has to be robust enough

to pick up audio tones from phones from a variety of

angles and distances.

The ultimate question is how accurately the micro-

phone arrays can detect the angle of incidence from

phones, and subsequently the location of each phone

by triangulation from a pair of arrays. Additional

questions are how quickly tones can be generated and

phones located, and how that impacts the scalability

of the system in how many phones can be located and

how frequently.

βmic2

(Xmic1 ,Y mic1) (Xmic2 ,Y mic2)

,Y)

αmic1

Mic 2

Mic 1

Server

Figure 1: Daredevil overview

III. Design and Implementation

III.A. Overview

As shown in Figure 1, we assume that the indoor envi-

ronment has been instrumented with at least two mi-

crophone arrays, with known positions. We expect the

arrays to be mounted high on adjacent walls that are

perpendicular to each other, or attached to the ceiling.

The placement should be high enough such that there

is good line of sight to most users in the environment,

minimizing the impact of audio reﬂecting from indoor

surfaces. The goal of each microphone array is to cal-

culate the angle of incidence of received audio to the

front surface of the array. With two angles (one from

each microphone array), we can triangulate the posi-

tion of the sound source.

Each phone has an app that generates and plays au-

dio in a speciﬁc frequency band. The idea is to play

audio at high enough frequencies to be inaudible by

humans, yet be playable by smartphone speakers and

be loud enough to be captured by microphone arrays.

In our application scenario, we expect multiple

phones to be present that need to be located simul-

taneously. If multiple phones emit sound at the same

time and frequency, it can be difﬁcult to distinguish

them and hence calculate an angle of incidence to

each microphone array. We can use two basic ap-

proaches to address the problem of multiple phones –

frequency division multiplexing (FDM) or time divi-

sion multiplexing (TDM). As our experiments in §IV

show, when attempting to emit sound at a particular

frequency, we observe our signal at additional fre-

quencies. As a result, FDM may be more challenging

to achieve in practice. Hence, we focus on TDM in

Daredevil.

Our smartphone app uses coarse-grained location

information from the underlying mobile OS (typically

based on Wi-Fi and cellular tower location) to esti-

mate which Daredevil-enabled store the user is in.

The app will then receive a schedule from the Dare-

devil server (shown in Figure 1) that tells it when it

can emit sound and in what frequency band. The app

uses amplitude modulation to encode a unique phone

ID in the sound it emits. Clock drift between phones

and servers will limit how quickly multiple phones

can be located.

Each phone registers with the Daredevil server and

retrieves the common tone frequency, a unique phone

ID, and the TDM schedule assigned to it. The phone

constructs the audio in software and plays it at the as-

signed schedule. The microphone arrays stream au-

dio to the server. SSL software running at the server

identiﬁes the tone, decodes the phone ID, and com-

putes the tone angle of arrival at each of the two mi-

crophone arrays (angles αmic1and βmic2in Figure

1). The microphone array coordinates (Xmic1, Ymic1)

and (Xmic2, Ymic2)are static values provided to the

the server software as conﬁguration values at deploy-

ment time. Using the microphone array positions and

the tone angles of arrivals, the server computes the

phone coordinates through triangulation. The phone

coordinates (X, Y )are returned to the app, or pro-

cessed further for higher-level services (such as pro-

viding directions on top of an indoor map, or sending

a discount coupon).

The phone location is updated periodically, each

time the app on the phone plays sound at the sched-

uled times. The schedule periodicity can be made

adaptive based on the number of phones present in the

environment. Additional factors, including the max-

imum user speed, and limits on user movement by

walls and aisles, can be also used to dynamically ad-

just the schedule for different phones.

III.B. Sound Source Localization

Locating sounds using microphone arrays is a well es-

tablished area in signal processing. One of the ﬁrst

applications was pointing a pan-tilt-zoom camera to-

ward the current speaker in a conference room [21].

Direction estimation with a pair of microphones

is done by computing the delay between the two

received signals, and using the known speed of

sound and the distance between the two microphones:

θ=arcsin(τ ν/δ). Here θis the direction angle, τ

is the estimated time delay, νis the speed of sound,

and δis the distance between the two microphones. τ

is computed using the Generalized Cross-Correlation

function [10], typically with PHAT weighting.

Using more than two microphones improves preci-

sion, but increases the complexity of the estimation.

The naive approach of combining angles from all pos-

sible pairs does not work well. Instead, Steered Re-

sponse Power (SRP) algorithms are often used. Those

are based on evaluating multiple angles and picking

the one that maximizes certain criteria (power of the

signal from the sound source [18], spatial probabil-

ity [17], eigenvalues [14], etc.). In Daredevil, we use

a modiﬁed and improved version of the algorithm de-

scribed in [17]. On every audio processing frame we

run a Voice Activity Detector (VAD), like the one de-

scribed in [15] but modiﬁed for the type of audio we

generate, and engage the sound source localizer only

if there is a signal (real signal or interfering signal)

present.

The audio frames with signal present are converted

to frequency domain using short-term Fourier Trans-

formation, and only the frequency bins containing the

signal frequency band are processed. For each fre-

quency bin kof audio frame n, the probability of it

having a signal as a function of the direction p(n)

k(θ)

is estimated using the IDOA approach [17]. The prob-

abilities from the frequency bins of interest are aver-

aged to receive the probability of sound source pre-

sented as function of the direction p(n)(θ). A conﬁ-

dence level is estimated as the proportion of the dif-

ference between the maximum and minimum values

of the probability, divided by the probability average.

If the conﬁdence level is below a given threshold, the

result from this audio frame is discarded, otherwise

the direction angle is considered where the probabil-

ity peaks. The time stamped angle, accompanied by

the conﬁdence level, is sent up the stack for further

processing.

The angular precision based on a single audio frame

with duration 10-30 ms is not high and hence we use

post-processors. Their purpose is to remove outliers

and reﬂections and to increase location precision by

averaging and interpolating sound source trajectory

over small time windows. A variety of methods can be

used, including Kalman ﬁltering [9], clustering [16]

and particle ﬁltering [22]. In Daredevil, we use the

clustering algorithm described in chapter 6 of [16].

The output of the sound source localizer is a set of

directions toward all sound sources tracked, each ac-

companied by a conﬁdence level.

In addition to localizing sounds, signals from the

microphones can be combined into one signal, which

is equivalent of a highly directive microphone - an

operation called beamforming. Commonly the beam-

forming happens in frequency domain, where we al-

ready converted the audio signal to perform sound

source localization. In this domain the beamform-

ing output is deﬁned as a weighted sum with direc-

tion dependent weight. The output signal contains

less reverberation and noise, allowing us to process

more easily the chirps from a phone. The beamform-

ing operation requires knowledge of the direction to

the desired sound source, which is obtained from the

sound source localizer. In Daredevil, we use a time

invariant steerable beamformer [16]. It consists of 21

pre-computed tables of weight coefﬁcients for direc-

tions from -50◦to +50◦pointing at every 5◦. Using

the list of sound sources, obtained from the sound

source localizer, for each of the tracked sound sources

we snap the direction to the closest pre-computed an-

gle and then compute the beamformer output. This

means that we apply the beamforming procedure as

many times as sound sources we track. The output

of each beamformer contains the sound from the cor-

responding phone, while the other signals and noises

are attenuated. It is used further for decoding the in-

formation coming from that particular phone.

Each microphone array works in the range of ±50◦.

In the ideal case we should be able to reject any sound

source that is out of this angle and out of the line of

sight. Otherwise we will get fake directions for sound

sources. We mitigate this problem in the following

ways:

•We use VAD to select only those audio frames

where we have strong signals, presumably reject-

ing these that are out of the monitored zone or not

in the direct sight of the array.

•We scan for sound sources only in the directions

of the range above.

•After processing each frame we compute a conﬁ-

dence level ((max-min)/average) and reject all of

the frames with conﬁdence level below a thresh-

old.

•We cluster the sound source localizations, and

track only sound sources with presence in sev-

eral frames. We then compute a conﬁdence level

for each sound source.

•We compute the beamformer toward a direction

only when the conﬁdence level after clustering is

above a threshold.

Again, the key is to reject directions toward sporadic

sound sources, reﬂections from walls and ceilings,

and noise. This is part of the reason why we do not

detect real sound sources at larger distances – with

our thresholds we cannot distinguish those from soft

sounds from nearby noise. Further tuning can be done

Figure 2: Spectrogram from Adobe Audition software

of 18 kHz audio played on a HTC Surround phone and

recorded on a microphone array

to the thresholds to increase the detection range as

much as possible while still keeping false positives at

acceptable low levels.

III.C. Audio Frequency Band

In our application scenario, the store owner is will-

ing to deploy hardware in the store but cannot ex-

pect users to modify their phones beyond installing

an app. Ideally, we want the audio emitted by phones

to be well beyond human perceptible ranges. Medical

literature indicates that human hearing can recognize

sounds up to 20 kHz. Unfortunately, speakers on most

smartphones are not designed to operate beyond hu-

man hearing ranges, and our experiments on a number

of phone models (Apple iPhone 4, Samsung Focus,

Asus E600, and HTC Surround) veriﬁed that expec-

tation. As the emitted sound frequency approaches

21 kHz, the sound amplitude drops off quickly. Most

phones are capable of achieving 18 kHz at loud vol-

umes. It is well known that the upper range of audio

frequencies detectable by the human ear varies with

age. In a small experiment using 10 test users aged

20 and higher, playing audio on a PC with high end

speakers, we observed no user perception of frequen-

cies higher than 17 kHz. For this paper, we focus on

18 kHz, realizing that the short audio chirps that our

app emits may be perceptible by children.

Phone loudspeakers at maximum volume produce

substantial non-linear distortions. For 18 kHz sound,

the second, third, and higher harmonics are above one

half of the sampling rate and should be removed by the

anti-aliasing ﬁlter integrated into the ADC. However,

due to the high amplitude of these non-linear distor-

tions, they still bleed over and are mirrored as signals

with lower frequency. The spectrogram in Figure 2

Tone

TSLOT

tPLAY tGUARD

tTONE

Figure 3: TDM time slot

shows these additional signals with frequencies of 6

and 10 kHz. While these two shadow frequencies are

within the human audible range, their amplitudes are

much lower than the primary signal which is clearly

visible at 18 kHz. We have observed this behavior

on multiple phones and we believe this is a limitation

of the speakers built into phones. We observe simi-

lar spectrograms for frequency bands ranging from 18

kHz to 21 kHz.

Our ﬁndings are consistent with prior work. Recent

work [5] investigated the feasibility of playing ultra-

sonic sounds on mobile phones. The authors tested

four commercial phones (HTC G1, HTC Hero, Apple

iPhone 3GS, and Nokia 6210 Navigator) playing tones

at frequencies between 17 kHz and 22 kHz. They ob-

serve that all phones were capable of generating these

high frequencies. While noise appeared at some other

frequencies when the volume of the phone’s speak-

ers are set to maximum, there exists a combination of

volume settings for each phone such that the noise is

minimal.

III.D. Locating Multiple Phones

Using FDM to allocate distinct frequencies to differ-

ent phones is possible, but complicated by the shadow

frequency problem we observe. We instead use TDM.

However, for a TDM approach to work, we need to

ensure that the slots do not overlap. This is challeng-

ing because different phones and the Daredevil server

may not have good time synchronization with each

other. In addition, scheduling and processing over-

head in smartphone OSes may introduce variable de-

lay between our app issuing a command to play audio

and it coming out of the phone’s speaker.

Figure 3 represents a time slot in the Dare-

devil TDM schedule. We denote the length of the slot

with TSLOT . This represents the time interval allo-

cated to a phone to play the tone of length tT ON E

(tT ON E < TSLO T ). We mark with tP LAY the de-

lay between the request to play the tone and the phone

actually playing it, and tGU ARD a guard time delay

to account for clock drift. Parameter tGU ARD ensures

that the next scheduled phone does not play the tone

concurrently with the current phone due to poor clock

synchronization.

To measure the time delay tP LAY between issuing

the play tone command and the tone coming out of the

phone speaker, we ran a small experiment. We instru-

mented the Daredevil app on a phone to timestamp

when tone play is requested by the OS. In parallel,

the phone recorded sound through its microphone. In

this way, we are able to estimate tP LAY which is the

difference in time between the app issuing the play

command and the audio containing the expected fre-

quency. After running this experiment 10 times on

different phone models, we observed a maximum de-

lay of 100ms. Hence we use a tP LAY of 100ms.

We empirically evaluated a number of values for

tT ON E , while measuring (i) tone detection and (ii)

angle estimation accuracy. Both these values suffer

when the tone length is short. If the tone length is too

short, our SSL algorithm does not have enough sam-

ples to remove noise from angle estimates. In our ex-

periments, 500ms was an adequate length, and hence

we set tT ON E to 500ms.

To pick tGU ARD we ran experiments with two

phones that were conﬁgured to synchronize their

clocks with cellular towers. We attached both phones

to a PC using USB cables and recorded on the PC

the time reported by both phones at multiple points

throughput the day. The two sets of reported times

were very close to each other, and remained below

150ms, which is the value we pick for tGUARD .

Putting together the values for tP LAY ,tTO N E and

tGU ARD , we have 750ms for the value of TSLOT .

Hence, with a single frequency band, we can locate up

to 40 phones every 30 seconds, in the coverage area of

a pair of microphone arrays.

III.E. Hardware and Software Imple-

mentation

Figure 4 shows a photograph of one of the three mi-

crophone arrays we built for Daredevil. We used a

laser cutter on a sheet of plastic to form the base and

the front plate that holds the microphones. The geom-

etry is a linear equidistant eight element microphone

array. The distance between the microphones is one

half wavelength for sound with frequency of 21 kHz

(8.16 mm). We used cardioid electret microphones

with a diameter of 6 mm, all of them pointing forward.

The microphones ﬁt snugly into the holes on the

front plate (bottom of picture), and wires connect

them to simple circuit. The voltage bias for the mi-

crophones is provided by a 9 V battery. We also

have variable resistors on each channel for individual

gain calibration. Eight cables then run out to an A-D

converter. For our experiments, we used the MOTU

UltraLite-mk3, which supports 8 audio inputs and

Figure 4: Top down view of one of our microphone

arrays

connects to a PC via USB. For prototype and hard-

ware debugging purposes, we made the microphone

array larger than it needs to be. If productized the en-

tire microphone array and the ADC can be made into

a box that is the size of a deck of playing cards.

Each A-D converter outputs 8 channel audio

through the audio driver stack in Microsoft Windows.

We use a combination of C code and Matlab code to

ﬁlter the audio, do VAD, do SSL, and triangulate the

phone’s position. Another piece of software maintains

a list of active phones and allocates frequency bands,

time slots, and unique IDs to each phone. This consti-

tutes the Daredevil server.

On the phone side, we have a simple app on Win-

dows Phone 7, which communicates with the Dare-

devil server over the Internet, and plays audio when

given a schedule. The app uses amplitude modu-

lation to encode a 24 bit unique phone ID on top

of the audio tone. Both the unique ID and the au-

dio frequency are provided to the app by the Dare-

devil server. The unique phone ID is modulated at a

baud rate of 50 symbols per second, and there are two

additional guard bits at the start of the audio sequence.

Each tone is 0.52 seconds long. We have also ported

our app to the Apple iOS platform.

IV. Experimental Results

The primary question we want to answer in our evalu-

ation is how accurately we can determine the location

of a phone in Daredevil. We begin by evaluating how

accurately Daredevil can determine the angle between

a microphone array and where the phone was when it

played a sound. We evaluate that error at different an-

gles and distances to the microphone array. We also

evaluate how that error changes when we change the

frequency band in which audio is played. Finally, us-

ing the accuracy we achieve in determining the angle,

we can use triangulation to calculate the error between

a phone’s actual and estimated positions.

In the experiments here, we set the volume of the

phone’s loudspeaker to 90% of the maximum, which

does not have as much distortion as at 100%. The

limited ability of phone loudspeakers to emit sounds

at high amplitudes without distortion, and the noise

abatement techniques in our sound source localiza-

tion algorithm together limit the maximum distance

at which we can locate phones.

IV.A. Angular accuracy at 18 kHz

To evaluate Daredevil, we deployed one of our micro-

phone arrays in a conference room. We then played

tones at 18 kHz on a Samsung Focus smartphone at

speciﬁc locations in the room. Figure 5 shows the 12

locations in the room that we evaluated. The ﬁgure

also shows for each location, the distance to the micro-

phone and angle between the center of the microphone

array and the shortest line between the phone’s posi-

tion and the microphone array. We calculated these

distances and angles by using a laser range ﬁnder, a

measuring tape, and trigonometry.

Figure 6 shows the results of estimating the angle

at each of the 12 locations in Daredevil. As shown

in the ﬁgure, many of the estimates are 5◦or lower.

However, our modiﬁed VAD is not able to detect the

tone at three locations and in two locations, our SSL

provides poor accuracy.

To understand why accuracy is poor at certain lo-

cations, we present Figure 7. The green baseline in

the center represents the noise ﬂoor, while the spikes

denote the tone being played by the phone. While

the tones are distinguishable, the SNR is low – in

many cases less than twice the noise ﬂoor. This low

SNR results in missed tones and poor angle accuracy

caused by tone amplitudes, erroneously correlated to

the noise and not the actual tone signal. The SNR is

poor at higher frequencies as well (we tested 19 kHz,

20 kHz, and 21 kHz).

IV.B. Angular accuracy at 10 kHz

The poor SNR at 18 kHz contributes to our poor ac-

curacy, and we suspect the poor SNR is due to phone

speakers not being able to produce sound at high am-

Mic

Array

5ft

-450

10ft

-260

15ft

-180

#12

25ft

-110

5ft

450

10ft

260

15ft

180

10ft

710

15ft

450

25ft

300

#11

30ft

#12

35ft

290

Figure 5: Evaluation locations when playing tones at

18 kHz. The “X” marks a spot where we played audio

from a phone. The black bubble shows the position

number, the distance and angle to the center of the

microphone array. (not drawn to scale).

Not Detected

Detected

Off by 17º

Off by 7º

Off by 2º

Off by 3º

Off by 2º

Off by 1º

Figure 6: Results from calculating the angle at each

of the 12 locations while playing tones at 18 kHz.

Each orange circle indicates the time at which the

tone was played on a phone, in order (location #1 is

the left-most circle, location #12 is the right-most cir-

cle). Locations 3, 10, 11, and 12 were not detected

by Daredevil. Blue dots denote the angle direction

inferred per audio frame (1024 audio samples at 96

kHz). Green lines denote the beamforming directions.

Small horizontal red lines represent the angles output

by SSL after clustering and weighting the per-frame

directions.

Weak Signal or noise?

Good SNR

Figure 7: Amplitudes of the 18 kHz audio tones re-

ceived at one of the 8 microphones, after amplifying

the signal by 40 dB, applying a band pass ﬁlter around

18 kHz, and then amplifying again by 40 dB. The ver-

tical axis is amplitude, and the horizontal axis is time.

plitudes at 18 kHz. To test this theory, we repeated our

experiments at 10 kHz. Figure 8 shows the locations

we tested in the same room at this frequency.

Figure 9 shows the angular accuracy at 10 kHz. We

observe that all the 7 locations are detected, and in

most cases with relatively low error. Daredevil can

detect the phone’s audio as far away as 35 feet. The

average error is 3.8◦.

Figure 10 shows the amplitude of the received sig-

nal at one microphone after processing. As expected,

we see a very strong set of 7 signals (the vertical bars).

Unfortunately, since we are operating at audible fre-

quencies, we also catch human voice, as shown by the

other peaks. These sometimes appear in Figure 9 as

calculated angles (short red horizontal lines without

an orange circle), but since the phone’s ID cannot be

decoded off that signal, no phone is detected at those

angles.

Unfortunately, the lower frequency of 10 kHz poses

problems for Daredevil. While each tone played by a

phone is very short, it can become annoying for hu-

mans. While human voice can also appear in this fre-

quency range, it can be ﬁltered out. However, it can

interfere with the phone’s audio signal when humans

speak at the same time. While 10 kHz is not a fea-

sible option in a real deployment, it demonstrates the

effectiveness of Daredevil if this limitation of phone

speakers were to be mitigated in the future.

IV.C. Location accuracy

To evaluate location accuracy, we simulate a range of

scenarios with different angular accuracies from our

results above. Rather than stand at many locations in-

side a room and run the application on the phone, we

7 feet

5ft

540

10ft

340

15ft

250

20ft

190

25ft

150

30ft

130

35ft

110

Mic

Array

Figure 8: Evaluation locations when playing tones at

10 kHz. The “X” marks a spot where we played audio

from a phone. The black bubble shows the position

number, the distance and angle to the center of the

microphone array. (not drawn to scale).

Detected

Not Detected

Off by 7º

Off by 1º

Off by 2º

Off by 1º

Off by 5º

Off by 4º

Figure 9: Results from calculating the angle at each

of the 7 locations while playing tones at 10 kHz. Each

orange circle indicates the time at which the tone

was played on a phone, in order (location #1 is the

left-most circle, location #7 is the right-most circle).

All locations were detected by Daredevil. Blue dots

denote the angle direction inferred per audio frame

(1024 audio samples at 96 kHz). Green lines de-

note the beamforming directions. Small horizontal red

lines represent the angles output by SSL after cluster-

ing and weighting the per-frame directions.

use our prior measured angular accuracies in a simu-

lation for convenience. We deﬁne the location error

as the distance in feet between the Daredevil com-

puted location and the actual location of the user. We

randomly select 50 user locations within 25 feet of

each of two microphone arrays. At each location, we

use two angular errors that are randomly chosen from

Figure 9 (for each angle error we considered both the

positive and the negative value). Figure 11 shows the

CDF of the Daredevil location error in feet. The min-

Excellent SNR

Human Voice

Figure 10: Amplitudes of the 10 kHz audio tones re-

ceived at one of the 8 microphones, after amplifying

the signal by 40 dB, applying a band pass ﬁlter around

10 kHz, and then amplifying again by 40 dB. The ver-

tical axis is amplitude, and the horizontal axis is time.

imum location error is 0.52 feet, the maximum error

is 17.59 feet while the average error across all 50 lo-

cations is 3.19 feet (less than 1 meter). As baseline

values we include the CDFs of two other localization

schemes: Centroid (estimate the user location as the

mid-point between the two microphone arrays) and

Best Mic (estimate the user location as the location of

the microphone array that is closest to the user). The

CDF graphs show that Daredevil offers promising lo-

cation accuracy, and can meet the requirements of the

scenarios we outlined in §I.

V. Related Work

Indoor location is an active area of research, with a

large number of published works. We cannot even

begin to comprehensively cite all prior work in this

space, but instead focus on location systems that use

sound.

ActiveBat [6] relies on small devices, called Bats,

to transmit ultrasonic pulses when instructed over

RF. Bats are carried by users or attached to objects.

Known-location receivers listen for the Bat ultrasonic

pulse, and compute the distance to it based on the time

difference between the RF request and the device ul-

trasonic pulse. The Bat location is computed centrally

by the system using distance estimates to at least three

receivers. ActiveBat avoids collisions by scheduling

the Bat transmissions: it broadcasts over RF which

Bat devices should transmit the ultrasonic pulse. The

system determines the Bat location with accuracy of a

few cm.

In Cricket [13], mobile devices correlate RF signals

and ultrasonic pulses transmitted by hardware bea-

cons installed in the environment. The mobile device

0 5 10 15 20 25 30 35

0.2

0.4

0.6

0.8

error (feet)

CDF

Centroid

Best Mic

DareDevil

Figure 11: CDF of location error in feet for Dare-

devil, Centroid and Best Mic. Centroid and Best Mic

represent baseline localization schemes. Centroid al-

ways estimates the user location at the mid-location

between the two microphone arrays. Best Mic places

the user at the closest microphone array location from

the user ground truth location.

is located at the closest emitting beacon. The bea-

cons are carefully placed in the environment to maxi-

mize location coverage and minimize interference. To

avoid collisions and make correct RF-ultrasonic pulse

correlations, the beacon transmissions times are ran-

domized. The system achieves location granularities

of 4x4 square feet.

We believe that Daredevil provides an interesting

solution in a different part of the design space, where

fewer devices are installed (two microphone arrays

per room or area) and a different technique (SSL) is

used to calculate location. In situations where a large

number of devices can be installed in the environment,

such as in ActiveBat and Cricket, perhaps ﬁner loca-

tion accuracy can be achieved.

WALRUS [2] provides room-level mobile device

location. Each room is instrumented with a desk-

top computer which broadcasts simultaneously Wi-Fi

and ultrasound beacons. Ultrasound beacons are con-

ﬁned by room walls and, as a result, these beacons are

overheard only by devices co-located with the desk-

top computer. The mobile device infers room-location

by correlating received Wi-Fi packets (possibly from

multiple rooms) with the overheard ultrasound bea-

con(from the same room). To make correct correla-

tions, WALRUS uses a maximum time difference be-

tween the Wi-Fi and ultrasound beacon and periodic

variations in desktops broadcast times. The authors

show that even in the case of 6 neighboring rooms,

room-identiﬁcation is correct 85% of the time. In con-

trast, Daredevil is locating phones within a room at

much ﬁner granularity using microphone arrays.

WebBeep [11] proposes a scheme based on time

of ﬂight to locate mobile devices. These devices in-

form over RF when they are about to play an audible

tone (at 4kHz). Microphones detect the tone, and then

compute a distance estimate to the device based on

the speed of sound and the difference between the ad-

vertised time of play and the time the tone was heard.

These distance estimates are affected by a constant,

unknown time-delay between when the sound was ad-

vertised to play and when it actually played. By fac-

toring in this unknown delay and relying on distance

estimates to three microphones, the location of the

device can be inferred. In a room 20x9m2covered

with 6microphones, the authors show an accuracy of

70cm, 90% of the time in a quiet environment.

Authors in [7, 12] built or used off the shelf UWB

transmitters/receivers for location in an indoor envi-

ronment. The advantage of wideband ultrasound over

narrow band ultrasound (as used in the above sys-

tems) is to allow multiple, simultaneous access to the

medium. A wide frequency spectrum can support

code division multiple access (CDMA) techniques. In

CDMA, special codes are assigned to transmitters to

ensure their signals can be separated at the receiver,

even when transmissions occur in parallel.

In contrast to these prior systems, we use a unique

approach of using microphone arrays at carefully

spaced intervals to use SSL to calculate the angle of

incidence of sound. In §III.B, we already described

the underlying prior work in SSL. However, the prior

work in that space has focused on locating human

voice, not locating phones, which comes with asso-

ciated challenges of how phones can emit appropriate

sound and how to schedule those transmissions and

what accuracy can be achieved. SSL has been used

in robotics, typically to locate human voice so that a

robot can follow humans. For example in [20], the

authors designed a sound location system to point the

robot camera toward the direction of the sound source.

When the source was placed at 3mfrom the robot, the

mean direction error was 3◦. Authors in [8] allow for

simultaneous robot and multiple ﬁxed sound source

location. The sound sources are assumed ﬁxed, while

the robot can move freely through the space of a room.

The bearing to sound sources is computed with accu-

racies between 3◦and 9◦. Similar work [19], tracked

multiple people (2-4) while speaking. The system

used a combination of beamforming and particle ﬁl-

tering to distinguish between multiple sound sources

in the environment. The authors showed direction ac-

curacies of 10◦when the sound source is up to 7m

away from the robot.

VI. Conclusions

Daredevil is a system for locating phones in an indoor

environment that has been instrumented with small

microphone arrays. Each array can be as small as a

pack of playing cards. Our system uses sound source

localization to calculate the angle between a phone

and each microphone array, and then triangulation to

calculate the position. We use time-division multi-

plexing to schedule multiple phones that want to be

located. Our implementation and evaluation demon-

strate that low error can be achieved: 3.8◦, or approx-

imately 3.2 feet on average.

Our current implementation is limited by the qual-

ity of audio speakers on modern smartphones. At fre-

quencies higher than 18 kHz, the amplitude of sound

that these speakers generate is too low compared to the

ambient noise ﬂoor. At human audible ranges, sufﬁ-

cient amplitude can be achieved. Our hope is that bet-

ter speakers will be available on smartphones in the

future. Nonetheless, at frequencies above 18 kHz, it

is unknown what the impact will be on animals, espe-

cially service animals for the blind. One possibility of

using audible frequencies such as 10 kHz is to embed

our unique audio signal into music or sounds that the

UI of an app may play while the user is interacting

with the app.

There are two directions for future work in this

space that we believe are promising. In this paper, we

did not address the problem of segmenting the ﬂoor

plan of a large room (such as a large department store)

into squares, where each square is served by a pair of

microphone arrays. In such a conﬁguration, the sched-

ule of audio transmissions by phones in one square

would need to be coordinated with those in adjacent

squares – a traditional coloring problem that would

need to be solved. A second direction that we did not

explore in this work is reversing the direction of audio.

If we had wall mounted speakers that are transmitting

at inaudible, high frequencies, can a microphone array

on a phone determine its location? In such a scheme,

no scheduling is needed since the same signal would

be useful by all phones in the vicinity. Some mod-

ern smartphones have multiple microphones, primar-

ily for noise cancellation during audio calls. If each

audio stream from each microphone were exposed to

the software stack on the phone, potentially SSL could

be applied.

VII. Acknowledgments

We thank Mike Sinclair (MSR) for letting us use his

hardware lab and helping us with the laser cutter, and

we thank Bruce Cleary (MSR) for help with circuit

boards.

References

[1] P. Bahl and V. Padmanabhan. RADAR: An In-

Building RF-based User Location and Tracking

System. In IEEE INFOCOM, 2000.

[2] G. Borriello, A. Liu, T. Offer, C. Palistrant, and

R. Sharp. WALRUS: wireless acoustic location

with room-level resolution using ultrasound. In

ACM MobiSys, 2005.

[3] Y. Chen, D. Lymberopoulos, J. Liu, and

B. Priyantha. FM-based Indoor Localization. In

ACM MobiSys, 2012.

[4] I. Constandache, R. R. Choudhury, and I. Rhee.

CompAcc: Using Mobile Phone Compasses and

Accelerometers for Localization. In IEEE IN-

FOCOM, 2010.

[5] V. Filonenko, C. Cullen, and J. Carswell. In

IPIN, Sept. 2010.

[6] A. Harter, A. Hopper, P. Steggles, A. Ward, and

P. Webster. The anatomy of a context-aware ap-

plication. In ACM MobiCom, 1999.

[7] M. Hazas and A. Hopper. Broadband ultrasonic

location systems for improved indoor position-

ing. IEEE Transactions on Mobile Computing,

5:536–547, 2006.

[8] J.-S. Hu, C.-Y. Chan, C.-K. Wang, and C.-

C. Wang. Simultaneous localization of mobile

robot and multiple sound sources using micro-

phone array. In IEEE ICRA, 2009.

[9] S. Julier and J. Uhlmann. A new extension of the

Kalman ﬁlter to nonlinear systems. In Interna-

tional Symposium on Aerospace/Defense Sens-

ing, Simulation and Controls, 1997.

[10] C. Knapp and G. Carter. The generalized cor-

relation method for estimation of time delay. In

IEEE TASSP, Aug. 1976.

[11] C. V. Lopes, A. Haghighat, A. Mandal, T. Givar-

gis, and P. Baldi. Localization of off-the-shelf

mobile devices using audible sound: architec-

tures, protocols and performance assessment. In

ACM MC2R, April 2006.

[12] K. Muthukrishnan and M. Hazas. Position Esti-

mation from UWB Pseudorange and Angle-of-

Arrival: A Comparison of Non-linear Regres-

sion and Kalman Filtering. In LoCA, pages 222–

239, 2009.

[13] N. B. Priyantha, A. Chakraborty, and H. Balakr-

ishnan. The cricket location-support system. In

ACM MobiCom, 2000.

[14] R. Schmidt. Multiple emitter location and signal

parameter estimation. In IEEE Transactions on

Antennas and Propagation, 1986.

[15] J. Sohn, N. Kim, and W. Sung. A statistical

model based voice activity detector. In IEEE

Signal Processing Letters, Jan. 1999.

[16] I. Tashev. Sound capture and processing: Practi-

cal approaches. In John Wiley and Sons, 2009.

[17] I. Tashev and A. Acero. Microphone array post-

processor using instantaneous direction of ar-

rival. In IWAENC, 2006.

[18] H. V. Trees. Optimum Array Processing. Part IV

of Detection, Estimation and Modulation The-

ory. In John Wiley and Sons, 2002.

[19] J.-M. Valin, F. Michaud, and J. Rouat. Robust

localization and tracking of simultaneous mov-

ing sound sources using beamforming and parti-

cle ﬁltering. Robotics and Autonomous Systems,

55(3):216–228, 2007.

[20] J.-M. Valin, F. Michaud, J. Rouat, and D. Le-

tourneau. Robust sound source localization us-

ing a microphone array on a mobile robot. In

IEEE IROS, 2003.

[21] H. Wang and P. Chu. Voice source localization

for automatic camera pointing in videoconfer-

encing. In ICASSP, 1997.

[22] D. Ward and R. Williamson. Particle ﬁlter beam-

forming for acoustic source localization in rever-

berant environment. In ICASSP, 2002.

RoVaR: Robust Multi-agent Tracking through Dual-layer Diversity in Visual and RF Sensor Fusion

Preprint

Jul 2022

The plethora of sensors in our commodity devices provides a rich substrate for sensor-fused tracking. Yet, today's solutions are unable to deliver robust and high tracking accuracies across multiple agents in practical, everyday environments - a feature central to the future of immersive and collaborative applications. This can be attributed to the limited scope of diversity leveraged by these fusion solutions, preventing them from catering to the multiple dimensions of accuracy, robustness (diverse environmental conditions) and scalability (multiple agents) simultaneously. In this work, we take an important step towards this goal by introducing the notion of dual-layer diversity to the problem of sensor fusion in multi-agent tracking. We demonstrate that the fusion of complementary tracking modalities, - passive/relative (e.g., visual odometry) and active/absolute tracking (e.g., infrastructure-assisted RF localization) offer a key first layer of diversity that brings scalability while the second layer of diversity lies in the methodology of fusion, where we bring together the complementary strengths of algorithmic (for robustness) and data-driven (for accuracy) approaches. RoVaR is an embodiment of such a dual-layer diversity approach that intelligently attends to cross-modal information using algorithmic and data-driven techniques that jointly share the burden of accurately tracking multiple agents in the wild. Extensive evaluations reveal RoVaR's multi-dimensional benefits in terms of tracking accuracy (median of 15cm), robustness (in unseen environments), light weight (runs in real-time on mobile platforms such as Jetson Nano/TX2), to enable practical multi-agent immersive applications in everyday environments.

Voice Localization Using Nearby Wall Reflections

Conference Paper

Full-text available

Sep 2020

Voice assistants such as Amazon Echo (Alexa) and Google Home use microphone arrays to estimate the angle of arrival (AoA) of the human voice. This paper focuses on adding user localization as a new capability to voice assistants. For any voice command, we desire Alexa to be able to localize the user inside the home. The core challenge is twofold: (1) accurately estimating the AoAs of multipath echoes without the knowledge of the source signal, and (2) tracing back these AoAs to reverse triangulate the user's location. We develop VoLoc, a system that proposes an iterative align-and-cancel algorithm for improved multipath AoA estimation, followed by an error-minimization technique to estimate the geometry of a nearby wall reflection. The AoAs and geometric parameters of the nearby wall are then fused to reveal the user's location. Under modest assumptions, we report localization accuracy of 0.44 m across different rooms, clutter, and user/microphone locations. VoLoc runs in near real-time but needs to hear around 15 voice commands before becoming operational.

ACTIVELY EXPLOITING PROPAGATION DELAY FOR ACOUSTIC SYSTEMS

Thesis

Full-text available

Dec 2019

Sheng Shen

Propagation delay refers to the length of time it takes for a signal to travel from point A to point B. Many existing systems, including Global Positioning System (GPS) localization, vehicular imaging, and microphone array beamforming, have taken advantage of propagation delay. This dissertation revisits different properties of propagation delay to enable new acoustic techniques and applications. For instance: (1) We leverage the propagation delay difference between two very different frequencies-radio frequency (RF), and acoustics-to improve active noise cancellation. By "piggybacking" sound over RF, our proposed system is able to compute anti-noise signals more precisely , and ultimately attain better cancellation performance. (2) We develop solutions that exploit the propagation delays of multipath echoes to localize an indoor human speaker. By aligning the arrivals of the voice signal at different times, we compute user location within an optimization framework, serving as a valuable context for smart voice assistants like Amazon Echo and Google Home. (3) We design 3D directional sound by actively synthesizing different propagation delays at two ears using earphones. We develop algorithms that accurately track the 3D orientation of the head, a key enabler for designing 3D acoustics. In general, this dissertation shows that while propagation delay has been studied for a long time and for many applications , there is still opportunity for new techniques and systems, by carefully looking at different properties of the propagation delay, across frequencies, time, and space.

Indoor Positioning Design for Mobile Phones via Integrating a Single Microphone Sensor and an H2 Estimator

Article

Full-text available

Jan 2023
SENSORS-BASEL

An indoor positioning design developed for mobile phones by integrating a single microphone sensor, an H2 estimator, and tagged sound sources, all with distinct frequencies, is proposed in this investigation. From existing practical experiments, the results summarize a key point for achieving a satisfactory indoor positioning: The estimation accuracy of the instantaneous sound pressure level (SPL) that is inevitably affected by random variations of environmental corruptions dominates the indoor positioning performance. Following this guideline, the proposed H2 estimation design, accompanied by a sound pressure level model, is developed for effectively mitigating the influences of received signal strength (RSS) variations caused by reverberation, reflection, refraction, etc. From the simulation results and practical tests, the proposed design delivers a highly promising indoor positioning performance: an average positioning RMS error of 0.75 m can be obtained, even under the effects of heavy environmental corruptions.

DeepEar: Sound Localization with Binaural Microphones

Conference Paper

Full-text available

May 2022

Model-based Head Orientation Estimation for Smart Devices

Article

Full-text available

Sep 2021

Voice interaction is friendly and convenient for users. Smart devices such as Amazon Echo allow users to interact with them by voice commands and become increasingly popular in our daily life. In recent years, research works focus on using the microphone array built in smart devices to localize the user's position, which adds additional context information to voice commands. In contrast, few works explore the user's head orientation, which also contains useful context information. For example, when a user says, "turn on the light", the head orientation could infer which light the user is referring to. Existing model-based works require a large number of microphone arrays to form an array network, while machine learning-based approaches need laborious data collection and training workload. The high deployment/usage cost of these methods is unfriendly to users. In this paper, we propose HOE, a model-based system that enables Head Orientation Estimation for smart devices with only two microphone arrays, which requires a lower training overhead than previous approaches. HOE first estimates the user's head orientation candidates by measuring the voice energy radiation pattern. Then, the voice frequency radiation pattern is leveraged to obtain the final result. Real-world experiments are conducted, and the results show that HOE can achieve a median estimation error of 23 degrees. To the best of our knowledge, HOE is the first model-based attempt to estimate the head orientation by only two microphone arrays without the arduous data training overhead.

Radio-frequency-based indoor-localization techniques for enhancing Internet-of-Things applications

Article

Full-text available

Oct 2020
PERS UBIQUIT COMPUT

An important capability of most smart, Internet-of-Things-enabled spaces (e.g., office, home, hospital, factory) is the ability to leverage context of use. Location is a key context element, particularly indoor location. Recent advances in radio ranging technologies, such as Wi-Fi RTT, promise the availability of low-cost, near-ubiquitous time-of-flight-based ranging estimates. In this paper, we build on prior work to enhance this ranging technology’s ability to provide useful location estimates. For further improvements, we model user motion behavior to estimate the user motion state by taking the temporal measurements available from time-of-flight ranging. We select the velocity parameter of a particle-filter-based on this motion state. We demonstrate meaningful improvements in coordinate-based estimation accuracy and substantial increases in room-level estimation accuracy. Furthermore, insights gained in our real-world deployment provide important implications for future Internet-of-Things applications and their supporting technology deployments such as social interaction, workflow management, inventory control, or healthcare information tools.

Indoor Silent Object Localization using Ambient Acoustic Noise Fingerprinting

Conference Paper

May 2020

Prediction of Indoor Positioning via fingerprinting of BLE beacons and Other sensors with Machine learning

Thesis

Full-text available

Jan 2023

Arpit Kulsreshtha

This project thesis work represents the solution of indoor positioning on custom location for finding and searching things relevant to that store, marts with optimization of multiple inputs from BLE beacons, WiFi Signal Strengths and other sensors of smartphones such as magnetic field, sound noise level. The presented solution try to solve the problem of accuracy and time difference between the actual position and position on custom indoor location points and predict the correct positioning with machine learning multi class classification solutions.

RoVaR: Robust Multi-agent Tracking through Dual-layer Diversity in Visual and RF Sensing

Article

Mar 2023

The plethora of sensors in our commodity devices provides a rich substrate for sensor-fused tracking. Yet, today's solutions are unable to deliver robust and high tracking accuracies across multiple agents in practical, everyday environments - a feature central to the future of immersive and collaborative applications. This can be attributed to the limited scope of diversity leveraged by these fusion solutions, preventing them from catering to the multiple dimensions of accuracy, robustness (diverse environmental conditions) and scalability (multiple agents) simultaneously. In this work, we take an important step towards this goal by introducing the notion of dual-layer diversity to the problem of sensor fusion in multi-agent tracking. We demonstrate that the fusion of complementary tracking modalities, - passive/relative (e.g. visual odometry) and active/absolute tracking (e.g.infrastructure-assisted RF localization) offer a key first layer of diversity that brings scalability while the second layer of diversity lies in the methodology of fusion, where we bring together the complementary strengths of algorithmic (for robustness) and data-driven (for accuracy) approaches. ROVAR is an embodiment of such a dual-layer diversity approach that intelligently attends to cross-modal information using algorithmic and data-driven techniques that jointly share the burden of accurately tracking multiple agents in the wild. Extensive evaluations reveal ROVAR'S multi-dimensional benefits in terms of tracking accuracy, scalability and robustness to enable practical multi-agent immersive applications in everyday environments.

FM-based indoor localization

Article

Full-text available

Jun 2012

The major challenge for accurate fingerprint-based indoor localization is the design of robust and discriminative wireless signatures. Even though WiFi RSSI signatures are widely available indoors, they vary significantly over time and are susceptible to human presence, multipath, and fading due to the high operating frequency. To overcome these limitations, we propose to use FM broadcast radio signals for robust indoor fingerprinting. Because of the lower frequency, FM signals are less susceptible to human presence, multipath and fading, they exhibit exceptional indoor penetration, and according to our experimental study they vary less over time when compared to WiFi signals. In this work, we demonstrate through a detailed experimental study in 3 different buildings across the US, that FM radio signal RSSI values can be used to achieve room-level indoor localization with similar or better accuracy to the one achieved by WiFi signals. Furthermore, we propose to use additional signal quality indicators at the physical layer (i.e., SNR, multipath etc.) to augment the wireless signature, and show that localization accuracy can be further improved by more than 5%. More importantly, we experimentally demonstrate that the localization errors of FM andWiFi signals are independent. When FM and WiFi signals are combined to generate wireless fingerprints, the localization accuracy increases as much as 83% (when accounting for wireless signal temporal variations) compared to when WiFi RSSI only is used as a signature.

Microphone Array Post-Processor Using Instantaneous Direction of Arrival

Conference Paper

Full-text available

Sep 2006

In this paper we describe a novel algorithm for postprocessing a microphone array’s beamformer output to achieve better spatial filtering under noise and reverberation. For each audio frame and frequency bin the algorithm estimates the spatial probability for sound source presence and applies a spatio-temporal filter towards the look-up direction. It is implemented as a real-time post-processor after a timeinvariant beamformer and it substantially improves the directivity of the microphone array. The algorithm is CPU efficient and adapts quickly when the listening direction changes. It was evaluated with a linear four element microphone array. The directivity index improvement is up to 8 dB, the suppression of a jammer 40° from the sound source is up to 17 dB.

Sound Capture and Processing: Practical Approaches

Book

Jul 2009

Ivan Tashev

A new extension of the Kalman filter to nonlinear systems

Conference Paper

Jan 1997

Sound Capture and Processing: Practical Approaches

Article

Jul 2009

Ivan Tashev

A new extension of the Kalman filter to nonlinear systems

Article

Jan 1997

CompAcc: Using Mobile Phone Compasses and Accelerometers for Localization

Article

This paper identifies the possibility of using electronic compasses and accelerometers in mobile phones, as a simple and scalable method of localization. The idea is not fundamentally different from ship or air navigation systems, known for cen-turies. Nonetheless, directly applying the idea to human-scale environments is non-trivial. Noisy phone sensors and complicated human movements present practical research challenges. We cope with these challenges by recording a person's walking patterns, and matching it against possible path signatures generated from a local electronic map. Electronic maps enable greater coverage, while eliminating the reliance on WiFi infrastructure and expen-sive war-driving. Measurement on Nokia phones and evaluation with real users confirm the anticipated benefits. Results show a location accuracy of less than 12m in regions where today's localization services are unsatisfactory or unavailable.

RADAR: An in-building RF-based user location and tracking system

Article

Jan 2000

The proliferation of mobile computing devices and local-area wireless networks has fostered a growing interest in location-aware systems and services. In this paper we present RADAR, a radio-frequency (RF) based system for locating and tracking users inside buildings. RADAR operates by recording and processing signal strength information at multiple base stations positioned to provide overlapping coverage in the area of interest. It combines empirical measurements with signal propagation modeling to determine user location and thereby enable location-aware services and applications. We present experimental results that demonstrate the ability of RADAR to estimate user location with a high degree of accuracy.

Particle filter beamforming for acoustic source localization in reverberant environment

Article

May 2002

Traditional acoustic source localization uses a two-step procedure requiring intermediate time-delay estimates from pairs of microphones. An alternative single-step approach is proposed in this paper in which particle filtering is used to estimate the source location through steered beamforming. This scheme is especially attractive in speech enhancement applications, where the localization estimates are typically used to steer a beamformer at a later stage. Simulation results show that the algorithm is robust to reverberation, and is able to accurately follow the source trajectory.

Detection, Estimation, and Modulation Theory, Part IV, Optimum Array Processing

Article

May 2002

Harry L. Van Trees

Daredevil: indoor location using sound

Abstract and Figures

Recommended publications

Daredevil: Indoor Location Using Sound

Cost Function for Sound Source Localization with Arbitrary Microphone Arrays

Reverberated speech signal separation based on regularized subband feedforward ICA and instantaneous...

Non-Uniform Microphone Arrays for Robust Speech Source Localization for Smartphone-Assisted Hearing...