Conference PaperPDF Available

Real-world sound recognition: A recipe

December 2006

December 2006

Conference: LSAS

Authors:

University of Groningen

This article addresses the problem of recognizing acoustic events present in unconstrained input. We propose a novel approach to the processing of audio data which combines bottom-up hypothesis generation with top-down expectations, which, unlike standard pattern recognition techniques, can ensure that the representation of the input sound is physically realizable. Our approach gradually enriches low-level signal descriptors, based on Continuity Preserving Signal Processing, with more and more abstract interpretations along a hierarchy of description levels. This process is guided by top-down knowledge which provides context and ensures an interpretation consistent with the knowledge of the system. This leads to a system that can detect and recognize specific events for which the evidence is present in unconstrained real-world sounds.

Cochleograms of different sound events involved in making coffee. The top three cochleograms are of the sound of, from left to right, the filling of the water compartment of a coffee machine, the placing the coffee carafe, and the coffee machine percolating, respectively. These three cochleograms have the same energy range, as shown by the right color bar. The bottom two cochleograms are enlargements of the two top right cochleograms. (Note that the energy ranges and time scales of the bottom two cochleograms do not correspond to each other.)

…

The sound is analyzed in terms of signal components. Signal component patterns lead to sound event hypotheses. These hypotheses are continuously matched with top-down expectation based on the current knowledge state. The combination of bottom-up hypotheses and top-down expectations results in a best interpretation of the input sound.

…

Different stages in the process of recognizing coffee making. The top level is the highest abstraction level, representing most semantic content, but minimal signal detail. The bottom level is the lowest abstraction level, representing very little or no interpretation, but instead reflecting much of the details of the signal. From left to right the succeeding best hypotheses are shown, generated by the low-level signal analysis, and matched to high-level expectations, which follow from the knowledge scheme.

…

Figures - uploaded by T.C. Andringa

Content may be subject to copyright.

Content uploaded by T.C. Andringa

Content may be subject to copyright.

Real-World Sound Recognition:

ARecipe

Tjeerd C. Andringa and Maria E. Niessen

Department of Artiﬁcial Intelligence, University of Groningen,

Gr. Kruisstr. 2/1, 9712 TS Groningen, The Netherlands

{T.Andringa,M.Niessen}@ai.rug.nl

http://www.ai.rug.nl/research/acg/

Abstract. This article addresses the problem of recognizing acoustic

events present in unconstrained input. We propose a novel approach

to the processing of audio data which combines bottom-up hypothesis

generation with top-down expectations, which, unlike standard pattern

recognition techniques, can ensure that the representation of the input

sound is physically realizable. Our approach gradually enriches low-level

signal descriptors, based on Continuity Preserving Signal Processing,

with more and more abstract interpretations along a hierarchy of descrip-

tion levels. This process is guided by top-down knowledge which provides

context and ensures an interpretation consistent with the knowledge of

the system. This leads to a system that can detect and recognize spe-

ciﬁc events for which the evidence is present in unconstrained real-world

sounds.

Key words: real-world sounds, sound recognition, event perception,

ecological acoustics, multimedia access, Continuity Preserving Signal Pro-

cessing, Computational Auditory Scene Analysis.

1 Introduction

Suppose you claim you have made a system that can automatically recognize the

sounds of passing cars and planes, making coﬀee, and verbal aggression, while

ignoring sounds like speech. People might react enthusiastically and ask you for

a demonstration. How impressive would it be if you could start-up a program on

your laptop, and the system recognizes the coﬀee machine the moment it starts

percolating? You might then bring your laptop outside where it starts counting

passing cars and it even detects the plane that takes oﬀ on a nearby airﬁeld. In

the meanwhile you tell your audience that the same system is currently alerting

security personnel whenever verbal aggression occurs at public places like train

stations and city centers. Your audience might be impressed, but will they be

as impressed if you show a system that works with sound recorded in studio

conditions but that does not work in the current environment? Or that only

works in the absence of other sound sources, or only when the training data base

is suﬃciently similar to the current situation, which it, unfortunately, is not?

106

Learning the Semantics of Audio Signals (LSAS) 2006

Apart from the coﬀee machine detector, detectors similar to those described

above are actually deployed in the Netherlands by a spin-oﬀ of the University

of Groningen called Sound Intelligence [16]. But although these systems reliably

monitor the activities in a complex and uncontrollable acoustic environment,

they require some optimization to their environment and cannot easily be ex-

tended to recognize and detect a much wider range of acoustic events or to reason

about their meaning.

The one system that does have the ability to function reliably in all kinds of

dynamic environments is, of course, the human and animal auditory system. The

main attribute of the natural auditory system is that it deals with unconstrained

input and helps us to understand the (often unexpected) events that occur in

our environment. This entails that it can determine the cause of complex, un-

controllable, in part unknown, and variable input. In the sound domain we can

call this real-world sound recognition.

A real-world sound recognition system does not put constraints on its input:

it recognizes any sound source, in any environment, if and only if the source is

present. As such it can be contrasted to popular sound recognition approaches,

which function reliably only with input from a limited task domain in a spe-

ciﬁc acoustic environment. This leads to systems for speciﬁc tasks in speciﬁc

environments.Forexample,astudybyDefr´eville et al. describes the automatic

recognition of urban sound sources from a database with real-world recordings

[6]. However, the goals of our investigation do not allow for the implicit assump-

tion made in Defr´eville’s study, namely that at least one sound source on which

the system is trained is present in each test sample.

Apart from being able to recognize unconstrained input, a real-world sound

recognition system must also be able to explain the causes of sounds in terms of

the activities in the environment. In the case of making coﬀee, the system must

assign the sounds of ﬁlling a carafe, placing the carafe in the coﬀee machine, and

a few minutes later the sound of hot water percolating through the machine, to

the single activity of making coﬀee.

The systems of Sound Intelligence approach the ideal of real-world sound

recognition by placing minimal constraints on the input. These systems rely on

Continuity Preserving Signal Processing, a form of signal processing, developed

in part by Sound Intelligence, which is designed to track the physical develop-

ment of sound sources as faithfully as possible. We suspect that the disregard of

a faithful rendering of physical information of, for example, Short Term Fourier

Transform based methods, limits traditional sound processing systems to appli-

cations in which a priori knowledge about the input is required for reasonable

recognition results.

The purpose of our research is the development of a sound processing system

that can determine the cause of a sound from unconstrained real-world input

in a way that is functionally similar to natural sound processing systems. This

paper starts with the scientiﬁc origins of our proposal. From this we derive a

research paradigm that describes the way we want to contribute to the further

development of a comprehensive theory of sound processing on the one hand,

107

and the development of an informative and reliable sound recognition system

on the other hand. From this we spell out a proposal for a suitable architecture

for such a system from an example of a coﬀee making process. Finally we will

discuss how our approach relates to other approaches.

2 Real-World Sound Recognition

2.1 Event Perception

Ecological psychoacoustics (see for example [10, 9, 18, 3]) provides part of the

basis of our approach. Instead of the traditional focus on musical or analytical

listening, in which sensations such as pitch and loudness are coupled to phys-

ical properties such as frequency and amplitude in controlled experiments (for

a short historical overview, see Neuhoﬀ [13]), ecological psychoacousticians in-

vestigate everyday listening, introduced by Gaver [9]. Everyday or descriptive

listening refers to a description of the sounds in terms of the processes or events

that produced them. For example, we do not hear a noisy harmonic complex in

combination with a burst of noise, instead we hear a passing car. Likewise we

do not hear a double pulse with prominent energy around 2.4 and 6 kHz, but

we hear a closing door. William Gaver concludes:

“Taking an ecological approach implies analyses of the mechanical physics

of source events, the acoustics describing the propagation of sound through

an environment, and the properties of the auditory system that enable us

to pick up such information. The result of such analyses will be a char-

acterization of acoustic information about sources, environments, and

locations which can be empirically veriﬁed.” ([9], p. 8)

Gaver links source physics to event perception, and since natural environ-

ments are, of course, always physically realizable, we insist on physical realizabil-

ity within our model. This entails that we aim to limit all signal interpretations

to those which might actually describe a real-world situation and as such do not

violate the physical laws that shape reality.

Physical realizability, as deﬁned above, is not an issue in current low-level

descriptors [5], which focus on mathematically convenient manipulations. The

solution space associated with these descriptors may contain a majority of phys-

ically impossible, and therefore certainly incorrect, signal descriptions. In fact,

given these descriptors, there is no guarantee that the best interpretation is phys-

ically possible (and therefore potentially correct). For example, current speech

recognition systems are designed to function well under speciﬁc conditions (such

as limited background noise and adapted to one speaker). However, when these

systems are exposed to non-speech data (such as instrumental music) that hap-

pen to generate an interpretation as speech with a suﬃciently high likelihood,

the system will produce nonsense.

108

2.2 Continuity Preserving Signal Processing

We try to ensure physical realizability of the signal interpretation by apply-

ing Continuity Preserving Signal Processing (CPSP, [2, 1]). Compared to other

signal processing approaches CPSP has not been developed from a viewpoint of

mathematical elegance or numerical convenience. Instead it is a signal processing

framework designed to track the physical development of a sound source through

the identiﬁcation of signal components as the smallest coherent units of physical

information. Signal components are deﬁned as physically coherent regions of the

time-frequency plane delimited by qualitative changes (such as on- and oﬀsets

or discrete steps in frequency). Although CPSP is still in development, indica-

tions are that it is often, and even usually, possible to form signal component

patterns that have a very high probability to represent physically coherent in-

formation of a single source or process. This is especially true for pulse-like and

sinusoidal components, such as individual harmonics, for which reliable estima-

tion techniques have been developed [2]. For example, estimated sinusoidal signal

components (like in voiced speech) signify the presence of a sound source which

is able to produce signal components with a speciﬁc temporal development of

energy and frequency content. This analysis excludes a multitude of other sound

sourceslikedoors,whicharenotabletoproducesignalcomponentswiththese

properties; the system is one step up toward a correct interpretation.

CPSP is a form of Computational Auditory Scene Analysis (CASA). Modern

CASA approaches typically aim to identify a subset of the time-frequency plane,

called a mask, where a certain target sound dominates [17, 4]. The advantage of

such a mask is that it can be used to identify the evidence which should be

presented to a subsequent recognition phase. One major disadvantage is that

it requires a hard decision of what is target and what not before the signal is

recognized as a certain instance of a target class. In contrast, CPSP does not

aim to form masks, but aims to identify patterns of evidence of physical pro-

cesses that can be attributed to speciﬁc events or activities. CPSP is based on

the assumption that sound sources are characterized by their physical proper-

ties, which in turn determine how they distribute energy in the time-frequency

plane E(f,t). Furthermore it assumes that the spatio-temporal continuity of the

basilar membrane (BM) in the mammalian cochlea, where position corresponds

to frequency, is used by the auditory system to track the development of phys-

ically coherent signal components. Typical examples of signal components are

pulses, clicks, and bursts, or sinusoids, narrowband noises, wavelets, and broad-

band noises. A sinusoid or chirp may exist for a long time, but is limited to a

certain frequency range. In contrast pulses are short, but they span a wide range

of frequencies. Noises are broad in frequency, persist for some time, and show a

ﬁne structure that does not repeat itself.

CPSP makes use of signal components because they have several character-

istics which are useful for automatic sound recognition: The ﬁrst characteristic

is the low probability that the signal component consists of qualitatively dif-

ferent signal contributions (for instance periodic versus aperiodic), the second

is the low probability that the signal component can be extended to include a

109

larger region of the time-frequency plane without violating the ﬁrst condition,

and the third is the low probability that the whole region stems from two or

more uncorrelated processes. Together these properties help to ensure a safe and

correct application of quasi-stationarity and as such a proper approximation of

the corresponding physical development and the associated physical realizability.

Signal components can be estimated from a model of the mammalian cochlea

[7]. This model can be interpreted as a bank of coupled (and therefore overlap-

ping) bandpass-ﬁlters with a roughly logarithmic relation between place and

center-frequency. Unlike the Short Term Fourier Transform, the cochlea has no

preference for speciﬁc frequencies or intervals: A smooth development of a source

will result in a smooth development on the cochlea. A quadratic, energy-domain

rendering of the distribution of spectro-temporal energy, like a spectrogram, re-

sults from leaky integration of the squared BM excitation with a suitable time

constant. This representation is called a cochleogram and it provides a rendering

of the time-frequency plane E(f,t) without biases toward special frequencies or

intervals ([2], chapter 2). Although not a deﬁning characteristic, signal compo-

nents correspond often to the smallest perceptual units: it is usually possible

to hear-out individual signal components, and in the case of complex patterns,

such as speech, they stand out when they do not comply with the pattern.

2.3 An Example: Making Coﬀee

Figure 1 shows a number of cochleograms derived from a recording of the coﬀee

making process in a cafeteria1. The upper three cochleograms correspond to

ﬁlling the machine with water, positioning the carafe in the machine, and the

percolation process, respectively. The lower row shows enlargement of the last

two processes. The positioning of the carafe in the machine (a) results in a

number of contacts between metal and glass. The resulting pulses have signiﬁcant

internal structure, do not repeat themselves, and reduce gradually in intensity.

Some strongly dampened resonances occur at diﬀerent frequencies, but do not

show a discernible harmonic pattern. The whole pattern is consistent with hard

(stiﬀ) objects that hit each other a few times in a physical setting with strong

damping. This description conveys evidence of physical processes which helps

to reduce the number of possible, physically meaningful, interpretations of the

signal energy. Note that signal components function as interpretation hypotheses

for subsets of the signal. As hypotheses, they may vary in reliability. A context

in which signal components predict each other, for instance through a repetition

of similar components, enhances the reliability of interpretation hypotheses of

individual signal components. In this case the presence of one pulse predicts the

presence of other more or less similar pulses.

The lower right cochleogram (b) corresponds to a time 6 minutes after the

start where the water in the coﬀee machine has heated to the boiling point

and where it starts to percolate through the coﬀee. This phase is characterized

1The sounds can be found on our website, http://www.ai.rug.nl/research/acg/

research.html

110

Time (s)

coffee machine percolating

Frequency (Hz)

368.5 369 369.5 370 370.5 371 371.5 372

7740

5010

3220

2060

1290

800

470

260

120

29 0

Time (s)

Frequency (Hz)

coffee carafe

5.9 6.15 6.4 6.65

7740

5010

3220

2060

1290

800

470

260

120

29 10

Time (s)

coffee machine percolating

369 370 371 372 0

Time (s)

coffee carafe

5.9 6.4

Time (s)

Frequency (Hz)

fililng coffee machine

1.05 1.55 2.05 2.55 3.05 3.55

7740

5010

3220

2060

(a) (b)

Fig. 1. Cochleograms of diﬀerent sound events involved in making coﬀee. The top

three cochleograms are of the sound of, from left to right, the ﬁlling of the water

compartment of a coﬀee machine, the placing the coﬀee carafe, and the coﬀee machine

percolating, respectively. These three cochleograms have the same energy range, as

shown by the right color bar. The bottom two cochleograms are enlargements of the

two top right cochleograms. (Note that the energy ranges and time scales of the bottom

two cochleograms do not correspond to each other.)

by small droplets of boiling water and steam emerging from the heating sys-

tem. This produces a broadband signal between 400 and 800 Hz with irregular

amplitude modulation and, superimposed, many not-repeating wavelet-like com-

ponents that probably correspond to individual drops of steam driven boiling

water that emerge from the machine. Again it is possible to link signal details

to a high level description of a complex event like percolating water.

3 Research Paradigm

We propose a novel approach to real-world sound recognition which combines

bottom-up audio processing, based on CPSP, with top-down knowledge. The top-

down knowledge provides context for the sound event and guides interpretations

of the hypotheses. Succeeding description levels in a hierarchy gradually include

more semantic content. Low levels in the hierarchy represent much of the details

of the signal and are minimally interpreted. High levels represent minimal signal

111

detail, but correspond to a speciﬁc semantic interpretation of the input signal

which is consistent with the systems knowledge of its environment and its current

input. Before we present the model for real-world sound recognition, we ﬁrst

propose a paradigm, comprising of seven focus points, to guide the development

of the model, inspired by considerations from the previous section:

Real-world sound recognition The model should recognize unconstrained

input, which means there is no a priori knowledge about the input signal

such as the environment it stems from. The system will use knowledge, but

this knowledge is typically of the way we expect events to develop through

time and frequency, used a posteriori to form hypotheses. Besides this the

knowledge can also be about the context, for example an interpretation of the

past which may need updating, or about the environment, which may have

changed as well. Matching top-down knowledge with bottom-up processing

helps to generate plausible interpretation hypotheses. However, top-down

knowledge does not pose restrictions on the bottom-up processing and the

generation of hypotheses; it will only inﬂuence activation and selection of

hypotheses.

Domain independent Techniques that are task or environment speciﬁc are to

be avoided, because every new task or environment requires the development

of a new recognizer. In particular, optimizing on closed tasks or closed and

constrained environments should be avoided. This excludes many standard

pattern recognition techniques such as neural networks, and Hidden Markov

Models (HMM’s), based on Bayesian decision theory (see for example Knill

and Young [11]). Note that specifying a general approach toward a speciﬁc

task is allowed, but including a task speciﬁc solution in the general system

is not.

Start from the natural exemplar Initially the model should remain close to

approaches known to work in human and animal auditory systems. The prob-

lem in this respect is the multitude of fragmented and often incompatible

knowledge about perception and cognition, which (by scientiﬁc necessity)

is also based on closed domain research in the form of controlled experi-

ments. Nevertheless domains such as psycholinguistics have reached many

conclusions we might use as inspiration.

Physical optimality Because the auditory systems of diﬀerent organisms are

instances of a more general approach to sound recognition, we may try to gen-

eralize these instances toward physical optimality or physical convenience.

We might for example ignore physiological non-linearities which may be nec-

essary only to squeeze the signal into the limited dynamic range of the

natural system. Physical optimality rules out the application of standard

frame-based methods in which quasi-stationarity, with a period equaling the

frame size, is applied on an yet unknown mixture of sound sources. Physi-

cally, the approximation of a sound source as a number of diﬀerent discrete

steps is only justiﬁed with a suitable (inertia dependent) time-constant. For

closed domains with clean speech this might be guaranteed, but for open

domains quasi-stationarity can only be applied on signal evidence which,

112

ﬁrstly, is likely to stem from a single process, and secondly, is known to

develop slowly enough for a certain discrete resampling. Signal components

comply with these requirements, while in general the result of frame blocking

does not. ([2], chapter 1)

Physical realizability The model should at all times only consider those so-

lutions which are physically plausible. This is important because we want to

link the physics of the source of the sound to the signal, which only makes

sense if we actually consider only plausible sound events as hypotheses. An

extra advantage of this rule is the reduction of the solution space, since so-

lutions which are not physically plausible will not be generated. Again this

is diﬀerent from standard pattern recognition techniques, which consider

mathematically possible (that is, most probable) solutions.

Limited local complexity To ensure realizability, the diﬀerent steps in the

model should neither be too complex, nor too large. This can be ensured

through a hierarchy of structure which is not imposed by the designer, but

dedicated by the predictive structures in the environment.

Test i n g The model should not remain in the theoretical domain, but should be

implemented and confronted with the complexities of unconstrained input.

This also means the model should not be optimized for a target domain, but

confronted with input from many other domains as well.

4 Model of Real-World Sound Recognition

Figure 2 captures the research paradigm of the previous section. The system pro-

cesses low-level data to interpretation hypotheses, and it explicates semantically

rich, top-down queries to speciﬁc signal component expectations. The top-down

input results from task speciﬁc user queries, like “Alert me when the coﬀee is

ready” or “How many times is the coﬀee machine used in the cafeteria every

day?”, and contextual knowledge, like the setting, for example a cafeteria, and

the time of day, for example 8:30 am. However, the system recognizes sounds

and updates its context model even without top-down input: The more the sys-

tem learns about an environment (that is, the longer it is processing data in an

environment), the better its context model is able to generate hypotheses about

what to expect and the better and faster it will deal with bottom-up ambiguity.

The bottom-up input is continuous sound input, which can be assigned to signal

components with a delay of less then a few hundred milliseconds (in part de-

pendent on the properties of the signal components). Subsequent levels generate

interpretation hypotheses of each signal component combination which complies

with predictions about the way relevant sound events, such as speech, combine

signal components.2. Several levels match top-down expectations with bottom-

up hypotheses. This ensures consistency between the query and the sound (set

2The bottom-up hypotheses generation from signal component patterns will be hard-

coded in the ﬁrst implementations of the system, but eventually we want to use

machine learning techniques so that the system learns to classify the signal com-

ponent patterns. A similar approach is chosen for top-down queries and contextual

113

sound event

hypotheses

SC pattern

hypotheses

components

signal

sound signal

bottom−up

querycontext

sound event

SC pattern

expectations

(through matching)

output: best hypothesis

top−down

expectations

Fig. 2. The sound is analyzed in terms of signal components. Signal component pat-

terns lead to sound event hypotheses. These hypotheses are continuously matched

with top-down expectation based on the current knowledge state. The combination of

bottom-up hypotheses and top-down expectations results in a best interpretation of

the input sound.

of signal components) which is assigned to it. When top-down knowledge is suf-

ﬁciently consistent with the structure of reality (that is, potentially correct) the

resulting description of the sound is physically realizable.

Top-down combinations of query and context lead to expectations of which

instances of sound events might occur and what the current target event is.

Although important, the way queries lead to expectations is not yet addressed in

this coﬀee-making detection system. The event expectations are translated into

probability distributions of (properties of) the signal component combinations to

expect. For example, knowledge about the type of the coﬀee machine, for instance

ﬁlter or espresso, leads to a reduction of the signal components to expect. The

derivation to top-down expectations through a top-down process ensures that

the expectations are always consistent with our knowledge.

The continuous matching process between the bottom-up instances and the

top-down probability distributions ensures that bottom-up hypotheses which are

inconsistent with top-down expectations will be deactivated (or ﬂagged irrelevant

for the current task, in which case they might still inﬂuence the context model).

input; we ﬁrst hard-code the knowledge between sound events and signal component

patterns, but eventually we plan to use machine learning techniques here as well.

114

"impact sounds"

pulses"

"damped, structured

SC patterns

"pouring water"

event hypothesis

sound classes

"making coffee"

sound classes

"liquid sounds"

SC patterns

broadband sound"

"structured

sound classes

"percolating sounds"

SC patterns

"confused AM

abstraction level

physical world

representational world

coffee can"

event hypothesis

"positioning

at work"

event hypothesis

"coffee machine

broadband noise"

knowledge scheme

Fig. 3. Diﬀerent stages in the process of recognizing coﬀee making. The top level is

the highest abstraction level, representing most semantic content, but minimal signal

detail. The bottom level is the lowest abstraction level, representing very little or no

interpretation, but instead reﬂecting much of the details of the signal. From left to right

the succeeding best hypotheses are shown, generated by the low-level signal analysis,

and matched to high-level expectations, which follow from the knowledge scheme.

Vice versa, signal component combinations which are consistent with top-down

expectations are given priority during bottom-up processing. The consistent hy-

pothesis that ﬁts the query and the context best is called the best hypo thesi s,and

is selected as output of the system. The activation of active hypotheses changes

with new information: Either new top-down knowledge or additional bottom-up

input changes the best interpretation hypothesis whenever one hypothesis be-

comes stronger than the current best hypothesis. Similar models for hypothesis

selection can be found in psycholinguistics [12, 14]. For example, the Shortlist

model of Norris [14] generates an initial candidate-list on the basis of bottom-up

acoustic input; a best candidate is selected after competition within this set. The

list of candidates is updated whenever there is new information.

Let us turn back to the example, the process of making coﬀee. Figure 3 depicts

the coﬀee making process at three levels: a sequence of activities, a sequence of

sound events to expect, and a series of signal component combinations to expect.

These three levels correspond to the top-down levels of the system. The lowest

level description must be coupled to the signal components and the cochleogram

(see ﬁgure 1) from which they are estimated. We focus on the second event

hypothesis, the sound of a coﬀee carafe being placed on the hot plate (a).

Previously, the system found some broadband irregularly amplitude modu-

lated signal components with local, not repeating, wavelet-like components su-

115

perimposed. Particular sound events with these features are liquid sounds [9].

However, the system did not yet have enough evidence to classify the liquid

sound as an indication of the coﬀee machine being ﬁlled; the same evidence

might also correspond to ﬁlling a glass. But since the signal component pattern

does certainly not comply with speech, doors, moving chairs, or other events,

the number of possible interpretations of this part of the acoustic energy is lim-

ited, and each interpretation corresponds, via the knowledge of the system, to

expectations about future events. If there are more signal components detected

consistent with the coﬀee making process, the hypothesis will be better sup-

ported. And if the hypothesis is better supported, it will be more successful in

predicting subsequent events which are part of the process. In this case addi-

tional information can be estimated after 6 minutes where the water in the coﬀee

machine has heated to the boiling point and starts percolating through the coﬀee

(b).

5 Discussion

This article started with a description of a sound recognition system that works

always and everywhere. In the previous sections we have sketched an architecture

for real-world sound recognition which is capable of exactly that. Our general

approach is reminiscent to the prediction-driven approach of Ellis [8]. However,

the insistence on physical realizability is an essential novelty, associated with the

use of CPSP.

The physical realizability in our approach ensures that the state of the context

model in combination with the evidence is at least consistent with physical laws

(insofar reﬂected by the sounds). In the example of the coﬀee making detection

system only a minimal fraction of the sounds will match the knowledge of the

system and most signal component patterns will be ignored from the moment

they are ﬂagged as inconsistent with the knowledge of the system. This makes

the system both insensitive to irrelevant sounds as well as speciﬁc for the target

event, which helps to ensure its independence of the acoustic environment. A

richer system with more target events functions identically, but more patterns

will match the demands of one or more of the target events. An eﬃcient tree-

like representation of sound source properties can be implemented to prevent a

computational explosion.

Our approach is an intelligent agent approach [15] to CASA. The intelli-

gent agent approach is a common within the ﬁelds of artiﬁcial intelligence and

cognitive science, and generally involves the use of explicit knowledge (often

at diﬀerent levels of description). The intelligent agent based approach can be

contrasted to the more traditional machine learning based approaches, such as

reviewed in Cowling [5], and other CASA approaches [4, 8, 17]. An intelligent

agent is typically assumed to exist in a complex situation in which it is ex-

posed to an abundance of data, most of which irrelevant — unlike, for example,

the database in Defr´eville’s study [6], where most data is relevant. Some of the

data may be informative in the sense that it changes the knowledge state of

116

the agent, which helps to improve the selection of actions. Each executed action

changes the situation and ought to be in some way beneﬁcial to the agent (for

example because it helps the user). An intelligent agent requires an elaborate

and up-to-date model of the current situation to determine the relevance of the

input.

More traditional machine learning and CASA approaches avoid or trivialize

the need for an up-to-date model of the current situation and the selection of

relevant input by assuming that the input stems from an environment with par-

ticular acoustic properties. If the domain is chosen conveniently and if suﬃcient

training data is available, sophisticated statistical pattern classiﬁcation tech-

niques (HMM, Neural Networks, etcetera) can deal with features like Mel Fre-

quency Cepstral Coeﬃcients (MFCC) and Linear Predictive Coeﬃcients (LPC),

which represent relevant and irrelevant signal energy equally well. These features

actually make it more diﬃcult to separate relevant from irrelevant input, but

a proper domain choice helps to prevent (or reduce) the adverse eﬀects of this

blurring during classiﬁcation. CASA approaches [4, 17] rely on the estimation of

spectro-temporal masks that are intended to represent relevant sound sources,

only after information in the masks is presented to similar statistical classiﬁca-

tion systems. These approaches require an evaluation of the input in terms of

relevant/irrelevant befo re the signal is classiﬁed. This entails that class-speciﬁc

knowledge cannot be used to optimize the selection of evidence. This limits these

approaches to acoustic domains in which class-speciﬁc knowledge is not required

to form masks that contain suﬃciently reliable evidence of the target sources.

6 Conclusions

Dealing with real-world sounds requires methods which do not rely on a conve-

niently chosen acoustic domain as traditional CASA methods do. Furthermore,

these methods do not make use of explicit class and even instance speciﬁc knowl-

edge to guide recognition, but rely on statistical techniques, which, by their

very nature, blur (that is, average away) diﬀerences between events. Our intel-

ligent agent approach can be contrasted to traditional CASA approaches, since

it assumes the agent (that is, the recognition system) is situated in a complex

and changing environment. The combination of bottom-up and top-down pro-

cessing in our approach allows a connection with the real world through event

physics. The insistence on physical realizability ensures that each event speciﬁc

representation is both supported by the input and consistent with the systems

knowledge. The reliability of a number of CPSP-based commercial products for

verbal aggression and vehicle detection, which function in real-live and therefore

unconstrained acoustic environments, is promising with respect to a success-

ful implementation of our approach. We have a recipe, and a large number of

ingredients. Let us try to proof the pudding.

117

Acknowledgments

This work is supported by SenterNovem, Dutch Companion project grant nr:

IS053013.

References

1. Andringa, T.C. et. al. (1999), “Method and Apparatuses for Signal Processing”,

International Patent Application WO 01/33547.

2. Andringa, T.C. (2002), “Continuity Preserving Signal Processing”, PhD disserta-

tion, University of Groningen,seehttp://irs.ub.rug.nl/ppn/237110156.

3. Ballas, J.A. (1993), “Common Factors in the Identiﬁcation of an Assortment of

Brief Everyday Sounds”, Journal of Experimental Psychology: Human Perception

and Performance 19(2), pp 250-267.

4. Cooke, M., Ellis, D.P.W. (2001), “The Auditory Organization of Speech and Other

Sources in Listeners and Computational Models”, Speech Communi cation 35,pp

141-177.

5. Cowling, M., Sitte, R. (2003), “Comparison of Techniques for Environmental Sound

Recognition”, Pattern Recognition Letters 24, pp 2895-2907.

6. Defr´eville, B., Roy, P., Rosin, C., Pachet, F. (2006), “Automatic recognition of

urban sound sources”, Audio Engineering Society 120th Convention.

7. Duifhuis, H., Hoogstraten, H.W., Netten, S.M., Diependaal, R.J., Bialek, W.

(1985), “Modeling the Cochlear Partition with Coupled Van der Pol Oscillators”,

in J.W. Wilson and D.T Kemp (Ed.), Cochlear Mechanisms: Structure, Function

and Models, pp 395-404. Plenum, New York.

8. Ellis, D.P.W. (1999), “Using Knowledge to Organize Sound: The Prediction-

Driven Approach to Computational Auditory Scene Analysis and its Application

to Speech/Nonspeech Mixtures”, Speech Comm unication 27 (3-4), pp 281-298.

9. Gaver, W.W. (1993), “What in the World Do We Hear?: An Ecological Approach

to Auditory Event Perception”, Ecological Psychology 5(1), pp 1-29.

10. Gibson, J.J. (1979), The Ecological Approach to Visual Perception, Boston, MA:

Houghton Miﬄin.

11. Knill, K., Young, S. (1997), “Hidden Markov Models in Speech and Language Pro-

cessing”, in S. Young and G. Bloothooft (Ed.), Corpus-Based Methods in Language

and Speech Proce ssing, pp 27-68. Kluwer Academic, Dordrecht.

12. McClelland, J.L., Elman, J.L. (1986), “The TRACE Model of Speech Perception”,

Cognitive Psychology 18, pp 1-86.

13. Neuhoﬀ, J.G. (2004), “Ecological Psychoacoustics: Introduction and History”, in

J.G. Neuhoﬀ (Ed.), Ecological Psychoacoustics, pp 1-13. Elsevier Academic Press.

14. Norris, D. (1994), “Shortlist: A Connectionist Model of Continuous Speech Recog-

nition”, Cognition 52, pp 189-234.

15. Russell, S.J., Norvig, P. (1995), Artiﬁcial Intelligence: A Modern Approach.

Prentice-Hall.

16. Sound Intelligence, http://www.soundintel.com/

17. Wang, D. (2005), “On Ideal Binary Mask As the Computational Goal of Auditory

Scene Analysis”, in P. Divenyi (Ed.), Speech Separation by Humans and Machines,

pp 181-197. Kluwer Academic, Norwell MA.

18. Warren, W.H., Verbrugge, R.R. (1984) “Auditory Perception of Breaking and

Bouncing Events: A Case Study in Ecological Acoustics”, Journal of Experimental

Psychology: Human Perception and Performance 10(5), pp 704-712.

118

Disambiguating Sounds through Context

Conference Paper

Full-text available

Sep 2008

A central problem in automatic sound recognition is the mapping between low-level audio features and the meaningful content of an auditory scene. We propose a dynamic network model to perform this mapping. In acoustics, much research has been devoted to low-level perceptual abilities such as audio feature extraction and grouping, which have been translated into successful signal processing techniques. However, little work is done on modeling knowledge and context in sound recognition, although this information is necessary to identify a sound event rather than to separate its components from a scene. We first investigate the role of context in human sound identification in a simple experiment. Then we show that the use of knowledge in a dynamic network model can improve automatic sound identification, by reducing the search space of the low-level audio features. Furthermore, context information dissolves ambiguities that arise from multiple interpretations of one sound event.

Verbal aggression detection in complex social environments

Conference Paper

Full-text available

Oct 2007

The paper presents a knowledge-based system designed to detect evidence of aggression by means of audio analysis. The detection is based on the way sounds are analyzed and how they attract attention in the human auditory system. The performance achieved is comparable to human performance in complex social environments. The SIgard system has been deployed in a number of different real-life situations and was tested extensively in the inner city of Groningen. Experienced police observers have annotated ~1400 recordings with various degrees of shouting, which were used for optimization. All essential events and a small number of nonessential aggressive events were detected. The system produces only a few false alarms (non-shouts) per microphone per year and misses no incidents. This makes it the first successful detection system for a non-trivial target in an unconstrained environment.

Voice Driven Sound Sketch for Animation Authoring Tools

Article

Apr 2010

Soon-Il Kwon

Authoring tools for sketching the motion of characters to be animated have been studied. However the natural interface for sound editing has not been sufficiently studied. In this paper, I present a novel method that sound sample is selected by speaking sound-imitation words(onomatopoeia). Experiment with the method based on statistical models, which is generally used for pattern recognition, showed up to 97% in the accuracy of recognition. In addition, to address the difficulty of data collection for newly enrolled sound samples, the GLR Test based on only one sample of each sound-imitation word showed almost the same accuracy as the previous method.

Audition: From Sound to Sounds

Chapter

Full-text available

Jan 2010

T.C. Andringa

This chapter addresses the functional requirements of auditory systems, both natural and artificial, to be able to deal with the complexities of uncontrolled real-world input. The demand to function in uncontrolled environments has severe implications for machine audition. The natural system has addressed this demand by adapting its function flexibly to changing task demands. Intentional processes and the concept of perceptual gist play an important role in this. Hearing and listening are seen as complementary processes. The process of hearing detects the existence and general character of the environment and its main and most salient sources. In combination with task demands these processes allow the pre-activation of knowledge about expected sources and their properties. Consecutive listening phases, in which the relevant subsets of the signal are analyzed, allow the level of detail required by task and system-demands. This form of processing requires a signal representation that can be reasoned about. A representation based on source physics is suitable and has the advantage of being situation independent. The demand to determine physical source properties from the signal imposes restrictions on the signal processing. When these restrictions are not met, systems are limited to controlled domains. Novel signal representations are needed to couple the information in the signal to knowledge about the sources in the signal.

Disambiguating Sound through Context.

Article

Sep 2008

A central problem in automatic sound recognition is the mapping between low-level audio features and the meaningful content of an auditory scene. We propose a dynamic network model to perform this mapping. In acoustics, much research is devoted to low-level perceptual abilities such as audio feature extraction and grouping, which are translated into successful signal processing techniques. However, little work is done on modeling knowledge and context in sound recognition, although this information is necessary to identify a sound event rather than to separate its components from a scene. We first investigate the role of context in human sound identification in a simple experiment. Then we show that the use of knowledge in a dynamic network model can improve automatic sound identification by reducing the search space of the low-level audio features. Furthermore, context information dissolves ambiguities that arise from multiple interpretations of one sound event.

Sound Recognition: A Cognitive Way

Article

Full-text available

Jan 2010

Alle Veenstra

Sound recognition systems aim to determine what source produced a sound event. Until now, such systems lack explicit knowledge about sound sources; they rely on crude signal descriptions and large annotated training databases. These databases, hopefully for such systems, allow a reliable correlation between signal descriptions and annotations. Some modern perception approaches are based on the concept of gist, which suggest that a signal is first analyzed crudely, not unlike conventional sound recognition systems, but that the crude analysis is followed by a more detailed knowledge driven approach. In this work I explore a way of involving explicit knowledge about sound sources and audition, resulting in a more cognitive approach to sound recognition. I find that this approach is complementary to conventional sound recognition systems, as it leads to higher classification performance on a single sound event recognition task and it gives more insight into recognitions. For example, in addition to answering what source produced a sound event, my approach can also precisely tell where (in time and frequency the evidence stems from) and why the sound event was recognized as being such. Also, this approach is extensible, as knowledge can be added and specialized, possibly increasing its performance beyond that of conventional sound recognition systems.

Comparison of techniques for environmental sound recognition

Article

Full-text available

Nov 2003
PATTERN RECOGN LETT

This paper presents a comprehensive comparative study of artificial neural networks, learning vector quantization and dynamic time warping classification techniques combined with stationary/non-stationary feature extraction for environmental sound recognition. Results show 70% recognition using mel frequency cepstral coefficients or continuous wavelet transform with dynamic time warping.

TheTRACE model of speech perception

Article

Jan 1986

Modelling the Cochlear Partition with Coupled van der Pol Oscillators

Chapter

Jan 1986

Within the context of interest in analyzing ‘active’ and nonlinear processes in the cochlea we have been studying a model cochlea in which the local membrane impedance is described by a Van der Pol-oscillator. The behaviour of the undriven and sinusoidally driven discretized model is examined numerically. The undriven model describes the behaviour of a discrete number of coupled oscillators, which, if uncoupled, would have limit cycles gradually differing in frequency. In the coupled case the limit cycle behaviour is less predictable: it appears to exhibit quasi stochastic properties. In the driven model a sufficiently strong stimulus causes entrainment to the stimulus, and odd order harmonics appear. In the range where the driven response is small compared with the average limit cycle, the response is almost linear. The strict Van der Pol-damping function, which is parabolic in velocity, produces strong saturation. A generalized Van der Pol-damping term, which causes small-amplitude instability and large-amplitude stability, produces much the same general behaviour, but the intensity response can be modelled more realistically.

Ecological psychoacoustics: Introduction and history

Article

Jan 2004

John Neuhoff

Artificial Intelligence, A Modern Approach. Second Edition

Book

Jan 2003

The ecological approach to visual perception

Book

Jan 1979

J J Gibsen

On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis

Chapter

Jan 2006

DeLiang Wang

In his famous treatise of computational vision, Marr (1982) makes a compelling argument for separating different levels of analysis in order to understand complex information processing. In particular, the computational theory level, concerned with the goal of computation and general processing strategy, must be separated from the algorithm level, or the separation of what from how. This chapter is an attempt at a computational-theory analysis of auditory scene analysis, where the main task is to understand the character of the CASA problem. My analysis results in the proposal of the ideal binary mask as a main goal of CASA. This goal is consistent with characteristics of human auditory scene analysis. The goal is also consistent with more specific objectives such as enhancing ASR and speech intelligibility. The resulting evaluation metric has the properties of simplicity and generality, and is easy to apply when the premixing target is available. The goal of the ideal binary mask has led to effective for speech separation algorithms that attempt to explicitly estimate such masks.

Ecological Approach to Visual Perception

Chapter

Jan 1979

James Jerome Gibson

The auditory organization of speech and other sources in listeners and . . .

Article

Mar 2000
SPEECH COMMUN

Speech is typically perceived against a background of other sounds. Listeners are adept at extracting target sources from the acoustic mixture reaching the ears. The auditory scene analysis (ASA) account holds that this feat is the result of a two-stage process. In the first-stage, sound is decomposed into collections of fragments in several dimensions. Subsequent processes of perceptual organization reassemble these fragments, based on cues indicating common source of origin which are interpreted in the light of prior experience. In this way, the decomposed auditory scene is processed to extract coherent evidence for one or more sources. Auditory scene analysis in listeners has been studied for several decades and recent years have seen a steady accumulation of computational models of perceptual organization. The purpose of this review is to describe the evidence for the nature of auditory organization in listeners and to explore the computational models which have been motivated by such evidence. The primary focus is on speech rather than on sources such as polyphonic music or non-speech ambient backgrounds, although all these domains are equally amenable to auditory organization. The review includes a discussion of the relationship between auditory scene analysis and alternative approaches to sound source segregation.

SHORTLIST: A connectionist model of continuous speech recognition

Article

Sep 1994
COGNITION

Dennis Norris

Previous work has shown a back-propagation network with recurrent connections can successfully model many aspects of human spoken word recognition (Norris, 1988, 1990, 1992, 1993). However, such networks are unable to revise their decisions in the light of subsequent context. TRACE (McClelland & Elman, 1986), on the other hand, manages to deal appropriately with following context, but only by using a highly implausible architecture that fails to account for some important experimental results. A new model is presented which displays the more desirable properties of each of these models. In contrast to TRACE the new model is entirely bottom-up and can readily perform simulations with vocabularies of tens of thousands of words.

Real-world sound recognition: A recipe

Abstract and Figures

Recommended publications

Energy keep us going

University of Groningen in 83rd place on THE ranking list

Origins Center to open at Fundamentals of Life in the Universe symposium

A grand future with small particles. Nanotechnology affects many aspects of our lives

A Dispersion-Theoretic account of Brazilian Portuguese rhotic variation

Disambiguating Sounds through Context

Disambiguating Sound through Context.

Context-based sound event recognition