Article

Disambiguating Sound through Context.

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

A central problem in automatic sound recognition is the mapping between low-level audio features and the meaningful content of an auditory scene. We propose a dynamic network model to perform this mapping. In acoustics, much research is devoted to low-level perceptual abilities such as audio feature extraction and grouping, which are translated into successful signal processing techniques. However, little work is done on modeling knowledge and context in sound recognition, although this information is necessary to identify a sound event rather than to separate its components from a scene. We first investigate the role of context in human sound identification in a simple experiment. Then we show that the use of knowledge in a dynamic network model can improve automatic sound identification by reducing the search space of the low-level audio features. Furthermore, context information dissolves ambiguities that arise from multiple interpretations of one sound event.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Environmental sounds are an integral part of the everyday listening experience and can convey critical information for individual safety and well-being [10]. Previous research has demonstrated varied effects of semantic context in environmental sound perception in normal-hearing adults and children [11][12][13][14][15]. However, time-efficient tests for the assessment of nonspeech semantic context effects have not as of yet been developed. ...
... Over time, the causal dependencies and arbitrary but consistent probabilistic contingencies among sound-producing objects and events in an auditory scene can form stable semantic memory networks which affect the identification of specific sounds. Listeners tend to rely on contextual information to resolve perceptual ambiguity about specific environmental sounds in a sound sequence [11,12,33]. For instance, an ambiguous sound can be perceived as 'a fuse burning' when preceded by the sound of a match being struck and followed by the sound of an explosion, but it is perceived as 'bacon frying' when preceded and followed by other kitchen sounds [11]. ...
... Semantic connections are formed among sounds that are likely to occur together, which results in the formation of a semantic memory network [33], also referred to as an auditory schema [3,37]. Similar to identification of objects in visual scenes [31,32], the identification of one or more of individual sounds within such schemas or networks activates other elements, increasing the likelihood of their identification [3,11,12]. On the other hand, semantic incongruence between a specific sound and the other sounds in an auditory scene can also result in an identification advantage [13]. For instance, the sound of rooster crowing in an auditory scene of a hospital emergency room is more detectable than the same sound heard in a barnyard ambience. ...
Article
Full-text available
Objective Sounds in everyday environments tend to follow one another as events unfold over time. The tacit knowledge of contextual relationships among environmental sounds can influence their perception. We examined the effect of semantic context on the identification of sequences of environmental sounds by adults of varying age and hearing abilities, with an aim to develop a nonspeech test of auditory cognition. Method The familiar environmental sound test (FEST) consisted of 25 individual sounds arranged into ten five-sound sequences: five contextually coherent and five incoherent. After hearing each sequence, listeners identified each sound and arranged them in the presentation order. FEST was administered to young normal-hearing, middle-to-older normal-hearing, and middle-to-older hearing-impaired adults (Experiment 1), and to postlingual cochlear-implant users and young normal-hearing adults tested through vocoder-simulated implants (Experiment 2). Results FEST scores revealed a strong positive effect of semantic context in all listener groups, with young normal-hearing listeners outperforming other groups. FEST scores also correlated with other measures of cognitive ability, and for CI users, with the intelligibility of speech-in-noise. Conclusions Being sensitive to semantic context effects, FEST can serve as a nonspeech test of auditory cognition for diverse listener populations to assess and potentially improve everyday listening skills.
... This method constructs a dynamic network that keeps track of both bottom-up signal information and contextual knowledge. By using more information than what can be known from the signal at each point in time, the system is not only more robust to noise, but it can also distinguish between sound events that are similar in acoustic structure but different in meaning (Niessen et al., 2008). The nodes of the dynamic network represent information about sound events at different levels of complexity. ...
... In this section we present a model that creates expectancies of sound events and evaluates the signal-driven hypotheses based on these expectancies . The description of the way the model operates is given in more detail in Niessen et al. (2008). ...
... As a result, every hypothesis in the network holds a confidence value after spreading the activation. A description of the details of the spreading activation algorithm can be found in Niessen et al. (2008). The activation values of all hypotheses in the network decrease with time when they get no reinforcement from signal-driven evidence. ...
Article
A recognition system for environmental sounds is presented. Signal-driven classification is performed by applying machine-learning techniques on features extracted from a cochleogram. These possibly unreliable classifications are improved by creating expectancies of sound events based on context information.
... En appliquant ces idées à la détection automatique d'événements sonores, (Niessen et al., 2008) montrent que la modélisation du contexte environnemental, i. e. la probabilité d'entendre un événement dans un type d'environnement donné et a fortiori la probabilité de co-occurrence des événements, permet d'améliorer les performances de détection. ...
... Section 2.3.5), i. e. de la nature des sources co-occurrentes dans la scène (Ballas et Howard, 1987 ;Gygi et Shafiro, 2011 ;Niessen et al., 2008). ...
Thesis
Full-text available
La présente thèse traite de l’analyse de scènes extraites d’environnements sonores, résultat auditif du mélange de sources émettrices distinctes et concomitantes. Ouvrant le champ des sources et des recherches possibles au-delà des domaines plus spécifiques que sont la parole ou la musique, l’environnement sonore est un objet complexe. Son analyse, le processus par lequel le sujet lui donne sens, porte à la fois sur les données perçues et sur le contexte de perception de ces données. Tant dans le domaine de la perception que de l’apprentissage machine, toute expérience suppose un contrôle fin de l’expérimentateur sur les stimuli proposés. Néanmoins, la nature de l’environnement sonore nécessite de se placer dans un cadre écologique, c’est à dire de recourir à des données réelles, enregistrées, plutôt qu’à des stimuli de synthèse. Conscient de cette problématique, nous proposons un modèle permettant de simuler, à partir d’enregistrements de sons isolés, des scènes sonores dont nous maîtrisons les propriétés structurelles – intensité, densité et diversité des sources. Appuyé sur les connaissances disponibles sur le système auditif humain, le modèle envisage la scène sonore comme un objet composite, une somme de sons sources. Nous investissons à l’aide de cet outil deux champs d’application. Le premier concerne la perception, et la notion d’agrément perçu dans des environnements urbains. L’usage de données simulées nous permet d’apprécier finement l’impact de chaque source sonore sur celui-ci. Le deuxième concerne la détection automatique d’événements sonores et propose une méthodologie d’évaluation des algorithmes mettant à l’épreuve leurs capacités de généralisation.
... Identifying separate acoustic objects and sources other than speech has been employed particularly as a tool in video segmentation and classification [6] [7], where the audio stream provides semantic information that cannot be deduced from the video image alone. Furthermore, the analysis of a particular acoustic scene (such as is done in the present work) has been tackled by tracking activation patterns for different classes, e.g. in train station scenes [8]. A particular application to office scenes can be found in [9], where microphones and several other sensors were used to track the activity of workers in an office, judging for example how much time is spent working on the computer. ...
... Identifying separate acoustic objects and sources other than speech has been employed particularly as a tool in video segmentation and classification [6, 7] , where the audio stream provides semantic information that cannot be deduced from the video image alone. Furthermore, the analysis of a particular acoustic scene (such as is done in the present work) has been tackled by tracking activation patterns for different classes, e.g. in train station scenes [8]. A particular application to office scenes can be found in [9], where microphones and several other sensors were used to track the activity of workers in an office, judging for example how much time is spent working on the computer. ...
Conference Paper
Full-text available
In this study, a new generic framework for the detection and interpretation of disagreement ("incongruence") between different classifiers [1] is applied to the problem of detecting novel acoustic objects in an office environment. Using a general model that detects generic acoustic objects (standing out from a stationary background) and specific models tuned to particular sounds expected in the office, a novel object is detected as an incongruence between the models: the general model detects it as a generic object, but the specific models can not identify it as any of the known office-related sources. The detectors are realized using amplitude modulation spectrogram and RASTA-PLP features with support vector machine classification. Data considered are speech and non-speech sounds embedded in real office background at signal-to-noise ratios (SNR) from +20dB to -20dB. Our approach yields approximately 90% hit rate for novel events at 20 dB SNR, 75% at 0dB and reaches chance level below -10 dB.
... The model continuously makes an estimation of its current state, based on sensory input and knowledge of the context. The dynamic network model has been applied previously to sound input [7], but is developed to process any type of sensory input. Therefore, as we will show in this paper, it can also be applied to visual information from the real-world environment of a mobile robot. ...
... Moreover, the model is general, because the sensory information in the model is not limited to visual observations. Hence, it can be used for state estimation in other domains (see [7]), or even combine information from different modalities to make predictions. ...
Conference Paper
Full-text available
Agents that operate in a real-world environment have to process an abundance of information, which might be ambiguous or noisy. We present a method inspired by cognitive research that keeps track of sensory information, and interprets it with knowledge of the context. We test this model on visual information from the real-world environment of a mobile robot. A basic task for a robot is to build a map of its environment for self-localization. We use a topological map to represent the environment, which it is an abstract representation of distinct places and the connections between them. The context of a place on the map is used to compute expectancies of the robot's next location when it is moving. These expectancies are combined with evidence from observations to reach the best prediction of the next location of the robot. Furthermore, such expectancies of where the robot will be in the next time step can resolve ambiguous observations. In general, the model is more robust to noise in the input than a data-driven model. Results of the model on data gathered by a mobile robot confirm that context evaluation improves localization compared to a data-driven model.
... Context-aware sound event recognition is still at its early stage compared with context-aware speech recognition. Niessen et al. [8] modeled the context in audio recognition by investigating the role of the dynamic network model to improve automatic audio identification and simultaneously reduce the search space of low-level audio features. The context-aware level describes more general information about the surroundings around an audio device in spite of the location, such as time [9] and even user-dependent states like emotion [10]. ...
Conference Paper
Full-text available
Sound event recognition without the context is challenging for both humans and robots due to the diversity of sound events. Contextual information allows them to disambiguate the sound events, for example, in a home environment. This paper proposes and implements a context-aware sound event recognition for a home service robot that monitors the elderly living alone at home. The location context of sound events is estimated by fusing distributed PIR (Passive Infrared) Sensors. A two-level dynamic Bayesian network (DBN) is used to model the intra-temporal and inter-temporal constraints among the context and sound events. We conducted experiments in a robot-integrated smart home (RiSH) testbed to evaluate the proposed method. The obtained results show the effectiveness and accuracy of context-aware sound event recognition.
... This is due to the relative lack of basic lexicalized terms to describe acoustic phenomena [39]. • Sound description and identification are highly dependent of their context, that is, sound source identification depends on the nature of other co-occurring sound sources [34], [40], [41]. ...
Article
Full-text available
This paper introduces a model for simulating environmental acoustic scenes that abstracts temporal structures from audio recordings. This model allows us to explicitly control key morphological aspects of the acoustic scene and to isolate their impact on the performance of the system under evaluation. Thus, more information can be gained on the behavior of an evaluated system, providing guidance for further improvements. To demonstrate its potential, this model is employed to evaluate the performance of nine state of the art sound event detection systems submitted to the IEEE DCASE 2013 Challenge. Results indicate that the proposed scheme is able to successfully build datasets useful for evaluating important aspects of the performance of sound event detection systems, such as their robustness to new recording conditions and to varying levels of background audio.
... AERS includes several different ways in which top-down influence can affect its operation. Specifying these effects is an exciting direction for further research (see, e.g., Niessen, van Maanen, & Andringa, 2008). ...
Article
Communication by sounds requires that the communication channels (i.e. speech/speakers and other sound sources) had been established. This allows to separate concurrently active sound sources, to track their identity, to assess the type of message arriving from them, and to decide whether and when to react (e.g., reply to the message). We propose that these functions rely on a common generative model of the auditory environment. This model predicts upcoming sounds on the basis of representations describing temporal/sequential regularities. Predictions help to identify the continuation of the previously discovered sound sources to detect the emergence of new sources as well as changes in the behavior of the known ones. It produces auditory event representations which provide a full sensory description of the sounds, including their relation to the auditory context and the current goals of the organism. Event representations can be consciously perceived and serve as objects in various cognitive operations. Copyright © 2015 The Authors. Published by Elsevier Inc. All rights reserved.
... In other words, a same sound may be described quite differently according to the subject. This is due to the relative lack of basic lexicalized terms to describe acoustic phenomena [32] • Sound description and identification are highly context dependent, that is, sound source identification depends on the nature of the other co-occurring sound sources [27], [33], [34] Even if there is no systematic way to build a sound collection, one may take into account some perceptual considerations to guarantee a certain level of ecological validity. Those considerations are addressed in the next sections. ...
Preprint
This paper introduces a model of environmental acoustic scenes which adopts a morphological approach by abstracting temporal structures of acoustic scenes. To demonstrate its potential, this model is employed to evaluate the performance of a large set of acoustic events detection systems. This model allows us to explicitly control key morphological aspects of the acoustic scene and isolate their impact on the performance of the system under evaluation. Thus, more information can be gained on the behavior of evaluated systems, providing guidance for further improvements. The proposed model is validated using submitted systems from the IEEE DCASE Challenge; results indicate that the proposed scheme is able to successfully build datasets useful for evaluating some aspects the performance of event detection systems, more particularly their robustness to new listening conditions and the increasing level of background sounds.
... However, sounds in coherent scenes were identified with greater accuracy than the same sounds in incoherent scenes, demonstrating the effect of context. These findings were later replicated by Niessen et al. (2008) who also observed an advantage for the identification of individual environmental sounds embedded in a semantically coherent sequence. Target sounds in her study were auditory "chimeras" in which the energy envelope one sound was used to modulate another sound. ...
Article
Full-text available
Past work has shown involvement of context in the perception of sequences of meaningful speech sounds. The present study extended investigation to meaningful environmental sounds, evaluating the accuracy in identifying individual sources in sequences of five environmental sounds that were either likely or not to have occurred together in place and time. Rating of sequence coherence confirmed the subjective distinction between sequence types. In the main task, young normal-hearing listeners were instructed to identify and order each sound they heard by selecting from among 20 randomly selected labels. Listeners were significantly more accurate in reporting both the names and order of the sounds of the coherent sequences in which sounds were more likely to have occurred together. If not scored in terms of source order, a smaller but still significant difference was obtained between coherent and incoherent sequences. For both sequence types, there was a trend for best performance for the initial and final sources of the sequences. Overall, results are consistent with speech findings and demonstrate a positive effect of context in the perceptual processing of environmental sounds. [Work supported by NIH/NIDCD.].
... In contrast to domain specific solutions, a general sound recognition system should be robust to the complexities of unconstraint soundscapes, such as strong and varying transmission effects and concurrent sources. To handle real-world complexities, human perception relies on signal- driven processing, but also on contextual knowledge and reasoning 4 . Therefore, a general sound recognition system should comprise an interaction of signal-driven techniques and interpretation of the context. ...
Conference Paper
Full-text available
... The methods are only briefly described here. For a detailed description we refer to Krijnders et al. 13 and Niessen et al. 14 . ...
Conference Paper
Full-text available
Human perception of sound in real environments is complex fusion of many factors, which are investigated by divers research fields. Most approaches to assess and improve sonic environments (soundscapes) use a holistic approach. For example, in experimental psychology, subjective measurements usually involve the evaluation of a complete soundscape by human listeners, mostly through questionnaires. In contrast, psycho-acoustic measurements try to capture low-level perceptual attributes in quantities, such as loudness. However, these two types of soundscape measurements are difficult to link other than with correlational measures. We propose a method inspired by cognitive research to improve our understanding of the link between acoustic events and human soundscape perception. Human listeners process sound as meaningful events. Therefore, we developed a model to identify components in a soundscape that are the basis of these meaningful events. First, we select structures from the sound signal that are likely to stem from a single source. Subsequently, we use a model of human memory to identify the most likely events that constitute these components.
Article
Monitoring daily activities is essential for home service robots to take care of the older adults who live alone in their homes. In this article, we proposed a sound-based human activity monitoring (SoHAM) framework by recognizing sound events in a home environment. First, the method of context-aware sound event recognition (CoSER) is developed, which uses contextual information to disambiguate sound events. The locational context of sound events is estimated by fusing the data from the distributed passive infrared (PIR) sensors deployed in the home. A two-level dynamic Bayesian network (DBN) is used to model the intratemporal and intertemporal constraints between the context and the sound events. Second, dynamic sliding time window-based human action recognition (DTW-HaR) is developed to estimate active sound event segments with their labels and durations, then infer actions and their durations. Finally, a conditional random field (CRF) model is proposed to predict human activities based on the recognized action, location, and time. We conducted experiments in our robot-integrated smart home (RiSH) testbed to evaluate the proposed framework. The obtained results show the effectiveness and accuracy of CoSER, action recognition, and human activity monitoring. Note to Practitioners —This article is motivated by the goal to develop companion robots that can assist older adults living alone. Among many capabilities, monitoring human daily activities is an essential one for such robots. Though computer vision or wearable sensors-based methods have been developed by other researchers, they are not practical due to the privacy concern and intrusiveness. Sound-based daily activity recognition can address these concerns and offer a viable solution. In this regard, our proposed method adopts microphones on the robot and a small set of motion sensors distributed in the home. The proposed theoretical framework was tested in a small-scale mock-up apartment with promising results. Before such companion robots can be deployed to real homes for elderly care, there is a need to improve the robustness of the algorithms. More thorough tests in various realistic home environments should be conducted to fully evaluate the performance of the robots. In addition, privacy concern related to audio capture should be further mitigated.
Article
We describe a soundscape annotation tool for unconstrained environmental sounds. The tool segments the time-frequency-plane into regions that are likely to stem from a single source. A human annotator can classify these regions. The system learns to suggest possible annotations and presents these for acceptance. Accepted or corrected annotations will be used to improve the classification further. Automatic annotations with a very high probability of being correct might be accepted by default. This speeds up the annotation process and makes it possible to annotate complex soundscapes both quickly and in considerable detail. We performed a pilot on recordings of uncontrolled soundscapes at locations in Assen (NL) made in the early spring of 2009. Initial results show that the system is able to propose the correct class in 75% of the cases.
Article
Two sound classifiers were proposed for a novel aquaculture application that involved processing sound to estimate the feed consumption of prawns within the turbid waters of farm ponds. A two stage content classifier inferred feed events using identified sound features. To deal with the class ambiguity created by the acoustically challenging conditions of ponds, the CADBN was proposed to jointly model the sound features with the context of feed events. The CADBN was then reformulated to classify the energy load of devices using a distributed state space that enabled flexible and efficient modelling of context. The CADBN was compared to a set of benchmark classifiers for both the prawn feeding and energy applications. Results indicate that the inclusion of context greatly enhances class discrimination in both problems. Furthermore, results illustrate that the temporal structure of the CADBN produced superior performance to benchmark context classifiers that adopt the same context features as independent inputs.
Article
Full-text available
When environmental sounds are semantically incongruent with the background scene (e.g., horse galloping in a restaurant) they can be identified more accurately by young normal hearing listeners (YNH) than sounds congruent with the scene (e.g., horse galloping at a racetrack). This study investigated how age and high frequency audibility affect this Incongruency Advantage (IA) effect. Methods In Experiments 1a and 1b, elderly listeners (N=18 for 1a, N=10 for 1b) with age appropriate hearing (EAH) were tested on target sounds and auditory scenes in five Sound-to-Scene ratios (So/Sc) between -3 to -18 dB. Experiment 2 tested eleven YNH on the same sound/scene pairings lowpass-filtered at 4 kHz (YNH-4k). The EAH and YNH-4k groups exhibited an almost identical pattern of significant IAs, but both were at approximately 3.9 dB higher So/Sc than the previously tested YNH. However, the psychometric functions revealed a shallower slope for EAH compared to YNH for the Congruent stimuli only, suggesting a greater difficulty for the elderly in attending to sounds expected to occur in a scene. These findings indicate that semantic relationship between environmental sounds in soundscapes are mediated by both audibility and cognitive factors and suggest a method for dissociating the two.
Article
This study investigated the development of children's skills in identifying ecologically relevant sound objects within naturalistic listening environments, using a non-linguistic analogue of the classic 'cocktail-party' situation. Children aged 7 to 12.5 years completed a closed-set identification task in which brief, commonly encountered environmental sounds were presented at varying signal-to-noise ratios. To simulate the complexity of real-world acoustic environments, target sounds were embedded in either a single, stereophonically presented scene, or in one of two different scenes, with each scene presented to a single ear. Each target sound was either congruent or incongruent with the auditory context. Identification accuracy improved with increasing age, particularly in trials with low signal-to-noise ratios. Performance was most accurate when target sounds were incongruent with the background scene, and when sounds were presented in a single background scene. The presence of two backgrounds disproportionately disrupted children's performance relative to that of previously tested adults, and reduced children's sensitivity to contextual cues. Successful identification of familiar sounds in complex auditory contexts is the outcome of a protracted learning process, with children reaching adult levels of performance after a decade or more of experience.
Article
Full-text available
The work presented in this article studies how the context information can be used in the automatic sound event detection process, and how the detection system can benefit from such information. Humans are using context information to make more accurate predictions about the sound events and ruling out unlikely events given the context. We propose a similar utilization of context information in the automatic sound event detection process. The proposed approach is composed of two stages: automatic context recognition stage and sound event detection stage. Contexts are modeled using Gaussian mixture models and sound events are modeled using three-state left-to-right hidden Markov models. In the first stage, audio context of the tested signal is recognized. Based on the recognized context, a context-specific set of sound event classes is selected for the sound event detection stage. The event detection stage also uses context-dependent acoustic models and count-based event priors. Two alternative event detection approaches are studied. In the first one, a monophonic event sequence is outputted by detecting the most prominent sound event at each time instance using Viterbi decoding. The second approach introduces a new method for producing polyphonic event sequence by detecting multiple overlapping sound events using multiple restricted Viterbi passes. A new metric is introduced to evaluate the sound event detection performance with various level of polyphony. This combines the detection accuracy and coarse time-resolution error into one metric, making the comparison of the performance of detection algorithms simpler. The two-step approach was found to improve the results substantially compared to the context-independent baseline system. In the block-level, the detection accuracy can be almost doubled by using the proposed context-dependent event detection.
Article
Full-text available
We describe a soundscape annotation tool for unconstrained environmental sounds. The tool segments the time-frequency-plane into regions that are likely to stem from a single source. A human annotator can classify these regions. The system learns to suggest possible annotations and presents these for acceptance. Accepted or corrected annotations will be used to improve the classification further. Automatic annotations with a very high probability of being correct might be accepted by default. This speeds up the annotation process and makes it possible to annotate complex soundscapes both quickly and in considerable detail. We performed a pilot on recordings of uncontrolled soundscapes at locations in Assen (NL) made in the early spring of 2009. Initial results show that the system is able to propose the correct class in 75% of the cases.
Article
Full-text available
This paper presents a model for reading cursive scripts which has an architecture inspired by the behavior of human reading and perceptual concepts. The scope of this study is limited to offline recognition of isolated cursive words. First, this paper describes McClelland and Rumelhart's reading model, which formed the basis of the system. The method's behavior is presented, followed by the main original contributions of our model which are: the development of a new technique for baseline extraction, an architecture based on the chosen reading model (hierarchical, parallel, with local representation and interactive activation mechanism), the use of significant perceptual features in word recognition such as ascenders and descenders, the creation of a fuzzy position concept dealing with the uncertainty of the location of features and letters, and the adaptability of the model to words of different lengths and languages. After a description of our model, new results are presented.
Article
Full-text available
This paper surveys the use of Spreading Activation techniques on Semantic Networks in Associative Information Retrieval. The major Spreading Activation models are presented and their applications to IR is surveyed. A number of works in this area are critically analyzed in order to study the relevance of Spreading Activation for associative IR. Key words: spreading activation, information storage and retrieval, semantic networks, associative information retrieval, information processing, knowledge representation.
Article
Full-text available
Four experiments investigated the acoustical correlates of similarity and categorization judgments of environmental sounds. In Experiment 1, similarity ratings were obtained from pairwise comparisons of recordings of 50 environmental sounds. A three-dimensional multidimensional scaling (MDS) solution showed three distinct clusterings of the sounds, which included harmonic sounds, discrete impact sounds, and continuous sounds. Furthermore, sounds from similar sources tended to be in close proximity to each other in the MDS space. The orderings of the sounds on the individual dimensions of the solution were well predicted by linear combinations of acoustic variables, such as harmonicity, amount of silence, and modulation depth. The orderings of sounds also correlated significantly with MDS solutions for similarity ratings of imagined sounds and for imagined sources of sounds, obtained in Experiments 2 and 3—as was the case for free categorization of the 50 sounds (Experiment 4)—although the categorization data were less well predicted by acoustic features than were the similarity data.
Article
Full-text available
This paper presents a comprehensive comparative study of artificial neural networks, learning vector quantization and dynamic time warping classification techniques combined with stationary/non-stationary feature extraction for environmental sound recognition. Results show 70% recognition using mel frequency cepstral coefficients or continuous wavelet transform with dynamic time warping.
Article
Full-text available
Describes a model in which perception results from excitatory and inhibitory interactions of detectors for visual features, letters, and words. A visual input excites detectors for visual features in the display and for letters consistent with the active features. Letter detectors in turn excite detectors for consistent words. It is suggested that active word detectors mutually inhibit each other and send feedback to the letter level, strengthening activation and hence perceptibility of their constituent letters. Computer simulation of the model exhibits the perceptual advantage for letters in words over unrelated contexts and is considered consistent with basic facts about word advantage. Most important, the model produces facilitation for letters in pronounceable pseudowords as well as words. Pseudowords activate detectors for words that are consistent with most active letters, and feedback from the activated words strengthens activations of the letters in the pseudoword. The model thus accounts for apparently rule-governed performance without any actual rules. (50 ref) (PsycINFO Database Record (c) 2007 APA, all rights reserved)
Article
Full-text available
Everyday listening is the experience of hearing events in the world rather than sounds per se. In this article, I take an ecological approach to everyday listening to overcome constraints on its study implied by more traditional approaches. In particular, I am concerned with developing a new framework for describing sound in terms of audible source attributes. An examination of the continuum of structured energy from event to audition suggests that sound conveys information about events at locations in an environment. Qualitative descriptions of the physics of sound-producing events, complemented by protocol studies, suggest a tripartite division of sound-producing events into those involving vibrating solids, gasses, or liquids. Within each of these categories, basic-level events are defined by the simple interactions that can cause these materials to sound, whereas more complex events can be described in terms of temporal patterning, compound, or hybrid sources. The results of these investigations are used to create a map of sound-producing events and their attributes useful in guiding further exploration.
Conference Paper
Full-text available
The goal of the FDAI project is to create a general system that computes an efficient representation of the acoustic environment. More precisely, FDAI has to compute a noise disturbance indicator based on the identification of six categories of sound sources. This paper describes experiments carried out to identify acoustic features and recognition models that were implemented in FDAI. This framework is based on EDS–Extractor Discovery System–an innovative acoustic feature extraction system for sound feature extraction. The design and development of FDAI raised two critical issues. Completeness: it is very difficult to design descriptors that identify every sound source in urban environments; and Consistency: some sound sources are not acoustically consistent. We solved the first issue with a conditional evaluation of a family of acoustic descriptors, rather than the evaluation of a single general-purpose extractor. Indeed, a first hierarchical separation between vehicles (moped, bus, motorcycle and car) and non-vehicles (bird and voice) significantly raised the accuracy of identification of the buses. The second issue turned out to be more complex and is still under study. We give here preliminary results.
Article
Full-text available
The effects of context on the identification of everyday sounds was examined in four experiments. In all experiments, the test sounds were selected as nearly homonymous pairs. That is, the members of a pair sounded similar but were aurally discriminable. These test sounds were presented in isolation to get baseline identification performance and within sequences of other everyday sounds to assess contextual effects. These sequences were: (a) semantically consistent with the correct response, (b) semantically biased toward the other member of the pair, or (c) composed of randomly arranged sounds. TWO par- adigms, binary choice and free identification were used. Results indicate that context had significant negative effects and only minor positive effects. Per- formance was consistently poorest in biased context and best in both isolated and consistent context. A signal detection analysis indicated that perform- ance in identifying an out-of-context sound remains constant for the two par- adigms, and that response bias is conservative, especially with a free-response paradigm. Labels added to enhance context generally had little effect beyond the effects of sounds alone.
Article
Full-text available
Comparisons are made between the perception of environmental sound and the perception of speech. With both, two types of processing are involved, bottom-up and top-down, and with both, the detailed form of the processing is, in several respects, similar. Recognition of isolated speech and environmental sounds produces similar patterns of semantic interpretations. Environmental sound "homonyms" are ambiguous in much the same manner as speech homonyms. Environmental sounds become integrated on the basis of cognitive processes similar to those used to perceive speech. The general conclusion is that environmental sound is usefully thought of as a form of language.
Conference Paper
Full-text available
We present a new approach to robust pitch tracking of speech in noise. We transform the audio signal to the time-frequency domain using a transmission-line model of the human ear. The model output is leaky-integrated resulting in an energy measure, called a cochleogram. To select the speech parts of the cochleogram a correlogram, based on a running autocorrelation per channel, is used. In the correlogram possible pitch hypotheses are tracked through time and for every track a salience measure is calculated. The salience determines the importance of that track, which makes it possible to select tracks that belong to the signal while ignoring those of noise. The algorithm is shown to be more robust to the addition of pink noise than standard autocorrelation-based methods. An important property of the grouping is the selection of parts of the time-frequency plane that are highly likely to belong to the same source, which is impossible with conventional autocorrelation-based methods.
Conference Paper
Full-text available
This article addresses the problem of recognizing acoustic events present in unconstrained input. We propose a novel approach to the processing of audio data which combines bottom-up hypothesis generation with top-down expectations, which, unlike standard pattern recognition techniques, can ensure that the representation of the input sound is physically realizable. Our approach gradually enriches low-level signal descriptors, based on Continuity Preserving Signal Processing, with more and more abstract interpretations along a hierarchy of description levels. This process is guided by top-down knowledge which provides context and ensures an interpretation consistent with the knowledge of the system. This leads to a system that can detect and recognize specific events for which the evidence is present in unconstrained real-world sounds.
Article
Full-text available
Acoustic, ecological, perceptual and cognitive factors that are common in the identification of 41 brief, varied sounds were evaluated. In Experiment 1, identification time and accuracy, causal uncertainty values, and spectral and temporal properties of the sounds were obtained. Experiment 2 was a survey to obtain ecological frequency counts. Experiment 3 solicited perceptual-cognitive ratings. Factor analyses of spectral parameters and perceptual-cognitive ratings were performed. Identification time and causal uncertainty are highly interrelated, and both are related to ecological frequency and the presence of harmonics and similar spectral bursts. Experiments 4 and 5 used a priming paradigm to verify correlational relationships between identification time and causal uncertainty and to assess the effect of sound typicality. Results support a hybrid approach for theories of everyday sound identification.
Article
Full-text available
By Fourier's theorem, signals can be decomposed into a sum of sinusoids of different frequencies. This is especially relevant for hearing, because the inner ear performs a form of mechanical Fourier transform by mapping frequencies along the length of the cochlear partition. An alternative signal decomposition, originated by Hilbert, is to factor a signal into the product of a slowly varying envelope and a rapidly varying fine time structure. Neurons in the auditory brainstem sensitive to these features have been found in mammalian physiological studies. To investigate the relative perceptual importance of envelope and fine structure, we synthesized stimuli that we call 'auditory chimaeras', which have the envelope of one sound and the fine structure of another. Here we show that the envelope is most important for speech reception, and the fine structure is most important for pitch perception and sound localization. When the two features are in conflict, the sound of speech is heard at a location determined by the fine structure, but the words are identified according to the envelope. This finding reveals a possible acoustic basis for the hypothesized 'what' and 'where' pathways in the auditory cortex.
Article
Full-text available
Multiple sound sources often contain harmonics that overlap and may be degraded by environmental noise. The auditory system is capable of teasing apart these sources into distinct mental objects, or streams. Such an 'auditory scene analysis' enables the brain to solve the cocktail party problem. A neural network model of auditory scene analysis, called the ARTSTREAM model, is presented to propose how the brain accomplishes this feat. The model clarifies how the frequency components that correspond to a given acoustic source may be coherently grouped together into a distinct stream based on pitch and spatial location cues. The model also clarifies how multiple streams may be distinguished and separated by the brain. Streams are formed as spectral-pitch resonances that emerge through feedback interactions between frequency-specific spectral representations of a sound source and its pitch. First, the model transforms a sound into a spatial pattern of frequency-specific activation across a spectral stream layer. The sound has multiple parallel representations at this layer. A sound's spectral representation activates a bottom-up filter that is sensitive to the harmonics of the sound's pitch. This filter activates a pitch category which, in turn, activates a top-down expectation that is also sensitive to the harmonics of the pitch. Resonance develops when the spectral and pitch representations mutually reinforce one another. Resonance provides the coherence that allows one voice or instrument to be tracked through a noisy multiple source environment. Spectral components are suppressed if they do not match harmonics of the top-down expectation that is read-out by the selected pitch, thereby allowing another stream to capture these components, as in the 'old-plus-new heuristic' of Bregman. Multiple simultaneously occurring spectral-pitch resonances can hereby emerge. These resonance and matching mechanisms are specialized versions of Adaptive Resonance Theory, or ART, which clarifies how pitch representations can self-organize during learning of harmonic bottom-up filters and top-down expectations. The model also clarifies how spatial location cues can help to disambiguate two sources with similar spectral cues. Data are simulated from psychophysical grouping experiments, such as how a tone sweeping upwards in frequency creates a bounce percept by grouping with a downward sweeping tone due to proximity in frequency, even if noise replaces the tones at their intersection point. Illusory auditory percepts are also simulated, such as the auditory continuity illusion of a tone continuing through a noise burst even if the tone is not present during the noise, and the scale illusion of Deutsch whereby downward and upward scales presented alternately to the two ears are regrouped based on frequency proximity, leading to a bounce percept. Since related sorts of resonances have been used to quantitatively simulate psychophysical data about speech perception, the model strengthens the hypothesis that ART-like mechanisms are used at multiple levels of the auditory system. Proposals for developing the model to explain more complex streaming data are also provided.
Conference Paper
Full-text available
This paper presents a smart surveillance system named CASSANDRA, aimed at detecting instances of aggressive human behavior in public environments. A distinguishing aspect of CASSANDRA is the exploitation of the complimentary nature of audio and video sensing to disambiguate scene activity in real-life, noisy and dynamic environments. At the lower level, independent analysis of the audio and video streams yields intermediate descriptors of a scene like: "scream", "passing train" or "articulation energy". At the higher level, a Dynamic Bayesian Network is used as a fusion mechanism that produces an aggregate aggression indication for the current scene. Our prototype system is validated on a set of scenarios performed by professional actors at an actual train station to ensure a realistic audio and video noise setting.
Conference Paper
Full-text available
Many works in audio and image signal analysis are based on the use of "features" to represent characteristics of sounds or images. Features are used in various ways, for instance as inputs to classifiers to categorize automatically objects, e.g. for audio scene description. Most, if not all, approaches focus on the development of clever classifiers and on the various processes of feature selection, classifier algorithms and parameter tuning. Strangely, the features themselves are rarely justified. The predominant paradigm consists in selecting, by hand, "generic", well-known features from the literature and focusing on the rest of the chain. In this study we try to generalize the notion of feature to make the choice of features more systematic and less prone to hazardous, unjustified human choices. We introduce to this aim the notion of "analytical feature": features built only from the analysis of the problem at hand, using a heuristic function generation process. We show some experiments aiming at answering some general questions about analytical features.
Article
Full-text available
The sound of a busy environment, such as a city street, gives rise to a perception of numerous distinct events in a human listener -- the `auditory scene analysis' of the acoustic information. Recent advances in the understanding of this process from experimental psychoacoustics have led to several efforts to build a computer model capable of the same function. This work is known as `computational auditory scene analysis'. The dominant approach to this problem has been as a sequence of modules, the output of one forming the input to the next. Sound is converted to its spectrum, cues are picked out, and representations of the cues are grouped into an abstract description of the initial input. This `data-driven' approach has some specific weaknesses in comparison to the auditory system: it will interpret a given sound in the same way regardless of its context, and it cannot `infer' the presence of a sound for which direct evidence is hidden by other components. The `prediction-driven' ap...
Book
Auditory Scene Analysis addresses the problem of hearing complex auditory environments, using a series of creative analogies to describe the process required of the human auditory system as it analyzes mixtures of sounds to recover descriptions of individual sounds. In a unified and comprehensive way, Bregman establishes a theoretical framework that integrates his findings with an unusually wide range of previous research in psychoacoustics, speech perception, music theory and composition, and computer modeling. Bradford Books imprint
Book
How can we engineer systems capable of “cocktail party” listening? Human listeners are able to perceptually segregate one sound source from an acoustic mixture, such as a single voice from a mixture of other voices and music at a busy cocktail party. How can we engineer “machine listening” systems that achieve this perceptual feat? Albert Bregmans book Auditory Scene Analysis, published in 1990, drew an analogy between the perception of auditory scenes and visual scenes, and described a coherent framework for understanding the perceptual organization of sound. His account has stimulated much interest in computational studies of hearing. Such studies are motivated in part by the demand for practical sound separation systems, which have many applications including noiserobust automatic speech recognition, hearing prostheses, and automatic music transcription. This emerging field has become known as computational auditory scene analysis (CASA). Computational Auditory Scene Analysis: Principles, Algorithms, and Applications provides a comprehensive and coherent account of the state of the art in CASA, in terms of the underlying principles, the algorithms and system architectures that are employed, and the potential applications of this exciting new technology. With a Foreword by Bregman, its chapters are written by leading researchers and cover a wide range of topics including: Estimation of multiple fundamental frequenciesFeaturebased and modelbased approaches to CASASound separation based on spatial locationProcessing for reverberant environmentsSegregation of speech and musical signalsAutomatic speech recognition in noisy environmentsNeural and perceptual modeling of auditory organizationThe text is written at a level that will be accessible to graduate students and researchers from related science and engineering disciplines. The extensive bibliography accompanying each chapter will also make this book a valuable reference source. A web site accompanying the text, http://www.casabook.org, features software tools and sound demonstrations. © 2006 by The Institute of Electrical and Electronics Engineers, Inc. All rights reserved.
Chapter
A mixture of sounds, though distinct in the environment, arrives in the form of a single pressure wave at each ear. From it, listeners must extract the signals coming from individual sources of sound, a process called auditory scene analysis (ASA). ASA first characterizes the incoming waveform by its frequency components and other features, then creates subsets (auditory streams) that extend over time, each representing a single environmental sound source. ASA exploits regularities in the signal to determine how to parse it. ASA is present at birth in humans and has been found in other animals.
Book
Auditory scene analysis (ASA) is defined and the problem of partitioning the time-varying spectrum resulting from mixtures of individual acoustic signals is described. Some basic facts about ASA are presented. These include causes and effects of auditory organization (sequential, simultaneous, and the old-plus-new heuristic). Processes employing different cues collaborate and compete in determining the final organization of the mixture. These processes take advantage of regularities in the mixture that give clues about how to parse it. There are general regularities that apply to most types of sound, as well as regularities in particular types of sound. The general ones are hypothesized to be used by innate processes, and the ones specific to restricted environments to be used by learned processes in humans and possibly by innate ones in animals. The use of brain recordings and the study of nonhuman animals is discussed.
Article
This paper discusses an analysis of how scientists select relevant publications, and an application that can assist scientists in this information selection task. The application, called the Personal Publication Assistant, is based on the assumption that successful information selection is driven by recognizing familiar terms. To adapt itself to a researcher’s interests, the system takes into account what words have been used in a particular researcher’s abstracts, and when these words have been used. The user model underlying the Personal Publication Assistant is based on a rational analysis of memory, and takes the form of a model of declarative memory as developed for the cognitive architecture ACT-R. We discuss an experiment testing the assumptions of this model and present a user study that validates the implementation of the Personal Publication Assistant. The user study shows that the Personal Publication Assistant can successfully make an initial selection of relevant papers from a large collection of scientific literature.
Article
The fields of human speech recognition (HSR) and automatic speech recognition (ASR) both investigate parts of the speech recognition process and have word recognition as their central issue. Although the research fields appear closely related, their aims and research methods are quite different. Despite these differences there is, however, lately a growing interest in possible cross-fertilisation. Researchers from both ASR and HSR are realising the potential benefit of looking at the research field on the other side of the ‘gap’. In this paper, we provide an overview of past and present efforts to link human and automatic speech recognition research and present an overview of the literature describing the performance difference between machines and human listeners. The focus of the paper is on the mutual benefits to be derived from establishing closer collaborations and knowledge interchange between ASR and HSR. The paper ends with an argument for more and closer collaborations between researchers of ASR and HSR to further improve research in both fields.
Article
Speech is typically perceived against a background of other sounds. Listeners are adept at extracting target sources from the acoustic mixture reaching the ears. The auditory scene analysis (ASA) account holds that this feat is the result of a two-stage process. In the first-stage, sound is decomposed into collections of fragments in several dimensions. Subsequent processes of perceptual organization reassemble these fragments, based on cues indicating common source of origin which are interpreted in the light of prior experience. In this way, the decomposed auditory scene is processed to extract coherent evidence for one or more sources. Auditory scene analysis in listeners has been studied for several decades and recent years have seen a steady accumulation of computational models of perceptual organization. The purpose of this review is to describe the evidence for the nature of auditory organization in listeners and to explore the computational models which have been motivated by such evidence. The primary focus is on speech rather than on sources such as polyphonic music or non-speech ambient backgrounds, although all these domains are equally amenable to auditory organization. The review includes a discussion of the relationship between auditory scene analysis and alternative approaches to sound source segregation.
Article
A challenging problem for research in computational auditory scene analysis is the integration of evidence derived from multiple grouping principles. We describe a computational model which addresses this issue through the use of a `blackboard' architecture. The model integrates evidence from multiple grouping principles at several levels of abstraction, and manages competition between principles in a manner that is consistent with psychophysical findings. In addition, the blackboard architecture allows heuristic knowledge to influence the organisation of an auditory scene. We demonstrate that the model can replicate listeners' perception of interleaved melodies, and is also able to segregate melodic lines from polyphonic, multi-timbral audio recordings. The applicability of the blackboard architecture to speech processing tasks is also discussed.
Article
GRANT is an expert system for finding sources of funding given research proposals. Its search method-constrained spreading activation—makes inferences about the goals of the user and thus finds information that the user did not explicitly request but that is likely to be useful. The architecture of GRANT and the implementation of constrained spreading activation are described, and grant's performance is evaluated.
Conference Paper
Conventional speech recognition is notoriously vulnerable to ad- ditive noise, and even the best compensation methods are defeated if the noise is nonstationary. To address this problem, we propose a new integration of bottom-up techniques to identify 'coherent fragments' of spectro-temporal energy (based on local features), with the top-down hypothesis search of conventional speech re- cognition, extended to search also across possible assignments of each fragment as speech or interference. Initial tests demonstrate the feasibility of this approach, and achieve a reduction in word error rate of more than 25% relative at 5 dB SNR over stationary noise missing data recognition.
Conference Paper
In this paper, we will present an outline for an online recommender system for art works. The system, termed Virtual Museum Guide, will take the interest that visitors of an online museum ex- press into account in recommending suitable art works, as well as the relationships that exist be- tween art works in the collection. To keep the Virtual Museum Guide similar to a human mu- seum guide, we based its design on principles from research on human memory. This way, the Virtual Museum Guide can 'remember' which is the most suitable art work to present, based on its perception of the visitor's interests and its knowledge of the works of art.
Article
Computational auditory scene analysis - modeling the human ability to organize sound mixtures according to their sources - has experienced a rapid evolution from simple implementations of psychoacoustically inspired rules to complex systems able to process demanding real-world sounds. Phenomena such as the continuity illusion and phonemic restoration show that the brain is able to use a wide range of knowledge-based contextual constraints when interpreting obscured or overlapping mixtures: To model such processing, we need architectures that operate by confirming hypotheses about the observations rather than relying on directly extracted descriptions. One such architecture, the `prediction-driven' approach, is presented along with results from its initial implementation. This architecture can be extended to take advantage of the high-level knowledge implicit in today's speech recognizers by modifying a recognizer to act as one of the `component models' providing the explanations of the signal mixture. A preliminary investigation indicates the viability of this approach while at the same time raising a number of issues which are discussed. These results point to the conclusion that successful scene analysis must, at every level, exploit abstract knowledge about sound sources.
Article
This study investigated the role of auditory context in identifying individual environmental sounds. Limited previous research has not demonstrated a facilitatory influence of context on sound identification comparable to the benefits grammatical context provides for identifying speech sounds. Unlike some previous studies, in which context was determined by sound presented sequentially before and after the target environmental sound, in the present experiments the sounds to be identified were mixed in with a background of naturally occurring auditory scenes, such as the ambience of a beach or a street. Thirty‐one familiar environmental sounds from previous environmental sounds studies were presented mixed either with congruent scenes (such as hammering at a construction site) or incongruent scenes (such as a horse galloping in a restaurant). The sounds were presented at signal‐to‐background ratios of −12 and −16 dB and trained listeners identified them using three‐letter codes. Results indicate auditory context had a small but significant effect on identification performance. However, contrary to expectations, the sounds were identified more accurately in incongruent scenes (55% correct) rather than congruent ones (46% correct). The differential effects of auditory attention in determining foreground and background sounds will be discussed in connection with the results.
Article
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1996. Includes bibliographical references (p. 173-180). by Daniel P.W. Ellis. Ph.D.
Article
The premise of this paper is that the auditory system's primary function is its ability to determine the sources of sound. Auditory image perception and analysis are defined as the basis for sound source determination. Few studies in the literature have focused on understanding these abilities and the paper argues that more attention should be paid to auditory image perception and analysis. Four questions are posed for understanding auditory image formation and seven physical variables are described which might be used for auditory image perception. The paper relates auditory image perception and analysis to a number of other topics in the hearing sciences in order to reinforce the argument that auditory image perception and analysis are the basis of hearing.
Article
Nearly perfect speech recognition was observed under conditions of greatly reduced spectral information. Temporal envelopes of speech were extracted from broad frequency bands and were used to modulate noises of the same bandwidths. This manipulation preserved temporal envelope cues in each band but restricted the listener to severely degraded information on the distribution of spectral energy. The identification of consonants, vowels, and words in simple sentences improved markedly as the number of bands increased; high speech recognition performance was obtained with only three bands of modulated noise. Thus, the presentation of a dynamic temporal pattern in only a few broad spectral regions is sufficient for the recognition of speech.
Article
Environmental sound research is still in its beginning stages, although in recent years there has started to accumulate a body of research, both on the perception of environmental sounds themselves, and on their practical applications in other areas of auditory research and cognitive science. In this chapter some of those practical applications are detailed, combined with a discussion of the implications of environmental sound research for auditory perception in general, and finally some outstanding issues and possible directions for future research are outlined.
Article
A key question for speech enhancement and simulations of auditory scene analysis in high levels of nonstationary noise is how to combine principles of auditory grouping and to integrate several noise-perturbed acoustical cues in a robust way. We present an application of recent online, nonlinear, non-Gaussian multidimensional statistical filtering methods which integrates tracking of sound-source direction and spectro-temporal dynamics of two mixed voices. The framework used is in agreement with the notion of evaluating competing hypotheses. To limit the number of hypotheses which need to be evaluated, the approach developed here uses a detailed statistical description of the high-dimensional spectro-temporal dynamics of speech, which is measured from a large speech database. The results show that the algorithm tracks sound source directions very precisely, separates the voice envelopes with algorithmic convergence times down to 50 ms, and enhances the signal-to-noise ratio in adverse conditions, requiring high computational effort. The approach has a high potential for improvements of efficiency and could be applied for voice separation and reduction of nonstationary noises
Article
The statistical theory of speech recognition introduced several decades ago has brought about low word error rates for clean speech. However, it has been less successful in noisy conditions. Since extraneous acoustic sources are present in virtually all everyday speech communication conditions, the failure of the speech recognition model to take noise into account is perhaps the most serious obstacle to the application of ASR technology. Approaches to noise-robust speech recognition have traditionally taken one of two forms. One set of techniques attempts to estimate the noise and remove its effects from the target speech. While noise estimation can work in low-to-moderate levels of slowly varying noise, it fails completely in louder or more variable conditions. A second approach utilises noise models and attempts to decode speech taking into account their presence. Again, model-based techniques can work for simple noises, but they are computationally complex under realistic conditions and require models for all sources present in the signal. In this paper, we propose a statistical theory of speech recognition in the presence of other acoustic sources. Unlike earlier model-based approaches, our framework makes no assumptions about the noise background, although it can exploit such information if it is available. It does not require models for background sources, or an estimate of their number. The new approach extends statistical ASR by introducing a segregation model in addition to the conventional acoustic and language models. While the conventional statistical ASR problem is to find the most likely sequence of speech models which generated a given observation sequence, the new approach additionally determines the most likely set of signal fragments which make up the speech signal. Although the framework is completely general, we provide one interpretation of the segregation model based on missing-data theory. We derive an efficient HMM decoder, which searches both across subword state and across alternative segregations of the signal between target and interference. We call this modified system the speech fragment decoder. The value of the speech fragment decoder approach has been verified through experiments on small-vocabulary tasks in high-noise conditions. For instance, in a noise-corrupted connected digit task, the new approach decreases the word error rate in the condition of factory noise at 5 dB SNR from over 59% for a standard ASR system to less than 22%.
Signal component estimation using phase and energy information
  • R Violanda
  • H Van
  • T C Vooren
  • Andringa
R. Violanda, H. van de Vooren, and T. C. Andringa, Signal component estimation using phase and energy information, in preparation.
Information Processing &amp
  • P R Cohen
  • R Kjeldsen