, were disappointing: performance fell between 70% and 75% in all cases.

Source publication

Topic and speaker identification via large vocabulary continuous speech recognition

Article

Full-text available

Jan 1993

In this paper we exhibit a novel approach to the problems of topic and speaker identification that makes use of a large vocabulary continuous speech recognizer. We present a theoretical framework which formulates the two tasks as complementary problems, and describe the symmetric way in which we have implemented their solution. Results of trials of...

Context 1

... the 120 test messages were rescored us- ing this adjustment, the results improved dramatically for all but the smallest list (where the keywords were too sparse for scores to be adequately estimated). The improved results are given in the last column of Table 2. ...

View in full-text

Context 2

... were surprised not to find a more pronounced benefit from using large numbers of keywords for the topic iden- tification task. Our prior experience had indicated that there were small but significant gains as the number of keywords grew and, although such a pattern is perhaps suggested by the results in Table 2, the gains (beyond those in the recalibration estimates) are too small to be considered significant. It is possible that with bet- ter modelling of keyword frequencies or by introducing acoustic distinctiveness as a keyword selection criterion, such improvements might be realized. ...

View in full-text

Design and collection of czech lombard speech database

Article

Full-text available

Jan 2005

In this paper, design, collection and parameters of newly proposed Czech Lombard Speech Database (CLSD) are presented. The database focuses on analysis and modeling of Lombard effect to achieve robust speech recognition improvement. The CLSD consists of neutral speech and speech produced in various types of simulated noisy background. In comparison...

The effect of language identification accuracy on speech recognition accuracy of proper names

Conference Paper

Full-text available

Nov 2017

Effect of reference set selection on speaker dependent speech recognition

Article

Full-text available

May 1981

Presented here for a speaker dependent system, is an algorithm which chooses a reference template for each word in the vocabulary from a set of N exemplars. The goal of the algorithm is to produce a reference set that minimizes the worst matching behavior and total error over the N sets of exemplars. The results of the experiments presented here sh...

Homophone Identification and Merging for Code-switched Speech Recognition

Conference Paper

Full-text available

Sep 2018

Adaptive Training for Large Vocabulary Continuous Speech Recognition

Article

Full-text available

Kai Yu

PHONEME MATCHING TECHNIQUES FOR SPEECH IDENTIFICATION PROBLEMS

Conference Paper

Mar 2024

From pose to activity: Surveying datasets and introducing CONVERSE

Article

Nov 2015
COMPUT VIS IMAGE UND

We present a review on the current state of publicly available datasets within the human action recognition community; highlighting the revival of pose based methods and recent progress of understanding person-person interaction modeling. We categorize datasets regarding several key properties for usage as a benchmark dataset; including the number of class labels, ground truths provided, and application domain they occupy. We also consider the level of abstraction of each dataset; grouping those that present actions, interactions and higher level semantic activities. The survey identifies key appearance and pose based datasets, noting a tendency for simplistic, emphasized, or scripted action classes that are often readily definable by a stable collection of sub-action gestures. There is a clear lack of datasets that provide closely related actions, those that are not implicitly identified via a series of poses and gestures, but rather a dynamic set of interactions. We therefore propose a novel dataset that represents complex conversational interactions between two individuals via 3D pose. 8 pairwise interactions describing 7 separate conversation based scenarios were collected using two Kinect depth sensors. The intention is to provide events that are constructed from numerous primitive actions, interactions and motions, over a period of time; providing a set of subtle action classes that are more representative of the real world, and a challenge to currently developed recognition methodologies. We believe this is among one of the first datasets devoted to conversational interaction classification using 3D pose features and the attributed papers show this task is indeed possible. The full dataset is made publicly available to the research community at www.csvision.swansea.ac.uk/converse.

Using latent topic features to improve binary classification of spoken documents

Conference Paper

Full-text available

Jun 2011
Acoust Speech Signal Process

Jonathan Wintrode

In many topic identification applications, supervised training labels are indirectly related to the semantic content of the documents being classified. For example, many topically distinct emails will all be assigned a single broad category label of "spam" or "not-spam", and a two-class classifier will lack direct knowledge of the underlying topic structure. This paper examines the degradation of topic identification performance on conversational speech when multiple semantic topics are combined into a single broad category. We then develop techniques using document clustering and Latent Dirchlet Allocation (LDA) to exploit the underlying semantic topics which improve performance over classifiers trained on the single category label by up to 20%.

The Beta-Binomial Mixture Model and Its Application to TDT Tracking and Detection

Article

Aug 2000

Stephen A. Lowe

This paper describes a continuous-mixture statistical model for word occurrence frequencies in documents, and the application of that model to the TDT topic identification tasks. This model was originally proposed by Gillick [1] as a means to account for variation in word frequencies across documents more accurately than the binomial and multinomial models. Further mathematical development of the model will be presented, along with performance results on the DARPA TDT December 1998 Evaluation Tracking Task. Application to the Detection Task will also be discussed. 1. INTRODUCTION Previous work at Dragon Systems on topic identification tasks has consistently followed a theme of defining document similarity using statistical measures [1, 2, 3, 4, 5, 6]. To elaborate, for a given document collection we construct a statistical model for the frequencies with which words (or other surface features, such as bigrams) occur in documents drawn from that collections. For example, we construct a...

The Beta-Binomial Mixture Model For Word Frequencies In Documents With Applications To Information Retrieval

Article

Jun 2000

Stephen A. Lowe

This paper describes a continuous-mixture statistical model for word occurrence frequencies in documents, and the application of that model to the DARPA-sponsored TDT topic identification tasks [1]. This model was originally proposed in 1990 by L. Gillick [2] as a means to account for variation in word frequencies across documents more accurately than the binomial model. The present paper presents further mathematical development of the model, leading to the implementation of a topic-tracking system. Performance results for this system on the Tracking Task in the December 1998 DARPA TDT Evaluation will be shown and compared with Dragon's existing, more complex multinomial-model-based system. (Results from other systems applied to this task are available in [3].) We will conclude with plans for further development. 1.

Person Identification Using Multimodal Biometrics under Different Challenges

Chapter

Full-text available

Jul 2018

Characterizing and Detecting Spontaneous Speech: Application To Speaker Role Recognition

Article

Jan 2014
SPEECH COMMUN

Processing spontaneous speech is one of the many challenges that automatic speech recognition systems have to deal with. The main characteristics of this kind of speech are disfluencies (filled pause, repetition, false start, etc.) and many studies have focused on their detection and correction. Spontaneous speech is defined in opposition to prepared speech, where utterances contain well-formed sentences close to those found in written documents. Acoustic and linguistic features made available by the use of an automatic speech recognition system are proposed to characterize and detect spontaneous speech segments from large audio databases. To better define this notion of spontaneous speech, segments of an 11-hour corpus (French Broadcast News) had been manually labeled according to three classes of spontaneity. Firstly, we present a study of these features. We then propose a two-level strategy to automatically assign a class of spontaneity to each speech segment. The proposed system reaches a 73.0% precision and a 73.5% recall on high spontaneous speech segments, and a 66.8% precision and a 69.6% recall on prepared speech segments. A quantitative study shows that the classes of spontaneity are useful information to characterize the speaker roles. This is confirmed by extending the speech spontaneity characterization approach to build an efficient automatic speaker role recognition system.

Elements of Speaker Verification

Article

Jan 2005

Herbert Gish

Semi-Automated Speech Transcription System Study

Article

Aug 1994

This report describes preliminary explorations towards the design of a semi-automatic transcription system. Current transcription practices were studied and are described in this report. The promising results of several speech recognition experiments as well as a topic identification experiment, all performed on broadcast data, are reported. These experiments were designed to gauge the quality of speech recognition on broadcast data and to explore possible uses of a continuous speech recognizer in a semi-automatic transcription system. Possible future directions for research are also reported.

Automatic language identification using large vocabulary continuous speech recognition

Article

May 1996

We have developed a highly accurate automatic language identification system based on large vocabulary continuous speech recognition (LVCSR). Each test utterance is recognized in a number of languages, and the language ID decision is based on the probability of the output word sequence reported by each recognizer. Recognizers were implemented for this test in English, Japanese, and Spanish, using the Ricardo corpus of telephone monologues. When tested on the OGI corpus of digitally recorded telephone speech, we obtained error rates of 3% or lower on 2-way and 3-way closed-set classification of ten-second and one-minute speech segments.

, were disappointing: performance fell between 70% and 75% in all cases.

Contexts in source publication

Similar publications

Citations