Table 2 - uploaded by James K. Baker
Content may be subject to copyright.
, were disappointing: performance fell between 70% and 75% in all cases.

, were disappointing: performance fell between 70% and 75% in all cases.

Source publication
Article
Full-text available
In this paper we exhibit a novel approach to the problems of topic and speaker identification that makes use of a large vocabulary continuous speech recognizer. We present a theoretical framework which formulates the two tasks as complementary problems, and describe the symmetric way in which we have implemented their solution. Results of trials of...

Contexts in source publication

Context 1
... the 120 test messages were rescored us- ing this adjustment, the results improved dramatically for all but the smallest list (where the keywords were too sparse for scores to be adequately estimated). The improved results are given in the last column of Table 2. ...
Context 2
... were surprised not to find a more pronounced benefit from using large numbers of keywords for the topic iden- tification task. Our prior experience had indicated that there were small but significant gains as the number of keywords grew and, although such a pattern is perhaps suggested by the results in Table 2, the gains (beyond those in the recalibration estimates) are too small to be considered significant. It is possible that with bet- ter modelling of keyword frequencies or by introducing acoustic distinctiveness as a keyword selection criterion, such improvements might be realized. ...

Similar publications

Article
Full-text available
In this paper, design, collection and parameters of newly proposed Czech Lombard Speech Database (CLSD) are presented. The database focuses on analysis and modeling of Lombard effect to achieve robust speech recognition improvement. The CLSD consists of neutral speech and speech produced in various types of simulated noisy background. In comparison...
Article
Full-text available
Presented here for a speaker dependent system, is an algorithm which chooses a reference template for each word in the vocabulary from a set of N exemplars. The goal of the algorithm is to produce a reference set that minimizes the worst matching behavior and total error over the N sets of exemplars. The results of the experiments presented here sh...

Citations

... Examples of this are the speaker's identity, the 1 age spoken, the gender of the speaker or the topic under discussion. The latter pro lem has been discussed elsewhere [1,2,3], and so in this paper we will consider the first three problems, that is, speaker, language and gender identification. In Section 2 we will describe the system we use to match the incoming speech to a set of sub-word models, and in Section 3 we present the theory which underlines our approach to identi cation In Section 4 we present recent results we have achieved in each of the application areas under discussion 2. MATCHING PHONEME SEQUENCE The phoneme matching system comprisesthe first two stages shown in Figure 1. ...
... We also provide facial tracking features used for the tracking of gaze and facial dynamics which have been used for feature fusion in [122] . Despite the benefit that audio provides to action classi- fication [123, 124, 125, 20], the audio has been stripped from all recordings due to the private natures of the conversations that occurred during the interactions. This allows the conversations to be natural, providing a more realistic representation of the scenarios than if each subject was given a script. ...
Article
We present a review on the current state of publicly available datasets within the human action recognition community; highlighting the revival of pose based methods and recent progress of understanding person-person interaction modeling. We categorize datasets regarding several key properties for usage as a benchmark dataset; including the number of class labels, ground truths provided, and application domain they occupy. We also consider the level of abstraction of each dataset; grouping those that present actions, interactions and higher level semantic activities. The survey identifies key appearance and pose based datasets, noting a tendency for simplistic, emphasized, or scripted action classes that are often readily definable by a stable collection of sub-action gestures. There is a clear lack of datasets that provide closely related actions, those that are not implicitly identified via a series of poses and gestures, but rather a dynamic set of interactions. We therefore propose a novel dataset that represents complex conversational interactions between two individuals via 3D pose. 8 pairwise interactions describing 7 separate conversation based scenarios were collected using two Kinect depth sensors. The intention is to provide events that are constructed from numerous primitive actions, interactions and motions, over a period of time; providing a set of subtle action classes that are more representative of the real world, and a challenge to currently developed recognition methodologies. We believe this is among one of the first datasets devoted to conversational interaction classification using 3D pose features and the attributed papers show this task is indeed possible. The full dataset is made publicly available to the research community at www.csvision.swansea.ac.uk/converse.
... As in text information retrieval, topic identification (ID) can be used to improve search results, enrich browsing, or provide filtering of documents (such as spam detection). Topic ID of spoken documents has been part of the repertoire of speech retrieval and browsing since work on the Switchboard corpus in 1993 [1]. Much of the previous work has considered the effects of a variety of automatic speech recognition (ASR) approaches [2] [3], feature selection techniques [4], and non ASR-based approaches [5], on topic ID. ...
Conference Paper
Full-text available
In many topic identification applications, supervised training labels are indirectly related to the semantic content of the documents being classified. For example, many topically distinct emails will all be assigned a single broad category label of "spam" or "not-spam", and a two-class classifier will lack direct knowledge of the underlying topic structure. This paper examines the degradation of topic identification performance on conversational speech when multiple semantic topics are combined into a single broad category. We then develop techniques using document clustering and Latent Dirchlet Allocation (LDA) to exploit the underlying semantic topics which improve performance over classifiers trained on the single category label by up to 20%.
... Previous work at Dragon Systems on topic identification tasks has consistently followed a theme of defining document similarity using statistical measures [1, 2, 3, 4, 5, 6] . To elaborate , for a given document collection we construct a statistical model for the frequencies with which words (or other surface features, such as bigrams) occur in documents drawn from that collections. ...
... This model bears a relationship to the 2-Poisson model of Harter [8, 9] which used a 2-component mixture of Poisson distributions instead of binomials , and to the continuous Gamma-Poisson mixture employed by Burrell [10]. A non-parametric approach to account for document variability within a source through the use of mixtures was used by Peskin and Gillick [3, 4, 5], in which the method was used to improve the reliability of keyword selection for use in a multinomial model. ...
Article
This paper describes a continuous-mixture statistical model for word occurrence frequencies in documents, and the application of that model to the TDT topic identification tasks. This model was originally proposed by Gillick [1] as a means to account for variation in word frequencies across documents more accurately than the binomial and multinomial models. Further mathematical development of the model will be presented, along with performance results on the DARPA TDT December 1998 Evaluation Tracking Task. Application to the Detection Task will also be discussed. 1. INTRODUCTION Previous work at Dragon Systems on topic identification tasks has consistently followed a theme of defining document similarity using statistical measures [1, 2, 3, 4, 5, 6]. To elaborate, for a given document collection we construct a statistical model for the frequencies with which words (or other surface features, such as bigrams) occur in documents drawn from that collections. For example, we construct a...
... Previous work at Dragon Systems on topic identification tasks has consistently followed a theme of defining document similarity using statistical measures [2,4,5,6,7,8]. To elaborate, for a given document collection we construct a statistical model for the frequencies with which words (or other features, such as bigrams) occur in documents drawn from that collection. ...
... A non-parametric approach to account for document variability within a source through the use of mixtures was used by Peskin and Gillick [5,6,7], in which the method was used to improve the reliability of keyword selection for use in a multinomial model. ...
... It is also not hard to calculate the expected value and variance for the mixture output distribution of equation (1) (1 + ) (7) The main conceptual points to take from these relations is that the parameter is the expected value for n=s, and that the variance of the P(njs; ; ) increases with . ...
Article
This paper describes a continuous-mixture statistical model for word occurrence frequencies in documents, and the application of that model to the DARPA-sponsored TDT topic identification tasks [1]. This model was originally proposed in 1990 by L. Gillick [2] as a means to account for variation in word frequencies across documents more accurately than the binomial model. The present paper presents further mathematical development of the model, leading to the implementation of a topic-tracking system. Performance results for this system on the Tracking Task in the December 1998 DARPA TDT Evaluation will be shown and compared with Dragon's existing, more complex multinomial-model-based system. (Results from other systems applied to this task are available in [3].) We will conclude with plans for further development. 1.
Article
Processing spontaneous speech is one of the many challenges that automatic speech recognition systems have to deal with. The main characteristics of this kind of speech are disfluencies (filled pause, repetition, false start, etc.) and many studies have focused on their detection and correction. Spontaneous speech is defined in opposition to prepared speech, where utterances contain well-formed sentences close to those found in written documents. Acoustic and linguistic features made available by the use of an automatic speech recognition system are proposed to characterize and detect spontaneous speech segments from large audio databases. To better define this notion of spontaneous speech, segments of an 11-hour corpus (French Broadcast News) had been manually labeled according to three classes of spontaneity. Firstly, we present a study of these features. We then propose a two-level strategy to automatically assign a class of spontaneity to each speech segment. The proposed system reaches a 73.0% precision and a 73.5% recall on high spontaneous speech segments, and a 66.8% precision and a 69.6% recall on prepared speech segments. A quantitative study shows that the classes of spontaneity are useful information to characterize the speaker roles. This is confirmed by extending the speech spontaneity characterization approach to build an efficient automatic speaker role recognition system.
Article
This report describes preliminary explorations towards the design of a semi-automatic transcription system. Current transcription practices were studied and are described in this report. The promising results of several speech recognition experiments as well as a topic identification experiment, all performed on broadcast data, are reported. These experiments were designed to gauge the quality of speech recognition on broadcast data and to explore possible uses of a continuous speech recognizer in a semi-automatic transcription system. Possible future directions for research are also reported.
Article
We have developed a highly accurate automatic language identification system based on large vocabulary continuous speech recognition (LVCSR). Each test utterance is recognized in a number of languages, and the language ID decision is based on the probability of the output word sequence reported by each recognizer. Recognizers were implemented for this test in English, Japanese, and Spanish, using the Ricardo corpus of telephone monologues. When tested on the OGI corpus of digitally recorded telephone speech, we obtained error rates of 3% or lower on 2-way and 3-way closed-set classification of ten-second and one-minute speech segments.