Article

A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition

Authors:
To read the full-text of this research, you can request a copy directly from the author.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... The hidden (unknown) state sequence of the leader ( ) and the follower ( ) are estimated from the state sequence using the decoding problem of an HMM [67]. Given the visible state or observable state sequence = { 0 ,  1 , … ,  −2 }, a decoding problem of an HMM finds the maximum likely hidden sequences using the backward algorithm or -pass algorithm [67]. ...
... The hidden (unknown) state sequence of the leader ( ) and the follower ( ) are estimated from the state sequence using the decoding problem of an HMM [67]. Given the visible state or observable state sequence = { 0 ,  1 , … ,  −2 }, a decoding problem of an HMM finds the maximum likely hidden sequences using the backward algorithm or -pass algorithm [67]. Hence, each element of belongs to a set of observable states those are known and extracted using Eq. ...
... For example,  0 is the probability of the hidden state sequence initially being at state  0 . Then, the most likely hidden states of and are estimated from the observed state sequence ,  ,  using the -pass algorithm [67]. ...
Article
Full-text available
Traffic count (or link count) data represents the cumulative traffic in the lanes between two consecutive signalised intersections. Typically, dedicated infrastructure-based sensors are required for link count data collection. The lack of adequate data collection infrastructure leads to lack of link count data for numerous cities, particularly those in low- and middle-income countries. Here, we address the research problem of link count estimation using crowd-sourced trajectory data to reduce the reliance on any dedicated infrastructure. A stochastic queue discharge model is developed to estimate link counts at signalised intersections taking into account the sparsity and low penetration rate (i.e., the percentage of vehicles with known trajectory) brought on by crowdsourcing. The issue of poor penetration rate is tackled by constructing synthetic trajectories entirely from known trajectories. The proposed model further provides a methodology for estimating the delay resulting from the start-up loss time of the vehicles in the queue under unknown traffic conditions. The proposed model is implemented and validated with real-world data at a signalised intersection in Kolkata, India. Validation results demonstrate that the model can estimate link count with an average accuracy score of 82% with a very low penetration rate (not in the city, but at the intersection) of 5.09% in unknown traffic conditions, which is yet to be accomplished in the current state-of-the-art.
... where A is the transmission matrix, B is the emission or observation distribution, and π is the initial state distribution [28]. Let ξ = [ξ 1 , ξ 2 , ξ 3 , · · · , ξ N ] be the definition of the hidden states. ...
... Assuming N = 4 and ξ 1 means to S 1 , ξ 2 means to systole, ξ 3 refers to S 2 , ξ 4 refers to diastole. Assuming that T represents an entire sequence and the state at time t is q t and the entire sequence of states is Q [28]. Assuming that the observation sequence is ...
... However, it is impractical to evaluate every possible combination of Q for short-time sequences to find the optimal sequence. Therefore, the Viterbi algorithm is used to dynamically find the most likely state sequence using dynamic programming [28]. ...
Thesis
Full-text available
The heart sound signal has been studied wildly over the past several decades. The goal of this thesis is to investigate the impact of heart sound segmentation on heart sound classification. Heart sound segmentation, which involves dividing heart sounds into distinct parts, is commonly used as a pre-processing step for heart sound classification, but recent research has demonstrated that heart sound classification can also be effective without segmentation. Automated heart sound segmentation and classification, such as the detection of abnormal heart sounds, have the potential to screen for diseases in various clinical settings. The heart sound database used in this thesis is the dataset from the PhysioNet/CinC Challenge 2022, which is available to the public. To find out, the heart sounds will be segmented by a method of time-series query search with the TSSEARCH toolkit. After that, the segmented heart sounds will be processed by a Convolutional Neural Network (CNN) model for the classification task. To compare performance, the classification results will be compared to those obtained from heart sound classification without segmentation.
... A HMM λ = (A, B, π) is fully characterized by [32] 13 : 1) Its number N of states, the set of states being S = {s 1 , . . . , s N }. ...
... A[s i , s j ] = P rob(q t+1 = s j |q t = s i ) with 1≤i, j≤N . 4) Its symbols probabilities matrix B verifying B[s i , z k ] = P rob(z k at time t|q t = s i ) with 1≤i≤N and 13 See Section II-B of [32] 1≤k≤M . 5) Its initial state probability vector π verifying π[s i ] = P rob(q 1 = s i ) with 1≤i≤N . ...
... To achieve this goal, we behave like a photograph who sequentially browses − → u from its first dimension to its last dimension, and in each dimension k (1≤k≤n), the photograph 14 See Section III-A of [32] 15 See Section III-C of [32] 16 See Section V-B of [32] 17 See Definition 9.1 of [33] takes a suitable position for capturing the value of p(u k ) which materializes the adherence of the k th component u k to property p. In the current work, this suitable position is the angle ( − → u , − → e k ) which is computed using Equation 2. ...
Article
Full-text available
Vectors are generally compared through the comparison of the exact values of their components. Existing techniques only compare two vectors and are limited by many factors for the comparison of vectors having different dimensions. This paper attempts to overcome all these limitations by proposing a new technique based on hidden Markov models which enhances existing techniques by giving them the ability to compare two finite sets of vectors, each containing vectors having different dimensions, while precising the set of targeted properties on which the comparison should be performed. Classification experiments conducted on three online available custom datasets demonstrated that when the suitable set of targeted properties is selected, the proposed approach outperforms existing techniques with accuracy gains reaching +82.3%.
... where α t (k) is the forward probability of procedure P k at step t. This is known as the Forward algorithm [52]. ...
... where Pr[X | A] is the probability of observing the sequence X given the HMM parameterized by A. We omit other parameters like P as they are shared by all HMMs. This problem can be solved by the Forward algorithm [52]. ...
... The estimated transition matrix is the fingerprint F of the attack trace X. We obtain using the Expectation-Maximization algorithm [52], which finds the maximum likelihood estimate of given the observed data. ...
Preprint
Full-text available
Machine Learning (ML) systems are vulnerable to adversarial examples, particularly those from query-based black-box attacks. Despite various efforts to detect and prevent such attacks, there is a need for a more comprehensive approach to logging, analyzing, and sharing evidence of attacks. While classic security benefits from well-established forensics and intelligence sharing, Machine Learning is yet to find a way to profile its attackers and share information about them. In response, this paper introduces SEA, a novel ML security system to characterize black-box attacks on ML systems for forensic purposes and to facilitate human-explainable intelligence sharing. SEA leverages the Hidden Markov Models framework to attribute the observed query sequence to known attacks. It thus understands the attack's progression rather than just focusing on the final adversarial examples. Our evaluations reveal that SEA is effective at attack attribution, even on their second occurrence, and is robust to adaptive strategies designed to evade forensics analysis. Interestingly, SEA's explanations of the attack behavior allow us even to fingerprint specific minor implementation bugs in attack libraries. For example, we discover that the SignOPT and Square attacks implementation in ART v1.14 sends over 50% specific zero difference queries. We thoroughly evaluate SEA on a variety of settings and demonstrate that it can recognize the same attack's second occurrence with 90+% Top-1 and 95+% Top-3 accuracy.
... First, whole-hash pre-masking with a 20% masking ratio is applied to both the training and validation data splits. Next, BERT was pre-trained, to minimize perplexity [25], for 40 epochs (Roughly 240 hours). Pre-training was done in parallel on two NVIDIA Tesla V100 GPU chips with 128G of memory each. ...
... A.4.1 Evaluation metrics. To evaluate the quality of pre-training, we used perplexity [25] as a metric. Perplexity measures how "surprised" the model is when it sees new data and is a commonly used metric in NLP in general and masked language modeling specifically. ...
Preprint
Full-text available
We empirically demonstrate that a transformer pre-trained on country-scale unlabeled human mobility data learns embeddings capable, through fine-tuning, of developing a deep understanding of the target geography and its corresponding mobility patterns. Utilizing an adaptation framework, we evaluate the performance of our pre-trained embeddings in encapsulating a broad spectrum of concepts directly and indirectly related to human mobility. This includes basic notions, such as geographic location and distance, and extends to more complex constructs, such as administrative divisions and land cover. Our extensive empirical analysis reveals a substantial performance boost gained from pre-training, reaching up to 38% in tasks such as tree-cover regression. We attribute this result to the ability of the pre-training to uncover meaningful patterns hidden in the raw data, beneficial for modeling relevant high-level concepts. The pre-trained embeddings emerge as robust representations of regions and trajectories, potentially valuable for a wide range of downstream applications.
... There is evidence that financial markets exhibit dual regime behaviour [17,18]. We test whether DRL agents can adapt to the different regimes using contextual reinforcement learning (CRL) [19] and a hidden Markov model (HMM) [20] to learn to predict the current regime. ...
... Specifically, we solve for the infinitesimal generator matrix Q where e 1 12 Q equals the parameters in [17]. Then we get the required probabilities by computing e The multivariate Gaussian HMM was trained using the Viterbi algorithm [20] implemented by hmmlearn (https://github.com/hmmlearn/hmmlearn). Due to the possibility of no bear regime appearing in a single episode, the HMM is trained for the first 10 episodes of each experimental run. Thereafter, the learned parameters are frozen. ...
Preprint
Full-text available
We evaluate benchmark deep reinforcement learning (DRL) algorithms on the task of portfolio optimisation under a simulator. The simulator is based on correlated geometric Brownian motion (GBM) with the Bertsimas-Lo (BL) market impact model. Using the Kelly criterion (log utility) as the objective, we can analytically derive the optimal policy without market impact and use it as an upper bound to measure performance when including market impact. We found that the off-policy algorithms DDPG, TD3 and SAC were unable to learn the right Q function due to the noisy rewards and therefore perform poorly. The on-policy algorithms PPO and A2C, with the use of generalised advantage estimation (GAE), were able to deal with the noise and derive a close to optimal policy. The clipping variant of PPO was found to be important in preventing the policy from deviating from the optimal once converged. In a more challenging environment where we have regime changes in the GBM parameters, we found that PPO, combined with a hidden Markov model (HMM) to learn and predict the regime context, is able to learn different policies adapted to each regime. Overall, we find that the sample complexity of these algorithms is too high, requiring more than 2m steps to learn a good policy in the simplest setting, which is equivalent to almost 8,000 years of daily prices.
... DTW leads to an optimal alignment between time series under local and global constraints composed of a forward pass that computes a global distortion and an optional backward pass that determines the warping function [17]. Another important technique for continuous speech recognition was introduced with Hidden Markov Models (HMM) [18], where the state for a model and its sequence of states are hidden, but during training the state sequence is estimated along with the state transition probabilities and the observation probabilities for each state in the sequence. When testing an utterance, its sequence of acoustic features, taken as observations, is used for determining how likely it is to have been generated by each trained model and the more likely one is selected. ...
... where the mfccn[m] is computed varying m from 1 to the total number of MFCC chosen (usually about [10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25]. ...
Article
Full-text available
Keyword Spotting (KWS) has been the subject of research in recent years given the increase of embedded systems for command recognition such as Alexa, Google Home, and Siri. Performance, model size, processing time, and robustness to noise are fundamental in these systems. Furthermore, applications in embedded systems demand computationally efficient models that can be implemented in current technology. In this work, an approach for keyword recognition is evaluated using three deep learning models namely LeNet-5, SqueezeNet, and EfficientNet-B0. We evaluate transfer learning, pruning and quantization strategies in training and test using noisy and clean speech signals. In addition, compression techniques such as pruning and quantization were assessed in terms of the size reduction of the model footprint and the accuracy obtained in each case. Using the Google’s Speech Commands dataset and additive babble noise signal, our keyword recognition approach achieves an accuracy of 94.6% using an unstructured pruning of 80% of parameters of the original SqueezeNet network with a reduction of 70% in the model size.
... The acoustic model is typically a machine learning model that maps the spectrogram or extracted acoustic features to an intermediate representation. Traditional ASR systems use hidden Markov models (HMMs) and Gaussian mixture models (GMMs) [17,49], while modern ASR systems employ DNNs, such as convolutional neural networks (CNNs) and Transformers [23,36]. • Decoder. ...
Preprint
In recent years, extensive research has been conducted on the vulnerability of ASR systems, revealing that black-box adversarial example attacks pose significant threats to real-world ASR systems. However, most existing black-box attacks rely on queries to the target ASRs, which is impractical when queries are not permitted. In this paper, we propose ZQ-Attack, a transfer-based adversarial attack on ASR systems in the zero-query black-box setting. Through a comprehensive review and categorization of modern ASR technologies, we first meticulously select surrogate ASRs of diverse types to generate adversarial examples. Following this, ZQ-Attack initializes the adversarial perturbation with a scaled target command audio, rendering it relatively imperceptible while maintaining effectiveness. Subsequently, to achieve high transferability of adversarial perturbations, we propose a sequential ensemble optimization algorithm, which iteratively optimizes the adversarial perturbation on each surrogate model, leveraging collaborative information from other models. We conduct extensive experiments to evaluate ZQ-Attack. In the over-the-line setting, ZQ-Attack achieves a 100% success rate of attack (SRoA) with an average signal-to-noise ratio (SNR) of 21.91dB on 4 online speech recognition services, and attains an average SRoA of 100% and SNR of 19.67dB on 16 open-source ASRs. For commercial intelligent voice control devices, ZQ-Attack also achieves a 100% SRoA with an average SNR of 15.77dB in the over-the-air setting.
... Q f is a set of final states. X t (i) = P(O 1 ,O 2 …………, O t , 9 t = S i /λ) …………….. (10) β t (i) = P(O t+1, O tt2 , ……., O T /9 t = S i, λ) ………………. (11) This specifies the probability of the partial observation sequence O tt1, O tT2 ………… O T , given state 9 t = S i and model λ (λ i.e IP risk model, a particular and unique combination of IP risk factors/events including their value estimates from the IP risk database). ...
Research
Full-text available
An important element of intellectual property (IP) risk management is valuation, forecasting and strategy. Forecasting the optimal likelihood probabilities for the risk can be an audacious exercise, but it is critical in understanding the damage that can be caused by infringement, IP rights litigations etc providing the basis for prioritizing risk management activities and allocating resources. In this paper the occurrence, interactions of risk events as it impacts intellectual property management is modeled as Hidden Markov Model (HMM). The paper presents the HMM as a tool that can be used to optimize IP risk management response. The paper developed a HMM that can be used to predict the maximum likelihood probability for IP risk. This gives substantial information for optimal planning & coordination of IP risk response activities. KEY TERMS: IP, risk management, HMM maximum likelihood probabilities, IP risk features.
... The author provided a summary of research works that primarily focus on the application of deep learning algorithms for the detection of pavement degradation. The works being assessed utilize CNN and DCNN techniques on various image datasets, encompassing Google Street View pictures, smartphone photos, and 3D images generated through specialized hardware like GPR [12]. Literature highlights the commendable performance of both CNN and DCNN in the classification of pavement images. ...
Article
Full-text available
Intelligent Transportation Systems (ITS) provides the state-of-the-art real time integration of vehicles and intelligent systems. Collectively, the prospective of the technologies have capability to communicate between system users, roads, and infrastructure. This study presents a comprehensive examination of many applications and implications of AI and ML in the development of an ITS. The primary objective of this is to effectively mitigate the traffic congestion and enhance road safety measures to prevent accidents. Subsequently, we examined different machine learning methodologies employed in the identification of road traffic based on vehicles and their junctions with the purpose of evading impediments, as well as forecasting real-time traffic patterns to attain intelligent and effective transportation systems. The exponential growth of the population inside the country has resulted in a corresponding rise in the utilization of vehicles and various modes of transportation, thereby it needs a contributing to the exacerbation of traffic congestion and the occurrence of road accidents. Therefore, there exists a need for intelligent transportation systems that possess the capability to offer the dependable transportation services while simultaneously upholding environmental standards to overcome the traffic congestions. Designing accurate models for predicting traffic density is a crucial task in the field of transportation systems. This study compares the ML models which are derived using a variety of machine-learning approaches. Supervised machine learning algorithms, including Naive Bayes, Markov models, KNN, linear regression, and SVM, and KNN are employed. The conclusion result suggests that the Markov model achieves the highest level of accuracy, of 98%. Implementation of ITS with Markov Model provides the best performance in resilient environment.
... In this framework, the life of an individual is summarized as a set of possible discrete hidden states that represents a succession of life stages, and it follows a Markov process, meaning that the present state of an individual at generation depends only on P( , | , −1 ). The observed outcome depends only on the present hidden state through P( , | , ) and refers to the observation or emission probability (Rabiner, 1989). ...
... The Hidden Markov Model is a probabilistic model about sequences. It describes the process where an underlying Markov chain generates an unobservable sequence of states, followed by the generation of an observed sequence of outcomes based on each state [7]. ...
Article
Full-text available
This paper employs the Auto-Encoding Variational Bayes (AEVB) estimator based on Stochastic Gradient Variational Bayes (SGVB), designed to optimize recognition models for challenging posterior distributions and large-scale datasets. It has been applied to the mnist dataset and extended to form a Dynamic Bayesian Network (DBN) in the context of time series. The paper delves into Bayesian inference, variational methods, and the fusion of Variational Autoencoders (VAEs) and variational techniques. Emphasis is placed on reparameterization for achieving efficient optimization. AEVB employs VAEs as an approximation for intricate posterior distributions.
... Profile HMMs show greater sensitivity [25,29,46] than other homology search methods such as BLAST [5], LAST [26], and MMseqs2 [46]. The sensitivity of pHMMs is due to a combination of (i) position specific scores [19] learned from sequence family members and (ii) implementation of the Forward algorithm [39,30], which sums the probabilities of all possible alignments between the aligned pair of sequences. The Forward algorithm is responsible for much of the sensitivity gains of pHMMs, but is computationally expensive. ...
Preprint
Full-text available
We present NEAR, a method based on representation learning that is designed to rapidly identify good sequence alignment candidates from a large protein database. NEAR's neural embedding model computes per-residue embeddings for target and query protein sequences, and identifies alignment candidates with a pipeline consisting of k-NN search, filtration, and neighbor aggregation. NEAR's ResNet embedding model is trained using an N-pairs loss function guided by sequence alignments generated by the widely used HMMER3 tool. Benchmarking results reveal improved performance relative to state-of-the-art neural embedding models specifically developed for protein sequences, as well as enhanced speed relative to the alignment-based filtering strategy used in HMMER3's sensitive alignment pipeline.
... Hidden Markov models (HMMs) [1] are the most commonly used acoustic models for the state-of-the-art speech recognition systems. However, the "state output probability conditional independence" assumption underlying HMM fails to model the correlations among speech frames. ...
... Our work rests on the crucial fact, elucidated in [8], that finitely correlated states can be seen as a generalization to the quantum setting of stochastic processes admitting finite dimensional linear models, so-called quasi-realizations [9]. In the classical case, a special subclass of the latter go by the name of hidden Markov models [9,10] (also known as positive realizations). In the quantum case, the natural analogue of hidden Markov models are the aforementioned description of states generated by consecutive applications of a quantum channel to a memory system. ...
Preprint
Full-text available
We show that marginals of subchains of length t of any finitely correlated translation invariant state on a chain can be learned, in trace distance, with O(t²) copies -- with an explicit dependence on local dimension, memory dimension and spectral properties of a certain map constructed from the state -- and computational complexity polynomial in t. The algorithm requires only the estimation of a marginal of a controlled size, in the worst case bounded by a multiple of the minimum bond dimension, from which it reconstructs a translation invariant matrix product operator. In the analysis, a central role is played by the theory of operator systems. A refined error bound can be proven for C*-finitely correlated states, which have an operational interpretation in terms of sequential quantum channels applied to the memory system. We can also obtain an analogous error bound for a class of matrix product density operators reconstructible by local marginals. In this case, a linear number of marginals must be estimated, obtaining a sample complexity of Õ(t³). The learning algorithm also works for states that are only close to a finitely correlated state, with the potential of providing competitive algorithms for other interesting families of states.
... Sequence tagging can be modelled with Linear-Chain CRF (Lafferty et al., 2001). The partition function for linear-chain models is computed with the forward algorithm (Rabiner, 1990). The computational complexity is O(m 2 n) for m tags and sequence of length n. ...
... Since early appearance in the statistical literature (Baum and Petrie, 1966;Baum et al., 1970) and popularization in speech recognition (Rabiner, 1989), Hidden Markov models (HMMs) have been used to solve a broad range of problems ranging from texture recognition (Bose and Kuo, 1994), to gene prediction (Stanke and Waack, 2003), and weather forecasting (Hughes et al., 1999). The influential paper of Ghahramani and Jordan (1997) introduced the class of Factorial hidden Markov models (FHMMs), in which the hidden Markov chain is a multivariate process, with a-priori independent coordinates. ...
Article
We propose algorithms for approximate filtering and smoothing in high-dimensional Fac-torial hidden Markov models. The approximation involves discarding, in a principled way, likelihood factors according to a notion of locality in a factor graph associated with the emission distribution. This allows the exponential-in-dimension cost of exact filtering and smoothing to be avoided. We prove that the approximation accuracy, measured in a local total variation norm, is "dimension-free" in the sense that as the overall dimension of the model increases the error bounds we derive do not necessarily degrade. A key step in the analysis is to quantify the error introduced by localizing the likelihood function in a Bayes' rule update. The factorial structure of the likelihood function which we exploit arises naturally when data have known spatial or network structure. We demonstrate the new algorithms on synthetic examples and a London Underground passenger flow problem, where the factor graph is effectively given by the train network.
... Sequence tagging can be modelled with Linear-Chain CRF (Lafferty et al., 2001). The partition function for linear-chain models is computed with the forward algorithm (Rabiner, 1990). The computational complexity is O(m 2 n) for m tags and sequence of length n. ...
Preprint
Full-text available
The development of deep learning software libraries enabled significant progress in the field by allowing users to focus on modeling, while letting the library to take care of the tedious and time-consuming task of optimizing execution for modern hardware accelerators. However, this has benefited only particular types of deep learning models, such as Transformers, whose primitives map easily to the vectorized computation. The models that explicitly account for structured objects, such as trees and segmentations, did not benefit equally because they require custom algorithms that are difficult to implement in a vectorized form. SynJax directly addresses this problem by providing an efficient vectorized implementation of inference algorithms for structured distributions covering alignment, tagging, segmentation, constituency trees and spanning trees. With SynJax we can build large-scale differentiable models that explicitly model structure in the data. The code is available at https://github.com/deepmind/synjax.
... It operates by iteratively denoising Gaussian noise to produce the image x 0 . Typically, the diffusion model assumes a Markov process [44] wherein Gaussian noises are gradually added to a clean image x 0 based on the following equation: ...
Preprint
Full-text available
While language-guided image manipulation has made remarkable progress, the challenge of how to instruct the manipulation process faithfully reflecting human intentions persists. An accurate and comprehensive description of a manipulation task using natural language is laborious and sometimes even impossible, primarily due to the inherent uncertainty and ambiguity present in linguistic expressions. Is it feasible to accomplish image manipulation without resorting to external cross-modal language information? If this possibility exists, the inherent modality gap would be effortlessly eliminated. In this paper, we propose a novel manipulation methodology, dubbed ImageBrush, that learns visual instructions for more accurate image editing. Our key idea is to employ a pair of transformation images as visual instructions, which not only precisely captures human intention but also facilitates accessibility in real-world scenarios. Capturing visual instructions is particularly challenging because it involves extracting the underlying intentions solely from visual demonstrations and then applying this operation to a new image. To address this challenge, we formulate visual instruction learning as a diffusion-based inpainting problem, where the contextual information is fully exploited through an iterative process of generation. A visual prompting encoder is carefully devised to enhance the model's capacity in uncovering human intent behind the visual instructions. Extensive experiments show that our method generates engaging manipulation results conforming to the transformations entailed in demonstrations. Moreover, our model exhibits robust generalization capabilities on various downstream tasks such as pose transfer, image translation and video inpainting.
... Since the release of larger scale noisy datasets [11,12] and methods to clean them [13], many efforts [14,15] have focused on building conceptually simpler end-to-end systems that enable quick iteration and can make use of larger amounts of data. Many of these systems are trained using the Connectionist Temporal Classification (CTC) loss [16], which is a specific instance of the forward-backward (FB) algorithm [17]. ...
... Glow-TTS [1] integrates the a DP algorithm on a flow-based model [1] to obtain unobserved monotonic alignment in parallel. Other models with structured latent representations such as HMMs [21] and PCFGs [22] make strong assumptions about the model structure which could limit their flexibility and applicability. Instead, we propose a unified framework to enable the VAEs to capture sparse latent optimal paths, allowing for flexible adaptation to a variety of tasks. ...
Preprint
Full-text available
We propose a unified approach to obtain structured sparse optimal paths in the latent space of a variational autoencoder (VAE) using dynamic programming and Gumbel propagation. We solve the classical optimal path problem by a probability softening solution, called the stochastic optimal path, and transform a wide range of DP problems into directed acyclic graphs in which all possible paths follow a Gibbs distribution. We show the equivalence of the Gibbs distribution to a message-passing algorithm by the properties of the Gumbel distribution and give all the ingredients required for variational Bayesian inference. Our approach obtaining latent optimal paths enables end-to-end training for generative tasks in which models rely on the information of unobserved structural features. We validate the behavior of our approach and showcase its applicability in two real-world applications: text-to-speech and singing voice synthesis.
... Interpretable NLP systems can foster thrust by enabling end-users, practitioners, and researchers to understand the model's prediction mechanisms, and ensure ethical NLP practices. Historically, traditional NLP systems, such as rulebased methods (Woods, 1973), Hidden Markov models (Ghahramani, 2001;Rabiner, 1989), and logistic regression (Cramer, 2002), were inherently interpretable, known as white-box techniques. However, recent advancements in NLP, most of which are black-box methods, come at the cost of a loss in interpretability. ...
Preprint
Recent progress in large language models has enabled the deployment of many generative NLP applications. At the same time, it has also led to a misleading public discourse that ``it's all been solved.'' Not surprisingly, this has in turn made many NLP researchers -- especially those at the beginning of their career -- wonder about what NLP research area they should focus on. This document is a compilation of NLP research directions that are rich for exploration, reflecting the views of a diverse group of PhD students in an academic research lab. While we identify many research areas, many others exist; we do not cover those areas that are currently addressed by LLMs but where LLMs lag behind in performance, or those focused on LLM development. We welcome suggestions for other research directions to include: https://bit.ly/nlp-era-llm
Chapter
Smarter apps and connected devices are now possible because of the proliferation of IoT, which has greatly improved the quality of life in today's urban centers. ML and IoT approaches have been employed in the study of smart transportation, which has attracted a large number of researchers. Smart transportation is viewed as a catch-all word that encompasses a wide range of topics, including optimization of route, parking, street lighting, accident detection, abnormalities on the road, and other infrastructure-related issues. The purpose of this chapter is to examine the state of machine learning (ML) and internet of things (IoT) applications for smart city transport in order to better comprehend recent advances in these fields and to spot any holes in coverage. From the existing publications it's clear that ML may be underrepresented in smart lighting and smart parking systems. Additionally, researchers' favorite applications in terms of transportation system's intelligence include optimization of route, smart parking management, and accident/collision detection.
Article
Full-text available
Internet has revolutionized the way we live. Banking sector have largely grown in popularity and so now user has started online banking. But this revolution, even though it has paved a greater way to help people ,has more number of risks and security flaws that will in turn affect the users who are common people and retrieve confidential information about them and try to attack especially in the field of online banking. So to avoid all these we have planned to implement and develop applications in such a way that it provides secured transactions from unauthorized users and hackers. In early days people used textual password to protect their information from hackers but still these passwords get affected due to various attacks like shoulder surfing ,phishing and dictionary attack etc...In our concept we are going to provide security in login phase instead of transaction so that the hackers will be sorted out in the phase of login itself. In phase of login users will be provided a set of secret question which can be only answered by the authorized users so unauthorized will be sorted out and further cannot proceed for the transaction phase. This technique becomes more beneficial since security is provided in the login phase itself.
Article
This paper studies a high-speed text-independent Automatic Speaker Recognition (ASR) algorithm based on a multicore system's Gaussian Mixture Model (GMM). The high speech is achieved using parallel implementation of the feature's extraction and aggregation methods during training and testing procedures. Shared memory parallel programming techniques using both OpenMP and PThreads libraries are developed to accelerate the code and improve the performance of the ASR algorithm. The experimental results show speed-up improvements of around 3.2 on a personal laptop with Intel i5-6300HQ (2.3 GHz, four cores without hyper-threading, and 8 GB of RAM). In addition, a remarkable 100% speaker recognition accuracy is achieved.
Conference Paper
Full-text available
Driver intention is crucial to advanced driving assist system design. In this paper, Hidden Markov Model approach is adopted to recognize driver steering intention, which is further used to adapt the desire handling model in vehicle stability control. When emergency steering is recognized, a fuzzy based yaw moment controller is used to calculate the desired corrective torque, thus to improve the yaw rate response of vehicle. Simulation and test results show that the proposed controller is able to coordinate the driver steering and torque vectoring, and both the vehicle stability and driver steering feel can be improved.
Article
Full-text available
We propose a new modeling framework to compute the most likely path for stochastic hidden systems; where the computation is based on the control theory of discrete event systems. The main innovation in this proposed model is calculating which event will have a higher probability of occurring in the future by applying k-step to the likelihood of events occurring at discrete times, which will give us the best way to transition between situations. We encode the problem as a node built with synchronous data-flow equations; then we apply the synthesis algorithm to the node in order to generate a controller that will find the most likely state sequence; where the algorithm is limited to a sliding window of a fixed number of discrete steps. We experimentally evaluate and validate our approach by comparing it with several algorithms, which are the most common and suitable algorithms applied for the best path calculation.
Article
Hidden Markov Chains (HMC) and Recurrent Neural Networks (RNN) are two well known tools for predicting time series. Even though these solutions were developed independently in distinct communities, they share some similarities when considered as probabilistic structures. So in this paper we first consider HMC and RNN as generative models, and we embed both structures in a common generative unified model (GUM). We next address a comparative study of the expressivity (or modelling power) of these models, which here refers to the range of the joint probability distribution of an observations sequence, induced by the underlying latent variables. To that end we assume that the models are furthermore linear and Gaussian. The probability distributions produced by these models are characterized by structured covariance series, and as a consequence expressivity reduces to comparing sets of structured covariance series, which enables us to call for stochastic realization theory (SRT). We finally provide conditions under which a given covariance series can be realized by a GUM, an HMC or an RNN.
Conference Paper
Full-text available
Lemmatisation, which is one of the most important stages of text preprocessing, consists in grouping the inflected forms of a word together so they can be analysed as a single item, identified by the word's lemma, or dictionary form. It is not a very complicated task for languages such as English, where a paradigm consists of a few forms close in spelling; but when it comes to morphologically rich languages, such as Russian, Hungarian or Irish, lemmatisation becomes more challenging. However, this task is often considered solved for most resource-rich modern languages irregardless of their morphological type. The situation is dramatically different for ancient languages characterised not only by a rich inflectional system, but also by a high level of orthographic variation, and, what is more important, a very little amount of available data. These factors make automatic morphological analysis of historical language data an underrepresented field in comparison to other NLP tasks. This work describes a case of creating an Early Irish lemmatiser with a character-level sequence-to-sequence learning method that proves efficient to overcome data scarcity. A simple character-level sequence-to-sequence model trained during 34,000 iterations reached the accuracy score of 99.2 % for known words and 64.9 % for unknown words on a rather small corpus of 83,155 samples. It outperforms both the baseline and the rule-based model described in [21] and [76] and meets the results of other systems working with historical data.
Article
We consider probabilistic models for sequential observations which exhibit gradual transitions among a finite number of states. We are particularly motivated by applications such as human activity analysis where observed accelerometer time series contains segments representing distinct activities, which we call pure states , as well as periods characterized by continuous transition among these pure states. To capture this transitory behavior, the dynamical Wasserstein barycenter (DWB) model of (Cheng et al., 2021) associates with each pure state a data-generating distribution and models the continuous transitions among these states as a Wasserstein barycenter of these distributions with dynamically evolving weights. Focusing on the univariate case where Wasserstein distances and barycenters can be computed in closed form, we extend (Cheng et al., 2021) specifically relaxing the parameterization of the pure states as Gaussian distributions. We highlight issues related to the uniqueness in identifying the model parameters as well as uncertainties induced when estimating a dynamically evolving distribution from a limited number of samples. To ameliorate non-uniqueness, we introduce regularization that imposes temporal smoothness on the dynamics of the barycentric weights. A quantile-based approximation of the pure state distributions yields a finite dimensional estimation problem which we numerically solve using cyclic descent alternating between updates to the pure-state quantile functions and the barycentric weights. We demonstrate the utility of the proposed algorithm in segmenting both simulated and real world human activity time series.
Article
Full-text available
Latent variable models are widely used to perform unsupervised segmentation of time series in different context such as robotics, speech recognition, and economics. One of the most widely used latent variable model is the Auto-Regressive Hidden Markov Model (ARHMM), which combines a latent mode governed by a Markov chain dynamics with a linear Auto-Regressive dynamics of the observed state. In this work, we propose two generalizations of the ARHMM. First, we propose a more general AR dynamics in Cartesian space, described as a linear combination of non-linear basis functions. Second, we propose a linear dynamics in unit quaternion space, in order to properly describe orientations. These extensions allow to describe more complex dynamics of the observed state. Although this extension is proposed for the ARHMM, it can be easily extended to other latent variable models with AR dynamics in the observed space, such as Auto-Regressive Hidden semi-Markov Models.
Article
The aspiration for insight into human cognitive processing has traditionally driven research in cognitive science. With methods such as the Hidden semi-Markov Model-Electroencephalography (HsMM-EEG) method, new approaches have been developed that help to understand the temporal structure of cognition by identifying temporally discrete processing stages. However, it remains challenging to assign concrete functional contributions by specific processing stages to the overall cognitive process. In this paper, we address this challenge by linking HsMM-EEG1 with cognitive modelling, with the aim of further validating the HsMM-EEG1 method and demonstrating the potential of cognitive models to facilitate functional interpretation of processing stages. For this purpose, we applied HsMM-EEG1 to data from a mental rotation task and developed an ACT-R cognitive model that is able to closely replicate human performance in this task. Applying HsMM-EEG1 to the mental rotation experiment data revealed a strong likelihood for 6 distinct stages of cognitive processing during trials, with an additional stage for non-rotated conditions. The cognitive model predicted intra-trial mental activity patterns that project well onto the processing stages, while explaining the additional stage as a marker of non-spatial shortcut use. Thereby, this combined methodology provided substantially more information than either method by itself and suggests conclusions for cognitive processing in general.
Article
The integration of information and communication technologies (ICT) can be of great utility in monitoring and evaluating the elderly’s health condition and its behavior in performing Activities of Daily Living (ADL) in the perspective to avoid, as long as possible, the delays of recourse to health care institutions (e.g., nursing homes and hospitals). In this research, we propose a predictive model for detecting behavioral and health-related changes in a patient who is monitored continuously in an assisted living environment. We focus on keeping track of the dependency level evolution and detecting the loss of autonomy for an elderly person using a Hidden Markov Model based approach. In this predictive process, we were interested in including the correlation between cardiovascular history and hypertension as it is considered the primary risk factor for cardiovascular diseases, stroke, kidney failure and many other diseases. Our simulation was applied to an empirical dataset that concerned 3046 elderly persons monitored over 9 years. The results show that our model accurately evaluates person’s dependency, follows his autonomy evolution over time and thus predicts moments of important changes.
Article
Full-text available
Algorithms for recognizing strings of connected words from whole-word patterns have become highly efficient and accurate, although computation rates remain high. Even the most ambitious connected-word recognition task is practical with today's integrated circuit technology, but extracting reliable, robust whole-word reference patterns still is difficult. In the past, connected-word recognizers relied on isolated-word reference patterns or patterns derived from a limited context (e.g., the middle digit from strings of three digits). These whole-word patterns were adequate for slow rates of articulated speech, but not for strings of words spoken at high rates (e.g., about 200 to 300 words per minute). To alleviate this difficulty, a segmental k-means training procedure was used to extract whole-word patterns from naturally spoken word strings. The segmented words are then used to create a set of word reference patterns for recognition. Recognition string accuracies were 98 to 99 percent for digits in variable length strings and 90 to 98 percent for sentences from an airline reservation task. These performance scores represent significant improvements over previous connected-word recognizers.
Article
Full-text available
Accurate detection of the boundaries of a speech utterance during a recording interval has been shown to be crucial for reliable and robust automatic speech recognition. The endpoint detection problem is fairly straightforward for high-level speech signals spoken in low-level stationary noise environments (e.g. signal-to-noise ratios greater than 30 dB). However, these ideal conditions do not always exist. One example, where reliable word detection is difficult, is speech spoken in a mobile environment. Because of road, tire, fan noises, etc. detection of speech often becomes problematic.
Conference Paper
Full-text available
In this paper, we describe BYBLOS, the BBN continuous speech recognition system. The system, designed for large vocabulary applications, integrates acoustic, phonetic, lexical, and linguistic knowledge sources to achieve high recognition performance. The basic approach, as described in previous papers [1, 2], makes extensive use of robust context-dependent models of phonetic coarticulation using Hidden Markov Models (HMM). We describe the components of the BYBLOS system, including: signal processing frontend, dictionary, phonetic model training system, word model generator, grammar and decoder. In recognition experiments, we demonstrate consistently high word recognition performance on continuous speech across: speakers, task domains, and grammars of varying complexity. In speaker-dependent mode, where 15 minutes of speech is required for training to a speaker, 98.5% word accuracy has been achieved in continuous speech for a 350-word task, using grammars with perplexity ranging from 30 to 60. With only 15 seconds of training speech we demonstrate performance of 97% using a grammar.
Article
Full-text available
Speech recognition is formulated as a problem of maximum likelihood decoding. This formulation requires statistical models of the speech production process. In this paper, we describe a number of statistical models for use in speech recognition. We give special attention to determining the parameters for such models from sparse data. We also describe two decoding methods, one appropriate for constrained artificial languages and one appropriate for more realistic decoding tasks. To illustrate the usefulness of the methods described, we review a number of decoding results that have been obtained with them.
Article
Full-text available
Most current attempts at automatic speech recognition are formulated in an artificial intelligence framework. In this paper we approach the problem from an information-theoretic point of view. We describe the overall structure of a linguistic statistical decoder (LSD) for the recognition of continuous speech. The input to the decoder is a string of phonetic symbols estimated by an acoustic processor (AP). For each phonetic string, the decoder finds the most likely input sentence. The decoder consists of four major subparts: 1) a statistical model of the language being recognized; 2) a phonemic dictionary and statistical phonological rules characterizing the speaker; 3) a phonetic matching algorithm that computes the similarity between phonetic strings, using the performance characteristics of the AP; 4) a word level search control. The details of each of the subparts and their interaction during the decoding process are discussed.
Article
Full-text available
In a template-based speech recognition system, distortion measures that compute the distance or dissimilarity between two spectral representations have a strong influence on the performance of the recognizer. Accordingly, extensive comparative studies have been conducted to determine good distortion measures for improved recognition accuracy. Previous studies have shown that the log likelihood ratio measure, the likelihood ratio measure, and the truncated cepstral measures all gave good recognition performance (comparable accuracy) for isolated word recognition tasks. In this paper we extend the interpretation of distortion measures, based upon the observation that measurements of speech spectral envelopes (as normally obtained from standard analysis procedures such as LPC or filter banks) are prone to statistical variations due to window position fluctuations, excitation interference, measurement noise, etc., and may not accurately characterize the true speech spectrum because of analysis model constraints. We have found that these undesirable spectral measurement variations can be partially controlled (i.e., reduced in the level of variation) by appropriate signal processing techniques. In particular, we have found that a bandpass "liftering" process reduces the variability of the statistical components of LPC-based spectral measurements and hence it is desirable to use such a liftering process in a speech recognizer. We have applied this liftering process to several speech recognition tasks: in particular, single frame vowel recognition and isolated word recognition. Using the liftering process, we have been able to achieve an average digit error rate of 1 percent in a speaker-independent isolated digit test. This error rate is about one-half that obtained without the liftering process.
Article
We introduce a new method of estimating transition probabilities of Markov source models from given sparse data. We first review the forward-backward algorithm which derives parameter estimates having maximum likelihood properties relative to the data. We give 2 heuristic methods based on the interpolated estimator concept that address the problem and present results of simulations confirming the viability of our approach. The algorithms were successfully applied in continuous speech recognition to estimation of parameters of speech processes. -from Authors
Article
Many signals can be modeled as probabilistic functions of Markov chains in which the observed signal is a random vector whose probability density function (pdf) depends on the current state of an underlying Markov chain. Such models are called Hidden Markov Models (HMMs) and are useful representations for speech signals in terms of some convenient observations (e.g., cepstral coefficients or pseudolog area ratios). One method of estimating parameters of HMMs is the well-known Baum-Welch reestimation method. For continuous pdf's, the method was known to work only for elliptically symmetric densities. We have recently shown that the method can be generalized to handle mixtures of elliptically symmetric pdf's. Any continuous pdf can be approximated to any desired accuracy by such mixtures, in particular, by mixtures of multivariate Gaussian pdf's. To effectively make use of this method of parameter estimation, it is necessary to understand how it is affected by the amount of training data available, the number of states in the Markov chain, the dimensionality of the signal, etc. To study these issues, Markov chains and random vector generators were simulated to generate training sequences from “toy” models. The model parameters were estimated from these training sequences and compared to the “true” parameters by means of an appropriate distance measure. The results of several such experiments show the strong sensitivity of the method to some (but not all) of the model parameters. A procedure for getting good initial parameter estimates is, therefore, of considerable importance.
Article
In this paper we present an approach to speaker-independent, isolated word recognition in which the well-known techniques of vector quantization and hidden Markov modeling are combined with a linear predictive coding analysis front end. This is done in the framework of a standard statistical pattern recognition model. Both the vector quantizer and the hidden Markov models need to be trained for the vocabulary being recognized. Such training results in a distinct hidden Markov model for each word of the vocabulary. Classification consists of computing the probability of generating the test word with each word model and choosing the word model that gives the highest probability. There are several factors, in both the vector quantizer and the hidden Markov modeling, that affect the performance of the overall word recognition system, including the size of the vector quantizer, the structure of the hidden Markov model, the ways of handling insufficient training data, etc. The effects, on recognition accuracy, of many of these factors are discussed in this paper. The entire recognizer (training and testing) has been evaluated on a 10-word digits vocabulary. For training, a set of 100 talkers spoke each of the digits one time. For testing, an independent set of 100 tokens of each of the digits was obtained. The overall recognition accuracy was found to be 96.5 percent for the 100-talker test set. These results are comparable to those obtained in earlier work, using a dynamic time-warping recognition algorithm with multiple templates per digit. It is also shown that the computation and storage requirements of the new recognizer were an order of magnitude less than that required for a conventional pattern recognition system using linear prediction with dynamic time warping.
Article
Accurate location of the endpoints of spoken words and phrases is important for reliable and robust speech recognition. The endpoint detection problem is fairly straightforward for high-level speech signals in low-level stationary noise environments (e.g., signal-to-noise ratios greater than 30-dB rms). However, this problem becomes considerably more difficult when either the speech signals are too low in level (relative to the background noise), or when the background noise becomes highly nonstationary. Such conditions are often encountered in the switched telephone network when the limitation on using local dialed-up lines is removed. In such cases the background noise is often highly variable in both level and spectral content because of transmission line characteristics, transients and tones from the line and/or from signal generators, etc. Conventional speech endpoint detectors have been shown to perform very poorly (on the order of 50-percent word detection) under these conditions. In this paper we present an improved word-detection algorithm, which can incorporate both vocabulary (syntactic) and task (semantic) information, leading to word-detection accuracies close to 100 percent for isolated digit detection over a wide range of telephone transmission conditions.
Article
We propose a probabilistic distance measure for measuring the dissimilarity between pairs of hidden Markov models with arbitrary observation densities. The measure is based on the Kullback-Leibler number and is consistent with the reestimation technique for hidden Markov models. Numerical examples that demonstrate the utility of the proposed distance measure are given for hidden Markov models with discrete densities. We also discuss the effects of various parameter deviations in the Markov models on the resulting distance, and study the relationships among parameter estimates (obtained from reestimation), initial guesses of parameter values, and observation duration through the use of the measure.
Conference Paper
A connected speech recognition method based on the Baum forward backward algorithm is presented. The segmentation of the test sentence uses the probability that an acoustic vector lays at the separation of two speech subunit models (Hidden Markov models). The labelling rests on the highest probability that a vector has been emitted on the last state of a subunit model. Results are presented for word- and phoneme-recognition.
Article
Speech recognition research can be distinguished into three areas: isolated word recognition where words are separated by distinct pauses; continuous speech recognition where sentences are produced continuously in a natural manner; and speech understanding where the aim is not transcription but understanding in the sense that the system responds correctly to a spoken instruction or request. This chapter focuses on Continuous Speech Recognition (CSR) and summarizes acoustic processing techniques. The Markov models of speech processes are introduced in the chapter and it describes an elegant linguistic decoder based on dynamic programming that is practical under certain conditions. The the practical aspects of the sentence hypothesis search conducted by the linguistic decoder is discussed in the chapter and it introduces algorithms for extracting model parameter values automatically from the data. The methods of assessing the performance of the CSR systems and the relative difficulty of recognition tasks are discussed. The chapter illustrates the capabilities of present recognition systems by describing the results of certain recognition experiments.
Article
In this paper we discuss parameter estimation by means of the reestimation algorithm for a class of multivariate mixture density functions of Markov chains. The scope of the original reestimation algorithm is expanded and the previous assumptions of log concavity or ellipsoidal symmetry are obviated, thereby enhancing the modeling capability of the technique. Reestimation formulas in terms of the well-known forward-backward inductive procedure are also derived.
Article
During the past decade, the applicability of hidden Markov models (HMM) to various facets of speech analysis has been demonstrated in several different experiments. These investigations all rest on the assumption that speech is a quasi-stationary process whose stationary intervals can be identified with the occupancy of a single state of an appropriate HMM. In the traditional form of the HMM, the probability of duration of a state decreases exponentially with time. This behavior does not provide an adequate representation of the temporal structure of speech. The solution proposed here is to replace the probability distributions of duration with continuous probability density functions to form a continuously variable duration hidden Markov model (CVDHMM). The gamma distribution is ideally suited to specification of the durational density since it is one-sided and only has two parameters which, together, define both mean and variance. The main result is a derivation and proof of convergence of re-estimation formulae for all the parameters of the CVDHMM. It is interesting to note that if the state durations are gamma-distributed, one of the formulae is non-algebraic but, fortuitously, has properties such that it is easily and rapidly solved numerically to any desired degree of accuracy. Other results are presented including the performance of the formulae on simulated data.
Article
In this paper we extend previous work on isolated-word recognition based on hidden Markov models by replacing the discrete symbol representation of the speech signal with a continuous Gaussian mixture density. In this manner the inherent quantization error introduced by the discrete representation is essentially eliminated. The resulting recognizer was tested on a vocabulary of the ten digits across a wide range of talkers and test conditions and shown to have an error rate comparable to that of the best template recognizers and significantly lower than that of the discrete symbol hidden Markov model system. We discuss several issues involved in the training of the continuous density models and in the implementation of the recognizer.
Article
Recent work at Bell Laboratories has shown how the theories of LPC Vector Quantization (VQ) and hidden Markov modeling (HMM) can be applied to the recognition of isolated word vocabularies. Our first experiments with HMM based recognizers were restricted to a vocabulary of the ten digits. For this simple vocabulary we found that a high performance recognizer (word accuracy on the order of 97%) could be implemented, and that the performance was, for the most part, insensitive to parameters of both the Markov model and the vector quantizer. In this talk we extend our investigations to the recognition of isolated words from a medium size vocabulary, (129 words), as used in the Bell Laboratories airline reservation and information system. For this moderately complex vocabulary we have found that recognition accuracy is indeed a function of the HMM parameter (i.e., the number of states and the number of symbols in the vector quantizer). We have also found that a vector quantizer which uses energy information gives better performance than a conventional LPC shape vector quantizer of the same size (i.e., number of codebook entries).
Article
Continuous speech was treated as if produced by a finite‐state machine making a transition every centisecond. The observable output from state transitions was considered to be a power spectrum—a probabilistic function of the target state of each transition. Using this model, observed sequences of power spectra from real speech were decoded as sequences of acoustic states by means of the Viterbi trellis algorithm. The finite‐state machine used as a representation of the speechsource was composed of machines representing words, combined according to a “language model.” When trained to the voice of a particular speaker, the decoder recognized seven‐digit telephone numbers correctly 96% of the time, with a better than 99% per‐digit accuracy. Results for other tests of the system, including syllable and phoneme recognition, will also be given.
Article
This paper gives a unified theoretical view of the Dynamic Time Warping (DTW) and the Hidden Markov Model (HMM) techniques for speech recognition problems. The application of hidden Markov models in speech recognition is discussed. We show that the conventional dynamic time-warping algorithm with Linear Predictive (LP) signal modeling and distortion measurements can be formulated in a strictly statistical framework. It is further shown that the DTW/LP method is implicitly associated with a specific class of Markov models and is equivalent to the probability maximization procedures for Gaussian autoregressive multivariate probabilistic functions of the underlying Markov model. This unified view offers insights into the effectiveness of the probabilistic models in speech recognition applications.
Article
In this paper we present several of the salient theoretical and practical issues associated with modeling a speech signal as a probabilistic function of a (hidden) Markov chain. First we give a concise review of the literature with emphasis on the Baum-Welch algorithm. This is followed by a detailed discussion of three issues not treated in the literature: alternatives to the Baum-Welch algorithm; critical facets of the implementation of the algorithms, with emphasis on their numerical properties; and behavior of Markov models on certain artificial but realistic problems. Special attention is given to a particular class of Markov models, which we call “left-to-right” models. This class of models is especially appropriate for isolated word recognition. The results of the application of these methods to an isolated word, speaker-independent speech recognition experiment are given in a companion paper.
Article
In this paper a new sequential decoding algorithm is introduced that uses stack storage at the receiver. It is much simpler to describe and analyze than the Fano algorithm, and is about six times faster than the latter at transmission rates equal to Rcomp the rate below which the average number of decoding steps is bounded by a constant. Practical problems connected with implementing the stack algorithm are discussed and a scheme is described that facilitates satisfactory performance even with limited stack storage capacity. Preliminary simulation results estimating the decoding effort and the needed stack siazree presented.
Article
An abstract is not available.
Conference Paper
An isolated word recognizer based on vector quantization at the acoustic level and on stochastic modeling at the phonetic level is described. The power of this approach lies in its best utilization of the training data. The first experimental results obtained are encouraging and suggest that further optimization is possible.
Conference Paper
A method for modelling time series is presented and then applied to the analysis of the speech signal. A time series is represented as a sample sequence generated by a finite state hidden Markov model with output densities parameterized by linear prediction polynomials and error variances. These objects are defined and their properties developed. The theory culminates in a theorem that provides a computationally efficient iterative scheme to improve the model. The theorem has been used to create models from speech signals of considerable length. One such model is examined with emphasis on the relationship between states of the model and traditional classes of speech events. A use of the method is illustrated by an application to the talker verification problem.
Conference Paper
The Speech Recognition Group at IBM Research in Yorktown Heights has developed a real-time, isolated-utterance speech recognizer for natural language based on the IBM Personal Computer AT and IBM Signal Processors. The system has recently been enhanced by expanding the vocabulary from 5,000 words to 20,000 words and by the addition of a speech workstation to support usability studies on document creation by voice. The system supports spelling and interactive personalization to augment the vocabularies. This paper describes the implementation, user interface, and comparative performance of the recognizer.
Conference Paper
This paper proposes a new strategy, the Multi-Level Decoding (MLD), that allows to use a Very Large Size Dictionary (VLSD, size more than 100,000 words) in speech recognition. MLD proceeds in three steps: bullet a Syllable Match procedure uses an acoustic model to build a list of the most probable syllables that match the acoustic signal from a given time frame. bullet from this list, a Word Match procedure uses the dictionary to build partial word hypothesis. bullet then a Sentence Match procedure uses a probabilistic language model to build partial sentence hypothesis until total sentences are found. An original matching algorithm is proposed for the Syllable Match procedure. This strategy is experimented on a dictation task of French texts. Two different dictionaries are tested, bullet one composed of the 10,000 most frequent words, bullet the other composed of 200,000 words. The recognition results are given and compared. The error rate on words with 10,000 words is 17.3%. If the errors due to the lack of coverage are not counted, the error rate with 10,000 words is reduced to 10.6%. The error rate with 200,000 words is 12.7%.
Conference Paper
A new iterative approach for hidden Markov modeling of information sources which aims at minimizing the discrimination information (or the cross-entropy) between the source and the model is proposed. This approach does not require the commonly used assumption that the source to be modeled is a hidden Markov process. The algorithm is started from the model estimated by the traditional maximum likelihood (ML) approach and alternatively decreases the discrimination information over all probability distributions of the source which agree with the given measurements and all hidden Markov models. The proposed procedure generalizes the Baum algorithm for ML hidden Markov modeling. The procedure is shown to be a descent algorithm for the discrimination information measure and its local convergence is proved.
Conference Paper
This paper describes an experimental continuous speech recognition system comprising procedures for acoustic/phonetic classification, lexical access and sentence retrieval. Speech is assumed to be composed of a small number of phonetic units which may be identified with the states of a hidden Markov model. The acoustic correlates of the phonetic units are then characterized by the observable Gaussian process associated with the corresponding state of the underlying Markov chain. Once the parameters of such a model are determined, a phonetic transcription of an utterance can be obtained by means of a Viterbi-like algorithm. Given a lexicon in which each entry is orthographically represented in terms of the chosen phonetic units, a word lattice is produced by a lexical access procedure. Lexical items whose orthography matches subsequences of the phonetic transcription are sought by means of a hash coding technique and their likelihoods are computed directly from the corresponding interval of acoustic measurements. The recognition process is completed by recovering from the word lattice, the string of words of maximum likelihood conditioned on the measurements. The desired string is derived by a best-first search algorithm. In an experimental evaluation of the system, the parameters of an acoustic/phonetic model were estimated from fluent utterances of 37 seven-digit numbers. A digit recognition rate of 96% was then observed on an independent test set of 59 utterances of the same form from the same speaker. Half of the observed errors resulted from insertions while deletions and substitutions accounted equally for the other half.
Conference Paper
One approach to large vocabulary speech recognition, is to build phonetic Markov models, and to concatenate them to obtain word models. In previous work, we already designed a recognizer based on 40 phonetic Markov machines, which accepts a 10,000 words vocabulary ([3]), and recently 200,000 words vocabulary ([5]). Since there is one machine per phoneme, these models obviously do not account for coarticulatory effects, which may lead to recognition errors. In this paper, we improve the phonetic models by using general principles about coarticulation effects on automatic phoneme recognition. We show that both the analysis of the errors made by the recognizer, and linguistic facts about phonetic context influence, suggest a method for choosing context dependent models. This method allows to limit the growing of the number of phonems, and still account for the most important coarticulation effects. We present our experiments with a system applying these principles to a set of models for French. With this new system including context-dependant machines, the phoneme recognition rate goes from 82.2% to 85.3%, and the error rate on words with a 10,000 word dictionary, is decreased from 11.2 to 9.8%.
Conference Paper
This paper proposes a new way of using vector quantization for improving recognition performance for a 60,000 word vocabulary speaker-trained isolated word recognizer using a phonemic Markov model approach to speech recognition. We show that we can effectively increase the codebook size by dividing the feature vector into two vectors of lower dimensionality, and then quantizing and training each vector separately. For a small codebook size, integration of the results of the two parameter vectors provides significant improvement in recognition performance as compared to the quantizing and training of the entire feature set together. Even for a codebook size as small as 64, the results obtained when using the new quantization procedure are quite close to those obtained when using Gaussian distribution of the parameter vectors.
Conference Paper
Most current speech recognition systems are sensitive to variations in speaker style, the following is the result of an effort to make a Hidden Markov Model (HMM) Isolated Word Recognizer (IWR) tolerant to such speech changes caused by speaker stress. More than an order-of-magnitude reduction of the error rate was achieved for a 105 word simulated stress database and a 0% error rate was achieved for the TI 20 isolated word database.
Conference Paper
A new training procedure called multi-style training has been developed to improve performance when a recognizer is used under stress or in high noise but cannot be trained in these conditions. Instead of speaking normally during training, talkers use different, easily produced, talking styles. This technique was tested using a speech data base that included stress speech produced during a workload task and when intense noise was presented through earphones. A continuous-distribution talker-dependent Hidden Markov Model (HMM) recognizer was trained both normally (5 normally spoken tokens) and with multi-style training (one token each from normal, fast, clear, loud, and question-pitch talking styles). The average error rate under stress and normal conditions fell by more than a factor of two with multi-style training and the average error rate under conditions sampled during training fell by a factor of four.
Conference Paper
The use of instantaneous and transitional spectral representations of spoken utterances for speaker recognition is investigated. LPC derived-cepstral coefficients are used to represent instantaneous spectral information and best linear fits of each cepstral coefficient over a specified time window are used to represent transitional information. An evaluation has been carried out using a data base of isolated digit utterances over dialed-up telephone lines by 10 talkers. Two vector quantization (VQ) codebooks, instantaneous and transitional, are constructed from training utterances for each speaker. The experimental results show that the instantaneous and transitional representations are relatively uncorrelated thus providing complementary information for speaker recognition. A rectangular window of approximately 100-150 ms duration provides an effective estimate of spectral transitions for speaker recognition. Also, simple transmission channel variations are shown to affect the instantaneous spectral representations and the corresponding recognition performance significantly, while the transitional representations and performance are relatively resistant.
Conference Paper
The problem of modeling durational structure is addressed. Results of experiments on the use of temporal models to optimally change the duration and temporal structure of words are presented and used to show that a first-order Markov chain is inadequate for effectively modeling expected local duration. Semi-Markov models are proposed and shown to lead to improved performance. The implications of these results for automatic speech recognition are considered. Hidden semi-Markov models are introduced, and two alternative models of state duration are proposed. The problem of model parameter estimation is addressed. Finally, preliminary experimental results are presented that compare the recognition performance obtained using hidden Markov models with that obtained using a special class of hidden semi-Markov models.
Conference Paper
This paper describes the results of our work in designing a system for phonetic recognition of unrestricted continuous speech. We describe several algorithms used to recognize phonemes using context-dependent Hidden Markov Models of the phonemes. We present results for several variations of the parameters of the algorithms. In addition, we propose a technique that makes it possible to integrate traditional acoustic-phonetic features into a hidden Markov process. The categorical decisions usually associated with heuristic acoustic-phonetic algorithms are replaced by automated training techniques and global search strategies. The combination of general spectral information and specific acoustic-phonetic features is shown to result in more accurate phonetic recognition than either representation by itself.
Article
Although a great deal of effort has gone into studying large-vocabulary speech-recognition problems, there remains a number of interesting, and potentially exceedingly important, problems which do not require the complexity of these large systems. One such problem is connected-digit recognition, which has applications to telecommunications, order entry, credit-card entry, forms automation, and data-base management, among others. Connected-digit recognition is also an interesting problem for another reason, namely that it is one in which whole-word training patterns are applicable as the basic speech-recognition unit. Thus one can bring to bear all the fundamental speech recognition technology associated with whole-word recognition to solve this problem. As such, several connected digit recognizers have been proposed in the past few years. The performance of these systems has steadily improved to the point where high digit-recognition accuracy is achievable in a speaker-trained mode.
Article
S ummary A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality. Theory showing the monotone behaviour of the likelihood and convergence of the algorithm is derived. Many examples are sketched, including missing value situations, applications to grouped, censored or truncated data, finite mixture models, variance component estimation, hyperparameter estimation, iteratively reweighted least squares and factor analysis.
Article
This paper gives an exposition of linear prediction in the analysis of discrete signals. The signal is modeled as a linear combination of its past values and present and past values of a hypothetical input to a system whose output is the given signal. In the frequency domain, this is equivalent to modeling the signal spectrum by a pole-zero spectrum. The major part of the paper is devoted to all-pole models. The model parameters are obtained by a least squares analysis in the time domain. Two methods result, depending on whether the signal is assumed to be stationary or nonstationary. The same results are then derived in the frequency domain. The resulting spectral matching formulation allows for the modeling of selected portions of a spectrum, for arbitrary spectral shaping in the frequency domain, and for the modeling of continuous as well as discrete spectra. This also leads to a discussion of the advantages and disadvantages of the least squares error criterion. A spectral interpretation is given t
Article
Parameter estimation for multivariate functions of Markov chains, a class of versatile statistical models for vector random processes, is discussed. The model regards an ordered sequence of vectors as noisy multivariate observations of a Markov chain. Mixture distributions are a special case. The foundations of the theory presented were established by L. E. Baum, T. Petrie, G. Soules, and N. Weiss. A powerful representation theorem by Fan is employed to generalize the analysis of L. E. Baum, et al. to a larger class of distributions.
Article
The basic theory of Markov chains has been known to mathematicians and engineers for close to 80 years, but it is only in the past decade that it has been applied explicitly to problems in speech processing. One of the major reasons why speech models, based on Markov chains, have not been developed until recently was the lack of a method for optimizing the parameters of the Markov model to match observed signal patterns. Such a method was proposed in the late 1960's and was immediately applied to speech processing in several research institutions. Continued refinements in the theory and implementation of Markov modelling techniques have greatly enhanced the method, leading to a wide range of applications of these models. It is the purpose of this tutorial paper to give an introduction to the theory of Markov models, and to illustrate how they have been applied to problems in speech recognition.
Article
We describe a procedure for efficient encoding of the speech wave by representing it in terms of time-varying parameters related to the transfer function of the vocal tract and the characteristics of the excitation. The speech wave, sampled at 10 kHz, is analyzed by predicting the present speech sample as a linear combination of the 12 previous samples. The 12 predictor coe&ients are determined by minimiaing the mean-squared error between the actual and the predicted values of the speech samples. Fifteen parametek-namely, the 12 predictor coethcienta, the pitch period, a binary parameter indicating whether the speech is voiced or unvoiced, and the rms value of the speech samples-are derived by analysis of the speech wave, encoded and transmitted to the synthesizer. The speech wave is synthesized as the output of a linear recursive filter excited by either a sequence of quasiperiodic pulses or a wbite-noise source. Application of this method for efficient transmission and storage of speech signals as well as procedures for determining other speech characteristics, such as formant frequencies and bandwidths, the spectral envelope, and the autocorrelation function, are discussed.
Conference Paper
A new speech analysis technique applicable to speech recognition is proposed considering the auditory mechanism of speech perception which emphasizes spectral dynamics and which compensates for the spectral undershoot associated with coarticulation. A speech wave is represented by the LPC cepstrum and logarithmic energy sequences, and the time sequences over short periods are expanded by the first- and second-order polynomial functions at every frame period. The dynamics of the cepstrum sequences are then emphasized by the linear combination of their polynomial expansion coefficients, that is, derivatives, and their instantaneous values. Speaker-independent word recognition experiments using time functions of the dynamics-emphasized cepstrum and the polynomial coefficient for energy indicate that the error rate can be largely reduced by this method.
Conference Paper
SPHINX, the first large-vocabulary speaker-independent continuous-speech recognizer is described. SPHINX is a hidden-Markov-model (HMM)-based recognizer using multiple codebooks of various LPC-derived features. Two types of HMMs are used in SPHINX: context-independent phone models and function-word-dependent phone models. On a 997-word task using a bigram grammar, SPHINX achieved a word accuracy of 93%. This demonstrates the feasibility of speaker-independent continuous-speech recognition, and the appropriateness of hidden Markov models for such a task
Article
A weighted cepstral distance measure is proposed and is tested in a speaker-independent isolated word recognition system using standard DTW (dynamic time warping) techniques. The measure is a statistically weighted distance measure with weights equal to the inverse variance of the cepstral coefficients. The experimental results show that the weighted cepstral distance measure works substantially better than both the Euclidean cepstral distance and the log likelihood ratio distance measures across two different databases. The recognition error rate obtained using the weighted cepstral distance measure was about 1 percent for digit recognition. This result was less than one-fourth of that obtained using the simple Euclidean cepstral distance measure and about one-third of the results using the log likelihood ratio distance measure. The most significant performance characteristic of the weighted cepstral distance was that it tended to equalize the performance of the recognizer across different talkers.