ArticlePDF Available

An Overview of Speech Recognition Using HMM

Authors:

Abstract and Figures

The Speech is most prominent & one of the natural forms of communication among of human being. The speech is a signal of infinite information. There are different aspects related to speech like speech recognition, speech verification, speech synthesis, speaker recognition, speaker identification etc. The purpose of this project is to study a speech recognition system using HMM. The goal of speech recognition is to determine which speech is present based on spoken information. The system uses MFCC for feature extraction and HMM for pattern training. The success of MFCC combined with their robust and cost-effective computation, turned them into a standard choice in speech recognition applications. And HMM provides a highly reliable way of recognizing speech.
Content may be subject to copyright.
Available Online at www.ijcsmc.com
International Journal of Computer Science and Mobile Computing
A Monthly Journal of Computer Science and Information Technology
ISSN 2320–088X
IJCSMC, Vol. 2, Issue. 6, June 2013, pg.233 – 238
RESEARCH ARTICLE
© 2013, IJCSMC All Rights Reserved
233
An Overview of Speech Recognition Using HMM
Ms. Rupali S Chavan
1
, Dr. Ganesh. S Sable
2
1
Department of E&TC, Savitribai Phule Women’s Engineering College, Aurangabad, Maharashtra, India
2
Department of E&TC, Savitribai Phule Women’s Engineering College, Aurangabad, Maharashtra, India
1
chavanrupali452@gmail.com;
2
sable.eesa@gmail.com
Abstract— The Speech is most prominent & one of the natural forms of communication among of human
being. The speech is a signal of infinite information. There are different aspects related to speech like speech
recognition, speech verification, speech synthesis, speaker recognition, speaker identification etc. The
purpose of this project is to study a speech recognition system using HMM. The goal of speech recognition is
to determine which speech is present based on spoken information. The system uses MFCC for feature
extraction and HMM for pattern training. The success of MFCC combined with their robust and cost-
effective computation, turned them into a standard choice in speech recognition applications. And HMM
provides a highly reliable way of recognizing speech.
Key Terms: - Discrete Cosine Transform; Fast Fourier Transform; Hidden Markov Model; Mel Frequency
Cepstral coefficients; Speech recognition
I. I
NTRODUCTION
Speech recognition is a powerful tool of the information exchange using the acoustic signal. Therefore, not
surprisingly, the speech signal is for several centuries the subject of research. Speech recognition is a technology
that able a computer to capture the words spoken by a human with a help of microphone. These words are later
on recognized by speech recognizer, and in the end, system outputs the recognized words. Speech recognition is
basically the science of talking with the computer, and having it correctly recognized. Speech recognition is
getting the meaning of an utterance such that one can respond properly whether or not one has correctly
recognized all of the words. Data input to a machine is of generic use, but in what circumstances is speech
recognition preferred ?An eyes-and-hands-busy user such as a quality control inspector, inventory taker,
cartographer, radiologist (medical X-ray reader), mail sorter, or aircraft pilot-is one example. Another use is
transcription in the business environment where it may be faster to remove the distraction of typing for the non-
typist. The technology is also helpful to handicapped persons who might otherwise require helpers to control
their environments. Automatic speech recognition has a long history of being a difficult problem-the first papers
date from about 1950. During this period, a number of techniques, such as linear-time-scaled word-template
matching, dynamic-time-warped word-template matching, linguistically motivated approaches (find the
phonemes, assemble into words, assemble into sentences), and hidden Markov models (HMM), were used. Of
all of the available techniques, HMMs are currently yielding the best performance [1].
In speech recognition, database creation (training) and recognition processes are involved. Database creation
describes the collection of speaker’s voice samples and extraction of features for selected words. And
recognition is a process to identify the spoken word by comparing current voice features to pre stored features of
voice. In real time, the recognition first it finds the likelihood of the unknown spoken word to the pre stored
database of known words and then it make decision of word with the selection of maximum likelihood word.
Speech recognition has two categories text dependent and text independent. Text dependent speech recognition
identifies the spoken word against the words that were given to him at the time of database collection. In this
case the text in recognition phase is same as in training phase. Text independent speech recognition identifies
Rupali S Chavan et al, International Journal of Computer Science and Mobile Computing Vol.2 Issue. 6, June- 2013, pg. 233-238
© 2013, IJCSMC All Rights Reserved
234
the spoken word irrespective of the words. Speech recognition is also classified as speaker dependent & speaker
independent. In speaker dependent type speech of the speakers is recognized only if their speech samples are
taken during training. Speaker independent speech recognition identifies the spoken word irrespective of the
speakers [2].
In early research Lawrence Rabiner, Biing Hwang Juang in their book “Fundamentals of speech recognition”
explained the different techniques like Hidden Markov Models, DTW, LPC, VQ, and MFCC in detail. The
HMM systems generally use large acoustic models composed of several thousands of parameters. Dynamic
Time Warping (DTW) and Hidden Markov Model (HMM) are two well-studied non-linear sequence alignment
(or, pattern matching) algorithm. The research trend transited from DTW to HMM in approximately1988-1990,
since DTW is deterministic and lack of the power to model stochastic signals[3].In another research review
M.A.Anusuya, S.K.Katti reported on “Speech Recognition by Machine: A Review”. They presented a brief
survey on Automatic Speech Recognition. They deeply explained the Automatic Speech Recognition system
classification, relevant issues of ASR design, Approaches to speech recognition [2].In another research Mahdi
Shaneh and Azizollah Taheri suggested the “Voice command recognition system based on MFCC and VQ
algorithms”. They designed a system to recognition voice commands. They used MFCC algorithm for feature
extraction and VQ (vector quantization) method for reduction of amount of data to decrease computation time.
In the feature matching stage Euclidean distance was applied as similarity criterion. Because of high accuracy of
used algorithms, they got the high accuracy voice command system. They trained initially with one repetition
for each command and once in each in testing sessions and got 15% error rate. Secondly they increased the
training samples then got zero error rates [4].
In their research H.P. Combrinck and E.C. Botha reported “On The Mel-scaled Cepstrum’’. They reported on
superior performance of MFCC especially under adverse conditions. Also concluded that it represents a good
trade-off between computational efficiency and perceptual considerations [5].
Another research was done by Ahmad A. M. Abushariah, Teddy S. Gunawan, and Othman O. Khalifa in their
paper English Digits Speech Recognition System Based on Hidden Markov Models”. Two modules were
developed, namely the isolated words speech recognition and the continuous speech recognition. Both modules
were tested in both clean and noisy environments and showed a successful recognition rates. These recognition
rates are relatively successful if compared to similar systems. The recognition rates of multi-speaker mode
performed better than the speaker-independent mode in both environments[6].Then Ibrahim Patel and
DrY.Shrinivasa Rao in their research paper “Speech recognition using Hidden Markov Model with MFCC
Subband technique” concluded that with these methods quality metrics of speech recognition with respect to
computational time,learning accuracy get improved[8].In another research of voice recognition using HMM
with MFCC for secure ATM by Shumaila Iqbal,Tahira Mahboob and Malik Sikandar recognition accuracy was
found to be 86.67% [9].
Here this paper takes an overview of speech recognition system using MFCC and HMM. The Mel Frequency
Cepstral Coefficient (MFCC) method is studied here for extracting the features of speech signal. The pre-
processing and feature extraction stages of a pattern recognition system serves as an interface between the real
world and a classifier operating on an idealised model of reality. Then HMM is used to train these features into
the HMM parameters and used to find the log likelihood of entire speech samples. In recognition this likelihood
is used to recognize the spoken word.
II. S
PEECH
R
ECOGNITION
S
YSTEM
Speech signal primarily conveys the words or message being spoken. Area of speech recognition is concerned
with determining the underlying meaning in the utterance. Success in speech recognition depends on extracting
and modelling the speech dependent characteristics which can effectively distinguish one word from another.
The speech recognition system may be viewed as working in a four stages as shown in Fig. 1
i. Feature extraction
ii. Pattern training
iii. Pattern Matching.
iv. Decision logic
Fig. 1 Speech Recognition System
Rupali S Chavan et al, International Journal of Computer Science and Mobile Computing Vol.2 Issue. 6, June- 2013, pg. 233-238
© 2013, IJCSMC All Rights Reserved
235
The feature extraction process is implemented using Mel Frequency Cepstral Coefficients (MFCC) in which
speech features are extracted for all the speech samples. Then all these features are given to pattern trainer for
training and are trained by HMM to create HMM model for each word. Then viterbi decoding will be used to
select the one with maximum likelihood which is nothing but recognized word.
III.
MFCC
A
PPROACH
The purpose of this module is to convert the speech waveform to some type of parametric representation.
MFCC is used to extract the unique features of speech samples. It represents the short term power spectrum of
human speech. The MFCC technique makes use of two types of filters, namely, linearly spaced filters and
logarithmically spaced filters. To capture the phonetically important characteristics of speech, signal is
expressed in the Mel frequency scale. The Mel scale is mainly based on the study of observing the pitch or
frequency perceived by the human. The scale is divided into the units mel. The Mel scale is normally a linear
mapping below 1000 Hz and logarithmically spaced above 1000 Hz. Equation (1) is used to convert the normal
frequency to the Mel scale the formula used is
Mel=2595 log
10
(1+f/ 700) (1)
As shown in Fig 1, MFCC consists of six computational steps. Each step has its own function and
mathematical approaches as discussed briefly in the following:
Step 1: Pre–emphasis
This step processes the passing of signal through a filter which emphasizes higher frequency in the band of
frequencies the magnitude of some higher frequencies with respect to magnitude of other lower frequencies in
order to improve the overall SNR. It increases with This process will increase the energy of signal at higher
frequency. [7]
Step 2: Framing
The process of segmenting the sampled speech samples into a small frames. The speech signal is divided into
frames of N samples. Adjacent frames are being separated by M (M<N). Typical values used are M = 100 and
N= 256(which is equivalent to ~ 30 m sec windowing)
Fig.2 Computational Steps of MFCC
Step 3: Hamming windowing
Each individual frame is windowed so as to minimize the signal discontinuities at the beginning and end of
each frame. Hamming window is used as window and it integrates all the closest frequency lines. The Hamming
window equation is given as: If the window is defined as
W (n), 0 n N-1 where
N = number of samples in each frame
Y[n] = Output signal
X (n) = input signal
W (n) = Hamming window, then the result of windowing signal is shown below:
Y (n) = X (n) * W (n) (2)
W (n) =0.54–0.46 Cos (2 n / N-1); 0 < n < N-1 (3)
Step 4: Fast Fourier Transform
To convert each frame of N samples from time domain into frequency domain FFT is applied.
Step 5: Mel Filter Bank Processing
The frequencies range in FFT spectrum is very wide and voice signal does not follow the linear scale. The
bank of filters according to Mel scale as shown in Fig 6 is then performed.
Rupali S Chavan et al, International Journal of Computer Science and Mobile Computing Vol.2 Issue. 6, June- 2013, pg. 233-238
© 2013, IJCSMC All Rights Reserved
236
Fig. 3 Mel Filter Bank
This figure shows a set of triangular filters that are used to compute a weighted sum of filter spectral
components so that the output of process approximates to a Mel scale. Each filter’s magnitude frequency
response is triangular in shape and equal to unity at the centre frequency and decrease linearly to zero at centre
frequency of two adjacent filters. Then, each filter output is the sum of its filtered spectral components. The
output is mel spectrum consists of output powers of these filters. Then its logarithm is taken and output is log
mel spectrum.
Step 6: Discrete Cosine Transform
This is the process to convert the log Mel spectrum into time domain using Discrete Cosine Transform (DCT).
The result of the conversion is called Mel Frequency Cepstrum Coefficients. The set of coefficient is called
acoustic vectors. Therefore, each input utterance is transformed into a sequence of acoustic vector.[7][8].
IV.
H
IDDEN
M
ARKOV
M
ODELLING
A
PPROACH
A hidden Markov model (HMM) is a statistical model in which the system being modeled is assumed to be a
Markov process with unknown parameters; the challenge is to determine the hidden parameters from the
observable data. In a hidden Markov model, the state is not directly visible, but variables influenced by the state
are visible. Each state has a probability distribution over the possible output tokens. Therefore the sequence of
tokens generated by an HMM gives some information about the sequence of states. A hidden Markov model can
be considered a generalization of a mixture model where the hidden variables which control the mixture
component to be selected for each observation, are related through a Markov process rather than independent of
each other.
HMM creates stochastic models from known utterances and compares the probability that the unknown
utterance was generated by each model. This uses theory from statistics in order to (sort of) arrange our feature
vectors into a Markov matrix (chains) that stores probabilities of state transitions. That is, if each of our code
words were to represent some state, the HMM would follow the sequence of state changes and build a model
that includes the probabilities of each state progressing to another state.
HMMs are more popular because they can be trained automatically and are simple and computationally
feasible to use HMM considers the speech signal as quasi- static for short durations and models these frames for
recognition. It breaks the feature vector of the signal into a number of states and finds the probability of a signal
to transit from one state to another. HMMs are simple networks that can generate speech (sequences of cepstral
vectors) using a number of states for each model and modeling the short-term spectra associated with each state
with, usually, mixtures of multivariate Gaussian distributions (the state output distributions). The parameters of
the model are the state transition probabilities and the means, variances and mixture weights that characterize
the state output distributions [10]. This uses theory from statistics in order to (sort of) arrange our feature vectors
into a Markov matrix (chains) that stores probabilities of state transitions. That is, if each of our code words
were to represent some state, the HMM would follow the sequence of state changes and build a model that
includes the probabilities of each state progressing to another state.
HMM can be characterized by following when its observations are discrete:
i. N is number of states in given model, these states are hidden in model.
ii. M is the number of distinct observation symbols correspond to the physical output of the certain
model.
iii. A is a state transition probability distribution defined by NxN matrix as shown in equation (4).
A= {a
ij
}
a
ij
= p{ q
t+1
= j/q
t
= i}, 1i, j N
n
(4)
a
ij
= 1, 1i, j N
n
(5)
Where q
t
occupies the current state. Transition probabilities should meet the stochastic limitations
B is observational symbol probability distribution matrix (3) defined by NxM matrix equation comprises
b
j
(k) = p{o
t
=v
k
|q
t
=j}, 1<=j<=N , 1<=k<=M (6)
b
j
(k)
= 1, 1<=k<=M (7)
Rupali S Chavan et al, International Journal of Computer Science and Mobile Computing Vol.2 Issue. 6, June- 2013, pg. 233-238
© 2013, IJCSMC All Rights Reserved
237
Where V
k
represents the K
th
observation symbol in the alphabet, and O
t
the current parameter vector. It must
follow the stochastic limitations
Π is an initial state distribution matrix (4) defined by Nx1.
π= {π
ι
}
(8)
By defining the N, M, A, B, and π, ΗΜΜ can give the observation sequence for entire model as λ= (A, B, π)
which specify the complete parameter set of model [11].
V. F
ORWARD
B
ACKWARD
A
LGORITHM
The forward backward estimation algorithm is used to train its parameters and to find log likelihood of voice
sample. It is used to estimate the unidentified parameters of HMM. It is used to compute the maximum
likelihoods and posterior mode estimate for the parameters for HMM in training process. Here we want to find
P(O|λ), given the observation sequence O = O1,O2,O3, · · · ,OT .
Forward Algorithm
The forward variable αt(i) is defined as αt(i) = P(o1,o2,… ,ot,qT = i|λ) i.e. the probability of the partial
observation sequence (until time t) and state i at time t, given the model λ. αt(i) is inductively computed by
following steps:
Initialization:
α1(ι) = πi Bi (ο1 ),1 ι Ν (9)
Induction:
,1 t T -1 (10)
Termination:
P(O|λ ) = Σαt (i) (11)
Finally the required P(O| λ) is sum of the terminal forward variables αT (i), this is true because
αT (i) = P(O1,O2, · · · ,OT , qT = Si| λ) (12)
Si is the state at time t. There are N possible states Si (1 i N), at time t.
Backward Algorithm
The backward variable βt(i) is defined as βt(i) = P(ot+1,ot+2,…,oT,qT = i|λ) i.e. the probability of the partial
observation sequence from t+1 to the end, given the state i at time t and the model λ. βt(i) is inductively solved
as follows:
Initialization:
βt (i) = 1,1 i Ν (13)
Induction:
βt (i)= Σ aij bj bj (Ot+1) βt +1( j) where t = Τ1, Τ – 2…1,1 i Ν (14)
Combining Forward and Backward variables, we get:
P(O|λ ) = Σαt (i)βt (i) ,1 t Τ (15)
VI. K-
MEANS
A
LGORITHM
Segmental k mean algorithm is used to generate the code book of entire features of voice sample. It is used
for clustering the observations into the k partitions. K-mean algorithm is used to first partition the input vector
into k initial sets by random selection or by using heuristic data. It defines two steps to precede k-mean
algorithm. Each observation is assigned to the cluster with the closest mean. And then calculate the new means
to be centroids of observation in each cluster by associating each observation with the closest centroids it
construct the new partition , the centroids are recalculated for new cluster until it convergence or observations
are no longer remains to clustering . It converges extremely fast in practice and return the best clustering found
by executing several iterations. [9]
Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means
clustering aims to partition the n observations into k sets (k n) S = {S1, S2, …, Sk} so as to minimize within-
cluster sum of squares (WCSS):
Where µi is the mean of points in Si
VII. V
ITERBI
A
LGORITHM
Using the final re-estimated A, B and π; the value of log likelihood HMM is calculated with respect to all the
word models available with the recognition engine by using Viterbi algorithm. The Viterbi algorithm takes
Rupali S Chavan et al, International Journal of Computer Science and Mobile Computing Vol.2 Issue. 6, June- 2013, pg. 233-238
© 2013, IJCSMC All Rights Reserved
238
model parameters and the observational vectors of the word as input and returns the value of matching with all
particular word models. This is the likelihood values of the word (LIHMM) passed to hybrid training model [9].
It says that to find single best state sequence, Q = q1, q2.q3, · · · , qt, (which produces given observation
sequence) for a given observation sequence O = o1, o2, o3,···,ot, we define a quantity δt(i) =
]
i.e., δt(i) is the best score along a single path, at time t, which account for the first t observations and ends in
state Si, by induction
δt+1(j) = bj(Ot+1)
In order to find the state sequence we need to keep track of state which maximizes the above equation. We do
this via array ψt(j) for each t and stat j. Once the final state is reached corresponding state sequence can be found
out using backtracking [9].
VIII. C
ONCLUSION
In this over review, we have discussed the speech recognition system using HMM. Here the techniques used
in each stage of speech recognition system are discussed. Through this over review it is found that MFCC is
used widely for feature extraction of speech because it is noise robust and HMM is best among all modeling
techniques as it increases recognition accuracy and speed.
R
EFERENCES
[1]
SD.B.Paul,”Speech Recognition using Hidden Markov Model.”
[2]
M.A.Anusuya , S.K.Katti “Speech Recognition by Machine: A Review” International journal of
computer science and Information Security 2009
[3]
L.R.Rabiner and B.H.jaung ,” Fundamentles of Speech Recognition Prentice-Hall, Englewood Cliff,
New Jersy,1993.
[4]
Mahdi Shaneh, and Azizollah Taheri,”Voice Command Recognition System Based on MFCC and VQ
Algorithms”, World Academy of Science, Engineering and Technology 57 2009
[5]
H. Combrinck and E. Botha, “On the mel-scaled cepstrum,” department of Electrical and
[6]
Electronic Engineering, University of Pretoria.,Journal of Computer Science 3 (8): 608-616, 2007 ISSN
1549-3636.
[7]
Ahmad A. M. Abushariah,Teddy S. Gunawan, Othman O. Khalifa“English Digits Speech Recognition
System Based on Hidden Markov Models”, International Islamic University Malaysia, International
Conference on Computer and Communication Engineering (ICCCE 2010), 11-13 May 2010, Kuala
Lumpur, Malaysia
[8]
Anjali Bala,ABHIJEET Kumar,Nidhika Birla,”Voice command recognition System Based on MFCC and
DTW”,International Journal of Engineering Science and Technology,Vol.2(12),2010
[9]
Ibrahim Patel,Dr.Y.Srinivasa Rao, , “Speech recognition using Hidden Markov Model With MFCC-
Subband Technique.” 2010 International Conference on Recent Trends in
Information,Telecommunication and Computing.
[10]
Shumaila Iqbal,Tahira Mehboob,Malik,”Voice Recognition using HMM with MFCC for secure
ATM”,IJCS Vol.8,Issue 6
[11]
Nov 2011
[12]
Vimala C, Dr.V.Radha, “A Review on Speech Recognition Challenges and Approaches”, World of
Computer Science and Information Technology Journal (WCSIT) ISSN: 2221-0741 Vol. 2, No. 1, 1-7,
2012
[13]
Lawrence R. Rabiner, Fellow, IEEE ‘A Tutorial On Hidden Markov Model And Selected Applications
In Speech Recognition, Proceedings Of The IEEE, Vol. 77, No. 2, February 1989.
... The HMM-based voice recognition system has been covered in this [3] by the author. Based on this comprehensive research, it has been determined that MFCC was the most popular choice for noise-resistant feature extraction of speech, and that HMM was the most effective modelling technique overall due to its ability to simultaneously improve identification accuracy and speed. ...
... In speech recognition systems, several feature extraction approaches are utilized. It helps to achieve the objective of identifying speakers based on lowlevel attributes [3]. The presence of rhythmic patterns in the speech signal supports the use of the cepstral approach for feature extraction from our speech data. ...
Article
Speech recognition is the application of sophisticated algorithms which involve the transforming of the human voice to text. Speech identification is essential as it utilizes by several biometric identification systems and voice-controlled automation systems. Variations in recording equipment, speakers, situations, and environments make speech recognition a tough undertaking. Three major phases comprise speech recognition: speech pre-processing, feature extraction, and speech categorization. This work presents a comprehensive study with the objectives of comprehending, analyzing, and enhancing these models and approaches, such as Hidden Markov Models and Artificial Neural Networks, employed in the voice recognition system for feature extraction and classification.
... HMM is a model used to represent the acoustic phenomenon and acoustic changes according to time. HMMs provide a highly reliable method of recognizing spoken signals (Chavan & Sable, 2013;Naithani et al., 2018;Najkar et al., 2010). HMMs were also applied to recognize inhaling and exhaling signals (Phoophuangpairoj, 2020) and sleep spindles (Stevner et al., 2019). ...
Article
This paper proposes a method to recognize fruits whose quality, including their ripeness, grades, brix values, and flesh characteristics, cannot be determined visually from their skin but from striking and flicking sounds. Four fruit types consisting of durians, watermelons, guavas, and pineapples were studied in this research. In recognition of fruit types, preprocessing removes the non-striking/non-flicking parts from the striking and flicking sounds. Then the sequences of frequency domain acoustic features containing 13 Mel Frequency Cepstral Coefficients (MFCCs) and their 13 first- and 13 second-order derivatives were extracted from striking and flicking sounds. The sequences were used to create the Hidden Markov Models (HMMs). The HMM acoustic models, dictionary, and grammar were incorporated to recognize striking and flicking sounds. When testing the striking and flicking sounds obtained from the fruits used to create the training set but were collected at different times, the recognition accuracy using 1 through 5 strikes/flicks was 98.48%, 98.91%, 99.13%, 98.91%, and 99.57%, respectively. For an unknown test set, of which the sounds obtained from the fruits that were not used to create the training set, the recognition accuracy using 1 through 5 strikes/flicks were 95.23%, 96.82%, 96.82%, 97.05%, and 96.59%, respectively. The results also revealed that the proposed method could accurately distinguish the striking sounds of durians from the flicking sounds of watermelons, guavas, and pineapples.
... Before the era of deep neural networks, the HMM was one of the most popular methods for speech recognition [28,29]. Therefore, initially, in a speech recognition scenario for dysarthric speech, the overall performance of a system based on proposed CNN and gammatonegram with a system based on HMM-GMM and conventional features of Mel Frequency Cepstral Coefficient (MFCC) has been compared. ...
Preprint
Full-text available
Dysarthria is a disability that causes a disturbance in the human speech system and reduces the quality and intelligibility of a person's speech. Because of this effect, the normal speech processing systems can not work properly on impaired speech. This disability is usually associated with physical disabilities. Therefore, designing a system that can perform some tasks by receiving voice commands in the smart home can be a significant achievement. In this work, we introduce gammatonegram as an effective method to represent audio files with discriminative details, which is used as input for the convolutional neural network. On the other word, we convert each speech file into an image and propose image recognition system to classify speech in different scenarios. Proposed CNN is based on the transfer learning method on the pre-trained Alexnet. In this research, the efficiency of the proposed system for speech recognition, speaker identification, and intelligibility assessment is evaluated. According to the results on the UA dataset, the proposed speech recognition system achieved 91.29% accuracy in speaker-dependent mode, the speaker identification system acquired 87.74% accuracy in text-dependent mode, and the intelligibility assessment system achieved 96.47% accuracy in two-class mode. Finally, we propose a multi-network speech recognition system that works fully automatically. This system is located in a cascade arrangement with the two-class intelligibility assessment system, and the output of this system activates each one of the speech recognition networks. This architecture achieves an accuracy of 92.3% WRR. The source code of this paper is available.
... This employs statistics postulates to organize feature vectors into a Markov matrix (chains) that contains state transition probabilities [60]. If each of our code words were to represent some state, the HMM would split the feature vector of the signal into a no. of states and find the chances of a signal to travel from one state to another state [61]. HMM's popularity stems from the fact that they can be trained automatically and are computationally feasible due to their solid mathematical foundation compared to the template-based approach discussed. ...
Article
Full-text available
Speech recognition systems have become a unique human-computer interaction (HCI) family. Speech is one of the most naturally developed human abilities; speech signal processing opens up a transparent and hand-free computation experience. This paper aims to present a retrospective yet modern approach to the world of speech recognition systems. The development journey of ASR (Automatic Speech Recognition) has seen quite a few milestones and breakthrough technologies that have been highlighted in this paper. A step-by-step rundown of the fundamental stages in developing speech recognition systems has been presented, along with a brief discussion of various modern-day developments and applications in this domain. This review paper aims to summarize and provide a beginning point for those starting in the vast field of speech signal processing. Since speech recognition has a vast potential in various industries like telecommunication, emotion recognition, healthcare, etc., this review would be helpful to researchers who aim at exploring more applications that society can quickly adopt in future years of evolution.
... HMMs are traditionally applied in fields such as speech recognition (Palaz et al., 2019;Novoa et al., 2018;Chavan and Sable, 2013;Abdel-Hamid and Jiang, 2013), bioinformatics, and anomaly detection (Qiao et al., 2002;Joshi and Phoha, 2005;Cho and Park, 2003). It has also been used for short-term EQ forecasting, using observations from EQ catalogs (Yip et al., 2018;Chambers et al., 2012;Ebel et al., 2007), as well as GPS measurements of ground deformations (Wang and Bebbington, 2013). ...
Article
Full-text available
Geoelectric time series (TS) have long been studied for their potential for probabilistic earthquake forecasting, and a recent model (GEMSTIP) directly used the skewness and kurtosis of geoelectric TS to provide times of increased probability (TIPs) for earthquakes for several months in the future. We followed up on this work by applying the hidden Markov model (HMM) to the correlation, variance, skewness, and kurtosis TSs to identify two hidden states (HSs) with different distributions of these statistical indexes. More importantly, we tested whether these HSs could separate time periods into times of higher/lower earthquake probabilities. Using 0.5 Hz geoelectric TS data from 20 stations across Taiwan over 7 years, we first computed the statistical index TSs and then applied the Baum–Welch algorithm with multiple random initializations to obtain a well-converged HMM and its HS TS for each station. We then divided the map of Taiwan into a 16-by-16 grid map and quantified the forecasting skill, i.e., how well the HS TS could separate times of higher/lower earthquake probabilities in each cell in terms of a discrimination power measure that we defined. Next, we compare the discrimination power of empirical HS TSs against those of 400 simulated HS TSs and then organized the statistical significance values from this cellular-level hypothesis testing of the forecasting skill obtained into grid maps of discrimination reliability. Having found such significance values to be high for many grid cells for all stations, we proceeded with a statistical hypothesis test of the forecasting skill at the global level to find high statistical significance across large parts of the hyperparameter spaces of most stations. We therefore concluded that geoelectric TSs indeed contain earthquake-related information and the HMM approach is capable of extracting this information for earthquake forecasting.
Article
The paper presents the analysis of modern Artificial Intelligence algorithms for the automated system supporting human beings during their conversation in Polish language. Their task is to perform Automatic Speech Recognition (ASR) and process it further, for instance fill the computer-based form or perform the Natural Language Processing (NLP) to assign the conversation to one of predefined categories. The State-of-the-Art review is required to select the optimal set of tools to process speech in the difficult conditions, which degrade accuracy of ASR. The paper presents the top-level architecture of the system applicable for the task. Characteristics of Polish language are discussed. Next, existing ASR solutions and architectures with the End-To-End (E2E) deep neural network (DNN) based ASR models are presented in detail. Differences between Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN) and Transformers in the context of ASR technology are also discussed.
Article
A Hidden Markov Model (HMM) is a couple of stochastic processes: A hidden Markov process and an observed emission process. Generally, the HMMs are used to study the hidden behavior of random systems through some observed emission sequences generated by the phenomenon under study. In this frame-work, we propose to solve the likelihood and the decoding problems of HMMs whose state space is composed of a continuous part and a discrete part. We adapt forward, backward and Viterbi algorithms to the case of our proposal. Numerical examples and Monte Carlo simulations are considered to show the efficiency and the adaptation of the algorithms for the proposed model.
Conference Paper
Data mining is capable of giving and providing the hidden, unknown, and interesting information in terms of knowledge in healthcare industry. It is useful to form decision support systems for the disease prediction and valid diagnosis of health issues. The concepts of data mining can be used to recommend the solutions and suggestions in medical sector for precautionary measures to control the disease origin at early stage. Today, diabetes is a most common and life taking syndrome found all over theworld. The presence of diabetes itself is a cause tomany other health issues in the form of side effects in human body. In such cases when considered, a need is to find the hidden data patterns from diabetic data to discover the knowledge so as to reduce the invisible health problems that arise in diabetic patients. Many studies have shown that AssociativeClassification concept of dataminingworks well and can derive good outcomes in terms of prediction accuracy. This research work holds the experimental results of the work carried out to predict and detect the by-diseases in diabetic patients with the application of Associative Classification, and it discusses an improved algorithmic method of Associative Classification named Associative Classification using Maximum Threshold and Super Subsets (ACMTSS) to achieve accuracy in better terms. Keywords Knowledge · By-disease · Maximum threshold · Super subsets · ACMTSS · Associative Classification
Article
Full-text available
Security is an essential part of human life. In this era security is a huge issue that is reliable and efficient if it is unique by any mean. Voice recognition is one of the security measures that are used to provide protection to human’s computerized and electronic belongings by his voice. In this paper voice sample is observed with MFCC for extracting acoustic features and then used to trained HMM parameters through forward backward algorithm which lies under HMM and finally the computed log likelihood from training is stored to database. It will recognize the speaker by comparing the log value from the database against the PIN code. It is implemented in Matlab 7.0 environment and showing 86.67% results as correct acceptance and correct rejections with the error rate of 13.33%.
Conference Paper
Full-text available
Conference code: 81802, Export Date: 30 November 2012, Source: Scopus, Art. No.: 5556819, doi: 10.1109/ICCCE.2010.5556819, Language of Original Document: English, Correspondence Address: Abushariah, A. A. M.; Electrical and Computer Engineering Department, Faculty of Engineering, International Islamic University Malaysia, Gombak, 53100 Kuala Lumpur, Malaysia; email: ahmad2010@hotmail.com, References: (1998), www.dragonmedical-transcription.com/historyspeechrecognition.html, Garfinkel, Retrieved on 10th February 2009Forsberg, M., (2003) Why Is Speech Recognition Difficult?, , Department of Computing Science, Chalmers University of Technology, Gothenburg, Sweden;
Article
Full-text available
The Voice is a signal of infinite information. Digital processing of speech signal is very important for high-speed and precise automatic voice recognition technology. Nowadays it is being used for health care, telephony military and people with disabilities therefore the digital signal processes such as Feature Extraction and Feature Matching are the latest issues for study of voice signal. In order to extract valuable information from the speech signal, make decisions on the process, and obtain results, the data needs to be manipulated and analyzed. Basic method used for extracting the features of the voice signal is to find the Mel frequency cepstral coefficients. Mel-frequency cepstral coefficients (MFCCs) are the coefficients that collectively represent the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency.This paper is divided into two modules. Under the first module feature of the speech signal are extracted in the form of MFCC coefficients and in another module the non linear sequence alignment known as Dynamic Time Warping (DTW) introduced by Sakoe Chiba has been used as features matching techniques. Since it's obvious that the voice signal tends to have different temporal rate, the alignment is important to produce the better performance. This paper presents the feasibility of MFCC to extract features and DTW to compare the test patterns.
Article
Speech technology and systems in human computer interaction have witnessed a stable and remarkable advancement over the last two decades. Today, speech technologies are commercially available for an unlimited but interesting range of tasks. These technologies enable machines to respond correctly and reliably to human voices, and provide useful and valuable services. Recent research concentrates on developing systems that would be much more robust against variability in environment, speaker and language. Hence today's researches mainly focus on ASR systems with a large vocabulary that support speaker independent operation with continuous speech in different languages. This paper gives an overview of the speech recognition system and its recent progress. The primary objective of this paper is to compare and summarize some of the well known methods used in various stages of speech recognition system.
Article
The goal of this project is to design a system to recognition voice commands. Most of voice recognition systems contain two main modules as follow "feature extraction" and "feature matching". In this project, MFCC algorithm is used to simulate feature extraction module. Using this algorithm, the cepstral coefficients are calculated on mel frequency scale. VQ (vector quantization) method will be used for reduction of amount of data to decrease computation time. In the feature matching stage Euclidean distance is applied as similarity criterion. Because of high accuracy of used algorithms, the accuracy of this voice command system is high. Using these algorithms, by at least 5 times repetition for each command, in a single training session, and then twice in each testing session zero error rate in recognition of commands is achieved.
Article
This paper presents a brief survey on Automatic Speech Recognition and discusses the major themes and advances made in the past 60 years of research, so as to provide a technological perspective and an appreciation of the fundamental progress that has been accomplished in this important area of speech communication. After years of research and development the accuracy of automatic speech recognition remains one of the important research challenges (e.g., variations of the context, speakers, and environment).The design of Speech Recognition system requires careful attentions to the following issues: Definition of various types of speech classes, speech representation, feature extraction techniques, speech classifiers, database and performance evaluation. The problems that are existing in ASR and the various techniques to solve these problems constructed by various research workers have been presented in a chronological order. Hence authors hope that this work shall be a contribution in the area of speech recognition. The objective of this review paper is to summarize and compare some of the well known methods used in various stages of speech recognition system and identify research topic and applications which are at the forefront of this exciting and challenging field. Comment: 25 pages IEEE format, International Journal of Computer Science and Information Security, IJCSIS December 2009, ISSN 1947 5500, http://sites.google.com/site/ijcsis/