ArticlePDF Available

An Overview of Speech Recognition Using HMM

June 2013

June 2013

Authors:

Maharashtra Institute of Technology

The Speech is most prominent & one of the natural forms of communication among of human being. The speech is a signal of infinite information. There are different aspects related to speech like speech recognition, speech verification, speech synthesis, speaker recognition, speaker identification etc. The purpose of this project is to study a speech recognition system using HMM. The goal of speech recognition is to determine which speech is present based on spoken information. The system uses MFCC for feature extraction and HMM for pattern training. The success of MFCC combined with their robust and cost-effective computation, turned them into a standard choice in speech recognition applications. And HMM provides a highly reliable way of recognizing speech.

Computational Steps of MFCC

…

Mel Filter Bank This figure shows a set of triangular filters that are used to compute a weighted sum of filter spectral components so that the output of process approximates to a Mel scale. Each filter's magnitude frequency response is triangular in shape and equal to unity at the centre frequency and decrease linearly to zero at centre frequency of two adjacent filters. Then, each filter output is the sum of its filtered spectral components. The output is mel spectrum consists of output powers of these filters. Then its logarithm is taken and output is log mel spectrum. Step 6: Discrete Cosine Transform This is the process to convert the log Mel spectrum into time domain using Discrete Cosine Transform (DCT). The result of the conversion is called Mel Frequency Cepstrum Coefficients. The set of coefficient is called acoustic vectors. Therefore, each input utterance is transformed into a sequence of acoustic vector.[7][8].

…

Figures - uploaded by Ganesh Sable

Content may be subject to copyright.

Content uploaded by Ganesh Sable

Content may be subject to copyright.

Available Online at www.ijcsmc.com

International Journal of Computer Science and Mobile Computing

A Monthly Journal of Computer Science and Information Technology

ISSN 2320–088X

IJCSMC, Vol. 2, Issue. 6, June 2013, pg.233 – 238

RESEARCH ARTICLE

233

An Overview of Speech Recognition Using HMM

Ms. Rupali S Chavan

, Dr. Ganesh. S Sable

Department of E&TC, Savitribai Phule Women’s Engineering College, Aurangabad, Maharashtra, India

chavanrupali452@gmail.com;

sable.eesa@gmail.com

Abstract— The Speech is most prominent & one of the natural forms of communication among of human

being. The speech is a signal of infinite information. There are different aspects related to speech like speech

recognition, speech verification, speech synthesis, speaker recognition, speaker identification etc. The

purpose of this project is to study a speech recognition system using HMM. The goal of speech recognition is

to determine which speech is present based on spoken information. The system uses MFCC for feature

extraction and HMM for pattern training. The success of MFCC combined with their robust and cost-

effective computation, turned them into a standard choice in speech recognition applications. And HMM

provides a highly reliable way of recognizing speech.

Key Terms: - Discrete Cosine Transform; Fast Fourier Transform; Hidden Markov Model; Mel Frequency

Cepstral coefficients; Speech recognition

I. I

NTRODUCTION

Speech recognition is a powerful tool of the information exchange using the acoustic signal. Therefore, not

surprisingly, the speech signal is for several centuries the subject of research. Speech recognition is a technology

that able a computer to capture the words spoken by a human with a help of microphone. These words are later

on recognized by speech recognizer, and in the end, system outputs the recognized words. Speech recognition is

basically the science of talking with the computer, and having it correctly recognized. Speech recognition is

getting the meaning of an utterance such that one can respond properly whether or not one has correctly

recognized all of the words. Data input to a machine is of generic use, but in what circumstances is speech

recognition preferred ?An eyes-and-hands-busy user such as a quality control inspector, inventory taker,

cartographer, radiologist (medical X-ray reader), mail sorter, or aircraft pilot-is one example. Another use is

transcription in the business environment where it may be faster to remove the distraction of typing for the non-

typist. The technology is also helpful to handicapped persons who might otherwise require helpers to control

their environments. Automatic speech recognition has a long history of being a difficult problem-the first papers

date from about 1950. During this period, a number of techniques, such as linear-time-scaled word-template

matching, dynamic-time-warped word-template matching, linguistically motivated approaches (find the

phonemes, assemble into words, assemble into sentences), and hidden Markov models (HMM), were used. Of

all of the available techniques, HMMs are currently yielding the best performance [1].

In speech recognition, database creation (training) and recognition processes are involved. Database creation

describes the collection of speaker’s voice samples and extraction of features for selected words. And

recognition is a process to identify the spoken word by comparing current voice features to pre stored features of

voice. In real time, the recognition first it finds the likelihood of the unknown spoken word to the pre stored

database of known words and then it make decision of word with the selection of maximum likelihood word.

Speech recognition has two categories text dependent and text independent. Text dependent speech recognition

identifies the spoken word against the words that were given to him at the time of database collection. In this

case the text in recognition phase is same as in training phase. Text independent speech recognition identifies

Rupali S Chavan et al, International Journal of Computer Science and Mobile Computing Vol.2 Issue. 6, June- 2013, pg. 233-238

234

the spoken word irrespective of the words. Speech recognition is also classified as speaker dependent & speaker

independent. In speaker dependent type speech of the speakers is recognized only if their speech samples are

taken during training. Speaker independent speech recognition identifies the spoken word irrespective of the

speakers [2].

In early research Lawrence Rabiner, Biing Hwang Juang in their book “Fundamentals of speech recognition”

explained the different techniques like Hidden Markov Models, DTW, LPC, VQ, and MFCC in detail. The

HMM systems generally use large acoustic models composed of several thousands of parameters. Dynamic

Time Warping (DTW) and Hidden Markov Model (HMM) are two well-studied non-linear sequence alignment

(or, pattern matching) algorithm. The research trend transited from DTW to HMM in approximately1988-1990,

since DTW is deterministic and lack of the power to model stochastic signals[3].In another research review

M.A.Anusuya, S.K.Katti reported on “Speech Recognition by Machine: A Review”. They presented a brief

survey on Automatic Speech Recognition. They deeply explained the Automatic Speech Recognition system

classification, relevant issues of ASR design, Approaches to speech recognition [2].In another research Mahdi

Shaneh and Azizollah Taheri suggested the “Voice command recognition system based on MFCC and VQ

algorithms”. They designed a system to recognition voice commands. They used MFCC algorithm for feature

extraction and VQ (vector quantization) method for reduction of amount of data to decrease computation time.

In the feature matching stage Euclidean distance was applied as similarity criterion. Because of high accuracy of

used algorithms, they got the high accuracy voice command system. They trained initially with one repetition

for each command and once in each in testing sessions and got 15% error rate. Secondly they increased the

training samples then got zero error rates [4].

In their research H.P. Combrinck and E.C. Botha reported “On The Mel-scaled Cepstrum’’. They reported on

superior performance of MFCC especially under adverse conditions. Also concluded that it represents a good

trade-off between computational efficiency and perceptual considerations [5].

Another research was done by Ahmad A. M. Abushariah, Teddy S. Gunawan, and Othman O. Khalifa in their

paper “English Digits Speech Recognition System Based on Hidden Markov Models”. Two modules were

developed, namely the isolated words speech recognition and the continuous speech recognition. Both modules

were tested in both clean and noisy environments and showed a successful recognition rates. These recognition

rates are relatively successful if compared to similar systems. The recognition rates of multi-speaker mode

performed better than the speaker-independent mode in both environments[6].Then Ibrahim Patel and

DrY.Shrinivasa Rao in their research paper “Speech recognition using Hidden Markov Model with MFCC

Subband technique” concluded that with these methods quality metrics of speech recognition with respect to

computational time,learning accuracy get improved[8].In another research of voice recognition using HMM

with MFCC for secure ATM by Shumaila Iqbal,Tahira Mahboob and Malik Sikandar recognition accuracy was

found to be 86.67% [9].

Here this paper takes an overview of speech recognition system using MFCC and HMM. The Mel Frequency

Cepstral Coefficient (MFCC) method is studied here for extracting the features of speech signal. The pre-

processing and feature extraction stages of a pattern recognition system serves as an interface between the real

world and a classifier operating on an idealised model of reality. Then HMM is used to train these features into

the HMM parameters and used to find the log likelihood of entire speech samples. In recognition this likelihood

is used to recognize the spoken word.

II. S

PEECH

ECOGNITION

YSTEM

Speech signal primarily conveys the words or message being spoken. Area of speech recognition is concerned

with determining the underlying meaning in the utterance. Success in speech recognition depends on extracting

and modelling the speech dependent characteristics which can effectively distinguish one word from another.

The speech recognition system may be viewed as working in a four stages as shown in Fig. 1

i. Feature extraction

ii. Pattern training

iii. Pattern Matching.

iv. Decision logic

Fig. 1 Speech Recognition System

Rupali S Chavan et al, International Journal of Computer Science and Mobile Computing Vol.2 Issue. 6, June- 2013, pg. 233-238

235

The feature extraction process is implemented using Mel Frequency Cepstral Coefficients (MFCC) in which

speech features are extracted for all the speech samples. Then all these features are given to pattern trainer for

training and are trained by HMM to create HMM model for each word. Then viterbi decoding will be used to

select the one with maximum likelihood which is nothing but recognized word.

III.

MFCC

PPROACH

The purpose of this module is to convert the speech waveform to some type of parametric representation.

MFCC is used to extract the unique features of speech samples. It represents the short term power spectrum of

human speech. The MFCC technique makes use of two types of filters, namely, linearly spaced filters and

logarithmically spaced filters. To capture the phonetically important characteristics of speech, signal is

expressed in the Mel frequency scale. The Mel scale is mainly based on the study of observing the pitch or

frequency perceived by the human. The scale is divided into the units mel. The Mel scale is normally a linear

mapping below 1000 Hz and logarithmically spaced above 1000 Hz. Equation (1) is used to convert the normal

frequency to the Mel scale the formula used is

Mel=2595 log

(1+f/ 700) (1)

As shown in Fig 1, MFCC consists of six computational steps. Each step has its own function and

mathematical approaches as discussed briefly in the following:

Step 1: Pre–emphasis

This step processes the passing of signal through a filter which emphasizes higher frequency in the band of

frequencies the magnitude of some higher frequencies with respect to magnitude of other lower frequencies in

order to improve the overall SNR. It increases with This process will increase the energy of signal at higher

frequency. [7]

Step 2: Framing

The process of segmenting the sampled speech samples into a small frames. The speech signal is divided into

frames of N samples. Adjacent frames are being separated by M (M<N). Typical values used are M = 100 and

N= 256(which is equivalent to ~ 30 m sec windowing)

Fig.2 Computational Steps of MFCC

Step 3: Hamming windowing

Each individual frame is windowed so as to minimize the signal discontinuities at the beginning and end of

each frame. Hamming window is used as window and it integrates all the closest frequency lines. The Hamming

window equation is given as: If the window is defined as

W (n), 0 ≤ n ≤ N-1 where

N = number of samples in each frame

Y[n] = Output signal

X (n) = input signal

W (n) = Hamming window, then the result of windowing signal is shown below:

Y (n) = X (n) * W (n) (2)

W (n) =0.54–0.46 Cos (2∏ n / N-1); 0 < n < N-1 (3)

Step 4: Fast Fourier Transform

To convert each frame of N samples from time domain into frequency domain FFT is applied.

Step 5: Mel Filter Bank Processing

The frequencies range in FFT spectrum is very wide and voice signal does not follow the linear scale. The

bank of filters according to Mel scale as shown in Fig 6 is then performed.

Rupali S Chavan et al, International Journal of Computer Science and Mobile Computing Vol.2 Issue. 6, June- 2013, pg. 233-238

236

Fig. 3 Mel Filter Bank

This figure shows a set of triangular filters that are used to compute a weighted sum of filter spectral

components so that the output of process approximates to a Mel scale. Each filter’s magnitude frequency

response is triangular in shape and equal to unity at the centre frequency and decrease linearly to zero at centre

frequency of two adjacent filters. Then, each filter output is the sum of its filtered spectral components. The

output is mel spectrum consists of output powers of these filters. Then its logarithm is taken and output is log

mel spectrum.

Step 6: Discrete Cosine Transform

This is the process to convert the log Mel spectrum into time domain using Discrete Cosine Transform (DCT).

The result of the conversion is called Mel Frequency Cepstrum Coefficients. The set of coefficient is called

acoustic vectors. Therefore, each input utterance is transformed into a sequence of acoustic vector.[7][8].

IV.

IDDEN

ARKOV

ODELLING

PPROACH

A hidden Markov model (HMM) is a statistical model in which the system being modeled is assumed to be a

Markov process with unknown parameters; the challenge is to determine the hidden parameters from the

observable data. In a hidden Markov model, the state is not directly visible, but variables influenced by the state

are visible. Each state has a probability distribution over the possible output tokens. Therefore the sequence of

tokens generated by an HMM gives some information about the sequence of states. A hidden Markov model can

be considered a generalization of a mixture model where the hidden variables which control the mixture

component to be selected for each observation, are related through a Markov process rather than independent of

each other.

HMM creates stochastic models from known utterances and compares the probability that the unknown

utterance was generated by each model. This uses theory from statistics in order to (sort of) arrange our feature

vectors into a Markov matrix (chains) that stores probabilities of state transitions. That is, if each of our code

words were to represent some state, the HMM would follow the sequence of state changes and build a model

that includes the probabilities of each state progressing to another state.

HMMs are more popular because they can be trained automatically and are simple and computationally

feasible to use HMM considers the speech signal as quasi- static for short durations and models these frames for

recognition. It breaks the feature vector of the signal into a number of states and finds the probability of a signal

to transit from one state to another. HMMs are simple networks that can generate speech (sequences of cepstral

vectors) using a number of states for each model and modeling the short-term spectra associated with each state

with, usually, mixtures of multivariate Gaussian distributions (the state output distributions). The parameters of

the model are the state transition probabilities and the means, variances and mixture weights that characterize

the state output distributions [10]. This uses theory from statistics in order to (sort of) arrange our feature vectors

into a Markov matrix (chains) that stores probabilities of state transitions. That is, if each of our code words

were to represent some state, the HMM would follow the sequence of state changes and build a model that

includes the probabilities of each state progressing to another state.

HMM can be characterized by following when its observations are discrete:

i. N is number of states in given model, these states are hidden in model.

ii. M is the number of distinct observation symbols correspond to the physical output of the certain

model.

iii. A is a state transition probability distribution defined by NxN matrix as shown in equation (4).

A= {a

}

= p{ q

t+1

= j/q

= i}, 1≤i, j≤ N

(4)

∑ a

= 1, 1≤i, j≤ N

(5)

Where q

occupies the current state. Transition probabilities should meet the stochastic limitations

B is observational symbol probability distribution matrix (3) defined by NxM matrix equation comprises

(k) = p{o

=j}, 1<=j<=N , 1<=k<=M (6)

∑ b

(k)

= 1, 1<=k<=M (7)

Rupali S Chavan et al, International Journal of Computer Science and Mobile Computing Vol.2 Issue. 6, June- 2013, pg. 233-238

237

Where V

represents the K

observation symbol in the alphabet, and O

the current parameter vector. It must

follow the stochastic limitations

Π is an initial state distribution matrix (4) defined by Nx1.

π= {π

}

(8)

By defining the N, M, A, B, and π, ΗΜΜ can give the observation sequence for entire model as λ= (A, B, π)

which specify the complete parameter set of model [11].

V. F

ORWARD

ACKWARD

LGORITHM

The forward backward estimation algorithm is used to train its parameters and to find log likelihood of voice

sample. It is used to estimate the unidentified parameters of HMM. It is used to compute the maximum

likelihoods and posterior mode estimate for the parameters for HMM in training process. Here we want to find

P(O|λ), given the observation sequence O = O1,O2,O3, · · · ,OT .

Forward Algorithm

The forward variable αt(i) is defined as αt(i) = P(o1,o2,… ,ot,qT = i|λ) i.e. the probability of the partial

observation sequence (until time t) and state i at time t, given the model λ. αt(i) is inductively computed by

following steps:

• Initialization:

α1(ι) = πi Bi (ο1 ),1 ≤ ι ≤ Ν (9)

Induction:

,1 ≤t≤ T -1 (10)

• Termination:

P(O|λ ) = Σαt (i) (11)

Finally the required P(O| λ) is sum of the terminal forward variables αT (i), this is true because

αT (i) = P(O1,O2, · · · ,OT , qT = Si| λ) (12)

Si is the state at time t. There are N possible states Si (1 ≤ i≤ N), at time t.

Backward Algorithm

The backward variable βt(i) is defined as βt(i) = P(ot+1,ot+2,…,oT,qT = i|λ) i.e. the probability of the partial

observation sequence from t+1 to the end, given the state i at time t and the model λ. βt(i) is inductively solved

as follows:

• Initialization:

βt (i) = 1,1 ≤ i ≤ Ν (13)

• Induction:

βt (i)= Σ aij bj bj (Ot+1) βt +1( j) where t = Τ −1, Τ – 2…1,1 ≤i ≤ Ν (14)

Combining Forward and Backward variables, we get:

P(O|λ ) = Σαt (i)βt (i) ,1 ≤ t ≤ Τ (15)

VI. K-

MEANS

LGORITHM

Segmental k mean algorithm is used to generate the code book of entire features of voice sample. It is used

for clustering the observations into the k partitions. K-mean algorithm is used to first partition the input vector

into k initial sets by random selection or by using heuristic data. It defines two steps to precede k-mean

algorithm. Each observation is assigned to the cluster with the closest mean. And then calculate the new means

to be centroids of observation in each cluster by associating each observation with the closest centroids it

construct the new partition , the centroids are recalculated for new cluster until it convergence or observations

are no longer remains to clustering . It converges extremely fast in practice and return the best clustering found

by executing several iterations. [9]

Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means

clustering aims to partition the n observations into k sets (k ≤ n) S = {S1, S2, …, Sk} so as to minimize within-

cluster sum of squares (WCSS):

Where µi is the mean of points in Si

VII. V

ITERBI

LGORITHM

Using the final re-estimated A, B and π; the value of log likelihood HMM is calculated with respect to all the

word models available with the recognition engine by using Viterbi algorithm. The Viterbi algorithm takes

Rupali S Chavan et al, International Journal of Computer Science and Mobile Computing Vol.2 Issue. 6, June- 2013, pg. 233-238

238

model parameters and the observational vectors of the word as input and returns the value of matching with all

particular word models. This is the likelihood values of the word (LIHMM) passed to hybrid training model [9].

It says that to find single best state sequence, Q = q1, q2.q3, · · · , qt, (which produces given observation

sequence) for a given observation sequence O = o1, o2, o3,···,ot, we define a quantity δt(i) =

]

i.e., δt(i) is the best score along a single path, at time t, which account for the first t observations and ends in

state Si, by induction

δt+1(j) = bj(Ot+1)

In order to find the state sequence we need to keep track of state which maximizes the above equation. We do

this via array ψt(j) for each t and stat j. Once the final state is reached corresponding state sequence can be found

out using backtracking [9].

VIII. C

ONCLUSION

In this over review, we have discussed the speech recognition system using HMM. Here the techniques used

in each stage of speech recognition system are discussed. Through this over review it is found that MFCC is

used widely for feature extraction of speech because it is noise robust and HMM is best among all modeling

techniques as it increases recognition accuracy and speed.

EFERENCES

[1]

SD.B.Paul,”Speech Recognition using Hidden Markov Model.”

[2]

M.A.Anusuya , S.K.Katti “Speech Recognition by Machine: A Review” International journal of

computer science and Information Security 2009

[3]

L.R.Rabiner and B.H.jaung ,” Fundamentles of Speech Recognition Prentice-Hall, Englewood Cliff,

New Jersy,1993.

[4]

Mahdi Shaneh, and Azizollah Taheri,”Voice Command Recognition System Based on MFCC and VQ

Algorithms”, World Academy of Science, Engineering and Technology 57 2009

[5]

H. Combrinck and E. Botha, “On the mel-scaled cepstrum,” department of Electrical and

[6]

Electronic Engineering, University of Pretoria.,Journal of Computer Science 3 (8): 608-616, 2007 ISSN

1549-3636.

[7]

Ahmad A. M. Abushariah,Teddy S. Gunawan, Othman O. Khalifa“English Digits Speech Recognition

System Based on Hidden Markov Models”, International Islamic University Malaysia, International

Conference on Computer and Communication Engineering (ICCCE 2010), 11-13 May 2010, Kuala

Lumpur, Malaysia

[8]

Anjali Bala,ABHIJEET Kumar,Nidhika Birla,”Voice command recognition System Based on MFCC and

DTW”,International Journal of Engineering Science and Technology,Vol.2(12),2010

[9]

Ibrahim Patel,Dr.Y.Srinivasa Rao, , “Speech recognition using Hidden Markov Model With MFCC-

Subband Technique.” 2010 International Conference on Recent Trends in

Information,Telecommunication and Computing.

[10]

Shumaila Iqbal,Tahira Mehboob,Malik,”Voice Recognition using HMM with MFCC for secure

ATM”,IJCS Vol.8,Issue 6

[11]

Nov 2011

[12]

Vimala C, Dr.V.Radha, “A Review on Speech Recognition Challenges and Approaches”, World of

Computer Science and Information Technology Journal (WCSIT) ISSN: 2221-0741 Vol. 2, No. 1, 1-7,

2012

[13]

Lawrence R. Rabiner, Fellow, IEEE ‘A Tutorial On Hidden Markov Model And Selected Applications

In Speech Recognition, Proceedings Of The IEEE, Vol. 77, No. 2, February 1989.

A Novel Approach for Speech to Text Recognition System Using Hidden Markov Model

Article

Dec 2022

Speech recognition is the application of sophisticated algorithms which involve the transforming of the human voice to text. Speech identification is essential as it utilizes by several biometric identification systems and voice-controlled automation systems. Variations in recording equipment, speakers, situations, and environments make speech recognition a tough undertaking. Three major phases comprise speech recognition: speech pre-processing, feature extraction, and speech categorization. This work presents a comprehensive study with the objectives of comprehending, analyzing, and enhancing these models and approaches, such as Hidden Markov Models and Artificial Neural Networks, employed in the voice recognition system for feature extraction and classification.

Recognition of Fruit Types from Striking and Flicking Sounds

Article

Sep 2023

Rong Phoophuangpairoj

This paper proposes a method to recognize fruits whose quality, including their ripeness, grades, brix values, and flesh characteristics, cannot be determined visually from their skin but from striking and flicking sounds. Four fruit types consisting of durians, watermelons, guavas, and pineapples were studied in this research. In recognition of fruit types, preprocessing removes the non-striking/non-flicking parts from the striking and flicking sounds. Then the sequences of frequency domain acoustic features containing 13 Mel Frequency Cepstral Coefficients (MFCCs) and their 13 first- and 13 second-order derivatives were extracted from striking and flicking sounds. The sequences were used to create the Hidden Markov Models (HMMs). The HMM acoustic models, dictionary, and grammar were incorporated to recognize striking and flicking sounds. When testing the striking and flicking sounds obtained from the fruits used to create the training set but were collected at different times, the recognition accuracy using 1 through 5 strikes/flicks was 98.48%, 98.91%, 99.13%, 98.91%, and 99.57%, respectively. For an unknown test set, of which the sounds obtained from the fruits that were not used to create the training set, the recognition accuracy using 1 through 5 strikes/flicks were 95.23%, 96.82%, 96.82%, 97.05%, and 96.59%, respectively. The results also revealed that the proposed method could accurately distinguish the striking sounds of durians from the flicking sounds of watermelons, guavas, and pineapples.

Gammatonegram Representation for End-to-End Dysarthric Speech Processing Tasks: Speech Recognition, Speaker Identification, and Intelligibility Assessment

Preprint

Full-text available

Jul 2023

Dysarthria is a disability that causes a disturbance in the human speech system and reduces the quality and intelligibility of a person's speech. Because of this effect, the normal speech processing systems can not work properly on impaired speech. This disability is usually associated with physical disabilities. Therefore, designing a system that can perform some tasks by receiving voice commands in the smart home can be a significant achievement. In this work, we introduce gammatonegram as an effective method to represent audio files with discriminative details, which is used as input for the convolutional neural network. On the other word, we convert each speech file into an image and propose image recognition system to classify speech in different scenarios. Proposed CNN is based on the transfer learning method on the pre-trained Alexnet. In this research, the efficiency of the proposed system for speech recognition, speaker identification, and intelligibility assessment is evaluated. According to the results on the UA dataset, the proposed speech recognition system achieved 91.29% accuracy in speaker-dependent mode, the speaker identification system acquired 87.74% accuracy in text-dependent mode, and the intelligibility assessment system achieved 96.47% accuracy in two-class mode. Finally, we propose a multi-network speech recognition system that works fully automatically. This system is located in a cascade arrangement with the two-class intelligibility assessment system, and the output of this system activates each one of the speech recognition networks. This architecture achieves an accuracy of 92.3% WRR. The source code of this paper is available.

Challenges and Limitations in Speech Recognition Technology: A Critical Review of Speech Signal Processing Algorithms, Tools and Systems

Article

Full-text available

Oct 2022
Comput Model Eng Sci

Speech recognition systems have become a unique human-computer interaction (HCI) family. Speech is one of the most naturally developed human abilities; speech signal processing opens up a transparent and hand-free computation experience. This paper aims to present a retrospective yet modern approach to the world of speech recognition systems. The development journey of ASR (Automatic Speech Recognition) has seen quite a few milestones and breakthrough technologies that have been highlighted in this paper. A step-by-step rundown of the fundamental stages in developing speech recognition systems has been presented, along with a brief discussion of various modern-day developments and applications in this domain. This review paper aims to summarize and provide a beginning point for those starting in the vast field of speech signal processing. Since speech recognition has a vast potential in various industries like telecommunication, emotion recognition, healthcare, etc., this review would be helpful to researchers who aim at exploring more applications that society can quickly adopt in future years of evolution.

Hidden-state modeling of a cross-section of geoelectric time series data can provide reliable intermediate-term probabilistic earthquake forecasting in Taiwan

Article

Full-text available

Jun 2022
NAT HAZARD EARTH SYS

Geoelectric time series (TS) have long been studied for their potential for probabilistic earthquake forecasting, and a recent model (GEMSTIP) directly used the skewness and kurtosis of geoelectric TS to provide times of increased probability (TIPs) for earthquakes for several months in the future. We followed up on this work by applying the hidden Markov model (HMM) to the correlation, variance, skewness, and kurtosis TSs to identify two hidden states (HSs) with different distributions of these statistical indexes. More importantly, we tested whether these HSs could separate time periods into times of higher/lower earthquake probabilities. Using 0.5 Hz geoelectric TS data from 20 stations across Taiwan over 7 years, we first computed the statistical index TSs and then applied the Baum–Welch algorithm with multiple random initializations to obtain a well-converged HMM and its HS TS for each station. We then divided the map of Taiwan into a 16-by-16 grid map and quantified the forecasting skill, i.e., how well the HS TS could separate times of higher/lower earthquake probabilities in each cell in terms of a discrimination power measure that we defined. Next, we compare the discrimination power of empirical HS TSs against those of 400 simulated HS TSs and then organized the statistical significance values from this cellular-level hypothesis testing of the forecasting skill obtained into grid maps of discrimination reliability. Having found such significance values to be high for many grid cells for all stations, we proceeded with a statistical hypothesis test of the forecasting skill at the global level to find high statistical significance across large parts of the hyperparameter spaces of most stations. We therefore concluded that geoelectric TSs indeed contain earthquake-related information and the HMM approach is capable of extracting this information for earthquake forecasting.

Enhancing Security with Hidden Markov Model Speech-to-Text Authentication

Article

Jan 2024

System dedicated to Polish Automatic Speech Recognition - overview of solutions

Article

May 2024
B POL ACAD SCI-TECH

The paper presents the analysis of modern Artificial Intelligence algorithms for the automated system supporting human beings during their conversation in Polish language. Their task is to perform Automatic Speech Recognition (ASR) and process it further, for instance fill the computer-based form or perform the Natural Language Processing (NLP) to assign the conversation to one of predefined categories. The State-of-the-Art review is required to select the optimal set of tools to process speech in the difficult conditions, which degrade accuracy of ASR. The paper presents the top-level architecture of the system applicable for the task. Characteristics of Polish language are discussed. Next, existing ASR solutions and architectures with the End-To-End (E2E) deep neural network (DNN) based ASR models are presented in detail. Differences between Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN) and Transformers in the context of ASR technology are also discussed.

Likelihood and decoding problems for mixed space hidden Markov model

Article

Apr 2024

A Hidden Markov Model (HMM) is a couple of stochastic processes: A hidden Markov process and an observed emission process. Generally, the HMMs are used to study the hidden behavior of random systems through some observed emission sequences generated by the phenomenon under study. In this frame-work, we propose to solve the likelihood and the decoding problems of HMMs whose state space is composed of a continuous part and a discrete part. We adapt forward, backward and Viterbi algorithms to the case of our proposal. Numerical examples and Monte Carlo simulations are considered to show the efficiency and the adaptation of the algorithms for the proposed model.

Gammatonegram Representation for End-to-End Dysarthric Speech Processing Tasks: Speech Recognition, Speaker Identification, and Intelligibility Assessment

Article

Jan 2023

Associative Classification Using Maximum Threshold and Super Subsets for Rules to Predict By-Diseases in Diabetic Syndromes

Conference Paper

Sep 2022

Data mining is capable of giving and providing the hidden, unknown, and interesting information in terms of knowledge in healthcare industry. It is useful to form decision support systems for the disease prediction and valid diagnosis of health issues. The concepts of data mining can be used to recommend the solutions and suggestions in medical sector for precautionary measures to control the disease origin at early stage. Today, diabetes is a most common and life taking syndrome found all over theworld. The presence of diabetes itself is a cause tomany other health issues in the form of side effects in human body. In such cases when considered, a need is to find the hidden data patterns from diabetic data to discover the knowledge so as to reduce the invisible health problems that arise in diabetic patients. Many studies have shown that AssociativeClassification concept of dataminingworks well and can derive good outcomes in terms of prediction accuracy. This research work holds the experimental results of the work carried out to predict and detect the by-diseases in diabetic patients with the application of Associative Classification, and it discusses an improved algorithmic method of Associative Classification named Associative Classification using Maximum Threshold and Super Subsets (ACMTSS) to achieve accuracy in better terms. Keywords Knowledge · By-disease · Maximum threshold · Super subsets · ACMTSS · Associative Classification

Voice Recognition using HMM with MFCC for Secure ATM.

Article

Full-text available

Nov 2011

Security is an essential part of human life. In this era security is a huge issue that is reliable and efficient if it is unique by any mean. Voice recognition is one of the security measures that are used to provide protection to human’s computerized and electronic belongings by his voice. In this paper voice sample is observed with MFCC for extracting acoustic features and then used to trained HMM parameters through forward backward algorithm which lies under HMM and finally the computed log likelihood from training is stored to database. It will recognize the speaker by comparing the log value from the database against the PIN code. It is implemented in Matlab 7.0 environment and showing 86.67% results as correct acceptance and correct rejections with the error rate of 13.33%.

English digits speech recognition system based on Hidden Markov Models

Conference Paper

Full-text available

May 2010

Conference code: 81802, Export Date: 30 November 2012, Source: Scopus, Art. No.: 5556819, doi: 10.1109/ICCCE.2010.5556819, Language of Original Document: English, Correspondence Address: Abushariah, A. A. M.; Electrical and Computer Engineering Department, Faculty of Engineering, International Islamic University Malaysia, Gombak, 53100 Kuala Lumpur, Malaysia; email: ahmad2010@hotmail.com, References: (1998), www.dragonmedical-transcription.com/historyspeechrecognition.html, Garfinkel, Retrieved on 10th February 2009Forsberg, M., (2003) Why Is Speech Recognition Difficult?, , Department of Computing Science, Chalmers University of Technology, Gothenburg, Sweden;

VOICE COMMAND RECOGNITION SYSTEM BASED ON MFCC AND DTW

Article

Full-text available

Dec 2010

The Voice is a signal of infinite information. Digital processing of speech signal is very important for high-speed and precise automatic voice recognition technology. Nowadays it is being used for health care, telephony military and people with disabilities therefore the digital signal processes such as Feature Extraction and Feature Matching are the latest issues for study of voice signal. In order to extract valuable information from the speech signal, make decisions on the process, and obtain results, the data needs to be manipulated and analyzed. Basic method used for extracting the features of the voice signal is to find the Mel frequency cepstral coefficients. Mel-frequency cepstral coefficients (MFCCs) are the coefficients that collectively represent the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency.This paper is divided into two modules. Under the first module feature of the speech signal are extracted in the form of MFCC coefficients and in another module the non linear sequence alignment known as Dynamic Time Warping (DTW) introduced by Sakoe Chiba has been used as features matching techniques. Since it's obvious that the voice signal tends to have different temporal rate, the alignment is important to produce the better performance. This paper presents the feasibility of MFCC to extract features and DTW to compare the test patterns.

Speech recognition using hidden Markov models

Article

Jan 1990

D.B. Paul

A Review on Speech Recognition Challenges and Approaches

Article

V Radha

Speech technology and systems in human computer interaction have witnessed a stable and remarkable advancement over the last two decades. Today, speech technologies are commercially available for an unlimited but interesting range of tasks. These technologies enable machines to respond correctly and reliably to human voices, and provide useful and valuable services. Recent research concentrates on developing systems that would be much more robust against variability in environment, speaker and language. Hence today's researches mainly focus on ASR systems with a large vocabulary that support speaker independent operation with continuous speech in different languages. This paper gives an overview of the speech recognition system and its recent progress. The primary objective of this paper is to compare and summarize some of the well known methods used in various stages of speech recognition system.

Speech Recognition Using Hidden Markov Model With MFCC- Subband Technique,

Conference Paper

Mar 2010

Srinivasa Rao Yarravarapu

Fundamentals Of Speech Recognition Prentice Hall

Article

Jan 1993

A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition

Article

Jan 1993

Lawrence Rabiner

Voice Command Recognition System Based on MFCC and VQ Algorithms

Article

Jan 2009

The goal of this project is to design a system to recognition voice commands. Most of voice recognition systems contain two main modules as follow "feature extraction" and "feature matching". In this project, MFCC algorithm is used to simulate feature extraction module. Using this algorithm, the cepstral coefficients are calculated on mel frequency scale. VQ (vector quantization) method will be used for reduction of amount of data to decrease computation time. In the feature matching stage Euclidean distance is applied as similarity criterion. Because of high accuracy of used algorithms, the accuracy of this voice command system is high. Using these algorithms, by at least 5 times repetition for each command, in a single training session, and then twice in each testing session zero error rate in recognition of commands is achieved.

Speech Recognition by Machine, A Review

Article

Jan 2010

This paper presents a brief survey on Automatic Speech Recognition and discusses the major themes and advances made in the past 60 years of research, so as to provide a technological perspective and an appreciation of the fundamental progress that has been accomplished in this important area of speech communication. After years of research and development the accuracy of automatic speech recognition remains one of the important research challenges (e.g., variations of the context, speakers, and environment).The design of Speech Recognition system requires careful attentions to the following issues: Definition of various types of speech classes, speech representation, feature extraction techniques, speech classifiers, database and performance evaluation. The problems that are existing in ASR and the various techniques to solve these problems constructed by various research workers have been presented in a chronological order. Hence authors hope that this work shall be a contribution in the area of speech recognition. The objective of this review paper is to summarize and compare some of the well known methods used in various stages of speech recognition system and identify research topic and applications which are at the forefront of this exciting and challenging field. Comment: 25 pages IEEE format, International Journal of Computer Science and Information Security, IJCSIS December 2009, ISSN 1947 5500, http://sites.google.com/site/ijcsis/

An Overview of Speech Recognition Using HMM

Abstract and Figures

Recommended publications

Comparison of text-independent speaker recognition methods using VQ-distortion and discrete/continuo...

The past, present, and future of speech processing

A robust formant-based speech spectrum comparison measure

Polynomial-Approximation-Based Model Adaptation for Noisy Speech Recognition