Conference PaperPDF Available

German End-to-end Speech Recognition based on DeepSpeech

October 2019

October 2019

Conference: KONVENS
At: Germany

Authors:

University of Duisburg-Essen

While automatic speech recognition is an important task, freely available models are rare, especially for languages other than English. In this paper, we describe the process of training German models based on the Mozilla DeepSpeech architecture using publicly available data. We compare the resulting models with other available speech recognition services for German and find that we obtain comparable results. Acceptable performance under noisy conditions would, however, still require much more training data. We release our trained Ger-man models and also the training configurations .

Learning curves for single datasets

…

Figures - uploaded by Aashish Agarwal

Content may be subject to copyright.

Content uploaded by Aashish Agarwal

Content may be subject to copyright.

German End-to-end Speech Recognition based on DeepSpeech

Aashish Agarwal and Torsten Zesch

Language Technology Lab

University of Duisburg-Essen

Duisburg, Germany

Abstract

While automatic speech recognition is an

important task, freely available models are

rare, especially for languages other than

English. In this paper, we describe the pro-

cess of training German models based on

the Mozilla DeepSpeech architecture using

publicly available data. We compare the re-

sulting models with other available speech

recognition services for German and ﬁnd

that we obtain comparable results. Accept-

able performance under noisy conditions

would, however, still require much more

training data. We release our trained Ger-

man models and also the training conﬁgu-

rations.

1 Introduction

Automatic speech recognition (ASR) is the task of

translating a spoken utterance into a textual tran-

script. It is a key component of voice assistants like

Google Home (Li et al., 2017), in spoken language

translation devices (Krstovski et al., 2008), or for

automatic transcription of audio and video ﬁles

(Liao et al., 2013). For any language beyond En-

glish, readily available pre-trained models are still

rare. For German, we are only aware of the model

by Milde and K

ohn (2018) for the Kaldi framework

(Povey et al., 2011). For the recently introduced

Mozilla DeepSpeech framework, a German model

is still missing. This is a serious obstacle to ap-

plied research on German speech data, as available

web-services by Google, Amazon, or Microsoft are

problematic due to data privacy reasons. We thus

use publicly available speech data to train a Ger-

man DeepSpeech model. We release our trained

German model and also publish the code and con-

ﬁgurations enabling researchers to (i) directly use

the model in applications, (ii) reproduce state-of-

the-art results, and (iii) train new models based on

other source corpora.

2 Speech Recognition Systems

Due to the underlying complexity of recogniz-

ing spoken language and the wish of the service

provider to keep the model private, many systems

are offered as web services. This includes com-

mercial services like Google Cloud Speech-to-Text

(He et al., 2018), Amazon Alexa Voice Services

IBM Watson Speech to Text (Saon et al., 2017) or

Speechmatics

as well as academic services like

BAS.

While web services are convenient, there

are many situations where they cannot be used:

•

sending data to a web service might violate

data privacy protection laws

•

as the data throughput of a web service is

limited; it might rule out batch processing of

large amounts of speech data

•

the user cannot control (or change) the func-

tionality of a remotely deployed web service

•

research results based on web service calls

are not easily replicable, as services might

change without notice or become unavailable

altogether.

For this work, we therefore consider only frame-

works that can be used locally and without restric-

tions. One such framework is

Kaldi

(Povey et al.,

2011) which was found to be the best perform-

ing open-source ASR system in a previous study

(Gaida et al., 2014). It is open-source toolkit writ-

ten in C++ that supports conventional models (e.g.

Gaussian Mixture Models) as well as deep neu-

ral networks. Recently, end-to-end neural systems

wav2letter++

(Pratap et al., 2018) provided by

Facebook, or

DeepSpeech4

provided by Mozilla

have been introduced. To our knowledge, there is

1https://developer.amazon.com/alexa/science

2https://www.speechmatics.com

3https://clarin.phonetik.uni-muenchen.de/BASWebServices/interface/ASR

4https://github.com/mozilla/DeepSpeech

Figure 1: DeepSpeech architecture (adapted from

Mozilla Blog5)

only one German model for any of these frame-

works that is publicly available, which is the one

by Milde and K

ohn (2018) for Kaldi. Other Ger-

man models, e.g. a Kaldi model from Fraunhofer

IAIS (Stadtschnitzer et al., 2014), rely on in-house

datasets and are not publicly available.

In this work, we focus on Mozilla’s DeepSpeech

framework, as it is an end-to-end neural system

that can be quite easily trained, unlike Kaldi, which

requires more domain knowledge or wav2letter++,

which is not yet widely tested by the community.

Mozilla DeepSpeech

DeepSpeech (v0.1.0) was

based on a TensorFlow (Abadi et al., 2016) imple-

mentation of Baidu’s end-to-end ASR architecture

(Hannun et al., 2014). As it is under active devel-

opment, the current architecture deviates from the

original version quite a bit. In Figure 1, we give

an overview of the architecture of version v0.5.0,

which we also used for our experiments in this

paper.6

DeepSpeech is a character-level, deep recurrent

5https://hacks.mozilla.org/2018/09/speech-recognition- deepspeech

6https://github.com/mozilla/DeepSpeech/releases/tag/v0.5.0

neural network (RNN), which can be trained end-

to-end using supervised learning.

It extracts Mel-

Frequency Cepstral Coefﬁcients (Imai, 1983) as

features and directly outputs the transcription, with-

out the need for forced alignment on the input or

any external source of knowledge like a Grapheme

to Phoneme (G2P) converter. Overall, the network

has six layers: the speech features are fed into three

fully connected layers (dense), followed by a uni-

directional RNN layer, then a fully connected layer

(dense) and ﬁnally an output layer as shown in Fig-

ure 1. The RNN layer uses LSTM cells, and the

hidden fully connected layers use a ReLU activa-

tion function. The network outputs a matrix of

character probabilities, i.e. for each time step the

system gives a probability for each character in the

alphabet, which represents the likelihood of that

character corresponding to the audio. Further, the

Connectionist Temporal Classiﬁcation (CTC) loss

function (Graves et al., 2006) is used to maximize

the probability of the correct transcription.

DeepSpeech comes with a pre-trained English

model, but while Mozilla is collecting speech sam-

ples

and is releasing training datasets in several

languages (see paragraph on Mozilla Common

Voice in Section 3), no ofﬁcial models other than

English are provided. Users have reported on train-

ing models for French

and Russian (Iakushkin et

al., 2018), but the resulting models do not seem to

be available.

3 Model Training

In this section, we describe in detail our setup for

training the German model in order to ease subse-

quent attempts to train DeepSpeech models.

3.1 Datasets

To train the German Deep Speech model, we utilize

the following publicly available datasets:

The

Voxforge10

corpus, which is about 35 hours

of German speech clips. Nearly 180 speakers have

read aloud sentences from German Wikipedia, pro-

tocols from the European Parliament, and some

individual commands. The clips vary in length,

ranging from 5 to 7 seconds.

The

Tuda-De

(Milde and K

ohn, 2018) corpus,

is similar to Voxforge. It uses the same sources

7https://hacks.mozilla.org/2017/11/a-journey- to-10- word-error- rate/

8https://voice.mozilla.org/

9http://bit.ly/discourse-mozilla- org

10http://www.voxforge.org/home/forums/other-languages/german/

open-speech- data-cor pus-for- german

Dataset Size Median Length # Speakers Condition Type

Voxforge 35h 4.5s 180 noisy read

Tuda-De 127h 7.4s 147 clean read

Mozilla Common Voice 140h 3.7s >1,000 noisy read

Table 1: Overview of German datasets

(Wikipedia, parliament speeches, commands), but

the recordings are under more controlled condi-

tions. The ﬁnal data was also curated “to reduce

speaking errors and artefacts”. Each recording was

made with 4 different microphones at the same

time. This means that while the overall size of

the dataset is larger than Voxforge and a model

based on this dataset is supposed to be more robust,

the actual amount of unique speech hours in both

datasets are about the same.

The

Mozilla Common Voice

project

aims to

make speech recognition open to everyone. The

multilingual dataset currently covers 18 languages -

including English, French, German, and Mandarin.

The German corpus contains clips with lengths

varying from 3 to 5 seconds. However, the corpus

is recorded outside controlled conditions as per the

comfort of the speaker. The utterances have back-

ground noise, and users have varied accents. There-

fore we expect this dataset to be relatively challeng-

ing. Speakers in this dataset are relatively young,

and the male/female ratio is about 5:1, which might

result in a severe bias when trying to transfer the

model.

The version used in our experiments has

140 hours of recordings, but as Mozilla aims at

adding more recordings, there might already be a

larger dataset available.

3.2 Preprocessing

DeepSpeech expects audio and transcription data

to be prepared in a speciﬁc format so that they can

be read directly by the input pipeline (see Figure 2

for an example). We cleaned the transcriptions

by removing commas as well as punctuation and

converting all transcriptions to lower case. We

further ensured all audio clips are in .wav format.

The pruned results were split into training (70%),

validation (15%), and test data (15%).

For more details on data preprocessing parame-

ters, we refer the reader to the code release.13

11https://voice.mozilla.org/de/datasets

Speaker Information is based on the self-reported statis-

tics provided on the project homepage for each dataset.

13https://github.com/AASHISHAG/deepspeech-german

Hyperparameter Value

Batch Size 24

Dropout 0.25

Learning Rate 0.0001

Table 2: Hyperparameters used in the experiments

3.3 Hyperparameter Setup

We searched for a good set of hyperparameters

as shown in Figure 3. In the ﬁrst iteration, we

select learning rate and train batch-size and plot

the graph showing the relationship of dropout and

word-error rate, to determine the dropout with the

lowest WER. We then used the best dropout (0.25)

from the above iteration and kept the train batch

size, to identify the best learning rate. Finally,

we took the best dropout (0.25) and learning rate

(0.0001) to determine the effect on batch size which

shows that our initial choice of 24 was reasonable,

even if somewhat better results seem possible using

smaller batches.

Since Deep Speech employs early stopping,

which stops the training of a neural network early

before it overﬁts the training data, we did not ex-

periment much with the number of epochs. The re-

maining hyperparameters were set to be the same as

those pre-conﬁgured in Mozilla Deepspeech. The

best results are obtained with the hyper-parameters

mentioned in Table 2. We train the network using

the Adam optimizer (Kingma and Ba, 2014).

Language Model

We apply a probabilistic lan-

guage model using KenLM toolkit (Heaﬁeld, 2011)

to train a 3-gram model on the pre-processed cor-

pus provided by Radeck-Arneth et al. (2015). It

consists of eight million ﬁltered sentences compris-

ing 63.0% Wikipedia, 22.0% Europarl, and 14.6%

crawled sentences. MaryTTS

has been used to

canonicalize the corpus, i.e. normalized to a form

that is close to how a reader would speak the sen-

tence, especially changing numbers, abbreviations,

and dates. Additionally, punctuations were dis-

carded, as it is usually also not pronounced. We

14http://mary.dfki.de/

Figure 2: Screenshot of the input ﬁle format

0 0.2 0.40.60.8

100

Dropout

WER

0 0.2 0.40.60.8 1

·10−3

100

Learning Rate

0 10 20 30

Batch Size

Figure 3: Hyperparameter search space

Dataset WER

Mozilla 79.7

Voxforge 72.1

Tuda-De 26.8

Tuda-De + Mozilla 57.3

Tuda-De + Voxforge 15.1

Tuda-De + Voxforge + Mozilla 21.5

Table 3: German DeepSpeech results

used the unpruned Language Model that has a

rather large vocabulary size of over 2 million types,

but we expect pruning would only affect runtime,

not recognition quality.

3.4 Server & Runtime

We trained and tested our models on a compute

server having 56 Intel(R) Xeon(R) Gold 5120

CPUs @ 2.20GHz, 3 Nvidia Quadro RTX 6000

with 24GB of RAM each. Typical training time on

a single dataset under this setup was in the range

of 1 hour.

4 Results & Discussion

Table 3 shows the word error rates (WER) obtained

when training and testing DeepSpeech on the avail-

able German datasets and their combinations. The

best conﬁguration in Milde and K

ohn (2018) using

only the Tuda-De corpus yields a WER of 28.96%.

Our model only trained on Tuda-De yields a com-

parable WER of 26.8%.

Results for the other datasets are much lower, but

apparently combining several datasets improves the

results. While the combination of Tuda and Mozilla

yields a WER of 57.3%, the combination of Tuda,

Voxforge, and Mozilla gives a WER of 21.5%.

Combining the very similar Tuda-De and Voxforge

yields a WER of 15.1%, which is a remarkable im-

provement over using only a single dataset. Note

that this is the black-box performance, as we used

DeepSpeech as is and only slightly tuned hyper-

parameters. See Section 6 for ideas on how to

improve over these results.

To put our results into perspective, in Table 4,

we present results in other languages for training

different versions of the DeepSpeech architecture.

Our best results are in the same range as for the

other languages, but cross-dataset comparisons are

hard to interpret. However, it is safe to say that

training a DeepSpeech model can result in accept-

able in-domain word error rates with considerably

less training data than previously considered.

4.1 Inﬂuence of Training Size

Figure 4 depicts the relation between the amount of

training data and its impact on the word-error-rate.

To plot the learning curve, we split the training

data into 10 subsets containing each 10% of the

0 0.2 0.40.60.811.2 1.41.61.8

·104

100

Number of Training Instances

WER

Voxforge

Mozilla

Tuda-De

Figure 4: Learning curves for single datasets

00.511.522.533.5

·104

100

Number of Training Instances

WER

Tuda-De + Mozilla

Tuda-De + Voxforge

Tuda-De + Voxforge + Mozilla

Figure 5: Learning curves when combining datasets

00.511.522.533.5

·104

100

Number of Training Instances

WER

Tuda

Voxforge

Mozilla

Figure 6: Order effects when combining datasets

Language DeepSpeech version Training Set Size Test Set WER

English Baidu

(Hannun et al., 2014)

Switchboard

Fisher

WSJ

Baidu

7,380h Hub5 (LDC2002S23) 16.0

English Mozilla v0.3.0

Switchboard

Fisher

LibriSpeech

3,260h LibriSpeech (clean test) 11.0

English Mozilla v0.5.0

Switchboard

Fisher

LibriSpeech

3,260h LibriSpeech (clean test) 8.2

Russian Mozilla v?

(Iakushkin et al., 2018)

Yt-vad-1k

Yt-vad-650-clean 1,650h Voxforge (Russian) 18.0

German Mozilla v0.5.0

(our) Tuda-De + Voxforge 162h Tuda-De + Voxforge (test) 15.1

Table 4: Comparison with previous results in other languages

training data. Then the model is trained on one

subset and WER is calculated on a separate test

dataset. Next, we introduce the new subset with

more data, re-train the model, and compute its ef-

fect on the error rate. The model is trained on each

subset for a maximum of 10 epochs and sometimes

less when the model starts to overﬁt the training

data, and early stopping is triggered. We observe

that the rather noisy datasets Voxforge and Mozilla

converge rather slowly, while the clean Tuda-De

reaches much better results. This might also be

a result of the different microphones that add in-

creased robustness (not unlike other data augmen-

tation strategies).

Figure 5 present the same learning curves when

combining datasets showing that we can reach even

better WER in this setting. Mixing the datasets

seems to force the model to converge more quickly.

However, combining the similar dataset Tuda-De

and Voxforge yields a bit better performance than

combining all three datasets.

We also tested against a mix of all datasets in

combination, but add training data one dataset at

a time. Thus, the order in which datasets are in-

troduced into the training process might inﬂuence

performance. Figure 6 shows the results for dif-

ferent order in which the datasets are introduced

into the training process. Adding the noisy Mozilla

dataset too early in the process seems to slow down

convergence, while it adds a little bit of improved

performance when added in the end.

4.2 Cross-dataset Performance

So far, we used training and testing data either from

the same dataset or a mix of the available datasets,

Train Test WER

Voxforge

72.1

Tuda-De 96.8

Mozilla 73.1

Tuda-De, Mozilla 66.2

Tuda-De

26.8

Voxforge 98.5

Mozilla 84.9

Voxforge, Mozilla 83.8

Mozilla

79.7

Tuda-De 94.8

Voxforge 87.1

Tuda-De, Voxforge 80.5

Table 5: Results across datasets

while of course keeping train and test data separate.

To get a more realistic estimate of performance

when used in a general setting, we assess cross-

dataset performance, i.e. we train and develop on

one or two datasets and test on a third one.

Table 5 shows the resulting word error rates. Ap-

parently, the cross-domain results are much worse

than in the in-domain setting in Table 3. For exam-

ple, training on Mozilla or Voxforge and Mozilla

and testing on Tuda-De yield unacceptable word

error rates of 84.9 and 83.8 compared to 26.8 when

training on Tuda-De. Interestingly, in this case, as

we have seen already above, adding Voxforge in

the mix does not help much, even if it is similar to

Tuda-De. We see a similar picture for the other test

datasets, transferring from a single dataset does not

work at all, as in the training process the model is

never forced to generalize beyond its properties.

However, training on the Tuda-De and Mozilla

combination yields WER of 66.2 on Voxforge,

Model WER Example

original -der bandbreitenverbrauch wird erheblich verringert

Tuda-De 60 diese zeiten tonwoche erheblich verringert

Voxforge 80 zeiten epoche erheblich in

Tuda-De + Mozilla 160 es sind endete suche den ist es in

Tuda-De + Voxforge 60 der pen zeiten verprach wird erheblich verringert

Tuda-De + Voxforge + Mozilla 40 der bandbreiten verbrauch wird erheblich verringert

original -ferner gibt es m¨

oglicherweise eine gewisse anonymit¨

at und sicherheit

Tuda-De 78 weites m¨

ogliche welche in glichen unit¨

at und sicherheit

Voxforge 100 zitierweise sich entsichert

Tuda-De + Mozilla 100 hunde titisee gelten die die mitte zum

Tuda-De + Voxforge 44 den gibt es m¨

oglicherweise eine gewisse mietsicherheit

Tuda-De + Voxforge + Mozilla 11 er gibt es m¨

oglicherweise eine gewisse anonymit¨

at und sicherheit

original -die einwilligung des schuldners war nicht erforderlich

Tuda-De 100 ideen

Voxforge 86 die angebliche natacha vollich

Tuda-De + Mozilla 57 die einwilligung des schutzmacht erfordern

Tuda-De + Voxforge 86 die ein eigenes schuldnersicht erfordern

Tuda-De + Voxforge + Mozilla 43 die einigung des schuldner zwar nicht erforderlich

original -die geschwindigkeit f¨

ur die kunden kann erh¨

oht werden

Tuda-De 75 die geschwindigkeit und unterteilten

Voxforge 100 schinkelpreise

Tuda-De + Mozilla 88 wie die schmiede den trennendes

Tuda-De + Voxforge 38 die geschwindigkeit f¨

ur die kunden kenterte

Tuda-De + Voxforge + Mozilla 0 die geschwindigkeit f¨

ur die kunden kann erh¨

oht werden

original -mehrere arbeitgeberverb ¨

ande sind zu einem dachverband zusammengeschlossen

Tuda-De 114 der see aufweitungen des in einem tatorten samen erschossen

Voxforge 100 es recognitionszeichen

Tuda-De + Mozilla 100 in den sitzungen des entstandenen schaden

Tuda-De + Voxforge 29 mehrere arbeitgeberverb¨

ande sind zu einem tachodaten geschlossen

Tuda-De + Voxforge + Mozilla 14 der arbeitgeberverb¨

ande sind zu einem dachverband zusammengeschlossen

Table 6: Recognition results on random Voxforge test instances

which is even lower than using the training por-

tion of Voxforge (which yields 72.1). Thus forcing

the model to generalize over topics, recording con-

ditions, speakers, etc. seem to be a crucial point.

5 Error Analysis

Table 6 shows the recognition results on randomly

selected test instances from the Voxforge dataset.

The models trained on only one dataset are surpris-

ingly bad, resulting in rather poetic utterances that

sometimes are quite far from the expected source.

An example is the Tuda-De model recognizing

tatorten samen erschossen instead of dachverband

zusammengeschlossen.

As is to be expected for German, compounds

are especially challenging as exempliﬁed by band-

breitenverbrauch that is recognized as bandbreiten

verbrauch or even pen zeiten verprach, where ver-

prach is probably only in the language model as a

common misspelling of versprach.

The models often fail in interesting ways, e.g.

all models sometimes return very short results like

schinkelpreise that should actually have low prob-

ability. We currently have no explanation for this

behaviour and need to explore the issue further.

In cases like des schuldners war being recog-

nized as des schuldner zwar, the phonetic ambigu-

ity should have been resolved by a better language

model.

6 Summary

In this paper, we presented the ﬁrst results on

building a German speech recognition model using

Mozilla Deep Speech. Our best performing model

reaches an in-domain WER of 15.1%, which is in

line with the performance for other languages us-

ing the DeepSpeech framework. Our results thus

support the idea that Mozilla Deep Speech can

be easily transferred to new languages. Learning

curve experiments highlight the importance of the

amount of training data, but also quite strong order

effects when mixing the datasets.

We publish our trained model along with con-

ﬁguration data for all our experiments in order to

enable replicating all results. The model can eas-

ily be re-trained and optimised on new datasets by

referring the code-release.

No speciﬁc hardware

is required to run the trained model, and it works

even on a normal desktop computer or laptop.

Future Work

Our experiments only scratch the

surface of possible approaches, and our analysis

recommends several avenues for further explo-

ration.

We mainly treated DeepSpeech as a black-box

and only performed a light hyper-parameter search.

The model can probably still be ﬁne-tuned by ex-

ploring other hyper-parameters. We also did not

experiment much with the language model, but

used a simple 3-gram model.

Since the amount of publicly available training

data is limited, it could be interesting to consider

data augmentation strategies.

Another approach

to improve recognition quality could be to use

transfer learning by taking an English model (pre-

trained with the larger English datasets) and re-

training with the German data (Kunze et al., 2017;

Bansal et al., 2018). In the light of recent discus-

sions on the CO2 footprint of training deep learning

models (Strubell et al., 2019), using re-training and

providing trained models is desirable. Additionally,

more research is needed to ﬁnd neural architectures

that perform equally well, but require less compute.

Finally, the training process described here could

be easily used to train speech recognition models

for other languages, where currently no pre-trained

models are available.

Acknowledgments

We want to thank Andrea Horbach for her many

helpful comments that signiﬁcantly improved the

paper. We also thank the developers at Mozilla

DeepSpeech, who provided insight and expertise

that greatly assisted the research.

References

[Abadi et al.2016] Mart´

ın Abadi, Paul Barham, Jianmin

Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean,

Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving,

Michael Isard, et al. 2016. Tensorﬂow: A system

for large-scale machine learning. In 12th USENIX

Symposium on Operating Systems Design and Imple-

mentation OSDI 16, pages 265–283.

[Bansal et al.2018] Sameer Bansal, Herman Kamper,

Karen Livescu, Adam Lopez, and Sharon Goldwater.

15https://github.com/AASHISHAG/deepspeech-german

16https://ai.googleblog.com/2019/04/

specaugment-new- data-augmentation.html

2018. Pre-training on high-resource speech recog-

nition improves low-resource speech-to-text transla-

tion. CoRR, abs/1809.01431.

[Gaida et al.2014] Christian Gaida, Patrick Lange, Rico

Petrick, Patrick Proba, Ahmed Malatawy, and David

Suendermann-Oeft. 2014. Comparing open-source

speech recognition toolkits.

[Graves et al.2006] Alex Graves, Santiago Fern´

andez,

Faustino Gomez, and J¨

urgen Schmidhuber. 2006.

Connectionist temporal classiﬁcation: Labelling un-

segmented sequence data with recurrent neural ’net-

works. volume 2006, pages 369–376, 01.

[Hannun et al.2014] Awni Y. Hannun, Carl Case, Jared

Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen,

Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta,

Adam Coates, and Andrew Y. Ng. 2014. Deep

speech: Scaling up end-to-end speech recognition.

CoRR, abs/1412.5567.

[He et al.2018] Yanzhang He, Tara N. Sainath, Rohit

Prabhavalkar, Ian McGraw, Raziel Alvarez, Ding

Zhao, David Rybach, Anjuli Kannan, Yonghui Wu,

Ruoming Pang, Qiao Liang, Deepti Bhatia, Yuan

Shangguan, Bo Li, Golan Pundak, Khe Chai Sim,

Tom Bagby, Shuo-Yiin Chang, Kanishka Rao, and

Alexander Gruenstein. 2018. Streaming end-to-

end speech recognition for mobile devices. CoRR,

abs/1811.06621.

[Heaﬁeld2011] Kenneth Heaﬁeld. 2011. KenLM:

Faster and smaller language model queries. In Pro-

ceedings of the Sixth Workshop on Statistical Ma-

chine Translation, pages 187–197, Edinburgh, Scot-

land.

[Iakushkin et al.2018] Oleg Iakushkin, George Fe-

doseev, Anna S. Shaleva, Alexander Degtyarev,

and Olga S. Sedova. 2018. Russian-language

speech recognition system based on deepspeech. In

Proceedings of the VIII International Conference

on Distributed Computing and Grid-technologies in

Science and Education (GRID 2018).

[Imai1983] Satoshi Imai. 1983. Cepstral analy-

sis synthesis on the mel frequency scale. In

ICASSP’83. IEEE International Conference on

Acoustics, Speech, and Signal Processing, volume 8,

pages 93–96. IEEE.

[Kingma and Ba2014] Diederik P. Kingma and Jimmy

Ba. 2014. Adam: A Method for Stochastic Op-

timization. arXiv e-prints, page arXiv:1412.6980,

Dec.

[Krstovski et al.2008] Kriste Krstovski, Michael De-

cerbo, Rohit Prasad, David Stallard, Shirin Saleem,

and Premkumar Natarajan. 2008. A wearable head-

set speech-to-speech translation system. In Proceed-

ings of the ACL-08: HLT Workshop on Mobile Lan-

guage Processing, pages 10–12, Columbus, Ohio,

June. Association for Computational Linguistics.

[Kunze et al.2017] Julius Kunze, Louis Kirsch, Ilia

Kurenkov, Andreas Krug, Jens Johannsmeier, and

Sebastian Stober. 2017. Transfer learning

for speech recognition on a budget. CoRR,

abs/1706.00290.

[Li et al.2017] Bo Li, Tara Sainath, Arun Narayanan,

Joe Caroselli, Michiel Bacchiani, Ananya Misra,

Izhak Shafran, Hasim Sak, Golan Pundak, Kean

Chin, Khe Chai Sim, Ron J. Weiss, Kevin Wil-

son, Ehsan Variani, Chanwoo Kim, Olivier Siohan,

Mitchel Weintraub, Erik McDermott, Rick Rose,

and Matt Shannon. 2017. Acoustic modeling for

google home.

[Liao et al.2013] Hank Liao, Erik McDermott, and An-

drew W. Senior. 2013. Large scale deep neural net-

work acoustic modeling with semi-supervised train-

ing data for youtube video transcription. In ASRU,

pages 368–373. IEEE.

[Milde and K¨

ohn2018] Benjamin Milde and Arne

K¨

ohn. 2018. Open source automatic speech

recognition for german. CoRR, abs/1807.10311.

[Povey et al.2011] Daniel Povey, Arnab Ghoshal, Gilles

Boulianne, Lukas Burget, Ondrej Glembek, Nagen-

dra Goel, Mirko Hannemann, Petr Motlicek, Yanmin

Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer,

and Karel Vesely. 2011. The kaldi speech recogni-

tion toolkit. In IEEE 2011 Workshop on Automatic

Speech Recognition and Understanding, December.

[Pratap et al.2018] Vineel Pratap, Awni Hannun,

Qiantong Xu, Jeff Cai, Jacob Kahn, Gabriel Syn-

naeve, Vitaliy Liptchinsky, and Ronan Collobert.

2018. wav2letter++: The fastest open-source

speech recognition system. CoRR, abs/1812.07625.

[Radeck-Arneth et al.2015] Stephan Radeck-Arneth,

Benjamin Milde, Arvid Lange, Evandro Gouvˆ

ea,

Stefan Radomski, Max M ¨

uhlh¨

auser, and Chris

Biemann. 2015. Open source german distant

speech recognition: Corpus and acoustic model. In

Text, Speech, and Dialogue, pages 480–488, Cham.

[Saon et al.2017] George Saon, Gakuto Kurata, Tom

Sercu, Kartik Audhkhasi, Samuel Thomas, Dim-

itrios Dimitriadis, Xiaodong Cui, Bhuvana Ram-

abhadran, Michael Picheny, Lynn-Li Lim, Bergul

Roomi, and Phil Hall. 2017. English conversa-

tional telephone speech recognition by humans and

machines. CoRR, abs/1703.02136.

[Stadtschnitzer et al.2014] Michael Stadtschnitzer,

Jochen Schwenninger, Daniel Stein, and Joachim

Koehler. 2014. Exploiting the large-scale German

Broadcast Corpus to boost the Fraunhofer IAIS

Speech Recognition System. In Proceedings of

LREC 2014, pages 3887–3890, Reykjavik, Iceland.

[Strubell et al.2019] Emma Strubell, Ananya Ganesh,

and Andrew McCallum. 2019. Energy and policy

considerations for deep learning in nlp. In Proceed-

ings of ACL.

The Applicability of Wav2Vec2 and Whisper for Low-Resource Maltese ASR

Conference Paper

Aug 2023

Using Character-Level Sequence-to-Sequence Model for Word Level Text Generation to Enhance Arabic Speech Recognition

Article

Full-text available

Jan 2023

Owing to the linguistic richness of the Arabic language, which contains more than 6000 roots, building a reliable Arabic language model for Arabic speech recognition systems faces many challenges. This paper introduces a language model free Arabic automatic speech recognition system for Modern Standard Arabic based on an end-to-end-based Deep Speech architecture developed by Mozilla. The proposed model uses a character-level sequence-to-sequence model to map the character alignment produced by the recognizer model onto the corresponding words. The developed system outperformed recent studies on single-speaker and multi-speaker Arabic speech recognition using two different state-of-the-art datasets. The first was the Arabic Multi-Genre Broadcast (MGB2) corpus with 1200 h of audio data from multiple speakers. The system achieved a new milestone in the MGB2 challenge with a word error rate (WER) of 3.2, outperforming related work using the same corpus with a word error reduction of 17%. An additional experiment with a 7-hour Saudi Accent Single Speaker Corpus (SASSC) was used to build an additional model for single male speaker-based Arabic speech recognition using the same proposed network architecture. The single-speaker model outperformed related experiments with a WER of 4.25 with a relative improvement of 33.8%.

Digital Primer Implementation of Human-Machine Peer Learning for Reading Acquisition: Introducing Curriculum 2

Conference Paper

Full-text available

Jan 2023

The aim of the digital primer project is cognitive enrichment and fostering of acquisition of basic literacy and numeracy of 5 – 10 year old children. Here, we focus on Primer's ability to accurately process child speech which is fundamental to the acquisition of reading component of the Primer. We first note that automatic speech recognition (ASR) and speech-to-text of child speech is a challenging task even for large-scale, cloud-based ASR systems. Given that the Primer is an embedded AI artefact which aims to perform all computations on edge devices like RaspberryPi or Nvidia Jetson, the task is even more challenging and special tricks and hacks need to be implemented to execute all necessary inferences in quasi-real-time. One such trick explored in this article is transformation of a generic ASR problem into much more constrained multiclass-classification problem by means of task-specific language models / scorers. Another one relates to adoption of "human machine peer learning" (HMPL) strategy whereby the DeepSpeech model behind the ASR system is supposed to gradually adapt its parameters to particular characteristics of the child using it. In this article, we describe first, syllable-oriented exercise by means of which the Primer aimed to assist one 5-year-old pre-schooler in increase of her reading competence. The pupil went through sequence of exercises composed of evaluation and learning tasks. Consistently with previous HMPL study, we observe increase of both child's reading skill as well as of machine's ability to accurately process child's speech.

Digital Primer Implementation of Human- Machine Peer Learning for Reading Acquisition: Introducing Curriculum 2

Preprint

Full-text available

May 2023

The aim of the digital primer project is cognitive enrichment and fostering of acquisition of basic literacy and numeracy of 5-10 year old children. Here, we focus on Primer's ability to accurately process child speech which is fundamental to the acquisition of reading component of the Primer. We first note that automatic speech recognition (ASR) and speech-to-text of child speech is a challenging task even for large-scale, cloud-based ASR systems. Given that the Primer is an embedded AI artefact which aims to perform all computations on edge devices like RaspberryPi or Nvidia Jetson, the task is even more challenging and special tricks and hacks need to be implemented to execute all necessary inferences in quasi-real-time. One such trick explored in this article is transformation of a generic ASR problem into much more constrained multiclass-classification problem by means of task-specific language models / scorers. Another one relates to adoption of "human machine peer learning" (HMPL) strategy whereby the DeepSpeech model behind the ASR system is supposed to gradually adapt its parameters to particular characteristics of the child using it. In this article, we describe first, syllable-oriented exercise by means of which the Primer aimed to assist one 5-year-old pre-schooler in increase of her reading competence. The pupil went through sequence of exercises composed of evaluation and learning tasks. Consistently with previous HMPL study, we observe increase of both child's reading skill as well as of machine's ability to accurately process child's speech.

Proof-of-concept of feasibility of human–machine peer learning for German noun vocabulary learning

Article

Full-text available

Feb 2023

The present study provides the first empiric evidence that the creation of human–machine peer learning (HMPL) couples can lead to an increase in the level of mastery of different competences in both humans and machines alike. The feasibility of the HMPL approach is demonstrated by means of Curriculum 1 whereby the human learner H gradually acquires a vocabulary of foreign language, while the artificial learner fine-tunes its ability to understand H's speech. The present study evaluated the feasibility of the HMPL approach in a proof-of-concept experiment that is composed of a pre-learn assessment, a mutual learning phase, and post-learn assessment components. Pre-learn assessment allowed us to estimate prior knowledge of foreign language learners by asking them to name visual cues corresponding to one among 100 German nouns. In a subsequent mutual learning phase, learners are asked to repeat the audio recording containing the label of a simultaneously presented word with the visual cue. After the mutual learning phase is over, the subjacent speech-to-text (STT) neural network fine-tunes its parameters and adapts itself to peculiar properties of H's voice. Finally, the exercise is terminated by the post-learn assessment phase. In both assessment phases, the number of mismatches between the expected answer and the answer provided by human and recognized by machine provides the metrics of the main evaluation. In the case of all six learners who participated in the proof-of-concept experiment, we observed an increase in the amount of matches between expected and predicted labels, which was caused both by an increase in human learner's vocabulary as well as by an increase in the recognition accuracy of machine's speech-to-text model. Therefore, the present study considers it reasonable to postulate that curricula could be drafted and deployed for different domains of expertise, whereby humans learn from AIs at the same time as AIs learn from humans.

Handling and extracting key entities from customer conversations using Speech recognition and Named Entity recognition

Preprint

Full-text available

Nov 2022

In this modern era of technology with e-commerce developing at a rapid pace, it is very important to understand customer requirements and details from a business conversation. It is very crucial for customer retention and satisfaction. Extracting key insights from these conversations is very important when it comes to developing their product or solving their issue. Understanding customer feedback, responses, and important details of the product are essential and it would be done using Named entity recognition (NER). For extracting the entities we would be converting the conversations to text using the optimal speech-to-text model. The model would be a two-stage network in which the conversation is converted to text. Then, suitable entities are extracted using robust techniques using a NER BERT transformer model. This will aid in the enrichment of customer experience when there is an issue which is faced by them. If a customer faces a problem he will call and register his complaint. The model will then extract the key features from this conversation which will be necessary to look into the problem. These features would include details like the order number, and the exact problem. All these would be extracted directly from the conversation and this would reduce the effort of going through the conversation again.

Perceptual Synchronization Scoring of Dubbed Content using Phoneme-Viseme Agreement

Conference Paper

Jan 2024

Honey Gupta

Roboter – Interaktive, transparente und adaptive Lebensbegleiter (R-ITUAL) - Förderkennzeichen 16SV8585

Chapter

Full-text available

Jan 2024

Improving Low Resources Arabic Speech Recognition using Data Augmentation

Conference Paper

Nov 2022

Tunisian Dialectal End-to-end Speech Recognition based on DeepSpeech

Article

Jan 2021

Streaming End-to-end Speech Recognition for Mobile Devices

Conference Paper

Full-text available

May 2019

Russian-Language Speech Recognition System Based on Deepspeech

Conference Paper

Full-text available

Dec 2018

The paper examines the practical issues in developing a speech-to-text system using deep neural networks. The development of a Russian-language speech recognition system based on DeepSpeech architecture is described. The Mozilla company's open source implementation of DeepSpeech for the English language was used as a starting point. The system was trained in a containerized environment using the Docker technology. It allowed to describe the entire process of component assembly from the source code, including a number of optimization techniques for CPU and GPU. Docker also allows to easily reproduce computation optimization tests on alternative infrastructures. We examined the use of TensorFlow XLA technology that optimizes linear algebra computations in the course of neural network training. The number of nodes in the internal layers of neural network was optimized based on the word error rate (WER) obtained on a test data set, having regard to GPU memory limitations. We studied the use of probabilistic language models with various maximum lengths of word sequences and selected the model that shows the best WER. Our study resulted in a Russian-language acoustic model having been trained based on a data set comprising audio and subtitles from YouTube video clips. The language model was built based on the texts of subtitles and publicly available Russian-language corpus of Wikipedia's popular articles. The resulting system was tested on a data set consisting of audio recordings of Russian literature available on voxforge.com-the best WER demonstrated by the system was 18%.

Acoustic Modeling for Google Home

Conference Paper

Full-text available

Aug 2017

Large Scale Deep Neural Network Acoustic Modeling with Semi-supervised Training Data for Youtube Video Transcription

Conference Paper

Full-text available

Dec 2013

YouTube is a highly visited video sharing website where over one billion people watch six billion hours of video every month. Im-proving accessibility to these videos for the hearing impaired and for search and indexing purposes is an excellent application of automatic speech recognition. However, YouTube videos are extremely chal-lenging for automatic speech recognition systems. Standard adapted Gaussian Mixture Model (GMM) based acoustic models can have word error rates above 50%, making this one of the most difficult reported tasks. Since 2009, YouTube has provided automatic gener-ation of closed captions for videos detected to have English speech; the service now supports ten different languages. This paper de-scribes recent improvements to the original system, in particular the use of owner-uploaded video transcripts to generate additional semi-supervised training data and deep neural networks acoustic models with large state inventories. Applying an "island of confidence" fil-tering heuristic to select useful training segments, and increasing the model size by using 44,526 context dependent states with a low-rank final layer weight matrix approximation, improved performance by about 13% relative compared to previously reported sequence trained DNN results for this task.

Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural 'networks

Conference Paper

Full-text available

Jan 2006

Many real-world sequence learning tasks re- quire the prediction of sequences of labels from noisy, unsegmented input data. In speech recognition, for example, an acoustic signal is transcribed into words or sub-word units. Recurrent neural networks (RNNs) are powerful sequence learners that would seem well suited to such tasks. However, because they require pre-segmented training data, and post-processing to transform their out- puts into label sequences, their applicability has so far been limited. This paper presents a novel method for training RNNs to label un- segmented sequences directly, thereby solv- ing both problems. An experiment on the TIMIT speech corpus demonstrates its ad- vantages over both a baseline HMM and a hybrid HMM-RNN.

Pre-training on high-resource speech recognition improves low-resource speech-to-text translation

Conference Paper

Jan 2019

Transfer Learning for Speech Recognition on a Budget

Conference Paper

Jan 2017

DeepSpeech: Scaling up end-to-end speech recognition

Article

Dec 2014

We present a state-of-the-art speech recognition system developed using end-to-end deep learning. Our architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments. In contrast, our system does not need hand-designed components to model background noise, reverberation, or speaker variation, but instead directly learns a function that is robust to such effects. We do not need a phoneme dictionary, nor even the concept of a "phoneme." Key to our approach is a well-optimized RNN training system that uses multiple GPUs, as well as a set of novel data synthesis techniques that allow us to efficiently obtain a large amount of varied data for training. Our system, called DeepSpeech, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.5% error on the full test set. DeepSpeech also handles challenging noisy environments better than widely used, state-of-the-art commercial speech systems.

Cepstral analysis synthesis on the mel frequency scale

Conference Paper

May 1983

Satoshi Imai

This paper presents a new technique of cepstral analysis synthesis on the mel frequency scale, the log spectrum on the mel frequency scale (the mel log spectrum) is considered to be an effective representation of the spectral envelope of speech. This analysis synthesis system uses the mel log spectrum approximation (MLSA) filter which was devised for the cepstral synthesis on the mel frequency scale. The filter coefficients are easily obtained through a simple linear transform from the mel cepstrum defined as the Fourier cosine coefficients of the mel log spectral envelope of speech. The MLSA filter has a low coefficient sensitivity and a good coefficient quantization characteristics. The spectral distortion caused by interpolation of the filter parameters of two successive frames is small. Accordingly, the data rate of this system is very low. The same quality speech is synthesized at 60-70 % of data rates in the conventional cepstral vocoder or the LPC vocoder.

KenLM: Faster and smaller language model queries

Jan 2011
187-197

Kenneth Heafield

Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 187-197, Edinburgh, Scotland.

German End-to-end Speech Recognition based on DeepSpeech

Abstract and Figures

Recommended publications

Effects of Layer Freezing when Transferring DeepSpeech to New Languages

LTL-UDE at Low-Resource Speech-to-Text Shared Task: Investigating Mozilla DeepSpeech in a low-resour...

Common Voice: A Massively-Multilingual Speech Corpus

Robustness of end-to-end Automatic Speech Recognition Models -- A Case Study using Mozilla DeepSpeec...