Conference PaperPDF Available

German End-to-end Speech Recognition based on DeepSpeech

Authors:

Abstract and Figures

While automatic speech recognition is an important task, freely available models are rare, especially for languages other than English. In this paper, we describe the process of training German models based on the Mozilla DeepSpeech architecture using publicly available data. We compare the resulting models with other available speech recognition services for German and find that we obtain comparable results. Acceptable performance under noisy conditions would, however, still require much more training data. We release our trained Ger-man models and also the training configurations .
Content may be subject to copyright.
German End-to-end Speech Recognition based on DeepSpeech
Aashish Agarwal and Torsten Zesch
Language Technology Lab
University of Duisburg-Essen
Duisburg, Germany
Abstract
While automatic speech recognition is an
important task, freely available models are
rare, especially for languages other than
English. In this paper, we describe the pro-
cess of training German models based on
the Mozilla DeepSpeech architecture using
publicly available data. We compare the re-
sulting models with other available speech
recognition services for German and find
that we obtain comparable results. Accept-
able performance under noisy conditions
would, however, still require much more
training data. We release our trained Ger-
man models and also the training configu-
rations.
1 Introduction
Automatic speech recognition (ASR) is the task of
translating a spoken utterance into a textual tran-
script. It is a key component of voice assistants like
Google Home (Li et al., 2017), in spoken language
translation devices (Krstovski et al., 2008), or for
automatic transcription of audio and video files
(Liao et al., 2013). For any language beyond En-
glish, readily available pre-trained models are still
rare. For German, we are only aware of the model
by Milde and K
¨
ohn (2018) for the Kaldi framework
(Povey et al., 2011). For the recently introduced
Mozilla DeepSpeech framework, a German model
is still missing. This is a serious obstacle to ap-
plied research on German speech data, as available
web-services by Google, Amazon, or Microsoft are
problematic due to data privacy reasons. We thus
use publicly available speech data to train a Ger-
man DeepSpeech model. We release our trained
German model and also publish the code and con-
figurations enabling researchers to (i) directly use
the model in applications, (ii) reproduce state-of-
the-art results, and (iii) train new models based on
other source corpora.
2 Speech Recognition Systems
Due to the underlying complexity of recogniz-
ing spoken language and the wish of the service
provider to keep the model private, many systems
are offered as web services. This includes com-
mercial services like Google Cloud Speech-to-Text
(He et al., 2018), Amazon Alexa Voice Services
1
,
IBM Watson Speech to Text (Saon et al., 2017) or
Speechmatics
2
as well as academic services like
BAS.
3
While web services are convenient, there
are many situations where they cannot be used:
sending data to a web service might violate
data privacy protection laws
as the data throughput of a web service is
limited; it might rule out batch processing of
large amounts of speech data
the user cannot control (or change) the func-
tionality of a remotely deployed web service
research results based on web service calls
are not easily replicable, as services might
change without notice or become unavailable
altogether.
For this work, we therefore consider only frame-
works that can be used locally and without restric-
tions. One such framework is
Kaldi
(Povey et al.,
2011) which was found to be the best perform-
ing open-source ASR system in a previous study
(Gaida et al., 2014). It is open-source toolkit writ-
ten in C++ that supports conventional models (e.g.
Gaussian Mixture Models) as well as deep neu-
ral networks. Recently, end-to-end neural systems
like
wav2letter++
(Pratap et al., 2018) provided by
Facebook, or
DeepSpeech4
provided by Mozilla
have been introduced. To our knowledge, there is
1https://developer.amazon.com/alexa/science
2https://www.speechmatics.com
3https://clarin.phonetik.uni-muenchen.de/BASWebServices/interface/ASR
4https://github.com/mozilla/DeepSpeech
Figure 1: DeepSpeech architecture (adapted from
Mozilla Blog5)
only one German model for any of these frame-
works that is publicly available, which is the one
by Milde and K
¨
ohn (2018) for Kaldi. Other Ger-
man models, e.g. a Kaldi model from Fraunhofer
IAIS (Stadtschnitzer et al., 2014), rely on in-house
datasets and are not publicly available.
In this work, we focus on Mozilla’s DeepSpeech
framework, as it is an end-to-end neural system
that can be quite easily trained, unlike Kaldi, which
requires more domain knowledge or wav2letter++,
which is not yet widely tested by the community.
Mozilla DeepSpeech
DeepSpeech (v0.1.0) was
based on a TensorFlow (Abadi et al., 2016) imple-
mentation of Baidu’s end-to-end ASR architecture
(Hannun et al., 2014). As it is under active devel-
opment, the current architecture deviates from the
original version quite a bit. In Figure 1, we give
an overview of the architecture of version v0.5.0,
which we also used for our experiments in this
paper.6
DeepSpeech is a character-level, deep recurrent
5https://hacks.mozilla.org/2018/09/speech-recognition- deepspeech
6https://github.com/mozilla/DeepSpeech/releases/tag/v0.5.0
neural network (RNN), which can be trained end-
to-end using supervised learning.
7
It extracts Mel-
Frequency Cepstral Coefficients (Imai, 1983) as
features and directly outputs the transcription, with-
out the need for forced alignment on the input or
any external source of knowledge like a Grapheme
to Phoneme (G2P) converter. Overall, the network
has six layers: the speech features are fed into three
fully connected layers (dense), followed by a uni-
directional RNN layer, then a fully connected layer
(dense) and finally an output layer as shown in Fig-
ure 1. The RNN layer uses LSTM cells, and the
hidden fully connected layers use a ReLU activa-
tion function. The network outputs a matrix of
character probabilities, i.e. for each time step the
system gives a probability for each character in the
alphabet, which represents the likelihood of that
character corresponding to the audio. Further, the
Connectionist Temporal Classification (CTC) loss
function (Graves et al., 2006) is used to maximize
the probability of the correct transcription.
DeepSpeech comes with a pre-trained English
model, but while Mozilla is collecting speech sam-
ples
8
and is releasing training datasets in several
languages (see paragraph on Mozilla Common
Voice in Section 3), no official models other than
English are provided. Users have reported on train-
ing models for French
9
and Russian (Iakushkin et
al., 2018), but the resulting models do not seem to
be available.
3 Model Training
In this section, we describe in detail our setup for
training the German model in order to ease subse-
quent attempts to train DeepSpeech models.
3.1 Datasets
To train the German Deep Speech model, we utilize
the following publicly available datasets:
The
Voxforge10
corpus, which is about 35 hours
of German speech clips. Nearly 180 speakers have
read aloud sentences from German Wikipedia, pro-
tocols from the European Parliament, and some
individual commands. The clips vary in length,
ranging from 5 to 7 seconds.
The
Tuda-De
(Milde and K
¨
ohn, 2018) corpus,
is similar to Voxforge. It uses the same sources
7https://hacks.mozilla.org/2017/11/a-journey- to-10- word-error- rate/
8https://voice.mozilla.org/
9http://bit.ly/discourse-mozilla- org
10http://www.voxforge.org/home/forums/other-languages/german/
open-speech- data-cor pus-for- german
Dataset Size Median Length # Speakers Condition Type
Voxforge 35h 4.5s 180 noisy read
Tuda-De 127h 7.4s 147 clean read
Mozilla Common Voice 140h 3.7s >1,000 noisy read
Table 1: Overview of German datasets
(Wikipedia, parliament speeches, commands), but
the recordings are under more controlled condi-
tions. The final data was also curated “to reduce
speaking errors and artefacts”. Each recording was
made with 4 different microphones at the same
time. This means that while the overall size of
the dataset is larger than Voxforge and a model
based on this dataset is supposed to be more robust,
the actual amount of unique speech hours in both
datasets are about the same.
The
Mozilla Common Voice
project
11
aims to
make speech recognition open to everyone. The
multilingual dataset currently covers 18 languages -
including English, French, German, and Mandarin.
The German corpus contains clips with lengths
varying from 3 to 5 seconds. However, the corpus
is recorded outside controlled conditions as per the
comfort of the speaker. The utterances have back-
ground noise, and users have varied accents. There-
fore we expect this dataset to be relatively challeng-
ing. Speakers in this dataset are relatively young,
and the male/female ratio is about 5:1, which might
result in a severe bias when trying to transfer the
model.
12
The version used in our experiments has
140 hours of recordings, but as Mozilla aims at
adding more recordings, there might already be a
larger dataset available.
3.2 Preprocessing
DeepSpeech expects audio and transcription data
to be prepared in a specific format so that they can
be read directly by the input pipeline (see Figure 2
for an example). We cleaned the transcriptions
by removing commas as well as punctuation and
converting all transcriptions to lower case. We
further ensured all audio clips are in .wav format.
The pruned results were split into training (70%),
validation (15%), and test data (15%).
For more details on data preprocessing parame-
ters, we refer the reader to the code release.13
11https://voice.mozilla.org/de/datasets
12
Speaker Information is based on the self-reported statis-
tics provided on the project homepage for each dataset.
13https://github.com/AASHISHAG/deepspeech-german
Hyperparameter Value
Batch Size 24
Dropout 0.25
Learning Rate 0.0001
Table 2: Hyperparameters used in the experiments
3.3 Hyperparameter Setup
We searched for a good set of hyperparameters
as shown in Figure 3. In the first iteration, we
select learning rate and train batch-size and plot
the graph showing the relationship of dropout and
word-error rate, to determine the dropout with the
lowest WER. We then used the best dropout (0.25)
from the above iteration and kept the train batch
size, to identify the best learning rate. Finally,
we took the best dropout (0.25) and learning rate
(0.0001) to determine the effect on batch size which
shows that our initial choice of 24 was reasonable,
even if somewhat better results seem possible using
smaller batches.
Since Deep Speech employs early stopping,
which stops the training of a neural network early
before it overfits the training data, we did not ex-
periment much with the number of epochs. The re-
maining hyperparameters were set to be the same as
those pre-configured in Mozilla Deepspeech. The
best results are obtained with the hyper-parameters
mentioned in Table 2. We train the network using
the Adam optimizer (Kingma and Ba, 2014).
Language Model
We apply a probabilistic lan-
guage model using KenLM toolkit (Heafield, 2011)
to train a 3-gram model on the pre-processed cor-
pus provided by Radeck-Arneth et al. (2015). It
consists of eight million filtered sentences compris-
ing 63.0% Wikipedia, 22.0% Europarl, and 14.6%
crawled sentences. MaryTTS
14
has been used to
canonicalize the corpus, i.e. normalized to a form
that is close to how a reader would speak the sen-
tence, especially changing numbers, abbreviations,
and dates. Additionally, punctuations were dis-
carded, as it is usually also not pronounced. We
14http://mary.dfki.de/
Figure 2: Screenshot of the input file format
0 0.2 0.40.60.8
20
40
60
80
100
Dropout
WER
0 0.2 0.40.60.8 1
·103
20
40
60
80
100
Learning Rate
0 10 20 30
10
20
30
40
Batch Size
Figure 3: Hyperparameter search space
Dataset WER
Mozilla 79.7
Voxforge 72.1
Tuda-De 26.8
Tuda-De + Mozilla 57.3
Tuda-De + Voxforge 15.1
Tuda-De + Voxforge + Mozilla 21.5
Table 3: German DeepSpeech results
used the unpruned Language Model that has a
rather large vocabulary size of over 2 million types,
but we expect pruning would only affect runtime,
not recognition quality.
3.4 Server & Runtime
We trained and tested our models on a compute
server having 56 Intel(R) Xeon(R) Gold 5120
CPUs @ 2.20GHz, 3 Nvidia Quadro RTX 6000
with 24GB of RAM each. Typical training time on
a single dataset under this setup was in the range
of 1 hour.
4 Results & Discussion
Table 3 shows the word error rates (WER) obtained
when training and testing DeepSpeech on the avail-
able German datasets and their combinations. The
best configuration in Milde and K
¨
ohn (2018) using
only the Tuda-De corpus yields a WER of 28.96%.
Our model only trained on Tuda-De yields a com-
parable WER of 26.8%.
Results for the other datasets are much lower, but
apparently combining several datasets improves the
results. While the combination of Tuda and Mozilla
yields a WER of 57.3%, the combination of Tuda,
Voxforge, and Mozilla gives a WER of 21.5%.
Combining the very similar Tuda-De and Voxforge
yields a WER of 15.1%, which is a remarkable im-
provement over using only a single dataset. Note
that this is the black-box performance, as we used
DeepSpeech as is and only slightly tuned hyper-
parameters. See Section 6 for ideas on how to
improve over these results.
To put our results into perspective, in Table 4,
we present results in other languages for training
different versions of the DeepSpeech architecture.
Our best results are in the same range as for the
other languages, but cross-dataset comparisons are
hard to interpret. However, it is safe to say that
training a DeepSpeech model can result in accept-
able in-domain word error rates with considerably
less training data than previously considered.
4.1 Influence of Training Size
Figure 4 depicts the relation between the amount of
training data and its impact on the word-error-rate.
To plot the learning curve, we split the training
data into 10 subsets containing each 10% of the
0 0.2 0.40.60.811.2 1.41.61.8
·104
10
20
30
40
50
60
70
80
90
100
Number of Training Instances
WER
Voxforge
Mozilla
Tuda-De
Figure 4: Learning curves for single datasets
00.511.522.533.5
·104
10
20
30
40
50
60
70
80
90
100
Number of Training Instances
WER
Tuda-De + Mozilla
Tuda-De + Voxforge
Tuda-De + Voxforge + Mozilla
Figure 5: Learning curves when combining datasets
00.511.522.533.5
·104
10
20
30
40
50
60
70
80
90
100
Number of Training Instances
WER
Tuda
Voxforge
Mozilla
Figure 6: Order effects when combining datasets
Language DeepSpeech version Training Set Size Test Set WER
English Baidu
(Hannun et al., 2014)
Switchboard
Fisher
WSJ
Baidu
7,380h Hub5 (LDC2002S23) 16.0
English Mozilla v0.3.0
Switchboard
Fisher
LibriSpeech
3,260h LibriSpeech (clean test) 11.0
English Mozilla v0.5.0
Switchboard
Fisher
LibriSpeech
3,260h LibriSpeech (clean test) 8.2
Russian Mozilla v?
(Iakushkin et al., 2018)
Yt-vad-1k
Yt-vad-650-clean 1,650h Voxforge (Russian) 18.0
German Mozilla v0.5.0
(our) Tuda-De + Voxforge 162h Tuda-De + Voxforge (test) 15.1
Table 4: Comparison with previous results in other languages
training data. Then the model is trained on one
subset and WER is calculated on a separate test
dataset. Next, we introduce the new subset with
more data, re-train the model, and compute its ef-
fect on the error rate. The model is trained on each
subset for a maximum of 10 epochs and sometimes
less when the model starts to overfit the training
data, and early stopping is triggered. We observe
that the rather noisy datasets Voxforge and Mozilla
converge rather slowly, while the clean Tuda-De
reaches much better results. This might also be
a result of the different microphones that add in-
creased robustness (not unlike other data augmen-
tation strategies).
Figure 5 present the same learning curves when
combining datasets showing that we can reach even
better WER in this setting. Mixing the datasets
seems to force the model to converge more quickly.
However, combining the similar dataset Tuda-De
and Voxforge yields a bit better performance than
combining all three datasets.
We also tested against a mix of all datasets in
combination, but add training data one dataset at
a time. Thus, the order in which datasets are in-
troduced into the training process might influence
performance. Figure 6 shows the results for dif-
ferent order in which the datasets are introduced
into the training process. Adding the noisy Mozilla
dataset too early in the process seems to slow down
convergence, while it adds a little bit of improved
performance when added in the end.
4.2 Cross-dataset Performance
So far, we used training and testing data either from
the same dataset or a mix of the available datasets,
Train Test WER
Voxforge
Voxforge
72.1
Tuda-De 96.8
Mozilla 73.1
Tuda-De, Mozilla 66.2
Tuda-De
Tuda-De
26.8
Voxforge 98.5
Mozilla 84.9
Voxforge, Mozilla 83.8
Mozilla
Mozilla
79.7
Tuda-De 94.8
Voxforge 87.1
Tuda-De, Voxforge 80.5
Table 5: Results across datasets
while of course keeping train and test data separate.
To get a more realistic estimate of performance
when used in a general setting, we assess cross-
dataset performance, i.e. we train and develop on
one or two datasets and test on a third one.
Table 5 shows the resulting word error rates. Ap-
parently, the cross-domain results are much worse
than in the in-domain setting in Table 3. For exam-
ple, training on Mozilla or Voxforge and Mozilla
and testing on Tuda-De yield unacceptable word
error rates of 84.9 and 83.8 compared to 26.8 when
training on Tuda-De. Interestingly, in this case, as
we have seen already above, adding Voxforge in
the mix does not help much, even if it is similar to
Tuda-De. We see a similar picture for the other test
datasets, transferring from a single dataset does not
work at all, as in the training process the model is
never forced to generalize beyond its properties.
However, training on the Tuda-De and Mozilla
combination yields WER of 66.2 on Voxforge,
Model WER Example
original -der bandbreitenverbrauch wird erheblich verringert
Tuda-De 60 diese zeiten tonwoche erheblich verringert
Voxforge 80 zeiten epoche erheblich in
Tuda-De + Mozilla 160 es sind endete suche den ist es in
Tuda-De + Voxforge 60 der pen zeiten verprach wird erheblich verringert
Tuda-De + Voxforge + Mozilla 40 der bandbreiten verbrauch wird erheblich verringert
original -ferner gibt es m¨
oglicherweise eine gewisse anonymit¨
at und sicherheit
Tuda-De 78 weites m¨
ogliche welche in glichen unit¨
at und sicherheit
Voxforge 100 zitierweise sich entsichert
Tuda-De + Mozilla 100 hunde titisee gelten die die mitte zum
Tuda-De + Voxforge 44 den gibt es m¨
oglicherweise eine gewisse mietsicherheit
Tuda-De + Voxforge + Mozilla 11 er gibt es m¨
oglicherweise eine gewisse anonymit¨
at und sicherheit
original -die einwilligung des schuldners war nicht erforderlich
Tuda-De 100 ideen
Voxforge 86 die angebliche natacha vollich
Tuda-De + Mozilla 57 die einwilligung des schutzmacht erfordern
Tuda-De + Voxforge 86 die ein eigenes schuldnersicht erfordern
Tuda-De + Voxforge + Mozilla 43 die einigung des schuldner zwar nicht erforderlich
original -die geschwindigkeit f¨
ur die kunden kann erh¨
oht werden
Tuda-De 75 die geschwindigkeit und unterteilten
Voxforge 100 schinkelpreise
Tuda-De + Mozilla 88 wie die schmiede den trennendes
Tuda-De + Voxforge 38 die geschwindigkeit f¨
ur die kunden kenterte
Tuda-De + Voxforge + Mozilla 0 die geschwindigkeit f¨
ur die kunden kann erh¨
oht werden
original -mehrere arbeitgeberverb ¨
ande sind zu einem dachverband zusammengeschlossen
Tuda-De 114 der see aufweitungen des in einem tatorten samen erschossen
Voxforge 100 es recognitionszeichen
Tuda-De + Mozilla 100 in den sitzungen des entstandenen schaden
Tuda-De + Voxforge 29 mehrere arbeitgeberverb¨
ande sind zu einem tachodaten geschlossen
Tuda-De + Voxforge + Mozilla 14 der arbeitgeberverb¨
ande sind zu einem dachverband zusammengeschlossen
Table 6: Recognition results on random Voxforge test instances
which is even lower than using the training por-
tion of Voxforge (which yields 72.1). Thus forcing
the model to generalize over topics, recording con-
ditions, speakers, etc. seem to be a crucial point.
5 Error Analysis
Table 6 shows the recognition results on randomly
selected test instances from the Voxforge dataset.
The models trained on only one dataset are surpris-
ingly bad, resulting in rather poetic utterances that
sometimes are quite far from the expected source.
An example is the Tuda-De model recognizing
tatorten samen erschossen instead of dachverband
zusammengeschlossen.
As is to be expected for German, compounds
are especially challenging as exemplified by band-
breitenverbrauch that is recognized as bandbreiten
verbrauch or even pen zeiten verprach, where ver-
prach is probably only in the language model as a
common misspelling of versprach.
The models often fail in interesting ways, e.g.
all models sometimes return very short results like
schinkelpreise that should actually have low prob-
ability. We currently have no explanation for this
behaviour and need to explore the issue further.
In cases like des schuldners war being recog-
nized as des schuldner zwar, the phonetic ambigu-
ity should have been resolved by a better language
model.
6 Summary
In this paper, we presented the first results on
building a German speech recognition model using
Mozilla Deep Speech. Our best performing model
reaches an in-domain WER of 15.1%, which is in
line with the performance for other languages us-
ing the DeepSpeech framework. Our results thus
support the idea that Mozilla Deep Speech can
be easily transferred to new languages. Learning
curve experiments highlight the importance of the
amount of training data, but also quite strong order
effects when mixing the datasets.
We publish our trained model along with con-
figuration data for all our experiments in order to
enable replicating all results. The model can eas-
ily be re-trained and optimised on new datasets by
referring the code-release.
15
No specific hardware
is required to run the trained model, and it works
even on a normal desktop computer or laptop.
Future Work
Our experiments only scratch the
surface of possible approaches, and our analysis
recommends several avenues for further explo-
ration.
We mainly treated DeepSpeech as a black-box
and only performed a light hyper-parameter search.
The model can probably still be fine-tuned by ex-
ploring other hyper-parameters. We also did not
experiment much with the language model, but
used a simple 3-gram model.
Since the amount of publicly available training
data is limited, it could be interesting to consider
data augmentation strategies.
16
Another approach
to improve recognition quality could be to use
transfer learning by taking an English model (pre-
trained with the larger English datasets) and re-
training with the German data (Kunze et al., 2017;
Bansal et al., 2018). In the light of recent discus-
sions on the CO2 footprint of training deep learning
models (Strubell et al., 2019), using re-training and
providing trained models is desirable. Additionally,
more research is needed to find neural architectures
that perform equally well, but require less compute.
Finally, the training process described here could
be easily used to train speech recognition models
for other languages, where currently no pre-trained
models are available.
Acknowledgments
We want to thank Andrea Horbach for her many
helpful comments that significantly improved the
paper. We also thank the developers at Mozilla
DeepSpeech, who provided insight and expertise
that greatly assisted the research.
References
[Abadi et al.2016] Mart´
ın Abadi, Paul Barham, Jianmin
Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean,
Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving,
Michael Isard, et al. 2016. Tensorflow: A system
for large-scale machine learning. In 12th USENIX
Symposium on Operating Systems Design and Imple-
mentation OSDI 16, pages 265–283.
[Bansal et al.2018] Sameer Bansal, Herman Kamper,
Karen Livescu, Adam Lopez, and Sharon Goldwater.
15https://github.com/AASHISHAG/deepspeech-german
16https://ai.googleblog.com/2019/04/
specaugment-new- data-augmentation.html
2018. Pre-training on high-resource speech recog-
nition improves low-resource speech-to-text transla-
tion. CoRR, abs/1809.01431.
[Gaida et al.2014] Christian Gaida, Patrick Lange, Rico
Petrick, Patrick Proba, Ahmed Malatawy, and David
Suendermann-Oeft. 2014. Comparing open-source
speech recognition toolkits.
[Graves et al.2006] Alex Graves, Santiago Fern´
andez,
Faustino Gomez, and J¨
urgen Schmidhuber. 2006.
Connectionist temporal classification: Labelling un-
segmented sequence data with recurrent neural ’net-
works. volume 2006, pages 369–376, 01.
[Hannun et al.2014] Awni Y. Hannun, Carl Case, Jared
Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen,
Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta,
Adam Coates, and Andrew Y. Ng. 2014. Deep
speech: Scaling up end-to-end speech recognition.
CoRR, abs/1412.5567.
[He et al.2018] Yanzhang He, Tara N. Sainath, Rohit
Prabhavalkar, Ian McGraw, Raziel Alvarez, Ding
Zhao, David Rybach, Anjuli Kannan, Yonghui Wu,
Ruoming Pang, Qiao Liang, Deepti Bhatia, Yuan
Shangguan, Bo Li, Golan Pundak, Khe Chai Sim,
Tom Bagby, Shuo-Yiin Chang, Kanishka Rao, and
Alexander Gruenstein. 2018. Streaming end-to-
end speech recognition for mobile devices. CoRR,
abs/1811.06621.
[Heafield2011] Kenneth Heafield. 2011. KenLM:
Faster and smaller language model queries. In Pro-
ceedings of the Sixth Workshop on Statistical Ma-
chine Translation, pages 187–197, Edinburgh, Scot-
land.
[Iakushkin et al.2018] Oleg Iakushkin, George Fe-
doseev, Anna S. Shaleva, Alexander Degtyarev,
and Olga S. Sedova. 2018. Russian-language
speech recognition system based on deepspeech. In
Proceedings of the VIII International Conference
on Distributed Computing and Grid-technologies in
Science and Education (GRID 2018).
[Imai1983] Satoshi Imai. 1983. Cepstral analy-
sis synthesis on the mel frequency scale. In
ICASSP’83. IEEE International Conference on
Acoustics, Speech, and Signal Processing, volume 8,
pages 93–96. IEEE.
[Kingma and Ba2014] Diederik P. Kingma and Jimmy
Ba. 2014. Adam: A Method for Stochastic Op-
timization. arXiv e-prints, page arXiv:1412.6980,
Dec.
[Krstovski et al.2008] Kriste Krstovski, Michael De-
cerbo, Rohit Prasad, David Stallard, Shirin Saleem,
and Premkumar Natarajan. 2008. A wearable head-
set speech-to-speech translation system. In Proceed-
ings of the ACL-08: HLT Workshop on Mobile Lan-
guage Processing, pages 10–12, Columbus, Ohio,
June. Association for Computational Linguistics.
[Kunze et al.2017] Julius Kunze, Louis Kirsch, Ilia
Kurenkov, Andreas Krug, Jens Johannsmeier, and
Sebastian Stober. 2017. Transfer learning
for speech recognition on a budget. CoRR,
abs/1706.00290.
[Li et al.2017] Bo Li, Tara Sainath, Arun Narayanan,
Joe Caroselli, Michiel Bacchiani, Ananya Misra,
Izhak Shafran, Hasim Sak, Golan Pundak, Kean
Chin, Khe Chai Sim, Ron J. Weiss, Kevin Wil-
son, Ehsan Variani, Chanwoo Kim, Olivier Siohan,
Mitchel Weintraub, Erik McDermott, Rick Rose,
and Matt Shannon. 2017. Acoustic modeling for
google home.
[Liao et al.2013] Hank Liao, Erik McDermott, and An-
drew W. Senior. 2013. Large scale deep neural net-
work acoustic modeling with semi-supervised train-
ing data for youtube video transcription. In ASRU,
pages 368–373. IEEE.
[Milde and K¨
ohn2018] Benjamin Milde and Arne
K¨
ohn. 2018. Open source automatic speech
recognition for german. CoRR, abs/1807.10311.
[Povey et al.2011] Daniel Povey, Arnab Ghoshal, Gilles
Boulianne, Lukas Burget, Ondrej Glembek, Nagen-
dra Goel, Mirko Hannemann, Petr Motlicek, Yanmin
Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer,
and Karel Vesely. 2011. The kaldi speech recogni-
tion toolkit. In IEEE 2011 Workshop on Automatic
Speech Recognition and Understanding, December.
[Pratap et al.2018] Vineel Pratap, Awni Hannun,
Qiantong Xu, Jeff Cai, Jacob Kahn, Gabriel Syn-
naeve, Vitaliy Liptchinsky, and Ronan Collobert.
2018. wav2letter++: The fastest open-source
speech recognition system. CoRR, abs/1812.07625.
[Radeck-Arneth et al.2015] Stephan Radeck-Arneth,
Benjamin Milde, Arvid Lange, Evandro Gouvˆ
ea,
Stefan Radomski, Max M ¨
uhlh¨
auser, and Chris
Biemann. 2015. Open source german distant
speech recognition: Corpus and acoustic model. In
Text, Speech, and Dialogue, pages 480–488, Cham.
[Saon et al.2017] George Saon, Gakuto Kurata, Tom
Sercu, Kartik Audhkhasi, Samuel Thomas, Dim-
itrios Dimitriadis, Xiaodong Cui, Bhuvana Ram-
abhadran, Michael Picheny, Lynn-Li Lim, Bergul
Roomi, and Phil Hall. 2017. English conversa-
tional telephone speech recognition by humans and
machines. CoRR, abs/1703.02136.
[Stadtschnitzer et al.2014] Michael Stadtschnitzer,
Jochen Schwenninger, Daniel Stein, and Joachim
Koehler. 2014. Exploiting the large-scale German
Broadcast Corpus to boost the Fraunhofer IAIS
Speech Recognition System. In Proceedings of
LREC 2014, pages 3887–3890, Reykjavik, Iceland.
[Strubell et al.2019] Emma Strubell, Ananya Ganesh,
and Andrew McCallum. 2019. Energy and policy
considerations for deep learning in nlp. In Proceed-
ings of ACL.
... Finally, the Maltese corpora are joined and exported as a MsgPack file. The order in which data is joined in matters [15]. Thus, while joining, the corpora were amended one after the other in the following manner; Headset, CV Validated, Parliament, MEP, Tube, Merlin, CV Other. ...
... Instead of Arabic, deep speech has been used to build ASR models in different languages. The authors presented preliminary results of using Mozilla Deep Speech to create a German ASR model [24]. The top model achieved a WER of 15.1%. ...
Article
Full-text available
Owing to the linguistic richness of the Arabic language, which contains more than 6000 roots, building a reliable Arabic language model for Arabic speech recognition systems faces many challenges. This paper introduces a language model free Arabic automatic speech recognition system for Modern Standard Arabic based on an end-to-end-based Deep Speech architecture developed by Mozilla. The proposed model uses a character-level sequence-to-sequence model to map the character alignment produced by the recognizer model onto the corresponding words. The developed system outperformed recent studies on single-speaker and multi-speaker Arabic speech recognition using two different state-of-the-art datasets. The first was the Arabic Multi-Genre Broadcast (MGB2) corpus with 1200 h of audio data from multiple speakers. The system achieved a new milestone in the MGB2 challenge with a word error rate (WER) of 3.2, outperforming related work using the same corpus with a word error reduction of 17%. An additional experiment with a 7-hour Saudi Accent Single Speaker Corpus (SASSC) was used to build an additional model for single male speaker-based Arabic speech recognition using the same proposed network architecture. The single-speaker model outperformed related experiments with a WER of 4.25 with a relative improvement of 33.8%.
... All models evaluated in this article are based on Mozilla's implementation of DeepSpeech (DS) architecture (Agarwal and Zesch, 2019). The baseline model is publicly available DS model which was trained on Common Voice Data. ...
Conference Paper
Full-text available
The aim of the digital primer project is cognitive enrichment and fostering of acquisition of basic literacy and numeracy of 5 – 10 year old children. Here, we focus on Primer's ability to accurately process child speech which is fundamental to the acquisition of reading component of the Primer. We first note that automatic speech recognition (ASR) and speech-to-text of child speech is a challenging task even for large-scale, cloud-based ASR systems. Given that the Primer is an embedded AI artefact which aims to perform all computations on edge devices like RaspberryPi or Nvidia Jetson, the task is even more challenging and special tricks and hacks need to be implemented to execute all necessary inferences in quasi-real-time. One such trick explored in this article is transformation of a generic ASR problem into much more constrained multiclass-classification problem by means of task-specific language models / scorers. Another one relates to adoption of "human machine peer learning" (HMPL) strategy whereby the DeepSpeech model behind the ASR system is supposed to gradually adapt its parameters to particular characteristics of the child using it. In this article, we describe first, syllable-oriented exercise by means of which the Primer aimed to assist one 5-year-old pre-schooler in increase of her reading competence. The pupil went through sequence of exercises composed of evaluation and learning tasks. Consistently with previous HMPL study, we observe increase of both child's reading skill as well as of machine's ability to accurately process child's speech.
... All models evaluated in this article are based on Mozilla's implementation of DeepSpeech (DS) architecture (Agarwal and Zesch, 2019). The baseline model is publicly available DS model which was trained on Common Voice Data. ...
Preprint
Full-text available
The aim of the digital primer project is cognitive enrichment and fostering of acquisition of basic literacy and numeracy of 5-10 year old children. Here, we focus on Primer's ability to accurately process child speech which is fundamental to the acquisition of reading component of the Primer. We first note that automatic speech recognition (ASR) and speech-to-text of child speech is a challenging task even for large-scale, cloud-based ASR systems. Given that the Primer is an embedded AI artefact which aims to perform all computations on edge devices like RaspberryPi or Nvidia Jetson, the task is even more challenging and special tricks and hacks need to be implemented to execute all necessary inferences in quasi-real-time. One such trick explored in this article is transformation of a generic ASR problem into much more constrained multiclass-classification problem by means of task-specific language models / scorers. Another one relates to adoption of "human machine peer learning" (HMPL) strategy whereby the DeepSpeech model behind the ASR system is supposed to gradually adapt its parameters to particular characteristics of the child using it. In this article, we describe first, syllable-oriented exercise by means of which the Primer aimed to assist one 5-year-old pre-schooler in increase of her reading competence. The pupil went through sequence of exercises composed of evaluation and learning tasks. Consistently with previous HMPL study, we observe increase of both child's reading skill as well as of machine's ability to accurately process child's speech.
... Given that we focused on the acquisition of German language, we used a DeepSpeech architecture (Hannun et al., 2014) model trained by Agarwal and Zesch (2019) on German speech data, such as the "generic" model. This development provided sufficient but necessary starting point for further fine-tuning of often strongly accented recordings collected during the proof-of-concept HMPL C1 exercise introduced hereby. ...
Article
Full-text available
The present study provides the first empiric evidence that the creation of human–machine peer learning (HMPL) couples can lead to an increase in the level of mastery of different competences in both humans and machines alike. The feasibility of the HMPL approach is demonstrated by means of Curriculum 1 whereby the human learner H gradually acquires a vocabulary of foreign language, while the artificial learner fine-tunes its ability to understand H's speech. The present study evaluated the feasibility of the HMPL approach in a proof-of-concept experiment that is composed of a pre-learn assessment, a mutual learning phase, and post-learn assessment components. Pre-learn assessment allowed us to estimate prior knowledge of foreign language learners by asking them to name visual cues corresponding to one among 100 German nouns. In a subsequent mutual learning phase, learners are asked to repeat the audio recording containing the label of a simultaneously presented word with the visual cue. After the mutual learning phase is over, the subjacent speech-to-text (STT) neural network fine-tunes its parameters and adapts itself to peculiar properties of H's voice. Finally, the exercise is terminated by the post-learn assessment phase. In both assessment phases, the number of mismatches between the expected answer and the answer provided by human and recognized by machine provides the metrics of the main evaluation. In the case of all six learners who participated in the proof-of-concept experiment, we observed an increase in the amount of matches between expected and predicted labels, which was caused both by an increase in human learner's vocabulary as well as by an increase in the recognition accuracy of machine's speech-to-text model. Therefore, the present study considers it reasonable to postulate that curricula could be drafted and deployed for different domains of expertise, whereby humans learn from AIs at the same time as AIs learn from humans.
... With enough data, in theory, one would be able to build a super robust speech recognition model that can account for all the nuance in speech without having to spend a ton of time and effort hand-engineering acoustic features or dealing with complex pipelines in more old-school GMM-HMM model architectures. [3] 2. Wav2Vec Model: ...
Preprint
Full-text available
In this modern era of technology with e-commerce developing at a rapid pace, it is very important to understand customer requirements and details from a business conversation. It is very crucial for customer retention and satisfaction. Extracting key insights from these conversations is very important when it comes to developing their product or solving their issue. Understanding customer feedback, responses, and important details of the product are essential and it would be done using Named entity recognition (NER). For extracting the entities we would be converting the conversations to text using the optimal speech-to-text model. The model would be a two-stage network in which the conversation is converted to text. Then, suitable entities are extracted using robust techniques using a NER BERT transformer model. This will aid in the enrichment of customer experience when there is an issue which is faced by them. If a customer faces a problem he will call and register his complaint. The model will then extract the key features from this conversation which will be necessary to look into the problem. These features would include details like the order number, and the exact problem. All these would be extracted directly from the conversation and this would reduce the effort of going through the conversation again.
Conference Paper
Full-text available
The paper examines the practical issues in developing a speech-to-text system using deep neural networks. The development of a Russian-language speech recognition system based on DeepSpeech architecture is described. The Mozilla company's open source implementation of DeepSpeech for the English language was used as a starting point. The system was trained in a containerized environment using the Docker technology. It allowed to describe the entire process of component assembly from the source code, including a number of optimization techniques for CPU and GPU. Docker also allows to easily reproduce computation optimization tests on alternative infrastructures. We examined the use of TensorFlow XLA technology that optimizes linear algebra computations in the course of neural network training. The number of nodes in the internal layers of neural network was optimized based on the word error rate (WER) obtained on a test data set, having regard to GPU memory limitations. We studied the use of probabilistic language models with various maximum lengths of word sequences and selected the model that shows the best WER. Our study resulted in a Russian-language acoustic model having been trained based on a data set comprising audio and subtitles from YouTube video clips. The language model was built based on the texts of subtitles and publicly available Russian-language corpus of Wikipedia's popular articles. The resulting system was tested on a data set consisting of audio recordings of Russian literature available on voxforge.com-the best WER demonstrated by the system was 18%.
Conference Paper
Full-text available
YouTube is a highly visited video sharing website where over one billion people watch six billion hours of video every month. Im-proving accessibility to these videos for the hearing impaired and for search and indexing purposes is an excellent application of automatic speech recognition. However, YouTube videos are extremely chal-lenging for automatic speech recognition systems. Standard adapted Gaussian Mixture Model (GMM) based acoustic models can have word error rates above 50%, making this one of the most difficult reported tasks. Since 2009, YouTube has provided automatic gener-ation of closed captions for videos detected to have English speech; the service now supports ten different languages. This paper de-scribes recent improvements to the original system, in particular the use of owner-uploaded video transcripts to generate additional semi-supervised training data and deep neural networks acoustic models with large state inventories. Applying an "island of confidence" fil-tering heuristic to select useful training segments, and increasing the model size by using 44,526 context dependent states with a low-rank final layer weight matrix approximation, improved performance by about 13% relative compared to previously reported sequence trained DNN results for this task.
Conference Paper
Full-text available
Many real-world sequence learning tasks re- quire the prediction of sequences of labels from noisy, unsegmented input data. In speech recognition, for example, an acoustic signal is transcribed into words or sub-word units. Recurrent neural networks (RNNs) are powerful sequence learners that would seem well suited to such tasks. However, because they require pre-segmented training data, and post-processing to transform their out- puts into label sequences, their applicability has so far been limited. This paper presents a novel method for training RNNs to label un- segmented sequences directly, thereby solv- ing both problems. An experiment on the TIMIT speech corpus demonstrates its ad- vantages over both a baseline HMM and a hybrid HMM-RNN.
Article
We present a state-of-the-art speech recognition system developed using end-to-end deep learning. Our architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments. In contrast, our system does not need hand-designed components to model background noise, reverberation, or speaker variation, but instead directly learns a function that is robust to such effects. We do not need a phoneme dictionary, nor even the concept of a "phoneme." Key to our approach is a well-optimized RNN training system that uses multiple GPUs, as well as a set of novel data synthesis techniques that allow us to efficiently obtain a large amount of varied data for training. Our system, called DeepSpeech, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.5% error on the full test set. DeepSpeech also handles challenging noisy environments better than widely used, state-of-the-art commercial speech systems.
Conference Paper
This paper presents a new technique of cepstral analysis synthesis on the mel frequency scale, the log spectrum on the mel frequency scale (the mel log spectrum) is considered to be an effective representation of the spectral envelope of speech. This analysis synthesis system uses the mel log spectrum approximation (MLSA) filter which was devised for the cepstral synthesis on the mel frequency scale. The filter coefficients are easily obtained through a simple linear transform from the mel cepstrum defined as the Fourier cosine coefficients of the mel log spectral envelope of speech. The MLSA filter has a low coefficient sensitivity and a good coefficient quantization characteristics. The spectral distortion caused by interpolation of the filter parameters of two successive frames is small. Accordingly, the data rate of this system is very low. The same quality speech is synthesized at 60-70 % of data rates in the conventional cepstral vocoder or the LPC vocoder.
KenLM: Faster and smaller language model queries
  • Kenneth Heafield
Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 187-197, Edinburgh, Scotland.