Fig 2 - uploaded by Zied Mnasri
Content may be subject to copyright.
Standard Text-to-speech system architecture

Standard Text-to-speech system architecture

Source publication
Article
Full-text available
Arabic text-to-speech synthesis from non-diacritized text is still a big challenge, because of unique Arabic language rules and characteristics. Indeed, the diacritic and gemination signs, which are special characters representing respectively short vowels and consonant doubling, have a major effect on accurate pronunciation of Arabic. However thes...

Contexts in source publication

Context 1
... TTS system is based on two main modules (cf. Fig. 2): the Text-to-Phoneme module named also Natural Language Processing (NLP) module, and the Phonemeto-Speech module or the Digital Signal Processing (DSP) ...
Context 2
... this can not be compatible with all languages : each language has its own characteristics that lead to having its own system. The focus of this work is to carry out all linguistic module component for Arabic Text as shown in figure 2, using the deep learning approach. In this research, no linguistic features and tools are employed as it has been the case in previous works, such as [16]. ...

Similar publications

Preprint
Full-text available
Converting written texts into their spoken forms is an essential problem in any text-to-speech (TTS) systems. However, building an effective text normalization solution for a real-world TTS system face two main challenges: (1) the semantic ambiguity of non-standard words (NSWs), e.g., numbers, dates, ranges, scores, abbreviations, and (2) transform...

Citations

... In order to improve the TSS performance of the GRU model, the AOA technique is used as a hyperparameter optimizer [21]. Like other birds, Aquila has a dark brown colour complexion. ...
... Similar to the population-based algorithms, the AOA methodology starts with a population of the candidate solutions. The technique stochastically initiates with an upper limit and a lower limit [21]. Every iteration almost defines the optimal solution as given below. ...
... Diacritical marks are used in Arabic to differentiate words in terms of sound and meaning. However, Arabic writers rarely use diacritics in their non-academic writing [28], compounding the complexity of text processing in MSA. • Most Arabic words can be reduced to roots consisting of three letters [29]. ...
Article
Full-text available
Although there are several speech synthesis models available for different languages tailored to specific domain requirements and applications, there is currently no readily available information on the latest trends in Arabic language speech synthesis. This can make it challenging for beginners to research and develop text-to-speech (TTS) systems for Arabic languages. To address this issue, this article provides a comprehensive overview of several scholars’ contributions to the field of Arabic TTS, along with an examination of the unique features of the Arabic language and the corresponding challenges in creating TTS systems. Reporting only on papers discussing Arabic TTS, this systematic review evaluated the available literature published between 2000 and 2022. We conducted a systematic review in six databases using preferred reporting items for systematic reviews and meta-analyses (PRISMA) guidelines to identify studies that addressed Arabic Text-to-Speech systems. Of a total of 3719 articles identified, only 36 (0.96%) articles met our search criteria. Bibliometric analyses of these studies were conducted and reported. The results highlight the main types of speech synthesis techniques used in TTS systems: concatenative, formant, deep neural network (DNN), hybrid models, and multiagent. The corpora used to develop these systems, as well as the diacritization techniques incorporated, evaluation techniques, and the results of the performance of the systems are reported. Subjective evaluation using the mean opinion score is most applied to measure the accuracy of systems. This study also identifies gaps in the literature and makes recommendations for future research directions.
... Tacotron 2 [69] has been tested for a voweled MSA corpus by [64]. Hadj Ali et al. [71] have tested DNN for the task of grapheme-to-phoneme conversion using diacritized texts. Abdelali et al., [72] have also tested Tacotron [67], Tacotron 2 [69] and Model ESPnet Transformer TTS [73] in the Arabic language. ...
Article
Full-text available
Intelligent systems powered by Artificial Intelligence techniques have been massively proposed to help humans in various tasks. The intelligent personal assistant (IPA) is one of these smart systems. In this paper, we present an attempt to create an IPA, that interacts with users via Tunisian Arabic (TA), (the colloquial form used in Tunisia). We propose and explore a simple-to-implement method for building the principal components of a TA IPA. We apply Deep learning techniques: CNN [1], RNN encoder-decoder [2] and end-to-end approaches for creating IPA speech components (Speech Recognition and Speech Synthesis). In addition, we explore the availability and free dialog platform for understanding and generating the suitable response in TA for a request. For this proposal, we create and use TA transcripts for generating the corresponding models. Evaluation results are acceptable for the first attempt.
... We use two widely used datasets for Arabic NLP [43], [44], [45], [46], [47], [48], [49] to train, validate, and test our proposed BiLSTM models. The first is a processed subset of the Tashkeela corpus as extracted in [50]. ...
... With the success of deep learning, the paradigm of speech synthesis has shifted away from hidden Markov model-based speech synthesis toward neural speech synthesis. The DNNbased technique has the potential to solve the drawbacks of the HMM-based approach, such as inefficiency in expressing complicated context dependencies, fragmentation of training data, and full disregard for language input characteristics [2]. ...
... The contextual features are translated to the vector output in DNN-based speech synthesis [2]. The size and quality of the training data have a significant impact on the quality of synthesized speech of a neural text-to-speech system [3]. ...
... The aim of this work is to realize an Arabic DNN-based speech synthesis system using deep learning techniques. Deep Neural Networks of various forms, such as feed forward, recurrent (LSTM and BLSTM), and hybrid DNN models [2], have been employed for a number of applications in the field of speech synthesis systems. In terms of our objective, the most relevant studies employ DNN to predict phone durations [19,20], or syllable durations [7]. ...
Conference Paper
This article discusses a Deep Neural Network-based Text-to-Speech synthesis for the Arabic language. Subjective and objective tests were used to evaluate the system. We used the Mean Opinion Score (MOS) for subjective evaluation, and the Diagnostic Rhyme Test (DRT) to test the intelligibility of some consonants and vowels. We use the Perceptual Evaluation of Speech Quality (PESQ) for objective evaluation. The results have a mean of 3.92/5, 3.88/5 for the MOS and DRT tests, respectively, and 3.17/5 for the PESQ test; the majority of words and sentences were recognized, and the system's overall evaluation quality was satisfactory. Furthermore, the results show a significant improvement in the quality of synthesized speech for DNN-based TTS when compared to its HMM-based counterpart.
... Feedforward DNN is designed as a one-way process with the inputs fed into the network through the first layer; then, the output is provided as input for the following layer. At the same time, the whole process is governed by supervised ML, and the final result could be a classification or regression [17]. Fig. 2 is a diagram of FeedForward DNN. ...
... It suffers from low accuracy [32]. Indeed, Farasa's diacritic restoration performance was compared to some examples of DNN performance [17]. The results show that the DNN system outperforms Farsa with an accuracy rate of 4.2% higher than its accuracy rate. ...
... While the study in [17] falls under the category of Arabic text to speech synthesis field, the realization of diacritics in non-diacritized texts is an essential step for that task. The study tackles restoring gemination at one step then restoring other diacritics as a second step. ...
Article
Full-text available
Arabic diacritics are signs used in Arabic orthography to represent essential morphophonological and syntactic information. It is a common practice to leave out those diacritics in written Arabic. Most Arabic electronic texts lack such diacritics. The processing of those texts for various puposes of Natural Language Processing is a complicated task. Diacritized words are necessary for applications such as machine translation, sentiment analysis, and speech synthesis. To address this problem, several studies proposed automatic systems to restore diacritics in Arabic texts. The present paper presents an in-depth survey of 56 most recent Arabic diacritization studies. The studies encompassed in this survey have been selected from the following databases: IEEE Xplore, Clarivate, Analytics, Google Scholar, and Science Direct. Based on the diacritization approach, the studies are grouped into four sections in terms of method; rule-based, simple statistical, hybrid, and Neural Networks. While rule-based methods such as morphological analyzers and lexicon retrievals were the earliest approaches, results indicated that they are still valuable tools that can aid in the process of diaciritization. Effective statistical methods that produced diacritics with acceptable accuracy include Hidden Markov Model, n-grams, and Support Vector Machines. They are often accompanied by either rule-based or neural networks in hybrid systems. Neural networks, specifically Bidirectional Long Short Term Memory, reached very high diacritization accuracy levels. Studies employing neural networks focused on evaluating and comparing the efficacy of several types of neural networks or a hybrid of them, testing alternatives of input units or suggested schemes for partial daicritization. The study synthesizes the results of the studies, identifies research gaps, and offers recommendations for future research.
... sion (Ali et al., 2020), a crucial component in Text-to-Speech (TTS). With the rise of personal digital assistants with TTS capabilities, there is a clear need for improved automatic diacritization methods. ...
Preprint
We propose a novel multitask learning method for diacritization which trains a model to both diacritize and translate. Our method addresses data sparsity by exploiting large, readily available bitext corpora. Furthermore, translation requires implicit linguistic and semantic knowledge, which is helpful for resolving ambiguities in the diacritization task. We apply our method to the Penn Arabic Treebank and report a new state-of-the-art word error rate of 4.79%. We also conduct manual and automatic analysis to better understand our method and highlight some of the remaining challenges in diacritization.
... We use two widely used datasets for Arabic NLP [42,43,44,45,46,47,48] to train, validate, and test our proposed BiLSTM models. The first is a processed subset of the Tashkeela corpus as extracted in [49]. ...
Preprint
Full-text available
Soft spelling errors are a class of spelling mistakes that is widespread among native Arabic speakers and foreign learners alike. Some of these errors are typographical in nature. They occur due to orthographic variations of some Arabic letters and the complex rules that dictate their correct usage. Many people forgo these rules, and given the identical phonetic sounds, they often confuse such letters. In this paper, we propose a bidirectional long short-term memory network that corrects this class of errors. We develop, train, evaluate, and compare a set of BiLSTM networks. We approach the spelling correction problem at the character level. We handle Arabic texts from both classical and modern standard Arabic. We treat the problem as a one-to-one sequence transcription problem. Since the soft Arabic errors class encompasses omission and addition mistakes, to preserve the one-to-one sequence transcription, we propose a simple low-resource yet effective technique that maintains the one-to-one sequencing and avoids using a costly encoder-decoder architecture. We train the BiLSTM models to correct the spelling mistakes using transformed input and stochastic error injection approaches. We recommend a configuration that has two BiLSTM layers, uses the dropout regularization, and is trained using the latter training approach with error injection rate of 40%. The best model corrects 96.4% of the injected errors and achieves a low character error rate of 1.28% on a real test set of soft spelling mistakes.
... For instance, the phoneme /b/ in a word 'cab' distinguishes that word from 'can', 'cap', and 'cat'. A phonemicization plays important roles in automatically recognizing speech [1], synthesizing speech [2,3], developing phonemic syllabification model [4,5] and many other applications in the speech and linguistics areas [6]. 10 A G2P can be developed using a rule-based approach, a conventional MLbased approach, or a DL-based approach. The performances of those approaches are commonly based on the complexity of the phonotactic rules of a language, which represents how strong the relation between graphemes and phonemes. ...
Article
Full-text available
A phonemicization or grapheme-to-phoneme conversion (G2P) is a process of converting a word into its pronunciation. It is one of the essential components in speech synthesis, speech recognition, and natural language processing. The deep learning (DL)-based state-of-the-art G2P model generally gives low phoneme error rate (PER) as well as word error rate (WER) for high-resource languages, such as English and European, but not for low-resource languages. Therefore, some conventional machine learning (ML)-based G2P models incorporated with specific linguistic knowledge are preferable for low-resource languages. However, these models are poor for several low-resource languages because of various issues. For instance, an Indonesian G2P model works well for roots but gives a high PER for derivatives. Most errors come from the ambiguities of some roots and derivative words containing four prefixes: 〈ber〉,〈meng〉,〈peng〉, and 〈ter〉. In this research, an Indonesian G2P model based on n-gram combined with stemmer and phonotactic rules (NGTSP) is proposed to solve those problems. An investigation based on 5-fold cross-validation, using 50 k Indonesian words, informs that the proposed NGTSP gives a much lower PER of 0.78% than the state-of-the-art Transformer-based G2P model (1.14%). Besides, it also provides a much faster processing time.