Figure - uploaded by Hatem Haddad
Content may be subject to copyright.
Examples of Tunisian comments with their MSA and English transla- tion.

Examples of Tunisian comments with their MSA and English transla- tion.

Source publication
Chapter
Full-text available
Pre-trained models have accomplished high performances with the introduction of the Transformers like the Bidirectional Encoder Representations from Transformers known for BERT. Nevertheless, most of these proposed models have been trained on most represented languages (English, French, German, etc.) and few models target the under-represented lang...

Contexts in source publication

Context 1
... Tunisian Arabic has no grammatical rules and there is multiple ways to write and speak. Furthermore, this dialect has its proprietary lexicon, phonetics, and morphological structure as shown in Table 1. The necessity for a strong language model for the Tunisian dialect has become essential to build NLP-based applications (sentiment analysis, translation, information retrieval, etc..). ...
Context 2
... also outperforms AraBERT. Likewise, it has achieved a macro-F1 of 93.25% for the Tunisian-Algerian dialects identification task outperforming the other used language models as shown in Table 10. ...
Context 3
... strategy was to use the pre-trained language model, fine-tune it for a few epochs on the MSA dataset, then use the best checkpoint to train and test on the TRCD dataset. Following this strategy, TunBERT achieved great results with an exact match of 27.65%, an F1 score of 60.24% and a Recall of 82.36%, as shown in Table 11. ...
Context 4
... experimental results indicate that the proposed pre-trained TunBERT model yields improvements, compared to mBert, Gigabert and AraBERT models in the studied tasks as shown in Tables 7 and 8 for the Sentiment Analysis sub-task, Tables 9 and 10 for Dialect Identification task, and Table 11 for the QuestionAnswering task. Unsurprisingly, GigaBERT as customized BERT for English-to-Arabic crosslingual transfer is not effective for the tackled tasks and should be applied for tasks using code-switched data as suggested in [8]. ...

Similar publications

Article
Full-text available
This paper analyzes Wikipedia’s representation of the Nobel Prize winning CRISPR/Cas9 technology, a method for gene editing. We propose and evaluate different heuristics to match publications from several publication corpora against Wikipedia’s central article on CRISPR and against the complete Wikipedia revision history in order to retrieve furthe...
Chapter
Full-text available
This chapter primarily contributes to the research task V , i.e., evaluating the effectiveness of the semantification and translation system LACAST . In Section 5.1, we also extend LACAST semantic LATEX translations to support more mathematical operators, including sums, products, integrals, and limit notations. Hence, this chapter secondarily also...
Article
Full-text available
Web 2.0 technologies have become in recent times one of the ways of achieving interactive and impactful learning process. Web 2.0 technologies have increasingly helped in producing quality learning and significant opportunities for students. However, certain barriers constrain their use in learning environments in developing countries. The study th...
Article
Full-text available
Abstract: Introduction: Since Gagarin became the first human to travel into space and complete one orbit around the Earth, on 12 April 1961, the number of manned spaceflights has increased significantly. Spaceflight is still complex and has potential risk for incidents and accidents. The aim of this study was to analyze how safe it is for humans to...
Preprint
Full-text available
The traditional centralized knowledge base paradigm, exemplified by platforms like Wikipedia, grapples with numerous challenges like a solitary point of collapse, issues related to transparency, and potential integrity concerns in articles. To address these shortcomings, an innovative blockchain-enabled alternative is proposed which leverages the i...

Citations

... This discrepancy is attributed to the wealth of available data for these languages, The Arabic language stands out as a prime example of a low-resource language, lacking dedicated models, especially tailored to its diverse dialects. Presently, the landscape includes just three monodialectal BERT models for Arabic: SudaBERT [5], TunBERT [6], and DziriBERT [7]. Other models are either multidialectal or exclusively focused on modern standard Arabic (MSA), significantly distinct from dialectal Arabic (DA). ...
... While the first BERT model was published in 2018, it was not until 2020 that the first MSA-specific models, such as AraBERT [10] and ArabicBERT [11], were introduced. The first dialect-specific models [5][6][7] were only published in 2021, three years after the release of the original BERT model. ...
... This indicated that the selected keywords were not specific enough to the Moroccan dialect. Based on this observation, the decision was made to exclude keywords that were shared between the Moroccan and Algerian dialects 6 . Furthermore, Very short keywords 7 and ambiguous words were eliminated to ensure clarity and reliability of the dataset. ...
Article
Full-text available
The established performance of existing transformer-based language models, delivering state-of-the-art results on numerous downstream tasks, is noteworthy. However, these models often face limitations, being either confined to high-resource languages or designed with a multilingual focus. The availability of models dedicated to Arabic dialects is scarce, and even those that do exist primarily cater to dialects written in Arabic script. This study presents the first BERT models for Moroccan Arabic dialect, also known as Darija, called DarijaBERT, DarijaBERT-arabizi, and DarijaBERT-mix. These models are trained on the largest Arabic monodialectal corpus, supporting both Arabic and Latin character representations of the Moroccan dialect. Their performance is thoroughly evaluated and compared to existing multidialectal and multilingual models across four distinct downstream tasks, showcasing state-of-the-art results. The data collection methodology and pre-training process are described, and the Moroccan Topic Classification Dataset (MTCD) is introduced as the first dataset for topic classification in the Moroccan Arabic dialect. The pre-trained models and MTCD dataset are available to the scientific community.
... The Arabic language serves as an example of a low-resource language, as it lacks dedicated models, particularly for its various dialects. Currently, only three monodialectal BERT models are available for Arabic, namely SudaBERT [5], TunBERT [6], and DziriBERT [7]. The remaining models are either multidialectal or exclusively dedicated to Modern Standard Arabic (MSA), which differs substantially from dialectal Arabic (DA). ...
Preprint
Full-text available
The performance of existing transformer-based language models in providing state-of-the-art results on many downstream tasks is well established. However, these models tend to be limited to high-resource languages or are multilingual in nature. The availability of models dedicated to Arabic dialects is limited, and even those that exist primarily support dialects written in Arabic script. This study presents the first BERT models for Moroccan Arabic dialect, also known as Darija, called DarijaBERT, DarijaBERT-arabizi, and DarijaBERT-mix. These models are trained on the largest Arabic monodialectal corpus, supporting both Arabic and Latin character representations of the Moroccan dialect. The models' performance is evaluated and compared to existing multidialectal and multilingual models on four distinct downstream tasks, demonstrating state-of-the-art results. The data collection methodology and pre-training process are described, and the Moroccan Topic Classification Dataset (MTCD) is introduced as the first dataset for topic classification in the Moroccan Arabic dialect. The pre-trained models and MTCD dataset are available to the scientific community.
... In [28], we have released the first pretrained BERT model for Tunisian dialect built on a Tunisian Common-Crawlbased dataset. We pretrained, then fine-tuned the model on three downstream tasks: Sentiment Analysis (SA), Tunisian Dialect Identification (TDI), and Reading Comprehension Question-Answering (QA). ...
... Architecture Overview of fine-tuning the TunBERT pretrained model on the Reading Comprehension Question-Answering downstream task Fig. 6 TRCD dataset example[28] ...
Article
Full-text available
Language models have proved to achieve high performances and outperform state of the art results in the Natural Language Processing field. More specifically, Bidirectional Encoder Representations from Transformers (BERT) has become the state of the art model for such tasks. Most of the available language models have been trained on Indo-European languages. These models are known to require huge training datasets. However, only a few studies have focused on under-represented languages and dialects. In this work, we describe the pretraining of a customized Google BERT Tensorflow implementation model (named TunBERT-T) and the pretraining of a PyTorch implementation of BERT language model using NVIDIA implementation (named TunBERT-P) for the Tunisian dialect. We describe the process of creating a training dataset from collecting a Common-Crawl-based dataset, filtering and pre-processing the data. We describe the training setup and we detail fine-tuning TunBERT-T and TunBERT-P models on three NLP downstream tasks. We challenge the assumption that a lot of training data is needed. We explore the effectiveness of training a monolingual Transformer-based language model for low-resourced languages, taking the Tunisian dialect as a use case. Our models results indicate that a proportionately small sized Common-Crawl-based dataset (500K sentences, 67.2MB) leads to comparable performances as those obtained using costly larger datasets (from 24GB to 128GB of text). We demonstrate that with the use of newly created datasets, our proposed TunBERT-P model achieves comparable or higher performances in three downstream tasks: Sentiment Analysis, Language Identification and Reading Comprehension Question-Answering. We release the two pretrained models along with all the datasets used for the fine-tuning.
Chapter
This article aims, on the one hand, to theoretically analyze the fact of school failure by identifying and describing its extent, specifically in the province of Ouezzane in northern Morocco, on the other hand, it aims to describe the effect of hybrid education caused by the COVID-19 health crisis on student results in the 2020/2021 school year as well as to make a comparative analysis of school failure rates following an exploratory approach for previous school years; namely, the years 2015/2016, 2016/2017, 2017/2018 and 2018/2019. In order to carry out this study, we proceeded with an in-depth analysis of the marks of the students relating to the scientific subjects, in particular: mathematical sciences, sciences of life and earth and physics, resulting from the school curriculum obtained at the regional examination for the third year of college. Finally, we have suggested some recommendations regarding the technology plan that aim to reduce the rate of this failure in this province in particular and can be generalized in the other parts of the kingdom.KeywordsSchool FailureCOVID-19MathematicsScience of Life and EarthPhysics Sciences
Chapter
Arabic nations share the same mother tongue, Arabic. However the used vernacular language is different, that, in turn, may vary from one region to another. In this paper, we aim to identify various dialects by performing text classification. We distinguish between Moroccan, Algerian, Tunisian, Egyptian, and Lebanese Arabic dialects denoted respectively MDA, ADA, TDA, EDA and LDA. Aside from explaining the collecting process of system resources, we identify linguistic specificities that characterize each dialect. We build models from the preprocessed text using a combination of the pretrained model, AraBERT; and Generative Adversarial Networks (GANS). The work establishes the foundation of a dialect identification system by gathering freely available corpora and reaching the state-of-the-art.KeywordsDialect IdentificationArabic LanguageAraBERTGANs