Figure - uploaded by Mehren Alam
Content may be subject to copyright.
List of 40 most frequently and 40 least frequently occurring Urdu words and Roman-Urdu words in Roman-Urdu-Parl.

List of 40 most frequently and 40 least frequently occurring Urdu words and Roman-Urdu words in Roman-Urdu-Parl.

Source publication
Article
Full-text available
Pakistan Availability of corpora is a basic requirement for conducting research in a particular language. Unfortunately, for a morphologically rich language like Urdu, despite being used by over a 100 million people around the globe, the dearth of corpora is a major reason for the lack of attention and advancement in research. To this end, we prese...

Context in source publication

Context 1
... has a total of 6.37 million (6, Table 4. We list out 40 most and 40 least frequently occurring Urdu and Roman-Urdu words in the RomanUrdu-Parl in Table 3. In addition to statistical analysis, we have also done empirical analysis on our dataset that confirms its validity and also shows its syntactic, semantic, and contextual fitness for solving research problems. ...

Similar publications

Article
Full-text available
Word embedding is possessed by Natural language processing as a key procedure for semantically and syntactically manipulating the unlabeled text corpus. While this process represents the extracted features of corpus on vector space that enables to perform the NLP tasks such as summary generation, text simplification, next sentence prediction, etc....

Citations

... Table 1 summarizes the Roman Urdu datasets generated along with their name, size, and language. A novel study [14] conducted research on generating a parallel corpus for Urdu and RU. They presented a large-scale RU parallel corpus named Roman-Urdu-Parl that contained 6.37 million Urdu and RU text pairs. ...
Article
Full-text available
Social media has transformed into a crucial channel for political expression. Twitter, especially, is a vital platform used to exchange political hate in Pakistan. Political hate speech affects the public image of politicians, targets their supporters, and hurts public sentiments. Hate speech is a controversial public speech that promotes violence toward a person or group based on specific characteristics. Although studies have been conducted to identify hate speech in European languages, Roman languages have yet to receive much attention. In this research work, we present the automatic detection of political hate speech in Roman Urdu. An exclusive political hate speech labeled dataset (RU-PHS) containing 5002 instances and city-level information has been developed. To overcome the vast lexical structure of Roman Urdu, we propose an algorithm for the lexical unification of Roman Urdu. Three vectorization techniques are developed: TF-IDF, word2vec, and fastText. A comparative analysis of the accuracy and time complexity of conventional machine learning models and fine-tuned neural networks using dense word representations is presented for classifying and predicting political hate speech. The results show that a random forest and the proposed feed-forward neural network achieve an accuracy of 93% using fastText word embedding to distinguish between neutral and politically offensive speech. The statistical information helps identify trends and patterns, and the hotspot and cluster analysis assist in pinpointing Punjab as a highly susceptible area in Pakistan in terms of political hate tweet generation.
Article
Full-text available
Parallel corpus is a key component of statistical and Neural Machine Translation (NMT). While most research focuses on machine translation, corpus creation studies are limited for many languages and no research paper on a Mizo–English corpus exists yet. A high-quality parallel corpus is required for Natural Language Processing (NLP) activities including machine translation, Chatbots, Transliteration, and Cross-Language Information Retrieval. This work aims to investigate parallel corpus creation techniques and apply them to the Mizo-English language pair. Another goal is to test machine translation on the newly constructed corpus. We contributed to LF Aligner tool to support Mizo language for Mizo sentence alignment in corpus development. Our effort created the first large-scale Mizo-English parallel corpus with over 529K sentences. The pre-processed corpus was used for Mizo-to-English NMT. It was evaluated using BLEU, ChrF, and TER scores. Our system achieved BLEU 45.08, ChrF 65.36, and TER 41.16, setting a new benchmark for Mizo-to-English translation.
Chapter
Author profiling is part of information retrieval in which different perspectives of the author are observed by considering various characteristics like native language, gender, and age. Different techniques are used to extract the required information using text analysis, like author identification on social media and for Short Text Message Service. Author profiling helps in security and blogs for identification purposes while capturing authors’ writing behaviors through messages, posts, comments, blogs, comments, and chat logs. Most of the work in this area has been done in English and other native languages. On the other hand, Roman Urdu is also getting attention for the author profiling task, but it needs to convert Roman-Urdu to English to extract important features like Named Entity Recognition (NER) and other linguistic features. The conversion may lose important information while having limitations in converting one language to another language. This research explores machine learning techniques that can be used for all languages to overcome the conversion limitation. The Vector Space Model (VSM) and Query Likelihood (Q.L.) are used to identify the author’s age and gender. Experimental results revealed that Q.L. produces better results in terms of accuracy.KeywordsVector space modelQuery likelihood modelInformation retrieval (I.R.)Text miningAuthor profiling