ArticlePDF Available

FarsiTag: A Part-of-Speech Tagging System for Persian

Authors:

Abstract and Figures

FarsiTag is a tagging system capable of assigning the most probable part-of-speech (POS) tags to Persian words in a text. In this system, some linguistic rules have been used to select the best POS tag for every Persian word. The present study aims to report the processes during which a robust tagging system—FarsiTag—was designed and implemented on Persian texts. A POS-tagged parallel corpus of English–Persian containing about 5,000,000 words has also been developed as a side-product of the mentioned tagger. An experiment has been conducted to evaluate the performance of the system while tagging unrestricted Persian texts. The highest rate of error traces back to medical and religious genres, while the lowest system error type is related to the scientific texts. The total error rate considering all domains is as low as 1.4%, with the overall system accuracy of 98.6% which is very promising for a language like Persian.
Content may be subject to copyright.
A preview of the PDF is not available
... However, if the tag-sets or corpora are different, the comparison is not so straightforward. For instance, Rezai and Mosavi Miangah (2016) compare their POS tagger FarsiTag to another Persian POS tagger, TagPer, developed by Seraji (2011). They conclude that FarsiTag surpasses TagPer because the respective accuracies of the taggers are 98.6% and 96.9%. ...
... Here we illustrate the approach from the above section by using the results reported in Rezai and Mosavi Miangah (2016), where a Persian POS tagging system, called FarsiTag, is described. A previously tagged corpus of Persian, namely BijanKhan Corpus, was used to tag every single word of the raw corpus with some modifications, and then a genotype was created for each word based on various tags they take in different contexts. ...
... A total of 37 tags was used. Their list can be found in Rezai and Mosavi Miangah (2016). Here, we only mention those that are used in Table 1; they are one of the most frequently assigned ones: N_PL (noun, plural), N_SING (noun, singular), ADJ_SIM (adjective, simple), P (preposition), ADV_NEGG (adverb, negative), and PS (pseudo-sentence). ...
Article
Formulae are proposed for measuring the efficiency of part-of-speech tagging systems. The efficiency is a modification of tagging accuracy, which enables the comparison of tagging systems in the situation when they are applied to different corpora and/or use different tag-sets. The approach is based on a mathematical model of tagging accuracy and genotypes.
... However, if the tag-sets or corpora are different, the comparison is not so straightforward. For instance, Rezai and Mosavi Miangah [23] compare their POS tagger FarsiTag to another Persian POS tagger, TagPer, developed by Seraji [24]. They conclude that FarsiTag surpasses TagPer because the respective accuracies of the taggers are 98.6% and 96.9%. ...
... Here we illustrate the approach from the above section by using the results reported in [23], where a Persian POS tagging system, called FarsiTag, is described. A previously tagged corpus of Persian, namely BijanKhan Corpus, was used to tag every single word of the raw corpus with some modifications, and then a genotype was created for each word based on various tags they take in different contexts. ...
... A total of 37 tags was used. Their list can be found in [23]. Here, we only mention those that are used in Table 1; they are one of the most frequently assigned ones: N PL (noun, plural), N SING (noun, singular), ADJ SIM (adjective, simple), P (preposition), ADV NEGG (adverb, negative), and PS (pseudo-sentence). ...
... In a similar study in the open domain, 14,369 tokens in the training set and 5000 tokens in the testing set were considered [43]. Rezai [44] offered a POS tagger corpus with 5,000,000 tokens in the open domain for training and 11,766 tokens in the test set for the Persian language. However, using the manually annotated corpus, the corpus size may not be enough for modeling and efficient evaluation [16,45]. ...
Article
Full-text available
Manufacturing industry faces increasing complexity in the performance of assembly tasks due to escalating demand for complex products with a greater number of variations. Operators require robust assistance systems to enhance productivity, efficiency, and safety. However, existing support services often fall short when operators encounter unstructured open questions and incomplete sentences due to primarily relying on procedural digital work instructions. This draws attention to the need for practical application of natural language processing (NLP) techniques. This study addresses these challenges by introducing a domain-specific dataset tailored to assembly tasks, capturing unique language patterns and linguistic characteristics. We explore strategies to process declarative and imperative sentences, including incomplete ones, effectively. Thorough evaluation of three pre-trained NLP libraries—NLTK, SPACY, and Stanford—is performed to assess their effectiveness in handling assembly-related concepts and ability to address the domain’s distinctive challenges. Our findings demonstrate the efficient performance of these open-source NLP libraries in accurately handling assembly-related concepts. By providing valuable insights, our research contributes to developing intelligent operator assistance systems, bridging the gap between NLP techniques and the assembly domain within manufacturing industry.
... In [19], 14369 tokens in the training set and 5000 tokens in the testing set are studied. Rezai [25] offered a POS tagger corpus with 5000000 tokens for training and 11766 tokens in the test set for the Persian language. ...
Chapter
Adaptive guidance systems in manufacturing that support operators during the assembly process need to serve the right information at the right time. A conversational recommender system as the single point of contact between the operator and different sources of information, based on natural language processing, can be introduced to assist the operators. Natural language processing techniques can help to mine answers in text-based knowledge repositories as available in training documents, work instructions, and company procedures. Both the content as well as the style of writing in these documents are different from general language use and we examine the accuracy of part-of-speech tagging within this close domain of manufacturing. A benchmark dataset has been constructed based on four different classes of documents typical in the manufacturing domain. The dataset contains 1206 tokens divided over eight tag types. The accuracy of two open-source corpora, spaCy and NLTK, has been measured on this benchmark with an average accuracy of 93% and 87% respectively. The conclusion drawn is that pre-trained natural language libraries can effectively handle the specific contexts in the assembly domain based on the provided accuracy.
... In [27] a Lstm model had been used to investigate the effectiveness of neural networks (NNs) in Arabic NLP. there are several studies for other languages such as Persian [28], [29], Russian [30] languages. ...
... In the structure of this network, the hidden layer neurons were replaced with memory blocks to solve the forget problem of long sequences, hence blocks were used in the LSTM network [24]. For training the LSTM Network the following steps should be done [26] • ...
... Secondly, Brown term clustering algorithm is used to cluster large-scale non-tagged corpus to obtain cluster information. Subsequently, Metaphone speech matching algorithm is used to generate pronunciation key values of words using [6]. Finally, experiments are conducted on the above-mentioned manual tagging corpus and official news corpus. ...
Article
Full-text available
Part-of-speech (POS) tagging for English is the basis for implementing English automatic correction. Although researchers have done a lot of useful studies on English POS tagging, most of them are aimed at users with English as the first language, while studies for users with English as the second language are few. For this purpose, manual tagging is performed based on the typical parameter smoothing algorithm. On this basis, a performance evaluation method for English part-of-speech tagging is proposed, which integrates the features of term clustering, non-tagged corpus statistics, word pronunciation, etc. The experimental results show that the algorithm can improve the performance of POS tagging effectively, with the tagging accuracy improved from 94.49% to 97.07%
Article
Over the past few decades, with the advancement of technology, the use of corpora in linguistic studies has dramatically increased. Linguistic corpuses provide linguistic experts with the possibility to apply different methods for linguistic analysis by providing large collections. Most of the studies that have been done so far have been in English, French, and Japanese, and limited research has been conducted in Farsi language, and this lack, especially in specialized fields such as medical sciences, mathematics, science, tourism and so on is so tangible. So far most of the term or vocabulary extractions in Farsi have been done by using non-automatic methods and through reading and collecting data by the researchers; however, due to the technical properties of Farsi language, using non-Farsi term extractors which have been quite successful in other languages such as English, French and Japanese, have been impossible to use in Farsi so far. This is because of the particularities and specific features of languages. Each of these extractors is defined based on the features and properties of language they have been used for. In order to improve teaching materials in Farsi, paying attention to this problem was of paramount importance and we decided to apply some of these extraction methods and devise an extraction method for Farsi language which works properly. Since Iran's universities admit a lot of non-native Farsi international students annually whose goal is to study at fields such as medicine, engineering and humanities, preparing standard modern teaching materials in Farsi, which are based on the most modern technologies, is significantly important .The purpose of this study was to improve the resources used in teaching Farsi language at university levels, especially for non-native Farsi speakers and to explore the feasibility of using frequency-based methods in the automatic extraction of core medical terms and comparing the capabilities of each method. Findings of the research reveal the strengths and weaknesses of these methods in Farsi language and explore the possibility of using each of these methods in Farsi and provide technical solutions for the improvement of the results. Research Methodology: The frequency counting approaches utilized in this study included the general and a specialized corpus which was created by the researcher. The general corpus used in this study was the Hamshahri Corpus and the specialized researcher made corpus included: texts from the science books of grades 1-4 of senior high schools and grades 1-3 of junior high schools in Iran, science courses in Imam Khomeini Farsi language center, general medicine texts from journals and internet. After the formation of the corpus, preparation and tokenization, the research introduced two methods of frequency i.e. classical and modern categories. Then, in the next step, the capabilities of each method were compared. The methods used in the classical frequency approach were the frequency of the main general corpus, the frequency of the specialized corpus and their improved approaches. Also, modern methods used in the research were: PMI and Chi-square. Pearson correlation analysis and trend analysis were also used to compare the methods used in the research. Research findings The results showed that classical methods in their general form, have little accuracy in identifying specialized vocabulary, however, by applying some techniques, it was possible to improve the process of selecting specialized vocabulary, among which the best performance related to the improved numerical method in the specialized corpus which resulted in extracting 60% of the specialized vocabulary in the first 50 high-frequency words. This result improved by increasing the scope of the study to 100, 150 and 200 first extracted words and it was observed that the percentage of specialized vocabulary identified increased by about 75%. Moreover, the results obtained for modern methods indicated that these methods can be used in Farsi. It can be seen that chi-square method with 32% and PMI method with 52% extraction of specialized vocabulary in the first 50 high frequency words showed a good function in automatic term extraction in Farsi. They automatically detected specialized vocabulary and by increasing the scope of the study to 200 first words, these percentages improved. Conclusion: The results of the research showed that frequency-based methods are applicable in Farsi. If we use classic frequency methods, we will need to utilize improved classic frequency methods in order to increase the accuracy of extracted words. Also, in order to achieve reliable results in modern frequency approaches, it is necessary to choose large enough vocabulary scope for the extracted vocabulary.
Article
Full-text available
Commonly used evaluation measures including Recall, Precision, F-Measure and Rand Accuracy are biased and should not be used without clear understanding of the biases, and corresponding identification of chance or base case levels of the statistic. Using these measures a system that performs worse in the objective sense of Informedness, can appear to perform better under any of these commonly used measures. We discuss several concepts and measures that reflect the probability that prediction is informed versus chance. Informedness and introduce Markedness as a dual measure for the probability that prediction is marked versus chance. Finally we demonstrate elegant connections between the concepts of Informedness, Markedness, Correlation and Significance as well as their intuitive relationships with Recall and Precision, and outline the extension from the dichotomous case to the general multi-class case.
Article
Full-text available
In this paper we introduce a method for part-of-speech disambiguation of Persian texts, which uses word class probabilities in a relatively small training corpus in order to automatically tag unrestricted Persian texts. The experiment has been carried out in two levels as unigram and bi-gram genotypes disambiguation. Comparing the results gained from the two levels, we show that using immediate right context to which a given word belongs can increase the accuracy rate of the system to a high degree. Keywords: genotype, machine translation, part of speech disambiguation, word class probabilities 1. Introduction In linguistics, the term 'corpus' refers to a relatively large number of raw or annotated words in the body of text. Computational linguists recently turned into corpus-based approaches for solving various linguistic problems such as phrase recognition (Cutting, et al., 1992), word sense disambiguation (Mosavi Miangah & Delavar Khalafi, 2005), building dictionaries, morphological analysis and automatic lemmatization (Masayuki, 2003; Mosavi Miangah, 2006), language teaching (Conrad, 1999), machine translation (Tsutsumi, et al., 1994), information retrieval (Braschler, & Schauble, 2000) and some other problems. Naturally, preparing a tagged or annotated corpus from different points of view has been of great significance for anyone who involves in computational linguistics career. Constructing such an annotated corpus has already been done for many languages including English, Czech, German, Hungarian, French and Arabic to name a few. Automatic part-of-speech (below POS) disambiguation of a large corpus has been studied applying different approaches that we will go through some of them in what follows. To start with Persian, it should be said that corpus-based approaches for text analysis have a rather short history in Persian language. The only serious attempt ever taken in this connection is constructing an interactive POS tagging system developed by Assi and Abdolhosseini (2000). In their project they followed the methods proposed in Schuetze (1995). It is based on the hypothesis that syntactic behavior is reflected in co-occurrence patterns. Therefore, the similarity between two words will be measured with respect to their syntactic behaviors to their left side by the degree to which they share the same neighbors on the left. So, the word types are recognized according to their distributional similarity (their similarity in terms of sharing the same neighbors), and then each category can be manually tagged (Assi, S, M. & Haji Abdolhosseini, M. 2000). In this way a grammatically tagged corpus of Persian was created making up of 45 tags which have designed with reference to the categories normally introduced in dictionaries. Each tag is made up of one to five characters. In general, the accuracy of this kind of distributional POS tagging system proved to be 57.5%. Brill (1992) presents a simple rule-based POS tagger, which automatically acquires its rules and tags with accuracy comparable to stochastic taggers (Brill, E. 1992). Petasis, et al. (1999) study the performance of Transformation-Based Error Driven (TDED) learning for solving POS ambiguity in the Greek language, and examine its dependence on the thematic domain. For their work they trained the Brill tagger (Brill, 1995) over relatively small-sized annotated Greek corpus and found its performance to be around 95% (Petasis, G. et al., 1999). Daelemans, et al. (1996) introduce a memory-based approach to POS tagging. The POS tag of a word in a particular context is extrapolated from the most similar cases held in memory. Using this method, they obtain a tagging accuracy that is on a par with that of known statistical approaches, and with attractive space and time complexity properties when using IGTree, a tree-based formalism for indexing and searching huge case bases (Daelemans, W. et al., 1996). Kempe (2000) presents a method of constructing and applying a cascade consisting of a left and right-sequential finite-state transducer, T 1 and T 2 , for POS disambiguation. In the process of POS tagging, every word is first assigned a unique ambiguity class that represents the set of alternative tags that this word can occur with. The sequence of the ambiguity classes of all words of one
Article
Full-text available
In recent years the exploitation of large text corpora in solving various kinds of linguistic problems, including those of translation, is commonplace. Yet a large-scale English-Persian corpus is still unavailable, because of certain difficulties and the amount of work required to overcome them. The project reported here is an attempt to constitute an English-Persian parallel corpus composed of digital texts and Web documents containing little or no noise. The Internet is useful because translations of existing texts are often published on the Web. The task is to find parallel pages in English and Persian, to judge their translation quality, and to download and align them. The corpus so created is of course open; that is, more material can be added as the need arises. One of the main activities associated with building such a corpus is to develop software for parallel concordancing, in which a user can enter a search string in one language and see all the citations for that string in it and corresponding sentences in the target language. Our intention is to construct general translation memory software using the present English-Persian parallel corpus.
Article
Full-text available
In this paper we present a rather novel unsupervised method for part of speech (below POS)disambiguation which has been applied to Persian. This method known as Iterative Improved Feedback(IIF) Model, which is a heuristic one, uses only a raw corpus of Persian as well as all possible tags forevery word in that corpus as input. During the process of tagging, the algorithm passes through severaliterations corresponding to n-gram levels of analysis to disambiguate each word based on a previouslydefined threshold. The total accuracy of the program applying in Persian texts has been calculated as 93percent, which seems very encouraging for POS tagging in this language.
Article
Full-text available
Persian with its about 100,000,000 speakers in the world belongs to the group of languages with less developed linguistically annotated resources and tools. The few existing resources and tools are neither open source nor freely available. Thus, our goal is to develop open source resources such as corpora and treebanks, and tools for data-driven linguistic analysis of Persian. We do this by exploring the reusability of existing resources and adapting state-of-the-art methods for the linguistic annotation. We present fully functional tools for text normalization, sentence segmentation, tokenization, part-of-speech tagging, and parsing. As for resources, we describe the Uppsala PErsian Corpus (UPEC) which is a modified version of the Bijankhan corpus with additional sentence segmentation and consistent tokenization modified for more appropriate syntactic annotation. The corpus consists of 2,782,109 tokens and is annotated with parts of speech and morphological features. A treebank is derived from UPEC with an annotation scheme based on Stanford Typed Dependencies and is planned to consist of 10,000 sentences of which 215 have already been annotated.
Article
Full-text available
This paper presents the statistical part-of-speech tagger HunPoS trained on a Per-sian corpus. The result of the experiments shows that HunPoS provides an overall ac-curacy of 96.9%, which is the best result reported for Persian part-of-speech tag-ging.
Article
Full-text available
In the world of non-proprietary NLP soft-ware the standard, and perhaps the best, HMM-based POS tagger is TnT (Brants, 2000). We argue here that some of the crit-icism aimed at HMM performance on lan-guages with rich morphology should more properly be directed at TnT's peculiar li-cense, free but not open source, since it is those details of the implementation which are hidden from the user that hold the key for improved POS tagging across a wider variety of languages. We present HunPos 1 , a free and open source (LGPL-licensed) al-ternative, which can be tuned by the user to fully utilize the potential of HMM architec-tures, offering performance comparable to more complex models, but preserving the ease and speed of the training and tagging process.
Chapter
In this paper we describe an unsupervised learning algorithm for automatically training a rule-based part of speech tagger without using a manually tagged corpus. We compare this algorithm to the Baum-Welch algorithm, used for unsupervised training of stochastic taggers. Next, we show a method for combining unsupervised and supervised rule-based training algorithms to create a highly accurate tagger using only a small amount of manually tagged text1.
Article
The purpose of this article is to briefly introduce an interactive POS tagging system developed as a project at the Institute for Humanities and Cultural Studies in Tehran, Iran. The system is designed as part of the annotation procedure for a Persian corpus called The Farsi Linguistic Database (FLDB) (a project at the Institute for Humanities and Cultural Studies in Tehran which comprises a selection of contemporary Modern Persian literature, formal and informal spoken varieties of the language, and a series of dictionary entries and word lists [Assi 1997: 5]) and is the first attempt ever to tag a Persian corpus. In Section 1, the project itself will be introduced; Section 2 presents an evaluation of the project, and Section 3 is allocated to some suggestions for future work.