Figure 1 - uploaded by Vipul Mittal
Content may be subject to copyright.
Initial FST accepting only two words xaXi and awra.

Initial FST accepting only two words xaXi and awra.

Source publication
Conference Paper
Full-text available
In this paper, we propose a novel method for automatic segmentation of a Sanskrit string into different words. The input for our segmentizer is a Sanskrit string either encoded as a Unicode string or as a Roman transliterated string and the output is a set of possible splits with weights associated with each of them. We followed two different appro...

Citations

... Previous attempts at SWS mainly focused on rule-based Finite State Transducer systems [Huet, 2003;Mittal, 2010]. One approach produced all possible solutions and recommended a solution based on a probabilistic score inferred from a dataset of 25,000 data splits [Mittal, 2010]. ...
... Previous attempts at SWS mainly focused on rule-based Finite State Transducer systems [Huet, 2003;Mittal, 2010]. One approach produced all possible solutions and recommended a solution based on a probabilistic score inferred from a dataset of 25,000 data splits [Mittal, 2010]. Another approach attempted to solve the SWS task for sentences with one or two splits using a Bayesian approach and the same dataset [Natarajan and Charniak, 2011]. ...
... Sanskrit Compound Type Identification task has garnered considerable attention of the researchers in the last decade. In order to decode the meaning of a Sanskrit compound, it is essential to figure out its constituents [Huet, 2010;Mittal, 2010;Hellwig and Nehrdich, 2018a], how the constituents are grouped [Kulkarni and Kumar, 2011b], identify the semantic relation between them [Kumar, 2012] and finally generate the paraphrase of the compound [Kumar et al., 2009]. Satuluri and Kulkarni [2013] and Kulkarni and Kumar [2013b] proposed a rule-based approach where around 400 rules mentioned in Pān . ...
Preprint
Full-text available
The primary focus of this thesis is to make Sanskrit manuscripts more accessible to the end-users through natural language technologies. The morphological richness, compounding, free word orderliness, and low-resource nature of Sanskrit pose significant challenges for developing deep learning solutions. We identify four fundamental tasks, which are crucial for developing a robust NLP technology for Sanskrit: word segmentation, dependency parsing, compound type identification, and poetry analysis. The first task, Sanskrit Word Segmentation (SWS), is a fundamental text processing task for any other downstream applications. However, it is challenging due to the sandhi phenomenon that modifies characters at word boundaries. Similarly, the existing dependency parsing approaches struggle with morphologically rich and low-resource languages like Sanskrit. Compound type identification is also challenging for Sanskrit due to the context-sensitive semantic relation between components. All these challenges result in sub-optimal performance in NLP applications like question answering and machine translation. Finally, Sanskrit poetry has not been extensively studied in computational linguistics. While addressing these challenges, this thesis makes various contributions: (1) The thesis proposes linguistically-informed neural architectures for these tasks. (2) We showcase the interpretability and multilingual extension of the proposed systems. (3) Our proposed systems report state-of-the-art performance. (4) Finally, we present a neural toolkit named SanskritShala, a web-based application that provides real-time analysis of input for various NLP tasks. Overall, this thesis contributes to making Sanskrit manuscripts more accessible by developing robust NLP technology and releasing various resources, datasets, and web-based toolkit.
... This was further enhanced by introducing a graphical interface which presents all the possible lexically and morphologically valid segments (Goyal and Huet, 2016). Mittal (2010) used OpenFST and augmented it with sandhi rules and finally validated the segments using optimality theory. Kumar et al (2010) developed a segmenter exclusively for Sanskrit compounds using probabilistic methods and optimality theory. ...
Preprint
Full-text available
Sanskrit processing has seen a surge in the use of data-driven approaches over the past decade. Various tasks such as segmentation, morphological parsing, and dependency analysis have been tackled through the development of state-of-the-art models despite working with relatively limited datasets compared to other languages. However, a significant challenge lies in the availability of annotated datasets that are lexically, morphologically, syntactically, and semantically tagged. While syntactic and semantic tags are preferable for later stages of processing such as sentential parsing and disambiguation, lexical and morphological tags are crucial for low-level tasks of word segmentation and morphological parsing. The Digital Corpus of Sanskrit (DCS) is one notable effort that hosts over 650,000 lexically and morphologically tagged sentences from around 250 texts but also comes with its limitations at different levels of a sentence like chunk, segment, stem and morphological analysis. To overcome these limitations is to look at alternatives such as Sanskrit Heritage Segmenter (SH) and Samsaadhanii tools, that provide information complementing DCS’ data. This work focuses on enriching the DCS dataset by incorporating analyses from SH, thereby creating a dataset that is rich in lexical and morphological information from both DCS and SH. Furthermore, this work also discusses the impact of such datasets on the performances of existing segmenters, specifically the Sanskrit Heritage Segmenter.
... Earlier approaches on SWS focused on rule-based Finite State Transducer systems (Gérard, 2003;Mittal, 2010). Natarajan and Charniak (2011) attempted to solve the SWS task for sentences with one or two splits using the Bayesian approach. ...
Preprint
Full-text available
Sanskrit Word Segmentation (SWS) is essential in making digitized texts available and in deploying downstream tasks. It is, however, non-trivial because of the sandhi phenomenon that modifies the characters at the word boundaries, and needs special treatment. Existing lexicon driven approaches for SWS make use of Sanskrit Heritage Reader, a lexicon-driven shallow parser, to generate the complete candidate solution space, over which various methods are applied to produce the most valid solution. However, these approaches fail while encountering out-of-vocabulary tokens. On the other hand, purely engineering methods for SWS have made use of recent advances in deep learning, but cannot make use of the latent word information on availability. To mitigate the shortcomings of both families of approaches, we propose Transformer based Linguistically Informed Sanskrit Tokenizer (TransLIST) consisting of (1) a module that encodes the character input along with latent-word information, which takes into account the sandhi phenomenon specific to SWS and is apt to work with partial or no candidate solutions, (2) a novel soft-masked attention to prioritize potential candidate words and (3) a novel path ranking algorithm to rectify the corrupted predictions. Experiments on the benchmark datasets for SWS show that TransLIST outperforms the current state-of-the-art system by an average 7.2 points absolute gain in terms of perfect match (PM) metric. The codebase and datasets are publicly available at https://github.com/rsingha108/TransLIST
... Earlier approaches on SWS focused on rule-based Finite State Transducer systems (Gérard, 2003;Mittal, 2010). Natarajan and Charniak (2011) attempted to solve the SWS task for sentences with one or two splits using the Bayesian approach. ...
... It also provides inflectional analysis, prunes the answer, uses local morph analysis to handle unrecognised words and produce a derivational analysis of the derived roots. [56][57][58]. • Parsing: Parser is used as a compiler or interpreter that breaks data into smaller units for easy translation of one language to another. Parsers take input from the sequence of words or tokens. ...
Article
Full-text available
Languages help to unite the world socially, culturally and technologically. Different natives communicate in different languages; there is a tremendous requirement for inter-language information translation process to transfer and share information and ideas. Though Sanskrit is an ancient Indo-European language, a significant amount of work for processing the information is required to explore the full potential of this language to open vistas in computational linguistics and computer science domain. In this paper, we have proposed and presented the machine translation system for translating Sanskrit to the Hindi language. The developed technique uses linguistic features from rule-based feed to train neural machine translation system. The work is novel and applicable to any low-resource language with rich morphology. It is a generic system covering various domains with minimal human intervention. The performance analysis of work is performed on automatic and linguistic measures. The results show that proposed and developed approach outperforms earlier work for this language pair.
... [13] proposed a statistical method based on Dirichlet process. Finite state methods have also been used [12]. A graph query method has been proposed by [10]. ...
... It also provides inflectional analysis, prunes the answer, uses local morph analysis to handle unrecognized words and produces derivational analysis of the derived roots (Bharati et al. 2006;Mittal 2010;Jha et al. 2009). -Parsing: Parser is used as compiler and interpreter that break data into smaller units for easy translation of one language to another. ...
Article
Full-text available
Machine translation system (MTS) constitutes of functionally heterogeneous modules for processing source language to a given target language. Deploying such an application on a stand-alone system requires much time, knowledge and complications. It even becomes more challenging for a common user to utilize such a complex application. This paper presents a MTS that has been developed using a combination of linguistic rich, rule-based and prominent neural-based approach. The proposed MTS is deployed on the cloud to offer translation as a cloud service and improve the quality of service (QoS) from a stand-alone system. It is developed on TensorFlow and deployed under the cluster of virtual machines in the Amazon web server (EC2). The significance of this paper is to demonstrate management of recurrent changes in term of corpus, domain, algorithm and rules. Further, the paper also compares the MTS as deployed on stand-alone machine and on cloud for different QoS parameters like response time, server load, CPU utilization and throughput. The experimental results assert that in the translation task, with the availability of elastic computing resources in the cloud environment, the job completion time irrespective of its size can be assured to be within a fixed time limit with high accuracy.
... [13] proposed a statistical method based on Dirichlet process. Finite state methods have also been used [12]. A graph query method has been proposed by [10]. ...
Preprint
Full-text available
This paper describes neural network based approaches to the process of the formation and splitting of word-compounding, respectively known as the Sandhi and Vichchhed, in Sanskrit language. Sandhi is an important idea essential to morphological analysis of Sanskrit texts. Sandhi leads to word transformations at word boundaries. The rules of Sandhi formation are well defined but complex, sometimes optional and in some cases, require knowledge about the nature of the words being compounded. Sandhi split or Vichchhed is an even more difficult task given its non uniqueness and context dependence. In this work, we propose the route of formulating the problem as a sequence to sequence prediction task, using modern deep learning techniques. Being the first fully data driven technique, we demonstrate that our model has an accuracy better than the existing methods on multiple standard datasets, despite not using any additional lexical or morphological resources. The code is being made available at https://github.com/IITD-DataScience/Sandhi_Prakarana
... This kind of non-determinism is also found in languages like Chinese and Japanese, where the word boundaries are not indicated, and also in agglutinative languages like Turkish (Mittal, 2010). In some of these languages like Thai (Haruechaiyasak et al., 2008), most of the sentences have mere concatenation of words. ...
... The current paper focuses on updating this external sandhi segmenter. Mittal (2010) had used the Optimality Theory to derive a probabilistic method, and de-veloped two methods to segment the input text (1) by augmenting the finite state transducer developed using OpenFst (Allauzen et al., 2007), with sandhi rules where the FST is used for the analysis of the morphology and is traversed for the segmentation, and (2) used optimality theory to validate all the possible segmentations. Kumar et al. (2010) developed a compound processor where the segmentation for the compound words was done and used optimality theory with a different probabilistic method (discussed in section 5). ...
... Many probabilistic measures have been proposed in the past to prioritize the solutions. Mittal (2010) calculated the weight for a specific split s j as ...
Preprint
Computationally analyzing Sanskrit texts requires proper segmentation in the initial stages. There have been various tools developed for Sanskrit text segmentation. Of these, G\'erard Huet's Reader in the Sanskrit Heritage Engine analyzes the input text and segments it based on the word parameters - phases like iic, ifc, Pr, Subst, etc., and sandhi (or transition) that takes place at the end of a word with the initial part of the next word. And it enlists all the possible solutions differentiating them with the help of the phases. The phases and their analyses have their use in the domain of sentential parsers. In segmentation, though, they are not used beyond deciding whether the words formed with the phases are morphologically valid. This paper tries to modify the above segmenter by ignoring the phase details (except for a few cases), and also proposes a probability function to prioritize the list of solutions to bring up the most valid solutions at the top.
... One peculiarity of Sanskrit processing is the non-trivial word segmentation [5]. For a long time, oral transmission played a dominant role in preserving and spreading Sanskrit stories; if they were eventually written down, the writing system closely followed pronunciation. ...
Article
Full-text available
We present the first freely available dependency treebank of Sanskrit. It is based on text from Panchatantra, an ancient Indian collection of fables. The annotation scheme we chose is that of Universal Dependencies, a current de-facto standard for cross-linguistically comparable morphological and syntactic annotation. In the present paper, we discuss word segmentation issues, morphological inventory and certain interesting syntactic constructions in the light of the Universal Dependencies guidelines. We also present an initial parsing experiment.