Initial FST accepting only two words xaXi and awra.

Linguistically-Informed Neural Architectures for Lexical, Syntactic and Semantic Tasks in Sanskrit

Preprint

Full-text available

Aug 2023

Jivnesh Sandhan

The primary focus of this thesis is to make Sanskrit manuscripts more accessible to the end-users through natural language technologies. The morphological richness, compounding, free word orderliness, and low-resource nature of Sanskrit pose significant challenges for developing deep learning solutions. We identify four fundamental tasks, which are crucial for developing a robust NLP technology for Sanskrit: word segmentation, dependency parsing, compound type identification, and poetry analysis. The first task, Sanskrit Word Segmentation (SWS), is a fundamental text processing task for any other downstream applications. However, it is challenging due to the sandhi phenomenon that modifies characters at word boundaries. Similarly, the existing dependency parsing approaches struggle with morphologically rich and low-resource languages like Sanskrit. Compound type identification is also challenging for Sanskrit due to the context-sensitive semantic relation between components. All these challenges result in sub-optimal performance in NLP applications like question answering and machine translation. Finally, Sanskrit poetry has not been extensively studied in computational linguistics. While addressing these challenges, this thesis makes various contributions: (1) The thesis proposes linguistically-informed neural architectures for these tasks. (2) We showcase the interpretability and multilingual extension of the proposed systems. (3) Our proposed systems report state-of-the-art performance. (4) Finally, we present a neural toolkit named SanskritShala, a web-based application that provides real-time analysis of input for various NLP tasks. Overall, this thesis contributes to making Sanskrit manuscripts more accessible by developing robust NLP technology and releasing various resources, datasets, and web-based toolkit.

Normalized Dataset for Sanskrit Word Segmentation and Morphological Parsing

Preprint

Full-text available

Aug 2023

Sanskrit processing has seen a surge in the use of data-driven approaches over the past decade. Various tasks such as segmentation, morphological parsing, and dependency analysis have been tackled through the development of state-of-the-art models despite working with relatively limited datasets compared to other languages. However, a significant challenge lies in the availability of annotated datasets that are lexically, morphologically, syntactically, and semantically tagged. While syntactic and semantic tags are preferable for later stages of processing such as sentential parsing and disambiguation, lexical and morphological tags are crucial for low-level tasks of word segmentation and morphological parsing. The Digital Corpus of Sanskrit (DCS) is one notable effort that hosts over 650,000 lexically and morphologically tagged sentences from around 250 texts but also comes with its limitations at different levels of a sentence like chunk, segment, stem and morphological analysis. To overcome these limitations is to look at alternatives such as Sanskrit Heritage Segmenter (SH) and Samsaadhanii tools, that provide information complementing DCS’ data. This work focuses on enriching the DCS dataset by incorporating analyses from SH, thereby creating a dataset that is rich in lexical and morphological information from both DCS and SH. Furthermore, this work also discusses the impact of such datasets on the performances of existing segmenters, specifically the Sanskrit Heritage Segmenter.

TransLIST: A Transformer-Based Linguistically Informed Sanskrit Tokenizer

Preprint

Full-text available

Oct 2022

Sanskrit Word Segmentation (SWS) is essential in making digitized texts available and in deploying downstream tasks. It is, however, non-trivial because of the sandhi phenomenon that modifies the characters at the word boundaries, and needs special treatment. Existing lexicon driven approaches for SWS make use of Sanskrit Heritage Reader, a lexicon-driven shallow parser, to generate the complete candidate solution space, over which various methods are applied to produce the most valid solution. However, these approaches fail while encountering out-of-vocabulary tokens. On the other hand, purely engineering methods for SWS have made use of recent advances in deep learning, but cannot make use of the latent word information on availability. To mitigate the shortcomings of both families of approaches, we propose Transformer based Linguistically Informed Sanskrit Tokenizer (TransLIST) consisting of (1) a module that encodes the character input along with latent-word information, which takes into account the sandhi phenomenon specific to SWS and is apt to work with partial or no candidate solutions, (2) a novel soft-masked attention to prioritize potential candidate words and (3) a novel path ranking algorithm to rectify the corrupted predictions. Experiments on the benchmark datasets for SWS show that TransLIST outperforms the current state-of-the-art system by an average 7.2 points absolute gain in terms of perfect match (PM) metric. The codebase and datasets are publicly available at https://github.com/rsingha108/TransLIST

TransLIST: A Transformer-Based Linguistically Informed Sanskrit Tokenizer

Conference Paper

Jan 2022

Improving neural machine translation for low-resource Indian languages using rule-based feature extraction

Article

Full-text available

Feb 2021
NEURAL COMPUT APPL

Languages help to unite the world socially, culturally and technologically. Different natives communicate in different languages; there is a tremendous requirement for inter-language information translation process to transfer and share information and ideas. Though Sanskrit is an ancient Indo-European language, a significant amount of work for processing the information is required to explore the full potential of this language to open vistas in computational linguistics and computer science domain. In this paper, we have proposed and presented the machine translation system for translating Sanskrit to the Hindi language. The developed technique uses linguistic features from rule-based feed to train neural machine translation system. The work is novel and applicable to any low-resource language with rich morphology. It is a generic system covering various domains with minimal human intervention. The performance analysis of work is performed on automatic and linguistic measures. The results show that proposed and developed approach outperforms earlier work for this language pair.

Neural Compound-Word (Sandhi) Generation and Splitting in Sanskrit Language

Conference Paper

Full-text available

Jan 2021

A forefront to machine translation technology: deployment on the cloud as a service to enhance QoS parameters

Article

Full-text available

Nov 2020
SOFT COMPUT

Machine translation system (MTS) constitutes of functionally heterogeneous modules for processing source language to a given target language. Deploying such an application on a stand-alone system requires much time, knowledge and complications. It even becomes more challenging for a common user to utilize such a complex application. This paper presents a MTS that has been developed using a combination of linguistic rich, rule-based and prominent neural-based approach. The proposed MTS is deployed on the cloud to offer translation as a cloud service and improve the quality of service (QoS) from a stand-alone system. It is developed on TensorFlow and deployed under the cluster of virtual machines in the Amazon web server (EC2). The significance of this paper is to demonstrate management of recurrent changes in term of corpus, domain, algorithm and rules. Further, the paper also compares the MTS as deployed on stand-alone machine and on cloud for different QoS parameters like response time, server load, CPU utilization and throughput. The experimental results assert that in the translation task, with the availability of elastic computing resources in the cloud environment, the job completion time irrespective of its size can be assured to be within a fixed time limit with high accuracy.

Neural Compound-Word (Sandhi) Generation and Splitting in Sanskrit Language

Preprint

Full-text available

Oct 2020

This paper describes neural network based approaches to the process of the formation and splitting of word-compounding, respectively known as the Sandhi and Vichchhed, in Sanskrit language. Sandhi is an important idea essential to morphological analysis of Sanskrit texts. Sandhi leads to word transformations at word boundaries. The rules of Sandhi formation are well defined but complex, sometimes optional and in some cases, require knowledge about the nature of the words being compounded. Sandhi split or Vichchhed is an even more difficult task given its non uniqueness and context dependence. In this work, we propose the route of formulating the problem as a sequence to sequence prediction task, using modern deep learning techniques. Being the first fully data driven technique, we demonstrate that our model has an accuracy better than the existing methods on multiple standard datasets, despite not using any additional lexical or morphological resources. The code is being made available at https://github.com/IITD-DataScience/Sandhi_Prakarana

Sanskrit Segmentation Revisited

Preprint

May 2020

Computationally analyzing Sanskrit texts requires proper segmentation in the initial stages. There have been various tools developed for Sanskrit text segmentation. Of these, G\'erard Huet's Reader in the Sanskrit Heritage Engine analyzes the input text and segments it based on the word parameters - phases like iic, ifc, Pr, Subst, etc., and sandhi (or transition) that takes place at the end of a word with the initial part of the next word. And it enlists all the possible solutions differentiating them with the help of the phases. The phases and their analyses have their use in the domain of sentential parsers. In segmentation, though, they are not used beyond deciding whether the words formed with the phases are morphologically valid. This paper tries to modify the above segmenter by ignoring the phase details (except for a few cases), and also proposes a probability function to prioritize the list of solutions to bring up the most valid solutions at the top.

The Forest Lion and the Bull: Morphosyntactic Annotation of the Panchatantra

Article

Full-text available

Dec 2018

We present the first freely available dependency treebank of Sanskrit. It is based on text from Panchatantra, an ancient Indian collection of fables. The annotation scheme we chose is that of Universal Dependencies, a current de-facto standard for cross-linguistically comparable morphological and syntactic annotation. In the present paper, we discuss word segmentation issues, morphological inventory and certain interesting syntactic constructions in the light of the Universal Dependencies guidelines. We also present an initial parsing experiment.

Initial FST accepting only two words xaXi and awra.

Citations