You are here

Sanskrit parsing

The Sanskrit texts that are of the greatest interest to the serious student have their origins between 3,500 and 1,000 years ago, and many have been discussed in commentaries by learned experts for over 3,000 years. Most of these expert commentaries explore, in great depth, various interpretations of phrases and words in these classics, providing references to the relevant contexts alluded to in the classics. Several of these commentaries provide alternate interpretations of the classics, and have been studied at traditional schools in the Indian subcontinent for over a thousand years. Almost every learned translation of these classics in the modern era refers extensively to a number of these commentaries, in order to provide detailed interpretations. Lexicographers have also noted, from a study of such expert commentaries, that a large number of words have multiple, diverse, meanings. One example is the root verb दय् , with which the Monier-Williams Dictionary[1] associates various meanings including 'to divide', 'partake', 'kill', 'pity', 'love', 'repent', and 'go', among many others. Some of these meanings were described by Panini himself c.500 BCE, as being associated with specific syntactic Cases, while others may have been a result of a 'drift' in meaning over time.1

It is also well-known that the inherent ambiguity present in 'euphonic combinations ( संधि )' (i.e. adjacent words that are fused together) can lead to significant differences in interpretation. It is frequently possible to split an 'euphonic combination' in multiple different ways that are each syntactically well-formed (and, sometimes, appear semantically plausible as well). For example, the combination पश्यैतां in the following sentence can be split as पश्य+एताम्, पश्य+ऐताम्, पश्या+एताम्, पश्या+ऐताम्, पश्यै+ताम्, among other possibilities. Note that all of these listed combinations contain words that are not only technically well-formed according to the rules of grammar, but are also present in the Sanskrit lexicon, and may lead to different interpretations, if not ruled out by other constraints:

पश्यैतां पाण्डुपुत्राणामाचार्य महतीं चमूम्

व्यूढां द्रुपदपुत्रेण तव शिष्येण धीमता

(Srimad Bhagavad Gita, Stanza 1.3)

A few short verbs are particularly noteworthy, such as, for example, the verb इ ('to go/come/retire, etc.'), which has a few conjugated forms, such as इत इते इयते एतु, that are also present in fused forms inside a large number of other conjugated / declined words (for e.g. the passive form verb क्रियते ('to do'); most 3rd Person Singular Passive Present tense verbs have the word-final इयते). This obviously results in problems during the splitting of 'euphonic combinations', as a decision will need to be made regarding when not to split a word (i.e. when to safely ignore such conjugated forms of the verb इ). It must also be kept in mind that nominals (and proper names, in particular) are frequently word-internal combinations of words, but must not be split into their constituents; these combinations are virtually indistinguishable from the regular 'euphonic combinations', and hence the parser can mistake them as such, leading to a failed or a wrong parse and translation.

Clearly, interpretations could be radically different if two experts choose to split an 'euphonic combination' in different ways, and there is evidence of such differences in interpretations.2 The reader must also note that the decoding of 'euphonic combinations' cannot be done in a simplistic manner, and that a number of declined forms are identical in form across Cases/ Genders (for e.g., dual INS, DAT, and ABL share the identical declined form ending -भ्याम् for most nominals), and even across phrase types (for e.g. the term 'bodha (बोध ) ' could represent either an Imperative-2nd Person-Singular verb or a VOC-Singular nominal), difficult problems to tackle in themselves. The resolution of these components can be done only at a late stage of parsing (if at all it can be done mechanically), when multiple candidates for each word can be ruled out based on other considerations (for e.g., the presence or absence of a verb in the clause, etc.).

Yet another problem is the 'free' word-order seen in the classics; 'free' word-order was necessary to ensure the correct metre for the verse (the right metre ensured that many of these texts were transmitted, relatively error-free, from teacher to pupil for several thousand years). It is important to note that the components of a single noun phrase may be scattered across different parts of the clause/ sentence. While syntactic Case makes it technically feasible to identify the various components, a large number of alternate candidate forms (i.e. alternate Cases) and 'free' word-order lead to a delayed resolution of 'euphonic combinations' and clause structure (until the entire sentence has been seen by the parser). Further, many sentences in the classics may not be 'well-formed' as per the rules of Sanskrit grammar (as is generally true of poetry in all languages), leading to difficulties in parsing them correctly.

The above discussion leads us to the conclusion that it is meaningless to attempt what can at best be a crude Sanskrit to English machine translation of these classics. Instead, our goal is to provide tools that help the serious student to appreciate both the classics as well as the expert commentaries (most of which are in Sanskrit, and are important texts in themselves). Hence, our approach is to stop well short of translation, and to limit the software to focus on partial translations i.e. splitting 'euphonic combinations' in a sentence, identifying the sandhi sutras applicable to each word in the sentence, and parsing it into clauses and phrases. This is useful in itself, as these mappings (for e.g., identification of the subject, object, adjuncts, verb form, etc. of each clause) are otherwise very difficult for students to figure out, as there are a number of alternate base/ root words for each declined/ conjugated form. While many commentaries mention each word and its meaning, they do not specify which conjugation/ declension is applicable, nor do they mention the root / base word. This level of detail is available only in grammatical analyses of the major texts (see for e.g. [2] or [3] for a grammatical analysis of the Srimad Bhagavad Gita), and requires a lot of effort by experts. Every student of Sanskrit knows that it is non-trivial to consult a Sanskrit dictionary (for e.g., the Monier-Williams Dictionary[1] or the Capeller Dictionary[4]) for the meanings of complex words, as head words are usually listed using their base forms or roots, rather than by their declined/ conjugated forms. In order to look up a word, the declined/ conjugated form must first be decoded (using the Paninian 'euphonic combination' rules, followed by handling of exceptional cases), and its root/ base form must be identified. For example, a word such as यन्तु (see below) cannot be looked up in the dictionary if the reader does not know that it is an irregular conjugated form of the verb इ ('to go/come/retire, etc.'). This word is present in the following well-known sentence from the Rig Veda meaning, 'Let right understanding/ good judgement come to us from everywhere ...':

आ नो भद्राः क्रतवो यन्तु विश्वतोѕदभ्दासो अपरीतास उद्भिदः

(Rig Veda 1.89.1)

The serious student is encouraged to try to look up the dictionary meanings of the remaining words in the sentence (a non-trivial task even after the verb यन्तु has been identified in the preceding paragraph).

Online dictionaries are available at the Cologne Sanskrit Digital Dictionaries website, University of Koln. Another excellent resource for Sanskrit is the Sanskrit Grammarian website by Gerard Huet, INRIA, Paris, that provides declensions, conjugations, as well as an interactive mode for splitting 'euphonic combinations' in a sentence (during this process, it also provides hyperlinks for root/ base words to the Cologne Sanskrit Digital Dictionaries website mentioned above). Another useful resource that provides online access to some of the classics, including associated commentaries by many learned experts, is the Gita Supersite, by IIT Kanpur.

Sanskrit to English Machine Translation is particularly challenging because of the steep learning curve in understanding the language in some depth. Any serious attempt at parsing and translation requires an in-depth study of Paninian rules (and exceptions), including conjugation, declension, euphonic combinations, derived forms, and syntax. Sanskrit follows a very detailed set of rules that were described by Panini in the Ashtadhyayi around 2500 years ago, and even the splitting of euphonic combinations, one of the preconditions for parsing, requires the software to be aware of a large number of these rules (and exceptions). However, the time and effort spent in studying Sanskrit is extremely rewarding, as it enables a deeper engagement with the timeless classics of the Indian subcontinent.

கற்றது கைமண் அளவு, கல்லாதது உலகளவு

(kaRRadu kEmaNN aLavu, kallAdadu ulagaLavu)

The above quote attributed to Avvaiyar, a Tamil poet-philosopher from ancient times, cautions that 'what one has studied can be likened to a mere fistful of earth contrasted with the vastness of the universe'. The aptness of this quote was recently brought to the author's attention by a visiting relative, during a discussion that it may require a lifetime to study and fully appreciate Panini's Ashtadhyayi.

  • 1. In Panini's time, the Dhatukosha included almost 2,000 roots. However, according to present-day Sanskrit experts, a large number of these roots are no longer found in the literature that has survived to the present time, indicating that a large number of texts may have been lost over time. Interestingly, over 260 roots in the Dhatukosha are associated with the meaning 'to go'.
  • 2. including a well-known one between two famed grammarians regarding the interpretation of a sutra in Panini's Ashtadhyayi.

References