You are here

Sanskrit parsing

On this page, we will discuss the question 'What does Parsing of a Sanskrit sentence involve' ? We will list the major tasks involved in parsing a Sanskrit sentence, in order that the reader may not make the common mistake of confusing 'sandhi analysis' with parsing ('sandhi analysis' is merely the first step in parsing a Sanskrit sentence). Our Sanskrit Parser is unique in that it performs most of these tasks on an extremely complex non-prose text such as the Srimad Bhagavad Gita. While this work is ongoing, the results of processing the early chapters (i.e. Chapter 1, 2, 3, 4, and 5) are very encouraging, making this the only parser in the public domain to have successfully parsed the first 200+ verses of this complex text (with defects being in the order of 2-3% per chapter).

Parsing a Sanskrit sentence is not unlike solving an irregular-shaped Sudoku puzzle, where many cells can have a number of alternate candidate values. In Sudoku, the value chosen for each cell depends on the satisfaction of certain constraints in its row, column, and immediate block (for example, the value of each cell must be unique, and between 1 to 9 in each of its three dimensions). Imagine how complex a Sudoku puzzle would be if the structure of the puzzle must itself be figured out; in other words, if the dimensions of each row, column, and immediate block of each cell were not known in advance (or, worse still, if the initial dimensions or decision variables had to be revised half-way through the processing). This is analogous to what the parser must do while parsing a Sanskrit sentence.


Now let us consider what is involved in parsing the following example from Stanza 1.1 of the Srimad Bhagavad Gita:

धृतराष्ट्र उवाच
धर्मक्षेत्रे कुरुक्षेत्रे समवेता युयुत्सवः
मामकाः पाण्डवाश्चैव किमकुर्वत सञ्जय

Printed: dhRutarASHTra uvAcha . dharmakSHetre kurukSHetre samavetA yuyutsavaHa , mAmakAHa pANNDavAshchEva kimakurvata saNjaya .

dhRutarASHTra uvAcha ... [8.3.17, 8.3.19, 1.3.2, 8.2.66] bho bhago agho apUrvasya yo'shi
samavetA yuyutsavaHa ... [8.3.17, 8.3.22, 1.3.2, 8.2.66] bho bhago agho apUrvasya yo'shi
mAmakAHa p ... [8.3.15, 1.3.2, 8.2.66] kharavasAnayo visarjanIyaHa
pANNDavAshchEva ... [8.4.40, 8.3.34, 8.3.15, 1.3.2, 8.2.66] stoHa shchunA shchuHa
chEva ... [6.1.88, 1.1.1] vRuddhirechi
kimakurvata ... [6.1.72] saNhitAyAm
pANNDavAHa ch ... [8.3.15, 1.3.2, 8.2.66] kharavasAnayo visarjanIyaHa

Underlying: dhRutarASHTras uvAcha . dharmakSHetre kurukSHetre samavetAs yuyutsavas , mAmakAs pANNDavAs cha eva kim akurvata saNjaya .


Parsing of a sentence in Sanskrit involves the following tasks (NOTE: our parser follows the framework of Chomsky's Government and Binding Theory [1][2][3][4][5][6][7][8]):

  1. Split each euphonic combination ('sandhi') into its constituent terms in the 'sandhi analysis' stage. For example, the euphonic combination 'pANNDavAshchEva' in this example must be split into its constituent terms 'pANNDavAHa', 'cha', and 'eva'. In the example above, the input to 'sandhi analysis' is marked as 'Printed' and the output is marked as 'Underlying'. The numbers in the square brackets above (for e.g., [6.1.88, 1.1.1]) indicate the Paninian Ashtadhyayi sutras applied to the 'Underlying' terms that transform them to the 'Printed' form. It is important for 'sandhi analysis' to be aware of all the steps in each derivation as it works on undoing each transformation from the 'Printed' form to yield the 'Underlying' terms.
  2. Sometimes, due to inherent syntactic ambiguity, the 'sandhi analysis' stage is unable to decide between multiple valid ways to split an euphonic combination. These must be marked for the subsequent grammatical analysis to resolve based on other syntactic cues in the sentence.
  3. Having obtained all the individual terms (shown as 'Underlying' above) of the sentence after 'sandhi analysis', the parser must now figure out all the clauses in the sentence. For example, the parser must identify two clauses, built around the verbs 'uvAcha' ('vach:2:P:to speak:VerbPerfect') and 'akurvata' ('kRu:8:U:to do:VerbImperfect') respectively. The reader must be aware that many terms can be defined as both verbs and nouns, hence figuring this out is a non-trivial task in some cases.
  4. In Sanskrit, it is also possible for the verb to be elided, hence these must be 'read in' by the parser. For example, the verb 'to be' (also known as the copula verb) is frequently elided, and the reader must use other syntactic cues to 'read in' such elided verbs. See for example, Stanza 1.23 of the Srimad Bhagavad Gita, where the parser 'reads in' the elided copula verb to create Clause A.2. If this verb and clause are not created by the parser, the terms of the stanza cannot be resolved correctly (for e.g., NOM-P 'ye' and 'ete' would conflict with the NOM-S Subject 'aham', and could well be wrongly considered as Feminine/ Neuter ACC-D). Occasionally, a sentence may require a different elided verb (as seen in Stanza 1.15 of the Srimad Bhagavad Gita and Stanza 1.16 of the Srimad Bhagavad Gita), to be 'read in' from another clause in the same or preceding stanza, and inflected so as to agree with its Subject.
  5. The clauses must be marked as Main clause, subordinate clause, subjunctive clause, co-ordinate clause, etc. to show their relation with each other. This is especially important in case of relative clauses, where a component inside the relative clause must be linked with its 'antecedent' (in some other clause) by the parser.
  6. For each clause, the parser must figure out its Arguments (usually its Subject and Object/s). This will depend on whether the verb in the clause is used as a Transitive verb (for e.g., 'akurvata' in this example) or an Intransitive verb (for e.g., 'uvAcha' in this example). Note that many verbs can be used in both Transitive and Intransitive senses. For example, the verb 'uvAcha' can be used as a Transitive with a single object, or even a Ditransitive with two objects (as seen in Stanza 2.1 of the Srimad Bhagavad Gita). Hence, the parser must study the sentence carefully to rule out a number of alternatives before arriving at the right answer.
  7. The parser must also be aware whether the Verb can accept a Subject term and an Object term. Remember that in non-Finite verbs (such as Gerunds and Infinitives), the clause must not contain a Subject term. Further, in the case of Passives, the clause must not contain an Object term (unless it is a Ditransitive verb) as the Syntactic Subject is usually the Semantic Object of the Passive clause (for e.g., consider the sentence 'The ball was kicked by the boy.', in which the syntactic Subject 'the ball' is actually the Semantic Object of the sentence i.e. 'the boy kicked what ?').
  8. In Sanskrit, the verb is usually inflected for the Person and Number of its Subject. Once the parser figures out the verb, it will try to locate the Subject using the Person and Number of the verb (or do this the other way around). As one can see from the example above, the Subject of the verb 'akurvata' should be in the III Person and Plural Number. The parser must figure out in this case that the Subject is a Coordinated term i.e. 'mAmAkAHa pANNDavAHa cha eva' ('mine as well as the sons of Pandu'). In the other clause, the Subject term 'dhRutarASHTraHa' ('Dhrutarashtra') matches the Person and Number or the verb 'uvAcha' (III-Singular). But note that the I-Singular VerbPerfect conjugation is identical (i.e. 'uvAcha'); hence the parser must rule this candidate out based on the presence of the NOM-S 'dhRutarASHTraHa' (but remember that the printed form was 'dhRutarASHTra', and if 'sandhi analysis' had slipped up, we could well have had the wrong underlying VOC-S term 'dhRutarASHTra', and would not be able to rule out uvAcha:I-Singular:VerbPerfect).
  9. The reader must keep in mind that there may be several candidates for the Subject and the Object in each clause. Some of these candidates may not be the Nominal Head, and instead may modify the Subject or Object (for e.g., modifiers such as 'samavetAHa' and 'yuyutsavaHa' in this example).
  10. In Sanskrit, nominal modifiers ( for e.g. Adjectives, pronouns, and derived nominals such as participles) must agree in Number, Gender, and Case with the Nominal Head that they modify. However, there are exceptions to this rule, and there are complications when the Nominal Head is a part of a coordinated phrase (i.e. what is the Gender of the coordinated phrase 'the boy and the girl' where one component is Masculine and the other is Feminine, and what is its Nominal Head ?). See Stanza 1.27+28 of the Srimad Bhagavad Gita for an example of agreement between the NOM-S masculine 'saHa', 'kOnteya', 'AviSHTaHa', and 'viSHIdan', skipping over the INS-S Feminine terms 'kRupayA' and 'parayA'. However, this agreement rule frequently helps the parser to figure out the correct declined forms, where either the Nominal Head or its modifier is unambiguous with reference to Gender, Case, or Number, and forms the anchor using which other terms can be set to the correct Gender, Case, or Number.
  11. In relative clauses, a relative pronoun may be present (but is occasionally elided) and the Subject (or Object, Adjunct) must be linked with its antecedent (which may not always be present) in another clause (or perhaps in a preceding sentence). An example can be seen in Stanza 1.23 of the Srimad Bhagavad Gita where Arjuna tells Lord Krishna: 'I observe those who wish to fight are gathered here ...'. In this example, the relative pronoun is NOM-P 'ye' and its antecedent is NOM-P 'yotsyamAnAn' ('those who wish to fight'). The relative clause has the Subject term NOM-P 'ete' ('these') as well as a Subject modifier NOM-P 'samAgatAHa' (passive past participle 'those who are gathered'). Note that, in a relative clause that contains a Nominative relative pronoun (NOM-P 'ye' in this case), the presence of such a Subject phrase ('ete samAgatAHa') usually indicates an appositival structure that may be used for emphasis or clarification ('those who wish to fight, these people who are gathered here, ...').
  12. There is also the question of agreement in Number and Gender between the relative pronoun, the Subject inside the relative clause, and the antecedent. We discussed how this works in the example from Stanza 1.23 of the Srimad Bhagavad Gita discussed above. However, this agreement can be very difficult to check reliably because some of these terms may not be present in the sentence.
  13. Pronominals must also agree in Number and Gender with their antecedents. However, it is not possible for our syntactic parser to check this agreement reliably, without an understanding of the semantics of the sentence (as well as preceding and succeeding sentences). See an example in Stanza 1.33 of the Srimad Bhagavad Gita, where the NOM-P pronominal 'te' ('those') must agree with its antecedent GEN-P 'yeSHAm' ('whose') in the same stanza, as well as the link to the succeeding stanza ('teachers, fathers, ...'). Binding Theory [3] helps the parser to rule out some candidates, but there are usually several valid candidates for the antecedent in the sentence (and preceding context).
  14. It is also possible for the Subject to be elided in Sanskrit sentences, especially when they can be reconstructed using the verb inflections. For example, 'aham' ('I') may be elided if the verb is in the First Person Singular conjugation as seen in Clause A.3 of Stanza 1.7 of the Srimad Bhagavad Gita.
  15. It is also possible for the Subject to be elided and shared between two clauses. For example, 'vayam' ('we') is shared between two clauses in Stanza 1.45 of the Srimad Bhagavad Gita.
  16. It is also possible for the direct object to be elided in Sanskrit sentences. However, the direct object of a clause may not be subject to agreement constraints, making it very difficult for a syntactic parser to identify such elided (or shared) objects between clauses. For example, 'enam' ('this') is elided and shared between two clauses in Stanza 2.29 of the Srimad Bhagavad Gita, but the parser cannot definitively link the missing object of one clause with the object of the other clause in the absence of any syntactic cue that indicates such sharing.
  17. The reader must also be aware that it is not easy for the parser to figure out the declension or conjugation applicable to each term in the sentence, because there are a large number of words that have identical declined forms for multiple Cases. There are also conjugated forms of some verbs that are indistinguishable from certain declensions of nominals. For example, we can see the term 'veda' in Stanza 2.21 of the Srimad Bhagavad Gita which could be either a nominal (VOC-S declension of 'Veda/ sacred knowledge') or a verb (III-S or I-S conjugation of 'vid:2:to know:VerbPresent').
  18. It is not trivial to resolve the Subject and Object terms for some nominals that have Neuter entries in the Lexicon. The reader must keep in mind that the Neuter declensions are frequently identical for Nominative, Accusative, and Vocative (for e.g., for 'a'-ending nominal bases). For e.g., 'vanam' ('forest') may be NOM-S, ACC-S, or VOC-S. The parser must use other syntactic cues in the sentence to figure out which one is applicable. The situation can be complicated further by the presence of Masculine or Feminine entries for the word in the Lexicon (note that adjectives are declined in all three Genders, and can be difficult to resolve in cases where the Nominal Head has multiple candidate declensions). For example, the Pronominal 'te' could be the Masculine NOM-P, Feminine NOM-D or ACC-D, or Neuter NOM-D or ACC-D terms. In this case, the parser must figure out whether the term is Accusative (Object) or Nominative (Subject), and what its Number and Gender are. This is particularly difficult if the term is not the Subject (for e.g. the short pronoun 'naHa'), since no information is available from the verb inflection in such cases. As mentioned earlier, it is not possible for our parser to relate pronominals with their 'antecedents', because there may be a number of candidate terms, only one of which may be the correct 'antecedent' (keep in mind that the 'antecedent' may well be in some preceding sentence, or occasionally even a succeeding sentence).
  19. The parser must also resolve and assign the Indeclinables (for e.g., 'eva'), as well as declensions of the remaining 'adjunctive' terms (for e.g., 'saNJaya' is VOC-S, 'dharmakSHetre' is LOC-S, and 'kurukSHetre' is LOC-S) in each clause. It must be noted here that some of these 'adjunctive' terms may have alternate declined forms giving them the appearance of Nominative or Accusative terms as well.
  20. There is considerable ambiguity in attaching such 'adjunctive' terms to clauses. For example, there is no syntactic reason why 'dharmakSHetre' cannot be attached to Clause A.1 (i.e. there could be an alternate interpretation: 'Dhrutarashtra spoke in Dharmakshetra Kurukshetra') instead of Clause B.1 (note that, here too, there are two alternatives, i.e. 'What did [those who are] eager [who are] assembled in Dharmakshetra Kurukshetra, mine as well as the sons of Pandu, do ?', and 'What did [those who are] eager [who are] assembled, mine and the sons of Pandu, do in Dharmakshetra Kurukshetra ?'). In general, the attachment of such 'adjunctive' terms is determined by semantics rather than syntactic constraints, and is probably best left to human experts. Our parser is a purely syntactic parser, and may therefore assign such terms to the wrong clause on occasion, especially those that are on the boundaries between clauses. However, in this particular example, our syntactic parser assumes that the adjunctive term 'LOC-S:dharmakSHetre' is governed by the passive past participle 'samavetAHa:[who are] assembled' that succeeds it.
  21. The internal structure of the clause consists of phrases (groups of associated terms), some of which may be subordinate to other terms. For example, Passive past participles in Sanskrit are derived from verbs and will have their own internal structure (i.e. Subject and Objects). In the example below, in clause B.1, note the comment 'Link_subj_yuyutsavaHa' in the entry for the passive past participle 'samavetAHa:[those who are] assembled'. This 'Link_subj_' indicates that the term 'samavetAHa' has as its grammatical subject the phrase containing the term 'yuyutsavaHa:[those who are] eager to fight'. Similarly, the comment 'Link_gov_samavetAHa' in the entry for 'dharmakSHetre' indicates that the term is governed by the phrase containing 'samavetAHa'. With these links provided in the parsed output, the reader can figure out the internal structure of the clause (for e.g., the larger phrase can be read as '... [those who are] eager to fight [who are] assembled in Dharmakshetra Kurukshetra ...'). However, we have not grouped and displayed phrases in the parser output because of the possibility of egregious errors in such groupings due to the phenomenon of 'free' word order.

The final result of parsing Stanza 1.1 of the Srimad Bhagavad Gita as shown above will include the following (in addition to the sandhi analysis shown at the top of this page):

A: dhRutarASHTraHa uvAcha

A.1:

  • dhRutarASHTraHa:NOM-S:dhRutarASHTra:Masc.:Noun
  • uvAcha:III-S:vach:2:P:VerbPerfect
  • B: dharmakSHetre kurukSHetre samavetAHa yuyutsavaHa mAmakAHa pANNDavAHa cha eva kim akurvata saNjaya

    B.1:

  • dharmakSHetre:LOC-S:dharman-kSHetra (dharmakSHetra) :Masc.:Noun:samAsa_tatpuruSHa(GEN):Link_gov_samavetAHa
  • kurukSHetre:LOC-S:kuru-kSHetra (kurukSHetra) :Masc.:Noun:samAsa_tatpuruSHa(GEN)
  • samavetAHa:NOM-P:samaveta:Masc.:Adj:past_participle_passive_kta_2P_sam-ava-i:Link_subj_yuyutsavaHa
  • yuyutsavaHa:NOM-P:yuyutsu:Masc.:Noun
  • mAmakAHa:NOM-P:mAmaka:Masc.:Noun
  • pANNDavAHa:NOM-P:pANNDava:Masc.:Noun
  • cha:Indeclinable
  • eva:Indeclinable
  • kim:ACC-S:kim:Neut.:Pronoun
  • akurvata:III-P:kRu:8:A:VerbImperfect
  • saNjaya:VOC-S:saNjaya:Masc.:Noun

  • Due to readability and space limitations, we have not listed all the alternative forms for each term (investigated and subsequently ruled out by our parser). The English translation in the gloss for each term is merely indicative; our Sanskrit parser steers clear of semantic interpretations, leaving this difficult task to humans. The reader must note the NOM and ACC terms in particular in each clause, in order to understand the internal structure of the clause.

    Some more details are provided in the examples in this section.


    As will be noted from the above, Sanskrit parsing is an extremely complex task, and largely involves the application of a number of syntactic constraints (such as the agreement rules described above) to rule out various alternatives that violate such constraints.

    One question that is frequently asked of this author, is whether Sanskrit sentences can be parsed using Machine Learning techniques (whether using Artificial Neural Networks or otherwise). Our response is that every serious student of Sanskrit and Computational Linguistics should take the time to study the Ashtadhyayi, a text considered by most Linguists and Philosophers to be a work of monumental genius. Once they have undertaken such a study, we are confident that they will see that such an exploration is very much more rewarding than being able to parse some sentences quickly.


    References

    1. [Chomsky1980a] Chomsky N.. On Binding. Linguistic Inquiry. 1980;11:1-46.
    2. [Chomsky1980b] Chomsky N.. Rules and Representations. New York: Columbia University Press; 1980.
    3. [Chomsky1981] Chomsky N.. Lectures on Government and Binding: The Pisa lectures. Seventh 1993 ed. Berlin; New York: Mouton de Gruyter; 1981.
    4. [Chomsky1981b] Chomsky N.. Principles and Parameters in syntactic theory. In: Hornstein N., Lightfoot D., editors. Explanations in Linguistics. London: Longman; 1981.
    5. [Chomsky1982] Chomsky N.. Some consequences of the Theory of Government and Binding. Vol Linguistic Inquiry Monograph 6. Cambridge, MA: MIT Press; 1982.
    6. [Chomsky1986] Chomsky N.. Barriers. Vol Linguistic Inquiry Monograph 13. Cambridge, MA: MIT Press; 1986.
    7. [Chomsky1995] Chomsky N.. The Minimalist Program. Cambridge, MA: MIT Press; 1995.
    8. [Chomsky2000] Chomsky N.. New horizons in the study of language and mind. Cambridge, UK: Cambridge University Press; 2000.