UD for Malayalam
Tokenization and Word Segmentation
- In general, words are delimited by whitespace characters or punctuations.
- Multiword tokens are relatively common in Malayalam. In the following situations, we understand orthographic tokens
as corresponding to multiple syntactic words and split them:
- The copula ആക് / āk “to be” is written as a suffix of the nominal/adjectival predicate. However, sometimes it is suffixed to another word in the clause, indicating that it is a clitic rather than a derivational morpheme that would derive a verb from a noun/adjective.
- The quotative particle or the complimentizer എന്ന് / enn “that” usually occurs as a suffix of the verb or the copula. Given that we split the copula as a syntactic word, we split the complementizer as well. (Also, it increases parallelism with languages where complementizers are independent words, and avoids having to define a language-specific feature for verb with complementizer.)
- The coordinating clitics -ഉം / -um and -ഓ / -ō are written together with conjuncts but analyzed as separate syntactic words.
- In orthography sometimes the object and the verb of a sentence occur as a multiword token. For example, in the sentence പെൺകുട്ടി തന്റെ സുഹൃത്തിന് കത്തെഴുതി. / peṇkuṭṭi tanṟe suhr̥ttin katteḻuti. “The girl wrote a letter to her friend.”, കത്ത് / katt “letter” and എഴുതി / eḻuti ”wrote” occur as a multiword token and are split.
- There are letters that can be encoded in multiple ways, even after standard Unicode normalization (NFC), which is
required in UD.
- The viram sign, used across Indic scripts to cancel the vowel (a or schwa) inherently present in a consonant character, may in Malayalam actually result in a half vowel ŭ, especially at the end of a word. To signal that there is really no vowel, some consonants in the Malayalam script have so-called chillu variants, which have their own Unicode position. However, there is an older alternative of signalling that the chillu glyph should be rendered: using the standard code of the consonant, followed by a viram (U+0D4D) and a zero width joiner (U+200D). For example, കമ്യൂണിക്കേഷന്സ് / kamyūṇikkēṣans “communications”, has a viram and a zero width joiner between ന / na and സ / sa. Without these two characters, but using chillu na (U+0D7B) instead of na (U+0D28), the resulting glyphs should look the same: കമ്യൂണിക്കേഷൻസ്. On the other hand, with the standard na, viram, and without the zero width joiner, the string will look different: കമ്യൂണിക്കേഷന്സ്. When annotating text that originally used the zero width joiner, we normalize the text to the encoding that uses the chillu letters, provided the corresponding chillu letter exists. If the viram and zero width joiner follow a consonant that lacks the corresponding chillu letter, the string is left as is.
- In addition, a zero width non-joiner (U+200C) may be used to signal that two consecutive consonants should not be rendered as a ligature. This is is a rare phenomenon but it is not an obsolete thing, so no normalization is applied. For example: സ്കോര്പിയോ / skōrpiyō has a viram and a zero width non-joiner between the initial സ / sa and ക / ka. Without the zero width non-joiner, the word would look like this: സ്കോർപിയോ (the ligature sk may or may not be visible depending on the rendering algorithm used by your browser). Note that the first variant of the word also has a zero width joiner after ര ra, which is converted in the second variant using ർ / chillu rra, as described above.
Morphology
Tags
- Malayalam uses all the 17 POS tags, including particles (PART).
- The noun tag NOUN is intended for common nouns only.
Abstract nouns are also tagged
NOUN
. The nouns in Malayalam are marked for case and number. - Proper nouns include the name of a specific individual, place, or object and are tagged PROPN.
- Pronouns are tagged PRON. The nominal reflexive താൻ / tān is also tagged
PRON
. - The numeral “one” which functions as the indefinite article is tagged DET. For example, ഒരു വീട് / oru vīṭ “a house”.
Quantifiers like ഒരുപാട് / orupāṭ “a lot” that act as modifiers are also tagged
DET
. - Cardinal numbers (including ഒരു / ŏru “one” if it denotes primarily quantity) are tagged NUM.
- The emphatic markers -ഏ / -ē and തന്നേ / tannē, the coordination clitics -ഉം / -um and ഓ / -ō, and the quotative particle എന്ന് / enn are tagged PART.
- The tag ADJ covers both free adjectives, such as പഴയ / paḻaya “old”, and derived adjectives, such as സന്തോഷകരമായ / santōṣakaramāya “pleasant”.
- The tag ADV covers adverbs like സങ്കടത്തോടെ / saṅkaṭattōṭe “sadly”, തീർച്ചയായും / tīṟ̕ccayāyuṁ “certainly”.
- Finite and nonfinite verb forms are tagged VERB or AUX.
- Malayalam has the following auxiliary verbs AUX:
- ആക് / āk “to be” is used as a copula to denote existential and stative meanings. It can also function as lexical verb conveying the meanings of “to have”, “to take place”, “to be able to”.
- ഉണ്ട് / uṇṭ “to be” is used as a copula to denote existential and stative meanings but additionally it has a possessive meaning ‘to have’.
- Modal auxiliaries:
- കഴിയുക / kaḻiyuka “to be able, can”
- വേണം / vēṇaṁ “want”
- There are four main (de)verbal forms, distinguished by the UPOS tag and the value of the VerbForm feature:
- Finite verb Fin, tagged VERB or AUX. It is marked for Tense and it can occur in the main clause without an auxiliary.
- Nominalized form of a verb is annotated as Vnoun with the UPOS VERB. These forms (sometimes also called gerunds in the literature) end in -ത് / -t. Despite being nominalized, they are marked for Tense and assign the nominative case to their subjects. They occur with the auxiliary ആക് / āk “to be”.
- Infinitive Inf, tagged
VERB
orAUX
. - Participle Part, tagged
VERB
orAUX
.
Nominal Features
- Inherent Gender of nouns determines which personal pronoun can refer to the noun, and it is sometimes reflected as agreement on adjectives. It is not reflected on verbs (unlike in related Tamil). We do not annotate the gender of nouns in data but we do so for third-person pronouns with one of three values: Masc, Fem or Neut.
- Like
Gender
, Animacy is also an inherent feature of nominal words (NOUN, PROPN, and PRON). It has two values: Anim and Inan. Animacy is grammatically relevant because inanimate nouns may occur without accusative marking when used ad direct objects. Animates include nouns denoting persons, animals, or trees.- Animacy aligns with gender only partially. Masculine and feminine third person pronouns refer to persons and are perceived as animate. Neuter pronouns can be animate if referring to animals or plants, and inanimate otherwise. For inanimates, the accusative form is equal to the nominative (അത് / at “it”), while for animates it uses a separate form (അതിനെ / atine “it”).
- We annotate the animacy of third person neuter pronouns but we omit the feature for other personal pronouns. We annotate the animacy of interrogative pronouns.
- The two values of Number are
Sing and
Plur.
The following parts of speech inflect for number: NOUN, PROPN, PRON. There is no agreement in
Number
, that is, the number of nouns is not reflected in the form of verbs or adjectives. - Case has 13 possible values: Nom, Acc, Gen, Dat, Ins, Loc, Abl, All, Cmp, Com, Ben, Cau, Voc. Malayalam is an agglutinative language and the spatiotemporal and/or case-like morphemes are analyzed as postpositions. The Case feature occurs with the nominal words, i.e., NOUN, PROPN, PRON, NUM and also with nominalized verb forms.
Verbal Features
- Finite verbs always have one of eight values of Mood: Ind, Irr, Cnd, Des, Nec, Imp, Prp or Opt.
- Verbs in the indicative mood always have one of three values of Tense: Past, Pres or Fut.
- Aspect has five possible values: Hab, Imp, Perf, Prog, Iter.
- Voice has three possible values: Act, Pass, Cau.
- Polarity has two values: Pos and Neg.
- Politeness must be distinguished in the imperative and has two values: Infm and Form. The verb stem serves as an informal imperative: തുറ / tuṟa “open”. The citation form may serve as a formal imperative: തുറക്കുക / tuṟakkūka “open”. Finally, there is another formal imperative with -kkū: തുറക്കൂ / tuṟakkū “open”.
Pronouns, Determiners, Quantifiers
- PronType is used with pronouns PRON.
- NumType is used with numerals NUM.
- Person is a lexical feature of personal pronouns (PRON) and has three values 1, 2 and 3.
- Clusivity distinguishes inclusive and exclusive 1st person plural pronouns.
- Deixis distinguishes proximate and remote demonstratives and 3rd person singular and plural pronouns.
Syntax
Core Arguments, Oblique Arguments and Adjuncts
- Malayalam is a verb-final language; both SOV and OSV orders are possible.
- Core arguments are marked by the morphological cases nominative (subject) and accusative (object). Core arguments are bare noun phrases without postpositions. Neither subject nor object are cross-referenced by verbal morphology.
- Subjects have the following characteristics:
- Case marking: Subjects occur in nominative case without adpositions.
- Passivization: Subjects are suppressed when verbs are passivized.
- Objects have the following characteristics:
- Case marking: Objects occur in accusative case without adpositions.
- Passivization: Objects become (non-expletive) subjects when verbs are passivized.
- Non-nominative subjects are attached as nsubj.
- Adjuncts and non-core arguments are attached as obl.
Relations Overview
- nsubj:pass for nominal subjects in passive clauses.
- nmod:poss for possessive adjectives.
പെൺകുട്ടി തന്റെ സുഹൃത്തിന് കത്ത് എഴുതി . nsubj(എഴുതി, പെൺകുട്ടി) nmod:poss(സുഹൃത്തിന്, തന്റെ) obl(എഴുതി, സുഹൃത്തിന്) obj(എഴുതി, കത്ത്) punct(എഴുതി, .)
‘The girl wrote a letter to her friend’
- mark for the quotative particle introducing a finite clause subordinate to another clause.
ആര് ആണ് എഴുതിയത് എന്ന് അവർക്ക് അറിയില്ല . nsubj(എഴുതിയത്, ആര്) cop:emph(എഴുതിയത്, ആണ്) ccomp(അറിയില്ല, എഴുതിയത്) mark(എഴുതിയത്, എന്ന്) obl(അറിയില്ല, അവർക്ക്) punct(അറിയില്ല, .)
‘They don’t know who wrote it’
- cop for the copular or the non-verbal predicates.
നിങ്ങൾ ഒരു വിദ്യാർത്ഥി ആണോ ? nsubj(വിദ്യാർത്ഥി, നിങ്ങൾ) det(വിദ്യാർത്ഥി, ഒരു) cop(വിദ്യാർത്ഥി, ആണോ) punct(വിദ്യാർത്ഥി, ?)
‘Are you a student?’
- cop:emph for the copula used for emphasis or focus shift.
ഇന്ന് ഇരുവര ഉം ഒന്നിച്ചുള്ള ആദ്യ ചിത്രമ് ആണ് നസ്രിയ പങ്കു വച്ചിരിക്കുന്നത് . advmod(വച്ചിരിക്കുന്നത്, ഇന്ന്) nmod(ചിത്രമ്, ഇരുവര) advmod:emph(ഇരുവര, ഉം) amod(ആദ്യ, ഒന്നിച്ചുള്ള) compound(ചിത്രമ്, ആദ്യ) obj(വച്ചിരിക്കുന്നത്, ചിത്രമ്) cop:emph(ചിത്രമ്, ആണ്) nsubj(വച്ചിരിക്കുന്നത്, നസ്രിയ) compound(വച്ചിരിക്കുന്നത്, പങ്കു) punct(വച്ചിരിക്കുന്നത്, .)
‘Today Nazriya has shared the first picture of the two together’
- cc for coordinating conjunctions.
അവൻ പുകവലി ഉം മദ്യപാനം _ഉം നിർത്താൻ ശ്രമിച്ചു . nsubj(ശ്രമിച്ചു, അവൻ) obj(നിർത്താൻ, പുകവലി) cc(പുകവലി, ഉം) conj(പുകവലി, മദ്യപാനം) cc(മദ്യപാനം, _ഉം) xcomp(ശ്രമിച്ചു, നിർത്താൻ) punct(ശ്രമിച്ചു, .)
‘He tried to quit smoking and drinking’
- compound:svc for serial verb constructions with shared complements.
ഇന്ത്യയിൽ ഉം കഞ്ചാവ് നിയമവിധേയമാക്കണം എന്ന വാദങ്ങൾ ഉയർന്നു വരുന്നുണ്ട് . obl(നിയമവിധേയമാക്കണം, ഇന്ത്യയിൽ) advmod:emph(ഇന്ത്യയിൽ, ഉം) nsubj(നിയമവിധേയമാക്കണം, കഞ്ചാവ്) acl:relcl(വാദങ്ങൾ, നിയമവിധേയമാക്കണം) mark(നിയമവിധേയമാക്കണം, എന്ന) nsubj(വരുന്നുണ്ട്, വാദങ്ങൾ) compound:svc(വരുന്നുണ്ട്, ഉയർന്നു) punct(വരുന്നുണ്ട്, .)
‘Arguments to legalize cannabis are also emerging in India.’
- advcl for adverbial clauses.
രാഷ്ട്രപതി അംഗീകരിച്ചില്ലെങ്കിൽ സുപ്രീം കോടതിയെ സമീപിക്കാന ഉം തീരുമാനം . nsubj(അംഗീകരിച്ചില്ലെങ്കിൽ, രാഷ്ട്രപതി) advcl(തീരുമാനം, അംഗീകരിച്ചില്ലെങ്കിൽ) nmod(കോടതിയെ, സുപ്രീം) obj(സമീപിക്കാന, കോടതിയെ) xcomp(തീരുമാനം, സമീപിക്കാന) advmod:emph(സമീപിക്കാന, ഉം) punct(തീരുമാനം, .)
‘Decision to approach the Supreme Court if the President does not agree.’
Treebanks
There is 1 Malayalam UD treebank: