UD for Tamil
Tokenization and Word Segmentation
- Following most tokenization patterns, words are delimited by whitespace or punctuation.
- Multiword tokens are relatively common in Tamil. For example, the coordinating clitic -உம் / -um is analyzed as a separate syntactic word.
Morphology
Tags
- Tamil uses 14 universal tags (SCONJ, INTJ, and SYM do not occur in the corpus at present).
- Auxiliary verbs (AUX) include:
- போ / po “go” for future tense, follows the infinitive of the main verb
- மாட்டேன் / māṭṭen “will not” (lemma மாட்டு māṭṭu) for negative future tense with human subject
- படு / paṭu “experience” for the passive voice
- வை / vai “put” for the causative voice
- இல் / il (இல்லை / illai) “not be” for negation
- உள் / uḷ “within”, இரு / iru “be”, வரு / varu “come”, கொள் / kòḷ “take”, செய் / cèy “do”, விடு / viṭu “let”, வா / vā “come”
- வேண்டு / veṇṭu “must”
- முடியும் / muṭiyum “can” (lemma முடி muṭi): modal auxiliary, follows the infinitive of the main verb
Features
- 7 cases are annotated as morphological features of nouns: nominative, genitive, dative, accusative, instrumental, comitative, locative. Tamil is an agglutinating language and other spatiotemporal and/or case-like morphemes may be analyzed as postpositions.
- Verbs occur as finite forms, participles, infinitives, and gerunds.
Syntax
- Tamil is a verb-final language; both SOV and OSV orders are possible.
- Core arguments are marked by the morphological cases nominative (subject) and accusative (object). Core arguments are bare noun phrases without postpositions.
- Subjects have the following characteristics:
- Case marking: Subjects occur in nominative case without adpositions.
- Passivization: Subjects are suppressed when verbs are passivized.
- Objects have the following characteristics:
- Case marking: Objects occur in accusative case without adpositions.
- Passivization: Objects become (non-expletive) subjects when verbs are passivized.
- Bare nominal arguments (i.e., verb-licensed dependents) in the dative case are not considered core arguments. They are attached as
obl:arg
. - Prepositional arguments (i.e., verb-licensed dependents) are not considered core arguments. They are attached as
obl:arg
.
Tamil uses 4 relation subtypes:
advmod:emph
for adverbials emphasizing noun phrasescompound:prt
to attach verbal particles to verbsnsubj:pass
for nominal subjects in passive clausesobl:arg
for oblique arguments (to distinguish them from other oblique dependents, i.e., adjuncts)
References
- See also http://www.southasia.sas.upenn.edu/tamil/grammar/tamilgrammar12.html
- Tamil at the Language Gulper
Treebanks
There are two Tamil UD treebanks at present: