UD for Kurmanji
Tokenization and Word Segmentation
- In general, words are delimited by whitespace characters. Description of exceptions follows.
- According to typographical rules, many punctuation marks are attached to a neighboring word. We tokenize them as separate tokens (words), except the following cases:
- The period marking an abbreviation: Dr. “doctor” is one token.
- The apostrophe (or occasionally a hyphen) is not treated as punctuation when it occurs between a number and its morphological suffix, as in 15’ê, 1932’an.
- There is a type of verb called ‘Lêkerên hevedudanî’ which is similar to English phrasal verbs. These verbs typically consist of two or three parts that are separated by spaces when written. However, in passive voice and causative forms, these parts are written adjacent.
- There are several closed classes of contractions that are treated as multi-word tokens and segmented to individual syntactic words. The most prominent type is a pronoun fused with the future auxiliary: ezê = ez + dê “I will”.
Morphology
Tags
- Kurmanji uses all 17 universal POS categories, including particles (PART). Only 2 word types are tagged PART: jî “also”, ma.
- Kurmanji has four auxiliaries; three of them inflect like verbs (and can act as full verbs depending on context), while dê is an uninflected particle:
- The copula bûn “to be”.
- The future tense marker dê.
- The passive auxiliary hatin “to come” (it combines with an infinitive of the lexical verb).
- The causative auxiliary dan “to give” (it combines with an infinitive of the lexical verb).
- Verbs with modal meaning are not considered auxiliary in Kurmanji.
- There are four main (de)verbal forms, distinguished by the UPOS tag and the value of the VerbForm feature:
Nominal Features
- Nominal words (NOUN, PROPN) have an inherent Gender feature with one of two values:
Masc
orFem
. The gender of the referent is reflected by PRON and DET. - The two values of the Number feature are
Sing
andPlur
. The following parts of speech inflect for number: NOUN, PROPN, PRON, DET, VERB, AUX, marginally NUM. - Case has 4 possible values:
Nom
,Acc
,Con
,Voc
. It occurs with the nominal words, i.e., NOUN, PROPN, PRON, ADJ, DET, NUM.
Degree and Polarity
- Degree applies to adjectives (ADJ) and has one of three possible values:
Pos
,Cmp
,Sup
. For example, zêde “a lot of”, zêdetir “more”, zêdetirîn “most”. - Polarity has one value,
Neg
(whilePos
is not marked explicitly), and applies primarily to verbs (VERB, AUX), determiners (DET) and adverbs (ADV).
Verbal Features
- Aspect is
Perf
(perfective) andProg
(progressive); it can be also unmarked. - Finite verbs always have one of four values of Mood:
Ind
,Imp
,Opt
orSub
. - Verbs in the indicative mood always have one of four values of Tense:
Pqp
,Past
,Pres
orFut
. - Evident (evidentiality) has only one value,
Nfh
(non-first-hand).
Pronouns, Determiners, Quantifiers
- PronType is used with pronouns (PRON), determiners (DET) and adverbs (ADV).
- NumType is used with numerals (NUM – only
Card
). - The Reflex feature marks reflexive pronouns (xwe).
- Person is a lexical feature of personal pronouns (PRON) and has three values,
1
,2
and3
. Person is not marked on other types of pronouns and on nouns, although they can be almost always interpreted as the 3rd person.
Syntax
Core Arguments, Oblique Arguments and Adjuncts
- Nominal subject (nsubj) is a noun phrase in the nominative case, without preposition.
- Objects may be bare noun phrases in accusative (oblique).
Non-verbal Clauses
- The copula verb bûn (be) is used in equational, attributional, locative, possessive and benefactory nonverbal clauses.
Treebanks
There is 1 Kurmanji UD treebank: