UD for Kazakh
Tokenization and Word Segmentation
- In general, words are delimited by whitespace characters. Description of exceptions follows.
- According to typographical rules, many punctuation marks are attached to a neighboring word. These are normally tokenized as separate tokens (words), with the following exceptions:
- The period that marks an abbreviation is part of the abbreviation token: млн. “million”
- The hyphen that attaches a morphological suffix to a number is not a token separator: 100-ге, 19,4°С-қа
- Hyphenated compounds are also kept as a single token: Премьер-Министрге “Prime Minister”
- There are a few instances of multi-word tokens that are segmented to individual syntactic words.
- On the other hand, a few closed classes of words can contain spaces. The most prominent types of the segments after the space are жоқ, емес, екен.
Morphology
Tags
This is an overview only. For more detailed discussion and examples, see the list of Kazakh POS tags and Kazakh features.
- Kazakh uses all 17 universal POS categories, including particles (PART). At present, only 6 word types are tagged PART: ма, ау, шығар, шы, ғой, ше.
- There is a large number of constructions where a semantically weak verb combines with a non-finite form (infinitive or converb) of a lexically prominent verb. Traditional grammatical descriptions of Kazakh would label them as auxiliary constructions. However, most of them do not fall under the AUX category in UD. Instead, the non-finite form of the lexical verb is attached to the semantically weak verb via the xcomp relation.
- Nevertheless, some verbs are tentatively categorized as AUX even in UD:
- The copula бол or е (two lemmas with deficient paradigms).
- The verbs that, when combined with the -ip infinitive, render the progressive aspect: жат, жүр, қал, отыр, тұр.
- The durative бер.
- The optative кел.
- The potential modal auxiliary ал “can”.
- There are five main (de)verbal forms, distinguished by the UPOS tag and the value of the VerbForm feature:
Nominal Features
- There is no grammatical gender in Kazakh; however, personal names (PROPN) are annotated with the Gender feature as either
Masc
orFem
. - The two values of the Number feature are
Sing
andPlur
. For NOUN, PROPN and ADJ, only thePlur
value is used if the plural suffix is present; the singular is unmarked and unannotated. Pronouns (PRON) have both values and they are treated as lexical, that is, the plural pronoun has its own lemma, distinct from the corresponding singular pronoun. Finite verbs (VERB and AUX) cross-reference the person and number of the subject. They annotate both singular and plural. - Case has 7 possible values:
Nom
,Gen
,Dat
,Acc
,Loc
,Abl
,Ins
. It occurs with the nominal words, i.e., NOUN, PROPN, PRON, ADJ, NUM, as well as gerunds and participles (VERB, AUX).
Degree and Polarity
- Degree applies to adjectives (ADJ) and adverbs (ADV) and has only one value:
Cmp
. The basic (positive) form is unmarked and unannotated. - Polarity applies to verbs (VERB, AUX) and has only one value:
Neg
. The basic (positive) form is unmarked and unannotated.
Verbal Features
- Finite verbs are normally annotated as the habitual Aspect (
Hab
). Other values (Imp
,Perf
) can be observed with infinitives and converbs. - Finite verbs always have one of five values of Mood:
Ind
,Imp
,Opt
,Pot
orDes
. The conditional mood (Cnd
) is only used with conditional converbs. - Verbs in the indicative mood always have one of two values of Tense:
Past
,Pres
. The future tense (Fut
) may occur with participles. - The Evident feature (evidentiality) distinguishes first-hand past tense (
Fh
, e.g., болыпты) from evidentiality-neutral forms (unmarked, e.g., болды). - There are two values of the Voice feature:
Pass
andRcp
. The basic (active) form of the verb is unmarked and unannotated.
Syntax
This is an overview only. For more detailed discussion and examples, see the list of Kazakh relations.
Core Arguments, Oblique Arguments and Adjuncts
- Nominal subject (nsubj) is a noun phrase in the nominative case, without adposition.
- A subordinate clause may serve as the subject and is labeled
csubj
.
- A subordinate clause may serve as the subject and is labeled
- Object (obj) is a noun phrase without adposition and typically in the accusative case, although it can be also nominative or dative.
Relations Overview
- The following relation subtypes are used in Kazakh:
- acl:relcl for relative clauses
- nmod:poss for possessive and genitive modifiers
- obl:own for locative nominals that denote owners in constructions with the have-meaning
- compound:lvc for light-verb constructions
- flat:name to connect parts of a person name
- The following main types are not used alone and must be subtyped: flat
- The following relation types are not used in Kazakh at all: expl, dislocated
Treebanks
There is 1 Kazakh UD treebank: