UD for Karelian
Tokenization and Word Segmentation
- The main tokenisation is standard white-space delimited approach with punctuations separated.
- The punctuation is used as part of the token for ordinals written with digits: “123.” as well some abbreviations.
- The initial tokenisation for the currently only treebank was made based on morphological analyser from giellatekno and tooling from apertium.
Morphology
Tags
- Karelian uses all 17 universal POS categories
- Karelian has following auxiliaries:
- “olla” (to be, also to own etc.)
- “ei” (inflected negation verb)
- modals: “voija” (can). “piteä” (must), … (list will be extended as the corpora gets larger)
- For proadjectives etc., ADJ is used as the UPOS tag, similarly ADV for proadverbs and so forth.
- Ordinal numerals are tagged ADJ
Features
Verbal Features
- There are three main verbal forms distinguished by the value of VerbForm feature:
- Mood has four values:
Cnd
,Imp
,Ind
orPot
. - Tense has two values:
Past
orPres
. - Voice has two values:
Act
andPass
. - Person has four values,
0
,1
,2
and3
. - Number has values
Sing
orPlur
.
Nominal Features
- Karelian does not have Gender feature
- Number feature has two possible values:
Sing
andPlur
- Case has 15 possible values:
Abe
,Abl
,Acc
,Ade
,All
,Com
,Ela
,Ess
,Gen
,Ill
,Ine
,Ins
,Nom
,Par
,Tra
Degree and Polarity
- Degree applies to adjectives (ADJ), adverbs (ADV) and participles
(VERB or AUX), and has one of three possible values:
Pos
,Cmp
,Sup
. - Polarity has only value
Neg
, and applies to negative verb ‘ei’ - Connegative has only value
Yes
and applies to verbs which have been negated by ‘ei’
Possessives
- Layered features are used for possessive suffixes:
- Number[psor] for number and Person[psor] for person of possessor
Syntax
- Nominal subject (nsubj) is typically a nominal in the nominative, genitive or partitive case, without preposition.
- Objects (obj) can be nominals in nominative, genitive, partitive or accusative case
- The copula verb olla (be) is used in equational, attributional, locative,
possessive and benefactory nonverbal clauses.
- for possessive structure cop:own subtype is used
- genetive modifier uses nmod:poss subtype
- acl:relcl is used for relative clauses
Treebanks
There is one Karelian UD treebanks: