UD for Estonian
Tokenization and Word Segmentation
- In general, words are delimited by whitespace characters.
- Multitoken words (syntactic words consisting of several ortographic words) include: some foreign names, e.g. New York, Rio de Janeiro; numerical expressions like 20 000.
- Punctuation marks are treated as separate tokens; the exceptions include: ordinary numbers (1. jaanuar) abbreviations (they can be written with and without period); if an abbreviation ends a sentence, the period mark is treated as the end-of-sentence punctuation mark, not as an abbreviation
- Emoticons (consisting mostly of punctutation marks) are single tokens.
Morphology
Tags
This is an overview only.
- Estonian uses 16 universal POS categories (all UD catecories except PART).
- Estonian has following auxiliary verbs: olema (“to be”), ei, ära (“not”), võima, saama, pidama, tohtima, näima, paistma, tunduma (modal verbs). Modal verbs, except võima and tohtima, can also be used as main verbs, depending on the context.
The auxiliary verbs are used in several types of constructions:
- ei, ära:
- negation of verb
- olema:
- The copula with non-verbal predicates.
- Past tenses
- saama:
- modal verb (+ infinitive)
- future (+ supine, bad style)
- passive (+ participe)
- võima, tohtima:
- modal verb (+ infinitive)
- pidama:
- modal verb (+ supine)
- näima, paistma, tunduma:
- modal verb (+ vat-infinitive)
- ei, ära:
- There are five main verbal forms, distinguished by the UPOS tag and the value of the VerbForm feature:
Nominal Features
- Estonian does not have Gender feature
- The two main values of the Number feature are
Sing
andPlur
. The following parts of speech inflect for number: NOUN, PROPN, PRON, ADJ, DET, VERB, AUX (finite, participles and converbs), marginally NUM. - Case has 15 possible values:
Nom
,Gen
,Par
,Add
,All
,Ade
,Abl
,Ill
,Ine
,Ela
,Ter
,Ess
,Abe
,Com
. Additive (Add) is a short form of illative (Ill) and exists only in singular. It occurs with the nominal words, i.e., NOUN, PROPN, PRON, ADJ, DET, NUM. It can occur with participles but only with those tagged asADJ
. Cases Abe, Ill, Ine, Ela, Tra occur with supines.
Degree and Polarity
- Degree applies to adjectives (ADJ) and has one of three possible values:
Pos
,Cmp
,Sup
. - Polarity has only value
Neg
, and applies to auxiliaries ‘ei’ and ‘ära’ if they form a negating predicate, and olema negating forms ‘pole’, ‘polnud’, ‘poldud’ etc. - Connegative has only value
Yes
and applies to verbs which have been negated by ‘ei’ or ‘ära’.
Verbal Features
- Finite verbs always have one of four values of Mood:
Ind
,Imp
, Cndor
Qot`. - Verbs in the indicative mood always have one of two values of Tense:
Past
orPres
. - There are two values of the Voice feature:
Act
andPass
. Impersonal verb forms haveVoice=Pass
. All other verb forms haveVoice=Act
. - Person has three values,
1
,2
and3
. - Number has values
Sing
orPlur
.
Pronouns, Determiners, Quantifiers
- PronType is used with pronouns (PRON) and determiners (DET).
- NumType is used with numerals (NUM) and adjectives (ADJ).
- The Poss feature marks possessive personal pronouns (e.g. oma “my”),
- The Reflex feature marks reflexive pronouns (ise, end).
In Estonian it is always used together with
PronType=Prs
. - Person is a lexical feature of personal pronouns (PRON) and has three values,
1
,2
and3
. - PronType is a list based feature of pronouns and determiners and it has following values:
- “PronType=Dem” for demonstrative pronouns
- “PronType=Ind” for indefinite pronouns
- “PronType=Int,Rel” for interrogative or relative pronouns
- “PronType=Prs” for personal pronouns
- “PronType=Rcp” for reciprocal pronouns
- “PronType=Tot” for total (collective) pronouns (kõik, kogu etc.)
Other Features
- Estonian UD treebanks have following language-specific features:
- The following universal features are not used in Estonian: Definite, Evident, Polite.
Syntax
This is an overview only.
Core Arguments, Oblique Arguments and Adjuncts
- Nominal subject (nsubj) is a nominal in the nominative or partitive case, without preposition.
- Ojectsobj can be nominals in nominative, genitive or partitive case forms (and there is no accusative case in Estonian).
- Adjuncts or adverbial modifiers realized as noun phrases are labeled obl:
Non-verbal Clauses
- The copula verb olema (be) is used in equational, attributional, locative, possessive and benefactory nonverbal clauses. Purely existential clauses (without indicating location) use olema as well but it is treated as the head of the clause and tagged VERB.
Relations Overview
- The following relation subtypes are used in Estonian:
- acl:relcl for relative clauses
- cc:preconj for constructions nii … kui ka … and kas … või …
- compound:prt for adverbal components of particle verbs
- csubj:cop for clausal or infinitive subject in copula clauses.
- nsubj:cop for nominal subject in copula clauses,
-
The following main types are not used alone and must be subtyped:
- The following relation types are not used in Estonian at all: expl, clf, dislocated