UD for Latvian 
Tokenization and Word Segmentation
In general, words are delimited by whitespace characters and punctuation is separated. Description of exceptions follows:
- A whitespace separating digits in a large number is not treated as a word separator. For example, 1 000 000 (“1,000,000” by English rules) is one token.
- Abbreviations without spaces are treated as single words and may contain punctuation (utt. “etc.”). In following cases we treat abbreviation as a single token even if whitespace is used between part of abbreviation and punctuation mark: u.t.jpr., u.c., u.tml., v.tml., u.t.t., N.B., P.S. and P.P.S.
- Double surnames such as Vīķe-Freiberga and words abbreviated with dashes such as e-pasts “e-mail”, k-dze “Ms.” are tokenized as a single token.
- In Latvian ordinal numerals are written with punctuation mark without whitespace like abbreviations (1.), so we tokenize ordinal numeral together with punctuation mark as one token.
- Multiple dots (… and .. ) are considered as one token. Multiple ?! are considered one token, ?!… is considered to be two tokens (?! and …).
Paragraph borders from the original text is indicated by comment line # newpar
in cases when paragraph borders aligns sentence borders and MISC
value NewPar=Yes
for the token following mid-sentence paragraph break. MISC
value SpaceAfter=No
is used to note tokens lacking any whitespace after.
Latvian uses all 17 universal POS categories.
PART tag is used for following function words: acīmredzot, ak, ar, arī, arīdzan, da, diemžēl, diez, diezin, droši, gan, i, ij, ik, ir, it, itin, ja, jau, jā, jel, jo, kaut, kā, lai, laikam, mjā, ne, nea, nebūt, nez, nezin, nē, nu, nudien, nujā, nū, nūja, nūjā, pat, patiesi, patiešām, protams, proti, taču, tad, tak, tā, tāpat, tātad, tiešām, tik, tikai, tikpat, tipa, tomēr, turklāt, vai, varbūt, vēl, vien, vienīgi, vis.
Particles can be homonymous with other POS, most notably, conjunctions CCONJ and SCONJ, interjections INTJ, and adverbs ADV, correct POS is assigned based on sentence context.
Pronouns and Determiners
Effectively distinguishing PRON and DET categories in Latvian is very hard as words used as DET
can also be used as PRON
, and, thus, traditional Latvian grammar does not define determiners as a distinct POS. Since version 2.15 pronoun (PRON
) vs. determiner (DET
) distinction is done by lemma (similarly as is done with PDT). In earlyer versions distinction was made based on tree structure.
Currently DET
are: abas, abi, cikais, cikas, ciki, cita, cits, daudzi, daža, dažs, ikkatra, ikkatrs, ikkura, ikkurš, ikviena, ikviens, jebkāda, jebkāds, jebkura, jebkurš, jelkāda, jelkāds, jūsējs, kāda, kādā, kādais, kāds, katra, katrs, kura, kurā, kurais, kurs, kurš, manējs, mana, mans, mūsējs, nekāda, nekādā, nekādais, nekāds, neviena, neviens, pate, pati, pats, savējs, sava, savs, šāda, šāds, šī, šis, šitāda, šitāds, šitaids, šitejāda, šitejāds, šitā, šitais, šitas, šitentāda, šitentāds, šitentas, štā, štas, štis, tāda, tāds, tā, tas, taste, tāte, tavējs, tava, tavs, vairāki, vēlviena, vēlviens, vienotra, vienotrs, viņējs, viņā, viņais, visa, viss.
are: daudzkas, es (“I”), jebkas, jelkas, jis, jūs (“you”, plural), kas, mēs (“we”), nekas, nezinkas, sevis, tu (“you”, singular), viņa (“she”), viņš (“he”), viš (“they/he/she”).
Syntax role det
is used for Latvian pronoun category, which modify nouns in the sentence and agree with this noun in gender, number and case. Pronominal quantifiers daudzi “many” and vairāki “several” , and personal possessives manējais, tavējais, mūsējais, jūsējais, viņējais are DET
, however in Latvian grammar they are described as adjectives.
Auxiliary Verbs
Latvian has three auxiliary verbs AUX: būt “to be”, tikt “to get”, and tapt “to become” (obsolete). The auxiliary verb is used in several types of constructions: * Analytic word forms of verbs (būt, tikt). * The copula in non-verbal predicates (būt). * The copula in infinitive predicates (būt).
Būt, tikt and tapt may still occur as normal VERB if they are used in purely existential sentences or indicate location. Verbs with modal meaning are not considered auxiliary in Latvian.
Deverbal Nouns, Participles, Coverbs
Deverbal nouns with endings -šana, -šanās (skriešana “running”) are tagged as NOUN. Most converbs with endings -ot, -oties, -am, -ām, -amies, -āmies, -dams, dama, -damies, -damās are tagged as VERB
or AUX
. Most adjectival participles (redzams, aizgājis, negaidīts, velkošs) are tagged as VERB
. Exceptions are lexicalized uses with separate meaning, like protams “of course”, acīmredzot “obvious”, which are tagged as PART
, and iespējams “possible”, which is tagged as ADJ
Nominal Features
- Nominal words (NOUN, PROPN and PRON) have an inherent Gender feature with one of two values:
. - The following parts of speech inflect for
as they must agree with nouns: ADJ, DET, NUM, VERB, AUX. For verbs (including auxiliaries), only participles inflect forGender
. Finite verbs don’t. - The two main values of the Number feature are
. The following parts of speech inflect for number:NOUN
(finite, participles and verbal nouns), marginallyNUM
. Selected nouns are plurale tantumPtan
or singulare tantumColl
. - Case has 6 possible values:
. It occurs with the nominal words, i.e.,NOUN
(participles and verbal nouns). - Definite has 2 possible values:
. The following parts of speech inflect for definitnes:ADJ
Degree and Polarity
- Degree applies to adjectives (ADJ), adverbs (ADV), and some participles (VERB, AUX), and has one of three possible values:
. - Polarity has two values,
, and applies to verbs (VERB, AUX).- Words ne, nē “no” occurs as independent negation particles (PART) and are marked with
. - Occasionaly ne occurs as a part of correlative conjunction and is marked with
. - Word jā occurs as an independent affirmation particle (PART) and is marked with
. - The
feature is not used with pronouns and determiners, although there is a subset of pronouns and determiners which are considered to be negated traditionally. ThePronType=Neg
feature is used there instead.
- Words ne, nē “no” occurs as independent negation particles (PART) and are marked with
Verbal Features
- There are five main (de)verbal form types, distinguished by the UPOS tag and the value of the VerbForm feature:
- Aspect applies only to part of participles (
) and is either imperfectiveImp
or perfectivePerf
. - Finite verbs always have one of five values of Mood:
. - Tense is used for verbs and participles:
- Verbs in the indicative mood always have one of three
. - Infinitive, imperative, conditional, quotative, and necessitative forms do not have the
feature. - The
feature is also used to distinguish declinable participles (taggedVERB
) into two groups: present participles (ziedošs “[it is] flowering” and lasāms “[it is] readable”) and past participles (darījis “[he has] been doing” and pateikts “[it has] been said”).
- Verbs in the indicative mood always have one of three
- There are two values used for the [Voice() feature:
:- Passive participles (lasāms “[it is] readable” and pateikts “[it has] been said”) has
. - Finite verb forms and active participles (ziedošs “[it is] flowering” and darījis “[he has] been doing”) have
- Passive participles (lasāms “[it is] readable” and pateikts “[it has] been said”) has
- Evident applies to finite verb forms (
) and depends on value ofMood
: quotatives have valueNfh
, but indicative have valueFh
Pronouns, Determiners, Quantifiers
- PronType is used with pronouns PRON, determiners DET and pronominal adverbs ADV with 8 permissible values:
. - NumType is used with numerals (also cardinal numbers) NUM, ordinal numbers ADJ, and some adverbs ADV:
- Numerals and ordinal numbers has one of three possible values:
. - Adverbs vienreiz “once”, divreiz “twice”, trīsreiz “thrice”, četrreiz, piecreiz, sešreiz, septiņreiz, astoņreiz, deviņreiz, desmitreiz “ten times”, pusotrreiz “one and a half times” has
- Numerals and ordinal numbers has one of three possible values:
- The Poss feature marks possessive personal pronouns and determiners (e.g., mans “my”) and possessive adjectives (e.g., tavējais “yours”) with value
. - The Reflex feature marks reflexive pronoun sevis.
- Reflexivity is also marked on reflexive verbs and participles (VERB, e.g., mazgāties, pusapģērbusies).
- Person is marked for pronouns and finite verbs and has three values:
.- It is a lexical feature of personal pronouns
like es “I”, tu “you” (singular), viņš “he”, viņa “she”, mēs “we”, jūs “you” (plural), viņi “they” (plural, masculine), viņas “they” (plural, feminine). - It is a lexical feature of personal possessives
mans, manējais “my/mine”, tavs, tavējais “your/yours” (singular), mūsējais “our/ours”, jūsējais “your/yours” (plural), viņējais “his/hers/theirs”. Person
is also marked on some demonstrative pronouns with value3
.- As a cross-reference to subject, person is also marked on finite verbs (
, AUX).
- It is a lexical feature of personal pronouns
- Foreign is annotated
for foreign words X. - Abbr is annotated
for abbreviations, which can be nouns NOUN (DJ), PROPN (NATO),ADJ
(god. “honored”),VERB
(skat. “see”),ADV
(v.j.l. “above sea level”),SYM
(utt. “etc.”).
ExtPos is currently used for annotating fixed
constructions. See ExtPos for Latvian for currently used values and examples.
Unused Features
Features not applicable for Latvian:
Core Arguments
- Nominal subject (nsubj) is a noun phrase usually in the nominative case. However:
- If the noun phrase is quantified, it may be in the genitive, which is required by the quantifier.
- With predicates nebūt, trūkt, pietikt, netrūkt, nepietikt noun phrase can be in genitive.
- A finite subordinate clause may serve as the subject and is labeled csubj.
- The noun phrase may be in the dative, if the predicate is in the necessitative mood (man jāskatās spēle “I have to watch the game”) or if the predicate is with modal meaning and has subordinated infinitive (viņam vajadzētu pasteigties “he should hurry”).
- Objects as defined in the Latvian grammar may be either bare noun phrases in accusative, dative, or genitive, or prepositional phrases in accusative, dative, genitive. All objects are labeled as obj or iobj.
- However, if the predicate is in the necessitative mood, object may be in nominative (zēnam jāuzraksta mājasdarbs “the boy has to write a homework.”), and it is labeled as
. - Accusative objects are considered
. - Objects in dative and genitive cases and prepositional objects are considered
- However, if the predicate is in the necessitative mood, object may be in nominative (zēnam jāuzraksta mājasdarbs “the boy has to write a homework.”), and it is labeled as
Non-verbal Clauses
The copula verb būt “be” is used in equational and attributional nonverbal clauses. Purely existential clauses (also indicating location) use būt as well, but it is treated as the head of the clause and tagged VERB.
Relations Overview
The following relation subtypes are used in Latvian:
- nsubj:pass for nominal subjects of passive verbs
- csubj:pass for clausal subjects of passive verbs
- aux:pass for passive auxiliaries
- flat:foreign for non-first words in quoted foreign phrases
- flat:name for exocentric complex name
- advmod:neg for negative particles
- advmod:emph for emphasizing particles
The following relation types are not used for Latvian: clf, dislocated, list, reparandum. However, reparandum
should be introduced in future, as appropriate speech texts are annotated.
Annotating Textual Errors
Following MISC
values can be used to annotate errors in the source text interfering with treebank annotation:
for typos (FORM
is given as in text, whileLEMMA
as for word without the error)CorrectionType=Spacing
for missing or unnecessary whitespacesCorrectionType=InsertedPunctAfter
for cases when there is missing punctuation mark (usually comma) after this tokenCorrectionType=RemovedPunctuation
for unnecessary punctuation (usually comma)- In case of
additional featureCorrectedForm=
… gives the corrected form.
There are 2 Latvian UD treebanks: