UD for Latgalian
It is important to note that currently UD guidelines for annotating Latgalian is in a very early stage as not much text has been annotated yet.
Tokenization and Word Segmentation
In general, words are delimited by whitespace characters and punctuation is separated. Description of exceptions follows:
- A whitespace separating digits in a large number is not treated as a word separator. For example, 1 000 000 (“1,000,000” by English rules) is one token.
- Abbreviations without spaces are treated as single words and may contain punctuation (v.tml. “etc.”). In following cases we treat abbreviation as a single token even if whitespace is used between part of abbreviation and punctuation mark: v.tml., N.B., P.S. and P.P.S.
- Double surnames such as Vīke-Freiberga and words abbreviated with dashes such as e-posts “e-mail”, k-dze “Ms.” are tokenized as a single token.
- In Latgalian ordinal numerals are written with punctuation mark without whitespace like abbreviations (1.), so we tokenize ordinal numeral together with punctuation mark as one token.
- Multiple dots (… and .. ) are considered as one token. Multiple ?! are considered one token, ?!… is considered to be two tokens (?! and …).
Paragraph borders from the original text is indicated by comment line # newpar
in cases when paragraph borders aligns sentence borders and MISC
value NewPar=Yes
for the token following mid-sentence paragraph break. MISC
value SpaceAfter=No
is used to note tokens lacking any whitespace after.
Morphology
Tags
Latgalian uses all 17 universal POS categories.
Particles
PART tag is used for following function words: ar, ari, až, ba, da, dīvamžāļ, dīz, gon, ik, it, kab, kazyn, konče, koč, kod, kuo, lai, laikam, mošeit, mož, na, nabejs, naviņ, naz, nazyn, nui, nā, pat, prūtams, rikti, ta, tak, tik, tikai, to, tok, tože, varbyut, viņ, vys, vīneigi. This list might be expanded in future.
Pronouns and Determiners
Effectively distinguishing PRON and DET categories in Latgalian (similarly as in Latvian) is very hard and currently no clear guidelines has been developed yet. Following the example of Latvian, distinction is done by lemma.
Currently DET
are: itei, itys, kaida, kaids, kura, kurs, muna, muns, sova, sovs, tei, tis, toveja, tovejs.
PRON
are: es, jei, jis, jī, kas, tu.
These lists will be expanded in future.
Auxiliary Verbs
Latgalian has one auxiliary verb AUX: byut “to be”. The auxiliary verb is used in several types of constructions:
- Analytic word forms of verbs.
- The copula in non-verbal predicates.
- The copula in infinitive predicates.
Byut may still occur as normal VERB if it is used in purely existential sentences or indicate location.
Verbs with modal meaning are not considered auxiliary in Latgalian.
Deverbal Nouns, Participles, Coverbs
Latgalian features rich set of deverbal derivations and not everything has been analized to align with UD guidelines yet. However, deverbal nouns with endings -šona, -šonuos (skrīšona “running”) are tagged as NOUN. Most converbs with endings -ūt, -ūts, -ūte, -ūtīs, -om, -omīs, -dams, -dama, -damīs, -damuos are tagged as VERB
or AUX
. Most adjectival participles (radzams, aizguojs, nagaideits, valkūšs) are tagged as VERB
. Exceptions are lexicalized uses with separate meaning, like prūtams “of course”, acimradzūt “obvious”, which are tagged as PART
, and īspiejams “possible”, which is tagged as ADJ
.
Features
Nominal Features
- Nominal words (NOUN, PROPN and PRON) have an inherent Gender feature with one of two values:
Masc
orFem
. - The following parts of speech inflect for
Gender
as they must agree with nouns: ADJ, DET, NUM, VERB, AUX. For verbs (including auxiliaries), only participles inflect forGender
. Finite verbs don’t. - The two main values of the Number feature are
Sing
andPlur
. The following parts of speech inflect for number:NOUN
,PROPN
,PRON
,ADJ
,DET
,VERB
andAUX
(finite, participles and verbal nouns), marginallyNUM
. Selected nouns are plurale tantumPtan
or singulare tantumColl
. - Case has 6 possible values:
Nom
,Gen
,Dat
,Acc
,Loc
,Voc
. It occurs with the nominal words, i.e.,NOUN
,PROPN
,PRON
,ADJ
,DET
,NUM
,VERB
andAUX
(participles and verbal nouns). - Definite has 2 possible values:
Ind
andDef
. The following parts of speech inflect for definitnes:ADJ
,NUM
,VERB
andAUX
(participles).
Verbal Features
- There are five main (de)verbal form types, distinguished by the UPOS tag and the value of the VerbForm feature:
- Aspect applies only to part of participles (
VERB
,AUX
) and is either imperfectiveImp
or perfectivePerf
. - Finite verbs always have one of five values of Mood:
Ind
,Imp
,Cnd
,Qot
orNec
. - Tense is used for verbs and participles:
- Verbs in the indicative mood always have one of three
Tense
values:Past
,Pres
orFut
. - Infinitive, imperative, conditional, quotative, and necessitative forms do not have the
Tense
feature. - The
Tense
feature is also used to distinguish declinable participles (taggedVERB
orAUX
) into two groups: present participles (zīdūšs “[it is] flowering” and skaitams “[it is] readable”) and past participles (darejs “[he has] been doing” and pasaceits “[it has] been said”).
- Verbs in the indicative mood always have one of three
- There are two values used for the Voice feature:
Act
andPass
:- Passive participles (skaitams “[it is] readable” and pasaceits “[it has] been said”) has
Voice=Pass
. - Finite verb forms and active participles (zīdūšs “[it is] flowering” and darejs “[he has] been doing”) have
Voice=Act
.
- Passive participles (skaitams “[it is] readable” and pasaceits “[it has] been said”) has
- Evident applies to finite verb forms (
VERB
,AUX
) and depends on value ofMood
: quotatives have valueNfh
, but indicative have valueFh
.
Pronouns, Determiners, Quantifiers
- PronType is used with pronouns PRON, determiners DET and pronominal adverbs ADV with 8 permissible values:
Prs
,Rcp
,Int
,Rel
,Dem
,Tot
,Neg
,Ind
. - NumType is used with numerals (also cardinal numbers) NUM, ordinal numbers ADJ, and some adverbs ADV:
- Numerals and ordinal numbers has one of three possible values:
Card
,Ord
orFrac
. - Adverbs vīnreiz “once”, divreiz “twice”, treisreiz “thrice”, četrreiz, pīcreiz, sešreiz, septeņreiz, ostoņreiz, deveņreiz, desmitreiz “ten times” has
NumType=Mult
.
- Numerals and ordinal numbers has one of three possible values:
- The Poss feature marks possessive personal pronouns and determiners (e.g., muns “my”) and possessive adjectives (e.g., tovejs “yours”) with value
Yes
. - The Reflex feature marks reflexive pronouns seve, sevi.
- Reflexivity is also marked on reflexive verbs and participles (VERB, e.g., apsamozguot, mozguotīs, apsavāruse, vārusīs).
- Person is marked for pronouns and finite verbs and has three values:
1
,2
and3
.- It is a lexical feature of personal pronouns
PRON
like es “I”, tu “you” (singular), jis “he”, jei “she”, mes “we”, jius “you” (plural), jī “they” (plural, masculine), juos “they” (plural, feminine). - It is a lexical feature of personal possessives
DET
/PRON
muns, munejs, munejais “my/mine”, tovs, tovejs, tovejais “your/yours” (singular), myusejs, myusejais “our/ours”, jiusejs, jiusejais “your/yours” (plural). Person
is also marked on some demonstrative pronouns with value3
.- As a cross-reference to subject, person is also marked on finite verbs (
VERB
, AUX).
- It is a lexical feature of personal pronouns
- Foreign is annotated
Yes
for foreign words X. - Abbr is annotated
Yes
for abbreviations, which can be nouns NOUN (DJ), PROPN (NATO),ADJ
(gūd. “honored”),VERB
(sal. “compare”),ADV
(p.Kr. “anno Domini”),SYM
(v.tml. “etc.”).
Unused Features
Features not applicable for Latvian:
Syntax
Core Arguments
- Nominal subject (nsubj) is a noun phrase usually in the nominative case. However:
- If the noun phrase is quantified, it may be in the genitive, which is required by the quantifier.
- With predicates nabyut, tryukt, pītikt, natryukt, napītikt noun phrase can be in genitive.
- A finite subordinate clause may serve as the subject and is labeled csubj.
- The noun phrase may be in the dative, if the predicate is in the necessitative mood (maņ juosaver spēle “I have to watch the game”) or if the predicate is with modal meaning and has subordinated infinitive (jam vajadzātu pasasteigt “he should hurry”).
- Objects as defined in the Latgalian grammar may be either bare noun phrases in accusative, dative, or genitive, or prepositional phrases in accusative, dative, genitive. All objects are labeled as obj or iobj.
- However, if the predicate is in the necessitative mood, object may be in nominative (puikam juoatnas iudiņs “the boy has to bring the water.”), and it is labeled as
obj
. - Accusative objects are considered
obj
. - Objects in dative and genitive cases and prepositional objects are considered
iobj
.
- However, if the predicate is in the necessitative mood, object may be in nominative (puikam juoatnas iudiņs “the boy has to bring the water.”), and it is labeled as
Non-verbal Clauses
The copula verb byut “be” is used in equational and attributional nonverbal clauses. Purely existential clauses (also indicating location) use būt as well, but it is treated as the head of the clause and tagged VERB.
Relations Overview
The following relation subtypes are used in Latgalian:
- nsubj:pass for nominal subjects of passive verbs
- csubj:pass for clausal subjects of passive verbs
- aux:pass for passive auxiliaries
- flat:foreign for non-first words in quoted foreign phrases
- flat:name for exocentric complex name
The following relation types are not used for Latgalian: clf, dislocated, list, reparandum. However, reparandum
should be introduced in future, as appropriate speech texts are annotated.
Treebanks
There is 1 Latgalian UD treebank: