UD for Croatian and Serbian
Tokenization and Word Segmentation
- In general, words are delimited by whitespace characters. In cases where a word is immediately followed or preceded by a punctuation sign (comma, full stop, parentheses, etc.), a white space is inserted between the word and the punctuation. In this way, punctuation signs are treated as separate tokens, with a few exceptions:
- Full stops separating digits in a large number (“65.000” stands for sixty-five-thousand, one token)
- Hyphens in compounds such as etno-selo “ethnic (traditional) village” (one tokens) and for abbreviations such as atd. “etc.” (two tokens).
Morphology
Tags
- All 17 universal POS categories are used.
- Pronominal quantifiers (which the traditional grammar includes in numerals) are DET.
- There are two auxiliary verbs (AUX), biti (“to be”, past tense) and ht(j)eti (“will”, future tense); the latter can be attached to a verb in Serbian (e.g. pružiće “will provide”), but NOT in Croatian (e.g. pružit će “will provide”).
Nominal Features
- Nominal words (NOUN, PROPN and PRON) have an inherent Gender feature with one of three values:
Masc
,Fem
orNeut
. In some cases the masculine gender is further subclassified by the Animacy valuesAnim
andInan
. Feminine and neuter nominals do not distinguish animacy grammatically. - There are two values of the Number feature are
Sing
andPlur
. The following parts of speech inflect for number: NOUN, PROPN, PRON, ADJ, DET, VERB, AUX (finite, participles and converbs), marginally NUM. - Case has 7 possible values:
Nom
,Gen
,Dat
,Acc
,Voc
,Loc
,Ins
.
Degree and Polarity
- Degree applies to adjectives (ADJ) and adverbs (ADV) and has one of three possible values:
Pos
,Cmp
,Sup
. - Polarity has two values,
Pos
andNeg
, and applies primarily to verbs (VERB, AUX), adjectives (ADJ) and adverbs (ADV) that can be negated using the bound morpheme ne-.- Typically ne occurs as an independent negation particle (PART) and is marked with
Polarity=Neg
. - Negated nouns are rare and considered lexical derivations (e.g nepravda “injustice”)
- The
Polarity
feature is not used with pronouns and determiners, although there is a subset of negative pronouns and determiners. ThePronType=Neg
feature is used there instead.
- Typically ne occurs as an independent negation particle (PART) and is marked with
Verbal Features
- Although verbs have a lexical Aspect, either imperfective (
Imp
) or perfective (Perf
), like in Czech, this category is not included in the language-specific features. - Finite verbs always have one of three values of Mood:
Ind
,Imp
orCnd
. - Verbs in the indicative mood always have one of three values of Tense:
Past
,Pres
orFut
.- There are two values of the Voice feature:
Act
andPass
. Only the passive participle hasVoice=Pass
. All other verb forms haveVoice=Act
.
- There are two values of the Voice feature:
Pronouns, Determiners, Quantifiers
- PronType is used with pronouns (PRON), determiners (DET) and adverbs (ADV).
- NumType is used with numerals (NUM) and adjectives (ADJ).
- The Poss feature marks possessive personal determiners (e.g. moj “my”), possessive interrogative and possessive relative determiners (e.g. čiji “whose”) and possessive adjectives (e.g. očev “father’s”).
- The Reflex feature marks reflexive particles (se, si) and determiners (svoj). It is always used together with
PronType=Prs
. - Person is a lexical feature of personal pronouns (PRON) and has three values,
1
,2
and3
. With personal possessive determiners (DET), the feature actually encodes the person of the possessor. Person is not marked on other types of pronouns and on nouns, although they can be almost always interpreted as the 3rd person. - There are two layered features, Gender[psor] and Number[psor]. They appear with certain possessive adjectives and determiners and encode the lexical gender/number of the possessor. The extra layer is needed to distinguish these lexical features from the inflectional gender and number that mark agreement with the modified (possessed) noun.
Syntax
This is an overview of the implementation of the general UD guidelines for Croatian and Serbian. As the syntax of these two languages is very similar to Czech, Czech-specific examples scattered across the general UD guidelines might be helpful too.
Core Arguments, Oblique Arguments and Adjuncts
- Nominal subject (nsubj) is a noun phrase in the nominative case, without preposition.
- If the noun phrase is quantified, it may be in the genitive, which is required by the quantifier. If this is the case, then the quantifier is attached using a special relation, either nummod:gov or det:numgov.
- An infinitive verb may serve as the subject and is labeled as clausal subject, csubj.
On the other hand, verbal nouns as subjects are just
nsubj
. - A finite subordinate clause may serve as the subject and is labeled
csubj
.
- While traditional grammars distinguish between direct and indirect objects (like in Czech), we do not annotate any indirect object, but distinguish between objects labeled obj and oblique constituents labeled obl.
- Bare accusative phrases considered objects.
- All other constituents (bare phrases in oblique cases, prepositional phrases) are considered oblique.
- Accusative objects of some verbs alternate with finite clausal complements, which are labeled ccomp.
- If a verb subcategorizes for the infinitive (e.g. modal verbs or verbs of control), the infinitival complement is labeled xcomp.
- Adjuncts (adverbial modifiers realized as noun phrases) are usually prepositional phrases, but they can be bare noun phrases as well. These dependencies , including temporal modifiers (e.g. svake godine “every year” are all labeled obl.
- Extra attention has to be paid to clitic forms of reflexive pronouns se (accusative) and si (dative, more used in Croatian than in Serbian). Traditional grammars distinguish many different functions of these words, including (but not limited to) those listed in the documentation for Czech. We do not distinguish between most of these functions and label these words as expl unless they clearly function as obj. This decision is motivated by theoretical finings showing that traditional distinctions do not hold in most of the cases. Instead, the common function of the reflexive particle across different uses is to mark a reduction in the number of core arguments of the verb (either object or subject is not expressed).
- Passive voice is not marked in syntactic relations (e.g. no distinction between active and passive subjects), only in verbal features (as described above).
Non-verbal Clauses
- The copula verb biti/jeste (be) is used in equational, attributional, locative, possessive and benefactory nonverbal clauses.