UD for Bulgarian
Tokenization and Word Segmentation
- In general, words are delimited by whitespace characters. In our original treebank we use MWEs, such as compound subordinators, adverbs or composite pronouns. However, in UD every single token is segmented separately. For example: за да / za da “in order to”, може би / mozhe bi “maybe”, се забавям / se zabavyam”slow myself”. For indicating the common POS meaning, we use the relation
fixed
. - The numbers are analyzed as one token when used as expression without spaces (20000) or with an internal comma as indicator (10,434).
- The hyphenated complex words are treated as one token: външно-политически / vanshno-politicheski “foreign-political”, министър-председателят / ministar-predsedateleyat “the prime minister”, по-малко / po-malko “less”.
- Depending on the intervals, there might be cases in which the complex word is analyzed as three tokens. For example ДПС - депутати / DPS - members-of-parliament “MPs from the DPS party”.
Morphology
Tags
This is an overview only. For more detailed discussion and examples, see the list of Bulgarian POS tags and Bulgarian features.
- Bulgarian uses 15 universal POS categories. It does not make use of (SYM) and (X).
- Affirmative, negative, interrogative, modal particles are analyzed as (PART).
- The pronoun (PRON) vs. determiner (DET) distinction is handled as follows:
- as pronouns - personal and reflexive pronouns, and all entity-pointing other pronouns (demonstrative, interrogative, relative, indefinite, collective, negative).
- as determiners - the attributive and possessive atributive forms of the demonstrative, interrogative, relative, indefinite, collective, negative pronouns; the long forms of the possessive pronouns.
- Disclamer 1: entity-denoting demonstrative, interrogative, relative, indefinite, collective, negative pronouns can be either (PRON) or (DET) depending on the usage of the homonymic form. If it is used as a pro-noun, it is pronoun. If it is used attributively, it is determiner.
- Disclaimer 2: Bulgarian has a post-positined definite article which is part of the word and a phrasal affix within a phrase. Thus, it does not have a distinct analysis.
- Bulgarian has just one auxiliary verb (AUX), съм / сам (“to be”), but lemmas бъда / bada, бивам / bivam, би / bi (“would”) are also possible.
- Auxiliary particles ще / shte (“will, shall”) and да / da (“to”) are analyzed as (AUX).
- Modal verbs are analyzed as (VERB).
- The following POS are tagged as (ADJ): adjectives; ordinal numerals; partciples in adjectival usage adjectives derived from family names.
- The following POS are tagged as (VERB): personal and impersonal verbs; participles when used as verbal forms - indicators of evidentiality; and converbs.
Features
- Features Not used:
Abbr
,Typo
,NounClass
,Evident
,Polite
,Clusivity
. AlthoughEvident
andPolite
are relevant for Bulgarian, their annotation requires additional manual intervention and thus - for the moment they are not reflected.
Nominal Features
- Nouns NOUN and PROPN]() have an inherent Gender feature with one of the three values:
Masc
,Fem
orNeut
. - Animacy is a semantic feature. It is gramatically incorporated only in some pronouns and numerals. The distinction
Human - Non-human
is more explicit in theCount
form ofMasculine
.Count
form is applicable only for masculine nouns (Masc
) that denote Non-human. - ADJ, DET, NUM, PART inflect for
Gender
andNumber
, and agree with nouns. - Bulgarian lacks declension, so only some vocative forms have vocative case, while personal pronouns have nominative, accusative and dative cases. Masculine forms of
Int
,Rel
,Neg
,Ind
and very rarelyTot
have accusative and dative forms. - Bulgarian nominals (nouns, adjectives, ordinal numerals, attributively used participles) make use of the
Definite
feature. When the form has a definite article, it is marked asDef
. When no definite article is attached after the ending, it is marked asInd
.
Degree and Polarity
- Degree is an inherent feature for adjectives (ADJ) and adverbs (ADV). It has one of three possible values:
Pos
,Cmp
,Sup
. - Polarity has two values,
Pos
andNeg
, and applies primarily to negative and affirmative particles PART.
Verbal Features
- Similarly to other Slavic languages, Bulgarian verbs have as a lexically classifying feature Aspect, either imperfective (
Imp
) or perfective (Perf
). - Finite verbs always have one of three values of Mood:
Ind
,Imp
orCnd
. The conditional mood is only used with the special conditional auxiliaries (бих / bih, би / bi, бихме / bihme, бихте / bihte, биха / biha). The l-participle of the main verb, that is needed to form the analytic conditional, is not marked with this feature. - Verbs in the indicative mood always have one of three values of Tense:
Past
,Imp
andPres
.Fut
is not used because this tense is always analytic and formed with a special particle. - There are two values of the Voice feature:
Act
andPass
. Only the passive participle hasVoice=Pass
. All other verb forms haveVoice=Act
.
Pronouns, Determiners, Quantifiers
- PronType is used with pronouns (PRON), determiners (DET) and adverbs (ADV).
- NumType is used with numerals (NUM), adjectives (ADJ), determiners (DET) and adverbs (ADV).
- The Poss feature marks possessive personal determiners (e.g. мой / moy “my”),
possessive interrogative, indefinite or negative determiners (e.g. чий / chiy “whose”),
possessive relative determiners (e.g. чийто / chiyto “whose”)
and possessive adjectives (e.g. майчин / maychin “mother’s”). It also marks the clitic personal pronouns
Prs
and reflexive pronouns. - The Reflex feature marks reflexive pronouns (себе си, се, си) / sebe si, se, si and determiners (свой) / svoy “one’s own” and possesive clitic pronoun си / si.
- Person is a lexical feature of personal pronouns (PRON) and has three values,
1
,2
and3
.
Syntax
This is an overview only. For more detailed discussion and examples, see the list of Bulgarian relations.
Core Arguments, Oblique Arguments and Adjuncts
- Nominal subject (nsubj) is a noun phrase in a nominative case, without a preposition.
- A finite subordinate clause can serve the role of a subject. In such a case it is labeled as clausal subject, csubj. There is no infinitive in Bulgarian. The inheritant cosnstruction is finite - да идвам / da idvam “to come”.
- Objects can be bare noun phrases in the position of an accusative pronoun.
- Bare accusative, dative and prepositinal dative are considered core arguments (with the preposition на / na).
- All other prepositional objects are considered oblique obl.
- Accusative objects of some verbs alternate with finite clausal complements, which are labeled ccomp.
- If a verb subcategorizes for modal verbs or verbs of control, the infinitival complement is labeled xcomp.
- Adjuncts are usually prepositional phrases, but they can be bare noun phrases as well. They are labeled obl.
- In Bulgarian there is the phenomenon of clitic doubling. Thus, when the short pronoun appears alone, it takes the role of obj or iobj. However, when the full-fledged pronoun or phrase is present, the short doubling pronoun is marked expl. Expletive expl is used also for the reflexive short pronouns when they are semantically empty and are part of the lexical verb. For example: смея се / smeya se “I am laughing”.
- In Bulgarian the copula cop is expressed by the auxiliary verb съм / sam “to be” and its synonyms that are semantically vacuous.
- In passive clauses (both reflexive and periphrastic passive), the subject is labeled with nsubj:pass or csubj:pass, respectively.
- The auxiliary verb in periphrastic passive is labeled aux:pass.
Other relations:
- In Bulgarian the Yes-No questions are formed with the question particle ли / li. At the moment this particle is annotated with the discourse relation.
No used relations:
compound
,dislocated
,clf
,list
,reparandum
,orphan
,dep
Treebanks
There is 1 Bulgarian UD treebank: