UD for Polish
Tokenization and Word Segmentation
- In general, words are delimited by whitespace characters. Description of exceptions follows.
- According to typographical rules, a punctuation mark is attached to a neighbouring (usually preceding) word. Punctuation marks are usually tokenised as separate tokens (words), unless they are considered an integral part of the lemma (as in Rolls-Royce “Rolls-Royce”, O’Donellowie “the O’Donells”, or 85-lecie “85th anniversary”) or are used to express inflection (as in the accusative or genitive Melville’a “Melville”). On the other hand, hyphens in constructions such as biało-czerwona “white-and-red” are treated as separate tokens.
- A whitespace separating digits in a large number is not treated as a word separator. For example, 1 000 000 (“1,000,000” by English rules) is one token. (However, such tokens do not occur in Polish treebanks as of release 2.2.)
- There are two classes of “orthographic words” (sequences of letters without spaces inside) that are split into several syntactic tokens.
- The most prominent type is an l-participle (or, rarely, another form) fused with a so-called “mobile inflection” auxiliary (e.g., śmy expressing first person and plural number) or the conditional particle by (also treated as an auxiliary), as in: wyprodukowalibyśmy = wyprodukowali + by + śmy “we would have produced”.
- Orthographic words of the other class consist of a preposition and a short (not accentable) pronoun, as in czekał nań = czekał na + ń “(he) waited for him”.
Morphology
Tags
- Polish in principle uses all 17 universal POS categories: SYM is only used in the PDB treebank to mark symbols, e.g. % (percent), ° (degree), + (plus), - (minus), $ (dollar), or emojis, e.g. :-), and X is only used in the PDB treebank (to mark abbreviations and digits).
- The NOUN tag is used not only for prototypical nouns, but also – somewhat arbitrarily – for gerunds (the so-called -nie/-cie forms), which have both nominal and verbal properties.
- Pronouns (PRON) are here understood as personal pronouns, so-called reflexive pronouns (also in their non-reflexive and – generally – non-pronominal uses), and such nominal pronouns as kto “who”, nic “nothing” and wszyscy “everybody”.
- As Polish grammars do not include a separate part of speech determiner, the DET class is based on a word list and includes words treated by standard Polish tagsets as adjectives, numerals or even nouns:
- determiners treated elsewhere as adjectives include possessive pronouns, as well as words such as ten “this”, każdy “each”, taki “such”, którykolwiek “whichever”, etc.,
- determiners treated elsewhere as numerals include indefinite numerals (e.g., wiele “many”, niedużo “not much, not many”, kilka “several”), as well as fractional numerals such as pół “half”,
- one determiner treated elsewhere as a noun is mnóstwo “a lot”.
- The main auxiliary verb (AUX) in Polish is być (“to be”), with the aspectual variant bywać “to be (habitual)”.
This auxiliary verb is used in several types of constructions:
- the copula with predicative phrases,
- periphrastic future tense (future form of być + infinitive or so-called l-participle form of the main verb),
- periphrastic conditional (any form of być + the conditional mood marker by + l-participle of the main verb),
- (imperfective) periphrastic passive (any form of być, including periphrastic forms, + passive participle of the main verb).
- Another auxiliary, zostać “become” (and its habitual version zostawać), is used for the perfective periphrastic passive (any form of zostać + passive participle of the main verb). Additionally, mood markers by (conditional) and niech (imperative, also its variants niechaj, niechże, niechby) are marked as
AUX
, as are “mobile inflections” and the copular uses of to (usually, but inappropriately in this context, translated as “this”). - The words być, bywać, zostać and zostawać may also occur as normal VERB if they are used in purely existential sentences (i.e., ones that do not even indicate location because if they do, then they should be treated as copulas).
- Verbs with modal meaning are not considered auxiliary in Polish.
- There are five main (de)verbal forms, distinguished by the UPOS tag and the value of the VerbForm feature:
- Inherently impersonal forms ending in -no/-to (a specialty of Polish and Ukrainian) are marked as finite verbs with
Person=0
(andTense=Past
).
Nominal Features
- Nouns (NOUN and PROPN) have an inherent Gender feature. Five genders are standardly assumed in Polish linguistics (and in Polish tagsets): three masculine, one feminine and one neuter. The three masculine genders are often called “human masculine”, “animate masculine” and “inanimate masculine”, but the correlation with the semantic animacy feature is far from perfect. In particular, there are many “animate masculine” semantically inanimate nouns (including all masculine names of dances, and many more), as well as “animate masculine” nouns which are, semantically, human and feminine (some derogatory nouns for women, e.g., babsztyl), or which are human and, well, no longer animate (trup “corpse”), or which are “superhuman” (e.g., diabeł “devil” and anioł “angel”, but not bóg “god”, which is “human masculine”). For the sake of cross-linuguistic consistency, three values are assumed for the
Gender
feature, i.e.,Masc
,Fem
andNeut
, but there must be another feature which distinguishes the three masculine genders. - The following parts of speech in general inflect for gender: ADJ, DET, NUM, PRON, VERB, AUX. In the case of pronouns, only personal pronouns inflect for gender; other nominal pronouns (as well as the nominal determiner wszyscy “everybody”) have this feature defined lexically, and so-called reflexive pronouns lack this feature altogether. In the case of tokens tagged as
VERB
orAUX
, only past forms of finite verbs overtly inflect for gender. - The two values of the Number feature are
Sing
andPlur
. The following parts of speech inflect for number: NOUN, PROPN, PRON, ADJ, DET, VERB, AUX (only finite verbs and auxiliaries). - Case has 7 possible values:
Nom
,Acc
,Gen
,Dat
,Loc
,Ins
,Voc
. It occurs on broadly nominal categories, i.e., NOUN, PROPN, PRON, ADJ, DET, NUM. It can occur with de-verbal forms but only with those tagged asADJ
(adjectival participles) orNOUN
(gerunds). It never occurs with purely verbal forms. - Polite is used in Polish (in the LFG treebank of release 2.2) as a nominal feature, with the language-specific value
Depr
in case of special derogatory forms of some human masculine nouns, e.g., profesory “professors (derogatory)”, as opposed to profesorowie “professors (neutral)”.
Pronouns, Determiners, Numerals
- PronType is used with pronouns (PRON), determiners (DET) and adverbs (ADV), as well as with the word co when it plays the dual role of a complementiser (SCONJ) introducing a special kind of relative clause (one that may involve resumptive pronouns).
- NumType (
Card
orFrac
) is used with numerals (NUM) and determiners (DET). - The Poss feature marks possessive personal determiners (e.g., mój “my”).
- The Reflex feature marks so-called reflexive pronouns (się, siebie) and determiners (swój), even when they are not used reflexively or reciprocally.
- Person is a lexical feature of personal pronouns (PRON) and has three values,
1
,2
and3
. With personal possessive determiners (DET), the feature actually encodes the person of the possessor. Person is not marked on other types of pronouns and on nouns, although they can be almost always interpreted as the 3rd person. On the other hand, it is marked on finite verbs (VERB, AUX). - A layered feature, Number[psor], appears with certain possessive determiners and encodes the lexical number of the possessor. The extra layer is needed to distinguish this lexical number from the inflectional number that marks agreement with the modified (possessed) noun.
Degree and Polarity
- Degree applies to adjectives (ADJ) and adverbs (ADV) and has one of three possible values:
Pos
,Cmp
,Sup
. - Polarity has two values,
Pos
andNeg
, and applies to de-verbal adjectives (ADJ; i.e., adjectival participles) and nouns (NOUN; i.e., gerunds), which can be negated using the bound morpheme nie.- Often, nie occurs as an independent negation particle (PART) and is marked with
Polarity=Neg
. - The
Polarity
feature is not used with pronouns, determiners or adverbs, although there is a subset of traditional pronouns (hence, here, elements of various parts of speech) which are negative in the sense that they have a negative meaning when used as standalone utterances but do not introduce additional negation when they occur with negated verbs (i.e., when they participate in so-called negative concord). ThePronType=Neg
feature is used for such cases.
- Often, nie occurs as an independent negation particle (PART) and is marked with
Verbal Features
- Typical Polish verbs (including auxiliaries) have lexical Aspect, either imperfective (
Imp
) or perfective (Perf
).- There is, however, a class of verb-like words, marked as VERB with the universal
VerbType
feature with the language-specificQuasi
value, which do not inflect for person and do not have aspect. - On the other hand, the
Aspect
feature is used with de-verbal nouns (gerunds) and adjectives (participles), if they have theVerbForm
feature.
- There is, however, a class of verb-like words, marked as VERB with the universal
- Finite verbs have one of three values of Mood:
Ind
,Imp
orCnd
. The conditional mood is only used with the conditional auxiliary (by). The imperative mood is marked on imperative forms of verbs, as well as on the imperative auxiliary (niech, and its variant niechaj). All other finite verb forms, but not the “mobile inflection” auxiliaries (m, śmy, etc.), are marked for the indicative mood. - Verbs in the indicative mood always have one of three values of Tense:
Past
,Pres
orFut
.- Imperative forms of verbs do not have the
Tense
feature. - The
Tense
feature is also used to distinguish contemporary and anterior adverbial participles (sometimes called “converbs”), e.g., robiąc “while doing” (Tense=Pres
) vs. zrobiwszy “having done” (Tense=Past
). - The l-participle (tagged
VERB
orAUX
) also hasTense=Past
because its primary function is to form the past tense. - De-verbal adjectives (adjectival participles) and nouns (gerunds) do not have
Tense
.
- Imperative forms of verbs do not have the
- There are two values of the Voice feature:
Act
andPass
. Only the passive participle hasVoice=Pass
. All other verb forms haveVoice=Act
.
Other Features
- Other universal features used in Polish include:
- AdpType – almost always
Prep
, but in the case of the adposition temu “ago” it isPost
. - Hyph – marks forms such as biało “white” in biało-czerwony “white-and-red”.
- PartType – used (as of release 2.2, only in the LFG treebank) only to mark question particles (
Int
). - PrepCase – distinguishes those pronominal forms which may only occur as dependents of prepositions (
Pre
) from those which may only occur in other contexts (Npr
). - PunctSide and PunctType
- AdpType – almost always
- The following universal features are not used in Polish: Definite, Evident.
- Apart from SubGender, other language-specific features include:
- Agglutination – distinguishes these rare situations where the l-participle has different forms depending on whether the “mobile inflection” auxiliary directly attaches to it or not, e.g., on mógł “he could” (
Agglutination=Nagl
) vs. mogł in ja mogłem “I could” (Agglutination=Agl
); as of release 2.2, only used in the LFG treebank. - Emphatic – present on those traditional pronouns (hence, various parts of speech here) which include the emphatic particle ż(e), e.g., co “what” (neutral) vs. cóż “what” (emphatic); as of release 2.2, only used in the LFG treebank.
- Variant – distinguishes short and long forms of adjectives, a Slavic-wide phenomenon; in Polish primarily used to distinguish basic from vocalised versions of some pronouns (e.g., z vs. ze “from”), basic from vocalised versions of the “mobile inflection” auxiliary (e.g., m from em), and short (not accentable) from long (accentable) forms of some pronouns.
- Agglutination – distinguishes these rare situations where the l-participle has different forms depending on whether the “mobile inflection” auxiliary directly attaches to it or not, e.g., on mógł “he could” (
Syntax
Core and Oblique Dependents
- Prototypically, nominal subjects (nsubj) are bare noun phrases in the nominative case. In the case of typical numeral phrases in the subject position, the noun itself occurs in the genitive case. The issue of the case of the numeral is more controversial: it is nominative on some theories (and in the PDB treebank) and accusative on other (and in the LFG treebank). This special numeral construction, where the noun is in the genitive case, is marked in both release 2.2 treebanks:
- in the PDB treebank, the dependency relation is nummod:gov or det:numgov,
- in the LFG treebank, the MISC column contains the
[DepType=Rec]
feature (it is[DepType=Congr]
in the case of those numerals which do not assign the genitive case but rather agree with the noun).
- Clausal subjects (csubj) are typically infinitival phrases or subordinate clauses.
- On the other hand, verbal nouns in the subject position are just
nsubj
. - However, it is possible to have a
csubj
dependency to a nominal word (a noun or an adjective), namely, when this word heads a copular clause.
- On the other hand, verbal nouns in the subject position are just
- In passive clauses, the subject is labelled with nsubj:pass or csubj:pass, respectively.
- Direct objects are those dependents of verbs which may passivise, i.e., which become subjects in the passive voice. Nominal direct objects are marked as obj. They usually occur in the accusative case (but not all bare accusative nominals are objects), but also some instrumental and genitive nominals may be direct objects.
- Since only nominal dependents may be considered objects according to current UD guidelines, passivisable clauses are marked as ccomp:obj.
- In the case of typical numeral phrases in the accusative object position, the noun actually occurs in the genitive case, similarly to subject positions, and the numeral is uncontroversially accusative. Such constructions are marked as in the case of numeral subjects (see above).
- All required dependents of verbs in the dative case are indirect objects (iobj).
- All other bare nominal phrases, e.g. Pies merdał ogonem “The dog wagged its tail”, are treated as indirect objects (iobj) in the PDB treebank and as obliques (obl) in the LFG treebank.
- All adpositional phrases, when they are dependents of verbs, are treated as obliques (obl).
- Required clausal dependents of verbs are marked as ccomp, unless they are subjects (
csubj
) or direct objects (ccomp:obj
). - Open (“controlled”) dependents are marked as xcomp; they are either infinitival phrases or predicative complements of verbs such as stać się “become”.
- Extra attention has to be paid to the so-called reflexive pronoun się. It may function as:
- reflexive direct object (obj): zobaczył się w lustrze “he saw himself in the mirror” (in such cases się may alternate with the longer form siebie),
- reciprocal direct object (
obj
): całowali się “they were kissing each other”, - impersonal (expl:impers): oddycha się historią “one breathes with history”, lit. “breathe się history.INS”,
- an inherent part of a verb (usually included in the lemma in dictionaries). In accord with the current UD guidelines, we label the relation between the verb and the clitic as expl:pv, not
compound
. Example: śmiała się “she laughed.”
Non-verbal (Predicative) Clauses
- The copula verb być “be” (and the habitual variant bywać) is used in equational, attributional, locative, possessive and benefactory nonverbal clauses. Purely existential clauses (without indicating location) use this copula as well, but it is treated as the head of the clause and tagged VERB. Another copula word in Polish is the quasi-verbal to (inflects periphrasitically for tense, but not for person, etc.).
Relations Overview
This is an overview only. For more detailed discussion and examples, see the list of Polish relations.
- The following relation subtypes are used in Polish:
- acl:relcl for relative clauses,
- advcl:cmpr for comparative clauses (as of release 2.5, in PDB-UD)
- advcl:relcl for relative clause modifiers of clauses (as of release 2.5, in PDB-UD and PUD-PL),
- advmod:arg for adverbial complements of verbs (as of release 2.4, in PDB-UD and PUD-PL),
- advmod:emph for emphasizing adverbial modifiers (as of release 2.5, in PDB-UD and PUD-PL),
- advmod:neg for negation particles (as of release 2.4, in PDB-UD and PUD-PL),
- amod:flat for adjectival parts of named entities (as of release 2.5, in PDB-UD and PUD-PL),
- aux:clitic for “mobile inflection” auxiliaries,
- aux:cnd for conditional auxiliaries,
- aux:imp for imperative auxiliaries,
- aux:pass for passive auxiliaries,
- cc:preconj for preconjunctions,
- ccomp:cleft for required clausal dependents of the pronoun to (as of release 2.5, in PDB and PUD-PL),
- ccomp:obj for clausal objects of verbs,
- cop:locat for locative uses of copulas (as of release 2.2, only in the LFG treebank),
- csubj:pass for clausal subjects of passive verbs (does not occur in release 2.2),
- det:numgov for pronominal quantifiers that are attached as children of the quantified noun but govern its case (as of release 2.4, in PDB-UD and PUD-PL),
- det:nummod for pronominal quantifiers in cases in which they do not govern the case of the quantified noun (as of release 2.4, in PDB-UD and PUD-PL),
- discourse:emo for emoticons and emojis (as of release 2.4, in PDB-UD),
- discourse:intj for interjections (as of release 2.4, in PDB-UD),
- expl:pv for inherent uses of the so-called reflexive pronoun się,
- expl:impers for impersonal uses of the so-called reflexive pronoun się (as of release 2.2, only in the LFG treebank),
- flat:foreign for foreign words (as of release 2.4, in PDB-UD and PUD),
- nmod:arg for required nominal dependents of nouns (as of release 2.4, in PDB-UD and PUD-PL),
- nmod:flat for nominal parts of named entities (as of release 2.5, in PDB-UD and PUD-PL),
- nmod:poss for possessive nominal modifiers, including 3rd person possessive pronouns (as of release 2.2, only in the LFG treebank),
- nmod:pred for predicative expressions depending on the gerund form of the copula być (“to be”), (as of release 2.5, in PDB-UD and PUD-PL),
- nummod:flat for numeral parts of named entities (as of release 2.5, in PDB-UD),
- nummod:gov for numerals that are attached as dependents of the noun but govern its case, in contrast to nummerals nummod which agree with the noun case (as of release 2.5, in PDB-UD and PUD-PL),
- nsubj:pass for nominal subjects of passive verbs,
- obl:agent for agents of passive verbs,
- obl:arg for adpositional arguments of verbs (as of release 2.4, in PDB and PUD-PL),
- obl:cmpr for comparative phrases (as of release 2.5, only in PDB-UD),
- obl:orphan for adpositional dependents with the elided noun (as of release 2.5, only in PDB-UD),
- parataxis:insert for parenthetical clauses or comments (as of release 2.4, in PDB-UD and PUD-PL),
- parataxis:obj for direct speech (as of release 2.4, in PDB-UD and PUD-PL),
- xcomp:cleft for required open dependents (non-finite clauses) of the pronoun to (as of release 2.5, in PDB-UD),
- xcomp:obj for objects realized as infinitival clauses (in PDB-LFG),
- xcomp:pred for predicative dependents of non-copular verbs (as of release 2.4, in PDB-UD and PUD-PL),
- xcomp:subj for subjects realized as infinitival or adverbial phrases (as of release 2.5, in PDB and PUD-PL).
- The following main types are not used alone and must be subtyped: expl.
- The following relation types are not used in Polish at all (as of release 2.2): clf, dislocated.
Treebanks
There are three Polish UD treebanks: