UD for Dutch
Tokenization and Word Segmentation
- Words are delimited by whitespace or punctuation
- Words do not contain spaces, although some lemma’s for multi-word expressions do (au serieux, dat wil zeggen, onder ander, onder veel, ter plaatse, tot en met)
- Words (e.g. abbreviations, names, URLs etc.) may contain arbitrary punctuation signs (http://www.speelgoedmuseum.be, vroeg-renaissance, o.a., ex-VU&ID)
- No multiword tokens occur (i.e. forms like ten are treated as a single token, not as te+een)
Instruction: Describe the general rules for delimiting words (for example, based on whitespace and punctuation) and exceptions to these rules. Specify whether words with spaces and/or multiword tokens occur. Include links to further language-specific documentation if available.
Morphology
Tags
-
ADJ (XPOS=ADJ) is used for adjectives. Adjectives occur as prenominal modifiers, as predicates, as seperable verb-particles, as adverbs, as part of fixed expressions and names. Also, in cases where an adjective functions as noun (om Nederlandstaligen te pesten) the POS is still ADJ. Ordinal number words such as eerste, 60ste (XPOS=TW rang) are also ADJ. - ADP (XPOS=VZ) is used for prepositions and postpositions. They introduce nominal and verbal modifiers. They also occur as seperable verb-particles, as part of fixed expressions and names. The verbal inflexion element te is also an ADP. In prepositional phrases such as ten opzichte van ten and van are ADP and opzichte is a NOUN.
- ADV (XPOS=BW) is used for adverbs. Also, some adverbial pronouns (R-pronouns) such as daar, er, ergens, waar are ADV.
- AUX (XPOS=WW) is used for
- perfect tense auxiliaries hebben and zijn
- the passive tense auxiliaries worden and zijn and krijgen (in so-called cases of krijgen-passive such as U krijgt een bewegwijzering toegezonden)
- the modal verbs kunnen, zullen, moeten, mogen (The treebank annotation on which the conversion to UD is based does not distinguish between auxiliaries and main verbs. Here we take a conservative approach in labeling only these modals as auxiliaries. )
- the copula verb zijn (UD allows only one copula verb per language, even though traditional Dutch syntax lists several verbs as copula-verbs.)
-
CCONJ (XPOS=VG neven) is used for coordinating conjunctions such as en and of, zowel (X als Y). (ISSUE: Note that als in this context is labeled SCONJ as it is XPOS=VG onder) -
DET (XPOS=LID, XPOS=VNW prenom) is used for words that are part of a noun phrase and function as determiner. They are: - words that have XPOS LID in the original data (de, het, een). These are words traditionally seen as determiners in Dutch
- Prenominal pronouns (deze, welke, meerdere, sommige) with the exception of possessive pronouns (mijn). They are labeled PRON.
- ISSUE: In multi-word determiner expressions such as maar weinig, helemaal geen, al mijn, both words are DET. The modifiers should be ADV and mijn should be PRON.
-
ISSUE: Numeric elements such as +11,77 or 26% (XPOS=SPEC symb) are labeled as DET if their grammatical function is det, where this should be NUM. (Note that regular numeric values such as 11 or 26 have XPOS=TW and are labeled NUM.)
- INTJ (XPOS=TSW) Elements that are not part of the syntactic structure such as ach, verrek, jazeker
-
NOUN (XPOS=N soort) Nouns can be the head of a variety of dependents, such as subject, object, oblique, etc. -
NUM (XPOS=TW hoofd). Tokens such as 1, 0,80, +4,20 -1, 1.000, 040-12280, drieduizend, iv, twaalf. Note that elements such as eerste are labeled ADJ. - PART is not used. In particular, seperable verb prefixes are assigned their regular POS and are a compound:prt dependent of the verb.
- PRON (XPOS=VNW) is used for
- possessive pronouns
- pronominal elements that do not occur in prenominal position but are depedents of a verb, like personal and impersonal pronouns (_ik, u, iemand _)
-
PROPN (XPOS=N eigen, SPEC deeleigen) is used for proper names (Achterberg) and the parts of multi-word names (_ Gerrit Achterberg_). When used as adjective (Amsterdamse), names are labeled ADJ. - PUNCT (XPOS=LET) is used for punctuation.
-
SCONJ (XPOS=VG onder) is used for subordinating conjunctions (als, dat, of, omdat, toen). -
ISSUE: in multi-word expressions such as dan wel dan is VG onder but could also be VG neven and thus CCONJ with seems more appropriate as it also introduces a cc in syntax.
-
-
SYM (XPOS=SPEC sym, SPEC afgebr, SPEC vreemd, LET) is used for symbols (), foreign words (unit, vici, walkover), incomplete words (welzijns-, zorg-) and interpunction that does not introduce a punc relation in syntax (i.e. as it is part of a name/title (ZinderZlam !) or a multi-word unit (en / of) ). - VERB (XPOS=WW) is used for verbs that are not AUX. Note that adjectively used verbs and nominalized verbs are VERB (passende maatregelen, het verplaatsen van voorwerpen).
-
X (XPOS=SPEC afk) is used for abbreviations (_ v.Chr., o.a., nr._)
Detailed documentation of the decisions w.r.t. POS-tags in the original data can be found in the D-COI POS-tagging and lemmatization manual
Features
-
Abbr=Yes for abbreviations (POS=X, XPOS=SPEC afk) -
Case=Acc,Nom for PRON (Nom for XPOS=VNW nomin, Acc for XPOS=VNW obl). -
Definite=Def,Ind for DET (Def for XPOS=LID bep, Ind for XPOS=LID onbep) -
Degree=Pos,Cmp,Sup for adjectives (POS=ADJ, Pos for XPOS=ADJ basis, Sup for ADJ sup, Cmp for ADJ comp) -
Gender=Com,Neut for NOUN and PROPN, Com for N onz, Neut for N onz, Com,Neut for N genus -
Person=1,2,3 for PRON (1 for XPOS=VNW 1, 2 for XPOS=VNW 2,2v,2b, 3 for XPOS=VNW 3,3p,3v,3p,3o -
PronType=Int,Prs,Ind,Rel,Dem for PRON (Dem for demonstratives, VNW aanw, Rel for relative pronouns, VNW betr, Prs for personal and possessive pronouns, VNW pers and VNW bez, Ind for indefinite pronouns, VNW onbep, Int for interoggative pronouns, VNW vb). -
Number=Sing,Plur for AUX, NOUN, VERB, PROPN, Sing for WW ev, WW met-t, N ev, Plur for WW mv, N mv -
Poss=Yes for PRON with VNW bez -
Reflex=Yes for PRON with VNW refl -
VerbForm=Inf,Fin,Part for AUX and VERB with Inf for WW inf, Fin for WW pv and Part for WW od or WW vd
Detailed documentation of the decisions w.r.t. features in the original data can be found in the D-COI POS-tagging and lemmatization manual
Syntax
The Dutch treebanks are automatically converted from annotated and manually corrected treebanks. Detailed documentation of the the original syntactic annotation is in the syntactic annotation manual of the Lassy project. The data included in the UD treebanks can be explored using the PaQu interface, which supports querying both the original and UD annotation.
- acl, acl:relcl acl is used for phrases headed by a verb modifying a noun. These can be prenominal (as in thans geldende rentestand) postnominal (as in de vraag of de rente zal stijgen). acl:relcl is used for relative clauses. In the original syntactic annotation these are nodes with an mod dependency relation that occur as sister to a nominal head, and which have a category ppres, ppart (prenominal), or cp, oti (postnominal) or rel (relative clauses). Verbs without dependents in prenominal position are considered to be amod.
- advcl is used for phrases that occur as modifying phrases (adjuncts) and are dependents of a verbal head. In the original annotation they have relation mod and they can be of category cp, oti, ppart, among others.
- advmod is used for adverbs and adverbial phrases modifying a verb. The POS of advmod elements is almost always ADV or ADJ.
- amod is used for adjectives and other elements modifying a noun. The POS of amod elements is usually ADJ, but ADV and NOUN and others occur as well. ADV is used for elements such as slechts (5 euro), vele (kookboeken), zo’n (25 optredens) and occurs in nominalisations (het niet doen terugkeren, where niet is amod of the verb terugkeren, which itself is being used nominatively), and is used for adverbial pronouns (de verlenging ervan)
- appos is used for appositions. In the original annotation, the relation app is used for a wide range of nominal phrases occurring in postnominal position (de fotograaf Philip Mechanicus, Nooteboom’s debuut ‘Philip en de anderen’, de jaren 1979-1981, de wethouder cultuur, presentatie Slibreeks, Hans Groenewegen, dichter en publicist, ZUiderzinnen, Festival van het woord, zondag 18 september 2005. All these are mapped to the appos dependency relation, even though this stretches the intended use of appos in UD.
- aux, aux:pass aux is used for auxiliaries as defined above in the section on POS tags. Note that this implies that auxiliaries are dependents of the main verb with which they co-occur. In the original annotation, no distinction between verbs and auxiliaries is made, and auxiliaries always have a sister that is a clause headed by the main verb. Note that this also means that elements such as subjects, complementizers, and even the marker ‘te’ become dependents of the main verb, and not the auxiliary.
- case is used for prepositions (ADP) that introduce a prepositional phrase. The preposition is a dependent of the head of the nominal phrase. Where there is both a preposition and a postposition (door de eeuwen heen, om hem heen) both elements are case dependents of the nominal head. In cases where the nominal element is replaced by an R-pronoun (er etc), the R-pronoun precedes the preposition, and may be nonadjacent to the preposition (U doet er verstandig aan). Note that this is a source of non-projective annotations.
- cc is used for coordination words such as en, of, maar.
- ccomp is used for complement clauses that are dependents of a verb. Complement clauses are phrases with relation vc in the original annotation and that are headed by a finite verb or a te-infinitive, so they can be of category cp, whsub, ti, oti. In ccomp clauses, there is no controlled subject.
- compound:prt is used for seperable verbal prefixes (_ groeide uit, aan te wijzen_) and the non-verbal part of phrasal verbs (_ op prijs stellen, bekend staan, kenbaar maken_)
- conj is used for conjuncts.
- cop is used for the copula zijn only. Thus, the copula is a dependent of the predicate. If the copula is preceded by the inflection marker te, the marker also becomes a dependent of the predicate (In _ wordt aangeraden waakzaam te zijn_, we have (waakzaam,mark,te) )
- csubj is used for clausal subjects. Clausal subjects are sometimes introduced by expletive het (marked as expl), as in het blijft onduidelijk wat Japix bedoelt. Clausal subjects can be of category cp, whsub, ti, or oti in the original annotation.
- det is used for determiners, ie for elements with DET POS-tag, as explained above.
- expl, expl:pv Expletives are het or er when used to introduce a clausal subject (het is verstanding u te laten adviseren, u dient er rekening mee te houden dat…) expl:pv is used for inherent reflexives (_ richt zich op, bevindt zich in, scheidt zich af, jaagt NP tegen zich in het harnas_)
- fixed is used for the non-initial parts of multi-word expressions, such as ten aanzien van, voor zover, dan wel, fine fleur) Also, titles of books and other works of art and some institutions are annotated as fixed expressions (De ontdekking van de hemel, Faculteit Kunst en Cultuur) and some amounts (EUR 37,50, 15 uur) Note that the decision on what to label as fixed or not follows largely from the original annotation (ie phrases with category mwu where the parts are not labeled as proper names). Also note that fixed elements can in fact be coordinated (_ maandag 18 t/m zaterdag 23 april 2005, where _april 2005 is shared between to two conjuncts in the original annotatin) and that discontinous fixed expressions exist (exclusively in the so-called wat-voor construction as in wat is dit voor een kutfilm)
-
flat is used for the non-initial tokens of multi-word proper names (Kees van Kooten) and other multi-word expressions that contain at least one proper name. In particular, in dates like 20 augustus 2000 , 20 is the head with augustus and 2000 as flat dependents, as augustus is a name. Also, some titles of works of art are labeled flat, if at least one of the tokens was labeled as SPEC deeleigen in the original annotation. ISSUE: there is some inconsistency between when a multi-word unit introduces flat or fixed dependents, but this is caused at least in part by the underlying annotation. - iobj is used for indirect objects that are NOT introduced by a preposition. The original annotation has both prepositional (geef het boek aan haar) and nominal (geef haar het boek) obj2 constituents. In UD, only the latter are iobj, while the former are obl dependents.
- mark is used for subordinating conjunctions (dat, omdat, wanneer, hoewel, etc.). The word om is also a mark if it introduces a te-infinitive. The word te preceding a verb is also a mark dependent of the verb. As auxiliaries take no dependents, the te that may precede an auxiliary is attached, somewhat counterintuitively, to the main verb (na door het moeras gedwaald te hebben, here te is a dependent of gedwaald)
- nmod, nmod:poss nmod is used for nominal and prepositional phrases modifying a noun (een neiging to dalen, de rente in de VS). In het Dow Jones gemiddelde, Dow is an nmod dependent of gemiddelde. Note also that some nouns can be used as adjective as in de afzijdige waarnemer, where afzijdige is a NOUN and thus an nmod dependent of waarnemer. In Enkele malen the pronoun Enkele is a modifier of the noun in the original annotation, and thus also labeled as nmod. Nmod:poss is used for possessive pronouns (hun oude boeken) en genitives (Nootebooms debuut).
- nsubj, nsubj:pass Nsubj is used for the nominal subject of finite sentences. Nsubj:pass is used for the subject of passives. Clausal subjects are labeled csubj.
- nummod Nummod is used for NUM elements occurring in pre-nominal position (tien arrestaties, 450.000 mark) In zeven miljard gulden we have zeven as nummod dependent of miljard, while miljard (a NOUN) is a nmod of gulden.
- obj is used for the direct object of verbal heads (winst boeken, _een shock oplopen). Note that reflexives are labeled as obj if the verb is not inherently reflexive (in zich emanciperen, zich is an obj).
- obl, obl:agent Obl is used for prepositional arguments and adjuncts of a verbal head (klopt met de werkelijkheid, ). In (temporal) nominal adjuncts can appear without preposition (enkele malen), these are also obl. Obl:agent is used for the door-phrase that can be present in passives (hij moet door zijn vrouw tot kalmte worden gebracht). As the underlying annotation does not mark such prepositional phrases, the labeling is based on heuristics and may contain errors.
- orphan is used in elliptic constructions where the syntactic head has been elided and more than one dependent remains. The leftmost dependent is attached to the preceding constituent, while the remaining dependents are attached as orphan to the initial dependent (In 850 fondsen boekten winst tegenover 512 een verlies, een verlies is an orphan dependent of 512 which itself is a conj dependent of boekten).
- parataxis is used to label utterances that do not form a syntactic unit, but consist of a number of phrases for which no obvious dependency label can be given( In dit in verband met de langere levensduur van de vrouw, dit is the root, with the rest of the phrase headed by levensduur being a parataxis dependent of dit). Note that in cases of ellipsis, there is a preceding conjunction which also contains a predicate that can be seen as identical to the elided element. In parataxis constructions, this is not the case. Parataxis is also used in attribution, as in Het deksel was er afgeslagen, zei Rijkers where the speech verb zei is a parataxis dependent of afgeslagen.
- punct is used for punctuation signs.
- root is the root of the utterance. This is usually the main verb, but in copula constructions it is the head of the predicate.
- xcomp is used for the head of non-finite verbal complements of verbs (de burgemeester wil een traditie handhaven, de debiteuren staan te dringen, hij vraagt om een krediet beschikbaar te stellen), and for predicative complements of non-copula verbs (Fennema werd raadslid, de aandeelhouders vonden het bod onaanvaardbaar). In the enhanced dependencies, the subject of xcomp dependents that are non-finite clauses are added. For other predicative elements no controlled subject is identified.
Treebanks
There are 2 Dutch UD treebanks:
Instruction: Treebank-specific pages are generated automatically from the README file in the treebank repository and
from the data in the latest release. Link to the respective *-index.html
page in the treebanks
folder, using the language code and the treebank code in the file name.