home edit page issue tracker

This page pertains to UD version 2.

UD Icelandic GC

Language: Icelandic (code: is)
Family: IE

This treebank has been part of Universal Dependencies since the UD v2.11 release.

The following people have contributed to making this treebank part of UD: Vilhjálmur Þorsteinsson, Hulda Óladóttir, Þórunn Arnardóttir, Sveinbjörn Þórðarson, Haukur Barri Símonarson, Katla Ásgeirsdóttir.

Repository: UD_Icelandic-GC
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.15

License: CC BY-SA 4.0

Genre: news, government

Questions, comments? General annotation questions (either Icelandic-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [thar (æt) hi • is]. Development of the treebank happens outside the UD repository. If there are bugs, either the original data source or the conversion procedure must be fixed. Do not submit pull requests against the UD repository.

Annotation Source
Lemmas annotated manually
UPOS annotated manually in non-UD style, automatically converted to UD
XPOS annotated manually
Features annotated manually in non-UD style, automatically converted to UD
Relations annotated manually in non-UD style, automatically converted to UD

Description

UD_Icelandic-GC is a conversion of the gold part of GreynirCorpus, which has been manually corrected and verified. The corpus is parsed into full constituency trees, and converted using UDConverter-GreynirCorpus.

The treebank consists of text which was extracted from news and governments sites on the web in the years 2015-2021.

The GreynirCorpus data was split into a development set and a test set, and that split is preserved. The test set consists of 10% of the total number of sentences, chosen at random. The test set in UD_Icelandic-GC is the same. The original development set is now the training set, excluding every tenth file, which is now in the development set.

The treebank consists of 99,611 tokens in total. The training set consists of 78,568 tokens, the development set of 10,694 tokens and the test set of 10,349 tokens.

Acknowledgments

This project was funded by the Language Technology Programme for Icelandic 2019-2023. The programme, which is managed and coordinated by Almannarómur (https://almannaromur.is/), is funded by the Icelandic Ministry of Education, Science and Culture.

Statistics of UD Icelandic GC

POS Tags

ADJADPADVAUXCCONJDETINTJNOUNNUMPARTPRONPROPNPUNCTSCONJSYMVERBX

Features

CaseDefiniteDegreeGenderMoodNumberNumTypePersonPronTypeTenseVerbFormVoice

Relations

aclacl:relcladvcladvmodamodauxcaseccccompcompound:prtconjcopcsubjdepdetdiscourseexplfixedflatflat:foreignflat:nameiobjmarknmodnmod:possnsubjnummodobjoblparataxispunctrootxcomp

Tokenization and Word Segmentation

Morphology

Tags

Nominal Features

Degree and Polarity

Verbal Features

Pronouns, Determiners, Quantifiers

Other Features

Syntax

Auxiliary Verbs and Copula

Core Arguments, Oblique Arguments and Adjuncts

Here we consider only relations between verbs (parent) and nouns or pronouns (child).

Relations Overview