home edit page issue tracker

This page pertains to UD version 2.

UD Hebrew IAHLTknesset

Language: Hebrew (code: he)
Family: Afro-Asiatic

This treebank has been part of Universal Dependencies since the UD v2.15 release.

The following people have contributed to making this treebank part of UD: Amir Zeldes, Avner Algom, Noam Ordan, Yifat Ben Moshe, Nick Howell, Shira Wigderson, Omer Strass, Israel Landau, Netanel Dahan, Yael Minerbi, Hilla Merhav, Emmanuelle Kowner, Shuli Wintner, Gili Goldin, Ella Rabinovhich, Vladimir Gurevich.

Repository: UD_Hebrew-IAHLTknesset
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.15

License: CC BY-SA 4.0

Genre: government, spoken

Questions, comments? General annotation questions (either Hebrew-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [amir • zeldes (æt) georgetown • edu]. Development of the treebank happens directly in the UD repository, so you may submit bug fixes as pull requests against the dev branch.

Annotation Source
Lemmas annotated manually
UPOS annotated manually, natively in UD style
XPOS not available
Features annotated manually, natively in UD style
Relations annotated manually, natively in UD style

Description

Publicly available IAHLT UD Hebrew Treebank’s Knesset section (https://www.iahlt.org/)

UD_Hebrew-IAHLTknesset is a manually annotated UD Treebank of spoken Hebrew data, with approximately 67K words/2800 sentences taken from transcribed proceedings of the Israeli Parliament, the Knesset. The data contains a subset of sentences from the proceedings originally extracted for modeling factuality, and represent sometimes contiguous chunks of 100 parliament discussions, but not necessarily enitre or fully contiguous ones (see the document identifiers under # newdoc id annotations). Where possible, consecutive sentences are given in their original orders, but with possible gaps in the dialogue. Speaker names are provided as well.

Compatible datasets

The HTB version used in the project was initially converted automatically, then a subset of the converted data was manually validated and adopted as a gold standard for training the model for UD parsing used in Hebrew-IAHLT. The entire parsed data has been manually edited to correct parsing errors, and was automatically QA’ed to apply corrections following updates in the schema. For a fork of UD_Hebrew-HTB (Ha’aretz newswire data) using the same annotation scheme, see:

https://github.com/IAHLT/UD_Hebrew

For an additional UD_Hebrew corpus with the same annotation scheme (Wikipedia articles), see:

https://github.com/UniversalDependencies/UD_Hebrew-IAHLTwiki

NER annotations

The data additionally contains Named Entity annotations in the IAHLT scheme in the MISC annotation Entity=, illustrated in the following excerpt:

## Acknowledgments

We would like to thank Gili Golden, Shuly Wintner, and Ella Rabinovich for making the original raw data available. We also thank all the people who contributed to this corpus: Amir Zeldes, Hilla Merhav, Israel Landau, Netanel Dahan, Nick Howell, Noam Ordan, Omer Strass, Shira Wigderson, Yael Minerbi and Yifat Ben Moshe.

## References

For academic citations of the IAHLT UD treebanks, please use:

Zeldes, Amir, Nick Howell, Noam Ordan and Yifat Ben Moshe (2022) [A Second Wave of UD Hebrew Treebanking and Cross-Domain Parsing](https://arxiv.org/abs/2210.07873). In: *Proceedings of EMNLP 2022*. Abu Dhabi, UAE, 4331-4344.

```bibtex
@InProceedings{ZeldesHowellOrdanBenMoshe2022,
author = {Amir Zeldes and Nick Howell and Noam Ordan and Yifat Ben Moshe},
booktitle = {Proceedings of {EMNLP} 2022},
title = {A Second Wave of {UD} {H}ebrew Treebanking and Cross-Domain Parsing},
year = {2022},
pages = {4331--4344},
address = {Abu Dhabi, UAE},
url = {https://aclanthology.org/2022.emnlp-main.292/},
}

For academic citations of the underlying Knesset corpus, please use:

Goldin, Gili, Nick Howell, Noam Ordan, Ella Rabinovich, and Shuly Wintner (2024) The Knesset Corpus: An Annotated Corpus of Hebrew Parliamentary Proceedings.

Statistics of UD Hebrew IAHLTknesset

POS Tags

ADJADPADVAUXCCONJDETINTJNOUNNUMPRONPROPNPUNCTSCONJSYMVERBX

Features

AbbrAspectCaseDefiniteForeignGenderHebBinyanMoodNumberNumTypePersonPolarityPossPrefixPronTypeReflexTenseTypoVerbFormVerbTypeVoice

Relations

aclacl:relcladvcladvmodamodapposauxcaseccccompcompoundcompound:affixconjcopcsubjcsubj:outercsubj:passdepdetdiscoursedislocatedexplfixedflatiobjlistmarknmodnmod:possnmod:unmarkednsubjnsubj:outernsubj:passnummodobjoblobl:unmarkedorphanparataxispunctreparandumrootvocativexcomp

Tokenization and Word Segmentation

Morphology

Tags

Nominal Features

Degree and Polarity

Verbal Features

Pronouns, Determiners, Quantifiers

Other Features

Syntax

Auxiliary Verbs and Copula

Core Arguments, Oblique Arguments and Adjuncts

Here we consider only relations between verbs (parent) and nouns or pronouns (child).

Verbs with Reflexive Core Objects

Relations Overview