UD English GUMReddit
Language: English (code: en
)
Family: IE
This treebank has been part of Universal Dependencies since the UD v2.6 release.
The following people have contributed to making this treebank part of UD: Siyao Peng, Amir Zeldes.
Repository: UD_English-GUMReddit
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.15
License: CC BY 4.0. The underlying text is not included; the user must obtain it separately and then merge with the UD annotation using a script distributed with UD
Genre: blog, social
Questions, comments? General annotation questions (either English-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [amir • zeldes (æt) georgetown • edu]. Development of the treebank happens outside the UD repository. If there are bugs, either the original data source or the conversion procedure must be fixed. Do not submit pull requests against the UD repository.
Annotation | Source |
---|---|
Lemmas | annotated manually |
UPOS | annotated manually in non-UD style, automatically converted to UD |
XPOS | annotated manually |
Features | annotated manually in non-UD style, automatically converted to UD |
Relations | annotated manually, natively in UD style |
Description
Universal Dependencies syntax annotations from the Reddit portion of the GUM corpus (https://gucorpling.org/gum/)
This repository only contains annotations, without the underlying textual data from Reddit
In order to obtain the underlying text, you will need to use the script get_text.py
. For more information on the underlying Reddit text see this page. For Universal Dependencies annotations of other genres from GUM, see https://github.com/UniversalDependencies/UD_English-GUM
GUM, the Georgetown University Multilayer corpus, is an open source collection of richly annotated texts from multiple text types. The corpus is collected and expanded by students as part of the curriculum in the course LING-4427 “Computational Corpus Linguistics” at Georgetown University. The selection of text types is meant to represent different communicative purposes, while coming from sources that are readily and openly available (usually Creative Commons licenses), so that new texts can be annotated and published with ease.
The dependencies in the corpus up to GUM version 5 were originally annotated using Stanford Typed Depenencies (de Marneffe & Manning 2013) and converted automatically to UD using DepEdit (https://gucorpling.org/depedit/). The rule-based conversion took into account gold annotations found in other annotation layers of the GUM corpus (e.g. entity annotations), and has since been corrected manually in native UD. The original conversion script used can found in the GUM build bot code from version 5, available from the (non-UD) GUM repository. Documents from version 6 of GUM onwards were annotated directly in UD, and subsequent manual error correction to all GUM data has also been done directly using the UD guidelines. Enhanced dependencies were added semi-automatically from version 7.1 of the corpus. For more details see the corpus website.
Acknowledgments
GUM annotation team (so far - thanks for participating!)
Adrienne Isaac, Akitaka Yamada, Alex Giorgioni, Alexandra Berends, Alexandra Slome, Amani Aloufi, Amber Hall, Amelia Becker, Andrea Price, Andrew O’Brien, Ángeles Ortega Luque, Aniya Harris, Anna Prince, Anna Runova, Anne Butler, Arianna Janoff, Aryaman Arora, Ayan Mandal, Aysenur Sagdic, Bertille Baron, Bradford Salen, Brandon Tullock, Brent Laing, Caitlyn Pineault, Calvin Engstrom, Candice Penelton, Carlotta Hübener, Caroline Gish, Charlie Dees, Chenyue Guo, Chloe Evered, Cindy Luo, Colleen Diamond, Connor O’Dwyer, Cristina Lopez, Cynthia Li, Dan DeGenaro, Dan Simonson, Derek Reagan, Devika Tiwari, Didem Ikizoglu, Edwin Ko, Eliza Rice, Emile Zahr, Emily Pace, Emma Manning, Emma Rafkin, Ethan Beaman, Felipe De Jesus, Han Bu, Hana Altalhi, Hang Jiang, Hannah Wingett, Hanwool Choe, Hassan Munshi, Helen Dominic, Ho Fai Cheng, Hortensia Gutierrez, Jakob Prange, James Maguire, Janine Karo, Jehan al-Mahmoud, Jemm Excelle Dela Cruz, Jess Godes, Jessica Cusi, Jessica Kotfila, Jingni Wu, Joaquin Gris Roca, John Chi, Jongbong Lee, Juliet May, Jungyoon Koh, Katarina Starcevic, Katelyn Carroll, Katelyn MacDougald, Katherine Vadella, Khalid Alharbi, Kristen Cook, Lara Bryfonski, Lauren Levine, Leah Northington, Lindley Winchester, Linxi Zhang, Lucia Donatelli, Luke Gessler, Mackenzie Gong, Margaret Anne Rowe, Margaret Borowczyk, Maria Laura Zalazar, Maria Stoianova, Mariko Uno, Mary Henderson, Maya Barzilai, Md. Jahurul Islam, Michael Kranzlein, Michaela Harrington, Mingyeong Choi, Minnie Annan, Mitchell Abrams, Mohammad Ali Yektaie, Naomee-Minh Nguyen, Negar Siyari, Nicholas Mararac, Nicholas Workman, Nicole Steinberg, Nitin Venkateswaran, Parker DiPaolo, Phoebe Fisher, Rachel Kerr, Rachel Thorson, Rebecca Childress, Rebecca Farkas, Riley Breslin Amalfitano, Rima Elabdali, Robert Maloney, Ruizhong Li, Ryan Mannion, Ryan Murphy, Sakol Suethanapornkul, Sarah Bellavance, Sarah Carlson, Sasha Slone, Saurav Goswami, Sean Macavaney, Sean Simpson, Seyma Toker, Shane Quinn, Shannon Mooney, Shelby Lake, Shira Wein, Sichang Tu, Siddharth Singh, Siona Ely, Siyao Peng, Siyu Liang, Stephanie Kramer, Sylvia Sierra, Talal Alharbi, Tatsuya Aoyama, Tess Feyen, Timothy Ingrassia, Trevor Adriaanse, Ulie Xu, Wai Ching Leung, Wenxi Yang, Wesley Scivetti, Xiaopei Wu, Xiulin Yang, Yang Liu, Yi-Ju Lin, Yifu Mu, Yilun Zhu, Yingzhu Chen, Yiran Xu, Young-A Son, Yu-Tzu Chang, Yuhang Hu, Yunjung Ku, Yushi Zhao, Zhijie Song, Zhuosi Luo, Zhuxin Wang, Amir Zeldes
… and other annotators who wish to remain anonymous!
References
To cite the Reddit subset of GUM in particular, please use this citation:
- Behzad, Shabnam and Zeldes, Amir (2020) “A Cross-Genre Ensemble Approach to Robust Reddit Part of Speech Tagging”. In: Proceedings of the 12th Web as Corpus Workshop (WAC-XII).
@InProceedings{BehzadZeldes2020,
author = {Shabnam Behzad and Amir Zeldes},
title = {A Cross-Genre Ensemble Approach to Robust {R}eddit Part of Speech Tagging},
booktitle = {Proceedings of the 12th Web as Corpus Workshop (WAC-XII)},
pages = {50--56},
year = {2020},
}
As a scholarly citation for the GUM corpus as a whole, please use this article (note that this paper predates the inclusion of Reddit data in GUM):
- Zeldes, Amir (2017) “The GUM Corpus: Creating Multilayer Resources in the Classroom”. Language Resources and Evaluation 51(3), 581–612.
@Article{Zeldes2017,
author = {Amir Zeldes},
title = {The {GUM} Corpus: Creating Multilayer Resources in the Classroom},
journal = {Language Resources and Evaluation},
year = {2017},
volume = {51},
number = {3},
pages = {581--612},
doi = {http://dx.doi.org/10.1007/s10579-016-9343-x}
}
Statistics of UD English GUMReddit
POS Tags
ADJ – ADP – ADV – AUX – CCONJ – DET – INTJ – NOUN – NUM – PART – PRON – PROPN – PUNCT – SCONJ – SYM – VERB – X
Features
Abbr – Case – Definite – Degree – ExtPos – Gender – Mood – Number – NumForm – NumType – Person – Polarity – Poss – PronType – Reflex – Style – Tense – Typo – VerbForm – Voice
Relations
acl – acl:relcl – advcl – advcl:relcl – advmod – amod – appos – aux – aux:pass – case – cc – cc:preconj – ccomp – compound – compound:prt – conj – cop – csubj – csubj:pass – dep – det – det:predet – discourse – dislocated – expl – fixed – flat – goeswith – iobj – mark – nmod – nmod:poss – nmod:unmarked – nsubj – nsubj:outer – nsubj:pass – nummod – obj – obl – obl:agent – obl:unmarked – parataxis – punct – reparandum – root – vocative – xcomp
Tokenization and Word Segmentation
- This corpus contains 895 sentences, 15958 tokens and 16364 syntactic words.
- This corpus contains 1922 tokens (12%) that are not followed by a space.
- This corpus does not contain words with spaces.
- This corpus does not contain words that contain both letters and punctuation.
- This corpus contains 406 multi-word tokens. On average, one multi-word token consists of 2.00 syntactic words.
- There are 1 types of multi-word tokens. Examples: __.
Morphology
Tags
- This corpus uses 17 UPOS tags out of 17 possible: ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB, X
- This corpus contains 1 word types tagged as particles (PART): _
- This corpus contains 1 lemmas tagged as pronouns (PRON): _
- This corpus contains 1 lemmas tagged as determiners (DET): _
- Out of the above, 1 lemmas occurred sometimes as PRON and sometimes as DET: _
- This corpus contains 1 lemmas tagged as auxiliaries (AUX): _
- Out of the above, 1 lemmas occurred sometimes as AUX and sometimes as VERB: _
- There are 4 (de)verbal forms:
- Fin
- AUX: _
- VERB: _
- Ger
- AUX: _
- VERB: _
- Inf
- AUX: _
- VERB: _
- Part
- AUX: _
- VERB: _
Nominal Features
- Fem
- PRON: _
- Masc
- PRON: _
- Neut
- PRON: _
- Plur
- AUX-Fin: _
- DET: _
- NOUN: _
- PRON: _
- PROPN: _
- VERB-Fin: _
- Ptan
- NOUN: _
- Sing
- AUX-Fin: _
- DET: _
- NOUN: _
- PRON: _
- PROPN: _
- SYM: _
- VERB-Fin: _
- Acc
- PRON: _
- Gen
- PRON: _
- Nom
- PRON: _
- Def
- DET: _
- Ind
- DET: _
Degree and Polarity
- Cmp
- ADJ: _
- ADV: _
- Pos
- ADJ: _
- ADV: _
- Sup
- ADJ: _
- ADV: _
- Neg
- ADV: _
- CCONJ: _
- INTJ: _
- PART: _
- Pos
- INTJ: _
Verbal Features
- Imp
- AUX-Fin: _
- VERB-Fin: _
- Ind
- AUX-Fin: _
- VERB-Fin: _
- Sub
- AUX-Fin: _
- VERB-Fin: _
- Past
- AUX-Fin: _
- AUX-Part: _
- VERB-Fin: _
- VERB-Part: _
- Pres
- AUX-Fin: _
- AUX-Part: _
- VERB-Fin: _
- VERB-Part: _
- Pass
- VERB-Part: _
Pronouns, Determiners, Quantifiers
- Art
- DET: _
- Dem
- ADV: _
- DET: _
- PRON: _
- Emp
- PRON: _
- Ind
- DET: _
- PRON: _
- Int
- ADV: _
- DET: _
- PRON: _
- Neg
- DET: _
- PRON: _
- Prs
- PRON: _
- Rel
- ADV: _
- DET: _
- PRON: _
- Tot
- DET: _
- PRON: _
- Card
- NOUN: _
- NUM: _
- PROPN: _
- Frac
- NOUN: _
- Mult
- ADV: _
- Ord
- ADJ: _
- ADV: _
- Yes
- PRON: _
- Yes
- PRON: _
- 1
- AUX-Fin: _
- PRON: _
- VERB-Fin: _
- 2
- AUX-Fin: _
- PRON: _
- VERB-Fin: _
- VERB-Inf: _
- 3
- AUX-Fin: _
- PRON: _
- VERB-Fin: _
Other Features
- Abbr
- Yes
- ADP: _
- ADV: _
- NOUN: _
- PRON: _
- PROPN: _
- VERB-Fin: _
- VERB-Inf: _
- Yes
- ExtPos
- ADP
- VERB-Part: _
- ADV
- ADP: _
- ADV: _
- NOUN: _
- ADP
- NumForm
- Combi
- ADJ: _
- NOUN: _
- NUM: _
- Digit
- NUM: _
- Word
- ADJ: _
- ADV: _
- NOUN: _
- NUM: _
- PROPN: _
- Combi
- Style
- Coll
- PART: _
- Vrnc
- VERB-Part: _
- Coll
- Typo
- Yes
- ADJ: _
- ADP: _
- ADV: _
- AUX-Fin: _
- AUX-Inf: _
- CCONJ: _
- DET: _
- NOUN: _
- PART: _
- PRON: _
- PROPN: _
- PUNCT: _
- SCONJ: _
- VERB: _
- VERB-Fin: _
- VERB-Ger: _
- VERB-Inf: _
- VERB-Part: _
- X: _
- Yes
Syntax
Auxiliary Verbs and Copula
- This corpus uses 1 lemmas as copulas (cop). Examples: _.
- This corpus uses 1 lemmas as auxiliaries (aux). Examples: _.
- This corpus uses 1 lemmas as passive auxiliaries (aux:pass). Examples: _.
Core Arguments, Oblique Arguments and Adjuncts
Here we consider only relations between verbs (parent) and nouns or pronouns (child).
- nsubj
- VERB--NOUN (1)
- VERB-Fin--NOUN (189)
- VERB-Fin--PRON (93)
- VERB-Fin--PRON-Nom (368)
- VERB-Ger--NOUN (4)
- VERB-Inf--NOUN (52)
- VERB-Inf--PRON (31)
- VERB-Inf--PRON-Nom (182)
- VERB-Part--NOUN (22)
- VERB-Part--PRON (16)
- VERB-Part--PRON-Nom (121)
- obj
- VERB--NOUN (1)
- VERB-Fin--NOUN (236)
- VERB-Fin--PRON (37)
- VERB-Fin--PRON-Acc (45)
- VERB-Ger--NOUN (26)
- VERB-Ger--PRON (2)
- VERB-Ger--PRON-Acc (6)
- VERB-Inf--NOUN (208)
- VERB-Inf--PRON (44)
- VERB-Inf--PRON-Acc (55)
- VERB-Inf--PRON-Gen (1)
- VERB-Part--NOUN (82)
- VERB-Part--PRON (17)
- VERB-Part--PRON-Acc (18)
- iobj
- VERB-Fin--NOUN (7)
- VERB-Fin--PRON-Acc (23)
- VERB-Fin--PRON-Gen (1)
- VERB-Ger--NOUN (2)
- VERB-Inf--NOUN (2)
- VERB-Inf--PRON (2)
- VERB-Inf--PRON-Acc (9)
- VERB-Part--NOUN (3)
- VERB-Part--PRON (1)
- VERB-Part--PRON-Acc (7)
Verbs with Reflexive Core Objects
- This corpus contains 1 lemmas that occur at least once with a reflexive core object (obj or iobj). Examples: _ _
Relations Overview
- This corpus uses 13 relation subtypes: acl:relcl, advcl:relcl, aux:pass, cc:preconj, compound:prt, csubj:pass, det:predet, nmod:poss, nmod:unmarked, nsubj:outer, nsubj:pass, obl:agent, obl:unmarked
- The following 3 relation types are not used in this corpus at all: clf, list, orphan