UD Pesh ChibErgIS
Language: Pesh (code: pay
)
Family: Chibchan
This treebank has been part of Universal Dependencies since the UD v2.15 release.
The following people have contributed to making this treebank part of UD: Natalia Cáceres Arandia, Claudine Chamoreau, Sylvain Kahane, Bruno Guillaume.
Repository: UD_Pesh-ChibErgIS
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.15
License: CC BY-SA 4.0
Genre: spoken
Questions, comments? General annotation questions (either Pesh-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [natalia • caceres • arandia (æt) cnrs • fr]. Development of the treebank happens outside the UD repository. If there are bugs, either the original data source or the conversion procedure must be fixed. Do not submit pull requests against the UD repository.
Annotation | Source |
---|---|
Lemmas | annotated manually |
UPOS | annotated manually, natively in UD style |
XPOS | not available |
Features | annotated manually, natively in UD style |
Relations | annotated manually, natively in UD style |
Description
A Universal Dependencies corpus for Pesh (aka Paya), a member of the Chibchan language family. The language is spoken by about 500 speakers in Honduras.
The treebank is an automatic conversion of the SUD_Pesh-ChibErgIS, which is an automatic conversion of the mSUD_Pesh-ChibErgIS which was extracted from Claudine Chamoreau and Natalia Cáceres interlinearized corpus in Flex format, itself an extension of an oral corpus documented by Claudine Chamoreau (https://www.elararchive.org/dk0392).
Acknowledgments
Sentences are annotated with the following metadata:
speaker_id
(which identifies the turn of speech)
sent_timecode
(which will enable playback of the sentence)morphemic_text
: (original segmentation of the text into morphemes)text
: (lexical tokenization)text_en
: (English interpretation)text_phrase-gls-de
: (original id)text_phrase-gls-es
: (Spanish interpretation)text_phrase-gls-it
: (IPA transcription)text_phrase-gls-pro
: (prosodic transcription)text_phrase-gls-tl
: (original comments in Flex)text_phrase-gls-wg
: (original word-gloss in Flex) -
Structure
This version of the treebank is a dependency parsing of the original corpus first four files.
The original data are spoken data, which were originally segmented in words with concatenated clitics, then interlinearized and glossed in Flex with clitics as separate tokens. Tokens comprize words and affixes (preceded by a “=” sign).
The UD_Pesh-ChibErgIS counts 2,507 tokens for 307 sentences.
References
- Chamoreau, Claudine. 2015. A cross-varietal documentation and description of Pesh, a Chibchan language of Honduras. Endangered Languages Archive. Handle: http://hdl.handle.net/2196/00-0000-0000-000F-BF49-B
Acknowledgments
This treebank was produced as part of the ChibErgIS and Autogramm ANR projects. With special thanks to Bruno Guillaume for the conversion from SUD to UD, Sylvain Kahane, Christian Chanard, Uyên-To Rabier and Aleksandra Miletic.
Statistics of UD Pesh ChibErgIS
POS Tags
ADP – ADV – AUX – CCONJ – DET – INTJ – NOUN – NUM – PART – PRON – PROPN – PUNCT – SCONJ – VERB – X
Features
AdvType – Animacy – Case – Clusivity – Person – PronType – VerbForm – Voice
Relations
acl – acl:relcl – advcl – advmod – advmod:lmod – appos – aux – case – cc – ccomp – compound – compound:lvc – compound:svc – conj – cop – csubj – dep – dep:conj – det – discourse – dislocated – mark – nmod – nsubj – nsubj:outer – nsubj:pass – nummod – obj – obl – obl:arg – obl:mod – orphan – parataxis – punct – reparandum – root – vocative – xcomp
Tokenization and Word Segmentation
- This corpus contains 307 sentences and 2508 tokens.
- All tokens in this corpus are followed by a space.
- This corpus does not contain words with spaces.
- This corpus contains 10 types of words that contain both letters and punctuation. Examples: San.Esteban, akasteʃk(w)a, amaspariʃkaw(a), kapaʃbar(w)a, ke,, nãpar(w)a, sukuher(w)a, tarkasakw(a), teʔkertVw(a), yãhaw(a)
Morphology
Tags
- This corpus uses 15 UPOS tags out of 17 possible: ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, VERB, X
- This corpus does not use the following tags: ADJ, SYM
- This corpus contains 48 word types tagged as particles (PART): =ha, =hã, =hãʔ, =hĩ, =i, =kan, =kanka, =kari, =ken, =lerwa, =ma, =mã, =n, =na, =pe, =pera, =pero, =pra, =ra, =ras, =ri, =riʃ, =sa, =sri, =tV, =tWh, =wi, =ɲãrã, =ʃona, =ʔi, =ʔã, akarʃki, akarʃkwa, akaʃkwa, akãʃkwa, ama, karʃwĩka, nĩhã, nẽhã, nẽʔ, nẽʔã, ukwa, ukwã, ãm, ãma, ãmã, ũtanĩhã, ẽkerʃ
- This corpus contains 12 lemmas tagged as pronouns (PRON): =wa, eka, este, pa, ta, tas, to, toʔ, wa, ã, ĩ, ẽka
- This corpus contains 9 lemmas tagged as determiners (DET): =na, =nã, =pero, =s, =ɲa, =ɲã, =ɲãh, as, ãs
- This corpus contains 4 lemmas tagged as auxiliaries (AUX): _, ak, r, tʃa
- Out of the above, 3 lemmas occurred sometimes as AUX and sometimes as VERB: ak, r, tʃa
- There are 1 (de)verbal forms:
- Inf
- VERB: artʃuiʃ
Nominal Features
- Hum
- NOUN: taarwã
- Abs
- ADP: =ra, =ro
- SCONJ: =ro
- Erg
- ADP: =ya
- Nom
- ADP: =ma
Degree and Polarity
Verbal Features
- Appl
- AUX: akatʃaitVri, akatʃaui, takatʃai, takatʃaii, takatʃauwa, takatʃawa, ũtakatʃaitVi
- VERB: artʃuiʃkari, artʃuiʃatVri, tarwarkuh, akasteʃkawa, akastok, arkapriʃi, artapuki, artʃuiʃbartVi, akaporki, akasteʃ
- VERB-Inf: artʃuiʃ
- Cau
- VERB: ũkawa, ũweerwa, ũwarahparh
- Mid
- VERB: taõʃi, atʃi, taiʃkari, taõʃ, taõʃkerwa, taõʃki, apastVpi, apiʃki, atuhwa, atuhweʃkwa
- Rcp
- VERB: apuru, tVkaeri, tVkairi
Pronouns, Determiners, Quantifiers
- Int
- ADP: =kanki
- PART: =kanka
- 1
- AUX: tʃatVpa, =bartVwa
- VERB: piãpa, kaporpa, akonapa, artʃuiʃpa, kapai, kawiʃpa, kawiʃpai, paspa, peʔpa, piʃpa
- 2
- AUX: =rya
- VERB: kaya, takaya
- 3
- VERB: kawiʃkawa
Other Features
- AdvType
- Ideoph
- ADV: tõʃ, kluk, roh, teʔne, tukuluk
- Ideoph
- Clusivity
- Ex
- AUX: tʃabaruri, =barwa, =bari, tʃaberuri, tʃaberwa, =bartVwa, ũtakatʃaitVi
- NOUN: ũtaoryah, ũtayãha, ũtakaki, ũtaoryaha, ũtasira, ũtasuwa, ũtasãma
- PART: ũtanĩhã
- PRON: ũtas
- VERB: tiʃbarwa, artʃuiʃbartVi, atʃahbari, kapaʃbarwa, artʃuiʃbarwa, kabarwa, kakoyoʃbari, kapaʃbar(w)a, kapaʃbari, kapaʃbarpi
- In
- NOUN: patatiʃta, patasaʔa, patasã, pataya, patayãha, patayãhha, pataĩ
- VERB: ãparh, amaskapiwa, amasparwa, akatipari, amaspari, iʃparwa, kapari, kaparwa, masperwa, nãapi
- Ex
Syntax
Auxiliary Verbs and Copula
- This corpus uses 2 lemmas as copulas (cop). Examples: r, _.
- This corpus uses 3 lemmas as auxiliaries (aux). Examples: tʃa, ak, r.
Core Arguments, Oblique Arguments and Adjuncts
Here we consider only relations between verbs (parent) and nouns or pronouns (child).
- nsubj
- VERB--NOUN (25)
- VERB--NOUN-ADP(=ma) (7)
- VERB--NOUN-ADP(=mã) (1)
- VERB--NOUN-ADP(=ya) (5)
- VERB--PRON (14)
- VERB--PRON-ADP(=ma) (7)
- VERB--PRON-ADP(=ma)-ADP(=ma) (1)
- VERB--PRON-ADP(=mã) (2)
- obj
- VERB--NOUN (53)
- VERB--NOUN-ADP(=ma) (5)
- VERB--NOUN-ADP(=ra) (2)
- VERB--NOUN-ADP(=yo) (1)
- VERB--PRON (13)
- VERB--PRON-ADP(=ken) (1)
- VERB--PRON-ADP(=ma) (4)
- VERB--PRON-ADP(=ra) (4)
- VERB-Inf--PRON (1)
Relations Overview
- This corpus uses 9 relation subtypes: acl:relcl, advmod:lmod, compound:lvc, compound:svc, dep:conj, nsubj:outer, nsubj:pass, obl:arg, obl:mod
- The following 8 relation types are not used in this corpus at all: iobj, expl, amod, clf, fixed, flat, list, goeswith