home edit page issue tracker

This page pertains to UD version 2.

It appears that you have Javascript disabled. Please consider enabling Javascript for this page to see the visualizations.

UD Nheengatu CompLin

Language: Nheengatu (code: yrl)
Family: Tupian, Maweti-Guarani

This treebank has been part of Universal Dependencies since the UD v2.11 release.

The following people have contributed to making this treebank part of UD: Leonel Figueiredo de Alencar.

Repository: UD_Nheengatu-CompLin
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.13

License: CC BY-NC-SA 4.0

Genre: spoken, bible, fiction, nonfiction, grammar-examples

Questions, comments? General annotation questions (either Nheengatu-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [leonel • de • alencar (æt) ufc • br]. Development of the treebank happens outside the UD repository. If there are bugs, either the original data source or the conversion procedure must be fixed. Do not submit pull requests against the UD repository.

Annotation	Source
Lemmas	annotated manually
UPOS	annotated manually, natively in UD style
XPOS	annotated manually
Features	annotated manually, natively in UD style
Relations	annotated manually, natively in UD style

Description

The UD_Nheengatu-CompLin is a treebank of Nheengatu, also known, e.g., as Modern Tupi and Língua Geral Amazônica. It comprises sentences from diverse published sources, e.g., grammatical descriptions, fables, myths, coursebooks, and dictionaries.

To our knowledge, this is the first treebank of Nheengatu. It is a work in progress. The initial release only contained a couple hundred sentences. This new release encompasses more than six times this number. We plan to continually expand the resource in the next months.

The treebank comprises sentences from diverse published sources freely available on the Internet, e.g., grammatical descriptions, fables, coursebooks, and dictionaries. The sentences were either extracted from PDF text files, transcribed from non-searchable (image-only) PDF files, or manually converted to orthography from phonetic transcriptions. Throughout the treebank, we use the spelling system of Navarro (2016), which only contains minor differences from Avila (2021)’s. The annotation was performed semi-automatically, i.e., we applied a Python program to the output of a morphological analyzer, manually revising each automatically annotated sentence.

The development of this treebank and related tools and resources is part of the research activities of the Research Group on Computation and Natural Language (Computação e Linguagem Natural — CompLin) at the Humanities Center of the Federal University of Ceará in Brazil. The main contributor to this effort is Leonel Figueiredo de Alencar, coordinator of the CompLin group. Additional annotators include Dominick Maia Alexandre. For more information, please visit the corresponding repository:

https://github.com/CompLin/nheengatu

So far, the treebank includes examples from Magalhães (1876), Rogrigues (1890), Amorim (1928), Moore, Facundes, and Pires (1994), Casasnovas (2006), Cruz (2011), Comunidade de Terra Preta (2013), Stradelli (2014), Navarro (2016), Alencar (2021), and Avila (2021) as well as from the New Testament (Novo Testamento na língua Nyengatu, 1973/2019).

Acknowledgments

We are grateful to Eduardo de Almeida Navarro (University of São Paulo) for kindly allowing us to use examples and texts from his coursebook (Navarro 2016) in this project. Besides, the glossary of this coursebook was the first basis for the morphological analyzer.

We also acknowledge the use of Avila (2021)’s dictionary, from which numerous treebank sentences stem. This dictionary also provided invaluable lexical, grammatical, and semantic information for the further development of the morphological analyzer and related treebank annotation tools. We are much obliged to its author, Marcel Twardowsky Avila, for making the XML version of the dictionary available to us and clarifying questions about some entries.

License

Copyright of the treebank sentences and their translations belongs to their respective authors. This data is made available here solely to promote research, teaching, and learning of the Nheengatu language. Therefore, it shouldn’t be used for any commercial purposes. For more information, see LICENSE.txt.

References

Avila, Marcel Twardowsky.(2021). Proposta de dicionário nheengatu-português [Doctoral dissertation, University of São Paulo]. doi:10.11606/T.8.2021.tde-10012022-201925
Casasnovas, Afonso. (2016). Noções de língua geral ou nheengatú: Gramática, lendas e vocabulário (2nd ed.). Editora da Universidade Federal do Amazonas; Faculdade Salesiana Dom Bosco.
Comunidade de Terra Preta. (2013). Fábulas de Terra Preta: Uma coletânea bilingüe.
Cruz, Aline da. (2011). Fonologia e gramática do nheengatú: A língua falada pelos povos Baré, Warekena e Baniwa. Netherlands National Graduate School of Linguistics.
de Alencar, Leonel Figueiredo. (2021). Uma gramática computacional de um fragmento do nheengatu / A computational grammar for a fragment of Nheengatu. Revista de Estudos da Linguagem, 29(3), 1717-1777. doi:http://dx.doi.org/10.17851/2237-2083.29.3.1717-1777
de Amorim, Antonio Brandão. (1928). Lendas em nheêngatú e em portuguez. Revista do Instituto Historico e Geographico Brasileiro, 154(2), 9-475.
de Magalhães, J. V. C. (1876). O selvagem. Typographia da Reforma.
Maslova, Irina. (2018). Tradução Comentada de Mitos e Lendas Amazônicas do Nheengatu para o Russo. [Master’s Dissertation, University of São Paulo]. doi:10.11606/D.8.2019.tde-22022019-175350
Moore, Denny, Facundes, Sidney, & Pires, Nádia. (1994). Nheengatu (Língua Geral Amazônica), its History, and the Effects of Language Contact. UC Berkeley: Department of Linguistics. Retrieved from https://escholarship.org/uc/item/7tb981s1
Navarro, Eduardo de Almeida. (2016). Curso de Língua Geral (nheengatu ou tupi moderno): A língua das origens da civilização amazônica (2nd ed.). Centro Angel Rama da Faculdade de Filosofia, Letras e Ciências Humanas da Universidade de São Paulo.
Novo Testamento na língua Nyengatu (2nd ed.). (2019). Missão Novas Tribos do Brasil. (Original work published 1973)
Rodrigues, João Barbosa. (1890). Poranduba amazonense ou kochiyma-uara porandub, 1872-1887. Typ. de G. Leuzinger & Filhos.
Stradelli, Ermanno. (2014). Vocabulário português-nheengatu, nheengatu-português. Ateliê Editorial.(Original work published 1929)

Statistics of UD Nheengatu CompLin

POS Tags

ADJ – ADP – ADV – AUX – CCONJ – DET – INTJ – NOUN – NUM – PART – PRON – PROPN – PUNCT – SCONJ – VERB

Features

AdvType – Aspect – Case – Clitic – Compound – Definite – Degree – Deixis – Derivation – Evident – Foc – Mood – Number – Number[grnd] – Number[psor] – NumType – PartType – Person – Person[grnd] – Person[psor] – Polarity – Poss – PronType – PunctType – Red – Rel – Style – Tense – VerbForm – Voice

Relations

acl – acl:relcl – advcl – advcl:relcl – advmod – amod – appos – aux – case – cc – ccomp – compound – conj – cop – csubj – dep – det – discourse – dislocated – expl – fixed – flat – iobj – mark – nmod – nmod:poss – nsubj – nummod – obj – obl – parataxis – punct – reparandum – root – vocative – xcomp

Tokenization and Word Segmentation

This corpus contains 1239 sentences, 12621 tokens and 12743 syntactic words.

This corpus contains 3529 tokens (28%) that are not followed by a space.

This corpus does not contain words with spaces.

This corpus contains 131 types of words that contain both letters and punctuation. Examples: waá-itá, kwá-itá, mira-itá, apigawa-itá, kunhã-itá, amú-itá, anama-itá, taína-itá, maã-itá, nhaã-itá, pindá-itá, raíra-itá, rundewara-itá, Pirá-itá, kunhamukú-itá, mirá-itá, mú-itá, taria-itá, tayera-itá, taíra-itá, amú-tetamawara, amú-wirandé, arú-itá, ikewara-itá, iwá-itá, kariwa-itá, kurasí-ara, kurumiwasú-itá, kurumĩ-itá, kurupira-itá, kuẽma-piranga, makú-itá, mbira-itá, mimbira-itá, nheenga-itá, pituna-pisayé, pituna-pukú, pura-itá, rikusawa-itá, rimirikú-itá, ruayana-itá, sakanga-itá, sera-rakapira, suiwara-itá, suú-itá, tuyué-itá, uka-itá, wanana-itá, wirá-itá, yepé-yepé

This corpus contains 122 multi-word tokens. On average, one multi-word token consists of 2.00 syntactic words.
There are 86 types of multi-word tokens. Examples: maita, asú-putari, asú-kwáu, iwí-pe, kupixá-pe, uyuká-putari, Ukiririntu, ambaú-putari, ipí-pe, kaá-pe, paraname, resú-putari, ripí-pe, rumasá-pe, rupitá-pe, ukwáu-putari, usú-putari, uwatá-kwáu, Amaã-putari, Amaãntu, Amunhã-kari, Apiripana-putari, Apituú-putari, Apurakí-putari, Asenúi-kari, Ayuíri-putari, Igarapé-pe, Indé-ta, Marã-ta, Rekiri-putari, Remukaturu-kari, Reumpuka-putari, Yamunhã-putari, amupuka-kwáu, apisika-kwáu, apitimú-kwáu, apurungitá-putari, awatá-kwáu, gantime, garapá-pe, katuntu, kitintu, kupé-pe, pemaãntu, pemukanhemu-kwáu, piá-pe, putiá-pe, rasú-kwáu, rembií-pe, remunhã-kari.

Morphology

Nominal Features

Number

Plur
- AUX-Fin: yasú, yaikú, pesú, yapuderi, Pekũi, Pepuderi, peikú, yayuíri
- DET: kwá-itá, amú-itá, nhaã-itá
- NOUN: mira-itá, apigawa-itá, kunhã-itá, anama-itá, taína-itá, maã-itá, pindá-itá, Pirá-itá, kunhamukú-itá, mirá-itá
- PRON: aintá, yané, ta, waá-itá, yandé, penhẽ, pe, kwá-itá, amú-itá, nhaã-itá
- VERB-Fin: yamunhã, yamaã, yasú, yaú, taunheẽ, pemaã, pemunhã, peú, yambaú, pembeú

Sing
- AUX-Fin: aikú, asú, reikú, resú, aputari, Repuderi, apuderi, ayuíri, rekwáu, reputari
- DET: kwá, nhaã, amú, kwaá
- NOUN: ara, mira, igara, manha, apigawa, pituna, kunhã, yautí, paraná, pirá
- PRON: i, waá, se, aé, ixé, indé, ne, kwá, nhaã, amú
- VERB-Fin: asú, reputari, amunhã, akwáu, amaã, ambeú, remaã, resú, surí, aputari

Case

Gen
- PRON: i, se, ne, aintá, yané, pe, ta, tá

Definite

Ind
- DET: yepé
- PRON: yepé

Degree and Polarity

Degree

Aug
- NOUN: buyawasú, mirawasú, miráwasú, amanawasú, awawasú, irusangawasú, itapewawasú, kavernawasú, marikawasú, mirá-itawasú

Cmp
- ADV: piri

Dim
- ADJ: purangamirĩ
- NOUN: makakaí

Sup
- ADV: piri

Polarity

Neg
- PART: ti, te, nti, intí, umbaá, nẽ, tenhẽ

Pos
- PART: eré

Verbal Features

Aspect

Compl
- PART: pawa, pá

Freq
- ADV: Asuiwara, Ikewara, kwayewara, sewara, yawewara
- AUX-Fin: uikuwera
- NOUN: arawara, rukawara
- PART: aikwewara
- VERB-Fin: Amanduariwara, Asuwara, pesenduwera, upukawera, upuruwera, uyumuatiriwera

Frus
- PART: yepé

Hab
- SCONJ: rametiwa
- VERB-Fin: ambautiwa, ukanhemutiwa, upurungitatiwa, usutiwa, uyukatiwa

Iter
- AUX-Fin: ayuíri, yayuíri

Perf
- PART: ana, ã, wã, wana

Mood

Cnd
- PART: maã

Imp
- AUX-Fin: Pekũi
- PART: te, tenhẽ
- VERB-Fin: yuri

Tense

Fut
- PART: kurí, arama, arã, ku, rã

Past
- AUX-Fin: uikuwera
- PART: kwera
- VERB-Fin: pesenduwera, upukawera, upuruwera, uyumuatiriwera

Pres
- ADV: Asuiwara, Ikewara, kwayewara, sewara, yawewara
- NOUN: arawara, rukawara
- PART: aikwewara
- VERB-Fin: Amanduariwara, Asuwara

Voice

Pass
- VERB-Fin: Uyupurungitá, uyumunhã, uyumusangawa

Evident

Nfh
- PART: paá

Pronouns, Determiners, Quantifiers

PronType

Art
- DET: yepé
- PRON: yepé

Dem
- ADV: iké, ape, kwá, akití, Mimi, aape, kí, Ikewara
- DET: kwá, nhaã, kwá-itá, aé, kwaá, nhaã-itá
- PRON: kwá, nhaã, kwá-itá, nhaã-itá, Kwaá, aé

Emp
- DET: aité
- PRON: aité

Ind
- ADV: mairamé, makití, masuí, marupí
- DET: amú, muíri, siiya, siía, maã, setá, yawé, turusú, siya, yepé-yepé
- PRON: maã, awá, amú, amú-itá, manungara, siiya, siya, muiriira, yepé-yepé

Int
- ADV: mayé, mamé, makití, marupí, maita, marama, masuí, mairamé, marã, mayawé
- DET: muíri, Maã, awá
- PRON: maã, awá, Muíri

Prs
- PRON: aintá, i, se, aé, ixé, indé, ne, yané, ta, yandé

Rel
- ADV: mamé, makití, masuí, mayé, marupí, mairamé
- PRON: waá, waá-itá, awá, maã

Tot
- DET: panhẽ, muíri, upaĩ
- PRON: panhẽ, muíri

NumType

Card
- NUM: mukũi, musapiri, yepé, nove, pú-mukũi

Ord
- ADV: mukũisawa, primeru

Poss

Yes
- PRON: se, i, ne, yané, aintá, pe, ta

Person

1
- AUX-Fin: aikú, asú, yasú, yaikú, yapuderi, aputari, apuderi, ayuíri, yayuíri
- PRON: se, ixé, yané, yandé
- VERB-Fin: asú, yamunhã, amunhã, akwáu, amaã, ambeú, yamaã, aputari, ayuíri, yasú

2
- AUX-Fin: pesú, reikú, resú, Pekũi, Pepuderi, Repuderi, peikú, rekwáu, reputari
- PRON: indé, ne, penhẽ, pe
- VERB-Fin: reputari, remaã, resú, remunhã, rerikú, reyuri, remundú, pemaã, pemunhã, rembeú

3
- AUX-Fin: uikú, usú, upuderi, uputari, uikuwera
- PRON: aintá, i, aé, ta, tá
- VERB-Fin: unheẽ, usika, usú, umaã, umunhã, urikú, umbeú, upitá, upisika, uri

Number[psor]

Sing
- NOUN: sera, suka, ximirikú, sawa, sesá, sukwera, sumuara, suíwa, sakakwera, sesewara

Other Features

AdvType
- Cau
  - ADV: nhaãsé, ape, aresé, aramé, kurumú, marama, marã
- Con
  - ADV: Ma, nuká
- Deg
  - ADV: katú, reté, piri, retana, xinga, puru, mirĩ, retã, ité
- Loc
  - ADV: iké, apekatú, mamé, ape, makití, marupí, masuí, kwá, akití, mikití
- Man
  - ADV: yawé, mayé, kwayé, puranga, kutara, kirimbawa, puxí, satambika, sé, kwayentu
- Mod
  - ADV: kuité, kuté
- Tim
  - ADV: asuí, kuíri, rẽ, aramé, wirandé, ariré, kuxiíma, aiwana, kuité, yeperesé

Clitic
- Yes
  - ADP: pe, me
  - ADV: ntu
  - PART: taá, ta

Compound
- Yes
  - AUX: putari, kwáu, kari
  - AUX-Inf: putari, kwáu

Deixis
- Prox
  - ADV: iké, kwá, kí
  - DET: kwá, kwá-itá, kwaá
  - PRON: kwá, kwá-itá, Kwaá
- Remt
  - ADV: ape, akití, Mimi, aape
  - DET: nhaã, aé, nhaã-itá
  - PRON: nhaã, nhaã-itá, aé

Derivation
- Coll
  - NOUN: itatiwa, kapĩtiwa, mirawasutiwa, sakaitiwa
- Priv
  - ADJ: Adana-ima, ara-ima, kiinha-ima, santaíma, sawa-ima, tĩ-ima, ximirikú-ima
  - VERB: kiaíma

Foc
- Yes
  - PART: tẽ, tenhẽ, katú, té

Number[grnd]
- Sing
  - ADP: sesé, suakí, sakakwera, sesewara, suaxara

PartType
- Emp
  - PART: tẽ, tenhẽ, katú, té
- Exs
  - PART: aikwé, aikwewara
- Int
  - PART: taá, será, ta
- Mod
  - PART: paá, pu, supí, maã, eré, tenki, tenupá, presizu, ba, ipú
- Neg
  - PART: ti, te, nti, intí, umbaá, nẽ, tenhẽ
- Prs
  - PART: xukúi, Kusukúi

Person[grnd]
- 3
  - ADP: sesé, suakí, sakakwera, sesewara, suaxara

Person[psor]
- 3
  - NOUN: sera, suka, ximirikú, sawa, sesá, sukwera, sumuara, suíwa, sakakwera, sesewara

PunctType
- Elip
  - PUNCT: [...]

Red
- Yes
  - VERB-Fin: Akaá-kaá, Tasuú-suú, ukaúkaú

Rel
- Abs
  - NOUN: uka, tatá, tetama, timbiú, tuixawa, tendawa, peé, pé, teapú, tuwí
- Cont
  - ADP: resé, resewara, ruakí, rakakwera, rikuyara, renundé, ruaxara
  - NOUN: ruka, ramunha, rapé, raíra, riiya, retama, raínha, rangawa, rera, resá
  - SCONJ: resewara
  - VERB: rurí, rakú, ranhẽ, rawa, resarái, rikwé, renúi, ripiaka
  - VERB-Inf: renúi, ripiaka
- NCont
  - ADP: sesé, suakí, sakakwera, sesewara, suaxara
  - NOUN: sera, suka, ximirikú, sawa, sesá, sukwera, sumuara, suíwa, sakakwera, sesewara
  - VERB-Fin: surí, sakú, sasí, tiapú, sesaíma, setá, tipí, sawa, sikwé, Ikupukú

Style
- Arch
  - NOUN: uka, ukena
  - PRON: aé, penhẽ, yandé
  - VERB-Inf: renúi, ripiaka
- Rare
  - NOUN: Yukasara

Syntax

Auxiliary Verbs and Copula

This corpus uses 1 lemmas as copulas (cop). Examples: ikú.

This corpus uses 7 lemmas as auxiliaries (aux). Examples: sú, ikú, putari, kwáu, puderi, kari, yuíri.

Core Arguments, Oblique Arguments and Adjuncts

Here we consider only relations between verbs (parent) and nouns or pronouns (child).

nsubj
- VERB--NOUN (20)
- VERB--PRON (20)
- VERB--PRON-Gen (36)
- VERB-Fin--NOUN (417)
- VERB-Fin--PRON (441)
- VERB-Inf--NOUN (4)
- VERB-Inf--PRON (3)

obj
- VERB-Fin--NOUN (483)
- VERB-Fin--NOUN-ADP(resé) (2)
- VERB-Fin--PRON (250)
- VERB-Fin--PRON-Gen (1)
- VERB-Fin--PRON-Gen-ADP(irumu) (1)
- VERB-Inf--PRON-Gen (6)

iobj
- VERB-Fin--NOUN (1)
- VERB-Fin--NOUN-ADP(irumu) (1)
- VERB-Fin--NOUN-ADP(supé) (20)
- VERB-Fin--NOUN-ADP(suí) (2)
- VERB-Fin--NOUN-ADP(xupé)-ADP(arama) (2)
- VERB-Fin--PRON (1)
- VERB-Fin--PRON-ADP(arama) (13)
- VERB-Fin--PRON-ADP(arã) (4)
- VERB-Fin--PRON-ADP(supé) (1)
- VERB-Fin--PRON-ADP(supé)-ADP(arama) (1)
- VERB-Fin--PRON-Gen-ADP(supé) (10)
- VERB-Fin--PRON-Gen-ADP(xupé) (25)

Relations Overview

This corpus uses 3 relation subtypes: acl:relcl, advcl:relcl, nmod:poss
The following 4 relation types are not used in this corpus at all: clf, list, orphan, goeswith