home edit page issue tracker

This page pertains to UD version 2.

UD Portuguese Porttinari

Language: Portuguese (code: pt)
Family: IE

This treebank has been part of Universal Dependencies since the UD v2.13 release.

The following people have contributed to making this treebank part of UD: Magali Sanches Duran, Lucelene Lopes, Maria das Graças Volpe Nunes, Thiago Alexandre Salgueiro Pardo.

Repository: UD_Portuguese-Porttinari
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.15

License: CC BY 4.0

Genre: news

Questions, comments? General annotation questions (either Portuguese-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [taspardo (æt) icmc • usp • br]. Development of the treebank happens outside the UD repository. If there are bugs, either the original data source or the conversion procedure must be fixed. Do not submit pull requests against the UD repository.

Annotation Source
Lemmas annotated manually
UPOS annotated manually, natively in UD style
XPOS not available
Features annotated manually, natively in UD style
Relations annotated manually, natively in UD style

Description

Porttinari-base (Duran et al., 2023) is the journalistic portion of Porttinari (which stands for “PORTuguese Treebank”), which shall be a large multigenre treebank for Portuguese (Pardo et al., 2021), following the “Universal Dependencies” international grammar framework (de Marneffe et al., 2021).

Porttinari-base (Duran et al., 2023) is the journalistic portion of Porttinari (which stands for “PORTuguese Treebank”), which shall be a large multigenre treebank for Portuguese (Pardo et al., 2021), following the “Universal Dependencies” international grammar framework (de Marneffe et al., 2021).

As reported by Duran et al., (2023), Porttinari is currently composed by three subcorpora with different characteristics and purposes:

The texts in the treebank are from Folha de São Paulo newspaper, which are publicly available at Kaggle website. Overall, the journalistc portion of Porttinari includes 167,048 news articles, with 3,964,321 sentences and 94,646,080 tokens, which are distributed in the subcorpora as follows.

subcorpora

For the interested reader, Porttinari-check and Porttinari-automatic, as well as other related information, may be accessed at https://sites.google.com/icmc.usp.br/poetisa/porttinari.

Acknowledgments

This work was carried out at the Center for Artificial Intelligence of the University of São Paulo (C4AI - http://c4ai.inova.usp.br/), with support by the São Paulo Research Foundation (FAPESP grant #2019/07665-4) and by the IBM Corporation. The project was also supported by the Ministry of Science, Technology, and Innovation, with resources of Law N. 8.248, of October 23, 1991, within the scope of PPI-SOFTEX, coordinated by Softex and published as Residence in TIC 13, DOU 01245.010222/2022-44.

References

Statistics of UD Portuguese Porttinari

POS Tags

ADJADPADVAUXCCONJDETINTJNOUNNUMPRONPROPNPUNCTSCONJSYMVERBX

Features

AbbrCaseDefiniteForeignGenderMoodNumberNumTypePersonPossPronTypeTenseVerbFormVoice

Relations

aclacl:relcladvcladvmodamodapposauxaux:passcaseccccompccomp:speechconjcopcsubjcsubj:outercsubj:passdetdiscoursedislocatedexplexpl:impersfixedflatflat:foreignflat:nameiobjlistmarknmodnsubjnsubj:outernsubj:passnummodobjoblobl:agentorphanparataxispunctreparandumrootvocativexcomp

Tokenization and Word Segmentation

Morphology

Tags

Nominal Features

Degree and Polarity

Verbal Features

Pronouns, Determiners, Quantifiers

Other Features

Syntax

Auxiliary Verbs and Copula

Core Arguments, Oblique Arguments and Adjuncts

Here we consider only relations between verbs (parent) and nouns or pronouns (child).

Relations Overview