home edit page issue tracker

This page pertains to UD version 2.

It appears that you have Javascript disabled. Please consider enabling Javascript for this page to see the visualizations.

UD Indonesian CSUI

Language: Indonesian (code: id)
Family: Austronesian

This treebank has been part of Universal Dependencies since the UD v2.7 release.

The following people have contributed to making this treebank part of UD: Ika Alfina, Jessica Naraiswari Arwidarasti, Muhammad Yudistira Hanifmuti, Arawinda Dinakaramani, Ruli Manurung, Fam Rashel, Andry Luthfi.

Repository: UD_Indonesian-CSUI
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.15

License: CC BY-SA 4.0

Genre: nonfiction, news

Questions, comments? General annotation questions (either Indonesian-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [ika • alfina (æt) cs • ui • ac • id]. Development of the treebank happens directly in the UD repository, so you may submit bug fixes as pull requests against the dev branch.

Annotation	Source
Lemmas	assigned by a program, with some manual corrections, but not a full manual verification
UPOS	annotated manually in non-UD style, automatically converted to UD, with some manual corrections of the conversion
XPOS	annotated manually in non-UD style, automatically converted to UD, with some manual corrections of the conversion
Features	assigned by a program, with some manual corrections, but not a full manual verification
Relations	annotated manually in non-UD style, automatically converted to UD, with some manual corrections of the conversion

Description

UD Indonesian-CSUI is a conversion from an Indonesian constituency treebank in the Penn Treebank format named Kethu that was also a conversion from a constituency treebank built by Dinakaramani et al. (2015). We named this treebank Indonesian-CSUI, since all the three versions of the treebanks were built at Faculty of Computer Science, Universitas Indonesia.

UD Indonesian-CSUI treebank was converted automatically from the Kethu treebank, an Indonesian constituency treebank in the Penn Treebank format. The Kethu treebank itself was converted from a consituency treebank built by Dinakaramani et al. (2015).

Other characteristics of the treebank:

Genre: news in formal Indonesian (the majority is economic news)
This treebank consists of 1030 sentences and 28K words. We divide CSUI treebank into testing and training dataset:
Testing dataset consists of around 10K words
Training dataset consists of around 18K words
Average sentence length is around 27.4 words per-sentence, which is very high compare to the Indonesian-PUD treebank that has average sentence length of 19.4.

Acknowledgments

The original constituency treebank was built with manual annotation by Arawinda Dinakaramani, Fam Rashel, Andry Luthfi, and Ruli Manurung at Faculty of Computer Science, Universitas Indonesia in 2015.
The previous treebank was converted to the Penn Treebank format by Ika Alfina and Jessica Naraiswari Arwidarasti in 2019-2020. This PTB version was named Kethu.
The Kethu treebank was converted automatically to this UD treebank by Alfina et al. (2020).
The lemma (LEMMA) and morphological features (FEATS) were generated using Aksara and manually corrected.

References

Ika Alfina, Indra Budi, and Heru Suhartanto. “Tree Rotations for Dependency Trees: Converting the Head-Directionality of Noun Phrases”. In Journal of Computer Science, 2020, Vol 16 No 11.
M. Yudistira Hanifmuti and Ika Alfina. “Aksara: An Indonesian Morphological Analyzer that Conforms to the UD v2 Annotation Guidelines”. In Proceeding of the 2020 International Conference of Asian Language Processing (IALP) in Kuala Lumpur, Malaysia, 4-6 Desember 2020.

Statistics of UD Indonesian CSUI

POS Tags

ADJ – ADP – ADV – AUX – CCONJ – DET – INTJ – NOUN – NUM – PART – PRON – PROPN – PUNCT – SCONJ – SYM – VERB – X

Features

Clusivity – Definite – Degree – Foreign – Mood – Number – NumType – Person – Polarity – Polite – PronType – Reflex – Voice

Relations

acl – acl:relcl – advcl – advmod – advmod:emph – amod – appos – aux – case – case:adv – cc – cc:preconj – ccomp – clf – compound:a – conj – cop – csubj – dep – det – discourse – dislocated – fixed – flat – flat:foreign – flat:name – iobj – mark – nmod – nmod:lmod – nmod:poss – nmod:tmod – nsubj – nsubj:pass – nummod – obj – obl – obl:agent – obl:tmod – orphan – parataxis – punct – root – xcomp

Tokenization and Word Segmentation

This corpus contains 1030 sentences, 27771 tokens and 28263 syntactic words.

This corpus contains 3923 tokens (14%) that are not followed by a space.

This corpus does not contain words with spaces.

This corpus contains 148 types of words that contain both letters and punctuation. Examples: rata-rata, APBN-P, masing-masing, Ltd., non-migas, 's, non-keuangan, AA-idn, II/2007, Ka'ban, Pte., langkah-langkah, negara-negara, No., RAPBN-P, bank-bank, baru-baru, idA-, syarat-syarat, C/D, Co., I/2007, II/2003, LLC., S., S/A, Tbk., anak-anak, benar-benar, berbeda-beda, berturut-turut, minus/idn, monyet-monyet, nama-nama, non-residence, obligasi-obligasi, peringkat-peringkat, perusahaan-perusahaan, prinsip-prinsip, rasio-rasio, semata-mata, sumber-sumber, terus-menerus, 03-Oct, 05-May, 10-Jan, 17-Mar, 23-Aug, 26-Sep, 34/PMK.011/2007

This corpus contains 492 multi-word tokens. On average, one multi-word token consists of 2.00 syntactic words.
There are 200 types of multi-word tokens. Examples: katanya, adanya, menurutnya, ujarnya, laporannya, pihaknya, lainnya, tambahnya, membaiknya, apakah, pernyataannya, sahamnya, jelasnya, masuknya, sisanya, tingginya, bersihnya, walaupun, Keuangannya, besarnya, meningkatnya, meskipun, naiknya, rencananya, ucapnya, Dikatakannya, antaranya, bunganya, jumlahnya, kalinya, nilainya, penjelasannya, persnya, turunnya, usahanya, Dijelaskannya, Disebutkannya, Ditambahkannya, Misalnya, akhirnya, artinya, aslinya, baiknya, banyaknya, bukanlah, halnya, informasinya, inilah, instrumennya, investasinya.

Morphology

Nominal Features

Number

Plur
- DET: beberapa, banyak, para, berbagai
- NOUN: langkah-langkah, negara-negara, bank-bank, syarat-syarat, anak-anak, monyet-monyet, nama-nama, obligasi-obligasi, peringkat-peringkat, perusahaan-perusahaan
- PRON: kita, mereka, kami

Sing
- NOUN: persen, Rp, tahun, dolar, sebesar, saham, perusahaan, pemerintah, negara, pertumbuhan
- PRON: nya, dia, ia, saya, anda

Definite

Def
- DET: nya, yang

Ind
- DET: sebuah, seorang, suatu

Degree and Polarity

Degree

Sup
- ADJ: terakhir, terbesar, tertinggi, terbaik, tertentu, terkaya, terdekat, terbanyak, terendah, terutama

Polarity

Neg
- PART: tidak, belum, bukan, tak, jangan

Verbal Features

Mood

Ind
- VERB: kata, menjadi, mencapai, mengatakan, ada, meningkat, naik, dibandingkan, lalu, merupakan

Voice

Act
- VERB: kata, menjadi, mencapai, mengatakan, ada, meningkat, naik, lalu, merupakan, turun

Pass
- VERB: dibandingkan, dibanding, terjadi, dilakukan, diperkirakan, termasuk, terdiri, diharapkan, didorong, diterbitkan

Pronouns, Determiners, Quantifiers

PronType

Art
- DET: nya, sebuah, seorang, yang, suatu

Dem
- DET: ini, tersebut, itu, si, sana, sebagian
- PRON: itu, demikian, ini, mana, begitu

Emp
- DET: sendiri

Ind
- DET: beberapa, banyak, para, berbagai, sedikit
- PRON: sesuatu

Int
- PRON: Apa

Prs
- PRON: nya, dia, kita, ia, mereka, saya, kami, diri, anda

Rel
- ADV: bagaimana
- PRON: yang, apa, siapa

Tot
- DET: seluruh, semua, masing-masing, setiap, segala
- NUM: Ke-23

NumType

Card
- NUM: 2007, triliun, miliar, 2006, juta, 2008, satu, dua, 30, 10

Ord
- ADJ: pertama, kedua, ketiga, keenam, kedelapan, kelima, ke-10, ke-2, ke-4, ke-40

Reflex

Yes
- PRON: diri

Person

1
- PRON: kita, saya, kami

2
- PRON: anda

3
- PRON: nya, dia, ia, mereka

Polite

Form
- PRON: saya, anda

Other Features

Clusivity
- Ex
  - PRON: kami
- In
  - PRON: kita

Foreign
- Yes
  - X: rate, year, rating, mortgage, subprime, on, listed, net, netto, outlook

Syntax

Auxiliary Verbs and Copula

This corpus uses 1 lemmas as copulas (cop). Examples: adalah.

This corpus uses 10 lemmas as auxiliaries (aux). Examples: akan, telah, bisa, dapat, sudah, harus, sedang, mungkin, tengah, boleh.

Core Arguments, Oblique Arguments and Adjuncts

Here we consider only relations between verbs (parent) and nouns or pronouns (child).

nsubj
- VERB--NOUN (687)
- VERB--PRON (503)

obj
- VERB--NOUN (943)
- VERB--PRON (31)

iobj
- VERB--NOUN (1)

Verbs with Reflexive Core Objects

This corpus contains 5 lemmas that occur at least once with a reflexive core object (obj or iobj). Examples: beri diri, daftar diri, tahu diri, tarik diri, tempat diri

Relations Overview

This corpus uses 13 relation subtypes: acl:relcl, advmod:emph, case:adv, cc:preconj, compound:a, flat:foreign, flat:name, nmod:lmod, nmod:poss, nmod:tmod, nsubj:pass, obl:agent, obl:tmod
The following 1 main types are not used alone, they are always subtyped: compound
The following 5 relation types are not used in this corpus at all: vocative, expl, list, goeswith, reparandum