UD for Indonesian
Tokenization and Word Segmentation
-
In general, words are delimited by whitespace characters. Special treatments are given to multiword tokens and punctuations.
- Special treatments of multiword tokens:
- Multiword tokens that ended with particles -lah/-kah/-tah/-pun are split into two tokens. These particles are usually used to emphasize the word before them. Particles of -lah/-kah/-tah are clitics, while particle pun can be written as clitic or a single token. The examples of how to tokenize these clitic particles are as follows:
- bacalah is split into baca “read” and lah
- diakah is split into dia “he/she” and kah
- apatah is split into apa “what” and tah
- walaupun is split into walau “although” and pun
- The particle kah marks yes-no questions and its position may emphasize the previous word as the focus of the question. The word apa “what”, when placed at the beginning of a sentence, also functions as a question particle, and it may be optionally strengthened by kah, resulting in apakah, as in Apa(kah) dia guru? “Is she a teacher?” However, apa is also used as an interrogative pronoun, even sentence-initially, as in Apa pendapatmu? “What do you think?” Finally, kah can be also added to interrogative words (apa “what”, siapa “who”, di mana “where”, kapan “when”, bagaimana “how”) in open questions; when -kah is added, the tone becomes more polite.
- Multiword tokens that contain clitics of -ku “me/my”, -mu “you/your”, -nya “he/him/she/her/it” are split into two tokens, with exceptions for certain words ending with -nya.
- Words ending with -nya where _-nya- itself serves as a pronoun or determiner are split into two tokens. For example:
- Word -nya as pronoun, as in mencintainya “love him/her/it”, this token is split into mencintai “love” and nya “him/her/it”.
- Word -nya as posessive pronoun, as in bukunya “his/her/its book”, this token is split into buku “book” and nya “his/her/its”.
- Word -nya as determiner in predicate nominalisation case, as in meningkatnya “the increase”, this token is split into meningkat “increase” and nya “the”.
- Words ending with -nya that functions as adverbs, adjectives or auxiliary are not split. For example:
- adverbs ended with -nya: khususnya “especially”, awalnya “initially”, akhirnya “finally”
- adjectives ended with -nya: sebelumnya “previous”, sesudahnya “next”, berikutnya “next”
- auxiliary ended with -nya: seharusnya/sebaiknya “shall/should”
- Words ending with -nya where _-nya- itself serves as a pronoun or determiner are split into two tokens. For example:
- Multiword tokens that ended with particles -lah/-kah/-tah/-pun are split into two tokens. These particles are usually used to emphasize the word before them. Particles of -lah/-kah/-tah are clitics, while particle pun can be written as clitic or a single token. The examples of how to tokenize these clitic particles are as follows:
- Special treatments for punctuations. All punctuation symbols are separated from the words, except in two cases:
- Hyphen in reduplicated words. Indonesian has many reduplicated words as nouns (both singular and plural), verbs, adjectives, adverbs, and so on. These reduplicated words are not split and remain one token. The examples of reduplicated words are:
- Singular noun: mata-mata “spy”
- Plural noun: anak-anak “children”, from anak “child”
- Verb: merobek-robek “shredding”
- Adjective: hiruk-pikuk “noisy”
- Adverb: terus-menerus “continuously”
- For abbreviations. All abbreviations such as Mr., M.Sc. Tn., are not split and remain one token.
- Hyphen in reduplicated words. Indonesian has many reduplicated words as nouns (both singular and plural), verbs, adjectives, adverbs, and so on. These reduplicated words are not split and remain one token. The examples of reduplicated words are:
Morphology
Tags
- We refer to KBBI (Kamus Besar Bahasa Indonesia/Indonesian Great Dictionary) as the reference dictionary. However, since this dictionary only defines 7 word classes: noun, verb, adjective, adverb, pronoun, particle and number, we need to make adjustments so that the tags conform to UD v2.
- Indonesian UD treebanks use all 17 universal POS categories.
- PART is used for:
- negation words: tidak/tak/bukan “no/not”, belum “not yet”, jangan “don’t + VERB”
- particles of -lah, -kah, -tah, pun, that have been discussed in the previous section.
- The auxiliary (AUX) vs. VERB distinction is based on examples for English treebank, since initially there is no AUX type in KBBI. We defined 14 Indonesian words as AUX as follows:
- adalah and ialah “be” serve as copulas.
- Tenses-related AUX:
- akan “will/would” for the future tense.
- sedang/tengah “be” for the present tense.
- telah/sudah “have/has/had” for the past tense.
- Modal-related AUX:
- harus/mesti/wajib as the equivalents of modal “must”.
- sebaiknya/seharusnya as the equivalents of modal ‘shall/should’.
- bisa/dapat/sanggup/mampu as the equivalents of modal “can/could”.
- boleh as the equivalent of modal “may”.
- mungkin as the equivalent of modal “might”.
- The pronoun (PRON) vs. determiner (DET) distinction is also based on examples for English treebank, since DET word class also is not defined in KBBI.
- The following word types are tagged as PRON:
- personal pronouns, such as saya/aku/ku “I”, kamu/mu/anda “you”, dia/ia/nya “he/she/it/him/her/its”, kami/kita “we/us/our”, mereka “they/them/their”
- interrogative pronouns, apa “what”, siapa “who” as in Apa yang kamu inginkan? “What do you want?”
- relative pronouns: apa “what”, siapa “who” as in Saya tahu siapa yang kamu maksud. “I know who you mean”
- indefinite pronouns: seseorang “seomeone/somebody”, sesuatu “something”
- total pronouns, such as semua “all” as in Semua kecuali bukumu “All except your books”.
- demonstrative pronouns: ini “this” as in Ini bukan salahmu. “This is not your fault”.
- The following word types are tagged as DET:
- demonstrative determiners: ini “this” as in Kota ini sangat indah “This city is beautiful”
- pronominal numerals: beberapa, berbagai, para “some/many”, semua “all” as in semua siswa “all students”
- The following word types are tagged as PRON:
- Indonesian has the following coordinating conjunction words (CCONJ):
- dan, serta, maupun as the equivalents of “and” in English
- atau “or”
- tapi, tetapi, namun, melainkan as the equivalents of “but” in English
- Clauses can be nominalized by attaching the clitic -nya to the predicate. In the annotation, the clitic is analyzed as a separate syntactic word, functioning as a determiner. However, the predicate keeps the VERB tag, so there may be a verb with a determiner attached to it.
- meningkat “to increase” is a verb
- meningkatnya “the increase” is a nominalized form; however, since -nya is also used with regular nouns and functions like a definite article, meningkatnya is treated as a multi-word token meningkat+nya, where nya is attached as a det to the verb meningkat
- Since meningkat stays tagged as a verb, it will attach to its parent as a clause rather than a nominal. So if it is a subject of another clause, it will be csubj rather than nsubj.
Features
- We propose the use of 15 of 24 features defined in UD v2 that are relevant to Indonesian grammar:
-
Abbr, with one possible value:
Yes
. This feature can be applied to all UPOS categories, except PUNCT and SYM. - Clusivity, applies to PRON with two possible values:
Ex
andIn
.Clusivity=Ex
for kami “we/our”Clusivity=In
for kita “we/our”
- Degree, applies to ADJ with one possible value:
Sup
.Degree=Sup
for superlative adjectives, such as terbaik “the best”, tercantik “the most beautiful”.
-
Foreign, with one possible value:
Yes
. This feature only applies to X. - Mood, applies to VERB, with two possible values:
Ind
, andImp
Mood=Ind
for verb in declarative sentences.Mood=Imp
for verb in imperative sentences.
- Number, applies to DET, NOUN, and PRON, with two possible values:
Sing
, orPlur
.Number=Sing
is used for singular nouns, determiner, or pronouns.Number=Plur
is used for plural nouns, determiner, or pronouns.
- NumType, applies to NUM and ADJ, with two possible values:
Card
orOrd
.NumType=Card
is used for cardinal numbers tagged asNUM
.NumType=Ord
is used for ordinal numbers tagged asADJ
.
-
Person, applies to PRON with three possible values:
1
,2
,3
. - Polarity, with one possible value:
Neg
, applies to PART and INTJ. - Polite, applies to PRON with two possible values:
Form
andInfm
.Polite=Form
, applies toPRON
, such as for saya “I”, anda “you”, and beliau “him/her”.Polite=Infm
, applies toPRON
, such as for aku “I”, kamu “you” (singular), and kalian “you” (plural).
- PronType, applies to PRON, DET, and ADV. For Indonesian, eight possible values can be applied:
PronType=Art
, applies toDET
, such as for sebuah, seorang and -nyaPronType=Dem
, applies toADV
,DET
, andPRON
such as for itu “that” in Itu masalahmu. “That is your problem.”PronType=Emp
, applies toDET
such as for sendiri “self” in Kamu harus percaya pada dirimu sendiri “You have to believe in yourself”.PronType=Ind
, applies toADV
,DET
, andPRON
such as for seseorang “someone/somebody” or sesuatu “something”PronType=Int
, applies toPRON
andADV
.PronType=Int
forPRON
, such as for apa “what” and siapa “who” in interrogative sentencesPronType=Int
forADV
, such as for bagaimana “how” and kapan “when” in interrogative sentences
PronType=Prs
, applies toPRON
for all personal pronouns.PronType=Rel
, applies toPRON
andADV
.PronType=Rel
forPRON
, such as for apa “what”, siapa “who”, yang “that”.PronType=Rel
forADV
, such as for di mana “where”, bagaimana “how” and kapan/saat/ketika “when” in non-interrogative sentences
PronType=Tot
, applies toADV
,DET
, andPRON
.PronType=Tot
forPRON
, such as for semua “all” in Semua adalah milikmu. “All is yours.”PronType=Tot
forDET
, such as for semua “all” in Semua siswa terlihat senang. “All students look happy.”PronType=Tot
forADV
, such as for selalu “always” in Dia selalu terlambat. “She is always late.”
-
Reflex, applies to PRON with one possible value:
Yes
. Only one word qualifies to this feature: diri “self”. -
Typo, with one possible value,
Yes
. This feature can be applied to all UPOS categories except PUNCT and SYM. - Voice, applies to VERB with two possible values:
Act
andPass
. Voice alternation is treated as inflection and the active and passive counterparts have the same lemma.Voice=Act
for active verbs that have characteristic of using base word, prefixes me-, ber-- Active verbs without affix: duduk “sit”, pergi “go”
- Active verbs with prefix me-: memperbaiki “fix”, mengakui “admit”
- Active verbs with prefix ber-: belajar “study”, bekerja “work”
Voice=Pass
for passive verbs that have characteristic of using prefixes di-, ter- or circumfix ke-an.- Passive verbs with prefix di- : dipublikasikan “be published”, dilepaskan “be released”
- Passive verbs with prefix ter-: terbakar “on fire”, terjatuh “fell”, terkejut “shocked”
- Passive verbs with confix ke-an: ketinggalan “lag behind”, kecurian “be stolen”
-
- We consider these 9 UD v2 features are not relevant to Indonesian grammar:
Gender
. Indonesian words have no gender.Animacy
. Similar with Gender, there is no requirements of agreements between words in Indonesian.NounClass
, with the same reason for Gender and AnimacyCase
, with the same reason for Gender, Animacy, and NounClassTense
. Indonesian verbs have the same form in all tenses.Aspect
, with the same reason for Tense.Evident
Poss
VerbForm
Syntax
Core Arguments, Oblique Arguments and Adjuncts
- The default word order is SVO, so the subject (nsubj) normally precedes and the object follows the verb (with the exception of inverted sentences).
- A verb may serve as the subject and is labeled as clausal subject, either as csubj or csubj:pass.
- Transitive verbs will have a noun phrase as the object (obj).
- Passive verbs could be followed by agent (obl:agent), such as in Pesan yang dikirimkan presiden “Messages sent by president”, presiden “president” is the agent of predicate dikirimkan “be sent”.
- Verbs can have oblique arguments (obl). Special for temporal modifiers, we label it as obl:tmod.
Non-verbal Clauses
- The copula ialah or adalah (be) is optionally used in equational, attributional, locative, possessive and benefactory nonverbal clauses. The two forms are interchangeable but adalah is more common. For example: “This is my house.”, in Indonesian can be written as:
- Ini rumahku., without copula
- Ini adalah rumahku., with copula adalah
Relations Overview
- Among 37 universal dependency relations in UDv2:
- 31 deprels are represented in the Indonesian-CSUI (except:
compound
,expl
,goeswith
,list
,reparandum
, andvocative
) - 33 deprels are represented in the Indonesian-PUD (except:
dep
,expl
,list
, andreparandum
) - 34 deprels are represented in the Indonesian-GSD (except:
dislocated
,expl
, andreparandum
)
- 31 deprels are represented in the Indonesian-CSUI (except:
- We provide additional docummentation with examples in Indonesian for some of universal deprels:
- The following 14 relation subtypes could be used in Indonesian UD treebank:
- acl:relcl for relative clauses that modify a noun phrase.
- advmod:emph for particles (PART) -lah, -kah, -tah and , pun that emphasize other words.
- case:adv for adposition (ADP) that is not a nominal dependent.
- cc:preconj for word baik in clause baik A maupun B “both A and B”.
- compound:a for adjective compounds
- csubj:pass for clausal subjects of passive verbs.
- flat:foreign to label sequences of foreign words.
- flat:name to label sequences of names of PROPN-PROPN pairs.
- nmod:lmod for locative nouns.
- nmod:poss for possessive relationship.
- nmod:tmod for temporal modifier of a noun phrase.
- nsubj:pass for nominal subjects of passive verbs.
- obl:agent for agents of passive verbs.
- obl:tmod for temporal modifier for a VERB/ADJ.
Treebanks
There are 3 Indonesian UD treebanks: