UD Japanese GSDLUW
Language: Japanese (code: ja
)
Family: Japanese
This treebank has been part of Universal Dependencies since the UD v2.9 release.
The following people have contributed to making this treebank part of UD: Mai Omura, Yusuke Miyao, Hiroshi Kanayama, Hiroshi Matsuda, Aya Wakasa, Kayo Yamashita, Masayuki Asahara, Takaaki Tanaka, Yugo Murawaki, Yuji Matsumoto, Shinsuke Mori, Sumire Uematsu, Ryan McDonald, Joakim Nivre, Daniel Zeman.
Repository: UD_Japanese-GSDLUW
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.15
License: CC BY-SA 4.0
Genre: news, blog
Questions, comments? General annotation questions (either Japanese-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [masayu-a (æt) ninjal • ac • jp]. Development of the treebank happens directly in the UD repository, so you may submit bug fixes as pull requests against the dev branch.
Annotation | Source |
---|---|
Lemmas | annotated manually in non-UD style, automatically converted to UD |
UPOS | annotated manually in non-UD style, automatically converted to UD |
XPOS | annotated manually |
Features | not available |
Relations | annotated manually in non-UD style, automatically converted to UD |
Description
This Universal Dependencies (UD) Japanese treebank is based on the definition of UD Japanese convention described in the UD documentation. The original sentences are from Google UDT 2.0.
The Japanese UD treebank contains the sentences from Google Universal Dependency Treebanks v2.0 (legacy): https://github.com/ryanmcd/uni-dep-tb. First, Google UDT v2.0 was converted to UD-style with bunsetsu-based word units (say “master” corpus).
The word units in “master” is significantly different from the definition of the documents based on Short Word Unit (SWU) [1], then the sentences are automatically re-processed by Hiroshi Kanayama in Feb 2017. It is the Japanese_UD v2.0 and used in the CoNLL 2017 shared task. In November 2017, UD_Japanese v2.0 is merged with the “master” data so that the manual annotations for dependencies can be reflected to the corpus. It reduced the errors in the dependency structures and relation labels.
Still there are slight differences in the word unit between UD_Japanese v2.1 and UD_Japanese-KTC 1.3.
In May 2020, we introduce UD_Japanese BCCWJ[3] like coversion method for UD_Japanese GSD v2.6.
In May 2021, we introduce the other word segmentation version of UD_Japanese-GSD.
Acknowledgments
The original treebank was provided by:
- Adam LaMontagne
- Milan Souček
- Timo Järvinen
- Alessandra Radici
via
- Dan Zeman.
The corpus was converted by:
- Mai Omura
- Aya Wakasa
- Masayuki Asahara
through annotation, discussion and validation with
- Kayo Yamashita
Statistics of UD Japanese GSDLUW
POS Tags
ADJ – ADP – ADV – AUX – CCONJ – DET – INTJ – NOUN – NUM – PART – PRON – PROPN – PUNCT – SCONJ – SYM – VERB – X
Features
Relations
acl – advcl – advmod – amod – aux – case – cc – ccomp – compound – cop – csubj – csubj:outer – dep – det – discourse – fixed – mark – nmod – nsubj – nsubj:outer – nummod – obj – obl – punct – root
Tokenization and Word Segmentation
- This corpus contains 8100 sentences and 150243 tokens.
- This corpus contains 142134 tokens (95%) that are not followed by a space.
- This corpus contains 153 types of words with spaces. Examples: You Tube, DEATH NOTE, EEP ROM, Mozilla Firefox, 12.1型WXGA TFT液晶, AOL Key words, ASIA GIRLS EXPLOSION, Acoustic UK, Ad Planner, André Franquin, Arc Sight株, Ars Technica, BAD TIMES, Bill of Lading, Biohazard archives, BlackBerry Bold9700, Blue tooth, Blues Attack, Brian Brazil, British Rail termini, COCK AND BULL TUNES, CRYSTAL BALLカードミラー, CS 5, City of Sarnia, Club Class, DARK SIDE REPORT, DFJ Esprit, DRAGON GATE RECORDS代表兼プロデューサー, Deep Junior, Deep Sjeng, Detailed Baseline Report, Direct X, Double Click Ad Planner, EMI CLASSICS, Enterprise Java Beans, F-ZERO AX, FM TOWNS, FRIDAY NIGHT, Feeling Heart, GNU Cライブラリ, GTC Speed, Galaxy Tab, HOUND DOG, HTTP Proxy/SSH/Socks, Happy Tablet, HellChose Me, Herbert J.Zeiger, Hugo Pratt, IBM PC/AT以来, IT’S FRIDAY
- This corpus contains 999 types of words that contain both letters and punctuation. Examples: レアル・マドリード, J-WAVE, SETI@home, T-72, Wi-Fi, 『真型』以前, エドガー・ダイクストラ, スター・ウォーズ, テーブル“T”, ルイ・ヴィトン, 一、二塁, 阪神・淡路大震災, 0.2%減, 0.5%減, 0.6%安, 1%未満, 100%有機, 11ウォール・ストリート, 15%急落, 157km/h, 1ch・2ch・12ch, 2.6%減, 2両・3両単位, 3%増, 30%アップ, 323A-1, 35%程度, 3人目・4度目, 3回転ルッツ-3回転ループ, 4.4%減, 5%以下, 50%以上, 50%以下, 6344P-L, 6・7%増, 6・7・DS, 6番・出口, 7.0%増, 70%以上, 80’sカルチャー, 90%以上, A&Mレコード, A&S, A.T.フィールド, AC/DC, ACミネロス・デ・グアヤナ所属, AMX-10RC, AQTI-2型クリッパー発売, AT-X, AW.55アポロ
Morphology
Tags
- This corpus uses 17 UPOS tags out of 17 possible: ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB, X
- This corpus contains 36 word types tagged as particles (PART): か, かしらん, くらい, ぐらい, さ, さえ, しか, すら, ぞ, ぞお, たり, だけ, だけでなく, だり, って, どころ, な, なぁ, なあ, など, なんて, なー, ね, ねえ, の, のみ, ばかり, ほど, まで, や, よ, よぉ, よー, わ, 程, 風
- This corpus contains 44 lemmas tagged as pronouns (PRON): あちこち, おめえ等, こっち, こんな, そらあ, そんな, どんな, わしゃ, 何, 何れ, 何処, 何方, 何時, 何等, 余, 俺, 僕, 僕等, 其れ, 其れ等, 其処, 其方, 君, 己, 彼, 彼れ, 彼女, 彼女達, 彼方, 彼等, 御前, 御宅, 我, 我々, 本の, 此れ, 此れ等, 此処, 此方, 私, 私達, 誰, 貴方, 貴様
- This corpus contains 8 lemmas tagged as determiners (DET): あらゆる, とある, 何の, 其の, 彼の, 我が, 或る, 此の
- This corpus contains 79 lemmas tagged as auxiliaries (AUX): かもしれない, かもしれません, ことがある, ことができない, ことができる, こととなる, ことにする, ことになる, ことはない, こともある, こともない, ごとし, させる, ざるを得ない, しかない, じゃ, ず, せる, そう, た, たい, たがる, たらいい, たり, だ, ちゃう, つう, つつある, てある, ていく, ていただく, ている, ておく, ておる, てく, てくださる, てくる, てくれる, てしまう, てはいけない, てはならない, てほしい, てみる, てもいい, てもらう, てやる, てる, である, です, でない, ではありません, ではない, でもある, とく, ない, ないではいられない, なくてはならない, なければならない, なり, に過ぎない, に違いない, のだ, のである, のです, のではない, ばいい, べし, まい, まじ, ます, までもない, みたい, む, や, らしい, られる, れる, わけにはいかない, 様
- Out of the above, 1 lemmas occurred sometimes as AUX and sometimes as VERB: ている
- This corpus does not use the VerbForm feature.
Nominal Features
Degree and Polarity
- Neg
- AUX: ない, ず, ん, なかっ, なく, なけれ, なければならない, ぬ, ざるをえない, ざるを得ない
- NOUN: なし
Verbal Features
Pronouns, Determiners, Quantifiers
Other Features
Syntax
Auxiliary Verbs and Copula
- This corpus uses 2 lemmas as copulas (cop). Examples: だ, です.
- This corpus uses 79 lemmas as auxiliaries (aux). Examples: た, ている, だ, れる, ます, である, ない, です, られる, のだ, ず, 様, ておる, せる, てくる, たい, ではない, てしまう, のです, てくれる, てる, ていく, ことができる, てもらう, ことになる, そう, てみる, ていただく, べし, てくださる, ことがある, こともある, らしい, でもある, のである, みたい, こととなる, のではない, でない, かもしれない, ちゃう, ではありません, ておく, てある, てく, かもしれません, ことはない, なければならない, てほしい, てもいい.
Core Arguments, Oblique Arguments and Adjuncts
Here we consider only relations between verbs (parent) and nouns or pronouns (child).
- nsubj
- VERB--NOUN-ADP(が) (2474)
- VERB--NOUN-ADP(は) (1532)
- VERB--NOUN-ADP(も) (527)
- VERB--PRON-ADP(が) (51)
- VERB--PRON-ADP(は) (105)
- VERB--PRON-ADP(も) (50)
- obj
- VERB--NOUN-ADP(か)-ADP(を) (2)
- VERB--NOUN-ADP(だけ)-ADP(を) (2)
- VERB--NOUN-ADP(と)-ADP(か)-ADP(を) (1)
- VERB--NOUN-ADP(と)-ADP(を) (3)
- VERB--NOUN-ADP(など)-ADP(を) (52)
- VERB--NOUN-ADP(など)-ADP(を通じて) (2)
- VERB--NOUN-ADP(により)-ADP(を) (2)
- VERB--NOUN-ADP(の)-ADP(の)-ADP(を) (1)
- VERB--NOUN-ADP(のみ)-ADP(を) (3)
- VERB--NOUN-ADP(まで)-ADP(を) (2)
- VERB--NOUN-ADP(を) (4458)
- VERB--NOUN-ADP(を)-ADP(で)-ADP(も) (1)
- VERB--NOUN-ADP(を)-ADP(に) (1)
- VERB--NOUN-ADP(を)-ADP(も) (2)
- VERB--NOUN-ADP(をはじめ) (1)
- VERB--NOUN-ADP(をもって) (7)
- VERB--NOUN-ADP(を通じて) (5)
- VERB--PRON-ADP(か)-ADP(を) (5)
- VERB--PRON-ADP(まで)-ADP(を) (1)
- VERB--PRON-ADP(を) (83)
Relations Overview
- This corpus uses 2 relation subtypes: csubj:outer, nsubj:outer
- The following 14 relation types are not used in this corpus at all: iobj, xcomp, vocative, expl, dislocated, appos, clf, conj, flat, list, parataxis, orphan, goeswith, reparandum