UD Chinese Beginner
Language: Chinese (code: zh
)
Family: Sino-Tibetan
This treebank has been part of Universal Dependencies since the UD v2.13 release.
The following people have contributed to making this treebank part of UD: Kirian Guiller, Yidi Huang, Yixuan Li, Qishen Wu, Bruno Guillaume, Sylvain Kahane, Kim Gerdes.
Repository: UD_Chinese-Beginner
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.15
License: CC BY-NC-SA 3.0
Genre: grammar-examples
Questions, comments? General annotation questions (either Chinese-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [kiriangui (æt) gmail • com]. Development of the treebank happens directly in the UD repository, so you may submit bug fixes as pull requests against the dev branch.
Annotation | Source |
---|---|
Lemmas | annotated manually |
UPOS | annotated manually, natively in UD style |
XPOS | not available |
Features | annotated manually, natively in UD style |
Relations | annotated manually, natively in UD style |
Description
A treebank of Chinese sentences adapted for learner of level A1 to C1 (HSK1 to 5) collected on the Chinese Grammar Wiki (CC BY-NC-SA 3.0 License) website. The treebank was manually annotated by researchers of Paris Nanterre University (Modyco) in the mSUD annotation schema (morpheme level Surface Universal Dependencies).
The syntactic analysis is originally done in SUD on the character level under the name SUD_Chinese-PatentChar. See SUD Guidelines : https://surfacesyntacticud.github.io/guidelines/u/
Structure of the Treebank
The Treebank is partitioned in 5 parts A1, A2, B1, B2 and C1 that represents different level of difficulty (from easiest to hardest).
/!\ At the day of October 12th of 2023, 2295 sentences have been hand annotated (around 20k tokens in total). But below is the complete distribution of the corpus when it will be finished.
The corpus is made of around 4300 sentences, with the following distribution :
- A1 : 382 sentences (3456 tokens , ~ 9.05 tokens per sentences)
- A2 : 1103 sentences (11920 tokens, ~ 10.80 tokens per sentences)
- B1 : 1347 sentences (18236 tokens, ~ 15.54 tokens per sentences)
- B2 : 1441 sentences (24419 tokens, ~ 16.95 tokens per sentences)
- C1 : 300 sentences (5482 tokens, ~ 18.27 tokens per sentences)
Data Split
The treebank is still being annotated and around 40% of the sentences are yet to be annotated or validated. Therefore, the current version is not representative of the final distribution which prevent us for doing a representative data split that would be stable across release (see UD data split guidelines). Until the treebank is fully annotated, we will not split the data and release all sentences in a single test folder. Please perform 10 fold cross validation if you are using this treebank for any machine learning task.
Structure of a sentence
Here an example of the meta data that each sentences contains : ```
Acknowledgments
This annotation work is supported by the Autogramm project and rely on the extensive work done by AllSetLearning contributors to the Chinese Grammar wiki.
References
- Please cite any of this github repo, the original mSUD repo or the SUD conversion as well as the original content Chinese Grammar Wiki.
Statistics of UD Chinese Beginner
POS Tags
ADJ – ADP – ADV – AUX – CCONJ – DET – INTJ – NOUN – NUM – PART – PRON – PROPN – PUNCT – SCONJ – VERB
Features
Aspect – Case – Number – NumType – Person – Polarity – PronType
Relations
acl – advcl – advmod – amod – appos – aux – case – cc – ccomp – clf – compound – compound:svc – compound:vv – conj – cop – csubj – dep – det – discourse – discourse:sp – fixed – flat – iobj – mark – nmod – nsubj – nsubj:outer – nummod – obj – obl – obl:arg – obl:lmod – obl:tmod – parataxis – punct – reparandum – root – vocative – xcomp
Tokenization and Word Segmentation
- This corpus contains 2295 sentences and 19999 tokens.
- This corpus contains 19999 tokens (100%) that are not followed by a space.
- This corpus does not contain words with spaces.
- This corpus contains 1 types of words that contain both letters and punctuation. Examples: 漂亮”
Morphology
Tags
- This corpus uses 15 UPOS tags out of 17 possible: ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, VERB
- This corpus does not use the following tags: SYM, X
- This corpus contains 26 word types tagged as particles (PART): 之, 之前, 之后, 之类, 了, 什么的, 们, 儿, 吗, 吧, 呗, 呢, 啊, 啦, 嘛, 地, 它, 得, 所有, 来着, 的, 的话, 等, 等等, 罢了, 那
- This corpus contains 26 lemmas tagged as pronouns (PRON): 为什么, 什么, 他, 他们, 你, 你们, 你自己, 其, 几, 别, 咱们, 哪, 哪儿, 多少, 她, 她们, 它, 干吗, 怎么, 您, 我, 我们, 是否, 自己, 谁, 这么
- This corpus contains 18 lemmas tagged as determiners (DET): 一, 一点, 些, 其, 其他, 几, 所有, 整, 本, 某, 每, 点, 这, 这儿, 这里, 那, 那么, 那儿
- Out of the above, 2 lemmas occurred sometimes as PRON and sometimes as DET: 其, 几
- This corpus contains 16 lemmas tagged as auxiliaries (AUX): 了, 会, 可以, 应该, 得, 必须, 想, 是, 有, 用, 着, 能, 被, 要, 过, 需要
- Out of the above, 11 lemmas occurred sometimes as AUX and sometimes as VERB: 了, 会, 得, 想, 是, 有, 用, 被, 要, 过, 需要
- This corpus does not use the VerbForm feature.
Nominal Features
- Plur
- PART: 们
- PRON: 我们, 他们, 你们, 咱们, 她们
- Gen
- PART: 的, 什么的, 的话
Degree and Polarity
- Neg
- ADV: 不, 没, 不怎么, 不管, 再不
- VERB: 没, 没办法
Verbal Features
- Perf
- AUX: 过
- PART: 了
Pronouns, Determiners, Quantifiers
- Ind
- PRON: 什么, 谁
- Int
- PRON: 怎么, 什么, 谁, 为什么, 多少, 哪儿, 哪, 几
- Card
- NUM: 一, 两, 十, 三, 几, 五, 1, 二, 八, 四
- Ord
- NUM: 第
- 1
- PRON: 我, 我们, 咱们
- 2
- PRON: 你, 你们, 您, 你自己
- 3
- PRON: 他, 她, 他们, 她们
Other Features
Syntax
Auxiliary Verbs and Copula
- This corpus uses 1 lemmas as copulas (cop). Examples: 是.
- This corpus uses 15 lemmas as auxiliaries (aux). Examples: 要、 了、 想、 会、 过、 可以、 能、 应该、 有、 得、 着、 用、 必须、 需要、 被.
Core Arguments, Oblique Arguments and Adjuncts
Here we consider only relations between verbs (parent) and nouns or pronouns (child).
- nsubj
- VERB--NOUN (382)
- VERB--PRON (1185)
- obj
- VERB--NOUN (1292)
- VERB--NOUN-ADP(的) (2)
- VERB--NOUN-ADP(通过) (1)
- VERB--PRON (172)
- VERB--PRON-ADP(的) (6)
- iobj
- VERB--PRON (1)
Relations Overview
- This corpus uses 7 relation subtypes: compound:svc, compound:vv, discourse:sp, nsubj:outer, obl:arg, obl:lmod, obl:tmod
- The following 5 relation types are not used in this corpus at all: expl, dislocated, list, orphan, goeswith