UD Korean KSL
Language: Korean (code: ko
)
Family: Korean
This treebank has been part of Universal Dependencies since the UD v2.14 release.
The following people have contributed to making this treebank part of UD: Hakyung Sung, Gyu-Ho Shin.
Repository: UD_Korean-KSL
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.15
License: CC BY-SA 4.0
Genre: learner-essays
Questions, comments? General annotation questions (either Korean-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [hsung (æt) uoregon • edu]. Development of the treebank happens directly in the UD repository, so you may submit bug fixes as pull requests against the dev branch.
Annotation | Source |
---|---|
Lemmas | annotated manually |
UPOS | annotated manually in non-UD style, automatically converted to UD, with some manual corrections of the conversion |
XPOS | annotated manually |
Features | annotated manually, natively in UD style |
Relations | annotated manually in non-UD style, automatically converted to UD, with some manual corrections of the conversion |
Description
UD_Korean-KSL is a dependency treebank of L2 Korean, featuring morpheme and Universal Dependency manual annotations for six hundred randomly sampled texts from the Kyung Hee Korean Learner Corpus (which is no longer available).
- Language-specific morpheme tags (XPOS) are based on the Sejong tag set and were manually annotated.
- Dependency annotations adhere to the Universal Dependencies (version 2.0) framework and were manually annotated (initially tagged automatically by Stanza, then corrected).
- Universal part of speech (UPOS) tags were automatically added using Stanza, which was trained on the UD_Korean-GSD dataset and then corrected.
- The current version contains a total of 7,530 sentences: 6,024 in the training set, 753 in the test set, and 753 in the development set. The data also includes details on classroom proficiency levels (ranging from A1 to C2, serving as a proxy for learner proficiency).
References
- Sung, H., & Shin, G-H. (2023). Towards L2-friendly pipelines for learner corpora: A case of written production by L2-Korean learners. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), 72-82, Association for Computational Linguistics.
- Sung, H., & Shin, G-H. (2024). Constructing a Dependency Treebank for Second Language Learners of Korean. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) (pp. 3747-3758).
Acknowledgments
Statistics of UD Korean KSL
POS Tags
ADJ – ADP – ADV – AUX – CCONJ – DET – NOUN – NUM – PART – PRON – PROPN – PUNCT – SYM – VERB – X
Features
Relations
acl – advcl – advmod – amod – appos – aux – case – cc – ccomp – compound – conj – cop – csubj – dep – det – discourse – dislocated – fixed – flat – goeswith – list – mark – nmod – nmod:poss – nsubj – nummod – obj – obl – parataxis – punct – root – vocative
Tokenization and Word Segmentation
- This corpus contains 7530 sentences and 66989 tokens.
- This corpus contains 8410 tokens (13%) that are not followed by a space.
- This corpus does not contain words with spaces.
- This corpus contains 19 types of words that contain both letters and punctuation. Examples: 10일-17일에, K-POP을, T-EXPRESS, 남주인공-카일, 성인-199,000원, 아동-174,000원, 여주인공-사라, 용산-목포-홍도-흑산도까지, 용산-목포-홍도-흑산도다, 용산-목포-홍도-흑산도와, 용산-목포-홍도-흑산도이라는, 용산-목포-홍도-흑산도입니다, 용산-목표-홍도-족산도, 용산-목표-홍도-흑단도이다, 용산-목표-홍도-흑산도, 용산-목표-홍도-흑산도이다, 있., 청.훙의, 초.한으로
Morphology
Tags
- This corpus uses 15 UPOS tags out of 17 possible: ADJ, ADP, ADV, AUX, CCONJ, DET, NOUN, NUM, PART, PRON, PROPN, PUNCT, SYM, VERB, X
- This corpus does not use the following tags: SCONJ, INTJ
- This corpus contains 1 word types tagged as particles (PART): 에는
- This corpus contains 95 lemmas tagged as pronouns (PRON): 거기+는, 그, 그+는, 그+들+은, 그+들+의, 그+들+이, 그+를, 그+의, 그거+도, 그것, 그것+ㄴ, 그것+도, 그것+들+도, 그것+은, 그것+을, 그것+이, 그녀+가, 그녀+는, 그녀+를, 그녀+의, 그때+의, 나, 나+ㄴ, 나+는, 나+도, 나+를, 나+의, 남+이, 내+가, 너, 너+ㄴ, 너+는, 너+도, 너+의, 네+가, 누구+가, 누구+나, 누구+도, 누구+를, 누구+이+ㄴ가, 니+는, 다+들, 당신+은, 둘째+는, 무엇+을, 무엇+이, 뭐+가, 비, 어디, 어디+가, 여기, 여기+는, 여러분, 여러분+들+은, 영, 우리, 우리+가, 우리+는, 우리+도, 우리+들+이, 우리+를, 우리+만+의, 우리+의, 이, 이+는, 이+를, 이거, 이것, 이것+도, 이것+들+은, 이것+들+을, 이것+은, 이것+을, 이것+이, 자+기, 자기, 자기+가, 자기+도, 자기+를, 자기+만, 자기+의, 자신+들+의, 자신+을, 자신+의, 자신+이, 저, 저+는, 저+도, 저+랑, 저+를, 저+와, 저+의, 저희, 저희+는, 제+가
- This corpus contains 33 lemmas tagged as determiners (DET): 각, 그, 그떤, 그러+ㄹ, 그런, 그런+한, 너+ㄴ, 몇, 모든, 무슨, 아무, 어누, 어느, 어던, 어떤, 어러, 어쩌+ㄹ, 여러, 예기, 오+ㄴ, 이, 이+들, 이러하+ㄴ, 이런, 이런+저런, 이런+하+ㄴ, 이럼, 이렇, 이번, 일, 저, 저런, 한
- Out of the above, 4 lemmas occurred sometimes as PRON and sometimes as DET: 그, 너+ㄴ, 이, 저
- This corpus contains 5 lemmas tagged as auxiliaries (AUX): 싶, 않, 이, 있, 하
- Out of the above, 1 lemmas occurred sometimes as AUX and sometimes as VERB: 있
- This corpus does not use the VerbForm feature.
Nominal Features
Degree and Polarity
Verbal Features
Pronouns, Determiners, Quantifiers
Other Features
Syntax
Auxiliary Verbs and Copula
- This corpus uses 1 lemmas as copulas (cop). Examples: 이.
- This corpus uses 5 lemmas as auxiliaries (aux). Examples: 싶, 하, 있, 않, 이.
Core Arguments, Oblique Arguments and Adjuncts
Here we consider only relations between verbs (parent) and nouns or pronouns (child).
- nsubj
- VERB--NOUN (5107)
- VERB--NOUN-ADP(가) (1)
- VERB--NOUN-ADP(는) (4)
- VERB--NOUN-ADP(도) (7)
- VERB--NOUN-ADP(들+이) (1)
- VERB--NOUN-ADP(등) (3)
- VERB--NOUN-ADP(등+뿐+만) (1)
- VERB--NOUN-ADP(등+은) (1)
- VERB--NOUN-ADP(등+의) (1)
- VERB--NOUN-ADP(따위+도) (1)
- VERB--NOUN-ADP(만) (1)
- VERB--NOUN-ADP(밖에) (2)
- VERB--NOUN-ADP(반+쯤) (1)
- VERB--NOUN-ADP(뿐+만) (7)
- VERB--NOUN-ADP(와) (1)
- VERB--NOUN-ADP(은) (1)
- VERB--NOUN-ADP(을) (1)
- VERB--NOUN-ADP(이) (1)
- VERB--NOUN-ADP(쫌) (1)
- VERB--NOUN-ADP(쯤) (2)
- VERB--PRON (1052)
- VERB--PRON-ADP(뿐+만) (3)
- obj
- VERB--NOUN (4975)
- VERB--NOUN-ADP(대신+에) (1)
- VERB--NOUN-ADP(도) (4)
- VERB--NOUN-ADP(등) (3)
- VERB--NOUN-ADP(등+을) (4)
- VERB--NOUN-ADP(라는) (1)
- VERB--NOUN-ADP(로) (1)
- VERB--NOUN-ADP(를) (4)
- VERB--NOUN-ADP(만) (1)
- VERB--NOUN-ADP(을) (4)
- VERB--NOUN-ADP(정도+로) (1)
- VERB--NOUN-ADP(하고) (2)
- VERB--PRON (48)