UD Telugu English TECT
Language: Telugu English (code: qte
)
Family: Code switching
This treebank has been part of Universal Dependencies since the UD v2.14 release.
The following people have contributed to making this treebank part of UD: Anishka Vissamsetty.
Repository: UD_Telugu_English-TECT
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.15
License: CC BY-SA 4.0
Genre: spoken
Questions, comments? General annotation questions (either Telugu English-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [anishka18v (æt) gmail • com]. Development of the treebank happens directly in the UD repository, so you may submit bug fixes as pull requests against the dev branch.
Annotation | Source |
---|---|
Lemmas | assigned by a program, not checked manually |
UPOS | annotated manually in non-UD style, automatically converted to UD, with some manual corrections of the conversion |
XPOS | not available |
Features | not available |
Relations | assigned by a program, not checked manually |
Description
UD Telugu_English-TECT is a Telugu-English code-switching treebank.
The treebank consists of edited data from the Telugu UD treebank (Rama and Vajilla, 2021), sentences from a grammar book, and the MASSIVE dataset, spoken conversational utterances in Telugu (FitzGerald et al., 2022; Bastianelli et al., 2020). The sentences were randomly selected from each corpus. The sentences were romanized and each sentence was altered to contain at least one code-switch. The sentences were then annotated following the Universal Dependencies annotation scheme.
Acknowledgments
We want to thank the creators of the Telugu UD treebank and MASSIVE dataset for their corpus.
References
- Rama, Taraka and Vajilla, Sowmya (2021). The Telugu UD treebank
@misc{UD_Telugu-MTG,
year = {2021},
title = {The Telugu UD treebank},
author = {Rama, Taraka, Vajjala, Sowmya},
url= {https://github.com/UniversalDependencies/UD_Telugu-MTG}
}
- FitzGerald, J., Hench, C., Peris, C., et al., (2022). Massive: A 1m-example multilingual natural language understanding dataset with 51 typologically-diverse languages. arXiv preprint arXiv:2204.08582.
@misc{fitzgerald2022massive,
title={MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages},
author={Jack FitzGerald and Christopher Hench and Charith Peris and Scott Mackie and Kay Rottmann and Ana Sanchez and Aaron Nash and Liam Urbach and Vishesh Kakarala and Richa Singh and Swetha Ranganath and Laurie Crist and Misha Britan and Wouter Leeuwis and Gokhan Tur and Prem Natarajan},
year={2022},
eprint={2204.08582},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
- Emanuele Bastianelli, Andrea Vanzo, Pawel Swietojanski, and Verena Rieser. (2020). SLURP: A Spoken Language Understanding Resource Package. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7252–7262, Online. Association for Computational Linguistics.
@inproceedings{bastianelli-etal-2020-slurp,
title = "{SLURP}: A Spoken Language Understanding Resource Package",
author = "Bastianelli, Emanuele and
Vanzo, Andrea and
Swietojanski, Pawel and
Rieser, Verena",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.emnlp-main.588",
doi = "10.18653/v1/2020.emnlp-main.588",
pages = "7252--7262",
abstract = "Spoken Language Understanding infers semantic meaning directly from audio data, and thus promises to reduce error propagation and misunderstandings in end-user applications. However, publicly available SLU resources are limited. In this paper, we release SLURP, a new SLU package containing the following: (1) A new challenging dataset in English spanning 18 domains, which is substantially bigger and linguistically more diverse than existing datasets; (2) Competitive baselines based on state-of-the-art NLU and ASR systems; (3) A new transparent metric for entity labelling which enables a detailed error analysis for identifying potential areas of improvement. SLURP is available at https://github.com/pswietojanski/slurp."
}
Statistics of UD Telugu English TECT
POS Tags
ADJ – ADP – ADV – DET – NOUN – NUM – PRON – PROPN – PUNCT – VERB
Features
Relations
acl – advcl – advmod – amod – case – ccomp – compound – dep – det – iobj – nmod – nsubj – nummod – obj – obl – punct – root – xcomp
Tokenization and Word Segmentation
- This corpus contains 97 sentences and 456 tokens.
- All tokens in this corpus are followed by a space.
- This corpus does not contain words with spaces.
- This corpus does not contain words that contain both letters and punctuation.
Morphology
Tags
- This corpus uses 10 UPOS tags out of 17 possible: ADJ, ADP, ADV, DET, NOUN, NUM, PRON, PROPN, PUNCT, VERB
- This corpus does not use the following tags: AUX, SCONJ, CCONJ, PART, INTJ, SYM, X
- This corpus contains 1 lemmas tagged as pronouns (PRON): _
- This corpus contains 1 lemmas tagged as determiners (DET): _
- Out of the above, 1 lemmas occurred sometimes as PRON and sometimes as DET: _
- This corpus contains 0 lemmas tagged as auxiliaries (AUX):
- This corpus does not use the VerbForm feature.
Nominal Features
Degree and Polarity
Verbal Features
Pronouns, Determiners, Quantifiers
- Card
- NUM: five, ten
Other Features
Syntax
Auxiliary Verbs and Copula
- This corpus does not contain copulas.
- This corpus does not contain auxiliaries.
Core Arguments, Oblique Arguments and Adjuncts
Here we consider only relations between verbs (parent) and nouns or pronouns (child).
- nsubj
- VERB--NOUN (15)
- VERB--NOUN-ADP(_) (2)
- VERB--PRON (41)
- obj
- VERB--NOUN (45)
- VERB--PRON (7)
- iobj
- VERB--PRON (3)