UD for Odia
Tokenization and Word Segmentation
- In general, words are delimited by whitespace characters. Description of exceptions follows.
- According to typographical rules, some punctuation marks (e.g., comma) are attached to a neighboring word, while others (e.g., the sentence-terminating danda) are not. We tokenize punctuation as separate tokens (words).
Morphology
Tags
- Odia uses all 17 universal POS categories, including particles (PART).
- Odia has the following auxiliary verbs (AUX):
Instruction: Specify any unused tags. Explain what words are tagged as PART. Describe how the AUX-VERB and DET-PRON distinctions are drawn, and specify whether there are (de)verbal forms tagged as ADJ, ADV or NOUN. Include links to language-specific tag definitions if any.
Features
*
Instruction: Describe inherent and inflectional features for major word classes (at least NOUN and VERB). Describe other noteworthy features. Include links to language-specific feature definitions if any.
Syntax
*
Instruction: Give criteria for identifying core arguments (subjects and objects), and describe the range of copula constructions in nonverbal clauses. List all subtype relations used. Include links to language-specific relations definitions if any.
Treebanks
There are N Odia UD treebanks:
Instruction: Treebank-specific pages are generated automatically from the README file in the treebank repository and
from the data in the latest release. Link to the respective *-index.html
page in the treebanks
folder, using the language code
and the treebank code in the file name.