UD for Zaar
Tokenization and Word Segmentation
Since the dependencies presented in the Universal Dependencies framework are based on a lexical approach of syntax, the first step of the processing chain is to decide how to tokenize the language. The idea is, by breaking down the sentence into tokens, to extract the syntactic information related to words in the discourse chain.
- The Zaar treebank is an extension of an oral corpus (https://cortypo.huma-num.fr/index.html) interlinearized and glossed on a morphological basis.
- Tokenization had to take into account the fact that syntactic information in Zaar can be spread in different ways in words, affixes and clitics. It has been decided to keep as tokens only words (with and without affixes) and clitics while the syntactic information contained in affixes is annotated by morphological features of the affixed words. Clitics are PRON conveying syntactic functions such as complement and modifier. They are preceded by an “=” sign in the transcription.
- As we are dealing with oral data, we have chosen the illocutionary unit as the basic transcription unit. Punctuation tokens (e.g. <, >, //, etc.) organise the illocutionary unit into: pre-nucleus < nucleus > post-nucleus //
Morphology
This is an overview only. For more detailed discussion and examples, see the list of Zaar POS tags and Zaar features.
Tags
- The language specific tagset is the original annotation made from the extended version of the Leipzig Glossing Rules. (Available at here)
- The UD tagset is based on a conversion from the previous annotation to UPOS.
- Zaar uses 16 of the universal tags (with the exception of
SYM
, which is not relevant for oral data) - As in other African languages (e.g. Hausa, Wolof), the verbal inflections in Zaar are gathered in a single
AUX
that precedes theVERB
, and expresses various combinations ofTense
(4 values),Aspect
(7 values) andMood
(4 values). This relatively small treebank already shows 23 combinations, resulting in 23 different AUX. The following auxiliaries are recognized in Zaar:- àː for perfect (aspect)
- á for aorist (aspect)
- àːnáː for recent past perfect (tense + aspect)
- àːtá for remote past perfect (tense + aspect)
- àːyi for perfect iterative aspect
- àːyí for immediate past perfect (tense + aspect)
- àːyiká for perfect progressive aspect
- ánáː for recent past tense
- ánáːyáː for recent past imperfect (tense + aspect)
- ánáːyi for recent past iterative (tense + aspect)
- átâ for remote past tense
- átâyáː for remote past imperfect (tense + aspect)
- átâyi for remote past iterative (tense + aspect)
- átáyiká for remote past progressive (tense + aspect)
- áyǎː for immediate past imperfect (tense + aspect)
- áyí for immediate past tense
- á̙yyiká for immediate past progressive (tense + aspect)
- ʧáː imperfect (aspect)
- ʧáːnaː for concomitant (aspect)
- ʧáːyi for imperfect iterative (aspect)
- ʧáyiká for imperfect progressive (aspect)
- tə̀ for subjunctive (mood)
- ʧiká for progressive (aspect)
- ʧínaː for recent past irrealis (tense + mood)
- ʧíta for remote past irrealis (tense + mood)
- wò for future (tense)
- wòyi for future iterative (tense + aspect)
- wòyiká for future continuous (tense + aspect)
- yáː for conditional (mood)
- yí for irrealis (mood)
- yiː for iterative (aspect)
Features
- The Zaar treebank uses 34 universal features
- 8 language specific values associated with the Zaar
AUX
have been added to the scheme:- 3 for the
Tense
feature (Imm
= Immediate Past ;Rec
= Recent Past ;Rem
= Remote Past;), - 3 for the
Aspect
feature (Aor
= Aorist ;Conc
= Concomitant ;ImpIter
= Iterative Imperfect) - 2 other
Aspect
features, used independently of theAUX
upos, were added:- Inchoative (
Aspect=Inch
) - Resultative (
Aspect=Res
)
- Inchoative (
- 3 for the
Syntax
- The dependency analysis is a conversion of the manual annotation to SUD format. For more information, see SUD guidelines.
- Zaar is mostly a SVO language. The only exception is found in the progressive Aspect, where the direct object of precedes the nominalized verb (a Vnoun).
- Zaar is a prodrop language with a strong proportion of dislocated subjects and complements. In addition to a possible independent lexical or pronominal subject (tagged
nsubj
), theAUX
contains agreement features forPerson
andNumber
. - We have direct object with
obj
, indirect object withiobj
.
Treebanks
There is 1 Zaar UD treebank: