home edit page issue tracker

This page pertains to UD version 2.

UD South Levantine Arabic MADAR

Language: South Levantine Arabic (code: ajp)
Family: Afro-Asiatic

This treebank has been part of Universal Dependencies since the UD v2.7 release.

The following people have contributed to making this treebank part of UD: Shorouq Zahra.

Repository: UD_South_Levantine_Arabic-MADAR
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.15

License: CC BY-SA 4.0

Genre: spoken, social

Questions, comments? General annotation questions (either South Levantine Arabic-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [shorouqjzahra (æt) gmail • com]. Development of the treebank happens directly in the UD repository, so you may submit bug fixes as pull requests against the dev branch.

Annotation Source
Lemmas annotated manually
UPOS annotated manually, natively in UD style
XPOS not available
Features not available
Relations annotated manually, natively in UD style

Description

The South_Levantine_Arabic-MADAR treebank consists of 100 manually-annotated sentences taken from the MADAR (Multi-Arabic Dialect Applications and Resources) project.

TO-DO: Add 20 annotated sentences from CCC as a train set.

The treebank contains 100 manually annotated sentences in the South Levantine dialect primarily spoken in Amman. The sentences were taken from the “MADAR Parallel Corpus Dataset” (Bouamor et al., 2018) which consists of parallel texts translated into 25 dialects spoken in 25 diferent cities in the Arab World. The original texts were taken from the Basic Traveling Expression Corpus (BTEC) (described in Takezawa et al., 2007).

Sentences in the treebank can best be described as short conversational tourism-related texts.

The treebank was created as part of the “Language Technology: Research and Development” course at Uppsala University. You can view the report here: “Parsing Low-Resource Levantine Arabic: Annotation Projection versus Small-Sized Annotated Data”. The report describes two methods for parsing low-resource Levantine Arabic using the treebank provided in this repo (but split instead into three sets: train, dev, and test).

Acknowledgments

Big thanks to Houda Bouamor, Nizar Habash, and the MADAR project team for creating the multi-dialect parallel corpus and allowing me to use the Amman portion of it prior to official release.

References

Statistics of UD South Levantine Arabic MADAR

POS Tags

ADJADPADVAUXCCONJDETINTJNOUNNUMPARTPRONPROPNPUNCTSCONJVERBX

Features

Relations

acladvcladvmodadvmod:emphamodauxcaseccccompconjdepdetdiscourseflat:foreigniobjmarknmodnmod:possnsubjnummodobjoblobl:argparataxispunctrootxcomp

Tokenization and Word Segmentation

Morphology

Tags

Nominal Features

Degree and Polarity

Verbal Features

Pronouns, Determiners, Quantifiers

Other Features

Syntax

Auxiliary Verbs and Copula

Core Arguments, Oblique Arguments and Adjuncts

Here we consider only relations between verbs (parent) and nouns or pronouns (child).

Relations Overview