Introduction
Universal Dependencies (UD) is a project that is developing cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on an evolution of (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008). The general philosophy is to provide a universal inventory of categories and guidelines to facilitate consistent annotation of similar constructions across languages, while allowing language-specific extensions when necessary.
This is illustrated in the following parallel examples from English, Bulgarian, Czech and Swedish, where the main grammatical relations involving a passive verb, a nominal subject and an oblique agent are the same, but where the concrete grammatical realization varies.
# visual-style 4 2 nsubj:pass color:blue
# visual-style 4 7 obl color:blue
1 The the DET _ Definite=Def|PronType=Art 2 det _ _
2 dog dog NOUN _ Gender=Neut|Number=Sing 4 nsubj:pass _ _
3 was be AUX _ Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin 4 aux:pass _ _
4 chased chase VERB _ Tense=Past|VerbForm=Part 0 ROOT _ _
5 by by ADP _ _ 7 case _ _
6 the the DET _ Definite=Def|PronType=Art 7 det _ _
7 cat cat NOUN _ Gender=Neut|Number=Sing 4 obl _ _
8 . . PUNCT _ _ 4 punct _ _
# visual-style 3 1 nsubj:pass color:blue
# visual-style 3 5 obl color:blue
1 Кучето куче NOUN _ Definite=Def|Gender=Neut|Number=Sing 3 nsubj:pass _ _
2 се се PRON _ Case=Acc|PronType=Prs|Reflex=Yes 3 expl:pass _ _
3 преследваше преследвам VERB _ Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 0 root _ _
4 от от ADP _ _ 5 case _ _
5 котката котка NOUN _ Definite=Def|Gender=Fem|Number=Sing 3 obl _ _
6 . . PUNCT _ _ 3 punct _ _
# visual-style 3 1 nsubj:pass color:blue
# visual-style 3 4 obl color:blue
1 Pes pes NOUN _ Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing 3 nsubj:pass _ _
2 byl být AUX _ Aspect=Imp|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act 3 aux:pass _ _
3 honěn honit VERB _ Aspect=Imp|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass 0 root _ _
4 kočkou kočka NOUN _ Case=Ins|Gender=Fem|Number=Sing 3 obl _ _
5 . . PUNCT _ _ 3 punct _ _
# visual-style 2 1 nsubj:pass color:blue
# visual-style 2 4 obl color:blue
1 Hunden hund NOUN _ Definite=Def 2 nsubj:pass _ _
2 jagades jaga VERB _ Tense=Past|Voice=Pass 0 root _ _
3 av av ADP _ _ 4 case _ _
4 katten katt NOUN _ Definite=Def 2 obl _ _
5 . . PUNCT _ _ 2 punct _ _
What is needed for UD to be successful?
The secret to understanding the design and current success of UD is to realize that the design is a very subtle compromise between approximately 6 things:
- UD needs to be satisfactory on linguistic analysis grounds for individual languages.
- UD needs to be good for linguistic typology, i.e., providing a suitable basis for bringing out cross-linguistic parallelism across languages and language families.
- UD must be suitable for rapid, consistent annotation by a human annotator.
- UD must be easily comprehended and used by a non-linguist, whether a language learner or an engineer with prosaic needs for language processing. We refer to this as seeking a habitable design, and it leads us to favor traditional grammar notions and terminology.
- UD must be suitable for computer parsing with high accuracy.
- UD must support well downstream language understanding tasks (relation extraction, reading comprehension, machine translation, …).
It’s easy to come up with a proposal that improves UD on one of these dimensions. The interesting and difficult part is to improve UD while remaining sensitive to all these dimensions.
Project organization
UD is an open collaboration with many project members. The administrative structure is kept at a minimum and currently consists of the following:
- The project is coordinated by Joakim Nivre (aka chief cat herder).
- Releases (including validation and documentation) are managed by Dan Zeman.
- Universal guidelines are managed by a small group of core members, currently consisting of Marie de Marneffe, Chris Manning, Lori Levin, Joakim Nivre, Nathan Schneider, Francis Tyers, Amir Zeldes and Dan Zeman.
- Language-specific guidelines and treebanks are maintained by each specific language team.
- Issues are raised on GitHub and resolved through discussion and voting among the core members.