home edit page issue tracker

This page still pertains to UD version 1.

Introduction

UD currently contains three treebanks for Russian:

Below we give a brief description of each treebank. More information can be found in the respective README files.

UD Russian

https://github.com/UniversalDependencies/UD_Russian/

UD Russian is based on the Russian Wikipedia data set originally annotated by the Google team led by Ryan McDonald. It consists of 4,429 sentences and 87,841 tokens. The syntactic annotation and annotation of morphological features is converted from the Google annotation scheme, while lemmatization follows the guidelines of the Russian National Corpus (Lyashevskaya, Plungian, Sichinava 2005) and is performed using MyStem-RNC tagger (Zobnin & Nosyrev 2015). All data were checked manually. Documentation is provided by Elena Shakurova, Nina Mustafina, and Olga Lyashevskaya.

Source of annotations

This table summarizes the origins and checking of the various columns of the CoNLL-U data.

Column Status
ID Sentence segmentation and tokenization was automatically done by Google. Additional changes (splitting and merging) were done manually during conversion to UD 1.3 and UD 2.0.
FORM  
LEMMA  
UPOSTAG  
XPOSTAG Converted automatically from FEATS and then hand-corrected.
FEATS  
HEAD  
DEPREL  
DEPS — (currently unused)
MISC Information about token spacing.

Acknowledgments

We wish to thank all of the contributors to the original Google annotation effort as well as reannotation scheme developers, including Vitaly Nikolaev, Ryan McDonald, Olga Lyashevskaya, Daniel Zeman, Kira Droganova, Svetlana Toldova, Andrey Kutuzov, Yana Kurmacheva, Maria Alexeeva, Tatiana Gavrilova, Nina Mustafina, and Elena Shakurova.

References

UD Russian-SynTagRus

readme

UD Russian-SynTagRus is based on the SynTagRus dependency treebank which is being developed by the Computational Linguistics Laboratory, A. A. Kharkevich Institute of Information Transmission Problems, Russian Academy of Sciences, located in Moscow.

Currently the treebank contains over 1,000,000 tokens (over 66,000 sentences) belonging to texts from a variety of genres (contemporary fiction, popular science, newspaper and journal articles dated between 1960 and 2016, texts of online news, etc.)

The treebank is so far the only human-corrected corpus of Russian supplied with comprehensive morphological annotation and syntactic annotation in the form of a complete dependency tree provided for every sentence. Additionally, the original version of SynTagRus contains other types of annotation, first of all lexical functional annotation in terms of lexical functions as defined in the Meaning-Text model.

SynTagRus is an integral but fully autonomous part of the Russian National Corpus developed in a nationwide research project and can be freely consulted on the Web: http://www.ruscorpora.ru/instruction-syntax.html

Source of annotations

This table summarizes the origins and checking of the various columns of the CoNLL-U data.

Column Status
ID  
FORM  
LEMMA  
UPOSTAG  
XPOSTAG  
FEATS  
HEAD  
DEPREL  
DEPS — (currently unused)
MISC Information about token spacing.

References

Acknowledgments

UD Russian-PUD

readme

This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies (http://universaldependencies.org/conll17/). There are 1000 sentences in each language, always in the same order. (The sentence alignment is 1-1 but occasionally a sentence-level segment actually consists of two real sentences.) The sentences are taken from the news domain (sentence id starts in ‘n’) and from Wikipedia (sentence id starts with ‘w’). There are usually only a few sentences from each document, selected randomly, not necessarily adjacent. The digits on the second and third position in the sentence ids encode the original language of the sentence. The first 750 sentences are originally English (01). The remaining 250 sentences are originally German (02), French (03), Italian (04) or Spanish (05) and they were translated to other languages via English. Translation into German, French, Italian, Spanish, Arabic, Hindi, Chinese, Indonesian, Japanese, Korean, Portuguese, Russian, Thai and Turkish has been provided by DFKI and performed (except for German) by professional translators. Then the data has been annotated morphologically and syntactically by Google according to Google universal annotation guidelines; finally, it has been converted by members of the UD community to UD v2 guidelines.