home edit page issue tracker

This page pertains to UD version 2.

This checklist describes the steps needed in order to add a new language or treebank to UD. It is meant for the maintenance task force rather than individual treebank teams. See here for the checklist for data contributors.

How to add a language to UD

How to add a treebank to UD

# If you have the gh tool, run:
gh repo create UniversalDependencies/UD_Ancient_Greek-PROIEL --public --add-readme --team Contributors
git clone git@github.com:UniversalDependencies/UD_Ancient_Greek-PROIEL.git
cd UD_Ancient_Greek-PROIEL
copy ..\UD_ZZZ-Template\README.md .
copy ..\UD_ZZZ-Template\CONTRIBUTING.md .
copy ..\UD_ZZZ-Template\LICENSE.txt .
git add CONTRIBUTING.md LICENSE.txt

or

perl docs-automation\ghapi\ghapi.pl --create UD_Ancient_Greek-PROIEL
git commit -a -m "Initialization and the last commit to the master branch; switching to dev now."
git checkout -b dev
git push --all --set-upstream
perl docs-automation\ghapi\ghapi.pl --protect UD_Ancient_Greek-PROIEL

or

perl docs-automation\ghapi\ghapi.pl --finalize UD_Ancient_Greek-PROIEL

How to rename a treebank in UD

Normally, the names of the treebank repositories should be stable because the infrastructure depends on them (which is also partially illustrated by this section). However, between releases 2.1 and 2.2 we want to rename the repositories that were so far named only by language (e.g., UD_Czech) so that all repository names also contain a treebank-specific suffix (e.g., UD_Czech-PDT, where PDT is the suffix). The change of the name involves at least the following steps:

  1. Go to the Settings tab of the website of the repository. Change the name (e.g. from “UD_Czech” to “UD_Czech-PDT”) and click the Rename button.
  2. Go to the server where the automatic validation and evaluation runs (currently quest.ms.mff.cuni.cz, operated by Dan). Remove the old clone of the repository and the reports from validation and evaluation.
    oldrepo=UD_Czech
    newrepo=UD_Czech-PDT
    rm -rf $oldrepo
    rm log/$oldrepo.log
    rm log/$oldrepo.eval.log
    grep -v -P '^'$oldrepo':' validation-report.txt > /tmp/newreport.txt
    mv /tmp/newreport.txt validation-report.txt
    chmod 666 validation-report.txt
    setfacl -m u:zeman:rw,u:www-data:rw validation-report.txt
    grep -v -P '^'$oldrepo'\t' evaluation-report.txt > /tmp/newreport.txt
    mv /tmp/newreport.txt evaluation-report.txt
    chmod 666 evaluation-report.txt
    setfacl -m u:zeman:rw,u:www-data:rw evaluation-report.txt
  3. Call
    docs-automation/valdan/clone_one.sh $newrepo
    ./update-validation-report.pl $newrepo
  4. Go to one of the places where you have local clones of all UD repositories. Remove the old clone. Create a new clone under the new name. Check out the dev branch.
  5. Rename the data files in the dev branch (e.g. from “cs-ud-test.conllu” to “cs_pdt-ud-test.conllu”).
  6. Check the README.md and LICENSE.txt files for any mentions of the treebank name that may have to be modified. In the README file, add a line to the Changelog, e.g.:
    * 2018-04-15 v2.2
      * Repository renamed from UD_Czech to UD_Czech-PDT.
  7. Commit and push the changes. This should also trigger an automatic re-validation of the treebank under the new name. There will be a README error because the treebank is not recognized as previously released (in the function check_metadata() in tools/udlib.pm); see the next step.
  8. Go to the docs-automation repository. Open the file valdan/releases.json and go to its end where there is the key renamed_after_release. At the end of the hash denoted by this key, we need a new record in the following form:
    "2.1": [["UD_Czech", "UD_Czech-PDT]]

    The release number identifying this record should be the last release where the treebank appeared under the old name.

  9. If there are other places where you maintain local clones of UD repositories (e.g., one is your laptop and the other is your university network), go to each of them, do a new git clone ; git checkout dev ; rm old clone.
  10. Finally, we want to regenerate the title page of Universal Dependencies. Go to docs-automation. Assumption: all UD treebank repositories, and the docs repository are cloned as siblings of docs-automation in the file-folder hierarchy. They are switched to the dev branch. (It does not matter for us because we will switch them to master in any case; but we assume that we do this temporarily, and we will switch back to dev when we are done.)
  11. Remove the old cached metadata:
    rm _corpus_metadata/UD_Czech.json
  12. Generate new metadata for the treebank (this script switches the repo temporarily to master):
    ./refresh_corpus_data.sh ../UD_Czech-PDT
  13. Regenerate the UD title page and push it to Github:
    make dan
    cd ../docs
    git pull --no-edit
  14. Rename the folder with the treebank hub page in the docs repository. Then push the changes.
    git mv treebanks/cs treebanks/cs_pdt
    for i in treebanks/cs_pdt/cs-* ; do git mv $i `echo -n $i | perl -pe 's/cs-/cs_pdt-/'` ; done
    git status
    git diff

    then press Q and

    git commit -a -m 'Renamed treebank repository.'
    git push
    cd ..