home edit page issue tracker

This page pertains to UD version 2.

UD for Old Irish

Treebank Classification and Pre-tokenisation Considerations

Both spelling and word-separation in Old Irish texts can be highly irregular. In modern editions some editors attempt to faithfully reproduce the text as it survives in the original manuscript. These are generally referred to as diplomatic editions. Other editors may alter the text so that it does not resemble exactly the contents of any single manuscript source, often normalising the text, and typically adding editorial commentary and supplying variant readings from extant sources in the form of footnotes (but not within the body of the text). This work may be done with the aim of emulating a theorised earlier exemplar from which one or more existing manuscript sources are believed to have been copied, or with a view to making the resulting edition more reader-friendly. In such cases the resulting work is generally referred to as a critical edition. Editors may also alter texts by standardising spelling, by silently introducing word spacing, by capitalising certain letter characters in accordance with modern orthographic practice, and by introducing forms of punctuation not present in the original manuscript. While these changes are not necessarily associated with critical editions, they alter the text in such a manner that it cannot be referred to as entirely diplomatic. Therefore, texts edited in such a manner will also be referred to broadly as “critical editions” here also.

It is necessary to mark a distinction between diplomatic editions and those which have been altered to any extent by modern editors (i.e. “critical editions”). To mark this distinction all Old Irish treebanks should identify in their README documentation which type of edition they represent by using either the “diplomatic” or “critical” designation. This information should also be included in the treebank name and URL using the abbreviations Dip and Crit (for example, the Diplomatic St. Gall Glosses treebank URL ends: …/UD_Old_Irish-DipSGG).

For the purpose of choosing the correct designation for a new treebank, the following definitions should be adhered to.

Diplomatic:

Critical:

Tokenisation and Word Segmentation

Tokenising Old Irish text is an unusually difficult task (Doyle and McCrae, 2025a; Doyle et al., 2019), posing difficulties which may not arise in many other languages. Words are not necessarily delimited by whitespace characters or punctuation in Old Irish texts. Instead, manuscript sources tend to combine unstressed words (including common clitics like the copula and definite article) with surrounding parts-of-speech bearing a stress (see Thurneysen, p. 24 §34 and p. 30 §41). This practice results in many compound words which are purely orthographic, but comprised of two or more lexical words. Where spacing does occur in Roman script, the whitespace character is used to delineate word boundaries, however, Ogham script has a discrete space mark consisting of a stemline devoid of any other markings.

Tokenisation in treebanks for Old Irish follows the method set out in Doyle and McCrae (2025a). This is intended to ensure compatibility, not only between digital resources for Old Irish, but between Old Irish and other languages in UD also (see Doyle and McCrae 2025b). The following section gives a brief overview of some of the requirements of this tokenisation method.

Tokenisation in Old Irish Treebanks

Orthographic combinations of discrete lexical words should be separated during tokenisation:

Prepositional pronouns (conjugated prepositions) are taken to be discrete words in their own right, and should not be separated during tokenisation. This is based on Stifter’s assertion that “It is not possible to separate one element from the other” (2006, p. 87) follows their treatment in Modern Irish treebanks (though notably not their treatment in Scottish Gaelic or Manx treebanks).

Punctuation is infrequent in manuscript sources, however, punctuation characters not present in the original manuscript material may be introduced by editors of some modern editions. Aside from these, the following exceptions occur:

No multiword tokens occur. Where adjectives or nouns precede other nouns they generally remain separate tokens. This can be seen in examples like “sengrec” which is split into “sen” and “grec”.

Some general advice on tokenisation follows which may not be intuitive to those familiar with Old Irish:

Morphology

POS-Tags


Features

Syntax

References

Bergin, Osborn. On the Syntax of the Verb in Old Irish. In Ériu, vol. 12, 1938, pp. 197–214.

Doyle, Adrian and John P. McCrae. 2024. Developing a Part-of-speech Tagger for Diplomatically Edited Old Irish Text. In Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024, pages 11–21, Torino, Italia. ELRA and ICCL. https://aclanthology.org/2024.lt4hala-1.2/

Doyle, Adrian and John P. McCrae. 2025a. An Assessment of Word Separation Practices in Old Irish Text Resources and a Universal Method for Tokenising Old Irish Text. In Proceedings of the 5th Celtic Language Technology Workshop, pages 1–11, Abu Dhabi [Virtual Workshop]. International Committee on Computational Linguistics. https://aclanthology.org/2025.cltw-1.1/

Doyle, Adrian and John P. McCrae. 2025b. Development of Old Irish Lexical Resources, and Two Universal Dependencies Treebanks for Diplomatically Edited Old Irish Text. In 5th International Conference on Natural Language Processing for Digital Humanities.

Doyle, Adrian, John P. McCrae, and Clodagh Downey. 2019. A Character-Level LSTM Network Model for Tokenizing the Old Irish text of the Würzburg Glosses on the Pauline Epistles. In Proceedings of the Celtic Language Technology Workshop, pages 70–79, Dublin, Ireland. European Association for Machine Translation. https://aclanthology.org/W19-6910/

McCone, Kim. (1997). The Early Irish Verb - Second Edition Revised with Index. An Sagart, Maynooth.

Ó hUiginn, Ruairí. Notes on Old Irish Syntax. In Ériu, vol. 38, 1987, pp. 177–183.

Stifter, David. (2006). Sengoidelc. Syracuse University Press, New York.

Thurneysen, Rudolf. (1946). A Grammar of Old Irish. Binchy, D. A. and Bergin, Osborn (tr.), Reprinted 2010, Dublin Institute for Advanced Studies.

Treebanks

There are two Old Irish UD treebanks: