UD for Scottish Gaelic
At present UD for Scottish Gaelic contains a single corpus, the Annotated Reference Corpus of Scottish Gaelic.
Tokenisation and Word Segmentation
Words are delimited by whitespace or punctuation. There are no multiword tokens. There are however multitoken words.
Reconstructing spacing
Context: ARCOSG does not contain the original texts, so we have to reconstruct them in a consistent way. We use GOC (Gaelic Orthographic Conventions, https://www.sqa.org.uk/files_ccc/SQA-Gaelic_Orthographic_Conventions-En-e.pdf) for consistency in reconstructing spacing, but don’t apply any other corrections.
According to the latest GOC:
- There are spaces after a’, b’, d’ or m’.
- There are no spaces after dh’.
- Do not close up before ‘m or ‘n.
Also (not covered explicitly by GOC but shown in examples):
- Close up h-, t-, and n-.
- Don’t close up after th’ and bh’.
If an elided a’ or ag before a verbal noun is indicated by ‘, close this up.
Close up around the hyphen in a-measg, a-rèir, a-thaobh and similar but don’t close up around hyphens if they’re being used as dashes. Also don’t attempt to bring into line with GOC by adding or taking away hyphens.
Also close up dhà-na-tri (see fp05_012).
Multiword tokens
The original version of ARCOSG contains tokens that contain spaces. For UD, however, we need to split these up. The XPOS is duplicated for each of these words but the UPOS need not be.
PROPNs have a flat:name
relation; others use fixed
.
Multitoken words
Conversely, there are single tokens in ARCOSG that correspond to more than one word in the UD sense. Here are the most common families:
- Inflected prepositions (tagged
Pr*
in ARCOSG) such as orm, agam and ann are divided into the preposition which isADP
and the personal pronoun which isPRON
. - Similarly prepositions (tagged
Spa-s
in ARCOSG) which have been fused with a following article such as dhan, mun, tron are divided into a preposition and an article. ‘san and the like, which are short for anns an, are divided into twoADP
s with afixed
relation between the two. - Prepositions and aspect markers (see below) which have been fused with a possessive pronoun (tagged
Sap*
orSpp*
) in ARCOSG. - Where ‘se and ‘sann are a single orthographical word they too are separated out. See the section on is below for details.
Morphology
Parts of speech
Standard UPOS tags are used throughout. Generally we follow the choices made in the Irish UD treebanks.
AUX
is used for is (the copula) and rach (the passive copula). bi is tagged asVERB
.- The following words are tagged
PART
: the adverbialiser gu (gu math ‘well’), the comparative particle nas, the superlative particle as, the agreement particle a (a dol ‘to go’), the vocative particle a (a Sheumais ‘James’), the patronymic particles Mac and Nic, the numerical particle a (a h-aon) ‘one’), the past tense marker _do, the negative particles cha and nach, the interrogative particles a and an and the relative particle a. - Verbal nouns are tagged as
NOUN
withVerbType=Vnoun
. - Deverbal adjectives are tagged as
ADJ
. - The aspectual particles ag/a’, air, gu and ri (all prototypically adpositions) are tagged as
ADP
. - Demonstrative pronouns, seo, sin and siud are tagged as
PRON
as in Irish. - If they are acting as determiners (and tagged as
Dd
in ARCOSG) then they are taggedDET
, as in Irish again.
Features
Gaelic has two genders (masculine and feminine), four cases (nominative/accusative, genitive, dative and vocative), three numbers (singular, dual and plural), the usual three persons and an impersonal form.
The words fèin and cheile take Reflex=Yes
.
The indicative mood is default and we mark the conditional (Cnd
), imperative (Imp
) and interrogative (Int
) moods. The tenses we mark are
We also follow Irish in marking three pronoun types (Emp
= emphatic, Int
= interrogative and Rel
= relative), polarity (Neg
on negative particles) and the following particle types: Ad
(adverbialiser), Comp
(comparative), Cmpl
(complement), Inf
(agreement particle), Int
(interrogative), Num
(numerical), Pat
(patronymic), Vb
(verbal) and Voc
(vocative).
We also have Foreign=Yes
for words that are in Irish or English according to the original ARCOSG tagging.
Syntax
VSO clause structure
Main clauses and subordinate clauses are VSO. The subject almost invariably follows the verb but
- Chuala mi sin gun teagamh. ‘I heard that without a doubt’ (V S O ADV)
- Can gun do chaith thu e ‘Say that you did it’ (V SCONJ past-tense-marker V S O)
However, if there is an externally-controlled complement then the object follows the verbal noun if it is in the progressive aspect with a nominal object, but precedes it if it is in the progressive aspect with a pronominal object.
- Bha iad a’ toirt an teachd-an-tìr ‘They were making a living’ (V S asp V O)
- Bheil thu ga mo leantainn? ‘Are you following me?’ (V S asp O V)
- Bha esan air a bhean a chaill ‘He had lost his wife’ (V S asp O V)
Core arguments, oblique arguments and adjuncts
The core arguments are marked by nsubj
and obj
if they are noun phrases. Oblique arguments and adjuncts are marked by obl
when they are prepositional phrases. Occasionally they are noun phrases in which case we use obl:tmod
if they indicate a stretch of time or obl:smod
if they indicate a distance.
In terms of clausal subjects csubj:cop
is used for expressions like:
- B’ àbhaist do dhaoine saoilsinn… ‘People usually think…’ where àbhaist do dhaoine is the root, bu (here in the reduced form b’) is the copula and saoilsinn is the clausal subject. In Gaelic clefting constructions are much more common than in Irish:
- ‘se caoraich a th’ aice ‘it is sheep that they have’
- chan e gearrain aon duine a th’ ann (lit. it is not the complaint of one person that is in it) ‘it is not the complaint of one person’
The expletive particle e or ann is linked to the copula with
fixed
.
Language-specific labels
With three exceptions, these follow Irish:
acl:relcl
for relative clausesaux:pass
for rach-passivescase:voc
for vocative particlescsubj:cleft
for cleft subjectscsubj:cop
for copular clausal subjectsmark:prt
for particles not otherwise markednmod:poss
for possessive pronouns (but we useobj
where the use of the possessive pronoun indicates an object)nsubj:outer
for where there are two subjects for a rach-passive.nsubj:pass
for the subjects of rach-passivesobl:smod
(not in Irish) for spatial modifiersobl:tmod
for temporal modifiersxcomp:pred
for predicates of the substantive verb bi ‘to be’. bi does not take an object. To identify the predicate: the most likely is a verbal noun, followed by an existential prepositional phrase ann or a prepositional phrase expressing location, a noun phrase expressing temporal extent, spatial extent or cost, and lastly an adverb.
Some specific cases
The verbal noun
Annotate as a NOUN
.
With aspect markers (continuous tenses and depictives)
Here it has VerbType=VNoun
.
ag, air, ri and so forth preceding it have a case
relationship as in Irish.
Here it is an xcomp:pred
of the verb bi.
Inversion structures and rach-passives
Here it has VerbType=Inf
.
Usually it is preceded by an infinitive particle a but this is elided where it begins with a vowel or fh.
In inversion structures, the object is obj
of the verbal noun, with the exception of rach-passives where it is nsubj:pass
or exceptionally nsubj:outer
.
agus, is and ’s’
- Usually these are
CCONJ
and are related to what they are conjoining withcc
. - However if they are being used cosubordinatively, to introduce an adverbial phrase that looks like a
bi
clause where the verb has been elided, they areSCONJ
and the relation ismark
. - Caution: one eighth of ARCOSG is football commentary where the verb is routinely elided. In this case look at whether the events being related are sequential or simultaneous. If they are sequential, then agus is a coordinating conjunction. If they are simultaneous then agus is a subordinating conjunction.
- In expressions like fad ‘s and o chionn ‘s, then ‘s has a
fixed
relation to the subordinating conjunction. - However in expressions like corr is and fiù ‘s, where the word preceding it is a content word, then it is a coordinating conjunction and behaves as normal.
air ais
In ARCOSG, ais is tagged as Nf
(fossilized noun).
However there are phrases like air ais no air adhart in which there seems to be no good reason to treat the first half differently from the second half, even if ais is no longer productive.
c04_024: ‘she did not write back yet’
1 cha cha PART Qn PartType=Vb|Polarity=Neg 3 mark:prt _ _
2 do do PART Q--s Tense=Past 3 mark:prt _ _
3 sgrìobh sgrìobh VERB V-s Tense=Past 0 root _ _
4 i i PRON Pp3sf Gender=Fem|Number=Sing|Person=3 3 nsubj _ _
5 air air ADP Sp _ 6 case _ _
6 ais ais NOUN Nf _ 3 obl _ _
7 fhathast fhathast ADV Rt _ 3 advmod _ _
bi
Auxiliary use: we follow the Irish UD treebank and treat bi as a VERB
, and the verbal noun as a NOUN
linked back to bi with an xcomp:pred
deprel.
Predicative use: again, we follow Irish and use xcomp:pred
for predicative adjectives, PPs and adverbs. There is a construction exemplified in c02_009a, c02_009b and c02_010 bi… agam… ri dhol… and in this case we assume that the PP with aig is the quirky experiencer and ri is the predicate.
However (see f01_028), there are also uses of bi for extent in time (n03_041) and space.
còrr is and friends
Example taken from pw01_015a: in còrr is deich bliadhna, bliadhna is conjoined with còrr and deich is a nummod
of bliadhna.
From ns04_053: in thachair an tubaist còrr is bliadhna gu leth còrr is obl:tmod
of thachair because the phrase as a whole is a time phrase.
dè cho…
‘how’ as in ‘how big’. dè remains PRON
and cho is advmod
of the succeeding adjective.
feuch
When this is tagged as Vm-2s
the sense in which it is usually used is ‘to try to’, in which case it is linked to the higher clause with an xcomp
deprel.
For example n04_002: … gu robh e ‘dol a dh’fhalbh feuch a faigheadh…, feuch is an xcomp
of dh’fhalbh.
fhios agad and variants
‘you know’. Treat as parataxis
as it is explicitly excluded from discourse
. See also parataxis
below.
foreign words
Usually English (en
) but sometimes Early Modern Irish (ghc
).
If they’re the names of institutions (mostly in the news subcorpus) or borrowings being used in a matter-of-fact way (mostly in the conversation subcorpus) then they are tagged with their original parts of speech and joined by flat
.
OrigLang=en
(or whichever language) goes in the MISC column.
If they’re being used appositively or are titles of works, or are reported speech in another language, then tag everything with X
and use flat:foreign
to join them.
They have Foreign=Yes
and no other features in the morphology column.
Lang=en
goes in the MISC column.
an ìre mhath
This means ‘almost’. See s08_061b for an example. Use nmod
.
is
‘S, b’, bu, ‘se, ‘sann and so on are cop
and the root is whatever has been fronted by it.
We treat ‘S e as a fixed expression where e has a fixed
relation with the AUX
.
Likewise ‘S ann, except of course ann is divided up into an and e and both have a fixed
relation with the AUX
.
Following Cox in Geàrr Ghràmar na Gàidhlig, p. 284, in phrases like is ann a cheannaich mi bainne, cheannaich is still the root even though it’s preceded by a relative particle.
Again we follow Irish and whatever comes after the root is a subject, be it a nominal subject, nsubj
, or a clausal subject, csubj:cleft
or csubj:cop
.
mas
Mas (‘if’) is divided into the two words ma (SCONJ
) and is (AUX
).
nach maireann
(as in Dr Calum MacGilleathain nach maireann, ‘The late Dr Calum Maclean’) This is acl:relcl
of the deceased because nach is the negative relativiser.
parataxis
Where you have a big long sentence with lots of “ars’ esan” and “ars’ ise”s in it, treat them like punctuation and make them parataxis
of the most contentful content word in the nearest quoted text so as to avoid non-projectivity. Sentence n01_038 is an example of this.
an t-seachdain seo chaidh and others
‘last week’, literally ‘this week that went’. Treat chaidh as being acl:relcl
of t-seachdain (pw05_005, also ceud in the sense of ‘century’: see fp01_034).
urrainn
In most dialects the person (or thing) that can follows the preposition do so is of course nmod
.
In some, however, you can say, for example, ’s urrainn mi, so in this case mi is nmod
of urrainn.
vocables
There are no vocables in ARCOSG, but in the event of a future poetry/song corpus the words in them should be connected by flat
.
Treebanks
There is one Scottish Gaelic UD treebank:
References
- Colin Batchelor, 2019. Universal dependencies for Scottish Gaelic: syntax, in Proceedings of CLTW2019 at Machine Translation Summit XVII, Dublin, August
- Lamb, William, Sharon Arbuthnot, Susanna Naismith, and Samuel Danso. 2016. Annotated Reference Corpus of Scottish Gaelic (ARCOSG), 1997–2016 [dataset]. Technical report, University of Edinburgh; School of Literatures, Languages and Cultures; Celtic and Scottish Studies. https://doi.org/10.7488/ds/1411.
- Lynn, Teresa and Jennifer Foster, [Universal Dependencies for Irish] (http://www.nclt.dcu.ie/~tlynn/Lynn_CLTW2016.pdf), CLTW 2016, Paris, France, July 2016