Tokenization and Word Segmentation
In Irish, in general, words are delimited by whitespace characters. Description of exceptions follows:
Some punctuation marks are attached to a neighbouring word. The word and the punctuation mark are taken together as one token. For example, D’ (contraction for do in d’ith “ate”), b’ (in b’fhearr “would prefer”) and O’ (in surnames) are recognised as single tokens. Abbreviations surch as srl. “etc.” or i.n. “p.m.” are also recognised as one token.
Note that compound prepositions (os_cionn “above”, in_aice “beside”, etc) are split into two tokens for UD v2, as are some placenames that the tagger recognises (e.g. Cill_Dara) or a limited number of mwes (chomh_fada_is “as long as; cé_is_moite “except for”). The Irish POS-tagger used in the Irish Dependency Treebank retains these as single tokens and so must be mapped accordingly as the treebanks develop concurrently.
POS Tags
The UD part-of-speech (POS) tagset is an extension of the The Google Universal POS tagset (Petrov et al., 2012) and contains 17 POS tags. The tags for the Irish Dependency Treebank is based on the PAROLE Morphosyntactic Tagset (ITÉ, 2002).
A mapping from this tagest to the UD tagset for use in the IUDT is given in: Lynn, Teresa and Jennifer Foster, Universal Dependencies for Irish In Proceedings of the 2nd Celtic Language Technology Workshop 2016, Paris, France.
The following is a summary of some specific/ unintuitive choices made to map Irish data conform to Universal POS tags for UDv2:
- The AUX tag is used for the Copula only. All other verbs (including substantive verb bí “to be” are tagged as VERB).
- Verbal adjectives are tagged as ADJ
- The following particles are tagged as PART: adverbial (go mall “slowly”), verbal (ná déan “don’t do”), vocative (a Sheosamh), comparative (níos déanaí “later”), superlative (is déanaí “latest”), numeral (a haon “one”), relative (a chonaic sé “that he saw”), infinitive (a dhéanamh “to do”), degree (a luaithe “sooner”), name (Seosamh Mac Grianna)
- Verbal nouns are tagged as NOUN
- ag in use with verbal nouns to form a gerund in progressive aspectual phrases are tagged as ADP
- demonstrative pronouns are tagged as PRON (sin an fadhb “that’s the problem”, Thug sé sin faoi deara “he noticed that”)
- demonstrative determiners, on the other hand, are tagged as DET along with all other determiners (an leabhar sin “that book”)
Here we summarise the morphological features of Irish which can be categorised into inflectional and lexical features.
Inflection in Irish mainly occurs through suffixation, but initial mutation through lenition and eclipsis is also common. Lenition is a phonological change that softens or weakens the articulation of a consonant. The eclipsis process renders voiced segments as nasalised and voiceless segments as being voiced (Stenson, 1981, p.18). A prominent feature of Irish which influences inflection, is the existence of two sets of consonants, referred to as “broad” and “slender” consonants. Consonants can be slenderised by accompanying the consonant with a slender vowel, either e or i. Broadening occurs through the use of broad vowels; a, o or u. In general, there needs to be vowel harmony (slender or broad) between stem endings and the initial vowel in a suffix.
- buail “hit” ag bualadh na liathróide “hitting the ball” (Verbal Noun)
- buail “hit” buaileadh an liathróid “the ball was hit” (Impersonal Form)
Verbs inflect for number and person, as well as mood and tense. Verbs can incorporate their subject, inflecting for person and number through suffixation. Such forms are referred to as synthetic verb forms. Most verbs tend to incorporate a subject when it is first person singular or plural. These synthetic forms are generally restricted to the Present Tense, Imperfect Tense, Conditional Mood and Imperative Mood.
- scríobh “write”
- scríobhaim “I write”
- scríobhfaimid “we will write”
However, second person singular and plural subjects are incorporated in some verb tenses and moods:
- nigh “wash”
- niteá “you used to wash” (Past Habitual)
- nígí! “(you pl.) wash!” (Imperative)
Tense is also marked by lenition on some verb forms:
- dún “close”
- dhún mé “I closed”
- dhúnfainn “I would close”
Lenition occurs after the negative particle ní:
- tugaim “I give”
- ní thugaim “I do not give”
- tabharfaidh mé “I will give”
- ní thabharfaidh mé “I will not give”
Eclipsis (initial mutation) occurs following clitics such as interrogative particles (an, nach); complementisers (go, nach); and relativisers (a, nach) (Stenson, 1981,pp. 21-26).
- an dtuigeann sé? “does he understand?”
- nach dtuigeann sé “that he does not understand”.
- go dtabharfadh sé “that he would give”
Modern Irish uses three cases: Nominative, Genitive and Vocative. The nominative form is sometimes regarded as the “common form” as it is now also used for accusative and dative forms (See Case for a description of ‘Case=NomAcc’). Nouns in Irish are divided into five classes, or declensions, depending on the manner in which the genitive case is formed. In addition, there are two grammatical genders in Irish - masculine and feminine. Case, declension and gender are expressed through noun inflection. For example, páipéar “paper” is a masculine noun in the first declension. Both lenition and slenderisation are used to form the genitive singular form: pháipéir.
- an dochtúir “the doctor”
- cóta an dochtúra “the doctor’s coat”
- an fheoil “the meat”
- boladh na feola “the smell of the meat”
- an coinín “the rabbit”
- eireaball an choinín “the rabbit’s tail”
- an siopa “the shop”
- cúl an tsiopa “the back of the shop”
- Máire “Mary”
- a Mháire! “Mary!” (Vocative)
In addition, possessive determiners cause nominal inflection through lenition, eclipsis and prefixation.
- teach “house”
- mo theach “my house”
- ár dteach “our house”
- ainm “name”
- a hainm “her name”
- a n-ainm “their name”
In general, adjectives follow nouns and agree in number, gender and case. Depending on the noun they modify, adjectives can also inflect. The Christian Brothers (1988, p.63) note eight main declensions of adjectives. They can decline for genitive singular masculine, genitive singular feminine and nominative plural.
- bacach “lame”
- bacaigh (Gen.Sg.Masc)
- bacaí (Gen.Sg.Fem)
- bacacha (Nom.PL).
Comparative adjectives are also formed through inflection:
- láidir “strong” / níos láidre “stronger”
- déanach “late” / is déanaí “latest”
Irish has simple prepositions (e.g. ar “on”) and compound prepositions (e.g. in aghaidh “against”). Most of the simple prepositions can inflect for a pronominal object that indicates person and number (known as prepositional pronouns or pronominal prepositions), thus including a nominal element. Compare le and leis:
- bhí sé ag labhairt le fear “he was speaking with a man”
- bhí sé ag labhairt leis “he was speaking with him”
These forms are used quite frequently, not only with regular prepositional attachment where pronominal prepositions operate as arguments of verbs or modifiers of nouns and verbs, but also in idiomatic use where they express emotions and states.
- tá brón orm “I am sorry” (lit. `is sorrow on-me’)
- tá súil agam “I hope”
Here we summarise some of the distinctive features of Irish as a Celtic language. These features commonly occur in standard Irish use and therefore require discussion in the context of treebank development. Irish theoretical syntax is relatively under-researched, yet this summary shows that even within the limited work carried out in this area thus far, there still remain many unresolved disagreements as we show here. In general, Irish dependency treebank development follows the work of Stenson (1981).
VSO clause structure
Both main clauses and subordinate clauses follow a VSO structure in Irish.
- Thug sí comhairle dom (lit. Gave she advice to-me) “She gave me advice”
- Dúirt siad gur chaith na daoine an airgead “They said that the people were seeking work” (V S [that V S O])
There are only a couple of exceptional circumstances under which an element can appear between the verb and the subject (see example below) and while various elements may occur between the subject and object, such as prepositional phrases and adverbs, the verb-subject-object order is strict (Mc-Closkey, 1983, pp. 10-11).
- Tá ar ndóigh daoine a chreideann… (V ADV SUBJ REL-CL) “There are of-course people who believe…”
- Thug sé dom inné é (V S PP ADV O) “He gave it to me yesterday”
Irish sentences using bí, the Substantive Verb “to be” follow the VSO structure. However, copular constructions using the Copula is follow a Copula-Predicate-Subject order. This is explained in more detail in cop.
Core Arguments, Oblique Arguments and Adjuncts
A nominal subject (nsubj) is a noun phrase in the nominative case, without preposition.
An infinitive verb may serve as the subject and is labeled as clausal subject, ‘csubj’. On the other hand, verbal nouns as subjects are just (nsubj).
A finite subordinate clause may serve as the subject and is labeled ‘csubj:cop’.
‘csubj:cop’ is used when the clause is a subject of a copular phrase. These are copular constructions that follow the Copula-Predicate-Subject order.
- Ní hamháin nach bhfaca sé aon rogha eile áfach “it wasn’t just that he didn’t see any other option however”
On the other hand, ‘csubj:cleft’ is used when the clause is the subject of a clefted sentence (which also follow the Copula-Predicate-Subject order).
- Is leabhar a thug sí dó “It’s a book she gave him”
There are idiomatic phrases in which translations would suggest that the Irish subject is actually the object.
For example:
- Is maith liom tae “I like tea” (lit. tea is good with me)
There is no passive construction in Irish, and therefore ‘nsubj:pass’ or ‘csubj:pass’ are not used in the Irish treebank. What often translates into English as passive is the automonous verb form. These verbs (labelled with the feature ‘Voice=Auto’ (See Voice) have an “understood”/implicit subject and are usually followed directly by the object.
- Foilsíodh an chéad chuid den sraith cartún “The first cartoon series was published” (lit. somebody published the first series of the cartoon)
Objects ‘obj’ in Irish may be bare noun phrases in common form (NomAcc)or prepositional phrases in common form (NomAcc). For the purpose of UD the objects are divided to core objects, labeled obj and oblique objects, labeled obl.
There are no indirect objects in Irish.
Oblique ‘obl’. Adjuncts are usually prepositional phrases, but they can be bare noun phrases as well. They are labeled obl: * Foilsíodh an chéad chuid den sraith cartún sa bhliain 1983 “The first cartoon series was published in the year 1983”
The dative alternation where the prepositional construction gets a similar analysis to the double object construction
- Thug sé litir don fhear “He gave a letter to the man”
Nouns can be objects of clausal complements, which are labeled xcomp.
If a verb subcategorizes for two core objects, one of them accusative (or ccomp) and the other non-accusative, then the non-accusative object is labeled iobj. Core nominal objects in other situations are labeled just obj.
Oblique agents of verbal adjectives are labelled as ‘obl’
- go bhfuil dul chun cinn iontach déanta ag foireann shinsir… “that the senior team have made great progress…” (lit. that great progress has been made by the senior team)
All prepositional phrases that are not prepositional objects (i.e., their role and form is not defined lexically by the predicate) are adjuncts (‘nmod’).
- as gach ceann de na béilí seo “from each one of these meals”
Clefting / Fronting
Clefting or fronting is a commonly used structure in the Irish language and described in more detail in csubj:cleft. Elements are fronted to predicate position to create emphasis or focus. Irish clefts differ from clefts in English in that there is more freedom with regards to the type of sentence element that can be fronted (Stenson, 1981, p.99). In Irish, the structure is as follows: Copula, followed by the fronted element (Predicate), followed by the rest of the sentence (Relative Clause). The predicate can take the form of a noun phrase (headed by pronoun, noun, verbal noun), or adjectival, prepositional or adverbial phrases.
Nominal Fronting:
- Is leabhar a thug sí dó “It’s a book that she gave to him”
Adverbial Fronting:
- Is laistigh de bhliain a déanfar é “It’s within a year that it will be done”
Pronoun Fronting:
- Is ise a chonaic siad inné “It is she whom they saw yesterday”
Prepositional fronting:
- Is sa pháirc a chonaic mé an gabhar “It’s in the field I first saw the goat”
Note that in UD, the cleft particle a is indistinguishable from the relative particle a. Both are labelled ‘mark:prt’ (see mark:prt.
Stenson (1981, p.111) describes the cleft construction as being similar to copular identity structures with the order of elements as Copula, Predicate, Subject. According to Stenson, the a is a relative particle which forms part of the relative clause. However, there is no surface head noun in the relative clause { it is missing an NP. Stenson refers to these structures as having an “understood” nominal head such as an rud “the thing” or an té “the person/the one”, e.g. Is ise [an té] a chonaic siad inné. When the nominal head is present, it becomes a copular identity construction: She is the one who they saw yesterday. In the absence of a head noun, the verb is labelled as the head of the clause.
Note that a relative clause which is copular is considered to be clefted when it occurs as the predicate of a copular phrase.
- Is é Michael D. Higgins ba chionsiocair leis an Roinn a bhunú sa bhliain 1992. “Michael D. Higgins was the driving force behind the establishment of the Department in 1992.”
Pleonastic Conjunction ‘ná’
The presence of the pleonastic conjunction ná allows for the reordering of the copula-predicate subject structure which is rearranged to become copula-subject-conjunction-predicate.
Note that in this example, we consider ‘bunú’ as the root, ‘toradh’ as the subject and ‘é’ as a nominal modifier on ‘toradh’. There is a relative clause; ‘bhí’ is an acl:relcl coming off ‘toradh’.
- Ba é an toradh a bhí ar a gcuid iarrachtaí ná bunú ‘Irish Historical Studies’ i 1938 “The result of their efforts was the establishment of ‘Irish Historical Studies’ in 1938”.
ROOT Ba é an toradh a bhí ar a gcuid iarrachtaí ná bunú 'Irish Historical Studies' i 1938'. \n ROOT COP(past) The result of their efforts was the establishment of the Irish Historical Studies in 1938.
mark:prt(bunú, ná)
cop(bunú, Ba)
root(ROOT, bunú)
nsubj(bunú, toradh)
nmod(toradh, é)
acl:relcl(toradh, bhí)
Language specific labels
The Irish UD treebank uses 26 of the UD dependency labels. A further 10 language specific labels were introduced to deal with certain linguistic phenomena in Irish:
- acl:relcl for relative clauses
- case:voc for vocative particles
- compound:prt for verb particle heads
- csubj:cleft for cleft subjects
- csubj:cop for copular clausal subject
- mark:prt for (most) particles
- nmod:poss for possessive pronouns
- obl:prep for pronominal prepositions
- obl:tmod for temporal modifiers
- xcomp:pred for predicates of the substantive verb “to be”
