Tokenization
The low-level tokenization of the UD Armenian Treebanks (both Eastern and Western Armenian) generally adopts the Հայերենի ծառադարան - ArmTDP standard:
- In general, tokens are delimited by whitespace.
- Punctuation (recognized by the corresponding Unicode property) that is conventionally written adjacent to the preceding or following word is separated during tokenization.
Some special cases worth mentioning:
- An abbreviation marked by a period, as in թ. “year”, becomes two tokens, թ and . .
- A compound containing a hyphen becomes three tokens (two words and the hyphen), as in անգլո-ամերիկյան “anglo-american”, պատմա-բանասիրական “historical-philological”. In these cases, the first token is a special form of adjective that never occurs independently. Compounds without a hyphen are not split, thus ռազմածովային “navy” is one token but հասարակական-քաղաքական “civic-social” would be three tokens. Another common case of splitting-on-hyphen are reduplicative or echo words as in մեծ-մեծ “very big”, շուն-մուն “dog or something”.
- Inflectional bound morphemes and hypens after phrases or sentences used as names in quotation marks or after abbreviations marked by a period, as in «Երկիր Նաիրի»-ից “from “Yerkir Nairi” or 1937 թ.-ին “in year 1937” are split and are considered as separate tokens: { « , Երկիր , Նաիրի , » , - , ից } and { 1937 , թ , . , - , ին } .
The word before the hypen is the head and the bound morpheme is linked with the deprel
dep
. Tokenizing and segmenting this way seems easier for parsing. - The words that contain “infixed” punctuation (question, exclamation, emphasis and Armenian abbreviation marks), as in ինչո՞ւ “why?”, are considered multi-word tokens and become two tokens, ինչու and ՞ . EXCEPTION is the apostrophe, as in Ժաննա դ՚Արկ “Joan of Arc”, which is split and belongs to the preceding word, { Ժաննա , դ՚ , Արկ }.
- Generally, every punctuation character constitutes a token of its own. Thus »,— will become three tokens.
- EXCEPTIONs are conventional multi-character punctuation marks: … , …. , and emojis and smileys: :) , ^_^ , ։Ճ etc. Conventional non-armenian multi-character punctuation marks and terms are tokenized as single tokens: ?! .
- Special symbols before and after numerical expressions, as in $250 , 4,81% , +32°С , are tokenised separately (so, the tokens are { $ , 250 } , { 4,81 , % } , { + , 32 , °С }).
- Email addresses, URLs, and tweet-style names are treated as single tokens: muster@muster.am , https://github.com , @gov_am .
Some special cases worth mentioning:
- Numerical expressions are treated as single words as long as they do not contain spaces or hyphen, for example, 355,089.40 . Decimal numbers (with Armenian decimal comma or English decimal point) are also kept as one token, e.g. 2.1 , 2,1 .
- EXCEPTION: Time expressions and dates like 19:45 or 20.05.2000 , 20/05/2000 are splitted into separate tokens (in this case, three { 19 , : , 45 } and five { 20 , . , 05 , . , 2000 } , { 20 , / , 05 , / , 2000 }).
- Numerical expressions with or without hyphen and Armenian endings as well as adjectives and other non-numerals which contain digits (e.g. 1-ին “1st” , 2րդ “2nd” , 1000-ական “by 1000” , 1700-ամյա “1700-year-old” , ՆԱՏՕ-ական “belonging-to-NATO , ՏՈՒ-154Մ “TU-154M”) are treated as single tokens as long as they do not contain inflectional endings (e.g. 3-ով “3.Ins” , 1950-ականներին “in 1950s” , 20-ամյակը “the 20th anniversary” ) which are splitted into separate tokens (in this case, three { 3 , - , ով } , { 1950 , - , ականներին } , { 20 , - , ամյակը }).
Multi-word tokens
See above, the “infixed” punctuation.
Pronouns and adverbs
- Indefinite pronouns and adverbs like ինչ-որ, փոքր-ինչ, դույզն-ինչ, ինչ-ինչ “something, somewhat”, etc. are splitted as compounds containing a hyphen and become three tokens (two words and the hyphen).
Verb forms, analytical grammatical forms, negation
- the forms of necessitative mood, analytical causative, complex tenses, complex comparatives, etc. are splitted according to the orthographic principle: { պիտի , վազեն } “they must run”, { գրել , տվեց } “made write”, { վազում , եմ } “I run”, { ավելի , լուրջ } “more serious”.
- մի and ոչ used as negation markers with verbs, adjectives, pronouns and other words are tokenized according to the orthographic rules: { մի , գրիր } “don’t write”, { ոչ , պաշտոնական } “unofficial”, { ոչ , մի , կերպ } “in no way”.
Sentence splitting
Each sentence contains only one root. Splitting is usually performed after an end-of-sentence full stop or after a dot, ellipsis or colon when these punctuation marks separate unrelated subparts of a sentence. Items in a list may sometimes be rendered as separate sentences.