home edit page issue tracker

This page pertains to UD version 2.

It appears that you have Javascript disabled. Please consider enabling Javascript for this page to see the visualizations.

Tokenization

The low-level tokenization of the UD Armenian Treebanks (both Eastern and Western Armenian) generally adopts the Հայերենի ծառադարան - ArmTDP standard:

In general, tokens are delimited by whitespace.
Punctuation (recognized by the corresponding Unicode property) that is conventionally written adjacent to the preceding or following word is separated during tokenization. Some special cases worth mentioning:
- An abbreviation marked by a period, as in թ. “year”, becomes two tokens, թ and . .
- A compound containing a hyphen becomes three tokens (two words and the hyphen), as in անգլո-ամերիկյան “anglo-american”, պատմա-բանասիրական “historical-philological”. In these cases, the first token is a special form of adjective that never occurs independently. Compounds without a hyphen are not split, thus ռազմածովային “navy” is one token but հասարակական-քաղաքական “civic-social” would be three tokens. Another common case of splitting-on-hyphen are reduplicative or echo words as in մեծ-մեծ “very big”, շուն-մուն “dog or something”.
- Inflectional bound morphemes and hypens after phrases or sentences used as names in quotation marks or after abbreviations marked by a period, as in «Երկիր Նաիրի»-ից “from “Yerkir Nairi” or 1937 թ.-ին “in year 1937” are split and are considered as separate tokens: { « , Երկիր , Նաիրի , » , - , ից } and { 1937 , թ , . , - , ին } . The word before the hypen is the head and the bound morpheme is linked with the deprel dep. Tokenizing and segmenting this way seems easier for parsing.
- The words that contain “infixed” punctuation (question, exclamation, emphasis and Armenian abbreviation marks), as in ինչո՞ւ “why?”, are considered multi-word tokens and become two tokens, ինչու and ՞ . EXCEPTION is the apostrophe, as in Ժաննա դ՚Արկ “Joan of Arc”, which is split and belongs to the preceding word, { Ժաննա , դ՚ , Արկ }.
- Generally, every punctuation character constitutes a token of its own. Thus »,— will become three tokens.
- EXCEPTIONs are conventional multi-character punctuation marks: … , …. , and emojis and smileys: :) , ^_^ , ։Ճ etc. Conventional non-armenian multi-character punctuation marks and terms are tokenized as single tokens: ?! .
Special symbols before and after numerical expressions, as in $250 , 4,81% , +32°С , are tokenised separately (so, the tokens are { $ , 250 } , { 4,81 , % } , { + , 32 , °С }).
Email addresses, URLs, and tweet-style names are treated as single tokens: muster@muster.am , https://github.com , @gov_am .

Some special cases worth mentioning:

Numerical expressions are treated as single words as long as they do not contain spaces or hyphen, for example, 355,089.40 . Decimal numbers (with Armenian decimal comma or English decimal point) are also kept as one token, e.g. 2.1 , 2,1 .
EXCEPTION: Time expressions and dates like 19:45 or 20.05.2000 , 20/05/2000 are splitted into separate tokens (in this case, three { 19 , : , 45 } and five { 20 , . , 05 , . , 2000 } , { 20 , / , 05 , / , 2000 }).
Numerical expressions with or without hyphen and Armenian endings as well as adjectives and other non-numerals which contain digits (e.g. 1-ին “1st” , 2րդ “2nd” , 1000-ական “by 1000” , 1700-ամյա “1700-year-old” , ՆԱՏՕ-ական “belonging-to-NATO , ՏՈՒ-154Մ “TU-154M”) are treated as single tokens as long as they do not contain inflectional endings (e.g. 3-ով “3.Ins” , 1950-ականներին “in 1950s” , 20-ամյակը “the 20th anniversary” ) which are splitted into separate tokens (in this case, three { 3 , - , ով } , { 1950 , - , ականներին } , { 20 , - , ամյակը }).

Multi-word tokens

See above, the “infixed” punctuation.

Pronouns and adverbs

Indefinite pronouns and adverbs like ինչ-որ, փոքր-ինչ, դույզն-ինչ, ինչ-ինչ “something, somewhat”, etc. are splitted as compounds containing a hyphen and become three tokens (two words and the hyphen).

Verb forms, analytical grammatical forms, negation

the forms of necessitative mood, analytical causative, complex tenses, complex comparatives, etc. are splitted according to the orthographic principle: { պիտի , վազեն } “they must run”, { գրել , տվեց } “made write”, { վազում , եմ } “I run”, { ավելի , լուրջ } “more serious”.
մի and ոչ used as negation markers with verbs, adjectives, pronouns and other words are tokenized according to the orthographic rules: { մի , գրիր } “don’t write”, { ոչ , պաշտոնական } “unofficial”, { ոչ , մի , կերպ } “in no way”.

Sentence splitting

Each sentence contains only one root. Splitting is usually performed after an end-of-sentence full stop or after a dot, ellipsis or colon when these punctuation marks separate unrelated subparts of a sentence. Items in a list may sometimes be rendered as separate sentences.