Typos and Other Errors in Underlying Text
Sometimes the text underlying a UD treebank does not conform to canonical spelling or other grammatical rules of the language. In most situations it is desirable to preserve the error because taggers and parsers that learn their models from the data should learn how to deal with noisy input too. On the other hand, it is also desirable to mark such places as errors and to show the correct spelling, so that an application can hide bad sentences or present their correct version when necessary.
The recommendations on this page are designed with sporadic errors in mind. Technically they could be also applied to learner corpora, which are full of errors; however, learner corpora usually require more thinking, and the main question is: Do we want to guess what the author would have written if they knew the language better, or do we want to approximate “the grammar in their head,” which is probably a mixture of the intended language and a language they know better?
Mechanisms similar to typo handling could also be used to annotate historical corpora with historical spelling; see below for more details.
Misspelled Word
The easiest type of error is a simple typo in a single word, especially if the result is a non-word. (If the result is another word of the language, e.g. if one writes too instead of two in English, then we must decide that the author really wanted to say something else, and it may not be always obvious.)
The FORM field and the text
attribute at the beginning of the sentence should always contain the form that really occurred in the original text. On the other hand, LEMMA should use normalized spelling; thus if the text says kats instead of cats, the lemma will be cat, not kat. Now the morphological features should include the feature Typo=Yes
that marks the typo. This is important: it ensures that there is a unique mapping from lemma + part-of-speech tag + morphological features to the correct word form. Without Typo=Yes
, one could infer from the corpus that the correct plural form of the English noun cat is kats. (The mapping is actually not unique for wrong forms, as all possible misspellings are still marked by the same Typo=Yes
feature.)
Finally, neither the lemma nor the morphological features tell the user what the correct spelling at this position would be. We want to list the correct form as well. This is not a morphological feature, so we must put it in the MISC column instead: CorrectForm=cats
. Here is a full example:
# text = I have two kats. 1 I I PRON _ Case=Nom|Number=Sing|Person=1|PronType=Prs 2 nsubj _ _ 2 have have VERB _ Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin 0 root _ _ 3 two two NUM _ NumType=Card 4 nummod _ _ 4 kats cat NOUN _ Number=Plur|Typo=Yes 2 obj _ CorrectForm=cats|SpaceAfter=No 5 . . PUNCT _ _ 2 punct _ _
Typo=Yes
is not intended for all typographical errors in a text, only those that are internal to (the rendering of) a single word in the language,
including wrongly split words as described below.
Errors in spacing around words, or erroneous insertion or deletion of words, are represented via other means (see below).
Misspelled Multiword Token
A typo is a surface feature: In case of a multiword surface token, there can be a typo in the token form which does not appear in the
reconstructed forms of the corresponding syntactic words. For example, the Spanish words vamos and nos may be merged into one token.
In isolation, none of the two words is spelled with a stress-marking accent. However, when merged, the resulting token has stress on the
third syllable from the end, which requires adding an acute accent over that syllable: vámonos. If the accent is omitted, it is a typo
in the surface token while the reconstructed syntactic words are spelled correctly. Therefore, the feature Typo=Yes
is exceptionally
allowed to occur on the MWT line, although normally the FEATS column must be empty there. See the example below.
If a language has solely concatenative multiword tokens (that is, the form of the MWT is always identical to the concatenation of the
forms of the syntactic words), then the language-specific guidelines may rule that Typo=Yes
should be placed on the line of the
misspelled word as usual. However, in such cases Typo=Yes
must not occur also on the MWT line. (Allowing both could lead to confusion
about the redundancy.)
# text = Vamonos al mar. # text_en = Let's go to the sea. 1-2 Vamonos _ _ _ Typo=Yes _ _ _ CorrectForm=Vámonos 1 Vamos ir VERB _ Mood=Imp|Number=Plur|Person=1|VerbForm=Fin 0 root _ _ 2 nos nosotros PRON _ Case=Acc|Number=Plur|Person=1|PronType=Prs|Reflex=Yes 1 expl:pv _ _ 3-4 al _ _ _ _ _ _ _ _ 3 a a ADP _ _ 5 case _ _ 4 el el DET _ Definite=Def|Gender=Masc|Number=Sing|PronType=Art 5 det _ _ 5 mar mar NOUN _ Gender=Masc|Number=Sing 1 obl _ SpaceAfter=No 6 . . PUNCT _ _ 1 punct _ _
Intentionally Noncanonical Spellings
Abbreviations and expressive spelling variants are not considered typos, but may be paired with a CorrectForm
for the canonical spelling.
See the Abbr=Yes
and Style=Expr
features.
Wrongly Split Word
If the word is erroneously written with one or more spaces, we have several incorrect tokens. We do not join them into one token with a space, although Universal Dependencies since version 2 allow words with spaces. This option is reserved for very specific situations, usually quite marginal in the language (with the exception of Vietnamese), but predictable. Not for arbitrary errors. Instead, UD defines the goeswith relation to connect the parts of the word. The first part is always the head, the other parts are attached to it via goeswith
.
Parts attaching as goeswith
should not themselves have any dependents.
If the treebank provides Enhanced Dependencies, goeswith
relations should be the same as in Basic Dependencies, and goeswith
dependents should not participate in any additional enhanced relations.
The head should bear the part-of-speech tag, lemma, and morphological annotation of the entire word. Beginning with UD release 2.10, any treebank that uses the Typo
feature must apply it to all words with goeswith
dependents, as an extra space within a word is a misrendering of that word.
Example:
# text = This spel ling is wrong. 1 This this DET _ Number=Sing|PronType=Dem 2 det _ _ 2 spel spelling NOUN _ Number=Sing|Typo=Yes 5 nsubj _ CorrectForm=spelling 3 ling _ X _ _ 2 goeswith _ _ 4 is be AUX _ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 5 cop _ _ 5 wrong wrong ADJ _ _ 0 root _ SpaceAfter=No 6 . . PUNCT _ _ 5 punct _ _
The goeswith
solution is only for segmentations that violate syntactic word boundaries. If the extra space is inserted between syntactic word boundaries (e.g., a clitic and its base), this is instead represented as two syntactic words, the first of which has CorrectSpaceAfter=No
. (Note that a multi-word token should not be used in this case: multi-word tokens are strictly for syntactically complex single orthographic tokens, whether spaced correctly or not.)
To summarize the rules about goeswith
:
- Any surface word attaching as
goeswith
in Basic Dependencies should have an orthographic form but empty lemma, UPOS ofX
, and no morphological features (empty FEATS). If the corpus uses DEPS, it should have the samegoeswith
relation and no other relations. - The
goeswith
dependents of a head must be immediately after it and must be consecutive (cannot skip any words). - The head must not have
goeswith
as its deprel. - If the treebank uses
Typo=Yes
, it must appear in the FEATS column for the head. - A
goeswith
dependent may have MISC features. For example, the lastgoeswith
dependent for a given head may haveSpaceAfter=No
andCorrectSpaceAfter=Yes
. Neither of these should apply to the other parts of thegoeswith
structure, as they concern spacing between true syntactic words, whereasgoeswith
indicates excessive space within a true syntactic word. - Within a multi-word token, only the first word is permitted to attach to a previous word as
goeswith
. - Within a multi-word token, only the last word is permitted to head any
goeswith
dependents.
Wrongly Merged Words
UD has two mechanisms capable of capturing that two words are not separated by whitespace: the SpaceAfter=No
attribute in MISC, and multi-word tokens. The former is considered low-level and it is normally used between a word and a punctuation node. The latter is intended for situations where two real words are merged into one, but it is assumed that these cases adhere to regular rules of the grammar, i.e., they are not arbitrary errors. Also, the format of multi-word token annotation is technically more complex because it allows for non-concatenative fusions. For the annotation of poorly edited text, the low-level SpaceAfter
attribute seems quite suitable.
As with Typo=Yes
and CorrectForm=X
, it is desirable to indicate that the space is missing by error. Therefore, SpaceAfter=No
should be accompanied by CorrectSpaceAfter=Yes
.
Though CorrectSpaceAfter=Yes
signals a kind of typographical error in the sentence, Typo=Yes
should not be applied unless there is an error in how a word is rendered, and that error is internal to the word. Missing spaces between legitimate words are considered external to the word.
Note that a similar mechanism can be used also to mark excess spaces around punctuation (using CorrectSpaceAfter=No
). Punctuation should not be attached to another node via goeswith
because they do not together constitute a word. Example:
# text = This spellingis wrong . 1 This this DET _ Number=Sing|PronType=Dem 2 det _ _ 2 spelling spelling NOUN _ Number=Sing 4 nsubj _ SpaceAfter=No|CorrectSpaceAfter=Yes 3 is be AUX _ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 4 cop _ _ 4 wrong wrong ADJ _ _ 0 root _ CorrectSpaceAfter=No 5 . . PUNCT _ _ 4 punct _ _
A Combination of the Above
Here is a more complex example with several error types:
# text = This spel lingi$ wrong . 1 This this DET _ Number=Sing|PronType=Dem 2 det _ _ 2 spel spelling NOUN Typo=Yes Number=Sing 5 nsubj _ CorrectForm=spelling 3 ling _ X _ _ 2 goeswith _ SpaceAfter=No|CorrectSpaceAfter=Yes 4 i$ be AUX _ Mood=Ind|Number=Sing|Person=3|Tense=Pres|Typo=Yes|VerbForm=Fin 5 cop _ CorrectForm=is 5 wrong wrong ADJ _ _ 0 root _ CorrectSpaceAfter=No 6 . . PUNCT _ _ 5 punct _ _
The following contains two errors as well as a multi-word token. Note that the second word of the multi-word token is headed by the beginning of the erroneously split word, which is outside of the multi-word token:
# text = mc donalds 1 mc McDonald PROPN NNP Number=Sing|Typo=Yes 0 root _ CorrectForm=McDonald 2-3 donalds _ _ _ _ _ _ _ _ 2 donald _ X NNP _ 1 goeswith _ _ 3 s 's PART POS Typo=Yes 1 case _ CorrectForm='s
Missing Word
If one or more words are missing from the text, we treat it as ellipsis. That is, we select a constituent from the remainder of the incomplete subtree, promote it to the head of the subtree and attach the other surviving dependents to it. If an argument of a missing verb is promoted, the other arguments and adjuncts are attached to it via the orphan relation, otherwise the relation type is used that would go out of the head if the missing material were present.
Note that sometimes words are missing really by error and not due to ellipsis, albeit we propose to use an ellipsis-like annotation. For instance, errors in sentence segmentation may cause the sentence to end prematurely, after a period that was not intended to terminate the sentence.
Extra Word
If the text contains by error a word that should not be there, it can be treated similarly to speech disfluences, that is, attached to the following constituent via the reparandum relation. A relatively common instance in written language is that a word is typed twice in a row.
Wrong Morphology or Syntax
For example, the grammar requires dative but the actual form is nominative. Or a singular occurs instead of plural (the cars is produced in Detroit). Such errors could be treated as simple typos but intuitively they are not in the same category (although they could co-occur with a typo, as in the cars iss produced…) It is not always obvious what is the correct form. We could either correct cars to car, or is to are (but not both). Similarly, it may be clear that the actual word form is the nominative case and that it is wrong, but several other morphological cases may be plausible in the given context.
Sometimes it will not be obvious whether such errors should be classified as errors. In some languages it may be dialectal or other variety. (This actually applies to certain typos too: color is correct in American English but in Britain it should be colour.)
Suggestion: Keep the word as it was in the source text. Add morphological features that correspond to the actual form, not to the hypothetical correct form: English is is Number=Sing
, and cars is Number=Plur
. (Note that some cases will be hard to decide. Czech auto “car” is singular nominative or accusative. If the context requires the dative (autu), we only know that the actual form is wrong. But we don’t know whether it is Case=Nom
or Case=Acc
, thus we may have to annotate Case=Acc,Nom
. If there were the correct form autu, which besides the dative could also mean locative, we would disambiguate it by the context and annotate Case=Dat
, not Case=Dat,Loc
.)
In the MISC column, we should indicate the correct form as we did with simple typos: CorrectForm=autu
. We also add in the MISC column those features from the FEAT column that would differ for the correct form, and prefix them with “Correct”, e.g. CorrectCase=Dat
. We will not add the Typo=Yes
feature in FEAT because the word form in FORM reflects the values of the morphological features in FEAT.
As for the syntactic annotation, there does not seem to be a simple and easy-to-follow rule. Each sentence will have to be decided separately, seeking a compromise between the actual surface form and the assumed intended reading. For example, consider the Czech preposition k “to” which requires noun phrases in the dative. If the text contains (wrong) k auto instead of (correct) k autu, using the relation case(auto, k)
is probably the only thing we can do, disregarding the fact that the nominative auto is ungrammatical with the preposition.
# text = The cars is produced in Detroit. 1 The the DET _ Definite=Def|PronType=Art 2 det _ _ 2 cars car NOUN _ Number=Plur 4 nsubj:pass _ _ 3 is be AUX _ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 4 aux:pass _ CorrectForm=are|CorrectNumber=Plur 4 produced produce VERB _ Tense=Past|VerbForm=Part 0 root _ _ 5 in in ADP _ _ 6 case _ _ 6 Detroit Detroit PROPN _ Number=Sing 4 obl _ SpaceAfter=No 7 . . PUNCT _ _ 4 punct _ _
Historical Spelling
Similar mechanisms could also be used to mark historical spelling in older texts. For instance,
German sein “to be” used to be spelled seyn. It is not a typo because this form was correct
in the time the text was produced. Thus in the FEATS column, we may use Style=Arch
to mark
that this is an archaic form. In the MISC column, we can add ModernForm=sein
(an analogy to
CorrectForm=sein
, which we would use if we wanted to mark it as a typo).