UD for Punjabi
Tokenization and Word Segmentation
- Tokenize with whitespace and punctuation basically.
- Compounds with hyphens should be split.
- Some clitics have apostrophes at the beginning and could be written merged with the previous word (e.g. ‘ਚ “in”). These should be tokenized separately.
Morphology
Tags
- Use the full range of UPOS tags.
- Aspectual light verbs should be tagged VERB since they take full inflectional paradigms.
Features
- TBD
Syntax
- Special relations:
- acl:relcl for relative adnominal clauses. These have to have a relative pronoun in them (otherwise just acl).
- aux:pass for passive auxiliary ਜਾਣਾ.
- compound:lvc for noun/adjective + verb constructions.
- compound:redup for reduplication.
- compound:svc for aspectual light verbs.
- nsubj:pass for passivized subjects.
Treebanks
There are 1 Punjabi UD treebanks: