UD version 2
This is the workspace that was used for preliminary discussions of v2 of the universal guidelines in August-September 2016. It is now preserved mainly for archival purposes. The current v2 draft, can be found here.
Deadlines (revised 2016-10-17):
- 2016-10-17: Draft guidelines made available for feedback
- 2016-12-01: Guidelines fixed for spring release and CoNLL shared task
Some quick links:
The main issues
- Core dependents. The distinction between core dependents and the rest is fundamental to the whole taxonomy. Having specific and cross-linguistically consistent guidelines for core dependents is therefore crucial for putting the whole enterprise on a solid footing. This involves clarifying the treatment of (among other things) double objects, reflexives, expletives, copula constructions and valency-changing operations. Relevant reports from the Uppsala meeting include: copula, clitics.
- Functional labels.
Cross-linguistic guidelines for the use of the functional labels such as
aux
,det
,cop
. There is currently a lot of variation around this. Representing lexical heads promotes cross-linguistic parallelism, but only if we can agree on what lexical heads are. - Tokenization (or perhaps better, word segmentation). We need to be able to handle the whole spectrum from multitoken words in Vietnamese to multiword tokens in Turkish. Ideally, we should also set up more substantial criteria for when to split tokens into words and vice versa. On this issue, there is a relevant paper dealing with the Turkish case. See also the report from the Uppsala meeting: tokenization.
- Enhanced dependencies. Having a first version of the guidelines for enhanced dependencies is important not just for its own sake, but also because it has implications for the basic dependencies. If we know that something can be captured in the enhanced dependencies, we don’t need to clutter the basic dependencies with this information. Examples of constructions that can benefit from this are control verbs and light verb constructions. In this connection, it would also be relevant to discuss what language-specific subtypes can and cannot be used for. We seem to have a lot of inconsistencies here. Report from Uppsala meeting: future.
- Ellipsis. There seems to be a consensus that we should get rid of the remnant relation, but it is still unclear what we should put in its place. See the report from the Uppsala meeting here: ellipsis. Conceivably, the enhanced dependencies could be put to use here as well.
- Part-of-speech tags and their relation to syntax. To what extent should the part-of-speech tag be predictable from the syntactic relation and vice versa? For example, does “det” imply “DET” (rather than “PRON”) or does “DET” imply “det” (or both or neither)? Coming up with a more consistent set of principles for making these decisions will be important to achieve (better) cross-language consistency.
- Features. Check the language-specific features and values defined so far in our treebanks. Add new values to existing features where necessary. Do we need entire new features as well? Evidentiality perhaps?
- Coordination. We may want to revise the guidelines for coordination and similar constructions. (See report from the Uppsala meeting: coordination and a position paper)
- CoNLL-U. The definition of the CoNLL-U format may have to be revised in the light of decisions about tokenization (see above). In addition, we should standardize comments for sentence ids, etc.
Proposed revisions
- Remove u-dep/nsubjpass, u-dep/csubjpass, and u-dep/auxpass from the list of universal relations. Discussion
- Require language-specific subtypes to be used for true syntactic subtypes, not cross-classification of syntax/semantics. Discussion
- Remove u-dep/remnant from universal relations. Use promotion + enhanced representation to annotate ellipsis. Discussion
- New general principles for form vs. function in POS assignment, and new proposal for categorizing the pronominal words. Discussion