empirist-lemmatization

Shared task on the lemmatization of German web and social media texts

GermEval 2019 Task 3: Shared task on the lemmatization of German web and social media texts (EmpiriST-lemmatization 2019)

Goal

The goal of the shared task is to encourage the developers of NLP applications to adapt their tools and resources to the lemmatization of German Web pages and written German discourse in genres of computer-mediated communication (CMC). Examples for CMC genres are chats, forums, wiki talk pages, tweets, blog comments, social networks, SMS and WhatsApp dialogues.

The shared task is a follow-up to the EmpiriST 2015 shared task, which focused on tokenization and POS-tagging. The current task focuses on the next fundamental step in the NLP pipeline. Lemmatization is crucial for general corpus indexing purposes as well as for many applications in lexicography, text classification, discourse analysis, etc.

Tasks

Participants will receive pre-tokenized and pre-tagged text files and will have to provide surface-oriented lemmata and/or normalized lemmata. Surface-oriented lemmata are mainly based on the inflectional suffixes of the token and retain, as far as possible, any non-standard orthographical features of the token. For normalized lemmata, on the other hand, obvious spelling errors are corrected and non-standard forms are treated as standard forms.

Subtask 1: Surface-oriented lemmatization

XD	EMOASC	XD
du	PPER	du
killst	VVFIN	killen
mich	PPER	mich
!	$.	!
Soooo	PTKIFG	soooo
herrlich	ADJD	herrlich
xDD	EMOASC	xDD

Subtask 2: Normalized lemmatization

XD	EMOASC	XD
du	PPER	du
killst	VVFIN	killen
mich	PPER	mich
!	$.	!
Soooo	PTKIFG	so
herrlich	ADJD	herrlich
xDD	EMOASC	xDD

Schedule

The shared task will be a pre-conference workshop of the Conference on Natural Language Processing (“Konferenz zur Verarbeitung natürlicher Sprache”, KONVENS) hosted on October 8, 2019 at FAU Erlangen-Nuremberg, see http://2019.konvens.org/.

Registration

Participants to the shared task need to register by sending an e-mail with the following information to empirist@collocations.de:

Mailing list

All participants and further interested parties are invited to register to our mailing list.

Lemmatization guidelines

The training and test data of the shared task have been manually lemmatized according to our lemmatization guidelines (in German) that are an extension of the TIGER annotation scheme.

Data sets

The training data were individually lemmatized by four student annotators according to our lemmatization guidelines. Unclear cases were decided in group meetings with the task organizers.

Organizers

The shared task is organized by: