Shared task on the lemmatization of German web and social media texts
Note that the shared task has been cancelled due to an insufficient number of participants.
The goal of the shared task is to encourage the developers of NLP applications to adapt their tools and resources to the lemmatization of German Web pages and written German discourse in genres of computer-mediated communication (CMC). Examples for CMC genres are chats, forums, wiki talk pages, tweets, blog comments, social networks, SMS and WhatsApp dialogues.
The shared task is a follow-up to the EmpiriST 2015 shared task, which focused on tokenization and POS-tagging. The current task focuses on the next fundamental step in the NLP pipeline. Lemmatization is crucial for general corpus indexing purposes as well as for many applications in lexicography, text classification, discourse analysis, etc.
Participants will receive pre-tokenized and pre-tagged text files and will have to provide surface-oriented lemmata and/or normalized lemmata. Surface-oriented lemmata are mainly based on the inflectional suffixes of the token and retain, as far as possible, any non-standard orthographical features of the token. For normalized lemmata, on the other hand, obvious spelling errors are corrected and non-standard forms are treated as standard forms.
XD EMOASC XD
du PPER du
killst VVFIN killen
mich PPER mich
! $. !
Soooo PTKIFG soooo
herrlich ADJD herrlich
xDD EMOASC xDD
XD EMOASC XD
du PPER du
killst VVFIN killen
mich PPER mich
! $. !
Soooo PTKIFG so
herrlich ADJD herrlich
xDD EMOASC xDD
The shared task will be a pre-conference workshop of the Conference on
Natural Language Processing (“Konferenz zur Verarbeitung natürlicher
Sprache”, KONVENS) hosted on October 8, 2019 at FAU
Erlangen-Nuremberg, see http://2019.konvens.org/.
Participants to the shared task need to register by sending an e-mail with the following information to empirist@collocations.de:
All participants and further interested parties are invited to register to our mailing list.
The training and test data of the shared task have been manually lemmatized according to our lemmatization guidelines (in German) that are an extension of the TIGER annotation scheme.
The training data were individually lemmatized by four student annotators according to our lemmatization guidelines. Unclear cases were decided in group meetings with the task organizers.
The shared task is organized by: