WMT 2018 Automatic Post-Editing data set

ID:

http://hdl.handle.net/11372/LRT-2390

For the APE shared task at WMT2018**, we will use:
- A new test set of 2,000 segments for the English-German language pair from 2017 (en-de and de-en), where the MT segments are generated by the SMT system. In total this language pair covers 30k segments. The split is: training set = 23 k, dev set = 1k, test-set16 = 2k, test-set17 = 2k, test-set18= 2k.
- A new English-German dataset of 30,000 segments where the MT segments are generated by a NMT system. The split is: training set = 27k, dev set = 1k, test-set18 = 1k.
The SMT English-German test data consists of 2,000 triplets (source, target and post-edit) belonging to the Information Technology domain and already tokenised. Target sentences are machine-translated with the KIT SMT system. Post-edits are collected by Text & Form from professional translators.
The NMT English-German data consists of 30,000 triplets (source, target and post-edit) belonging to the Information Technology domain and already tokenised. Target sentences are machine-translated with the Nematus system. Post-edits are collected by Text & Form from professional translators

** The 2018 APE data sets will be made available end of June 2018!

IMPORTANT LEGAL NOTICE (This dataset is provided under the following terms of use)
TAUS Terms of Use (https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21).
TAUS grants to QT21 User access to the WMT Data Set with the following rights:
i) the right to use the target side of the translation units into a commercial product, provided that QT21 User may not resell the WMT Data Set as if it is its own new translation;
ii) the right to make Derivative Works; and
iii) the right to use or resell such Derivative Works commercially and for the following goals:
i) research and benchmarking;
ii) piloting new solutions; and
iii) testing of new commercial services.

You don’t have the permission to edit this resource.