WMT 2018 Automatic Post-Editing data set – META-SHARE

Last view: 2025-06-21

84 Last view: 2025-06-21

WMT 2018 Automatic Post-Editing data set

http://www.qt21.eu/,

https://lindat.mff.cuni.cz/

ID:

http://hdl.handle.net/11372/LRT-2390

For the APE shared task at WMT2018, we will use:
- A new test set of 2,000 segments for the English-German language pair from 2017 (en-de and de-en), where the MT segments are generated by the SMT system. In total this language pair covers 30k segments. The split is: training set = 23 k, dev set = 1k, test-set16 = 2k, test-set17 = 2k, test-set18= 2k.
- A new English-German dataset of 30,000 segments where the MT segments are generated by a NMT system. The split is: training set = 27k, dev set = 1k, test-set18 = 1k.
The SMT English-German test data consists of 2,000 triplets (source, target and post-edit) belonging to the Information Technology domain and already tokenised. Target sentences are machine-translated with the KIT SMT system. Post-edits are collected by Text & Form from professional translators.
The NMT English-German data consists of 30,000 triplets (source, target and post-edit) belonging to the Information Technology domain and already tokenised. Target sentences are machine-translated with the Nematus system. Post-edits are collected by Text & Form from professional translators

The 2018 APE data sets will be made available end of June 2018!

IMPORTANT LEGAL NOTICE (This dataset is provided under the following terms of use)
TAUS Terms of Use (https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21).
TAUS grants to QT21 User access to the WMT Data Set with the following rights:
i) the right to use the target side of the translation units into a commercial product, provided that QT21 User may not resell the WMT Data Set as if it is its own new translation;
ii) the right to make Derivative Works; and
iii) the right to use or resell such Derivative Works commercially and for the following goals:
i) research and benchmarking;
ii) piloting new solutions; and
iii) testing of new commercial services.

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Restricted Use

Licence

Other

Distribution Access/Medium: Downloadable

Contact Person

Christian Dugast

text

1
2

Bilingual text corpusLanguages

English German

Linguality

Linguality type: Bilingual

Multi-linguality type: Parallel

Size

1,294 (both language pairs) Kb

Bilingual text corpusLanguages

German English

Linguality

Linguality type: Bilingual

Multi-linguality type: Parallel

Size

1,294 (both language pairs) Kb

Domains

Information Technology

Metadata

Created: 12/13/2017

Last Updated: 03/02/2018

Usage

Foreseen UseNlp Applications

Use NLP Specific: Machine Translation

Actual Use - Nlp Applications

Use NLP Specific: Machine Translation

People who looked at this resource also viewed the following: