WMT 2017 Automatic Post-editing and Quality Estimation data set – META-SHARE

Last view: 2024-03-20

73 Last view: 2024-03-20

WMT 2017 Automatic Post-editing and Quality Estimation data set

http://www.qt21.eu/,

https://lindat.mff.cuni.cz/

ID:

http://hdl.handle.net/11372/LRT-2390

For WMT 2017, 11,000 segments have been added to the WMT16 training set (En-De) together with a new test (for 2017) made of 2,000 segments (En-De). In 2017 a new language pair has been added: De-En with 25k segments for training, 1k segments for dev, 2k segments for test. Adding the 2016 and 2017 APE and QE data together, we obtain, for each language pair a total of 28k segments each, split in: En-De: training set = 23 k, dev set = 1k , test-set16 = 2k, test-set17 = 2k, De-En: training set: 25k, dev-set = 1k, test-set17= 2k
Training, development and text data consist of English-German triplets (source, target and post-edit) belonging to the Information Technology domain and already tokenised. Target sentences are machine-translated with the KIT system. Post-edits are collected by Text & Form from professional translators.
Training, development and text data consist of German-English triplets (source, target and post-edit) belonging to the Pharma domain and already tokenised. Target sentences are machine-translated with the KIT system. Post-edits are collected by Text & Form from professional translators.

IMPORTANT LEGAL NOTICE (This dataset is provided under the following terms of use)
TAUS Terms of Use (https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21).
TAUS grants to QT21 User access to the WMT Data Set with the following rights:
i) the right to use the target side of the translation units into a commercial product, provided that QT21 User may not resell the WMT Data Set as if it is its own new translation;
ii) the right to make Derivative Works; and
iii) the right to use or resell such Derivative Works commercially and for the following goals:
i) research and benchmarking;
ii) piloting new solutions; and
iii) testing of new commercial services.

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Restricted Use

Licence

Other

Distribution Access/Medium: Downloadable

Contact Person

Christian Dugast

text

1
2

Bilingual text corpusLanguages

English German

Linguality

Linguality type: Bilingual

Multi-linguality type: Parallel

Size

1,294 (both language pairs) Kb

Bilingual text corpusLanguages

German English

Linguality

Linguality type: Bilingual

Multi-linguality type: Parallel

Size

1,294 (both language pairs) Kb

Domains

Information Technology

Metadata

Created: 12/13/2017

Last Updated: 03/02/2018

Usage

Foreseen UseNlp Applications

Use NLP Specific: Machine Translation

Actual Use - Nlp Applications

Use NLP Specific: Machine Translation

People who looked at this resource also viewed the following: