ROMBAC - Romanian balanced corpus

172 Last view: 2026-04-22

1 Last update: 2016-01-20

ROMBAC - Romanian balanced corpus

View resource name in all available languages

ROMBAC - Corpus équilibré du roumain

http://catalog.elra.info/product_info.php?products_id=1253

ID:

ELRA-W0088

ROMBAC is a Romanian corpus containing equal shares of texts from 5 different genres: journalism, legalese, fiction, medicine and biographical data for Romanian literary personalities. For each genre, texts have been selected containing around 7,000,000 words, so that the entire corpus counts around 41,000,000 words, including punctuation.

The corpus is annotated at paragraph, sentence, constituent group and word levels. It provides morpho-syntactic information (MSD) which has been assigned automatically with the high accuracy TTL tagger (accuracy is at least 98%), which implements the tiered tagging methodology. About 20% of the MSDs have been manually checked, validated and, where the case, corrected.
MSDs follow the Multext-East specifications. For Romanian there are 614 different MSDs. They have been slightly modified (new tags for named entities have been added).

The corpus is xml encoded.

View resource description in all available languages

ROMBAC est un corpus équilibré du roumain contenant 5 genres textuels différents : journalistique, juridique, académique, médical, littéraire. Chaque genre contient environ 7 millions de mots, le nombre total de mots étant d’environ 41 millions (signes de ponctuation inclus).

Le corpus a été annoté au niveau morphosyntaxique (annotations MSD) avec l’étiqueteur TTL qui implémente la méthodologie d’étiquetage à plusieurs niveaux et qui a une précision estimée de 98%. Environ 20% des annotations MSD ont été validées manuellement. Les annotations morphosyntaxiques (MSDs) suivent les spécifications Multext-East. Il y a 614 MSDs différentes pour le roumain (des étiquettes ont été ajoutées pour les entités nommées).

Le corpus est encodé en XML.

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Restricted Use

Start date: 01/19/2016

Licence

ELRA VAR

Restrictions: Commercial Use

For Members of ELRA

Fee: 5,000.00

User Nature: Academic

ELRA END USER

Restrictions: Academic - Non Commercial Use

For Non Members of ELRA

Fee: 500.00

User Nature: Academic

ELRA VAR

Restrictions: Commercial Use

For Non Members of ELRA

Fee: 8,000.00

User Nature: Academic

ELRA END USER

Restrictions: Academic - Non Commercial Use

For Non Members of ELRA

Fee: 8,000.00

User Nature: Commercial

ELRA VAR

Restrictions: Commercial Use

For Non Members of ELRA

Fee: 8,000.00

User Nature: Commercial

ELRA END USER

Restrictions: Academic - Non Commercial Use

For Members of ELRA

Fee: 0.00

User Nature: Academic

ELRA END USER

Restrictions: Academic - Non Commercial Use

For Members of ELRA

Fee: 5,000.00

User Nature: Commercial

ELRA VAR

Restrictions: Commercial Use

For Members of ELRA

Fee: 5,000.00

User Nature: Commercial

Contact Person

Mapelli Valérie

text

Monolingual text corpusLanguages

Romanian

Linguality

Linguality type: Monolingual

Size

4 Gb

Metadata

Created: 05/12/2005

Version

Version: 1.0

Last Updated: 01/19/2016

People who looked at this resource also viewed the following: