The EMILLE/CIIL Corpus – META-SHARE

Last view: 2026-05-07

1114 Last view: 2026-05-07

Last update: 2013-06-26

1 Last update: 2013-06-26

The EMILLE/CIIL Corpus

View resource name in all available languages

Corpus EMILLE/CIIL

http://catalog.elra.info/product_info.php?products_id=696

ID:

ELRA-W0037

The EMILLE/CIIL Corpus consists of three components: monolingual, parallel and annotated corpora. There are fourteen monolingual corpora, including both written and (for some languages) spoken data for fourteen South Asian languages: Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telegu and Urdu. The EMILLE monolingual corpora contain approximately 92,799,000 words (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu). The parallel corpus consists of 200,000 words of text in English and its accompanying translations in Hindi, Bengali, Punjabi, Gujarati and Urdu. The annotated component includes the Urdu monolingual and parallel corpora automatically annotated for parts-of-speech, together with twenty written Hindi corpus files annotated to show the nature of demonstrative use. All other components are annotated at the sentence level. The corpus is marked up using CES-compliant SGML and encoded using Unicode.

References: Xiao, Z, McEnery, A., Baker, P. and Hardie, A. 2004. ‘Developing Asian language corpora: standards and practice’ in Sornlertlamvanich, V., Tokunaga, T. and Huang, C. (eds.) Proceedings of the Fourth Workshop on Asian Language Resources, pp. 1-8. March 25, Sanya.

This database is available for research use by academic organisations only. For a use by commercial organisations, a subset of the EMILLE/CIIL Corpus is available under the reference ELRA-W0038 The EMILLE Lancaster Corpus.

View resource description in all available languages

Le corpus EMILLE/CIIL regroupe 3 composants : des corpus monolingues, des corpus parallèles et des corpus annotés.
Les corpus monolingues sont disponibles pour 14 langues parlées en Asie du sud : l’assamais, le bengali, le gujarati, l’hindi, le kannada, le kashmiri, le malayalam, le marathi, l’oriya, le punjabi, le sinhala, le tamil, le telegu et l’ourdou. Ces corpus présentent des données écrites et, pour certaines langues, des données de l’oral.
Au total, les corpus monolingues contiennent environ 92 799 000 mots, dont 2 627 000 sont des transcriptions de données audio pour le bengali, le gujarati, l’hindi, le punjabi et l’ourdou.
Le corpus parallèle présente quant à lui 200 000 mots en anglais avec leurs traductions en hindi, bengali, punjabi, gujarati et ourdou.
Enfin, la partie annotée regroupe les corpus monolingues et parallèles traitant la langue ourdou, annotés de façon automatique sur les parties du discours, ainsi qu’une vingtaine de corpus écrits en langue hindi annotés dans le but de montrer le type d’usage des démonstratifs.. Tous les autres composants sont annotés au niveau de la phrase. Les corpus sont annotés au format SGML conforme avec la norme CES (Corpus Encoding Standards) et codés en Unicode.

Références: Xiao, Z, McEnery, A., Baker, P. et Hardie, A. 2004. ‘Developing Asian language corpora: standards and practice’ in Sornlertlamvanich, V., Tokunaga, T. et Huang, C. (eds.) Proceedings of the Fourth Workshop on Asian Language Resources, pp. 1-8. 25 Mars, Sanya.

Cette ressource est disponible pour un usage de recherche par des organisations académiques uniquement.
Un sous-ensemble du corpus EMILLE/CIIL est disponible pour les organisations commerciales : référence ELRA-W0038, le corpus EMILLE Lancaster.

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Restricted Use

Start date: 09/15/2004

Licence

ELRA END USER

Restrictions: Academic - Non Commercial Use

For Members of ELRA

User Nature: Academic

ELRA END USER

Restrictions: Academic - Non Commercial Use

For Non Members of ELRA

User Nature: Academic

Contact Person

Mapelli Valérie

text

Multilingual text corpusLanguages

Urdu Oriya Panjabi

Variety: Punjabi (Type: Dialect) (2 Gb)

Malayalam Marathi Telugu Sinhalese Tamil Kannada Kashmiri Gujarati Hindi Assamese Bengali Hindi English Panjabi

Variety: Punjabi (Type: Dialect) (2 Gb)

Bengali Urdu Gujarati

Linguality

Linguality type: Multilingual

Size

1.82 Gb

Resource Creation

Funding Project

EMILLE (Enabling Minority Language Engineering) - UK EPSRC

Funding Type: Own Funds

Metadata

Created: 05/12/2005

Version

Version: 1.0

Last Updated: 03/06/2009

People who looked at this resource also viewed the following:

Resources from the same project