Modern French Corpus including Anaphors Tagging
View resource name in all available languages
Collection de corpus du français contemporain - Corpus avec étiquetage des anaphores
The corpus that includes the tagging of the anaphors was created by the CRISTAL-GRESEC (Stendhal-Grenoble 3 University, France) team and XRCE (Xerox Research Centre Europe, France) in the framework of the call launched by the DGLF-LF (national institution for the French language and the languages spoken in France), for the creation of modern French corpora).
Over 1 million words have been annotated. The corpora have been selected so that they represent a wide sampling of the French language (scientific and human science articles, extracts from newspapers and magazines, legal texts, etc.) and according to the points of interest of the teams working on the project. The processed corpora supplied by ELRA are listed below:
- Two books edited by the CNRS: La protection des oeuvres scientifiques en droit d'auteur français, Xavier Strubel. Paris, CNRS Editions, 1997 (77 591 words) and Cinquante ans de traction à la SNCF. Enjeux politiques, économiques et réponses techniques, Clive Lamming. Paris, CNRS Editions, 1997 (124 990 words).
- 204 articles extracted from CNRS Info, a magazine which contains short popular scientific articles from the CNRS laboratories (201 280 words).
- 14 articles dealing with Hermès Human Sciences (111 886 words).
- 136 articles extracted from "Le Monde", dealing with economics (roughly 180 760 words).
- 13 booklets of the Official Journal of the European Communities (roughly 337 000 words).
Below the tagged anaphoric elements:
- Person pronouns: 3rd person pronoun, anaphoric.
- Possessive determiners: 3rd person possessive determiner.
- Demonstrative pronouns: anaphoric pronouns (celui, celle, ceux, celles-ci, celles-là)
- Indefinite pronouns: Aucun(e), chacun(e), certain(e)s, l'un(e), les un(e)s, tout(es), etc, when they are anaphoric.
- "Proverbs": "le" + "faire".
- Anaphoric and cataphoric adverbs: Dessus, dedans, dessous , when they have an anaphoric function.
- Ellipsis of head nouns: Nominal adjectives or quantifiers determiners ellipsis.
- Textual headers like "ce dernier": Ce dernier, le premier , etc.
The annotation scheme was defined in XML format. The texts were divided into sections, paragraphs and sentences. The sentence segmentation was carried out with NLP tools developed by XRCE, the annotation part was done manually by two qualified linguists. A large subset of anaphoric phrases was automatically pre-annotated. The antecedents and the tagging of the anaphoric relations were manually processed, but editing tools (emacs, macros from Author/Editor software) were used to make it easier. 5% of the corpora were checked to measure the annotation reliability.
View resource description in all available languages