CINTIL DependencyBank PREMIUM

121 Last view: 2025-06-23

CINTIL DependencyBank PREMIUM

CINTIL DependencyBank PREMIUM is a corpus of Portuguese utterances manually annotated with the representation of grammatical dependency relations and the information of part-of-speech, inflection and lemmas. It is being developed and maintained at the University of Lisbon. The current version is composed by 3,000 sentences (79,378 tokens) taken from portuguese newspaper articles.

The approach we follow is to build on top of an existing resource by adding a new annotation layer. We take the existing CINTIL corpus (Barreto et al., 2006), a 1 million token corpus already annotated with manually verified information on part-of-speech, morphology and named entities, and add syntactic function tags by automatically analysing it with a state-of-the-art dependency parser (LX-DepParser1). This tentative automatic annotation is then manually corrected.The manual correction is done by two annotators under a double-blind scheme, that is followed by adjudication by a third annotator. This process is supported by a general purpose annotation tool, WebAnno (https://code.google.com/p/webanno/).
The main motivation behind the creation of this resource was to create a corpus with a large variety of annotated phenomena that can be used for training statistical dependency parsers that are to be used in applications that deal with unrestricted text. Besides that, it enables linguistic studies that need to search the corpus for specific dependency structures.

This work was partly funded by the Portuguese Foundation for Science and Technology through the Portuguese project DP4LT (PTDC/EEI-SII/1940/2012) and by the European Commision through project QTLeap (EC/FP7/610516).

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Restricted Use

Licence

Proprietary

Restrictions: Academic - Non Commercial Use

User Nature: Academic

Distribution Access/Medium: Downloadable

Licensors:

António Branco

Distribution rights holders:

António Branco

IPR Holder

António Branco

Contact Person

António Branco

text

Monolingual text corpusLanguages

Portuguese

Linguality

Linguality type: Monolingual

Text Format

text

Size

79,378 Tokens

3,000 Sentences

Character encoding

UTF - 8

Modalities

Written Language

Resource Creation

Resource Creator

António Branco

João Ricardo Silva

Funding Project

Deep Language Engineering for Language Technology (DP4LT - PTDC/EEI-SII/1940/2012)

Funding Type: National Funds

Funder: Fundação para a Ciência e Tecnologia

Funding Country: Portugal

Quality Translation by Deep Language Engineering Approaches (QTLeap - EC/FP7/610516)

URL: http://qtleap.eu

Funding Type: Eu Funds

Metadata

Created: 10/12/2015

Last Updated: 10/21/2015

META-SHARE

Metadata Language: English (en)

Metadata Creator

Rita Carvalho

Version

Version: 1

Last Updated: 10/12/2015

Usage

Foreseen UseNlp Applications

Use NLP Specific: Parsing

Actual Use - Nlp Applications

Use NLP Specific: Parsing

Documentation

Document Type: Other

Rita de Carvalho João Silva, CINTIL DependencyBank PREMIUM, http://194.117.45.19...

Document Type: Tech Report

António Branco, João Silva, Andreia Querido, Rita de Carvalho, CINTIL DependencyBank PREMIUM Handbook: Design options for the representation of grammatical dependencies, http://hdl.handle.ne...

People who looked at this resource also viewed the following:

Resources from the same project

Resources from the same creators