Collocation and Term Extractor

301 Last view: 2026-04-08

Collocation and Term Extractor

CollTerm

http://www.nljubesic.net/resources/tools/collterm/

ID:

312 CollTerm is a language independent tool for collocation and term extraction. It is an application that collects collocation and term candidates based on five different co occurrence measures for multiword units (i.e. collocations) or distributional differences from large representative corpus by application of the TF-IDF measurement on singleword units. The language dependent part consists of stop-word list and list of MWU MSD-patterns that can be coded with regular expressions as well. The application is describe in the paper presented at TKE2012 by Pinnis, M., Ljubešić, N., Ştefănescu, D., Skadiņa, I, Tadić, Gornostay, T. Term Extraction, Tagging, and Mapping Tools for Under-Resourced Languages. The first version of this application is available as an integral part of ACCURAT Toolkit that is available under Apache 2.0 license (http://www.accurat-project.eu/index.php?p=accurat-toolkit). In this version of the tool a calibration of MWU MSD-patterns has been provided for Croatian thus enhancing the usability of the tool. The plan is to provide calibration for other CESAR languages as well.

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Unrestricted Use

Licence

Apache Licence 2.0

Restrictions: Inform Licensor

Execution location: hidden

Distribution Access/Medium: Downloadable

Distribution rights holders:

University of Zagreb, Faculty of Humanities and Social Sciences

IPR Holder

University of Zagreb, Faculty of Humanities and Social Sciences

Marko Tadić

Contact Person

Nikola Ljubešić

toolService

Tool

Language Independent

Input

Media type: Text

Resource type: Language Description

Modality: Written Language

Output

Media type: Text

Resource type: Lexical Conceptual Resource

Modality: Written Language

Operation

Operating system: Linux

Required Software

Python (version 2.6 or higher)

Evaluation

Evaluated: True

Evaluation level: Diagnostic

Evaluation type: Black Box

Evaluation criteria: Intrinsic

Evaluation measure: Human

Evaluator Nikola Ljubešić

Creation

Programming language: Python

Resource Creation

Resource Creator

Univ. of Zagreb, Faculty of Humanities and Social Sciences, Depts. of Linguistics & Information Sci.

Creation started: 04/01/2011

Funding Project

Analysis and evaluation of Comparable Corpora for Under Resourced Areas of machine Translation (ACCURAT)

URL: http://www.accurat-p...

Funding Types: Eu Funds, National Funds

Funders: European Commission (75%), University of Zagreb, Faculty of Humanities and Social Sciences (25%)

Project duration: 01/01/2010 - 06/30/2012

Central and South-East European Resources (CESAR)

URL: http://www.cesar-pro...

Funding Types: Eu Funds, National Funds

Funders: European Commission (50%), University of Zagreb, Faculty of Humanities and Social Sciences (50%)

Project duration: 02/01/2011 - 01/31/2013

Metadata

Created: 07/30/2012

Last Updated: 02/04/2013

Metadata Creator

Marko Tadić

Version

Version: 1.0

Last Updated: 07/30/2012

People who looked at this resource also viewed the following:

Resources from the same project

Resources from the same creators