amph-Corpus – META-SHARE

Last view: 2026-07-14

174 Last view: 2026-07-14

amph-Corpus

View resource name in all available languages

Ajatella, miettiä, pohtia, harkita -korpus

amph

https://kitwiki.csc.fi/twiki/bin/view/Old/ResourceAmph

ID:

http://urn.fi/urn:nbn:fi:lb-2015021301

The corpus is available in Kielipankki - the Language Bank of Finland (taito-shell.csc.fi, instructions on how to gain access rights: https://kitwiki.csc.fi/twiki/bin/view/FinCLARIN/KielipankkiAccessRights).

The amph micro-corpus consists of altogether 3404 occurrences of the four most common Finnish THINK lexemes, ajatella, miettiä, pohtia, and harkita 'think, reflect, ponder, consider'.

These occurrences have been extracted from a corpus consisting of two months worth (January–February 1995) of written text from Helsingin Sanomat (1995), Finland’s major daily newspaper, and six months worth (October 2002 – April 2003) of written discussion in the SFNET (2002-2003) Internet discussion forum, namely regarding (personal) relationships (sfnet.keskustelu.ihmissuhteet) and politics (sfnet.keskustelu.politiikka). The newspaper corpus consisted altogether of 3,304,512 words of body text, excluding headers and captions (as well as punctuation tokens), and included 1,750 representatives of the studied THINK verbs, whereas the Internet corpus comprised altogether 1,174,693 words of body text, excluding quotes of previous postings as well as punctuation tokens, adding up to 1,654 representatives of the studied THINK verbs. The individual overall frequencies among the studied THINK lexemes in the corpus were 1492 for ajatella, 812 for miettiä, 713 for pohtia, and 387 for harkita.

The corpus contents were first automatically syntactically and morphologically analyzed using a computational implementation of Functional Dependency Grammar (Tapanainen and Järvinen, 1997, Järvinen and Tapanainen 1997) for Finnish, namely the FI-FDG parser (Connexor 2007). After this, all the instances of the studied THINK lexemes together with their syntactic arguments were manually validated and corrected, if necessary, and subsequently supplemented with semantic classifications. In addition, some extra-linguistic features (newspaper section or specific newsgroup, author ID when available, unique document index) are incorporated, when they could be identified and extracted from the original corpora.

The amph micro-corpus contains for each occurrence of the selected four THINK verbs in the original research corpora all relevant contextual features, including the verb itself, analyzed at the aforementioned morphological, syntactic and semantic levels in the immediate sentential context, as well as all pertinent extralinguistic features. In addition, the amph micro-corpus includes scripts for processing this data, R functions for its statistical analysis, as well as a comprehensive set of the ensuing results as R format data tables.

Researchers who have a user name and a password can access the corpus in Taito (taito-shell.csc.fi). University students have to apply for access rights at https://lbr.csc.fi/ (sign in with your university credentials) before being able to access the corpus.

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Restricted Use

Licence

CLARIN ACA

Restrictions: Academic - Non Commercial Use, Attribution, Other

User Nature: Academic

Execution location: hidden

Licensors:

Distribution rights holders:

CSC - Tieteen tietotekniikan keskus Oy , CSC — IT Center for Science Ltd

IPR Holder

Contact Person

User support at CSC - IT Center for Science Ltd. The Language Bank of Finland

text

Monolingual text corpusLanguages

Finnish

Linguality

Linguality type: Monolingual

Size

777,288 Kb

Modalities

Written Language

Time Coverage

1995-2003

Metadata

Created: 02/13/2015

Last Updated: 11/30/2015

Metadata Creator

Usage

Foreseen UseHuman Use

Use NLP Specific: Linguistic Research

Actual Use - Human Use

Use NLP Specific: Linguistic Research

People who looked at this resource also viewed the following: