Estonian Dialect Corpus

View resource name in all available languages

Eesti murdekorpus



The dialect corpus consists of:

1) Dialect recordings. The corpus is based on dialect recordings which have mainly been made in the 1960s and 1970s. The first recordings are even earlier – they date from 1938. The recordings are traditional dialect recordings where the interview is conducted at the home of the informant.

2) Phonetically transcribed texts. The traditional Finno-Ugric phonetic transcription is used. The texts are available as Word and pdf files (by the 1st of May 2011, there are about 1,284,000 text words in the corpus).

3) Dialect texts in simplified transcription. All of the phonetically transcribed texts have been transported one-to-one into the simplified transcription (.txt), which enables the use of these texts with every program and to conduct primary analyses.

4) Morphologically tagged texts which have been read into a MySQL database. All the word classes and morphological forms are tagged;

5) Database containing information about informants and recordings;

6) Syntactically parsed texts (about 40000 text words).

In the corpus, every phonetically transcribed text is accompanied by a recording, a file in simplified transcription and a description; more than half of the texts are also accompanied by a morphologically tagged file.

Also some data from other Finnic languages which are spoken around Estonia have been added. The aim is to incorporate at least Votic, Ingrian and Livonian data to the corpus.

View resource description in all available languages


You don’t have the permission to edit this resource.
People who looked at this resource also viewed the following: