NER-tagger corpus represents a collection of sentences with manually labelled named entities. The labelling is partial -- only a selected word from each sentence is labelled. As a result, the labelled entity may be only a part of a named entity and the sentence may potentially contain other named entities. We distinguish the following types on named entities: PER: person, LOC: location, ORG: organization, FAC: facility, PRD: product, O: other. For each labelled word the label is determined by the largest named entity containing it. For instance, Eesti in the following sentence: "Eesti Ühispanga Tartu kontor oli inimesi täis" is facility although "Eesti" is location and "Eesti Ühispank" is and organisation.
The corpus has been created using nertagger web tool: https://github.com/estnltk/ner-tagger. Two human annotators have been involved in the annotation process.
The data file contains one sentence per line with the following columns: name named entity token sentence sentence start entity start offset in the sentence end entity end position in the sentence label assigned label annotator human annotator id time number of milliseconds it took annotator to tag a word.