Bulgarian Sentence Splitter and Tokenizer

131 Last view: 2025-06-29

Bulgarian Sentence Splitter and Tokenizer

BulSST

http://dcl.bas.bg/en/programs_en.html#BGTokenizer

ID:

815 The sentence splitter marks the sentence boundaries and the tokenizer marks string of symbols in raw Bulgarian text.
The sentence splitter applies regular rules and lexicons. Both - regular rules and lexicons - are manually crafted by an expert. Lists of lexicons (for recognizing abbreviations after which there must be or there might be a capital letter, a number, etc. in the middle of the sentence) are applied before the regular rules. The lexicons are compiled by a separate tool - the Lexicon compiler, as minimal acyclic final state automatа which allows an effective processing. Sentence borders are represented as a position and length which allows the incoming text to be kept unchanged as well as an easy integration in different systems for annotation.
The tokenizer demarcates strings of letters, numbers, punctuation marks, special symbols, combinations of them and empty symbols. Regular patterns are used to recognize some simple cases of named entities that mean dates, fractions, emails, internet addresses, abbreviations, etc. The tokenizer classifies each recognized token (for example: small Cyrillic letters, capital Latin letters, etc.). The tokenizer utilizes finite state transducers for token recognition and type matching.

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Restricted Use

Start date: 06/01/2005

Licence

CC - BY - NC

Restrictions: Share Alike

Download locations: hidden

Distribution Access/Medium: Downloadable

IPR Holder

Institute for Bulgarian Language

Contact Person

Ivelina Stoyanova

toolService

Tool

Language Dependent

Input

Media type: Text

Language: Bulgarian

Character encoding: UTF - 8

Annotation type: Segmentation

Segmentation level: Sentence, Word

Output

Media type: Text

Operation

Operating system: Linux

Evaluation

Evaluated: True

Evaluation level: Diagnostic

Resource Creation

Resource Creator

Institute for Bulgarian Language

Creation started: 01/01/2010

Funding Project

Central and South-East European Resources (CESAR)

URL: http://cesar.nytud.hu/

Funding Types: Eu Funds, Own Funds

Project duration: 02/01/2011 - 01/30/2013

Metadata

Created: 07/20/2012

Last Updated: 01/31/2013

Version

Version: 3.0

Last Updated: 07/20/2012

ValidationValidated

Usage

Foreseen UseNlp ApplicationsActual Use - Nlp Applications

Documentation

Koeva, Svetla, Angel Genov. Bulgarian language processing chain. In Proceedings of Integration of Multilingual Resources and Tools in Web Applications. Proceedings of a Workshop in conjunction with GSCL 2011, University of Hamburg, 2011.

People who looked at this resource also viewed the following:

Resources from the same project

Resources from the same creators