Polish Wikipedia Corpus




Full textual content of Polish Wikipedia (http://pl.wikipedia.org/) on 28.04.2012. The main directory contains subdirectories named AA to DQ. Each of those consists of subdirectories 00 to 99, containing approximately 100 kB of text each, one Wikipedia article per file. Article files start with a title, followed by a blank line.

Only ordinary articles are present - without stubs, templates, disambiguation pages, history of changes etc. All the multimedia, tables, references, links, and other non-plaintext elements have been removed. Text is encoded as UTF-8. In 839 269 articles there are 127 million segments, 918 MB of text in total.

The corpus has been created by applying WikiExtractor script (http://medialab.di.unipi.it/wiki/Wikipedia_Extractor) to Wikipedia dump (http://dumps.wikimedia.org/backup-index.html) and dividing 100 kB files into individual articles using own code.

Wikipedia's text content is released under the Creative Commons Attribution-Share-Alike License 3.0 (http://creativecommons.org/licenses/by-sa/3.0/).

