WP2TXT
|
Extract text from Wikipedia dump file fast and easy. WP2TXT help you extract plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata.
It is originally intended to be useful for researchers who look for an easy way to obtain open-source multi-lingual corpora, but may be handy for other purposes.
WP2TXT Features:
1. Convert dump files of Wikipedia of different languages (only tested on English and Japanese ones, though).
2. Create output files of specified encoding and size.
3. Allow users to specify text elements to be extracted/converted (title, heading, paragraph, etc.).
4. Allow users to decide if footnotes (and the like) embedded in text are skipped or not.
5. Character references are converted to UTF-8 entities.
The license of this software is Free, you can free download and free use this file converter software.