information and language processing systems
WikiXML is a collection of Wikipedia articles converted to XML format.
This document describes a set of XML collections based on Wikipedia. Each collection is the result of conversion of Wikipedia in one language to an XML format that combines information from the original wikitext of articles with the result of the rendering of articles as XHTML.
Different XML conversions of Wikipedia are available from a number of other projects:
Our XML version of Wikipedia was designed to serve as a multi-lingual text collection for experiments in Information Retrieval and Natural Language Processing, in particular, in the context of Cross-Language Evaluation Forum (CLEF). Therefore, it differs in some design aspects from the other XML-izations of Wikipedia:
For each language, the WikiXML collection provides a set of XML files, one file per Wikipedia page. The pages of following types are included in the collections:
For each language, the collection includes all Wikipedia pages of these four types, except a small number of pages that could not be properly processed by the MediaWiki, the Wikipedia rendering engine. For example, in the English collection there were 53 such pages. Each collection provides a list of the excluded pages (files pages_failed.list in the distribution directories).
In addition to the XML files, collections also provide a number of database tables that facilitate access to the pages. The tables contain information about page titles and sizes, categories of articles, internal and external links, images and templates used by articles, redirects and interwiki links (links between counterpart pages in different languages). The database tables provide information that is also present in the XML files, but in our experience substantially simplify access to the collections.
See the detailed description of:
|Language||XML sample||# pages||# non-redirect article pages||size of zipped XML files||wikipedia dump date|
|English (en)||browse||5,158,844||3,075,006||14 GB||2007-08-02|
|Dutch (nl)||browse||840,606||616,807||1.5 GB||2010-01-01|
|Spanish (es)||browse||279,191||163,383||674 MB||2006-10-17|
|German (de)||browse||969,298||502,779||2.2 GB||2006-11-06|
|Bulgarian (bg)||browse||78,202||33,130||187 MB||2006-11-30|
|Portuguese (pt)||browse||312,170||200,247||714 MB||2006-11-07|
|French (fr)||browse||720,740||405,389||1.9 GB||2006-12-04|
|Italian (it)||browse||388,175||215,821||1.1 GB||2006-11-05|
|Romanian (ro)||browse||77,066||41,209||160 MB||2006-11-08|
The following pages describe how to get, install and use the collection.
Download and unzip the file wikixml-20070312.tar.gz and read the instructions in