Offline Wikipedia and unicode support

There is an excellent tutorial which describes the procedure to get an offline version of wikipedia. The description can be found here Building a (fast) Wikipedia offline reader (download the all in one file, plus the mediawiki_sa.tar.7z and un7zip it, run the Makefile). The big wikipedia file can be downloaded from here Wikipedia database download

It works pretty straightforward.

The problem which I faced was with the typical German umlauts like ä,ö,ü. If you enter Pfäffikon (my hometown) than you get a nice exception. Of course there is a solution for this problem (have a look for the modified views.py file on the first mentioned site). But because of the lack of internet I solved the problem by myself.

Here the short explanation:
The script receives the HTTP Data in unicode but uses it with the default encoding of Python. The default encoding can be displayed by printing sys.getdefaultencoding(). On my Debian machine it was ascii. So the script tries to change the Umlauts from unicode to ascii, which can’t work.

Two solutions:
Either you add two lines to the file (/etc/python2.5/sitecustomize.py) and set your default encoding to UTF-8
import sys
sys.setdefaultencoding('utf-8')

or

you do the following in the views.py file:
After line 7 (def article(request, article):)
add
article = article.encode('utf-8')
and
replace the line 71 (searchData = request.GET['data'])
with the following line
searchData = request.GET['data'].encode('utf-8')

Now the offline Wikipedia reader should handle umlauts properly

2 Responses to “Offline Wikipedia and unicode support”

  1. Thanassis
    March 14th, 2010 20:03
    1

    Since you were kind enough to call my tutorial excellent, I updated the tarball (on my site) to include your UTF-8 change :-)

    Thanassis.

  2. alexander
    March 15th, 2010 06:49
    2

    :) Thank you for adding this change. I appreciate your work. It helps me a lot, especially now to save money, because I’m staying currently in Tanzania and internet is quite expensive.

Leave a Reply