Friday, August 6, 2010

Towards Tatoeba on Android

Great multi-lingual example sentences project at Tatoeba.org. Wish they had an API, but they very kindly provide a dump of their database. 3rd party projects are currently using that. I'm chatting with the Tatoeba guys about an API and hearing that it is an issue of expense, since they are a "a free project run by a student". Makes me wonder if there is a way to use Google App Engine or other cloud computing platform to help them support an API, or Mashery or 3scale something ...

So anyway, I used RazorSQL to take their csv sentences file and convert that into an SQLite database. The conversion took about 10 hours once I had figured out an issue with RazorSQL trying to always find closing double quotes. So then I got an INSTALL_FAILED_INSUFFICIENT_STORAGE error trying to load an Android app with Tatoeba's full set of sentences stored in an sqlite db that was 25 Mb. I was considering that the better approach would be to download the file from a remote location and store on the SDCard, when I realized the problem was actually that I had left the 50Mb SQL Inserts file in the same folder. It got automatically folded into the Android APK file. I removed that and the application ran. Well I say ran, it had the new problem that it couldn't find the Sentences table, but that was my Android SQLite plumbing and not the file size :-)

Fixing that I got a new error, Data exceeds UNCOMPRESS_DATA_MAX (24825856 vs 1048576), which happened when I tried to import the Tatoeba database from resources into the file system. A little research suggested there was a hard limit of 1Mb on resource and asset files in Android packages. So I moved straight to the network download which is what we would need for market release of an app of this kind anyway.

Download from Google docs worked after I added support for 302 redirect, took a long time - but database did get stored on file system, although the DDMS client died, so I couldn't check the details.

Started to feel like it would be faster to wrap my own rails REST layer around db dump ...

So I got the thing displaying a list of sentences but unfortunately on first attempt the UTF-8 characters were mangled. SQLite appears to support UTF-8, so I wondered if this was an android issue. Android is also UTF-8 compliant, so it seems like it was actually another RazorSQL issue. Although the SQL inserts had been created in UTF-8 when I ran them through RazorSQL the double byte encoding had been lost. I was able to re-generate an SQLite file with intact double byte encoding by running the SQL inserts from the command line - which had the added benefit of being hugely faster.

A bit of jiggery pokery later and I had the entire sentences db of Tatoeba displaying in an Android app. Nice to see the ListActivity performantly handling the display of a table with over 400,000 rows.

So that was a fun day, but still to do are:

1) Work out the best option for download; Google Sites has 20Mb limit, and Google docs stared doing some weird virus redirects on the latest file ...
2) Add some basic search to the app
3) Drop in the relations csv and get things all prettied up :-)

2 comments:

Sangorrin said...

nice, can u put the source? :]

Sam Joseph said...

In principle yes. I have just moved from the US to the UK and have encountered housing chaos. I hope to work on this and potentially release the source before the end of the year.