Making Hypertext versions of ICL 1900 Documentation

The ICL 1900 Series Preservation project has a large collection of ICL manuals, mostly scanned into djvu (déjà vu) files.

Optical Character recognition

A tool exists to do optical character recognition (OCR) on djvu files, ocrodjvu (based on ocropus which, at the moment uses tesseract for the character recognition).

By running ocrdjvu on the djvu files I've managed to make searchable versions of some of the ICL manuals.

To run OCR on a .djvu file the following complex command can be used:

ocrodjvu --save-script xxx.djvused xxx.djvu

(Producing a djvused script that can be used to include the OCR text in the .djvu file. Alternatively used the --in-place option to modify the .djvu file directlty.)

Further information can be extracted from the OCRed text, notably the document outline (making it easy to jump to specific pages) and a clickable contents list.

Problems

It is quite difficult to compile the current versions of ocrodjvu, ocropus, tesseract and all their dependencies. I'm currently using the pre-packaged versions from Debian Squeeze.

The Debian versions of ocropus and tesseract are slightly out-of-date and they work very badly on 64bit systems. Debian bug #590672. The results on 32bit systems are acceptable.

The old version of ocropus has some problems recognising text that touches the edge of the page. Debian bug #575484. The Debian bugreport contains a couple of minor patches for ocropus/ocrodjvu that work around this bug.

ocropus seems to have problems recognising the large bold words which many manual pages start:

BRN

Branch on Double Indexing

...

This makes automatic generation of bookmarks something of a pig.

The gnome document viewing program, evince knows how to select text from an OCRed djvu file, but doesn't show the selection area on the screen: Gnome bug ~~Bug 448739 - Evince cannot select text in djvu documents~~ .

This appears to be fixed in more recent versions of evince.

Evince doesn't show the document outline in the "Index" sidebar: Bug 592806 - empty index for .djvu file .

Results

The raw OCR output for the files I've done can be found in OCR. These are formatted as "djvused" scripts, which are used with the djvused command from the djvulibre package to include the text in the base djvu files:

	$ djvused -f xxx.djvused -s xxx.djvu