Configuring OCR in OpenKM

A nice feature of OpenKM is the OCR of documents. With this not only the manually added metadata of a document but also the contents of the documen ts themselves can be searched. The quality of the search depends on the indexed content, so the better the quality of the scanned documents, pictures etc. the better is the search result.

There are different OCR engines around from whis cuneiform is told to be the best. But I tried a lot and could not get it to work on the ODROID. The next option was tesseract and there the setup process was pretty straight forward:

  1. Install the necessary packages:

    apt-get install tesseract-ocr

    and if needed/wanted language pack(s) with

    apt-get install tesseract-ocr-XXX

    where the XXX stands for the language code.

  2. Configure the textextractor. Remove cuneiform and add tesseract3
  3. Configure the relevant OpenKM variables:

    system.ocr=/usr/bin/tesseract ${fileIn} ${fileOut} -l deu

    The -l stands here for language so adapt it to your installed language pack

  4. Restart OpenKM

Normally this should be all what is necessary (besides a good quality of the ocr documents).

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: