Thursday, January 9, 2014

OCR Scanning

This post describes how to scan pages from a printed book and convert the image to text using Optical Character Recognition (OCR) technology.

The tools that I use are:

  1. SimpleScan
  2. tesseract

Preparation

SimpleScan is a GUI scan application that comes pre-installed in many Linux distributions (including Debian Wheezy).

To manually install it on Debian:

$ sudo apt-get install simple-scan

tesseract is a command-line OCR program.

To install:

$ sudo apt-get install tesseract-ocr

If English is the language used, that is all you need to install. If you require another language, you must install additional tesseract language packs. Examples are tesseract-ocr-rus for Russian, tesseract-ocr-deu for German, and tesseract-ocr-fra for French.

OCR Procedure

  1. Scan the pages using SimpleScan.
  2. Save the image.
  3. Run the tesseract command:
    $ tesseract OnWritingWell.jpg out
    Tesseract Open Source OCR Engine v3.02 with Leptonica
    

    The first parameter is the input image filename. The second parameter is the desired basename of the output text file. The default txt extension is added to the basename, e.g., out.txt.

    If the language is not English, you need to specify the language on the command line using a 3-character language code (refer to the tesseract man page). The following command specifies the use of 3 languages: Russian, German and French.

    $ tesseract OnWritingWell.jpg myout  -l rus+deu+fra 
    

Accuracy

In the above example, there were a total of 734 words. Within the output text file, 119 words (16% of total) require some form of manual correction. This roughly translates to 84% OCR accuracy. The sample size is too small to be scientific, or statistically valid. What is the performance that you are getting from OCR?

3 comments:

Unknown said...

Thanks a lot.very easy.

Anonymous said...

Thanks, man! It really helped me!

Anonymous said...

Many thanks for clear command line example
srihari konakanchi