In a project I am working I need OCR capability; in particular I have to recognize a sequence of digits from an image that also contains some pictures (which I am not interested at).

Looking for a candidate library on PyPI, I found PyOCR that seems to fulfill my requirements. This library is essentialy a bridge to an underlying OCR system and you could choose from two engines:

Using Homebrew on my Mac I tried these two OCR systems and Tesseract was the one that fits better the requirement of my project.

The problem is that Tesseract has no RPMs for CentOS, which is the Linux distribution that is installed on my production servers, so I have to build them on my own in order to deploy Tesseract easily and quickly.

After a search on Google, I found this project that provides spec files for Tesseract and its dependency Leptonica.

Here you can find a complete walkthrough to have Tesseract an PyOCR fully functional on CentOS.

Creating Tesseract RPM for CentOS

Before I start, I would point out that I almost follow the guide presentend by the RPM project author but, to simplify the entire process, I executed this guide as root (which is highly discouraged when building RPM for a lot of security reasons) only because I have a playground CentOS loaded into a VirtualBox. If you, like me, want to build the RPM in a isolated and controlled environment build your own CentOS on VirtualBox or download a preconfigured image.

Assuming your VirtualBox is up and running, the first step is to download the packages used to compile Leptonica and Tesseract, using the command:

yum groupinstall 'Base' 'Development Tools'

this will require a considerable amount of time, so confirm your choices and have a cup of tea.
Once the installation has been made, you have to install RPM creation specific tools:

yum install rpmdevtools rpmlint createrepo

At this stage, your CentOS is capable of building RPM. The first step on creating them is to build a tree of standard directory that will be used by the RPM creating tools. Assoming you are root and you are in your home directory (/root), the command:

rpmdev-setuptree

will create the directories:

/root/rpmbuild/BUILD
/root/rpmbuild/RPMS
/root/rpmbuild/SOURCES
/root/rpmbuild/SPECS
/root/rpmbuild/SRPMS

Once you have created the directory tree, you need to donwload the source files of the programs you want to package (Leptonica, Tesseract and Tesseract optional languages).
The Tesseract OCR Spec mantainer has created a very handful script that will do this dirty work for you using the command:

curl https://raw.githubusercontent.com/grossws/tesseract-ocr-specs/master/utils/download_sources.sh | bash

When the download is complete, you need to place the SPEC1 files into the correct directory (which is /root/rpmbuild/SPECS) using the commands:

curl https://raw.githubusercontent.com/grossws/tesseract-ocr-specs/master/specs/leptonica.spec -o /root/rpmbuild/SPECS/leptonica.spec
curl https://raw.githubusercontent.com/grossws/tesseract-ocr-specs/master/specs/leptonica.spec -o /root/rpmbuild/SPECS/tesseract.spec
curl https://raw.githubusercontent.com/grossws/tesseract-ocr-specs/master/specs/leptonica.spec -o /root/SPECS/tesseract-langpack.spec

After the SPEC downloads, it is time to build the first package, Leptonica, using the command

yum-builddep /root/rpmbuild/SPECS/leptonica.spec && rpmbuild -ba /root/rpmbuild/SPECS/leptonica.spec

This will produce some RPMs in /root/rpmbuild/RPMS/ directory and you need to install the Leptonica package itself and its development headers2 in order to build Tesseract:

yum install /root/rpmbuild/RPMS/$(uname -m)/leptonica{,-devel}-1.69-*.rpm

After the packages installation, the next target is Tesseract and it ill be built using command:

yum-builddep /root/rpmbuild/SPECS/tesseract.spec && rpmbuild -ba /root/rpmbuild/SPECS/tesseract.spec

Again, the development headers and packages are required to build Tesseract languages:

yum install /root/rpmbuild/RPMS/$(uname -m)/tesseract{,-devel}-3.02-*.rpm

Tesseract languages pack are built using the command:

yum-builddep /root/rpmbuild/SPECS/tesseract-langpack.spec && rpmbuild -ba /root/rpmbuild/SPECS/tesseract-langpack.spec

And it's done! After these commands you have finally built all the RPM for CentOS for Tesseract, these packages are ready for production install and they are:

/root/rpmbuild/RPMS/$(uname -m)/leptonica-1.69-*.rpm
/root/rpmbuild/RPMS/$(uname -m)/tesseract-3.02-*.rpm
/root/rpmbuild/RPMS/noarch/*.rpm

You have to move them into a directory on production machine and install3 them using:

yum install *.rpm

To test that your installation is working correctly, run:

tesseract --help

and try your fist OCR recognition.

Using Tesseract through PyOCR

Once installed Tesseract using it through PyOCR is trivial. First you have to install the Python package using the command4:

pip install pyocr

and then you can follow the instruction of this package.

When I used it on OSX installing Tesseract with Homebrew, PyOCR refused to work because it was unable to find Tesseract. This happened because Homebrew installs Tesseract in /usr/local/bin which is a directory that PyOCR does not check for Tesseract presence.
I solved this problem by adding /usr/local/bin/ to the PATH environment:

import os.path

os.environ['PATH'] = os.environ['PATH']  + os.path.sep + "/usr/local/bin"

At this point, everything works smoothly and I am using Tesseract into my project on a CentOS production system without any problem.


  1. SPEC files are the guides that instruct RPM builders which sources have to use, which dependencies are required and so on. For a complete guide on RPM building see the Fedora Wiki

  2. Development Headers usually are packaged in a separate RPM only because they are rarely used and they are necessary only to build other packages. Normally you do not use development headers on production environment. In this case they are used to build the final package. 

  3. The Tesseract language packages allow a fine OCR recognition of words in a particular language, but they also occupy a lot of space on server storage. A full installation of language packages is 640MB. Choose wisely if you need all the languages for OCR scanning; you could install olnyl the languages you need by selecting the correct langpack.  

  4. It is safer to install packages when you are working with virtualenv


Comments

comments powered by Disqus