Skip to content

Pytesseract: Get started with OCR

ocr

Optical Character Recognition (OCR) is a system that provides a full alphanumeric character recognition on an image. The system allows extracting text from an image, to convert it later into an editable file.

There are some open source libraries for OCR such as Tesseract, Gocr, JavaOCR, and Ocrad. The most popular on the list is Tesseract. Tesseract is an open source OCR engine that was developed in HP between 1984 and 1994. Combined with the processing library of Leptonic image can read a wide variety of image formats and turn them into text. As well, it has good support from the community, it has wrappers for different languages and it has good results among others.

One of these wrappers is Pytesseract, based on python. We will see a simple example of Tesseract and one using the wrapper.

First, install tesseract. For ubuntu 18 just run the command:

sudo apt install tesseract-ocr

Then, check the tesseract version with:

tesseract -v

You will see a prompt like the following:

tesseract 4.0.0-beta.1
leptonica-1.75.3
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3

Tesseract is a command line program. we can test tesseract providing an image and then checking the resulting text:

tesseract image.jpg out

It will create an out.txt file with the result

Now, let’s do the same we did with the command line, but with python.

Create your virtual environment with python3:

virtualenv ocr -p /usr/bin/python3

To get started with our OCR, we will need 2 libraries installed:

pip install pillow pytesseract

Once the requirements have been installed, import the libraries needed:

import pytesseract
from PIL import Image

Inside the test() method we will open the local image and get the result with the method image_to_string, using the same image the method will return the same text of the previous example:

def test():
    img=Image.open("images/ocr.jpg")
    result = pytesseract.image_to_string(img)
    return "result: {}".format(result)
def main():
    print(test())
if __name__== "__main__":
   main()

Very simple, right? Well if we send a suitable image to the OCR it will return the exact text. A suitable image needs at least the following attribute:

  • Binary Image: black and white image
  • The size of the image preferably not too big
  • Delimit the area of ​​interest
  • Highlight the text over the background
  • Avoid noise, remove pixels that are not part of the text.
  • The typography must be common, such as Arial, Roman, Tahoma.

In real life, OCR doesn’t work as simple as this example you will need a good pre-processing step before you send the image to the OCR. You must ensure that the input image has the previous attributes to get a good result; the quality of the image will be reflected in the obtained text.

You can analyze different techniques for the pre-processing step, the one that you choose depends on the image type input. For example, it varies if the picture is taken in the darkness or in the light, or if the text’s font color is white or if it is black. When you finally decide your pre-processing technique, and you have tested it with several images, then you can send it to the OCR.

Wide applications exist in the OCR world, like mobile apps that can transform a book page into a digital page, the systems that capture vehicular plate for the traffic fines, vehicular systems that can read traffic signs on the road or the devices that let the blind people read everywhere. Image processing and OCR together have several applications that you can start to inquire.