OCR, Neural Networks and other Machine Learning Techniques
There are many different approaches to solving the optical character recognition problem. One of the most common and popular approaches is based on neural networks, which can be applied to different tasks, such as pattern recognition, time series prediction, function approximation, clustering, etc. In this section, we’ll review some of CVISION's OCR approaches using Neural Networks (NNs) in its Maestro Recognition Server.
A Neural Network (NN) is a wonderful tool that can help to resolve OCR type problems. Of course, the selection of appropriate classifiers is essential. The neural network is an information processing paradigm inspired by the way the human brain processes information. NNs are collections of mathematical models that represent some of the observed properties of biological nervous systems and draw on the analogies of adaptive biological learning. The key element of an NN is its topology. Unlike the original Perceptron model, shown by Minsky and Papert to have limited computational capability, the neural network of today consists of a large number of highly interconnected processing elements (nodes) that are tied together with weighted connections (links). Learning in biological systems involves adjustments to the synaptic connections that exist between the neurons.
This is true for neural networks as well. Machine learning typically occurs by example through training, or exposure to a set of input/output data (pattern) where the training algorithm adjusts the link weights. The link weights store the knowledge necessary to solve specific problems. Originating in the late 1950s, neural networks didn’t gain much popularity until the 1980s.
Neural networks are good pattern recognition engines and robust classifiers, with the ability to generalize in making decisions based on imprecise input data.
Today NNs are mostly used for solution of complex real world problems. They are often good at solving problems that are too complex for conventional technologies (e.g., problems that do not have an algorithmic solution or for which an algorithmic solution is too complex to be found) and are often well suited to problems that people are good at solving, but for which traditional methods are not. They are good pattern recognition engines and robust classifiers, with the ability to generalize in making decisions based on imprecise input data. They offer ideal solutions to a variety of classification problems such as speech, character and signal recognition, as well as functional prediction and system modeling, where the physical processes are not understood or are highly complex. The advantage of neural networks lies in their resilience against distortions in the input data and their capability to learn.
A popular and simple neural network approach to the OCR problem is based on feed forward neural networks with backpropagation learning. The basic idea is that we first need to prepare a training set and then train an NN to recognize patterns from the training set. In the training step, we teach the network to respond with the desired output for a specified input. For this purpose, each training sample is represented by two components: possible input and the desired network’s output given that input. After the training step is done, we can give an arbitrary input to the network and the network will form an output, from which we can resolve a pattern type presented to the network.
Example of Neural Network OCR Font Learning using Bitmaps
For example, let’s assume that we want to train a network to recognize 26 capital letters, represented as images of 16x16 pixels. One of the most obvious ways to convert an image to an input part of a training sample is to create a vector of size 256 (for our case), containing a “1” in all positions corresponding to the letter pixels and “0” in all positions corresponding to the background pixels. In many NN training tasks, it’s preferred to represent training patterns in a so called “bipolar” way, placing into the input vector “0.5” instead of “1” and “-0.5” instead of “0”. This sort of pattern coding will often lead to a greater machine learning performance improvement.
For each possible input we need to create a desired network’s output to complete the training samples. For the OCR task at hand, it’s very common to code each pattern as a vector of size 26 (because we have 26 different letters), placing into the vector “0.5” for positions corresponding to the pattern’s type number and “-0.5” for all other positions.
After having such training samples for all letters, we can start to train our network. But, the last question is about the network’s structure. For the above task we can use one layer of neural network, which will have 256 inputs corresponding to the size of input vector and 26 neurons in the layer corresponding to the size of the output vector. At each learning epoch, all samples from the training set are presented to the network and the summary squared error is calculated. When the error becomes less than the specified error limit, then the training is done and the network can be used for recognition.
Example of Neural Network OCR Font Learning using Feature-based Classifiers
The approach described above works fine, but is limited in its extensibility. There are some issues that a generalized, robust neural network OCR system needs to handle, which include font and scale variations. Giving an NN OCR system bitmaps as input is somewhat problematic since humans don’t see characters at the pixel level, nor is the “essence” of a character font conveyed by this pixelized representation. When there are considerable bitmap variations in the definition of each font character, a better set of inputs to represent the data would be a set of classifiers, computable from the bitmap images, such that these classifiers are invariant to changes in font and point size.
Such classifiers might include topological characteristics, such as Euler number, compactness, and geometric properties, e.g., concave up. Of course, these features now need to be computed from the input images and given as input to the neural network OCR system. In addition, the system is invariant to changes in font and point size, so it cannot classify beyond labeling an input bitmap as say an “e”, when we may want additional information such as the font and point size, e.g.,”e”, point size: 12, font: Times Roman. The point is that features typically provide some level of invariance, but at the same time, limit the degree of recognition.
In this case, since there is wide variation in font definitions, we could first have an NN-based OCR system that is invariant to font and scale to recognize the character. Once we know it’s an “e”, we can match it against all “e” font definitions in our font database to establish the exact font and point size.