Thresholding within OCR
Thresholding is the simplest method of grouping an image into regions, aka image segmentation. In the case of thresholding, there are only two types of pixels: foreground and background. Foreground pixels correspond to the text and the background pixels correspond to everything else, such as background texture, embedded images, etc.
Individual pixels in a grayscale image are typically marked as “object” pixels if their value is greater than some threshold value and as “background” pixels otherwise. Typically, an object pixel is given a value of “1” while a background pixel is given a value of “0.” This method employs a static threshold, namely, one value is used to threshold the entire page.
Static Threshold Methods
The key parameter in thresholding is obviously the choice of the threshold. Several different methods for choosing a “static” threshold exist. The simplest method would be to choose the mean or median value of the image, the rationale being that if the object pixels are brighter than the background, they should also be brighter than the average value. In a noiseless image with uniform background and object values, the mean or median will work quite well as the threshold. In many situations, however, this will not be the case.
A more sophisticated approach might be to create a histogram of the image pixel intensities and use the valley point as the threshold. The histogram approach assumes that there is some average value for the background and object pixels, but that the actual pixel values have some variation around these average values. However, computationally this is not as simple as we’d like, and many image histograms do not have clearly defined valley points. Ideally we’re looking for a method for choosing the threshold which is simple, does not require too much prior knowledge of the image, and works well for noisy images.
Semistatic Threshold Methods
Clearly, if the image page contains both video, i.e., dark text on light background, and reverse video, i.e., light text on dark background, then a single static threshold for the page will not suffice. A more complex thresholding algorithm may first try to segment the image into different backgrounds, not assuming a uniform image background. Then, for each background region, a static threshold value is selected. Methods such as this one, that are static for some local region but not for the entire image, are sometimes referred to as semistatic.
Of course, even the above method has its limitations. So for a book page where the background intensity varies smoothly this method may not be appropriate. Undersampled text, or documents that are cell phone scanned, may need special treatment including upsampling prior to thresholding. Gradient methods, akin to edge detection used in computer vision, may sometimes be appropriate for hard to threshold images.
« To Section 2: Document Capture & OCR
To Understanding OCR Technology
To Section 4: Texture Patterns and Small Fonts OCR »