Generic optical character recognition (OCR) engines often perform very poorly in transcribing scanned low resolution
(LR) text documents. To improve OCR performance, we apply the Neighbor Embedding (NE) single-image
super-resolution (SISR) technique to LR scanned text documents to obtain high resolution (HR) versions, which we
subsequently process with OCR. For comparison, we repeat this procedure using bicubic interpolation (BI). We demonstrate
that mean-square errors (MSE) in NE HR estimates do not increase substantially when NE is trained in one
Latin font style and tested in another, provided both styles belong to the same font category (serif or sans serif). This
is very important in practice, since for each font size, the number of training sets required for each category may be
reduced from dozens to just one. We also incorporate randomized k-d trees into our NE implementation to perform
approximate nearest neighbor search, and obtain a 1000x speed up of our original NE implementation, with negligible
MSE degradation. This acceleration also made it practical to combine all of our size-specific NE Latin models
into a single Universal Latin Model (ULM). The ULM eliminates the need to determine the unknown font category
and size of an input LR text document and match it to an appropriate model, a very challenging task, since the dpi
(pixels per inch) of the input LR image is generally unknown. Our experiments show that OCR character error rates
(CER) were over 90% when we applied the Tesseract OCR engine to LR text documents (scanned at 75 dpi and 100
dpi) in the 6-10 pt range. By contrast, using k-d trees and the ULM, CER after NE preprocessing averaged less than
7% at 3x (100 dpi LR scanning) and 4x (75 dpi LR scanning) magnification, over an order of magnitude improvement.
Moreover, CER after NE preprocessing was more that 6 times lower on average than after BI preprocessing.
|