Technology Detail
Optical Character Recognition
Get More
Contact Us
Optical character recognition (OCR) is the process of converting images of paper documents into electronic text files. While commercial OCR products provide good performance on office quality paper documents, they typically fail to provide usable text transcriptions of noisy documents, such as faxes, crumpled paper sheets, documents that are faded or have bleed-through. Another drawback of existing commercial OCR technology is that it is language- and script-dependent. As a result, developing an OCR solution for a new language using commercially available approaches requires significant effort and might not be commercially viable.
Raytheon BBN Technologies offers a unique, trainable, robust, and script-independent OCR technology that can be quickly configured for new languages, domains, and image quality. Training can be done using a collection of scanned text images along with line-by-line transcriptions of the text content of the images. The training and recognition components of the system are identical to those in our Byblos speech recognition system; the only difference is the feature extraction part of the system.
So far, the technology has been successfully demonstrated on nine languages: Arabic, English, Farsi, Pashto, Chinese, Japanese, Thai, Urdu, and Hindi. More recently, the same technology has been successfully applied to the recognition of text in video images.
