OCR (Optical Character Recognition)

Optical Character Recognition accuracy and speed has increased dramatically in recent years and presents a viable and cost effective means of automatically indexing documents. Service Point offer various solutions depending on the customers requirement and the accuracy of recognition required.

OCR can be used for various different purposes, for instance it maybe that a client is performing a large backfile conversion project of engineering documents and wishes to index their drawings according to a description, drawing number and issue. In this instance a method known as zone OCR would most likely be utilised. As the name suggests the OCR system would be configured to look at various zones on the document and the information would then be extracted, recognised, verified and used to populate a database.

It is imperative in this instance for such an operation that the accuracy of recognition is greater than 80% as the time to manual correct any incorrect fields would require a great deal of time. Also for zone OCR operations of this type we would never recommend attempting to OCR less than 2 data fields, the reason being is that it is far easier to verify mistakes when using 2 fields, as a mistake in 1 field easily falls through the net. If this data can also be verified against an existing database to ensure data integrity the OCR process is greatly enhanced as the accuracy can be automatically verified.

Another example where OCR accuracy need not be so accurate would be the conversion of large swathes of text (i.e. historical books or manuscripts). In this example Service Point might employ a method of scanning and OCR'ing the complete text without paying a great deal of attention to the accuracy of the data, to do so would be cost prohibitive for a very large project.

We would then produce the documents as "PDF on text". This method ensures that the original look and integrity of the data is kept whilst enabling the document to be searched due to the underlying text within the PDF file. The 80% accuracy rule does not apply in this case as the term being searched is;

  1. likely to appear more than once in a particular page and
  2. much of the text on the page is never likely to be searched for in isolation i.e. "and", "to","for","the" and "then" etc.