OCR can be run on printed or handwritten text, but the results vary depending on the type and quality of text. Handwritten text in block capitals is more likely to work with OCR than non-standardised text or difficult to read writing such as copperplate. Printed or typed text usually has much better results than handwritten. To use OCR you will need high-quality images to run the software on.
These images should meet the following criteria:
-
Be brightly lit and in focus.
-
Have a good contrast between the text and page, for example, black text on a white or light-coloured page.
-
Any text in the image should be straight in the photograph and not skewed.
-
High-resolution images on capture (this cannot be changed after the photo is taken).
-
Stored in a format that keeps the image quality as high as possible such as TIFF files; JPG files are smaller which compress data and lose image quality.
-
Kept in a useable format; OCR programs will accept certain file formats such as JPG, TIFF, PNG or in some cases PDF. It is best to check in advance to ensure you have a useable format for your specific software.
For example, this image of The Student, the University of Edinburgh’s student newspaper, is brightly lit and focused so that all the text is clear and straight-on in the photograph. There is also good contrast between the dark lettering and the lighter page, making it a perfect image to use with OCR software.
All these factors can impact the quality of the OCR produced and influence the accuracy of the text output. A conservation assessment of materials and guidelines for handling them should be made before images are scanned and digitised, either by organisations such as libraries and archives or by individuals in ownership or custody of materials. All imaging and handling should adhere to these guides and the condition of cultural heritage materials should not be compromised when imaging. For example, do not force a tightly bound book spine in order to get a clearer or straighter image; always prioritise the safe handling of materials.
Image processing software used by libraries and archives often have OCR built-in, but it is available in commonly known software such as Adobe Acrobat and Abbyy Finereader and R via Tessaract. The University of Edinburgh provides access to this software and others through the uCreate space on the first floor Main Library in George Square. There are also open software programmes to run your own; see the ‘Programming Historian: OCR and Machine Translation’ lesson to run your own OCR.
