Optical Character Recognition (OCR) software recognises text from scanned pages and is also used to populate metadata fields by identifying title information, layout, etc. OCR first pre-processes the image into component parts such as text blocks, sentence blocks and word blocks. This is zoning and is done slightly differently by different digitisers (sometimes by article, for example). Common OCR software is most effective on modern Roman type; older newspapers and those using Gothic font are difficult to OCR successfully. Delpher worked with volunteers to improve the quality of the text of newspapers, particularly seventeenth-century newspapers and WW2 illegal resistance newspapers while Trove crowdsources the correction of article text to improve accuracy.
“OCR technology successfully permits the reading of documents containing a mixture of fonts in different sizes and styles.” [Feather and Sturges, 458]
“Normalization is the process of converting numerous, diverse files from their native formats into a smaller number of more open, preservation-oriented formats, typically upon deposit or ingest (e.g., migrating articles transcribed through OCR from Olive’s PrXML to METS-ALTO). Migration more generally may be employed to ensure that the content of a file type that is facing obsolescence can be rendered into a new format (proprietary or open).” [Skinner and Schultz, 11]
“As a digital collection of text becomes bigger, the only efficient way to navigate through it is with search tools supported by good indexing. OCR makes this practicable.” [Tanner, Muñoz and Ros, par. 3]
“The main challenge in this case is the basic Layout Analysis as it is part of any OCR engine may be erroneous. E.g. distinct columns might be merged into one, large newspaper titles may be recognised as image instead of text, or the reading order may be confused.” [Europeana Newspapers 2015, 12-13]