Lost Indices; or, Integrating Contemporary Metadata into Digitised Newspaper Collections

When a user approaches a newspaper collection, they usually have three principal means of retrieving the information they want: browsing individual issues, jumping to a relevant article by means of an index, or searching the full text of the collection. Considering the material and archival histories of these objects, it is perhaps strange that despite making all three available (to some degree) most providers focus their limited time and resources on improving the third—particularly through improvements to OCR. Perhaps stranger still are the unintentional impediments placed on the first two mechanisms.

When faced with bound collections or microfilm reels of historical newspapers, browsing is by far the most natural method of information retrieval. In 2007, I spent weeks nestled within a corner of the National Library of Scotland pouring over twenty years of the Kelso Mail, month by month, issue by issue. Although performing what might be characterised as a manual keyword search, quickly scanning lines for relevant snippets of text, I became immersed in contextual information (including the divorce proceedings of George IV) which greatly improved my understanding of the language, focus and prominence of reports and essays on emigration and settlement. Browsing digital editions should provide the same experience but rarely does so, despite the fact that replicating the research library experience at a distance was the initial aim of many digitisation projects. The resizing of pages to be readable on modern monitors, and the ability to browse articles (computationally defined) rather than pages or issues, has led users to actively de-contextualise their results. Even collections in which OCR has not yet been fully implemented suffer from the ease and temptation of zooming. Likewise, the physical receipt of a particularly thin bound volume, or a reel containing one too many years, immediately made me aware of missing issues, something digital collections—bereft of a physical footprint and immune to physical heuristics—often unintentionally obscure.

Not that browsing was the only or even preferred means of pre-digital retrieval. Although The Times of London possesses perhaps the most famous nineteenth-century index, newspapers across the world were regularly indexed by their publishers and regional or national libraries; indeed, they often joined forces to do so! These indexes might exist as printed volumes – an in-kind contribution by the publisher –as binders of typed monthly addenda or as enveloped “snippets” sorted by key themes or persons for use in journalistic or library research departments. This wealth of information, however, has largely been relegated to the (dust-covered) archives as incomplete or superseded by full-text searching.

During research for the Atlas, the question of digitising indexes was posed multiple times to often bewildered librarians; one earnestly confided that they had debated integrating the existing index but felt it was unwanted by users, who would demand resources be placed on OCR instead. But, as we shall explore more fully in our next blog, retrieval through human-mediated indexes is paramount to providing wider access to heritage collections.

Traditional Newspaper Index Cards at the Multnomah County Library

For their relatively small textual footprint, indexes provide deep contextual information and a diverse range of access points, far beyond what is provided by either full-text searching or computationally derived abstracts. The difficulty of the latter was so profound that librarians at one library forbade users from directly searching the New York Times’s new (and very expensive) dial-in search service in the late 1970s, leaving the crafting of search terms to those familiar with the reference system rather than those with domain knowledge of information being retrieved. Modern studies of social bookmarking suggest similar patterns, with free tagging providing a greater diversity of relevant and domain or context-aware keywords than algorithmic scraping of website or link text.

We are, therefore, left to consider, if human-mediated ontologies provide richer, more diverse access to heritage data, how do we best integrate them into our digitised collections?

Bibliography

Al-Khalifa, Hend S. and Hugh C. Davis. “Exploring the Value of Folksonomies for Creating Semantic Metadata.” International Journal on Semantic Web and Information Services 3:1 (2007), 13-39. Avaiable at https://www.researchgate.net/publication/39994821_Exploring_The_Value_Of_Folksonomies_For_Creating_Semantic_Metadata. Accessed 8 March 2020.

Friedman, Harry A. Newspaper Indexing. Milwaukee: Marquette University Press, 1942.

Garoogian, Rhoda. “Library Use of the New York Times Information Bank: A Preliminary Survey.” Reference & User Services Quarterly 16:1 (1976), 59-64. Available at https://www.jstor.org/stable/41354525.

Stafford, Robert. Australian Newspaper Index Feasibility Study. Canberra: National Library of Australia, 1980.