Ask for an Inch, We'll Give you a Mile; or, Integrating Authoritative Metadata into Digitised Newspaper Collections

In our last, we spoke of the seemingly unnecessary loss of ontological information that occurred as we moved from analogue newspaper indices to digital full-text searching. In this week’s blog, we aim to discuss some of the ameliorations that do or could exist.

Perhaps the most straightforward recapturing of this information would be the mass digitisation of newspaper indices.
This is not, however, as simple as it first appears. Newspaper indices generally appeared in one of three forms:

a printed list, bound or in expandable binders
a collection of notecards held in boxes or drawers
a collection of labelled envelopes containing clippings

The third option, often found in publisher “morgues” or research departments, is probably the least practical to digitise; given its piecemeal nature, it would require significant manual preparation before digitisation. It would, however, provide delicious fodder for research into journalistic and archival science practices, given the specialised curatorial choices made in retaining and cataloguing portions of otherwise ephemeral items. Even the choices regarding the “snipping” of individual articles would provide a remarkable training set for automatic processes that so often fail to produce human-intelligible segmentation.

The first, on the other hand, most neatly fits within standard digitisation processes. It is perhaps for this reason that book-like indices, such as Palmer’s Index of the Times, have already been digitised multiple times, though most of these remain firmly behind paywalls or Google Books’s snippet view. Their digitisation is problematic for an almost opposite reason. Because of the lack of manual processing required, little if any manual processing is likely to be undertaken. Rather than convert these indices into digital subject-headings for digital collections, they are given the same agnostic full-text treatment as the newspapers themselves, which improves accessibility but does not transform or enhance digitised newspaper collections directly.

The most promising option for integration, instead, seems to lie with the second option, collections of handwritten or typed catalogue cards. As these collections require a greater degree of manual processing than index codices, they seem to have encouraged an “ask for an inch, we’ll give you a mile” mentality from librarians. Partial and complete collections of this type have been successfully digitised by libraries around the United States and at least one European library. These digital projects usually take the form of digital scans of the original cards, which are then OCR processed or manually transcribed by researchers, with some further corrected by crowdsourced contributions.

What happens at this stage is somewhat eclectic. Some are presented as digital datasets in their own right, with the individual cards presented as artefacts to be browsed, searched, viewed or downloaded as collections of ephemera or visual material. Others are treated as updated newspaper indexes, allowing users to locate specific newspaper articles and then request the specific materials from their local library branch. This, it should be noted, is different from simply digitising index codices. Building upon their inherent cross-referencing nature, the digitised versions of these cards can be effectively transformed into hyperlinked catalogues that allow users to browse ontological networks.

The final option, which perhaps requires the most effort but offers significant rewards, is that undertaken by the National Library of Finland and the Kenton County Public Library. These libraries have integrated their indexes directly into their digitised collections, allowing users to utilise full-text searchers of OCRed periodicals as well as approach the collections directly through these authoritative indices. Both are fragmentary, relying upon incomplete indices or limited time frames for this full integration. Yet, their efforts open significant new possibilities for understanding these historical collections as products of their own times, rather being constrained by modern conceptions of the newspaper and its organisation.

To support these efforts, and to publicise the significant work undertaken by researchers and librarians around the world in preserving and revitalising these ontological treasures, we are adding an expanding list of digital newspaper indices to the Atlas. Below are some initial entries, which we hope you will explore in detail. If you have or are aware of additional online indices, please contribute via Github or by emailing us directly.

In our final blog in this series, we will discuss the possibilities of integrating born-digital indices into our historical digitised newspaper collections—how this has been approached, the continuing concerns and the tantalizing rewards.