Delpher

History of the Collection

Delpher is the free, online repository of digitised printed material from the Netherlands. It was created and is maintained by the Koninklijke Bibliotheek, the national library of the Netherlands. It was officially launched in 2013, bringing together several previous digitisation projects, and now includes more than 15 million newspaper pages, 7.3 million magazine pages and approximately 900,000 books from the fifteenth to the twenty-first century. The digitised newspaper collection, originally funded by a subsidy of €12m from the National Programme for Investments in Large-Scale Research Facilities, has the explicit aim of being a resource for humanities researchers. The library has worked with academics and journalists throughout the process of selecting and obtaining the most representative and relevant collection possible.

Consulted Libraries

Delpher is the product of a collaboration between the university libraries of Amsterdam, Groningen, Leiden, and Utrecht, the Meertens Institute, and the Koninklijke Bibliotheek (The Hague). Newspapers are included from the collections of various organisations from the Netherlands and abroad, including: the Eemland archive; Arnhem Library; Calvin College Archives, Michigan; Gelders Archive; the Municipal Archive of Hulst; Municipal Sluis; Groningen Archives; Herzog August Bibliothek, Wolfenbüttel; Historical Center Overijssel; Joint Archives of Holland, Michigan; Royal Tropical Institute; Royal Institute for Language, Land and Ethnology; Royal Library; Kungliga Biblioteket, Stockholm; L’Archivio Segreto Vaticano; Museum Enschede; Meermanno Museum; National Archives Suriname, Paramaribo; Niedersächsiches Landesarchiv – Staatsarchiv, Oldenburg; NIOD Institute for War, Holocaust and Genocide Studies; Northwestern College, Orange City; North Holland Archives; Press Museum (now part of Sound and Vision); the private collection of André de Rijck; Radboud University Nijmegen; Regional Archive Alkmaar; Regional Archives Leiden; Roosevelt Institute for American Studies; Russian State Archive of Ancient Acts, Moscow; Rutgers University Library, New Brunswick; the Reformed Political Party office; the Social Historic Center for Limburg; City Archives of ‘s-Hertogenbosch; City Archives of Rotterdam; City Archive of Vlaardingen; Maastricht City Library; the National Archives Kew, Richmond; Tresoar – Frisian Historical and Literary Center; Trinity Christian College, Palos Heights; Ghent University Library; University Library of Groningen; Leiden University Library; Tilburg University Library; University Library Amsterdam; VU Amsterdam University Library; Waterlands Archive; West Frisian Archive; Wisconsin Historical Society, Madison; Zeeland Archives; ZB Planning Bureau and Library of Zeeland; Central Library, Zurich.

For those newspapers still in copyright, the following rights holders have contributed to the newspaper collection: AD News Media; Audax Publishing; Erven A. J. Morpurgo; Erven D. G. A. Findlay; Erven J. A. Pengel; Erven Varekamp; Erven Wormser; FD Mediagroep; Friesch Dagblad; HDC Media; Media Group Limburg; NDC Mediagroep; Nederlands Dagblad; NRC Media; Omroepvereniging VPRO; De Persgroep Netherland; Ried fan de Fryske Bewegig; Reformed Political Party; Foundation for the Management of the CPN Archives; Nieuw Israelitisch Weekblad Foudation; Utjouwerij Frysk en Frij Foundation; Trouw; Amigoe publishing house; De Telegraaf publishing house; Dhr. W. Lionarons; Vereniging Algemene Omroepvereniging AVRO; Wegener Nieuwsmedia.

Microfilming Projects

Since the 1970s, the Koninklijke Bibliotheek and other institutions have engaged in microfilm preservation of their newspaper collections. These efforts, however, were undertaken with differing specifications and not all the collections have been found suitable for digitisation, usually owing to the high contrast employed in some filming processes. Therefore, while it was generally considered preferable to digitise from microfilm collections, as doing so was more efficient and cost-effective than scanning originals, the microfilm status of a title was not a primary selection criterion for digitisation.

Digitisation Projects

Digitisation of newspapers began at the Koninklijke Bibliotheek in 1999 as part of the Roaring Twenties and War & Revolution projects, in which three national newspapers from the 1920s and 300,000 pages from the 1910s were digitised and underwent OCR. Building upon these pilots, as well as a number of digitisation projects at other Dutch archives and libraries, full-scale digitisation began in 2007 with the Database of Digital Daily Newspapers or Databank Digitale Dagbladen (DDD) project. In 2013, this database was combined with the library’s other digital collections to be made available through the Delpher interface. Digitisation of newspapers, alongside other heritage materials, now continues as part of the general operations of the library and is overseen by Metamorfoze, the Dutch national programme for the preservation of paper heritage. All digitisation companies working for Metamorfoze were selected by means of a European tender procedure and all files are checked for completeness and correctness at the Koninklijke Bibliotheek.

Selection

At the start of the DDD project, there was no complete catalogue of the more than 5,000 national, colonial, regional and local newspapers that had been published in the Netherlands since the seventeenth century. Although the Koninklijke Bibliotheek owns the largest collection of Dutch newspapers, several titles exist only in libraries and archives elsewhere in the Netherlands or in other European nations. The selection process of the DDD project started with the installation of an advisory board, consisting of press historians and journalists, charged with selecting a number of major national newspapers as well as influential and long-standing colonial, regional and local publications. As the digitisation of these newspapers was primarily seen as serving academic researchers, the advisory commission was comprised largely of Dutch academics and journalists who were explicitly asked to consider the scientific significance or press history relevance of the proposed selection, as well as geographical, political and religious significance. The committee first defined what publications qualify as a newspaper: a product of a printing press, thus having been made into multiple identical copies, published at a set periodicity and day, high amount of content relating to current affairs of all types, and available for purchase by the general public. Then chronological periods were defined, based on important developments in the history of the press: 1618–1800, 1800–1814, 1814–1869, 1869–1914, 1914–1965 and 1965–1995. For each period a set of criteria was developed, that was used to select a set of important, trend-setting and representative newspapers for each period. The aim of this was to have a representative set of newspapers digitised from the beginning of the project. As more and more newspapers were digitised over the following years, this initial selection became less relevant.

Preservation and Access

The Koninklijke Bibliotheek considers the digitisation of both microfilm and original newspapers as part of its preservation strategy. It has discontinued its microfilming programmes and is instead creating a collection of high-resolution JPEG files as its archival objects of record.

Composition of the Collection

Selection Available

As of December 2018, the database contains over 1.4 million newspaper issues representing over 15 million newspaper pages. A full list of included titles is available. The collection includes newspapers from 1618 to 1995, as allowed by copyright restrictions, and efforts have been made to represent the entire chronology fully. There is, however, a disproportionately large number of issues relating to the Second World War owing to targeted digitisation of this period through exceptional government funding.

Data Quality

Text

Newspapers within Delpher have been scanned for OCR using ABBYY versions 7.0 to 10.0. According to a 2018 study by the library, the newspaper collections, excluding those from the seventeenth century, have an average word-error rate of 11.3% (standard deviation: 9.96). This represents a material improvement over the databank’s earlier OCR transcriptions, with newspapers from the Roaring Twenties and War & Revolution projects having error rates of roughly 30%. The study indicated that updating the current OCR by re-scanning archival masters with ABBYY 11 will improve this to 9.01%. The library is currently investigating this option. Owing to the Gothic font employed by seventeenth-century newspapers, the OCR on these items is particularly error-prone. To correct this, the library worked in a close collaboration with the Meertens Institute and with a group of skilled volunteers to manually re-key these transcriptions.

Images

For each page in the collection, Delpher maintains a JPEG2000 (lossless 8:1 compression) and PDF file, the latter of which is made available through a structured URL based on an item’s unique identifier. Lower resolution images (96 PPI) are available through the general web interface and API.

Metadata Schema

Newspaper data is divided into three file types: an MPEG21 XML file describing the issue, an ALTO XML file for each page and an OCR XML file for each article. This data adheres to two primary metadata schemas: structural metadata using MPEG21-DIDL and descriptive metadata using Dublin Core and derivatives thereof. The MPEG21 file describes the structural hierarchy of the newspaper issue and provides descriptive metadata for the issue. Its constituent pages and individual articles include segment coordinates and stable URLs. Because a page may contain multiple articles and an article may appear on multiple pages, these are listed on the same hierarchical level with sub-elements describing their relationships. The choice was made to segment issues at article level in order to facilitate user searches, improve relevance rankings, and allow for the removal of an article for copyright reasons without having to remove an entire page or issue.

Backend Structure

Each issue within Delpher’s newspaper collection is comprised of several items: the issue itself, each individual page and each unique article. Each item is, in turn, comprised of multiple components, including metadata, text, and image data, which are stored in individual or nested resource files. The issue MPEG21 file contains the IDs for all other items, components, and resources within that issue. Of these, only the archival master images and complementary technical metadata are not accessible through API or direct download.

User Interface Structures

Web Interface

The current user interface, Delpher, was built by an in-house team at the Koninklijke Bibliotheek. It replaced an earlier interface designed specifically to serve the DDD newspaper project. It is currently only available in Dutch. The interface allows users to perform a simple or advanced search of the underlying descriptive metadata and OCR text, or to browse images by date and title. The advanced search allows for filtering by article type, newspaper type, title, years, and place of publication. The full-text search can be filtered using standard Boolean operators. By default, search results are ranked by relevance but can also be ordered by date, article title or newspaper title. Users can also visualise their results as a date-segmented histogram. Once a result is selected, a full-page image, centred on the relevant article with highlighted search results, is displayed in an image viewer. The viewer allows users to pan and zoom as well as navigate through the issue. The underlying data (plain text, PDF and JPEG) and manually selected snippets can be downloaded using icons at the top of the viewer, and the metadata and OCR text can be viewed in retractable widgets to either side of the image.

API

Delpher supports a range of API queries. It can harvest complete sets of metadata through the OAI-PMH protocol or sub-sets through a Java implementation of the SRU search protocol. Search queries are made via a structured URL at either issue or article level.

Direct Download

The Delpher Open Newspaper Archive contains the texts (OCR, ALTO, XML) of all newspapers from the period 1618 to 1876. The archive is 111 GB in size and divided into 22 ZIP files. For copyright reasons, the archive does not include newspapers after 1876 but, under certain research conditions, a licence may be granted for bulk use of post-1876 dates.

Rights and Usage

Web Interface

All material obtained from the Delpher web interface may be used freely for personal research. When browsed or searched through the user interface, users are presented with a full citation to the digitised image and text. Any materials that remains under third-party copyright are clearly labelled, including the specific conditions of use for that item.

API and Direct Download

Users are allowed to access archive (ZIP) files of all out-of-copyright texts and metadata on Delpher for text and data mining. Advance permission is required to access datasets that contain copyright-protected materials. Although individual texts have been released into the public domain, the dataset as a single object has been released under a CC-BY license and must be properly attributed in derivative works.

Re-Publication

Out-of-copyright text and images have been released into the public domain and may be used or republished for both personal and commercial purposes. Items that remain under third-party copyright may not be redistributed or republished, in print or digitally. They may, however, be linked to directly.

Suggested Citation

Beals, M. H. and Emily Bell, with contributions by Ryan Cordell, Paul Fyfe, Isabel Galina Russell, Tessa Hauswedell, Clemens Neudecker, Julianne Nyhan, Sebastian Padó, Miriam Peña Pimentel, Mila Oiva, Lara Rose, Hannu Salmi, Melissa Terras, and Lorella Viola. “Delpher.” The Atlas of Digitised Newspapers and Metadata: Reports from Oceanic Exchanges. Loughborough: 2020. DOI: 10.6084/m9.figshare.11560059.