History of the Collection
Parts I and II of the British Library’s 19th Century Newspapers collection, now part of the British Library Newspapers collection, were created as part of a public-private partnership between the British Library and Gale, a Cengage company. The British Library began developing a prototype system for newspaper digitisation in 2001, focusing on nineteenth century newspapers. The project aimed for efficiency in digitising its newspapers through the development of automatic indexing and sought to make the newspapers open to advanced searching. The initial focus was on newspapers outside of copyright and eighteen microfilm reels of varying quality were selected as part of this test. Around 20,000 pages were processed in the first two months of the project.
The British Library’s main efforts to digitise its newspaper collections, beginning in 2004, were funded by a £2 million grant from the United Kingdom’s Joint Information Systems Committee (JISC). The project had an initial target of making 2 million pages available and broadly useful to scholars, researchers, and the public. The British Library partnered with commercial vendors to process the scanned images, including Gale and Brightsolid; the latter partner continues to expand the collection as the British Newspaper Archive, and the library itself is currently undertaking a large-scale digitising effort entitled Heritage Made Digital, neither of which are discussed here.
The British Library’s newspaper collection is based upon material obtained through legal deposit legislation. By law, a copy of every UK print publication must be given to the British Library by its publishers and to five other major libraries that request it. Since 1869 newspapers have been included within the legislation and between 1820 and 1869, publishers were obliged to provide copies to the Stamp Office for the purposes of taxation; the latter were passed on to the British Museum and now form part of the British Library’s collection. The original digitisation programme was exclusively derived from this collection.
From the early 1940s to 2010, the usual method of preserving newspapers at the British Library was through the creation of access surrogates by microfilming. Approximately 30% of the newspaper collection was microfilmed during this time and, upon examination in the early 2000s, it was deemed that only 2% of the historical microfilm collection was unfit for digitisation by the Library’s Zeutschel microfilm cameras. Microfilming continued at the British Library alongside the original digitisation programme and was seen as an intermediary stage of the digitisation process rather than a replaced technology. Microfilming has been funded in different ways. Primarily it was funded by the BL, but external microfilm providers have also been used, notably MicroFormat (now a part of Stor-a-File), under contract to the BL. Microfilming of newspapers from other libraries was undertaken as part of a number of co-operative projects, most significantly Newsplan 2000, in which at-risk newspapers titles from libraries across the UK were microfilmed and distributed to the partner libraries, with master copies held by the BL and the National Library of Wales. The project was funded by the Heritage Lottery Fund with additional financial support from the newspaper industry and ran from 2000 to 2005 (though Newsplan 2000 as a body still exists). 1,325 newspaper titles, or 12,800,000 pages were microfilmed, producing 30,476 reels of microfilm.
The British Library’s initial nineteenth-century newspaper digitisation project took place in two phases. The first took place between 2004 and 2007 while the second ran from 2008 to 2009. The second phase was specifically aimed at expanding the digital collection’s coverage of regional and local news as well as including the eighteenth-century issues of existing titles. Owing to budget constraints and available technology, newspapers were not directly scanned to digital files in either phase of digitisation. Instead, new microfilms were made of newspapers, where needed, and these films were subsequently scanned. The exception to this was The Standard, which was scanned directly from paper copies at the Boston Spa repository. These in-house scanned images and microfilm reels were sent to external vendors, first Apex CoVantage (JISC I), followed by Content Conversion Specialists (JISC II), for processing, providing the library and Gale with an archival master for each page, as well as bitonal and greyscale images and processed OCR text. Although this is a static collection, the BL has continued to expand its newspaper digitisation: over 30 million pages have been produced since 2010 through the BL’s relationship with Findmypast, augmented recently by the BL-funded Heritage Made Digital programme.
For the first newspaper digitisation project, the British Library opened an online consultation with academics forming an advisory group of library staff and scholars to develop a framework of titles that provided a representative image of the country on a given date; forty-eight titles were selected to provide a broad yet detailed view of British life in the nineteenth century. Focus groups and user panel meetings were not held for the second digitisation project because it was decided that all titles could be of interest to some users. Although the British Library’s physical collections of historical newspapers are far more extensive, newspapers were selected for digitisation to provide a representative sample of the wider collection, covering the metropolitan and provincial press, ranging in political and geographical coverage, and representing both English- and Welsh-language titles.
Preservation and Access
The British Library digitises newspapers as part of its remit to preserve its collections. Its policy is to provide access through surrogates rather than the originals, where possible. Traditionally this has been done through microfilm, but the policy was updated in the early 2000s to create additional access copies through the digitisation of existing or new microfilm reels. Therefore, for the original nineteenth-century newspaper collections, digital copies do not act as the sole preservation copy of the newspapers but rather an additional form of access.
Composition of the Collection
The Gale 19th Century Newspapers collection contains sixty-nine distinct publications. Of these, twenty-one were published in London, thirty-three in England outside London, five in Scotland, five in Wales, and two in Ireland. Many of the titles published in Scotland, Wales, and Ireland are primarily held by their respective national libraries, which have pursued separate digitisation projects. The Library aimed to provide the full date range of each selected title to the extent allowable by the physical collection and within the project criteria (1800–1900). Thus, titles such as the Glasgow Herald, which began publication in 1783 and continues today, was only digitised from 1820–1900, or from the first issue held by the British Library until the project cut-off date. Although the entire collection covers the period from 1800 to 1900, the number of titles increases substantially as the century progresses, with 10% of Part I being published before 1833, 10% of Part I being published before 1840 and 50% of both collections appearing after 1874.
Each page was processed into machine-readable text by Prime OCR. Part I was processed by Apex, with the hand-keying or OCR correction of article titles. Part II was captured by Gale, who re-keyed articles titles for both parts and later commissioned the keying of article subheadings from an external contractor. Independent studies have suggested that the overall OCR quality of Part I and Part II is approximately 60-85%, but this varies widely within and between titles.
For Part I of the collection, Apex provided an archival master file, in TIFF format, at 300 PPI and 8-bit greyscale, as well as lower resolution images for use on the web interface, including bitonal images of text blocks and greyscale images of illustrations or photographs, to facilitate use over dial-up modem technology. For Part II of the collection, the resolution was raised to 400 PPI and the package of images included an unedited archival TIFF, a slightly cropped, lossless JP2 or JPEG2000 master image, and a compressed JPEG for use on the web interface. Images of earlier and local newspapers were generally scanned to produce a lighter image to improve OCR word accuracy.
The data contained within the British Library Newspapers is available under three distinct metadata schemes: two provided by Gale and one held for project work by the British Library.
Gale Legacy Text Mining Drives
Before 2016, the Gale Text Mining Drive contained metadata and text content in a single XML file. Although similar in coverage to the METS/ALTO schema used by many public institutions, Gale established a bespoke metadata schema to label information consistently across its different newspapers and collections. A DTD file is provided on the text-mining drives and the fields appear to be adapted from Dublin Core, MARC and other standard bibliographical standards, to which they have been successfully mapped when working with external content partners.
Each XML file contains bibliographic information for a single issue, automatically zoned during the OCR process. The metadata for the issue is followed by the machine-readable text, in which each individual word is encoded with spatial coordinates of its location on the corresponding image, as well as marker elements indicating new pages or columns. The metadata was created partly through automatic processes and partly by direct input by contract workers.
Gale Current Text Mining Drives
After 2016, the Gale Text Mining Drive separated metadata and text content into three XML files: title or publication-level metadata, issue-level metadata, and issue-level content data. As with the previous schema, the data is encoded using Gale’s standardised metadata schema and a DTD file is provided on the text-mining drives. Although distinct from the METS/ALTO schema, this system is similar to a combination of library MARC records and METS/ALTO XMLs.
British Library Project Drives
A pre-processed version of the data is held by British Library Labs and has been used by BL Labs Competition winners and award recipients in supported projects. This version of the XML is encoded at page rather than issue or article level. As with the Gale version, each word is encoded with the spatial coordinates. As it is encoded at page level, it does not contain the marker elements for page or column breaks. This provides a possibly more researcher-friendly variant of the XML, with human-readable element names and an intuitive nesting of elements, but lacks any form of delimitation between articles, which can be found in the Gale version.
The definitive dataset is kept in a proprietary XML format, known as the Gale Interchange Format or GIFT, and from this its text-mining and online datasets are derived. In addition to the metadata provided on text-mining drives or online, this database stores image metadata on resolution, file format, bit depth, colour map, file size and image dimensions. The image database stores image metadata, including image resolution, file format, bit depth, colour map, file size, width and height.
User Interface Structure
British Library Interface
Users of the collection through the British Library Interface can perform basic or advanced searches of the collection or browse by publication or location. The basic search can be filtered to a specific metadata field or the full text, a date range, a specific title or a specific digitised collection. The advanced search allows for standard Boolean operators and fuzzy searching as well as filtering by place of publication, issue section, title publication frequency, language and whether an image is included. Results can be sorted by publication date, article title, publication title or relevance and further filtered by publication section or article type. Individual results can be viewed at article, page, or issue level. At article level, the searched terms will be highlighted; at page level the article will be outlined in red. The image viewer allows users to navigate the issue and enlarge the image. The selection can be printed, emailed, downloaded or bookmarked using the interface at the top of the screen but there is no access to the underlying OCR text. A suggested citation, including a word count, is provided for each result.
Gale Primary Sources Interface
Users of the collection through the Gale Primary Sources interface can search using the same features as the British Library interface, with additional filters and simple analysis tools, such as topic finder and term frequency. Once selected, users are presented with the chosen article and options to navigate the issue or other search results. Users can adjust the image contrast and brightness to improve legibility and download it using standard browser context menus. The image may also be downloaded as a PDF and or the plain text of the OCR content. Bibliographical information, images and text content can also be saved to cloud storage on Gale’s servers or through integrations with Google Drive and OneDrive.
API access is not currently available through the British Library or Gale.
Direct Download or Drives
Gale Legacy Text Mining Drives
The previous version of Gale’s Text Mining Drive for Part I had separated data into directories containing either scanned images or machine-readable texts. Images were contained in numerically labelled batch directories (for example, BLC_Images_Archive_01) in which there were individual directories labelled by the four-letter title abbreviation used in the XML metadata. Within these were directories labelled by individual issue dates in ISO format (for example, 18840102) inside which were the page images. Machine-readable text files were stored in a separate file structure, using numerically labelled batch numbers (for example, BLC_XML_Archive_14). Within these are all the XML files created as part of that processing batch.
The previous version of Gale’s Text Mining Drive for Part II packaged image and text data together within a single data structure. Within numerically labelled batch directories, data was separated into directories using the four-letter abbreviation associated with a particular title. Within these were individual directories for specific dates, each of which contain one XML file, containing metadata and content data, and the individual page images, stored as JPGs.
Gale Current Text Mining Drives
The current version of Gale’s Text Mining Drives separates data into image and XML data. Within these directories, the data is separated into processing batches, each with an individual alphanumeric code. Within these are directories for each individual title that has been digitised within that batch; a single title code might exist in multiple processing batch directories. Within this directory are either the metadata and content XML or image files for each issue. These are not separated by specific date ranges but a full index of all issues, and their location within the structure, are provided on a spreadsheet file on the drive.
British Library Project Drives
XML data from the British Library Project Drives is divided into 524 ZIP files. These are not indexed or separated by title or date and therefore complete decompression is required to ensure a full title or date range is extracted. Each file represents a single page and is named with the project code, the title abbreviation, the year, the month, the day, and the page of data.
Rights and Usage
In the United Kingdom, the collections can be freely accessed within the British Library reading rooms as well as remotely through Gale and British Library interfaces provided to all UK Higher Education Institutions (and some others) via JISC agreements. Outside the UK, collections are accessible by institutional purchase, including many public or national libraries; there is currently no individual purchase model available.
API access is not currently available through the British Library. However, users can create batches of specific issues or titles for bulk download through the Gale Digital Scholar Lab, a separate subscription service.
Direct Download or Drives
Gale Cengage makes content from its collections available to academic researchers for data mining and analysis through physical hard drives for a nominal “cost recovery” charge. This includes directories, title manifests, XMLs and image files. In the United Kingdom, as part of the original agreement with JISC, the underlying data can be accessed by request and a cost recovery fee by all Higher Education Institutions. Elsewhere, the data is only accessible to those with institutional purchase to the relevant Gale products. Material obtained on text mining drives may be used to examine individual text or for large-scale analysis for purposes of performing personal or non-commercial research purposes but cannot be duplicated or shared without express permission.
As part of the user agreement, XML, OCR and image data cannot be re-published in any form, physical or digital, without the express permission and licensing of the British Library (web interface) or Gale (web interface and drives). Small quotations, using standard citation practices, may be reproduced in accordance with local fair use provisions and should be accompanied by a DOI link that points back to the individual full text article or book chapter and a proprietary notice in the following form: “Some rights reserved. This work permits non-commercial use, distribution, and reproduction in any medium, provided the original author and source are credited.”
Beals, M. H. and Emily Bell, with contributions by Ryan Cordell, Paul Fyfe, Isabel Galina Russell, Tessa Hauswedell, Clemens Neudecker, Julianne Nyhan, Sebastian Padó, Miriam Peña Pimentel, Mila Oiva, Lara Rose, Hannu Salmi, Melissa Terras, and Lorella Viola. “British Library 19th Century Newspapers.” The Atlas of Digitised Newspapers and Metadata: Reports from Oceanic Exchanges. Loughborough: 2020. DOI: 10.6084/m9.figshare.11560059.