Map of Digitised Newspaper Metadata v.1.0.0

An Oceanic Exchanges Dataset

DOI:10.6084/m9.figshare.11560110

Contents

  • Description
  • Background
  • Methods and Caveats
    • Essential Background Reading
    • The Source Data
    • Data Processing
    • Caveats
  • Data
    • General Notes
    • Table Columns
    • Value Definitions
  • Citation
  • Acknowledgments
  • License
  • Future plans

Description

This dataset provides a comprehensive list of all the XML elements and attributes and JSON keys used within the fourteen (14) database instantiations utilised by the Oceanic Exchanges Project in the creation of their Atlas of Digitised Newspapers and Metadata..

It was derived from a sampling of XML and JSON files from complete collections furnished by ten digitised newspaper providers: Hemeroteca Nacional Digital de México (National Digital Newspaper Archive of Mexico), Chronicling America (Library of Congress), the British Library, the Times Digital Archive (Gale, a Cengage Company), Delpher (Koninklijke Bibliotheek), Europeana, ZEFYS (Staatsbibliothek zu Berlin / Berlin State Library), Suomen kansalliskirjaston digitoidut sanomalehdet (Digital Newspapers at the National Library of Finland), Trove (National Library of Australia), and Papers Past (National Library of New Zealand). Information from these samples was supplemented by internal and publicly available documentation,—document type definitions (DTDs), standardised metadata schema, and API technical guides—as well as interviews with library and digitisation staff.

It includes 3343 rows, each containing a unique element, attribute or key, and provides detailed information about their content, their placement within their separate hierarchies, and their equivalencies across the different instantiations and databases.

Background

The nineteenth-century newspaper was a messy object, filled with an ever-changing mix of material—literary, factual and the suspiciously plausible—in an innumerable number of amorphous layouts. Working with digitised newspapers is no different. Each database contains a theoretically-standardised collection of data, metadata, and images, but the precise nature and nuance of this data is often occluded by the automatic processes that encoded it. Moreover, no true universal standard has been implemented to facilitate cross-database analysis, encouraging digital research to remain within existing institutional or commercial silos. Where common standards have been asserted, such as the minimum standards for Europeana or Chronicling America, they have been standardised at only a very low resolution, with significant variance in the range and interpretation of the metadata within their direct collaborations as well as by independent programmes following their example. These irregularities make the data highly vulnerable to misinterpretation by both end users and also those updating the collections in the future.

In order to better explore global exchanges (for example, scissors-and-paste journalism) in the nineteenth-century press, Oceanic Exchanges: Tracing Global Information Networks in Historical Newspaper Repositories, 1840-1914 attempted to integrate and make interoperable the metadata used to store digitised newspapers in a variety of linguistic and institutional contexts. It excavated institutional decision-making from a variety of sources in order to understand the archaeology of digitised newspaper metadata, its vocabulary and structures, and how they related to the conceptions of the newspaper object by both modern end-users and the original nineteenth-century producers. Because exhaustive documentation was not available for any of the collections used, the project team retro-engineered the implementation of these vocabularies, beginning with document type definitions (DTDs) and schema specifications, and then complementing them with internal and public documentation on the cataloguing standards used. Some cases also required the use of grey literature—discussions by users about how to manipulate the data—and direct examination of records.

Although most of the databases used variants of the METS/ALTO standard, these were not implemented in a way that would allow for simple equivalencies. The variance in terminology, and in the interpretation of the correct range of inputs for a given field, arose from the use of a hodgepodge of different vocabularies, including variants of Dublin Core, METS/ALTO, MPEG-21, PREMIS, as well as other bespoke or proprietary taxonomies. Overlapping and ambiguous vocabularies were also structured inconsistently, with some combining data at the article, page or issue level and others separating the metadata and content for these elements into multiple files. Our initial attempts to account for both internal structures and field equivalencies across these databases made the level of irregularity strikingly clear.

Moreover, the interpretation and implementation of these fields was inconsistent within collections owing to the turnover of staff during the digitisation process as well as the long history of metadata being drawn from existing library catalogues. Such layering is particularly evident in the metadata associated with Trove, the National Library of Australia’s collections, which includes end-user annotations, categorisations and text corrections—layers which are valuable to humanities researchers but which remain in unintegrated grey literature and derived data for the other collections. The level of publically-available documentation about how to interpret both authoritative and user-generated fields varied widely, and interviews and internal documents made it clear that consistent implementation of guidelines was unlikely across time. Working with these collections, therefore, requires a creative and flexible interpretation of these standards and an understanding of the history and character of the specific digital files.

Although it currently represents only an initial snapshot of fourteen (14) digitised newspaper collections, this dataset acts as a framework to help bridge the interoperability gap between individuals who create the authoritative, standardised metadata for these collections and the end-users who attempt to create historical and other narratives through the use of these materials. It aims to enable researchers from a variety of disciplinary backgrounds to break through the barriers between collections as well as suggest historically-informed principles for archivists and digitisers to consider when implementing their metadata standards and selecting which fields to make publicly available and searchable.

Methods and caveats

Essential background reading:

  • Beals, M. H. and Emily Bell, with contributions by Ryan Cordell, Paul Fyfe, Isabel Galina Russell, Tessa Hauswedell, Clemens Neudecker, Julianne Nyhan, Mila Oiva, Sebastian Padó, Miriam Peña Pimentel, Lara Rose, Hannu Salmi, Melissa Terras, and Lorella Viola. The Atlas of Digitised Newspapers and Metadata: Reports from Oceanic Exchanges. Loughborough University, 2020. doi: 10.6084/m9.figshare.11560059.

  • Hauswedell, Tessa, Julianne Nyhan, M. H. Beals, Melissa Terras and Emily Bell. ‘Of Global Reach Yet of Situated Contexts: An Examination of the Implicit and Explicit Selection Criteria that Shape Digital Archives of Historical Newspapers.’ Archival Science: International Journal on Recorded Information, forthcoming.

The source data

In 2017-18, led by Paul Fyfe of North Carolina State University, Oceanic Exchanges gathered together fourteen instantiations of ten distinct digitised newspaper databases, alongside histories of their creation, composition and licensing. These collections were generously furnished by digitisation providers via a combination of text-mining harddrives, direct download packages and API retrieval systems. The collections were hosted on a secure server by Northeastern University, which could be consulted remotely by project partners around the world, and these datasets served as the primary source material for The Map of Digitised Newspaper Metadata. Information regarding the history and composition of these datasets was consolidated and made available via the project website in 2018.

Data processing

In 2018–19, a team led by M. H. Beals of Loughborough University worked to catalogue the data and metadata available across these collections, to undertake detailed interviews with data providers and libraries, and to develop a robust taxonomy for discussing the digitised newspaper not only as a facsimile but as a research object in its own right.

Collation of datasets

Project team members with the appropriate language knowledge worked through sample XML and JSON files, inputting into a shared spreadsheet the name of each element, key and attribute, its XPath/JSONPath, and an example of of its content. Each item was giving a unique identification number, which was used to describe internal hierarchies through lists of parents, children, attributes and attributing.

Provision of technical definitions

After a conflated catalogue was completed, team members used technical documentation, grey literature and a wider sampling of the collections in order to attached technical definitions, controlled vocabularies, data types and metadata standard information to each item. The language for these was largely standarised to aid mapping, but exceptions or unique characteristics were documented.

Attachment of ontological categories

As we finalised our master list of data and metadata fields, we attempted to visually group, or map, all possible elements across all collections, using the visualisation tool draw.io; we anticipated that the majority of fields would correspond directly to similar fields in other databases and thus a visual representation would be the clearest means of conveying the information overlap between collections. However, attempts to create a single map of all possible elements and attributes, and to provide provenance of internal structures while grouping object by type and subtype, raised significant ontological issues; the ideal relationship structure between elements varied widely depending on discpline and use case. As a result, we instead created an inclusive taxonomy of the metadata categories and sub-categories that was based upon the structure of the newspaper as both a physical and digital object and which took into account the reality of the information that was available in each dataset. We hope this format will provide a deeper, more nuanced understanding of this ubiquitous and ambiguous medium and allow for a generous mapping of similar fields while retaining sufficient detail to distinguish apples from oranges.

Caveats

  • This dataset represents only a handful of digitised newspaper collections worldwide
  • The collections it represents are predominately (50%) Anglophone
  • The collections it represents were created by national libraries or large-scale commercial publishers in Europe, North America and Australasia.
  • Where possible, technical definitions from metadata standards and DTDs have been modified to better reflect samples from individual collections but these changes have not been marked in the dataset
  • Controlled vocabulary lists are derived from metadata standards, database-specific documentation and sampling. As a result, they may not reflect all possible values or may include values that were not instantiated in the final collection
  • Format standard and data type are based on official documentation, where available, and sampling in all other cases.

We hope that the dynamic version of this dataset will lead to a more varied and globally representative selection of metadata connections. Those who work with the collections are encouraged to update and refine controlled vocabularies and definitions.

Data

General Notes

The current version is 1.0, released January 2020. This represents the static version of the dataset, a further description of which can be found in the Atlas of Digitised Newspapers and Metadata DOI: 10.6084/m9.figshare.11560059

The data table is supplied in TSV (tab separated values) format, which can be readily imported into spreadsheets and database software.

Table Columns

The columns of data are defined as follows:

Field Description
UID This column contains a unique numerical code used for cross-referencing between fields within the same database
Category This column maps the field to a broad ontological category, used across collections
Sub-Category This column maps the field a more narrow ontological category, used across collections
Collection This column contains the 4-character code for a particular collection and file type
XPATH This column contains the full XPath or JSON Path to the field
Name This column contains the name of the field (element or attribute) within the database
Format This column contains the standard or schema to which the field belongs
Content Type This column contains a 3-character code indicating the data type
Example Content This column contains the contents of the field from an example XML or JSON file
MCHOICE Values This column contains the controlled vocabulary of a multiple-choice field
Definitions This column contains a definition or description of the field
Field Type This column indicates is the field contains data that can mapped across collections, technical (usually automatically generated) data, or a container, which contains no data but rather only other elements or fields
Element Type This column describes whether the field is an element/key (XML/JSON) or attribute
Parent If this is an element, this column contains the unique numerical code of its parent element
Attributed If this is an attribute, this column contains the unique numerical code of the element for which it is an attribute
Child(ren) This column contains the numerical codes of all this element’s child elements
Attributes This column contains the numerical codes of all this element’s attributes

Value Definitions

Categories

The top-level ontological categories used to map fields from one collection to another are as follows:

Category Description
Abbreviated Newspaper Title A standardised abbreviation of the newspaper title which may also appear in unique IDs for the newspaper and article
Alternate Newspaper Title Provides an alternate title for the publication, where the title may have changed during its run. Occasionally this is a minor change, such as dropping the article, but this can also be a more radical restyling of the publication. Standardised title information can be found in newspaper title
Article Category Specifies the genre of the article, such as Advertisements or News Article
Article Subheading Smaller titles used to break up the article
Article Title or Headline Provides the headline or title of the item. This may be hand-keyed or the result of OCR. Distinct from a section heading.
Attribution The name of the author of the newspaper article, as printed
Comments and Social Tagging Information about tags and comments added to an article by online users
Coordinates Provides coordinates for a component of the image
Copyright Specifies the copyright holder of the issue. Access conditions provides additional information about the status of a physical object, i.e. any restrictions on access
Database Provides the name of the digital database in which the issue is stored. Distinct from the Holding Library information
Dimensions Provides the dimensions of a component of the image. See measurement unit for the specific unit used; this is usually “mm10”
Document Type Specifies the nature of the piece of writing, generally “article”
Edition Provides edition information for the issue, including morning, afternoon, evening, day, special and supplemental. In SBMA and SBME, it also specifies that it is an electronic edition of the issue
Filename Provides the filename of the image file attached to the XML text. This can take the form of file names, URLs or relative paths with filenames
Font Information Provides information about the font of the text, as recognised by the OCR software. This includes font size, font style (whether bold, italics, underlined, small caps, etc.), font type (whether serif or sans serif), font width (whether proportional or fixed), and font family
Geographic Coverage Classifies newspapers depending on their wider geographic area of publication and readership; it is listed as regional, local, or a specific territory. If not indicated, it can often be presumed to coincide with place of publication. It can be used to distinguish between different editions of the same paper aimed at different cities, towns or regions
Holding Library Provides the name and details of the library or archive that contained the digitised material at the time of digitisation. For some databases, it is separated into library name and library location
Hyphenation Provides information about words that have been typographically hyphenated.
ID Provides a unique ID for the component of the image
Illustration Information Provides information about any illustrations, including whether one is present, its type, and its colour information
Issue Date Gives the date of the issue. May refer to the publication date, the date as printed on the issue, the ISO standard date or a part of the date, such as the day of the week, day, month or year. In some cases, this is normalised and in others it is the date as printed on the image
Issue Number Gives the issue number for the item. This is sometimes a string, as printed, and other times a numerical value. It can also take the form of a unique identifier for the newspaper issue
Language Specifies the language of the textual unit, often, but not always, using the ISO language code. This can refer to the language of the newspaper, the article text, or the specific block of text
Measurement Unit Provides the measure unit for all values except the font size
Metadata Type Defines the metadata type
Microfilm Reel Most often provides a unique 4-digit ID to the microfilm reel used in the creation of the image associated with this XML. This does not translate to a MARC or library-based record number
Newspaper Subtitle The subtitle, which is intended to provide clarification of the newspaper title, may be taken from the physical object, or an amalgamation chosen by the cataloguer
Newspaper Title There are two kinds of newspaper title provided: first, the title as it appears on that particular issue; second, the title in a normalised format. This may or may not be a version of the title as printed but rather an amalgamation chosen by the cataloguer. It is usually derived from the earliest available issue from that newspaper, after which alternate newspaper titles will be recorded
OCR Provides technical information on the OCR software used, including Description ID, Agency, Date and Time, Step Description, Step Settings, Creator, Name, Version, Relevance and Confidence Level
Page Count of Article Provides the number of pages over which the particular article, as computationally zoned, is spread
Page Count of Issue Total number of pages of the newspaper issue
Page Number Provides an ID for the page. This is divided into: unique identifiers, page image numbers, identifiers across the database; URLs to web-accessible versions of the page; relative numerical identifiers, within the issue; and string descriptors
Page Position Indicates the position of the page within the issue
Page Skew Provides the degree to which the page image is skewed from the perpendicular
Paragraph Information about paragraphs, including XML containers, text alignment, and UIDs
Place of Publication Provides the geographical location associated with the printing or manufacture of the publication and generally listed as the city. They are determined by the imprint or cataloguer determination for the publication as a whole, except where specified as being the publication location of an alternate title for this newspaper
Publication Date Range Provides the date range, in either years or full dates, of the publication. It does so without date or other restrictions and should be considered to refer to the newspaper as defined by ontological subcategory normalised title. There are three variants, the dates included in the collection (collection range), those that to our knowledge existed (full range), and all the individual days, months and years of publication. Newspaper start date and end date provide the full ISO date for the publication’s first and last issue. Instantiations are divided into container elements Instantiations are divided into container elements, which hold no specific data, and attributes or specific elements, which hold the year, month and day separately.
Publication Frequency Specifies the frequency of publication as a whole and should not be confused with the edition of a specific issue. Across these databases, it is usually listed as daily, weekly or quarterly
Publication Genre Specifies the genre of the publication at the broadest level, i.e. whether a newspaper, periodical or magazine
Publisher This category contains information about the publisher of the publication. It does so without date or other restrictions and should be considered to refer to the entire run of the newspaper as defined by category Normalised Title
Quality The preservation status of the physical object
Section Heading Specifies the printed title for a section; distinct from an article headline or title
Shelf Mark This category contains information linking the publication, as a conceptual unit, to an item or record in an external database or catalogue. It does so without specific date or other restrictions and should be considered to refer to a specific physical volume rather than a volume as numbered by the original publisher
Starting Column for Article Provides the column on the page (given as a letter) in which the article begins
Sub-Collection Provides details of the sub-collection within which the item has been placed
Supplement Title The physical newspaper or periodical section this article appeared in if not the issue itself; i.e. if it appeared in a supplement
Text Article text content. For text content in article titles, See title of the article . For text content in subheadings, see article subheading
Volume Number Provides the volume information, either a numerical volume number relative to the newspaper title or a unique identifier. One volume comprises many issues
Word Count of Article The number of words in the article
Word Count of Page The number of words on the page, as identified through Optical Character Recognition

Definitions of sub-categories can be found within the relevant category of the Atlas.

Date Types

The codes used to signify data types are as follows:

Code Description
BOO A Boolean char such as 0/1 or Y/N
COO A set of numeric coordinates to delineate a segment of an image
DAR A range of dates
DAT A single date
FIN A filename
MCH Multiple pre-defined choices
NUL Holds no content
NUM Numerical value, may include the symbols . , -
STR A string of alphanumeric content
UID Any form of unique ID or acronym
URL A URL

Databases

The codes used to distinguish different databases, and files within those databases, are as follows:

Code Collection Name Standard Description
B1GI British Library 19th Century Newspapers, Part I, Gale’s Current Text-Mining Drives GIFT Issue Metadata XML File
B1GP British Library 19th Century Newspapers, Part I, Gale’s Current Text-Mining Drives GIFT Publication Metadata XML File
B1GT British Library 19th Century Newspapers, Part I, Gale’s Current Text-Mining Drives GIFT Text Content XML File
B2GI British Library 19th Century Newspapers, Part II, Gale’s Current Text-Mining Drives GIFT Issue Metadata XML File
B2GP British Library 19th Century Newspapers, Part II, Gale’s Current Text-Mining Drives GIFT Publication Metadata XML File
B2GT British Library 19th Century Newspapers, Part II, Gale’s Current Text-Mining Drives GIFT Text Content XML File
B1JI British Library 19th Century Newspapers, Part I, British Library’s Text-Mining Drives Bespoke Content and Metadata XML File
B1GL British Library 19th Century Newspapers, Part I, Gale’s Legacy Text-Mining Drives GIFT Content and Metadata XML File
B2GL British Library 19th Century Newspapers, Part II, Gale’s Legacy Text-Mining Drives GIFT Content and Metadata XML File
CAAL Chronicling America ALTO Content and Layout XML File
CADI Chronicling America   Directory Structure
CAME Chronicling America METS Issue Metadata XML File
DEAL Delpher ALTO Content and Layout XML File
DEMP Delpher MPEG Issue Metadata XML File
DEOC Delpher Bespoke OCR Text XML File
EUAL Europeana ALTO Content and Layout XML File
EUME Europeana METS Issue Metadata XML File
F1AL Finnish National Library 1771–1910 ALTO Content and Layout XML File
F2AL Finnish National Library 1771–1910 ALTO+ Content, Layout and Metadata XML File
F1ME Finnish National Library 1771–1910 METS Issue Metadata XML File
HNME Hemeroteca Nacional Digital de México METS+ Content, Layout and Metadata XML File
HNDM Hemeroteca Nacional Digital de México Bespoke Content and Metadata JSON File
PPAL Papers Past ALTO Content and Layout XML File
PPDI Papers Past   Directory Structure
PPME Papers Past METS Issue Metadata XML File
SBAL State Library of Berlin ALTO Content and Layout XML File
SBME State Library of Berlin METS Issue Metadata XML File
SBMA State Library of Berlin METS Publication Metadata XML File
SBMY State Library of Berlin METS Publication-Issue Metadata XML File
TDAG Times Digital Archive GIFT Content and Metadata XML File
TRAL Trove ALTO Content and Layout XML File
TRAP Trove Bespoke API XML Return
TRME Trove METS Issue Metadata XML File

Citation

M. H. Beals and Emily Bell. (2020). Map of Digitised Newspaper Metadata v.1.0.0 [Data set]. Figshare. DOI:10.6084/m9.figshare.11560110

Acknowledgments

This derived dataset was created using internal documentation, sample XML files, and interviews with the public and commercial providers of the digitised newspapers collections. We are grateful for the support and contributions made by members of the Oceanic Exchanges team—Ryan Cordell, Paul Fyfe, Isabel Galina Russell, Tessa Hauswedell, Clemens Neudecker, Julianne Nyhan, Mila Oiva, Sebastian Padó, Miriam Peña Pimentel, Lara Rose, Hannu Salmi, Melissa Terras, and Lorella Viola—as well as our friends and colleagues at libraries and publishing firms around the world—Seth Cayley (Gale), Steven Claeyssens (KB), Huibert Crijns (KB), Nicola Frean (NLNZ), Julia Hickie (NLA), Jussi-Pekka Hakkarainen (NLF), Chris Houghton (Gale), Melanie Lovell-Smith (NLNZ), Minna Kaukonen (NLF), Luke McKernan (BL), Chris McPartland (NLA), Maaike Napolitano (KB), Tim Sherratt (University of Canberra) and Emerson Vandy (NLNZ).

License

The dataset and all accompanying documentation are licensed under a Creative Commons Attribution 4.0 International License.

This means that you are free to copy and redistribute the material in any medium or format; and to remix, transform, and build upon the material for any purpose, even commercially, providing you give appropriate credit, provide a link to the license, and indicate if changes were made.

We ask that rather than create multiple independent variants of this dataset, that you create a pull request for (contribute) updates and additions to the dynamic dataset. This will allow us to maintain an ever more accurate and comprehensive map with full provenance tracking of your contributions.

Future plans

  • Development of sample use cases
  • Inclusion of additional datasets