Content Data Citation Metadata Bibliographic Metadata
Holdings Information Descriptive Metadata User-Generated Metadata
Technical Metadata    

This report contains a series of maps to better align and compare data from different digitised newspaper collections. Each page is devoted to a low-resolution category of the metadata, such as title, publisher or text. Within that category, we have attempted to subdivide the relevant elements and fields into categories of metadata that are the most comparable across databases, for example normalised title or individual word.

Each section provides a technical definition of that category as well as an exploration of how the term, or variant terms, have been used by modern researchers in periodical studies, literary studies, library science and computer science (where appropriate) as well as in a nineteenth-century context. This is followed by a discussion of any exceptions or eccentricities regarding this category within the data collections. It concludes with a list of relevant XPaths, or other identifiers, and key information regarding the nature of the data in each element. With this information, the reader should be able to understand the different structures of these collections and develop computational means for robustly comparing datasets.

These maps represent data and metadata that is comparable, to varying degrees, across the collection. They have been grouped teleological, by their most likely use: content, citation, bibliography, holdings, description, social interaction and technical information. Scholarly, technical and industry terms have been included and indexed to facilitate information retrieval throughout.

In addition to descriptive information, each map provides a table with the following information:

  • A locater, comprised of the four-letter collection ID
  • The XPath or JSON path to that data
  • A data type indicator, comprised of a three-letter ID
  • An example of the content of that field with long strings of content text begin truncated with […]

The data type and collection IDs are as follows:

Data Types

BOO A Boolean char such as 0/1 or Y/N
COO A set of numeric coordinates to delineate a segment of an image
DAT A single date
DAR A range of dates
FIN A filename
STR An open-ended string of content (alphanumeric)
MCH Multiple pre-defined choices
NUL Holds no content; used as a container element for other fields
NUM Numeric value; may include the symbols . , -
UID Any form of unique ID or acronym
URL A url

Collection IDs

B1GI British Library 19th Century Newspapers, Part I, Gale’s Current Text-Mining Drives GIFT Issue Metadata XML File
B1GP British Library 19th Century Newspapers, Part I, Gale’s Current Text-Mining Drives GIFT Publication Metadata XML File
B1GT British Library 19th Century Newspapers, Part I, Gale’s Current Text-Mining Drives GIFT Text Content XML File
B2GI British Library 19th Century Newspapers, Part II, Gale’s Current Text-Mining Drives GIFT Issue Metadata XML File
B2GP British Library 19th Century Newspapers, Part II, Gale’s Current Text-Mining Drives GIFT Publication Metadata XML File
B2GT British Library 19th Century Newspapers, Part II, Gale’s Current Text-Mining Drives GIFT Text Content XML File
B1JI British Library 19th Century Newspapers, Part I, British Library’s Text-Mining Drives Bespoke Content and Metadata XML File
B1GL British Library 19th Century Newspapers, Part I, Gale’s Legacy Text-Mining Drives GIFT Content and Metadata XML File
B2GL British Library 19th Century Newspapers, Part II, Gale’s Legacy Text-Mining Drives GIFT Content and Metadata XML File
CAAL Chronicling America ALTO Content and Layout XML File
CADI Chronicling America   Directory Structure
CAME Chronicling America METS Issue Metadata XML File
DEAL Delpher ALTO Content and Layout XML File
DEMP Delpher MPEG Issue Metadata XML File
DEOC Delpher Bespoke OCR Text XML File
EUAL Europeana ALTO Content and Layout XML File
EUME Europeana METS Issue Metadata XML File
F1AL Finnish National Library 1771–1910 ALTO Content and Layout XML File
F2AL Finnish National Library 1771–1910 ALTO+ Content, Layout and Metadata XML File
F1ME Finnish National Library 1771–1910 METS Issue Metadata XML File
HNME Hemeroteca Nacional Digital de México METS+ Content, Layout and Metadata XML File
HNDM Hemeroteca Nacional Digital de México Bespoke Content and Metadata JSON File
PPAL Papers Past ALTO Content and Layout XML File
PPDI Papers Past   Directory Structure
PPME Papers Past METS Issue Metadata XML File
SBAL State Library of Berlin ALTO Content and Layout XML File
SBME State Library of Berlin METS Issue Metadata XML File
SBMA State Library of Berlin METS Publication Metadata XML File
SBMY State Library of Berlin METS Publication-Issue Metadata XML File
TDAG Times Digital Archive GIFT Content and Metadata XML File
TRAL Trove ALTO Content and Layout XML File
TRAP Trove Bespoke API XML Return
TRME Trove METS Issue Metadata XML File

These maps are accurate to October 2019 for the specific the collection dataset listed above; however, it has been our experience that data providers frequently update, tweak or otherwise modify their metadata schema, both for new collections and in order to retrofit previous collections based on end-user feedback. We are also aware of specific forthcoming updates to several of these collections, details of which have not yet been made publicly available. It is therefore advisable that you consult with the data provider on their current schema before undertaking any data mining project.