Content Data | Citation Metadata | Bibliographic Metadata |
Holdings Information | Descriptive Metadata | User-Generated Metadata |
Technical Metadata |
This report contains a series of maps to better align and compare data from different digitised newspaper collections. Each page is devoted to a low-resolution category of the metadata, such as title, publisher or text. Within that category, we have attempted to subdivide the relevant elements and fields into categories of metadata that are the most comparable across databases, for example normalised title or individual word.
Each section provides a technical definition of that category as well as an exploration of how the term, or variant terms, have been used by modern researchers in periodical studies, literary studies, library science and computer science (where appropriate) as well as in a nineteenth-century context. This is followed by a discussion of any exceptions or eccentricities regarding this category within the data collections. It concludes with a list of relevant XPaths, or other identifiers, and key information regarding the nature of the data in each element. With this information, the reader should be able to understand the different structures of these collections and develop computational means for robustly comparing datasets.
These maps represent data and metadata that is comparable, to varying degrees, across the collection. They have been grouped teleological, by their most likely use: content, citation, bibliography, holdings, description, social interaction and technical information. Scholarly, technical and industry terms have been included and indexed to facilitate information retrieval throughout.
In addition to descriptive information, each map provides a table with the following information:
- A locater, comprised of the four-letter collection ID
- The XPath or JSON path to that data
- A data type indicator, comprised of a three-letter ID
- An example of the content of that field with long strings of content text begin truncated with […]
The data type and collection IDs are as follows:
Data Types
BOO | A Boolean char such as 0/1 or Y/N |
COO | A set of numeric coordinates to delineate a segment of an image |
DAT | A single date |
DAR | A range of dates |
FIN | A filename |
STR | An open-ended string of content (alphanumeric) |
MCH | Multiple pre-defined choices |
NUL | Holds no content; used as a container element for other fields |
NUM | Numeric value; may include the symbols . , - |
UID | Any form of unique ID or acronym |
URL | A url |
Collection IDs
B1GI | British Library 19th Century Newspapers, Part I, Gale’s Current Text-Mining Drives | GIFT | Issue Metadata XML File |
B1GP | British Library 19th Century Newspapers, Part I, Gale’s Current Text-Mining Drives | GIFT | Publication Metadata XML File |
B1GT | British Library 19th Century Newspapers, Part I, Gale’s Current Text-Mining Drives | GIFT | Text Content XML File |
B2GI | British Library 19th Century Newspapers, Part II, Gale’s Current Text-Mining Drives | GIFT | Issue Metadata XML File |
B2GP | British Library 19th Century Newspapers, Part II, Gale’s Current Text-Mining Drives | GIFT | Publication Metadata XML File |
B2GT | British Library 19th Century Newspapers, Part II, Gale’s Current Text-Mining Drives | GIFT | Text Content XML File |
B1JI | British Library 19th Century Newspapers, Part I, British Library’s Text-Mining Drives | Bespoke | Content and Metadata XML File |
B1GL | British Library 19th Century Newspapers, Part I, Gale’s Legacy Text-Mining Drives | GIFT | Content and Metadata XML File |
B2GL | British Library 19th Century Newspapers, Part II, Gale’s Legacy Text-Mining Drives | GIFT | Content and Metadata XML File |
CAAL | Chronicling America | ALTO | Content and Layout XML File |
CADI | Chronicling America | Directory Structure | |
CAME | Chronicling America | METS | Issue Metadata XML File |
DEAL | Delpher | ALTO | Content and Layout XML File |
DEMP | Delpher | MPEG | Issue Metadata XML File |
DEOC | Delpher | Bespoke | OCR Text XML File |
EUAL | Europeana | ALTO | Content and Layout XML File |
EUME | Europeana | METS | Issue Metadata XML File |
F1AL | Finnish National Library 1771–1910 | ALTO | Content and Layout XML File |
F2AL | Finnish National Library 1771–1910 | ALTO+ | Content, Layout and Metadata XML File |
F1ME | Finnish National Library 1771–1910 | METS | Issue Metadata XML File |
HNME | Hemeroteca Nacional Digital de México | METS+ | Content, Layout and Metadata XML File |
HNDM | Hemeroteca Nacional Digital de México | Bespoke | Content and Metadata JSON File |
PPAL | Papers Past | ALTO | Content and Layout XML File |
PPDI | Papers Past | Directory Structure | |
PPME | Papers Past | METS | Issue Metadata XML File |
SBAL | State Library of Berlin | ALTO | Content and Layout XML File |
SBME | State Library of Berlin | METS | Issue Metadata XML File |
SBMA | State Library of Berlin | METS | Publication Metadata XML File |
SBMY | State Library of Berlin | METS | Publication-Issue Metadata XML File |
TDAG | Times Digital Archive | GIFT | Content and Metadata XML File |
TRAL | Trove | ALTO | Content and Layout XML File |
TRAP | Trove | Bespoke | API XML Return |
TRME | Trove | METS | Issue Metadata XML File |
These maps are accurate to October 2019 for the specific the collection dataset listed above; however, it has been our experience that data providers frequently update, tweak or otherwise modify their metadata schema, both for new collections and in order to retrofit previous collections based on end-user feedback. We are also aware of specific forthcoming updates to several of these collections, details of which have not yet been made publicly available. It is therefore advisable that you consult with the data provider on their current schema before undertaking any data mining project.