History of the Collection

Chronicling America is a free, online repository of newspapers printed in the United States, primarily from 1836 to 1922. It is managed as part of the National Digital Newspaper Program, a collaboration between the National Endowment for the Humanities and the Library of Congress. The collection began its initial digitisation phase in 2005 and, as of December 2018, continues to support digitisation of new material. In addition to this digitised collection, the partnership also manages the national newspaper directory, which lists all US newspapers from 1690 to the present day. This catalogue was originally produced by the United States Newspaper Program, which ceased operation in 2011.

Consulted Libraries

Digitisation began in 2005 with grants to: University of California, Riverside; University of Florida Libraries, Gainesville; University of Kentucky Libraries, Lexington; the New York Public Library, New York City; the University of Utah, Salt Lake City; and the Library of Virginia, Richmond. This initial round of funding, covering newspapers from 1900 to 1910, concluded in 2007. The programme has since continued to allocate grants to institutions to digitise state collections. As of December 2018, this has included the University of Alabama, Tuscaloosa; Alaska State Library Historical Collections; Arizona Department of Libraries, Archives, and Public Records; Arkansas State Archive; University of California, Riverside; History Colorado; Connecticut State Library; University of Delaware; University of Florida, Gainesville; Digital Library of Georgia (University of Georgia Libraries/GALILEO); University of Hawaii at Manoa; Idaho State Historical Society; University of Illinois, Urbana; Indiana State Library; State Historical Society of Iowa; Kansas State Historical Society; University of Kentucky, Lexington; Louisiana State University; Maine State Library; University of Maryland, College Park; Central Michigan University; Minnesota Historical Society; Mississippi Department of Archives and History; State Historical Society of Missouri; Montana Historical Society; University of Nebraska-Lincoln Libraries; University of Nevada, Las Vegas; Rutgers University Libraries; New Jersey State Archives and New Jersey State Library; University of New Mexico; New York Public Library, Astor; Lenox and Tilden Foundation; University of North Carolina, Chapel Hill; State Historical Society of North Dakota; Ohio History Connection; Oklahoma Historical Society; University of Oregon; Penn State University Libraries, University Park; University of Puerto Rico, Rio Piedras; University of South Carolina; South Dakota Department of Education; University of Tennessee; University of North Texas; University of Utah, Marriott Libraries; University of Vermont; Library of Virginia; Washington State Library; West Virginia University Libraries; and Wisconsin Historical Society.

Microfilming Projects

The digitisation work of the National Digital Newspaper Program is built upon earlier preservation programmes managed by the United States Newspaper Program, which worked from 1982–2011 to identify, describe, and preserve historical newspaper collections. These programmes, funded by the National Endowment for the Humanities and given technical support by the Library of Congress, supported the preservation of historical newspapers through microfilming rather than the retention and conservation of loose or bound copies. Funding guidelines encouraged the removal of newspapers from bound volumes to facilitate a speedier microfilming process; however, this method largely prevented the rebinding and conservation of the original newspapers. This preference for microfilm as the “object of record” has continued under the National Digital Newspaper Program, as its technical and funding guidelines instruct awardees to scan existing microfilm copies, with only a brief mention made of scanning of original copies in order to complete a microfilm collection.

Digitisation Projects

The National Digital Newspaper Program builds on the work of the United States Newspaper Programme by running a biannual, competitive grant programme for institutions to digitise approximately 100,000 newspaper pages representing their state, with the option of applying for a second or third grant in subsequent rounds. The programme provides awardees with technical guidelines on selection, digitisation, encoding and delivery to ensure consistency across institutions and grant cycles but allows institutions to employ local expertise in fulfilling these guidelines, particularly regarding selection and populating bibliographic metadata. The programme originally limited the date range of submission to 1836–1922, but since July 2016 has allowed digitisation of newspapers from 1690–1963.

Selection

The guidelines for the National Digital Newspaper Program highlight four primary intellectual considerations for selecting titles to include in the Chronicling America database. First, the title should reflect the political, economic and cultural history of the state or territory, with preference given to titles recognised as “papers of record”. Second, they should provide state, or multi-county, coverage of the majority of the state or territory’s population. Third, titles with longer chronological runs are preferred over those with short or sporadic runs. Finally, particular consideration is given to titles that have ceased publication and therefore are less likely to be digitised by other providers. Technical guidelines state that a title should have a complete, or largely complete, run available on microfilm, and that use of the microfilm should not be restricted in any way that would affect the newspaper’s inclusion in the Chronicling America database. Additional guidelines for microfilm quality and reduction ratio are also provided, noting that titles with higher-quality microfilm should be given preference.

Preservation and Access

Chronicling America’s primary aim is to enhance free, public access to historical newspapers. Although archival-quality TIFF images are created for all NEH-funded digitisation projects, microfilm remains the preferred method of permanent preservation.

Composition of the Collection

Selection Available

As of December 2018, the collection included 2,689 distinct historical newspaper titles comprising 14,181,901 historical newspaper pages. A full list of included titles can be found at http://chroniclingamerica.loc.gov/newspapers.txt. The collection contains issues from the years 1789–1963, but the bulk of the collection is from 1850 to 1922. This latter date is commonly understood as the US copyright boundary. Of the 2,689 tiles available, only eight include issues before 1800 and only forty-two include issues after 1922.

Data Quality

Text

Because Chronicling America is a distributed digital content creation programme, individual awardees are responsible for selecting content, evaluating microfilm, assigning metadata and writing descriptive newspaper histories for each title, the latter hosted directly by the Chronicling America website. Despite sharing common technical guidelines, this had led to the quality and character of the METS/ALTO data files varying considerably by state awardee, date of digitisation, quality of the source material, and the title itself. In some cases, OCR quality varies widely even within a given title. This may be the result of shifts in the original quality of the source material, which changed rapidly over the course of the nineteenth century, or the paper’s conservation status. In addition, variation in OCR quality can represent changes made to OCR software over the course of a long-term digitisation project. The quality of the OCR data is so variable that any summary would be largely inaccurate, even for a specific title.

Images

For each page in their submission, contributors to Chronicling America are required to include an 8-bit, grayscale, 300–400 PPI TIFF preservation-quality image, as well as a derivate JPEG2000 (lossless 8:1 compression) and PDF file. These derivate images are made available through the Chronicling America web interface, while the TIFF can be provided to researchers by request. As JPEG2000 is a proprietary format, the National Digital Newspaper Program does not currently consider it a suitable archival substitute for TIFF, despite its higher compression allowing for more efficient online distribution.

Metadata Schema

The data contained within Chronicling America adheres to four specific metadata encoding schema. The METS XML schema is used for structural metadata, ALTO XML for the OCR content, PREMIS10 and MIX11 for technical metadata, and MARC and MODS12 for descriptive and bibliographic metadata. The descriptive and bibliographic metadata is largely based on title-level MARC records, many of which were created by the United State Newspaper Program. The technical guidelines for awardees direct them to map specific MARC fields to the required and optional metadata components. Other descriptive and administrative metadata is populated by awardees when collating or evaluating the microfilm and may include page numbers, section and edition labels, and preservation metadata. Since 2011–2013, many of the technical metadata fields relating the digitisation process have been reclassified as recommended, rather than required, components of the data files.

Backend Structure

The data for each issue is stored in multiple image and text files within a alphanumerical directory structure, organised by unique batch identifiers. Each batch contains several unique titles, listed by their canonical Library of Congress Call Number. Within each title subdirectory, files are separated into subdirectories by issue date. Within these, there are two METS-encoded files providing technical, administrative and descriptive metadata for the issue. There is also a separate image (.jp2/.pdf) and ALTO-encoded XML files for each page within that issue. The METS files are named by the date for the issue and the ALTO files are numbered sequentially across the title within that batch. Archival TIFF files are stored offline. Thus, data for each issue can be obtained using the following standardised URL:

https://chroniclingamerica.loc.gov/data/batches/[Batch_Name]/data/[Library of Congress Title ID/[Processing_ID]/[YYYMMDDEE]/[File ID].xml

The National Digital Newspaper Program decided to serve image and OCR data at the page rather than article level for the sake of efficiency, but allows users to zoom, pan and crop images—in essence, manually zoning the text. Many of the collections are hosted by individual awardees as well as the Library of Congress, and some awardees have undertaken and provide a higher resolution of zoning and descriptive metadata through their own hosting venues, going beyond the minimum zoning of individual columns with appropriate coordinate information to facilitate text highlighting within the images.

User Interface Structure

Web Interface

The online user interface is an open-source Django installation called chronam that allows user to perform a simple or advanced search of the underlying descriptive and OCR data, or to browse images by title and date. The advanced search allows for filtering by state, title, years, page number, and language, and employs checkbox Boolean operators such as “any”, “all”, “exact” and “near”. Tiled images with highlighted search results are displayed in an image viewer with an attached citation. The viewer allows users to pan and zoom as well as navigate through the issue. The underlying data (plain text, PDF and JP2) and manually selected snippets can be downloaded using icons at the top of the viewer. The underlying code for chronam is available on GitHub.

API

Materials from Chronicling America can be obtained through the site’s API which, as of December 2018, does not require a unique access key. Datasets filtered by metadata or content and can be retrieved in HTML, Atom/XML or JSON formats through URL queries. More information and examples can be found at https://chroniclingamerica.loc.gov/about/api.

Direct Download or Drives

Bulk data from the collection can be obtained through web crawling tools, using the standardised file structure. This is aided using Atom and JSON feeds to detail the structure of the data and indicate when it is updated.

Rights and Usage

Web Interface

All material obtained from the Chronicling America web interface may be used freely for personal research. When browsed or searched through the user interface, users are presented with a full citation for the digitised image and text.

API and Direct Download

Users are allowed to access archive (.tar.bz2) files of all texts, images and metadata on Chronicling America for text and data mining. Those using data via the API are requested to use the URL and a website citation, such as “from the Library of Congress, Chronicling America: Historic American Newspapers site”.

Re-Publication

The Library of Congress believes that the newspapers in Chronicling America are in the public domain or have no known copyright restrictions. Newspapers published in the United States prior to 1923 are in the public domain in their entirety. Any newspapers in Chronicling America that were published after 1922 are also believed to be in the public domain but may contain some copyrighted third-party materials and should be independently cleared for derivative use.

Suggested Citation

Beals, M. H. and Emily Bell, with contributions by Ryan Cordell, Paul Fyfe, Isabel Galina Russell, Tessa Hauswedell, Clemens Neudecker, Julianne Nyhan, Sebastian Padó, Miriam Peña Pimentel, Mila Oiva, Lara Rose, Hannu Salmi, Melissa Terras, and Lorella Viola. “Chronicling America.” The Atlas of Digitised Newspapers and Metadata: Reports from Oceanic Exchanges. Loughborough: 2020. DOI: 10.6084/m9.figshare.11560059.