History of the Collection

Trove was launched in 2009 in order to create a single point of entry for online discovery services developed by the National Library of Australia since 1997, including the Register of Australian Archives and Manuscripts, Picture Australia, Libraries Australia, Music Australia, Australia Dancing, the Preserving and Accessing Networked Documentary Resources of Australia (PANDORA) search service, the Australian Research Repositories Online to the World (ARROW) discovery service and the Australian Newspapers Beta service. The digitised newspaper collection was included alongside these aggregated resources in 2010. It was an extension of the Australian Newspaper Plan (ANPlan) founded in 1992, and aims to make freely available as many Australian newspapers as possible and make sure that they remain available in perpetuity, regardless of future technological change.

The pilot by the Australian Newspaper Digitisation Program (ANDP) aimed to digitise 50,000 pages from twelve (later eleven) titles, moving towards three million pages over four years. These were curated from existing microfilm copies by the State and Territory Libraries of Australia with the aim of providing a sample of historical newspapers evenly distributed across the country. Funding for the digitisation of the newspaper collection has come from various sources. As of 2015, the National and State Libraries Australia, other cultural heritage and research organisations, as well as community groups had directly funded the digitisation of about half of the newspaper pages available on Trove. This includes the State Library of New South Wales who alongside the National Library have been the most significant contributor to digitisation of newspapers and journals. The infrastructure costs are borne by the National Library without an additional appropriation from the government and digitisation not funded by contributors is funded from the Library’s collections budget.

The Australian Newspapers Beta service was launched in July 2008 as a standalone website and a year later became a fully integrated part of Trove. Shortly after launch, the system incorporated a platform for crowdsourced text-correction, allowing the public to improve the searchable text, as well as the ability to apply social tags to materials, create curated lists, and leave comments. These features have allowed a high degree of community engagement and enriched the collection. By 2009, users were able to access 720,000 pages of digitised content. By mid-2014, the newspaper collection had grown to 13.5 million pages, claiming the title of the largest digitised newspaper collection in the world and as of December 2019, Trove contains over 25 million newspaper pages. Since 2018, the National Library of Australia has invested heavily in expanding and improving access to its collections through updated digitisation, API and web interface protocols.

The digitisation, delivery and crowd enhancement services originally associated with newspapers are now also available for other published material within Trove, including government gazettes, journals (magazines and newsletters) and books as well as special collections.

Consulted Libraries

Most of the newspapers within Trove were scanned from microfilm collections held by members of the National and State Libraries Australasia: The National Library of Australia; the State Library of Western Australia; State Library of New South Wales; State Library of Victoria; Libraries ACT; Library and Archives NT; State Library of Queensland; State Library of South Australia; and Libraries Tasmania. Additional published collections, held by private organisations, have also been digitised through the digitisation partnership programme.

Over forty Australian public libraries have also selected titles to digitise based on significance to their local communities and have directly funded the digitisation. They have also coordinated local community groups and groups of libraries within their regions to raise the funds required. In some cases, public libraries have also provided previously uncatalogued physical copies for microfilming, as part of the digitisation process. Organisations beyond the Library sector have also nominated and funded newspapers for digitisation. This includes local, state and federal governments, historical societies, archives, universities, community groups, foreign embassies and businesses.

Microfilming Projects

Established in 1992 as the National Plan for Australian Newspapers, ANPlan brought together independent programmes of preservation by the National and State Libraries Australia (NSLA), alongside the National Library of New Zealand, which holds observer status. It continued the devolution of the responsibility for collecting, preserving and providing access to newspaper titles to respective jurisdictions but initiated a coordinating role for the State Library of South Australia. Part of this strategy was to ensure, as far as possible, that at least one hardcopy instance of every newspaper was retained alongside a surrogate copy, such as microfilm, to ensure long-term public access. In 2001, the coordinating responsibility was taken up by the National Library in Canberra. Although microfilming had been the primary means of long-term preservation for fifty years, by the mid-2000s, ANPlan partners had begun to express concerns about the long-term viability of microform preservation, citing concerns about film stock, microfilming services and the cost of suitable storage facilities. These difficulties have been compounded by decreased industry support, including manufacture and repair of microform readers and duplicators as well as user preference for digital delivery.

Digitisation Projects

Selection

It was the aim of the ANDP to make freely available all Australian newspapers published prior to 1955. During the initial phase of the programme, newspaper issues were selected under the National Library of Australia’s Australian Newspaper Digitisation Program (ANDP) by the National, State and Territory libraries. When digitisation began in 2007, the library deliberately chose one title from each state and territory in order to be geographically representative. Afterwards, particular attention was paid to the oldest or biggest newspapers from each state, though there was a general desire among the partners not to focus exclusively on newspapers with the largest circulation but also to represent smaller or more remote communities with less reliable physical access to research libraries. Under this devolved model, State and Territory libraries nominated the newspaper titles or issues for digitisation and provided the microfilm from their collections, while the National Library also selected titles for digitisation, its focus shifting year on year to address important themes, such as WWI, and to represent non-geographical communities. This selection process involved the consultation of newspaper historians and a microfilm supplier. This has led to some unevenness in periodisation across the collections, but this will diminish as digitisation continues. In general, this devolved decision-making process considered user demand, historical significance, geographical and regional coverage, and microfilm status. The initial selection process, therefore, largely depended upon librarian expertise of significance and user demand, as well as the availability of microfilm copies of sufficient quality for digitisation.

Up until the mid-2010s selection occurred within the framework of the Library’s Collection Digitisation Policy, which considered a newspaper’s cultural and historical significance, utility to a broad range of audiences, uniqueness or rarity, perceived public demand, conservation status, rights conditions, planned digitisation by other providers, and other practical and technical considerations regarding its digitisation.

Since 2010, individual users and groups have been encouraged to take part in a contributor-funded model, wherein they nominate and fund the digitisation of a title, so long as it falls within their general selection guidelines. Over 180 organisations and groups have participated in the programme. In 2020, the National Library implemented a new fundraising strategy under the Treasured Voices initiative to significantly increase its digitisation output. Australian newspapers pre-1955 are a finite set, and digitisation of the entire corpus into Trove is a long-term goal.

As of December 2019, the library maintains online lists of current titles), forthcoming titles) and new additions).

Preservation

Although their 2010 five-year plan included discussion of ongoing microfilming for preservation, the most recent strategy document for ANPlan focuses almost exclusively on the digitisation of historical newspapers for preservation and the retention of born-digital newspapers files for legal deposit. A key aim of the 2015–2018 strategy was to implement agreed minimum scanning standards for newspapers across all member libraries and detailed guidance on digitisation from both microfilm and hardcopy is available on the Trove Digitisation Partners webpage.

Access

The Trove newspapers collection provides users access to the most comprehensive selection of historical Australian newspapers in a single location; it is available free-of-charge, worldwide. Existing microfilm collections remain accessible at individual State and Territory Libraries as well as the National Library of Australia and, at the discretion of individual libraries and where conservation status allows, users may still consult original hardcopies of historical newspapers that have been digitised.

Composition of the Collection

Selection Available

As of December 2019, Trove Digitised Newspapers provides access to over 25 million pages across almost 1,500 Australian newspapers from each state and territory, from the earliest published newspaper in 1803 to 1954, when copyright is assumed to have expired. There are also fifty titles with digitised issues after 1954, and nine after 2000, which have been made available with the agreement of the publisher. This includes the Canberra Times, the Australian Women’s Weekly, Woroni and the Chaser. In addition to the English-language press, the collections also include Australian publications in community languages such as Chinese, Japanese, Danish, Estonian, French, German, Italian, Polish, Swedish, Greek, Macedonian, Gaelic, Bahasa Indonesia. A list of newspaper titles already digitised is available on Trove, as well as a list of newspaper titles selected for digitisation for the current year.

Data Quality

Text

A single contractor was responsible for OCR and content analysis in the initial phase, while a panel of OCR and content analysis providers have been used since 2010 to cater for the rapidly expanding programme. OCR contractors process page-level image files provided by National Library of Australia according to publicly available guidelines and provide hand-keyed metadata for key fields of each issue. Afterwards, the Digitisation and Photography Branch engage in a quality control process by which they check a sample of articles from each batch.

The overall OCR quality of the Trove newspapers collection varies owing first to variations in the source materials and second to the non-systematic inclusion of end-user corrections. During the initial processing, titles, sub-titles, authors and the first four lines of each zoned article are re-keyed, resulting in 99% percent accuracy for these components. Once online, Trove users are encouraged to help improve the accuracy of the OCR text by allowing line-by-line correction, and some users have self-organised into volunteer groups to undertake systematic corrections of certain parts of the collections. As of December 2019, over 333 million column-lines of the OCR text had been manually updated by Trove users, with one user having worked on almost 6 million lines. However, this represents only a small percentage of the growing collection and is not evenly distributed, with a disproportionate number of changes being made to family notices and other material useful to genealogical research. A history of these changes is recorded, allowing staff to roll-back vandalism, and the web interface searches both the original OCR and corrections to it. Articles that have or have not undergone manual corrections can be filtered using the web interface facet “has:corrections”, while the API will return the number of corrections and the last date the article was modified.

Thus, any given article within Trove may have had a small or significant manual correction to the original OCR transcription, which itself varies considerably depending the condition and typography of the original item. As these corrections are updated hourly, the OCR quality of the collection should be specifically tested for any sub-corpus used at the time of analysis. Moreover, periodicals in Chinese, Estonian, French, German, and Italian currently have a greater variance in OCR quality than the English-level titles, owing to software limitations. These provisos acknowledged, independent research undertaken in 2013 showed a general OCR accuracy of 80-90%, with the late 1840s rising to 94% and the early 1920s dipping to just under 80%, and the library has undertaken research into how to evaluate the improvement of crowdsourced corrections in order to improve the reliability of their machine-readable text as the number of digitised pages increases.

Images

During its digitisation programme, the majority of Trove newspapers were scanned from 35mm master negative silver gelatin microfilm reels or second-generation silver gelatin microfilm reels into a pair of digital images, consisting of a 400 PPI raw greyscale TIFF and an Image Optimised Bitonal TIFF. The National Library currently requires hard copy newspapers to be digitised in colour with a bitonal image for each page for OCR purposes. They are required to be formatted as a TIFF 6.0 at 400 PPI, compressed to LZW for the colour master image and CCITT Group 4 for the bitonal image.

Metadata Schema

The OCR metadata contained within Trove utilises the METS XML schema for structural metadata and ALTO XML for the OCR content. The descriptive and bibliographic metadata is largely based on human-inputted records, either by library staff or by human operators at OCR processing facilities. Additional metadata regarding user annotations and corrections is held in a separate metadata schema accessible via the API.

Backend Structure

The data for each issue is stored in multiple image and text files, with two digital image files, including a raw greyscale TIFF image and a bitonal TIFF image, for each newspaper page. File “pairs” will have identical names apart from the character that distinguishes bitonal and greyscale files (g for greyscale file, b for bitonal files; c is used for hardcopy-derived colour images). One XML file contains most of the human-supplied metadata for the issue, conforming to the METS schema. There is then an XML file for each page containing the OCR results using the ALTO schema. Each file has a name consisting of the base, generally “nla.news-issn”, followed by the ISSN for the publication (eight numeric characters, sometimes “x” as the final character, with no hyphen), followed by a unique sequence number for that page starting with “-s”, then by “-g” for the Greyscale image or “-b” for the Bitonal image, and ending with an extension for the file type. Sequence numbering is continued across scan jobs or microfilm reels for each individual newspaper title so that all file names are unique for a title. Image files are named sequentially based on the order in which they appear in the microfilm.

User Interface Structure

Web Interface

The current user interface allows users to perform a simple or advanced search of the underlying descriptive metadata and OCR text or to browse images by date, place, category, tag and title. Facets, as well as the advanced search, allow for filtering by article type, article length, illustration inclusion, title, date, and place of publication. The full-text search can be filtered using standard Boolean operators. By default, search results are ranked by relevance but can also be ordered by date. Once a result is selected, a full-page image, centred on the relevant article with highlighted search results, is displayed in an image viewer. The viewer allows users to pan and zoom as well as navigate through the issue. The underlying data (plain text, PDF and JPEG) as well as user-inputted categories and comments can be downloaded using icons at the left of the viewer and the metadata, preferred citation and OCR text can viewed in retractable widgets. When downloading an image, the article is segmented and then embedded into an HTML to facilitate printing onto A4 paper. A new web interface is currently in development.

API

The Trove API provides users with machine-readable access to the underlying data of the Trove collections, including user-generated data, in a machine-readable form. The API currently allows for the display of Trove results on external websites, the harvesting of data for offline analysis, the retrieval of user annotations and the creation of new tools and visualisations. A personal API key can be obtained automatically via a Trove user account; a commercial key is also available but requires explicit permission from the National Library. Materials can be accessed using a URL-based request, which assists users in formatting their requests. A full technical description of the API is available through the Trove Help pages.

Direct Download or Drives

The Australian Government Gazettes and the Australian Aborigines Advocate are available for bulk download through the Trove Help.

Rights and Usage

Web Interface

The Trove web interface is freely accessible to all users, worldwide. All material obtained from the web interface may be used freely for personal research. When browsed or searched through the user interface, users are presented with a full citation to the digitised image and text.

API

The API is free and open, with a key that is automatically obtainable for personal use. Commercial use requires explicit approval by Trove. Material derived from the Trove API may be used under the same conditions as that derived from the web interface.

Re-Publication

Digitised newspapers up to 1954, whether delivered through the web interface or the API, are available to users as greyscale or colour images and machine-readable texts. Copyright in Australian newspapers is complex. Neither Trove nor the National Library of Australia can grant special permission to use copyrighted items; only the copyright holder can do this. Before reproducing any newspaper articles, the user is asked to confirm whether they are out of copyright. If the article is out of copyright, it is free to use; however, proper attribution and citation should be applied when using all newspaper content.

Suggested Citation

Beals, M. H. and Emily Bell, with contributions by Ryan Cordell, Paul Fyfe, Isabel Galina Russell, Tessa Hauswedell, Clemens Neudecker, Julianne Nyhan, Sebastian Padó, Miriam Peña Pimentel, Mila Oiva, Lara Rose, Hannu Salmi, Melissa Terras, and Lorella Viola. “Trove.” The Atlas of Digitised Newspapers and Metadata: Reports from Oceanic Exchanges. Loughborough: 2020. DOI: 10.6084/m9.figshare.11560059.