History of the Collection
The Suomen Kansalliskirjaston digitoidut sanomalehdet has its origins in the Helsinki University Library Centre for Microfilming and Conservation, established in Mikkeli in 1990. Now known as the Centre for Preservation and Digitisation, part of the Suomen Kansalliskirjasto (National Library of Finland), the centre joined three other libraries in 1998 to form the Nordic digitisation project TIDEN. In 2001, the National Library launched its initial digital newspaper collection with 36,000 of an intended 90,000 pages of forty-four different Finnish titles published between 1771 and 1860. In 2005, the collection received 1.9 million page-requests and 160,000 unique visits. By 2018, the collection included all newspapers and journals published in Finland between 1771 and 1929 and comprised over 880,000 newspaper issues, containing 6.2 million pages of content.
The digitisation of Finnish newspapers was undertaken as part of the Nordic Project TIDEN, comprising the Royal Library of Sweden; the National Library of Norway; the University Library of Aarhus (Denmark); and the Helsinki University Library. The digitised collections are generally based on microfilms held by the Suomen Kansalliskirjasto, with some digitised from physical objects and new issues received electronically.
Newspapers had been the primary focus of reformatting programmes throughout Scandinavia since the 1950s and microfilming has been used as an access and storage format at the Suomen Kansalliskirjasto since 1951. Through the late 1980s, newspapers were stored on 35mm cellulose acetate film, first by a private service provider, Rekolid, and subsequently by the Helsinki University Photographic Institute. Since 1997, all Finnish newspapers have been microfilmed at the National Centre for Preservation and Digitisation. Until the 1970s, microfilm reels were stored in the same accommodation as print collections, after which they were transported to the Viikki bomb shelter. Since 1990, they have been held in air-conditioned vaults at Mikkeli.
As part of the Rescue Project, newspapers between 1771 and 1945 were re-filmed (either copied onto a more stable medium or re-filmed from originals), while newspapers between 1945 and the 1970s are currently under consideration for re-filming, owing to the difference in filming technology before and after 1980. All versions of the newspapers are currently retained, and this has allowed the creation of composite collections, wherein the original has been lost but a microfilm version remains, for digitisation. The completeness of the microfilm collection has allowed digitisation from microfilm, rather than originals, where quality is sufficiently high. In cases where quality was insufficient, originals were first microfilmed before being selected for digitisation. Currently, a greater number of filming projects include supplements as well as borrowed materials (in order to complete runs) than was previously the case.
The initial work undertaken by the University of Helsinki as part of TIDEN was funded by the Nordic Council of Scientific Information (NORDINFO) and the Ministry of Education in Finland. These initial digitisation tests were important in defining best practice for future microfilm digitisation projects, and the findings were published alongside other recommendations by the International Federation of Library Associations and Institutions in its 2002 supplement on Microfilming for Digitisation and Optical Character Recognition. In particular, the project developed test criteria for the digitisation of microfilm and experimented with best practice in developing automated production workflows. The project received €10,000 to €40,000 in funding from the Nordic Council for Scientific Information, with additional funding from the Ministry of Education allocation to the Finnish National Library.
The digitised historical newspaper collection of the National Library of Finland is based upon the newspapers acquired through free deposit laws since the eighteenth century. At the time of TIDEN, the Finnish legislation defined a newspaper as “a printed product published at least once a week”. Historical newspapers had been microfilmed systematically from the 1950s onwards and the aim was to digitise the whole older collection step-by-step, using microfilm as an intermediary. The first Finnish newspaper was published in 1771, and the first collection to be digitised was from this year forward until about 1860. After the TIDEN project, the next stages covered the newspapers from 1861–1890, 1891–1900 and 1901–1910 according to the allocated funding. Digitisation work followed the alphabetical order of newspapers within the chosen timeframe.
Composition of the Collection
The digitised collection contains all Finnish newspapers held by the Suomen Kansalliskirjasto for the years 1771–1929, with later years digitised and made available through special agreements with copyright holders where possible; newer digitised newspapers are available at the six national deposit libraries. As of October 2019, the full collection included over 998 distinct newspaper titles, comprising 6,259,133 historical newspaper pages. The majority of these pages (4,031,018 pages, representing 64% of the collection) are currently available for public use, with a further 2,228,115 pages, (post-1929), held in restricted use. A full list of publicly accessible titles can be found using the filter on the newspaper web interface.
Following the general trend, the volume of newspaper publishing in Finland increased towards the turn of the century: when all issues from 1771 to 1910 are counted, 82.7% of the data is from 1890–1910, and 92.3% is from the last four decades, 1870–1910. The majority of the newspapers are in Finnish and Swedish, but there are some pages in Russian and German, and other languages. Different languages dominated the Finnish public sphere in different periods: more than 50% of the publications before the late 1880s were in Swedish, after which the share of Finnish language publications increased to over 75% by 1910. The Russian language publications emerged after 1900, while there were already German language publications during the 1820s and 1830s. Out of the total number of newspaper pages in the collection, 1,063,648 are in Finnish, 892,191 in Swedish, 8,997 in Russian, and 2,551 in German.
The majority of nineteenth-century newspapers digitised by the Suomen Kansalliskirjasto were printed using Gothic (Fraktur, blackletter) typeface, with a minority of printed using Antiqua; the difficulty standard OCR software has recognising the former typeface is well known. By 2006, the Suomen Kansalliskirjasto had implemented automated encoding of word coordinates and grayscale scanning, utilising the digitising software DocWorks (CCS), with OCR by ABBYY Finereader, and structuring metadata in-house by combining OCR data with catalogue information. The next phase of development focused on automating OCR for both Fraktur and Roman on the same page, and conforming to international METS encoding standards. Analysis of parallel samples and word error rates showed that about 69% of all word tokens can be recognised with the modern Finnish morphological analyser, Omorfi. If orthographical variation is considered and the number of out-of-vocabulary words is estimated, the recognition rate increases to 74-75%. Overall the collection has a relatively good quality rating of about 69-75%; around 25-30% of the collections needs further processing in order to improve the overall quality of the data.
The publicly available images from the collection are available as a PDF or JPEG file with a resolution of 300 PPI, the latter of which is made available through a structured URL based on an item’s unique identifier. High resolution images are stored at the server of the National Library of Finland and released as part of METS packages in TIFF format.
The data hosted by the National Library of Finland uses the METS XML schema for structural metadata, ALTO XML for the OCR content, MIX11 for technical metadata, and MODS12 for descriptive and bibliographic metadata.
The main database contains metadata, page data, and file data containing the archive directory information. The database offers page images of the content and access to the content of the pages in ALTO XML format. However, the URL structure is not easily translatable from bibliographic data; it places the text files within numerical directories representing individual bindings. For example:
In the data packages, pages are located in two separate directories: one based on ISSN and the other on publication year. Below the publication year in the data structure is the language of the publication, below which are the actual ALTO XML files. They are named descriptively by ISSN, year, date, issue and page. For example:
User Interface Structures
The web interface allows users to perform a simple or advanced search of the underlying descriptive metadata and OCR text. The advanced search allows for filtering by material type, title, collection, years, place of publication, author, keyword, publisher and language. The full-text search can be filtered using standard Boolean operators, a fuzzy search option, or by limiting to content or metadata fields. By default, search results are ranked by relevance but can also be ordered by date, title, author or date of inclusion in the collection. Once a result is selected, a full-page image with highlighted search results is displayed in an image viewer. The underlying data (plain text, PDF and JPEG) and manually selected snippets can be downloaded using icons at the left of the viewer, and the metadata and OCR text can be viewed in retractable widgets.
There is currently no API system in place for accessing the newspaper data, though bulk data from the collection can be obtained through web crawling tools, using the aforementioned standardised file structure.
Direct Download or Drives
There are currently several options for obtaining the newspaper data in bulk format. The Digital Collections maintain an Open Data website in Finnish and English allowing the download of both METS and OCR data as date- or language-delimited ZIP files. The years available vary and are not yet comprehensive of the entire collection. The Newspaper and Periodical OCR Corpus of the National Library of Finland (1771–1874) was released in 2011. The data package is in the METS/ALTO format and downloadable via the Language Bank of Finland. The Newspaper and Periodical OCR Corpus of the National Library of Finland (1875–1920) was released in November 2017. The data package is downloadable via the Language Bank of Finland. The dataset includes all those newspapers and journals that had been digitised by the end of the year 2013. This includes all published newspapers 1875–1920.
Rights and Usage
All out-of-copyright material obtained from the Suomen Kansalliskirjaston digitoidut sanomalehdet web interface may be used freely but it is requested that they be cited using standard citation conventions. Any materials that remain under third-party copyright are clearly labelled and provide the specific conditions of use for that item; users may not redistribute in-copyright digitised material without permission from the rights holder.
API and Direct Download
Users are allowed access to archive (ZIP) files of all out-of-copyright texts, images and metadata. Although individual texts have been released into the public domain, the dataset as a single object should be properly attributed in derivative works. Users cannot deliver in-copyright digitised material onwards without the permission of the rights holder.
Beals, M. H. and Emily Bell, with contributions by Ryan Cordell, Paul Fyfe, Isabel Galina Russell, Tessa Hauswedell, Clemens Neudecker, Julianne Nyhan, Sebastian Padó, Miriam Peña Pimentel, Mila Oiva, Lara Rose, Hannu Salmi, Melissa Terras, and Lorella Viola. “Suomen Kansalliskirjaston digitoidut sanomalehdet.” The Atlas of Digitised Newspapers and Metadata: Reports from Oceanic Exchanges. Loughborough: 2020. DOI: 10.6084/m9.figshare.11560059.