Exploring the Atlas of Digitised Newspapers and Metadata Workshop: Speed Blogging Feedback Session

Video Transcription

So, we’ll begin then, so that we have time to give everyone chance to explain. So, as we said before, everyone’s had a chance to talk in their smaller groups and hopefully everybody’s had a nice conversation about their particular topics and come up with some ideas. It’s really important that everyone understands that these are not meant to be journal articles or final thoughts, or explanations on anything. They’re meant to be discussions of things that we could bring forward as the project continues. So what we’re going to do now is, I’m going to hand over to Emily in a second who will call out the different group numbers, and if somebody from that group would volunteer to (if they have a microphone) to please speak and give their one-to-two minute takeaways from their group discussion. Just so you guys know, we’re going to leave all of these blogs up for about a week or 10 days and give you a chance to continue them or revise them, and then we’ll contact the group members individually to see if they’re happy for it to be published on the Atlas blog over the next couple of weeks. That being said, I will hand it off to Emily who will master of ceremonies the feedback session.

Okay so could we hear from group one: what were your main headlines?

I guess we discussed the challenges that libraries or the institutions face in understanding the best practices in terms of generating content that researchers can use, whether that be user-generated or institution-generated. We noted this and the lack of documentation for member institutes, or for new libraries coming up wanting to do more digitization, and thought that there is space for libraries to collaborate on documentation to bring up digitization projects faster for those new digitization projects who are coming in. And we also had a bit of discussion around the usefulness of the metadata that got created, whether that be a level of just images or going up to a higher level of documented metadata, and quite how deep, as a bit of a summary of the things that we discussed.

Brilliant, thank you very much. Can we hear from group two?

We quickly converged on the importance of improving user experience for systems for adding metadata. What’s one huge problem right now is that many people who have a lot of expertise in a range of relevant areas for all sorts of research projects can’t really contribute, because the systems used are not accessible enough or easy to or easy to use enough for them. So, our main takeaway was that these systems need to be easy to use in the sense that they have familiar graphical user interfaces for people, nothing that could really trip up people with little technical expertise, and that they should ideally involve some ability to crowdsource. There should be easy ways to add information or metadata in a standardized way: for example, using drop-down menus that let people select suggestions from particular metadata schemas. There should be other ways otherwise that the system should be to enable metadata to be added automatically, if possible. For example, problems adding metadata about what software or tools or machines or locations were used in the creation of this metadata: that should be able to happen automatically really, and in the end we think that to improve this user experience in this way the developers of tools should do whatever possible to engage end users of tools as early as possible in this when developing these tools. I thought at the end when tools are openly available there should be a lot of documentation to make things as easy as possible for people to follow, and that’s basically it.

Great, thank you. Okay group three?

We were really looking at the literary historical challenges and we had quite a wide-ranging discussion about the different kinds of disciplinary backgrounds that people bring to this work. So we had a group which ranged from corpus linguists, who are used to dealing with large forms of data analysing those, to those of us working in literary and historical studies which are moving between questions about distant reading and close reading, and thinking about those different disciplinary expectations, to people working with family history resources. And so, we came up with a snappy title in the end about democratizing the archive and trying to think about the ways in which the interfaces need to accommodate quite a range of different backgrounds and skills that people are bringing, and different kinds of research questions. And there’s some interesting points I think around how often family history or genealogical or local history researchers are using the same resources as professional academic researchers, often for quite different purposes or final outcomes, and I think there are some missed conversations there. I think one of our colleagues was talking about how a lot of the family history databases don’t really expect that academic researchers will be using their services; they imagine a family historian as a user, but in fact often we’re moving across those different formats. But also, there’s some quite complex ethical questions about who has paid for access to particular things, and how much ownership they would like to assert over the that in terms of intellectual property and copyright. And I think that’s the wild west frontier of online digital research which is being worked out in quite improvisational ways. So that was a set of the ideas we talked about, but really, we were thinking about how do we actually think about some crowd-sourced and cross-disciplinary collaborations and some model projects in which we might demonstrate the best way to be doing that work?

Thank you! Okay, group four:

We went through just two questions. So, we went through “what are the issues facing literary scholars when working with collections?” and some of the things that we had to say were, for instance, it’s particularly time-consuming. The search functionality can be more advanced. It also depends on the type of research involved: so, research into a specific topic (surveying a range of publications for material related to that topic); research into newspaper/magazine publication history itself. And then we also thought it was difficult to find the right materials quickly because of too many databases, so capturing all the metadata necessary to anticipate what researchers may be looking for is virtually impossible. So, archivists actively work with researchers to understand their interests and needs and are normally guided by trends in research, and research collections are more often than not shaped by researchers needs and requirements, so those are some of the issues that we that we found. And then the changes we thought we could actually make would be: researchers needs and requirements actively help to build their archives, actively being involved and draw attention to the wider print ecology or print economy, not just the better-known publications but also those that have largely disappeared from historical view, and also make those forgotten materials/newspapers seen by contemporary readers and researchers, make them read, known and discussed by people. And then also in terms of knowledge, the wider use of standardized APIs would allow for the creation of networks of information that would enable researchers to find the material they need more easily and hugely improve discoverability. And surveys from researchers, or perhaps a feedback function of the newspaper database, so researchers can improve the service based on their experience. And the research needs research tags, correct categories for topics, more than one category for topics. Some text is also not readable on the newspaper, so making sure the text is clear. Also finding accurate materials through keywords or multiple similar keyword words. Also, there is an amount of metadata that archivists need to capture in order to meet international standards for archived metadata, so these fields already capture key metadata found on discoverability and access. The problem is capacity to capture a large amount of metadata as this is extremely time consuming, and archivists have many other pressures as well. So those are some of the things that we came up with as a group.

Thank you very much for that. Okay, group six?

We also ranged over a lot of the topics that have come up already, but specifically we started off with a discussion that there’s a perhaps a disengagement on both sides from researchers and library professionals in terms of not knowing exactly what’s happening in the process of research and in the process of recording the metadata, which creates uncertainties about how the data can be used, what the limitations of a collection and the shape of the overall collection might be, and what that means for researchers in the conclusions that they want to draw. Another point that came up was the question of the physical object or the materiality of the digitized archive, and whether or not metadata could lead us back to that. Questions such as the marginalia which might be changing the way in which you engage with a with an object, and paratext and size and format, and some way in which you can be drawn back to questions which would also interest literary scholars and historians but adds to the complexity of what metadata is gathered. And that brought us to a final question of the different forms of access for researchers, whether they’re individual researchers or affiliated with an institution, and how that difficulty of differing levels of access also means that there’s a further inconsistency on the researchers’ side in terms of how people are able to draw conclusions from periodical data. And more broadly a question of consistency across using and documenting methodologies around using different archives. We were in the process of discussing ways to keep that specificity, and the insight of bespoke collections, whilst also dealing with the need to train researchers in using different data sets and collections.

Great, thank you very much. Okay so, group eight?

We had a variety of people in our group, from very new periodical scholars to people who were much more expert, as well as our designated librarian archivist, Nicola. So, it was an interesting discussion. We talked, much as many others have, about the user experience and how, particularly when you’re a novice researcher, you’re not really sure how you want to look for things. And as you become more of an expert, you’ve got more pointed, maybe specific, queries that you’re looking for. You’ve got more sophistication, I guess, in terms of how you want to search, so the need for a variety of user experiences on the basis of how you want to search. We also talked about how it can be difficult to just even know where to start, so I learned something new today which is Wikipedia has a decent list of online newspapers. I’m like well hey, this workshop is now totally worth my while because I’ve learned something—I mean I’ve learned lots of new things this afternoon, but that was that was a new find as well. We focused on the importance of being able to download the entirety of the data set so that you could work on it offline, and just the variety of different ways of feeding our research interests back to the librarians and archivists so that we could influence and inform the kinds of decisions that are being made in that digitization process and the building of repositories. Okay, I think that’s about it from us.

Our final group is group eleven.

We discussed the challenge that libraries face in building up high quality collections, and we have discussed mainly the aspect of collaboration between researchers and libraries, on the one hand by thinking on and talking about structural metadata, on the other hand of collaborating in building up more contextual information. Because if you try to search or look for a newspaper, you not only do research on the title, sometimes you have to go deeper in, and where is that information? How can you find it? And we talked a lot about how to bring those together, and maybe do something like a contextualization like it is done for manuscripts as well. And we discussed how to get some contextual information and how to collaborate on this topic.

Thank you very much.

From workshop 2:

We’ll start with group one.

So, we had a very, very nice conversation and we talked about, let’s say, the context of metadata and how they can shed light on the content of the collection. And we discussed the fact that collections have a history, and metadata were created at different times and in different ways, and so it is very relevant to record provenance but to record provenance in a number of ways and at a number of levels. And, for example, we discussed having more information about the person that created the metadata. For example when the information was created: that might say a lot about what was the state of the art of the research at the time, what might have been the main cultural influences at the time, but also now that we’re dealing increasingly with automatically-generated metadata and semantic annotation, that it’s quite important to record this provenance that basically says this is machine-generated and that this generation of information that is very valuable also has to be documented to be truly valuable, so that we can, for example, have information about exactly what were the parameters of the algorithm, can the algorithm be replicated, can it be corrected, and things like that. We were then moving into how we can maybe try to overcome all the separations between collections due to the differences of the metadata, and there were at least a couple of very strong linked open data supporters in the group, so we thought that having URIs was a good starting point. And we had just started discussing how maybe online semantic annotation, or in general crowdsourcing projects, could maybe help cultural institutions producing metadata, and then feed them back, but we didn’t get there.

Thank you very much for that. Okay, so our next group is group four:

So, we all had a bit of a chat about how we felt a bit out of our depth, maybe, with this metadata, and all that. So, we talked a bit more about our own experiences using these databases, what we felt was lacking, and then from there what would be our utopia database experience without necessarily regarding whether or not that was possible, or how easy it would be. So, it’s pie in the sky dreams. So, in terms of the things that we felt were lacking, we’re saying that at the moment we find that the scattered and overlapping nature of databases can be frustrating. So, you might need to use the British Newspaper Archive in FindMyPast, but that search section is really clunky, and then you search in the British Library Newspapers first, but that’s behind a paywall so you can’t use it. So, it’s just jumping back and forth between different sources, so it might be good if there was a more standardized way of searching things—or at least that there were more accessible ways of searching things. It would be good if there was more cross-referencing between different kinds of databases, like if you search for someone’s name, especially if you’re a genealogist, and you search for them in the newspaper and it comes up with their name, there might be like “oh this name’s also found in this census, which you can get on FindMyPast” or something like that. So, it’s easier to trace people and learn more about them. Issues with tags being inaccurate: this was more not necessarily newspaper databases, but like for actual archives where it says it talks about illness and then it’s like a page in a diary and you get to the last page and it mentions that the person had a cold and that’s not overly useful, so maybe better tagging for things and maybe some issues with dates where the newspaper exists but the database says that it doesn’t and it’s hard to track down where it is and how it’s gone missing. So those were some of the things that we thought were issues, and so for our digital newspaper utopia we thought it would be really great if there is an algorithm like Amazon’s where it says “if you like this, you might like this”, or “other people who search this also looked at this” so it can help you jump around. Another thing might be a chat box function that goes with the article so that researchers can interact with each other and talk to each other about what they found in this, what was useful, what wasn’t, and something with just general archive categories so that if it is that they mention sickness only once, someone can say “hey I looked at this already and it’s in one sentence”. And then being able to search images, or being able to search layout styles somehow, although we weren’t really certain how to do that. And then also more searches about the newspapers themselves, the authorship of articles, the editors’ political affiliation, where they were sold, and their geographic spread. Like if there would be a way to make a map that shows where all the newspapers are, and then their geographic spread, that you can also drag across time and see how things shift and move over a period, but understanding that that’s probably very complicated. And that’s the main points we discussed.

Great, thank you! So, you cover quite a lot of points there, and there’s some good suggestions in the chat for some projects that are attempting to do some of those things. Okay great, so next group is group six:

We began with issues facing historical and literary researchers working with collections, and one issue that certainly overlapped with a number of members of our group is how do we locate minority cultures within a dominant culture? Now of course we covered in the general discussions the issues about specifically the digitization of periodicals and newspapers belonging to languages that aren’t English, but the fact is English language databases such as Trove and historical papers online such as the Library of Congress contain within them minority cultures, but locating those voices, locating those actors—whether they’re slave rebels, whether they’re indigenous people—is very, very complicated. So we wanted to begin with a plea for some way of tagging that could enable us to locate those actors more easily and, leading on from that, one of the points that came up was about a plea for more transparency over selection and a more candid account in editorial guidelines produced by databases about the relationship between the digital surrogates and the physical object. We really need to be very, very clear what content is digitized, what hasn’t been digitized, and what the rationale is for that, because I think there’s an assumption—correct me if I’m wrong—that historical and literary researchers are primarily, or only, looking for text, when in fact many of us are looking at illustrations, advertising, mastheads, and other non-textual material. So, if that isn’t being included, it’d be helpful for researchers to know that and to know why that decision was made. And then the third thing we discussed was thinking about cross-referencing, linking data across different databases. How can we do that, and how can we make sure that newspaper databases link up with other resources? You need to do things like book history, like library catalogues and other book catalogues, and other forms of open access digitized material that isn’t necessarily held in the newspaper database. And then I think the third thing that we wanted to discuss was really a more philosophical question we wanted to pose, which is how do we avoid verification in our search results? How do we find what we do not yet know we are seeking? How can we, if we’re looking for actors who we know are there but we don’t know exactly how they might be articulated, or how they might be articulating themselves, within a particular discourse or particular archive, how can we find more fuzzy, or proximate modes of searching to help locate those people? Because our concern is that by having very rigid search terms that are very particular and unique, we actually verify our research practices, and we verify our research questions, and we predetermine what we can look for, when actually a lot of us are trying to find people we don’t know are there, but we think might be.

Thank you for that, and that’s a really interesting one. And topic modelling attempts to address that in some way, but one of the great visualizations that I saw from some Computer Science undergraduate students doing dissertation projects at the University of Stuttgart with partners in our project looked like a galaxy: it was just a lot of dots, and it was just a galaxy of newspapers with no other information. It was just light dots, and the idea was that it enabled you to do organic search because you could see when there were clusters for some reason about particular topics.

That’s really interesting.

Yeah, and it looked absolutely mad, and thinking about how you might actually use it, it didn’t seem very intuitive. But I liked the fact that it really made again that idea of you don’t know what you’re looking for, so we just zero in on something that that’s bright.

Yeah, yeah but that’s really interesting, and I guess the data visualization that social networking can do can help with that maybe.

Thank you! Okay, group seven:

So, there were a number of things that were suggested that came up. Our prompt involved challenges facing literary scholars and historians doing research with digital archives, and apart from, I think, issues that that we all across the board were finding with the interfaces themselves being perhaps less user-friendly. So that would involve, for us, OCR readings and discrepancies between the physical material items and their visualizations online. So, for instance, something like bindings interfering with legibility or the intelligibility of text. Apart from those more, I suppose, granular issues, we did many of us mention—I guess there were only four of us, but several of us mentioned—that there isn’t so much consistency that we’re finding between databases themselves, and so far as we’re seeing sources that are either duplicated or the reverse of that, which is to say we expect to see an item in the database we don’t find, but do elsewhere. So, there’s less meta-communication, in a way, between the databases themselves. In my own research I’ve come across this with Periodicals Archive Online and British Periodicals Archive—or just British Periodicals, I suppose that’s how it’s called specifically. British Newspaper Archive, one of our members mentioned, is, well for one thing it is subscription-only, as opposed to British Library Newspapers. And one thing we discussed did involve paywalls and institutional subscriptions that may not be more largely available to researchers, and so with British Newspaper Archive as well, Gale I think—oh sorry, I’m just following your notes here—there is, well at any rate we found, difficulties that are particular to search optimization with those. So a scholarly network for sharing, along the lines of Victorian Web or NINES, would also really facilitate research, just to the extent that a few of us—and I suppose this is true for many across the board—struggle with the worry, or the fear, the preoccupation, that there might be this crucial pool or cache of resources that we’re just missing because we haven’t contacted the right person, or we haven’t followed the right trail of breadcrumbs. And so, I guess on a more scholarly, academic level, perhaps less so a digital one, being able to share information about the databases and the resources that are out there and what overlap occurs would be really helpful, and really useful. So, good to think about for the future.

A note about the sharing: when the New York Times offered a dial-up service, essentially, to search their databases, the librarians were actually charged per search, so they actually required scholars to come to the librarian and have the librarian form the question for them, to make sure that it was going to work with the database. And then they were asked to please photocopy all the results and give them to all their friends, so that people didn’t repeat searches, because it was costing too much money. So, there is actually precedent for sharing search results like that, and I think it’s something that we should maybe consider bringing back up, because we do curate when we search, and that curation would be helpful to other people.

Okay, thank you! Next group is group eight:

We skipped across a lot of different themes, broadly linked to the thinking about how we as researchers tend to approach the use of periodicals and newspapers, both in our research and in our teaching. So we talked a bit about accessibility, some of the things people have talked about already in terms of things like paywalls and open access, but also in terms of the more broader accessibility, about how difficult it is to sometimes get a handle on the self-study of periodicals as a field, both in terms of just the fact there are so many different databases that cover so many different things, but overlapping, complicated ways, and then also the fact you then have to understand the world of the historical press within that. So, there are two systems you’ve got to understand, and we talked a bit about how hard that was for our students. I think it’s one thing for people who maybe specialize in periodicals and live in these archives; it’s very hard for people who are picking them up very quickly to know what they’re looking at, and why. And so, we talked quite a bit about those challenges. Linked to that, we then moved on to think about something that came up in the plenary at the start, which is how we think about and visualize things that aren’t digitized—not necessarily just to again get anxious about what’s out there that we’re missing, but just so that we know what argument or how we’re constructing our arguments based on what has been digitized. So there’s a project at the British Library that is starting to address this now, but it would be really great to know, to have a map or an Atlas that maps out the shape of the historical press before we then start to figure out how the digitized press fits into that. So we’ve talked about that a bit, and then yeah, we talked a bit about enriching metadata and how we might describe newspapers in new ways (or more useful ways), whether it’s in terms of the subgenres and texts within them or whether it’s about adding additional things like authors or, I don’t know, whether it’s their religious affiliation, politicization, all those things. And I suppose that ended up being a discussion about the incredibly difficult methodological and practical challenges of how we do that, both in terms of the scale of the press but also its slippery nature; the fact that it’s very, very difficult to describe things that feel like they might fall into one category but start drifting into others. And so, I suppose, we were thinking about that metaphor of the Atlas that you’ve used for the project, and how that seems to be describing like a very definite landscape that exists, and yet the press feels like something more, or harder, to grasp; less solid, less easy to describe in those definite terms. I think that was most of the things we talked about. Yeah, we covered a lot of different things there I think, but I think that those are the main points.

Great, thank you very much. Okay, group ten.

So, we were in the librarian and archival science group. Two top-level questions: what would useful contributions from academics independent users look like in terms of metadata archive structures and access? And what particular challenges face your collections? So, in terms of the three top headings, I suppose the main one, or one of the ones we looked at, was around funding challenges and the fact that a lot of the digitization work is project-based. It’s not business as usual in that sense. And then, associated, not only the fact it’s project-based but also who does it. So a lot of the people in the breakout room were saying that it’s outsourced to either to a commercial partner, or even if it’s done in-house, there’s still the hard physical material that still needs to be sent somewhere, either as microfilm or as hard drives to other people to scan, or to process. Collaboration is crucial, not only between digitizers, librarians and researchers, but also tech developers and other interested stakeholders. There’s discussion around how and when to involve researchers, which I guess is what we were asked, but the feeling was that that should be an iterative process: you shouldn’t really just involve researchers at the end of the process, because by then so many decisions are already embedded in the workflows and everything. Researchers need to be involved from the start, although linked to that there was a brief discussion near the end around the Living with Machines work that’s being done, which is described almost as a flip model. So, the researcher questions come first, but this obviously opens up then different challenges, because researchers keep changing their minds over what they want digitized. And obviously that’s then implications around time challenges, costs, scope, and everything associated with that, so whichever way you do it, whether it’s digitization first, or research question first, or difficulties and challenges, you start one, then you call others. Then the third one was around, I call it, legal challenges, but things like copyright: “can I do this? If I can’t, when can I do it? Can I do only do part of it, or can I initialize part of the material?” They’re the three headline ones.

Great, thank you very much. And our final group, group twelve.

So, we were also a librarian/archivist group, and we discussed some of the same points that were just mentioned about contributions from academics, and how we would work with them, and then challenges for our collections. So, one of the things we talked about a lot was the context—providing context about what’s in a digital newspaper collection, and what isn’t in it. And that’s critical for research and also computational analysis, and I would say particularly for students, when we’re starting to teach them how to do computational analysis and text mining. So, understanding that results are not comprehensive or representative, and so we would need to provide that information in context with the digital newspaper collection, like “this is a subset of how many newspapers existed and how many we digitized”. And then we also, like the other group, talked a lot about the importance of partnering with researchers as the collections are being developed, learning from them, what metadata they need for their work, and then the librarians and archivists working on creating that metadata with them, so that we understand what’s actually needed. And then similar challenges: we talked about funding sustainability being a problem, especially if you want to make robust collections where you’re not just digitizing but you’re also enhancing them in these other ways. We also talked about the difficulties in selecting what to digitize, and there were a couple different facets of that, but one was about what is actually available to us, what content has survived in archives that is even available to be digitized. So, we’re aware of a lot of other newspapers that existed but they don’t exist anymore, and unfortunately those are often voices of underrepresented communities that are missing, and so really actively working to try to remedy that.