Scholarly Research Infrastructure

John Unsworth
April 18, 2015

Remarks for the closing roundtable at "Scholarly Networks and the Emerging Platforms for Humanities Research and Publication," an international colloquium co-sponsored by The Virtual Humanities Lab and the Center for Digital Scholarship at Brown University, and DARIAH-Italy.

I'm sorry I missed the conversations yesterday: clearly, this has been a very interesting colloquium--and I'd like to kick off this final "round-table" discussion by responding to some of the conversations I did hear, today. You probably won't be surprised to learn that I heard those conversations through the filter of my own obsessions, experiences, and commitments.

Guyda Armstrong and Marilyn Deegan's discussion of the "academic book of the future," the Crossick Report, library publishing, and related open-access efforts was reminiscent in some ways of conversations in Canada around INKE or Synergies--research infrastructure. The topic of the book in/of the future also calls to mind the future of our existing monograph collections, the interactions of that future with other elements in the ecosystem--including ebooks, pressures on library space and library budgets, consortial resource sharing, and the like.

This, for me, brings to mind the EAST project, a shared print monograph retention project, which is in turn related to WEST and HathiTrust, and its work on a monograph repository. And in general, while we may think of copyright as it restricts access to new academic publishing, we should also remember that it significantly constrains our access to what's already in the library--essentially, restricts access for anything published in the 20th century, or the 21st.

Also, in the Q&A, I heard discussion of the recently Mellon-funded scholarly publishing project at Brown--a conversation that reminded me that in some ways we haven't come that far in the last fifteen years (see Inside UVA article on Jerome McGann publishing the Rossetti Archive with the University of Michigan Press in 2000--a project that died on the vine, but led to NINES).

So, what's the role of publishers, and of libraries, not only with respect to new books, but with respect to the ones already on our shelves?

One point that needs to be made at the outset, I think, is that Google and to a lesser extent some commercial publishers have accelerated digitization in a way that libraries, and even the Internet Archive, on their own, were never going to accomplish--or at least, not in my lifetime. We deceive ourselves if we think that the work of digitizing those billions of pages was ever a given, or could have happened at no cost.

As those digital resources have proliferated, we've struggled with access, all along the way. Andy Land's talk on the University of Manchester's study of its own digital library services, and the services provided by others and used by the university library's clientele, reminded us all of the importance of delivering access to digital information in a more unified way, implying a more unified infrastructure.

This, in turn, reminds me that some things have not changed all that much in the BrightPlanet report in 2000, which provided evidence that the "Deep Web" was 500 times larger than the publicly accessible web. What shapes the library landscape? What gives value to library search? Access to some part of that Dark Web--licensed databases. And what of the time of those librarians who used to buy and catalog physical resources? It's now spent untangling thickets of conflicting license requirements, overlapping holdings in licensed bundles, and new uses not envisioned by the publishers who license those collections.

I want to suggest one scenario that casts a rather different light on the role of both libraries and publishers in serving the research needs of scholars, going forward. In fact, I believe this scenario has the potential to significantly realign the relationships among scholars, libraries, and publishers--at least in some cases.

So, let's start with a problem we don't usually think of: how does copyright constrain publishers? The short answer, I think, is that it consigns their content to walled gardens--and since scholarship is never interested in only one such garden, it limits the utility of their content, especially when that work requires aggregation of data. This problem becomes more starkly illuminated as researchers begin to ask for kinds of access that go beyond searching and browsing--specifically, data mining.

Publishers are keenly aware of this growing interest, and they are responding by trying to build out new research services to meet those needs. But the content that underlies those services is only going to be the content provided by the publisher who provides those services. So, ProQuest, for example, can provide Statistical Abstracts of the World, covering comparative economic and demographic data for "over 60 countries," and they can provide data-mining services on top of that content (for paying users).

What no publisher research platform can do, though, is to actually meet the research needs of any one economic researcher--say, Michelle Alexopolous, from the University of Toronto, whose opening keynote at the recent HTRC UnCamp focused on the diffusion of new technical knowledge with economic impact, as demonstrated in the text of books and newspapers collected in libraries during and after the introduction of a particular patented innovation, for example, penicillin.

It's worse than that, actually: when they try to meet the data needs of researchers, publishers end up interrupting their normal workflow, to gather (often by hand) resources that they will convey to researchers, with no real security, with no visibility into what researchers do with that content, and with no faith that the researcher will live up to her end of the bargain and destroy the dataset when the research is complete--something no one really expects to happen, in part because the researcher will have further curated the data in ways that add value--a value that cannot be recaptured and shared, either with the publisher or with other researchers. All because of copyright.

In other words, I believe that publishers actually need to find a sustainable way to participate in an analytic research commons--which will need to be secure, because of the considerable economic value of the material they (and others) would contribute to that commons. And it's not just about text-mining: annotation (which I once called a scholarly primitive--a basic building block of scholarship) is another great example, described in Andy and Eli's talk: in research with digital content, annotation is an overlay service the value of which is proportional to the scope of content across which it can be executed.

Another way to flip the question is to ask "When is copyright the scholar's friend?" Not the author's friend: the scholar's friend. I would argue it is the scholar's friend if the constraints it imposes on publishers drive those publishers to participate in a secure analytic research commons, but also, more to the point, copyright is the scholar's friend when it creates what Jerry McGann used to call a Gravity Field -- in this case, a gravity that exerts a centripetal force on research computing, driving it to the commons, where code can be shared, where research products (worksets, improved data, etc.) can be shared, and where data communities (like, say, a community interested in the early caribbean) can form around the data that interests them.

I leave it there, with the closing observation that the Secure HathiTrust Analytic Research Commons (SHARC) is what Stephen and I and others are trying to create now, and no, I don't think it is one ring to rule them all, nor do I think it will solve all problems, or enable every kind of research or scholarship, but I do think it will help us to address some kinds of problems that libraries, publishers, and scholars, working separately, have for years been unable to solve. And I would say that the HathiTrust, and the HTRC, do represent a shifting of the balance of power toward universities, in the way that John Cayley argued for just now: there really is an elephant in the library, and we are it.