The Digital Artifact in Library Collections
John Unsworth
Task Force on the Artifact in Library Collections
Council on Library And Information Resources
12/22/2000
Library Collections vs. Archives
“The functional chain of the different forms of linguistic representation is of great importance for those libraries and archives, whose duty it is to safeguard society's ability to recall its past. Whereas libraries are more concerned with keeping up knowledge and making it available, archives guarantee the availability of earlier experience. Both fields, however, employ the stabilisation and concrete form of the content of library and archive material in order to make it available.” [Menne-Haritz, “the effect of digital representation” in “The Intrinsic Value of Archive and Library Material”]
The two key terms in this passage are “stabilisation and concrete form.” For libraries, “concerned with keeping up knowledge and making it available,” stability and concrete form are important because they make long-term preservation and access economically feasible. For archives—the institutions that “guarantee the availability of earlier experience” not only in the sense that libraries do, but also in the sense of establishing the chain of evidence—stability and concrete form are also important because they are absolute requirements for proving the authenticity and provenance of unique documents. Of course, libraries also need to be able to assure the scholars who use their materials that those materials are what they purport to be, and that they can be used as evidence (albeit for scholarly arguments, rather than legal ones).
Our concern, in this document, is principally with library collections rather than archives—but many of the strategic issues and decisions provoked by the presence of digital artifacts are common to both. In fact, in dealing with the problem of digital artifacts, collections development personnel in libraries stand to benefit from considering one basic principle and one questionable opposition of long standing in the archival community.
The distinction between the evidence inherent in original forms and the information that can be abstracted from them—important to both libraries and archives, though perhaps in different ways—emerges from early preservation efforts:
“Among the many early archives and historical organizations that sought to preserve their materials by publishing and diffusing them, there developed a surprisingly modern distinction between the permanence of the archival documents themselves and the permanence of the information they contained. Initially, historical collections were valued principally for their information, information that testified to the ‘pastness of the past’ and thereby certified ‘the reality of progress.’ Only later did repositories come to value their collections as things worthy in their own right and, later still, as sources for specialized study by professional scholars.” (O’Toole, 16-17)
“Intrinsic value” is the name now used by archivists for the value of “things worthy in their own right,” and it also names the principle used in deciding when historical materials must be retained in their original form and when they can be represented with copies. The distinction is important in connection with digital artifacts in library collections not only because it would be relevant to deciding what can be represented by a digital surrogate rather than retained in its original form—or, less drastically, when a digital surrogate will suffice in place of access to an original still retained by the library—but also because it might help libraries to sort out what features of the born-digital artifact are unique to the original form of that artifact and what could be carried forward without loss into new incarnations. Intrinsic value, then,
“…is the archival term that is applied to permanently valuable records that have qualities and characteristics that make the records in their original physical form the only archivally acceptable form for preservation. Although all records in their original physical form have qualities and characteristics that would not be preserved in copies, records with intrinsic value have them to such a significant degree that the originals must be saved.”
The qualities and characteristics of records with intrinsic value are enumerated in this 1982 Staff Information Paper from the National Archives and Records Service as follows:
(“Intrinsic Value in Archival Material” 1)
The test of intrinsic value is that it can only be carried by the original document itself—even though, according to this statement of criteria, characteristics establishing a document’s intrinsic value may be either “physical or intellectual: that is, they may relate to the physical base of the record and the means by which information is recorded on it or they may relate to the information contained in the record.” So, although “intrinsic value” may be determined by something external to “the physical base of the record,” it is clear that the opposite term for ‘intrinsic value’ is not, as one might think, ‘extrinsic value’ but rather ‘fungible value’—in other words, a value which can be represented by a substitute form, or a value which can be transmitted without loss across forms. And finally, according to the same study, the judgment of this intrinsic value is, inevitably, a relative rather than an absolute evaluation:
“Records that possess any characteristic or quality of intrinsic value should be retained in their original form if possible. The concept of intrinsic value, therefore, is not relative. However, application of the concept of intrinsic value is relative; opinions concerning whether records have intrinsic value may vary from archivist to archivist and from one generation of archivists to another. Professional archival judgment, therefore, must be exercised in all decisions concerning intrinsic value.” (“Intrinsic Value in Archival Material” 2-3)
Almost by definition, of course, library collections do not consist of materials that have intrinsic value, in archival terms—the books, journals, and newspapers in the stacks and in circulation are not unique (many other libraries have copies of the same items), their physical form is not especially important (usually physical form is not even intact, as items are routinely rebound so they can stand up to heavy use), they are not so old as to be especially valuable, and so on. But as Nicholson Baker’s recent diatribe (“Deadline”) on the de-accessioning of newspaper collections demonstrates, the question of whether intrinsic value inheres in such materials is open to debate, and therefore, the criteria listed above are relevant for at least two reasons. First, as we begin to digitize library collections, we need to consider whether there are, in fact, important features or uses of those collections that are not fungible—otherwise we can never be sure when it will be acceptable to replace physical collections with digital surrogates, nor even what uses of the physical objects might be replaced by uses of a digital surrogate. Second, even though libraries may not bear the burden of legal proof that archives do, it is still important to scholars to know the source, provenance, and authority of the evidence they find in libraries. Moreover, as libraries begin to acquire artifacts that are originally digital, the criteria of intrinsic value will help us to understand where these artifacts need special authentication and documentation, where our technologies for producing, collecting, and transmitting them need further development—they might even help us to understand when the digital artifact is not simply a collection of bits, but also must include (and not merely emulate) original hardware and software (with all the practical difficulties and complex dependencies that such an object would entail).
The questionable distinction (and sometime opposition) between preservation and access may also be of some use in understanding how to deal with digital artifacts in library collections. This opposition—based on the commonsensical observation that there is an inverse relationship between use and longevity—is an old one:
“One historical society in Ohio in the 1840s announced its intention ‘to preserve the manuscripts of the present day to the remotest ages of posterity,’ adding almost theologically, ‘or at least . . . as near forever as the power and sagacity of man will effect.’ To accomplish this it proposed to store its manuscript and archival holding in ‘air-tight metallic cases, regularly numbered and indexed, so that it may be known what is contained in each case without opening it.’” (O’Toole, 15)
Even now, there are clear cases where cold storage (literal or figurative) is the best way to preserve an artifact with intrinsic value—photographic negatives, for example, or rare manuscripts. In the case of born-digital artifacts, though, another interesting possibility or principle suggests itself, namely preservation through handling:
“[E]lectronic resources require funding, skill, and ongoing commitment. Those that are intended for permanent use, moreover, will almost certainly require repeated intervention to ensure that they remain viable as technologies evolve.” (Hazen)
Precisely because the technologies used to encode, display, and enact digital information are changing so rapidly, the digital artifact that goes untouched for ten or twenty years may well be unrecoverable—because its storage medium requires hardware that no longer exists, or the software used to create it can no longer be found, or the operating system under which that software ran is obsolete, and so on. Even if none of those problems prevent access to the artifact, there is still the problem of the ephemerality of the artifact itself: as O’Toole points out, “by almost any standard, virtually all of the newer means of recording information, though more flexible, are less permanent than older ones” (25).
There’s a good deal of literature on the need to refresh and migrate digital information in order to keep it available for use (see in particular “Preserving Digital Information: Report of the Task Force on Archiving of Digital Information”). Migration—the process of moving digital information out of old formats and into new ones—raises all the core issues of intrinsic value, and there have been some interesting responses to that problem (e.g., Clifford Lynch’s “Canonicalization: A Fundamental Tool to Facilitate Preservation and Management of Digital Information”).
But whether or not we believe that digital artifacts have intrinsic value, and whether or not we believe—if they do—that this value can be retained as we migrate them across generations of carrier media and data formats, it is still clear that, where the value in digital artifacts is fungible, frequent handling may be the best assurance that this value will survive.
In library collections, one finds two kinds of digital artifacts: originals and surrogates.
Digital originals are less common, and the term itself is more problematic, for although “original” implies uniqueness, digital information is, by its nature, perfectly replicable. Nevertheless, it is clear that library collections now contain materials that come into being in digital form and are not surrogates for physical objects. This category of materials would include things like the Inter-university Consortium for Political and Social Research (ICPSR) data sets (which collectively comprise the world's largest archive of computerized social science data), or the Environmental Systems Research Institute (ESRI) data and maps (a detailed set of data for the US and the world, including census boundaries and major transportation features), or—increasingly—thematic research collections (e.g., Ed Ayers’ Valley of the Shadow civil war site or Columbia International Affairs Online) created by scholars and/or publishers.
Digital surrogates stand for original artifacts of some kind—usually, at this point, artifacts that concurrently exist, or previously existed, in another form (typically, but not necessarily, physical), either as part of the library’s collection or as part of some other collection. Digital page images are a very common example of digital surrogates: audio and video digitized from analog sources would be other examples. But we can also already point to digital surrogates for digital originals—for example, when texts marked up in the Text Encoding Initiative (TEI) Document Type Definition (DTD) are transformed into, and delivered in, the HTML (HyperText Markup Language) DTD, or when GIS (Geographic Information Systems) data are delivered as JPEG (Joint Photographic Experts Group) images.
Digital Originals
When considering artifacts that are originally digital, the first and possibly the most difficult question is, “what is the artifact?” What information or value inheres in the carrier medium? Is the equipment originally used for display part of the digital artifact? Does the software that presents and actualizes the data qualify as a constituent element of the artifact as well? Thinking again of the criteria for determining intrinsic value in an artifact—as a way of understanding what the features of an artifact might be—we can see a number of very practical ways in which these vexed questions might surface:
Physical form that may be the subject for study if the records provide meaningful documentation or significant examples of the form: For example, the layout of a form used for collecting data on the Web (software) might have a good deal to tell us about otherwise inexplicable aberrations, omissions, or misconstructions in the data collected.
Aesthetic or artistic quality: A member of this task force writes: “I may wish to read or use information in a form that recreates the context in which it was originally created or used. This will mean running a program on originally specified system hardware and software. Members of the committee are welcome to visit my office and play a really awful all-ASCII version of PacMan on my 1984 Kaypro II: the experience cannot be replicated otherwise” (Jim O’Donnell). This is asserted in spite of the existence (at http://www.yoy.org/kaypro/) of software emulation (in Java) that allows one to run Kaypro (Z-80/CPM-based) software on contemporary (Intel/Windows-based) systems. Presumably, the physical attributes of the Kaypro’s Keytronic keyboard, its 9-inch, 80-character by 24-line green phosphor display screen, its clunking 5.25” 191K disk drives, etc., are part of a genuine aesthetic experience. In fact, this same example—the complex artifact that is the Kaypro II plus its operating system and applications software—could probably also serve to illustrate the criteria of unique or curious physical features (this 26-pound computer was an early (1983) example of the “portable” computer, designed originally for field use by engineers, with no sound or graphics capability—hence the curiosity of an ASCII PacMan) as well as age that provides a quality of uniqueness and value for use in exhibits (given the rapidity with which “generations” of computer technology pass, and the average user’s tendency to discard outmoded equipment).
An exhaustive set of examples is not necessary here—these few suffice to show that the relevant features of a digital artifact could include more than the fungible information contained in an electronic file. Arguably, though, the Kaypro II belongs in an archive, not a library, for precisely the reasons just given. For the library, it might be enough to have a photorealistic 3D virtual representation of the Kaypro, and perhaps an emulation package that allows a researcher to run the original CPM code on current hardware. And though the Kaypro example may seem facetious, these are in fact be the sort of questions that library collections development staff will be asking themselves, as they grapple with the problem of collecting born-digital information.
In many cases, the answer for libraries will be that they are, in fact, primarily concerned with collecting, preserving, and providing access to the fungible informational content of digital objects. In that case, the “preservation through handling” scheme is a likely winner: digital information that is frequently used by library patrons undoubtedly stands a better chance of being migrated and refreshed, and therefore is more likely to continue to be available in future generations, compared with little-used digital information. Indeed, migration may turn out to be a much more frequently recurring problem than refreshing, because “today’s optical media most likely will far outlast the capability of systems to retrieve and interpret the data stored on them” (Conway in Handbook for Digital Projects). That’s bad news for libraries, since it is easier and cheaper to refresh than to migrate. If libraries have reason to be hopeful, in this regard, it lies in open, non-proprietary, standards such as JPEG for images, MPEG for video, and SGML, XML, and XSL for textual data. There are still important data types for which no such standards exist (GIS data, for example), but the trend over the last twenty years—accelerated significantly in the last decade by the advent of the World-Wide Web—has been in the direction of support for non-proprietary standards, even in proprietary software. The furthest extreme of standards-based optimism, in this regard, is represented by a recent contribution to D-Lib Magazine by a group of supercomputing scientists (Reagan Moore et al.), who believe that we can devise “an approach for maintaining digital data for hundreds of years through development of an environment that supports migration of collections onto new software systems.” (“Collection-Based Persistent Digital Archives - Part 1”).
Even those not willing to go quite so far in predicting success might look seriously at another promising strategy, called LOCKSS (Lots of Copies Keep Stuff Safe). The basic principle of LOCKSS is preservation through proliferation. LOCKSS is
“a network of PCs based at libraries around the world and designed to preserve access to scientific journals that are published on the web. The computers organise polls among themselves to find out whether files on their hard disks have been corrupted or altered, and replace them with intact copies if necessary” (“Here, There and Everywhere,” The Economist, June 24th, 2000).
This is actually an old idea, one that goes back to the 18th century:
“I learn with great satisfaction that you are about committing to the press the valuable historical and State papers you have been so long collecting. Time and accident are committing daily havoc on the originals deposited in our public offices. The late war has done the work of centuries in this business. The last cannot be recovered, but let us save what remains; not by vaults and locks which fence them from the public eye and use in consigning them to the waste of time, but by such a multiplication of copies, as shall place them beyond the reach of accident.” (Thomas Jefferson To Ebenezer Hazard, Philadelphia, February 18, 1791, in Jefferson’s Letters).
And though, with manuscript originals, the “preservation through proliferation” strategy assumes that a surrogate is adequate (that the value of the artifact is fungible, and that it is the information, and not the artifact itself, that one wants to preserve), with digital originals, one might well use the same strategy to preserve digital artifacts in their original form (while recognizing that a million perfect copies will still stand mute if we no longer understand how to read them).
The most common digital kind of digital artifact in library
collections today is the digital surrogate for a physical artifact. For that reason, the most important
questions about digital artifacts, at the moment, are questions having to do
with this kind of digital artifact.
Chief among those questions are:
·
When can a digital surrogate
stand in for its source?
·
When can a digital surrogate
replace its source?
·
When might a digital
surrogate be superior to its source?
·
What is the cost of producing
and maintaining digital surrogates?
·
What risks do digital
surrogates pose?
Many of these questions are treated in some detail in Paul
Conway’s contribution to a recent publication of the Northeast Document
Conservation Center, the Handbook for Digital Projects: A Management Tool
for Preservation and Access:
“The Preservation Purposes
of the Digital Product [include efforts to] …. Protect Originals …. Represent
Originals …. [and] Transcend Originals…. In a very small but increasing number
of applications, digital imaging holds the promise of generating a product that
can be used for purposes that are impossible to achieve with the original
sources. This category includes imaging
that uses special lighting to draw out details obscured by age, use, and
environmental damage; imaging that makes use of specialized photographic
intermediates; or imaging of such high resolution that the study of artifactual
characteristics is possible.”
(“Overview: Rationale for Digitization and Preservation”)
Conway makes clear the promise of the digital
surrogate. The risk posed by these surrogates
is presented in Menne-Haritz et al., “The Intrinsic Value of Archive and
Library Material”:
“the loss of testimony is
endangered, not only through . . . physical degeneration . . . but also through
the unconscious destruction of evidence as to the context and circumstances of
their origin, which can occur during their conversion and must therefore be
prevented by a previous analysis of . . . intrinsic value.”
The problem to which Menne-Haritz adverts is not unique to
digital surrogates, by any means: bad editions in printed form pose the same
threat, and indeed the early history of printing is, in part, a history of the
loss or destruction of manuscript materials “replaced” by printed versions—the
sources for which are now both undocumented and unrecoverable. In any case, this German archivist presents
the most reductive view of the value of digital surrogates, saying
“The loss of evidential
value and permanent accessibility inherent in digital forms and textual
conversion [by OCR] exclude them as a preservation medium. They can only be employed in addition to
preservation on film in order to increase the ease of use,” (“The necessity of criteria for conversion
procedures” in “The Intrinsic Value of Archive and Library Material”)
and, at another point, flatly stating that:
“digital imaging is not suitable for permanent storage.” (“Imaging” in “The Intrinsic Value of Archive and Library Material.”)
Having
considered this quaternion of conditions, let us revisit the questions with
which we opened this discussion of digital surrogates, and try now to provide
some answers to those questions:
·
When can a digital surrogate
stand in for its source?
When it answers to the needs of users.
·
When can a digital surrogate
replace its source?
If the source is not rare.
·
When might a digital
surrogate be superior to its source?
In cases where remote or simultaneous access to the object
is required, or when software provides tools that allow something more or
different than physical examination.
When the record of the digital surrogate finds its way into indexes and
search engines that would never find the physical original.
·
What is the cost of producing
and maintaining digital surrogates?
The cost of producing digital surrogates probably depends
on the uniformity, disposability, and legibility of the original. The cost of maintenance depends on frequency
of use and the idiosyncracy of format, but beyond that it depends on
technological, social, and institutional factors that are difficult or impossible
to predict—which is an important reason for being cautious when one chooses to replace
a physical object (the maintenance costs for which are known) with a digital
surrogate (the maintenance costs for which are, to some extent, unknown).
·
What risks do digital
surrogates pose?
The principal risk posed by digital surrogates is the risk
of disposing of an imperfectly represented original because one believes the
digital surrogate to be a perfect substitute for it. Digital surrogates also pose the risk of providing a partial view
(of an object) that seems to be complete, and the risk of
decontextualization—the possibility that the digital surrogate will become
detached from some context that is important to understanding what it is, and
will be received and understood in the absence of that context.
References:
Ayers, Edward L. et al. The Valley of the Shadow. http://www.iath.virginia.edu/vshadow2/
Baker, Nicholson. “Deadline: A desperate plea to stop the trashing of America's historic newspapers.” The New Yorker, July 24, 2000, 42-61.
Columbia International Affairs Online. http://www.ciaonet.org/
Hazen, Dan, Jeffrey Horrell, and Jan Merrill-Oldham. Selecting Research Collections for Digitization. CLIR publication, August 1998: http://www.clir.org/pubs/reports/hazen/pub74.html
“Intrinsic Value in Archival Material.” Staff Information Paper 21. National Archives and Records Service. General Services Administration: Washington, 1982.
Jefferson, Thomas. Letters, ed. Merrill D. Peterson, Literary Classics of the United States, Library of America Series, New York, 1984.
Lyman, Peter and Brewster Kahle. “Archiving Digital Cultural Artifacts: Organizing an Agenda for Action.” D-Lib Magazine (July/August 1998): http://www.dlib.org/dlib/july98/07lyman.html
Lynch, Clifford. “Canonicalization: A Fundamental Tool to Facilitate Preservation and Management of Digital Information.” D-Lib Magazine Vol. 5 No. 9 (September 1999).
http://www.dlib.org/dlib/september99/09lynch.html
Menne-Haritz, Angelika and Nils Brübach. “The Intrinsic Value of Archive and Library Material.” Digitale Texte der Archivschule Marburg Nr. 5: http://www.uni-marburg.de/archivschule/intrinsengl.html
Moore, Reagan, with Chaitan Baru, Arcot Rajasekar, Bertram Ludaescher, Richard Marciano, Michael Wan, Wayne Schroeder, and Amarnath Gupta. “Collection-Based Persistent Digital Archives.” D-Lib Magazine Vol. 6 No. 3 (March 2000) and Vol. 6 No. 4 (April 2000): http://www.dlib.org/dlib/march00/moore/03moore-pt1.html and http://www.dlib.org/dlib/april00/moore/04moore-pt2.html
O’Toole, James M.. “On the Idea of Permanence.” American Archivist vol. 52 (Winter, 1989). 10-25.
“Preserving Digital Information: Report of the Task Force on Archiving of Digital Information.” Commissioned by The Commission on Preservation and Access and The Research Libraries Group, Inc.. May 1, 1996. http://www.rlg.org/ArchTF/
Rosenthal, David S. H. and Vicky Reich. “Permanent Publishing on the Web: LOCKSS (Lots of Copies Keep Stuff Safe).” http://lockss.stanford.edu/
Sitts, Maxine K. Ed.. Handbook for Digital Projects: A Management Tool for Preservation and Access. First Edition. Northeast Document Conservation Center, Andover, Massachusetts, 2000: http://www.nedcc.org/digital/dighome.htm