The 2005 Lyman Award Lecture
November 11, 2005
National Humanities Center
Research Triangle Park, NC
Thanks to the National Humanities Center for hosting this event and the Richard W. Lyman Award, to the Lyman Award selection committee for the honoring me with the award, and to the Rockefeller Foundation for funding it. It's a pleasure to be here, not least because I began my academic career at NC State, and I have many friends there, and at Chapel Hill and Duke as well.
A few years back, when I was a member of the English Department at the University of Virginia, I was talking to a non-academic neighbor about doing research: "What--" he asked, "you discover new words?" Of course, my neighbor was thinking of "research" as something "scientific"--and had he consulted the OED, it would have supported him with a definition of research as "a search or investigation directed to the discovery of some fact by careful consideration or study of a subject; a course of critical or scientific inquiry." To tell the truth, even my colleagues in the English Department would probably have been more comfortable, on the whole, with "criticism" or "scholarship" than with "research" as the label for what we did when we were not teaching or doing committee work, unless one were talking specifically about "archival research" or possibly "library research."
Explaining how words like "Method" and "Research" apply to the humanities requires some retrospection, so before we look to the "new methods for humanities research," allow me to look backward for a moment, first twenty years or so, to when I was in graduate school, and then another thirty years or so before that, when postwar graduate education--with its professionalization of literary study--was taking shape, in the shadow of big science.
In the 1980s, when I was going through graduate school, we certainly didn't consider ourselves to be engaged in any sort of "investigation directed to the discovery of . . . fact"--perhaps in history one might undertake the discovery of fact, or possibly in textual editing, but certainly not in literary criticism, not in a postmodern era. The very concept of "fact" itself had been pretty well debunked, for purposes of literary study, by the '80s, and science was simply another ideology in the service of the state. Stanley Fish and Wolfgang Iser had reauthorized the affective fallacy, Derrida's Grammatology taught us that in writing there was "nothing outside the text," Lyotard recommended "incredulity toward metanarratives," and the decentered self had resigned itself to the endless deferral of truth, in the desert of the real.
The other word in my title, "method," raises some issues of its own. A method is a procedure, or sometimes more specifically (as in French) a "system of classification, [a] disposition of materials according to a plan or design" (OED). In the 1980s, in graduate school (and in job interviews), one sometimes faced the daunting question "what's your methodology?" Usually, what that meant was "what's your theoretical bent: what theoretical flag do you fly?" There was an older sense of methodology still in force, though: dissertations still sometimes had chapters on methodology, and graduate programs in English were wrestling with whether or not to discard requirements for coursework in research methods (which essentially meant bibliography, sometimes with library research methods included). Most departments eventually did do away with this requirement, and by the 1990s, "research" seemed to happen mostly without attention to method.
Yet research and method are connected, logically, because systematic and organized research proceeds according to some method, some plan or design. In his 1951 Kenyon Review essay called "The Archetypes of Literature," Northrop Frye talked about this:
"Every organized body of knowledge can be learned progressively; and experience shows that there is also something progressive about the learning of literature. . . . Physics is an organized body of knowledge about nature, and a student of it says that he is learning physics, not that he is learning nature. Art, like nature, is the subject of a systematic study, and has to be distinguished from the study itself, which is criticism. . . . So, while no one expects literature itself to behave like a science, there is surely no reason why criticism, as a systematic and organized study, should not be, at least partly, a science. . . . Criticism deals with the arts and may well be something of an art itself, but it does not follow that it must be unsystematic."
Systematic (methodical) thinking was, in Frye's view, what separated criticism from commentary: "commentators have little sense, unlike researchers, of being contained within some sort of scientific discipline: they are chiefly engaged, in the words of the gospel hymn, in brightening the corner where they are."
For Frye, the use of a method in pursuit of progress toward a goal was also what separated meaningful criticism from "the literary chitchat which makes the reputations of poets boom and crash in an imaginary stock-exchange . . . . That wealthy investor, Mr. Eliot, after dumping Milton on the market, is now buying him again; Donne has probably reached his peak and will begin to taper off; Tennyson may be in for a slight flutter but the Shelley stocks are still bearish. This sort of thing cannot be part of any systematic study," Frye maintained, "for a systematic study can only progress: whatever dithers or vacillates or reacts is merely leisure-class conversation."
Frye's enthusiasm for the systematic, though, is probably responsible for the downturn in his own critical fortunes during the later decades of the 20th century, when the fashion in literary criticism favored paradox, metaphor, and (in spite of the systematic basis of post-structuralism) a fairly high level of idiosyncracy and the foregrounding of persona over logic. Oscar Wilde, who thought criticism was the only civilized form of autobiography, might have approved: Frye did not, and in 1984, in a PMLA essay called "Literary and Linguistic Scholarship in a Postliterate World," he remarked disparagingly that "it has...become generally accepted that criticism is not a parasitic growth on literature but a special form of literary language." In the end, for Frye, the insistence on the primacy of method obscured the real goal of criticism (to be "interested in literature itself and in what it does or can do for people") and methodology, turned in on itself, became part of the problem: in that 1984 essay, he wrote that "critical theory today has relapsed into a confused and claustrophobic battle of methodologies, where, as in Fortinbras' campaign in Hamlet, the ground fought over is hardly big enough to hold the contending armies."
1984 was perhaps a low-point for both "research" and "method" in the humanities, but research did survive--perhaps perpetuated as some kind of guilty pleasure--and today it takes place quite openly, here at the National Humanities Center, as attested by this description of the Center's activities, on its web site:
"The Center annually admits forty fellows, who represent a broad range of ages, disciplines, and home institutions. Individually, the fellows pursue their own research and writing. Together, they create a stimulating community of intellectual discourse. Interdisciplinary seminars on topics of mutual interest provide a context in which fellows share fresh insights and thoughtful criticism. The most tangible result of the fellows' work is the publication of nearly a thousand books since the Center opened."
That's probably a fairly good use-example of the term "research" as it now applies, and has usually applied, in the humanities: it refers to the work of an individual, work that is preparatory to writing, work that results in the publication of a book. Researchers may gather to share insights and critique, but research itself is a solitary enterprise: as the same web page goes on to say, "Each fellow has a private study, appropriately furnished for reading, writing, and reflection, overlooking the surrounding woods." Research in the humanities, then, is and has been an activity characterized by the four Rs: reading, writing, reflection, and rustication.
If these are the traditional research methods in the humanities, what will "new research methods" look like--and more importantly, why do we need them?
Perhaps in at least some cases, we need them because they offer better ways of accomplishing research goals that we have long pursued. So, for example, to stick with the Canadian critical archetype for just a moment longer, in 1989 Northrop Frye delivered a plenary address to a humanities computing conference in Toronto. According to Willard McCarty, who was there, Frye said that "if he were starting out to write Anatomy of Criticism now he would pay very close attention to computer modeling in pursuit of the 'recurring conventional units' of literature on which his life-work was based." Frye was probably a little optimistic about what computers could have done in 1989, but I think today we could actually deliver on the promise he recognized, and I'd like to spend the rest of this talk considering the ways in which new methods, enabled by information technology, can support humanities research--some new kinds of research, and some very familiar kinds of research. I'll talk about what we can now do and what we can't, what's end-usable, and what requires expert intervention, what notions of the humanities--and of science--inform and sometimes distort out notion of research, and where we might really need to concentrate future graduate training, standards development, and tool-building, in order to realize the promise of these new methods for the core activities and future prospects of humanities research.
In the sciences generally, research is basic or applied. Basic research is motivated by curiosity rather than by a particular goal, and its outcomes tend to be theoretical rather than practical. Applied research usually grows out of basic research, and it usually has practical goals in view from the beginning. In reality, of course, the division between these two is not so neat, and much research could be described as one or the other, depending on the circumstances (in other words, depending on what's being funded).
If we consider humanities research in terms of the basic and the applied, some would say that all humanities research is basic research, because it never aims at having a practical application in the sense that, say, laboratory research on transistors, in the 1940s aimed at building amplifiers for electrical signals. On the other hand, if understanding is a practical outcome, then you might just as easily argue that all humanities research is applied, in that it aims directly at producing a practical outcome, namely changing the way we understand that part of the human record it has in view. Probably the truth is that in the humanities, as in science, both are done: Frye's work on literary archetypes, or Freud's work on the human psyche, or Saussurre's work on language, might best be considered basic research: this research is aimed at developing theoretical frameworks, rather than at applying those frameworks to particular objects of attention--even though particular objects are always in view as the theories are developed. In that sense, when we apply those theoretical frameworks to the understanding of particular texts, to illuminate the text rather than to alter or extend the theory, we're doing applied research. And again, of course, in the humanities as in science, we never really do only one or the other.
In a recent instant message exchange with Steve Ramsay, a colleague at the University of Georgia who is working on the NORA project (about which more in a moment), he asked:
What is a literary-critical "problem?" How is it different from a scientific "problem?" Consider the following scenario. Let's suppose that the NSF were to ask its funded physicists to report the achievements in physics for a given year. You can imagine what that list might look like. "We discovered the top quark. We achieved cold-fusion. We proved the existence of the Bose-Einstein Condensate." What if the NEH were to ask its literary critics the same question?
"Well," I argued, "that's because literary-critical 'problems' are not for solving. The object of the literary researcher is not to settle questions, but to open and explore them, whatever their rhetoric says to the contrary." Steve's response to that was that "Words like 'problem,' 'experiment,' 'fact,' 'truth,' and 'hypothesis' all mean something very different in a humanistic context than they do in the sciences." I replied: "I think we imagine science as being more scientific than it is." And I do think that (and as Steve pointed out), so did Kuhn, Feyerabend, and Lyotard. In science, one doesn't prove a hypothesis, any more than one does in cultural studies: all you can do is offer a hypothesis that withstands being disproven, for some period of time, until contradictory evidence or a better account of the evidence comes along. For that matter, in Against Method, Feyerabend argued that whatever scientists might say about their adherence to methodological rules, there are no rules which they always use, and if they did adhere strictly to such rules, it would retard scientific progress. Scientific research--and the shift in ground truth during scientific revolutions--do, however, turn on evidence in a way that humanities research often does not, and science's self-correcting mechanisms are not so obviously present in the humanities. Still, the difference between scientific research and humanities research, between scientific methods and humanistic methods, may be a difference of degree rather than of kind.
Bill Wulf, president of the National Academy of Engineering, would agree, at least in the case of computer science: Bill has argued (in my hearing) that computer science should really be considered one of the humanities, since the humanities deal with artifacts produced by human beings, and computers (and their software) are artifacts produced by human beings. Harold Abelson, a professor of computer science at MIT, tells students in his CS 101 course (Structure and Interpretation of Computer Programs) that
"computer science" is not a science and . . . its significance has little to do with computers. The computer revolution is a revolution in the way we think and in the way we express what we think. The essence of this change is the emergence of what might best be called procedural epistemology--the study of the structure of knowledge from an imperative point of view, as opposed to the more declarative point of view taken by classical mathematical subjects. Mathematics provides a framework for dealing precisely with notions of "what is." Computation provides a framework for dealing precisely with notions of "how to."
In other words, computers are all about method, they are epistemological to the core, and they are made by human beings. All of these qualities make them objects as well as instruments of interpretation--a point that I'll return to, after we look at some of the ways these artifacts of procedural epistemology can be used in humanities research.
My first example of new research methods for the humanities--in fact, my first several examples--comes out of the nora project. Nora (which either refers to a character in a William Gibson novel, or is an acronym for "No One Remembers Acronyms," depending on who in the project you ask), is a two-year project funded (as so much work in digital libraries and digital humanities has been) by the Andrew W. Mellon Foundation. The project began last October, so we're about one year in, and although I'm not quite ready to show, tonight, there is a good deal already to tell. The goal of the nora project is to produce text-mining software for discovering, visualizing, and exploring significant patterns across large collections of full-text humanities resources from existing digital libraries and scholarly projects.
In search-and-retrieval, we bring specific queries to collections of text and get back (more or less useful) answers to those queries; by contrast, the goal of data-mining (including text-mining) is to produce new knowledge by exposing similarities or differences, clustering or dispersal, co-occurrence and trends. Over the last decade, many millions of dollars have been invested in creating digital library collections: at this point, terabytes of full-text humanities resources are publicly available on the web. Those collections, dispersed across many different institutions, are large enough and rich enough to provide an excellent opportunity for text-mining, and we believe that web-based text-mining tools will make those collections significantly more useful, more informative, and more rewarding for research and teaching. In this effort, we are building on data-mining expertise at the University of Illinois' Graduate School of Library and Information Science and on several years of software development work that has already been done in Michael Welge's Automated Learning Group at the University of Illinois' National Center for Supercomputing Applications (NCSA), developing the D2K (Data to Knowledge) software, a kind of visual programming environment for building data-mining applications.
In order to assemble the test-bed for the text-mining tool development, we have negotiated agreements with a number of individual libraries, projects, and centers that hold large collections of full-text humanities resources. Our agreements aim at producing an aggregation that has some scholarly, intellectual, and subject coherence, and they focus on 19th-century British and American literary texts that have been generously contributed by libraries at the University of North Carolina at Chapel Hill, the University of Virginia, the University of California at Davis, the University of Michigan, the University of Indiana, and the Library of Congress. Other contributors include the Brown University's Women Writers Project, The Perseus Project, and scholarly projects at the University of Virginia's Institute for Advanced Technology in the Humanities, including those on Whitman, Dickinson, Stowe, Rossetti, and Blake. These agreements have allowed us to create a testbed of about 10,000 literary texts in English, roughly about 5 GB of machine-readable text, almost all of it marked up according to the Text Encoding Initiative Guidelines. This is a small amount of data, by comparison to what's out there in digital libraries, but it is large enough to be a meaningful testbed, and it does meet minimum requirements of intellectual coherence.
This is a profoundly collaborative project, and very different from the solitary work that we were talking about before as being the norm in humanities research: the participants are from four universities in addition to Illinois, each site with multiple individuals, and in most cases, multiple disciplines represented as well. At Illinois, where the focus is on the data-mining itself, the work is done by two highly competent graduate students in Library and Information Science, Bei Yu and Xin Xiang, who politely allow me to muddy the waters in weekly meetings, and then proceed to get sensible things accomplished in spite of that. They also work with Loretta Auvil and others at NCSA, where the focus is properly described as engineering and applied computer science. At Maryland, Matt Kirschenbaum and Martha Nell Smith, from the English Department and the Maryland Institute for Technology in the Humanities, and Catherine Plaisant, a computer scientist at the Human Computer Interaction Lab, work with another great group of students, including Tanya Clement and Greg Lord in English and James Rose, in Computer Science. Their work focuses on visualization, and Stan Ruecker, recently added to the project from the University of Alberta, works on interface design, along with one of his students, Ximena Rossello. Tom Horton, from Computer Science at the University of Virginia, works on software design and overall architecture, with staff at the Institute for Advanced Technology in the Humanities and with graduate students Kristen Taylor (English) and Ben Taitelbaum (Computer Science). Finally, at the University of Georgia, Steve Ramsay (faculty in English) works with his graduate student Sara Steger on developing Tamarind, the xml data-management system that supports the project's need to query large xml collections for quantitative information in real time.
That's seventeen people, on one project: seven faculty members, one NCSA staff person, and nine graduate students. And I've probably left someone out. The project is divided up fairly neatly, into the data-mining, interface/visualization, data support, and architecture, but it is a real challenge to do this kind of thing, on many levels. First, simply coordinating lots of people is difficult. I think we had a breakthrough on that front when we arrived at the point where the tasks were sufficiently well defined and the goals sufficiently clear that faculty could get out of the way and let graduate students work directly with each other. Second, we're building something that none of us (or anyone else) has ever seen before, so a large part of the problem is figuring out exactly what it is supposed to be and how it is supposed to work. Third, each time we try something new, it has ramifications across the whole system, and sometimes that means that we have to stop and tear something apart and rebuild it, before we can move on to the next step.
There are many more challenges than I'll mention tonight, but perhaps the greatest challenge, at the outset and still today, has been in figuring out exactly what data-mining really has to offer literary research, at a level more specific than the cleverly non-specific generalities I offered in my opening description of nora ("software for discovering, visualizing, and exploring significant patterns across large collections of full-text humanities resources"). What patterns would be of interest to literary scholars? Can we distinguish between patterns that are, for example, characteristic of the English language, and those that are characteristic of a particular author, work, topic, or time? Can we extract patterns that are based in things like plot, or syntax? Or can we just find patterns of words? When is a correlation meaningful, and when is it coincidental? What does it mean to be "coincidental"? How do we train software to focus on the features that are of interest to researchers, and can that training interface be usable for people who don't like numbers and do like to read? Can we structure an interface that is sufficiently generalized that it can accommodate interest in many different kinds of features, without knowing in advance what they will be? What are meaningful visualizations, and how do we allow them to instruct their users on their use, while provoking an appropriate suspicion of what they appear to convey? How would we evaluate the effectiveness of our visualizations, or the software in general? Is it succeeding if it surprises us with its results, or if it doesn't? How can we make visualizations function as interfaces, in an iterative process that allows the user to explore and tinker? And how in the hell can we do all this in real time on the web, when a modest subset of our collection, like the novels of a single author, contain millions of datapoints, all of which need to be sifted for these patterns? It takes many different kinds of expertise, and many hands, even to bring the epistemological elements of all this into focus, and as much again to work out the procedural details involved in actually building something that would allow a researcher to look for patterns across large collections.
As a part of the nora experiment, we're going to try to use text-analysis techniques to answer some of these questions--if not empirically, at least with some combination of evidence and subjective analysis. To that end, at my suggestion, Bei Yu has been analyzing literary criticism (from online journals in Project Muse) and comparing it to normal usage (as represented in the American National Corpus, a just-released collection of about 10 million words from the New York Times, Slate, and other such sources), and to journal and conference literature from the knowledge discovery domain (in other words, from data-mining). What we're looking for are words common in literary criticism and data-mining, but not common in the New York Times. The theory is that this will provide at least a start on figuring out an answer to the question "what do literary scholars already do, that data-mining can support?" So far, there have been some interesting results.
In the first stage of this research, Bei generated lists of relatively unusual verbs from Muse journal articles: she then asked a literary scholar (me) to identify some that seemed to indicate critical behaviors that might be characteristic of literary scholarship. Obligingly, I did so, and picked out words like "destabilizes, annotates, juxtapose, evaluates." Then she ran this list against the American National Corpus, and found that none of the words I'd picked were actually unique to literary criticism or even much more common in literary criticism than in normal usage. However, comparing her whole set of journal articles to the ANC, she found quite a few verbs that were unique--for example, "narrating, obviate, misreading, desiring, totalizing, mediating." Her conclusion, concerning round one?
Actually, the verbs . . . picked out by the literary scholar turn out to be common in ANC-NYTIMES corpus too. However, after examining the unique MUSE verb list, two literary scholars were surprised to find many unexpected unique verbs, which means their uniqueness is beyond the scholars' awareness. In conclusion, literary scholars are not explicitly aware of what are the unique research behaviors at the vocabulary-use level. They might be able to summarize their scholarly primitives as Unsworth did.... But this does not help the computer scientist to understand the data-mining needs in literary criticism.
This lack of explicit awareness on the part of the critic will become a leitmotif as we continue to discuss text-mining in literary contexts, so let me flag it as it arises here for the first time.
Bei also tried topic-analysis of the Muse articles, to see if that would help turn up some things for data-miners to do. She found
... that many essays are trying to build connections between writers, characters, concepts, and social and historic backgrounds. As evidence, 56 out of 84 ELH essays and 24 out of 40 ALH essays titles contain "and"--one of the parallel structure indicator. For example:
* "Monumental Inscriptions": Language, Rights, the Nation in Coleridge and Horne Tooke
* "Sublimation Strange": Allegory and Authority in Bleak House
* "Tranced Griefs": Melville's Pierre and the Origins of the Gothic
* Passion and Love: Anacreontic Song and the Roots of Romantic Lyric
In conclusion, simple MUSE topic analysis does not help to find new data mining applications. The reason might be that topics are so high-level and abstract that they cannot be easily represented as countable lower-level linguistic features for data mining purposes.
The third step in this procedure, or method, was then to compare the journal literature from data-mining with that from literary criticism, and see what words, at least in our sample, seemed to occur frequently in both, but infrequently in the American National Corpus. Some of those words were "model, pattern, framework, spatial" and various forms of the words "classify, correlate, associate, relationship, similarity, hierarchy," and "sequence."
The final step is to sit down with literary scholars and look at the phrase-level context for these words in the criticism itself, to see where these words--representing interests that seem to overlap between data-mining and literary criticism--actually refer to things that data-mining could support in literary criticism. For example, since pattern is our declared objective, here are some of the phrases in which one finds the word "pattern" embedded, in literary criticism:
Which of these patterns can we actually find in large collections? Well, we went looking for some, to find out. We began with a pattern suggested by Matt Kirschenbaum, namely the use of erotic language in the poetry of Emily Dickinson. The object would be to have Dickinson scholars identify erotic and non-erotic Dickinson poems (hot and not hot, for short) and the vocabulary that makes them so, and then subject the same corpus to analysis by software, to see if the software, trained by the expert judgments, could learn to predict which poems would be hot, which not, and why. Early on, Matt worried that the net of all this, if we were successful, might be
that our results may . . . largely confirm information the scholar already has in hand, or at least strongly suspects. While I hold out hope that our visualizations will contain a genuine surprise or two, there's a larger sense in which they'll merely be confirming (or suggesting) what we already know: that a high "hotness metric" for a given document suggests that that document is likely to be of interest to the scholar who supplied the particular indicators in the first place. This isn't as circular as it sounds (the computer is working on a scale and at a pace that would be impractical for a human investigator performing the same analysis manually) but this is still a very traditional form of text analysis and does not, it seems to me, take advantage of any actual data mining algorithms. What we ultimately want to be able to do is have the computer (I use the word loosely) suggest a new indicator, one conjured automatically from the list of indicators described (based on proximity or other forms of pattern matching). The idea, in other words, is for the computer to show us that [some word] crosses some threshold in relation to its proximity to the words already listed, thereby startling some intrepid Dickinsonian (who might that be?) to stroke her chin and say, "Hmm, I wouldn't have thought of that but it sure is interesting. I'm going to go and reread some poems I thought I knew well."
A computer science graduate student in Ben Schneiderman's Information Visualization class at Maryland, Nitin Madnani, put together a java tool for visualizing weighted searches across multiple poems, so that it would be easy to see the poems in which erotic terminology, once identified, seemed to cluster. This was done outside of D2K, and very hard-coded: it was an experiment in visualization and its requirements in this context (and in that respect, it will inform what we do later in designing interfaces and visualizations to work with D2K), but it was also a tool that helped the literary scholars begin to explore patterns across multiple works.
Having tinkered with that tool a bit, Martha Nell Smith and Tanya Clement sat down with the corpus of Dickinson poems and labeled each text hot or not--as a whole. This is, of course, a subjective evaluation, but it also represents expert knowledge. These evaluations were passed to Bei Yu, who subjected the corpus to a kind of predictive analysis known as Naïve Bayesian classification (no relation: NB is named for Thomas Bayes, an early 18th-century British mathematician and minister). As Matt Kirschenbaum explained in a recent presentation on the nora project,
Bayesian probability is the domain of probability that deals with non-quantifiable events: not whether a coin will land heads or tails for instance, but rather the percentage of people who believe the coin might land on its side; also known as subjective probability. Our Bayesian classification is "naïve" because it deliberately does not consider relationships and dependencies between words we might instinctively think go together--"kiss" and "lips," for example. The algorithm merely establishes the presence or absence of one or more words, and takes their presence or absence into account when assigning a probability value to the overall text. This is the kind of thing computers are very good at, and naïve Bayes has been proven surprisingly reliable in a number of different text classification domains.
The purpose of all this was to predict what we thought we already knew, namely what makes a Dickinson poem erotic. The prediction done by experts was based vocabulary, but it was more generally based on long experience in writing about and reflecting on Dickinson poems--in other words, it was based on traditional humanities research methods. The prediction done by the nora software was based on the combination of the experts overall determinations, evaluated against some 4000 features--in this case, words--extracted from the document set and ranked according to the probability that they would appear in erotic poems. In some cases, humans and software agreed. For example, the words "tasted, faces, touching, Lords, Berries, feel, Nights, Hands, Nut, Butterfly, seal, Queen, and Bees" were all identified as highly correlated with eroticism, both by the experts and by the nora software. In some cases, though, the software contradicted the experts: for example, the words "Music, tune, warm, cold, Lightning, blood, Sun, cut, and love" were are all predicted as markers of the erotic by the experts, and found not to be, by the software. Most interesting, to me at least, in the list of contradictions: terms in which eroticism varies by number. For example, it is erotic to be plural in the case of "nights, bees, berries, hands," and "faces," but those same words, in the singular, do not register as markers of the erotic. Conversely, "nut" is erotic, but "nuts" are not. Hmm.
Finally, and most interestingly, the software turned up some highly correlated markers of the erotic in Dickinson which hadn't appeared on the experts' list: "mine, must, Bud, Woman, Vinnie, joy, Thee, write, Eden, luxury, remember," and "always" followed by three dots ...
Here's what the Dickinson scholar, Martha Nell Smith, had to say about those results, in a post just this week, to nora's email list (webviz[at]prairienet): it's a bit long, but worth reading in its entirety:
In the 1990s Harold Love stated something very important about 'undiscovered public knowledge': that too often knowledge, or its elements, lies (all puns intended) like scattered pieces of a puzzle but remains unknown because its logically related parts are diffused, relationships and correlations suppressed. Five years ago I wrote about that fact in "Suppressing the Book of Susan in Emily Dickinson," an article surprisingly few Dickinson scholars seem to know, precisely because, though it's well situated in the volume Epistolary Histories, it is separated from much Dickinson criticism. Love does not remark anything we don't already know in some way, shape, form, and that is, I suppose, precisely the point. At one point or another members of this nora team have been frustrated over the "oh wow" moment that just seemed to be missing. When my first one came, I was left saying, "uh, duh"--the "oh wow" moment is right in front of me/us. (By the way, all of my observations here are drawn from thinking about the lists that Bei sent and came out of conversations with Tanya and Catherine.) When Bei sent the computationally-generated list of found erotic terms and "Vinnie" was a "hot" term, and one of the most frequent to occur, I was at first surprised. But just a smidgeon of reflection changed that surprise to "uh, duh" recognition. Of course I had known that many of Dickinson's effusive expressions to Susan were penned in her early years (written when a twenty-something) when her letters were long, clearly prose, and chock-full of the daily details of life in the Dickinson household. But I had never thought of this fact in quite the way that the data mining "search and find the erotic" exercise made me put together the blending of the erotic with the domestic. And thus I was surprised again because I've written extensively on the blending of the erotic with the domestic, of the familial with the erotic, and so forth. So I should have expected "Vinnie" to appear frequently in these early letters and to appear near erotic expressions, but I was still taxonomizing (and rather rigidly so) in my interpretations without realizing I was doing so. In other words, I was dividing epistolary subjects within the same letter, sometimes within a sentence or two of one another, into completely separate categories, and I was doing so un-self-consciously. I could wax eloquent here about why understanding the erotic as part and parcel of, and not separate from, daily life is so important, but in the interest of time I'll just note the important connection, a connection discouraged by the traditional hierarchies of Western culture. Making the connection leads to critical understandings not otherwise obtainable, and the data mining exercise helped me do that. Similarly, though I had not designated "mine" as a hot word, it did not surprise me at all that it was FIRST on Bei's list. The minute I saw it, I had one of those "I knew that" moments. Besides possessiveness, "mine" connotes delving deep, plumbing, penetrating--all things we associate with the erotic at one point or another. And Emily Dickinson was, by her own accounting and metaphor, a diver who relished going for the pearls. So "mine" should have been identified as a "likely hot" word, but has not been, oddly enough, in the extensive literature on Dickinson's desires. Same goes for "write" -- oh to leave a piece of oneself with, for, the beloved. To "write" is to present oneself, or a piece of oneself, physically -- and noting that the data mining was picking up both "write" when recorded by Dickinson and "write" in the [XML] header [where it would indicate a letter] led the three of us to a "can we teach a computer to recognize tone" discussion. I wonder, remembering Dickinson's "A pen has so many inflections and a voice but one" what the human machine can do, what the human machine does (recognizing, identifying tone) what we think we're doing when we're so damned sure of ourselves. So the data mining has made me plumb much more deeply into little four- and five-letter words, the function of which I thought I was already sure, and has also enabled me to expand and deepen some critical connections I've been making for the last 20 years. On this list I've already talked about the limitations of "key words," a fact of which all humanists who get frustrated with search and retrieval are all too well aware, so I won't go on at great length about that. "Key words" are indispensable, but they don't work like magic, and we need to be rigorously self-conscious about all such taxonomies. I knew that, but it still surprised me when I saw texts that had several key erotic words and the texts were definitely not "hot." So Harold Love's observation very much holds--all of this was available to me but lay scattered as unrelated pieces. The data mining exercise was key to pulling it all together. Oh, and perhaps it goes without saying that the exercises also made me pull some things apart in order to make these connections.
To this, within an hour, Steve Ramsay replied:
"What, then, is this shock of recognition we feel? How do we make sense of it? Is it useful? We're all familiar with McGann's memorable remark (from Lisa Samuels, I believe) that HC is all about "imagining what you don't know." But here, we seem to be encountering something different: imagining what we already know. And in a sense, won't data mining operations of the sort we are undertaking always produce this effect? After all, we trained the system. It only knows how to look for what is already implicit in our sense of things. It will produce a more granular, more all-encompassing vision of what we know, but what "we know" is the ground of its knowing. I am repeatedly asked questions like "Well, who decides what the erotic is?" I say two things about that. First, no one is actually defining [eroticism]. It's much more akin to Justice Stewart's observation about the obscene ("I know it when I see it"). We don't define the term so much as point to the instances we believe belong to its class. Second, whatever deciding is going on is as highly subjective, as insistently contingent as any other critical act. The fact that we are subjecting it to computational analysis neither diminishes nor enhances the implications of that fact. But if highly subjective interpreters point to instances of a particular class, and the computer comes back with the defining features of that class, have we done anything other than give ourselves a deeper understanding of what is implicit in our own subjective musings? Much will depend on how we present that insight, I think."
Martha clearly thinks that it is a worthy outcome to arrive at a deeper understanding of what we already know, but I think she'd also argue (and maybe she has--I haven't checked the mail today) that when the data-mining process throws up new markers of the erotic, at least some of them lead her to new understandings of Dickinson, and don't just confirm or expand the understandings she came with. Data-mining delivers a new kind of evidence into the scene of reading, writing, and reflection, and although it is not easy to figure out sensible ways of applying this new research method (new, at least, to the humanities), doing so allows us to check our sense of the gestalt against the myriad details of the text, and sometimes in that process we will find our assumptions checked and altered, almost in the way that evidence sometimes alters assumptions in science.
We're continuing on with these experiments, and the next round will take it up a notch from the works of an author to instances of a genre. We're looking at sentimental fiction in the 19th century. The first round of training, completed by Kristen Taylor and others at Virginia, consists of ranking each of the chapters of Uncle Tom's Cabin on a one-to-ten scale (not wet to wet, I guess we'll say). A number of people do these rankings and they are compared, and again we look for markers in vocabulary: we'll then take these results to a number of other works--ones that we recognize as sentimental and others we don't--and we'll see what we learn. So far, we already see some things that are interesting in the context of this particular novel: in the top 100 words appearing in chapters rated highly sentimental, number one, with a bullet, is ... "Senator" and the rest of the top ten are "susan", "weeping," "bird," "reflections," "auctioneer," "cloak," "john," "block," "mud." "Mothers" doesn't show up until #16, "forgive" is quite a bit higher on the list than "defenceless," "pain" and "prison" beat out "agony" and "sorrow," and way down toward the bottom of the list are words like "rose-colored," "swaying," and "melted." Writing to the email list about the high ranking of "bird," Kristen said:
"bird" at 4 is cool. Most of the occurrences are in the highly sentimental chapter 9 . . . with the Senator and Mrs. Bird, but there are enough significant usages of the word applied as an adjective (only once does it refer to actual birds) to make it significant. This would be a cool paper--Stowe is riffing off the slave song "I'll Fly Away," but the 'flying' and 'escaping' words do not appear often.
"Imagining what you already know" is a good description of modeling in many humanities contexts: for example, in building a model of Salisbury Cathedral, or the Crystal Palace, as we did at the Institute in Virginia, you could say that we were imagining what you already know about those structures. However, interestingly, the act of modeling almost always brings to the surface of awareness things you didn't know you knew, and often shows you significant gaps in your knowledge that--of course--you didn't know were there. Of course, in some cases--maybe even in all cases that I've mentioned--one could (in principle) do this kind of modeling and even the quantitative analysis without computers: you could model the crystal palace with toothpicks and plastic wrap; you could do the painstaking word-counting and frequency comparison by hand. But you wouldn't, because there are other interesting things you could do in far less time.
Near the beginning of this talk, I raised the distinction between basic and applied research. From a data-mining point of view, what we're doing in the nora project, for the most part, is applied research: we're not developing new alternatives to naïve Bayesian analysis, for example. But from the humanities perspective, I would argue what we're doing is basic research, because we are working out research methods that can then be applied in pursuit of more immediate research goals (like developing new understanding of particular texts). There are many other new research methods, in addition to statistical analysis, that are on the horizon, too, and needing (at least from the point of view of the humanities) some basic research. Simulations, games, map-making, semantic and semiotic tools--there's a lot of this kind of work yet to do, a lot of basic work, to bring information technology to bear on humanities research. Doing that work will require interdisciplinary teams, because there's too much in any of these projects for one person to do, and because it is simply impossible that any one person would have all of the necessary expertise. The problems that these teams will encounter will, I'm sure, be substantially the same as those we've been encountering in the nora project--and perhaps the work that Bei Yu is doing will provide a re-usable method for determining the best fit between the capabilities of tools developed in other domains, when they are brought to bear on research in the humanities. In that respect, what she's doing may be the most basic research of all, in spite of its focus on application.
It is easy to predict that new kinds of graduate training--at least, new for humanities graduate students--will be both necessary and available in this kind of collaborative project work. You've got to have graduate students involved, because they have so much to contribute in actually carrying out certain parts of the research program, and by the same token they can make some of those parts their own, get their own publishing done, and build dissertations out of the raw materials in something like nora. They can be funded while doing it, too, and they have a completely different kind of working relationship with faculty than that provided by the tutorial model that still informs most graduate training in English. They work with faculty in other universities, which has real significance when they hit the job market, and they work with graduate students in other universities and in other disciplines as well, which means that they have a very different sense of community throughout their graduate careers than do most of their peers, at least in English departments.
The other thing I think I can predict with some confidence is that computational methods for humanities research require a new kind of infrastructure. We've been building the digital library for some time now, and the library has always been the research infrastructure for the humanities, but in order to support this new kind of research, digital libraries are going to have to interoperate in ways they are not really even close to doing now, and for certain kinds of things--like data-mining--it's hard to imagine being able to derive the requisite quantitative information from collections that are distributed rather than aggregated: to put it simply, you need to put things in a big pile to find out the characteristics of that pile. The good news, though, is that infrastructure is, by its nature, somewhat general purpose: you can use electricity to drive lots of different devices, and you can use something like Tamarind--Steve Ramsay's XML data management system--to answer lots of different kinds of questions. We don't have to build new infrastructure for every new project, especially if we've properly distinguished between basic and applied research. Growing out of nora, for example, I can already see a set of applied research activities--probably taking the form of journal articles, actually--and some proposal for further basic research to develop infrastructure. That latter work will focus on bringing the nora testbed of 18th and 19th-century British and American literary texts together with earlier texts that are being similarly prepared and analyzed in a project called "WordHoard," run by Martin Mueller (in English and Classics, at Northwestern University). Taken together, the infrastructural work that's being done in these two projects can, we think, form the basis for an Analytical Corpus of English and American Literature, that would support many different applications in humanities research, across many different kinds of literature and literary study.
On the subject of "infrastructure" I'd like to encourage you to have a look at the draft report of the Commission on Cyberinfrastructure for Humanities and Social Sciences, sponsored by the American Council of Learned Societies: it became available for public comment just a few days ago, and it can be downloaded from the ACLS web site. The Commission is looking for comments on this draft, and your contributions would be most welcome: we hope that when it is complete the report will help to foster the development of the tools and the institutions that we require in order to reintegrate the human record in digital form, and make it not only practically available but also intellectually accessible to all those who might be interested in it.
That goal is, I think, a good place to stop, because it brings us back to the point that Frye made about the purpose of criticism in general, which is that it should be "interested in literature itself and in what it does or can do for people." However "scientific" or statistical or technical these new research methods might seem--however systematizing, totalizing, and gradgrindian--they are driven by the desire to understand the human record, and perhaps even more, to understand our understanding of it. That it should take a machine to do that is only a superficial paradox: the machine itself is simply an instrument of procedural epistemology, and its only function, at least in humanities research, is to offer us methods for imagining what we don't know, as well as what we do.
 Northrop Frye, "The Archetypes of Literature." Kenyon Review 13.1 (1951), 92-110.
 Spiritus Mundi: Essays on Literature, Myth, and Society (1976). 106.
 "The MLA and Literary and Linguistic Study and Teaching: The Centennial Forum. John H. Fisher; Geoffrey Marshall; John William Ward; Helen Vendler; Richard Lloyd-Jones; Frank G. Ryder; Northrop Frye." PMLA, Vol. 99, No. 5. (Oct., 1984), pp. 974-995. Frye's contribution has the title mentioned above, and these passages are from page 991.
 Willard McCarty, "Computing the embodied idea: modeling in the humanities." Deutsche Gesellschaft für Semiotik, Universität Kassel, 19 July 2002. http://www.kcl.ac.uk/humanities/cch/wlm/essays/kassel/
 http://mitpress.mit.edu/sicp/full-text/book/book-Z-H-7.html#%25_chap_Temp_4. See also the video of lecture 1, at http://swiss.csail.mit.edu/classes/6.001/abelson-sussman-lectures/. Thanks to Steve Ramsay for this reference.
 Martha Nell Smith, email of 11/10/05, subject "Curmudgeon Reflections on nora" to webviz[at]lists.prairienet.org, the project email list for the NORA project.
 Steve Ramsay, email of 11/10/05, subject "Re: Curmudgeon Reflections on nora" to webviz[at]lists.prairienet.org, the project email list for the NORA project.