Text Analysis Tools for the Digital Library: An International Digital Libraries Proposal to the National Science Foundation
I. Background:
A number of major projects in the past few years, such as the NSF’s own Digital Library Initiative, as well as ELIB and other JISC initiatives, have made available a very large distributed repository of primary and secondary electronic texts for research and teaching in the humanities. However, at present our interactions with this repository are limited to searching and browsing—hardly the limit of what one could hope to do with the digital library. Tools that would enable more sophisticated analytical operations in the context of distributed digital libraries simply do not yet exist—either to work across the network, or to work natively with texts encoded in Standard Generalized Markup Language. At the same time, in many cases, significant intellectual effort has been invested in encoding the structural and content features of the texts, so the texts themselves contain sufficient intelligence to support such operations.
The stand-alone tools that are currently in use (see Table 1, at the end of this section) for the study of texts in the humanities and other disciplines are generally optimized for expert users. Many fail to take advantage of current encoding and metadata practices, and present results unintuitively. A few still in wide-spread use are the product of a batch mode paradigm, as they were originally developed when computing resources were expensive and relatively slow. Others impose size limitations on the amount of text they can handle. The speed of modern computers, together with the use of texts with detailed metadata and sophisticated markup already within them, now gives us the potential to exploit existing data markup, work more interactively with our research materials, and to use graphical visualization to view the results. The research process can shift to a discovery model, in which the user refines and tests her hypotheses by making many minor modifications and trying them out on the texts repeatedly. This will open textual analysis to a wider spectrum of users, because the cost of error will be lower, and the software will invite experimentation. For the same reason, we may begin to see textual analysis software being put to uses with which it is not traditionally associated—both by experienced and novice users.
This project would close the abyss that currently separates the data from stand-alone text-analysis tools, and at the same time would expand the analytical repertoire, improve the usefulness of results, and create an extensible tool architecture so that new modules, to perform new analytical tasks, could be added in the future, by other developers. These developments will in turn have an impact on the development of future electronic text resources, giving wider scope to the creators and greater confidence to the agencies which fund these developments, such as NSF, JISC, NEH and the UK's Arts and Humanities Research Board (which explicitly supports the creation of digital resources for research). Perhaps most importantly, though, the personnel of this project, together with their European partners, bring to the proposed research many years of experience with the creation, dissemination and use of structured text, textual analysis tools, and the support of users of electronic resources in the humanities. Moreover, the project will include libraries as major players: most of the texts we will be using will have gone through the metadata-control and collection-shaping that a library provides, and will already be embedded in a library setting. And finally, because of their connections with the ACH and ALLC (Association for Computers and the Humanities and the Association for Literary and Linguistic Computing—the two main organizations for humanities computing and textual analysis), project partners in the U.S. and the U.K. can draw on a broad cross-section of computing humanists for requirements analysis, software testing, and evaluation.
Table 1:
Analysis Software known to members of the Project The following pieces of software have been examined to date as models or precursors for parts of the preliminary design. An asterisk (*) indicates that one of the project participants is or has been involved in the design and development of the system. |
EyeContact (*): is a prototype of a visual programming environment suited for text manipulation developed at McMaster University (Canada). It acts as a "think piece" for the development team to support the design of such an environment. (http://www.chass.utoronto.ca:8080/epc/chwp/rockwell/) |
GATE: GATE is a rapid application development environment for Language Engineering (LE) applications which can also serve as a test bench for LE research. The GATE project aims to develop a communication and control infrastructure for linking together sets of LE software and standard access modes to common data resources (lexica, corpora, etc). GATE builds on the Tipster architecture. (http://www.dcs.shef.ac.uk/research/groups/nlp/gate/overview.html) |
IMS Corpus Workbench: The IMS Corpus Workbench is a Unix-based collection of software tools developed at the University of Stuttgart that is used for data-driven linguistics, lexicography and terminological work. It can support large corpora, and currently handles a German newspaper corpus which consists of about 200 million tokens, annotated with lemmata, two different part-of-speech tag sets, and sentence boundaries. (http://www.ims.uni-stuttgart.de/) |
LinguaLinks/Cellar: CELLAR is a platform for linguistic work developed in Smalltalk at the Summer School of Lingistics (Dallas, Texas). LinguaLinks is a tool built for field works and is built on top of CELLAR and can cope simultaneously with data in many languages. CELLAR is fully programmable and can be used to develop text-related applications for many other disciplines. (http://www.sil.org/lingualinks/) |
LT XML tools and software library (*): LT XML is an integrated set of XML tools and a developers' tool-kit that runs on UNIX and WIN32. The given tools can process well-formed XML documents, including searching and extracting, down-translation (e.g. report generation, formatting), tokenising and sorting. Sequences of tool applications are pipelined together to achieve complex results. An API is provided so that new tools can be created. |
LT NSL tools (*) (Language Technology Group, U of Edinburgh): LT NSL adds to LT XML the tools necessary for working with arbitrary SGML as well as XML. |
MonoConc: MonoConc is a simple Windows-based program developed at Rice University that allows teachers and students to carry out basic research into the lexical, syntactic, and semantic patterns of a short text. It does not support markup. (http://www.ruf.rice.edu/~barlow/mc.html). |
OCP: The Oxford Concordance Program (OCP) is a batch program developed in the 1970s that generates concordances, word lists, and indexes. It is best known for its versatility in handling the special ordering and processing needs of foreign languages. OCP predates SGML, but can handle a number of markup systems that were then in use. (http://info.ox.ac.uk/ctitext/resguide/resources/o125.html) |
PAT: PAT is a text-string search engine which efficiently retrieves all occurrences of words or phrases appearing in large texts, and allows queries to be restricted to arbitrary structured text fragments. It was originally developed for the Oxford English Dictionary project at the University of Waterloo (Canada), and is now a part of a commercial product OpenText. (http://db.uwaterloo.ca/OED/index.html) |
SARA/BNC (*): SARA is a client/server system consisting of the SARA client (Windows only) and Unix-based server software developed at Oxford. It currently works with the 100 million BNC corpus and enables the inexpert user to search rapidly through the BNC for examples of specific words, phrases, patterns of words etc, which can be sorted and displayed in a variety of different formats. (http://info.ox.ac.uk/bnc/) |
TACT (*): A system of 15 programs for MS-DOS developed at University of Toronto, TACT is designed to do text-retrieval and analysis on literary works. It is well known within the international literary text analysis community, but works with only a relatively small size of text. Recently developed extensions provide Web access to a TACT database, and support SGML and TEI texts. (http://www.chass.utoronto.ca:8080/cch/tact.html) |
Tatoe: The Text Analysis Tool with Object Encoding is a tool written in Smalltalk by partners based at GMD-IPSI and ZUMA (Germany) for refining the structure of text and performing analyses based on the document structure and mark up information. It supports both text exploration and interactive mark up and is aimed at corpus-based, multilingual and multi-layered analysis. It is still under development. (http://www.darmstadt.gmd.de/~rostek/tatoe.htm) |
TIPSTER: The TIPSTER Text Program was a DARPA led government effort to advance the state of the art in text processing technologies. The TIPSTER research and development community has developed a standard set of technology components, enabling "plug and play" capabilities among the various tools being developed, and permit the sharing of software among the various participants. The architecture continues to evolve. (http://www.nist.gov/itl/div894/894.02/related_projects/tipster/) |
TuStep: TUSTEP was developed at the University of Tuebingen Computing Centre starting in the 1970s, and supports a range scholarly textual data processing ranging from text searching to bibliographic management. TUSTEP is highly modular and was developed to provide a toolbox consisting of a number of separate programs, each covering one basic operation. (http://www.uni-tuebingen.de/zdv/zrlinfo/tustep-des.html) |
WordCruncher: WordCruncher is a Windows-based program, originally developed at BYU, that provides full text retrieval, including logical connectors, frequency distribution, and collocation tools. It is now a commercial product marketted by WordCruncher Publishing Technologies (WPT). (http://www.wordcruncher.com/) |
WordSmith: WordSmith is a Windows-based suite of six programs that perform a basic set of specific text analysis tasks. It has full support for ASCII texts, and some support for SGML or HTML encoded corpora. It is marketed by OUP, who report that it is designed for advanced students, teachers, linguists and researchers. (http://www1.oup.co.uk/elt/catalogu/multimed/4589846/4589846.html) |
II. Project outline:
The main aim of this project is to develop a core set of text analysis tools that will enable and enhance the scholarly exploitation of structured electronic textual resources in the humanities. In the process of doing this, the project will define an open modular architecture and protocols for data interchange that will enable new modules to be developed and modules to be combined in new ways. In all of these tools, we believe that visualization will play a significant role—first, in allowing users to see the data structures represented in a the electronic text collection they are dealing with, to see how those data structures are populated, and (interactively) to allow users to choose the structures and elements against which they want to run a given analytical operation; second, in allowing users to see patterns in the results they obtain from that analytical operation. A proof of concept of this second role for visualization is available at Virginia’s Dante project, http://www.iath.virginia.edu/dante/, where users can choose different colors to associate with different types of reference in Dante’s Inferno, whereupon a VRML model of the entire poem is produced, in which each line of the poem is represented by a dot on a circle, and each circle represents a canto, and flags of the specified colors are distributed across the model (and linked back to the relevant section of the text), showing clustering and distribution of reference in a way that could only be done with visualizations such as density plots, dot plots for pattern discovery, and selections and subsets of markup trees. This text visualization would be an instance of the kind of exploratory data analysis described by John Tukey, in Exploratory Data Analysis (Addison-Wesley: Reading, MA, 1977).
The project will begin by assessing the requirements of humanists with respect to literary and linguistic analysis of structured electronic texts, and we will then proceed to build a representative sample of distributed, network-based analytical software that is informed by these requirements. In addition to incorporating features of existing stand-alone text-analysis software, this project will emphasize the production of tools designed for interactive and iterative analysis of networked textual resources. The outcome of the project will be a set of widely useable, well-documented, and freely distributed tools for literary and linguistic analysis.
In the course of the three years during which this project will take place, participants will engage in several different categories of work:
It is worth pointing out that although we do create texts ourselves, the digital library landscape is now more than ever shaped by collections that we purchase, just like any other library holdings. Many of the titles in our testbed collections are part of large commercially-available publications—Chadwyck-Healey, Oxford University Press, Accessible Archives, and InteLex datasets, for example—and the tools we develop to work on this particular data will be of potential benefit to any library owning these datasets. In addition, these tools provide an opportunity to experiment with issues of intellectual property, inasmuch as users might want results of an analytical operation and not want access to the full-text resource. We will work with the providers of licensed text bases to investigate these possibilities. It is important to note, also, that all project participants (in both the US and the UK) have experience as users of text analysis software themselves and as providers of this kind of service to other users. Moreover, three of the partners—UVa, Oxford and Brown—bring to the table significant collections of structured electronic texts, and the Virginia contingent includes library personnel (from the Library’s Electronic Text Center) with extensive experience in providing electronic text resources and serving the users of those resources.
III. Participants:
The principal investigator will be John Unsworth (Associate Professor and Director of the Institute for Advanced Technology in the Humanities at the University of Virginia). Other U.S. participants will include:
Brown University: | Steven DeRose (Adjunct Associate Professor., Computer Science Department)
Elli Mylonas (Lead Project Analyst, Scholarly Technology Group) |
Florida Atlantic University: | Tom Horton (Associate Professor, Computer Science Department) |
University of Virginia: | Worthy Martin (Associate Professor, Computer Science, and Technical Director, Institute for Advanced Technology in the Humanities)
Daniel Pitti (Project Director, Institute for Advanced Technology in the Humanities) David Seaman (Director, Electronic Text Center, University of Virginia Library) |
United Kingdom participants, who are applying separately to the Joint Information Systems Committee for their funding, will include:
Edinburgh University: | Chris Brew (Researcher, Language Technology Group)
Henry Thompson (Reader, Department of Artificial Intelligence & Researcher, Language Technology Group) |
King’s College, London: | John Bradley (Senior Analyst, Centre for Computing in the Humanities)
John Lavagnino (Lecturer, Humanities Computing) Harold Short (Director, Centre for Computing in the Humanities) |
Oxford University: | Lou Burnard (Manager, Humanities Computing Unit, Computing Services)
Mike Popham (Head, Oxford Text Archive) |
Brown, Oxford, King’s, and Virginia have considerable experience working with literary and linguistic researchers to create and use digital library resources; these institutions also have large collections of structured electronic texts, the largest being at Oxford and Virginia, which together comprise more than 60,000 SGML-encoded texts in a dozen languages. Participants from Brown, Edinburgh, Florida Atlantic, King’s, and Virginia also have significant experience in designing and producing software, and much of that experience is focused on structured text and on humanities computing. Finally, Participants in this project are already part of an informal association of humanities software developers, called the ELTA (encoded literary text analysis) Initiative, a group which has been meeting at least annually since 1995, and which has the support of the ACH and the ALLC. At these meetings, we have been discussing possible requirements, elements and principles of design, architectural issues, and the like. Many participants and sponsoring organizations have also been involved in the eleven-year standards development effort that produced the Text Encoding Initiative DTD, as well as in the W3C standards efforts, most recently XML.
IV. Plan of work:
The focus of this effort will be on developing tools for literary and linguistic analysis. There is a fairly long history of tools development in these areas, but none of those tools are capable of working with networked texts or arbitrary document structures—most, in fact, don’t work directly with SGML or XML and don’t work in a networked environment.
The primary research questions to be addressed in this project are:
The information resource to be used in answering this question is a distributed, networked, multilingual collection of structured text that comprises more than 60,000 texts, variously marked up, in a dozen languages, and held in four locations on two continents.
A. Year 1
1. Overview:
In the first year, emphasis will be placed on:
This first-year planning and design process will take account of the structure and content of the resources available in digital libraries now and in the future (as far as this can be anticipated) and will focus on the requirements of academic and other users exploiting these resources. Working together, participants in the U.S. and the U.K. will design information-gathering tools, including Web-based questionnaires and surveys. They will also conduct a small number of strictly coordinated, in-person, hands-on interviews with experienced scholarly users of structured electronic texts. Finally, the ACH/ALLC and Oxford’s Digital Resources in the Humanities (DRH) conferences will be used to conduct a workshop to gather requirements from scholars and developers working in this area.
This requirements analysis will be done in the context of a (brief) survey of the general features and text preparation requirements of widely used extant software. At the end of the first year, results of the requirements analysis and architecture design work will be made public and comments will be solicited. Work done with potential users to define requirements will, in turn, become the basis for later evaluation: we will use input collected in the requirements phase to design the questionnaires, surveys, and feedback mechanisms for evaluating individual tools
U.S. Partners: The Institute for Advanced Technology in the Humanities (IATH) and the Electronic Text Center (ETC), at the University of Virginia, and the Scholarly Technology Group (STG) at Brown University will contribute to this project as follows:
2. IATH:
Working with consultant Tom Horton, a software engineering professor from Florida Atlantic University, IATH will contribute to the requirements analysis with a dozen in-depth interviews with literary scholars who are experienced users, creators, and analysts of structured electronic texts. These scholars will be asked to describe (and rank in importance) the various analytical activities which characterize their interaction with texts (electronic or printed) and also to imagine what kinds of analysis they would like to be able to perform, were the tools to perform it available. Structured meetings with groups of domain experts has been found to be highly effective in earlier work carried out by Horton in an industrial setting [France 1995].
IATH will also contribute a percentage of effort from its technical director, Worthy Martin, and from its project director, Daniel Pitti, to the definition of architecture and protocols. We expect a portion of Tom Horton’s time to be involved in this activity as well.
Finally, IATH will convene several management meetings during the year, with the ETC and STG participants, and, as appropriate, with consultants. At least one meeting during the year will also include the UK participants, and we will plan to meet with those UK partners a second time during this year at a conference—such as the annual joint international conference of the Association for Computers in the Humanities and the Association for Literary and Linguistic Computing—which all project participants can reasonably be expected to attend. IATH will also provide daily project communication facilities such as email discussion lists, hypermail archives of those discussions, network conferencing facilities, and—perhaps most importantly—clients for a TCPIP-oriented, SGML-based document management system, hosted at IATH, which can provide the entire project group with a version-controlled central repository for documentation, interview results, Java classes, and other work in progress.
3. ETC:
The Electronic Text Center will have primary responsibility, in the first year of this project, for a thorough analysis of the electronic textual data that comprises the project testbed—some 60,000 electronic texts in more than a dozen languages and in many different Document Type Definitions, held at several different institutions. These collections will be assessed and described in terms of the language groups represented, the SGML specifications in use, and the manner in which the data collections actually use those specifications. The purpose of this analysis and description will be to provide architecture and protocols designers, in the first year, and software developers, at later stages in the project, with an accurate preview of the data on which their tools will need to work.
4. STG:
In the first year, during which the primary focus will be on specifying requirements, architecture, and test-bed characteristics, STG staff with expertise in tool use and design will contribute to the needs assessment, with the help of a consultant whose expertise is in the area of scholarly research methods and information gathering. Mylonas and the evaluation consultant will be the primary STG participants in this task; an STG student will be used to actually implement any mockups and web-based information-gathering instruments. This work will be done in conjunction with partners from King’s College and UVa.
The other area STG will work on during the first year is the system architecture—the framework for software development. Steve DeRose will work with Tom Horton and the European partners at King’s College and Edinburgh to define a system architecture, protocols and internal data formats for the distributed tools. Recent and ongoing standards efforts—including DOM, XPointer/XLink, and WebDAV–-will inform this work. A major architectural goal is to provide an environment where independently-implemented modules, possibly distributed around the network, can be combined into a coherent, dynamic, multiple-view user interface that makes high-level text-analytic studies far less tedious and far more effective and intuitive than is currently the case.
UK partners: Harold Short will co-ordinate the activities of the UK partners, and he will liaise closely with the Principal Investigator, John Unsworth, at the University of Virginia. The Center for Computing in the Humanities (CCH) at King’s College, the Humanities Computing Unit (HCU) at Oxford University, and the Language Technology Group (LTG) at Edinburgh, will contribute to the project as follows:
The primary focus will be the work to define a system architecture, protocols and internal data formats, and functional requirements. The work at King's will be supervised by John Bradley, who will liaise closely with the project staff at Edinburgh, Brown and Virginia. CCH will also maintain the current close liaison with research and development staff working in this area at the University of Bergen, Norway, at IPSI in Darmstadt and ZUMA in Mannheim, at the University of Pisa in Italy, and at the Universities of Alberta and McMaster in Canada. CCH will also contribute to the requirements analysis, using the survey instruments developed by Virginia and Oxford among relevant research and teaching staff at King's.
6. HCU:
The HCU will take the lead among the UK partners in the requirements analysis, working closely with Virginia to develop the survey instruments, co-ordinating their use at Oxford and King's, and collating the results. The HCU will liaise with Virginia on the preparation of the results across the whole project for publication.
7. LTG:
LTG will focus on architecture, protocol and data format specification, extending the models they have been developing for a number of years in relation to language corpora. Their emphasis will be on the specification of a hyperlinking architecture (using XSL or similar) to enable non-intrusive annotation of text corpora (such as might be used as a basis for tools to produce aligned editions—e.g. of multiple-language texts).
B. Year 2
The focus of this year’s work will be on developing core modules for the text-analysis package, testing those modules with one another and with the data collections at partner institutions, and documenting the architectural specification and the protocols on which they are based. Methods such as formal reviews of architecture and low-level designs will be used to verify that they meet requirements defined in earlier stages. We will release the architectural documentation at the end of year two, and we will develop in the open throughout. The ACH/ALLC and DRH conferences will be used for a developers workshop and for the demonstration of current prototypes. Also, we will recruit a few users with significantly large projects to be early adopters of our tools, so that we can perform case studies evaluating their use of our tools in Year 3, in order to determine the potential impact of our project on literary and linguistic research. Finally, the Modern Language Association (MLA) conference, in December, will be the occasion for the demonstration of software prototypes as well.
2. IATH and ETC:
In the second year of this project, IATH and ETC will share a full-time programmer-analyst position, and will be the principal partner providing software development to meet the specifications developed in year one. The focus for this development effort in year two will be on creating the core modules for the projected tool set and testing them, separately and in combination, using data drawn from the digital libraries at the appropriate partner institutions. Programming is likely to be done in a combination of Java, TCL, and Perl, with Perl and TCL handling server-side functions, and Java handling client-side/user-interface functions. During the second year, IATH and ETC will also begin to draft principles and prototypes for data visualization of text-analytical result sets.
Once again, IATH will convene several management meetings during the year, with the ETC and STG participants, and, as appropriate, with consultants. At least one meeting during the year will also include the UK participants, and we will plan to meet with those UK partners a second time during this year at a conference—such as the annual joint international conference of the Association for Computers in the Humanities and the Association for Literary and Linguistic Computing—which all project participants can reasonably be expected to attend. IATH will also continue to provide daily project communication facilities such as email discussion lists, hypermail archives of those discussions, network conferencing facilities, and an SGML-based document management system for work in progress.
3. STG:
In the second year, STG’s contribution will consist primarily in supporting software development by working on user-interface design and testing: STG will test the tools under development, and design and begin to test user interface and presentation, working first of all with internal beta testers and later in the year with users not otherwise connected to the project. Mylonas and an STG student will be primarily involved in this, and Horton will work with staff at Brown to plan for the use of exploratory prototyping techniques to define user requirements and evaluate proposed user interfaces. These will use simple but effective rapid prototyping methods such paper prototypes [Rettig 94] or hypermedia-based methods.
DeRose will continue to work with the developers to refine the architecture specification. STG will also contribute its design expertise in refining visualizations and other graphical views of data and results. As at the end of the first year, results will be made available, in this case including sample tools, and comments will be solicited. At this time we will also make the module architecture public and encourage third parties to build various add-on modules; this will allow us to evaluate the generality and ease-of-use of the architecture in a broader range of applications and user scenarios, and identify areas of needed enhancement.
John Bradley will supervise a full-time programmer in the development of a set of tools and libraries, as agreed in the software development plan. CCH will take the lead among the UK partners in work on user-interface design and prototyping, and will liaise closely with the project staff at Brown.
5. HCU:
Lou Burnard will supervise a full-time programmer in extending the BNC's SARA client to conform to the architecture and protocols specified in Year 1 of the project, and in developing tools that can interface with SARA to provide additional functions.
6. LTG:
Chris Brew and Henry Thompson will supervise work to develop tools to support the transparent processing of distributed documents, both static and dynamic.
C. Year 3
1. Overview:
The tools will be tested with groups of users at partner institutions, and the definitions and the tools will be modified to take account of feedback from the tests. Details of the work in progress will be published on the project web site and on the relevant discussion lists, inviting comment and input from interested observers. Full reports of the project will be published, including the architecture and protocol definitions and the tool specifications. The tools themselves will be made freely available for use in the higher education community. The intention is that the open architecture will enable new modules to be created and combined with existing modules in currently unforeseen ways.
The ACH/ALLC and DRH annual conferences will be used, in this year, for training workshops and the presentation of a final project report; the MLA and Association for Computational Linguistic’s US meeting will also be used as venues for final dissemination of the project report
2. IATH and ETC:
In the third and final year of this effort, IATH and ETC will continue to share a full-time programmer-analyst, whose time will be devoted to developing the remaining software modules projected in the initial design phase. Since these tools, their source code, and their specifications, will be released for broader volunteer development at the end of year three, it is critically important that the modules released be consistent with one another, with the documentation, and with the published specifications, so it is reasonable to expect significant debugging and tweaking of core modules produced in year two.
Once again, IATH will convene several management meetings during the year, with the ETC and STG participants, and, as appropriate, with consultants. At least one meeting during the year will also include the UK participants, and we will plan to meet with those UK partners a second time during this year at a conference—such as the annual joint international conference of the Association for Computers in the Humanities and the Association for Literary and Linguistic Computing—which all project participants can reasonably be expected to attend. IATH will also continue to provide daily project communication facilities such as email discussion lists, hypermail archives of those discussions, network conferencing facilities, and an SGML-based document management system for work in progress.
3. STG:
In the final year, while software development will be continuing, more of the basic framework will be complete. STG will continue to work with beta testers to refine the user interface(s), will provide online documentation for users, and will work with the other partners to ensure that there is technical documentation for anyone who may want to extend the system by adding a new analytical module.
4. CCH:
John Bradley will supervise the programmer in further development work. CCH will also participate in the developer and user testing, and in the evaluation activities. The 'developer' work will concentrate on links with other European developers in Italy, Germany and Norway. The user testing will focus on research and teaching staff at King's.
5. HCU:
The HCU will continue to focus on the interface between the specified architecture and protocols and existing software, building on experience gained in Year 2. Oxford will also take the lead among the UK partners in evaluating the success of the prototype tools for the effective exploitation of the holdings of the OTA, and meeting the needs of its user community.
6. LTG:
Chris Brew and Henry Thompson will supervise further development work. LTG will take a lead role in evaluating the specified protocols in relation to the development of future tools and tool sets in the context of large language corpora.
The end result of the project should be a widely accepted and extensible architecture for the development of tools for literary and linguistic analysis of structured electronic texts, and a set of functional sample tools for such analysis. In addition to the usual means of research publication, we will also publish the architecture specification, the final evaluative reports on the effectiveness of the architecture and tools, conduct training workshops, and we will make all source code and documentation freely available through a web site that the participants commit to maintaining beyond the end of the project.
V. User Community:
There are two types of 'user community' that may be identified: the 'developers' who will use the specifications and the initial tool set to create new tools; and the 'end users' who will use the tools (the initial set and future ones) to study and analyse digital library materials. For the developers, the open nature of the proposed architecture and the formal specification of protocols will enable the long-held goal of independent collaborative tools development to be realised. It will also become possible for developers to create and combine tools flexibly in response to new scholarly requirements.
The end-user community to be served by this project has several constituent elements. At its core are scholars involved in literary and linguistic computing, text-encoding, and the analysis of encoded texts. This group has been at the heart of humanities computing for decades now, and represents the segment of the humanities community most experienced and most involved with digital library problems and projects. In addition, these tools have the potential to provide an important evaluation of current markup practices in the scholarly text-encoding community.
Next, the tools we develop will serve a wider, less specialized audience interested in more preliminary investigation of digital library materials. These users might include undergraduate and graduate students for whom literary or linguistic analysis is either a discipline in which they are training or a requirement subordinate to some other task, such as critical analysis, literary history, cultural studies, etc.. At Virginia, the patterns of usage we see with the library’s on-line SGML data bear this out: we now see about 125,000 hits a day (40,000 accesses, 11,000 unique hosts) and the bulk of that traffic is non-university—it is general web surfers and high school teachers and students, and they value the same things we do—detailed searching, flexible browsing (browsing by SGML region is popular with the WebTV contingent, who don't want an entire book but want to move from data chunk to data chunk), common interfaces, reliable data. If these patterns of unanticipated use carry over to these tools, once they are embedded in library-based collections of electronic text resources, then literary and linguistic analysis of networked electronic texts may well find new users in an even more general public.
Finally, tools that are capable of providing visualization of the structures within large collections of electronic text could play a key role in the management, maintenance, and development of digital library collections.
VI. Benefits:
1) The scholarly community will be able to exploit the text resources in digital libraries more effectively than at present, being able to take full advantage of any mark-up that has been added to the text.
UVa is the lead site, and will co-ordinate U.S. partners. King’s will co-ordinate UK partners. In each partner institution, one individual will have primary responsibility for the project work in that institution, as follows:
UVa (IATH): John Unsworth
Brown (STG): Elli Mylonas
King’s (KCL): Harold Short
Oxford (HCU): Lou Burnard
Edinburgh (LTG): Chris Brew
These individuals together will form the Project Co-ordination Team (PCT). The PCT will be in regular communication, scheduled and informal, and will meet at least twice a year, on the occasions of the full project team meetings. The general project management strategy will be to have a US and a UK 'lead site' for each major strand of activity, matched to areas of specialist expertise and experience, with one of these having the overall responsibility for co-ordination, as follows:
Table 2: (OL=overall lead; L=lead; P=participant)
IATH/ETC |
STG |
KCL |
HCU |
LTG |
|
Requirements analysis |
OL |
P |
P |
L |
P |
Architectural specification |
P |
L |
OL |
P |
P |
Data analysis |
OL |
P |
L |
P |
|
Functional specification |
OL |
P |
L |
P |
P |
Tools development plan |
OL |
P |
L |
P |
P |
Evaluation plan |
P |
L |
P |
OL |
P |
Tools design |
P |
L |
OL |
P |
P |
Tools development |
OL |
P |
L |
P |
P |
UI design & development |
P |
OL |
L |
P |
P |
Documentation |
P |
OL |
P |
L |
P |
Developer testing |
P |
OL |
P |
P |
L |
User testing |
L |
P |
P |
OL |
P |
Evaluation |
OL |
P |
P |
L |
P |
During the three years of the project, we expect to address the various tasks involved in this research on the following general schedule:
Table 3:
Tasks |
Months |
|||||
1-6 |
7-12 |
13-18 |
19-24 |
25-30 |
31-36 |
|
Requirements analysis |
x |
x |
||||
Architecture specification |
x |
x |
x |
x |
x |
x |
Data analysis |
x |
x |
||||
Functional specification |
x |
|||||
Tools development plan |
x |
|||||
Evaluation plan |
x |
x |
||||
Tools design |
x |
x |
||||
Tools development |
x |
x |
x |
x |
||
UI design & development |
x |
x |
x |
x |
||
Documentation |
x |
x |
x |
x |
x |
x |
Developer testing |
x |
x |
x |
|||
User testing |
x |
x |
||||
Evaluation |
x |
x |
x |
x |
x |
|
Project management |
x |
x |
x |
x |
x |
x |
In the overall US/UK collaboration, our intention is to conduct tightly coordinated parallel development programs, with the US participants attending to the design, production, and documentation of tools for literary analysis and the UK partners doing the same for linguistic analysis tools. This division of labor matches the interests and expertise of the respective partners. There will be one managing partner on each side of the Atlantic—the University of Virginia on the American side, and King’s College London on the British side. These two managing partners will be individually responsible for coordinating the work of their respective compatriots, and will work together to coordinate the overall collaboration. It is hoped that this tiered and focused management structure will streamline planning, communication, and work in progress. Virginia and King’s personnel have some history of collaboration and work well together.
In general terms, we anticipate that architecture and protocol specification will be directed by King’s College (overall lead), with the U.S. lead being Brown, and will include participation from Virginia, principally in the form of a consultant, Tom Horton, whose activities will be coordinated by Virginia. Requirements analysis on the U.S. side will be conducted jointly by Virginia and Brown, coordinated by Virginia (overall lead), and the results of this analysis will be shared with King’s and the UK group, who will be conducting a parallel process directed by Oxford (UK lead). Data analysis will be directed by Virginia (overall lead), with assistance from Oxford (UK lead), and will involve collections at both US and two UK partner sites. Documentation will be a principal responsibility of the Brown group (overall lead), with some review and input from Virginia; on the UK side, Oxford will take the lead in documentation, and will coordinate with the other UK partners. Actual tools development, on the US side, will be done almost entirely at Virginia (overall lead), with some design consultation from Brown; on the UK side, King’s will take the lead in tools development, with some assistance from Edinburgh. Testing will be done at both US and multiple UK sites, with the same parameters. This division of labor follows the expertise and current capabilities of the various partners quite closely.
A number of participants are experienced in planning and management activities, including several who run centers, departments or institutes (Burnard, Seaman, Short, Unsworth) and others with exposure to industrial software development practices (DeRose, Horton). In the first year, we will use this expertise to define a minimal but effective software development process to coordinate the development efforts at the various sites. (We recognize the challenge in balancing the trade-offs between a heavy- and light-weight process. Horton has experience in working in a large company's efforts using the SEI's Capability Maturity Model, but is also actively developing light-weight processes for use in undergraduate software engineering project classes.) Our process will define documents and deliverables to be produced during development. It will address quality assurance issues, including the use of formal technical reviews, requirements verification, and integration testing. As deliverables are developed, we will develop a change control mechanism to allow fixes and improvements to be added to components while causing as little impact to other developers' work as possible. We will make use of tools that facilitate distributed development when appropriate. These will include networked configuration managment and repository tools (such as Astoria for SGML documents, or CVS for source code). Of course, the Internet also supports many forms of networked communication, including "meetings" held using AOL Instant Messenger or other chat programs. We may also adopt the use of a more sophisticated tool like WebProject (www.wproj.com), which provides a Web-based client-server support for project scheduling, task management, status meetings, etc.. Our main and mirrored Web sites will serve as repositories for documents and reports on requirements, designs, etc. that have been completed and reviewed during the project. We envision that this Web component will be similar to that developed by the TIPSTER project (www.tipster.org), and that it will provide up-to-date information for both the project participants and interested outsiders.
Progress will be coordinated by means of email discussion among participants, an SGML-based document management system housed at Virginia, networked conferencing as needed, and face-to-face meetings several times a year, either at meetings called for the purpose, or at other meetings likely to be attended by many or all of the project participants. At each of these meetings, progress to date will be compared to the plan of work presented above. We anticipate that participants will need to make two domestic trips a year, and one international trip.
Our primary goal is to establish the architecture and protocol specification that will make possible subsequent, volunteer development of software tools for literary and linguistic analysis; as a means of testing and refining those specifications, we mean to produce some core modules and some demonstration applications. If, in the course of the project, it becomes necessary to adjust the plan of work given above, we will make those adjustments in accordance with these priorities.
M. Rettig. "Prototyping for Tiny Fingers." Communications of the ACM, 37 (April 1994), pp. 21-27.
Robert B. France and Thomas B. Horton. "Applying Domain Analysis and Modeling: An Industrial Experience." Proceedings of the ACM-SIGSOFT Symposium on Software Reusability (SSR'95). 1995, Seattle, WA.