The Electronic Archive of Early American Fiction at the University of Virginia

David Seaman
Director, Electronic Text Center
Alderman Library
University of Virginia
Charlottesville, VA 22903, U.S.A.
Tel.: 804-924-3230 Fax: 804-924-1431
E-mail: etext@virginia.edu
URL: http://etext.lib.virginia.edu

Kendon Stubbs
Associate University Librarian
Alderman Library
University of Virginia
Charlottesville, VA 22903, U.S.A.
804-924-3026 Fax: 804-924-1431
E-mail: kstubbs@virginia.edu

Abstract:

The University of Virginia Library has recently received a $400,000 grant from The Andrew W. Mellon Foundation to digitize and put on the Web 558 rare volumes of early American fiction, and to study the economics of electronic versions of rare books. The two-year project is called the Early American Fiction project (EAF). The texts chosen for the project include first printings of James Fenimore Cooper's The Last of the Mohicans, Edgar Allan Poe's Tales of the Grotesque and Arabesque, Nathaniel Hawthorne's Scarlet Letter, and other novels and short stories. Two versions of each text will be made available: a TEI-conformant SGML-tagged text and color images of the pages of the first editions--a total of 118,000 pages. The project will conclude in 1998 with an economic study of usage of the e-texts compared with usage of the original rare books.

This paper is the first formal progress report on this project, focusing on the challenges and benefits of converting fragile rare books to electronic formats.

Keywords:

Digital Library, Fulltext Databases, Digital Images, Electronic Texts, TIFF, JPEG, SGML, WWW, Information Retrieval, Research Grants, Rare Books

The University of Virginia Library (UVA) is in the first year of a two-year project for 1996-1998 to create electronic texts of rare books and to compare the usage and costs of electronic texts with the usage and costs of original paper texts of rare books. Called the Early American Fiction (EAF) project, the work is focusing on e-texts of a well-defined and comprehensive collection of early American fiction derived from the two standard bibliographies of American fiction. Specific outcomes expected from the project are

electronic texts and images on the World Wide Web of 558 seminal volumes in early American literature
a model process, exportable to other libraries, for creating e-texts of rare books
measurement and analysis of usage and costs of the e-texts and of the originals on which they are based

In the U.S. and Canada the largest university libraries house over a million rare books. From the ancient library in Alexandria, Egypt, to the present time, preserving unique and rare books has been taken as a principle raison d'être of research libraries. But this objective is a costly part of the libraries' mission. At the University of Virginia in recent years it has cost eight times as much on average to acquire a rare book as an ordinary trade book. Because of security and preservation needs, maintenance of rare books is three times as expensive as of the other book collections. Security also means that physical access to rare books is necessarily restricted. As a result, last year the ratio of rare books used to total volumes in the rare books collection was .03, while the ratio in the general collections was .23 and in the undergraduate library 1.18. On average, 3% of the rare books collection was used, while each undergraduate library book was used more than once. And from the standpoint of the patron, usage of original rare books requires visiting geographical locations to use physical objects, just as 2,000 years ago in the great library of Alexandria.

The World Wide Web offers the possibility of greatly expanded access to computer versions of rare books. The electronic versions offer the added value that every word in the rare books can be indexed and searched. It is possible in an online collection of early American fiction to find in seconds every instance of the word ``freedom'' for a study of fictional concepts of freedom, while in the original rare books such a search might take years. And while computer images of rare book pages alone can only serve as pointers to the rich actuality of the original physical artifacts, the combination of searchable text and high-resolution color page images provides a detailed and flexible view of the material to teacher and scholar alike.

The UVA Early American Fiction project presents the opportunity to study scholarly use of original rare books and of their computer simulacra, and to determine the extent to which electronic texts of rare books can serve scholars. We expect to create an online collection, to focus on use by faculty and student scholars, and to obtain objective data supporting reliable comparisons of usage of e-texts with usage of original rare books.

This is the first formal report on the EAF project, which has been in progress for about half a year. In this paper we will concentrate on three aspects of the project:

1. the contents of the early American fiction collection;

2. the production of digital images of rare books;

3. the production of SGML-tagged electronic texts.

We will conclude with some remarks on future directions of the EAF project.

The Early American Fiction Collection

In 1776 Ueda Akinari's elegant and sophisticated masterpiece, Ugetsu Monogatari, was published. This work was in the great tradition of 800 years of fiction in Japan, from Ise Monogatari and Genji Monogatari to Akinari's time. 1776 was also the year when America declared its independence and when the earliest American fiction was just beginning to be published. The next 75 years, up to 1850, were marked by some masterpieces of American novels and short stories, such as James Fenimore Cooper's The Last of the Mohicans, Edgar Allan Poe's Tales of the Grotesque and Arabesque, and Nathaniel Hawthorne's Scarlet Letter. In addition to these well-known works, there were also numerous publications popular in their own time but now forgotten. All of these works, however, cast light on the early days of the United States, and they are worth studying in order to examine patterns of thought when the U.S. was still young.

Two standard bibliographies describe classic American literature:

Wright's American Fiction 1774-1850 lists all works of fiction published from the first story up to 1850.[1]

Bibliography of American Literature (BAL) lists the original editions of the most important authors of American literature, as chosen by a committee of the Modern Language Association of America.[2]

The University of Virginia Library is fortunate to have two of the world's major collections of rare first editions of American fiction in its Barrett and Taylor collections. In these collections most of the first editions in Wright and BAL are available. For some editions UVA has one of the few existing copies of the edition. In the EAF project, therefore, we are using first editions from UVA that meet the following criteria:

1. the author is in BAL;

2. the edition is listed in Wright;

3. UVA has a first edition of the work.

When these criteria are applied, the project will cover 421 titles in 558 volumes by 81 authors, containing 118,000 pages.

The Production of Digital Images of Rare Books

There are two major tasks in the creation of the electronic archive of early American fiction. First is making digital images of every page of each of the 558 volumes. Second is converting the page images to SGML-tagged ASCII text. The final product will be both digital images and searchable tagged texts of every book in the project.

Ordinary, non-rare books can be scanned on a flat-bed scanner, and sometimes the ASCII text can be created by OCR from the digital images. Rare books present the special problem that the physical books must be handled with extreme care. Most rare books cannot be scanned on a flat-bed scanner, for example, because on such a scanner their spines or pages might be damaged. Some of the rare books are so fragile, in fact, that they can not be opened wider than about 120 degrees. As a result, one of the interesting aspects of this project is to develop methods for large-scale production of digital images of fragile materials. This part of the project is being carried out in the UVA Library's Special Collections Digital Center.

Most of the imaging work is therefore being done with digital cameras mounted above light tables. The camera backs that we are using are manufactured by Phase One, and are the Phase One PhotoPhase Plus. These camera backs provide a maximum resolution of 5000 x 7000 pixels with 24-bit color, and high-quality software to control the camera. The backs are attached to Tarsia Technical Industries Prisma 45 4" x 5" cameras, on TTI Repro-Graphic Workstations. The workstations use Lowel Tota lights with 500-watt halogen bulbs on Bogen light stands. With these cameras, one can view, focus, and capture an image without removing the digital camera back. To protect the books, we use book cradles specially designed by John Riser for rare books. The cameras are run by Apple Power Macs, which process the images.

In the EAF project room adjacent to the Special Collections Digital Center [show slide 1 here] we have two Phase One cameras. A fulltime digitizing supervisor and part-time student assistants keep the cameras busy all day long, in order to stay on our project schedule. Production work got underway in the first week of January, 1997, and we need to make 118,000 digital images by June, 1998. In an eight-hour day, five days a week, this is about 20 images per hour for each of two cameras, or one image every three minutes. Our production work so far suggests an average of four minutes per image [show slide 2 here]. Of course, within this average there is a range of imaging speeds that depend on the type of book, among other things. For example, a book that is tightly bound or exceptionally fragile takes longer to digitize than a book that can be opened wider on the book cradle [show slide 3 here]. We expect the digitizing speed per page to increase as we gain more experience with the digital cameras.

Since a goal of the EAF project is to test whether digital versions of rare books can substitute for the original editions in some cases, we are methodically making images of all parts of every book--the spines, front and rear covers, and all pages of the book, including copies of any blank pages. For each book we do a test sheet on which we film the cover of the book with a ruler and Kodak grayscale and color strips for color comparisons. From the EAF copy of a book, it should be possible to get an idea of the appearance of every part of the book. In the future, we hope to use these images to create virtual reality images of each of the rare books.

The pages are being scanned in 24-bit color at a resolution of 500 dots per inch. The image files are saved in the Tagged Image File Format (TIFF) for long-term storage. Each TIFF is then converted by Photoshop 4.0 software into a high-quality JPEG (Joint Photographic Experts Group) format image and a slightly smaller-size JPEG for display on the World Wide Web. Both TIFF and JPEG are recognized standard formats for image encoding and delivery. Each page image requires 20-megabytes or more for the TIFF file. The JPEG images are approximately 300 and 100 kilobytes each. On the average, therefore, a book in this project requires at least 4.2 gigabytes for storing its TIFF images; 63 megabytes for its 300KB JPEGs; and 21MB for its 100KB JPEGs. The entire collection of 558 volumes should require a little less than 2,400 gigabytes of storage for TIFFs, 37GB of storage for the high-quality JPEGs, and about 12GB of storage for the smaller JPEGs.

Both the TIFF and the JPEG images are stored on writeable CD-ROMs. The TIFF CD-ROMs are used for archival storage of those images, and for additional security EXABYTE tapes are used to back up the CD-ROM TIFF images.

The first book completely digitized in the project was Charles Brockden Brown's Wieland [show slides 4 and 5 here].[3]

The Production of SGML-Tagged Electronic Texts

To summarize the preceding section: the 558 volumes that comprise the print originals for this extraordinary electronic text collection will be scanned as high-quality color images, and then "working copy" JPEG files will be created from the archival TIFF image original. These copies are the source for the keyboarding of the text by a commercial service bureau. This method is much cheaper than transcribing the documents ourselves, and much faster. The work is done in large professional typing operations overseas, and is the way in which practically all large commercial e-text projects are created. This part of the project will be overseen by the UVA Library's Electronic Text Center [show slide 6 here].

Both optical character recognition (OCR) and keyboarding are variously useful and reliable methods for the creation of machine-readable transcriptions of print materials. For the EAF project, the obvious practical choice is to use a commercial keyboarding company, because of both the physical nature of the source material and its bulk.

OCR works by taking a digital image of a page of type and interpreting the shapes on it, turning clusters of image pixels into ASCII characters. OCR works well with modern typefaces, and often copes reasonably well with later 19th-century printed matter, but its effectiveness decreases with earlier material. The ability for the software to recognize letters is principally challenged by printing flaws that disrupt the integrity of the letter form, such as uneven inking and broken type; and such features are typical of earlier printed material. For sample typefaces and the results of OCR on them, see the following Web site:

http://etext.lib.virginia.edu/helpsheets/scan-train.html

Even with clean modern type, one or two errors per page are not atypical with OCR. This makes it suitable for small runs of materials, especially when they can be effectively corrected using a modern spell-checker. But both the bulk and the non-20th century spelling in the EAF project would make them cumbersome to correct, and their typographical features will result in more numerous errors than we would see in modern print items.

The Electronic Text Center receives the page images from the Special Collections Center on CD-ROMs and sends images to the vendor who will create SGML-tagged texts and return the texts to the E-Text Center. The returned texts will then be checked, cataloging headers will be prepared, and the texts will be added to the E-Text Center's Web site.

The workflow for creating e-texts is as follows:

Using images of the pages sent by the E-Text Center, the vendor types in all texts at least twice, and the two versions are electronically compared to catch discrepancies. This double-keying is the accepted "standard" in the humanities text industry at present, and is the way in which works such as the Oxford English Dictionary have been prepared. This results in an accuracy rate of 99.995% or better. We estimate that there are an average of approximately 1,800 characters per page in the EAF volumes. This means that the final EAF e-texts will exhibit typographic error at a rate of less than approximately one error per ten pages.
As the texts are created, Standard Generalized Markup Language (SGML) tagging is added to record the physical and structural characteristics of the text: title-page layout, pagination, paragraphs, verse lines, italics, accented letters, etc. The vendor checks the accuracy of the tagging with a computer program that makes sure the tags are properly formed.
Each page of the electronic text has the location of its corresponding image marked, so that the two can be linked together hypertextually.
The texts are returned as they are finished to UVA, where the following happens:
They are spot-checked -- sections are proof-read -- to verify accuracy of input.
The SGML encoding is checked for completeness.
For each work, a standard bibliographical header is created by staff in the Electronic Text Center following the guidelines established by the Text Encoding Initiative's (TEI) Guidelines for Electronic Text Encoding and Interchange (P3). The header records all the details of the print source and of the electronic version, and includes some keyword information that will be valuable as the items are searched.
For each pictorial illustration in a work, a standardized description of type and subject matter is created.
The final text is parsed, indexed, and put online on the World Wide Web.

The final online texts will be searchable, like the other texts at the E-Text Center Web site, as described in the accompanying paper by Seaman and Stubbs on ``The Electronic Text Center in the University of Virginia Library'' [show slide 7 here].

A Web home page for the EAF project has been established at

http://etext.lib.virginia.edu/eaf

Future Directions for the EAF Project

As online texts become available in the EAF project, two issues will be the focus of attention: cost recovery for the project and measurement of the usage and benefits of the texts.

We have already begun to address the questions of cost recovery by arranging with a commercial publisher to make the texts available. We have contracted with Chadwyck-Healey Inc. to publish The Early American Fiction Collection on CD-ROM and to make the collection available on the World Wide Web as part of Chadwyck-Healey's Literature Online (LiOn) Web service. The online data will be housed on the E-Text Center's server at UVA. Through this arrangement we hope to recover at least some of our investment in the EAF texts, in order to regain funds to use for creating additional e-texts -- probably of American fiction published from 1851 to 1900.

A critical part of the EAF project is assessment of the costs and usage of the EAF texts, as compared with costs and usage of the original rare books. We plan to test the hypothesis that for most uses of rare books, high-quality electronic texts and digital images are adequate substitutes.

In order to survey EAF users, we will put on the World Wide Web in the spring of 1998 a forms-based survey instrument. The online questionnaire will be used to collect information on the demographics, knowledge, attitudes, and behavior of users of the online American fiction. Questions about behavior, for example, will focus on use of the searchable ASCII texts vs. the images of pages of the books. Other questions will elicit information on perceptions of ease of use of online texts compared with original paper texts. People who have used the online American fiction will be asked to fill in the questionnaire. Standard procedures for ensuring the reliability and validity of survey results will be followed, such as follow-up of non-respondents.

These data will be compared with data collected in a survey during spring, 1998, of users of original rare books in the UVA and other libraries. We expect to do a sample survey of users, again covering the topics of demographics, knowledge, attitudes, and behavior. The questions on this survey will be designed to mirror those on the online survey. For example, users will be asked if they are familiar with the online texts and if those texts could satisfy their needs. Demographic questions will elicit information on topics like distance traveled to use the original texts; this information can be used for inferences about costs.

From the segment of the project on measurement, the data should be available to allow us to consider costs per use. Consideration of costs needs to take account of the traditional way of getting at rare books: users travel to where the books are and look at them in a library. That is, the costs of access are mainly paid by the individual user. Even here, however, the maintenance of rare books imposes special burdens and costs on research libraries, especially the older and larger research libraries.

For example, in 1994-95 the unit cost of a purchased monograph in U.S. university libraries was $45.07.[4] In the same year, the UVA Library spent an average of $373 for each purchased rare book. So the typical rare book costs over 8 times as much as an ordinary monograph. And this initial cost disparity persists throughout the life of the books. Conventional wisdom is that it costs three times as much to house and maintain a rare book as a regular book (information from Professor Terry Belanger). So in a cost-per-use model, a typical rare book would need to be used 3 to 8 times as much as a regular monograph for the unit costs of acquisitions and maintenance to be equal. But in fact, the per-volume circulation of rare books is considerably lower than the per-volume circulation of ordinary monographs. As a result, the costs of acquiring, maintaining and providing access to rare books is disproportionately high for research libraries; and for users there is also a cost differential to use rare materials.

Though the initial cost of creating an e-text of a rare book is also high, the e-text offers considerably greater opportunities of distributed uses, and thus of much lower unit costs per use. From the standpoint of patrons, their costs of travel to UVA to see the first edition of The Scarlet Letter may be made unnecessary by the availability of an online version. And the online edition also accommodates many classes of users who could never access the original text because the cost to them is too high. It is significant that 70% of the books to be put online in the EAF project are not in print, and are available in only a few university libraries.

The EAF project therefore offers an opportunity for testing whether principles of digital libraries can be applied to a very traditional area of librarianship, the area of rare books and special collections. We are grateful to the University of Library and Information Science for offering us this opportunity to make this first report on the EAF project.

1. Lyle H. Wright. American Fiction 1774-1850: A Contribution Toward a Bibliography. San Marino, CA: Huntington Library, 1969. Second revised edition.

2. Bibliography of American Literature, compiled by Jacob Blanck for the Bibliographical Society of America. New Haven: Yale University Press, 1955-1990. 9 volumes.

3. Brown, Charles Brockden. Wieland; or, the Transformation. New York: T. & J. Swords, for H. Caritat, 1798. 1 volume. Wright: 426; BAL: 1496 .

4. ARL Statistics 1994-95. Washington: Association of Research Libraries, 1996. Page 46.