Digital libraries are organizations that provide the resources, including the specialized staff, to select, structure, offer intellectual access to, distribute, preserve the integrity of, and ensure the persistence over time of collections of digital works so that they are readily and economically available for use by a defined community or set of communities. [Waters, 1998]While experimentation and research are essential to the development of effective digital libraries, it is ultimately formally defined and institutionally supported organizations that will ensure the viability of digital collections.
Wendy Lougee has suggested that we can apply a kind of human developmental model to the four distinct stages of digital library development, beginning with Infancy and culminating in Maturity [Lougee, 1998]. Infancy is marked by learning through projects. Projects not only facilitate learning, but test resistance to ideas, find opportunities, uncover resources and barriers, and eventually create stable building blocks. In Adolescence, we see peer modeling and the exploration of best practices: here, the digital library has begun to look outward both to other institutions, but also to methods and formats that will ensure longevity and interoperability. The third stage of development, which she calls the digital library as Young Adult, is where we find the organizations that I wish to explore in this paper. In this stage of development, we find an increasing focus on collaboration, especially for the establishment of standards and architectures for interoperability. Maturity is marked by the presence of a fully functioning digital society, a market economy, and rich collaboration and knowledge environments such as those found in UARC [Finholt, 1995].
The relatively intense focus on digital libraries in the last few years has resulted in a handful of fairly mature production support organizations. Although experimentation for materials such as video continues to be important, these organizations have all successfully moved beyond the early stages marked by projects, work in isolation, and developing support for basic formats and methods. In fact, these mature digital library production organizations exhibit a new set of characteristics that are an important part of their productivity and livelihood.
The University of Michigan's Digital Library Production Service is an operation that exhibits these characteristics. The discussion that follows is one that attempts to illustrate the value of these characteristics by exploring the organization of DLPS and three model systems that it uses to provide a high degree of functionality and cost-effectiveness in building the digital library.
Even this commitment was situated in a set of larger accomplishments, which allowed the formalization of goals for campus digital library initiatives. Prior to the Information Symposium, the University Library had begun to embark on some preliminary and perhaps formative efforts to build digital library components. Wide area access to non-bibliographic information sources-i.e., to the data themselves, rather than metadata-became a focal effort of the University Library in the late 1980's. The University of Michigan Library put in place a modest program of support for statistical data files, began exploring delivery of GIS data, and then ultimately put in place a formal access system for text encoded in SGML.1 The Library had begun to build systems for storing and accessing library collections in electronic formats.
After the Information Symposium, beginning in 1992, the University Library undertook a number of initiatives that laid the groundwork for future work and future relationships. Some of these, like the Library's Gopher server, were unremarkable in their use of technology, but played important roles in information provision and partnerships. Others, like the UM implementation of TULIP, put in place significant pieces of infrastructure that would later prove instrumental in the University's development of digital libraries. The Library's Gopher server was remarkable in at least one respect, its aggressive approach to putting collection material online. Although it had previously undertaken similar efforts prior to Gopher (cf. UMLibText), the new mechanism allowed Library staff to mount collections like statistical information from the Commerce Department in ways that reached extremely broad audiences [York, 1994]. The strong presence of the Library's Gopher on campus also contributed to early cooperative efforts with the University's computing organization, ITD. TULIP's contributions were even more profound, if less visible.2 Although Michigan's was one of several Elsevier TULIP implementations, at the University of Michigan TULIP spawned a development effort that saw the creation of FTL, the search engine now used by JSTOR, and tif2gif, Doug Orr's optimized GIF generator for TIFF G4 images, now used by the University of Michigan Making of America system.
A broad, campus-wide partnership has played a powerful role in moving the concept of digital libraries forward at the University of Michigan. As mentioned earlier, the Library's Gopher project played an important role in establishing a partnership between the University of Michigan's Library and the Information Technology Division. Like Gopher, the TULIP effort contributed to bringing the Library together with another campus entity, the School of Information. These partnerships were formalized in 1993, and Wendy Lougee was appointed the director of campus-wide digital library efforts. That partnership is a powerful presence today in the University of Michigan's digital library work, with management and advisory committees composed of representatives from each of the organizations, and multi-organizational funding for initiatives.
A number of significant UM digital library efforts began in 1994. Among these is the now organizationally independent JSTOR, which grew rapidly in size and scope, and now plays an important role in shaping our expectations about archiving and retrospective conversion [Guthrie, 1997]. Another major initiative appearing in 1994 was the UM Humanities Text Initiative, which served to expand the earlier UMLibText effort and provide a significant WWW presence for the University's SGML-based text collections [Powell and Kerr, 1997]. Also appearing in 1994 was the University of Michigan's NSF/NASA/ARPA-funded digital library effort, focusing not on production technologies, but on the role of agent technologies and distribution of responsibility for the digital library [Durfee, Kiskis and Birmingham, 1997]. The breadth of these 1994 initiatives provides some indication of the fruitfulness and variety of the evolving environment at Michigan. (figure 1)
It was this growing proliferation of significant digital library activities-activities that had moved well beyond the "experimental" and had begun to reach wide audiences-that contributed most to the recognition of a need for a digital library production organization. Between 1994 and 1996, while the activities mentioned above continued to grow, new efforts at Michigan were introduced, and along with them new models and formats. Among the significant efforts undertaken were UM's MESL implementation, the UM Making of America development, and negotiation of the Elsevier journal content that would later go into the PEAK system [Stephenson and McClung, 1998; Bonn, 1999; Mackie-Mason, 1997]. These initiatives will be discussed in greater detail later, but they brought to the digital library environment at Michigan a variety of significant new elements. Not only were new formats or methods introduced (continuous tone images in the case of MESL, and preservation-oriented monograph conversion in the case of MOA), but several of these efforts helped to highlight the absence of a formally defined organization to support the effort. In 1996, the group that guided (?) Wendy Lougee in campus digital library efforts embraced a plan for, and then committed the resources needed to create, a digital library production service.
DLPS was established with clearly defined areas of responsibility. Not only would it provide long-term support to the growing array of production digital library operations (e.g., the Humanities Text Initiative) at the University of Michigan, but it would undertake a process of articulating and implementing a number of higher level goals. DLPS was made responsible for defining near-term digital library architectures for the campus, primarily refining those mechanisms it had already put in place, and extending them to create a more fully integrated environment. Similarly, it would work to take lessons learned in previous efforts to define appropriate document or data structures for the digital library. This goal was seen as essential for ensuring that our investments made in digitization would have enduring value. DLPS was also made responsible for application development and maintenance in those primary areas of responsibility (e.g., bitonal page image systems, continuous tone color image systems, and encoded text systems). As new formats such as video evolve, it is expected that DLPS will take responsibility for them as well. Finally, DLPS was charged with responsibility for basic operations such as data loading, and ensuring that servers and software have appropriate levels of maintenance.
Mainstream Library staff members provide an array of services critical to the operation of the digital library. Public service staff members provide user support for online collections, as well as end-user instruction. Collection development staff members are responsible for selecting digital collections for local deployment and work with DLPS staff on "E-Teams" to weigh the advantages and disadvantages of alternative means of delivery. Preservation staff, with guidance from DLPS, make determinations of the most appropriate means of digital capture, and then prepare the materials for digital capture (occasionally operating equipment for the actual capture). Mainstream Cataloging staff members create descriptive metadata for locally-converted materials, and specialized digital metadata specialists in the Cataloging Department help guide DLPS in decision-making for mapping between standards, display of bibliographic elements, and related issues. Similarly, DLPS maintains important relationships with other areas of the Library, including the Library Systems Office, Acquisitions, and Special Collections. Although DLPS staff members have responsibilities that touch on all of these areas of library operation, the intention of this design is to ensure that the most qualified staff member performs each task; the intention is not to recreate the Library within DLPS. Consequently, DLPS is fully integrated into the entire Library operation.
Of primary importance in the UM MOA architecture is support for the products of a Library Preservation process. Images in the system are 600 dpi bitonal, TIFF G4 files. No derivatives (e.g., GIF or JPEG images) are created or stored, except at the time of viewing request. When a user requests a page, the system generates a GIF or PDF derivative in real time and without any appreciable delay (typically less than one second). Four levels of resolution in GIF are made available to users, taking into account the wide range of displays and network connections; a 600dpi PDF version is also made available, primarily for printing. (figure 6) While the number of pages-approximately 3 million by late 2000-is relatively small compared to a typical research library collection, its large size, expected continued growth, and continuing changes in desktop technology (including networking) argue against storing anything but the master images online. Use patterns also suggest that as long as we are able to generate appropriate derivatives in real time, based on user demand, we will significantly minimize the requirements for management [Price-Wilkin, 1997]. Of course there are still concerns about the appropriateness of TIFF G4 as a preservation-quality surrogate for pages, but the University of Michigan Library believes that this format provides a high quality surrogate for most printed materials.9
Page images from the Preservation conversion process are subsequently treated by automated Optical Character Recognition (OCR), and the OCR is associated with the page image using a simple form of SGML. The extensibility of the MOA system has in fact been tested through our OCR processes: two generations of OCR have been applied to all of the materials without the need to change the system architecture. Current OCR technology used by DLPS is a voting system, providing the MOA system with a significantly high quality of character representation than typical OCR.10 The system exhibits approximately 99.8% accuracy for nearly all content-bearing pages (e.g., excluding pages with engravings and textual pages such as the title page and advertisements) [Bicknese, 1998]. The SGML applied to the text is XML-compliant, and provides information such as image location, page type (e.g., table of contents), "confidence" of the OCR, and page number. (figure 7)
The automatically generated SGML in MOA is also largely consistent with the Text Encoding Initiative (TEI) Guidelines, allowing a full integration with DLPS' encoded text efforts. While certain attributes such as those mentioned above (e.g., OCR "confidence") have been added to the MOA SGML Document Type Definition (DTD), it is otherwise entirely consistent with the TEI. This has allowed DLPS' encoded text operations in HTI to extract individual texts and upgrade them, correcting OCR and applying fuller encoding. Because the MOA system is fundamentally TEI-compliant, it can accommodate both the loosely encoded texts and texts with more detailed encoding. As resources are available to HTI, materials can be enhanced in MOA, ensuring better retrieval and higher levels of functionality for users. (figures 8-9).
Extensibility is critical for a system of this size and importance. The UM MOA system has been designed to be augmented in a variety of ways without significant overhaul. For example, DLPS has regenerated OCR for MOA without interruptions in service, thereby improving retrieval. Texts can be augmented by HTI, as discussed above. New texts can be added as Preservation resources allow, and all indexing and preparation of Web pages is automated. The underlying body of materials can remain largely unchanged while work on enhanced interfaces takes place-a process that took place between November 1998 and January 1999. We believe that the MOA system is a model of scalability and extensibility.
The success of the MOA archetype has been multifold. It has allowed us to add materials, a process now underway with MOA4 (2.3 million additional pages). It has helped reduce HTI costs, providing a readily accessible surrogate for the encoding process; in 1998-1999, HTI will add 100 more texts to the American Verse Project collection (see http://www.hti.umich.edu/english/amverse/). HTI works with the MOA system and contract services to keep its overall cost down and ensure future integration of its products with a Preservation surrogate. Perhaps most significant among MOA's successes, however, is the level of use by a broad range of users. While the printed source materials were largely unused (most had not circulated in more than ten years), in their online format they are searched some 100,000 times each month, and approximately 100,000 page images are displayed. The constant stream of positive user responses comes from genealogists, philologists, and academics alike.11
Image Services uses a mapping strategy for its metadata, keeping the native field names and data but simultaneously "tagging" each field with a corresponding value in a sort of Dublin Core scheme. This allows Image Services to provide collections in two ways: first, as a highly functional version of the database maintained by the contributing organization (e.g., the University of Michigan Museum of Art); and second, with generic field labels common to all of the images in the system. Thus, for example, the Museum of Art might use the field label "Artist," while the History of Art Department might use the field label "Source." Having selected only the Museum of Art data online, the user would be presented with search options for "Artist," and data would be displayed accordingly. However, if the user search across the entire collection, the Dublin Core value "Creator" is presented to the user as a search option and in the display of records. (figure 10) This, along with features for tailoring display and functionality (discussed below) ensures that a contributing organization such as the Museum of Art will enjoy the benefits of a powerful and highly functional system for their data. Trying to bring together a wide variety of administratively separate collections on campus is challenging, but we believe that the incentive of a cost-effective, inexpensive, highly functional host service (rather than, for example, administrative mandates) has been very effective in creating a unified database managed by DLPS.
DLPS Image Services uses a unified database to ensure a high level of performance and flexibility. Indeed, many of the collection providers maintain local management systems that could be brought online successfully. The diversity of systems, including Oracle, FileMaker, Embark, and others, would present a serious challenge to creating a distributed search feature. Instead, data are extracted from each of these systems on a periodic basis, and are then ported into the DLPS Image Services system. (figure 11). Using the methods described above, Image Services is still able to provide access to each collection as if it were an isolated, separately managed collection; the federating approach does not force a lowest common denominator effect. Instead, the significant resources of DLPS can be leveraged to provide faster search mechanisms, low-cost RAID, and around-the-clock maintenance of the database. These too are effective incentives to organizations when considering whether to mount their data through a central campus agency such as DLPS.
The Image Services system is very rich in functionality. Primarily to satisfy the variety of needs created by the diverse collections, Image Services has built its system using a template system. By specifying (and possibly even creating) a different template, one can make the system appear radically differently. For example, the primary interface presented to users is a search interface, with results appearing as a collection of thumbnails with associated descriptive information. (figure 12) Each thumbnail and descriptive label is, of course, a link to a fuller resolution view and more descriptive data. However, by invoking another template such as that for "slide shows," the system can be made to appear as a set of larger images in a pre-selected and set sequence. (figure 13) Another templates provides an interfaces for comparison of multiple images, critical for use of art and architectural images. (figure 14) Each high-resolution image is ultimately displayed with a pan-and-zoom interface. (figure15) Users can take advantage of low network speed connections or low-end displays by sending relatively small segments of high resolution images across the network, and panning directionally to view more of the image. This feature has been extremely useful to a wide variety of users, including our Papyrologists, who use the system to work with extremely high resolution images of papyri, and are rarely equipped with computing resources to be able to bring up the full resolution image.
The approach taken by Image Services is extremely scalable. Adding other collections incurs only small, marginal costs (e.g., RAID and time to process the new collection) rather than requiring us to build a new system for each. We are also able to add new functionality easily, adding "modules" or subroutines to the current middleware rather than re-writing the programs each time. At the same time, the system provides a high degree of functionality for a wide variety of users, including those who own and maintain the collections represented.
Critical to our being able to support the research mission was our ability to use known technologies to bring the journals online. The University of Michigan had prior experience with the Effect Specifications (i.e., Elsevier's format for delivering image, OCR, and metadata) during the TULIP experiment, and as mentioned earlier, UM developed several significant tools to make the TULIP journals available. Notably, Ken Alexander had developed the search engine, FTL, which was subsequently used and refined in JSTOR, and Doug Orr had developed tif2gif to enable real-time generation of GIF derivatives from the TIFF G4 images. (figures 16-19) The use of these tools in previous environments (e.g., TULIP and the early days of JSTOR development at UM) allowed us to look past these critical hurdles and instead to focus on supporting the research model. The availability of the known technologies also allowed us to invest energies in putting in place database mechanisms for authentication and for subscription/purchase control information, as well as new methods for compression to handle the large amounts of material. Material began to arrive in late summer, 1997. Within a few months, the system was ready for use at the University of Michigan, and by the beginning of 1998, it was released to the subscribing institutions.
With Elsevier's cooperation, Michigan contracted with eleven other institutions to deliver the 1,200 journals with some very unique subscription models. Using a model dictated by Prof. Jeff Mackie-Mason, four types of access were put in place:
In implementing PEAK, our production technologies and especially our production organization allowed us to extend the digital library more fully into the University's mission of research and teaching. Independence from Elsevier was critical in order for us to be able to test these models, and the body of Elsevier materials was equally important to ensure that users would have a valuable body of materials that would draw them into the research environment. The ultimate control and flexibility of the local production environment allowed the University of Michigan to perform research that would probably not have otherwise been possible, or could not have been performed in ways that the researcher stipulated.
The effective digital library production organization must be fully integrated into campus' academic mission, and especially into the mission and functions of the library. This can only be done by situating the digital library production organization in the library. Too many of the processes and resources needed to support the digital library are a part of libraries; the principles for information organization and management are an essential part of librarianship.
Success in creating digital library production organizations like those I describe here will also lead to an increased probability that we will successfully federate digital library resources. The holistic approach creates not only economies of scale, but also important opportunities for integration. At Michigan, this approach is leading to the elaboration of an architecture where every digital object is managed in highly functional ways that ensure the long-term maintenance of that object. This architecture also brings these resources together in ways that are transparent to the user, but which ensures tight integration of multimedia resources (figure 21). A focus on overall architecture is essential, and again can only ensue from testing hypotheses in production organizations.
Finally, sustainability is one of the key issues of the digital library, and an issue that also argues for the presence of the production operation. Certainly, as an ideal, many of us can readily embrace the notion that decisions must be made with regard for long-term value of the digital object. It is only through a permanent production organization-an organization with funding in a base budget, with open-ended appointments for staff, and with long-term responsibility for maintenance and migration-that this ideal can be supported.
Bicknese, Douglas. "Measuring the Accuracy of the OCR in the Making of America: A report prepared by Douglas A. Bicknese in fulfillment of Directed Field Experience requirements, Winter 1998, University of Michigan, School of Information. See http://www.umdl.umich.edu/moa/moaocr.html.
Bonn, Maria. "Building a Digital Library: The Stories of the Making of America" Forthcoming in The Evolving Virtual Library: More Visions and Case Studies, Laverna Saunders, ed. Information Today, Inc. See also http://www.umdl.umich.edu/dlps/mbonn-saunders.html.
Durfee, Edward, Daniel Kiskis, and William Birmingham. "The Agent Architecture of the University of Michigan Digital Library." IEE-Proceedings-Software-Engineering. vol.144, no.1; Feb. 1997; p.61-71.
Finholt, Thomas. "Evaluation of Electronic Work: Research on Collaboratories at the University of Michigan." SIGOIS-Bulletin. vol.16, no.2; Dec. 1995; p.49-51.
Guthrie, Kevin. "JSTOR: From Project to Independent Organization," D-Lib, July-August 1997. http://www.dlib.org/dlib/july97/07guthrie.html.
Information and People: A Campus Dialogue on the Challenges of Electronic Information. Final Report of the Information Symposium. Ann Arbor: School of Information and Library Studies, March, 1991.
Lougee, Wendy. School for Scanning Presentation. need citation.
Mackie-Mason, Jeffrey and Alexandra Jankovich. "PEAK: Pricing Electronic Access to Knowledge at the University of Michigan; presented at the Elsevier Electronic Subscriptions conference, October 1996." Library Acquisitions, 21:281-95 Fall '97 See also http://www-personal.umich.edu/~jmm/papers/PEAK/.
Powell, Christina Kelleher and Nigel Kerr. "SGML Creation and Delivery: The Humanities Text Initiative." D-Lib Magazine, July/August 1997. See http://www.dlib.org/dlib/july97/humanities/07powell.html.
Price-Wilkin, John. "Just-in-time Conversion, Just-in-case Collections: Effectively Leveraging Rich Document Formats for the WWW." D-Lib Magazine, May 1997. See http://www.dlib.org/dlib/may97/michigan/05pricewilkin.html.
Price-Wilkin, John. "Text Files in Libraries: Present Foundations and Future Directions," Library Hi Tech, Consecutive Issue 35, (1991)7-44.
Shaw, Elizabeth and Sarr Blumson. "Making of America; Online Searching and Page Presentation at the University of Michigan." D-Lib Magazine, July/August 1997. See http://www.dlib.org/dlib/july97/america/07shaw.html.
Stephenson, Christie, and Patricia McClung. MESL: Delivering Digital Images. Cultural Heritage Resources for Education. Los Angeles: The Getty Information Institute, 1998.
Waters, Don. "A Working Definition of Digital Library," on the Digital Library Federation Web site. http://www.clir.org/diglib/dldefinition.htm.
York, Grace. "A Facelift for Tradition: Mainstreaming Government Information on the Internet," in Proceedings of the 3rd Annual Federal Depository Library Conference, April 20-22, 1994, pp. 133-139.
York, Grace. "Out of the Basement: The Internet and Document Public Services," in Proceedings of the 7th Annual Federal Depository Library Conference, April 20-23, 1998, pp. 170-176.
I have just briefly tried MOA and it is the most amazing, spectacular research tool since the Xerox machine. It is what I assumed the future of libraries would be, but to be quite honest, I never believed I would live to see so much of the past put on-line in such an accessible form. Business data sure, but history?? 'Paradigm shift' is almost too limited a term. To be able to search for any word and pull the document up on the screen (and print it out) boggles the mind. I have a book at the publishers now, and realize that I am going to have to pull the manuscript until I get a chance to use your database. Congratulations, your founders have the thanks of professional historians and students for their foresight into what is clearly the future of the past.