Selecting Libraries, Selecting Documents, Selecting Data
Michael Buckland
School of Information Management & Systems
University of California, Berkeley, CA, USA 94720--4600
Christian Plaunt
Advanced Technology Group, Apple Computer, Inc.
One Infinite Loop, MS 301--3KM, Cupertino CA 94015
Abstract:
We use ``selecting'' as a general term for selection processes, including
filtering, retrieval, routing, and searching. The search for recorded
knowledge in a digital library environment is examined in terms of selection
at three levels:
- Selecting which library (repository) to look in;
- Selecting which document(s) within a library to look at; and
- Selecting fragments of data (text, numeric data, images) from
within a document.
These tasks with their differing problems have, historically, been treated as
separate and different. Examination and comparison of these three processes
reveal similarities and differences between the three levels. The three
selecting processes are fundamentally the same in theory. The differences in
practice are seen as arising from differing deficiencies in internal structure
or lack of metadata. Identification of these deficiencies provides a basis for
an agenda of research and development.
Introduction
This paper examines the process of selection at three levels: Selecting
libraries to look in; selecting documents within a library; and selecting data
within documents. We use the term ``selecting'' as a general term to include
filtering, retrieval, routing, and searching, all variations of search where
the outcome of the search is uncertain, which distinguishes ``selecting'' from
``look-up''. ``Data'' is used to here to mean parts or subsets of a document,
any fragments of text, images, numbers, other symbols from which we may learn.
Libraries, documents, data: This is the way libraries are used. First select a
library to visit, then select a document, then select which part or parts of
it to read. Sometimes we only look at one very small part; sometimes we may
read through all the parts.
A Changed Environment
The task of selecting which library to look in is not, of course, new, but it
has taken on a new importance. Much research and development concerning
digital libraries is, understandably, concerned with the creation of digital
libraries (``repositories''). Yet this situation is, in a sense, ironic
because a central, distinguishing feature of the emerging digital library
environment is that digital networking reduces the dominating ``localness'' of
paper-based library technology. With collections of documents on paper, it is
of great importance that copies of the documents that you will need are in
your local collection. But, to the extent that we move to a digital library
environment, the significance of the local library collection
diminishes. Improved ease of access to remote library collections makes the
use of non-local (digital) libraries more feasible and more attractive. With
the World Wide Web, it matters little where the web site is located. Note that
we are referring to libraries' collections, not to other library services and
not to librarians.
The improved access and, therefore, increased use of remote libraries makes
selecting which library to use a much more important activity. It is, however,
a problem that has received little systematic attention, perhaps because it is
a library users' problem more than a library provider's problem.
Descriptive directories of libraries and their collections have existed for
centuries, but we think that they have not been used much. There have also
been some quantitative measures of the relative strengths of collections,
e.g. ``Shelflist counts,'' based on counts of shelflist cards in a library for
each numerous subject categories defined in terms of the Library of Congress
Classification numbers and, more recently, the ``Conspectus'' approach in
which, for each collection, the character as well as the size of the holdings
on each topic is evaluated.
Libraries' internal systems have also changed in a manner relevant to our
discussion. For years the objective was to connect the online catalog with the
library's internal technical services system. Now the interest is in evolving
the online catalog into a gateway to online resources everywhere, not only to
books and serial titles held locally but to other collections elsewhere and,
through indexing and abstracting services, to articles in periodicals. The
library's online catalog is increasingly designed to provide access to
resources at all three levels: to remote library collections; to documents;
and to the extent that they are accessible, to parts of documents [11].
Selecting Libraries
The retrieval process is commonly shown as a Recall Curve, showing the
cumulative increase in the number of relevant documents found as the search is
expanded to retrieve more documents. A conventional cumulative Recall graph is
shown in Figure 1.
Retrieving documents randomly from a collection would yield the straight
diagonal line D. Better-than-random retrieval yields a convex curve, R. With
better-than-random retrieval, marginal retrieval diminishes as retrieval
continues, yielding a cumulative Recall curve of the shape shown in Figure
1. The better the retrieval effectiveness (the better the selecting), the more
convex the curve R becomes and the greater the separation of curve R from line
D in the direction of the arrow. (For a more detailed discussion of Recall
curves see [4].)
The problem of selecting libraries can be defined with three decisions:
- Which library to look in first,
- Which library to look in next, and
- When to stop looking.
Each decision can be seen as a comparison of the probable marginal benefit of
searching in the next library compared with the probable marginal cost. The
marginal benefit can be formulated in terms of the number of relevant
documents not previously found, i.e. the complement of the set of relevant
documents in the prospective next library to be searched relative to the union
of the documents already found in the set of libraries already searched. The
decision has two components: Deciding which library would have the highest
benefit-cost ratio and then deciding whether undertaking the search is likely
to be worthwhile. The problem can be seen as a special case of Search
Theory. (For a more detailed discussion of selecting libraries see [3].)
The benefit of the search as it progresses through any given set of libraries
in any given order can be expressed as the cumulative number of relevant
documents found. See Figure 2.
Selecting at Three Levels: Libraries, Documents, Data
The cumulative benefit curve in Figure 2, showing how the number of relevant
documents increases as the number of libraries selected looks like the
conventional cumulative Recall curve in Figure 1. This resemblance invites
comparison of the two different tasks: Selecting libraries from a set of
libraries and, the usual focus of attention, selecting documents from within a
collection of documents. And, if we can compare conventional document
selection with selection at a higher level of aggregation --- selecting
collections (libraries) of documents --- why not also look at a lower level of
aggregation: at the contents of documents. After all, it is not usually the
totality of a document that is of interest, but, rather, one or more
individual pieces within it. We call this selecting data and we mean the
selecting of passages of text, images, numeric data, or other fragments from
within a single document. (Discussions in the literature distinguishing ``data
retrieval'' from ``document retrieval'' need to be treated with caution
because the distinction between ``data'' and ``document'' has sometimes been
confused with the difference between ``known item searches'' and ``subject
searches'' [1, pp. 105,].) In this discussion the definitions should be clear:
Here we are concerned with the selection (in effect the ordering) of objects
at three different levels of aggregation:
- Selecting libraries: selecting one or more libraries (collections of
documents) from some set of libraries.
- Selecting documents: selecting one or more individual documents
from within a library (a collection of documents), and
- Selecting data: selecting one or more parts from within a single
document.
We have simply defined three levels of aggregation. There are others. At a
higher level of aggregation, one could select a network (set) of libraries
among several networks. In the other direction, pieces of text, etc. (our
``data'') can themselves be divided in to phrases, words, characters, bits, or
pixels. Further, a library's collection is often divided into a set of
collections; a document is often composed of smaller documents
(e.g. periodical is composed of multiple volumes); and so on. For this
discussion we use the three levels noted above: Libraries (collections),
documents, and data (parts of documents).
An Analytical Model of Selection
So far we have taken an empirical approach, examining selection at each level
and showing that there are similarities in the structure even though there may
be practical difficulties with the structure or metadata at each level. We
could have reached the same conclusion from first principles. What follows is
a development of ideas first discussed in Kyoto in 1992 [2] and presented in
much more detail by Buckland & Plaunt [5].
Analysis of the components of selection systems reveal that even the most
complex filtering and retrieval systems can be represented (modeled) in terms
of three primitive elements: The notion of collections (loosely, sets) and two
types of functional operations on those collections
- One type of function transforms collections by deriving a new,
transformed collection, which is distinct from, but corresponds to, the
original collection. For example, a library cataloger takes a collection
of books and derives a collection of catalog records that is a
representation of that collection of books. Similarly, the SMART system
takes a collection of textual documents and derives a collection of
vectors representing them.
- A second kind of function does not change the members of the collection
but rearranges or partitions them into new arrangement. In particular,
the purpose of information retrieval systems is to re-order a set of
documents into either a weak order (where the ``retrieved'' documents
precede the ``not-retrieved'' ones) or into a finer ordering (as in the
case of ranked retrieval) where the order corresponds to the probability
of relevance of each document.
It appears that all selecting systems at all levels can be modeled as a
sequence of operations on collections using these three primitive components
iteratively. In this way we find from a theoretical path an underlying
similarity in selection processes at any level.
Selecting Libraries: The Unit of Benefit
A conventional document retrieval cumulative recall curve measures, by
definition, the increase in the number of distinct relevant documents
retrieved as the search is expanded through the collection as shown in the
cumulative curve in Figure 1. We have noted the similarity to it of the graph
of the selection of libraries in Figure 2, but, in a sense, Figure 2 should
not be this way. The direct analog of a conventional recall curve at the
library search level would be to have the vertical scale showing a cumulative
count of relevant libraries as the search extends through a set of libraries,
not of documents. This could have been done, but it makes no sense to do so
because the contents of libraries is (or can be) known in more detailed terms,
in numbers of relevant documents.
Measuring libraries by whether each library is relevant or is not-relevant
seems excessively crude and raises the question of the threshold relevance to
be used. Would any, even a tiny amount, of relevance, be enough to cause a
library to be deemed to be a relevant library? It is much better to avoid this
crudeness and use a more refined, a more detailed measure of benefit by using
a metric from the next lower level in the hierarchy, the number of relevant
documents retrieved. We can measure the marginal benefit of searching a
library, not by counting relevant and non-relevant libraries, but with
calibration at the finer level of the units at the next lower level of
analysis, documents.
A Finer Measure of Benefit in Document Selection?
If it makes sense to use a finer unit of measurement from the next lower level
in our hierarchy for library selection, why not do the same in document
selection? Might it not also, by the same argument, be excessively simple to
judge documents as simply relevant or not relevant? Again, how ``relevant''
does it have to be --- or how much of it has to be relevant --- for the whole
document to be deemed relevant? By the reasoning we applied in our discussion
of libraries, it would seem better to move down a level and compare documents
in terms of how much of each one is of marginal benefit.
One might say that probabilistic retrieval systems avoid the crudeness of
binary relevance judgments because they generate a ranking of documents with
respect to relevance, but this is not the case. What is generated are
probability estimates that a document is or is not relevant, which is not the
same as a measure of the relevant content of a document. Binary relevance
judgments are still being used.
But comparing documents in a metric of the next lower level in our hierarchy
(data) has problems. Library collections can be divided rather easily into
their constituent elements, documents, but documents themselves are not so
easy to subdivide with respect to relevance. There are physical divisions
(pages and lines) and there are rhetorical divisions (chapters, paragraphs,
and sentences). Back of book indexing usually refers to pages, but the
physical divisions are largely irrelevant to the subject matter and the
rhetorical divisions are only weakly and inconsistently related to
intellectual content. Some exceptions can be found: Bibliographies and
encyclopedias are documents composed of recognizable, discrete elements.
Other documents commonly follow standardized rhetorical forms but in a much
less clearly and usefully defined way.
The advantages of dividing documents into separate conceptual units, into
hypertextual nodes, was recognized in the seventeenth century and received
detailed attention from Hartlib and Drury and from Leibniz. This interest was
renewed when there was interest in applying modern technology to information
management [14]. Already by 1911 the German chemist Wilhelm Ostwald and his
colleagues in an organization called Die Bruecke (The Bridge) were discussing
the ``monographic principle,'' their name for the hypertextual division of
documents into smaller units which could then be organized independently
[6,17]. They saw such a development as similar in nature and in significance
to Gutenberg's introduction of movable printing type. Commenting on the ease
with which bibliographies lend themselves to being subdivided and even
rearranged, Paul Otlet wrote in 1918 [13,12] of the need
``...to detach what the book amalgamates, to reduce all that is complex
to its elements... This is the `monographic principle' pushed to its
ultimate conclusion. ... What is a book, in fact, if not a continuous
line which has initially been cut to the length of a page and then cut
again to the length of a justified line? Now this cutting up, this
division, is purely mechanical; it does not correspond to any division
of ideas.'' [12, pp. 149,]
Otlet foresaw ``selection machines'' searching among these smaller units of
recorded knowledge and workstations for manipulating and processing them [12,
pp. 150,], but how to divide documents into conceptual pieces remains an
unsolved problem. There is research in this direction and some of it explores
how parts of documents might be used in various information retrieval
tasks. Examples include automatic discovery of the topic structure of text
[7], use of rhetorical boundaries (e.g. chapters, paragraphs, and sentences)
to improve retrieval performance [8], and the automatic creation of hypertext
links within documents [16]. However, their effective use outside the
laboratory may still be some way off.
Identifiable Duplication of Contents
Selecting libraries would be easier if there were not duplication in
libraries' collections of documents. Documents held in one library are often
held in other libraries also. Without this duplication, estimates of marginal
benefit, the number of relevant documents that would be retrieved from the
next library searched, would relatively straightforward. But there is much
duplication between existing libraries and, with duplication, the marginal
benefit of the next library depends not only on the relevant documents in that
library's collection, but also on how many of those documents have already
been found in the libraries already selected. This means that the marginal
benefit of searching a library does not depend only on that library's
collection, but also on the other libraries' collections also. Analyses of
individual libraries (e.g. shelflist counts and Conspectus) do not help in
this, but analysis of location data in union catalogs such as OCLC can be
used.
One can identify identical documents, copies of the same publisher's edition,
but, beyond that, identifying duplication of contents among documents seems
impractical because the duplication is not clearly enough defined to be
identified. One can imagine a world in which documents are composed of a
selection of data-elements, all drawn more or less duplicatively from a shared
population of hypertext passages and data-elements [9,10]. Some web pages
composed primarily of links to other web pages begin to assume that form, but
documents are not yet designed in such a way that one could know whether and
when multiple documents have the same, duplicative content. Such systematic,
duplicative use of common data elements is not the norm and, we think, is
unlikely to become the norm.
Metadata, Description of Content
As a practical matter it is more difficult to predict which library one should
search next than it is to decide which document within a library one should
look at next. Why? Because, at least in conventional libraries there is a
searchable representation of each document, a catalog record, containing
descriptive metadata. This is lacking for libraries. The conclusion is that if
we are to be able to search cost-effectively in the digital library universe,
then we shall need some kind of catalog record and cataloging code to
represent entire libraries as well as individual documents. Searching for data
in a document is more difficult than searching for documents in a library
collection not only because of the weak internal structure, but also because
of the lack of descriptive metadata for each section of data. There is some
precedent in ``analytical'' cataloging and in the use, earlier this century,
of the Universal Decimal Classification for the subject categorization of
parts of documents, but these are labor-intensive solutions.
Trade-Offs Between Levels
Selecting libraries, selecting documents, and selecting data are consecutive
stages in a single process. Improved effectiveness and improved efficiency are
desirable in each of these three activities. We want to move the recall curve
in the direction of the arrow in Figure 1 and in Figure 2 and in any
comparable figure for data selecting, but improvement may not be equally
possible or an equally good investment at all of the three levels. Because
each is part of a larger process, trade-offs are possible: Investing in
improved selecting of libraries may matter less than (or be less feasible or
more expensive than) improved selecting of documents within
collections. Improved retrieval of data from within documents might, in
principle, compensate for weakness in the selecting of the
documents. Depending on the nature of the search, it maybe cost-effective (or
necessary) to tolerate weak performance at one level if selecting performance
is good at another level. For example, if only a few documents are desired,
economical but weak library selecting may not matter if document selecting is
reliable within the libraries. If document selecting is weak, then that may be
compensated for if the selecting of the libraries and/or the selecting of data
were effective.
Summary
There is, in principle, a fundamental similarity in selecting at all three
levels, selecting libraries, selecting documents, and selecting data. In
practice, what is feasible varies importantly because of
- Differences in internal structure: Collections are easily
subdivided into documents, but documents are not easily
subdivided into data;
- Differences in the availability of descriptive metadata.
- Identifiable duplication of contents: Documents, the components of
collections, are recognizably duplicative, and to some extent
interchangeable. The components within documents may sometimes be
duplicative, but not usually in any recognizable or usable way.
These similarities and differences are summarized in Table 1.
| Libraries
| Documents
|
Internal structure:
| Yes
| No
|
Metadata:
| Yes
| No
|
Duplication:
| Yes
| No
|
Table 1: Summary of some of the differences and similarities between
libraries and documents.
Examining the consequences of these analyses offers a large and promising
agenda for international research development and practice in digital
libraries.
Acknowledgment
This work was supported in part by NSF, NASA, DARPA Digital Libraries
Initiative grant IRI-941334 ``The Environmental Electronic Library'' and by
DARPA contract AO# F477 ``Search Support for Unfamiliar Metadata
Vocabularies''.
References
[1] Buckland, M. K. Information and Information Systems. Westport, CT:
Greenwood, 1991.
[2] Buckland, M. K. The potential of extended retrieval. United Nations
University Second International Symposium on the Frontiers of Science and
Technology: Expanding Access to Science and Technology --- The role of
Information Technologies, Kyoto, 12--14 May, 1992, Proceedings,
133--143. Tokyo: United Nations University Press, 1994.
[3] Buckland, M. K. Searching multiple digital libraries: A design analysis,
1995. http://www.sims.berkeley.edu/research/oasis/ multisrch.html
[4] Buckland, M. K. & F. Gey. The relationship between recall and
precision. Journal of the American Society for Information Science,
45:12--19, 1994.
[5] Buckland, M. K. & C. Plaunt. On the construction of selection
systems. Library Hi Tech, 12(4):15--28, 1994.
[6] Buhrer, K. W. & A. Saager. La organizado de la intelekta laboro per La
Ponto [The organization of intellectual work by The Bridge]. Ansbach:
Fr. Seybold, 1911. (In Esperanto. Also published in German.)
[7] Hearst, M. A. Context and structure in automated full-text information
access. Dissertation, Computer Science, University of California,
Berkeley, 1994.
[8] Hearst, M. A. & Plaunt, C. Subtopic structuring for full-length document
access. 16th Annual International ACM/SIGIR Conference on Research and
Development in Information Retrieval, Pittsburgh, 27-30 June, 1993.,
59--68. New York: ACM, 1993.
[9] Nelson, T. H. Computer lib ; Dream machines. Redmond, Wash.: Microsoft
Press, 1987.
[10] Nelson, T. H. Literary machines : The report on, and of, project xanadu
concerning word processing, electronic publishing, hypertext,
thinkertoys, tomorrow's intellectual revolution, and certain other topics
including.... Ed. 93.1. Sausalito, CA: Mindful Press, 1992.
[11] Norgard, B. A., M. G. Berger, M. K. Buckland, & C. Plaunt. The online
catalog: From technical services to access service. Advances in
Librarianship, 17:111--148, 1993.
[12] Otlet, P. International organization and dissemination of knowledge:
Selected Essays. Ed. by W. B. Rayward. Amsterdam: Elsevier, 1990.
[13] Otlet, P. Transformations operees dans l'appareil bibliographique des
science. Revue Scientifique, 58:236--241, 1918. For English translation,
see [12].
[14] Rayward, W. B. Some schemes for restructuring and mobilising information
in documents: A historical perspective. Information Processing and
Management, 30(2):163--175, 1994.
[15] Rayward, W. B. Visions of Xanadu: Paul Otlet (1868--1940) and
hypertext. Journal of the American Society for Information Science,
45:235--250, 1994.
[16] Salton, G. & J. Allan. Selective text utilization and text
traversal. Fifth ACM Conference on Hypertext: Proceedings of HYPERTEXT
'93, Seattle, WA, USA, 14--18 Nov. 1993. New York, NY, USA: ACM,
p. 131--44, 1993.
[17] Satoh, T. The Bridge Movement in Munich and Ostwald's Treatise on the
organization of knowledge. Libri, 37:1--24, 1987.