The Evolving Metadata Architecture for the World Wide Web:
Bringing Together the Semantics, Structure
and Syntax of Resource Description

Stuart Weibel
Consulting Research Scientist
OCLC Office of Research
http://purl.org/net/weibel

Abstract

The Internet can be thought of as a World-Wide Commons in which many previously-distinct resource description communities are mixed together. There is a need for a single metadata architecture of sufficient richness to support the many varieties of resource description that now exist or may evolve. The Resource Description Framework (RDF) represents the foundation for such an architecture on the Web.

The Dublin Core is currently the best-developed candidate for a simple resource description model for electronic resources on the Web. It represents the results of a three year process of concensus-building through a series of focussed, invitational workshops involving librarians, digital library researchers, and various content specialists from many countries.

Keywords: Dublin Core; metadata; World Wide Web; Resource Description Framework;RDF; Warwick Framework

1 The Internet Commons and Resource Description Communities

Among the most compelling changes wrought by the emergence of globally distributed networking is the realization of the global village that has been talked about for many years. The importance of locale as an impediment to communication has faded substantially, but the problem of common semantic and syntactic conventions necessary to exchange structured information looms larger than ever. The Internet can be thought of as a World Commons, and in this commons, many languages (natural and machine-based) must be accommodated. The emergence of an Internet Esperanto is no more likely on the Web than it was in real life, but it is reasonable to expect that unambiguous conventions will emerge to support specification of the languages and protocols used to exchange structured information.

Resource description protocols will benefit substantially from such conventions. In an unconnected world, little penalty accrues to divergent standards for resource description. An information seeker may go to a library for books on a topic, a map repository for maps, and a museum for information about relevant cultural artifacts. Each of these Resource Description Communities maintains domain-specific description conventions with canonical semantics, structure, and syntax. Differences in description conventions can be problematic, but can be ameliorated by the help of a librarian, map specialist, or museum curator.

In the Internet Commons, much of the information will be online, and a user may have less opportunity to consult an intermediary. In addition, many ad hoc communities with no coherent description standards have blossomed on the Web, and other communities with strong requirements for structured information are developing (electronic commerce, for example). The cost of reconciling disparate models of search and retrieval becomes onerous and the benefit of common description standards is greatly increased.

The term metadata simply means structured data about data. It is the term most often used in Internet community for what has been known in the library community as cataloging data, or resource description, but many other varieties exist, and all varieties of metadata have three facets that must be addressed to achieve effective interchange:
* Structure
* Syntax
* Semantics

The structure of metadata is partly an extensive property of the data itself and partly determined by conventions established by those who maintain the data. Western personal names, for example, generally consist of a given name, a middle name, and a surname, but the encoding order is often surname, given-name, and middle initial. The important thing is that the structure is agreed upon (or can be determined dynamically) and unambiguous. In general, data structure issues are only controversial when different communities have well-established alternative solutions that are difficult to give up because they are deeply embedded in systems and culture.

Issues of syntax are often controversial for similar reasons. The tools, systems, and culture of a community make it difficult to move away from long-standing practice. This problem is mitigated slightly in the Internet milieu because the rapid proliferation of the Web has forced most communities to embrace these new protocols and, at the least, build gateways to their legacy systems. Thus, the Web makes it both easier and more important to establish the conventions and protocols that will support coherent resource description on a broader scale than ever before.

The foundation for this architecture is now under construction under the auspices of the World Wide Web Consortium (W3C) under the name of the Resource Description Framework, or RDF [1]. This architecture promises to support a wide variety of structured information, including library catalogs, third-party ratings, and digital commerce. When deployed, it will support the coexistence of many varieties of metadata that are developed and maintained independently by the communities with the expertise to define them. Perhaps more importantly, it will provide a foundation for these metadata standards to 'plug-and-play' in a manner that will make it easier to find descriptive metadata appropriate to the user's needs.

While it is necessary to support many varieties of independently defined metadata schemes, it is also desirable to identify shared semantics across description models, so that reconciliation of superficial differences might lead to a more coherent and simple set of description conventions. Just as machine-level protocols are needed to assure interoperability among different hardware platforms, finding commonalties among the semantics of data content standards helps to promote semantic interoperability in the platforms of greatest importance to the information process: the minds of users.

This paper addresses the evolution of a core element set (the Dublin Core) intended to be useful for describing a broad spectrum of electronic resources, and suggests how it is likely to be integrated into an architecture for metadata on the World Wide Web.

2 The Dublin Core: Simple Metadata for Electronic Resources

The Dublin Core is a 15-element metadata element set intended to facilitate discovery of electronic resources. Originally conceived for author-generated description of Web resources, it has also attracted the attention of formal resource description communities such as museums and libraries.

The Dublin Core Workshop Series [2] has gathered experts from the library world, the networking and digital library research communities, and a variety of content specialties in a series of focussed, invitational workshops. The building of an interdisciplinary, international consensus around a core element set is the central feature of the three-year evolution of the Dublin Core. The progress represents the emergent wisdom and collective experience of many stakeholders in the resource description arena. An open mailing list supports ongoing work.

The characteristics of the Dublin Core that distinguish it as a prominent candidate for description of electronic resources fall into several categories.

2.1 Simplicity

The Dublin Core is intended to be usable by non-catalogers. It is expected that authors or web-site maintainers unschooled in the cataloging arts will be able to use the Dublin Core for resource description, making their collections more visible to search engines and retrieval systems.

Most of the 15 elements have a commonly understood semantics that represents what might be described as a lowest common denominator for resource description (roughly equivalent to a catalog card). As such, the Dublin Core is not intended to replace richer description model, but rather to provide a core set of description elements that can be used by catalogers or non-catalogers for simple resource description.

2.2 Semantic Interoperability

In the Internet commons, disparate description models interfere with the ability to search across discipline boundaries. For example, libraries, museums, and the geographic information systems community use different standards for resource description. This reflects the different description needs of these communities, and the fact that such standards have evolved independently.

At a fine-grained description level, element sets are different because they must describe different things. Most writers seldom associate a cloud-cover attribute with their documents, but if you are describing satellite images of farmland, this is a critical descriptor.

But most resources share a core set of attributes that are similar from one discipline to the next, but have different names simply because they have evolved independently and at different times. Promoting a commonly understood set of core descriptors will improve the prospects for cross-disciplinary search by unifying related attributes. For example, an author and a creator can be sensibly thought of as the same attribute for the purposes of resource discovery. The Dublin Core is intended to serve as this core element set.

2.3 International Consensus

Recognition of the international scope of resource discovery on the Web is critical to the development of effective discovery infrastructure. The Dublin Core has benefited from active participation and promotion in the United Kingdom, Australia, Sweden, Denmark, Norway, Finland, The Netherlands, Germany, France, Thailand, Japan, Canada, and the United States (with active deployment projects in ten of these countries).

The high degree of international participation has brought a strong recognition in the importance of internationalization issues to the development of the Dublin Core, resulting in formal recognition of language qualifiers in the Dublin Core specification and in the latest version of HTML.

2.4 Flexibility

Although initially motivated by the need for author-generated resource description, the Dublin Core has also attracted the attention of formal resource description communities. As diversity and volume of Web resources increases, trusted third parties (such as museums and libraries) will achieve greater recognition as preferred sources of metadata for persistent resources.

The Dublin Core, in the hands of cataloging experts, is expected to provide an economical alternative to more elaborate description models such as full AACR2/MARC cataloging. The Dublin Core includes sufficient flexibility to encode the additional structure and more elaborate semantics appropriate to such applications.

3 Metadata Modularity on the Web

The wide diversity of metadata needs on the Web requires an environment that supports the coexistence of many independently developed and maintained metadata packages. The Dublin Core is targeted specifically towards resource discovery, but one can imagine many functionally distinct packages that serve other goals (terms and conditions, archival management, administrative metadata, and many others). For example, a Terms and Conditions metadata package would include elements that describe rights holders, cost of acquiring a resource, restrictions on reuse of the resource, and related information.

Recognition of the desirability of this sort of modularity has guided the evolution of the Dublin Core since the Warwick Workshop, and has been formalized as the Warwick Framework [3,4]. The concepts articulated in this work have informed the ongoing development of a metadata architecture for the Web as well.

3.1 RDF: A Metadata Architecture for the Web

The World Wide Web Consortium (W3C) is the primary standards forum for the Web, and has recently begun to focus on implementing an architecture for metadata for the Web. The Resource Description Framework, or RDF, is evolving to support the many different metadata needs of vendors and information providers [1]. Representatives of the Dublin Core effort are actively involved in the development of this architecture, bringing the digital library perspective to bear on this important component of the Web infrastructure.

3.2 Models for Deploying Dublin Core Description on the Web

The evolving RDF metadata architecture will support a variety of resource description models, each with implications for functionality and management.

3.2.1 Embedded Metadata

Currently the easiest way of deploying metadata on the Web is by embedding it in HTML documents (using the META tag). There are conventions that support inclusion of simple metadata in HTML versions 2.0 and above. The HTML 4.0 specification released in July of this year includes additional attributes for the META tag that allow the qualifiers necessary for more complex implementations. The advantage of embedded metadata is that no additional system must be in place to use it; the metadata is integral to the resource, and can be harvested by Web indexing agents.

The disadvantage of embedded metadata is the lack of formal control over such records. As resources are distributed, so too, the descriptive record. Changes or corrections made in one instance are not propagated to other copies of the resource, leading inevitably to inconsistencies.

3.2.2 Third Party Metadata

A model more familiar to the library community includes what is known in Web parlance as a third party label bureau. That is, an entity that collects and manages metadata records that refer to resources but are not embedded in the resource (a library catalog, for example). This model is important not only to libraries and museums, but also supports the development of agencies that might label resources according to age appropriateness or other acceptability criteria.

In this model, the record is a distinct data object stored and managed independently of the resource itself. Consistency maintenance and content management are both issues, but ones that can be managed according to publicly articulated policies. This will allow users to select a metadata store according to cost, collection or description policy, or other criteria.

3.2.3 Common Search Semantics

A third model also involves management of records by a distinct entity or entities. Managing a wide variety of data stores often involves reconciling very different description models. One approach to achieving interoperability in such an environment involves mapping many description schemas into a common set such as the Dublin Core, giving users a single semantic model for searching. Dublin Core has been mapped to other data content standards such as MARC [5] and recent discussions between the GILS community and the DC community have resulted in agreement that the Dublin Core can serve as a proper subset of GILS for the purposes of discovery metadata. A collection of metadata mappings is maintained by Michael Day in the UK [6].

A metadata collection may contain no Dublin Core records per se and yet still be searched using Dublin Core semantics. Various databases of different record types might be exposed to the Internet community via the search semantics embodied in the Dublin Core. There are currently efforts underway to establish a Z39.50 profile that will facilitate this approach, allowing any Z39.50 client to issue queries to Z39.50 servers based on Dublin Core Semantics [7].

3.3 Minimalists, Structuralists, and the Dublin Core as a Pidgin Metadata Language

The Canberra Metadata Workshop (DC-4) [8] exposed an important spectrum of opinion in the DC community that is most easily described in terms of the two poles of a continuum. At one end is what has become known as the minimalist camp. This position can best be described as having a strong commitment to simple, unqualified DC metadata. That is, the value of the elements should be interpreted as unstructured text without assumptions of external context or additional internal substructure. At the other end of the spectrum is the structuralist camp, characterized by the belief that additional context and qualification is appropriate, useful, and (in some cases) necessary for effective metadata.

Thus, the Canberra Qualifiers (scheme, language, and subelement) were identified as the three qualifiers that would be defined for DC metadata. As a practical matter, most people fall between the extremes on the continuum. Allegiance to simplicity is a virtue to be sacrificed only in the face of clear benefits associated with further complexification.

Tom Baker, in another paper in this volume[9], argues that the Dublin Core represents a kind of pidgin metadata language. A pidgin language is the result of the simplification and hybridization of two languages, arising from close interaction between cultures that do not share a common language. The corollary in the Internet world is the co-mingling of the Resource Description Communities discussed in section 1. Unqualified, or minimalist, Dublin Core represents the outcome of this effort of simplifying and hybridizing the resource description conventions of various communities.

The natural follow-on to this pidginization is a subsequent re-complexification (termed creolization) that takes place in a subsequent generation of pidgin speakers. That is, the need for greater nuance of semantics and syntax quickly begins to express itself, resulting in a creole language with greater expressive power than the pidgin. The structuralist perspective is an expression of this movement towards greater complexity. The tradeoffs between the two camps--minimalism promotes greater likelihood for interoperability, while structuralism allows greater richness of expression-can be seen as part of the natural development of expressivity, rather than as merely a schism of philosophy.

4 Future Directions

This manuscript is in preparation on the eve of the 5th workshop in then Dublin Core Workshop Series, to be held in Helsinki in early October of 1997. This meeting will bring together many early implementers of the Dublin Core to share their successes and problems, and should help to promote the convergence of early practice.

There remain aspects of the Dublin Core that require refinement and further specification, but the kernel of the Core is sound, and can be deployed with confidence. As the size of the Internet has made it increasingly difficult to locate resources, metadata has become a major focus of Web infrastructure development. As the metadata architecture of the Web matures, the Dublin Core is positioned as the most well-developed candidate for simple description of electronic resources. It promises to provide a bridge to promote cross-disciplinary access to a broad spectrum of resources, and to provide a model for other, richer description sets to model as communities move more information resources to the Web.

5 References

[1] W3C Web Site: Resource Description Framework. (confirmed September 30, 1997)
http://www.w3.org/Metadata/RDF

[2] Dublin Core Homepage (September 30, 1997)
http://purl.org/metadata/dublin_core

[3] Lagoze, Carl, Lynch, Clifford A., and Daniel, Ron Jr. The Warwick Framework: A
Container Architecture for Aggregating Sets of Metadata. TR96-1593, June 21, 1996. Acrobat
version: http://www.nlc-bnc.ca/ifla/documents/
libraries/cataloging/metadata/tr961593.pdfi

[4] The Warwick Metadata Workshop: a framework for the deployment of resource description
Lorcan Dempsey and Stuart L. Weibel - July/August 1996. D-Lib Magazine, July/August 1996.
http://www.dlib.org /dlib/july96/07weibel.html

[5] Caplan, Priscilla L., Guenther, Rebecca S. Metadata for Internet Resources: the Dublin Core metadata elements set and its mapping to USMARC. Cataloging and Classification
Quarterly v.22, no. 3/4, p.43-58, 1996.

[6] Michael Day. Metadata: Mapping Between Formats. UKOLN (confirmed September 30, 1997)
http://www.ukoln.ac.uk/metadata/interoperability

[7] Ralph LeVan. Dublin Core and Z39.50: Personal Reflections. http://cypress.dev.oclc.org:12345/ ~rrl/docs/dublincoreandz3950.html

[8] Weibel, Stuart, Warwick Cathro, and Renato Iannella. The 4th Dublin Core Metadata Workshop Report. D-Lib Magazine, June 1997 http:// www.dlib.org/dlib/june97/metadata/06weibel.html

[9] Tom Baker. Dublin Core in Multiple Languages: Esperanto, Interlingua, or Pidgin.
International Symposium on Digital Libraries. Tsukuba, Japan, November, 1997 (this volume)

Projects Using the Dublin Core

The following is a list of projects (by country) deploying Dublin Core metadata in one manner or another. This listing is maintained at the Dublin Core Homepage (http://purl.org/metadata/dublin_core), which can be consulted for an up-to-date listing.

Australia

DSTC http://www.dstc.edu.au/RDU/ The Distributed Systems Technology Centre is participating in the W3C Resource Description Framework (RDF) Working Group and is developing RDF-based prototypes of Dublin Core tools.

Australian Geodynamics Cooperative Research Centre (AGCRC) http://www.agcrc.csiro.au/
A collaboration between two public research organisations and two universities, linking two different metadata systems for text and numeric data.

Canada

searchBC: Vancouver Webpages http://vancouver-webpages.com/VWbot/searchBC.html A metadata generator script supports metadata collection for this British Columbia, Canada Web site.

Germany

Metadaten-Projekt (Metadata Project)

http://www2.sub.uni-goettingen.de
This project explores the use of metadata from a library point of view and looks at the impact of the developments in networked information resource discovery on traditional cataloging rules.

SSG-Fachinformation (SSG-FI) Mathematick
(Subject Area Information for Mathematics)

http://www.sub.uni-goettingen.de/ssgfi/
Metadata is generated for the listing and evaluation of information related to Mathematics. Sources include Internet servers, CD-ROM's, and reference books.

SSG-Fachinformation (SSG-FI) Geowissenschaften
(Subject Area Information for Earth Sciences)

http://www.sub.uni-goettingen.de/ssgfi/
Metadata is generated for the listing and evaluation of information related to the Earth Sciences. Sources include Internet servers, CD-ROM's and reference books.

The German Educational Resources Server / Deutscher Bildungs-Server
http://dbs.schule.de/indexe.html

2000 Web documents (with Dublin Core metadata) about teaching and learning materials from students, teachers, publishers and state educational authorities.

Math-Net

http://elib.zib.de/math-net/
This project uses author-created metadata for mathematics related documents.

Electronic Information Management and Metadata in Physics

http://www.physik.uni-oldenburg.de/EPS/EurophysNet/PhysDep/dep-links.html

The goal of this project is to build a Harvest-based distributed electronic information system in physics.

Netherlands

Koninklijke Bibliotheek/ The National Library of the Netherlands

http://www.konbib.nl:8000
The National Library of the Netherlands is in the process of developing a new version of its Web-information service with DC Metadata elements incorporated into the HTML pages.

Scandinavia

The Nordic Metadata Project

http://linnea.helsinki.fi/meta/
The Dublin Core is being used to provide and enhance end-user services by making a diversity of digital documents more easily searchable and deliverable over the Net.

Swedish EnviroNet

http://smn.environ.se/smnproj/proj/summary.htm
A project of the Swedish government that will be a gateway to electronic data and information on the Swedish environment.

Netpublikationer

http://www.fsk.dk/fsk/publ/online-pub/
Danish ministries, government offices and agencies shall publish do0cuments on the WWW in parallel with the printed editions beginning in 1997.

United Kingdom

Art, Design, Architecture & Media Information Gateway and the Visual Arts Data Service

http://adam.ac.uk/
http://vads.ahds.ac.uk/
The Art, Design, Architecture & Media Information Gateway and the Arts Data Service are two services that aim to provide the UK Higher Education community with fast, reliable access to high-quality networked resources in the visual arts, and to promote the use of standards of best practice through example and outreach.

AHDS Arts & Humanities Data Service

http://ahds.ac.uk/
The AHDS is a federal organisation encompassing archaeology, history, textual studies and the performing and visual arts. The goal of this organisation is to build an integrated system capable of providing a seamless whole to the user of the electronic resources available from each service provider.

Project BIBLINK

http://www.ukoln.ac.uk/metadata/BIBLINK/
Joint project of several national libraries and the European Union to establish an electronic metadata link between publishers and National Bibliographic Agencies (NBA's).

Project DESIRE

http://www.nic.surfnet.nl/surfnet/projects/desire/desire.html
DESIRE is demonstrating two approaches to resource discovery: subject based services based on manual selection and description of high quality resources. The automated web crawler is now 'metadata aware' and will gather Dublin Core descriptions.

SCRAN (Scottish Cultural Resources Access Network)

http://www.scran.ac.uk
SCRAN is a project to build a networked multimedia resource base for the study, teaching and appreciation of history and material culture in Scotland. It is anticipated that easy access to 1.5 million text records of artifacts and historic monuments and 100,000 related multimedia resources will be available by the year 2001.

NewsAgent for Libraries

http://www.sbu.ac.uk/litc/newsagent/
The aim of the NewsAgent project is to create an electronic news and current awareness service for library and information staff with a mixture of content streams, providing up to date descriptions of documents to end users based on user-configurable preferences.

Electronic Library Image Service for Europe (ELISE II)

http://severn.dmu.ac.uk/elise/
The ELISE service will operate on a client/server model, making use of Z39.50 and Dublin Core. In the ELISE II prototype, the catalogue data supplied by participating institutions is mapped to DC and displayed alongside thumbnail images.

United States

Monticello Electronic Library

http://www.solinet.net/monticello/monticel.htm
The basic function of Monticello Electronic Library is to link distributed regional resources regardless of source or type of information. The Dublin Core Element Set is being used to provide semantic interoperability between several databases of electronic media and record types including SGML EAD Finding Aid, MARC and GILS collections.

Medical Metadata Project

Home Page:
http://medir.ohsu.edu/~maletg/MedMetadata.HTM
Oregon Health Sciences University, American Medical Informatics Association Internet Working Group, National Cancer Institute to provide a Test set of the National Cancer Institute Cancer Genetics Database.

Florida International University Digital Library

http://www.fiu.edu/~diglib/
This digital library project will focus on images, sound and video including multimedia presentations and curriculum modules and will support every subject covered by university teaching and research.

University of Washington Digital Library

http://content.engr.washington.edu/
The University of Washington Digital Library is contributing to the development and adoption of standard resource descriptors for networked information; our initial efforts in this arena focus on image collections. The project is using extended Dublin Core descriptors for image collections.

Everglades Information Network & Digital Library

http://everglades.fiu.edu/
The collection will include technical reports, scientific papers, conference programs and abstracts, numerical vegetation or water quality data, maps, slides and photos, legal documents and archival documents covering the subject areas of long-term ecological research, biotic and hydrologic data, marine ecology, and wetlands restoration and resource management.

University of Michigan Digital Library Registry Database

http://dns.hti.umich.edu/registry/
This project will create a searchable, browsable database of World Wide Web resources that have been chosen for their institutional or academic value.