Layered Data View for Searching, Browsing, and Presenting Scholarly Documents

Keizo OYAMA
National Center for Science Information Systems
mailto:oyama@rd.nacsis.ac.jp
http://www.rd.nacsis.ac.jp/~oyama/

Abstract

This paper describes about the study result on text formats suitable for searching, browsing, and presenting scholarly documents as a digital library service, in relation with the document distribution formats and with the data production methods. Two types of data sources are considered in the context of their application to NACSIS-ELS. The printed document sources are first discussed mainly from the viewpoint of fulltext data production and their application, including application of OCR and document structure recognition technology. Electronic text sources are then discussed mainly from the viewpoint of format conversion and the mutual relation among formats for layered data view.

1 Introduction

NACSIS, the National Center for Science Information Systems, is operating a digital library service called NACSIS-ELS[1] since April, 1997. In this service, articles of academic journals published by Japanese academic societies are provided through a system which integrates online distribution of page image data and information retrieval of bibliographic and abstract data.
In order to expand usefulness of this system, compilation of the fulltext database is essential. In the course of the implementation of such mechanism, automatic production methods of fulltext data, both by converting existing printed documents and by constructing a system integrating publication process at academic organizations, should be developed.
Considering the use in NACSIS-ELS, however, it is not adequate merely to produce fulltext data, but the design of a data structure to utilize the features of various document data formats is required. The relation between data for search processing and data for final presentation should especially be taken into account.
In this paper, not only the methods to obtain fulltext data of academic journals are described, but also the layered view of data formats are discussed and their utilization method is proposed considering the role of digital libraries regarding the document supply function.

2 Layered Data View

Document data of various formats can be viewed as a hierarchy of documents as well as input and output of document processing filters. Figure 1 shows the relation between a typical document processing cycle and an aspect of document utilization. From a viewpoint of digital libraries, document sources are mostly in printed paper formats and tagged (or logically structured) text formats. Formatted (or physically structured) text formats are expected to be used much more in the near future.
In order to handle printed papers in an electronic format, the most cost-effective method is to convert them into a raster image format using scanners. As for NACSIS-ELS, printed journals are first converted to tiff 400dpi images.
Here we have to notice what are regarded as the final documents provided to the users. Fulltext data in tagged text format are useful for search purposes but not necessarily suitable for presentation. One of the reason is that respects should be paid to the principle of copyright holders such as publishers and authors about the format they want to distribute the document in. Another reason is that the referral document or the document part should be identifiable in order not to raise confusion when someone cites a document or a document part.
In a case the final publication is in a printed format, the conclusion is easily conducted that a document data in a format which can reproduce the original document with an acceptable accuracy should be provided to the users. In a case of a formatted text format, the original formatted text or, possibly, a rasterized image data should be provided. In a case of a tagged text format, the original tagged text is, of course, most preferable, but, when presented in a different document format, serious consideration is needed.
In any cases, in order to present appropriate parts of appropriate documents as the search result, correspondence information from location within search data to location within browsing and presentation data is necessary. In order of user interaction with a presented document, mutual correspondence information between location of browsing data and location of presentation data is also necessary.
It is, of course, an important issue to implement a friendly and useful user interface using these correspondence information. The digital library system developed at U.C. Berkeley is a good model[2]. NACSIS-ELS system has a built-in function for hyperlinking from an item of table of contents to the corresponding page and from an item of reference list to the corresponding article page. Although this is an interesting theme from a viewpoint of user interface, no further discussion is made here.

3 Printed Document Sources

Currently, although many academic journals are edited and printed electronically, proprietary typesetting systems of printer manufacturers are dominant in Japan. Therefore, in order to obtain text data, a conversion program must be written for each system, and, even with such an effort, the obtained data are often just plain texts. The worse is that, in many cases, typesetting data of back numbers no longer exist.
Considering such circumstances, as for existing printed journals, a mass and batch processing method using OCR and document image recognition technology, which can handle a variety of journal titles, is desirable. On NACSIS-ELS system, as academic journals are stored cover to cover as page image data, it is the most efficient approach to produce fulltext data from the page image data.
Many OCR and document layout analysis technologies have been studied, and there are many OCR products in the market that can read text with high accuracy from the document images which are scanned for NACSIS-ELS service with relatively high quality. Although many OCR products include layout analysis capability, almost no document structure recognition function is provided.
Because our target is to enrich the contents of NACSIS-ELS, we do not want to put much effort on research and development of such element technologies, but to integrate a system mainly by utilizing products in the market.

3.1 Current OCR Technology and its Application

In order to grasp the actual state of the art, we experimented conversion from page image data which are extracted from NACSIS-ELS image database, using a Japanese OCR product whose performance is said to be high. It can output, for each character, a list of candidate characters with scores and the coordinate data.
The results are shown in table 1, and the error classes and samples are shown in table 2.

3.1.1 Character Recognition Errors

Table 1 shows that, when the quality of document images is good, misrecognition rate is less than 0.06%, which is almost negligible for search purpose. For each of most misrecognized characters, the correct character does not appear in the candidate list or does appears with a relatively low score. Therefore, candidate lists are of little use in this case.
When the quality of document images becomes worse, although misrecognition rate tends to increase, the proportion of misrecognized characters for which the correct characters are included in the candidate lists with high scores also increases accordingly. Therefore, the recall rate can be much improved by utilizing the candidate lists in the search process[3].
Looking at the first error class of table 2, characters are often misrecognized because a symbol and a letter or a Japanese character are merged. For this class, correct characters never appear in the candidate lists. However, since the error has a common pattern for each character combination, they can be recalled by expanding queries using a confusion matrix method[4]. Moreover, parentheses and quotations could be corrected automatically after OCR processing by applying constraints on syntax and character width.

3.1.2 Layout Recognition Errors

Table 1 shows that 2-gram connection error rate is less than 0.05%. The errors consists of four cases.
The first is a case a running title or a page number is inserted at a page break or a column break, and is about 40% of the errors. Those elements can be determined considering the relative location and the string patterns.
The second is a case a title, a caption, or other text data in figures and tables are inserted, and is about 30%. In order to separate them, simple document image recognition is necessary, but is difficult to perform accurately.
The third is a case the order of blocks within a page are confused, and is about 20%. They are easily reordered according to their coordinates.
The last is a case blocks or lines are misrecognized, and is about 10%. These are recovered by checking height and width of each line.
After these correction, 2-gram connection errors of about 0.02% will remain. If a 4-character word is used for a query, the recall error rate will be about 0.06%, which is almost negligible. However, because this figure means that one block or line ordering error occurs in every 10 pages for typical Japanese academic journals, the text is not applicable to the presentation purpose.

3.1.3 Dropping Errors

Table 1 shows that 0.15% of all the characters were dropped (or unread). Most of them were caused by block recognition errors in 2-column pages because of interference by figures and tables. The proportion of these errors in all the errors is more than a half. Moreover, we cannot cope with them by means of error correction or query expansion because the data themselves do not exist.
One method to reduce the dropping errors would be making layout analysis more accurate, which we want to keep away from as far as possible. Another practical method is to register several layout patterns to the OCR system and to repeat the OCR process by switching the layout patterns according to the results. This method takes a little longer time for OCR process but can avoid block dropping.
Combining all of the methods described above, fulltext data accurate enough for search purposes can be obtained, accompanied with a coordinate and a candidate list for each character.

3.2 Document Structure Recognition

Document structure recognition has been studied as an advanced technology of document image processing. However, the performance is not good enough to realize a general purpose product available in the market. Satisfactory accuracy may be achieved if you write rules describing relative location and character sizes and font styles of document elements for a specific document type, and apply them to document image processing. However, in the application to digital libraries, it is difficult to write such rules tuned for respective document types.
So, we are considering to extract the minimum document structure required for search purpose, by means of combining rules of character string patterns, which are more abstract than image level characteristics, with rules of coordinate information for each character, which is output by OCR's. Although this also requires the knowledge on the layout of each document type, the rules can be prepared without knowing about document image processing technology. By this method, elements including a running title, page numbers, bibliographic data such as an article title and author names, and, possibly, a reference list are expected to be recognized for each article.
Extraction of the running title and the page numbers will improve the recall and the precision of search processing. The extracted bibliographic data can be used to improve the accuracy of manually input secondary information data prepared for NACSIS-ELS, by matching element by elemtn. The article title, the author names, and the page numbers are also utilized when generating hyperlinks from the table of contents to the article pages.
The references are indispensable to produce hyperlinks to the referenced articles. To cope with character recognition errors in each reference item, we have already developed a method to improve the recall rate of the referenced articles by means of approximate matching with records stored in existing secondary information databases[5]. However, this method cannot be applied if extraction of each reference item is incorrect. As it is difficult to improve its accuracy only by image characteristics output by OCR's, we are considering about using rules describing text and document level characteristics, for example, the character string pattern of the leading headder (e.g. ``References''), the location in the document, the format of reference identifiers, the usage of punctuations, and so on, in addition to coordinate information such as the starting location and the length of each line.
There are further issues such as recognition of figure and table areas, their titles and captions, chapter and section headings, footnotes and their indicators, reference indicators, and so on. However, these are difficult to realize accurately only by using currently available information output by OCR's. Recognition of these data elements might be of some help for the system to improve search and browsing function, but is not essential for the service.

3.3 Data View

The way to incorporate the text data obtained as described in the previous section to the service is discussed below.
For the search purposes, we are going to use manually input data for bibliographic and abstract parts, and, for body parts, a ``main'' fulltext data obtained from OCR processed text, by applying appropriate error correction and by selecting only the first candidate. When image quality is not good enough, we are going to use, in parallel, a ``complementary'' fulltext data preserving candidate characters having scores more than some threshold. The indexing method to utilize this data is now under development.
For the browsing purposes, we will use the bibliographic and abstract data and the ``main'' fulltext data, same as for the search purposes.
In order to make browsing text ``clean'', we might use some error correction methods which are proposed for OCR processed text by means of superficial analysis based on a word dictionary or a linguistic statistics such as 2-gram frequency. However, this kind of error correction should not be adopted to the text for search. While scholarly terms are frequently used as search terms, they often include Latin words, acronyms, proper nouns such as inventor's name, and so on. Proper nouns such as system names are also used as important search terms. It should be noticed that, because these are often not included in the dictionary or do not fit with the linguistic characteristics, important information may be lost in the process of error correction.
For the presentation purposes, raster image data is most practical, because OCR and document structure recognition technology are not reliable enough to produce authoritative data. If character misrecognition rate is low enough, formatted text such as PDF data can be produced from the OCR processed text data and the coordinate information. However, with the current technology, we cannot adopt this method because checking by humans' eyes is indispensable.
In order to feed the search result back to the users, highlighting the corresponding image parts is effective. This requires location information in the search data pointing to the presentation data. No location information is necessary for the browsing data if it is integrated into the search data.
According to the user's decision, the fulltext search process will executes, at first, strict matching in the ``main'' fulltext, then, if recall is inadequate, approximate matching using a confusion matrix method, and then, if precision is insatisfactory, extended matching using candidate lists.

4 Electronic Text Sources

In order to automate production of fulltext data and to improve efficiency and accuracy, it is ideal that each academic society itself provides journals in an electronic text format. Then the fulltext data for NACSIS-ELS can be obtained by converting the electronic text. However, in reality, most academic societies in Japan are backward in the introduction of publishing systems. So, one of practical approaches is to develop and to offer a system integrating editing process in academic societies from authoring through printing, based on some standards using popular tools. Fulltext data are then obtained by converting its internal text data produced in the process.
Data format for presentation view is a crucial issue also when electronic text sources are available. Adopting the format which the copyright holders regard as a final distribution format is the safest choice. However, if it is impossible because of the system configuration, we should adopt a physically more concrete format. In that case, location information to identify the corresponding part in the final distribution format will be required to be embedded into the presentation data.

4.1 Data View and Format Conversion

If final distribution is in a printed format, the data view will take the same structure as described in the previous chapter, except that raster image data for presentation view can be generated electronically without scanning printed papers. We can also adopt a formatted text format for presentation view. A text data produced by conversion of internal text data can be used for search and browsing purposes.
There may be a case, however, that the internal text cannot be converted because not a few Japanese word processors and DTP softwares cannot produce output in SGML or RTF format. Then document structure recognition from formatted text data may be necessary to produce the tagged text data.
It is also important that the contents of a printed format and an electronic text format are identical for one document. Since texts are sometimes modified physically in the printing process, the same modification must be applied to the electronic text data. If this routine cannot be performed, the electronic text source is of no use.
Another point to note is that the coordinate information of presentation data is required as the result of search process. Therefore, location of each character as the result of typesetting has to be inserted back into the text data. If it is impossible for any technical reason, OCR processing may rather be preferable. Even if it is the case, as OCR processing is executed on electronically generated raster images, it is expected that character recognition rate is very high and that no countermeasure is necessary against recognition errors.
In a case of electronic publishing in a formatted text format, PDF seems to be most commonly used. A specification to describe document structure is defined for PDF, therefore all the information needed for each view can be extracted from a PDF document if it includes the document structure. However, almost no actual PDF documents include structure information, so the structure has to be recognized from the layout information.
It is desirable, if possible, to get the internal text data for editing and to produce text data for search and browsing, providing that document structure conscious editing is executed using a style sheet, and so on. In order to provide additional information such as highlights to the presentation view, search data will need to hold offset information of each character in the formatted text data.
In a case of electronic publishing in a tagged text format which can describe logical structure explicitly (e.g. SGML, XML, and possibly HTML 4.0), a flexible user interface can be constructed because electronic text may be provided as the presentation data. In order to handle them with documents of other formats in a integrated service system, text data for search and browsing are required to be converted to a common format respectively.
It should be noticed that, because users can easily reuse or modify documents provided in a tagged text format, some measures are required to protect the copyright. If it is difficult, formatted text data or raster image data should be used for presentation view, as in the case of printed documents or formatted texts.
In the conversion process, some information to identify the part in the original text corresponding to each presented part should be embedded into the presentation data. Displaying document element identifiers (e.g. paragraph numbers) in the margin may be a solution.

4.2 Practical System Configuration

When incorporating electronic text sources into NACSIS-ELS, a step-by-step approach will be required to the practical system construction.
Because of the current restrictions from regimes (e.g. copyright protection, charging scheme) and from systems (e.g. implementation of plug-ins or clients), we will have no other choice, for the time being, than to use tagged text for searching and browsing view and to use raster image for presentation view.
As the next step, formatted text formats (e.g. PDF) will be used for presentation. The issues here are the charging scheme and the implementation of the clients. Alignment among the charging unit (e.g. per page or per document), document formats, and the client function will be the condition for realization. In other words, if there was no charging issue, PDF could easily be incorporated into NACSIS-ELS service. At this stage, a raster image format may be used in parallel in order that NACSIS-ELS can be accessible from as many platform as possible.
As the following step, tagged text formats (e.g. SGML, XML) will be used for presentation. The issues here are mainly the charging scheme and the copyright protection. As there is no concept of pages, if smaller charging unit than per document is required, a new charging scheme (e.g. per kilobyte of presented text) has to be introduced. For the sake of copyright protection, corresponding function on the client side is necessary, and a serious consideration is required about if generally used clients can satisfy the protection level requested by each copyright holder. If there was no copyright issues, HTML could be used for the presentation view in parallel in order to expand the number of platforms.
In all of the steps described above, if copyright holders cannot be satisfied with the copyright protection mechanism or with the charging scheme, the service may only provide search and browsing view, leaving the presentation view to the copyright holders themselves. In this case, NACSIS-ELS and the copyright holders have to adopt the common document identifier scheme. The feedback of the search result to the presentation view is almost impossible.
As for other formatted text formats and tagged text formats than described above, it is impossible to consider on the charging scheme and the copyright protection issues and on the implementation of clients and plug-ins respectively, therefore the mechanisms provided by the publishers will be used for presentation view as they are.

5 Conclusion

Based on experiences through NACSIS-ELS, the author's private view on the compilation method and the conceptual design of digital libraries' contents was described, as the result of consideration focusing on the data sources and the final document distribution formats. It should be noted, among others, that layered data view for searching, browsing, and presenting should be taken into consideration in the design of content data structure and user interface, and that, while electronic publishing being widespread, there will be several obstacles to be overcome before incorporating them into a digital library service in case it is provided to unrestricted users. The urgent issues for us are the construction of a fulltext database from the document image database and the implementation of a retrieval system for them.
Although this paper was described mainly on the basic theme on the contents of academic journals in a digital library, other researches are also in progress at NACSIS in the field of natural language processing technology, fulltext search technology, intelligent user interfaces, and so on. The results from these researches will be integrated into the future NACSIS-ELS system.

References

[1]: Adachi, J., Hashizume, H.:"NACSIS Electronic Library System: Its Design and Implementation", Proc. of ISDL'95, Aug.1995, Tsukuba, Japan, p.36-41, 1995.
[2]: "UC Berkeley Digital Library Project -- Documents", URL: "http://elib.cs.berkeley.edu/docs/".
[3]: Senda, S., Minoh, M, Ikeda, K.: "A Document Image Database System with Full-text Search Functions", Digital Libraries, No.8, 1996.
[4]: Ohta, M., Takasu, A., Adachi, J.: Retrieval Methods for English-Text with Misrecognized OCR Characters, Proc. of ICDAR'97, Ulm, Germany, p.950-956, 1997.
[5]: Takasu, A., Katayama, N., Yamaoka, M., Iwaki, O., Oyama, K., Adachi, J.:"Approximate Matching for OCR-processed Bibliographic Data", Proc. of ICPR 96, Vienna, Austria, Volume III, p.175-179, 1996.