Layered Data View for Searching, Browsing, and Presenting Scholarly Documents
Keizo OYAMA
National Center for Science Information Systems
mailto:oyama@rd.nacsis.ac.jp
http://www.rd.nacsis.ac.jp/~oyama/
Abstract
This paper describes about the study result on text formats suitable for
searching, browsing, and presenting scholarly documents as a digital library
service, in relation with the document distribution formats and with the data
production methods. Two types of data sources are considered in the context
of their application to NACSIS-ELS. The printed document sources are first
discussed mainly from the viewpoint of fulltext data production and their
application, including application of OCR and document structure recognition
technology. Electronic text sources are then discussed mainly from the
viewpoint of format conversion and the mutual relation among formats for
layered data view.
1 Introduction
NACSIS, the National Center for Science Information Systems, is operating a
digital library service called NACSIS-ELS[1] since April, 1997. In this
service, articles of academic journals published by Japanese academic
societies are provided through a system which integrates online distribution
of page image data and information retrieval of bibliographic and abstract
data.
In order to expand usefulness of this system, compilation of the fulltext
database is essential. In the course of the implementation of such mechanism,
automatic production methods of fulltext data, both by converting existing
printed documents and by constructing a system integrating publication
process at academic organizations, should be developed.
Considering the use in NACSIS-ELS, however, it is not adequate merely to
produce fulltext data, but the design of a data structure to utilize the
features of various document data formats is required. The relation between
data for search processing and data for final presentation should especially
be taken into account.
In this paper, not only the methods to obtain fulltext data of academic
journals are described, but also the layered view of data formats are
discussed and their utilization method is proposed considering the role of
digital libraries regarding the document supply function.
2 Layered Data View
Document data of various formats can be viewed as a hierarchy of documents as
well as input and output of document processing filters.
Figure 1
shows the
relation between a typical document processing cycle and an aspect of
document utilization. From a viewpoint of digital libraries, document sources
are mostly in printed paper formats and tagged (or logically structured) text
formats. Formatted (or physically structured) text formats are expected to be
used much more in the near future.
In order to handle printed papers in an electronic format, the most
cost-effective method is to convert them into a raster image format using
scanners. As for NACSIS-ELS, printed journals are first converted to tiff
400dpi images.
Here we have to notice what are regarded as the final documents provided to
the users. Fulltext data in tagged text format are useful for search purposes
but not necessarily suitable for presentation. One of the reason is that
respects should be paid to the principle of copyright holders such as
publishers and authors about the format they want to distribute the document
in. Another reason is that the referral document or the document part should
be identifiable in order not to raise confusion when someone cites a document
or a document part.
In a case the final publication is in a printed format, the conclusion is
easily conducted that a document data in a format which can reproduce the
original document with an acceptable accuracy should be provided to the
users. In a case of a formatted text format, the original formatted text or,
possibly, a rasterized image data should be provided. In a case of a tagged
text format, the original tagged text is, of course, most preferable, but,
when presented in a different document format, serious consideration is
needed.
In any cases, in order to present appropriate parts of appropriate documents
as the search result, correspondence information from location within search
data to location within browsing and presentation data is necessary. In order
of user interaction with a presented document, mutual correspondence
information between location of browsing data and location of presentation
data is also necessary.
It is, of course, an important issue to implement a friendly and useful user
interface using these correspondence information. The digital library system
developed at U.C. Berkeley is a good model[2]. NACSIS-ELS system has a
built-in function for hyperlinking from an item of table of contents to the
corresponding page and from an item of reference list to the corresponding
article page. Although this is an interesting theme from a viewpoint of user
interface, no further discussion is made here.
3 Printed Document Sources
Currently, although many academic journals are edited and printed
electronically, proprietary typesetting systems of printer manufacturers are
dominant in Japan. Therefore, in order to obtain text data, a conversion
program must be written for each system, and, even with such an effort, the
obtained data are often just plain texts. The worse is that, in many cases,
typesetting data of back numbers no longer exist.
Considering such circumstances, as for existing printed journals, a mass and
batch processing method using OCR and document image recognition technology,
which can handle a variety of journal titles, is desirable. On NACSIS-ELS
system, as academic journals are stored cover to cover as page image data, it
is the most efficient approach to produce fulltext data from the page image
data.
Many OCR and document layout analysis technologies have been studied, and
there are many OCR products in the market that can read text with high
accuracy from the document images which are scanned for NACSIS-ELS service
with relatively high quality. Although many OCR products include layout
analysis capability, almost no document structure recognition function is
provided.
Because our target is to enrich the contents of NACSIS-ELS, we do not want to
put much effort on research and development of such element technologies, but
to integrate a system mainly by utilizing products in the market.
3.1 Current OCR Technology and its Application
In order to grasp the actual state of the art, we experimented conversion from
page image data which are extracted from NACSIS-ELS image database, using a
Japanese OCR product whose performance is said to be high. It can output, for
each character, a list of candidate characters with scores and the coordinate
data.
The results are shown in
table 1, and the error classes and samples are shown
in table 2.
3.1.1 Character Recognition Errors
Table 1 shows that, when the quality of document images is good,
misrecognition rate is less than 0.06%, which is almost negligible for search
purpose. For each of most misrecognized characters, the correct character
does not appear in the candidate list or does appears with a relatively low
score. Therefore, candidate lists are of little use in this case.
When the quality of document images becomes worse, although misrecognition
rate tends to increase, the proportion of misrecognized characters for which
the correct characters are included in the candidate lists with high scores
also increases accordingly. Therefore, the recall rate can be much improved
by utilizing the candidate lists in the search process[3].
Looking at the first error class of table 2, characters are often
misrecognized because a symbol and a letter or a Japanese character are
merged. For this class, correct characters never appear in the candidate
lists. However, since the error has a common pattern for each character
combination, they can be recalled by expanding queries using a confusion
matrix method[4]. Moreover, parentheses and quotations could be corrected
automatically after OCR processing by applying constraints on syntax and
character width.
3.1.2 Layout Recognition Errors
Table 1 shows that 2-gram connection error rate is less than 0.05%. The errors
consists of four cases.
The first is a case a running title or a page number is inserted at a page
break or a column break, and is about 40% of the errors. Those elements can
be determined considering the relative location and the string patterns.
The second is a case a title, a caption, or other text data in figures and
tables are inserted, and is about 30%. In order to separate them, simple
document image recognition is necessary, but is difficult to perform
accurately.
The third is a case the order of blocks within a page are confused, and is
about 20%. They are easily reordered according to their coordinates.
The last is a case blocks or lines are misrecognized, and is about 10%. These
are recovered by checking height and width of each line.
After these correction, 2-gram connection errors of about 0.02% will remain.
If a 4-character word is used for a query, the recall error rate will be
about 0.06%, which is almost negligible. However, because this figure means
that one block or line ordering error occurs in every 10 pages for typical
Japanese academic journals, the text is not applicable to the presentation
purpose.
3.1.3 Dropping Errors
Table 1 shows that 0.15% of all the characters were dropped (or unread). Most
of them were caused by block recognition errors in 2-column pages because of
interference by figures and tables. The proportion of these errors in all the
errors is more than a half. Moreover, we cannot cope with them by means of
error correction or query expansion because the data themselves do not exist.
One method to reduce the dropping errors would be making layout analysis more
accurate, which we want to keep away from as far as possible. Another
practical method is to register several layout patterns to the OCR system and
to repeat the OCR process by switching the layout patterns according to the
results. This method takes a little longer time for OCR process but can avoid
block dropping.
Combining all of the methods described above, fulltext data accurate enough
for search purposes can be obtained, accompanied with a coordinate and a
candidate list for each character.
3.2 Document Structure Recognition
Document structure recognition has been studied as an advanced technology of
document image processing. However, the performance is not good enough to
realize a general purpose product available in the market. Satisfactory
accuracy may be achieved if you write rules describing relative location and
character sizes and font styles of document elements for a specific document
type, and apply them to document image processing. However, in the
application to digital libraries, it is difficult to write such rules tuned
for respective document types.
So, we are considering to extract the minimum document structure required for
search purpose, by means of combining rules of character string patterns,
which are more abstract than image level characteristics, with rules of
coordinate information for each character, which is output by OCR's. Although
this also requires the knowledge on the layout of each document type, the
rules can be prepared without knowing about document image processing
technology. By this method, elements including a running title, page numbers,
bibliographic data such as an article title and author names, and, possibly,
a reference list are expected to be recognized for each article.
Extraction of the running title and the page numbers will improve the recall
and the precision of search processing. The extracted bibliographic data can
be used to improve the accuracy of manually input secondary information data
prepared for NACSIS-ELS, by matching element by elemtn. The article title,
the author names, and the page numbers are also utilized when generating
hyperlinks from the table of contents to the article pages.
The references are indispensable to produce hyperlinks to the referenced
articles. To cope with character recognition errors in each reference item,
we have already developed a method to improve the recall rate of the
referenced articles by means of approximate matching with records stored in
existing secondary information databases[5]. However, this method cannot be
applied if extraction of each reference item is incorrect. As it is difficult
to improve its accuracy only by image characteristics output by OCR's, we are
considering about using rules describing text and document level
characteristics, for example, the character string pattern of the leading
headder (e.g. ``References''), the location in the document, the format of
reference identifiers, the usage of punctuations, and so on, in addition to
coordinate information such as the starting location and the length of each
line.
There are further issues such as recognition of figure and table areas, their
titles and captions, chapter and section headings, footnotes and their
indicators, reference indicators, and so on. However, these are difficult to
realize accurately only by using currently available information output by
OCR's. Recognition of these data elements might be of some help for the
system to improve search and browsing function, but is not essential for the
service.
3.3 Data View
The way to incorporate the text data obtained as described in the previous
section to the service is discussed below.
For the search purposes, we are going to use manually input data for
bibliographic and abstract parts, and, for body parts, a ``main'' fulltext
data obtained from OCR processed text, by applying appropriate error
correction and by selecting only the first candidate. When image quality is
not good enough, we are going to use, in parallel, a ``complementary''
fulltext data preserving candidate characters having scores more than some
threshold. The indexing method to utilize this data is now under development.
For the browsing purposes, we will use the bibliographic and abstract data and
the ``main'' fulltext data, same as for the search purposes.
In order to make browsing text ``clean'', we might use some error correction
methods which are proposed for OCR processed text by means of superficial
analysis based on a word dictionary or a linguistic statistics such as 2-gram
frequency. However, this kind of error correction should not be adopted to
the text for search. While scholarly terms are frequently used as search
terms, they often include Latin words, acronyms, proper nouns such as
inventor's name, and so on. Proper nouns such as system names are also used
as important search terms. It should be noticed that, because these are often
not included in the dictionary or do not fit with the linguistic
characteristics, important information may be lost in the process of error
correction.
For the presentation purposes, raster image data is most practical, because
OCR and document structure recognition technology are not reliable enough to
produce authoritative data. If character misrecognition rate is low enough,
formatted text such as PDF data can be produced from the OCR processed text
data and the coordinate information. However, with the current technology, we
cannot adopt this method because checking by humans' eyes is indispensable.
In order to feed the search result back to the users, highlighting the
corresponding image parts is effective. This requires location information in
the search data pointing to the presentation data. No location information is
necessary for the browsing data if it is integrated into the search data.
According to the user's decision, the fulltext search process will executes,
at first, strict matching in the ``main'' fulltext, then, if recall is
inadequate, approximate matching using a confusion matrix method, and then,
if precision is insatisfactory, extended matching using candidate lists.
4 Electronic Text Sources
In order to automate production of fulltext data and to improve efficiency and
accuracy, it is ideal that each academic society itself provides journals in
an electronic text format. Then the fulltext data for NACSIS-ELS can be
obtained by converting the electronic text. However, in reality, most
academic societies in Japan are backward in the introduction of publishing
systems. So, one of practical approaches is to develop and to offer a system
integrating editing process in academic societies from authoring through
printing, based on some standards using popular tools. Fulltext data are then
obtained by converting its internal text data produced in the process.
Data format for presentation view is a crucial issue also when electronic text
sources are available. Adopting the format which the copyright holders regard
as a final distribution format is the safest choice. However, if it is
impossible because of the system configuration, we should adopt a physically
more concrete format. In that case, location information to identify the
corresponding part in the final distribution format will be required to be
embedded into the presentation data.
4.1 Data View and Format Conversion
If final distribution is in a printed format, the data view will take the same
structure as described in the previous chapter, except that raster image data
for presentation view can be generated electronically without scanning
printed papers. We can also adopt a formatted text format for presentation
view. A text data produced by conversion of internal text data can be used
for search and browsing purposes.
There may be a case, however, that the internal text cannot be converted
because not a few Japanese word processors and DTP softwares cannot produce
output in SGML or RTF format. Then document structure recognition from
formatted text data may be necessary to produce the tagged text data.
It is also important that the contents of a printed format and an electronic
text format are identical for one document. Since texts are sometimes
modified physically in the printing process, the same modification must be
applied to the electronic text data. If this routine cannot be performed, the
electronic text source is of no use.
Another point to note is that the coordinate information of presentation data
is required as the result of search process. Therefore, location of each
character as the result of typesetting has to be inserted back into the text
data. If it is impossible for any technical reason, OCR processing may rather
be preferable. Even if it is the case, as OCR processing is executed on
electronically generated raster images, it is expected that character
recognition rate is very high and that no countermeasure is necessary against
recognition errors.
In a case of electronic publishing in a formatted text format, PDF seems to be
most commonly used. A specification to describe document structure is defined
for PDF, therefore all the information needed for each view can be extracted
from a PDF document if it includes the document structure. However, almost no
actual PDF documents include structure information, so the structure has to
be recognized from the layout information.
It is desirable, if possible, to get the internal text data for editing and to
produce text data for search and browsing, providing that document structure
conscious editing is executed using a style sheet, and so on. In order to
provide additional information such as highlights to the presentation view,
search data will need to hold offset information of each character in the
formatted text data.
In a case of electronic publishing in a tagged text format which can describe
logical structure explicitly (e.g. SGML, XML, and possibly HTML 4.0), a
flexible user interface can be constructed because electronic text may be
provided as the presentation data. In order to handle them with documents of
other formats in a integrated service system, text data for search and
browsing are required to be converted to a common format respectively.
It should be noticed that, because users can easily reuse or modify documents
provided in a tagged text format, some measures are required to protect the
copyright. If it is difficult, formatted text data or raster image data
should be used for presentation view, as in the case of printed documents or
formatted texts.
In the conversion process, some information to identify the part in the
original text corresponding to each presented part should be embedded into
the presentation data. Displaying document element identifiers (e.g.
paragraph numbers) in the margin may be a solution.
4.2 Practical System Configuration
When incorporating electronic text sources into NACSIS-ELS, a step-by-step
approach will be required to the practical system construction.
Because of the current restrictions from regimes (e.g. copyright protection,
charging scheme) and from systems (e.g. implementation of plug-ins or
clients), we will have no other choice, for the time being, than to use
tagged text for searching and browsing view and to use raster image for
presentation view.
As the next step, formatted text formats (e.g. PDF) will be used for
presentation. The issues here are the charging scheme and the implementation
of the clients. Alignment among the charging unit (e.g. per page or per
document), document formats, and the client function will be the condition
for realization. In other words, if there was no charging issue, PDF could
easily be incorporated into NACSIS-ELS service. At this stage, a raster image
format may be used in parallel in order that NACSIS-ELS can be accessible
from as many platform as possible.
As the following step, tagged text formats (e.g. SGML, XML) will be used for
presentation. The issues here are mainly the charging scheme and the
copyright protection. As there is no concept of pages, if smaller charging
unit than per document is required, a new charging scheme (e.g. per kilobyte
of presented text) has to be introduced. For the sake of copyright
protection, corresponding function on the client side is necessary, and a
serious consideration is required about if generally used clients can satisfy
the protection level requested by each copyright holder. If there was no
copyright issues, HTML could be used for the presentation view in parallel in
order to expand the number of platforms.
In all of the steps described above, if copyright holders cannot be satisfied
with the copyright protection mechanism or with the charging scheme, the
service may only provide search and browsing view, leaving the presentation
view to the copyright holders themselves. In this case, NACSIS-ELS and the
copyright holders have to adopt the common document identifier scheme. The
feedback of the search result to the presentation view is almost impossible.
As for other formatted text formats and tagged text formats than described
above, it is impossible to consider on the charging scheme and the copyright
protection issues and on the implementation of clients and plug-ins
respectively, therefore the mechanisms provided by the publishers will be
used for presentation view as they are.
5 Conclusion
Based on experiences through NACSIS-ELS, the author's private view on the
compilation method and the conceptual design of digital libraries' contents
was described, as the result of consideration focusing on the data sources
and the final document distribution formats. It should be noted, among
others, that layered data view for searching, browsing, and presenting should
be taken into consideration in the design of content data structure and user
interface, and that, while electronic publishing being widespread, there will
be several obstacles to be overcome before incorporating them into a digital
library service in case it is provided to unrestricted users. The urgent
issues for us are the construction of a fulltext database from the document
image database and the implementation of a retrieval system for them.
Although this paper was described mainly on the basic theme on the contents of
academic journals in a digital library, other researches are also in progress
at NACSIS in the field of natural language processing technology, fulltext
search technology, intelligent user interfaces, and so on. The results from
these researches will be integrated into the future NACSIS-ELS system.
References
-
[1]
- Adachi, J., Hashizume, H.:"NACSIS Electronic Library System: Its Design
and Implementation", Proc. of ISDL'95, Aug.1995, Tsukuba, Japan, p.36-41,
1995.
-
[2]
- "UC Berkeley Digital Library Project -- Documents", URL:
"http://elib.cs.berkeley.edu/docs/".
-
[3]
- Senda, S., Minoh, M, Ikeda, K.: "A Document Image Database System with
Full-text Search Functions", Digital Libraries, No.8, 1996.
-
[4]
- Ohta, M., Takasu, A., Adachi, J.: Retrieval Methods for English-Text with
Misrecognized OCR Characters, Proc. of ICDAR'97, Ulm, Germany, p.950-956,
1997.
-
[5]
- Takasu, A., Katayama, N., Yamaoka, M., Iwaki, O., Oyama, K., Adachi,
J.:"Approximate Matching for OCR-processed Bibliographic Data", Proc. of ICPR
96, Vienna, Austria, Volume III, p.175-179, 1996.