An Information Retrieval Using Conceptual Index Term
For Technical Paper on Digital Library
Chinatsu Horii, Masakazu Imai and Kunihiro Chihara
Nara Institute of Science and Technology
8916-5 Takayama, Ikoma, Nara, 630-01, Japan
E-mail : chinatsu, imai, chihara@is.aist-nara.ac.jp
Abstract
This paper presents a method for semantic Information
Retrieval(IR) which is implemented on Digital Library. It is well
known that Digital Library should have the IR system that user may
automatically access every kind of media from anywhere. However, no
improvement is made for the retrieval errors based on individual
differences of user's request. This is one of the significant problem
for the searching efficiency of IR. Our approach does not use the
request itself but the concepts. This makes it possible to retrieve
semantic information not merely to compare with the word strings
of the request.
Key words : Digital Library, Concept path, Query, Index Term, Request,
Technical Paper
1. Introduction
The realization of a Digital Library as an information service
center is one of the increasing interests with the developments of the
networks and the multimedia technology. Increasing the information
stored as the digital data, it becomes more important for Digital
Library to provide the proper information as user desired. The
librarian provides the reliability of the IR in the conventional
library, solving the problems for the omission and the noise caused by
individual difference of the senses for the concepts of user's
request. Since the interaction with human operator in the IR process
is decreased at Digital Library, it becomes necessary to construct the
automatic information retrieval system which solves the conformity
problems of the search results instead of the librarians.
The research of the automatic IR system has been studied
[1][2][3]. However, it is difficult to say that the method can express
the features of the sense sufficiently. We propose here the IR system
by using the index terms based on the concept information, not merely
the word strings. This system realized IR in the concept level using
an EDR Dictionary[4] as the thesaurus.
In our approach, the technical papers are used for retrieval
objects stored in Digital Library, which are text files generated by
OCR including misread and omitted the words. One of the purposes of
this study is to obtain the theme of the paper in the concept
level. It is also expected to realize the positioning and clustering
of the papers.
2 The structure of the document databas
It is difficult to represent accurately the subject of the
document by the word system. In this approach, the substantial
representation of the contents is achieved by the future extraction
using concept information in addition to word information. The
feature of the document is represented as the feature vector which
consists of the components is based on the word and the concept in the
document.
2.1 The distribution of the appearance frequency of a word
In this study, we use the weight of a word estimated by appearance
frequency of a word in the document to compare with that of other
document. The weight of a word means the size of the feature of the
document.
The notion of the word Wi which is input of the system, and document
database DB which is a set of the document dj are defined as
following:
Let the appearance frequency of a word Wi in the document dj be wfij.
The weight of a word Wi, wIij, is defined as follows.
where max(wflj) is the frequency of the most appearance word Wl in the
document dj.
2.2 The distribution of the appearance frequency of a concept
To obtain semantic frequency in the document, we use the weight of a
concept estimated by appearance frequency of the concept derived from
a word. The concept space constructs the tree-structure by using the
upper level concept.
As far as use of the concept of the word in the feature
extraction of the document, the system comes across the trouble that
the information space broaden to have another sense[5][6][7]. For
example, if a Japanese word "x[X(the base and bass in English)" is
conceptualized, several kinds of the upper level concepts which have
the sense of "x[X" is obtained. Fig.1 shows the main branch of the
concept space which constructed by searching the upper level concept
repeatedly.
In this paper, we propose a method considered with the concept
path based on the concept structure, as shown Fig.2. c(k,tk) forms
the concept on the upper level of k steps and tkth in the concept
structure which made from a word in the document dj. Cktkj is a set
of the upper level concepts obtained from c(k-1, tk-1). The notion of
the concepts are defined as following.
The concept path has the connection between the concepts in
the structure. The notion of the set of the concept paths are defined
as following:
where P_{k tkj} is a set of the concept path obtained from c(k, tk)
(the black circle in Fig.2).
Narrowing down the concept space, our approach compare this
space with the concept space constructed from another word. For
example, assuming an addition of a Japanese word, "
y
English)", the overlapped part with same sense is obtained as shown
Fig.3 by comparing with the concept space of "x[X". This
overlapped part is defined as the concept path narrowed down the
concept space and estimated by the appearance frequency of the concept
path. In this approach, obtaining semantic frequency in the document,
the weight of the concept is estimated by appearance frequency for
each elements of Pktkj. The weight of the concept cIktkj is
estimated as follows:
where is represented as each elements of Pktkj, and
max() is the frequency of the most appearance concept path
in the document dj.
In fact, these concepts are shown as Fig.4 by ID number in EDR
dictionary and it is easy to compare with each concept paths.
2.3 The feature of the document
No concept information consider with the appearance frequency
of a word to get semantic frequency in the document. It is necessary
to combine the weight of the word with the concept when it compare
with another documents. Adding both of the weights, let the weight of
the word and the concept are as follows:
where g represents the group of the concept path obtained from the
group of the word excepting duplication in the document, and CP
indicates a set of the word owned the same concept path. The weight
Igtgj which determines each feature of the document is obtained
by following equation.
The feature vector is composed of the concepts of the large weight.
The concept of the larger weight means the subject of the document.
3 Overview of the IR system
The overview of the IR system in the concept level shows
Fig.5 . This system consists of three parts; the query
part to get the query in the concept level, the indexing part to
decide the index terms for the theme of the technical paper, and the
comparing part to compare between the query and the theme and show
users the retrieval results. "Juman System[8]" is used for
Japanese morphological analysis.
4 Experimental studies
4.1 Experiment
The main problem at the operation of this system is whether
the theme of the technical paper is represented in the concept level
or not. It is necessary to perform several experiments trying to
acquire the concept of the theme for technical paper. On the practice
for the experiment, the conditions is set up as shown bellows:
- The technical paper are taken from the OCR stored in Digital Library
of our Institute.
- It is assumed that limitation of the searching times of
upper level concept is three.
- No threshold for concept path and truncate the paths detected only once.
As an example for the acquisition of the theme of the
technical paper, we present a case of the paper titled "On the
Function of the Retinal Bipolar Cell in Early Vision" in the
Transactions of the Institute of Electronics, Information and
Communication Engineers D-II. This journal is one of the most popular
technical journals in Japan. The result of an experiment is also
shown in Table 1. This table shows the concepts of the theme and its
sense in order of the weight of the concept path.
Each group of three indicates the concept path and the upper
level concept which has general meanings are shown as line goes down.
This experimental result explains that the categories of this paper
consist of a set of the concepts, such like the vision and the
transmission in the information field.
4.2 Evaluation of experimental results
To evaluate the experimental result mentioned above, we
perform the questionnaire investigations to 5 persons who have
sufficient knowledge in the information science field. One of the
subject of the questionnaire is to report the results evaluating
whether the sets of the concept represent the theme of the technical
paper or not, after reading 15 papers which reported in Transactions
of IEICE vol.J78-DII NO.7. The evaluation should answer the 4 ranks of
grades, "excellent" ,"good", "fair" and "poor". The results of the
questionnaire are shown in the Fig.6. This shows that
the theme of paper in the concept level which is obtained from the
experiment agrees with the one which user determined the theme
himself/herself by no use of the system developed here.
To see the situation of the evaluation results little more
detail, the score of the evaluation for each paper is plotted and
shown in Fig.7, setting as the point 3 for "excellent", 2 for "good",
1 for "fair" and nothing for "poor", respectively. From
Fig.7, the average of the score which marks 10.6 point
is satisfied roughly, but the variance of the score is little big and
the deviation width of the score has the points from 5(minimum value)
to 15(maximum value). The reasons of this are as follows:
- There exist the technical terms which the EDR does not include
(e.g. HMM, GA )
- There exist some case that morphological analysis is inadequate
(e.g. 3 dimension is deviled into 3 and dimension)
In this experiments, we did not use special dictionary for
technical terms. In the future, we will intend to use the word which
is appeared very often in the document as a technical term but could
not make the concept. This approach will be able to deal with the
word which is not stored in the dictionary or a new word named by
author of the document.
Since it is possible to change the connection rule for
morphological analysis in Juman system, we can try to modify the
connection rule so as to adapt the user oriented problems.
5 Conclusion
In this paper, we proposed an IR system in the concept level
for the technical paper generated by OCR including mistakes and
omitted words and stored in Digital Library.
The idea of the conceptualization of the theme for the
technical paper is expected so much to realize the positioning and
clustering between the papers. Furthermore, it will provide the
contribution to the research field of the IR and Digital Library
systems.
References
[1]Hosono, K, ed. Information Retrieval. Oyamakaku Publishing,
1991 (In Japanese).
[2]Harter, S. P. A probabilistic approach to automatic keyword
indexing. Part II: An algorithm for probabilistic indexing. Journal
of the American Society for Information Science, 26.5, pp. 280-289,
1975.
[3]Sembok, T. and Rijsbergen, C. J. SILOL:A simple logical linguistic
document retrieval system. Information Processing & Management,
26.1,pp. 111-134, 1990.
[4]Japan Electronic Dictionary Research Institute, Ltd. EDR
Electronic Dictionary Technical Guide(2nd edition). Japan: EDR
TR-045, 1995.
[5]Yokoi, T, et al. Information structure of an electric dictionary at
the surface level. Information Processing Society of Japan, 37.3, pp.
333-343, 1996.
[6]Lim, C. and Chen, H. An Automatic Indexing and Neural Network
Approach to Concept Retrieval and Classification of Multilingual
(Chinese-English) Documents. IEEE Transactions on Systems, Man and
Cybernetics, 1994.
[7]Salton, G. Allen, J. and Buckley, C. Automatic structuring and
retrieval of large text files. Communications of the ACM, 37.2,
pp. 94-108, 1994.
[8]Matsumoto, Y, et al. Japanese Morphological Analysis System JUMAN
Manual. Nara Institute of Science and Technology, Japan1994.