Key words : Digital Library, Concept path, Query, Index Term, Request, Technical Paper
The research of the automatic IR system has been studied [1][2][3]. However, it is difficult to say that the method can express the features of the sense sufficiently. We propose here the IR system by using the index terms based on the concept information, not merely the word strings. This system realized IR in the concept level using an EDR Dictionary[4] as the thesaurus.
In our approach, the technical papers are used for retrieval objects stored in Digital Library, which are text files generated by OCR including misread and omitted the words. One of the purposes of this study is to obtain the theme of the paper in the concept level. It is also expected to realize the positioning and clustering of the papers.
Let the appearance frequency of a word Wi in the document dj be wfij. The weight of a word Wi, wIij, is defined as follows.
where max(wflj) is the frequency of the most appearance word Wl in the document dj.
The concept path has the connection between the concepts in the structure. The notion of the set of the concept paths are defined as following:
where P_{k tkj} is a set of the concept path obtained from c(k, tk)
(the black circle in Fig.2).
Narrowing down the concept space, our approach compare this
space with the concept space constructed from another word. For
example, assuming an addition of a Japanese word, "
y
English)", the overlapped part with same sense is obtained as shown Fig.3 by comparing with the concept space of "x[X". This overlapped part is defined as the concept path narrowed down the concept space and estimated by the appearance frequency of the concept path. In this approach, obtaining semantic frequency in the document, the weight of the concept is estimated by appearance frequency for each elements of Pktkj. The weight of the concept cIktkj is estimated as follows:
where
In fact, these concepts are shown as Fig.4 by ID number in EDR
dictionary and it is easy to compare with each concept paths.
where g represents the group of the concept path obtained from the
group of the word excepting duplication in the document, and CP
indicates a set of the word owned the same concept path. The weight
Igtgj which determines each feature of the document is obtained
by following equation.
The feature vector is composed of the concepts of the large weight.
The concept of the larger weight means the subject of the document.
[2]Harter, S. P. A probabilistic approach to automatic keyword
indexing. Part II: An algorithm for probabilistic indexing. Journal
of the American Society for Information Science, 26.5, pp. 280-289,
1975.
[3]Sembok, T. and Rijsbergen, C. J. SILOL:A simple logical linguistic
document retrieval system. Information Processing & Management,
26.1,pp. 111-134, 1990.
[4]Japan Electronic Dictionary Research Institute, Ltd. EDR
Electronic Dictionary Technical Guide(2nd edition). Japan: EDR
TR-045, 1995.
[5]Yokoi, T, et al. Information structure of an electric dictionary at
the surface level. Information Processing Society of Japan, 37.3, pp.
333-343, 1996.
[6]Lim, C. and Chen, H. An Automatic Indexing and Neural Network
Approach to Concept Retrieval and Classification of Multilingual
(Chinese-English) Documents. IEEE Transactions on Systems, Man and
Cybernetics, 1994.
[7]Salton, G. Allen, J. and Buckley, C. Automatic structuring and
retrieval of large text files. Communications of the ACM, 37.2,
pp. 94-108, 1994.
[8]Matsumoto, Y, et al. Japanese Morphological Analysis System JUMAN
Manual. Nara Institute of Science and Technology, Japan1994.
2.3 The feature of the document
No concept information consider with the appearance frequency
of a word to get semantic frequency in the document. It is necessary
to combine the weight of the word with the concept when it compare
with another documents. Adding both of the weights, let the weight of
the word and the concept are as follows:
3 Overview of the IR system
The overview of the IR system in the concept level shows
Fig.5 . This system consists of three parts; the query
part to get the query in the concept level, the indexing part to
decide the index terms for the theme of the technical paper, and the
comparing part to compare between the query and the theme and show
users the retrieval results. "Juman System[8]" is used for
Japanese morphological analysis.
4 Experimental studies
4.1 Experiment
The main problem at the operation of this system is whether
the theme of the technical paper is represented in the concept level
or not. It is necessary to perform several experiments trying to
acquire the concept of the theme for technical paper. On the practice
for the experiment, the conditions is set up as shown bellows:
As an example for the acquisition of the theme of the
technical paper, we present a case of the paper titled "On the
Function of the Retinal Bipolar Cell in Early Vision" in the
Transactions of the Institute of Electronics, Information and
Communication Engineers D-II. This journal is one of the most popular
technical journals in Japan. The result of an experiment is also
shown in Table 1. This table shows the concepts of the theme and its
sense in order of the weight of the concept path.
Each group of three indicates the concept path and the upper
level concept which has general meanings are shown as line goes down.
This experimental result explains that the categories of this paper
consist of a set of the concepts, such like the vision and the
transmission in the information field.
4.2 Evaluation of experimental results
To evaluate the experimental result mentioned above, we
perform the questionnaire investigations to 5 persons who have
sufficient knowledge in the information science field. One of the
subject of the questionnaire is to report the results evaluating
whether the sets of the concept represent the theme of the technical
paper or not, after reading 15 papers which reported in Transactions
of IEICE vol.J78-DII NO.7. The evaluation should answer the 4 ranks of
grades, "excellent" ,"good", "fair" and "poor". The results of the
questionnaire are shown in the Fig.6. This shows that
the theme of paper in the concept level which is obtained from the
experiment agrees with the one which user determined the theme
himself/herself by no use of the system developed here.
To see the situation of the evaluation results little more
detail, the score of the evaluation for each paper is plotted and
shown in Fig.7, setting as the point 3 for "excellent", 2 for "good",
1 for "fair" and nothing for "poor", respectively. From
Fig.7, the average of the score which marks 10.6 point
is satisfied roughly, but the variance of the score is little big and
the deviation width of the score has the points from 5(minimum value)
to 15(maximum value). The reasons of this are as follows:
In this experiments, we did not use special dictionary for
technical terms. In the future, we will intend to use the word which
is appeared very often in the document as a technical term but could
not make the concept. This approach will be able to deal with the
word which is not stored in the dictionary or a new word named by
author of the document.
Since it is possible to change the connection rule for
morphological analysis in Juman system, we can try to modify the
connection rule so as to adapt the user oriented problems.
5 Conclusion
In this paper, we proposed an IR system in the concept level
for the technical paper generated by OCR including mistakes and
omitted words and stored in Digital Library.
The idea of the conceptualization of the theme for the
technical paper is expected so much to realize the positioning and
clustering between the papers. Furthermore, it will provide the
contribution to the research field of the IR and Digital Library
systems.
References
[1]Hosono, K, ed. Information Retrieval. Oyamakaku Publishing,
1991 (In Japanese).