An Information Retrieval Using Conceptual Index Term For Technical Paper on Digital Library

Chinatsu Horii, Masakazu Imai and Kunihiro Chihara
Nara Institute of Science and Technology
8916-5 Takayama, Ikoma, Nara, 630-01, Japan
E-mail : chinatsu, imai, chihara@is.aist-nara.ac.jp

Abstract

This paper presents a method for semantic Information Retrieval(IR) which is implemented on Digital Library. It is well known that Digital Library should have the IR system that user may automatically access every kind of media from anywhere. However, no improvement is made for the retrieval errors based on individual differences of user's request. This is one of the significant problem for the searching efficiency of IR. Our approach does not use the request itself but the concepts. This makes it possible to retrieve semantic information not merely to compare with the word strings of the request.

Key words : Digital Library, Concept path, Query, Index Term, Request, Technical Paper

1. Introduction

The realization of a Digital Library as an information service center is one of the increasing interests with the developments of the networks and the multimedia technology. Increasing the information stored as the digital data, it becomes more important for Digital Library to provide the proper information as user desired. The librarian provides the reliability of the IR in the conventional library, solving the problems for the omission and the noise caused by individual difference of the senses for the concepts of user's request. Since the interaction with human operator in the IR process is decreased at Digital Library, it becomes necessary to construct the automatic information retrieval system which solves the conformity problems of the search results instead of the librarians.

The research of the automatic IR system has been studied [1][2][3]. However, it is difficult to say that the method can express the features of the sense sufficiently. We propose here the IR system by using the index terms based on the concept information, not merely the word strings. This system realized IR in the concept level using an EDR Dictionary[4] as the thesaurus.

In our approach, the technical papers are used for retrieval objects stored in Digital Library, which are text files generated by OCR including misread and omitted the words. One of the purposes of this study is to obtain the theme of the paper in the concept level. It is also expected to realize the positioning and clustering of the papers.

2 The structure of the document databas

It is difficult to represent accurately the subject of the document by the word system. In this approach, the substantial representation of the contents is achieved by the future extraction using concept information in addition to word information. The feature of the document is represented as the feature vector which consists of the components is based on the word and the concept in the document.

2.1 The distribution of the appearance frequency of a word

In this study, we use the weight of a word estimated by appearance frequency of a word in the document to compare with that of other document. The weight of a word means the size of the feature of the document.
The notion of the word Wi which is input of the system, and document database DB which is a set of the document dj are defined as following:

equation (1) (2)

Let the appearance frequency of a word Wi in the document dj be wfij. The weight of a word Wi, wIij, is defined as follows.

equation (3)

where max(wflj) is the frequency of the most appearance word Wl in the document dj.

2.2 The distribution of the appearance frequency of a concept

To obtain semantic frequency in the document, we use the weight of a concept estimated by appearance frequency of the concept derived from a word. The concept space constructs the tree-structure by using the upper level concept.
As far as use of the concept of the word in the feature extraction of the document, the system comes across the trouble that the information space broaden to have another sense[5][6][7]. For example, if a Japanese word "x[X(the base and bass in English)" is conceptualized, several kinds of the upper level concepts which have the sense of "x[X" is obtained. Fig.1 shows the main branch of the concept space which constructed by searching the upper level concept repeatedly.
In this paper, we propose a method considered with the concept path based on the concept structure, as shown Fig.2. c(k,tk) forms the concept on the upper level of k steps and tkth in the concept structure which made from a word in the document dj. Cktkj is a set of the upper level concepts obtained from c(k-1, tk-1). The notion of the concepts are defined as following.

equation (4)

The concept path has the connection between the concepts in the structure. The notion of the set of the concept paths are defined as following:

equation (5)

where P_{k tkj} is a set of the concept path obtained from c(k, tk) (the black circle in Fig.2).
Narrowing down the concept space, our approach compare this space with the concept space constructed from another word. For example, assuming an addition of a Japanese word, " y

English)", the overlapped part with same sense is obtained as shown Fig.3 by comparing with the concept space of "x[X". This overlapped part is defined as the concept path narrowed down the concept space and estimated by the appearance frequency of the concept path. In this approach, obtaining semantic frequency in the document, the weight of the concept is estimated by appearance frequency for each elements of Pktkj. The weight of the concept cIktkj is estimated as follows:

equation (6)

where is represented as each elements of Pktkj, and max() is the frequency of the most appearance concept path in the document dj.

In fact, these concepts are shown as Fig.4 by ID number in EDR dictionary and it is easy to compare with each concept paths.

2.3 The feature of the document

No concept information consider with the appearance frequency of a word to get semantic frequency in the document. It is necessary to combine the weight of the word with the concept when it compare with another documents. Adding both of the weights, let the weight of the word and the concept are as follows:

equation (7)

where g represents the group of the concept path obtained from the group of the word excepting duplication in the document, and CP indicates a set of the word owned the same concept path. The weight Igtgj which determines each feature of the document is obtained by following equation.

equation (8)

The feature vector is composed of the concepts of the large weight. The concept of the larger weight means the subject of the document.

3 Overview of the IR system

The overview of the IR system in the concept level shows Fig.5 . This system consists of three parts; the query part to get the query in the concept level, the indexing part to decide the index terms for the theme of the technical paper, and the comparing part to compare between the query and the theme and show users the retrieval results. "Juman System[8]" is used for Japanese morphological analysis.

4 Experimental studies

4.1 Experiment

The main problem at the operation of this system is whether the theme of the technical paper is represented in the concept level or not. It is necessary to perform several experiments trying to acquire the concept of the theme for technical paper. On the practice for the experiment, the conditions is set up as shown bellows: As an example for the acquisition of the theme of the technical paper, we present a case of the paper titled "On the Function of the Retinal Bipolar Cell in Early Vision" in the Transactions of the Institute of Electronics, Information and Communication Engineers D-II. This journal is one of the most popular technical journals in Japan. The result of an experiment is also shown in Table 1. This table shows the concepts of the theme and its sense in order of the weight of the concept path.
Each group of three indicates the concept path and the upper level concept which has general meanings are shown as line goes down. This experimental result explains that the categories of this paper consist of a set of the concepts, such like the vision and the transmission in the information field.

4.2 Evaluation of experimental results

To evaluate the experimental result mentioned above, we perform the questionnaire investigations to 5 persons who have sufficient knowledge in the information science field. One of the subject of the questionnaire is to report the results evaluating whether the sets of the concept represent the theme of the technical paper or not, after reading 15 papers which reported in Transactions of IEICE vol.J78-DII NO.7. The evaluation should answer the 4 ranks of grades, "excellent" ,"good", "fair" and "poor". The results of the questionnaire are shown in the Fig.6. This shows that the theme of paper in the concept level which is obtained from the experiment agrees with the one which user determined the theme himself/herself by no use of the system developed here.
To see the situation of the evaluation results little more detail, the score of the evaluation for each paper is plotted and shown in Fig.7, setting as the point 3 for "excellent", 2 for "good", 1 for "fair" and nothing for "poor", respectively. From Fig.7, the average of the score which marks 10.6 point is satisfied roughly, but the variance of the score is little big and the deviation width of the score has the points from 5(minimum value) to 15(maximum value). The reasons of this are as follows: In this experiments, we did not use special dictionary for technical terms. In the future, we will intend to use the word which is appeared very often in the document as a technical term but could not make the concept. This approach will be able to deal with the word which is not stored in the dictionary or a new word named by author of the document.
Since it is possible to change the connection rule for morphological analysis in Juman system, we can try to modify the connection rule so as to adapt the user oriented problems.

5 Conclusion

In this paper, we proposed an IR system in the concept level for the technical paper generated by OCR including mistakes and omitted words and stored in Digital Library.
The idea of the conceptualization of the theme for the technical paper is expected so much to realize the positioning and clustering between the papers. Furthermore, it will provide the contribution to the research field of the IR and Digital Library systems.

References

[1]Hosono, K, ed. Information Retrieval. Oyamakaku Publishing, 1991 (In Japanese).

[2]Harter, S. P. A probabilistic approach to automatic keyword indexing. Part II: An algorithm for probabilistic indexing. Journal of the American Society for Information Science, 26.5, pp. 280-289, 1975.

[3]Sembok, T. and Rijsbergen, C. J. SILOL:A simple logical linguistic document retrieval system. Information Processing & Management, 26.1,pp. 111-134, 1990.

[4]Japan Electronic Dictionary Research Institute, Ltd. EDR Electronic Dictionary Technical Guide(2nd edition). Japan: EDR TR-045, 1995.

[5]Yokoi, T, et al. Information structure of an electric dictionary at the surface level. Information Processing Society of Japan, 37.3, pp. 333-343, 1996.

[6]Lim, C. and Chen, H. An Automatic Indexing and Neural Network Approach to Concept Retrieval and Classification of Multilingual (Chinese-English) Documents. IEEE Transactions on Systems, Man and Cybernetics, 1994.

[7]Salton, G. Allen, J. and Buckley, C. Automatic structuring and retrieval of large text files. Communications of the ACM, 37.2, pp. 94-108, 1994.

[8]Matsumoto, Y, et al. Japanese Morphological Analysis System JUMAN Manual. Nara Institute of Science and Technology, Japan1994.