Visual User Interface for Document Retrieval
Utilizing Spatial Relationships
among Document Vectors
Masahiro Watanabe Masatoshi Yoshikawa
Graduate School of Information Science
Nara Institute of Science and Technology
8916-5 Takayama, Ikoma, Nara 630-01, Japan
phone: +81 (743)72-5331
fax: +81 (743)72-5399
e-mail: {masah-w, yosikawa}@is.aist-nara.ac.jp
Abstract
With the growth of the current of information such as documents,
the method of selecting relevant one is more and more required. There
are several ways to answer to this requirement. Visualizing
information seemed to be a promising method.
In this paper, we propose our user interface and algorithms. In
the approach, a document-term matrix generated from a document set
is decomposed usig the singular value decomposition (SVD)
method. The document information is visualized on a 2-D space using
the result of SVD. Users specify retrieval conditions by indicating a
position on the plane. With this approach, users can easily specify
retrieval conditions which will be complicated when expressed in
Boolean expressions of keywords. We heve built a prototype system,
and did experiments using a small set of documents.
Keywords:
Digital Library, Information Retrieval, Information Filtering, Visualisation
1. Introduction
Recently, much attention has been paid to visual user interface
for document retrieval. This is mainly because such user interface
is deemed to be a promising method to overcome the limitations of
systems using keyword interface. Past research on this topic includes
VIBE[6], PITS[1] and SearchSpace[7].
In this paper, we propose a new approach to visual user interface
for document retrieval. Our approach has the following features:
- 1.
- Multi-dimensional document vectors are projected onto 2-D
planes on which documents are visualized and retrieved based
on their spatial interrelationships. Users can specify
queries in visual and intuitive manner. Specification of such
relationships among documents would be very complicated if
Boolean combinations of keywords were used.
- 2.
- Unlike many other systems, users can specify 2-D planes on
which documents are placed. To specify horizontal and vertical
axes of planes, users give two pairs of documents which are
familiar to them. Intuitively, the line connecting the two
document vectors in the first (and the second, respectively)
pair is horizontal (and vertical, respectively)
axis. Therefore, users can view and retrieve documents from
their own point of view.
- 3.
- So far, much work have been done on information retrieval
using vector model, in which objects near to the specified
query vector are retrieved. In these methods, only distance
information in vector space is used. Since our method employs
LSI[2, 5] and hence is based on orthogonal space, our method
uses not only distance but also direction information. From
practical point of view, this is only effective when using
visual user interface.
- 4.
- The conversion from multi-dimensional to 2-D vectors is based
on FastMap[4], and is very fast.
In this paper, we present our user interface and its generative
algorithms. We have built a prototype system, and did experiments
using a small set of documents.
2. Visual Query Interface
In this section, the construction process of the 2-D interface
(Figure 1) will be described. We assume that the interface is 2-D,
because it seems difficult for human to grasp or specify object space
using more than 2 dimensions.
Users specify the picture of object feature space that they have
in their mind. They arrange known objects as two pairs of "yardstick
objects" on the 2-D interface (Figure1). The horizontal axis will be
generated based on one of the two pairs of the "yardstick objects",
and the vertical axis will be generated based on the other. The user
interface is composed by these two axes. The users specify the target
query object on the same 2-D interface, and then the system interprets
the spatial interrelationships between the query object and the
"yardstick objects" as search conditions. So, it is desirable that
two yardstick objects in a pair are poles apart from each other. The
objects in the neighbourhood of the query object are retrieved.
The method we propose is closely related to FastMap algorithm
[4]. The goal of the FastMap algorithm is to map n-D vector space
objects into points in some k-D (k < n) space so that the distances
between the objects are preserved. To solve the problem of
visualizing and indexing, the FastMap algorithm and our method are
much faster than an older method from pattern recognition, namely,
Multi-Dimensional Scaling[3]. The difference between FastMap
algorithm and our method is the way to choose the two ends of the line
segment, on which object vectors are projected.
2.1. Algorithm
The heart of the proposed method is to project the objects on a
`line' selected by the user. The projections of the objects on that
line are computed using the cosine law.
Theorem(Cosine Law)(Figure 2)
db,i = d 2a,i +
d 2a,b - 2xi da,b
|
(1)
|
In the above equations, dij is a shorthand for the
distance D(Oi, Oj)
(for i, j = 1 . . .N . Notice that the computation of xi only needs
the distances between objects, which are given.)
Solution of Cosine Law's equation for xi
From the Pythagorean theorem in the two triangles Oa EOi
and ObEOi,
the equation of the Cosine Law can be solved for xi, the first
coordinate of object Oi:
xi =
|
da,i 2 + da,b 2 -
d 2b,i
|
(2)
|
|
2da,b
|
Observe that, thanks to Eq.(2), we can map objects into points on
a line, preserving some of the distance information.
The vertical axis is chosen to be orthogonal to the horizontal
axis (Figure 3). In FastMap algorithm, the pairs of objects which
define axes are automatically selected by the system, while they are
selected by users in our method. Therefore users can view and
retrieve documents from their own point of view.
To put the algorithm more concretely, first, a pair of objects, Oa
and Ob, is chosen. It is used for the horizontal axis. Every object is
orthogonally projected onto the line Oa Ob. The position on the line
segment of each object is regarded as the horizontal attribute. Now,
the hyper plane H, which is perpendicular to the line segment
Oa Ob,
is introduced (See Figure 3). Let the orthogonal projection of
the line segment Oa Ob on H be
O'ab in figure 3. Objects other than
Oa and Ob are also projected onto
H. For example, the object Ox is
projected into O'x. Then, another pair of objects,
Ox and Oy , is
chosen for the vertical axis. Let the projection of them to HO'x and
HO'y. The line segment O'xO'y
is orthogonal to the horizontal axis.
Eventually, we obtain the two line segments, the line segments Oa Ob
and O'xO'y
as two axes of the interface (Figure 1), and they are
orthogonal each other.
As a matter of fact, the line segments Oa Ob
and OxOy are not
orthogonal (Figure 3) in the N -D space.
So, there is an alternative
plan to make the interface, judging from the intuition. It is the
plan to present the interface whose axes are not crossing at right
angle(Figure 4). In the figure, the angle cos` can be computed from
the equation of the inner product.
cos =
|
Oa ObOx Oy
|
(2)
|
|
|Oa Ob||Ox Oy|
|
2.2. Retrieving the Documents
We an arrange object vectors onto the interface, and the spatial
relationships between the query object and "yardstic objects" are
interpreted as the definition of the plane, or as the search
conditions. Using our algorithm, the projected object vectors are
compared to the condition specified by the user (Figure 1).
The object vectors matched to the target area specified by the user are
returned to the user as the result of the retrieval.
In general retrieval of vector model, there are two possible
choices of policy to select the objects. We compute the dissimilarity
of the objects by distance function d.
1. Range Policy
given a query object Q, selsect all indexed objects
Oj
s.t. d(Q; Oj) =< r.
2. k nearest neighbours (k-NN) Policy
given a query object Q and integer k >= 1, select the k
indexed objects which have the shortest distance from Q.
To add to the policy based on distance, we should consider the
topological factor of retrieval conditions. This factor of conditions
is conjunctive and narrows the candidates for retrieval. For example,
in such a case as the user specifies like Figure1, the result should
not include the objects on the left of Oa .
3. Experiments
We did an experiment on our method using 116 documents. We report
the result of the experiment below.
3.1. Process
-
1.
- The document set of 116 master theses of NAIST is the object
of our experiment.
-
2.
- Stemming the words of the each document, we eliminate the stop
words which are included in the stop word list.
-
3.
- Couting the occurrence of the each variation of the words, the
document-word matrix is constructed. The size of matrix is 126
x 1640.
-
4.
- Perform the SVD of the document-word matrix.
-
5.
- Applying our method to the vectors of documents and query, we
obtain 2-D arrangemant.
-
6.
- Compare the obtained arrangemant and the retrieval conditions
specified by the user. The objects which satisfies the
conditions are returned to the user.
3.2. Results
The example of user interface is shown in Figure 5. In the
figure, the pair of horizontal yardstick objects are denoted by "x",
the pair of vertical yardstick objects are denoted by"*", and the
other objects are denoted by "+". These yardstick objects and two
orthogonal axes are standard against which users specify the query
object vector.
4. Conclusion
The result of our experiments shows that the interface proposed in
this paper has the efficient power of discriminating the object
vectors. Using the interface like our prototype system, users can
specify the retrieval conditions even if they don't know the scheme of
the field very well.
Acknowledgement
We are indebted to the members of Uemura Laboratory for their
constructive discussions.
References
-
[1]
- Steve Benford and John Mariani. Virtual Environments for Data
Sharing and Visualizaion - Populated Information Terrains. In
Interfaces to Databese Systems, pp. 168-182, 1994.
-
[2]
- Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas
K. Landauer, and Richard Harshman. Indexing by Latent Semantic
Analysis. In Journal of the American Society for Information
Science, Vol. 41-6, pp. 391-407, 1990.
-
[3]
- Kouji Akaike et al. Iwanami Encyclopedia in Mathematics.
Iwanami, 1985.
-
[4]
- Christos Faloutsos and King-Ip(David) Lin. FastMap: A Fast
Alghorithm for Indexing, Data-Mining and Visualisation of
Traditional and Multimedia Datasets. In Proc. ACM SIGMOD
International Conference on Management of Data, pp. 163-174,
1995.
-
[5]
- Christos Faloutsos and Douglas W. Oard. A Survey of Information
Retrieval and Filtering Methods. Technical Report CS-TR-3514,
University of Maryland, August 1995.
-
[6]
- Kai A. Olsen and Robert R. Korfhage. Desktop
Visualization. In Proceedings of the 1994 IEEE Symposium
on Visual Languages, pp. 239-244, 1994.
-
[7]
- Fujio Tsutsumi. SearchSpace: a Flexible Docment Retrieval
System with an Interface Arranging Keywords in 2-D Space
Expressing Fuzziness and Relevance. In Proc. of Intl.
Symp. on Digital Libraries 1995, pp. 283-284, August 1995.