Previewing Video Data: Browsing Key Frames at High Rates
Using a Video Slide Show Interface

Wei Ding, Gary Marchionini, Tony Tse

Digital Library Research Group
College of Library & Information Services
University of Maryland, College Park, MD 20740, USA.

{weid, march, tsetony@oriole.umd.edu}

Abstract

As the amount of video data in digital libraries increases, support for fast and easy access to this information has become necessary. Our approach is to empower users with direct control of video surrogates and provide interaction flexibility. A video browsing interface prototype using a slide show-style presentation of video key frames was built and tested for user performance and subjective satisfaction. The interface allows display rates to be adjusted interactively. Subjects in this preliminary study performed two browsing-related tasks, object identification and gist determination, at display rates of 1, 4, 8, 12, and 16 key frames per second (kfps). A possible functional limit in accuracy for object identification (OI) was detected between 8 to 12 kfps. Performance for gist determination (GD) tended to degrad with increased display rates. However, no significant performance differences were detected. Furthermore, it was observed that lower rates were required for object identification than for gist determination. Suggestions for designing fast video browsing interfaces are provided.

Keywords: video browsing, video surrogates, video key frames, slide show, display rate

1 Introduction

As multimedia computing technologies and networked information environments improve, complex digital data types such as high-resolution video, audio, and still images are becoming more popular. Although powerful information retrieval engines and filtering agents are capable of rapidly processing enormous amounts of information, they are not able to replace human judgement. Thus, ultimately, active human effort such as sustained attention is required to scan, evaluate, and use the information retrieved by such automated systems. At the level of the user interface, opportunities exist for improving designs that will help users save time in browsing and selecting useful video data.
Although slower than other more highly focused analytical search strategies, browsing is effective for capturing the gist of the information represented and deciding whether to examine information objects more closely [1]. Our Baltimore Learning Community (BLC) project provides users with access to various types of data through the World Wide Web, such as still images, video clips, teaching modules, websites, and texts [2]. Users can search for objects through text-searching and graphical dynamic queries and then use one of several browsing tools to examine the query results in more detail.
The study described in this paper is preliminary and focuses on a slide show video browsing mechanism. Browsing can be particularly effective for seeking video information because it takes advantage of users' innate perceptual capabilities, such as rapid cognitive processing of visual information and efficient use of memory resources. We thus hypothesize that human information processing throughput can be increased by accelerating the display rate of information surrogates. This should help users determine whether further in-depth exploration of objects represented by the surrogates is needed.

1.1 Background
In a study of short-term memory for pictures, Potter [3] found that images could be understood within about 100 milliseconds. Analogously, we believe that video surrogates such as key frames will allow users to determine a video content rapidly and assess the need for further processing of richer information layers or even the complete video.
Zhang et al. [4] applied two types of browsing techniques to video database interfaces: hierarchic and sequential browsing. Hierarchical browsing uses a hierarchy of static key frames displayed over time [5]. The key frames at top levels present an overview of the video clip; clicking each key frame at higher levels results in display of n key frames at the next lower level. Hierarchical browsing interfaces allow users to preview video clips from any point. Sequential browsing uses a VCR-like interface with stop/start, fast-forward, reverse and pause/freeze controls. Video surrogates include key frames, video skims [6], and full-length video clips. The VCR-like interface conserves screen space and, to some extent, maintains video's temporal nature. However, a major problem with retrieving complete videos on the World Wide Web is download time.

1.2 Slide Show Browsing Mechanism
The slide show interface described in this paper is an alternative to VCR-like controls for supporting rapid video browsing. Using a slide projector metaphor, the interface flips through computer-selected video key frames in a temporally ordered sequence, displaying the "slides" at rates set by users. Key frames are representative still images extracted from different scenes contained in each video sequence, based on physical and semantic properties of video [7]. Key frames are intended to summarize the content of a video in much the same way abstracts provide a summary of text documents. A video surrogate consisting of a sequence of key frames can be transferred more quickly and, if stored locally, requires less space than the full video. With the slide show interface, users can control key frame display rates to accommodate their specific needs for different situations. Our goal is to allow users to manipulate video browsing, through these control mechanisms, as rapidly and comfortably as thumbing through a book.
A video slide show interface with key frames, similar to one prototyped for the BLC project's digital video database, was tested in this study. The tests were designed to identify human performance limitations for fast video viewing/examination. A related study investigated the human limitations in processing multiple videos simultaneously [8]. Knowledge of human performance limitations will inform interface designs that help users make full use of their information processing capabilities while avoiding information overload. The results will contribute to our ongoing work on digital library interface designs for searching and browsing.

2 Research Questions

There has been little work on testing video browsing strategies. Whether suggested control mechanisms are optimized to meet user needs and how such interfaces could be improved are not known. Empirical data from this preliminary study would provide some insights into how slide show display rates can affect user performance. The three research questions were as follows:

1. It is hypothesized that both object identification (OI) and gist determination (GD) performance decrease with increasing display rates. As cognitive processes have limited resources, both OI and GD are expected to become less efficient with increasing data input. However, it is not clear if such a trend across all rates would be linear or result in an abrupt decrease at a particular threshold. Establishing a threshold rate level will help us better understand human factors in video interface design.

H1a: there exists a rate threshold for GD.
H1b: there exists a rate threshold for OI.
H1c: the rate threshold for GD is not the same as the rate threshold for OI.

2. Are there differences in display rates for optimal OI and GD performance? OI and GD are different cognitive processes; OI requires focused attention and GD requires global attention. Thus, it is expected that for a given display rate, OI requires the expenditure of more cognitive resources than GD.

H2: At the same rate, user performance on GD is higher than on OI.

3. Both user performance and satisfaction are important factors in interface design. Based on subjective estimation, what are the best display rates for OI and GD? Do they require the same display rates for OI and GD?

H3: Users perceive the need for lower display rates for OI than for GD.

Additionally, we were interested in identifying characteristics of users that influence video browsing performance. Information about age, gender, subject of academic degrees, and time spent watching TV was collected and examined descriptively.

3 Methodology

To identify rate thresholds for different information needs, 20 participants were tested. Using the slide show interface, they viewed six sets of key frame surrogates at different display rates. Subjects were then asked to complete two tasks, as discussed below. They were not allowed to view a set of key frames more than once, change predetermined speed rates, or pause the slide shows.

3.1 Selection of Video Segments

The videos used in this experiment were selected from Discovery Channel educational MPEG video resources (available in the video database for Baltimore Learning Community project at http://www.learn.umd.edu). Six 3-5-minute video segments (one for practice and five for testing) were used. Short video segments were taken from four full-length documentaries. The varied subject matter included the American Revolutionary War; the space shuttle program; research in rainforests; and conditions in countries around the equator.

3.2 Conversion of Video to Key Frames

Twenty-three to twenty-five representative key frames (GIF files, size 352 x 240) were selected from each segment. Key frames were automatically extracted based on scene changes through a technique developed at the Center for Automation Research at University of Maryland [9].

3.3 Display Rates and User Perceptions

Display rate is defined as the number of key frames shown per second and tested at the following values: 1 key frame per second (kfps), 4, 8, 12, and 16 kfps. Based on early studies and current research [13, 14] the rate for human recognition of images is about 100 milliseconds or 10 kfps. One kfps served as a baseline.
User Perceptions, as users' subjective estimation of the actual display rates, were measured with a Likert scale (1-7, from too slow to too fast) with 4 as an ideal.

3.4 Tasks

Two typical user tasks were identified and tested: gist determination (GD) and object identification (OI). As mentioned previously, overview and detailed examination are primary browsing activities. For example, users tend to narrow the size of text search results by quickly scanning them [10]. In this experiment, we call this rapid scanning strategy GD. Thus, GD helps users make rapid decisions on whether to examine a particular data record in further detail or simply to reject it. Because of the capabilities of the human visual system, it is expected that GD on visual-based key frames can be accomplished at high speeds. When the user needs to identify the most suitable data records among many similar ones, detailed examination is involved. Unlike GD, a filtering task, OI is a strategy for confirming one's selection, which is likely to require more attention, resulting in slower cognitive throughoutput.

3.4.1 Gist Determination
Gist determination was operationalized in two parts. First, subjects were asked to briefly summarize what they thought the video segment was about after viewing the key frames. Second, subjects were asked to select one summary statement from the four provided for each set of key frames that best described the video, as represented by the key frames (i.e., multiple choice questions). Although the free-form write-up is a direct reflection on the subjects' understanding of each video through observing the key frames only, it can be biased by individual differences in verbal expression. On the other hand, multiple choice questions greatly decreased variability in responses, facilitating data analysis, but allowed for false positives due to guessing.
Both the multiple-choice statements and the user-generated gist sentences were analyzed by one of the authors. For multiple-choice questions, 1 point was given for a correct selection, 0 points otherwise, with 100% and 0% accuracy respectively. The gist sentences from 20 subjects, including pre-test participants, were graded based on the level of understanding as a primary criterion and specificity of description and number of valid words as secondary/supplementary criteria. Three levels of understanding were predefined and used to categorize the sentences. If the sentence only gave a basically (correct) description of the contents of the key frames, such as "people and monkeys in the jungle" or " battle or war settings," it was put under level one and given one point. If the sentence showed some logical interpretation, sense making, or reasoning about the video key frame contents, such as "a battle during the Revolutionary War," or "researchers study monkeys in the jungle," which involved basic prior knowledge (maybe not professional), it was assigned to level two and given 2 points. Finally, if the sentence reflected a deeper or complete understanding of the video, such as "industrial and cultural aspects about an Asian city" or "A battle during the Revolutionary War, maybe at Lexington," it was graded as a level 3 and given 3 points. Blank responses or a wrong description/interpretation were graded as zeros.

3.4.2 Object Identification
Object identification was operationalized by providing subjects with lists of items, half of which appeared in the key frames (targets) and half of which did not (distractors). Subjects were then asked to mark the objects that they actually saw after viewing a set of key frames.
For OI, cued recall with a check list was tested. Among 20 objects in each list, half appeared in the video (targets) and the other half did not (distractors). The lists were carefully created for face validity and ordered alphabetically. Efforts were made to maintain the objects at the same specificity and difficulty level as much as possible so that terms would be neither "too specific" nor "too broad." Only objects that were visible at the lowest rate, 1 kfps, were considered. Objects that could potentially be misleading were avoided (for details see [11]). In general, the probability for distractors being chosen was about the same for targets.
Two kinds of scores and percentages were used: score A (true positive score) for correctly identifying objects in the video and score B (true negative score) for correctly rejecting objects not in the video. The total accuracy score is the sum of A and B. Under random conditions, the accuracy probability is 50%. Also, the accuracy probability is 50% if all or none of the items are recognized as being in the key frames.

3.5 Interface

The interface was developed with JavaScript and HTML 3.2 under Netscape Navigator 3.0 on a Power Mac 8500 with a 15-inch monitor at 640x480 resolution (see Figure 1 and http://www.glue.umd. edu/~weid/ movie/Viewer.html).
Display rates can be selected through the Display Speed pull-down menu. Clicking on the play button starts the slide show for the selected video. Between video shows, the display area is covered with a mask, consisting of random lines and dots.

Figure 1. The Video Slide Show Interface

3.6 Subjects

Twenty University of Maryland graduate and undergraduate students voluntarily participated in this study. Excluding the six who participated in the pretest, fourteen (3 males, 11 females) participants went formally tested. Their ages ranged from 20 to 60 years.

3.7 Experimental Procedure

A pretest was conducted with 6 participants (3 male and 3 female graduate students). Based on the pre-test results, some of the distractors in the object list were revised, and two of the 6 sets of key frames were replaced in an attempt to maintain a common difficulty level. (Note: difficulty level was assessed qualitatively, not quantitatively, in this study).
Each subject watched 5 different sets of video key frames, each at a different rate. A practice session with sample key frames and tasks was run to familiarize subjects with the experimental procedure and watching the slide show at different rates.
To avoid key frame rate bias and order effects, the display sequence and rates were randomized.
Prior to viewing each slide show, subjects were shown a list of objects. After watching each set of the key frames, they were asked to complete three kinds of activities: (1) check off the objects seen in the key frames; (2) summarize gist in several sentences and evaluate the display rate from 1 (too slow) to 7 (too fast) for both tasks; and (3) select one statement from four that best represented the gist. Subjects were encouraged to finish all the tasks. The procedure was repeated for each set of key frames. A brief interview was conducted with the subject after their session to elicit comments and suggestions about the experiment.

4 Results

4.1 Gist Determination: Multiple Choice Questions and Sentence Analysis

For the multiple choice questions, a one-way ANOVA showed no significant difference between performance across the various slide show display rates tested [F(4, 65) = 0.981, p<.4245]. However, this result is most likely an artifact of having only one multiple-choice question for each set of key frames; it is likely that one question does not fully reflect gist determination (GD) performance.

Table 1. Performance for Multiple-Choice Questions and Gist Sentences
(1 point - first level, 2 point - second level, 3 point -third level of understanding)

For the sentence analysis, a one-way ANOVA was not significant across the various slide show display rates [F(4, 95) = 2.3140, p<.0630]. No threshold for GD could be identified. However, the means decreased slightly with the increase of display rate from 1 to 12 kfps (see Table 1). Interestingly, even though no significant differences were found, the biggest performance drop between neighboring rates tested was again between 8 and 12 kfps.

4.2 Object Identification: Rate Threshold

A one-way ANOVA [F(4, 65) = 12.35, p<0.00] indicated significant performance differences between display rates and three homogeneous subsets under a student multiple t-test: 12 kfps and 16 kfps, 8 kfps and 4 kfps, and 1 kfps. 1 kfps resulted in best performance. 4 kfps and 8 kfps resulted in better performance than the highest rates, but were not significantly different from each other. There was also no significant difference between 12 kfps and 16 kfps, at which the accuracy was around 60%, just a little higher than random probability (see Table 2).

Table 2. Multiple Range Tests: Student-Newman-Keuls test with significance level .050
(*) Indicates significant differences, shown in the lower triangle

Figure 2. Identification Performance

It is not surprising that the best OI performance was obtained at the baseline (1kfps). More importantly, even though performance decreased with an increase in display rate in general, there was an abrupt performance drop between 8 kfps and 12 kfps. However, performance at 12 kfps and 16 kfps were identically poor. At 70% accuracy (comparing to the random probability 50%), it is possible that a rate threshold exists between 8 kfps and 12 kfps, as shown in Figure 2.

4.3 Display Rate Perceptions: Object Identification and Gist Determination

Figure 3 shows users' subjective ratings of the different key frame display rates. The 4 on the y-axis indicates ideal perceived rate. Subjects tended to believe that every single rate is faster for OI than for GD. A t-test showed significant differences between user perceptions of comparable display rates for the two tasks (p=.001), which preliminarily confirmed our hypothesis that different tasks require different display rates. Although the differences were practically trivial, users perceived that, both from the written choices and the informal interviews, it was easier to obtain gist (more general) than to identify all of the objects (more detailed) at comparable rates. This could suggest that the rate threshold for GD is higher than for OI.

Figure 3. User Rate Perceptions for Different Tasks

4.4 User Characteristics and Performance

User characteristics data such as age, gender, and hours per week of TV-watching were collected and analyzed descriptively.

4.4.1 Performance vs. Age:
Subjects were divided into four groups (A-D) by age:20 -24, 25-30, 31-40 and 41-60 years. Group A had the best performance and Group D had the worst performance in both OI and GD. Group C performed slightly better than Group B in both OI and GD. The difference between the youngest group (A) and the oldest group (D) suggests that age might be inversely related to both OI and GD performance. Psychological research suggests a widespread slowing of mental processes with age [12]. Thus it could be an important factor in interface design for video databases.

4.4.2 Performance vs. TV-Watching Hours:
We had anticipated a positive relationship between number of hours spent watching TV and performance on both the GD and OI tasks. However no relationship was found from our data.

5 Discussion

The primary goal of this study was to investigate suitable control mechanisms that can be integrated into fast video browsing/filtering interfaces to meet different user needs. The results show that slide show presentations of key frames at high rates can be used for detailed examination or determining the gist of the original video.
One of our findings is that identification performance decreases with an increase in display rate. From 8 kfps to 12 kfps, there is a dramatic performance drop. From 12 and 16 kfps, performance remained poor and was significantly different from that of lower other rates. Therefore a rate threshold might exist between 8 and 12 kfps (corresponding to frame duration times between 125 and 83.3 ms), beyond which identification performance would stay poor, independent of the display rate. This result is consistent with Potter [3, 13] and Healey's [14] results-a human's recognition rate is about 100 ms. This is also greater than the speed at which continuous video frames are perceived as motion. More importantly, this could suggest that user's video browsing activity for specific details can be greatly speeded up. For example, key frames picked from every 150 consecutive frames of a complete video (on average) shown in a slide show at 8 kfps would lead to a 40-fold increase (assume the regular display rate of the complete video is 30 fps).
Similar to object identification (OI), the mean performance difference for gist determination (GD) was greatest between 8 and 12 kfps and the smallest between 12 and 15 kfps. The scores (of gist sentences) degraded slightly with increasing display rates. However, there was no significant difference found in the GD performance. A rate threshold for GD might also exist between 8 and 12 kfps or go beyond 15 fps. Since subjects believed that it is easier to perceive OI than GD at the same rate, the rate threshold for GD might be higher than OI. This also seems to be consistent with schema building and testing theory during picture or video viewing.
A schema is a prototypical mental representation constructed to make sense of concepts or events [15]. It helps people adjust attention allocation and maximize attentional resources. The schema, a rough mental representation of the theme or gist of a video, is activated/built in an early stage of video viewing based on viewer expectations and experiences. When further information confirms the viewer schema, less attention is required. Reiger & Reeves [16] found that after the first 10 and 20 frames of related sequences, a viewer's attention to television decreased. Regarding the results of this study, perhaps subjects made less of an effort to make sense of the video (i.e., they just needed to confirm the schema). Thus, it was easier to get the gist (sense making) than to recognize details which may be less dependent on the gist.
The difference in OI and GD performance also raises an interesting issue about the definition of GD. If users expect to determine video gist at the descriptive level -just to know the basic settings to filter the visually unexpected ones, higher rates (than for OI) may be acceptable. When users expect better understanding at the second or third level (a related issue is to specify other video-browsing tasks), the expected rate could be much lower. (From the mean scores, the performance reached the second level only at 1 kfps.) Possibly more information from other modalities (e.g. sound or words) should be involved to maintain the high video browsing speed, since better understanding needs prior knowledge that is not visually available.

6 Conclusion

This preliminary experiment focused specifically on the extreme situations of video browsing activities so that we could, through empirical data, improve our understanding of human processing capabilities in fast video browsing. We preliminarily identified a possible display rate threshold for object identification, and differentiated two typical video browsing/filtering tasks. The results of this study provide guidelines for designing video browsing interfaces, especially for building control mechanisms to support users' maximizing scarce attentional resources for effective access to video data. When 8 kfps is used for object examination with a control of key frames slide show, the browsing could reach 40 fold speedup of regular display, which could significantly save users' video filtering time so that they can focus their attention on more important and relevant video records.
This study also raises interesting issues for future research, such as specifying more video-browsing related activities and involving other modalities for alternate video surrogates.
In addition, further work in this area is necessary to provide empirical evidence on a number of related topics, for example:

Acknowledgments: This research was supported by US Department of Education Grant R303A50051. The authors would like to thank Dr. Eileen Abels and Allison Gordon for their help with the experiment design, Dr. Dagobert Soergel and Dr. Doug Oard for their comments, and the subjects for their participation.

References:

[1]. Marchionini, G. Information Seeking in Electronic Environments: Cambridge University Press. 1995.

[2]. Marchionini, G.; Nolet, V.; Williams, H.; Ding, W.; Beale, J.; Rose, A; Enomoto, E.; A. Gordon & L. Harbinson. Connectivity = Community: Digital Resources for a Learning Community. Proceedings of the 2nd International Conference on Digital Libraries (Philadelphia: PA), pp. 212-220, 1997.

[3]. Potter, M. C. Short-term conceptual memory for pictures. Journal of Experimental Psychology: Human Learning & Memory, 2(5), pp. 509-522, 1976.

[4]. Zhang, H.; Low, C. Y.; & Smoliar, S. Video Parsing and Browsing Using Compressed Data. Multimedia Tools and Applications, 1(1), pp. 89-111, 1995.

[5]. Mills, M.; Cohen, J. & Wong, Y. A Magnifier Tool for Video Data. CHI '92 Conference Proceedings of Human Factors in Computing Systems(Monterey: CA), pp. 93-98, 1992.

[6]. Wactlar, H.; Kanade, T. & Stevens, S. Intelligent Access to Digital Video: Informedia Project. IEEE Computer, May, pp.46-52, 1996.

[7]. O'Connor, B. C. Access to Moving Image Documents: Background Concepts and Proposals for Surrogates for Films. Journal of Documentation, 41(4), pp. 209-220, 1985.

[8]. Slaughter, L.; Shneiderman, B. & Marchionini, G. Comprehension and Object Recognition Capabilities for Presentations of Simultaneous Video Key Frames Surrogates. Proceedings of First European Conference on Research and Advanced Technology for Digital Libraries (Pisa, Italy), 1997.

[9]. Kobla, V., Doermann, D., & Rosenfeld, A. Compressed domain video segmentation. Technical Report CAR-TR-839 CS-TR-3688, University of Maryland, 1996.

[10]. Fenichel, C. H. Online Searching: Measures that Discriminate among Users with Different Types of Experiences. Journal of American Society for Information Science, 32(1) 23-32, 1981.

[11]. Ding, W. & Marchionini, G. A Study on Video Browsing Strategies. Technical Report CS-TR-3790 CLIS-TR-97-06. University of Maryland, 1996.

[12]. Birren, J. E.; Woods, A. M. & Williams, M. V. Behavioral slowing with age: causes, organization, and consequences. In L. W. Poon(ed.) Aging in the 1980s. Washington, DC: APA, 1980.

[13]. Potter, M.C. & Levy, E. I. Recognition Memory for a Rapid Sequence of Pictures. Journal of Experimental Psychology, 81(1), pp.10-15, 1969.

[14]. Healey, C.G.; Booth, K. S. & Enns, J. T. High-Speed Visual Estimation Using Pre-attentative Processing. ACM Transactions on Computer-Human Interaction, 3(2), pp.107-135, 1996.

[15]. DeGraef, P. Scene-Context Effects and Models of Real-World Perception. In K. Rayner (ed.) Eye Movements and Visual Cognition. Springer-Verlag. pp. 227-242, 1992.

[16]. Reiger, S. & Reeves, B. The Effects of Scene Changes and Semantic Relatedness on Attention to Television. Communication Research, 20(2), pp.155-175, 1993.