Stacie Hibino EECS Department Software Systems Research Lab The University of Michigan 1301 Beal Avenue Ann Arbor, MI 48109-2122 USA hibino@eecs.umich.edu |
Elke A.
Rundensteiner Computer Science Department Worcester Polytechnic Institute 100 Institute Rd. Worcester, MA 01609-2280 USA rundenst@cs.wpi.edu
|
In contrast to these approaches, we propose a new paradigm for video analysis involving the use of interactive visualizations in which users browse the data in search of temporal trends. Rather than requiring users to pre-code relationships, we have simplified the annotation process to that of coding atomic actions and events. Our system then enhances the analysis process by allowing users to explore temporal relationships between subsets of such annotations using simple mouse manipulations. For this, we have developed a temporal visual query language (TVQL [13]--see review in Section 2.3) composed of integrated temporal query filters that support the specification of complex temporal queries using a direct manipulation paradigm. TVQL facilitates temporal exploration of the video data by allowing users to continuously specify and incrementally refine individual as well as combinations of similar temporal relationships. Thus, in our approach, users would code "P1 Talking" and "Digression" as separate annotations. Using our dynamic temporal query filters, they could then examine all types of temporal relationships between P1 talking and digressions, including how often P1 starts, participates in, or ends a digression.
In our MMVIS system, the results of TVQL queries are presented in a temporal visualization, called TViz (Section 2.4). The visualization presents the temporal relationships between subsets by abstracting relationships between individual event instances into trends over a given time interval. Some of the abstraction strategies we have explored include frequency counts, total duration, and average duration. The visualization is user-customizable on-the-fly, allowing users to highlight subsets according to their current focus of temporal analysis. In addition, our temporal visualization is dynamically updated as temporal filters are adjusted. This provides immediate feedback on temporal trends as a function of the type of temporal relationship (e.g., "does Joe always interrupt Mary the moment she starts talking" versus "does Joe interrupt Mary only after she has been talking continuously for at least five minutes?") or as a function of the type of selected event subsets (e.g., "does Joe interrupt only the female members of the meeting who are speaking for more than five minutes and not the male members?").
The combination of specialized query filters and visualizations forms our integrated MultiMedia Visual Information Seeking (MMVIS) environment. In this paper, we describe the design and implementation of our MMVIS system. We present the system design of primary interface components of MMVIS--namely, our visual query language including the subset query and TVQL query palettes and our temporal visualization, TViz. We also present our strategies for implementing MMVIS, focusing in particular on our overall system architecture and on the TVQL query processor. In order to preserve the notion of interactive browsing for trend analysis, the visual queries must be processed as efficiently as possible. Thus, the query processor is a critical component of the system and hence we discuss it in more depth in this paper.
We describe a new index structure (the k-Array method) and an associated query processing strategy we have developed for processing TVQL queries. Our preliminary experimental evaluation of the query processor reported in this paper demonstrates the effectiveness of our index strategy for handling multidimensional range queries specified in an incremental fashion--the main mode of interaction with MMVIS. Lastly, a brief report on our case study using real CSCW data and preliminary results from a user study illustrate the utility of TVQL and MMVIS.
This paper is divided into five additional sections. In Section 2, we present our system design--describing subset query palettes for dynamic subset selection, reviewing our temporal visual query language (TVQL), and presenting our temporal visualization (TViz). In Section 3, we describe our system implementation, beginning with the system architecture, presenting our annotation model, and discussing the underlying query processing technique. This is followed by evaluation and discussion of the efficiency and utility of our approach in Section 4. In Section 5, we discuss related work, and finally in Section 6, we present our conclusions.
This use of dynamic query (DQ) filters provides us with an easy-to-use visual paradigm for formulating and posing questions. Because a visualization of results is dynamically updated as users adjust any query filter, queries are incrementally specified and refined and users see the direct correlation between adjusting values of query parameters and the corresponding display of results.
Another system requirement for MMVIS is that it should be general enough to handle a variety of multimedia data typically characterized by dynamic, spatio-temporal characteristics (e.g., video, real-time data, etc.). We accomplish this by using annotations to abstract spatio-temporal information from the original media and then analyzing the annotation collection. In the context of this paper, we explain this approach using video data as an example. Section 3.2 describes our data model for these video annotations.
Although the VIS paradigm is suitable for our goals, the current VIS framework is not designed for selecting multiple subsets nor for handling spatio-temporal characteristics of multimedia data. Thus, in order for MMVIS to support the desired new paradigm for video analysis, several extensions to VIS are required:
Figure 1. Sample subset query palette.
In each mList DQ on the subset palette, users can simply click on an item to toggle its selection on and off. As in the standard DQ filters, these mList DQs are interrelated to one another so that the de-/selection of items in one mList can automatically affect the de-/selection of items in the other mLists. Thus, if an item is deselected in one mList, and an item in another mList is only related to that deselected item, then it too would be deselected. For example, in Figure 1, the name Nil is only associated with the actions NonVerbal and Transcript. Since Transcript has already been deselected, deselecting NonVerbal from the Action mList would auto-matically deselect Nil from the Name mList.
Figure 2. Relationships between temporal primitives and the four defining endpoint difference relations.
A general temporal query language must be able to specify any one of these primitives. However, it is also desirable to specify combinations of the primitives (e.g., to see how often events start at the same time but may end at different times, corresponding to combining the starts, started by, and equals primitives). Rather than providing arbitrary combinations of such relationships, we support users in selecting similar primitives (i.e., temporal neighborhoods [8], equivalent to selecting a series of adjacent cells such as a row, column, or grid from Figure 2).
A set or combination of temporal relationships between two events forms a temporal neighborhood if it consists of relations that are path-connectable conceptual neighbors. Two primitive temporal relationships between two events are conceptual neighbors if a continuous change to the events (e.g., shortening, lengthening, or moving the duration of the events) can be used to transform either relation to the other (without passing through an additional primitive temporal relationship) [8]. Thus, the before () and meets ( ) relations are neighbors, because we can move the ending point of A from before the start of B to its start without specifying any additional primitive relationship. On the other hand, the before ( ) and overlaps ( ) relations are not neighbors. This is because we cannot move the ending point of A past the starting point of B without first passing through the meets relation.
Figure 3. TVQL palette. This query specifies all events of type A that start at the same time as events of type B (and may end before, after, or at the same time as B events).
To enhance the TVQL user interface, we have incorporated qualitative descriptive labels along the top and side and our dynamic temporal diagrams along the bottom of the palette. The labels allow users to "read" the relationship specified and the diagrams provide visual confirmation of the temporal primitive(s) specified (though not quantitative values as given by the filters). If subset A specified person P1 and subset B specified all Plan design rationales, then Figure 3 illustrates how users could ask the query "show me how often person P1 starts at the same time as a Plan starts." The descriptive labels can be used to "read" the top query filter as "start A equals start B." The relationship between the temporal ending points is unconstrained as indicated by the selection of all values in the second (i.e., endA-endB) query filter. This is also reflected in the temporal diagram, which indicates that the end of A (represented by a filled circle) can be before, equal to, or after the end of B. The benefit of this direct manipulation design is that it supports specifying particular queries as well as browsing for temporal trends. That is, users can simply drag DQ thumbs with no particular query in mind and when an interesting visualization appears, they can look at the temporal diagram to see which temporal query was specified.
Similar to standard DQ filters, our temporal DQ filters are bound to one another to prevent the specification of invalid queries. As users adjust one query filter, the other filters are automatically updated accordingly. In the case of Figure 3, the user only has to set the filter thumbs of the top startA-startB query filter to 0. The bottom two filters are automatically constrained as indicated.
Our overall approach has been to provide a main window of representative annotation icons for context, and to use transparent overlays to highlight temporal occurrences (i.e., selected subsets) and relationships between them. In addition, we provide some options to allow users to tailor visualizations according to their needs and preferences.
Figure 4 presents the main MMVIS window. This window is divided into three areas: the primary visualization area, the key below the visualization area, and visualization options in the lower right corner of the window. Initially, the visualization area only contains icons representing the different types of annotations in the database (in Figure 4, icons from the CSCW case study are displayed). By default, these annotations are spatially placed according to their corresponding first occurrence in the video and are used to provide context for TViz. Users can move icons according to their preference. In the case study data set [20], we found that subjects never changed positions (e.g., walk across the screen) in the video, so there was a one-to-one correspondence between where talking icons were placed and where the corresponding speakers appeared on the video. In this way, the annotation placement formed a spatial abstraction of the video context for the case study.
The key below the visualization emphasizes that subset A events will be highlighted with yellow transparent circle overlays while subset B events will be highlighted with blue transparent square overlays (see also Figure 6). The visualization options allow users to customize the view according to their needs and preferences. More details on the subset highlighters and visualization options are provided below.
Figure 5. Visualization of Selected Subsets. In this example, Subset A is set to all types of annotations and Subset B is set to none. Transparent circle overlays highlight the types of A events selected. The relative size of the circular highlighters provides a comparison of total duration. Other viewing options include frequency and average duration.
Figure 5 presents an example where the user has set Subset A to all types of annotations and Subset B to none. By doing so, the user can then switch between visualization options to gain an overview of the data and to compare the differences between relative frequency, average duration, or total duration of all types of events. Figure 5 provides a comparison based on total duration. In it, we can see, for example, that the total time that Richard speaks is longer than that of Carol or Gary.
Figure 6. Sample TViz. In this example, TVQL specifies the all starts temporal query, where events start at the same time but may end at the same or different times. The AB connectors in TViz are displayed as bars, and the width of these bars indicates the relative frequency that the all starts relationship occurs between each type of AB pair.
Figure 6 shows an example where the user has specified the all starts temporal query, where events start at the same time but may end at the same or different times. In this example, the thick bar between NonVerbal and Pause indicates that these types of events frequently start at the same time. Similar to the overview visualizations, the temporal relationship connectors are also user-customizable, being viewable as triangles or bars. Figure 6 shows the view by bars.
Figure 7. MMVIS System Architecture.
Once the annotation collection has been formed, users can then explore and analyze the video through iteratively specifying queries using a Visual Query Language (VQL; i.e., the subset selection and TVQL query palettes described in Sections 2.2 and 2.3) and by reviewing the visualization (e.g., TViz, Section 2.4) of results presented. In order to maintain the notion of browsing to the users, the visualizations must be updated in real-time and thus queries must be processed as quickly as possible. This is accomplished through the VQL Processor which is described in Section 3.3. The VQL Processor sends messages to the Database Manager which then passes updates of the solution set to the Presentation Manager. The Presentation Manager takes these updates of the query results, along with any user-defined display preferences and updates the visualization. Users can view the visualization as it changes and visually scan the final results to look for data trends. If no trends are found, they can use the Presentation Language (PL) to clarify the visualization, the Navigation Controls to further explore query results, or the VQL to incrementally adjust the query.
MMVIS has been implemented in a Windows-based multimedia PC environment, using a ToolBook interface to a database library. The subset selection query palettes, TVQL, VQL Processor, Database Manager, and a primitive Presentation Manager for displaying and updating TViz have all been incorporated into the current working system. The Annotation Tools are currently limited to importing existing databases of video events, but tools for creating video annotations directly within the existing framework are being developed. In addition, we are also investigating the use of alternative visualizations to extend the collection of Presentation Templates from which users can select. Details of the Database Manager and our VQL query processing strategy are presented below.
Figure 8. Overview of Annotation Data Model
The descriptive objects are separate from the annotations so that several annotations can reference the same descriptive object. This provides consistency and improves efficiency during annotation creation by allowing users to link directly to previously specified descriptive objects rather than having to specify the same information over and over again. The same is true for the relationship between the descriptive object and the media object.
The Database Manager stores information on basic annotation, descriptive, and media objects in a dBase IV format while the actual views of the objects are realized directly within the application. The t.start_frame and t.end_frame attributes of an annotation object allow us to obtain direct access to the corresponding video segment.
Given:
V = {v1, v2, . . . , vs}
// set of video documents
Ann(vp) = {va1, va2,
. . . , vaT}
// annotations for video vp
Users select A and B subsets via subset selection DQs:
A = {a1, a2,..., ai,..., am
| m <= T, ai Ann(vp)}
// first subset of video annotations
B = {b1, b2,..., bj,..., bn
| n <= T, bj Ann(vp)}
// second subset of video annotations
Based on the subset selected and the constraints imposed by extreme values of the TVQL filters, a new database of temporal pairs (TPairs) is formed as follows:
TPairs = {pr1, pr2,...,
prq,..., prN
| prq=(ai, bj), ai A,
bj B,
(ai.end - bj.start ³ dTfo.minVal
AND ai.start - bj.end <= dTof.maxVal)},
where:
dTfo = temporal DQ filter for specifying difference
between end of ai and the start of bj, and
dTof = temporal DQ filter for specifying difference
between start of ai and the end of bj.
Users explore temporal relationships by using TVQL to query the TPairs database. We define:
T = total # of records in annotation collection Ann(vp)
N = # of (ai, bj) pairs
in TPairs (<=T2) from which pairs
that meet temporal relationship criteria are selected
k = number of attributes or dimensions used for query
specification (for TVQL, k=4)
Figure 9 presents a sample DQ filter (minVal=-3, maxVal=3, dtIncr=1) and its corresponding internal representation. It has slotCount=2*(3 - -3)/1 + 1 = 13. There are more slots than tics because a filter thumb can be closed or open to include or exclude an endpoint value, respectively. In Figure 9a, the DQ filter is selecting 0 <= values < 1. This corresponds to including slots 6 and 7 of the internal representation shown in Figure 9b.
Figure 9.(a) Sample DQ filter (minVal = -3, maxVal = 3, dtIncr = 1).
Figure 9.(b) Corresponding internal representation where the double-lined border around slots 6 and 7 corresponds to the selected DQ filter range in Figure 9.a.
Given that each item in the TPairs data set has one and only one valid value for each attribute, this means in TVQL that for a given pair prq, the value of each attribute determines the slotNum of the corresponding query filter. For example, if prq.dTo=0 and if the above filter represented the dTo (i.e., startA-startB) query filter, then prq would go into slot 6 of the filter. Recall that prq is only in the solution set when each and every one of its attribute values are valid (i.e., when each is included in the selected range of its corresponding attribute DQ filter). Another way to think of this is that if count represents the number of valid attribute values for prq, then a temporal pair prq is in the solution set when count=4 (or, in general, when count=k).
These types of manipulations can be represented as:
dqManip(dqi, thumbID, action, <dir>)
where
dqi // DQ filter being adjusted,
thumbID {LEFT, RIGHT} // which DQ filter thumb was manipulated,
action {MOVE, FILL, UNFILL} // type of manipulation made, and
dir {LEFT, RIGHT} // optional parameter to specify direction the
DQ thumb was moved, if applicable.
Query processing in TVQL must thus be able to effectively process such sets of manipulations. This includes identifying items affected by the manipulation, updating appropriate auxiliary structures such as counts, and updating the solution set by passing information on TPairs items that should be added/removed from the solution to the Presentation Manager via the Database Manager.
In order to reduce the storage required for indexing the data, to avoid empty buckets as much as possible, and to optimize performance, we thus set out to develop an alternative strategy for TVQL query processing. While several methods for processing multidimensional range queries exist (e.g., [4, 19]), we felt that the requirements and constraints of our TVQL query processing problem were different enough from previous work to warrant additional investigation. More specifically, our goal was to identify an indexing method for processing incremental multidimensional range queries over temporal data, giving higher priority to processing queries over updating data.
Our proposed k-Array approach is an array-based method generally applicable to DQ query processing--while of course for our TVQL language we set k=4. Given:
dBase = {item1, item2, . . . , itemN}
// with unique_IDs {1, 2, . . . , N} for TVQL, dBase = TPairs
DQ = {dq1, dq2, . . . , dqk} // set of DQ filters
kArray = {indexArray(dq1), indexArray(dq2),
..., indexArray(dqk)}
The basic idea is to keep an index array of size slotCount for each DQ filter. Each position in the index array has a one-to-one correspondence to the slotNum of the respective DQ filter. Thus, each item in the database can be sorted into one and only one position of each index array. This means that every item in the database occurs a total of k times in the full index structure.
If count is the number of selected ranges of the index arrays in which an item occurs, then an item should be added to the solution set when its count changes from k-1 to k. Similarly, the item should be removed from the solution when its count changes from k to k-1. We only need to access the actual item when it is being added to or removed from the solution. The rest of the time (i.e., while count <= k-1), it is sufficient to access only the count for each item. For this reason, the counts are stored in a separate array, consecutively by item ID:
itemCounts[N] = array of data item counts
Using pointers and a main memory model, the k-Array approach is captured by Figure 10.
Figure 10. Main memory based k-Array approach.
In order to provide support for a fine-grained analysis of video data (e.g., analyzing the video over events that are only seconds or less than seconds long) and/or support for analyzing several videos at once, our system must be able to handle a large collection of video annotations as well as an even larger TPairs collection (given that TPairs are formed by pairwise combinations of the video annotation subsets). Ultimately, this means that the number of pairs can become too large to fit in main memory, and secondary storage would be required. For this case, we propose the following modifications to convert the main memory-based k-Array method into a disk-based structure:
Figure 11. Disk-based version of the k-Array approach.
Our query processing strategy for our k-Array is to process each DQ user manipulation (i.e., dqManip) as follows:
Consider the case where the left filter thumb is unfilled, and the query manipulation taken is dqManip(dqi, LEFT, FILL, NULL). This corresponds to changing the left hand side of the specified range from
leftVal < dqi.attributeName
to leftVal <= dqi.attributeName,
meaning that the nearest neighbor is dqi.LEFT.slotNum-1 and that we want to add that slot to the selected range. If dqi had the corresponding indexArray in Figure 11, where dqi.LEFT.slotNum = 1, then we would be adding all data items corresponding to slot 0 of that indexArray to the solution. Since indexArray(dqi)[0] = 4, then we would loop over the four IDs in that slot (i.e., ID=1, ID=5, ID=8, ID=10), incrementing the count for each, and adding to the solution set the corresponding data item for each ID with a new count equal to k. Although all counts are not displayed in Figure 11, we do see that data item ID=1 has count=3. Thus, since now k would become 4, then we would fetch and add data item1 to the solution set.
In general, the k-Array has a storage cost on the order of O(kN) and search cost on the order of the average slotSize--i.e., O(N/slotCount). While this search cost may not seem to be the most efficient, it is counterbalanced by the fact that all item IDs in a slot are clustered together, and minimum information is kept to maintain and cluster the corresponding counts. Given a 2K page buffer size, this means that page faults in TVQL could be reduced to one page fault per every 512 item IDs fetched from the indexArrays and one page fault per every 5461 itemCounts.
Users manipulate the subset selection DQ filter lists to select subsets by d_ID. De-/selecting a d_ID then corresponds to hiding or showing the corresponding highlighter. We use a k-Array to process these incremental subset selection queries. Note that the k-Array supports access to "nearest neighbor" slots (as required by incremental selection of continuous ranges in the temporal query filters) as well as direct access to any slot of any DQ filter, which is desired in the case of subset selection based on discontinuous ranges of attributes.
Our proposed k-Array is an improvement over the linked array in that item IDs in each slot are clustered together. Figure 12 compares the number of page faults required for using the linked array vs. the k-Array methods for processing a sample TVQL query. The graph indicates that the performance of the linked array degrades considerably for larger data sets, while the performance of the k-Array is fairly stable (i.e., it degrades gracefully). We have compared our k-Array to the linked array for a series of TVQL queries and consistently find the same result--the k-Array outperforms the linked array, especially as the data set size increases [11].
Figure 12. Number of page faults required for processing a sample TVQL query for different data set sizes: comparison between linked array and k-Array methods.
Figure 13 compares the performance of the two approaches for specifying incremental vs. non-incremental queries. A query is incremental when it is processed as an update to the previous query rather than as a new query calculated from scratch. Both methods perform better overall for incremental queries, but the k-Array still outperforms the linked array under both conditions.
Figure 13. Number of page faults required for processing a series of incremental queries vs. processing the same queries from scratch: comparison between linked array and k-Array methods.
Our work on evaluating the TVQL Processor is ongoing and we have recently completed enhancements to the k-Array, converting it to a bucket-based method called the k-Bucket [11]. We have compared the k-Array and k-Bucket to the linked array and other popular indexing structures such as the k-d tree [4, 5] and grid file [15, 19]. Thus far, our findings indicate that although the k-Bucket is not the most efficient under all conditions, it is the best performer overall [11].
The series of sample temporal queries in the case study, for example, demonstrates how our tools can be used to investigate who in the study might have emerged as the leader of the group, even though no leader was assigned. We proposed a number of (temporal query) questions to investigate this and then used MMVIS to pose them and explore the results. We looked at the frequency and average duration of Talking events to see who talked more frequently and who spoke for longer periods of time than others. We examined the relationship between when individuals spoke and when digressions took place to see who initiated, who participated, and who ended digressions. TViz not only highlighted this information, but also highlighted the frequency (i.e., strength) of these temporal relationships. The overall results of the case study led us to some interesting observations, some of which were desirable and expected and some of which could lead to more in-depth analysis.
Although other extensions to dynamic query filters and VIS have been explored [7, 9], these extensions primarily focus on aggregation extensions to the interface. While these aggregation techniques could be incorporated into our system to enhance the formation of subsets, they do not address the temporal and relative exploratory needs of video analysis.
Previous research on processing multidimensional range queries focuses on processing queries once rather than processing dynamic, direct manipulation queries. Hence, the problem we are addressing is different from conventional multidimensional range query processing in that our input to the query processor corresponds to sets of incremental (continuous) query updates typically generated by adjusting one query filter at a time. Work by [16] comes closest to our TVQL query processing approach in that they analyze various structures for dynamic queries in general. However, the researchers focus their attention on main memory rather than disk-based methods. In this paper, we have presented an array-based indexing structure customized for incremental query processing. Advantages of our proposed strategy include not only its simplicity in implementation but also its efficiency of processing TVQL queries compared to alternative approaches, such as the linked array method.
In order to preserve the notion of interactive browsing for trend analysis, the visual queries must be processed as efficiently as possible. Thus, the query processor becomes a critical component of our system implementation. We thus have developed and fine-tuned a novel index data structure (called the k-Array method) and associated query processing strategy for handling TVQL queries. Our analysis indicates that the k-Array performs much better than the linked array as the data set size increases, and that it is particularly well suited to handle incrementally specified queries -- the main mode of interaction with MMVIS.
In the future, we plan to enhance the analysis of TVQL query processing, including a comparison of the various alternate processing methods specified in the literature for specifying disjunctive multidimensional range queries. We will also continue our ongoing effort on conducting user studies to test the usability and conceptual understanding of the TVQL interface as well as the utility of the MMVIS system for identifying temporal trends.
This work was completed while author was at the University of Michigan.