Abstract
An entropy-based criterion is proposed to characterize the pattern and intensity of object motion in a video sequence as a function of time. By applying a homoscedastic error model-based time series change point detection algorithm to this motion entropy curve, one is able to segment the corresponding video sequence into individual sections, each consisting of a semantically relevant event. The proposed method is tested on six hours of sports videos including basketball, soccer, and tennis. Excellent experimental results are observed.
1. Introduction
Fully
automated video skimming and summarization represent a hotly pursued research topic in
the field of content-based video analysis. It allows operators to quickly scan
surveillance videos to spot events of interests; or viewers to efficiently browse
large collection of long video clips. It may also facilitate content-based
multimedia authoring and content creation. Sports videos in particular will
benefit from the automated video summarization technology. Most sport games are naturally organized into
successive and alternating plays of offence and defence, cumulating at events
such as goal (or attempt of it) or hit. If a sports video can be segmented
according to these semantically meaningful events, it then can be used in
numerous applications to enhance their values and enrich the users’ viewing
experiences.
A
number of content-based analysis techniques have been proposed to analyze a
particular type of sport video [1–7]. Gong et al. [1] adopted
domain knowledge to parse content of soccer video programs based on four kinds
of components: a soccer court, a ball, the players, and the motion vectors.
Image contents analysis and panoramic reconstruction techniques are proposed to
automatically detect and extract the soccer highlights in [2]. Sadlier and O'Connor [3] propose a novel audio-visual feature-based framework for event detection in
broadcast video of multiple different field sports. The framework contains
crowd image detection, speech-band audio activity detection, on-screen graphic
tracking, and motion activity measure. Besides, Ye et al. [4] proposed a method for
exciting event detection methods in broadcast soccer video based on mid-level
visual description, and incremental SVM learning is investigated. A retrieval system is investigated [5] by considering the
spatiotemporal behavior of an object in the footage as being the embodiment of
a particular semantic event. Shih and Huang [6] introduced a content-based multifunctional video retrieval (MFVR) system where
content analysis is carried out based on different content semantics. In [7],
Liu et al. presented several feature extraction methods including wavelet-based
motion analysis, hybrid field-color model, and a prior knowledge-driven line
detection. Then, a boosting chain is used to deal with feature selection and
decision making for play detection in American football video. In these earlier
investigations, domain-specific heuristic, ad hoc rules are proposed to detect
specific scenes or objects from video of specific type of sports. It is unclear
that these rules can easily be generalized to other type of sports.
Motion
is perhaps the most significant feature of most sport videos. Dominant motion
estimation of a video clip has been analyzed in several previous works
[8–10]. The results lead to the development of a motion estimation model to
interpret motion activities in video. Liu et al. [11] have devised a perceived
motion model (PME) for the purpose of
video key frame extraction. The PME value is defined as the motion magnitude between successive frames along the
dominant motion direction. However, the authors also observed that the apparent
camera motion may cause excessive false alarms.
For
many professionally produced sport videos, it is observed that different sports
events seem to be often associated with different motion patterns. For example,
scenes of a slam dunk in a basketball game are often quite similar not only
within the same game, but also are quite similar across different games played
by different teams. This may be due to positions of camera as well as common
production practice in the studio. Nevertheless, based on this observation, we
are motivated to investigate the feasibility of using motion pattern and
intensity to characterize potentially semantically significant events in a
sport video.
Previously,
it has been observed [12] that global camera motion often causes excessive false alarm when
the dominant motion value is used as a feature to analyze a sports video. In
this work, we define an entropy-based motion value to alleviate this
problem. Conceptually, motion vectors due to global camera motion are quite regular, as oppose to motion vectors due
to complex object motion in a sports event. Moreover, we propose an entropy
measure to simultaneously characterize both the magnitudes and directions of
motion vectors. It is expected that regular motion vectors would yield lower
entropy value while irregular motion vectors would yield higher entropy value.
Hence, by incorporating an entropy-based criterion, it would be possible to
distinguish genuine sports events with high motion from those of global camera
motion.
The
entropy-based motion value feature yields a one-dimensional time function
corresponding to the original time-indexed video sequence. Using a maximum-likelihood estimation
method based on the homoscedastic error model [13], this
time function is approximated by a series of piece-wise linear line segments,
jointed by change points. By examining this approximated motion entropy
function against the original sports video, it is observed that semantically meaningful sports events
are often highly synchronized with certain change patterns by which the video sequence can be divided.
In the remaining of this paper, the motion
entropy measure will be presented in Section 2.
This is followed by a short discussion in Section 3 of applying change point
detection method to break the motion entropy time function into a piece-wise
linear approximation. The significant sports event detection
is reported in Section 4; and simulation using six hours of sports
videos of three different kinds of sports are reported in Section 5.
2. Entropy-Based Motion Analysis
As discussed above, previously, dominant
motion feature has been used to aid key-frame identification in sport videos
[11]. However, it has also been observed that global camera motion may be
confused with genuine sports events when dominant motion feature is used. In
this work, a motion entropy feature is proposed that is both effective in
rejecting false alarms and efficient in computation.
The notion of motion entropy has been used
for video watermarking [14]. In [15], an entropy measure defined on angle
distributions of motion vectors is used to compute a global motion ratio, which
then is used to calculate the perceived motion energy [11].
In Figure 1, motion vectors of two video
frames, representing (a) nonsignificant and (b) significant sports events are
shown to the right of corresponding frames. It can be observed that motion
vectors of the frame containing more significant events (b) are much less
regular than those of a nonsignificant frame (a). This motivates our
development of an entropy-based motion value to characterize the randomness of motion vectors.
Figure 1: Motion vectors obtained from uncompressed domain by full-search motion estimation approach: (a) a nonsignificant event, (b) a significant event.
Toward this goal, we exploit the
statistical distribution of motion vectors within a video frame along different
directions to gauge the degree of regularity of them. To do so, we equally divide
the angles from 0 to
into AngNum subangles. In this work, we choose
based on extensive experimentation.
Let
be the
fraction of motion vectors whose direction fall within the
th subangle, we define the motion directivity entropy along the
th subangle as
(1)
Furthermore,
we denote
as the sum of
magnitudes of all motion vectors whose direction fall within the
th subangle, and compute a magnitude weight factor
(2)
Given
and
, we propose a novel motion entropy measure, called entropy motion value (EMV), for a given video frame as follows:
(3)
Previously, the perceived motion energy proposed
in [11] also utilizes the magnitudes and angle of motion vectors to
characterize the sports contents in a video frame. However, by incorporating
the motion directivity entropy in the formula, EMV is insensitive to global motion such as camera panning. The use
of motion vector magnitude as a weight to the directivity entropy further
improves the performance.
In Figure 2(a), both
the EMV and PME values corresponding to the same sports video sequence are plotted
against time. Since each
sports scene usually lasts more than one second, it is sufficient to calculate
the EMV and PME values once per second. The horizontal and vertical axes represent the second index in a
video sequence and the corresponding normalized motion value, respectively. From the figure, it is noted that
while most of their values are approximately the same, there are a couple noted
differences. (i) Near the time index 400, the EMV value (~60) is markedly larger than the PME value (~40). A close
examination of the soccer video reveals that it corresponds to attacking-goal
event (cf. Figure 2(b)). (ii) Around time index 1400, the EMV value (~40) is much less than the corresponding PME value (~80). It turns out that this scene (cf. Figure 2(c)) corresponds to a closeup shot of players
walking around aimlessly and hence is not a semantically significant sports
event. In both incidents, the potential advantage of the proposed EMV measure over the previously proposed PME measure in terms of capturing
semantically significant sports event is clearly demonstrated.
Figure 2: (a) An example of the
EMV and
PME [
11] curves generated by the same sports video sequence, (b) attacking event only detected by
EMV feature, (c) non-event (false) only detected by
PME feature.
3. Event Segmentation
The
computed EMV value is a function of
time. As shown in Figure 2, it includes multiple peaks corresponding,
potentially, to various semantically significant sports events. In [11], the PME function is approximated by a train
of triangular pulses and then segmented accordingly. However, careful
comparison of the EMV function
against actual sports video indicates that such an ad hoc segmentation approach
may be insufficient. In particular, the triangle model requires the
determination of an additional variable
[11], and may lead to higher false-alarm rate.
In
this work, we formulate the task of sports event segmentation into a change point detection problem which is
a well-studied subject in the field of statistics [13, 16–18]. In
particular, we assume that the EMV curve over frames indices
,
,
can be represented with
line
segments (EMV curve is divided into segments by
change points):
(4)
In (4), the model fitting errors
of
th segment at
time
are modelled as independent,
identically distributed random variables with zero mean and variance
. Given the set of break
points
, and the underlying model parameters
, and imposing a
homoscedasticity assumption (homogeneity of variance) [13], we assume that
, the maximum-likelihood
function of the set of change points
then
can be expressed as
(5)
where
is the number of change points,
is the number of frames in segment
.
In practice, since the change points are unknown, an iterative method to find
the set of change points given in [13] is adopted in our work. Specifically,
let
be the likelihood
(accumulated from
several segments) for frames
under the homoscedastic error model. Here, the range of
is selected such that at least
frames are assigned to each segment. In this work, we set
. With this notation, the change
point within this segment then will be selected as the
minimizing
.
A complication of
the original change point detection in [13] is that every point from
to
in the time function will be examined
to determine if it is satisfied as a change point. In this work, a slope change
concept is used to reduce the number of points that need to be examined.
Specifically, a SlopeChange set that
includes examining points that correspond to zero crossing of the slope of the EMV curve are maintained. Figure 3(a) presents the example
result of the proposed approach and that of the original one. In the example
result, there are less candidate change points to be considered and less change
points to be detected in the proposed method. In
addition, a heuristically selected threshold
used in [13] is also eliminated. This leads to an improved change
point selection criterion
(6)
subject to the constraint
(7)
Figure 3: (a) An example result of the modified change point detection (solid line) compared to that of the original one (dash line), (b) event segmentation using change point detection under a homoscedastic error model.
These
changes result in marked improvement of computation efficiency and more
accurate results. The change point selection algorithm is summarized in
Algorithm 1.
An example of the
detected change points is presented in Figure 3(b). Each vertical
line represents a change point found by the above algorithm to partition the
whole curve into segments.
4. Significant Sports Events Detection
The homoscedastic error model-based change point detection method
provides a tentative segmentation of the sports video based on the EMV curve. However, some of these
dramatic changes of motion patterns may not be semantically significant. Here,
the semantic significance is defined based on the context of particular sports.
Further empirical analysis indicates that semantically significant sports
events seem to be highly correlated to certain patterns of the EMV curve. In particular, it is observed
that with the approximated EMV curve,
line segments with positive slope often correspond to significant sports actions
while line segments with negative slope do not. Figure 4(a) displays an example of motion patterns from a soccer video. The
video region A in Figure 4(a) is the offensive
team trying to pass the ball from the middle field to the front of the soccer
goal, and its motion pattern tends to be horizontal. The positive-slope motion
pattern in region B is a player preparing for a direct free kick in a soccer video.
The region C, the region around the peak value of the motion pattern, comes
from the most salient activities, that is, offensive and defensive teams both
trying to respond to
the free kick situation. After that, the motion pattern of region I
is decreasing slowly until other significant event starts.
Figure 4: (a) Significant event segment consisting of the motion pattern, (b) criterion for reserving or dropping segments based on IVal: 1st dropped, 2nd dropped, 3rd reserved, and 4th dropped.
To quantify this empirically
determined rule, an accumulated impact value IVal is defined as the difference between the curve increasing from the starting to the peak (PIVal) and the curve decreasing from
the peak (with maximum value) to the end (NIVal) for each event segment. An event segment that should be reserved or dropped relies on the difference obtained by subtracting NIVal from PIVal. The
equation is defined as follows:
(8)
where
and
denote the number
of frames in increasing curve and decreasing curve in segment
, respectively. For each of the
candidate video segments obtained using the change point detection method, a
simple rule-based classifier is applied to decide whether this segment is
semantically significant. The rule is very simple.
Semantically significant sport event detection rule. For each candidate sport event segment, if the corresponding impact value
, then it is semantically significant. Otherwise, it is not.
Figure 4(b) gives an example to illustrate the determined
rule. Only the third segment is verified as a significant event as its
corresponding IVal value is greater than zero. As
shown in an experiment to be discussed in Section 5, this simple decision rule
yields fairly accurate decision to distinguish significant events from
nonsignificant ones.
The overall significant sport event segmentation algorithm is summarized in Figure 5. The first stage
is the entropy-based motion analysis module. Then, it is followed by a homoscedastic-based
event segmentation module. Finally, the candidate sports event is subject to the
significant sports event detection rule, and the output is the detected
significant sports event segments.
Figure 5: Block diagram of the proposed architecture.
5. Experimental Results
Experiments in
this paper were conducted using three kinds of video: soccer, tennis, and
basketball. Table 1 presents the detail information of the experimental video.
In this table, different kinds of video are included to test the robustness of
the proposed approaches. We used a total of around six hours of experimental
video clips with 30 fps frame rate.
Table 1: Characteristics of the dataset.
The
proposed event detection approach detects a set of significant event
segments from the video. Events were detected by the three-phase entropy-based
motion analysis proposed in this paper. First, entropy-based motion
feature is extracted based on (1)–(3) form the
uncompressed sport videos. The time series EMV is then segmented by the change point detection module. The detailed algorithm
description is presented in Algorithm 1. Finally, segments without meaningful
events would be removed by the significant sports event detection module. A
sport video summary is created and composed of the reserved segments. Three
standard performance evaluation metrics are used: Precision, Recall, and Fscore. These metrics are defined as follows:
(9)
The overall performance of significant sports event segmentation
algorithm is summarized in Table 2. For each of the three sports under study,
three significant sport events are identified manually. They are Fault, Ace,
and Volley in tennis video; Scoring, Miss, and Free throw in basketball video;
Goal, Free kick, and Corner in soccer video. To conduct the experiment, the EMV curve is first evaluated, and the homoscedastic error model is
applied to compute the change points and associated candidate sport video
segments. Each of these candidate segments then is manually labelled as being
significant or not depending on whether it contains one of the corresponding
significant sport events specified above. The impact value detection rule
described in Section 4 is also applied to the corresponding EMV curve of each candidate sports event segment. Comparing the algorithm results against the manual labels leads to the Precision, Recall, and Fscore metrics which are summarized in Table 2. These results are also compared
against those obtained using PME method [11].
Table 2: Evaluation on event detection using entropy-based motion analysis.
The better performance of the proposed EMV feature is presumed to be from the following two main reasons. First, the
proposed feature is more sensitive to the players’ actions in sport video.
Therefore, using the proposed feature could detect more events with interaction
among players. Second, the entropy-based motion feature can decrease the effect
upon camera motion.
Table 3 lists
the detailed experimental results between the modified and original homoscedastic
error models. To validate the effectiveness of our proposed
segmentation method, we also conducted an experiment comparing the
results with those obtained using original homoscedastic model [13] and Triangle
model [11]. The results are summarized in Table 3.
Table 3: Comparison of experimental results among triangle model (

) [
11], original [
13] and our modified homoscedastic error model in a soccer video.
With the demand of video
summarization, this paper presents an efficient event detection scheme based on
entropy-based motion analysis and modified homoscedastic error model. Utilizing the entropy-based
motion analysis and the homoscedastic error model detects the motion directivity entropy curve.
The significant sport event detection then extracts the significant events from
all the segmented events. Future research
directions for improving the proposed architecture include (a) extending the
proposed framework for movie video, (b) integration of audio features such as
the melody of background music and the environmental sounds to further improve
the video semantic analysis.
6. Conclusions
In this paper, an efficient sports video segmentation method is
proposed. A motion entropy curve is extracted from a given sport video as a
succinct feature to characterize the motion pattern as a function of time. A
time series change point detection algorithm that minimizes the homoscedastic
error is employed to approximate the motion entropy curve with a piece-wise
linear model. The accumulated impact value is then used to decide which segment is a
significant sport event. Future research
directions for improving the proposed architecture include (a) using machine
learning or probabilistic-based methods to enhance semantic information in the significant sports event
detection module, and (b) integration of audio feature
for enriching the proposed feature extraction model.
Acknowledgments
The authors wish to thank Professor Yu-Hen Hu for his valuable suggestion and revision. This work was supported in part by National Science Council (Taiwan) under the grant NSC97-2218-E-006-012.
References
- Y. Gong, L. T. Sin, C. H. Chuan, H. Zhang, and M. Sakauchi, “Automatic parsing of TV soccer programs,” in Proceedings of the International Conference on Multimedia Computing and Systems (ICMCS '95), pp. 167–174, Washington, DC, USA, May 1995.
- D. Yow, B.-L. Yeo, M. Yeung, and B. Liu, “Analysis and presentation of soccer highlights from digital video,” in Proceedings of the 2nd Asian Conference on Computer Vision (ACCV '95), Singapore, December 1995.
- D. A. Sadlier and N. E. O'Connor, “Event detection in field sports video using audio-visual features and a support vector machine,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 15, no. 10, pp. 1225–1233, 2005.
- Q. Ye, W. Gao, and S. Jiang, “Exciting event detection in broadcast soccer video with mid-level description and incremental learning,” in Proceedings of the 13th Annual ACM International Conference on Multimedia (MULTIMEDIA '05), pp. 455–458, Singapore, November 2005.
- N. Rea, R. Dahyot, and A. Kokaram, “Semantic event detection in sports through motion understanding,” in Proceedings of the 3rd International Conference on Image and Video Retrieval (CIVR '04), pp. 88–97, Dublin, Ireland, July 2004.
- H.-C. Shih and C.-L. Huang, “Content-based multi-functional video retrieval system,” in Proceedings of the International Conference on Consumer Electronics (ICCE '05), pp. 383–384, Las Vegas, Nev, USA, January 2005.
- T. Y. Liu, W.-Y. Ma, and H.-J. Zhang, “Effective feature extraction for play detection in American football video,” in Proceedings of the 11th International Multimedia Modelling Conference (MMM '05), pp. 164–171, Melbourne, Australia, January 2005.
- M. J. Black and P. Anandan, “The robust estimation of multiple motions: parametric and piecewise-smooth flow fields,” Computer Vision and Image Understanding, vol. 63, no. 1, pp. 75–104, 1996.
- J. M. Odobez and P. Bouthemy, “Robust multiresolution estimation of parametric motion models,” Journal of Visual Communication and Image Representation, vol. 6, no. 4, pp. 348–365, 1995.
- P. Bouthemy, M. Gelgon, and F. Ganansia, “A unified approach to shot change detection and camera motion characterization,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 9, no. 7, pp. 1030–1044, 1999.
- T. Liu, H.-J. Zhang, and F. Qi, “A novel video key-frame-extraction algorithm based on perceived motion energy model,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 10, pp. 1006–1013, 2003.
- F. Coldefy and P. Bouthemy, “Unsupervised soccer video abstraction based on pitch, dominant color and camera motion analysis,” in Proceedings of the 12th ACM International Conference on Multimedia, pp. 268–271, New York, NY, USA, October 2004.
- V. Guralnik and J. Srivastava, “Event detection from time series data,” in Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '99), pp. 33–42, San Diego, Calif, USA, August 1999.
- S. Suthaharan, S.-W. Kim, S. Sathananthan, H.-K. Lee, and K. R. Rao, “Perceptually tuned video watermarking scheme using motion entropymasking,” in Proceedings of the IEEE Region 10th Conference and Exhibition (TENCON '99), vol. 1, pp. 182–185, Cheju Island, South Korea, September 1999.
- Y.-F. Ma and H.-J. Zhang, “A new perceived motion based shot content representation,” in Proceedings of the IEEE International Conference on Image Processing (ICIP '01), vol. 3, pp. 426–429, Thessaloniki, Greece, October 2001.
- D. M. Hawkins and D. F. Merriam, “Optimal zonation of digitized sequential data,” Mathematical Geology, vol. 5, no. 4, pp. 389–395, 1973.
- S. B. Guthery, “Partition regression,” Journal of the American Statistical Association, vol. 69, no. 348, pp. 945–947, 1974.
- D. M. Hawkins, “Point estimation of the parameters of piecewise regression models,” Journal of the Royal Statistical Society. Series C, vol. 25, no. 1, pp. 51–57, 1976.