Abstract

Slow-motion replays are content full segments of broadcast soccer videos. In this paper, we propose an efficient method for detection of slow-motion shots produced by high-speed cameras in soccer broadcasts. A rich set of color, motion, and cinematic features are extracted from compressed video by partial decoding of the MPEG-1 bitstream. Then, slow-motion shots are modeled by SVM classifiers for each shot class. A set of six full-match soccer games is used for training and evaluation of the proposed method. Our algorithm presents satisfactory results along with high speed for slow-motion detection in soccer videos.

1. Introduction

Replays in soccer broadcasts cover most important contents of the video. Quick development of video compression techniques led to huge compressed video archives. Compressed domain video analysis on these archived videos can result in efficient video processing frameworks. However, noisy features are main challenge in compressed video analysis.

Nowadays, most sports broadcasters use logo transitions before and after replay shots [13]; however, detection of replay shots by slow-motion detection can bring robustness and generality to prevailing systems. Slow-motion replays could be produced by standard or high-speed cameras. Several approaches are proposed for slow-motion detection based on each production style. A slow-motion replay from a standard camera can be generated by repeating some normal frames or inserting morphed frames between two consecutive frames [4, 5]. Repeated or inserted frames result in special patterns in frame difference feature and could be detected easily in spatial [46] or compressed domain [79].

Recently, majority of broadcasters are using high-speed cameras for slow-motion generation to achieve finer presentation of fast movements. When high-speed cameras are used with recording frame rate higher than desired slow-motion frame rate, some frames must be dropped. Dropped frames could be detected by plentiful fluctuations in frame difference feature [10]. In addition, slow-motion replays generated by dropped frames has higher mean of absolute frame difference than slow-motion replays generated by normal cameras [4]. With a high-speed camera, the slow-motion effect can also be generated by simply playing out the video at the normal speed. We call these slow-motion replays generated by high-speed cameras HISM replays. Detection of HISMs is very challenging because they do not result in any special pattern in visual features. Han et al. in [11] tried to detect HISMs by exploiting camera motion patterns and achieved to 72% precision and 66% recall rates. Wang in [12] used color and motion features to model HISMs with SVM classifiers and obtained 75% precision and 61% recall rates on sports videos. Yang and others in [13] improved the work of Wang et al. in [12] and used HMM models. They achieved 83% precision and 81% recall rates on soccer videos. In contrast, scene transition structures are used in [14] as a general method for replay detection and resulted in 74% precision and 86% recall rates on three soccer games. All prior works on detection of HISMs are proposed for spatial domain analysis and have low speed.

In this paper, we propose an efficient framework for slow-motion detection in compressed MPEG-1 soccer videos. Several color, motion, and cinematic features are used in our framework to achieve satisfactory results despite intrinsic noise in compressed domain. The rest of the paper is organized as follows. Section 2 presents an overview of the proposed framework for slow-motion detection. In Section 3, the proposed frame work is evaluated by precision and recall rates on soccer videos. Finally, Section 4 presents conclusion and future work.

2. Proposed Framework

Humans can discriminate between slow-motion and live shots by noticing the speed of players or ball motion in the scene [12]. However, robust detection and tracking of moving objects in soccer videos is a challenging task. On the other hand, color difference and motion features are correlated with object and camera motion in the scene. We try to use color, motion, and cinematic features for slow-motion modeling.

Shots in soccer video can be categorized into four classes: long, medium, close up, and out of field [6]. Close-up and out-of-field shots do not exhibit important events of the game. Thus, we do not consider close-up and out-of-field shots in our framework. In addition, motion-related visual features of long and medium shots are fully different. Therefore, we try to model slow-motion shots for each shot class independently.

Shot boundary detection and shot classification are two preliminary tasks in soccer domain and studied in several works [6, 1518]. Any error in these modules can result in high error rates in slow-motion detection. For an unbiased evaluation of slow-motion detection module, in our framework shot boundary detection and shot classification are done by human. First, low-level color and motion features are extracted efficiently from compressed domain. Then, a GMM is used to model grass color adaptively in each video stream. Color, motion and cinematic features are extracted in the next step. Thereafter, a SVM classifier is trained for slow-motion detection in each shot class based on training data. Finally, trained SVM classifiers are used for classification of slow-motion shots in test videos.

2.1. Low-Level Information Extraction

In this step, low-level color and motion information are extracted from MPEG bitstream. A MEPG-1 video decoder in Java language modified to extract this information from compressed video by partial decoding of the bit-stream.

2.1.1. DC Sequence Extraction

MPEG-1 is a block-based video compression technique which exploits discrete cosine transform (DCT) and motion compensation error to reduce spatiotemporal redundancy in image sequence [19, 20]. In MPEG-1 standard, each picture is divided to 16*16 subimages called macroblocks (MBs). Each macroblock constitutes from four luminance and two chrominance blocks. For each block in MPEG sequence, the first DCT coefficient is called DC coefficient and contains mean intensity of the block [21]. Therefore, a down-sampled version of each picture could be constructed by approximating DC coefficients of picture blocks. We utilized method proposed in [22, 23] for DC image extraction. The DC image of each picture in the sequence has a luminance plane 𝑌 and two chrominance planes: 𝐶𝑏 and 𝐶𝑟. Color and object feature could be extracted from approximated DC image.

2.1.2. Motion Vector Extraction

We use only I-Pictures or P-Pictures of MPEG stream. In an I-Picture, all MBs are coded independently from other pictures and named intracoded MB. In a P-Picture, MBs can code as an intracoded MB or a forward-coded MB. A forward-coded MB is similar to previous reference picture (I-Picture or P-Picture) and is coded by its displacement to similar area in previous reference frame and its compensation error in DCT domain. We consider displacement of each forward-coded MB as its motion vector as follows:𝑥(𝑚,𝑛,𝑡1)=𝑥(𝑚,𝑛,𝑡)+MV𝑥(𝑚,𝑛,𝑡),(1)𝑦(𝑚,𝑛,𝑡1)=𝑦(𝑚,𝑛,𝑡)+MV𝑦(𝑚,𝑛,𝑡),(2) where 𝑥(𝑚,𝑛,𝑡) and 𝑦(𝑚,𝑛,𝑡) are center point locations of MB(𝑚,𝑛) in current picture, 𝑥(𝑚,𝑛,𝑡1) and 𝑦(𝑚,𝑛,𝑡1) are center point locations of most similar area in previous picture, and MV𝑥(𝑚,𝑛,𝑡) and MV𝑦(𝑚,𝑛,𝑡) are horizontal and vertical components of MB(𝑚,𝑛) motion vector in current picture.

2.1.3. Grass Modeling

To model grass color, a temporal subsampling technique is used. In this manner, grass color is extracted adaptively from each video sequence by processing I-Pictures only. After processing each I-Picture, the next 20 I-Pictures are skipped and not processed. For each processed I-Picture, all pixels with color in the following range are considered as green:90<𝑌<140110<𝐶𝑏<13090<𝐶𝑟<120.(3) Then, green pixels are grouped into disjoint connected components, and all connected components smaller than one-tenth of picture size are removed. Remaining pixels are candidate grass pixels.

After that, a Gaussian mixture model (GMM) is used to model grass color by using candidate grass pixels in the video. Several GMMs with one, two, three, and four Gaussians are used to model grass color, and optimal model with minimum error is used as final grass color model.

2.2. Feature Extraction

In this section, several color, motion, and cinematic features are extracted from compressed soccer videos to model slow-motion shots among long and medium shots.

2.2.1. Cinematic Features

Cinematic features are extremely exploited in soccer video analysis because they are light-weight and effective features. Similarly, we try to exploit cinematic knowledge of soccer videos for slow-motion detection.

(1) Shot Length
Considering long shots, most live scenes last more than 16 seconds. On the other hand, slow-motion medium shots are often longer than live medium shots [12]. So, shot length can be a promising feature for discriminating live shots from slow-motion shots. Therefore, the length of each shot is computed and denoted by Length (𝑖) for 𝑖’th shot of the sequence.

(2) Shot Type
During a break, director may use several special consecutive shots to view a replay of last occurred event. This leads to common shot type patterns for replay scenes. Therefore, we use shot type of two prior and two next shots of the current shot as next features: respectively 𝑆𝑃1(𝑖), 𝑆𝑃2(𝑖), 𝑆𝑁1(𝑖), and 𝑆𝑁2(𝑖).

(3) Repeated Frame
In some slow-motion long shots, director freezes the scene for a short time during critical moments. As a result, detecting repeated frames in a short duration could be informative for slow-motion detection. Repeated frames can cause a low picture difference in DC image sequence. Therefore, we define repeated frame feature for 𝑡’th picture in the sequence as: Repeated(𝑡)=1𝐷𝑌(𝑡)<Th𝐷𝑌0otherwise,(4) where Th𝐷𝑌 is image difference threshold and 𝐷𝑌 is DC image difference. We define 𝐷𝑌 as: 𝐷𝑌1(𝑡)=4𝑀𝑁2𝑀𝑚=12𝑁𝑛=1||||𝑌(𝑚,𝑛,𝑡)𝑌(𝑚,𝑛,𝑡1).(5) The 𝑌 is luminance plane of DC image; 𝑀 and 𝑁 are number of rows and columns of MB grid (each macroblock contains four luminance blocks).Then, existence of frozen pictures in a shot can distinguished as 1Repetition(𝑖)=𝑡ShotiRepeated(𝑡)>ThRepetition.0otherwise.(6) We used Th𝐷𝑌=0.05 and ThRepetition=3 in our experiments.

2.2.2. Color Features

Color features can indicate similarity of consecutive pictures in the sequence. On the other hand, some object features could also approximated by color features. Several color features are introduced in this section for slow-motion detection.

(1) Grass Ratio Difference
Grass ratio difference of consecutive pictures in each medium shot is correlated with camera and object motion in the scene. To compute grass ratio of each picture, constructed GMM grass model is applied on each pixel of DC image. Then, all blocks with probability higher than Thgrass are considered as grass pixels. Value of Thgrass is determined empirically for whole video dataset. Finally, the ratio of grass pixels in 𝑡’th picture of the sequence is called 𝐺(𝑡) and computed by dividing number of grass blocks by number of all blocks in the DC image.We define mean of grass ratio difference in the shot as 𝑀Grass1(𝑖)=length(𝑖)𝑡Shot𝑖||||𝐺(𝑡)𝐺(𝑡1),(7) where 𝐺(𝑡) is ratio of grass-colored pixels in DC image of 𝑡’th picture in the sequence.

(2) Difference of Luminance Standard Deviation
Camera and object motion in the scene can cause changes in luminance contrast of the picture. Variance is a measure for contract of image luminance [24, 25]. So, changes in standard deviation of luminance correlate with motion patterns in the scene. Therefore, mean of luminance contrast difference in each shot is defined as the next descriptor of the shot: 𝑀Std𝑌1(𝑖)=length(𝑖)𝑡Shot𝑖||||Std𝑌(𝑡)Std𝑌(𝑡1),(8) where Std𝑌(𝑡) is standard deviation of pixels intensities in 𝑌 plane of 𝑡’th DC image of the sequence.

(3) Object Ratio
For each shot type, size of biggest object in the picture can be related with distance from camera to objects. A small movement causes high motion magnitude in the picture when camera is close to objects and vice versa. The biggest object size in the picture can be approximated by biggest connected component’s size in DC image. Consequently, we define next shot descriptor as 𝑅Object1(𝑖)=length(𝑖)𝑡Shot𝑖𝑅𝑂(𝑡),(9) where 𝑅o(𝑡) is ratio of biggest connected component in the DC image of 𝑡’th picture with respect to DC image size.

2.2.3. Motion Features

Due to slower motion of objects in slow-motion scenes, the most promising features for slow-motion detection are motion-related features. In this step, we introduce several motion features for slow-motion shot modeling.

In [26], we proposed a method for motion vector reliability measurement and global motion estimation in compressed MPEG-1 sequences. For each motion vector, a reliability value between zero and one is extracted and called MVReliability(𝑚,𝑛,𝑡). In addition, global motion parameters, namely, GMP(𝑡), GMT(𝑡), and GMZ(𝑡) are computed for each frame which denote pan, tilt, and zoom factor in 𝑡’th picture of the sequence. Figure 1 shows reliable motion information extraction from a sample picture by the mentioned method. In Figure 1(b), intensity of each motion vector indicates its reliability.

(1) Skipped Macroblocks Ratio
In MPEG compression standard, MBs which are very similar to reference picture have no prediction error. These MBs are not coded in the MEPG bit-stream and called skipped MBs. Thus, ratio of skipped MBs in each picture shows picture similarity to reference picture. In slow-motion shots, we expect higher similarity between consecutive pictures; which results in higher skipped MB ratio in each picture. Therefore, we define ratio of skipped MBs in each picture as: 𝑅S1(𝑡)=𝑀𝑁𝑀𝑀𝑚=1𝑛=1SkippedMB(𝑚,𝑛,𝑡),(10) where 𝑡 is index of current picture in the sequence, and Skipped MB for block MB(𝑚,𝑛) of 𝑡’th picture is defined as SkippedMB(𝑚,𝑛,𝑡)=1skippedmacroblock0otherwise.(11) Then, we define mean of skipped MBs ratio in the shot as 𝑀Skipped1(𝑖)=length(𝑖)𝑡Shot𝑖𝑅𝑆(𝑡).(12)

(2) Reliable MV Magnitude
Motion magnitude in each picture can indicate fast or slow movements of camera and objects in the scene [12]. However, in MPEG compressed sequences some motion vectors (MVs) are noisy and unreliable. Therefore, we try to discriminate reliable MVs from unreliable MVs and compute mean magnitude of reliable MVs.The reliable MVs are distinguished by =ReliableMV(𝑚,𝑛,𝑡)1MVReliability(𝑚,𝑛,𝑡)>ThMVRel(𝑡)0otherwise,(13) where ThMVRel(𝑡) denotes adaptive threshold for reliable MVs detection which is computed as ThMVRel(𝑡)=max1.2𝑀MVRel(𝑡),0.3.(14) The 𝑀MVRel(𝑡) denotes mean reliability of motion vectors in 𝑡’th picture of the sequence and is defined as 𝑀MVRel1(𝑡)=𝑀𝑁𝑀𝑁𝑚=1𝑛=1MVReliability(𝑚,𝑛,𝑡),(15) where 𝑀 and 𝑁 denote number of rows and columns of MB grid respectively. Then we define mean reliable motion magnitude of each picture as: 𝑀RMag1(𝑡)=𝑀𝑚=1𝑀𝑛=1×ReliableMV(𝑚,𝑛,𝑡)𝑚,𝑛reliableMVMV𝑥(𝑚,𝑛,𝑡)2+MV𝑦(𝑚,𝑛,𝑡)2.(16) Then, mean motion magnitude of the shot is defined as 𝑀ReliableMag1(𝑖)=length(𝑖)𝑡Shot𝑖𝑀RMag(𝑡).(17)

(3) Motion Magnitude Fluctuations
When slow-motion shot is produced by frame-dropping technique, numerous fluctuations appear in mean motion magnitude feature. On the other hand, recently some directors play only critical seconds of a replay shot as slow-motion; while play other moments of the shot with live speed. The same pattern of numerous fluctuations can appear in these replay shots on slow motion moments. Local variance of values is an indicative measure for magnitude fluctuations. Therefore, we moved a time window with length 10 on the mean motion magnitude feature and computed variance of values in each location as 𝑀magvar(𝑡). Then, motion magnitude fluctuations feature is defined for each shot as follows: 𝑀Vibration1(𝑖)=length(𝑖)𝑡Shot𝑖𝑀magvar(𝑡).(18)

(4) Camera Motion
Camera movement in slow-motion shot is slower than similar live shots [11]. Hence, we compute mean of each camera motion coefficient as in next features: 𝑀GMP1(𝑖)=length(𝑖)𝑡Shot𝑖||GMP||,𝑀(𝑡)GMT1(𝑖)=length(𝑖)𝑡Shot𝑖||GMT||,𝑀(𝑡)GMZ1(𝑖)=length(𝑖)𝑡Shot𝑖||GMZ||.(𝑡)1(19) The 𝑀GMP(𝑖), 𝑀GMT(𝑖), and 𝑀GMZ(𝑖) indicate mean pan, tilt, and zoom factor in the shot, respectively.

(5) Camera Motion Difference
Speed of camera motion changes could also be slower in slow-motion shots with respect to similar live shots [11]. Therefore, the mean change of motion parameters for each shot is computed as 𝐷GMP1(𝑖)=length(𝑖)𝑡Shot𝑖||GMP(𝑡)GMP||,𝐷(𝑡1)GMT1(𝑖)=length(𝑖)𝑡Shot𝑖||GMT(𝑡)GMT||,𝐷(𝑡1)GMZ1(𝑖)=length(𝑖)𝑡Shot𝑖||GMZ(𝑡)GMZ||.(𝑡1)(20)

2.2.4. Semantic Features

Soccer domain features which have special meanings for viewers are called semantic features. These features could also used for slow-motion modeling in soccer videos.

(1) Field-Side View
Most slow-motion long shots contain shoot and offside events occurred near goal area. So, detection of field-side views can be helpful for detecting slow-motion long shots. Kolekar in [27], used a simple rule and grass ratio in three regions of the picture for field-view detection. Similarly, we divide each picture into three regions as shown in Figure 2. Then, we use following rule for view classification of each frame during long shots.
IF |𝐺RUL(𝑡)𝐺RUR(𝑡)|>ThFieldSide THEN FieldSide(𝑡)=1 ELSE FieldSide(𝑡)=0.
The 𝐺RUL and 𝐺RUR denote grass ratio in left and right region of the picture, respectively. We used ThFieldSize=0.7 in our experiments.
Then, we examine existence of field-side view in each long shot as =1SideView(𝑖)𝑡last50picturesofShot𝑖FieldView(𝑡)>00otherwise.(21) For medium and close-up shots, the SideView (𝑖) is equal to zero.

2.3. Slow-Motion Detection

In this section, we model slow-motion shots for each shot class separately. A set of features is selected for each shot class based on aforementioned intuitive motivations. Since close-up and out-of-field shots do not show an important event in the game, we do not consider them in our framework. Figure 3 shows diagram of proposed classification algorithm.

During training phase, following features of long shots are given to an SVM model: shot-type features, repeated frame feature, camera motion features, camera motion difference features, and field-side view feature. In the test phase, long shots which last more than 16 seconds are considered as live shots. Shorter long shots are given to the SVM model for classification. RBF kernel function is used in this SVM model.

For medium shots, in the training phase the following features are given to another SVM model: shot length feature, shot type features, grass ratio difference feature, difference of luminance standard deviation feature, object ratio feature, skipped macro-block ratio feature, reliable MV magnitude feature, and motion magnitude fluctuations feature. In the test phase, all medium shots are given to the trained SVM model for classification. RBF kernel function is used in this SVM model.

3. Experimental Results

In this section, we evaluate performance of the proposed method on our soccer video dataset (FUM-BSVD 2011) of six soccer games captured from World Cup 2010. Table 1 shows six soccer games used in our experiments. The first three games are used as training set, and remaining games are used as test set. All slow-motion shots in these videos are produced by high-speed cameras. Each video sequence is encoded to MPEG-1 video format using FFMPEG video library. Important encoding parameters are summarized in Table 2.

Table 3 shows experimental results of the proposed method on training and test data, respectively. Our algorithm achieved high accuracy of 95% precision and 91% recall on training data. Errors in training data are mainly in long shots. The proposed method achieved satisfactory results of 70% precision and 83% recall rates on test data. Comparing to [11, 12], our algorithm achieved higher accuracy by exploiting more features in slow-motion modeling.

The recall rate for long shots is low. Our experiments indicate that discriminating between slow-motion and live among long shots is a difficult task due to following reasons:(1)movement of camera and objects in some live shots is slow,(2)slow-motion shots of high-motion scenes contain intensive objects and camera movements,(3)a replay shot may also played in live speed.

In order to show robustness of the proposed method against errors in preprocessing stages, a hierarchical shot classification module with overall accuracy of 92% is designed. Shot classes detected by this module are fed into the slow-motion detection module demonstrated in Figure 3. As shown in Table 4, our algorithm achieved accuracy of 90% precision and 81% recall on training data by using automatically detected shot classes. In this experiment, the proposed approach achieved accuracy of 78% precision and 68% recall rates on test data. Although such a small loss of accuracy is affordable, it could be reduced by exploiting more accurate shot classification methods.

Our proposed approach is more efficient than previous works due to compressed video analysis. Extraction of color and motion features from compressed video constitutes majority of proposed framework processing time; while training and classification by SVM classifiers is very fast. Table 5 shows processing time of time-consuming modules in proposed framework. Color features extraction and motion features extraction tasks could be done in parallel via multithreading. Using multithreading total processing time of the proposed method is 21 ms per frame or 47 fps which is nearly two times faster than realtime. When these modules run in sequence, the total processing time of the proposed method is 37.7 ms per frame or 26 fps which is faster than real-time. All video sequences in FUM-BSVD dataset are played at 25 fps.

Performance of proposed method on long shots can be improved by extracting better features from video. On the other hand, segmentation and tracking of moving objects in soccer scenes is an alternative way for improving this method. However, this process is problematic in broadcast soccer scenes due to moving camera and moving players.

4. Conclusion and Future Work

In this paper, we proposed an efficient method for slow-motion detection in compressed soccer videos. Exploiting a rich set of color, motion and cinematic features led to high accuracy of the proposed method. In addition, direct extraction of low-level information from compressed domain significantly improved efficiency of our method. Our framework achieved 83% precision and 70% recall rates on test data. This framework could be used in soccer video summarization and retrieval tasks as a semantic feature extractor.