Abstract

Recently, because of its importance in computer vision and surveillance systems, object tracking has progressed rapidly over the last two decades. Researches on such systems still face several theoretical and technical problems that badly impact not only the accuracy of position measurements but also the continuity of tracking. In this paper, a novel strategy for tracking multiple objects using static cameras is introduced, which can be used to grant a cheap, easy installation and robust tracking system. The proposed tracking strategy is based on scenes captured by a number of static video cameras. Each camera is attached to a workstation that analyzes its stream. All workstations are connected directly to the tracking server, which harmonizes the system, collects the data, and creates the output spatial-tempo database. Our contribution comes in two issues. The first is to present a new methodology for transforming the image coordinates of an object to its real coordinates. The second is to offer a flexible event-based object tracking strategy. The proposed tracking strategy has been tested over a CAD of soccer game environment. Preliminary experimental results show the robust performance of the proposed tracking strategy.

1. Introduction

Because of the advance in surveillance systems, object tracking has been an active research topic in the computer vision community over the last two decades as it is an essential prerequisite for analyzing and understanding video data. However, tracking an object, in general, is a challenging problem. Difficulties in tracking objects may arise due to several reasons such as abrupt object motion, changing appearance patterns of the object and/or the scene, nonrigid object structures, partial/full object-to-object and object-to-scene occlusions, camera motion, loss of information caused by projection of the 3D world on a 2D image, noise in images, complex object motion, complex object shapes, and real-time processing requirements. Moreover, tracking is usually performed in the context of higher-level applications, which in turn require the location and/or shape of the object in every captured frame. Accordingly, several assumptions should be considered to constrain the tracking problem for a particular application.

A great deal of interest in the field of object tracking has been generated due to (i) the recent evolution of high-speed computers, (ii) the availability of high quality and inexpensive sensors (video cameras), and (iii) the increasing demand for an automated real-time video analysis. Tracking an object is to get its spatial-tempo information by estimating its trajectory in the image plane as it moves around a scene, which in turn helps to study and predict its future behavior. There are three main steps in video analysis: detection of interesting moving objects, tracking those objects from frame to frame, and analysis of the recognized object tracks to detect their behavior. Some examples for tracking include the tracking of rockets in a military defense system, customers in a market, players in a sport game, and cars in a street. Tracking these objects helps in military defense, goods arranging, sport strategies, and traffic control, respectively.

Basically, tracking can be done by one or multiple types of sensors [1]. Radar, sonar, infrared, laser radar, and video cameras are the common used tracking sensors. Moreover, based on the resolution of the input sensors and the preciseness of the analysis, the surveillance system may work in one of three scenarios [2]. The first is the high preciseness scenario, which can be used in gait recognition to get the motion pattern of an object [3]. The second is the medium preciseness one, which can be used in recognizing generic activities such as walk, run, or slide [4, 5]. The third is the low-preciseness scenario, which is used to detect the presence of the object in crowded environment. Figure 1 gives an illustration of the three scenarios resolution.

Using color information may solve some tracking situations. Various methods have been proposed to present the appearance model of the tracked objects, such as, polar representation method [6]. Another similar method has been developed by slicing the object and taking the average color in each slice. These colors were stored as the appearance template of the tracked object. Figure 2 illustrates the appearance model of a soccer player.

Sometimes, it may be more beneficial to replace the high-preciseness sensor by multiple low-preciseness ones. Such strategy has a great impact in cost minimization as the low-preciseness sensors are cheaper than those high-preciseness ones [79]. Also, it improves the tracking preciseness by dividing the wide tracking area into several smaller divisions [10]. Moreover, this strategy guarantees reliable object recognition and solves the problem of object detection in the cluttered environments. Those benefits are gained as the multiple-sensors tracking system provides more information that can be used for further analysis and prediction. For instance, the 3D position of a soccer ball can be predicted using a couple of 2D frame camera-video streams.

However, multiple-sensors tracking systems suffer from several hurdles, such as sensors registration and integration. Furthermore, they become very complicated if the design of sensors network is noncentralized. In the centralized network scenario, the data streams from the machine behind the sensor (workstation) to a dedicated server, and vise-versa. So, the data stream paths are clearer than those in noncentralized systems. Generally, the centralized networks (server-workstation network) are preferred than the noncentralized networks (peer-to-peer) as they usually introduce better stability, robustness, and organization.

Usually, in several tracking systems, the tracked objects are not only moving randomly (not as rigid bodies) but also captured in low resolution. These restrictions result in several hurdles that are usually faced during the tracking systems’ operation. Such hurdles include (i) the automatic initialization and installation of new objects, (ii) detection of objects that enter the field-of-view (FOV) of each camera, (iii) detection of those exiting objects, (iv) guaranteeing an objects’ continuous migration from a camera’s FOV to another’s, and more importantly, (v) solving the merging/splitting situations of objects or groups of objects, which are the major challenge in the tracking system.

In this paper, a real-time tracking system will be proposed for automatically tracking a set of moving objects using a number of video streams collected by a set of static cameras. The simple installations, portability, as well as the low cost are the salient features of the proposed tracking system [11]. The system uses multisensors (static cameras), which are constructed using a centralized network, to track several moving objects. Such infrastructure has many benefits such as (i) maximizing the object tracking accuracy, (ii) minimizing not only the processing overhead but also the ambiguity in objects occlusion, (iii) enhancing both the systems portability and installation flexibility, and (iv) increasing the capability of event recognition and video parsing [1214].

The main contributions of this paper are (i) introducing a novel image-to-world coordinates transformation that impacts well the systems’ setup and tracking phases and improves the flexibility of tracking those objects that travel between cameras FOV(s), (ii) a new concept that considers the moving object as a group of physical targets, which in turn solves and corrects a lot of tracking problems, has been developed, and finally, (iii) introducing a novel event-driven tracking algorithm for efficient tracking of multiple objects using multiple cameras network.

The rest of the paper is organized as follows. Section 2 gives a brief review of the previous efforts. Section 3 presents some main concepts of object tracking systems. Section 4 discusses the proposed system architecture, phases, and algorithms. The experiments and their results are described in Section 5. Finally, conclusions are illustrated in Section 6.

2. Previous Efforts

Over the last two decades, a lot of work in the area of object tracking has been developed. Both the structure and organization of those developed systems differ according to their different purposes, the physical properties of the used sensors, and the considered economic issues. To the best of our knowledge, based on the underlying architecture, object tracking systems can be classified into three main categories. Two of them are built over a network of camera stations; the first depends on static cameras, while the second uses dynamic cameras in zooming and rotating. On the other hand, the third category depends on a wide-screen image (panorama image), which can be collected by a number of static cameras in the same side of the tracking field. Through the rest of this section, the first category, which employs the static cameras, will be discussed in details because of its similarity to our system. Afterwards, the two other types will be briefly reviewed.

According to the first category, the tracking system consists of a set of workstations that cover the tracked environment using a set of static cameras. Kang et al. [6] used multiple stationary cameras to provide a good coverage of a soccer field. The registration of cameras viewport was performed using a homography. Accordingly, the system can integrate 3D motion tracking component without explicit recovery of the 3D information. The 3D estimation used the projective space instead of the Euclidean, as the system does not use the camera calibration parameters. However, such system posed several issues that need to be addressed such as reducing the complexity and the delay generated by the propagation of uncertain hypothesis. Another system had been introduced by Xu [12, 15], which used eight digital video cameras statically positioned around the stadium, and calibrated to a common ground-plane coordinate system using Tsai's algorithm [16]. Based on Ming's work, a two-stage processing architecture had been used. The first was the extraction of information related to the objects observed by each camera (from the input image streams). Then, the data from each camera was entered to a central tracking module to update the estimated state of the tracked objects. However, there are several difficulties and opportunities to be addressed in the system reliability and accuracy, such as the movement of a group of occluded objects from a workstation's FOV to another.

Another architecture that uses stationary cameras was proposed by Iwase and Saito in [17, 18]. The system uses multiple view images to avoid the occlusion problem, so that it can simply obtain a robust object tracking. Initially, an inner-camera operation is performed independently in each camera to track the players. When an object cannot be tracked, an intercamera operation is performed as a second step. Tracking information of all cameras is integrated by employing the geometrical relationship between cameras through a process called homography. Inter-camera operation makes it possible to get the location of the object that is not detected in the input frame. Such an object is the one that is completely occluded by another ones; hence, it becomes outside the angle of view. However, in spite of its salient properties, the system suffered from the difficulty of matching the object-scene positions between the camera’s input frames. Moreover, unclear data streams and complex manipulation were the main hurdles that had a harmful impact on the performance of such tracking system.

On the other hand, the tracking systems architectures that use a network of dynamic cameras are sometimes used as a result of their cheapness. For illustration, tracking in soccer game needs no further cost as it depends on a number of dynamic cameras to deliver TV-Camera video streams, which is already constructed. Hayet presented a modular architecture for tracking objects in the sports broadcasting [19]. However, the system faced a lot of complexity, uncertainty, and inaccuracy. Furthermore, as a result of complex calculations, the system suffered from the long processing time, which results in the inability to process on the real time. On the other hand, its practical implementation depends frequently on manual feeds.

Finally, the use of a wide image (panorama image) makes it possible to track objects using one machine, which avoids the difficulties of correlating data over multiple machines. Sullivan and Carlsson in [20] used such architecture to track objects in soccer game. However, his system required specific, expensive, and calibrated sensors. Furthermore, the system consumed a long time to process such a large video frame.

3. Background and Basic Concepts

Through the next subsections, a quick review of the methodology used to extract the tracked objects as well as the employment of Kalman filter in the literature will be introduced.

3.1. Extracting Tracked Objects

To extract the moving objects in the video stream a very faster approach is the background subtraction method [21], which compares each frame against the background model. Background subtraction delineates the foreground from background in the images. Initially, it estimates the background from a certain number of early frames without any object inside. Subsequently, the following image frames are compared to the estimated background and calculates the difference in pixel’s intensity. The drastic changes in pixel’s intensity indicate that the pixel is a motion pixel. In other words, the different pixels, from those in the background model, in the current frame present the foreground pixels which, in turn, may represent the interesting moving objects. The background subtraction method is illustrated in Figure 3.

Definition 1. Background Model . The Background model is an image which presents, at each spatial location , the pixels color or luminosity of static objects in camera's FOV such as ground and walls. It can be updated regularly along time based upon its creating technique.

Definition 2. Foreground pixel . Foreground pixel is an image which different from its equivalent in background model, at spatial location and time , in color or luminosity by specified threshold . So, for any frame ,

Two different techniques can be used for creating the background model: recursive [22] and nonrecursive [23]. A nonrecursive technique stores a set of previous frames into a buffer and then builds the background model based on the variation of each pixel within those consecutive frames. On the other hand, the recursive technique does not use a buffer. It builds the background model by recursively updating it with each input frame. The recursive technique requires less storage rather than the nonrecursive one; however, any error done in an input frame will have a harmful effect on the model. After extracting the foreground pixels by a segmentation process (such as background subtraction), a filter (such as Laplacian of Gradient (LoG)) is applied. LoG is used to reduce not only the noise in foreground pixels, but also the noise in the edges of the pixels groups.

Next, the objects are created by connecting foreground pixels. For each object, a number of foreground pixels are clustered as a blob. These pixels are clustered by applying Otsu’s method, which determines the threshold for each pixel either belongs to an object or not. This blob is the source of object’s appearance in terms of its dimensions, its color signature, and even its position. The object may be represented by more than one blob. Finally, the object dimensions and position are represented with a bounded rectangle around its blobs. Figure 4 illustrates these steps in simple manner.

Definition 3. Tracked Object. A tracked Object can be represented by a bounding rectangle which groups a number of closest blobs. For the blobs included in pixels , the position of the tracked object is the center lower pixel of bounding rectangle, as shown in the following formula:

The bounding rectangle of tracked object has its width and height expressed by the following formula:

The tracked object is always expressed by .

3.2. Tracking Objects Using Extended Kalman Filter

An object tracking system is a system that determines the location of an object(s) over time using a sequence of images (frames). So, the first idea for getting the location of an object in the next image is to search for it in a small rectangle area around its present location. However, such idea may fail if the object got out of the searching rectangle. Hence, it is needed to modify this method to predict the object location in the next image by searching for it in a small rectangle area. To accomplish such aim, Kalman filter can be successfully employed because of its ability of prediction in noisy environments [24]. However, the main problem faced when using Kalman filter in real-time tracking systems is that it requires a statistical model not only for system output data but also for its measurement instruments. This problem can be avoided by using a recursive algorithm, as illustrated in Figure 5, which adjusts these unknown parameters, such as the measurement noise variance after each time step based on the observed measurements.

Kalman filter works as an estimator for the process state at some time and then obtains feedback in the form of noisy measurements. So, the equations of Kalman filter are divided into two groups: the prediction (time update) and correction (measurement update). The solved process by Kalman filter is governed by a linear stochastic difference equation. In the tracking systems, this process is nonlinear, so Kalman filter must be modified to tackle this problem. The modification of the filter comes in a famous form called extended Kalman filter (EKF) [24].

4. The Proposed Strategy

Through this section, the proposed tracking strategy will be illustrated in details as well as its main components.

4.1. Preliminary

The proposed tracking system architecture, as well as its main elements, is illustrated in Figure 6. The system consists of a flexible number of workstations, each analyzes the video stream coming from a static camera. Those cameras are the system sensors, which cover the overall considered tracking field. The system is centralized at the server machine, which collects, integrates, and corrects data coming from workstations and then sends it to a media server such as web servers, multimedia servers, TV shows, or data-mining servers.

As illustrated in Figure 6, our proposed tracking system can be partitioned into four main layers, which are (i) cameras layer, (ii) workstations layer, (iii) central server layer, and (iv) data representation layer. The cameras layer contains a number of static cameras that deliver the input video streams. Each camera covers a specific region in the field under scope which is called the camera's FOV. Furthermore, the cameras FOVs must partially intersect especially in Exit/Entry boundaries and preferably in crowded spots (such as the penalty area of soccer pitch). Ideally, the intersecting scope cameras should be positioned orthogonally, along their projective axis over the pitch as shown in Figure 7, or near orthogonally. Such construction solves most occlusion problems and can be used to get the 3D position of some objects.

The overall system's FOV encompasses the visible FOVs of all the cameras in the system. On the other hand, the system is not interested in any virtual FOV (the uncovered areas by cameras). Each camera is connected to a workstation as illustrated in Figure 6. Each workstation reflects its camera position in the system architecture and works as the local tracker of video stream coming from such camera. Hence, it can create its local database that expresses the spatial-tempo data of those objects in its camera's FOV. Furthermore, all workstations are connected to an ETHERNET LAN, which connects them to the tracking server.

In the server layer, the server machine works as the system's central point. It receives the spatial-tempo data from each workstation and then collects them to (i) guarantee the data integration, (ii) recorrect falsely received data, and (iii) predict the most correct data. Hence, the data streams from workstations to server, and vise versa. Finally, the output spatial-tempo data from the system is then stored in a database. This database can be expressed by general markup language like XML to be easily sent at real time to another media over a low bandwidth, TV shows, mobile cells, or internet. In addition, the complete database can be linked to video streams and can be used offline as a multimedia database to predict situations happened in the tracked environment as a kind of data mining.

Before tracking, each workstation's FOV must be defined with respect to the general system coordinates to avoid gaps and blind areas in the tracked environment. The workstation's FOV is defined as a polygon that has a number of corners representing the pitch area covered by the camera in front of that workstation. The manual choosing of these corners is more accurate, less expensive, and more transparent in many situations. Choosing those corners is performed over one captured frame before beginning tracking.

4.2. Workstation Operation

At each workstation, tracking moving objects is performed using the methodology illustrated in Figure 8. Segmenting static-camera video stream is handled by background-subtraction method as illustrated in Section 3.1. Background model can be created manually and updated frequently by a recursive technique such as approximated median filter [23]. After subtracting the input frame by the background model, the foreground pixels are often heavily fragmented, so a filter is applied to reduce the noise [25]. A morphological filter is then used to insure that every group of pixels makes up a single object. The moving objects are created by fast grouping foreground pixels as blobs. Then, each object is created by grouping closest blobs using a connected component analysis. Those blobs are surrounded by a bounding box, which presents the tracked object. Tracking each object in the scene is done by EKF [26]. Furthermore, EKF is employed to solve occlusion problem which happens when two or more tracked objects merge together. The prediction stage in EKF model of the occluded object and the foreground position of the occluded group are incorporated to update the position of each object in the group [14].

After splitting the occluded group, workstation depends on three choices to match the input objects with the output objects of the occluded group, which are (i) EKF prediction and correction, (ii) the color histogram of each object, and (iii) the feedback coming from the server.

4.3. Server Operation

The server machine acts as the center of the system as it receives all data coming from the workstations connected to the cameras (as shown in Figure 6). However, the data stream does not flow only from the workstation to the server but the server may also send its feedbacks to workstations. These feedbacks may be sent when (i) the server corrects data about objects tracks over a workstation; (ii) the server solves the uncertain data about occluded objects, or (iii) the entry of an object to a workstation's FOV while exiting from another. Hence the server performs a set of critical tasks which are (i) collecting and synchronizing data from each workstation, (ii) associating data about each object received from each workstation, and (iii) fusing all data to get the global position of each target in real world coordinates. The servers’ operation is depicted in Figure 9.

Data synchronization is performed by the scheme presented in [19]. The data sent from the workstations are time stamped. However, data coming from different workstations may arrive at server at different times. Thus, the server waits a short period (which should be less than the rate in which workstations cameras capture frames) before processing all incoming data stamped with the same time. Any data arriving after this period will be discarded. Data coming to the server are processed by two modules which are; data association and data fusion modules. Data association is the technique for correlating observations from multiple sensors about the tracked object [13, 27]. Hence, data association module gives a unique label to every tracked object. Many algorithms can be employed to build this module such as Nearest-Neighbor Standard Filter (NNSF) [28], Optimal Bayesian Filter (OBF) [29], Multiple Hypothesis Tracking (MHT) [30], Probabilistic Data Association (PDA) [31], Joint PDA (JPDA) [32], Fuzzy Data Association (FDA) [33], and Viteribi Data Association (VDA) [34]. Although PDA is recommended in cluttered environment, we used JPDA algorithm because it extends PDA to simultaneously track several objects. Moreover, JPDA is considered to be the best compromise between performance and complexity. After every target is associated to itself in every sensor stream, the data from all streams must be fused to get the global position. Data fusion can be done by many models such as Intelligence Cycle, Joint Directors of Laboratories (JDL) Model [35], Waterfall Model [36], and Omnibus Model [37]. Omnibus Model is a recent model which capitalizes the older models advantages and minimizes their limitations. Hence, the explicit structure of model components makes Omnibus Model the preferable choice in our data fusion module.

4.4. Image-to-World Coordinates Transformation

In any tracking system, transforming the image-to-world coordinates is a very essential and complicated step. This transformation should be calculated accurately to get the precise real position of each moving object. However, depending basically on the internal parameters of the camera and its position makes this transformation very complicated and expensive. Camera calibration is one of these complicated ways which is proposed in [28]. However, this method is based on the calibration camera parameters using costly hardware and usually based on complicated calculations that may not be possible in large environments. As we aim to establish a simple, cheap, and real-time system, we avoid the complicated transformation methods. However, our work focuses on tracking objects moving on flat planes such as pitch or parking areas; hence, we will present a novel method for transforming the 2-dimensional ground coordinates in the image to real-world coordinates. Such transformation is employed at each workstation. The proposed transformation method is based on the manual assignment of a predefined group of points that their real-positions are known. Those points are arranged in perpendicular lines to illustrate a grid, which is called the transformation grid (TG) as defined in Definition 4.

Definition 4. Transformation Grid (TG). Transformation Grid (TG) is a set of points which are arranged in perpendicular rows and columns over the transformed plane. These points' coordinates are known in both real and image coordinates. For easy and explicit calculations, the distances between each point and any neighboring point are the same in all directions.

At the server, a TG-creator module creates the TG for each workstation. Initially, TG-creator reads the real coordinates of the TG points' real coordinates which are organized and measured manually. These data are presented in the form of an XML file such as that is shown in Algorithm 1.

"1.0" " -8"
 <scene>
  <point name=" " " " " "
  
      …………….
      …………….

The second step in the TG-creator is to use the graphical user interface (GUI) shown in Figure 10, which employed to manually assign points in real model to its equivalent in the image. To accomplish such aim, TG-creator plots the real model points, as illustrated in the interface shown in Figure 10(b), by reading the previous XML file. Then, the server's operator chooses an initial frame of any workstation. This frame is captured with presence of TG points only. For instance, the human operator, as shown in Figure 10(a), chooses the initial frame of middle workstation over the pitch and begins to assign some points on both sides.

Hence, by simple clicks using TG-creator interface, the operator can assign each point in the initial frame, at the left part, to its equivalent, at the right, in the real model. As soon as the manual assignment is finished, the last step in TG-creator is to produce an XML file that collects the coordinates of each image point , got by the workstation's initial frame, and its equivalent in real coordinates. So, any point in the TG is expressed as . Furthermore, TG points are grouped into TG cells as defined in Definition 5. Also, the points which belong to incomplete cell in the workstation initial frame are ignored.

Definition 5. TG-Cell. For any TG having points arranged by their real coordinates in rows and columns , the TG-cell is expressed as Celli = . So, the TG-cell is a set of 4 closest points which present the four corners of a rectangle.

The XML file written by the TG-creator and expresses one workstation's TG is written as in Algorithm 2.

" " -8"
" ">
   " ">
   " " " " " " " " " "
   " " " " " " " " " "
   <point " " " " " " " " "
   <point " " " " " " " " "
   cell>
        …………….
        …………….
workstation>

Transforming a point from its image coordinates to real coordinates is begins with reading the previous XML. Next, the TG-cell where the transformed point exists is detected, or the nearest TG-cell if it does not exist inside any one. Once the TG-cell is detected, the transforming algorithm can be started as illustrated in Algorithm 3.

Input data:
  The image coordinates of point P(xi,yi).
  The TG-Cell point P laid in Celli = {P11, P21, P22, P12}.
  The real x-coordinate of point P11 is called P11-xr.
Calculated parameters:
  Rn: the iterated value of R after n tries.
  R0: the initial value of R which equals 0.5.
  Px1, Px2: are calculated with equation (4) and by Celli
  points' image coordinates.
  xth: the perpendicular distance in pixels from point
  P(xi,yi) to line Px1, Px2.
  Pxr: the real x-coordinate of Point P; it is the goal of
  this algorithm.
Algorithm:
  Start:
   for n = 0; R0 = 0.5;
   if xth for R0 <= 1 pixel then R = R0, and go to OUT
  Calculate Rn:
   n = n + 1
    = + * 0.5   // for n = 1;
    = 0.75
    = * 0.5    // for n = 1;
   Rn2 = 0.25
   if xth for > xth for then
       Rn = , and xth = xth for
   else
       Rn = , and xth = xth for
   if xth <= 1 pixel then
       R = Rn, and go to OUT
   else
       go to Calculating Rn
OUT:
   Pxr = P11-xr + (P12-xr – P11-xr) * R

Referring to Figure 11, a point is found in cell , and it is also laying on two lines which are and , each of them is parallel to two edges and perpendicular to the other two edges. Depending on the fact that cross-ratios remain invariant under perspective projection, and using an approximation in this ratio as in [19]: the following pair of equations can be predicted

Hence, using the previous pair of equations, the real -coordinate of point can be calculated by iterating . Referring to Figure 12(a), -threshold is the perpendicular distance from point to the line . By changing , line moves and -threshold changes too. To reduce -threshold under one pixel, is iterated by the algorithm illustrated in Algorithm 3.

In the same way, the real -coordinate of point can be calculated using the algorithm illustrated in Algorithm 3 by considering the -coordinates rather than the -coordinates.

As soon as the image coordinates are transformed to the corresponding real coordinates, the workstations coverage modules can be started immediately. After constructing the TG, the workstations coverage modules help in plotting the coverage polygonal over the ground of each camera's FOV. This plotting is performed manually with a developed GUI software aid tool at the server. The server receives the initial frame of each workstation and its GT XML file, and then the server operator assigns corner points of the coverage polygonal manually. Figure 13 shows the GUI of the workstations coverage aid tool. It shows the initial frame with the corresponding plotting coverage polygonal of each workstation as well as the coverage polygonal of all workstations. Our workstations coverage aid tool guarantees the complete coverage of the area under consideration as well as the overlapping between cameras’ FOVs. Hence, it helps the server operator to discover the blind areas, if they are found.

4.5. Constraining the Environment and Tracked Objects

The main goal in any tracking system is to solve the problem of discontinuity of objects paths (tracks) during the tracking process. This problem occurs due to many reasons such as (i) the occlusion with objects or structures in the scene (disappearing partially or completely by a static object), (ii) object disappearing in blind areas, or (iii) exiting and reentering of objects from the scene. Basically, object occlusion can be categorized into two main categories which are (i) interobject occlusion and (ii) occlusion of objects. The former happens when a group of objects appeared in the scene where objects are blocked by others in the group. This poses a difficulty not only in identifying those objects in the group, but also in determining the exact position of each individual object. On the other hand, occlusion of objects occurs when the object is not observed or disappeared for an amount of time blocked by structures or objects (static or dynamic) in the scene. Hence, the tracking system has to wait for the object’s reappearance or concludes that the object has left the scene. Hence, to minimize the effect of occlusion, we suppose an environment totally covered by several cameras; moreover, the ground is always visible since there are no static objects. On the other hand, the discontinuity of objects tracks has been overcome by (i) initializing new objects, (ii) successfully solving merging/splitting situations, and (iii) precisely integrating entering/exiting FOVs of the tracking workstations.

In most tracking systems, each physical target is represented as one corresponding object. However, this representation is not suitable for tracking multiple continual interacting targets. A generic formation of the tracked objects, in our system, assumes that “the tracked object can be represented as a group of micro-objects.” This assumption is illustrated in Definition 6.

Definition 6. GOO Representation. A group-of-objects (GOOs) representation maps the tracked object to a set of physical targets.

Considering Definition 6, the ambiguous tracking cases, especially merge/split cases, will be explicitly solved. Also, the mystery of splitting object, after it has been registered over the system as one object, can be easily solved.

Figure 14 shows the situation in which three players are tracked at the first frame as one object. After some frames, one player splits and becomes a single object and the same for the second and third players. If the tracking module considers the three players at the beginning as a single object, it has to generate a new tracked object at each split. Also, the past trajectory of the split player will be lost. However, if the tracked players are considered as a group, the past trajectory of the split object would be the same as the previous trajectory of the group from which it has disjoined. By the same way, the trajectory of a group is the future trajectory of its merged objects or groups. So, GOO representation of the tracked objects could be useful in merge and split situations.

4.6. Object Tracking Strategy

In the traditional tracking system, the workstations are entrusted with segmentation and objects creating stage and objects paths creating stage. However, the server was entrusted with feedback correction stage. We break the linearity of cascading the classic tracking stages by “taking the decision of tracking from events performed in those stages.” An early version of this strategy was introduced in [8, 9]. The strategy of using tracking events promotes the tracking system automation. It allows the automatic object creation and avoids the manual scenarios. Using tracking events also maximizes the stability of objects traveling between system cameras. It also enhances the continuity of object tracking in several scenarios such as merging and splitting. The considered events are illustrated in Table 1.

Each event is triggered by a collection of conditions that are generated by simple operations, calculation, and conditions. For example, changing the size of object-bounded box that is generated by background subtraction, changing of the size and position of the object bounded box that is generated by EKF prediction , or object appearance template that was presented. The parameters used in the algorithm are briefly mentioned in Table 2.

As illustrated in Algorithm 4, the proposed algorithm not only defines the events and their response but also arranges them according to their priority in descending order. The priority of events means, the order in which the event is tested with respect to the other events. Accordingly, if one event for the object is triggered, other events will be discarded.

Inputs data:
SSB: Set of N objects Segmented Bounded Boxes
SSB =
KBB: Set of M objects EKF Bounded Boxes
KBB =
AT: Set of M Appearance Templates
AT =
Algorithm
Defining Events for each Workstation:
 For each ϵ SSB, test E1, E2, and E3 in order
 E1 :
   if .not_matched (one in )
   Then
    Trigger R1
    go to OUT1
 E2:
   if .not_matched (any KBB) AND
      .matched (one in )
   Then
     Trigger R2
     go to OUT1
 E3:
   if size ( ) ≪ matched size ( ) AND
    .not_matched (one in )
   Then
    Trigger R3
    go to OUT1
 OUT1: Next
 For each ϵ KBB, test E4, E5, and E6 in order
 E4:
   if .Outside (FOV) AND
      .not_matched (any SBB)
   Then
     Trigger R4
     go to OUT2
 E5:
   if size ( ) < matched size ( ) AND
      .intersected_with (any )
   Then
     Trigger R5
     go to OUT2
 E6:
   if .matched ( )
   Then
    Trigger R6
    go to OUT2
 OUT2: Next
Events Responses:
Responses are done over:
SRM (Server Response Module) and WRM (Workstation
Response Module)
 R1:
   new WRM.register ( )
   WRM.create_object_template ( )
   WRM.send_to_SRM ( )
   SRM.associate ( ) get
   SRM.create_and_send_to_each_WRM ( )
 R2:
   new WRM.register ( )
   WRM.send_to_SRM ( )
   SRM.associate ( ) get
   SRM.create_and_send_to_each_WRM ( )
 R3:
   For all
     if not .matched_any_of ( ) AND .near ( )
     Then
        = WRM.get_object_of ( )
       if .number_of_containted_objects == 1
       Then
         WRM.trigger_new_response ( )
         WRM.build_history_of ( )
       Else if .number_of_contained_objects > 1
       Then
          .associate_to_objects_in ( )
   Next
   WRM.trigger_stable_response ( )
 R4:
   WRM.stop_tracking ( )
 R5:
   WRM.register_group ( ) = +
   WRM.announce_to_SRM ( )
   SRM.associate ( ) get
   SRM.create_and_send_to_each_WRM ( )
   if WRM.search( , ) = = True
   Then
     WRM.split ( )
   Else
     WRM.give_EKF_position ( )
 R6:
   WRM.track ( ); WRM.update_EKF ( ); WRM.update_template ( )
   WRM.send_to_SRM ( )
   SRM.associate ( ) get
   SRM.create_and_send_to_each_WRM (

As illustrated in Algorithm 4, the algorithm works over each workstation using some inputs such as SSB which is produced using the current frame by workstation segmentation method. Another input to the algorithm is KBB, which is generated in the previous frame by extended Kalman filter. The number of elements in SSB and KBB sets may be different because of differing objects from the current frame to the previous one. is an input that comes from the server and contains the predicted location of all objects in the tracked environment. AT is a set of appearance templates of objects in the previous frame which are frequently updated in every frame. On the other hand, the tracking events are arranged by their priority in a descending manner, as indicated in Figure 15. Moreover, the first three events (which are ,, and ) depend essentially on . These events are located in a loop to test the properties of each . In the same manner, the second three events depend essentially on . Responses of all events depend on two modules at each workstation and the server, which are WRM and SRM, respectively.

Triggering takes place whenever any element SSB presents a new object. This happens if any SSB element, , is not an object in the tracked environment by comparing it with all elements in . The response of this event is , which guarantees the new object creation over both SRM and WRM.

The second event is tested for unless has triggered. Event monitors an object entrance into the workstation's FOV which may be triggered when two conditions are valid. The first condition is that the entering object does not exist in the previous frame. The second is that the entering object must have existed previously in the tracked environment, as constrained the proposed tracked system previously, so it should be an element in . The response of event is , which guarantees the entered object registration over the workstation.

The third event () detects an object splitting into two or more objects. It depends on the sudden varying of object size. Another condition is to ensure that the split object was not previously detected over the system by another workstation. Hence, split operation can be performed in a workstation by the server and by the other workstations feeds. The response (R3) tries to detect the objects generated by split operation. The split objects may be known or unknown objects. The known objects mean that they were merged after starting tracking, so they were known. On the other hand, the unknown objects are merged before starting tracking. Although the algorithm can solve the two cases separately in an efficient way, it may fail when the object splits into known and unknown objects, object splits in complicated split-merge situations, or object splits in a crowded area. Finally, the newly unknown objects will be registered at the tracking server and at all workstations by triggering R1.

The second set of events (which are ,, and ) depends on KBB set. is triggered when an object exits the workstation's FOV. Two conditions ensure the objects exiting from the FOV, which guarantees the correctness of EKF prediction. The response of event is to stop tracking such exited object. To stop tracking an object means deleting its KBB element and not to build Gk for it.

The merge situation of objects fires the event . Triggering depends on a sudden change of an object size to larger size and the prediction of intersecting elements in KBB. The response is to register the produced merged object as a group of its constituting objects. SRM tries to split the merged objects using the knowledge that comes from other workstations.

All the previous events are addressed as nonstable events, so the objects that do not trigger these events are called to be tracked in a stable way. The last event, , expresses this situation when an element from KBB matches one of SBB as in traditional way of tracking. The response of event is to match the object track, update EKF, and update AT element of this object.

Once all the possible events are successfully triggered, the continuity of objects tracks can be done correctly. Figure 16 explores some situations that were faced by the system with the corresponding triggered events, solved problems, and server duties.

5. Experiments and Results

In this section, three experiments will be presented. The first two experiments determine the accuracy of the proposed image-to-world transformation method, while the third evaluates the proposed algorithm of object tracking.

5.1. Experiment I: Measuring the Accuracy of Image-to-World Transformation Algorithm

In this experiment, the efficiency of the proposed image-to-world transformation method will be measured on a real three-dimensional CAD model. This model simulates the soccer pitch with a camera's FOV that captures the center of the pitch. The frame size was adjusted to be 720 × 576 pixels. Each point in the TG points is 8 meters far from its neighbors in the four directions. Five paths are constructed in the CAD simulation, each of them consists of 40 points. Each point real position is known in the CAD simulation, and its image position is manually fed. We choose such paths, as well as their points positions, to test the efficiency of the algorithm near and far, boundaries, and middle of the camera's FOV. Moreover, we did not take any diagonal (over the camera's FOV) path that may get tricky results with the calculated Root Mean Square Error (RMSE). The proposed transformation method is compared here with the fast projection method [17] using two projection planes. Each plane is constructed by its four corners which present four noncollinear points , and . Transformation using each plane depends on (4) which was discussed previously. Figure 17 shows the CAD simulation with the constructed paths and planes.

For each path, RMSE is calculated using each point’s real position () and image position () for points using the following formulas for the two real axes and :

Results for each path are summarized in Table 3. The transformation algorithm calculations for one point took less than 1 millisecond on an Intel-I5 CPU, and the calculations module was built over. NET framework and ran under Windows 7 operating system.

Results in Table 3 show the robustness of the proposed image-to-world transformation algorithm. The RMSE is between 0.28 m and 1.02 m in a TG of 8 m points distance. These results lead to an error between 3.5% (in the higher resolution areas) and 12.7% at most (in the poor-resolution areas). Moreover, any transformed point will retain in the equivalent real cell of TG after transformation. In other words, the transformed point will not jump to another cell in real grid after transformation, and so the algorithm will not get any tricky results.

5.2. Experiment II: Extending the Image-to-World Transformation Algorithm

This experiment extends the previous one by assuming that each TG point is 8, 16, 24, 32, and 40 meters far from its neighbors in the four directions. The RMSE of the same five paths shown in Figure 17 is calculated, and the results are illustrated in Figure 18.

By observing Figure 18, it is concluded that the proposed transformation algorithm outperforms fast projection method, referring to results in Table 3. Moreover, the RMSE remains in an acceptable range even at large distances between the TG points.

5.3. Experiment III: Evaluating the Event-Tracking Algorithm

In this experiment, the proposed tracking algorithm will be tested over a three-dimensional CAD simulation. The environment of this experiment is the soccer pitch covered by a network of five cameras. The pitch includes 25 moving objects, which are divided into two teams and three referees (with different color clothes). The simulation period was 2 minutes length, 25 frames per second, MPEG format, and 768 × 576 pixels in frame size. The workstations, as well as the server, are Intel-I5 CPU, and the modules were built over . NET framework running under Windows 7 operating system. The FOVs of the five static cameras used to cover the soccer field are shown in Figure 19.

The proposed tracking events algorithm can be evaluated by comparing the events generated automatically by the system versus the manual observation of workstations and server events as percentage form. The event ratio can be calculated using the following formula: The evaluation results are detailed in Table 4.

The reported results in Table 4 show the success of event-tracking algorithm in most cases. The results verify that the algorithm was able to handle the continuity of objects tracks well, which is the main goal of any tracking system.

As a proof of system ability to preserve the continuity of objects tracks, we present two plots of objects tracks with two different ways. The first one is a plot of two objects over the pitch as shown in Figure 20. The detected events by the event-tracking algorithm are included along each object track. Moreover, the plotting compares the real object track (path) with the one generated by the system, which insures the good performance and robustness of the system. It is also noted that the objects tracks (paths) are smoothed by Kalman filter.

6. Conclusions

In this paper, a novel strategy was presented for constructing a tracking system in a crowded environment. The novelty of this strategy appeared in the flexibility of the system architecture, clearly solving image transformation and system setup problems, exploring a new model of the tracked objects, and finally presenting an event-driven algorithm for tracking objects over a network of static cameras. The proposed tracking strategy provides a potential to the surveillance system which requires wide area observation and tracking objects over multiple cameras, for example, airport, train station, or shopping malls. It is not possible for a single camera to observe the complete area of interest as the structures in the scene constraint the visible areas and the devices resolution is restricted. Therefore, the surveillance system of wide areas requires multiple cameras for objects tracking. Over a CAD of soccer game, experiments have shown that the proposed tracking strategy performs well even in cluttered environments. Our future investigations are to extend the proposed tracking algorithm to include other events. Moreover, although the soccer game was a very rich tracking environment, the system should be tested in other environments such as parking areas, other sports playgrounds, and hyper markets.