Abstract

Meniscal surgery is considered the most general orthopedic process that deals with the treatment of meniscus tears for human health care. It leads to a communal contusion to the cartilage that stabilizes and cushions the knee joints of human beings. Such tears can be classified into different categories based on age group, region, and occupation. Further, a large number of sportsmen and heavy weightlifters even in developed countries are affected by meniscus injuries. These patients are subjected to arthroscopic surgery, and during surgical treatment, the perseverance of meniscus is a very crucial task. Current research provides a significant ratio of meniscal tear patients around the globe, the critical expanse is considered as having strikingly risen with a mean annual of 0.066% due to surgery failure. To decumbent this ratio, an innovative training mechanism is proposed through video retrieval system in this research. This research work is focussed on developing a corpus and video retrieval system for meniscus surgery. Using the proposed system, surgeons can access guidance by watching the videos of surgeries performed by an expert and their seniors. The proposed system is comprised of four approaches to the spatiotemporal methodology to improve health care services. It entails key point, statistical modeling, PCA-scale invariant feature transform (SIFT), and PCA-Gaussian mixture model (GMM) with a combination of sparse-optical flow. The real meniscal surgery dataset is used for testing purposes and evaluation. The results conclude that using PCA-SIFT approach improves the results with an average precision of 0.78.

1. Introduction

A meniscus tear is a public discolouration of the cartilage that alleviates and pads the knee joints of human beings. This is common malfunctioning across the globe, with a mean annual rate of 0.066%. It is noteworthy to mention that a large number of meniscus injury incidences are observed at about 66 per 100,000 around the globe. These incidences can be categorized under various aspects, that is, age group, region, and occupation. Most of the sportsmen and heavy weightlifters affected with age less than 40 years in developed countries recovered using meniscus surgery. They cover the areas of America, Latin, Asia, and Africa. On the other hand in Denmark, more degenerative changes in the annual incidence are observed. These changes are increasing from 164 to 312 over the past 10 years and about 2 million cases observed in meniscus surgery treatment. Details of registered patients, diagnosis, and procedures maintained by Danish National Patient Register can be found in [1] and Summit (2016).

Previous studies have reported a very high rate of repetitive surgery due to the failure of meniscus surgery. Different causes are observed for the failure of meniscus surgery. One of the major causes is the inappropriate removal of meniscus because unexperienced surgeons do not know how much or which part has to be removed properly. More precisely, it deals with the interaction of tissues and radiation, which may be represented in the form of a static image and video or recordings. These recordings can be used to train surgeons and practitioners to perform successful surgeries. One way to access these recordings is through video retrieval system that aims to retrieve videos from a source of the multimedia repository. Upon surgeon’s request and need, video retrieval system can provide the most relevant and accurate videos of meniscal surgery. Likewise, there have been several studies in relation to surgery, that is, for the eyes, lungs, heart, and kidney, but meniscal surgery holds a unique position since it is a commonality in most of the countries and in literature neither corpus nor video retrieval systems precisely for meniscal surgery yet found [2].

To retrieve retinal surgery videos using compressed video streams, the required motion info is extracted at regular intervals. Precisely, two types of motion information are excerpted: one is region trajectory and the other encompasses the residual error. A heterogeneous feature vector is used to combine the motion information, and sequence segmentation is utilized to extract the motion information. The Kalman filter and generalized Gaussian distribution are utilized to formulate region trajectory and residual information. In last step, extension of fast dynamic time warping is also used along with feature vector to compare videos [3].

Another approach to retrieve medical videos is a key frame which is initially extracted and its global features are excerpted, which has to be saved in XML script format in the form of an object to reduce the storage space. Furthermore, neural network and naïve Bayes are used to enhance the performance. Features of an image are extracted using entropy, Huffman coding, and fast Fourier transforms [4].

In order to retrieve laparoscopy videos from a colossal repository of videos, a feature signature is utilized which is composed of abstract and local details of the image. Initially, position, colour, and texture features of an image are considered for this signature. Weight is assigned to each feature describing the potential to be used as a feature signature representative. For comparison between two signatures, earth mover’s distance, signature quadratic form distance, and signature matching distance are used in combination [5]. To retrieve endoscopy videos from repository, feature signature is computed based on global features, that is, colour, position, and texture. Weight is assigned to each feature, and zero-weighted features are considered as irrelevant for signature. Those features have weighted value unequal to zero denoted as signature representatives. Furthermore, cluster centroid of these features is used as feature representative. To measure the similarity between endoscopic images, signature matching distance and adaptive binning feature signature are used by using the visual characteristics [6].

In another approach to retrieve endoscopic videos, local, global, and feature fusion techniques are employed. To extract features, various algorithms CEDD, SURF, and SIMPLE are used. Additionally, early and late fusion approaches are also considered. These approaches are tested on 1276 video clips. LIRE software library is used to index test frames also capable to extract 20 visual features of an image [7].

To retrieve cataract surgery videos from the repository, a video is divided into a number of overlapping clips. Each clip contains the same number of frames. Various features are extracted from these frames using histogram of oriented gradients (HOG), BoVW, and motion histogram. These features are statistically modeled using the Bayesian network, CRFs, and HMMs. Thirty cataract surgery videos are used for the testing purpose having 720 × 576 pixels. A mean area under ROC of 0.61 was achieved in recognizing surgical steps from video stream [8]. In order to retrieve and analyse female reproductive system through hysteroscopy, a video summarization framework is proposed. To compute feature set, colour information is extracted from key frames using COC colour model. Other salient features are also extracted such as texture, motion, curvature, and contrast [9].

Another approach to retrieve cataract videos from the source that formulate a feature set is by using global features such as colour, texture, and motion information of the frame. Furthermore, the wavelet transform is performed on these extracted channels of colour. In this study, cataract and epiretinal surgeries are considered [10]. To retrieve similar videos in the domain of retinal surgery, a motion-based approach has been found in literature which utilized MPEG-4 to extract features from the frame. For video encoding, a group of pictures and macroblocks are used where three types of frames are considered. Motion histogram, classification, and motion trajectories are used to formulate motion feature set. Furthermore, signatures are also composed and to measure their ratio of similarity, fast dynamic time warping approach is used. This approach is applied to epiretinal membrane surgery.

In order to control the failure rate of surgery and to overcome the mentioned drawbacks, there is a dire need for a sophisticated approach. It could be possible via the fabrication of a connexion between information technology and medical sciences. Information technology is capable of providing medical diagnosis, clinical analysis, and treatment planning through a practice known as medical imaging [11, 2, 12].

Our proposed system firstly focuses on spatial features, that is, the location of objects in reference to static objects [13]. Secondly, it deals with the temporal information of an image, which provides a glimpse of moving and operational objects [14]. This system is comprised mainly of four methods, that is, interest point, statistical modeling, PCA-SIFT, and PCA-GMM-based video retrieval. The first method incorporates interest points of videos which are excerpted using up-to-the-minute feature explication and indicator algorithms known as SIFT, HOG, and covariant feature detector (CFD) with distance transform (DT) addressed by Bai et al. [15] and Bharathi et al. [16]. The second method encompasses statistical modeling of features using GMM and Kullback-Leibler divergence. Moreover, in the third and fourth method, dimension reduction technique is integrated with SIFT and GMM, respectively. To extract the temporal features, Lucas-Kanade sparse-optical flow is incorporated in each method [17]. This proposed system can help new and less-experienced surgeons to get appropriate information about surgical methods, procedures, and tools to perform a successful surgery [11, 2, 12]. These meniscal recordings can additionally be used for the explanation of the patients and their follow-up sessions [9].

Table 1 illustrates a summarized comparison of previous approaches to retrieve videos. It is clear from this comparison that there are so many systems for medical video retrieval, but a video retrieval system for meniscal surgery practitioners is not found in spite of its need around the globe. In this system, we also have used a hybrid of state-of-the-art techniques for feature detection, recognition, and comparison. The most important point that can be concluded in this summarization is a step towards new research direction.

2. Materials and Methods

This section sheds light on the proposed methodology for meniscal surgery video retrieval. The surgical video is considered to be composed of frames which are chronologically bound together to define a tale, that is, live surgical process. Each frame grasps an instant of surgery, producing a coherent association with contiguous frames. Algorithms in this framework are applied to the frame instead of a video because the frame is a building block of a video. Therefore, these videos are decomposed into frames and shot boundary is detected using XOR. As a result, the collections of master frames depicting master shot are extracted. These master frames are further utilized to model spatiotemporal features of a particular video. Detail of spatiotemporal feature extraction and modeling technique is as follows.

2.1. Spatiotemporal Modeling Using Interest Points

The interest point represents the identification of interest or key points within an image used for further processing. A description of the interest point is as follows: (i)It has a demarcated location and mathematical description of an image.(ii)Local image organization is high in the context of local information content.(iii)In the image domain, it has steady trepidation of both local and global.

2.2. SIFT and CFD-Based Approach

SIFT is used to detect interest points using various scales. With the help of Gaussian filters, the change in succeeding Gaussian-blurred images is figured out to excerpt interest points [17]. A difference of Gaussian (DoG) image of an original image is given by where describes the convolution operation of the original image with Gaussian blur at scale . [19].

CFD uses an affine adaptation to approximate the figure of an affine region in order to compute covariant features of an image. With the help of affine adaptation, interest points can be calculated as follows [10]: where is defined as the 2nd fractional derivative in the direction and represents the mix fractional 2nd derivative in directions of and . These computed features of input and target images are matched using Euclidean distance to get the value of similarity [20]. These calculated distances are sorted to make the ranked list based on similarity of input video with target videos.

2.3. HOG and Distance Transformation (DT)

The HOG feature detector uses edge directions to compute local object appearance and shape within an object.

Each image is split into small cells, and histogram of gradient directions is computed for all cells.

In order to improve accuracy, contrast normalization of local histograms is computed by measuring the intensity across image block. Dalal-Triggs method can be used for normalization such as [3].

All histograms in a given block are represented by a vector , which is not normalized, a small constant value is represented by and be its -norm for [21].

DT [3] descriptor is also called distance map which labels the pixels by measuring distance with the support of Canny’s edge detector. A framework of key point base is present in Figure 1.

2.4. Statistics-Based Framework for Video Retrieval

Statistics based technique is an approach to scrutinize condense and to interpret data. For this purpose, we have used GMM with Kullback-Leibler divergence to model the spatiotemporal content of videos. Following is the description of a method.

2.5. Spatiotemporal Modeling through GMM

The surgical footage is an object which contains highly complex multidimensional contents based on time series folded data. Assume that a video is an arrangement of frames and has length , , with every frame , providing a sequential pattern of an arrangement.

A video frame is described by each element , and its spatial features are modeled in -dimensional space. At this phase, presents the spatial features of frame in dimension feature space. Spatial features of a frame constitute facts of locality with the magnitude of features, for example, colours of every pixel (red, green, and blue), and collected objects may be easily traced. In order to model RGB colour distribution, pixel position can be stated as shown in Figure 2.

A spatiotemporal probabilistic model can be constructed through statistical observation of samples by using time and space features. The sampling of the video is performed at a presumed rate, preserving temporal enslavement by their adjoining images. Spatial features of every image depend on the spatial context of every image and may depend on its former image, which is known as spatiotemporal association.

Furthermore, GMM [22] is applied to an approximate likelihood density of collected items. GMM is an unsupervised learning approach in which probabilistic model is built with the assumption that data points are produced by an assortment of a finite amount of Gaussian distributions through unidentified parameters.

Moreover, a -dimensional spatial element for the image may be demonstrated through GMM along with quantities and a presumed number of a frame structure is represented by , for instance, the numeral core objects instituted in particular section. By using MLE, an association can be computed through the probability of spatiotemporal continuousness amongst two successive images as presented in Figure 3.

We have acquired Gaussians against input and output videos after applying maximum likelihood as shown in Figures 4 and 5. By analyzing data distribution of these Gaussians, we can determine that dissimilarity is higher between input and output. To measure the dissimilarity ratio, -Lealer divergence, also known as relative entropy, was used.

2.6. Compression-Based Video Retrieval

In order to reduce insignificance and redundancy, datum compression technique can be used. The intention of compression is to cut down the data dimensions for effective storage, retrieval, and manipulation. For this purpose, principal components analysis (PCA) is used to the significant components from data that immaculately defines the entire image. Moreover, a hybrid approach is applied by the combination of PCA along with SOFT and GMM to improve the precision of retrieval method. Details of the method are as follows.

2.6.1. Summary of PCA

Principal component analysis is the dimensional reduction method which calculates linear projections to maintain maximum variance in original data. Eigenvectors to Eigenvalues of a matrix of the covariance of datum where defines the dimensions of the original data and shows the total count of points.

By visualizing the point via projecting

defines a matrix having eigenvectors associated with 2 or 3 of the biggest eigenvalues, and is a calculated lower dimension sketch of . PCA is the most commonly used method to visualize new data. Images produced by PCA are very easy to deduce and also works moderately well even when the difference in data stands primarily intense in solitary limited directions [23].

2.6.2. PCA-SIFT-Based Video Retrieval

In this method, PCA is applied on SIFT which is highly vigorous to image distortion. In PCA-SIFT firstly computes an Eigenspace to define an inline image of the native patch. In the second step, the local image gradient is calculated on a given patch. In the third step, it projects gradient image vector by the Eigenspace to originate the dense feature trajectory. That feature trajectory is expressively negligible than SIFT feature trajectory and may be used with the identical matching method. Euclidean distances amongst binary feature trajectories can be used to define either these two trajectories resemble identical interest point in dissimilar frames. The following are the steps to retrieve videos using PCA-SIFT approach. (1)Input shot is transformed to image.(2)The master image is extracted using shot boundary detection.(3)For each mater frame, PCA-SIFT is calculated.(4)Steps from 1–3 are repeated for target videos from 1–.(5)After step 4, two feature vectors were gotten, that is, for input and target videos [].(6)The feature vector of input and target videos is compared using Euclidean distance to compute the similarity.

2.6.3. PCA-GMM-Based Video Retrieval

The analyses of Figures 4 and 5 show that spatial context is comprised of location and magnitude features against each pixel such as RGB colour channel. These colour channels are extracted from input and target videos and passed through dimension reduction process using PCA. These compressed channels are modeled using GMM for spatiotemporal analysis. Furthermore, Kullback-Leibler divergence is applied in similar way as presented in Section 3.2.

2.6.4. Temporal Points Extraction from Surgical Videos by Sparse-Optical Flow

Surgical video is a dynamic entity in which the motion of an object plays a vital role in understanding an ongoing surgical process. To capture the movement or motion of surgical procedure, we have used optic flow method, which is an arrangement of moving object in the photographic sight prompted through a relative movement amongst a scene plus an eye, that is, camera. The Lucas-Kanade method was used to capture the motion of objects within a video of live surgery [17].

3. Results and Discussion

In this section, we present experimental settings including the data set and evaluation method.

3.1. Dataset

Datasets are comprised of 20 real meniscus surgical videos collected from various online publicly available repositories MedlinePlus [24]; AAOS [25]; vjortho (2017); and Northgate (2017). These shots are sufficient and beneficial in the current scenario because it covers a wide range of multiple scenarios and uniquely identifies a number of instances. It also provides a higher degree of relevance with reference to its surrounding and properly differentiates from the nonadjacent objects. Furthermore, it is clear from Table 1 that for testing purposes, the state-of-the-art systems have considered a small testing dataset. However, for result evaluation of a proposed system, we have requested three surgeons to choose the ground truth from our dataset. A ground truth is selected in terms of video, a relevant approach, and a mechanism can also be found in [9]. Sample frames from meniscal surgery are shown in Figures 68.

3.2. Tool

To precede the evaluation phase, an input video is provided to our proposed system. Its spatiotemporal features are computed against 1– targeted surgical videos, and the most relevant videos are displayed in the right side of the interface as shown in Figure 9.

3.3. Evaluation Measurement

For the purposes of a system evaluation, there are many parameters such as precision, recall, and ROC area. But in the information retrieval system, precision metric is used because it provides the proportion of retrieved videos that are actually relevant. In order to measure the precision of our system, we have split the video into small segments and these segments are further compared with the ground truth. Precision can be computed using the following metric [26, 27].

Herein, a video is said to be the true positive (TP) if it is selected by both our system and the surgeon, or said to be false positive (FP) if it is selected by our system but not by the surgeon. Using this scheme, we measure the average precision of our system using our four proposed methods as shown in Table 2. It has already been mentioned that video retrieval system specifically for meniscus surgery was not yet found in the literature. Therefore, we have considered a number of general parameters for video retrieval system for result comparison. A brief comparison of medical video retrieval system and our proposed system is shown in Table 2.

We can deduce from Table 2 that PCA-SIFT from our proposed method performs well. Observable key factors of this dominant performance are that the incline areas nearby SIFT points are extremely organized to be presented through PCA. Eigenvalues against the patch decline are considerably faster than Edigenvalues for arbitrarily designated incline patches. For the reason that patches adjoining points entirely share definite characteristics, reducing since SIFT point recognition stage; they are totally adjusted, interchanged to align central inclines with vertical, and are scaled correctly. That streamlines the trade PCA necessity do and enabling the PCA-SIFT to exactly characterize patch by the small number of dimensions along with Lucas-Kanade optical flow method.

4. Conclusion

Our proposed video retrieval system for meniscal surgery modeling is based on four methods. These methods include key point-based retrieval, statistical retrieval, PCA-SIFT-based retrieval, and PCA-GMM-based retrieval. They further exploit spatiotemporal features for the improvement of health services to human beings.

In a key point-based retrieval method, key points of video are extracted using SIFT, HOG, CFD, and DT. Secondly, using the statistical retrieval method, spatial and temporal attributes are statistically modeled using GMM and Kullback-Leibler divergence. Moreover, by using PCA-SIFT-based retrieval and PCA-GMM-based retrieval methods, the dimension reduction technique is used with SOFT and GMM, respectively. To track spatiodevelopment over temporal interims, Lucas-Kanade sparse-optical flow method was used, which is efficient and robust to noise.

For experimental work, we have used 20 videos of meniscus surgery for human health services, and our evaluation shows that results gained using PCA-SIFT are more significant. The results established using PCA-SIFT approach enhanced the obvious results with an average precision of 0.78.

Our proposed system will help out the medication of diverse patients, and surgeons can especially get guidance by watching the videos of surgeries performed by experts. Various datasets of real meniscal surgery in our system will be explored in the future for testing purposes and evaluation.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was supported by the MIST (Ministry of Science and ICT), Korea, under the National Program for Excellence in SW supervised by the IITP (Institute for Information & Communications Technology Promotion) (2015-0-00938).